TRICO — Training Ideas (Vector-Only) LNSP/LVM • Semantic GPS 768-D • October 13, 2025 Goal. Stop “garbage decodes” by making the model’s 768-D outputs both (a) on-manifold in your Semantic-GPS space and (b) decoder-compatible with vec2text. We’ll do that with four concrete, vector-native training tracks you can run in parallel and compare.

⸻

TL;DR (what to run)

E2E Vector Supervision (E2E-V): Simple cosine-align to targets; adds manifold control.

Iterative No-Grad→Grad (INGR): Teach “improve-by-recursion” with cheap compute.

Contrastive + Cycle (CONTRAST-CYCLE): Align to positives, repel hard negatives, and enforce vec2text round-trips.

Curriculum + Lane Hard-Negatives (CURR-LANE): Stage difficulty and TMD-aware negatives for stability.

Run all four; keep the one that makes decoded text coherent and lifts retrieval metrics.

⸻

Assumptions & Setup

• Space: 768-D Semantic GPS (unit-norm; use SGPS projector).

• Encoder/Decoder: Use vec2text for both directions (encoder and decoder) to avoid GTR-T5 compatibility headaches.

• Data: Wikipedia chunks (40–50k), CPESH labels when available.

• Lanes: TMD lanes (Factoid, Math-Deriv, Code-API, …) as categorical features and sampling buckets.

Common tricks (use in all tracks):

• Unit-norm outputs: ŷ ← y / ||y||.

• EMA teacher for stability (decay ≈ 0.999–0.9999).

• SGPS projection penalty: keep outputs near training distribution (e.g., distance to k-NN hull or small L2 to retrieval centroid).

• Vec-decoder compatibility loss (below) on a random 30–50% of batches.

⸻

Track 1 — E2E Vector Supervision (E2E-V) Purpose: Baseline that already fixes many “garbage decode” cases by keeping vectors on-manifold and close to targets. Batch I/O

• Input: q (question vec), optional retrieval pooled r̄, lane embedding t.

• Target: y (CPESH Expected) or teacher vector (vec2text encoder of gold text).

Model (simple): ŷ = fθ(q, r̄, t) (Mamba or slim MLP); output 768-D. Loss

• Alignment: L_align = 1 − cos(ŷ, y).

• Manifold regularizer: L_mani = ||ŷ − Proj_SGPS(ŷ)||² or small penalty to drift from k nearest vectors in train set.

• Decoder-compat: decode and re-encode once:

• txt = vec2text.decode(ŷ)

• v' = vec2text.encode(txt)

• L_cycle = 1 − cos(ŷ, v')

Total: L = L_align + λ_mani L_mani + λ_cycle L_cycle. Why it helps: Forces outputs to sit where vec2text is accurate, not just “somewhere” in 768-D.

⸻

Track 2 — Iterative No-Grad → Grad (INGR) Purpose: Teach the head (or a tiny refiner) to improve by recursion without paying full backprop at every step. Module: Tiny Refiner (CTR-SGPS) that updates (z, y) S times; last step trains.

(z0=0, y0 = pool(Y or initial head))

for s in 1..S-1:

(zs, ys) = step_no_grad(zs-1, ys-1; q, r̄, t)

(zS, yS) = step_grad(zS-1, yS-1; q, r̄, t)

Loss (on final step only):

L = (1 − cos(yS, y)) + α·BCE(p_halt,S, 𝟙{cos(yS,y)≥τ}) + β·L_cycle(yS)

Notes

• Set S≈8–16 for baseline; cap latency with a halting head at inference.

• Use EMA of the refiner for evaluation.

Why it helps: You get the iterative behavior (big win for coherence) at almost the cost of a single forward.

⸻

Track 3 — Contrastive + Cycle (CONTRAST-CYCLE) Purpose: Make vectors discriminative and decoder-friendly. Fixes “blurry” outputs that decode nonsensically. Positives/Negatives

• Positive: y or teacher vector for the same chunk.

• Negatives: (a) lane-hard: same lane, different article; (b) near-miss: high retrieval but wrong; (c) adversarial: lexically similar but semantically different.
Loss
• InfoNCE:

L_{\text{nce}} = -\log \frac{\exp(\cos(ŷ, y^+)/τ_c)}{\exp(\cos(ŷ, y^+)/τ_c)+\sum_j \exp(\cos(ŷ, y^-_j)/τ_c)}

• Cycle: same L_cycle as Track 1.

• Optional Mutual consistency: symmetrize with teacher: 1 − cos(fθ(q,…), stopgrad(y)) + 1 − cos(stopgrad(fθ(q,…)), y).

Total: L = L_nce + λ_cycle L_cycle. Why it helps: Separates close concepts; keeps decodes sharp and faithful.
⸻
Track 4 — Curriculum + Lane Hard-Negatives (CURR-LANE) Purpose: Stabilize training; expose the model to difficulty gradually; force lane-aware precision. Stages
Stage A (Easy): Short, clean paragraphs; far negatives; batch size high.

Stage B (Medium): Normal chunks; lane-hard negatives; add L_cycle.

Stage C (Hard): Long/technical chunks; near-miss negatives; enable INGR (Track 2) for the last 30–50% of training.
Sampling
• Curriculum by readability/length and retrieval ambiguity.

• Maintain lane balance per epoch.
Scheduler
• Warmup → cosine decay; raise λ_cycle over time; increase negative count per batch over stages.

⸻
Implementation Notes (do these no matter what)
• Normalize everywhere: inputs, intermediate y, outputs.

• SGPS projector: if you have a PCA/autoencoder of the corpus vectors, project predictions back (light orthogonality helps).

• Adapters per lane: small FiLM/LoRA heads keyed by TMD → lowers interference.

• Batch shaping: mix 70% in-lane, 30% cross-lane examples to avoid collapse.

• Decode gating (for logging only): if cos(ŷ, v') < 0.7, mark as “decode-risky”; surface to eval dashboard.

• Teacher path (optional): keep a frozen EMA of a previously good checkpoint to generate soft targets y when CPESH isn’t available.

⸻

Minimal Loss Recipes (copy/paste) E2E-V

L = (1 - cos(ŷ, y)) + 0.05 ||ŷ - Proj_SGPS(ŷ)||^2 + 0.2 (1 - cos(ŷ, vec2text.encode(vec2text.decode(ŷ))))

INGR

L = (1 - cos(yS, y)) + 0.2 BCE(p_halt,S, 𝟙{cos(yS,y)≥0.85}) + 0.2 (1 - cos(yS, vec2text.encode(vec2text.decode(yS))))

CONTRAST-CYCLE

L = InfoNCE(ŷ, y+, {y-}) + 0.2 (1 - cos(ŷ, vec2text.encode(vec2text.decode(ŷ))))

⸻

Metrics & Gates (what “good” looks like)

• Vector alignment: cos(ŷ, y) ↑; median ≥ 0.88 on val after Stage B.

• Decoder cycle: cos(ŷ, v') ↑; median ≥ 0.82.

• Retrieval synergy: nDCG@10 on re-query with ŷ ↑ vs baseline.

• Lane accuracy: per-lane pass rate (CPESH Expected within top-k when decoded) ↑.

• Halting efficiency (INGR): avg steps ≤ 8 with ≥ 85% inside S_max.

• Human sanity checks: 50-sample blind read—≥ 70% “sensible” after Stage C.

⸻
Failure Modes & Fast Fixes

• Nonsensical decodes despite high cosine to y: raise λ_cycle; add near-miss negatives.

• Mode collapse (all vectors look alike): increase negative count; add lane adapters; up InfoNCE temperature τ_c.

• Over-halting (INGR stops too early): lower halt threshold τ or add penalty for early halts.

• Training stable but val decode poor: enable EMA eval; tighten SGPS projector; add small L2 to nearest-neighbor barycenter.

⸻

7-Day Plan (practical)

Day 1–2:* Implement Tracks 1 & 3. Log cos(ŷ,y), cos(ŷ,v'), nDCG@10.

Day 3: Add SGPS projector + EMA; run quick grid on λ_cycle ∈ {0.1,0.2,0.3}. Day 4: Implement INGR (Track 2) with S=8; τ=0.85; halting head. Day 5: Turn on CURR-LANE: Stage A→B scheduling + lane adapters (rank-8). Day 6: Ablate: no cycle vs cycle, no contrast vs contrast, no INGR vs INGR. Day 7: Pick winner by decode sanity + retrieval uplift; checkpoint + freeze.

⸻

Experiment Naming (so results aren’t chaos)

exp/__S_CY<λcycle>_N_LANE_EMA

example:

exp/2025-10-13_CONTRAST-CYCLE_S0_CY0.2_N64_LANEon_EMAon

⸻

Optional: Wire the Tiny Refiner (CTR-SGPS) cleanly

• Inputs: (q, r̄, Y_topk, t) → iterate (z,y); output (ŷ, p_halt).

• Use it inside Track 2 (training) and after the LVM head at inference for hard queries.

• Keep h=512, S_max=16, τ∈[0.82,0.9].

⸻

Final Notes (blunt and practical)

• If vec2text is the only reliable pair, standardize on it for both encode/decode.

• The single biggest lever against “garbage text” is CONTRAST-CYCLE with lane-hard negatives.

• The single cheapest quality bump is INGR (no-grad→grad) + EMA.

• Do not skip normalization and the SGPS projector; most drift bugs come from that.

If you want, I can turn this into a Makefile + runnable trainer skeleton next, with flags for --track, --lambda-cycle, --negatives, --ema, and --ingr-steps.

example:

example:

Related Research

VCRB: Vectorized Conversational Recursive Blockchain

Potential Flaws

Tiny Recursion Meets Latent-Space Reasoning

Semantic GPS vs Semantic Coordinates: A Technical Distinction Analysis