⸻
TL;DR (what to run)Run all four; keep the one that makes decoded text coherent and lifts retrieval metrics.
⸻
Assumptions & Setup• Space: 768-D Semantic GPS (unit-norm; use SGPS projector).
• Encoder/Decoder: Use vec2text for both directions (encoder and decoder) to avoid GTR-T5 compatibility headaches.
• Data: Wikipedia chunks (40–50k), CPESH labels when available.
• Lanes: TMD lanes (Factoid, Math-Deriv, Code-API, …) as categorical features and sampling buckets.
Common tricks (use in all tracks):• Unit-norm outputs: ŷ ← y / ||y||.
• EMA teacher for stability (decay ≈ 0.999–0.9999).
• SGPS projection penalty: keep outputs near training distribution (e.g., distance to k-NN hull or small L2 to retrieval centroid).
• Vec-decoder compatibility loss (below) on a random 30–50% of batches.
⸻
Track 1 — E2E Vector Supervision (E2E-V) Purpose: Baseline that already fixes many “garbage decode” cases by keeping vectors on-manifold and close to targets. Batch I/O• Input: q (question vec), optional retrieval pooled r̄, lane embedding t.
• Target: y (CPESH Expected) or teacher vector (vec2text encoder of gold text).
Model (simple): ŷ = fθ(q, r̄, t) (Mamba or slim MLP); output 768-D. Loss• Alignment: L_align = 1 − cos(ŷ, y).
• Manifold regularizer: L_mani = ||ŷ − Proj_SGPS(ŷ)||² or small penalty to drift from k nearest vectors in train set.
• Decoder-compat: decode and re-encode once:
• txt = vec2text.decode(ŷ)
• v' = vec2text.encode(txt)
• L_cycle = 1 − cos(ŷ, v')
Total: L = L_align + λ_mani L_mani + λ_cycle L_cycle. Why it helps: Forces outputs to sit where vec2text is accurate, not just “somewhere” in 768-D.⸻
Track 2 — Iterative No-Grad → Grad (INGR) Purpose: Teach the head (or a tiny refiner) to improve by recursion without paying full backprop at every step. Module: Tiny Refiner (CTR-SGPS) that updates (z, y) S times; last step trains.(z0=0, y0 = pool(Y or initial head))
for s in 1..S-1:
(zs, ys) = step_no_grad(zs-1, ys-1; q, r̄, t)
(zS, yS) = step_grad(zS-1, yS-1; q, r̄, t)
Loss (on final step only):L = (1 − cos(yS, y)) + α·BCE(p_halt,S, 𝟙{cos(yS,y)≥τ}) + β·L_cycle(yS)
Notes• Set S≈8–16 for baseline; cap latency with a halting head at inference.
• Use EMA of the refiner for evaluation.
Why it helps: You get the iterative behavior (big win for coherence) at almost the cost of a single forward.⸻
Track 3 — Contrastive + Cycle (CONTRAST-CYCLE) Purpose: Make vectors discriminative and decoder-friendly. Fixes “blurry” outputs that decode nonsensically. Positives/Negatives• Positive: y or teacher vector for the same chunk.
• Negatives: (a) lane-hard: same lane, different article; (b) near-miss: high retrieval but wrong; (c) adversarial: lexically similar but semantically different.
Loss• InfoNCE:
L_{\text{nce}} = -\log \frac{\exp(\cos(ŷ, y^+)/τ_c)}{\exp(\cos(ŷ, y^+)/τ_c)+\sum_j \exp(\cos(ŷ, y^-_j)/τ_c)}
• Cycle: same L_cycle as Track 1.
• Optional Mutual consistency: symmetrize with teacher: 1 − cos(fθ(q,…), stopgrad(y)) + 1 − cos(stopgrad(fθ(q,…)), y).
Total: L = L_nce + λ_cycle L_cycle. Why it helps: Separates close concepts; keeps decodes sharp and faithful.⸻
Track 4 — Curriculum + Lane Hard-Negatives (CURR-LANE) Purpose: Stabilize training; expose the model to difficulty gradually; force lane-aware precision. Stages• Curriculum by readability/length and retrieval ambiguity.
• Maintain lane balance per epoch.
Scheduler• Warmup → cosine decay; raise λ_cycle over time; increase negative count per batch over stages.
⸻
Implementation Notes (do these no matter what)• Normalize everywhere: inputs, intermediate y, outputs.
• SGPS projector: if you have a PCA/autoencoder of the corpus vectors, project predictions back (light orthogonality helps).
• Adapters per lane: small FiLM/LoRA heads keyed by TMD → lowers interference.
• Batch shaping: mix 70% in-lane, 30% cross-lane examples to avoid collapse.
• Decode gating (for logging only): if cos(ŷ, v') < 0.7, mark as “decode-risky”; surface to eval dashboard.
• Teacher path (optional): keep a frozen EMA of a previously good checkpoint to generate soft targets y when CPESH isn’t available.
⸻
Minimal Loss Recipes (copy/paste) E2E-VL = (1 - cos(ŷ, y)) + 0.05 ||ŷ - Proj_SGPS(ŷ)||^2 + 0.2 (1 - cos(ŷ, vec2text.encode(vec2text.decode(ŷ))))
INGRL = (1 - cos(yS, y)) + 0.2 BCE(p_halt,S, 𝟙{cos(yS,y)≥0.85}) + 0.2 (1 - cos(yS, vec2text.encode(vec2text.decode(yS))))
CONTRAST-CYCLEL = InfoNCE(ŷ, y+, {y-}) + 0.2 (1 - cos(ŷ, vec2text.encode(vec2text.decode(ŷ))))
⸻
Metrics & Gates (what “good” looks like)• Vector alignment: cos(ŷ, y) ↑; median ≥ 0.88 on val after Stage B.
• Decoder cycle: cos(ŷ, v') ↑; median ≥ 0.82.
• Retrieval synergy: nDCG@10 on re-query with ŷ ↑ vs baseline.
• Lane accuracy: per-lane pass rate (CPESH Expected within top-k when decoded) ↑.
• Halting efficiency (INGR): avg steps ≤ 8 with ≥ 85% inside S_max.
• Human sanity checks: 50-sample blind read—≥ 70% “sensible” after Stage C.
⸻
Failure Modes & Fast Fixes• Nonsensical decodes despite high cosine to y: raise λ_cycle; add near-miss negatives.
• Mode collapse (all vectors look alike): increase negative count; add lane adapters; up InfoNCE temperature τ_c.
• Over-halting (INGR stops too early): lower halt threshold τ or add penalty for early halts.
• Training stable but val decode poor: enable EMA eval; tighten SGPS projector; add small L2 to nearest-neighbor barycenter.
⸻
7-Day Plan (practical)
Day 1–2:* Implement Tracks 1 & 3. Log cos(ŷ,y), cos(ŷ,v'), nDCG@10.
Day 3: Add SGPS projector + EMA; run quick grid on λ_cycle ∈ {0.1,0.2,0.3}. Day 4: Implement INGR (Track 2) with S=8; τ=0.85; halting head. Day 5: Turn on CURR-LANE: Stage A→B scheduling + lane adapters (rank-8). Day 6: Ablate: no cycle vs cycle, no contrast vs contrast, no INGR vs INGR. Day 7: Pick winner by decode sanity + retrieval uplift; checkpoint + freeze.⸻
Experiment Naming (so results aren’t chaos)exp/
example:
exp/2025-10-13_CONTRAST-CYCLE_S0_CY0.2_N64_LANEon_EMAon
⸻
Optional: Wire the Tiny Refiner (CTR-SGPS) cleanly• Inputs: (q, r̄, Y_topk, t) → iterate (z,y); output (ŷ, p_halt).
• Use it inside Track 2 (training) and after the LVM head at inference for hard queries.
• Keep h=512, S_max=16, τ∈[0.82,0.9].
⸻
Final Notes (blunt and practical)• If vec2text is the only reliable pair, standardize on it for both encode/decode.
• The single biggest lever against “garbage text” is CONTRAST-CYCLE with lane-hard negatives.
• The single cheapest quality bump is INGR (no-grad→grad) + EMA.
• Do not skip normalization and the SGPS projector; most drift bugs come from that.
If you want, I can turn this into a Makefile + runnable trainer skeleton next, with flags for --track, --lambda-cycle, --negatives, --ema, and --ingr-steps.