TC
← All Research
example:
ReferenceSemantic GPS

example:

**TRICO — Training Ideas (Vector-Only)**

2025-10-137 min read1,211 words
TRICO — Training Ideas (Vector-Only) LNSP/LVM • Semantic GPS 768-D • October 13, 2025 Goal. Stop “garbage decodes” by making the model’s 768-D outputs both (a) on-manifold in your Semantic-GPS space and (b) decoder-compatible with vec2text. We’ll do that with four concrete, vector-native training tracks you can run in parallel and compare.

TL;DR (what to run)
  • E2E Vector Supervision (E2E-V): Simple cosine-align to targets; adds manifold control.
  • Iterative No-Grad→Grad (INGR): Teach “improve-by-recursion” with cheap compute.
  • Contrastive + Cycle (CONTRAST-CYCLE): Align to positives, repel hard negatives, and enforce vec2text round-trips.
  • Curriculum + Lane Hard-Negatives (CURR-LANE): Stage difficulty and TMD-aware negatives for stability.
  • Run all four; keep the one that makes decoded text coherent and lifts retrieval metrics.

    Assumptions & Setup

    Space: 768-D Semantic GPS (unit-norm; use SGPS projector).

    Encoder/Decoder: Use vec2text for both directions (encoder and decoder) to avoid GTR-T5 compatibility headaches.

    Data: Wikipedia chunks (40–50k), CPESH labels when available.

    Lanes: TMD lanes (Factoid, Math-Deriv, Code-API, …) as categorical features and sampling buckets.

    Common tricks (use in all tracks):

    • Unit-norm outputs: ŷ ← y / ||y||.

    EMA teacher for stability (decay ≈ 0.999–0.9999).

    • SGPS projection penalty: keep outputs near training distribution (e.g., distance to k-NN hull or small L2 to retrieval centroid).

    Vec-decoder compatibility loss (below) on a random 30–50% of batches.

    Track 1 — E2E Vector Supervision (E2E-V) Purpose: Baseline that already fixes many “garbage decode” cases by keeping vectors on-manifold and close to targets. Batch I/O

    • Input: q (question vec), optional retrieval pooled r̄, lane embedding t.

    • Target: y (CPESH Expected) or teacher vector (vec2text encoder of gold text).

    Model (simple): ŷ = fθ(q, r̄, t) (Mamba or slim MLP); output 768-D. Loss

    • Alignment: L_align = 1 − cos(ŷ, y).

    • Manifold regularizer: L_mani = ||ŷ − Proj_SGPS(ŷ)||² or small penalty to drift from k nearest vectors in train set.

    Decoder-compat: decode and re-encode once:

    • txt = vec2text.decode(ŷ)

    • v' = vec2text.encode(txt)

    • L_cycle = 1 − cos(ŷ, v')

    Total: L = L_align + λ_mani L_mani + λ_cycle L_cycle. Why it helps: Forces outputs to sit where vec2text is accurate, not just “somewhere” in 768-D.

    Track 2 — Iterative No-Grad → Grad (INGR) Purpose: Teach the head (or a tiny refiner) to improve by recursion without paying full backprop at every step. Module: Tiny Refiner (CTR-SGPS) that updates (z, y) S times; last step trains.

    (z0=0, y0 = pool(Y or initial head))

    for s in 1..S-1:

        (zs, ys) = step_no_grad(zs-1, ys-1; q, r̄, t)

    (zS, yS) = step_grad(zS-1, yS-1; q, r̄, t)

    Loss (on final step only):

    L = (1 − cos(yS, y)) + α·BCE(p_halt,S, 𝟙{cos(yS,y)≥τ}) + β·L_cycle(yS)

    Notes

    • Set S≈8–16 for baseline; cap latency with a halting head at inference.

    • Use EMA of the refiner for evaluation.

    Why it helps: You get the iterative behavior (big win for coherence) at almost the cost of a single forward.

    Track 3 — Contrastive + Cycle (CONTRAST-CYCLE) Purpose: Make vectors discriminative and decoder-friendly. Fixes “blurry” outputs that decode nonsensically. Positives/Negatives

    • Positive: y or teacher vector for the same chunk.

    • Negatives: (a) lane-hard: same lane, different article; (b) near-miss: high retrieval but wrong; (c) adversarial: lexically similar but semantically different.

    Loss

    InfoNCE:

    L_{\text{nce}} = -\log \frac{\exp(\cos(ŷ, y^+)/τ_c)}{\exp(\cos(ŷ, y^+)/τ_c)+\sum_j \exp(\cos(ŷ, y^-_j)/τ_c)}

    Cycle: same L_cycle as Track 1.

    • Optional Mutual consistency: symmetrize with teacher: 1 − cos(fθ(q,…), stopgrad(y)) + 1 − cos(stopgrad(fθ(q,…)), y).

    Total: L = L_nce + λ_cycle L_cycle. Why it helps: Separates close concepts; keeps decodes sharp and faithful.

    Track 4 — Curriculum + Lane Hard-Negatives (CURR-LANE) Purpose: Stabilize training; expose the model to difficulty gradually; force lane-aware precision. Stages
  • Stage A (Easy): Short, clean paragraphs; far negatives; batch size high.
  • Stage B (Medium): Normal chunks; lane-hard negatives; add L_cycle.
  • Stage C (Hard): Long/technical chunks; near-miss negatives; enable INGR (Track 2) for the last 30–50% of training.
  • Sampling

    • Curriculum by readability/length and retrieval ambiguity.

    • Maintain lane balance per epoch.

    Scheduler

    • Warmup → cosine decay; raise λ_cycle over time; increase negative count per batch over stages.

    Implementation Notes (do these no matter what)

    Normalize everywhere: inputs, intermediate y, outputs.

    SGPS projector: if you have a PCA/autoencoder of the corpus vectors, project predictions back (light orthogonality helps).

    Adapters per lane: small FiLM/LoRA heads keyed by TMD → lowers interference.

    Batch shaping: mix 70% in-lane, 30% cross-lane examples to avoid collapse.

    Decode gating (for logging only): if cos(ŷ, v') < 0.7, mark as “decode-risky”; surface to eval dashboard.

    Teacher path (optional): keep a frozen EMA of a previously good checkpoint to generate soft targets y when CPESH isn’t available.

    Minimal Loss Recipes (copy/paste) E2E-V

    L = (1 - cos(ŷ, y)) + 0.05 ||ŷ - Proj_SGPS(ŷ)||^2 + 0.2 (1 - cos(ŷ, vec2text.encode(vec2text.decode(ŷ))))

    INGR

    L = (1 - cos(yS, y)) + 0.2 BCE(p_halt,S, 𝟙{cos(yS,y)≥0.85}) + 0.2 (1 - cos(yS, vec2text.encode(vec2text.decode(yS))))

    CONTRAST-CYCLE

    L = InfoNCE(ŷ, y+, {y-}) + 0.2 (1 - cos(ŷ, vec2text.encode(vec2text.decode(ŷ))))

    Metrics & Gates (what “good” looks like)

    Vector alignment: cos(ŷ, y) ↑; median ≥ 0.88 on val after Stage B.

    Decoder cycle: cos(ŷ, v') ↑; median ≥ 0.82.

    Retrieval synergy: nDCG@10 on re-query with ŷ ↑ vs baseline.

    Lane accuracy: per-lane pass rate (CPESH Expected within top-k when decoded) ↑.

    Halting efficiency (INGR): avg steps ≤ 8 with ≥ 85% inside S_max.

    Human sanity checks: 50-sample blind read—≥ 70% “sensible” after Stage C.

    Failure Modes & Fast Fixes

    Nonsensical decodes despite high cosine to y: raise λ_cycle; add near-miss negatives.

    Mode collapse (all vectors look alike): increase negative count; add lane adapters; up InfoNCE temperature τ_c.

    Over-halting (INGR stops too early): lower halt threshold τ or add penalty for early halts.

    Training stable but val decode poor: enable EMA eval; tighten SGPS projector; add small L2 to nearest-neighbor barycenter.

    7-Day Plan (practical)

    Day 1–2:* Implement Tracks 1 & 3. Log cos(ŷ,y), cos(ŷ,v'), nDCG@10.

    Day 3: Add SGPS projector + EMA; run quick grid on λ_cycle ∈ {0.1,0.2,0.3}. Day 4: Implement INGR (Track 2) with S=8; τ=0.85; halting head. Day 5: Turn on CURR-LANE: Stage A→B scheduling + lane adapters (rank-8). Day 6: Ablate: no cycle vs cycle, no contrast vs contrast, no INGR vs INGR. Day 7: Pick winner by decode sanity + retrieval uplift; checkpoint + freeze.

    Experiment Naming (so results aren’t chaos)

    exp/__S_CY<λcycle>_N_LANE_EMA

    example:

    exp/2025-10-13_CONTRAST-CYCLE_S0_CY0.2_N64_LANEon_EMAon

    Optional: Wire the Tiny Refiner (CTR-SGPS) cleanly

    • Inputs: (q, r̄, Y_topk, t) → iterate (z,y); output (ŷ, p_halt).

    • Use it inside Track 2 (training) and after the LVM head at inference for hard queries.

    • Keep h=512, S_max=16, τ∈[0.82,0.9].

    Final Notes (blunt and practical)

    • If vec2text is the only reliable pair, standardize on it for both encode/decode.

    • The single biggest lever against “garbage text” is CONTRAST-CYCLE with lane-hard negatives.

    • The single cheapest quality bump is INGR (no-grad→grad) + EMA.

    • Do not skip normalization and the SGPS projector; most drift bugs come from that.

    If you want, I can turn this into a Makefile + runnable trainer skeleton next, with flags for --track, --lambda-cycle, --negatives, --ema, and --ingr-steps.

    Related Research