TC
← All Research
Tiny Recursion Meets Latent-Space Reasoning
ReferenceSemantic GPSFeatured

Tiny Recursion Meets Latent-Space Reasoning

**Tiny Recursion Meets Latent-Space Reasoning**

2025-10-1213 min read1,898 words
Trent Carter + ChatGPT
Tiny Recursion Meets Latent-Space Reasoning A White Paper for Integrating a Minimal Recursive Refiner into the LNSP/LVM Stack Author: (Trent Carter + ChatGPT) Date: October 12, 2025 Scope: LNSP (Latent Neurolese Semantic Processor), LVM (Large Vector Model), TMD lanes, CPESH supervision, vecRAG, Semantic GPS (768-D), Mamba-family LVM core.

Abstract

We propose a tiny, recursion-based refiner—a 2-layer micro-network repeatedly applied over a compact latent state—to improve vector-only reasoning in the LNSP/LVM pipeline. Instead of adding a deep stack of parameters, we apply many small improvement steps with an adaptive halting head. The module ingests (i) the question/context vector (Semantic GPS), (ii) LVM output candidate vectors (top-K), and (iii) lane metadata (TMD). It returns a refined 768-D concept vector plus a halting confidence, before vec2text.

We detail three integration patterns:

  • TRM-as-Refiner (post-retrieval/post-LVM, vector-only).
  • TRM-style Training (no-grad recursions + final grad step) to teach “improve-by-iteration.”
  • Tiny Recursion Experts (TREX) per TMD lane with shared body + lightweight adapters.
  • This white paper formalizes interfaces, training, inference, evaluation, and risks, and includes ASCII diagrams and pseudocode to enable immediate prototyping on your 768-D Semantic GPS.

    1. Background & Motivation 1.1 LNSP/LVM today

    Inputs: Question → embed to q⁷⁶⁸ (Semantic GPS).

    vecRAG: Retrieve top-K concept vectors r₁..r_K⁷⁶⁸.

    Core LVM (Mamba-family): Produces answer candidate vectors y₁..y_M⁷⁶⁸.

    TMD Router: Chooses one or more lanes (e.g., L1_FACTOID, Math-Deriv, Code-API).

    vec2text: Convert final vector back to text if needed.

    1.2 Why a tiny recursion module?

    Capacity via steps, not depth. A tiny network (≈2 layers) is iterated S times over a latent state z, yielding compound reasoning without a deep param stack.

    Adaptive compute: A halting head stops early on easy cases; runs longer on hard ones.

    Data-efficient: “Deep supervision across improvement steps” (many no-grad recursions + one grad step) teaches iterative refinement with modest compute.

    2. The Module: Contextual Tiny Recursion Refiner (CTR)

    Working name options (pick one and standardize):

    CTR-SGPS (Contextual Tiny Refiner — Semantic GPS)

    TRM-SGPS (Tiny Recursion Model aligned to SGPS)

    SGPS-R (Semantic GPS Refiner)

    C-TRM (Contextual TRM)

    TREX-Lane (Tiny Recursion EXpert per lane)

    We recommend CTR-SGPS for clarity.

    2.1 I/O and shapes (default)

    Semantic space: 768-D (SGPS).

    Inputs:

    q⁷⁶⁸: embedded question/context vector (can be a distilled bundle of multi-concept inputs).

    Y^{K×768}: top-K candidate answer vectors from LVM (K≈5–10).

    ⁷⁶⁸: (optional) retrieval centroid (mean of vecRAG hits) or a learned attention-pooled retrieval vector.

    t^d_TMD: TMD lane features (one-hot or learned embedding; d_TMD≈32–64).

    Outputs:

    ŷ⁷⁶⁸ refined answer vector.

    p_halt(0,1) halting confidence.

    2.2 Internal structure

    A compact MLP core applied recurrently over a latent state:

    • Latent z^h (suggest h=512 default; can set h=768 for isometry with SGPS).

    • Two small transforms reused at each step _s_ = 1…S:

    Latent update:

    z_{s} = f_lat(z_{s-1}, q, y_{s-1}, r̄, t)  (MLP + gating; LayerNorm)

    Answer update:

    y_{s} = f_ans(y_{s-1}, z_{s})  (MLP residual to keep y on-manifold)

    Halting head:

    p_halt,s = σ(wᵀ·[z_s ⊕ y_s ⊕ q ⊕ t] + b)

    Stop when p_halt,s ≥ τ or when s = S_max (cap).

    Attention-free vs attentionful:

    • For small fixed slots (e.g., K≤8), fold Y into a pooled summary plus per-candidate deltas using MLP mixing.

    • For larger candidate sets or structured artifacts (code traces, graph walks), add a light self-attention over {y_{s-1}^k} before pooling.

    3. Where It Fits: Three Integration Patterns 3.1 Pattern A — CTR-SGPS as a Refiner

    Run CTR after LVM candidate generation and lane routing.

    [Question Text]

       │

       ├─► Embed → q(768)

       ├─► vecRAG → r1..rK (768) ──► pool → r̄(768)

       └─► LVM(Mamba) → y1..yM (768), M≈5..10

              │

              ├─► TMD router → lane(s), t

              └─► CTR-SGPS(q, Y, r̄, t) → iterative refine → ŷ, p_halt

                                       └─► if needed vec2text(ŷ)

    Why: Improve fidelity to the original question by conditioning on q + r̄ + t while refining Y. Early stop if confident. 3.2 Pattern B — CTR-style Training for LVM heads

    Teach the LVM (or its vector head) to improve by recursion without huge compute:

    Supervision schedule: For each training example:

    • Run S-1 steps no-grad (detached), then 1 step with backprop.

    • Use deep supervision across steps to stabilize small data training.

    Losses:

    Cosine alignment: L_align = 1 − cos(ŷ, y) where y is CPESH “Expected.”

    Halting BCE: target 1 if cos(ŷ,y)≥τ; else 0.

    Lane regularizers: optional per-lane priors (e.g., smoothness for Math-Deriv).

    EMA: Maintain an exponential moving average of weights for inference stability.

    Why: You get the iteration skill with minimal backprop cost and better small-data generalization. 3.3 Pattern C — TREX per TMD lane

    Deploy CTR as Tiny Recursion EXperts:

    One shared CTR body, lane adapters (LoRA/FiLM) keyed by TMD.

    Compute policy: e.g., S_max=48 with ACT-style halting.

    3→2→1 loop integration: Use CTR primarily for verify→refine, then hand back ŷ to the lane combiner (or to a Mamba verifier if confidence low).

    Why: Specialist behavior with a tiny footprint and adaptive compute budget.

    4. Formalization

    Let Y = {y₀^k}_{k=1..K} be LVM candidates; we initialize y₀ as a pooled candidate (e.g., softmax over scores) and keep per-candidate deltas internally.

    Step s update:

    \begin{aligned}

    h_s &= \phi\Big(W_h [z_{s-1} \oplus y_{s-1} \oplus q \oplus r̄ \oplus t] + b_h\Big) \\

    z_s &= z_{s-1} + U_h h_s \quad (\text{residual + norm}) \\

    \tilde{y}s &= \psi\Big(W_y [y{s-1} \oplus z_s] + b_y\Big) \\

    y_s &= \text{proj}\text{SGPS}\big(y{s-1} + \tilde{y}s\big) \quad (\text{unit-norm or SGPS manifold proj}) \\

    p{halt,s} &= \sigma\Big(w^\top[z_s \oplus y_s \oplus q \oplus t] + b\Big)

    \end{aligned}

    Success indicator for supervision:

    \mathbb{1}_\text{ok}(y_s,y^) = \mathbb{1}\{\cos(y_s, y^) \ge \tau\}

    Loss (final supervised step):

    \mathcal{L} = \lambda_1\big(1-\cos(y_S, y^)\big) \;+\;

    \lambda_2\,\text{BCE}(p_{halt,S},\, \mathbb{1}\text{ok}) \;+\;

    \lambda_3\,\Omega\text{lane}(y_{1:S})

    5. Training & Inference 5.1 Data & alignment

    Dimensionality: Keep 768-D throughout to remain isometric with Semantic GPS.

    Co-training: Train CTR on the same corpus and CPESH instances used to align the LVM to SGPS.

    Lane-specific fine-tuning: After base CTR pretrain, apply adapters per lane.

    5.2 Schedule (Pattern B baseline)

    Per batch example:

  • Compute q, r̄, Y. Initialize z₀=0, y₀=pool(Y).
  • For s=1..S-1: forward without grad to get (z_s, y_s).
  • Final step s=S: forward with grad; compute ℒ; backprop.
  • EMA: decay 0.999–0.9999; evaluate with EMA weights.

    5.3 Inference policy

    • Set S_max (e.g., 48).

    • Run steps until p_halt,s ≥ τ (τ≈0.8–0.9) or s=S_max.

    • Return (ŷ=y_s, p_halt,s) and a small trace (S used, top contributing candidates).

    6. End-to-End Flow (with Context Injection)

                   ┌───────────────────────────────────────────────────────────┐

    [Text Q] ──► E ┤  q(768)  │ vecRAG: r1..rK(768) → pool r̄ │ LVM → Y( K×768 ) ├─┐

                   └───────────────────────────────────────────────────────────┘ │

                                                         TMD router → lane t     │

                                                                                  ▼

                                 ┌───────────────────────────────────────────────┐

                                 │         CTR-SGPS (iterative, tiny)            │

                                 │ Inputs: q, r̄, Y, t                            │

                                 │ Loop:   (z,y) ← f(z,y; q,r̄,Y,t)               │

                                 │ Halt:   if p_halt≥τ or s=S_max                 │

                                 └───────────────────────────────────────────────┘

                                                    │

                                                    ▼

                                          ŷ(768), p_halt

                                                    │

                                             vec2text(ŷ)

    7. Pseudocode (clean & runnable structure) 7.1 CTR core (PyTorch-style sketch)

    class CTR(torch.nn.Module):

        def __init__(self, d=768, h=512, d_tmd=32):

            super().__init__()

            self.f_lat = torch.nn.Sequential(

                torch.nn.LayerNorm(d + d + d + d + d_tmd),  # y, z, q, r̄, t

                torch.nn.Linear(d + d + d + d + d_tmd, h),

                torch.nn.GELU(),

                torch.nn.Linear(h, h)

            )

            self.f_ans = torch.nn.Sequential(

                torch.nn.LayerNorm(d + h),

                torch.nn.Linear(d + h, d),

                torch.nn.GELU(),

                torch.nn.Linear(d, d)

            )

            self.halt = torch.nn.Linear(h + d + d + d_tmd, 1)

        def step(self, z, y, q, rbar, t):

            h_lat = self.f_lat(torch.cat([y, z, q, rbar, t], dim=-1))

            z = z + h_lat

            dy = self.f_ans(torch.cat([y, z], dim=-1))

            y = torch.nn.functional.normalize(y + dy, dim=-1)  # SGPS projection

            p = torch.sigmoid(self.halt(torch.cat([z, y, q, t], dim=-1)))

            return z, y, p

    7.2 Training loop (no-grad … grad)

    def train_step(batch, ctr, optimizer, S=16, tau=0.85, ema=None):

        q, rbar, Y, t, y_star = batch  # shapes: [B,768], [B,768], [B,K,768], [B,d_tmd], [B,768]

        y = Y.mean(dim=1)              # simple pool; can learn

        z = torch.zeros_like(q[:, :512])  # h=512

        with torch.no_grad():

            for s in range(S-1):

                z, y, p = ctr.step(z, y, q, rbar, t)

        # final supervised step

        z, y, p = ctr.step(z, y, q, rbar, t)

        align = 1 - torch.cosine_similarity(y, y_star, dim=-1)

        ok = (torch.cosine_similarity(y, y_star, dim=-1) >= tau).float()

        halt_loss = torch.nn.functional.binary_cross_entropy(p.squeeze(-1), ok)

        loss = align.mean() + 0.2 halt_loss

        optimizer.zero_grad()

        loss.backward()

        optimizer.step()

        if ema: ema.update(ctr)

        return loss.item()

    8. Evaluation Plan 8.1 Metrics

    Vector alignment: Δcos(ŷ, y) uplift vs. baseline LVM head.

    Retrieval synergy: MRR / nDCG change when CTR is inserted.

    Lane success: Per-TMD pass rate; compute used (avg steps to halt).

    Downstream text: BLEU/ROUGE for vec2text, human preference when applicable.

    8.2 Ablations

    With/without q (question vector).

    With/without r̄ (retrieval context).

    With/without TMD adapters.

    Attention-free vs light attention over Y.

    Training schedule: full-grad every step vs no-grad…grad.

    EMA on/off.

    9. Compute & Scaling Notes

    Memory: ~linear in steps (store z and y; discard intermediate activations in no-grad steps).

    Latency: Proportional to halting step ; cheap on easy queries.

    Params: Few million; scalable across lanes via adapters.

    Throughput: Amenable to MPS on Mac; can batch multiple queries with different halting horizons.

    10. Risks & Mitigations

    Over-confidence halting: Calibrate τ via validation; add small penalty for premature halts.

    Manifold drift in y: Use normalize/Proj_SGPS and consistency regularizers (e.g., keep ‖Δy‖ bounded).

    Lane leakage: Strengthen TMD gating; train with lane-specific hard negatives (CPESH “H”).

    Insufficient context: Always pass q and alongside Y (your earlier critique is correct).

    11. Roadmap (30-day)

    Week 1: Implement CTR-SGPS (Pattern A), wire to LVM outputs and vecRAG; add adapters; set S_max=32.

    Week 2: Integrate no-grad→grad schedule; add EMA; run CPESH-Day3/Day4 subsets.

    Week 3: Lane-specialize (TREX), calibrate τ per lane; add light attention path for long slots.

    Week 4: Full eval suite + ablations; ship a Makefile target and CLI flags (--ctr --ctr-steps --ctr-tau).

    12. Implementation Interfaces CLI flags (examples):

    --ctr.enable

    --ctr.steps 32

    --ctr.tau 0.88

    --ctr.h 512

    --ctr.pool mean|attn

    --ctr.adapters lora-rank=8

    --ctr.lane FACTOID|MATH|CODE

    Python API:

    ŷ, p = ctr_refine(q, Y, rbar, t, steps=32, tau=0.88)

    13. Naming Decision

    • Recommend: CTR-SGPS (Contextual Tiny Recursion — Semantic GPS).

    • Lane specialist variant: TREX- (e.g., TREX-MATH, TREX-FACTOID).

    14. Conclusion

    You don’t need a deeper tower to get smarter signals—you need iterative improvement that is context-aware and compute-adaptive. CTR-SGPS adds exactly that to your LNSP: a tiny, recursive vector refiner that sees the question (q), the retrieval context (r̄), the LVM’s candidates (Y), and the lane (t)—and then moves the answer vector closer to the CPESH “Expected” with a calibrated halting policy. It cleanly fits your 3→2→1 loop and keeps everything in your 768-D Semantic GPS.

    Appendix A — Minimal End-to-End Trainer (pseudo-CLI)

    make ctr-train \

      DATA=cpesh/train.jsonl \

      LVM_CKPT=... \

      CTR_H=512 CTR_STEPS=16 CTR_TAU=0.88 \

      TMD_ADAPTERS=on EMA=0.999

    make ctr-eval \

      DATA=cpesh/val.jsonl \

      REPORT=reports/ctr_eval.md

    Appendix B — ASCII Timing Diagram

    time →

    Embed(q) ─┬─────────────┐

    vecRAG(r) ├─► pool r̄ ──┼───────────────┐

    LVM(Y) ───┘            │               │

                           ▼               │

                     CTR step 1: (z0,y0) → (z1,y1), p1

                           │               │

                     CTR step 2: (z1,y1) → (z2,y2), p2

                           │               │

                           ⋮               │

                     CTR step ŝ: (zŝ-1,yŝ-1) → (zŝ,yŝ), pŝ ≥ τ  ─► HALT

                                                       │

                                                    ŷ=yŝ

    If you want, I can turn this into a repo-ready scaffold (PyTorch module, trainer, Makefile, and a FastAPI endpoint POST /ctr_refine) so you can drop it straight into the LNSP codebase.

    Related Research