Tiny Recursion Meets Latent-Space Reasoning

Tiny Recursion Meets Latent-Space Reasoning A White Paper for Integrating a Minimal Recursive Refiner into the LNSP/LVM Stack Author: (Trent Carter + ChatGPT) Date: October 12, 2025 Scope: LNSP (Latent Neurolese Semantic Processor), LVM (Large Vector Model), TMD lanes, CPESH supervision, vecRAG, Semantic GPS (768-D), Mamba-family LVM core.

⸻

Abstract

We propose a tiny, recursion-based refiner—a 2-layer micro-network repeatedly applied over a compact latent state—to improve vector-only reasoning in the LNSP/LVM pipeline. Instead of adding a deep stack of parameters, we apply many small improvement steps with an adaptive halting head. The module ingests (i) the question/context vector (Semantic GPS), (ii) LVM output candidate vectors (top-K), and (iii) lane metadata (TMD). It returns a refined 768-D concept vector plus a halting confidence, before vec2text.

We detail three integration patterns:

TRM-as-Refiner (post-retrieval/post-LVM, vector-only).

TRM-style Training (no-grad recursions + final grad step) to teach “improve-by-iteration.”

Tiny Recursion Experts (TREX) per TMD lane with shared body + lightweight adapters.

This white paper formalizes interfaces, training, inference, evaluation, and risks, and includes ASCII diagrams and pseudocode to enable immediate prototyping on your 768-D Semantic GPS.

⸻

1. Background & Motivation 1.1 LNSP/LVM today

• Inputs: Question → embed to q ∈ ℝ⁷⁶⁸ (Semantic GPS).

• vecRAG: Retrieve top-K concept vectors r₁..r_K ∈ ℝ⁷⁶⁸.

• Core LVM (Mamba-family): Produces answer candidate vectors y₁..y_M ∈ ℝ⁷⁶⁸.

• TMD Router: Chooses one or more lanes (e.g., L1_FACTOID, Math-Deriv, Code-API).

• vec2text: Convert final vector back to text if needed.

1.2 Why a tiny recursion module?

• Capacity via steps, not depth. A tiny network (≈2 layers) is iterated S times over a latent state z, yielding compound reasoning without a deep param stack.

• Adaptive compute: A halting head stops early on easy cases; runs longer on hard ones.

• Data-efficient: “Deep supervision across improvement steps” (many no-grad recursions + one grad step) teaches iterative refinement with modest compute.

⸻

2. The Module: Contextual Tiny Recursion Refiner (CTR)

Working name options (pick one and standardize):

• CTR-SGPS (Contextual Tiny Refiner — Semantic GPS)

• TRM-SGPS (Tiny Recursion Model aligned to SGPS)

• SGPS-R (Semantic GPS Refiner)

• C-TRM (Contextual TRM)

• TREX-Lane (Tiny Recursion EXpert per lane)

We recommend CTR-SGPS for clarity.

2.1 I/O and shapes (default)

• Semantic space: 768-D (SGPS).

• Inputs:

• q ∈ ℝ⁷⁶⁸: embedded question/context vector (can be a distilled bundle of multi-concept inputs).

• Y ∈ ℝ^{K×768}: top-K candidate answer vectors from LVM (K≈5–10).

• r̄ ∈ ℝ⁷⁶⁸: (optional) retrieval centroid (mean of vecRAG hits) or a learned attention-pooled retrieval vector.

• t ∈ ℝ^d_TMD: TMD lane features (one-hot or learned embedding; d_TMD≈32–64).

• Outputs:

• ŷ ∈ ℝ⁷⁶⁸ refined answer vector.

• p_halt ∈ (0,1) halting confidence.

2.2 Internal structure

A compact MLP core applied recurrently over a latent state:

• Latent z ∈ ℝ^h (suggest h=512 default; can set h=768 for isometry with SGPS).

• Two small transforms reused at each step _s_ = 1…S:

• Latent update:

z_{s} = f_lat(z_{s-1}, q, y_{s-1}, r̄, t) (MLP + gating; LayerNorm)

• Answer update:

y_{s} = f_ans(y_{s-1}, z_{s}) (MLP residual to keep y on-manifold)

• Halting head:

p_halt,s = σ(wᵀ·[z_s ⊕ y_s ⊕ q ⊕ t] + b)

Stop when p_halt,s ≥ τ or when s = S_max (cap).

• Attention-free vs attentionful:

• For small fixed slots (e.g., K≤8), fold Y into a pooled summary plus per-candidate deltas using MLP mixing.

• For larger candidate sets or structured artifacts (code traces, graph walks), add a light self-attention over {y_{s-1}^k} before pooling.

⸻

3. Where It Fits: Three Integration Patterns 3.1 Pattern A — CTR-SGPS as a Refiner

Run CTR after LVM candidate generation and lane routing.

[Question Text]

│

├─► Embed → q(768)

├─► vecRAG → r1..rK (768) ──► pool → r̄(768)

└─► LVM(Mamba) → y1..yM (768), M≈5..10

│

├─► TMD router → lane(s), t

└─► CTR-SGPS(q, Y, r̄, t) → iterative refine → ŷ, p_halt

└─► if needed vec2text(ŷ)

Why: Improve fidelity to the original question by conditioning on q + r̄ + t while refining Y. Early stop if confident. 3.2 Pattern B — CTR-style Training for LVM heads

Teach the LVM (or its vector head) to improve by recursion without huge compute:

• Supervision schedule: For each training example:

• Run S-1 steps no-grad (detached), then 1 step with backprop.

• Use deep supervision across steps to stabilize small data training.

• Losses:

• Cosine alignment: L_align = 1 − cos(ŷ, y) where y is CPESH “Expected.”

• Halting BCE: target 1 if cos(ŷ,y)≥τ; else 0.

• Lane regularizers: optional per-lane priors (e.g., smoothness for Math-Deriv).

• EMA: Maintain an exponential moving average of weights for inference stability.
Why: You get the iteration skill with minimal backprop cost and better small-data generalization. 3.3 Pattern C — TREX per TMD lane
Deploy CTR as Tiny Recursion EXperts:

• One shared CTR body, lane adapters (LoRA/FiLM) keyed by TMD.

• Compute policy: e.g., S_max=48 with ACT-style halting.

• 3→2→1 loop integration: Use CTR primarily for verify→refine, then hand back ŷ to the lane combiner (or to a Mamba verifier if confidence low).
Why: Specialist behavior with a tiny footprint and adaptive compute budget.
⸻
4. Formalization
Let Y = {y₀^k}_{k=1..K} be LVM candidates; we initialize y₀ as a pooled candidate (e.g., softmax over scores) and keep per-candidate deltas internally.
Step s update:
\begin{aligned}

h_s &= \phi\Big(W_h [z_{s-1} \oplus y_{s-1} \oplus q \oplus r̄ \oplus t] + b_h\Big) \\

z_s &= z_{s-1} + U_h h_s \quad (\text{residual + norm}) \\

\tilde{y}s &= \psi\Big(W_y [y{s-1} \oplus z_s] + b_y\Big) \\

y_s &= \text{proj}\text{SGPS}\big(y{s-1} + \tilde{y}s\big) \quad (\text{unit-norm or SGPS manifold proj}) \\

p{halt,s} &= \sigma\Big(w^\top[z_s \oplus y_s \oplus q \oplus t] + b\Big)

\end{aligned}
Success indicator for supervision:
\mathbb{1}_\text{ok}(y_s,y^) = \mathbb{1}\{\cos(y_s, y^) \ge \tau\}
Loss (final supervised step):

\mathcal{L} = \lambda_1\big(1-\cos(y_S, y^)\big) \;+\;

\lambda_2\,\text{BCE}(p_{halt,S},\, \mathbb{1}\text{ok}) \;+\;

\lambda_3\,\Omega\text{lane}(y_{1:S})

⸻

5. Training & Inference 5.1 Data & alignment

• Dimensionality: Keep 768-D throughout to remain isometric with Semantic GPS.

• Co-training: Train CTR on the same corpus and CPESH instances used to align the LVM to SGPS.

• Lane-specific fine-tuning: After base CTR pretrain, apply adapters per lane.

5.2 Schedule (Pattern B baseline)

• Per batch example:

Compute q, r̄, Y. Initialize z₀=0, y₀=pool(Y).

For s=1..S-1: forward without grad to get (z_s, y_s).

Final step s=S: forward with grad; compute ℒ; backprop.

• EMA: decay 0.999–0.9999; evaluate with EMA weights.

5.3 Inference policy

• Set S_max (e.g., 48).

• Run steps until p_halt,s ≥ τ (τ≈0.8–0.9) or s=S_max.

• Return (ŷ=y_s, p_halt,s) and a small trace (S used, top contributing candidates).

⸻

6. End-to-End Flow (with Context Injection)

┌───────────────────────────────────────────────────────────┐

[Text Q] ──► E ┤ q(768) │ vecRAG: r1..rK(768) → pool r̄ │ LVM → Y( K×768 ) ├─┐

└───────────────────────────────────────────────────────────┘ │

TMD router → lane t │

▼

┌───────────────────────────────────────────────┐

│ CTR-SGPS (iterative, tiny) │

│ Inputs: q, r̄, Y, t │

│ Loop: (z,y) ← f(z,y; q,r̄,Y,t) │

│ Halt: if p_halt≥τ or s=S_max │

└───────────────────────────────────────────────┘

│

▼

ŷ(768), p_halt

│

vec2text(ŷ)

⸻

7. Pseudocode (clean & runnable structure) 7.1 CTR core (PyTorch-style sketch)

class CTR(torch.nn.Module):

def __init__(self, d=768, h=512, d_tmd=32):

super().__init__()

self.f_lat = torch.nn.Sequential(

torch.nn.LayerNorm(d + d + d + d + d_tmd), # y, z, q, r̄, t

torch.nn.Linear(d + d + d + d + d_tmd, h),

torch.nn.GELU(),

torch.nn.Linear(h, h)

)

self.f_ans = torch.nn.Sequential(

torch.nn.LayerNorm(d + h),

torch.nn.Linear(d + h, d),

torch.nn.GELU(),

torch.nn.Linear(d, d)

)

self.halt = torch.nn.Linear(h + d + d + d_tmd, 1)

def step(self, z, y, q, rbar, t):

h_lat = self.f_lat(torch.cat([y, z, q, rbar, t], dim=-1))

z = z + h_lat

dy = self.f_ans(torch.cat([y, z], dim=-1))

y = torch.nn.functional.normalize(y + dy, dim=-1) # SGPS projection

p = torch.sigmoid(self.halt(torch.cat([z, y, q, t], dim=-1)))

return z, y, p

7.2 Training loop (no-grad … grad)

def train_step(batch, ctr, optimizer, S=16, tau=0.85, ema=None):

q, rbar, Y, t, y_star = batch # shapes: [B,768], [B,768], [B,K,768], [B,d_tmd], [B,768]

y = Y.mean(dim=1) # simple pool; can learn

z = torch.zeros_like(q[:, :512]) # h=512

with torch.no_grad():

for s in range(S-1):

z, y, p = ctr.step(z, y, q, rbar, t)

# final supervised step

z, y, p = ctr.step(z, y, q, rbar, t)

align = 1 - torch.cosine_similarity(y, y_star, dim=-1)

ok = (torch.cosine_similarity(y, y_star, dim=-1) >= tau).float()

halt_loss = torch.nn.functional.binary_cross_entropy(p.squeeze(-1), ok)

loss = align.mean() + 0.2 halt_loss

optimizer.zero_grad()

loss.backward()

optimizer.step()

if ema: ema.update(ctr)

return loss.item()

⸻
8. Evaluation Plan 8.1 Metrics

• Vector alignment: Δcos(ŷ, y) uplift vs. baseline LVM head.

• Retrieval synergy: MRR / nDCG change when CTR is inserted.

• Lane success: Per-TMD pass rate; compute used (avg steps to halt).

• Downstream text: BLEU/ROUGE for vec2text, human preference when applicable.

8.2 Ablations

• With/without q (question vector).

• With/without r̄ (retrieval context).

• With/without TMD adapters.

• Attention-free vs light attention over Y.

• Training schedule: full-grad every step vs no-grad…grad.

• EMA on/off.

⸻

9. Compute & Scaling Notes

• Memory: ~linear in steps (store z and y; discard intermediate activations in no-grad steps).

• Latency: Proportional to halting step Ŝ; cheap on easy queries.

• Params: Few million; scalable across lanes via adapters.

• Throughput: Amenable to MPS on Mac; can batch multiple queries with different halting horizons.

⸻

10. Risks & Mitigations

• Over-confidence halting: Calibrate τ via validation; add small penalty for premature halts.

• Manifold drift in y: Use normalize/Proj_SGPS and consistency regularizers (e.g., keep ‖Δy‖ bounded).

• Lane leakage: Strengthen TMD gating; train with lane-specific hard negatives (CPESH “H”).

• Insufficient context: Always pass q and r̄ alongside Y (your earlier critique is correct).

⸻

11. Roadmap (30-day)

• Week 1: Implement CTR-SGPS (Pattern A), wire to LVM outputs and vecRAG; add adapters; set S_max=32.

• Week 2: Integrate no-grad→grad schedule; add EMA; run CPESH-Day3/Day4 subsets.

• Week 3: Lane-specialize (TREX), calibrate τ per lane; add light attention path for long slots.

• Week 4: Full eval suite + ablations; ship a Makefile target and CLI flags (--ctr --ctr-steps --ctr-tau).

⸻

12. Implementation Interfaces CLI flags (examples):

--ctr.enable

--ctr.steps 32

--ctr.tau 0.88

--ctr.h 512

--ctr.pool mean|attn

--ctr.adapters lora-rank=8

--ctr.lane FACTOID|MATH|CODE

Python API:

ŷ, p = ctr_refine(q, Y, rbar, t, steps=32, tau=0.88)

⸻

13. Naming Decision

• Recommend: CTR-SGPS (Contextual Tiny Recursion — Semantic GPS).

• Lane specialist variant: TREX- (e.g., TREX-MATH, TREX-FACTOID).

⸻

14. Conclusion

You don’t need a deeper tower to get smarter signals—you need iterative improvement that is context-aware and compute-adaptive. CTR-SGPS adds exactly that to your LNSP: a tiny, recursive vector refiner that sees the question (q), the retrieval context (r̄), the LVM’s candidates (Y), and the lane (t)—and then moves the answer vector closer to the CPESH “Expected” with a calibrated halting policy. It cleanly fits your 3→2→1 loop and keeps everything in your 768-D Semantic GPS.

⸻

Appendix A — Minimal End-to-End Trainer (pseudo-CLI)

make ctr-train \

DATA=cpesh/train.jsonl \

LVM_CKPT=... \

CTR_H=512 CTR_STEPS=16 CTR_TAU=0.88 \

TMD_ADAPTERS=on EMA=0.999

make ctr-eval \

DATA=cpesh/val.jsonl \

REPORT=reports/ctr_eval.md

Appendix B — ASCII Timing Diagram

time →

Embed(q) ─┬─────────────┐

vecRAG(r) ├─► pool r̄ ──┼───────────────┐

LVM(Y) ───┘ │ │

▼ │

CTR step 1: (z0,y0) → (z1,y1), p1

│ │

CTR step 2: (z1,y1) → (z2,y2), p2

│ │

⋮ │

CTR step ŝ: (zŝ-1,yŝ-1) → (zŝ,yŝ), pŝ ≥ τ ─► HALT

│

ŷ=yŝ

⸻

If you want, I can turn this into a repo-ready scaffold (PyTorch module, trainer, Makefile, and a FastAPI endpoint POST /ctr_refine) so you can drop it straight into the LNSP codebase.

Tiny Recursion Meets Latent-Space Reasoning

Related Research

Semantic GPS vs Semantic Coordinates: A Technical Distinction Analysis

SGPS-QA Architecture Mapping: Layer-by-Layer Analysis

Semantic GPS: Dynamic Spatial Navigation in Latent Language Spaces

Semantic GPS Coordinate Encoding: Learnable Spatial Positioning for Vector-Native Sequence Processing