TC
← All Research
PRD — VecRAG + LVM “Dual-Path Next-Vector Generation”
ArchitecturevecRAG

PRD — VecRAG + LVM “Dual-Path Next-Vector Generation”

10/21/25

2025-10-2111 min read2,024 words

Clarification on : How LLMs Handle Tokens vs. Your VecRAG/LVM Approach

10/21/25

There's a key distinction between token-based LLMs and your vector-concept-based LVM. You're on an innovative path with vecRAG (replacing fixed token vocab with dynamic concept vectors), but the internals aren't identical once you hit the "vector layer." Let's break it down simply, without jargon overload.

#### Quick LLM Basics (What GPTs Do)

  • Input Stage: Text is tokenized into a fixed vocabulary (e.g., 50k-100k subword fragments like BPE). Each token gets embedded into a vector (e.g., 768D or larger).
  • Core Processing (Vector Layer): The transformer model operates entirely on these vectors—attention, feed-forward layers, etc. It's autoregressive: Predicts the next token's vector/logits based on previous ones.
  • Output: Logits over the _same fixed vocab_ (50k-100k). Softmax picks the next token, decode back to text.
  • Key Limit: The vocab is static/fixed during training. No dynamic "dictionary" at inference—everything stays within that 50k-100k space. Retrieval (like RAG) is _external_ if needed (e.g., fetch docs, embed them, inject into context). Internals don't "search billions" natively; they generate from learned patterns.
  • LLMs _do_ work in vector space internally (after embedding), but they're tied to the token vocab for input/output. They don't have a "dynamic vector dictionary" like you're building.

    #### Your VecRAG/LVM Approach (What You're Doing)

    You're essentially bypassing the token vocab entirely:

  • Input: Text → GTR-T5 → Dynamic concept vectors (768D each, millions/billions possible from your bank).
  • Core (LVM): Autoregressive on _vectors_ directly (no token layer). Predicts next 768D vector based on context vectors.
  • Output: Vector → vec2text → Text.
  • The Innovation: Your "dictionary" is dynamic (vecRAG bank of concepts/propositions, ~771k now, scalable to billions). This allows richer, context-aware "tokens" (concepts = ~20 tokens each).
  • Once you hit the vector layer, it _is_ similar internally to LLMs—both are vector ops in a transformer-like setup (autoregressive prediction). But:

  • LLMs predict over a _fixed, small vocab_ (logits for 50k-100k).
  • Your LVM predicts over a _dynamic, huge space_ (next vector from millions/billions). This requires _external retrieval_ to make it efficient— you can't compute logits over billions of vectors directly (computationally impossible). That's why you need FAISS/indexing and a good query mechanism.
  • What you're building _could_ outperform token-based LLMs for long-context or conceptual reasoning (fewer "tokens" needed per idea), but the retrieval step (finding the next concept from the huge bank) is extra—LLMs don't need it because their vocab is tiny/fixed. The consultant's Two-Tower is fixing that retrieval gap, making your system practical at scale. It's not changing the internals; it's handling the "dynamic dictionary" lookup.

    If this matches what you're aiming for (vecRAG as a token replacement), great—it's feasible, but the retrieval blocker (0.65% hits) needs the Two-Tower to unlock it.

    Clarification on : Where the Shortlist Comes From (Diagram of the Pipeline)

    The "pre-made small shortlist" for your LVM comes from a _retrieval stage_ (Stage-1) that queries the full 771k bank quickly, using FAISS. Your Phase-3 LVM was trained/tested on small shortlists (e.g., 8 candidates), so it performs best when given one—instead of forcing it to handle the full 771k (which it's not optimized for, leading to low global hits).

    In small tests (75% Hit@5), the shortlist was artificial (e.g., simulated batches). At scale, we _create_ it via retrieval. Without a good retriever, the shortlist is junk → low accuracy. Two-Tower makes a good one.

    Here's a simple diagram of the full pipeline (current blocked vs. proposed fix). I used Mermaid syntax for a flowchart—copy-paste into a tool like mermaid.live to visualize if needed, or imagine it as boxes/arrows.

  • Flow Explanation:
  • - Start with user text → vectors (your context).

    - Stage-1 (Retrieval): Current: Naive query → bad shortlist from 771k. Fix: Two-Tower makes smart query → good shortlist (top-500 with true next often included).

    - Stage-2 (LVM): Re-ranks the shortlist (what Phase-3 excels at: 75%+ on small sets).

    - Stage-3: Final tweaks → text.

  • Why shortlist instead of full 771k? Efficiency/speed—scanning 771k every step is slow; LVM trained for small-batch picking, not big searches.
  • The shortlist is "pre-made" by retrieval, mimicking your successful small tests but at scale.
  • This keeps your LVM unchanged—just gives it better inputs. If the diagram doesn't click, describe what's still confusing!

    Short answer: yes—this is 90% clear. It nails the core idea: the old pipeline made a dumb query → bad shortlist; the fix learns a smart query → good shortlist → same Phase-3 reranks.

    If you want it _crystal_:

    Two tiny clarifications

  • Name the numbers on the arrows.
  • Label the green FAISS step as: Top-500 (Recall@500 ≈ 55–60%).

    Label the red FAISS step as: Top-500 (Recall@500 ≈ 0–5%).

  • State the towers explicitly.
  • Query tower = GRU(context 100×768) → 768D query.

    Doc tower = identity(768D bank), pre-indexed in FAISS.

    Minimal wording tweaks

  • “Naive Query Formation (last vector)” → “Naive query = last context vector”
  • “Proposed Fix with Two-Tower High Hits” → “Learned query (Two-Tower) → high recall”
  • Add a tiny “Oracle (upper bound 97.4% Recall@5)” callout to show index/data are fine.
  • Clean mermaid you can paste

    flowchart LR U[User Question Text] --> E[GTR-T5 Embedder] subgraph Current (Blocked) E --> C[Context: 5–100 x 768D] C --> N[Naive query = last context vector] N --> R1[FAISS on 771k\nTop-500 • Recall@500 ≈ 0–5%] R1 --> P1[Phase-3 LVM Re-Ranker] P1 --> T1[TMD Final Control] T1 --> V1[Output Vector] --> X1[vec2text] --> O1[Text Response] end subgraph Proposed (Fix) E --> C2[Context: 5–100 x 768D] C2 --> Q[Two-Tower Query: GRU → 768D smart query] D[(Doc Tower = identity on 768D bank)\nPre-indexed in FAISS] Q --> R2[FAISS on 771k\nTop-500 • Recall@500 ≈ 55–60%] D -. informs .- R2 R2 --> P2[Phase-3 LVM Re-Ranker (unchanged)] P2 --> T2[TMD Final Control] T2 --> V2[Output Vector] --> X2[vec2text] --> O2[Text Response] end O1:::bad; R1:::bad classDef bad fill:#fbe3e4,stroke:#e33;

    One-line explainer you can reuse

    > We’re adding a finder (Two-Tower retriever) that turns the context into a good query so FAISS returns a good shortlist; then our existing picker (Phase-3 LVM) reranks that shortlist. Result: from ~0.65% global Hit@5 to ~10–20%, because Recall@500 jumps to ~55–60%.

    That’s it. Your diagram & write-up are solid with those micro-edits.

    PRD — VecRAG + LVM “Dual-Path Next-Vector Generation” Purpose

    Trent Carter

    10/22/2025

    1) Problem & Goal

    Our LVM is a generative, autoregressive next-vector predictor. It can produce a valid next concept that may not exist in the 771k bank (unlike token LLMs that must pick from a fixed vocab). However, when the next concept does exist (or is close) in the bank, we want to ground efficiently. Today’s blocker is Stage-1 recall (0.65% Hit@5 end-to-end) due to poor query formation. We will add a Two-Tower retriever to form good queries for FAISS, and then decide per step whether to (a) snap/blend to a nearby bank vector (grounded) or (b) keep the novel generated vector. Result: strong Recall@K for grounding when useful, without sacrificing novel generation.

    2) Context: LLMs vs. Our LVM
  • Token LLMs: fixed vocab (~50k–100k). They predict logits over that fixed set; no retrieval inside the model.
  • Our LVM: no token layer; it predicts a 768-D vector directly. The correct next concept may be off-bank.
  • Implication: We need fast external retrieval to search a huge “dynamic dictionary” (bank) when grounding helps; but we must not force picks from the bank when novel is better. 3) Scope & Non-Goals In scope
  • Two-Tower retriever (query tower + doc tower identity) for high-recall candidate generation.
  • Dual-Path decoder: per step choose snap/blend (grounded) vs novel (free vector).
  • TMD policy to set thresholds per lane (e.g., legal = more grounding; creative = more novelty).
  • Training/eval to report both grounded and novel quality.
  • Not in scope
  • Replacing LVM with a token LLM.
  • Forcing bank-only decoding.
  • Rewriting vec2text.
  • 4) Users & Stories
  • Architect/CEO: “I need a system that can both invent new sentences and ground to facts when available.”
  • Engineer: “Give me a clean module: retriever → shortlist; LVM outputs a vector; switch/blend picks the final vector; metrics show when/why we snapped vs. stayed novel.”
  • Evaluator: “I want clear reports: Recall@K, %Novel, grounded quality, novel quality.”
  • 5) System Overview

    flowchart LR

      U[User Text] --> E[GTR-T5 Embedder]

      E --> C[Context: 5–100 × 768D]

      C --> Q[Two-Tower Query\nGRU(context 100×768) → 768D]

      D[(Doc Tower = identity on 768D bank)\nPre-indexed in FAISS]

      Q --> R[FAISS on 771k\nTop-500 • Recall@500 ≈ 55–60%]

      D -. informs .- R

      C --> L[LVM: autoregressive\nnext-vector predictor (768D)]

      R --> S[Snap/Blend Decision]

      L --> S

      S -->|if snap/blend| G[Grounded/Blended 768D]

      S -->|if novel| N[Novel 768D]

      G --> V[vec2text] --> O[Text]

      N --> V

    Legend:

    LVM always generates a free next vector.

    Two-Tower+FAISS returns a shortlist only.

    Decision: snap/blend to bank when nearby, otherwise keep the novel vector.

    6) Detailed Behavior — Dual-Path Decision

    At each step t:

    LVM generation:

      v̂ₜ = f_LVM(v₁:ₜ₋₁) ∈ ℝ⁷⁶⁸ (unit-norm)

    Retriever shortlist (optional but default-on):

      {nᵢ} = FAISS(f_q(context)), K=500, each nᵢ ∈ ℝ⁷⁶⁸ (unit-norm)

    Decision: snap / blend / novel
  • Compute c = maxᵢ cos(v̂ₜ, nᵢ); let n = argmax
  • Snap if c ≥ τ_snap (e.g., 0.92)
  • Novel if c ≤ τ_novel (e.g., 0.85)
  • Blend if τ_novel < c < τ_snap:
  •   vₜ = α(c) · v̂ₜ + (1 − α(c)) · n

    α(c) increases with c

    TMD can override thresholds per lane.

    Decode vₜ via vec2text.

    Notes:

    Retrieval is advisory: it proposes candidates; it never blocks novelty.

    Snap/blend keeps us on-manifold when the bank already contains a great vector; novelty lets us create new content.

    7) Two-Tower Retriever
  • Query tower f_q: GRU (or LSTM) + pooling → 768D, trained with InfoNCE + curriculum hard negatives.
  • Doc tower f_d: identity (use bank vectors as-is).
  • Index: FAISS (IVF/HNSW/Flat), async batched mining.
  • Target: Recall@500 ≥ 55–60% on full bank.
  • Why: Phase-3 optimized small candidate reranking, not global search. Two-Tower learns query formation so we can find good neighbors when grounding is helpful. 8) Functional Requirements
  • LVM output is primary: system must always produce v̂ₜ (novel allowed).
  • Retriever candidate pool: top-K (default K=500; tunable 200–1000).
  • Decision module: snap / blend / novel, with thresholds τ_snap, τ_novel, and α(c) schedule.
  • TMD policy: per-lane thresholds and blending rules.
  • Async mining: retrieval runs overlapped; training must not block.
  • Telemetry: log per step: c_max, decision (SNAP/BLEND/NOVEL), lane, and nearest neighbor ID.
  • 9) Non-Functional Requirements
  • Latency: decision ≤ 5 ms avg (excluding FAISS query).
  • Stability: retriever failure → default to NOVEL path.
  • Determinism: seedable evaluation mode.
  • Scalability: bank up to billions (via sharding); recall metrics must remain meaningful.
  • 10) Metrics & Acceptance Criteria
  • Retriever: Recall@{10,100,500,1000}; gate: ≥55% Recall@500.
  • Decision behavior: %SNAP / %BLEND / %NOVEL overall and by TMD lane.
  • Grounded quality: cosine to reference vector + vec2text semantic sim.
  • Novel quality: vec2text BLEU/ROUGE + embedding sim to reference or next authored vec.
  • End-to-end: Global Hit@5 (when groundable): 10–20% expected (vs. current 0.65%).
  • Pass if: Retriever gate met and mixed report shows coherent %NOVEL vs %SNAP/BLEND with quality at/above Phase-3 baselines. 11) Design Details
  • Decision defaults:
  •   τ_snap = 0.92

      τ_novel = 0.85

      α(c): linear from 0.3 at 0.86 to 0.7 at 0.91 (cap 0.9 at 0.95+)

  • TMD overrides:
  •   Legal: snap=0.94, novel=0.88

      Creative: snap=0.90, novel=0.82

  • Async retriever:
  •   Batched FAISS (qbatch=1024–2048), prefetch queue (depth 2–3), TTL cache(=3–5)

      Indices-only return; gather vectors via index_select from CPU bank (fp16 optional)

  • Failure modes:
  •   Timeout or empty queue → NOVEL

      If c_max > 0.98 and near duplicate → drop or NOVEL

    12) Plan & Milestones MVP (2–3 days)
  • Implement decision module + TMD hooks
  • Integrate Two-Tower (v4) and async mining
  • Report %SNAP/BLEND/NOVEL + Recall@K
  • Gate Review
  • Pass if Recall@500 ≥ 55% and telemetry sane by lane
  • Production hardening (1–2 days)
  • Tune thresholds per lane
  • Add config profiles (Conservative/Neutral/Creative)
  • Add guardrails (fallback to NOVEL)
  • Ship dashboards
  • 13) Risks & Mitigations
  • Over-snapping harms novelty → Per-lane thresholds, widen novel band, log %NOVEL target
  • Retriever stalls → Async + TTL + fallback
  • Bank bias → Keep blend path; periodic novel-only ablation
  • Related Research