TC
← All Research
VCRB: Vectorized Conversational Recursive Blockchain
ReferenceSemantic GPS

VCRB: Vectorized Conversational Recursive Blockchain

**VCRB: Vectorized Conversational Recursive Blockchain**

2026-04-0817 min read1,721 words
VCRB: Vectorized Conversational Recursive Blockchain

_A dynamic, verifiable conversation ledger powering vector-native generative systems_

_10/10/25_

_Trent Carter_

Abstract

We propose VCRB: a system that records every utterance in intelligent conversations to an immutable ledger, vectorizes those utterances into a stable semantic space (768-D), and recursively reuses the most relevant prior conversations during inference and training. VCRB turns the world’s dialogue into a queryable vector memory with auditability, data provenance, and model/version traceability baked in.

Core idea: each session is atomized into (text, 768-D vector, TMD, CPESH, metadata) and committed on-chain (hashes + pointers), while full payloads live off-chain in a vector store. At inference, the LVM retrieves high-similarity conversation snippets (e.g., cosine ≥ 0.999), composes context, and generates a new 768-D output. Periodically (or continuously), new high-quality data is batch-sealed to chain and used for fine-tuning/RL, with model artifacts cryptographically versioned on-chain.

1) Motivation

Generative + Retrieval: Token LMs are static; RAG is untrusted; chat logs are fragile. VCRB gives verifiable provenance and a global conversational memory in one architecture.

Cost & Velocity: Mamba-family / hybrid state-space models train faster than large transformers; VCRB exploits that by continuously curating fresh, audited data to fine-tune on tight loops.

Stability: A single, fixed 768-D space (“Semantic GPS”) keeps “glucose” and every other concept anchored—enabling precise cross-session reuse.

2) System Overview

Input/Output contract: LVM consumes a 768-D context (plus small TMD/CPESH side-channels) and outputs a 768-D vector.

Conversation to Ledger: Each turn → vectorize → quality-gate → Tx committed (hash, signatures, Merkle position); full vectors/text live off-chain (content-addressed).

Recursive retrieval: New queries fetch top-K prior conversational vectors (and their text) to extend the model’s effective context.

Training loop: Nightly/continuous jobs seal batch roots on-chain, update model weights, and commit ModelVersion hashes to the ledger.

3) Reference Architecture 3.1 Components

Ingress API: FastAPI for session/turn ingestion (auth, PII scrub).

Vectorizer: GTR-T5 (768-D) or compatible; deterministic pre-/post-norm.

Metadata:

TMD (16-D): Task, Modifier, Domain bits.

CPESH: Concept, Probe, Expected(+ Soft/Hard negatives).

Vector Store: FAISS/pgvector for ANN; HNSW/IVF, product quantization.

Knowledge Graph (optional): Neo4j for ontological links, parent/child hops.

Ledger: Energy-efficient PoS / app-chain (e.g., Tendermint/Substrate) or verifiable append-only log with periodic chain anchoring.

Trainer: Fine-tuning / RL loop (batch-sealed input sets).

Registry: Model & dataset registry with content hashes on-chain.

3.2 Key Data Types (simplified)

{

  "turn_id": "uuid",

  "session_id": "uuid",

  "ts": "2025-10-10T13:12:05Z",

  "speaker": "user|model",

  "text": "string",

  "vec_768": [0.001, ...],

  "tmd_16": "D.T.M",

  "cpesh": {"C":"glucose", "P":"effect on adolescence", "E":"hormonal modulation", "S":["..."], "H":["..."]},

  "quality": {"pii_score":0.0, "toxicity":0.01, "human_ok":true},

  "content_address": "bafy... (off-chain blob)",

  "tx_hash": "0x...",

  "block_height": 123456

}

4) End-to-End Flow (ASCII) 4.1 Request → Retrieval → Generation → Commit

+-----------+      +--------------+      +------------------+      +-------------+

Client--->Ingress API--->Vectorizer 768D--->Retriever

+-----------+      +--------------+      +------------------+      +-------------+

       |                   |                         |                     |

       |  text             |   text                  |   vec(query)        |

       |                   v                         v                     v

       |             +-----------+            +-------------+        +------------+

       |             |  Ledger   |<---------- |  Commit Tx  |        |  Top-K     |

       |             +-----------+            +-------------+        |  Prior Vec |

       |                                       (hashes)              +------------+

       |                                                                  |

       |                                                                  v

       |                                                          +---------------+

       |                                                          |  Composer     |

       |                                                          |  (context mk) |

       |                                                          +-------+-------+

       |                                                                  |

       |                                                                  v

       |                                                          +---------------+

       |                                                          |   LVM (768D)  |

       |                                                          +-------+-------+

       |                                                                  |

       |                                                                  v

       |                                                          +---------------+

       |                                                          |  Output 768D  |

       |                                                          +-------+-------+

       |                                                                  |

       |                                                                  v

       |                                                          +---------------+

       |                                                          |  Post-Vector  |

       |                                                          |  (vec2text)   |

       |                                                          +---------------+

4.2 Block Building & Audit

[Turn Vecs + Metadata] --> [Quality Gate] --> [Merkle Build]

                                           \

                                            \--> [Tx: UtteranceCommit]

                                                 [Tx: ContextAttest]

                                                 [Tx: FeedbackAttest]

                                                 [Tx: BatchSealRoot] ---> [Block N]

4.3 Nightly Training (or Continuous)

[New Off-chain Batch] --hash--> [Merkle Root] --anchor--> [Ledger]

        |

        v

[Trainer: FT/RL] --> [Model vN+1 artifacts] --hash--> [ModelVersionCommit Tx]

5) Retrieval & Recursion Top-K + Thresholding

• Find prior turns with cosine ≥ τ (e.g., 0.999).

• Diversify by session/source to avoid near-duplicate collapse.

• Optionally follow graph edges (parent/child/sibling) for 1-2 hops to enrich.

Depth-limited Recursion

• Layer-0: current query.

• Layer-1: top-K priors.

• Layer-2: for each prior, fetch its top-M neighbors (cap total tokens/vectors).

• Hard budget by vector count (not tokens): e.g., ≤ 128 vectors into composer.

6) Training Schedules (policy knobs)

Continuous RL: small on-policy updates when confident signals exist.

Nightly: curated, de-duplicated batch with batch-sealed Merkle root.

Periodic (weekly/quarterly): heavyweight evals, ablations, and rollbacks.

Quality gates: PII redaction → toxicity/NSFW → dedupe → diversity → human spot-check (1%).

7) Transaction Types (on-chain)

UtteranceCommit: hash(text), hash(vec), TMD, CPESH, signer, ts.

ContextAttest: list of referenced turn IDs + similarity scores.

FeedbackAttest: signed user/model feedback; RL reward hints.

BatchSealRoot: Merkle root of curated batch (training set).

ModelVersionCommit: hashes of weights/optimizer/state + evaluation digest.

PolicyUpdate: retrieval thresholds, privacy knobs, retention.

8) Security, Privacy, Compliance

PII scrub before vectorization; store redaction map off-chain with strict ACL.

Private chain or public-verifier / private-payload model (commit-reveal).

Right-to-be-forgotten: tombstone pointer off-chain; on-chain retains only opaque commitment; rekey vector store segments; exclude from future batches.

Auditability: end-to-end reconstruction via content-addresses + Merkle proofs.

9) Three Novel Additions (beyond what we discussed) 9.1 ZK-Sim: Zero-Knowledge Similarity Proofs Allow a node to prove that retrieved context exceeded a similarity threshold (e.g., cosine ≥ 0.999) without revealing the vectors or text.

• Use polynomial commitments/range proofs over normalized inner products.

• Benefits: privacy-preserving _and_ verifiable retrieval; drives trust in “why this context?” even in regulated domains.

9.2 Semantic Anchor Beacons (SAB) for Drift Control Curate a public set of anchor vectors for canonical concepts (e.g., “glucose”, “mRNA”, “Jupiter”).

• At each retrain, solve a small Procrustes alignment to keep the live embedding space aligned to SAB.

• Commit SAB set + alignment error to chain.

• Benefits: long-term coordinate stability for your Semantic GPS, preventing “concept drift” across model versions.

9.3 Conversational Curriculum Miner (CCM) Bandit-style miner scans the vector space for low-density, high-value regions (under-covered concepts) and:

• Generates targeted prompts / outreach to collect examples, or

• Synthesizes probe-expected pairs (CPESH) from trusted seeds, later human-audited.

• Commit mined curricula and their evaluation uplift to chain.

• Benefits: _data flywheel_ that systematically fills conceptual gaps and improves generalization.

10) Implementation Plan (MVP → V1) MVP (2–3 weeks)

API: FastAPI endpoints /ingest, /commit, /query, /attest.

Vectorizer: GTR-T5-base (768-D), fixed preproc; deterministic seeds.

Store: pgvector (ANN) + S3/MinIO for blobs (content-addressed).

Ledger: lightweight Tendermint app-chain or append-only log with daily Ethereum (or similar) anchor.

Composer: top-K + simple diversity; depth-1 recursion.

Trainer: nightly fine-tune job (FT) with batch-sealed inputs; log evals.

V1 (6–10 weeks)

Graph layer (Neo4j) for 1-hop ontological enrich.

RLHF/RLAIF: reward from FeedbackAttest (thumbs/stop/“good”).

ZK-Sim (prototype): off-chain prover / on-chain verifier for thresholded cosine.

SAB drift control: anchor set + alignment report on-chain.

CCM: bandit miner + ablation reports (“coverage uplift”, “error decay”).

11) ASCII Flow Diagrams (detailed) 11.1 Ingest + Commit

User/Model Turn ──> [Sanitize/Redact] ──> [Vectorize 768D] ──> [TMD/CPESH Tag]

         |                                                |

         |                                                v

         |                                      [Off-chain Blob Store]

         |                                                |

         v                                                v

   [Build Tx: UtteranceCommit]  <── hash(vec), hash(text), content_addr

         |

         v

     [Submit] ───────────────> [Mempool] ─> [Block Propose/Finalize]

                                             |

                                             v

                              [Block N contains Tx hashes + Merkle root]

11.2 Retrieval + Recursion

Query text -> vec(q) -> ANN search (K)

     |                       |

     |                       v

     |                Top-K prior turns

     |                       |

     v                       v

  Filter by quality ----> Diversify ----> (optional) graph hops

               \              |                   /

                \             v                  /

                 \------->  Compose Context  <---

                               |

                               v

                           LVM (768D out) -> vec2text

                               |

                               v

                        Commit ContextAttest Tx

11.3 Nightly Training + Versioning

[Daily New Turns] -> [Quality Gate] -> [Batch Merkle Root] -> [BatchSeal Tx]

          |                                             

          v

     [Trainer: FT/RL] -> [Weights vN+1] -> hash -> [ModelVersionCommit Tx]

                                         \-> evals -> [EvalDigestCommit Tx]

12) Retrieval & Composition (pseudocode)

def retrieve_context(query_vec, k=32, tau=0.999, m=4, max_ctx=128):

    C = ann_search(query_vec, k)                  # Top-K

    C = [c for c in C if c.cosine >= tau]         # Threshold

    C = diversify(C, by=["session","source"])     # Reduce redundancy

    G = graph_enrich(C, hops=1, per_seed=m)       # Optional 1-hop KG

    ctx = truncate(C + G, max_items=max_ctx)      # Hard budget

    return ctx

13) Evaluation Plan

Retrieval p@K / nDCG on held-out conversational QA.

Context Utility Uplift: Δ in downstream generation quality with vs. without VCRB context.

Chain-Consistency Score: fraction of responses whose attested contexts are (a) retrievable, (b) threshold-valid, (c) non-tampered via Merkle proof.

Drift Metric: SAB alignment error across versions.

Data Flywheel KPI: CCM coverage increase (% of low-density regions filled) vs. error rate decay.

14) Risks & Mitigations (blunt)

PII leakage: redact pre-vector; strict ACL for de-redaction maps; default private chain; ZK-Sim to avoid exposing raw content.

Vector drift: addressed via SAB + periodic alignment; regression alarms on anchor error.

Chain bloat: store hashes/roots on-chain; content-address off-chain; periodic root anchoring to public chain.

Garbage-in: quality gates, dedupe, human spot-audit (≥1%), CCM focuses data collection on valuable gaps, not volume.

Latency: commit is async; inference never blocks on chain finality.

15) Openness & Licensing

Architecture: fully open (spec + reference code).

Models: choose permissive bases (e.g., open Mamba/Hybrid).

Ledger: open validator set, reproducible node stack.

Artifacts: every dataset/model/eval has a content hash and optional public mirror.

16) Conclusion VCRB fuses vector-native retrieval, immutable provenance, and fast retraining into a single, composable system. It turns conversations into a living curriculum—verifiable, privacy-preserving, and relentlessly improving. With ZK-Sim proofs, Semantic Anchor Beacons, and a Conversational Curriculum Miner, VCRB does more than store chats: it industrializes conversational intelligence.

Appendix: Minimal Schemas UtteranceCommit (on-chain)

{

  "turn_id":"uuid","session_id":"uuid","ts":"iso8601",

  "speaker":"user|model",

  "hash_text":"0x..","hash_vec":"0x..","content_addr":"bafy..",

  "tmd":"3.2.5","cpesh":{"C":"...","P":"...","E":"..."},

  "signer":"pubkey","sig":"0x.."

}

ContextAttest

{

  "turn_id":"uuid",

  "refs":[{"ref_turn_id":"uuid","cosine":0.9993}, {"ref_turn_id":"uuid","cosine":0.9991}],

  "zk_sim_proof":"0x...(optional)",

  "signer":"pubkey","sig":"0x.."

}

ModelVersionCommit

{

  "model_id":"VCRB-LVM",

  "version":"v0.8.3",

  "hash_weights":"0x..",

  "hash_optimizer":"0x..",

  "eval_digest":{"set":"shadow_eval_v9","score":{"BLEU":..,"Uplift":..}},

  "parent_version":"v0.8.2"

}

Next steps (concrete)
  • Stand up MVP pipeline (FastAPI + pgvector + Tendermint app-chain) and ship /ingest, /query, /attest, /seal, /model_commit.
  • Implement SAB set + alignment check in the nightly job; report drift to chain.
  • Prototype ZK-Sim (off-chain prover) for thresholded cosine; verify on-chain.
  • Launch CCM with bandit sampling; track coverage uplift and commit ablations.
  • If you want, I’ll generate the exact FastAPI endpoints + Makefile + reference DB schema next so you can drop this into your repo and run Day-1 smoke tests.

    Related Research