Text-Vector-Text Processing System PRD

Executive Summary

A modular, publisher-subscriber architecture for processing text through vector embeddings and back to text, supporting multiple encoding/decoding strategies and vector transformations. The system enables efficient batch processing and comparative analysis of different vec2text approaches.

System Architecture

Core Components

┌─────────────────┐ │ vec_text_vect.py│ (Main Orchestrator) └────────┬────────┘ │ ├─── text_vect.py (GTR-T5 Encoder) │ ├─── Subscriber Registry │ ├─── jxe_vect2text.py │ ├─── ielab_vec2text.py │ └─── vmmoe_vec2vec.py │ └─── vect_to_vect.py (Secondary Orchestrator) └─── Forwards VMMoE output to vec2text subscribers

Component Specifications

1. Main Orchestrator: `vec_text_vect.py`

Purpose: Central control point for the entire pipeline Key Features:

Accepts single text or batch input (list of strings)

Manages subscriber registration and notification

Coordinates data flow between components

Aggregates and formats results

Interface:

class VecTextVectOrchestrator:
 def __init__(self, config: dict = None):
 """Initialize with optional config for subscribers"""

 def register_subscriber(self, name: str, subscriber: BaseSubscriber):
 """Register a processing subscriber"""

 def process(self, 
 input_data: Union[str, List[str]], 
 subscribers: List[str] = None) -> Dict[str, Any]:
 """
 Process text through pipeline
 Args:
 input_data: Single text or list of texts
 subscribers: List of subscriber names to use (None = all)
 Returns:
 Results dictionary with outputs from each subscriber
 """

CLI Interface:

# Single text python vec_text_vect.py --input-text "Linear algebra basics" --subscribers all Batch from file python vec_text_vect.py --batch-file texts.txt --subscribers jxe,ielab Specific subscribers python vec_text_vect.py --input-text "test" --subscribers vmmoe JSON output python vec_text_vect.py --input-text "test" --output-format json

2. Text Encoder: `text_vect.py`

Purpose: Efficient batch encoding of text to vectors using GTR-T5 Key Features:

Batch processing for efficiency

Automatic batching for large inputs

GPU memory management

Normalized embeddings option

Interface:

class TextToVectorEncoder:
 def __init__(self, model_path: str = "data/teacher_models/gtr-t5-base",
 device: str = None,
 batch_size: int = 32):
 """Initialize GTR-T5 encoder"""

 def encode(self, texts: List[str], 
 normalize: bool = True) -> torch.Tensor:
 """
 Encode texts to vectors
 Returns: Tensor of shape [N, 768]
 """

3. Base Subscriber Interface

Purpose: Common interface for all processing modules

from abc import ABC, abstractmethod

class BaseSubscriber(ABC):
 """Base class for all subscribers"""

 @abstractmethod
 def process(self, vectors: torch.Tensor, 
 metadata: Dict[str, Any] = None) -> Any:
 """Process vectors and return results"""

 @property
 @abstractmethod
 def name(self) -> str:
 """Unique name for this subscriber"""

 @property
 @abstractmethod
 def output_type(self) -> str:
 """Output type: 'text' or 'vector'"""

4. Vec2Text Subscribers

#### 4.1 JXE Vec2Text: jxe_vect2text.py

Purpose: Original vec2text implementation using LNSP processor Key Features:

Uses proven LNSP processor approach

Configurable decoding steps

Batch processing support

class JXEVec2TextSubscriber(BaseSubscriber):
 def __init__(self, teacher_model_path: str = "data/teacher_models/gtr-t5-base",
 steps: int = 1,
 device: str = None):
 """Initialize JXE vec2text decoder"""

 def process(self, vectors: torch.Tensor, 
 metadata: Dict[str, Any] = None) -> List[str]:
 """
 Decode vectors to text
 Args:
 vectors: [N, 768] tensor
 metadata: Optional dict with 'original_texts' key
 Returns:
 List of decoded texts
 """

#### 4.2 IELab Vec2Text: ielab_vec2text.py

Purpose: Improved vec2text implementation from IELab Key Features:

Better reconstruction quality

Faster inference

Python 3.11 compatibility

class IELabVec2TextSubscriber(BaseSubscriber):
 def __init__(self, steps: int = 20,
 beam_width: int = 1,
 device: str = None):
 """Initialize IELab vec2text decoder"""

 def process(self, vectors: torch.Tensor,
 metadata: Dict[str, Any] = None) -> List[str]:
 """Decode vectors using IELab models"""

5. Vector Transformation: `vmmoe_vec2vec.py`

Purpose: Transform vectors using VMMoE model Key Features:

Vector-to-vector transformation

Preserves semantic information

Configurable normalization

class VMMoEVec2VecSubscriber(BaseSubscriber):
 def __init__(self, checkpoint_path: str,
 normalize_output: bool = True,
 device: str = None):
 """Initialize VMMoE transformer"""

 def process(self, vectors: torch.Tensor,
 metadata: Dict[str, Any] = None) -> torch.Tensor:
 """
 Transform vectors through VMMoE
 Returns: Transformed vectors [N, 768]
 """

6. Secondary Orchestrator: `vect_to_vect.py`

Purpose: Process VMMoE output through vec2text subscribers Key Features:

Receives transformed vectors from VMMoE

Forwards to vec2text subscribers

Enables comparison of vec2text on transformed vs original vectors

class VectorToVectorOrchestrator:
 def __init__(self, vec2text_subscribers: List[BaseSubscriber]):
 """Initialize with vec2text subscribers"""

 def process(self, vectors: torch.Tensor,
 metadata: Dict[str, Any] = None) -> Dict[str, List[str]]:
 """Process vectors through all vec2text subscribers"""

Data Flow Example

# Input
texts = ["The cat sits on the mat", "Linear algebra basics"]

Step 1: Text → Vector (GTR-T5)
vectors = text_encoder.encode(texts) # [2, 768]

Step 2: Process through subscribers
results = {
 'jxe': jxe_subscriber.process(vectors), # ["cat on mat", "algebra basics"]
 'ielab': ielab_subscriber.process(vectors), # ["The cat sits", "Linear algebra"]
 'vmmoe': vmmoe_subscriber.process(vectors) # [2, 768] transformed
}

Step 3: VMMoE output → vec2text
vmmoe_vectors = results['vmmoe']
vmmoe_decoded = {
 'jxe': jxe_subscriber.process(vmmoe_vectors),
 'ielab': ielab_subscriber.process(vmmoe_vectors)
}

Configuration Management

Config File: `config.yaml`

encoding: model_path: "data/teacher_models/gtr-t5-base" batch_size: 32 normalize: true subscribers: jxe: enabled: true steps: 1 teacher_model: "data/teacher_models/gtr-t5-base" ielab: enabled: true steps: 20 beam_width: 1 vmmoe: enabled: true checkpoint: "output/vmmoe_stable/best_model.pth" normalize_output: true processing: device: "auto" # auto, cuda, mps, cpu parallel_subscribers: false max_batch_size: 100

Output Format

Standard Output

{
 "input_texts": ["text1", "text2"],
 "encoded_vectors": {
 "shape": [2, 768],
 "norms": [1.0, 1.0]
 },
 "results": {
 "jxe": {
 "decoded_texts": ["decoded1", "decoded2"],
 "processing_time": 2.34,
 "cosine_similarities": [0.89, 0.91]
 },
 "ielab": {
 "decoded_texts": ["decoded1", "decoded2"],
 "processing_time": 1.23,
 "cosine_similarities": [0.95, 0.97]
 },
 "vmmoe": {
 "transformed_vectors": {
 "shape": [2, 768],
 "norms": [1.0, 1.0]
 },
 "cosine_to_original": [0.98, 0.99],
 "decoded_by": {
 "jxe": ["vmmoe_decoded1", "vmmoe_decoded2"],
 "ielab": ["vmmoe_decoded1", "vmmoe_decoded2"]
 }
 }
 },
 "summary": {
 "total_processing_time": 5.67,
 "best_reconstruction": {
 "direct": "ielab",
 "via_vmmoe": "ielab"
 }
 }
}

Performance Optimizations

Batch Processing: All components support batch operations

Lazy Loading: Models loaded only when needed

GPU Memory Management: Automatic batch sizing based on available memory

Parallel Processing: Optional parallel subscriber execution

Caching: Vector encoding cache for repeated inputs

Error Handling

Graceful Degradation: If a subscriber fails, others continue

Clear Error Messages: Detailed error reporting with context

Retry Logic: Configurable retry for transient failures

Validation: Input validation and sanitization

Testing Strategy

Unit Tests: Each component tested independently

Integration Tests: Full pipeline testing

Performance Tests: Benchmark batch processing speeds

Quality Tests: Compare reconstruction quality metrics

Future Extensions

Additional Subscribers: Easy to add new vec2text models

Web API: REST API for remote processing

Streaming Support: Process large datasets in chunks

Model Zoo: Support for multiple encoder models

Visualization: Add embedding visualization tools

Implementation Timeline

Week 1: Core architecture and base classes

Week 2: GTR-T5 encoder and JXE subscriber

Week 3: IELab and VMMoE subscribers

Week 4: Testing and optimization

Week 5: Documentation and deployment

Success Metrics

Correctness: Output matches expected format

Performance: Batch of 1000 texts processed in < 60 seconds

Quality: Vec2text reconstruction cosine similarity > 0.8

Reliability: 99.9% uptime with graceful error handling

Extensibility: New subscriber added in < 100 lines of code

Text-Vector-Text Processing System PRD

Text-Vector-Text Processing System PRD

Executive Summary

System Architecture

Core Components

Component Specifications

1. Main Orchestrator: `vec_text_vect.py`

Batch from file

Specific subscribers

JSON output

2. Text Encoder: `text_vect.py`

3. Base Subscriber Interface

4. Vec2Text Subscribers

5. Vector Transformation: `vmmoe_vec2vec.py`

6. Secondary Orchestrator: `vect_to_vect.py`

Data Flow Example

Step 1: Text → Vector (GTR-T5)

Step 2: Process through subscribers

Step 3: VMMoE output → vec2text

Configuration Management

Config File: `config.yaml`

Output Format

Standard Output

Performance Optimizations

Error Handling

Testing Strategy

Future Extensions

Implementation Timeline

Success Metrics

Related Research

Product Requirements Document: Latent Vector Model (LVM) Core

Product Requirements Document: Text-Vector-Text Pipeline with VMMoE Integration

Product Requirements Document: The Cloud Lexicon Architecture

Product Requirements Document: The Cognitive Core

Text-Vector-Text Processing System PRD

Executive Summary

System Architecture

Core Components

Component Specifications

1. Main Orchestrator: vec_text_vect.py

Batch from file

Specific subscribers

JSON output

2. Text Encoder: text_vect.py

3. Base Subscriber Interface

4. Vec2Text Subscribers

5. Vector Transformation: vmmoe_vec2vec.py

6. Secondary Orchestrator: vect_to_vect.py

Data Flow Example

Step 1: Text → Vector (GTR-T5)

Step 2: Process through subscribers

Step 3: VMMoE output → vec2text

Configuration Management

Config File: config.yaml

Output Format

Standard Output

Performance Optimizations

Error Handling

Testing Strategy

Future Extensions

Implementation Timeline

Success Metrics

Related Research

Product Requirements Document: Latent Vector Model (LVM) Core

Product Requirements Document: Text-Vector-Text Pipeline with VMMoE Integration

Product Requirements Document: The Cloud Lexicon Architecture

Product Requirements Document: The Cognitive Core

1. Main Orchestrator: `vec_text_vect.py`

2. Text Encoder: `text_vect.py`

5. Vector Transformation: `vmmoe_vec2vec.py`

6. Secondary Orchestrator: `vect_to_vect.py`

Config File: `config.yaml`