Text-Vector-Text Processing System PRD
Executive Summary
A modular, publisher-subscriber architecture for processing text through vector embeddings and back to text, supporting multiple encoding/decoding strategies and vector transformations. The system enables efficient batch processing and comparative analysis of different vec2text approaches.
System Architecture
Core Components
┌─────────────────┐
│ vec_text_vect.py│ (Main Orchestrator)
└────────┬────────┘
│
├─── text_vect.py (GTR-T5 Encoder)
│
├─── Subscriber Registry
│ ├─── jxe_vect2text.py
│ ├─── ielab_vec2text.py
│ └─── vmmoe_vec2vec.py
│
└─── vect_to_vect.py (Secondary Orchestrator)
└─── Forwards VMMoE output to vec2text subscribers
Component Specifications
1. Main Orchestrator: vec_text_vect.py
Purpose: Central control point for the entire pipeline
Key Features:
class VecTextVectOrchestrator:
def __init__(self, config: dict = None):
"""Initialize with optional config for subscribers"""
def register_subscriber(self, name: str, subscriber: BaseSubscriber):
"""Register a processing subscriber"""
def process(self,
input_data: Union[str, List[str]],
subscribers: List[str] = None) -> Dict[str, Any]:
"""
Process text through pipeline
Args:
input_data: Single text or list of texts
subscribers: List of subscriber names to use (None = all)
Returns:
Results dictionary with outputs from each subscriber
"""
CLI Interface:
# Single text
python vec_text_vect.py --input-text "Linear algebra basics" --subscribers all
Batch from file
python vec_text_vect.py --batch-file texts.txt --subscribers jxe,ielab
Specific subscribers
python vec_text_vect.py --input-text "test" --subscribers vmmoe
JSON output
python vec_text_vect.py --input-text "test" --output-format json
2. Text Encoder: text_vect.py
Purpose: Efficient batch encoding of text to vectors using GTR-T5
Key Features:
class TextToVectorEncoder:
def __init__(self, model_path: str = "data/teacher_models/gtr-t5-base",
device: str = None,
batch_size: int = 32):
"""Initialize GTR-T5 encoder"""
def encode(self, texts: List[str],
normalize: bool = True) -> torch.Tensor:
"""
Encode texts to vectors
Returns: Tensor of shape [N, 768]
"""
3. Base Subscriber Interface
Purpose: Common interface for all processing modulesfrom abc import ABC, abstractmethod
class BaseSubscriber(ABC):
"""Base class for all subscribers"""
@abstractmethod
def process(self, vectors: torch.Tensor,
metadata: Dict[str, Any] = None) -> Any:
"""Process vectors and return results"""
@property
@abstractmethod
def name(self) -> str:
"""Unique name for this subscriber"""
@property
@abstractmethod
def output_type(self) -> str:
"""Output type: 'text' or 'vector'"""
4. Vec2Text Subscribers
#### 4.1 JXE Vec2Text: jxe_vect2text.py
class JXEVec2TextSubscriber(BaseSubscriber):
def __init__(self, teacher_model_path: str = "data/teacher_models/gtr-t5-base",
steps: int = 1,
device: str = None):
"""Initialize JXE vec2text decoder"""
def process(self, vectors: torch.Tensor,
metadata: Dict[str, Any] = None) -> List[str]:
"""
Decode vectors to text
Args:
vectors: [N, 768] tensor
metadata: Optional dict with 'original_texts' key
Returns:
List of decoded texts
"""
#### 4.2 IELab Vec2Text: ielab_vec2text.py
class IELabVec2TextSubscriber(BaseSubscriber):
def __init__(self, steps: int = 20,
beam_width: int = 1,
device: str = None):
"""Initialize IELab vec2text decoder"""
def process(self, vectors: torch.Tensor,
metadata: Dict[str, Any] = None) -> List[str]:
"""Decode vectors using IELab models"""
5. Vector Transformation: vmmoe_vec2vec.py
Purpose: Transform vectors using VMMoE model
Key Features:
class VMMoEVec2VecSubscriber(BaseSubscriber):
def __init__(self, checkpoint_path: str,
normalize_output: bool = True,
device: str = None):
"""Initialize VMMoE transformer"""
def process(self, vectors: torch.Tensor,
metadata: Dict[str, Any] = None) -> torch.Tensor:
"""
Transform vectors through VMMoE
Returns: Transformed vectors [N, 768]
"""
6. Secondary Orchestrator: vect_to_vect.py
Purpose: Process VMMoE output through vec2text subscribers
Key Features:
class VectorToVectorOrchestrator:
def __init__(self, vec2text_subscribers: List[BaseSubscriber]):
"""Initialize with vec2text subscribers"""
def process(self, vectors: torch.Tensor,
metadata: Dict[str, Any] = None) -> Dict[str, List[str]]:
"""Process vectors through all vec2text subscribers"""
Data Flow Example
# Input
texts = ["The cat sits on the mat", "Linear algebra basics"]
Step 1: Text → Vector (GTR-T5)
vectors = text_encoder.encode(texts) # [2, 768]
Step 2: Process through subscribers
results = {
'jxe': jxe_subscriber.process(vectors), # ["cat on mat", "algebra basics"]
'ielab': ielab_subscriber.process(vectors), # ["The cat sits", "Linear algebra"]
'vmmoe': vmmoe_subscriber.process(vectors) # [2, 768] transformed
}
Step 3: VMMoE output → vec2text
vmmoe_vectors = results['vmmoe']
vmmoe_decoded = {
'jxe': jxe_subscriber.process(vmmoe_vectors),
'ielab': ielab_subscriber.process(vmmoe_vectors)
}
Configuration Management
Config File: config.yaml
encoding:
model_path: "data/teacher_models/gtr-t5-base"
batch_size: 32
normalize: true
subscribers:
jxe:
enabled: true
steps: 1
teacher_model: "data/teacher_models/gtr-t5-base"
ielab:
enabled: true
steps: 20
beam_width: 1
vmmoe:
enabled: true
checkpoint: "output/vmmoe_stable/best_model.pth"
normalize_output: true
processing:
device: "auto" # auto, cuda, mps, cpu
parallel_subscribers: false
max_batch_size: 100
Output Format
Standard Output
{
"input_texts": ["text1", "text2"],
"encoded_vectors": {
"shape": [2, 768],
"norms": [1.0, 1.0]
},
"results": {
"jxe": {
"decoded_texts": ["decoded1", "decoded2"],
"processing_time": 2.34,
"cosine_similarities": [0.89, 0.91]
},
"ielab": {
"decoded_texts": ["decoded1", "decoded2"],
"processing_time": 1.23,
"cosine_similarities": [0.95, 0.97]
},
"vmmoe": {
"transformed_vectors": {
"shape": [2, 768],
"norms": [1.0, 1.0]
},
"cosine_to_original": [0.98, 0.99],
"decoded_by": {
"jxe": ["vmmoe_decoded1", "vmmoe_decoded2"],
"ielab": ["vmmoe_decoded1", "vmmoe_decoded2"]
}
}
},
"summary": {
"total_processing_time": 5.67,
"best_reconstruction": {
"direct": "ielab",
"via_vmmoe": "ielab"
}
}
}