TC
← All Research
Text-Vector-Text Processing System PRD
PRDVMMoE

Text-Vector-Text Processing System PRD

A modular, publisher-subscriber architecture for processing text through vector embeddings and back to text, supporting multiple encoding/decoding strategies and vector transformations. The system enables efficient batch processing and comparative analysis of different vec2text approaches.

2025-08-206 min read1,120 words

Text-Vector-Text Processing System PRD

Executive Summary

A modular, publisher-subscriber architecture for processing text through vector embeddings and back to text, supporting multiple encoding/decoding strategies and vector transformations. The system enables efficient batch processing and comparative analysis of different vec2text approaches.

System Architecture

Core Components

┌─────────────────┐

│ vec_text_vect.py│ (Main Orchestrator)

└────────┬────────┘

├─── text_vect.py (GTR-T5 Encoder)

├─── Subscriber Registry

│ ├─── jxe_vect2text.py

│ ├─── ielab_vec2text.py

│ └─── vmmoe_vec2vec.py

└─── vect_to_vect.py (Secondary Orchestrator)

└─── Forwards VMMoE output to vec2text subscribers

Component Specifications

1. Main Orchestrator: vec_text_vect.py

Purpose: Central control point for the entire pipeline Key Features:
  • Accepts single text or batch input (list of strings)
  • Manages subscriber registration and notification
  • Coordinates data flow between components
  • Aggregates and formats results
  • Interface:
    class VecTextVectOrchestrator:
    

    def __init__(self, config: dict = None):

    """Initialize with optional config for subscribers"""

    def register_subscriber(self, name: str, subscriber: BaseSubscriber):

    """Register a processing subscriber"""

    def process(self,

    input_data: Union[str, List[str]],

    subscribers: List[str] = None) -> Dict[str, Any]:

    """

    Process text through pipeline

    Args:

    input_data: Single text or list of texts

    subscribers: List of subscriber names to use (None = all)

    Returns:

    Results dictionary with outputs from each subscriber

    """

    CLI Interface:
    # Single text
    

    python vec_text_vect.py --input-text "Linear algebra basics" --subscribers all

    Batch from file

    python vec_text_vect.py --batch-file texts.txt --subscribers jxe,ielab

    Specific subscribers

    python vec_text_vect.py --input-text "test" --subscribers vmmoe

    JSON output

    python vec_text_vect.py --input-text "test" --output-format json

    2. Text Encoder: text_vect.py

    Purpose: Efficient batch encoding of text to vectors using GTR-T5 Key Features:
  • Batch processing for efficiency
  • Automatic batching for large inputs
  • GPU memory management
  • Normalized embeddings option
  • Interface:
    class TextToVectorEncoder:
    

    def __init__(self, model_path: str = "data/teacher_models/gtr-t5-base",

    device: str = None,

    batch_size: int = 32):

    """Initialize GTR-T5 encoder"""

    def encode(self, texts: List[str],

    normalize: bool = True) -> torch.Tensor:

    """

    Encode texts to vectors

    Returns: Tensor of shape [N, 768]

    """

    3. Base Subscriber Interface

    Purpose: Common interface for all processing modules
    from abc import ABC, abstractmethod
    
    

    class BaseSubscriber(ABC):

    """Base class for all subscribers"""

    @abstractmethod

    def process(self, vectors: torch.Tensor,

    metadata: Dict[str, Any] = None) -> Any:

    """Process vectors and return results"""

    @property

    @abstractmethod

    def name(self) -> str:

    """Unique name for this subscriber"""

    @property

    @abstractmethod

    def output_type(self) -> str:

    """Output type: 'text' or 'vector'"""

    4. Vec2Text Subscribers

    #### 4.1 JXE Vec2Text: jxe_vect2text.py

    Purpose: Original vec2text implementation using LNSP processor Key Features:
  • Uses proven LNSP processor approach
  • Configurable decoding steps
  • Batch processing support
  • class JXEVec2TextSubscriber(BaseSubscriber):
    

    def __init__(self, teacher_model_path: str = "data/teacher_models/gtr-t5-base",

    steps: int = 1,

    device: str = None):

    """Initialize JXE vec2text decoder"""

    def process(self, vectors: torch.Tensor,

    metadata: Dict[str, Any] = None) -> List[str]:

    """

    Decode vectors to text

    Args:

    vectors: [N, 768] tensor

    metadata: Optional dict with 'original_texts' key

    Returns:

    List of decoded texts

    """

    #### 4.2 IELab Vec2Text: ielab_vec2text.py

    Purpose: Improved vec2text implementation from IELab Key Features:
  • Better reconstruction quality
  • Faster inference
  • Python 3.11 compatibility
  • class IELabVec2TextSubscriber(BaseSubscriber):
    

    def __init__(self, steps: int = 20,

    beam_width: int = 1,

    device: str = None):

    """Initialize IELab vec2text decoder"""

    def process(self, vectors: torch.Tensor,

    metadata: Dict[str, Any] = None) -> List[str]:

    """Decode vectors using IELab models"""

    5. Vector Transformation: vmmoe_vec2vec.py

    Purpose: Transform vectors using VMMoE model Key Features:
  • Vector-to-vector transformation
  • Preserves semantic information
  • Configurable normalization
  • class VMMoEVec2VecSubscriber(BaseSubscriber):
    

    def __init__(self, checkpoint_path: str,

    normalize_output: bool = True,

    device: str = None):

    """Initialize VMMoE transformer"""

    def process(self, vectors: torch.Tensor,

    metadata: Dict[str, Any] = None) -> torch.Tensor:

    """

    Transform vectors through VMMoE

    Returns: Transformed vectors [N, 768]

    """

    6. Secondary Orchestrator: vect_to_vect.py

    Purpose: Process VMMoE output through vec2text subscribers Key Features:
  • Receives transformed vectors from VMMoE
  • Forwards to vec2text subscribers
  • Enables comparison of vec2text on transformed vs original vectors
  • class VectorToVectorOrchestrator:
    

    def __init__(self, vec2text_subscribers: List[BaseSubscriber]):

    """Initialize with vec2text subscribers"""

    def process(self, vectors: torch.Tensor,

    metadata: Dict[str, Any] = None) -> Dict[str, List[str]]:

    """Process vectors through all vec2text subscribers"""

    Data Flow Example

    # Input
    

    texts = ["The cat sits on the mat", "Linear algebra basics"]

    Step 1: Text → Vector (GTR-T5)

    vectors = text_encoder.encode(texts) # [2, 768]

    Step 2: Process through subscribers

    results = {

    'jxe': jxe_subscriber.process(vectors), # ["cat on mat", "algebra basics"]

    'ielab': ielab_subscriber.process(vectors), # ["The cat sits", "Linear algebra"]

    'vmmoe': vmmoe_subscriber.process(vectors) # [2, 768] transformed

    }

    Step 3: VMMoE output → vec2text

    vmmoe_vectors = results['vmmoe']

    vmmoe_decoded = {

    'jxe': jxe_subscriber.process(vmmoe_vectors),

    'ielab': ielab_subscriber.process(vmmoe_vectors)

    }

    Configuration Management

    Config File: config.yaml

    encoding:
    

    model_path: "data/teacher_models/gtr-t5-base"

    batch_size: 32

    normalize: true

    subscribers:

    jxe:

    enabled: true

    steps: 1

    teacher_model: "data/teacher_models/gtr-t5-base"

    ielab:

    enabled: true

    steps: 20

    beam_width: 1

    vmmoe:

    enabled: true

    checkpoint: "output/vmmoe_stable/best_model.pth"

    normalize_output: true

    processing:

    device: "auto" # auto, cuda, mps, cpu

    parallel_subscribers: false

    max_batch_size: 100

    Output Format

    Standard Output

    {
    

    "input_texts": ["text1", "text2"],

    "encoded_vectors": {

    "shape": [2, 768],

    "norms": [1.0, 1.0]

    },

    "results": {

    "jxe": {

    "decoded_texts": ["decoded1", "decoded2"],

    "processing_time": 2.34,

    "cosine_similarities": [0.89, 0.91]

    },

    "ielab": {

    "decoded_texts": ["decoded1", "decoded2"],

    "processing_time": 1.23,

    "cosine_similarities": [0.95, 0.97]

    },

    "vmmoe": {

    "transformed_vectors": {

    "shape": [2, 768],

    "norms": [1.0, 1.0]

    },

    "cosine_to_original": [0.98, 0.99],

    "decoded_by": {

    "jxe": ["vmmoe_decoded1", "vmmoe_decoded2"],

    "ielab": ["vmmoe_decoded1", "vmmoe_decoded2"]

    }

    }

    },

    "summary": {

    "total_processing_time": 5.67,

    "best_reconstruction": {

    "direct": "ielab",

    "via_vmmoe": "ielab"

    }

    }

    }

    Performance Optimizations

  • Batch Processing: All components support batch operations
  • Lazy Loading: Models loaded only when needed
  • GPU Memory Management: Automatic batch sizing based on available memory
  • Parallel Processing: Optional parallel subscriber execution
  • Caching: Vector encoding cache for repeated inputs
  • Error Handling

  • Graceful Degradation: If a subscriber fails, others continue
  • Clear Error Messages: Detailed error reporting with context
  • Retry Logic: Configurable retry for transient failures
  • Validation: Input validation and sanitization
  • Testing Strategy

  • Unit Tests: Each component tested independently
  • Integration Tests: Full pipeline testing
  • Performance Tests: Benchmark batch processing speeds
  • Quality Tests: Compare reconstruction quality metrics
  • Future Extensions

  • Additional Subscribers: Easy to add new vec2text models
  • Web API: REST API for remote processing
  • Streaming Support: Process large datasets in chunks
  • Model Zoo: Support for multiple encoder models
  • Visualization: Add embedding visualization tools
  • Implementation Timeline

  • Week 1: Core architecture and base classes
  • Week 2: GTR-T5 encoder and JXE subscriber
  • Week 3: IELab and VMMoE subscribers
  • Week 4: Testing and optimization
  • Week 5: Documentation and deployment
  • Success Metrics

  • Correctness: Output matches expected format
  • Performance: Batch of 1000 texts processed in < 60 seconds
  • Quality: Vec2text reconstruction cosine similarity > 0.8
  • Reliability: 99.9% uptime with graceful error handling
  • Extensibility: New subscriber added in < 100 lines of code
  • Related Research