TC
← All Research
Data Curation Options & Resource Requirements
ExperimentVMMoE

Data Curation Options & Resource Requirements

Trent carter 8/5/2025

2025-08-053 min read455 words

Data Curation Options & Resource Requirements

Trent carter

8/5/2025

Curation Strategy Options

StrategyQualityScaleCPU HoursGPU HoursInternet QueriesProcessing TimeCost Estimate Manual + LLM ValidationExcellent10M max2,00020050K3-4 weeks$500-1K Automated + Two-Stage LLMVery Good100M500100500K1-2 weeks$200-500 Hybrid: Seeds + ExpansionHigh1B1,0003001M2-3 weeks$800-1.5K Pure Automated PipelineGood1B+2005010M3-5 days$100-300 Crystallization MethodVery High500M8004002M1.5-2 weeks$1K-2K

Data Sources & Yields

SourceRaw SizeConcept YieldQualityExtraction RateSpecial Requirements The Stack v267.5TB50M code conceptsHigh1M/dayAST parsing, sandboxing C4 (Cleaned)750GB200M text conceptsMedium5M/dayLanguage detection Wikipedia100GB30M conceptsHigh2M/dayEntity linking arXiv Papers500GB20M scientificVery High500K/dayPDF parsing, citations ConceptNet 5.71GB8M relationsExcellent8M/dayAlready structured Wikidata120GB100M entitiesExcellent10M/daySPARQL queries CodeContests2GB50K code patternsExcellent50K/daySelf-validating

Resource Scaling by Concept Count

ConceptsStorage (Parquet)RAM (Loading)FAISS IndexTraining RAMInference RAMBuild Time 10K50MB100MB10MB2GB500MB1 hour 100K500MB1GB100MB4GB1GB4 hours 1M5GB10GB1GB8GB2GB12 hours 10M50GB50GB10GB20GB4GB2 days 100M500GB200GB100GB40GB8GB1 week 1B5TB1TB1TB80GB16GB2-3 weeks

*Requires memory mapping and batch loading

Storage Schema Breakdown (per concept)

Component768D (bytes)1536D (bytes)Multi-Dim (bytes) Concept ID161616 Text200 (avg)200200 Embeddings3,0726,1449,216 Projections004,608 Relations100 (avg)100100 Metadata150150150 Domain Info505050 Total per concept~3.6KB~6.7KB~14.3KB

Phase 1: Proof of Concept (10K concepts)

  • Sources: CodeContests + Wikipedia sample + ConceptNet subset
  • Time: 2-3 days
  • Resources: 1 CPU, local processing
  • Storage: 50MB
  • Purpose: Validate pipeline, test both VMM and Diffusion
  • Phase 2: Development Dataset (100K concepts)

  • Sources: The Stack (Python subset) + arXiv abstracts + ConceptNet
  • Time: 1 week
  • Resources: M4 Mac full utilization
  • Storage: 500MB
  • Purpose: Full model training, architecture comparison
  • Phase 3: Production Dataset (10M concepts)

  • Sources: Multi-language code + scientific papers + knowledge graphs
  • Time: 2 weeks
  • Resources: M4 Mac + some cloud processing
  • Storage: 50GB
  • Purpose: Real-world performance validation
  • Phase 4: Full Scale (100M+ concepts)

  • Sources: Full Stack v2 + C4 subset + all knowledge sources
  • Time: 3-4 weeks
  • Resources: Hybrid local/cloud
  • Storage: 500GB-5TB
  • Purpose: Production-ready model
  • Critical "Get It Right First Time" Considerations

    1. Schema Design

  • Multi-dimensional embeddings: Store all dimensions from day 1
  • Projection matrices: Learn 768→384 and 1536→768 mappings
  • Versioning: Track embedding model versions for reproducibility
  • Temporal stamps: Enable training data evolution analysis
  • 2. Quality Assurance Pipeline

  • Validation scores: Every concept gets multiple validation passes
  • Source tracking: Maintain provenance for debugging
  • Relationship confidence: Weight connections by validation strength
  • Domain confidence: Track classification uncertainty
  • 3. Scalability Preparation

  • Sharded storage: Design for horizontal scaling from start
  • Incremental updates: Support adding new concepts without rebuild
  • Memory mapping: Handle datasets larger than RAM
  • Distributed indexing: FAISS clustering for massive scales
  • 4. Testing Integration

  • Held-out validation: 10% of each domain reserved for testing
  • Temporal splits: Early concepts for training, later for validation
  • Cross-domain evaluation: Test physics expert on biology concepts
  • Adversarial samples: Include deliberately challenging cases
  • Related Research