TC
← All Research
LNSP Model Size Analysis
ExperimentLNSP

LNSP Model Size Analysis

By Trent Carter 7/23/25

2025-07-235 min read823 words

LNSP Model Size Analysis

By Trent Carter

7/23/25

Model Architecture Size Comparison

🧠 Student Model Comparison

Student DimArchitectureParametersModel Size (MB)RAM (Training, MB)Inference SpeedUse Case 384384β†’384β†’192β†’384β†’384~591K~2.4~26Very FastBaseline semantic tasks 512512β†’512β†’256β†’512β†’512~1.1M~4.2~46FastGeneral-purpose analogies 768768β†’768β†’384β†’768β†’768~2.4M~9.4~104MediumHigh-quality compression 10241024β†’1024β†’512β†’1024β†’1024~4.2M~16.8~185MediumPremium relational reasoning 20482048β†’2048β†’1024β†’2048β†’2048~16.8M~67.1~739SlowerResearch/experimental diversity 30723072β†’3072β†’1536β†’3072β†’3072~37.8M~151.2~1,663SlowLarge-scale semantic capture 40964096β†’4096β†’2048β†’4096β†’4096~67.1M~268.5~2,954SlowAdvanced generalization 51205120β†’5120β†’2560β†’5120β†’5120~104.9M~419.5~4,614Very SlowFrontier-level tasks 61446144β†’6144β†’3072β†’6144β†’6144~151.0M~604.1~6,645Very SlowComplex multi-domain processing 81928192β†’8192β†’4096β†’8192β†’8192~268.5M~1,073.9~11,813Extremely SlowUltra-scale latent exploration Student DimArchitectureParametersModel SizeRAM (Training)Inference SpeedUse Case 256384β†’256β†’192β†’256β†’384~545K2.2MB~25MBVery FastCurrent Winner (SN727) 320384β†’320β†’192β†’320β†’384~680K2.7MB~30MBFastSlight quality boost? 512384β†’512β†’192β†’512β†’384~1.1M4.4MB~50MBFastBetter semantic capture 768384β†’768β†’192β†’768β†’384~1.6M6.4MB~75MBMediumHigh-quality compression 1024384β†’1024β†’192β†’1024β†’384~2.1M8.4MB~100MBMediumPremium quality 2048384β†’2048β†’192β†’2048β†’384~4.2M16.8MB~200MBSlowerResearch/Experimentation 4096384β†’4096β†’192β†’4096β†’384~8.4M33.6MB~400MBSlowLarge-scale applications

Parameter Calculation Breakdown

For architecture 384→student_dim→192→student_dim→384:

# Layer sizes:

input_proj: 384 Γ— student_dim + student_dim = 384 Γ— S + S

compression: student_dim Γ— 192 + 192 = S Γ— 192 + 192

expansion: 192 Γ— student_dim + student_dim = 192 Γ— S + S

teacher_align: student_dim Γ— 384 + 384 = S Γ— 384 + 384

Total parameters = (384 + 192 + 384) Γ— S + (S + 192 + S + 384)

= 960 Γ— S + (2Γ—S + 576)

= 962 Γ— S + 576

Memory & Performance Analysis

Training Memory (MacBook M4)

  • 256D: ~25MB - βœ… Ultra-efficient
  • 512D: ~50MB - βœ… Still very efficient
  • 1024D: ~100MB - βœ… Reasonable for M4
  • 2048D: ~200MB - ⚠️ Getting heavy
  • 4096D: ~400MB - ❌ Probably too much
  • Inference Speed (MPS)

  • 256D: ~0.1ms per vector - ⚑ Real-time
  • 512D: ~0.2ms per vector - ⚑ Real-time
  • 1024D: ~0.4ms per vector - βœ… Fast
  • 2048D: ~0.8ms per vector - ⚠️ Slower
  • 4096D: ~1.6ms per vector - ❌ Too slow
  • Recommendations

    For Alignment Weight Sweep (Fast Iteration)

  • Use 20K dataset: Absolutely fine for hyperparameter tuning
  • 256D architecture: Keep current winner architecture
  • 6.5min vs 25min: Perfect trade-off for rapid experimentation
  • For Quality Scaling Experiments

  • 512D: Natural next step (2x parameters, still efficient)
  • 768D: Sweet spot for quality vs efficiency
  • 1024D: Maximum practical size for your use case
  • Architecture Progression Strategy

  • Phase 1: Alignment sweep with 256D + 20K (fast)
  • Phase 2: Best alignment + 512D + 220K (quality test)
  • Phase 3: Best alignment + 768D + 220K (premium model)
  • Size Efficiency Analysis

    Your currentΒ SN727 (256D, 2.2MB)Β is extremely efficient:

  • 4x smallerΒ than typical transformer layers
  • 10x fasterΒ than full attention mechanisms
  • 100x smallerΒ than comparable GPT models
  • Fits entirely in L3 cacheΒ on M4 chips
  • Strategic Insights

    Why 256D Works So Well

  • Goldilocks Zone: Not too compressed (information loss) or too large (overfitting)
  • Cache Friendly: Fits in processor cache for ultra-fast inference
  • Memory Efficient: Enables real-time processing on mobile devices
  • When to Scale Up

  • 512D: If 256D hits semantic ceiling (unlikely given current performance)
  • 768D: For specialized domains requiring higher semantic precision
  • 1024D+: Research purposes or when memory/speed isn't constrained
  • Conclusion

    SN727's 256D architecture is likely optimalΒ for your use case:
  • Excellent performance (0.414 score)
  • Ultra-efficient (2.2MB, 25MB RAM)
  • Fast inference (0.1ms per vector)
  • Proven stable (CV 0.029)
  • Recommendation: Focus on hyperparameter optimization (alignment weights, epochs) before scaling architecture.

    Related Research