LNSP Model Size Analysis
By Trent Carter
7/23/25
Model Architecture Size Comparison
π§ Student Model Comparison
| Student Dim | Architecture | Parameters | Model Size (MB) | RAM (Training, MB) | Inference Speed | Use Case |
| 384 | 384β384β192β384β384 | ~591K | ~2.4 | ~26 | Very Fast | Baseline semantic tasks |
| 512 | 512β512β256β512β512 | ~1.1M | ~4.2 | ~46 | Fast | General-purpose analogies |
| 768 | 768β768β384β768β768 | ~2.4M | ~9.4 | ~104 | Medium | High-quality compression |
| 1024 | 1024β1024β512β1024β1024 | ~4.2M | ~16.8 | ~185 | Medium | Premium relational reasoning |
| 2048 | 2048β2048β1024β2048β2048 | ~16.8M | ~67.1 | ~739 | Slower | Research/experimental diversity |
| 3072 | 3072β3072β1536β3072β3072 | ~37.8M | ~151.2 | ~1,663 | Slow | Large-scale semantic capture |
| 4096 | 4096β4096β2048β4096β4096 | ~67.1M | ~268.5 | ~2,954 | Slow | Advanced generalization |
| 5120 | 5120β5120β2560β5120β5120 | ~104.9M | ~419.5 | ~4,614 | Very Slow | Frontier-level tasks |
| 6144 | 6144β6144β3072β6144β6144 | ~151.0M | ~604.1 | ~6,645 | Very Slow | Complex multi-domain processing |
| 8192 | 8192β8192β4096β8192β8192 | ~268.5M | ~1,073.9 | ~11,813 | Extremely Slow | Ultra-scale latent exploration |
| Student Dim | Architecture | Parameters | Model Size | RAM (Training) | Inference Speed | Use Case |
| 256 | 384β256β192β256β384 | ~545K | 2.2MB | ~25MB | Very Fast | Current Winner (SN727) |
| 320 | 384β320β192β320β384 | ~680K | 2.7MB | ~30MB | Fast | Slight quality boost? |
| 512 | 384β512β192β512β384 | ~1.1M | 4.4MB | ~50MB | Fast | Better semantic capture |
| 768 | 384β768β192β768β384 | ~1.6M | 6.4MB | ~75MB | Medium | High-quality compression |
| 1024 | 384β1024β192β1024β384 | ~2.1M | 8.4MB | ~100MB | Medium | Premium quality |
| 2048 | 384β2048β192β2048β384 | ~4.2M | 16.8MB | ~200MB | Slower | Research/Experimentation |
| 4096 | 384β4096β192β4096β384 | ~8.4M | 33.6MB | ~400MB | Slow | Large-scale applications |
Parameter Calculation Breakdown
For architectureΒ 384βstudent_dimβ192βstudent_dimβ384:
# Layer sizes:
input_proj: 384 Γ student_dim + student_dim = 384 Γ S + S
compression: student_dim Γ 192 + 192 = S Γ 192 + 192
expansion: 192 Γ student_dim + student_dim = 192 Γ S + S
teacher_align: student_dim Γ 384 + 384 = S Γ 384 + 384
Total parameters = (384 + 192 + 384) Γ S + (S + 192 + S + 384)
= 960 Γ S + (2ΓS + 576)
= 962 Γ S + 576
Training Memory (MacBook M4)
256D: ~25MB - β
Ultra-efficient
512D: ~50MB - β
Still very efficient
1024D: ~100MB - β
Reasonable for M4
2048D: ~200MB - β οΈ Getting heavy
4096D: ~400MB - β Probably too much
Inference Speed (MPS)
256D: ~0.1ms per vector - β‘ Real-time
512D: ~0.2ms per vector - β‘ Real-time
1024D: ~0.4ms per vector - β
Fast
2048D: ~0.8ms per vector - β οΈ Slower
4096D: ~1.6ms per vector - β Too slow
Recommendations
For Alignment Weight Sweep (Fast Iteration)
Use 20K dataset: Absolutely fine for hyperparameter tuning
256D architecture: Keep current winner architecture
6.5min vs 25min: Perfect trade-off for rapid experimentation
For Quality Scaling Experiments
512D: Natural next step (2x parameters, still efficient)
768D: Sweet spot for quality vs efficiency
1024D: Maximum practical size for your use case
Architecture Progression Strategy
Phase 1: Alignment sweep with 256D + 20K (fast)
Phase 2: Best alignment + 512D + 220K (quality test)
Phase 3: Best alignment + 768D + 220K (premium model)
Size Efficiency Analysis
Your currentΒ SN727 (256D, 2.2MB)Β is extremely efficient:
4x smallerΒ than typical transformer layers
10x fasterΒ than full attention mechanisms
100x smallerΒ than comparable GPT models
Fits entirely in L3 cacheΒ on M4 chips
Strategic Insights
Why 256D Works So Well
Goldilocks Zone: Not too compressed (information loss) or too large (overfitting)
Cache Friendly: Fits in processor cache for ultra-fast inference
Memory Efficient: Enables real-time processing on mobile devices
When to Scale Up
512D: If 256D hits semantic ceiling (unlikely given current performance)
768D: For specialized domains requiring higher semantic precision
1024D+: Research purposes or when memory/speed isn't constrained
Conclusion
SN727's 256D architecture is likely optimalΒ for your use case:
Excellent performance (0.414 score)
Ultra-efficient (2.2MB, 25MB RAM)
Fast inference (0.1ms per vector)
Proven stable (CV 0.029)
Recommendation: Focus on hyperparameter optimization (alignment weights, epochs) before scaling architecture.