8/7/2025
1. High-Quality Seed Datasets with Natural Relationships
| Dataset | Relationship Types | Scale | Quality |
| ConceptNet 5.7 | IsA, PartOf, UsedFor, RelatedTo, HasContext | 8M edges | High - human curated |
| Wikidata | P31 (instance), P279 (subclass), P361 (part of) | 100M+ items | Very High - structured |
| WordNet | Hypernym, Hyponym, Meronym, Holonym | 155K synsets | Excellent - linguistic gold standard |
| ATOMIC 2020 | Causes, Effects, Intents, Reactions | 1.33M inferences | High - commonsense reasoning |
| Visual Genome | Spatial, Attribute, Action relationships | 3.8M relationships | Good - grounded in images |
| SciGraph | Citations, Methods, Results, Hypotheses | 15M papers | Domain-specific excellence |
Code-Focused Datasets for Concept Training
| Dataset | Size | Quality Features | Concept Extraction Value |
| The Stack v2 | 67.5TB, 600+ languages | Permissively licensed, deduplicated | Massive scale, multi-paradigm |
| CodeParrot | 50GB Python | Clean, well-documented | Pure Python focus |
| CodeContests | 13k problems | Solutions + test cases | Self-validating logic |
| APPS | 10k problems | Difficulty levels, test suites | Progressive complexity |
| HumanEval-X | 820 problems × 5 languages | Hand-written tests | Cross-lingual concepts |
| MBPP | 1000 Python tasks | Natural language → code | Concept bridging |
| CodeXGLUE | 14 tasks | Understanding + generation | Semantic code relationships |
relations:
Code-Specific Concept Relations
python
CODE_SPECIFIC_RELATIONS = [
"implements", # Function implements algorithm
"optimizes", # Better version of another approach
"generalizes", # More general version
"specializes", # More specific version
"tests", # Test case for concept
"depends_on", # Requires other concept
"parallel_to", # Can run concurrently with
"inverse_of", # Undo operation
"composed_of", # Built from smaller concepts
]