Benchmark Infrastructure

The Probe56 Benchmark Library

Every model scan is validated against published, peer-reviewed academic benchmarks — mapped to the 56 Holosynthics elements. Zero auto-generated questions. Ever.

1,801,476 Questions
46 Categories
88 Datasets
56 Elements

All verified academic benchmarks. Zero auto-generated.

Fixed — Every Scan

8 Universal Benchmark Categories

These 8 categories run automatically on every model. No configuration needed.

🔍 ✓ Standard
General Knowledge
58,934
questions
MMLU · MMLU-Pro · TriviaQA · HellaSwag · WinoGrande · CommonsenseQA · OpenBookQA
🧠 ✓ Standard
Reasoning
869,978
questions
BBH · CLUTRR · ProofWriter · EntailmentBank
✓ Standard
Logic
1,151
questions
LogiQA · ReClor
📐 ✓ Standard
Mathematics
7,158
questions
GSM8K · MATH-500 · MathQA · TheoremQA · SVAMP · ASDiv · GSM-Hard
💡 ✓ Standard
Commonsense
17,277
questions
PIQA · SocialIQA · XCOPA
🎯 ✓ Standard
Hallucination
5,143
questions
TruthfulQA · SimpleQA
🛡️ ✓ Standard
Safety
157,934
questions
BBQ · RealToxicityPrompts · StereoSet
📖 ✓ Standard
Reading Comprehension
28,075
questions
DROP · BoolQ · RACE · SQuAD · Quail · ROPES
Domain — Customer Select

27 Industry Domain Categories

Applied per your industry vertical. Click any category to see the datasets inside.

Click any category above to see the datasets inside
Supplementary — On Request

10 Supplementary Categories

Available for specialized evaluation needs. Contact us to include in your scan.

Click any category above to see the datasets inside