Benchmark Infrastructure
The Probe56 Benchmark Library
Every model scan is validated against published, peer-reviewed academic benchmarks — mapped to the 56 Holosynthics elements. Zero auto-generated questions. Ever.
1,801,476 Questions
46 Categories
88 Datasets
56 Elements
All verified academic benchmarks. Zero auto-generated.
Fixed — Every Scan
8 Universal Benchmark Categories
These 8 categories run automatically on every model. No configuration needed.
✓ Standard
General Knowledge
58,934
questions
MMLU · MMLU-Pro · TriviaQA · HellaSwag · WinoGrande · CommonsenseQA · OpenBookQA
✓ Standard
Reasoning
869,978
questions
BBH · CLUTRR · ProofWriter · EntailmentBank
✓ Standard
Logic
1,151
questions
LogiQA · ReClor
✓ Standard
Mathematics
7,158
questions
GSM8K · MATH-500 · MathQA · TheoremQA · SVAMP · ASDiv · GSM-Hard
✓ Standard
Commonsense
17,277
questions
PIQA · SocialIQA · XCOPA
✓ Standard
Hallucination
5,143
questions
TruthfulQA · SimpleQA
✓ Standard
Safety
157,934
questions
BBQ · RealToxicityPrompts · StereoSet
✓ Standard
Reading Comprehension
28,075
questions
DROP · BoolQ · RACE · SQuAD · Quail · ROPES
Domain — Customer Select
27 Industry Domain Categories
Applied per your industry vertical. Click any category to see the datasets inside.
Click any category above to see the datasets inside
Supplementary — On Request
10 Supplementary Categories
Available for specialized evaluation needs. Contact us to include in your scan.
Click any category above to see the datasets inside