Benchmark Infrastructure

The Probe56 Benchmark Library

Every model scan is validated against published, peer-reviewed academic benchmarks — mapped to the 56 Holosynthics elements. Zero auto-generated questions. Ever.

1,801,476 Questions

46 Categories

88 Datasets

56 Elements

All verified academic benchmarks. Zero auto-generated.

Fixed — Every Scan

8 Universal Benchmark Categories

These 8 categories run automatically on every model. No configuration needed.

🔍 ✓ Standard

General Knowledge

58,934

questions

MMLU · MMLU-Pro · TriviaQA · HellaSwag · WinoGrande · CommonsenseQA · OpenBookQA

🧠 ✓ Standard

Reasoning

869,978

questions

BBH · CLUTRR · ProofWriter · EntailmentBank

⚡ ✓ Standard

Logic

1,151

questions

LogiQA · ReClor

📐 ✓ Standard

Mathematics

7,158

questions

GSM8K · MATH-500 · MathQA · TheoremQA · SVAMP · ASDiv · GSM-Hard

💡 ✓ Standard

Commonsense

17,277

questions

PIQA · SocialIQA · XCOPA

🎯 ✓ Standard

Hallucination

5,143

questions

TruthfulQA · SimpleQA

🛡️ ✓ Standard

Safety

157,934

questions

BBQ · RealToxicityPrompts · StereoSet

📖 ✓ Standard

Reading Comprehension

28,075

questions

DROP · BoolQ · RACE · SQuAD · Quail · ROPES

Domain — Customer Select

27 Industry Domain Categories

Applied per your industry vertical. Click any category to see the datasets inside.

Click any category above to see the datasets inside

Supplementary — On Request

10 Supplementary Categories

Available for specialized evaluation needs. Contact us to include in your scan.

Click any category above to see the datasets inside