Language Model Benchmarks

Performance comparison of leading large language models across standard benchmarks.

GPT-4
Claude 2
Llama 2 (70B)
PaLM 2
GPT-3.5

Benchmark Details

MMLU

Massive Multitask Language Understanding - tests knowledge across 57 subjects.

Top Score: 89.4% (GPT-4)

HumanEval

Evaluates code generation capabilities on programming problems.

Top Score: 84.9% (GPT-4)

GSM8K

Grade school math problems requiring multi-step reasoning.

Top Score: 92.0% (Claude 2)

HellaSwag

Common sense reasoning about everyday situations.

Top Score: 95.3% (GPT-4)

Custom Model Comparison

Select models to compare their performance across different benchmarks.

Stay Updated

Subscribe to our newsletter for the latest AI research and resources