AI Benchmarks | Notion

The SWE-Bench-Verified scores (a measure of software skill, as measured independently by EpochAI) are unimpressive, landing below Claude 4.1 Opus. This suggests that unfortunately for OpenAI, Claude Code may remain dominant for a bit longer.
On METR’s time horizons task, a more advanced evaluation of software engineering skill, GPT-5 improved upon the state of the art in a way that is very in-line with what was predicted based on past trends2.
On GPQA, a set of PhD-level science questions, GPT-5 comes in at basically the same range as Gemini and Grok when evaluated independently by EpochAI.
On ARC-AGI and ARC-AGI-2, both a measure of general intelligence via symbolic reasoning, GPT-5 fails to outperform Grok 4. On FrontierMath, a measure of advanced mathematical reasoning, GPT-5 makes an advance, but in line with the trend.
On Deep Research Bench, my favorite benchmark for researching and analysis performance, GPT-5 etches out a win by a very small amount, within the margin of error. Also it appears GPT-5 is good at some tasks but not others, a bit of a ‘jagged frontier’ that makes it a complement to previous winner Claude 4 Opus rather than a replacement.

From The case for AGI 2030

Benchmark	Quick explanation	All quotes from the article	Link	# Chronology in the article
BIG-Bench Hard	A diverse set of questions — commonsense reasoning, social understanding, physics — specifically chosen to be challenging for LLMs.	“AI models have steadily improved at answering diverse questions—from commonsense reasoning to understanding social situations and physics. This is demonstrated on the ‘BIG-Bench Hard’ benchmark, which features diverse questions specifically chosen to challenge LLMs.”	https://github.com/google/BIG-bench	1
ImageNet	Standard benchmark for image recognition accuracy.	“Every two years, the compute needed to get the same performance across a wide range of models has decreased tenfold.” (based on the ImageNet benchmark)	https://www.image-net.org/	2
GPQA Diamond	PhD-level science questions designed to be “Google-proof,” so only domain experts can answer.	“Consider the GPQA Diamond benchmark — a set of scientific questions designed so that people with PhDs in the field can mostly answer them, but non-experts can’t, even with 30 minutes of access to Google.”	https://arxiv.org/abs/2311.12022	3
FrontierMath	Extremely difficult math problems, from Olympiad-level up to challenges for professional mathematicians.	“Epoch AI created Frontier Math — a benchmark of insanely hard mathematical problems. The easiest 25% are similar to Olympiad-level problems. The most difficult 25% are, according to Fields Medalist Terence Tao, ‘extremely challenging,’ and would typically need an expert in that branch of mathematics to solve them.”	https://arxiv.org/abs/2411.04872	4
SWE-bench Verified	Real-world software engineering tasks from GitHub, typically ~1 hour long, testing multi-application use.	“SWE-bench Verified is a benchmark of real-world software engineering problems from GitHub that typically take about an hour to complete.”	https://www.swebench.com/	5
RE-Bench	Seven difficult AI research engineering tasks (e.g., fine-tuning, predicting experiments) designed to mimic real AI R&D.	“Now consider perhaps the world’s most important benchmark: METR’s set of difficult AI research engineering problems (‘RE Bench’).”	https://arxiv.org/abs/2411.15114	6
METR Time Horizon benchmark	Categorises AI/computer tasks by the time they take humans (seconds → weeks) to measure models’ sustained performance ability.	“METR made a broader benchmark of computer use tasks categorised by time horizon. GPT-2 was only able to do tasks that took humans a few seconds; GPT-4 managed a few minutes; and the latest reasoning models could do tasks that took humans just under an hour.”	https://metr.org/	7
MMLU	Compilation of college and professional knowledge tests covering multiple subjects.	“MMLU: compilation of college and professional knowledge tests.”	https://arxiv.org/abs/2009.03300	8
Humanity’s Last Exam	3,000 extremely hard questions at the frontier of human knowledge.	“Humanity’s last exam: a compilation of 3,000 even harder questions at the frontier of human knowledge.”	(No official link found — may be internal/unnamed dataset)	9
MATH	High school mathematics competition questions.	“MATH: High school math competition questions.”	https://arxiv.org/abs/2103.03874	10
Situational Awareness benchmark	Tests whether a model understands itself, its outputs, and its deployment/training status.	“Situational Awareness: questions designed to test if model understands itself and context.”	(No standalone link found — likely internal to METR research)	11

Creative writing

Lech Mazur gives us the Short Story Creative Writing benchmark, where GPT-5-Thinking comes out on top. I continue to not trust the grading on writing but it’s not meaningless.