The quality of AI products and services depends intricately on the capabilities of the LLMs that power them, and as these models continue to improve, we have to regularly revise our choices.
To keep you up to date on how new models are performing and which ones lead each category, we've put together a neat list of benchmarks and datasets.
Leaderboards
LM Arena
Subjects:
Chat models, Reasoning models
Topics:
reasoning
language understanding
user preference
LM Arena is an open platform created by researchers from UC Berkeley that lets users pose questions and compare responses from two different Large Language Models without knowing which model produced which answer. The aggregate results of these head-to-head "battles" generate a ranking and leaderboard, similar to the ELO rating system used in chess.
LiveBench
Subjects:
Chat models, Reasoning models, Code models
Topics:
math
coding
reasoning
language
instruction following
data analysis
LiveBench is an evolving benchmark for Large Language Models that aims to avoid “training set leakage” (contamination) by continuously introducing fresh, recently sourced tasks. Each question in LiveBench is paired with a verifiable ground-truth answer, so that model outputs can be scored automatically — no external judge (human or AI) is needed.
LiveSWEBench
Subjects:
Code models, Coding agents
Topics:
code editing
feature implementation
bug fixing
repository comprehension
LiveSWEBench is a live, continuously updated benchmark that evaluates the software engineering (SWE) capabilities of AI agents in real-world conditions. It draws tasks from active repositories and scores agents on their ability to propose correct, executable changes.