LLM Benchmarks | Promptmetheus

The quality of AI products and services depends intricately on the capabilities of the LLMs that power them, and as these models continue to improve, we have to regularly revise our choices.

To keep you up to date on how new models are performing and which ones lead each category, we've put together a neat list of benchmarks and datasets.

Leaderboards

LM Arena

Subjects:

Chat models, Reasoning models

Topics:

reasoning

language understanding

user preference

LM Arena is an open platform created by researchers from UC Berkeley that lets users pose questions and compare responses from two different Large Language Models without knowing which model produced which answer. The aggregate results of these head-to-head "battles" generate a ranking and leaderboard, similar to the ELO rating system used in chess.

Leaderboard

Paper

LiveBench

Subjects:

Chat models, Reasoning models, Code models

Topics:

math

coding

reasoning

language

instruction following

data analysis

LiveBench is an evolving benchmark for Large Language Models that aims to avoid “training set leakage” (contamination) by continuously introducing fresh, recently sourced tasks. Each question in LiveBench is paired with a verifiable ground-truth answer, so that model outputs can be scored automatically — no external judge (human or AI) is needed.

LiveSWEBench

Subjects:

Code models, Coding agents

Topics:

code editing

feature implementation

bug fixing

repository comprehension

LiveSWEBench is a live, continuously updated benchmark that evaluates the software engineering (SWE) capabilities of AI agents in real-world conditions. It draws tasks from active repositories and scores agents on their ability to propose correct, executable changes.

Leaderboard

Code

Data