LLM-as-a-Judge

LLM-as-a-Judge is an evaluation technique in which a Large Language Model (LLM) scores, ranks, or critiques outputs produced by another model or AI system. It is used when quality criteria such as relevance, clarity, completeness, or style are difficult to express with deterministic metrics.

Common judging formats include:

assigning a score according to a rubric;
comparing two responses and selecting the better one;
checking whether an answer satisfies a list of criteria; and
producing a critique that explains a failure.

LLM judges can scale evaluation more cheaply and quickly than human review, but their decisions are not objective ground truth. Known failure modes include position bias, preference for verbose answers, self-preference, sensitivity to formatting, inconsistent scoring, and vulnerability to instructions embedded in the content being judged.

A reliable judging pipeline uses a precise rubric, hides irrelevant metadata, randomizes pairwise answer order, separates the evaluated content from judge instructions, and calibrates results against expert human ratings. Multiple judge samples or models can be used when variance is high.

Deterministic verification remains preferable when correctness can be computed directly. Code should be executed against tests, structured data should be schema-checked, and mathematical answers should use symbolic or numeric verification where possible. An LLM judge is most appropriate for residual qualitative properties.

In agent evaluation, a judge can assess final-answer quality or the appropriateness of a trajectory. It should not be the sole authority for security, authorization, or irreversible actions.

The research paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena established widely used methods for studying agreement between model judges and human preferences.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

Language Processing Unit (LPU)

Large Language Model (LLM)

Logit

Lost-in-the-Middle Effect

Machine Learning (ML)