Evaluation-driven Development

Evaluation-driven development in Prompt Engineering is akin to test-driven development in software engineering, with the key distinction that evaluations (or evals) replace unit tests. In this approach, prompt engineers iteratively refine prompts by running evaluations against completions to measure the model’s performance with respect to predefined criteria. Unlike unit tests, which have fixed, deterministic outputs, evaluations in prompt engineering often focus on qualitative metrics—such as accuracy, relevance, and coherence—guiding the continuous refinement of prompts.

A major challenge in evaluation-driven development, compared to test-driven development, is the inherent unpredictability and variability in Large Language Model (LLM) outputs. While unit tests in software have clear, deterministic outcomes, evaluations for LLMs often deal with subjective or probabilistic results, making it difficult to establish a definitive “pass” or “fail” criterion. Additionally, LLM behavior can change with new model versions or even slight prompt modifications, complicating the maintenance of consistent evaluation standards. Evaluations may also require human judgment, making them more time-consuming and susceptible to bias.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

Evaluation

Direct Preference Optimization (DPO)