An Evaluation (or "eval") is a crucial step in the development of Generative AI models. It involves assessing the performance, quality, and effectiveness of the AI-generated outputs, such as text, images, or audio. The evaluation process helps identify areas for improvement, ensures the model meets the desired standards, and validates its readiness for real-world applications. Common evaluation methods include human ratings, automated metrics, and comparative analyses against ground truth data or benchmarks. Regularly conducting evals is essential for maintaining the output quality of LLM-based services for end users.
An evaluation should represent the actual task distribution and define measurable success criteria before prompts or models are optimized. Deterministic checks are preferable when an answer can be verified directly; human review or LLM-as-a-Judge can help assess qualities such as relevance, tone, and completeness.
Multi-step systems require agent evaluation, which also measures tool use, execution trajectories, cost, latency, safety, and changes to external state.
The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.
It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.