Reranking is a second-stage retrieval process that reorders an initial set of search results using a more accurate but more computationally expensive relevance model. It improves which documents are ultimately supplied to a user or a Large Language Model (LLM).
The first-stage retriever prioritizes recall and speed. It may use keyword search, vector search, or hybrid search to select tens or hundreds of candidates from a large corpus. The reranker then evaluates those candidates more deeply and returns a smaller ordered list.
Common reranking models include:
- cross-encoders that jointly process the query and each document;
- late-interaction models that compare token-level representations;
- LLMs prompted to score or compare passages; and
- domain-specific learned ranking functions.
Cross-encoders are often more precise than embedding similarity because they model direct interactions between query and document tokens. Their cost scales with the number and length of candidates, so candidate limits and truncation policies matter.
In Retrieval-Augmented Generation (RAG), reranking can reduce irrelevant context and improve evidence ordering. It does not verify that a document is factually correct, trustworthy, current, or safe to follow.
Reranking quality should be measured with retrieval metrics such as nDCG, Mean Reciprocal Rank, recall at K, and downstream answer accuracy. Offline relevance gains should be checked against added latency and cost.
Google Cloud's Ranking API documentation describes semantic reranking of previously retrieved documents.
The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.
It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.