Inference-Time Scaling (Test-Time Compute)

Inference-time scaling, also called test-time compute scaling, is the practice of allocating additional computation to a model after training in order to improve the quality of its answer. Instead of relying on a single forward generation, the system may let a reasoning model reason for longer, generate multiple candidates, search over possible solutions, or verify intermediate results.

This differs from training-time scaling, where capability is increased by using more parameters, data, or training compute. Inference-time scaling changes the amount of work performed for each request and can therefore be adjusted dynamically according to task difficulty, latency requirements, and cost.

Common approaches include:

producing several candidate answers and selecting one with a verifier;
iterative self-correction or critique;
tree or graph search over reasoning steps;
executing code or tests to validate candidates;
increasing a model's reasoning-effort budget; and
using tool calls to gather evidence before answering.

More compute does not automatically produce a better result. Repeated samples can share the same misconception, weak verifiers can select confidently incorrect answers, and long reasoning traces increase latency and token cost. Effective systems allocate compute selectively rather than applying the same budget to every request.

Inference-time scaling is especially relevant to AI agents, which can spend additional steps planning, testing actions, and recovering from errors. It should be evaluated against end-to-end accuracy and resource usage through agent evaluation.

The paper Scaling LLM Test-Time Compute Optimally shows that the most effective strategy depends on problem difficulty and that adaptively allocated test-time compute can outperform substantially larger models on suitable tasks.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

Group Relative Policy Optimization (GRPO)

Input Token

Integrated Prompting Environment (IPE)

Jailbreak

LLM API

LLM API Provider