Agent Guardrail

An agent guardrail is a control that checks, restricts, or interrupts the behavior of an AI agent. Guardrails reduce the chance that model-generated decisions lead to unsafe, unauthorized, invalid, or out-of-policy outcomes.

Guardrails can operate at several boundaries:

  • Input guardrails inspect user requests before the agent begins work.
  • Output guardrails validate the final response before it reaches the user.
  • Tool guardrails approve, reject, or modify individual tool calls.
  • Runtime guardrails enforce limits on steps, cost, time, data access, or external side effects.

A guardrail may use deterministic rules, schema validation, classifiers, another Large Language Model (LLM), human approval, or a combination of these methods. Deterministic checks are preferable for properties that can be expressed exactly, such as permission scopes, numeric limits, file paths, or allowed API operations.

Guardrails are not equivalent to instructions in a system message. Instructions influence model behavior, while an independently enforced guardrail can block an action even when the model ignores or misinterprets those instructions.

Important design properties include fail-closed behavior for high-impact operations, clear error handling, auditability, and low false-positive rates. Guardrails should run close to the resource they protect; a database permission check is more reliable than asking a model whether a query appears safe.

No guardrail provides complete protection against prompt injection attacks or model error. Layered controls, scoped credentials, sandboxing, human confirmation, and agent tracing remain necessary.

The OpenAI Agents SDK guardrails documentation describes input, output, and tool-level guardrails with tripwire behavior.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

Promptmetheus © 2023-present