A Computer-Using Agent (CUA) is an AI agent that interacts with software through a graphical user interface. It observes screenshots or accessibility data and generates actions such as mouse movements, clicks, scrolling, and keyboard input.
Computer use provides a general interface for systems that lack an API. A CUA can operate websites, desktop applications, remote virtual machines, and legacy enterprise software using many of the same controls as a human user. This makes it useful for browser automation, quality assurance, data entry, research, and cross-application workflows.
A typical computer-use loop consists of:
- capturing the current screen and relevant environment metadata;
- sending the observation and task state to a multimodal or reasoning model;
- receiving a proposed interface action;
- validating and executing the action in an isolated environment; and
- returning a new observation until the task succeeds, fails, or reaches a limit.
CUAs differ from traditional Robotic Process Automation (RPA). RPA usually relies on predefined selectors, rules, or recorded sequences. A computer-using agent interprets visual and linguistic context dynamically, which makes it more adaptable but also less deterministic.
Graphical interfaces are an error-prone action space. Layout changes, pop-ups, ambiguous controls, loading states, and visual similarity can cause incorrect actions. Reliable systems use restricted environments, action allowlists, timeouts, checkpoints, and explicit confirmation before purchases, messages, deletions, permission changes, or other consequential operations.
Security is particularly important because untrusted content displayed on screen can contain instructions intended to manipulate the agent. This is a form of indirect prompt injection attack. The agent should distinguish task instructions from interface content, and sensitive data should be minimized or redacted from observations.
Computer use is often combined with tool calling. APIs remain preferable when they provide reliable, structured access; the graphical interface is best treated as a fallback or complementary tool. OpenAI describes this capability in its computer use documentation.
The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.
It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.