A Vision-Language Model (VLM) is a multimodal model that jointly processes visual information and natural language. It can connect images or video with text to perform tasks such as visual question answering, captioning, document understanding, image retrieval, and instruction following in visual environments.
A typical VLM includes:
- a vision encoder that transforms pixels or image regions into representations;
- a projection, adapter, or cross-attention mechanism that connects visual representations to language tokens; and
- a language model that interprets the combined sequence and generates text or actions.
Some VLMs are trained end to end, while others connect pretrained vision and language components. Training data may contain image-caption pairs, interleaved webpages, videos with transcripts, visual instructions, or synthetic data.
VLM is more specific than multi-modality, which can include audio, speech, or other data types. It is also distinct from a pure image generator: a VLM primarily learns relationships between visual inputs and language, although one system may support both understanding and generation.
VLMs are central to computer-using agents and robotics because they can interpret screenshots, diagrams, interfaces, and physical scenes. Visual grounding remains imperfect; models may misread small text, spatial relationships, counts, or occluded objects while producing a fluent explanation.
Evaluation should separate perception errors from reasoning errors and test robustness across resolution, layout, image quality, and domain shift. Sensitive images also require appropriate privacy and retention controls.
The Flamingo paper describes an influential architecture that connects pretrained vision and language models and handles interleaved image, video, and text inputs.
The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.
It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.