Multimodal Models: When AI Sees, Hears, and Reads at Once

Your e-commerce platform has a problem. Users upload photos of products they want to find - a screenshot from Instagram, a photo of a dress they saw on the street, a picture of a broken part they need to replace. Your text-based search is useless here. You could build a separate image classification pipeline, map it to product categories, then search by category. But that loses all the nuance - the specific shade of blue, the style of stitching, the exact model of the component.

Then you try a multimodal model. You feed it the image and ask “find similar products.” It returns results that match not just the category but the visual style, because it understands images and text in the same representational space. The image of a mid-century modern lamp matches products described as “retro brass desk lamp” even though no one tagged it that way.

Multimodal models are not just “vision added to a chatbot.” They represent a fundamental shift in how AI systems can process and relate information across formats.

What multimodal actually means

A multimodal model processes multiple types of input (modalities) - text, images, audio, video - through a unified architecture that can reason across them simultaneously. The key distinction from multi-model pipelines: the model does not process each modality in isolation and then combine results. It builds joint representations where a pixel, a word, and a sound wave can all relate to each other in the same latent space.

Current modalities in production models:

Text - the foundation modality, always present
Images - static visuals, screenshots, diagrams, photos
Audio - speech, music, environmental sounds
Video - temporal sequences of frames with optional audio
Structured data - tables, code, mathematical notation (treated as specialized text)

graph TD
  subgraph inputs["Input Modalities"]
      IMG["Image
(pixels)"]
      TXT["Text
(tokens)"]
      AUD["Audio
(waveform)"]
  end
  subgraph encoders["Modality Encoders"]
      VE["Vision Encoder
(ViT)"]
      TE["Text Encoder
(Tokenizer + Embedding)"]
      AE["Audio Encoder
(Whisper-style)"]
  end
  subgraph unified["Unified Transformer"]
      UT["Joint Attention
Across All Modalities"]
  end
  subgraph output["Output"]
      OUT["Text Response
(understands all inputs)"]
  end

  IMG --> VE
  TXT --> TE
  AUD --> AE
  VE --> UT
  TE --> UT
  AE --> UT
  UT --> OUT

  style IMG fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style TXT fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style AUD fill:#FAEEDA,stroke:#854F0B,color:#633806
  style UT fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style OUT fill:#E1F5EE,stroke:#0F6E56,color:#085041

How multimodal models work

Architecture approach 1: Vision encoder + LLM fusion

The most common architecture for image+text models (GPT-4V, Claude 3, Gemini):

Image passes through a vision encoder (typically a Vision Transformer/ViT) that converts the image into a sequence of “visual tokens” - patch embeddings that represent different regions of the image
Visual tokens are projected into the same dimensional space as text tokens through a learned projection layer
Both token sequences are concatenated and processed by the main transformer, which can attend across both visual and text tokens
The model generates text output conditioned on both the image understanding and the text prompt

A single high-resolution image might become 1,000-2,000 visual tokens. This is why image inputs are expensive - they consume significant context window space.

Architecture approach 2: Unified tokenization

Models like Gemini take a different approach - they tokenize all modalities into a shared token space from the start. Images become discrete tokens through a learned visual tokenizer (like VQ-VAE). Audio becomes tokens through a learned audio codec. Everything is just tokens to the transformer.

The advantage: truly native multimodal reasoning with no adapter layer. The disadvantage: training complexity and the loss of information during tokenization.

graph LR
  subgraph fusion["Approach 1: Encoder Fusion"]
      F1["ViT Encoder"] --> F2["Projection Layer"]
      F2 --> F3["Concatenate with text tokens"]
      F3 --> F4["Transformer LLM"]
  end
  subgraph unified["Approach 2: Unified Tokenization"]
      U1["Visual Tokenizer
(VQ-VAE)"]
      U2["Text Tokenizer
(BPE)"]
      U3["Audio Tokenizer
(Codec)"]
      U1 --> U4["Single Token Sequence"]
      U2 --> U4
      U3 --> U4
      U4 --> U5["Transformer"]
  end

  style F1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style F4 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style U4 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style U5 fill:#E1F5EE,stroke:#0F6E56,color:#085041

CLIP: the embedding bridge

OpenAI’s CLIP (Contrastive Language-Image Pre-training) trained a shared embedding space where images and text descriptions of those images have similar vectors. This is not a generative model - it is an embedding model that understands cross-modal relationships. CLIP powers many downstream systems: image search, zero-shot classification, and as the vision encoder in larger multimodal models.

Where multimodal breaks or gets interesting

Spatial reasoning is still weak

Current multimodal models struggle with precise spatial relationships. “What is to the left of the red object?” or “Count the items on the shelf” often produce errors. The vision encoder compresses spatial information into patch embeddings that lose fine-grained positional detail. Models are improving rapidly here, but do not trust them for pixel-precise tasks.

Hallucinating visual content

Just as text LLMs hallucinate facts, multimodal models hallucinate visual content. They might describe objects that are not in the image, misread text in screenshots, or infer details that are not visible. This is especially dangerous for document processing - the model might “read” a number as 156 when the image shows 165.

Token cost of images

A single high-res image can consume 2,000+ tokens of context window. If your application processes many images per request (document pages, product galleries, video frames), you can exhaust your context window quickly. Strategies: resize images to the minimum resolution needed, use tiling for high-res images, or process images in separate calls and aggregate results.

Video is just expensive frames

Most “video understanding” in current models works by sampling frames at intervals (e.g., 1 frame per second) and processing them as a sequence of images. A 60-second video at 1fps becomes 60 images, each consuming thousands of tokens. True temporal understanding (motion, causality, timing) is limited.

OCR is not perfect

Multimodal models can read text in images, but they are not OCR systems. They miss small text, misread similar characters (O vs 0, l vs 1), and struggle with unusual fonts or low contrast. For critical text extraction from documents, dedicated OCR (Tesseract, Google Document AI) followed by LLM processing is still more reliable.

Real-world systems using multimodal models

GPT-4V/4o (OpenAI) - accepts images and text, generates text. Powers ChatGPT’s image understanding, Be My Eyes accessibility, and many developer applications
Claude 3/4 (Anthropic) - processes images, PDFs, and text. Strong at document understanding, chart interpretation, and code from screenshots
Gemini 1.5/2.0 (Google) - natively multimodal (text, image, audio, video). The 2M token context window enables processing hour-long videos
LLaVA / InternVL - open-source multimodal models that achieve GPT-4V-level performance on many benchmarks
Midjourney, DALL-E 3, Stable Diffusion - text-to-image generation (inverse direction - text input, image output)
Google Lens - multimodal search: take a photo, get text-based results

How to apply multimodal in practice

Document processing pipeline

For invoices, receipts, contracts, or forms:

Convert PDF pages to images (or use native PDF support if available)
Send image + extraction prompt to multimodal model
Request structured JSON output with the fields you need
Validate extracted data against business rules
Flag low-confidence extractions for human review

Visual Q&A for products

For e-commerce, support, or inventory:

Accept user-uploaded image
Combine with text query (“What model is this?” or “Is this damaged?”)
Use multimodal model for classification/description
Route to appropriate business logic based on response

Chart and diagram understanding

For business intelligence or documentation:

Render charts/diagrams as images
Ask the model to describe trends, extract data points, or explain the diagram
Use structured output mode for reliable data extraction
Cross-validate extracted numbers against source data when possible

When NOT to use multimodal

Pixel-precise measurements - use computer vision libraries (OpenCV)
High-volume OCR - dedicated OCR is faster and cheaper at scale
Real-time video processing - frame-by-frame multimodal inference is too slow
Medical imaging diagnosis - requires specialized, validated models with regulatory approval

FAQ

Q: Should I send images to the multimodal model or extract text first with OCR and send just the text?

It depends on the information density. For text-heavy documents (contracts, articles), OCR + text is cheaper and often sufficient. For documents where layout matters (invoices, forms, diagrams), sending the image preserves spatial relationships that OCR loses. For mixed content (slides with charts and text), image input captures everything including visual elements that OCR cannot represent.

Q: How do I reduce the token cost of image inputs?

Resize images to the minimum resolution the task requires. Most models have a “detail” parameter (low/high) - use low detail for simple classification tasks. Crop to the relevant region rather than sending full screenshots. For multi-page documents, process pages individually rather than all at once, and only process pages likely to contain relevant information.

Q: Can multimodal models generate images, or only understand them?

Most multimodal LLMs (GPT-4V, Claude) only understand images as input and generate text as output. Image generation uses separate architectures (diffusion models like DALL-E 3, Midjourney, Stable Diffusion). Some unified models (Gemini 2.0) can both understand and generate images, but the quality of generated images from unified models is still below dedicated generators. GPT-4o can generate images natively through its multimodal architecture.

Interview questions

Q: Design a system that processes thousands of invoices daily, extracting vendor name, date, line items, and total amount. Would you use a multimodal model?

Strong answers discuss the tradeoff: multimodal models handle diverse invoice formats without template configuration, but are expensive at scale. A production system would use a tiered approach - fast OCR + template matching for known formats (80% of volume), multimodal model for unknown/complex formats (20%). Include confidence scoring, human review for low-confidence extractions, and feedback loops to improve the system over time. Mention cost analysis: at $0.01 per image token, 1000 invoices/day with 2 pages each is roughly $60/day for the multimodal path.

Q: A user uploads a photo and asks “find me something similar” in your product catalog of 5 million items. How do you architect this?

Pre-compute CLIP embeddings for all catalog images. When a user uploads a photo, embed it with the same CLIP model, then do nearest-neighbor search in the vector database. Return top-k results. For better results: combine visual similarity with text-based metadata (category, price range, brand). Use a multimodal model to generate a text description of the uploaded image, then do hybrid search (visual embedding + generated text description). This captures both visual similarity and semantic intent.

Q: Your multimodal document processing system works well on English invoices but fails on Japanese receipts. What is happening and how do you fix it?

The vision encoder and LLM were likely trained predominantly on English text in images. Japanese characters are denser, use different writing directions, and the model may not have seen enough training examples. Fixes: try a model with better multilingual training (Gemini tends to be strong here), increase image resolution for dense text, add explicit language hints in the prompt (“This receipt is in Japanese”), or use a specialized multilingual OCR first and send extracted text to the LLM for structuring.