AI Engineering
Multimodal AI: Vision, Audio, Documents - One API for Everything
Цели урока
- Understand the difference between multimodal AI and a pipeline of specialized models
- Integrate Vision API (GPT-4o, Claude) for image analysis
- Implement document processing: text extraction, vision-based and structured extraction
- Use GPT-4o audio input for speech-to-meaning tasks
- Design a multimodal backend with upload, routing and processing pipeline
GPT-4V (November 2023) - the model that sees. Six months later GPT-4o added audio. A year after that - Gemini 1.5 Pro processes one full hour of video in a single context window. Multimodality stopped being a feature and became a baseline expectation. Now a user photographs a receipt and the app extracts the total and every line item. A student uploads a lecture PDF and gets a summary with formulas. A customer sends a voice message with an error screenshot - and the support bot sees and hears it at the same time.
- Notion AI analyzes uploaded PDFs, images and tables - extracts data in structured format without a pipeline of separate models
- Stripe Document AI verifies IDs and bank statements via vision: 60% reduction in manual review
- Google Lens: multimodal real-time search - recognizes text, translates, finds products in one request
- Perplexity analyzes images in search queries - GPT-4o vision as part of the retrieval pipeline
How Multimodality Emerged
**CLIP (OpenAI, January 2021)** - the first model to link images and text in a shared latent space. Trained on 400M image-text pairs from the internet. Does not generate - understands relationships. **Flamingo (DeepMind, April 2022)** - the first few-shot multimodal LLM: a handful of examples in context, and the model handles new tasks. **GPT-4V (OpenAI, September 2023)** - vision arrives in GPT-4. For the first time a mass audience sees it: LLM reads screenshots, analyzes charts, describes photos. **GPT-4o (May 2024)** - native multimodal: one transformer for text, audio and images. Not a pipeline, not fine-tuning on top - a single unified architecture. **Gemini 1.5 Pro (2024)** - 1M context window: 1 hour of video, 11 hours of audio or 30,000 lines of code in one request.
Предварительные знания
What is Multimodal AI: Vision, Audio, Documents in One API
January 2021. OpenAI releases CLIP - a model that understands the connection between images and text. Nobody calls it a revolution. A year later, DeepMind ships Flamingo - the first model where image and text share a single context. Then GPT-4V. Then GPT-4o: one model that hears, sees, and reads. **In three years, multimodality went from a feature to a baseline expectation.**
Multimodal AI refers to models that accept multiple data types simultaneously: text, images, audio, video, documents. The key word is **simultaneously**. Not OCR → LLM → TTS chained together, but one model that understands context across everything at once. The difference is like a live orchestra versus soloists playing from separate rooms.
| Model | Text | Images | Audio | Video | PDF/Docs |
|---|---|---|---|---|---|
| GPT-4o | Yes | Yes | Yes (input) | Via frames | Via vision |
| Claude Sonnet/Opus | Yes | Yes | No | No | Yes (native PDF) |
| Gemini 2.0 Flash | Yes | Yes | Yes | Yes (native) | Yes |
| Llama 3.2 Vision | Yes | Yes | No | No | Via vision |
**Multimodal model is not the same as a multimodal pipeline.** True multimodal means a single model processes all data types in a unified context. A pipeline chains separate models (OCR + LLM + TTS) sequentially. In production, a hybrid is common: multimodal for understanding + specialized models for generation.
For backend developers, this means a new class of tasks: processing uploaded images, parsing PDF documents, analyzing screenshots, voice-to-action flows. The API interface stays familiar - the same messages array, just with content as an array of different typed objects.
What is the key advantage of a multimodal model over a pipeline of specialized models?
Vision API: GPT-4 Vision and Claude Vision - Image Analysis
One image sent to GPT-4o is not a "picture". It is **a series of tokens**. The model splits the image into 16x16 pixel patches, each patch becomes a vector, and those vectors flow into the transformer alongside text tokens. No special "vision module" - the same attention matrices, just different inputs. At `detail: high`, a 1024x1024 image adds ~765 tokens. At `detail: low` - a fixed 85 tokens.
Two ways to pass an image: **URL** (the model fetches it, no size billing) and **base64** (embedded in the request body, works without external hosting). GPT-4o and Claude Sonnet are the two main providers with different API formats.
| Parameter | OpenAI (GPT-4o) | Anthropic (Claude) |
|---|---|---|
| Image format | image_url (URL or base64 in URL) | image (base64 as separate field) |
| Detail level | detail: low/high/auto | No parameter (always high) |
| Max size | 20 MB | 5 MB (base64 in request) |
| Multiple images | Yes (up to 10+) | Yes (up to 20) |
| Cost (high detail) | ~765 tokens per 512x512 tile | ~1600 tokens per image |
**detail: 'low' saves 10x tokens.** In low mode, OpenAI uses a fixed 85 tokens per image instead of 765+ in high. For tasks like "what's in the picture?" - low is enough. High is needed for fine text, blueprints, UI screenshots. Price per image in gpt-4o: detail:low ~USD 0.00065, detail:high for 1024x1024 ~USD 0.00765.
When analyzing a mobile app screenshot with small text, which detail parameter should be used in GPT-4o Vision?
Document Processing: PDF Parsing, OCR, Structured Extraction
A PDF arrives - and immediately a choice has to be made. Is this a text-layer PDF (contract, article) or a scan (photo of a document, no text layer)? The answer changes everything: **text extraction** (pdf-parse, pdfjs) only works with the text layer. A scan is an image - there is no text layer. Tesseract recognizes text, but loses structure. **Vision API** sees the document like a human would - tables, layout, annotations, spatial relationships between elements.
**Choosing an approach by document type:** text-layer PDF (contract, article) → text extraction + LLM (cheap, fast). Scan, table, form → vision-based (GPT-4o with Zod schema). PDF with complex layout → Claude native PDF (sees structure without conversion). Structured extraction with Zod guarantees format - no more "I extracted the number but it came back as a string".
For extracting data from a scanned paper invoice (scan in PDF), the best approach is:
Audio Understanding: Audio Input for GPT-4o, Speech-to-Meaning
Whisper transcribes speech to text - and loses half the meaning. Sarcasm, hesitation, a customer's frustration on minute 56 of a call - all of it disappears into flat text. **GPT-4o audio input accepts audio directly**. The model hears intonation, pauses, pace. This is not STT - it is speech-to-meaning: the model understands what was said and how. Stripe uses audio understanding to analyze support calls - not just transcription, but sentiment detection and escalation risk scoring.
**Audio input vs Whisper STT:** Whisper transcribes speech to text - losing intonation, emotions, pauses. GPT-4o audio input analyzes audio directly, preserving all nuances. For tasks involving sentiment analysis, tone and non-verbal signals - audio input is preferable. But: max duration is ~10 minutes. For long recordings (lectures, podcasts, 10+ min), Whisper STT → LLM is still the better call.
What is the advantage of GPT-4o audio input over the Whisper STT → GPT-4o text chain?
Multimodal Architecture: Upload, Processing, Routing
A production multimodal backend is not "accept a file and send it to OpenAI". It is a routing system: images go to GPT-4o with `detail: high`, PDFs go to Claude native (no conversion), audio goes to gpt-4o-audio-preview, video gets split into frames. **Every modality demands its own strategy.** The router selects a strategy by MIME type - and the client code knows nothing about it.
**Upload security:** always validate MIME type on the server (never trust the client's Content-Type - it is directly spoofable), limit file size, delete temp files after processing, and never pass user-supplied filenames directly to the filesystem. An uploaded SVG can carry XSS, an uploaded PDF can carry an exploit. MIME-type validation is the first barrier, not the only one.
**Choosing a provider by modality** is a production best practice. Claude is better for PDF (native support, understands structure without conversion), GPT-4o for vision and audio (detail control, audio-preview model). A router lets engineers swap providers without touching client code - switching to Gemini 2.0 for native video takes three lines.
Why does the multimodal router choose Claude for PDF and GPT-4o for images?
The model "sees" an image the way a human does
The model splits the image into 16x16 pixel patches, each patch becomes a vector-token. There is no vision - only matrix operations on numbers
"Vision" is a marketing term. Internally: the image is divided into an N×N grid of patches (patch size is typically 14 or 16 pixels). Each patch is linearly projected into an embedding. These embeddings enter the transformer alongside text tokens. The attention mechanism finds relationships between patches and text. There is no separate "vision module" - the same self-attention, the same weight matrices. This is why the model loses detail at detail:low - fewer patches means fewer tokens.
A multimodal pipeline and a multimodal model are the same thing
A pipeline is a chain of specialized models (OCR → LLM → TTS), each unaware of the others. A multimodal model is a single unified context for all modalities
In a pipeline, every step loses information. OCR outputs text - but loses element positions, colors, visual context. The LLM receives flat text without visual structure. In a multimodal model, the image and text are processed jointly inside one attention mechanism - the model sees both simultaneously. This matters critically for tasks like "what text appears next to the red button in this screenshot".
Summary
- CLIP 2021 → Flamingo 2022 → GPT-4V 2023 → GPT-4o native multimodal 2024: multimodality became the baseline expectation in three years
- The model does not "see" - it splits images into 16x16px patches, each patch is a token, everything flows through the same transformer
- Vision API: GPT-4o (detail: low = 85 tokens, high = 765+ per 512x512 tile), Claude (native base64, always high). Price of detail:high in gpt-4o ~USD 0.00765 per image
- Document processing: text extraction for text-layer PDFs, vision+Zod for scans, Claude native PDF for complex layouts
- GPT-4o audio input is speech-to-meaning with intonation and emotion, Whisper is transcription only. Audio input limit: 10 minutes
- Production router: MIME type → strategy → provider. Claude for PDF, GPT-4o for vision/audio, Gemini for native video
Вопросы для размышления
- Which tasks in a current project could be handled by a multimodal API instead of a chain of specialized tools?
- When is an OCR + LLM pipeline better than native vision? Which factors - cost, speed, volume - drive the choice?
- What would a multimodal router look like for a document processing system with 10 file types and 3 providers?
What's Next
Multimodal AI analyzes existing content. The next step is generating new content: images, video, audio through AI APIs.
- Image Generation API — From analyzing images to generating them - DALL-E, Stable Diffusion, prompt engineering for images
- Structured Output — Structured extraction with Zod - continuation of patterns from the structured output lesson
- Cost Management — Vision tokens are expensive. Optimizing detail, choosing the model - part of cost management
Связанные уроки
- aie-05-api-integration — Multimodal calls extend the same API integration
- aie-26-image-generation — Vision understanding pairs with image generation
- aie-07-structured-output — Vision results parsed into structured schemas
- aie-29-cost-management — Image tokens inflate cost and need control
- ml-38-image-classification — Classic vision task now solved by one prompt
- ml-29-cnn