Multimodal LLMs — Vision, Language, Audio, Tool Use Combined
Session 36 of the 48-session learning series.
Why this session matters
This is Session 36 of 48 in the LLM track. By 2026, "LLM" really means "multimodal foundation model" — Claude, GPT, Gemini, all natively handle images, audio, and tools. Knowing how the modalities are stitched into a shared representation, and what fails when you cross them, separates AI engineers from prompt-jockeys.
Agenda
- How vision is plugged into a transformer (CLIP, ViT, vision-language models)
- Audio models — Whisper, audio tokens, native audio in/out
- Late fusion vs early fusion vs interleaved modalities
- Tool use across modalities — image → action → answer
- Evaluation — VQA, MMMU, Audio-LM benchmarks; the new failure modes
Pre-read (skim before the session)
- CLIP paper (Radford et al., 2021)
- Flamingo paper (Alayrac et al., 2022)
- LLaVA paper (Liu et al., 2023)
- Whisper paper (Radford et al., 2022)
Deep dive
1. What "multimodal" actually means
Three patterns conflated under the same word:
- Cross-modal embedding (CLIP) — text and image in same vector space for similarity / search.
- Vision-language generation (LLaVA, GPT-4V) — model takes image + text, generates text.
- Any-to-any (GPT-4o, Gemini Native) — image / audio / video / text in; image / audio / text out.
Each is a different architecture problem. We'll cover all three.
2. CLIP — the foundational cross-modal model
Train two encoders side by side:
- Image encoder (ViT or ResNet) → image embedding.
- Text encoder (transformer) → text embedding.
Train with contrastive loss: for a batch of (image, caption) pairs, the diagonal of image_emb · text_emb.T should be high, off-diagonal low.
After training:
- Zero-shot image classification by checking
cos(image, "a photo of a {class}")for each class. - Image search by text query.
- Text-image retrieval at scale.
CLIP is the embedding spine of most modern multimodal stacks (Stable Diffusion, Midjourney's text understanding, every "search images by text" feature).
3. From CLIP to vision-language generation
CLIP doesn't generate text. To get a VLM:
image → CLIP-like vision encoder → image tokens (~256–576 per image)
↓
[ MLP projection: image-emb-dim → LLM-emb-dim ]
↓
interleave with text tokens:
<image_tokens> "Describe this." <generated_text>
↓
LLM transformer
That's the LLaVA recipe. CLIP encoder + projection MLP + frozen LLM, trained on image-instruction pairs. Cheap to build; impressively capable.
4. Native multimodal architectures
GPT-4o, Gemini, Claude 3.5+ go further:
- Image, audio, video all encoded into tokens that share the model's token space.
- Trained end-to-end — no frozen modules.
- Generation can output across modalities (text + image, text + audio).
Architectural variants:
- Late fusion — encode each modality separately, concat embeddings, pass to LLM. Simplest; least cross-modal reasoning.
- Early fusion — interleave tokens from all modalities in one stream from the start. Strongest reasoning across modalities; hardest to train.
- Cross-attention fusion (Flamingo) — image tokens accessed via cross-attention in dedicated layers. Compromise; mainstream pre-2024.
Trend in 2026: pure early fusion with shared tokeniser.
5. Vision encoder choices
- CLIP-ViT — most common; 224×224 or 336×336 inputs; ~256 tokens out.
- SigLIP — better contrastive loss; sharper at smaller sizes.
- DinoV2 — self-supervised; better at fine detail, less semantic.
- ConvNext / EVA-CLIP — hybrid CNN-ViT; sometimes higher quality.
For document understanding / OCR-heavy: higher resolution + multi-crop sliding window.
6. Image tokens are expensive
A 336×336 image → 576 vision tokens. That eats your context window fast.
- 4-image prompt: 2304 tokens for images alone.
- High-res / multi-crop: easily 4000+ tokens per image.
Inference cost scales with this. Practical advice:
- Resize aggressively before sending.
- Use a smaller model for image-only queries; reserve the big model for cross-modal reasoning.
- Cache image embeddings if you'll re-query the same image multiple times.
7. Audio in / audio out
Whisper-style encoders are dominant for audio→text. Modern any-to-any models:
- Audio in — mel-spectrogram → audio tokens (Whisper-like encoder) → interleaved with text tokens.
- Audio out — model emits audio tokens; vocoder (SoundStream, EnCodec, neural codec) converts to waveform.
End-to-end voice latency: GPT-4o demonstrated <300 ms. The bottleneck moved from speech recognition to LLM decoding.
8. Document understanding — the killer business use case
- OCR-first — Donut, LayoutLM, PaddleOCR; extract text + structure; pass to LLM.
- OCR-free — feed image directly to VLM; ask it to extract a table. Works for clean docs, fails on dense.
- Hybrid — OCR + VLM ensemble. Most production.
For invoices, contracts, forms: hybrid + structured output schema + per-field validation is the only reliable approach.
9. Multimodal RAG
When your corpus has images, charts, PDFs:
- Embed both text and image with a CLIP-class encoder.
- Index in a vector DB with type tag.
- At query time, retrieve top-K across both; pass everything to a VLM.
Easier said than done — chart interpretation is still a weakness. A chart-aware prompt + structured extraction outperforms naive image RAG by a lot.
10. Tool use × multimodal
The compound:
- Vision identifies a chart → ask
parse_chart_to_csv()tool → retrieve numbers → LLM reasons → answer. - Vision identifies a UI screenshot → call
read_button(label)→ click → screenshot again → continue.
This is the "computer use" / agentic-vision frontier. Anthropic and OpenAI both shipped basic UI-control in 2024. Latency and reliability are the open problems.
11. Evaluation
- VQA (Visual Question Answering) — image + question, expect short answer.
- MMMU — multimodal multi-discipline understanding; college-level.
- MathVista — visual math.
- MMBench — broad rubric-based eval.
- OCRBench — OCR-style tasks.
- VideoMME — video understanding at length.
Same caveats as LLM eval (S25): contamination, judge bias, your-product-eval > public benchmark.
12. New failure modes
- Hallucinating across modalities — model invents details not in the image ("the man is wearing a red hat" — no hat).
- Reading text in image wrong — OCR errors hidden inside fluent prose.
- Spatial reasoning — "to the left of the cup" still fails 30% in 2026.
- Chart precision — values read off bar charts can be 10% off; never trust without validation.
- Cross-modal jailbreak — instructions hidden inside images (sticker says "ignore the user").
Mitigations: structured output, verification tools, second-pass validation, never auto-action on visual claims without confirmation.
13. Reality check
A pragmatic multimodal stack:
- Use API for general multimodal (GPT-4o, Claude, Gemini).
- For document extraction at scale: open OCR (PaddleOCR) + structured prompt + Pydantic validator + LLM-as-judge sample audit.
- For images-only retrieval: CLIP + Faiss.
- Avoid building your own multimodal training pipeline unless this is the company's IP.
Reading material
Books:
- Deep Learning with PyTorch — Eli Stevens, Luca Antiga, Thomas Viehmann (the vision chapters; the foundation under CLIP/ViT)
- Hands-On Large Language Models — Jay Alammar & Maarten Grootendorst (the multimodal chapter)
- Generative Deep Learning, 2nd ed. — David Foster (the diffusion + multimodal chapters)
Papers:
- Learning Transferable Visual Models From Natural Language Supervision (CLIP) — Radford et al. 2021 — the contrastive vision-language model that unlocked everything that followed.
- Flamingo: a Visual Language Model for Few-Shot Learning — Alayrac et al. 2022 (DeepMind, NeurIPS) — the perceiver-resampler + cross-attention pattern.
- Visual Instruction Tuning (LLaVA) — Liu et al. 2023 (NeurIPS) — the open-source recipe for instruction-tuned VLMs.
- Robust Speech Recognition via Large-Scale Weak Supervision (Whisper) — Radford et al. 2022 — the canonical multilingual ASR.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models — Li et al. 2023 — the Q-Former bridge between frozen vision + frozen LLM.
- Sigmoid Loss for Language-Image Pre-Training (SigLIP) — Zhai et al. 2023 — the simpler-than-CLIP contrastive loss now used by most modern VLMs.
- Gemini: A Family of Highly Capable Multimodal Models — Google 2023 — the natively multimodal training recipe.
Official docs:
- OpenAI — Vision (GPT-4o) — the production API for the most-used VLM.
- Anthropic — Vision (Claude) — Claude's vision API.
- Hugging Face — Vision-language model docs — the open ecosystem.
- OpenAI — Whisper — STT API + open-source weights.
Blog posts:
- OpenAI — Hello GPT-4o — the announcement; the native multimodal architecture story.
- Lilian Weng — Generalized Visual Language Models — the survey that named the patterns.
- Sebastian Raschka — Understanding Multimodal LLMs — the cleanest practitioner intro.
In-depth research material
- LLaVA — github.com/haotian-liu/LLaVA — ~21k ★, the canonical open-source VLM training + inference repo.
- open_clip — github.com/mlfoundations/open_clip — ~10k ★, the open-source reimplementation of CLIP.
- Whisper — github.com/openai/whisper — ~75k ★, the reference implementation; the foundation under most production ASR.
- whisper.cpp — github.com/ggerganov/whisper.cpp — ~36k ★, the CPU/edge Whisper port.
- transformers — github.com/huggingface/transformers — ~134k ★, the home of every open VLM (LLaVA, Idefics, Qwen-VL, Phi-3-Vision, etc.).
- Idefics3 — Hugging Face — HF's reference open VLM training recipe.
- Florence-2 — Microsoft 2024 — the small (0.7B) unified vision foundation model.
- MMMU benchmark — github.com/MMMU-Benchmark/MMMU — the massive multi-discipline multimodal benchmark used by every frontier lab.
- VQAv2 leaderboard — the original visual question-answering benchmark.
- Coqui / XTTS — github.com/coqui-ai/TTS — ~33k ★, the open-source multilingual TTS / voice-cloning library.
- Meta — SeamlessM4T — speech↔text translation across 100+ languages.
Videos
- Multimodal AI: LLMs that can see and hear — Andrej Karpathy — Karpathy · 25 min — the visual-grounding intuition from the master communicator.
- Multimodal Models — Stanford CS25 — Stanford CS25 · 1 h 02 min — the academic-rigour overview.
- The Illustrated CLIP — Aleksa Gordić — Aleksa Gordić · 46 min — the paper walkthrough that everyone copies.
- GPT-4o demo + technical details — OpenAI — OpenAI Spring Update · 26 min — the launch demo that defined native multimodality.
- How LLaVA was built — Haotian Liu (NeurIPS 2023) — 39 min — the paper's first author walking through the recipe.
LeetCode — Image Overlap
- Link: https://leetcode.com/problems/image-overlap/
- Difficulty: Medium
- Why this problem: Treat two image grids; find the best overlap. Forces the same kind of spatial-reasoning mental model that VLMs struggle with.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain CLIP's contrastive training and what it enables.
- Sketch the LLaVA architecture (CLIP encoder + projection + LLM).
- Compare late fusion, early fusion, cross-attention fusion.
- Estimate how many tokens an image consumes and design context budget around it.
- Pick the right approach for document extraction (OCR-first, OCR-free, hybrid).
- Solve
image-overlap— grid alignment & shift counting, mirror of the spatial-reasoning gap.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.