ai mlintermediate 12m2026-06-09

Multimodal LLMs — Vision, Language, Audio, Tool Use Combined

Session 36 of the 48-session learning series.

Why this session matters

This is Session 36 of 48 in the LLM track. By 2026, "LLM" really means "multimodal foundation model" — Claude, GPT, Gemini, all natively handle images, audio, and tools. Knowing how the modalities are stitched into a shared representation, and what fails when you cross them, separates AI engineers from prompt-jockeys.

Agenda

How vision is plugged into a transformer (CLIP, ViT, vision-language models)
Audio models — Whisper, audio tokens, native audio in/out
Late fusion vs early fusion vs interleaved modalities
Tool use across modalities — image → action → answer
Evaluation — VQA, MMMU, Audio-LM benchmarks; the new failure modes

Pre-read (skim before the session)

Deep dive

1. What "multimodal" actually means

Three patterns conflated under the same word:

Cross-modal embedding (CLIP) — text and image in same vector space for similarity / search.
Vision-language generation (LLaVA, GPT-4V) — model takes image + text, generates text.
Any-to-any (GPT-4o, Gemini Native) — image / audio / video / text in; image / audio / text out.

Each is a different architecture problem. We'll cover all three.

Train two encoders side by side:

Image encoder (ViT or ResNet) → image embedding.
Text encoder (transformer) → text embedding.

Train with contrastive loss: for a batch of (image, caption) pairs, the diagonal of image_emb · text_emb.T should be high, off-diagonal low.

After training:

Zero-shot image classification by checking cos(image, "a photo of a {class}") for each class.
Image search by text query.
Text-image retrieval at scale.

CLIP is the embedding spine of most modern multimodal stacks (Stable Diffusion, Midjourney's text understanding, every "search images by text" feature).

3. From CLIP to vision-language generation

CLIP doesn't generate text. To get a VLM:

image → CLIP-like vision encoder → image tokens (~256–576 per image)
                                          ↓
                       [ MLP projection: image-emb-dim → LLM-emb-dim ]
                                          ↓
                       interleave with text tokens:
                       <image_tokens> "Describe this." <generated_text>
                                          ↓
                                    LLM transformer

That's the LLaVA recipe. CLIP encoder + projection MLP + frozen LLM, trained on image-instruction pairs. Cheap to build; impressively capable.

4. Native multimodal architectures

GPT-4o, Gemini, Claude 3.5+ go further:

Image, audio, video all encoded into tokens that share the model's token space.
Trained end-to-end — no frozen modules.
Generation can output across modalities (text + image, text + audio).

Architectural variants:

Late fusion — encode each modality separately, concat embeddings, pass to LLM. Simplest; least cross-modal reasoning.
Early fusion — interleave tokens from all modalities in one stream from the start. Strongest reasoning across modalities; hardest to train.
Cross-attention fusion (Flamingo) — image tokens accessed via cross-attention in dedicated layers. Compromise; mainstream pre-2024.

Trend in 2026: pure early fusion with shared tokeniser.

5. Vision encoder choices

CLIP-ViT — most common; 224×224 or 336×336 inputs; ~256 tokens out.
SigLIP — better contrastive loss; sharper at smaller sizes.
DinoV2 — self-supervised; better at fine detail, less semantic.
ConvNext / EVA-CLIP — hybrid CNN-ViT; sometimes higher quality.

For document understanding / OCR-heavy: higher resolution + multi-crop sliding window.

6. Image tokens are expensive

A 336×336 image → 576 vision tokens. That eats your context window fast.

4-image prompt: 2304 tokens for images alone.
High-res / multi-crop: easily 4000+ tokens per image.

Inference cost scales with this. Practical advice:

Resize aggressively before sending.
Use a smaller model for image-only queries; reserve the big model for cross-modal reasoning.
Cache image embeddings if you'll re-query the same image multiple times.

7. Audio in / audio out

Whisper-style encoders are dominant for audio→text. Modern any-to-any models:

Audio in — mel-spectrogram → audio tokens (Whisper-like encoder) → interleaved with text tokens.
Audio out — model emits audio tokens; vocoder (SoundStream, EnCodec, neural codec) converts to waveform.

End-to-end voice latency: GPT-4o demonstrated <300 ms. The bottleneck moved from speech recognition to LLM decoding.

8. Document understanding — the killer business use case

OCR-first — Donut, LayoutLM, PaddleOCR; extract text + structure; pass to LLM.
OCR-free — feed image directly to VLM; ask it to extract a table. Works for clean docs, fails on dense.
Hybrid — OCR + VLM ensemble. Most production.

For invoices, contracts, forms: hybrid + structured output schema + per-field validation is the only reliable approach.

9. Multimodal RAG

When your corpus has images, charts, PDFs:

Embed both text and image with a CLIP-class encoder.
Index in a vector DB with type tag.
At query time, retrieve top-K across both; pass everything to a VLM.

Easier said than done — chart interpretation is still a weakness. A chart-aware prompt + structured extraction outperforms naive image RAG by a lot.

10. Tool use × multimodal

The compound:

Vision identifies a chart → ask parse_chart_to_csv() tool → retrieve numbers → LLM reasons → answer.
Vision identifies a UI screenshot → call read_button(label) → click → screenshot again → continue.

This is the "computer use" / agentic-vision frontier. Anthropic and OpenAI both shipped basic UI-control in 2024. Latency and reliability are the open problems.

11. Evaluation

VQA (Visual Question Answering) — image + question, expect short answer.
MMMU — multimodal multi-discipline understanding; college-level.
MathVista — visual math.
MMBench — broad rubric-based eval.
OCRBench — OCR-style tasks.
VideoMME — video understanding at length.

Same caveats as LLM eval (S25): contamination, judge bias, your-product-eval > public benchmark.

12. New failure modes

Hallucinating across modalities — model invents details not in the image ("the man is wearing a red hat" — no hat).
Reading text in image wrong — OCR errors hidden inside fluent prose.
Spatial reasoning — "to the left of the cup" still fails 30% in 2026.
Chart precision — values read off bar charts can be 10% off; never trust without validation.
Cross-modal jailbreak — instructions hidden inside images (sticker says "ignore the user").

Mitigations: structured output, verification tools, second-pass validation, never auto-action on visual claims without confirmation.

13. Reality check

A pragmatic multimodal stack:

Use API for general multimodal (GPT-4o, Claude, Gemini).
For document extraction at scale: open OCR (PaddleOCR) + structured prompt + Pydantic validator + LLM-as-judge sample audit.
For images-only retrieval: CLIP + Faiss.
Avoid building your own multimodal training pipeline unless this is the company's IP.

Reading material

Books:

Deep Learning with PyTorch — Eli Stevens, Luca Antiga, Thomas Viehmann (the vision chapters; the foundation under CLIP/ViT)
Hands-On Large Language Models — Jay Alammar & Maarten Grootendorst (the multimodal chapter)
Generative Deep Learning, 2nd ed. — David Foster (the diffusion + multimodal chapters)

Papers:

Learning Transferable Visual Models From Natural Language Supervision (CLIP) — Radford et al. 2021 — the contrastive vision-language model that unlocked everything that followed.
Flamingo: a Visual Language Model for Few-Shot Learning — Alayrac et al. 2022 (DeepMind, NeurIPS) — the perceiver-resampler + cross-attention pattern.
Visual Instruction Tuning (LLaVA) — Liu et al. 2023 (NeurIPS) — the open-source recipe for instruction-tuned VLMs.
Robust Speech Recognition via Large-Scale Weak Supervision (Whisper) — Radford et al. 2022 — the canonical multilingual ASR.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models — Li et al. 2023 — the Q-Former bridge between frozen vision + frozen LLM.
Sigmoid Loss for Language-Image Pre-Training (SigLIP) — Zhai et al. 2023 — the simpler-than-CLIP contrastive loss now used by most modern VLMs.
Gemini: A Family of Highly Capable Multimodal Models — Google 2023 — the natively multimodal training recipe.

Official docs:

OpenAI — Vision (GPT-4o) — the production API for the most-used VLM.
Anthropic — Vision (Claude) — Claude's vision API.
Hugging Face — Vision-language model docs — the open ecosystem.
OpenAI — Whisper — STT API + open-source weights.

Blog posts:

OpenAI — Hello GPT-4o — the announcement; the native multimodal architecture story.
Lilian Weng — Generalized Visual Language Models — the survey that named the patterns.
Sebastian Raschka — Understanding Multimodal LLMs — the cleanest practitioner intro.

In-depth research material

LLaVA — github.com/haotian-liu/LLaVA — ~21k ★, the canonical open-source VLM training + inference repo.
open_clip — github.com/mlfoundations/open_clip — ~10k ★, the open-source reimplementation of CLIP.
Whisper — github.com/openai/whisper — ~75k ★, the reference implementation; the foundation under most production ASR.
whisper.cpp — github.com/ggerganov/whisper.cpp — ~36k ★, the CPU/edge Whisper port.
transformers — github.com/huggingface/transformers — ~134k ★, the home of every open VLM (LLaVA, Idefics, Qwen-VL, Phi-3-Vision, etc.).
Idefics3 — Hugging Face — HF's reference open VLM training recipe.
Florence-2 — Microsoft 2024 — the small (0.7B) unified vision foundation model.
MMMU benchmark — github.com/MMMU-Benchmark/MMMU — the massive multi-discipline multimodal benchmark used by every frontier lab.
VQAv2 leaderboard — the original visual question-answering benchmark.
Coqui / XTTS — github.com/coqui-ai/TTS — ~33k ★, the open-source multilingual TTS / voice-cloning library.
Meta — SeamlessM4T — speech↔text translation across 100+ languages.

Videos

Multimodal AI: LLMs that can see and hear — Andrej Karpathy — Karpathy · 25 min — the visual-grounding intuition from the master communicator.
Multimodal Models — Stanford CS25 — Stanford CS25 · 1 h 02 min — the academic-rigour overview.
The Illustrated CLIP — Aleksa Gordić — Aleksa Gordić · 46 min — the paper walkthrough that everyone copies.
GPT-4o demo + technical details — OpenAI — OpenAI Spring Update · 26 min — the launch demo that defined native multimodality.
How LLaVA was built — Haotian Liu (NeurIPS 2023) — 39 min — the paper's first author walking through the recipe.

LeetCode — Image Overlap

Link: https://leetcode.com/problems/image-overlap/
Difficulty: Medium
Why this problem: Treat two image grids; find the best overlap. Forces the same kind of spatial-reasoning mental model that VLMs struggle with.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain CLIP's contrastive training and what it enables.
Sketch the LLaVA architecture (CLIP encoder + projection + LLM).
Compare late fusion, early fusion, cross-attention fusion.
Estimate how many tokens an image consumes and design context budget around it.
Pick the right approach for document extraction (OCR-first, OCR-free, hybrid).
Solve image-overlap — grid alignment & shift counting, mirror of the spatial-reasoning gap.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

News Feed / Timeline System — Fanout-on-Read vs Write, Ranking

Petabyte Cost Optimisation — Compression, Partitioning, Z-Order, File Sizing