Memory Model, GC, Heap, GC Leaks, Profiling
Session 20 of the 48-session learning series.
Why this session matters
This is Session 20 of 48 in the OOP & Languages track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- Memory layout — stack, heap, code, data segments
- Reference counting vs tracing GC; generational and compacting GCs
- Python's memory model — refcount + cycle collector; Java/Go GC
- Memory leaks in managed languages — cycles, caches, listeners
- Profiling and finding leaks — heap snapshots, allocators, py-spy
Pre-read (skim before the session)
- Pythonspeed — Python's memory model
- Java GC Basics (Oracle)
- Go GC Guide
- V8 Garbage Collection (Orinoco)
Deep dive
1. Process memory layout
high addr ┌──────────────────────┐
│ kernel space │
├──────────────────────┤
│ stack (grows ↓) │
│ │ │
│ ▼ │
│ │
│ ▲ │
│ │ │
│ heap (grows ↑) │
├──────────────────────┤
│ BSS (zero-init) │
│ data (init) │
│ text (code) │
low addr └──────────────────────┘
- Stack — fast, per-thread, LIFO. Local vars, return addresses. Small (default ~1–8 MB).
- Heap — slow(er), shared, free-form. Malloc / new / object allocations.
- Text — compiled code. Read-only, shared between processes via mmap.
- Data / BSS — globals.
2. Stack vs heap allocation cost
Stack push = sub rsp, N. 1 ns.
Heap malloc = allocator metadata lookup, bin search, ~50–200 ns.
Object pooling, arenas, and slab allocators exist to dodge this cost on hot paths.
3. Manual vs automatic memory management
Manual (C, C++) — you malloc / free. Mistakes = use-after-free, double-free, leaks.
RAII (C++, Rust) — destructors run on scope exit. Eliminates most manual errors. Rust's ownership system is RAII with compile-time enforcement.
Garbage collected (Java, Python, Go, JS, C#) — runtime tracks reachability, frees the unreachable.
4. Tracing GC — the family tree
Mark-sweep — walk all roots → mark reachable; sweep frees the rest. Simple, stops the world, fragments memory.
Mark-compact — like mark-sweep but compacts survivors to one end. Eliminates fragmentation; more expensive copy.
Copying (semi-space) — divide heap into two halves; allocate in one; on GC copy live to the other and flip. Fast for young objects, halves usable heap.
Generational — most objects die young (the generational hypothesis). Two/three generations (young, old, sometimes perm). Young collected often + cheaply; old rarely + expensively. The default for Java, .NET, JS.
Concurrent / incremental — do GC work concurrently with the mutator. Trades throughput for lower pause time. Java's G1, ZGC, Shenandoah; Go's concurrent collector.
5. Python's memory model
CPython uses reference counting as primary mechanism:
PyObject *obj = ...;
Py_INCREF(obj); // refcount++
Py_DECREF(obj); // refcount--; if (refcount == 0) free
Deterministic — object freed the moment last ref drops. Great for files, sockets (no defer close). Cost: every assignment touches a counter (lots of cache traffic).
Refcount can't free cycles (A → B → A). Python's cycle collector runs periodically (gc module) — generational mark-sweep over container objects only.
import gc
gc.collect() # force a cycle collection
gc.set_threshold(700, 10, 10)
gc.disable() # if you really know what you're doing
6. Java GC — the modern menu
- Serial GC — single thread, stop-the-world. Tiny heaps only.
- Parallel GC — multi-thread STW. Highest throughput; long pauses.
- G1 GC (default since Java 9) — concurrent + STW; region-based; tunable pause target.
- ZGC — sub-ms pauses on TB heaps. Higher CPU overhead.
- Shenandoah (Red Hat) — like ZGC; concurrent compaction.
Tuning rule of thumb:
- < 4 GB heap + throughput priority → Parallel GC.
- 4–32 GB heap + balanced → G1.
-
32 GB heap or low-latency requirement → ZGC.
7. Go GC
Concurrent mark-sweep, non-generational, write-barrier-based. Pause targets <1 ms by default. Tunable via GOGC env var (% of live heap to grow before next GC; default 100% means GC triggers when heap = 2× live).
GOGC=off disables GC (debugging only). GOMEMLIMIT (1.19+) sets a soft heap ceiling — pairs well with container memory limits.
8. Memory leaks in managed languages
GC frees the unreachable; leaks happen when objects stay reachable unintentionally:
- Caches without eviction.
dictgrowing forever. Usefunctools.lru_cache(maxsize=...),weakref.WeakValueDictionary. - Event listeners not unregistered. Especially in GUIs, web frontends, observers.
- Closures capturing big state.
def outer(huge): return lambda: huge[0]—hugelives as long as the closure. - Thread-locals. Long-lived threads with growing thread-local state.
- Global registries. Singletons accumulating refs.
- Module-level mutables.
LOGS = []that grows forever.
9. Finding leaks
Python:
import tracemalloc
tracemalloc.start()
# ... run workload ...
snap = tracemalloc.take_snapshot()
top = snap.statistics("lineno")[:10]
for s in top: print(s)
Or objgraph to visualise reference chains; memray (Bloomberg) for sample-based heap profiling — production-safe.
Java: jmap -dump:format=b,file=heap.bin \<pid>, open in MAT (Eclipse Memory Analyzer). Look at "Leak Suspects" report.
Node.js: --inspect, take heap snapshot in Chrome DevTools. Compare snapshots over time — growing classes = leak suspects.
Go: pprof heap profile. go tool pprof -alloc_space heap.pprof.
10. Cache hierarchy — the other memory cost
register 1 cycle (~0.3 ns)
L1 cache 4 cycles (~1 ns)
L2 cache 12 cycles (~3 ns)
L3 cache 40 cycles (~10 ns)
DRAM 200 cycles (~50–100 ns)
NVMe SSD ~50 μs
network ~ms
Cache-friendly code:
- Sequential access > random (prefetcher loves it).
- Arrays of structs (SoA) vs structs of arrays — depends on access pattern.
- Avoid pointer chasing (linked lists vs arrays).
This is why NumPy, polars, and Arrow exist — columnar layout + vectorised ops keep the CPU pipeline fed.
11. Allocators
Inside malloc:
- glibc ptmalloc / jemalloc / mimalloc — general-purpose; mimalloc is the fastest in benchmarks.
- Arena / bump allocator — allocate from a pre-sized chunk by bumping a pointer. Free everything at once (e.g., per-request arena in a web server).
- Slab allocator — pre-allocated free lists for fixed object sizes (Linux kernel).
- Pool / freelist — return-to-pool instead of free. Connection pools, buffer pools.
For high-allocation Python services, switching to jemalloc (LD_PRELOAD=/path/to/libjemalloc.so) often cuts RSS by 20–40% by reducing fragmentation.
12. RSS vs heap vs working set
top shows RSS (Resident Set Size) — pages currently in RAM. Includes:
- Heap that's actually touched.
- Shared libraries (overcounted).
- Page cache (kernel may evict to make room).
heap (in your GC stats) is just what the runtime tracks. RSS is usually higher (allocator overhead, fragmentation, native libs).
working set (cloud term) = pages actually used recently. Pricing on Azure / GCP often uses this.
Don't be alarmed by RSS > heap by 30–50%; that's normal.
Reading material
Books:
- Java Performance, 2nd ed. — Scott Oaks (the chapters on garbage collectors are gold)
- The Garbage Collection Handbook, 2nd ed. — Jones, Hosking, Moss (the canonical GC reference)
- High Performance Python, 2nd ed. — Micha Gorelick & Ian Ozsvald (profiling chapters; py-spy, memory_profiler)
- Systems Performance, 2nd ed. — Brendan Gregg (USE method, flame graphs; how to actually find the leak)
Papers:
- A Generational Mostly-Concurrent Garbage Collector — Printezis & Detlefs 2000 — foundations behind CMS / G1.
- The Z Garbage Collector — Tene, Iyengar, Wolf 2018 (OOPSLA) — ZGC; sub-ms pauses on 100GB+ heaps.
- Shenandoah: An Open-Source Concurrent Compacting GC — Flood et al. 2016 — Red Hat's pauseless GC.
Official docs:
- CPython memory management — Python docs — obmalloc, arena/pool/block model.
- Python
gcmodule — reference-count + generational cycle collector. - JVM — HotSpot VM Options — G1, ZGC, Shenandoah flags.
- Go GC Guide — the tri-color concurrent mark-sweep, GOGC, GOMEMLIMIT.
- V8 Blog — Trash Talk: Orinoco GC — JS's generational + incremental GC.
Blog posts:
- Memray: a memory profiler for Python — Bloomberg Engineering — the most useful Python memory tool of the last 5 years.
- Memory profiling — Python Speed (Itamar Turner-Trauring) — the practical toolset and decision tree.
- Aleksey Shipilëv's GC blog — the JVM GC writing nobody else does.
In-depth research material
- Memray — github.com/bloomberg/memray — ~14k ★, the Bloomberg Python memory profiler.
- py-spy — github.com/benfred/py-spy — ~13k ★, sampling profiler, attaches to a running process.
- Scalene — github.com/plasma-umass/scalene — ~12k ★, CPU + GPU + memory, line-level.
- Pyroscope / Grafana Pyroscope — github.com/grafana/pyroscope — ~10k ★, continuous profiling for production.
- async-profiler — github.com/async-profiler/async-profiler — ~7k ★, the JVM profiler that doesn't lie (safepoint-bias-free).
- Crafting Interpreters — Garbage Collection chapter — build a mark-sweep GC from scratch; clears the concepts forever.
- Discord — Why Discord is switching from Go to Rust — a real story about GC pause tails.
- Twitter — GC tuning for Twitter Cache — historical, but still cited.
- Brendan Gregg — USE Method — the framework for diagnosing any resource issue.
- Aleksey Shipilëv — JVM GC Pauses Tutorial — "JVM Anatomy Quarks" series; tiny essays on big GC questions.
Videos
- Garbage Collection in Python — PyConDE / Pablo Galindo · 35 min — ref counts + cyclic GC + recent improvements.
- Memory Management in Python — Talin — 35 min — obmalloc internals, arenas, fragmentation.
- Diagnosing memory leaks in Python with Memray — Pablo Galindo — Pablo Galindo · 40 min — by Memray's author.
- The Garbage Collection Handbook in 30 minutes — Richard Jones — 30 min — the GC author tours the algorithm landscape.
- Generational GC in JVM (G1, ZGC, Shenandoah) — Monica Beckwith — 50 min — deep JVM GC walk-through by a JVM perf engineer.
LeetCode — Lfu Cache
- Link: https://leetcode.com/problems/lfu-cache/
- Difficulty: Hard
- Why this problem: Two hash-maps + per-frequency LL; bookkeeping for min-freq. Real eviction policy.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Diagram stack/heap/data/text layout of a process.
- Explain why Python needs both refcount AND a cycle collector.
- Pick a Java GC for 8 GB throughput-bound vs 64 GB latency-bound.
- Identify 3 common leak patterns in your favourite language.
- Use tracemalloc or memray to find a leak in a small Python script.
- Solve
lfu-cache— two hash-maps + per-freq doubly-linked list; a real eviction policy.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.