engineeringintermediate 12m2026-06-09

Memory Model, GC, Heap, GC Leaks, Profiling

Session 20 of the 48-session learning series.

Why this session matters

This is Session 20 of 48 in the OOP & Languages track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

Memory layout — stack, heap, code, data segments
Reference counting vs tracing GC; generational and compacting GCs
Python's memory model — refcount + cycle collector; Java/Go GC
Memory leaks in managed languages — cycles, caches, listeners
Profiling and finding leaks — heap snapshots, allocators, py-spy

Pre-read (skim before the session)

Deep dive

1. Process memory layout

high addr  ┌──────────────────────┐
           │   kernel space       │
           ├──────────────────────┤
           │   stack (grows ↓)    │
           │       │              │
           │       ▼              │
           │                      │
           │       ▲              │
           │       │              │
           │   heap (grows ↑)     │
           ├──────────────────────┤
           │   BSS (zero-init)    │
           │   data (init)        │
           │   text (code)        │
low addr   └──────────────────────┘

Stack — fast, per-thread, LIFO. Local vars, return addresses. Small (default ~1–8 MB).
Heap — slow(er), shared, free-form. Malloc / new / object allocations.
Text — compiled code. Read-only, shared between processes via mmap.
Data / BSS — globals.

2. Stack vs heap allocation cost

Stack push = sub rsp, N. 1 ns.

Heap malloc = allocator metadata lookup, bin search, ~50–200 ns.

Object pooling, arenas, and slab allocators exist to dodge this cost on hot paths.

3. Manual vs automatic memory management

Manual (C, C++) — you malloc / free. Mistakes = use-after-free, double-free, leaks.

RAII (C++, Rust) — destructors run on scope exit. Eliminates most manual errors. Rust's ownership system is RAII with compile-time enforcement.

Garbage collected (Java, Python, Go, JS, C#) — runtime tracks reachability, frees the unreachable.

4. Tracing GC — the family tree

Mark-sweep — walk all roots → mark reachable; sweep frees the rest. Simple, stops the world, fragments memory.

Mark-compact — like mark-sweep but compacts survivors to one end. Eliminates fragmentation; more expensive copy.

Copying (semi-space) — divide heap into two halves; allocate in one; on GC copy live to the other and flip. Fast for young objects, halves usable heap.

Generational — most objects die young (the generational hypothesis). Two/three generations (young, old, sometimes perm). Young collected often + cheaply; old rarely + expensively. The default for Java, .NET, JS.

Concurrent / incremental — do GC work concurrently with the mutator. Trades throughput for lower pause time. Java's G1, ZGC, Shenandoah; Go's concurrent collector.

5. Python's memory model

CPython uses reference counting as primary mechanism:

PyObject *obj = ...;
Py_INCREF(obj);  // refcount++
Py_DECREF(obj);  // refcount--; if (refcount == 0) free

Deterministic — object freed the moment last ref drops. Great for files, sockets (no defer close). Cost: every assignment touches a counter (lots of cache traffic).

Refcount can't free cycles (A → B → A). Python's cycle collector runs periodically (gc module) — generational mark-sweep over container objects only.

import gc
gc.collect()        # force a cycle collection
gc.set_threshold(700, 10, 10)
gc.disable()        # if you really know what you're doing

Serial GC — single thread, stop-the-world. Tiny heaps only.
Parallel GC — multi-thread STW. Highest throughput; long pauses.
G1 GC (default since Java 9) — concurrent + STW; region-based; tunable pause target.
ZGC — sub-ms pauses on TB heaps. Higher CPU overhead.
Shenandoah (Red Hat) — like ZGC; concurrent compaction.

Tuning rule of thumb:

< 4 GB heap + throughput priority → Parallel GC.
4–32 GB heap + balanced → G1.
32 GB heap or low-latency requirement → ZGC.

7. Go GC

Concurrent mark-sweep, non-generational, write-barrier-based. Pause targets <1 ms by default. Tunable via GOGC env var (% of live heap to grow before next GC; default 100% means GC triggers when heap = 2× live).

GOGC=off disables GC (debugging only). GOMEMLIMIT (1.19+) sets a soft heap ceiling — pairs well with container memory limits.

8. Memory leaks in managed languages

GC frees the unreachable; leaks happen when objects stay reachable unintentionally:

Caches without eviction. dict growing forever. Use functools.lru_cache(maxsize=...), weakref.WeakValueDictionary.
Event listeners not unregistered. Especially in GUIs, web frontends, observers.
Closures capturing big state. def outer(huge): return lambda: huge[0] — huge lives as long as the closure.
Thread-locals. Long-lived threads with growing thread-local state.
Global registries. Singletons accumulating refs.
Module-level mutables. LOGS = [] that grows forever.

9. Finding leaks

Python:

import tracemalloc
tracemalloc.start()
# ... run workload ...
snap = tracemalloc.take_snapshot()
top = snap.statistics("lineno")[:10]
for s in top: print(s)

Or objgraph to visualise reference chains; memray (Bloomberg) for sample-based heap profiling — production-safe.

Java: jmap -dump:format=b,file=heap.bin \<pid>, open in MAT (Eclipse Memory Analyzer). Look at "Leak Suspects" report.

Node.js: --inspect, take heap snapshot in Chrome DevTools. Compare snapshots over time — growing classes = leak suspects.

Go: pprof heap profile. go tool pprof -alloc_space heap.pprof.

10. Cache hierarchy — the other memory cost

register    1   cycle  (~0.3 ns)
L1 cache    4   cycles (~1 ns)
L2 cache    12  cycles (~3 ns)
L3 cache    40  cycles (~10 ns)
DRAM        200 cycles (~50–100 ns)
NVMe SSD            ~50 μs
network              ~ms

Cache-friendly code:

Sequential access > random (prefetcher loves it).
Arrays of structs (SoA) vs structs of arrays — depends on access pattern.
Avoid pointer chasing (linked lists vs arrays).

This is why NumPy, polars, and Arrow exist — columnar layout + vectorised ops keep the CPU pipeline fed.

11. Allocators

Inside malloc:

glibc ptmalloc / jemalloc / mimalloc — general-purpose; mimalloc is the fastest in benchmarks.
Arena / bump allocator — allocate from a pre-sized chunk by bumping a pointer. Free everything at once (e.g., per-request arena in a web server).
Slab allocator — pre-allocated free lists for fixed object sizes (Linux kernel).
Pool / freelist — return-to-pool instead of free. Connection pools, buffer pools.

For high-allocation Python services, switching to jemalloc (LD_PRELOAD=/path/to/libjemalloc.so) often cuts RSS by 20–40% by reducing fragmentation.

12. RSS vs heap vs working set

top shows RSS (Resident Set Size) — pages currently in RAM. Includes:

Heap that's actually touched.
Shared libraries (overcounted).
Page cache (kernel may evict to make room).

heap (in your GC stats) is just what the runtime tracks. RSS is usually higher (allocator overhead, fragmentation, native libs).

working set (cloud term) = pages actually used recently. Pricing on Azure / GCP often uses this.

Don't be alarmed by RSS > heap by 30–50%; that's normal.

Reading material

Books:

Java Performance, 2nd ed. — Scott Oaks (the chapters on garbage collectors are gold)
The Garbage Collection Handbook, 2nd ed. — Jones, Hosking, Moss (the canonical GC reference)
High Performance Python, 2nd ed. — Micha Gorelick & Ian Ozsvald (profiling chapters; py-spy, memory_profiler)
Systems Performance, 2nd ed. — Brendan Gregg (USE method, flame graphs; how to actually find the leak)

Papers:

A Generational Mostly-Concurrent Garbage Collector — Printezis & Detlefs 2000 — foundations behind CMS / G1.
The Z Garbage Collector — Tene, Iyengar, Wolf 2018 (OOPSLA) — ZGC; sub-ms pauses on 100GB+ heaps.
Shenandoah: An Open-Source Concurrent Compacting GC — Flood et al. 2016 — Red Hat's pauseless GC.

Official docs:

CPython memory management — Python docs — obmalloc, arena/pool/block model.
Python gc module — reference-count + generational cycle collector.
JVM — HotSpot VM Options — G1, ZGC, Shenandoah flags.
Go GC Guide — the tri-color concurrent mark-sweep, GOGC, GOMEMLIMIT.
V8 Blog — Trash Talk: Orinoco GC — JS's generational + incremental GC.

Blog posts:

Memray: a memory profiler for Python — Bloomberg Engineering — the most useful Python memory tool of the last 5 years.
Memory profiling — Python Speed (Itamar Turner-Trauring) — the practical toolset and decision tree.
Aleksey Shipilëv's GC blog — the JVM GC writing nobody else does.

In-depth research material

Memray — github.com/bloomberg/memray — ~14k ★, the Bloomberg Python memory profiler.
py-spy — github.com/benfred/py-spy — ~13k ★, sampling profiler, attaches to a running process.
Scalene — github.com/plasma-umass/scalene — ~12k ★, CPU + GPU + memory, line-level.
Pyroscope / Grafana Pyroscope — github.com/grafana/pyroscope — ~10k ★, continuous profiling for production.
async-profiler — github.com/async-profiler/async-profiler — ~7k ★, the JVM profiler that doesn't lie (safepoint-bias-free).
Crafting Interpreters — Garbage Collection chapter — build a mark-sweep GC from scratch; clears the concepts forever.
Discord — Why Discord is switching from Go to Rust — a real story about GC pause tails.
Twitter — GC tuning for Twitter Cache — historical, but still cited.
Brendan Gregg — USE Method — the framework for diagnosing any resource issue.
Aleksey Shipilëv — JVM GC Pauses Tutorial — "JVM Anatomy Quarks" series; tiny essays on big GC questions.

Videos

Garbage Collection in Python — PyConDE / Pablo Galindo · 35 min — ref counts + cyclic GC + recent improvements.
Memory Management in Python — Talin — 35 min — obmalloc internals, arenas, fragmentation.
Diagnosing memory leaks in Python with Memray — Pablo Galindo — Pablo Galindo · 40 min — by Memray's author.
The Garbage Collection Handbook in 30 minutes — Richard Jones — 30 min — the GC author tours the algorithm landscape.
Generational GC in JVM (G1, ZGC, Shenandoah) — Monica Beckwith — 50 min — deep JVM GC walk-through by a JVM perf engineer.

LeetCode — Lfu Cache

Link: https://leetcode.com/problems/lfu-cache/
Difficulty: Hard
Why this problem: Two hash-maps + per-frequency LL; bookkeeping for min-freq. Real eviction policy.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Diagram stack/heap/data/text layout of a process.
Explain why Python needs both refcount AND a cycle collector.
Pick a Java GC for 8 GB throughput-bound vs 64 GB latency-bound.
Identify 3 common leak patterns in your favourite language.
Use tracemalloc or memray to find a leak in a small Python script.
Solve lfu-cache — two hash-maps + per-freq doubly-linked list; a real eviction policy.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Lakehouse — Delta Lake, Iceberg, Hudi, ACID on Object Storage

Function Calling, Tool Use, Agentic Loops