engineeringintermediate 12m2026-06-09

Testing, Mocks, Property-Based Tests, Mutation Testing

Session 41 of the 48-session learning series.

Why this session matters

This is Session 41 of 48 in the OOP track. Code without tests is a draft. Most engineers stop at "I wrote some unit tests and they pass" — and miss property-based tests, mutation testing, and the discipline of fast/slow tiers. The compounding returns on investment in test infrastructure are massive over a multi-year codebase.

Agenda

The test pyramid — unit, integration, e2e; what each catches
Mocking — stubs, mocks, fakes, spies; when each is right
Property-based testing — Hypothesis, fast-check; the bug-finding superpower
Mutation testing — measuring the quality of the tests themselves
Test infrastructure — fixtures, parallelism, flakes, coverage thresholds

Pre-read (skim before the session)

Deep dive

1. The pyramid (and the diamond, and the trophy)

        ▲
        e2e          slow, brittle, gives confidence at top
       ─────
     integration     real DB / queue / network
    ────────────
   unit tests       fast, deterministic, isolate one thing

Classic pyramid: unit-heavy, fewer integration, few e2e.

Modern variants:

Trophy (Kent C. Dodds) — heavier on integration than pure pyramid.
Diamond — light on unit, heavy on integration.

Right shape depends on your tech. Microservices = lean toward integration; tight monolith = lean toward unit.

2. What "unit" actually means

Two schools:

Solitary unit — mock every collaborator. Test the function in isolation.
Sociable unit — use real collaborators when cheap (in-process, no I/O). Mock only external boundaries.

The sociable school catches integration bugs earlier; solitary is faster and easier to debug.

Pragmatic default: sociable units with mocked external boundaries (HTTP, DB if remote, time, randomness).

3. Stubs vs mocks vs fakes vs spies

Type	Behaviour	Verifies?
Stub	Returns hardcoded values	No
Mock	Expects specific calls; fails if unmet	Yes
Fake	Working implementation, simplified (in-memory DB)	No
Spy	Wraps real object; records calls	Yes

Rule: prefer fakes > stubs > mocks > spies. Mocks couple you to implementation; refactoring breaks tests for no reason.

4. Test doubles per language

Python: unittest.mock, pytest-mock. For HTTP: responses, httpx_mock, pytest-httpserver. For time: freezegun.
Node/TS: jest.mock(), sinon, nock for HTTP.
Java: Mockito, WireMock for HTTP.
Go: handcrafted interfaces + table-driven tests (idiomatic; few libs).
Rust: mockall, wiremock.

5. Property-based testing

Instead of assert add(2, 3) == 5, write for any a, b: assert add(a, b) == add(b, a). The framework generates 1000s of cases and shrinks failures to minimal examples.

from hypothesis import given, strategies as st

@given(st.lists(st.integers()))
def test_sort_idempotent(xs):
    assert sorted(sorted(xs)) == sorted(xs)

Where it wins:

Stateless pure functions (parsers, serialisers, hashers).
Round-trip properties (encode(decode(x)) == x).
Invariants of data structures (BST always sorted, heap property).
Concurrency models (stateful machine testing).

Hours of property-based testing find bugs that years of example-based won't.

6. Mutation testing

A mutation test framework changes your code slightly (+ to -, > to \<) and re-runs your tests. If tests still pass → your test suite missed the mutation → suite is weaker than coverage suggests.

Tools: PIT (Java), mutmut (Python), Stryker (JS/TS/.NET).

Original:  if x > 0:
Mutant:    if x >= 0:

Result: kill rate (% of mutants caught). 80%+ = good; under 60% = your tests check that code runs, not that it's correct.

Run weekly, not per-commit (it's slow).

7. Fixtures and setup

Reuse expensive setup; avoid global state:

pytest @fixture(scope='module') for one-per-file.
pytest @fixture(scope='session') for one-per-test-run.
Use Testcontainers for ephemeral DB / Kafka / Redis in CI.

Anti-pattern: shared mutable state across tests → flakes that only fail on rerun.

8. Parallelism and isolation

Run tests in parallel for speed:

pytest -n auto (with pytest-xdist).
jest --maxWorkers.

Each parallel worker needs isolation:

Random schema/DB per worker.
Random ports.
No shared temp files.
No global mocks.

The discipline costs upfront; pays back every CI run.

9. Flaky tests

A flaky test is worse than no test — it teaches the team to ignore failures.

Common causes:

Timing assumptions (sleep(1) instead of wait_until).
Ordering assumptions on un-ordered output.
Network calls leaking to real services.
Shared state leaking between tests.

Triage:

Mark flaky → quarantine (@pytest.mark.flaky with auto-rerun).
Fix within a sprint or delete the test.
Track flake rate per test in CI history.

10. Coverage

A measurement, not a goal. 80% line coverage of "if x is None" branches is meaningless.

Useful coverage practices:

Track per-PR delta — coverage shouldn't drop.
Branch coverage > line coverage.
Mutation testing > both for actual quality signal.

Coverage above 90% is rarely cost-effective except for libraries / safety-critical code.

11. Contract / consumer-driven testing

For microservices: each consumer publishes a contract (the responses it expects). Producer's CI runs against all consumer contracts. Catches breaking API changes before deploy.

Tools: Pact, Spring Cloud Contract.

Heavy investment; high payoff at 5+ services.

12. Fast / slow tiers

Split tests by speed:

Fast — < 1 ms each; run on every save in IDE; < 5 s total.
Slow — DB, network; run on push; minutes ok.
Nightly — e2e, perf, mutation; hours ok.

Don't make engineers wait 10 min for a unit test rerun.

13. Reality check

A modern test setup checklist:

pytest / jest / your stack's standard runner.
80%+ branch coverage (with caveats above).
Hypothesis / fast-check for pure-function modules.
Testcontainers for integration with real services.
Mutation testing weekly.
Coverage as a PR comment, not a hard gate.
Flaky test budget — < 0.5% flake rate.
Slow tests in a separate CI lane.

The teams that take testing seriously ship faster, not slower. The teams that "don't have time for tests" pay back the debt at 3 AM during outages.

Reading material

Books:

Working Effectively with Legacy Code — Michael Feathers (the seam + characterization-test playbook for code you can't safely change)
xUnit Test Patterns — Gerard Meszaros (the canonical taxonomy of test doubles, fixtures, and smells)
Growing Object-Oriented Software, Guided by Tests — Freeman & Pryce (the original mockist-school TDD book; where roles + responsibilities + collaborators come from)
The Art of Unit Testing (3rd ed.) — Roy Osherove (modern walk-through with the stub/mock/fake distinction made concrete)
Effective Software Testing — Maurício Aniche (property-based + mutation testing + spec-by-example, all in one practitioner book)

Papers:

QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs — Claessen & Hughes 2000 — the founding paper of property-based testing.
An Analysis and Survey of the Development of Mutation Testing — Jia & Harman 2011 — the canonical 50-year survey of the mutation-testing field.
Coverage Is Not Strongly Correlated with Test Suite Effectiveness — Inozemtseva & Holmes 2014 — the paper that showed coverage % is a weak signal; mutation score is stronger.

Official docs:

Hypothesis (Python) documentation — the canonical property-based-testing library.
fast-check (TypeScript/JavaScript) docs — the modern TS/JS property tester.
PIT mutation testing for Java — the canonical JVM mutation-testing tool.
mutmut — mutation testing for Python — the Python mutation tester (or Cosmic-Ray as alternative).
Pact — Consumer-driven contract testing — the canonical contract-test platform between services.
Testcontainers — disposable Docker-backed dependencies for integration tests.
pytest documentation — fixtures, parametrize, markers; how to actually structure tiered suites.

Blog posts:

Martin Fowler — Mocks Aren't Stubs — the canonical essay; required reading.
Martin Fowler — Test Pyramid — where the fast-vs-slow tiering discipline comes from.
Hypothesis Works — articles — short essays by the Hypothesis maintainers on property-based testing patterns.
Increment — Testing issue — a Stripe-published collection of senior-engineer essays on testing in production systems.
Google Testing Blog — Test sizes — where the Small/Medium/Large taxonomy used at Google comes from.

In-depth research material

Hypothesis — github.com/HypothesisWorks/hypothesis — ~7.5k ★, the canonical PBT library.
fast-check — github.com/dubzzz/fast-check — ~4k ★, the modern TS property tester used inside Vercel, Stripe, and Snowpack tests.
PIT — github.com/hcoles/pitest — the JVM mutation-testing tool used by JetBrains, Spring, and many enterprise codebases.
Stryker Mutator — github.com/stryker-mutator/stryker-js — mutation testing for JavaScript / TypeScript / C#.
mutmut — github.com/boxed/mutmut — a popular Python mutation tester.
Cosmic-Ray — github.com/sixty-north/cosmic-ray — alternative Python mutation framework (parallel runner).
Pact — github.com/pact-foundation/pact-specification — the canonical CDC test spec.
Testcontainers — github.com/testcontainers/testcontainers-java — ~7k ★, plus per-language ports.
Stripe Engineering — How we built our test platform — Stripe's real test-tiering story at scale.
Google Testing Blog — long-running war stories from Google's test-engineering org.
Eugene Yan — Reliable ML systems with tests — testing patterns applied to ML pipelines.
Hillel Wayne — Why Property-Based? — a practitioner essay on when PBT actually earns its weight.

Videos

Property-Based Testing for Better Code — John Hughes (Curry On) — John Hughes (the QuickCheck author) · 56 min — the canonical PBT talk; war stories from real bugs found.
Mocks Aren't Stubs — Martin Fowler-inspired talk by Sandi Metz — Sandi Metz · 38 min — the clearest distinction between stubs, mocks, fakes, and spies you'll see on video.
Mutation Testing in Practice — Henry Coles — Henry Coles (PIT author) · 45 min — the inventor's tour of mutation testing on a real codebase.
TDD, Where Did It All Go Wrong? — Ian Cooper — Ian Cooper · 1 h 02 min — the talk that pushed many teams from London-school back to Detroit-school TDD.
Hypothesis & Property-Based Testing in Python — David MacIver — David MacIver (Hypothesis author) · 36 min — practical Python PBT walkthrough from the library's creator.

LeetCode — Validate Binary Search Tree

Link: https://leetcode.com/problems/validate-binary-search-tree/
Difficulty: Medium
Why this problem: Verifying a structural invariant is exactly what tests do. Naive solutions miss edge cases; property-based thinking catches them.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Pick test-pyramid shape for a given codebase (microservices vs monolith).
Choose between mocks, fakes, stubs, spies for a scenario.
Write a property-based test for a pure function.
Explain mutation kill-rate and why it beats line coverage.
Triage and fix a flaky test class.
Solve validate-binary-search-tree — classic invariant check, both naive and property-aware.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Change Data Capture — Debezium, Outbox Pattern, Snapshot+Stream

Prompt Engineering at Production Scale — Templates, Caching, Drift