Testing, Mocks, Property-Based Tests, Mutation Testing
Session 41 of the 48-session learning series.
Why this session matters
This is Session 41 of 48 in the OOP track. Code without tests is a draft. Most engineers stop at "I wrote some unit tests and they pass" — and miss property-based tests, mutation testing, and the discipline of fast/slow tiers. The compounding returns on investment in test infrastructure are massive over a multi-year codebase.
Agenda
- The test pyramid — unit, integration, e2e; what each catches
- Mocking — stubs, mocks, fakes, spies; when each is right
- Property-based testing — Hypothesis, fast-check; the bug-finding superpower
- Mutation testing — measuring the quality of the tests themselves
- Test infrastructure — fixtures, parallelism, flakes, coverage thresholds
Pre-read (skim before the session)
- Martin Fowler — Test Pyramid
- Martin Fowler — Mocks Aren't Stubs
- Hypothesis docs — Quick start
- PIT mutation testing — Mutators
Deep dive
1. The pyramid (and the diamond, and the trophy)
▲
e2e slow, brittle, gives confidence at top
─────
integration real DB / queue / network
────────────
unit tests fast, deterministic, isolate one thing
Classic pyramid: unit-heavy, fewer integration, few e2e.
Modern variants:
- Trophy (Kent C. Dodds) — heavier on integration than pure pyramid.
- Diamond — light on unit, heavy on integration.
Right shape depends on your tech. Microservices = lean toward integration; tight monolith = lean toward unit.
2. What "unit" actually means
Two schools:
- Solitary unit — mock every collaborator. Test the function in isolation.
- Sociable unit — use real collaborators when cheap (in-process, no I/O). Mock only external boundaries.
The sociable school catches integration bugs earlier; solitary is faster and easier to debug.
Pragmatic default: sociable units with mocked external boundaries (HTTP, DB if remote, time, randomness).
3. Stubs vs mocks vs fakes vs spies
| Type | Behaviour | Verifies? |
|---|---|---|
| Stub | Returns hardcoded values | No |
| Mock | Expects specific calls; fails if unmet | Yes |
| Fake | Working implementation, simplified (in-memory DB) | No |
| Spy | Wraps real object; records calls | Yes |
Rule: prefer fakes > stubs > mocks > spies. Mocks couple you to implementation; refactoring breaks tests for no reason.
4. Test doubles per language
- Python:
unittest.mock,pytest-mock. For HTTP:responses,httpx_mock,pytest-httpserver. For time:freezegun. - Node/TS:
jest.mock(),sinon,nockfor HTTP. - Java: Mockito, WireMock for HTTP.
- Go: handcrafted interfaces + table-driven tests (idiomatic; few libs).
- Rust:
mockall,wiremock.
5. Property-based testing
Instead of assert add(2, 3) == 5, write for any a, b: assert add(a, b) == add(b, a). The framework generates 1000s of cases and shrinks failures to minimal examples.
from hypothesis import given, strategies as st
@given(st.lists(st.integers()))
def test_sort_idempotent(xs):
assert sorted(sorted(xs)) == sorted(xs)
Where it wins:
- Stateless pure functions (parsers, serialisers, hashers).
- Round-trip properties (
encode(decode(x)) == x). - Invariants of data structures (BST always sorted, heap property).
- Concurrency models (stateful machine testing).
Hours of property-based testing find bugs that years of example-based won't.
6. Mutation testing
A mutation test framework changes your code slightly (+ to -, > to \<) and re-runs your tests. If tests still pass → your test suite missed the mutation → suite is weaker than coverage suggests.
Tools: PIT (Java), mutmut (Python), Stryker (JS/TS/.NET).
Original: if x > 0:
Mutant: if x >= 0:
Result: kill rate (% of mutants caught). 80%+ = good; under 60% = your tests check that code runs, not that it's correct.
Run weekly, not per-commit (it's slow).
7. Fixtures and setup
Reuse expensive setup; avoid global state:
- pytest
@fixture(scope='module')for one-per-file. - pytest
@fixture(scope='session')for one-per-test-run. - Use Testcontainers for ephemeral DB / Kafka / Redis in CI.
Anti-pattern: shared mutable state across tests → flakes that only fail on rerun.
8. Parallelism and isolation
Run tests in parallel for speed:
pytest -n auto(withpytest-xdist).jest --maxWorkers.
Each parallel worker needs isolation:
- Random schema/DB per worker.
- Random ports.
- No shared temp files.
- No global mocks.
The discipline costs upfront; pays back every CI run.
9. Flaky tests
A flaky test is worse than no test — it teaches the team to ignore failures.
Common causes:
- Timing assumptions (
sleep(1)instead ofwait_until). - Ordering assumptions on un-ordered output.
- Network calls leaking to real services.
- Shared state leaking between tests.
Triage:
- Mark flaky → quarantine (
@pytest.mark.flakywith auto-rerun). - Fix within a sprint or delete the test.
- Track flake rate per test in CI history.
10. Coverage
A measurement, not a goal. 80% line coverage of "if x is None" branches is meaningless.
Useful coverage practices:
- Track per-PR delta — coverage shouldn't drop.
- Branch coverage > line coverage.
- Mutation testing > both for actual quality signal.
Coverage above 90% is rarely cost-effective except for libraries / safety-critical code.
11. Contract / consumer-driven testing
For microservices: each consumer publishes a contract (the responses it expects). Producer's CI runs against all consumer contracts. Catches breaking API changes before deploy.
Tools: Pact, Spring Cloud Contract.
Heavy investment; high payoff at 5+ services.
12. Fast / slow tiers
Split tests by speed:
- Fast — < 1 ms each; run on every save in IDE; < 5 s total.
- Slow — DB, network; run on push; minutes ok.
- Nightly — e2e, perf, mutation; hours ok.
Don't make engineers wait 10 min for a unit test rerun.
13. Reality check
A modern test setup checklist:
- pytest / jest / your stack's standard runner.
- 80%+ branch coverage (with caveats above).
- Hypothesis / fast-check for pure-function modules.
- Testcontainers for integration with real services.
- Mutation testing weekly.
- Coverage as a PR comment, not a hard gate.
- Flaky test budget — < 0.5% flake rate.
- Slow tests in a separate CI lane.
The teams that take testing seriously ship faster, not slower. The teams that "don't have time for tests" pay back the debt at 3 AM during outages.
Reading material
Books:
- Working Effectively with Legacy Code — Michael Feathers (the seam + characterization-test playbook for code you can't safely change)
- xUnit Test Patterns — Gerard Meszaros (the canonical taxonomy of test doubles, fixtures, and smells)
- Growing Object-Oriented Software, Guided by Tests — Freeman & Pryce (the original mockist-school TDD book; where roles + responsibilities + collaborators come from)
- The Art of Unit Testing (3rd ed.) — Roy Osherove (modern walk-through with the stub/mock/fake distinction made concrete)
- Effective Software Testing — Maurício Aniche (property-based + mutation testing + spec-by-example, all in one practitioner book)
Papers:
- QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs — Claessen & Hughes 2000 — the founding paper of property-based testing.
- An Analysis and Survey of the Development of Mutation Testing — Jia & Harman 2011 — the canonical 50-year survey of the mutation-testing field.
- Coverage Is Not Strongly Correlated with Test Suite Effectiveness — Inozemtseva & Holmes 2014 — the paper that showed coverage % is a weak signal; mutation score is stronger.
Official docs:
- Hypothesis (Python) documentation — the canonical property-based-testing library.
- fast-check (TypeScript/JavaScript) docs — the modern TS/JS property tester.
- PIT mutation testing for Java — the canonical JVM mutation-testing tool.
- mutmut — mutation testing for Python — the Python mutation tester (or Cosmic-Ray as alternative).
- Pact — Consumer-driven contract testing — the canonical contract-test platform between services.
- Testcontainers — disposable Docker-backed dependencies for integration tests.
- pytest documentation — fixtures, parametrize, markers; how to actually structure tiered suites.
Blog posts:
- Martin Fowler — Mocks Aren't Stubs — the canonical essay; required reading.
- Martin Fowler — Test Pyramid — where the fast-vs-slow tiering discipline comes from.
- Hypothesis Works — articles — short essays by the Hypothesis maintainers on property-based testing patterns.
- Increment — Testing issue — a Stripe-published collection of senior-engineer essays on testing in production systems.
- Google Testing Blog — Test sizes — where the Small/Medium/Large taxonomy used at Google comes from.
In-depth research material
- Hypothesis — github.com/HypothesisWorks/hypothesis — ~7.5k ★, the canonical PBT library.
- fast-check — github.com/dubzzz/fast-check — ~4k ★, the modern TS property tester used inside Vercel, Stripe, and Snowpack tests.
- PIT — github.com/hcoles/pitest — the JVM mutation-testing tool used by JetBrains, Spring, and many enterprise codebases.
- Stryker Mutator — github.com/stryker-mutator/stryker-js — mutation testing for JavaScript / TypeScript / C#.
- mutmut — github.com/boxed/mutmut — a popular Python mutation tester.
- Cosmic-Ray — github.com/sixty-north/cosmic-ray — alternative Python mutation framework (parallel runner).
- Pact — github.com/pact-foundation/pact-specification — the canonical CDC test spec.
- Testcontainers — github.com/testcontainers/testcontainers-java — ~7k ★, plus per-language ports.
- Stripe Engineering — How we built our test platform — Stripe's real test-tiering story at scale.
- Google Testing Blog — long-running war stories from Google's test-engineering org.
- Eugene Yan — Reliable ML systems with tests — testing patterns applied to ML pipelines.
- Hillel Wayne — Why Property-Based? — a practitioner essay on when PBT actually earns its weight.
Videos
- Property-Based Testing for Better Code — John Hughes (Curry On) — John Hughes (the QuickCheck author) · 56 min — the canonical PBT talk; war stories from real bugs found.
- Mocks Aren't Stubs — Martin Fowler-inspired talk by Sandi Metz — Sandi Metz · 38 min — the clearest distinction between stubs, mocks, fakes, and spies you'll see on video.
- Mutation Testing in Practice — Henry Coles — Henry Coles (PIT author) · 45 min — the inventor's tour of mutation testing on a real codebase.
- TDD, Where Did It All Go Wrong? — Ian Cooper — Ian Cooper · 1 h 02 min — the talk that pushed many teams from London-school back to Detroit-school TDD.
- Hypothesis & Property-Based Testing in Python — David MacIver — David MacIver (Hypothesis author) · 36 min — practical Python PBT walkthrough from the library's creator.
LeetCode — Validate Binary Search Tree
- Link: https://leetcode.com/problems/validate-binary-search-tree/
- Difficulty: Medium
- Why this problem: Verifying a structural invariant is exactly what tests do. Naive solutions miss edge cases; property-based thinking catches them.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Pick test-pyramid shape for a given codebase (microservices vs monolith).
- Choose between mocks, fakes, stubs, spies for a scenario.
- Write a property-based test for a pure function.
- Explain mutation kill-rate and why it beats line coverage.
- Triage and fix a flaky test class.
- Solve
validate-binary-search-tree— classic invariant check, both naive and property-aware.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.