Python, from the ground up Lesson 24 / 60

Property-based testing with hypothesis

Generate hundreds of test cases automatically. The patterns that find bugs unit tests miss.

The unit tests you write are the cases you thought of. The bugs in production are the cases you didn’t. Property-based testing closes that gap by letting the test framework generate the cases for you — hundreds of them per test run, biased toward the inputs most likely to break things.

In Python, the property-based testing library is hypothesis. It’s been mature for years, ships clean APIs, and integrates seamlessly with pytest. This lesson is the working knowledge: how to think about properties, how to write them, the patterns that find real bugs, and where property-based testing isn’t worth the cost.

Example-based versus property-based

A traditional pytest test is example-based: you pick concrete inputs and assert concrete outputs.

def test_round_price():
    assert round_price(1.005) == 1.00
    assert round_price(1.015) == 1.02

This works for the cases you wrote down. It says nothing about 1.005000000001, or -0.005, or 0.0, or float("nan"), unless you also wrote those down.

A property-based test asserts a property — something that should hold for all valid inputs — and lets the framework probe at it:

from hypothesis import given, strategies as st

@given(st.floats(min_value=0, max_value=10_000, allow_nan=False))
def test_round_price_is_idempotent(amount: float) -> None:
    once = round_price(amount)
    twice = round_price(once)
    assert once == twice

Hypothesis runs this with a hundred different floats by default, biased toward edge cases (0.0, very small numbers, numbers with awkward binary representations). If any input fails the property, hypothesis tells you which one — and shrinks it to the smallest counterexample it can find.

The shrinking is the magic. If a giant random float fails, hypothesis doesn’t just say “it failed with 0.7281928281828”; it whittles the input down until it finds the simplest input that still fails. Often you get back something like 1e-300 or 0.0, and the bug is suddenly obvious.

The basic moves

Install and import:

pip install hypothesis
from hypothesis import given, strategies as st

A strategy is a description of an input space. Hypothesis comes with strategies for every common Python type:

st.integers()                       # any int
st.integers(min_value=0)            # non-negative
st.floats(allow_nan=False)          # finite floats only
st.text()                           # any unicode string
st.text(alphabet="abc", max_size=5) # restricted
st.lists(st.integers())             # list of ints
st.lists(st.integers(), min_size=1) # non-empty
st.dictionaries(st.text(), st.integers())
st.dates()                          # datetime.date
st.datetimes(timezones=st.timezones())
st.from_regex(r"^\d{3}-\d{4}$", fullmatch=True)

Combining strategies is just composition:

st.tuples(st.text(), st.integers())
st.lists(st.tuples(st.text(), st.integers()), max_size=10)
st.one_of(st.integers(), st.text())

For your own types, @st.composite builds a strategy out of others:

from hypothesis import strategies as st
from dataclasses import dataclass

@dataclass(frozen=True)
class Order:
    id: int
    amount: float
    customer: str

@st.composite
def orders(draw: st.DrawFn) -> Order:
    return Order(
        id=draw(st.integers(min_value=1)),
        amount=draw(st.floats(min_value=0, allow_nan=False, allow_infinity=False)),
        customer=draw(st.text(min_size=1, max_size=50)),
    )

@given(orders())
def test_order_amount_non_negative(order: Order) -> None:
    assert order.amount >= 0

Once you have a strategy for your domain types, every test that needs them is a one-liner.

Properties worth testing

The skill of property-based testing is identifying properties. A few patterns that come up everywhere:

Round-trip

If your code encodes and decodes, encoding then decoding should give you back the original:

@given(st.dictionaries(st.text(), st.integers()))
def test_json_round_trip(data: dict[str, int]) -> None:
    assert json.loads(json.dumps(data)) == data

This catches a surprising number of bugs in custom serializers — anything to do with quoting, encoding, special characters, empty containers.

Idempotence

Doing it twice equals doing it once:

@given(st.text())
def test_strip_is_idempotent(s: str) -> None:
    assert s.strip().strip() == s.strip()

@given(st.lists(st.integers()))
def test_sort_is_idempotent(xs: list[int]) -> None:
    assert sorted(sorted(xs)) == sorted(xs)

Idempotence is the natural property of any “normalising” function: rounding, sorting, deduplicating, canonicalising URLs, lowercasing.

Commutativity (where it should hold)

Some operations should give the same answer regardless of order:

@given(st.lists(st.integers()), st.lists(st.integers()))
def test_set_union_is_commutative(a: list[int], b: list[int]) -> None:
    assert set(a) | set(b) == set(b) | set(a)

Monotonicity

Sorted output is non-decreasing. Adding to a counter never decreases it. The output of an “average” never exceeds the maximum input:

@given(st.lists(st.integers(), min_size=1))
def test_sort_is_non_decreasing(xs: list[int]) -> None:
    s = sorted(xs)
    assert all(a <= b for a, b in zip(s, s[1:], strict=False))

Invariants under transformation

Length doesn’t change after a permutation. Total doesn’t change after rounding errors are summed back in. The set of customer IDs is preserved across an ETL step.

@given(st.lists(st.integers()))
def test_reverse_preserves_length(xs: list[int]) -> None:
    assert len(list(reversed(xs))) == len(xs)

A worked example: testing a price-rounding function

Here’s a function and the properties I’d write for it:

from decimal import Decimal, ROUND_HALF_EVEN

def round_price(amount: float) -> float:
    """Round to the nearest cent, banker's rounding."""
    return float(
        Decimal(str(amount)).quantize(Decimal("0.01"), rounding=ROUND_HALF_EVEN)
    )

The properties:

from hypothesis import given, strategies as st

prices = st.floats(min_value=0, max_value=1_000_000, allow_nan=False, allow_infinity=False)

@given(prices)
def test_round_price_is_idempotent(amount: float) -> None:
    once = round_price(amount)
    assert round_price(once) == once

@given(prices)
def test_round_price_close_to_input(amount: float) -> None:
    assert abs(round_price(amount) - amount) <= 0.005 + 1e-9

@given(prices)
def test_round_price_two_decimals(amount: float) -> None:
    rounded = round_price(amount)
    assert round(rounded * 100) == rounded * 100

Three properties, infinite test cases, and any of them failing tells you something specific is wrong. When I first ran this kind of suite on a real money-handling codebase, hypothesis found a case where 1e-308 didn’t round to zero because of a precision oddity in the Decimal conversion. Not a case I would have written by hand.

Stateful testing

Some bugs only show up in sequences of operations: the first call works, the second one corrupts state. Hypothesis handles this with RuleBasedStateMachine:

from hypothesis.stateful import RuleBasedStateMachine, rule, invariant

class CartMachine(RuleBasedStateMachine):
    def __init__(self) -> None:
        super().__init__()
        self.cart = Cart()
        self.expected_total = 0.0

    @rule(amount=st.floats(min_value=0, max_value=100))
    def add_item(self, amount: float) -> None:
        self.cart.add(amount)
        self.expected_total += amount

    @rule()
    def remove_last(self) -> None:
        if self.cart.items:
            removed = self.cart.remove_last()
            self.expected_total -= removed

    @invariant()
    def total_matches(self) -> None:
        assert abs(self.cart.total() - self.expected_total) < 1e-6

TestCart = CartMachine.TestCase

Hypothesis generates random sequences of add_item and remove_last calls and checks the invariant after each one. If the cart’s total() ever drifts from your bookkeeping, hypothesis shrinks the sequence to the shortest reproduction. This catches state-machine bugs that example tests can’t reach.

The cost, and how to control it

Property-based tests are slower than example tests. A test with a hundred runs takes more wall-time than a test with one input. For most codebases this is fine; the suite still finishes in seconds. For tests that hit a database, a network, or anything expensive, you tune with @settings:

from hypothesis import given, settings, strategies as st

@settings(max_examples=20, deadline=None)
@given(st.text())
def test_slow_thing(s: str) -> None:
    ...

max_examples reduces the number of generated cases. deadline=None disables hypothesis’s per-example time limit, which trips on tests that vary in speed. There’s also @settings(database=None) to disable the local database that hypothesis uses to remember failing cases between runs — useful in CI containers, annoying in local development.

A pattern I use: a slow profile for CI that runs max_examples=500, and a default profile for local development that uses the standard hundred. The CI run is more thorough; the local run is fast enough to keep me iterating.

When to skip

Property-based testing isn’t always the right tool.

  • Pure UI code. “What property should this React component have?” Usually none worth automating.
  • Code with strong network or randomness dependencies. If the function’s output depends on the response from a third-party API, hypothesis can’t generate that.
  • One-off scripts. The investment doesn’t pay back.
  • When the property is harder to express than the implementation. If you’d write a parallel implementation just to assert the property, you’ve gained nothing.

The sweet spot: pure functions over rich input spaces. Parsers, serializers, data transformations, financial calculations, sorting algorithms, anything that operates on user-shaped data.

Real bugs property tests catch

A short list of bugs I’ve personally caught with hypothesis, none of which my example tests had:

  • A CSV writer that broke on values containing both a comma and a quote.
  • A timestamp parser that mishandled the boundary between standard and daylight saving time.
  • A leap-second-related bug in a duration calculator.
  • A sort routine that was unstable on None values where I’d assumed None would never appear.
  • An integer-overflow in a percentage calculation when the input was a 64-bit integer near 2^63.
  • A Unicode normalization mismatch that caused two strings that “looked the same” to compare unequal.
  • A dict merger that lost keys when the same key appeared with different cases in different inputs.

All boring. All shippable as production bugs. All caught by a property that took five minutes to write.

If you only adopt one habit from this lesson: for every pure function in your codebase that takes a recognisable input shape (a list of numbers, a string with a known alphabet, a dataclass with documented fields), write one property test. Idempotence, round-trip, monotonicity — pick whichever fits. The bar is low, the payoff is high, and the first time hypothesis hands you back a one-line input that crashes your code, you’ll be glad you did.

For documentation: hypothesis’s own docs at https://hypothesis.readthedocs.io/ are the canonical reference, with a strategies catalogue worth bookmarking.

Search