Joins: merge, concat, the patterns that don't surprise you

Real analysis almost never lives in one DataFrame. You have orders in one table, customers in another, products in a third, and the question — “what’s the average order value per customer country?” — only makes sense after you’ve stitched them together. Today we cover the three operations that combine DataFrames: merge (SQL JOIN), concat (stack tables on top of or beside each other), and join (a thin wrapper around merge that aligns on index). We’ll also meet validate= and indicator=, two arguments that catch the bugs that otherwise only show up in production.

We’ll work with two small tables throughout:

import pandas as pd

pd.set_option("mode.copy_on_write", True)

orders = pd.DataFrame(
    {
        "order_id":    [1, 2, 3, 4, 5],
        "customer_id": [10, 20, 10, 30, 40],
        "amount":      [120.0, 85.0, 50.0, 200.0, 75.0],
    }
)

customers = pd.DataFrame(
    {
        "customer_id": [10, 20, 30],
        "name":        ["Ada", "Boris", "Chiara"],
        "country":     ["IT", "DE", "IT"],
    }
)

Five orders, three known customers, one order (customer_id=40) with no matching customer. That mismatch is on purpose — it’s where the join semantics start to matter.

`pd.merge` — the SQL JOIN of pandas

pd.merge(left, right, on="key", how="inner") is the workhorse. The how parameter picks the join type:

pd.merge(orders, customers, on="customer_id", how="inner")
# 4 rows: only customers 10, 20, 30 match. Customer 40 is dropped.

pd.merge(orders, customers, on="customer_id", how="left")
# 5 rows: all orders, with NaN for customer 40's name and country.

pd.merge(orders, customers, on="customer_id", how="right")
# 3 customers all kept; customer 30 has one matching order, but customers
# without orders also appear (here all customers have orders, so no diff).

pd.merge(orders, customers, on="customer_id", how="outer")
# 5 rows: union of both sides, NaN where either side has no match.

pd.merge(orders, customers, how="cross")
# Cartesian product. 5 * 3 = 15 rows. Use sparingly.

In SQL terms: inner = INNER JOIN, left = LEFT JOIN, right = RIGHT JOIN, outer = FULL OUTER JOIN, cross = CROSS JOIN. The defaults match what most analysts mean: how="inner", on= inferred from common column names. For production code, always pass how= explicitly — silent inner joins drop rows you might not realise are being dropped, and the bug shows up in totals two weeks later.

Joining on multiple keys

When the relationship is keyed by a tuple — say (country, year) — pass a list:

pd.merge(sales, fx_rates, on=["country", "year"], how="left")

Both sides need the same column names for on=. If they don’t, use left_on= and right_on=:

pd.merge(
    orders,
    customers.rename(columns={"customer_id": "id"}),
    left_on="customer_id",
    right_on="id",
    how="left",
)

The result has both columns — customer_id and id — even though they’re identical. Drop one explicitly afterwards if it bothers you (and it should, in production code):

result = pd.merge(
    orders, customers,
    left_on="customer_id", right_on="id",
    how="left",
).drop(columns=["id"])

Joining against the index

If the right side is indexed by the join key, use right_index=True:

customers_indexed = customers.set_index("customer_id")
pd.merge(orders, customers_indexed, left_on="customer_id", right_index=True, how="left")

This is common when you’ve cached a lookup table by its primary key — no need to reset the index just to merge.

Suffixes — when columns collide

If both DataFrames have a column with the same name (other than the join key), pandas adds suffixes to disambiguate:

left = pd.DataFrame({"id": [1, 2], "value": [10, 20]})
right = pd.DataFrame({"id": [1, 2], "value": [100, 200]})
pd.merge(left, right, on="id")
# columns: id, value_x, value_y

_x and _y are the defaults and they’re terrible. Pass suffixes= with names that mean something:

pd.merge(left, right, on="id", suffixes=("_orders", "_invoices"))

This is a tiny change that pays for itself the first time someone else reads your code.

`validate=` — the underused safety net

This is the argument that should be on every production merge and almost never is. validate= makes pandas check the relationship between the two sides and raise an error if it’s violated:

pd.merge(orders, customers, on="customer_id", how="left", validate="many_to_one")

The values:

"one_to_one" — every key appears at most once on each side. Useful when joining two reference tables that should each be keyed.
"one_to_many" — keys are unique on the left, may repeat on the right. Used when a parent record (order) joins to many children (order_lines).
"many_to_one" — keys may repeat on the left, unique on the right. The most common case in analytics: a fact table (orders) joining to a dimension table (customers). Each customer appears once; many orders can reference the same customer.
"many_to_many" — explicit “I know there’s no uniqueness.” Same as no validation, but documented.

If you’re joining orders to customers and someone duplicates a customer row by accident, an unvalidated merge silently doubles your order count. validate="many_to_one" raises MergeError immediately, which is much nicer than discovering it in a board-meeting dashboard. Pass it, every time you’re confident about the relationship. It’s free documentation and a free assertion.

`indicator=` — debugging “why are rows missing?”

The other argument you’ll thank yourself for: indicator=True adds a column called _merge that says, for each row, where it came from:

result = pd.merge(orders, customers, on="customer_id", how="outer", indicator=True)
result["_merge"]
# Categorical: "left_only", "right_only", or "both"

If you’re debugging “I expected 1,000 rows and got 970, where did the others go?”, do the merge with how="outer" and indicator=True, then count by _merge. The 30 missing rows are either left_only (no match in the right table) or right_only. From there you can inspect them directly:

unmatched = result.loc[result["_merge"] == "left_only", ["customer_id", "amount"]]

This single argument has saved me more debugging sessions than almost anything else in pandas.

`pd.concat` — stacking, not joining

merge combines tables side by side on a key. concat stacks them — vertically (more rows) or horizontally (more columns):

q1 = pd.DataFrame({"order_id": [1, 2], "amount": [120, 85]})
q2 = pd.DataFrame({"order_id": [3, 4], "amount": [50, 200]})

pd.concat([q1, q2])                # vertical stack, default axis=0
pd.concat([q1, q2], ignore_index=True)   # rebuild a clean 0..N RangeIndex

Without ignore_index=True, the original indices are preserved and concatenated as-is — so you’ll have two rows with index 0, two with index 1, etc. That’s almost never what you want; pass ignore_index=True unless you specifically need the originals.

Vertical concat aligns columns by name. Missing columns become NaN:

a = pd.DataFrame({"id": [1, 2], "x": [10, 20]})
b = pd.DataFrame({"id": [3, 4], "y": [30, 40]})
pd.concat([a, b], ignore_index=True)
# columns: id, x, y -- with NaN where each side didn't have the column

Horizontal concat (axis=1) aligns rows by index:

left = pd.DataFrame({"a": [1, 2, 3]}, index=[10, 20, 30])
right = pd.DataFrame({"b": [4, 5, 6]}, index=[10, 20, 40])
pd.concat([left, right], axis=1)
# index: 10, 20, 30, 40 (outer join of indices); NaN for non-matching rows

This is where pandas surprises people: horizontal concat is index-aligned, not row-positional. If your two DataFrames have unrelated indices and you wanted “glue them column-by-column in row order,” you need to reset_index(drop=True) on both first, then concat. Otherwise you get an outer-joined mess.

For “tag each chunk with where it came from,” keys= is handy:

pd.concat([q1, q2], keys=["q1", "q2"])
# MultiIndex: ("q1", 0), ("q1", 1), ("q2", 0), ("q2", 1)

The first level tells you which source each row came from. Useful for “stack quarterly files into one yearly file with a quarter label.”

`df.join` — the index-keyed shorthand

df.join is a method on DataFrame — not the same as pd.merge. By default, it joins on the index of the right side and the index of the left:

customers_indexed = customers.set_index("customer_id")
orders.set_index("customer_id").join(customers_indexed, how="left")

It also accepts an on= argument to join the right’s index against a column on the left:

orders.join(customers_indexed, on="customer_id", how="left")

df.join is a convenience for the common case “I have a lookup table indexed by the key, and I want to add its columns.” Anything more complex — different column names, validation, indicators — use pd.merge instead. Don’t try to remember join’s argument shape; remembering merge is enough.

When to use which

A short rubric:

pd.merge — joining two tables on a column. The default. Use this 90% of the time.
pd.concat(axis=0) — stacking tables with the same schema vertically. Loading multiple files of the same shape, accumulating chunks.
pd.concat(axis=1) — gluing columns from index-aligned DataFrames. Less common than you’d think; usually you want a merge.
df.join — quick lookup-by-index. Fine for ad-hoc work, prefer merge in pipelines.

Putting it together

A real analysis stitches several of these:

import pandas as pd

pd.set_option("mode.copy_on_write", True)

orders = pd.read_parquet("orders.parquet")
customers = pd.read_parquet("customers.parquet")
fx_rates = pd.read_parquet("fx_rates.parquet")  # (currency, year) -> rate

result = (
    orders
    .merge(
        customers,
        on="customer_id",
        how="left",
        validate="many_to_one",
    )
    .merge(
        fx_rates,
        on=["currency", "year"],
        how="left",
        validate="many_to_one",
    )
    .assign(amount_eur=lambda d: d["amount"] * d["rate"])
    .groupby(["country", "year"], as_index=False)
    .agg(
        total_eur=("amount_eur", "sum"),
        n_orders=("order_id", "size"),
    )
    .sort_values(["country", "year"])
    .reset_index(drop=True)
)

Two left joins, both validated as many-to-one, then aggregate. If a customer ID gets duplicated upstream, validate="many_to_one" errors at the merge instead of producing inflated totals. If an unmatched currency/year pair sneaks in, amount_eur becomes NaN and shows up in the next totals review (or, better, you add a result.loc[result["amount_eur"].isna()] check after the second merge and assert it’s empty).

Indicator + validate, the production pattern

For pipelines I care about, the merge looks like this:

merged = pd.merge(
    orders,
    customers,
    on="customer_id",
    how="left",
    validate="many_to_one",
    indicator=True,
)

unmatched = merged.loc[merged["_merge"] == "left_only", "customer_id"].unique()
if len(unmatched):
    raise ValueError(f"{len(unmatched)} unmatched customer_ids: {unmatched[:10]}")

merged = merged.drop(columns="_merge")

Validate the cardinality, indicator the matches, assert no unmatched rows, drop the indicator before continuing. Three extra lines; saves a quarter’s worth of “why are the totals off by 0.3%” tickets.

What’s next

That’s the end of the pandas mastery module. From here we move into module 6 — time-series and window functions, where we’ll see resampling, rolling windows, and the date-aware versions of the operations from the last few lessons.

pd.merge — the SQL JOIN of pandas