End-to-end data analysis project

We’ve spent twelve lessons building up a vocabulary: dtypes, IO, selection, joins, groupbys, time series, performance, Polars. Today we do the thing the vocabulary exists for. We take a real dataset, walk it from raw CSV to a clean answer, and write the kind of script you’d actually commit to a repo. This is the closing lesson of Module 6 and a working template for almost any analysis task you’ll get handed in your first job.

The dataset we’ll use is the NYC Yellow Taxi Trips monthly file — publicly available, around 3 million rows per month, real-world dirty in interesting ways. The question we’ll answer: what’s the average tip percentage by hour of day, and does it differ between credit-card and cash payers? It’s the kind of question a manager actually asks, and answering it requires every step of the pipeline.

The full script is at the end of the lesson. Here we walk through the six phases.

The shape: the scaffolding script

Before any code, the shape. A good analysis script is one Python file, runnable end to end, with section comments. Roughly:

"""nyc_taxi_tips.py — average tip % by hour, by payment type."""

# 1. Load + inspect
# 2. Clean
# 3. Explore
# 4. Transform
# 5. Answer
# 6. Save / present

if __name__ == "__main__":
    main()

Why one file, not a notebook? Notebooks are great for the exploration phase, but the version that goes in a repo and gets re-run on next month’s data should be a script — reproducible, diffable, runnable from cron. The pattern I use: explore in a notebook, then once I know the steps, copy them into a script that does only those steps in order. The notebook stays around as a journal; the script is the deliverable.

Phase 1 — load and inspect

Always the same three things: shape, dtypes, sample, nulls. Don’t skip any of them, no matter how confident you are about the file.

import pandas as pd

df = pd.read_parquet("yellow_tripdata_2025-01.parquet", dtype_backend="pyarrow")

print(df.shape)
print(df.dtypes)
print(df.head())
print(df.tail())
print(df.sample(5))
print(df.isna().sum())
print(df.describe())

Running that on a January NYC taxi file gives you something like:

(3_267_104, 19)
VendorID                       int64[pyarrow]
tpep_pickup_datetime           timestamp[us][pyarrow]
tpep_dropoff_datetime          timestamp[us][pyarrow]
passenger_count                double[pyarrow]
trip_distance                  double[pyarrow]
...
fare_amount                    double[pyarrow]
tip_amount                     double[pyarrow]
total_amount                   double[pyarrow]
payment_type                   int64[pyarrow]

Three observations from that single dump:

payment_type is an integer code, not a label. Going to need a lookup table to make it human.
passenger_count is a float, which is suspicious for a count. Probably has nulls and got promoted.
The dataset has 3.2 million rows and 19 columns. Fits in memory comfortably; no need for chunking or Polars today.

describe() will tell you the rest of the story: minimum trip distance of 0 (zero-mile trips, those are bugs), maximum tip of $400 (one person tipped $400 on a $30 ride? probably a data entry error), fare_amount going slightly negative (refunds, or sign errors). All of these will need decisions in the cleaning phase.

isna().sum() shows you which columns have nulls. In this file passenger_count and a few others have a small percentage; we’ll handle those.

Phase 2 — clean

This is where most analyses spend most of their time, and where the script earns its trust. Every cleaning decision is a judgement call you’re making about the data, and writing them down explicitly is the difference between a script that’s reproducible and one that’s mysterious.

# Parse the payment type codes (from the NYC TLC data dictionary)
payment_map = {
    1: "credit_card",
    2: "cash",
    3: "no_charge",
    4: "dispute",
    5: "unknown",
    6: "voided_trip",
}
df["payment_label"] = df["payment_type"].map(payment_map).astype("category")

# Drop rows where the trip is obviously bogus
df = df[df["trip_distance"] > 0]                        # zero-distance trips
df = df[df["fare_amount"] > 0]                          # negative fares
df = df[df["tip_amount"] >= 0]                          # negative tips
df = df[df["total_amount"] > 0]
df = df[df["tpep_dropoff_datetime"] > df["tpep_pickup_datetime"]]  # time travel

# Cap outliers at the 99.5th percentile so one $1000 ride doesn't skew averages
fare_cap = df["fare_amount"].quantile(0.995)
df = df[df["fare_amount"] <= fare_cap]

# Fill passenger_count nulls with 1 (the most common value, and a reasonable assumption)
df["passenger_count"] = df["passenger_count"].fillna(1).astype("Int64")

# Cast known-categorical columns
df["VendorID"] = df["VendorID"].astype("category")

A few patterns to notice:

Each filter is its own line, with a comment. A future reader (including future-you) needs to be able to scan this and see what was thrown away.
The 99.5th percentile cap on fare is a judgement call — I’m saying “anything above this is probably bad data, and including it would bias the answer.” That’s the kind of decision worth justifying in a comment.
Casting to category at the end so that the explore and groupby phases are fast.

After cleaning, print the new shape:

print(f"Cleaned shape: {df.shape}, dropped {3_267_104 - len(df):,} rows")

You should always know how many rows you dropped. If it’s more than a few percent, look at why.

Phase 3 — explore

Before you compute the answer, look at the data. This is the phase where AI assistants are unusually useful: paste the output of df.describe() and df.head() into Claude or ChatGPT and ask “what should I look at next given the question ‘tips by hour by payment type’?” The model will reliably suggest the right column derivations and the standard sanity checks. It’s not magic — it’s pattern matching against ten thousand similar analyses on the internet — but it speeds up the “what am I missing?” loop a lot. Where AI is less useful is the question of which question to ask, and the judgement calls about cleaning. Those are still yours.

Concretely, in the explore phase:

print(df["payment_label"].value_counts(normalize=True))
print(df.groupby("payment_label", observed=True)["tip_amount"].describe())
print(df["tpep_pickup_datetime"].dt.hour.value_counts().sort_index())

What I’d look for here: do credit-card and cash payers have suspiciously different tip distributions? (Yes — cash tips are usually $0 in this dataset, because the driver doesn’t enter them. That’s a huge finding for the question and we’ll have to handle it.) Are there hours of the day with weirdly few trips? (3-4 AM, expected.) Is the time range correct, or did the file include some stray December rows?

Phase 4 — transform

Now the columns the question actually needs:

df["hour"] = df["tpep_pickup_datetime"].dt.hour
df["tip_pct"] = (df["tip_amount"] / df["fare_amount"]) * 100

# The cash-tip problem: cash trips almost always show tip=0 because of how
# the data is collected. Including them would systematically pull cash tips
# toward zero in a way that doesn't reflect reality. The right call here is
# to either restrict to credit-card trips for the headline number, or to
# show both with a clear caveat. We'll do both.
df_cc = df[df["payment_label"] == "credit_card"].copy()

That third comment block is the kind of judgement call that turns “a script” into “an analysis.” A reader who only sees the numbers wouldn’t know that the cash data is structurally biased. A reader who sees this script does.

Phase 5 — answer

The actual computation, after all that setup, is a few lines:

# Headline: tip % by hour, credit-card payers only
hourly = (
    df_cc.groupby("hour", observed=True)["tip_pct"]
    .agg(["mean", "median", "count"])
    .round(2)
)
print(hourly)

# Comparison: tip % by hour, by payment type, both populations
by_pmt = (
    df.groupby(["hour", "payment_label"], observed=True)["tip_pct"]
    .mean()
    .unstack()
    .round(2)
)
print(by_pmt)

Look at the result and sanity-check it. Mean tip % around 18-22% during the day, dropping a couple of points late at night, jumping at 4-6 AM (small samples, airport runs to LGA/JFK with bigger fares and rounder tips). Cash column near zero across the board, exactly as expected from the structural problem.

Always sanity-check. If the answer looks dramatic, the bug is more often in your code than in the data.

Phase 6 — save and present

Write the result somewhere durable, in a format that’s diffable and re-loadable:

hourly.to_parquet("output/tip_pct_by_hour.parquet")

with open("output/summary.md", "w") as f:
    f.write("# NYC taxi tips, January 2025\n\n")
    f.write(f"Trips analyzed: {len(df_cc):,} (credit-card only)\n")
    f.write(f"Overall mean tip %: {df_cc['tip_pct'].mean():.2f}\n\n")
    f.write("## Hourly tip percentage\n\n")
    f.write(hourly.to_markdown())

ax = hourly["mean"].plot(kind="bar", figsize=(10, 5),
                         title="Mean tip % by hour of day (credit-card)")
ax.set_xlabel("Hour")
ax.set_ylabel("Tip %")
ax.figure.savefig("output/tip_pct_by_hour.png", dpi=150, bbox_inches="tight")

Three artifacts: a Parquet file (for further programmatic work), a Markdown summary (for the human reader), and a PNG plot (for slides). That’s the durable output of the analysis. The script can be re-run next month with a different input file and produce the same three artifacts.

The whole script

Glued together, with the boilerplate:

"""nyc_taxi_tips.py — average tip % by hour, by payment type."""
from pathlib import Path
import pandas as pd

INPUT = "data/yellow_tripdata_2025-01.parquet"
OUTPUT_DIR = Path("output")

PAYMENT_MAP = {
    1: "credit_card", 2: "cash", 3: "no_charge",
    4: "dispute", 5: "unknown", 6: "voided_trip",
}


def load_and_inspect(path: str) -> pd.DataFrame:
    df = pd.read_parquet(path, dtype_backend="pyarrow")
    print(f"Loaded {df.shape[0]:,} rows, {df.shape[1]} cols")
    print(df.dtypes, "\n")
    print(df.describe(), "\n")
    print("Nulls per column:")
    print(df.isna().sum()[df.isna().sum() > 0], "\n")
    return df


def clean(df: pd.DataFrame) -> pd.DataFrame:
    initial = len(df)
    df = df.copy()
    df["payment_label"] = df["payment_type"].map(PAYMENT_MAP).astype("category")
    df = df[df["trip_distance"] > 0]
    df = df[df["fare_amount"] > 0]
    df = df[df["tip_amount"] >= 0]
    df = df[df["total_amount"] > 0]
    df = df[df["tpep_dropoff_datetime"] > df["tpep_pickup_datetime"]]
    df = df[df["fare_amount"] <= df["fare_amount"].quantile(0.995)]
    df["passenger_count"] = df["passenger_count"].fillna(1).astype("Int64")
    df["VendorID"] = df["VendorID"].astype("category")
    print(f"Dropped {initial - len(df):,} rows ({(initial - len(df)) / initial:.1%})")
    return df


def transform(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["hour"] = df["tpep_pickup_datetime"].dt.hour
    df["tip_pct"] = (df["tip_amount"] / df["fare_amount"]) * 100
    return df


def analyze(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    df_cc = df[df["payment_label"] == "credit_card"]
    hourly = (
        df_cc.groupby("hour", observed=True)["tip_pct"]
        .agg(["mean", "median", "count"])
        .round(2)
    )
    by_pmt = (
        df.groupby(["hour", "payment_label"], observed=True)["tip_pct"]
        .mean().unstack().round(2)
    )
    return hourly, by_pmt


def save(hourly: pd.DataFrame, by_pmt: pd.DataFrame, df_cc: pd.DataFrame) -> None:
    OUTPUT_DIR.mkdir(exist_ok=True)
    hourly.to_parquet(OUTPUT_DIR / "tip_pct_by_hour.parquet")
    by_pmt.to_parquet(OUTPUT_DIR / "tip_pct_by_hour_by_payment.parquet")

    with open(OUTPUT_DIR / "summary.md", "w") as f:
        f.write("# NYC taxi tips, January 2025\n\n")
        f.write(f"- Credit-card trips analyzed: **{len(df_cc):,}**\n")
        f.write(f"- Overall mean tip %: **{df_cc['tip_pct'].mean():.2f}**\n\n")
        f.write("## Hourly tip percentage (credit-card)\n\n")
        f.write(hourly.to_markdown())

    ax = hourly["mean"].plot(
        kind="bar", figsize=(10, 5),
        title="Mean tip % by hour of day (credit-card)",
    )
    ax.set_xlabel("Hour"); ax.set_ylabel("Tip %")
    ax.figure.savefig(OUTPUT_DIR / "tip_pct_by_hour.png",
                      dpi=150, bbox_inches="tight")


def main() -> None:
    df = load_and_inspect(INPUT)
    df = clean(df)
    df = transform(df)
    hourly, by_pmt = analyze(df)
    print("\nHourly (credit-card):"); print(hourly)
    print("\nBy payment type:"); print(by_pmt)
    save(hourly, by_pmt, df[df["payment_label"] == "credit_card"])


if __name__ == "__main__":
    main()

About a hundred lines, six functions, runnable end to end. Each function does one phase. Each phase prints something so you can see progress. The output goes to a directory. You could schedule this with cron tonight and never look at it again, or you could rerun it on March’s file by changing one constant.

Patterns worth carrying forward

A few things in the script above that come up in every analysis I’ve ever written:

Functions per phase, called from a main(). Easier to test, easier to skip a phase when you’re iterating (“just rerun analyze on the in-memory frame”).
Constants at the top, including paths. Never bury a file path inside a function.
Print after each phase. Long pandas scripts that run silently for ten minutes and then crash are demoralizing; print the row count after each step.
Comment the judgement calls, not the code. “Cap at p99.5 because the long tail is dirty data” is worth a line. “Compute the mean of the column” is not.
Save Parquet, not CSV, for intermediate output. Faster, smaller, dtype-safe.
One plot per question. Too many plots and nobody looks at any of them.

When to reach for Polars instead

For a 3-million-row dataset, pandas is fine and the script above runs in 30 seconds. If the same script were processing a year of NYC taxi data — 36 million rows across 12 files — I’d rewrite the load and clean phases in Polars (pl.scan_parquet("data/*.parquet") followed by the same filters in expression form) and convert to pandas only for the final groupby and plotting step, where pandas’s matplotlib integration is convenient. The pattern from lesson 35 — Polars for the heavy lifting, pandas for the handoff — is exactly this.

Module 6 wrap

That closes Module 6. Over ten lessons we’ve gone from “what is a DataFrame” to “here’s a full analysis pipeline against a real public dataset,” with stops at every important pandas API along the way and a serious detour through Polars. You now know enough to do the data work in roughly any Python role: read the file, clean it, explore it, answer the question, save the answer, and have an opinion about whether to use pandas or Polars for the next one.

Module 7 is data engineering — taking analyses like the one above and making them production-grade. That means handling failure properly, writing them as schedulable jobs, monitoring them when they run unattended, and the boring-but-essential question of how to know when one of your scripts has silently started producing garbage. The shift is from “I ran the analysis once and got an answer” to “this thing runs every Monday at 6 AM and the business runs on its output.” Different mindset, same Python.