ML project: a classification problem, end to end

A model that gets 0.92 AUC in a Jupyter notebook and never leaves the notebook is worth zero. A model that gets 0.85 AUC and is sitting behind an HTTP endpoint that real systems can call is worth real money. The distance between the two is where most data scientists get stuck, and it’s where the engineering in machine learning engineering actually shows up.

This lesson is the practical conclusion of Module 9. We’ll take a real binary classification problem — customer churn — from raw CSV to deployed prediction service. The dataset is the IBM Telco Customer Churn dataset, freely available on Kaggle and in many mirrored repos. About 7,000 customers, 20-ish features, a binary Churn column. The same shape of problem you’d see in fraud detection, credit default, conversion prediction, or any “will this user do X” question.

Every step deliberately keeps the code straightforward. The point isn’t to maximize the leaderboard score; it’s to demonstrate the workflow from end to end with code you’d actually be willing to ship.

Step 1: load and explore

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("telco_churn.csv")
print(df.shape)
print(df.dtypes)
print(df["Churn"].value_counts(normalize=True))

First questions: what’s the target distribution? What columns are numeric vs categorical? Any missing values?

print(df.isna().sum())

# TotalCharges is read as object — common gotcha, has empty strings for new customers
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
print(df["TotalCharges"].isna().sum())  # ~11 rows
df = df.dropna(subset=["TotalCharges"])

df["Churn"] = (df["Churn"] == "Yes").astype(int)
df = df.drop(columns=["customerID"])

The Churn rate is roughly 27% — class imbalance is mild but real. Worth knowing for stratification later. Quick visual sanity:

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
df["tenure"].hist(by=df["Churn"], bins=30, ax=axes)
axes[0].set_title("tenure | Churn=0")
axes[1].set_title("tenure | Churn=1")
plt.tight_layout()
plt.show()

Customers who churn have much shorter tenure. Useful prior — it tells us a tree will probably split on tenure early.

Step 2: feature engineering

Most of the columns are categorical with two or three levels. A few derived features that domain knowledge suggests:

df["AvgChargePerMonth"] = df["TotalCharges"] / df["tenure"].replace(0, 1)
df["IsLongTermContract"] = (df["Contract"] != "Month-to-month").astype(int)
df["NumServices"] = (
    (df["PhoneService"] == "Yes").astype(int)
    + (df["MultipleLines"] == "Yes").astype(int)
    + (df["InternetService"] != "No").astype(int)
    + (df["OnlineSecurity"] == "Yes").astype(int)
    + (df["OnlineBackup"] == "Yes").astype(int)
    + (df["DeviceProtection"] == "Yes").astype(int)
    + (df["TechSupport"] == "Yes").astype(int)
    + (df["StreamingTV"] == "Yes").astype(int)
    + (df["StreamingMovies"] == "Yes").astype(int)
)

These are domain hypotheses: customers paying a lot per month relative to their history might churn, contract type matters, total service depth matters. We’ll let the model decide if they’re useful.

For the modeling step we want a clean separation between categoricals and numerics so we can build a column transformer:

y = df["Churn"]
X = df.drop(columns=["Churn"])

categorical = X.select_dtypes(include="object").columns.tolist()
numeric = X.select_dtypes(exclude="object").columns.tolist()
print(f"{len(categorical)} categorical, {len(numeric)} numeric")

Step 3: stratified split

Three-way split: train, validation (for early stopping during tuning), test (touched once, at the end).

from sklearn.model_selection import train_test_split

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.15 / 0.85, stratify=y_temp, random_state=42
)
print(X_train.shape, X_val.shape, X_test.shape)

stratify=y keeps the churn rate consistent across the splits. With imbalanced classes that’s not optional — without it, you can land a test set with a notably different positive rate and your evaluation becomes noisy.

Step 4: baseline — logistic regression

The ritual baseline. Build a Pipeline so preprocessing and model are coupled:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report

preprocess = ColumnTransformer([
    ("num", StandardScaler(), numeric),
    ("cat", OneHotEncoder(handle_unknown="ignore", drop="if_binary"), categorical),
])

baseline = Pipeline([
    ("pre", preprocess),
    ("clf", LogisticRegression(max_iter=1000, class_weight="balanced")),
])

baseline.fit(X_train, y_train)
val_proba = baseline.predict_proba(X_val)[:, 1]
print(f"Baseline AUC = {roc_auc_score(y_val, val_proba):.3f}")

Note class_weight="balanced" — handy for the mild imbalance, though for 27% positive rate you can also leave it default. The drop="if_binary" on the OneHotEncoder gets rid of the redundant column on yes/no features.

Typical AUC on this dataset: around 0.84. That’s the number every other model has to beat.

Step 5: better models with sane defaults

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

xgb = Pipeline([
    ("pre", preprocess),
    ("clf", XGBClassifier(
        n_estimators=500,
        max_depth=6,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        eval_metric="logloss",
    )),
])
xgb.fit(X_train, y_train)
print(f"XGB AUC = {roc_auc_score(y_val, xgb.predict_proba(X_val)[:, 1]):.3f}")

lgbm = Pipeline([
    ("pre", preprocess),
    ("clf", LGBMClassifier(
        n_estimators=500,
        learning_rate=0.05,
        num_leaves=31,
        random_state=42,
    )),
])
lgbm.fit(X_train, y_train)
print(f"LGBM AUC = {roc_auc_score(y_val, lgbm.predict_proba(X_val)[:, 1]):.3f}")

On Telco, both ensembles typically score 0.85-0.86 — only a hair above the linear baseline. This is, by the way, one of the patterns Lesson 52 warned about: on a problem dominated by additive signals, regularized linear models hold up.

For pedagogy, let’s pretend the gap matters and tune.

Step 6: tune with Optuna

import optuna
from sklearn.model_selection import StratifiedKFold, cross_val_score

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 200, 1000),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.2, log=True),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-8, 10.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-8, 10.0, log=True),
    }
    pipe = Pipeline([
        ("pre", preprocess),
        ("clf", XGBClassifier(**params, random_state=42, eval_metric="logloss")),
    ])
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    scores = cross_val_score(pipe, X_train, y_train, scoring="roc_auc", cv=cv, n_jobs=1)
    return scores.mean()

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=40, show_progress_bar=True)

print("Best AUC (CV):", study.best_value)
print("Best params:", study.best_params)

best_xgb = Pipeline([
    ("pre", preprocess),
    ("clf", XGBClassifier(**study.best_params, random_state=42, eval_metric="logloss")),
])
best_xgb.fit(X_train, y_train)
val_auc = roc_auc_score(y_val, best_xgb.predict_proba(X_val)[:, 1])
print(f"Tuned XGB val AUC = {val_auc:.3f}")

Forty trials is enough to see meaningful improvement. On this dataset you might gain another 0.005-0.01 AUC. Worth it? Depends on the business. The point is the workflow.

Step 7: evaluate honestly

Now the test set, touched for the first time:

from sklearn.metrics import (
    confusion_matrix, classification_report,
    precision_recall_curve, roc_curve, auc,
)

test_proba = best_xgb.predict_proba(X_test)[:, 1]
test_pred = (test_proba > 0.5).astype(int)

print(f"Test AUC = {roc_auc_score(y_test, test_proba):.3f}")
print(classification_report(y_test, test_pred))
print(confusion_matrix(y_test, test_pred))

AUC alone hides operationally important details. The confusion matrix and per-class precision/recall tell you what kind of mistakes the model makes. For churn, false negatives (we said “won’t churn”, customer churned) are usually more expensive than false positives (we said “will churn”, customer didn’t, we wasted a retention coupon).

That asymmetry argues for tuning the classification threshold away from 0.5:

prec, rec, thresh = precision_recall_curve(y_test, test_proba)
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(thresh, prec[:-1], label="precision")
ax.plot(thresh, rec[:-1], label="recall")
ax.set_xlabel("threshold")
ax.legend()
ax.grid(True)
plt.show()

You read this curve and pick a threshold matching your business cost ratio. Maybe 0.35: lower threshold, more churn predictions, higher recall, lower precision. Acceptable if a retention call costs $5 and a saved customer is worth $200.

Step 8: interpret with SHAP

Stakeholders will ask “why did the model flag this customer?” SHAP values give you a per-row decomposition.

import shap

# Pull the trained classifier out of the pipeline
clf = best_xgb.named_steps["clf"]
X_train_transformed = best_xgb.named_steps["pre"].transform(X_train)
X_test_transformed = best_xgb.named_steps["pre"].transform(X_test)

# Recover feature names from the ColumnTransformer
feat_names = best_xgb.named_steps["pre"].get_feature_names_out()

explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_test_transformed)

# Global view
shap.summary_plot(shap_values, X_test_transformed, feature_names=feat_names)

# Single-customer explanation
i = 0
shap.force_plot(
    explainer.expected_value,
    shap_values[i],
    X_test_transformed[i],
    feature_names=feat_names,
    matplotlib=True,
)

The summary plot ranks features by how much they move predictions on average. The force plot for a single customer tells you “this customer’s churn probability is 0.74 because contract=month-to-month pushed it +0.18, tenure=3 pushed it +0.12, …” That second view is what you put in front of a churn prevention team — it tells them which lever to pull.

Step 9: save the pipeline

The whole pipeline — preprocessing plus model — saved as one object. That’s the point of Pipeline.

import joblib
import json
from datetime import date

# Refit on train+val for the final production model
X_full = pd.concat([X_train, X_val])
y_full = pd.concat([y_train, y_val])
final_model = Pipeline([
    ("pre", preprocess),
    ("clf", XGBClassifier(**study.best_params, random_state=42, eval_metric="logloss")),
])
final_model.fit(X_full, y_full)

joblib.dump(final_model, "churn_model_v1.joblib")

# Schema documentation — the contract for callers
schema = {
    "model_version": "v1",
    "trained_on": str(date.today()),
    "test_auc": round(float(roc_auc_score(y_test, test_proba)), 3),
    "input_columns": {
        col: str(X[col].dtype) for col in X.columns
    },
    "categorical_levels": {
        col: sorted(X[col].dropna().unique().tolist()) for col in categorical
    },
}
with open("churn_model_v1.schema.json", "w") as f:
    json.dump(schema, f, indent=2, default=str)

The schema file is critical. Without it, six months from now nobody will remember what columns this model expects, in what order, or which values are valid for the categoricals. Save the contract.

Step 10: minimal serving endpoint

A FastAPI service that loads the pipeline and predicts on a posted JSON record:

# server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Literal, Optional
import pandas as pd
import joblib

app = FastAPI(title="Churn Predictor v1")
model = joblib.load("churn_model_v1.joblib")

class CustomerFeatures(BaseModel):
    gender: Literal["Male", "Female"]
    SeniorCitizen: int
    Partner: Literal["Yes", "No"]
    Dependents: Literal["Yes", "No"]
    tenure: int
    PhoneService: Literal["Yes", "No"]
    MultipleLines: Literal["Yes", "No", "No phone service"]
    InternetService: Literal["DSL", "Fiber optic", "No"]
    OnlineSecurity: Literal["Yes", "No", "No internet service"]
    OnlineBackup: Literal["Yes", "No", "No internet service"]
    DeviceProtection: Literal["Yes", "No", "No internet service"]
    TechSupport: Literal["Yes", "No", "No internet service"]
    StreamingTV: Literal["Yes", "No", "No internet service"]
    StreamingMovies: Literal["Yes", "No", "No internet service"]
    Contract: Literal["Month-to-month", "One year", "Two year"]
    PaperlessBilling: Literal["Yes", "No"]
    PaymentMethod: str
    MonthlyCharges: float
    TotalCharges: float

class Prediction(BaseModel):
    churn_probability: float
    will_churn: bool
    threshold: float = 0.35

@app.post("/predict", response_model=Prediction)
def predict(features: CustomerFeatures):
    df = pd.DataFrame([features.dict()])
    df["AvgChargePerMonth"] = df["TotalCharges"] / df["tenure"].replace(0, 1)
    df["IsLongTermContract"] = (df["Contract"] != "Month-to-month").astype(int)
    df["NumServices"] = (
        (df["PhoneService"] == "Yes").astype(int)
        + (df["MultipleLines"] == "Yes").astype(int)
        + (df["InternetService"] != "No").astype(int)
        + (df["OnlineSecurity"] == "Yes").astype(int)
        + (df["OnlineBackup"] == "Yes").astype(int)
        + (df["DeviceProtection"] == "Yes").astype(int)
        + (df["TechSupport"] == "Yes").astype(int)
        + (df["StreamingTV"] == "Yes").astype(int)
        + (df["StreamingMovies"] == "Yes").astype(int)
    )
    try:
        proba = float(model.predict_proba(df)[0, 1])
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))
    return Prediction(
        churn_probability=proba,
        will_churn=proba > 0.35,
    )

@app.get("/health")
def health():
    return {"status": "ok", "model_version": "v1"}

Run it with:

uv add fastapi uvicorn pydantic joblib pandas xgboost scikit-learn
uvicorn server:app --reload --port 8000

Test:

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"gender":"Female","SeniorCitizen":0,"Partner":"Yes","Dependents":"No","tenure":1,"PhoneService":"No","MultipleLines":"No phone service","InternetService":"DSL","OnlineSecurity":"No","OnlineBackup":"Yes","DeviceProtection":"No","TechSupport":"No","StreamingTV":"No","StreamingMovies":"No","Contract":"Month-to-month","PaperlessBilling":"Yes","PaymentMethod":"Electronic check","MonthlyCharges":29.85,"TotalCharges":29.85}'

You’ll get back a JSON with churn_probability and a boolean. That’s a deployable model.

A few things you’d add for production that we left out for length: input validation against the saved schema, structured logging of every prediction (input + output) for monitoring drift, a /metrics Prometheus endpoint, request-level tracing, and the same feature engineering function imported from a shared module rather than copied between training and serving. The last one is the most important. Train/serve skew — your training and serving features are computed by different code that subtly disagree — is the single most common cause of “the model worked great in evaluation, why is it bad in production.” Always factor the feature engineering into a function that both training notebook and serving endpoint import.

What you have, all together

That’s a complete project. Loaded raw data, cleaned it, engineered features, split with stratification, fit a baseline, fit better models, tuned with Optuna, evaluated on a held-out test set, interpreted with SHAP, saved the pipeline and its schema, and stood up an HTTP endpoint. Every step matches a tool from Module 9.

The thing nobody tells junior data scientists: the modeling itself is maybe 20% of the work. The other 80% is the loading, cleaning, splitting, evaluating, saving, and serving. Once you have that scaffold, you can swap models almost trivially. The scaffold is the asset.

What’s next: deep learning

This finishes Module 9. We’ve stayed firmly in the world of tabular models — linear, tree, ensemble. The deepest model we’ve touched still has clear interpretable knobs. Module 10 turns to deep learning: PyTorch, neural network training, transformers, and the parts of modern ML that don’t fit comfortably into a model.fit(X, y) call. Different toolchain, different style of debugging, different failure modes. But many of the same workflow lessons from this module — train/test discipline, baseline first, save the pipeline, watch for skew — carry over directly.

See you in Module 10.

References: scikit-learn user guide, Optuna documentation (https://optuna.org/), SHAP documentation (https://shap.readthedocs.io/), Telco customer churn dataset (IBM, available on Kaggle). Retrieval 2026-05-01.