Feature engineering: the part that matters most

There’s a quiet truth in tabular machine learning that most courses skip past: the model rarely matters as much as the features. You can take the same dataset, throw the same gradient-boosted trees at it twice, and one version beats the other by ten percentage points purely because someone spent a weekend thinking about what to put on the input side. That weekend is feature engineering, and this is the lesson that, more than any other in this module, will move your numbers.

We’ll go through the categories of feature engineering, then talk about leakage — the bug that silently inflates your metrics and ruins production. Most of the actual code lives in pipelines you’ve already seen in lesson 49; the hard part is the thinking.

What “engineering” actually means here

Feature engineering is the act of transforming raw columns into representations a model can learn from. Some transformations are mechanical (centering and scaling). Some require domain knowledge (knowing that a transaction at 3am is more suspicious than one at 3pm). Some are just clever (encoding the time of day as sin(2*pi*hour/24) and cos(2*pi*hour/24) so the model sees that 23:00 is close to 00:00).

The categories I’ll walk through:

Scaling
Encoding categorical variables
Interactions
Time features
Aggregations
Text features
Missing-value handling

And then the leakage section, which is the most important part of this lesson and the part most rushed tutorials skip.

Scaling

Most non-tree models care about the scale of your inputs. Linear models, neural networks, k-nearest neighbors, and SVMs all do worse — sometimes catastrophically so — when one feature ranges from 0 to 1 and another from 0 to 1,000,000. The fix is to standardize each numeric column to mean 0 and standard deviation 1:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Tree-based models — random forests, XGBoost, LightGBM — don’t care. They split on thresholds, and a threshold on a value times 1000 is the same as a threshold on the value itself. Don’t waste time scaling for trees.

Two variants worth knowing. MinMaxScaler rescales to [0, 1], which is what you want for inputs to neural networks with bounded activation functions. RobustScaler uses the median and interquartile range, which is what you want when your data has outliers — a single big value won’t pull the mean and inflate the standard deviation.

For very long-tailed numeric features (income, transaction amounts, page views), apply a log transform first: np.log1p(x) — that’s log(1+x), which handles zeros gracefully. After the log, the distribution is closer to symmetric and scaling does what it’s supposed to.

Encoding categorical variables

Categories need to become numbers. The right encoding depends on cardinality and whether there’s order.

Low cardinality, no order: one-hot encoding. Each category becomes its own 0/1 column.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X_encoded = encoder.fit_transform(X[["country", "plan"]])

The handle_unknown="ignore" is what saves you in production when a new category shows up that wasn’t in training data. Without it, your service crashes the moment marketing launches in a new country.

High cardinality: don’t one-hot. If you have 50,000 unique user IDs or 10,000 product SKUs, one-hot gives you a 50,000-column sparse matrix that most models choke on. The two real options:

Target encoding: replace each category with the mean of the target variable for that category. So if you have a country column and the average conversion rate for users in Italy is 4.2%, then country=IT becomes 0.042. Powerful, but a leakage trap (more on that below). Use the category_encoders library or sklearn’s TargetEncoder (added in 1.3, with proper cross-fitting to avoid leakage).
Hashing: feed the category through a hash function modulo n, ending up with n columns. Fast, lossy, no leakage. Good for very large cardinality where you don’t care about exact category identity.

Ordered categories: ordinal encoding is fine. ["small", "medium", "large"] to [0, 1, 2] preserves the order. But never do this on unordered categories — you’re lying to the model about distance.

Interactions

Sometimes two features only mean something together. Income alone doesn’t tell you much; income per household member does. Hour-of-day alone tells you something; hour-of-day combined with day-of-week tells you a lot more.

For tree models, interactions come for free — a tree that splits on income, then on household_size inside one branch, has discovered an interaction. For linear models, you have to spell them out:

import pandas as pd

df["income_per_member"] = df["income"] / df["household_size"]
df["weekend_x_hour"] = df["is_weekend"] * df["hour"]

Or use PolynomialFeatures to generate all pairwise products automatically — useful for small feature sets, dangerous for large ones because the number of interactions explodes quadratically.

Time features

Timestamps need to be unpacked. A raw datetime column carries almost no signal a model can use directly; pulling out the components does:

df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["month"] = df["timestamp"].dt.month
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["is_business_hour"] = df["hour"].between(9, 17).astype(int)

For cyclical features — hour, day of week, month — there’s a subtle problem: if you encode hour as 0-23, the model sees 23 and 0 as 23 units apart, when really they’re 1 hour apart. Cyclic encoding via sin/cos fixes this:

import numpy as np

df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)

Now the model sees the wraparound correctly. Same trick works for day-of-week (period 7), month (period 12), day-of-year (period 365.25).

Don’t include the raw timestamp itself unless you’re trying to capture a long-term trend — and even then, use (timestamp - reference_date).days so the value is bounded.

Aggregations

This is where the real lift hides on user-level or transaction-level problems. For each user, you compute features like:

Total amount spent in the past 7 days
Number of transactions in the past 30 days
Average transaction amount over the past 90 days
Days since last transaction

In pandas, this is a groupby plus a windowed aggregation. The catch — and we’re now sliding into the leakage section — is that “past 7 days” must mean past 7 days relative to the prediction time. If your prediction is for transaction T, then the aggregations must use only data with timestamp strictly before T. Including T itself in the aggregation is a leak: the model gets to see the answer.

# Wrong (leaks): aggregation over all data
df["user_total_spend"] = df.groupby("user_id")["amount"].transform("sum")

# Right: rolling window strictly before each row
df = df.sort_values(["user_id", "timestamp"])
df["spend_past_7d"] = (
    df.groupby("user_id")
      .rolling("7D", on="timestamp", closed="left")["amount"]
      .sum()
      .reset_index(level=0, drop=True)
)

closed="left" excludes the current row. Without it, you’ve leaked.

Text features

Free-text columns — descriptions, search queries, support tickets — don’t go in raw. The two classical paths:

Bag of words / TF-IDF — count terms, weight by inverse document frequency. Cheap, interpretable, surprisingly competitive on short documents.

from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_text = vec.fit_transform(df["description"])

Embeddings — feed the text through a sentence transformer model and use the resulting vector as features. Much better signal, especially when meaning matters more than exact words. In 2026, the standard is sentence-transformers from Hugging Face; the small all-MiniLM-L6-v2 model is fast on CPU and good enough for most tabular-plus-text problems.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(df["description"].tolist())  # (n, 384) array

Concatenate the embedding columns to your other features and treat them like any other numeric input.

Missing values

Real data has gaps. Models mostly don’t handle them natively (XGBoost and LightGBM are exceptions — they split on missingness directly). Three strategies:

Impute — replace missing with the mean, median, or mode. Median is usually the safer default for numeric features.
Sentinel value — replace with a value the model can’t confuse for real data. -999 for a positive-valued feature, "_missing_" for a category.
Indicator column — add a separate 0/1 column saying whether the value was missing. Often combined with imputation, because the fact of missingness can itself be predictive.

from sklearn.impute import SimpleImputer

df["age_missing"] = df["age"].isna().astype(int)  # the indicator
imputer = SimpleImputer(strategy="median")
df[["age"]] = imputer.fit_transform(df[["age"]])

For tree-based models with native missing-value support (XGBoost, LightGBM), just leave the NaNs in — the library will handle them and often does better than your imputation would.

Leakage: the bug that lies to you

Now the part that takes longer to learn than any other. Leakage happens when information that wouldn’t be available at prediction time sneaks into your training features. The model picks up on it, scores great in cross-validation, and falls apart in production because the leaked signal isn’t there at inference.

Two main flavors:

Target leakage — a feature includes information about the target. The classic example: predicting whether a customer will churn, with a feature like cancellation_processed_date. That field only gets populated after churn happens. Including it gives the model a near-perfect predictor that doesn’t exist when you actually want to predict.

The fix: think hard about when each feature becomes available. If you’re predicting at time T, every feature must be derivable from data before T. If a column might be populated after the target event, drop it or recompute it as of T.

Train-test leakage — preprocessing uses statistics from the test set. If you fit a StandardScaler on the full dataset before splitting, the scaler has seen the test data’s mean. The cross-validation score gets inflated by a tiny amount that, on tight problems, is the difference between launch and don’t-launch.

The fix is exactly what scikit-learn pipelines were designed for: wrap the preprocessing in a Pipeline, pass it to cross_val_score, and the library refits everything inside each fold automatically.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000)),
])

# Inside each fold, the scaler refits on that fold's training portion only.
scores = cross_val_score(pipe, X, y, cv=5)

If you compute aggregations or target encodings outside the pipeline, you’re back to manual leakage prevention: split first, then compute features using only training data, then transform test data with the training-fit transformer.

A useful gut check: if your model scores suspiciously well on the first try — say 99% accuracy on a problem your domain expert says is hard — assume leakage until proven otherwise. Real models on real problems are imperfect. A perfect model usually means a peeking model.

A note on AI assistance

When you ask an AI assistant to write feature engineering code for a dataset, it’s good at the standard patterns — give it a sample of your data and it will suggest reasonable scalers, encoders, and time-feature splits. What it’s unreliable about is the time semantics of aggregations. It will happily suggest features like “average of all past transactions” without a window or a closed="left", and the resulting code will leak the future into the present.

Always read AI-suggested feature code with one question on your mind: “at the moment we make a prediction, does each input column only contain information that was available before that moment?” If the answer is no for any column, fix it before you trust the metrics.

Closing

Feature engineering is the slow, unglamorous part of tabular ML, and also the part where the work pays off. The transforms in this lesson cover most cases you’ll meet in real datasets; the leakage discipline is what keeps your numbers honest. Next lesson, we put all this through tree-based models — the family that, in 2026, still wins on tabular problems.