Tree-based models: random forest, XGBoost, LightGBM

If you take one practical thing away from this module, let it be this: on tabular data, in 2026, tree-based models still win. Not “win sometimes, depending on the dataset.” Win as the default. The 2022 paper from Grinsztajn, Oyallon and Varoquaux, “Why do tree-based models still outperform deep learning on tabular data”, established this with rigor, and four years later nothing has displaced it. The reasons are structural — we’ll get to them at the end — but the consequence is that when you see a CSV with rows and columns and a target, the right first move is a gradient-boosted tree.

This lesson covers the three libraries that matter — XGBoost, LightGBM, CatBoost — plus scikit-learn’s random forest as the strong baseline. We’ll fit the same problem with all of them, talk about hyperparameters that actually move the needle, and discuss when to reach for which.

What a tree is, briefly

A decision tree asks a sequence of yes/no questions about features:

is income > 50000?
├── yes: is age > 30?
│   ├── yes: predict "approve"
│   └── no:  predict "deny"
└── no:  is employment_years > 5?
    ├── yes: predict "approve"
    └── no:  predict "deny"

That’s it. Each internal node splits on a feature and threshold; each leaf gives a prediction. The training algorithm picks each split to maximize some purity criterion (Gini, entropy, mean squared error reduction) — greedy, locally optimal, fast.

A single tree overfits. Train it deep enough and it memorizes the training set. The remedy is ensembles: combine many trees so their individual errors cancel out. Two flavors of ensemble dominate.

Bagging (random forest): grow many trees, each on a bootstrap sample of the data and a random subset of features at each split, then average their predictions. The randomness decorrelates the trees, so averaging reduces variance.

Boosting (gradient boosting): grow trees sequentially, each one trained to fix the errors of the ensemble so far. Slower, more careful, usually more accurate.

In practice, gradient boosting beats random forest on most tabular problems, but random forest is a great default — it’s robust, has few hyperparameters, and produces a strong baseline in three lines of code.

Random forest

Let’s start with the baseline. In scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(
    n_estimators=500,       # number of trees — more is better, up to a plateau
    max_depth=None,          # let trees grow fully; bagging handles overfit
    min_samples_leaf=2,      # mild regularization
    n_jobs=-1,               # use all CPU cores
    random_state=42,
)

scores = cross_val_score(rf, X, y, cv=5, scoring="roc_auc")
print(scores.mean())

Random forest is forgiving. The default hyperparameters give you 80% of the achievable accuracy on most problems. Tune n_estimators upward until your score plateaus, set min_samples_leaf to maybe 2-5, and you’re done.

It also handles missing values gracefully if you let scikit-learn’s HistGradientBoostingClassifier do it, but vanilla RandomForestClassifier does not — impute first.

The boosting libraries: XGBoost, LightGBM, CatBoost

Now to the real workhorses. All three are gradient boosting libraries, all three follow scikit-learn’s .fit / .predict API, and they’re roughly interchangeable. Each has its own strengths.

XGBoost (since 2014) was the library that put gradient boosting on the map. It dominated Kaggle for years. Mature, GPU-friendly, lots of switches. In 2026 it’s on version 2.x, with a stable Python API and improved categorical support.

uv add xgboost

from xgboost import XGBClassifier

xgb = XGBClassifier(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    eval_metric="auc",
    early_stopping_rounds=50,
    n_jobs=-1,
    random_state=42,
)

xgb.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False,
)

LightGBM (Microsoft, 2017) is faster than XGBoost on most problems, particularly with large datasets. Its claim to fame is leaf-wise growth instead of depth-wise: it grows the tree by repeatedly expanding the leaf with the largest loss reduction, which converges faster but is more prone to overfitting on small datasets unless you regularize. Currently on version 4.x.

uv add lightgbm

from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(
    n_estimators=2000,
    learning_rate=0.05,
    num_leaves=63,           # leaf-wise — this is the analog of max_depth
    min_child_samples=20,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    n_jobs=-1,
    random_state=42,
)

lgbm.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(50)],
)

CatBoost (Yandex, 2017) handles categorical features natively — you pass them as strings and it figures out an ordered target encoding internally that doesn’t leak. If your dataset is heavy on categoricals, CatBoost is often the easiest win because you skip a whole encoding step. It’s also the most user-friendly for beginners — sensible defaults, less tuning needed.

uv add catboost

from catboost import CatBoostClassifier

cat = CatBoostClassifier(
    iterations=2000,
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=3,
    cat_features=["country", "plan", "device"],   # tell it which columns are categorical
    early_stopping_rounds=50,
    random_seed=42,
    verbose=False,
)

cat.fit(X_train, y_train, eval_set=(X_val, y_val))

Which one to pick? In practice, the three are within a couple of percentage points of each other on most problems. My defaults: LightGBM for fast iteration, CatBoost when the data has lots of categoricals, XGBoost when the team already knows it. Try at least two on any serious problem and pick whichever scores best on a held-out set.

The hyperparameters that matter

Gradient boosting has a lot of knobs. Most of them you can leave at default. The ones that actually move the score:

n_estimators (or iterations in CatBoost). Number of trees in the ensemble. More is better up to a point — beyond it, you start memorizing the training data. The right answer is “as many as you can use before validation loss stops improving,” which is what early stopping handles for you. Set this to a large number like 2000-5000 and let early_stopping_rounds cut you off.

learning_rate (also eta). How much each new tree corrects the previous error. Smaller learning rate plus more trees usually wins, but takes longer to fit. 0.05 is a sane default. 0.01 if you have time and want the last 0.5% of accuracy. 0.1-0.3 for fast prototyping.

max_depth (XGBoost, CatBoost) or num_leaves (LightGBM). How complex each individual tree can be. Deeper trees fit more nuanced patterns and overfit faster. 4-10 covers most cases. LightGBM’s num_leaves is roughly 2^max_depth, so num_leaves=63 is comparable to max_depth=6.

min_child_weight (XGBoost) / min_samples_leaf (random forest) / min_child_samples (LightGBM). Minimum number of samples (or weighted samples) required to form a leaf. Higher values mean more regularization. 20 is a reasonable default for medium datasets.

subsample and colsample_bytree. Fraction of rows and columns to sample for each tree. Setting these below 1.0 (typically 0.7-0.9) introduces stochasticity that acts like bagging on top of boosting. Almost always helps a little.

reg_lambda / l2_leaf_reg. L2 regularization on leaf weights. Default is fine in most cases; tune if you see overfit.

The honest truth: with reasonable defaults plus early stopping, gradient boosting gets you 90% of the achievable accuracy. The remaining 10% takes serious tuning, ideally with a tool like Optuna that does Bayesian search over the parameter space.

import optuna
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        "n_estimators": 2000,
        "learning_rate": trial.suggest_float("lr", 0.01, 0.1, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 16, 256),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-3, 10.0, log=True),
    }
    model = LGBMClassifier(**params, n_jobs=-1, random_state=42)
    return cross_val_score(model, X, y, cv=5, scoring="roc_auc").mean()

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print(study.best_params, study.best_value)

That’s the production pattern: pipeline plus Optuna plus early stopping plus a held-out test set you only look at once.

Early stopping

This is the trick that turns “pick n_estimators carefully” into “pick a big number and let the library figure it out.” Pass a validation set; the library tracks validation loss on every round; if it doesn’t improve for early_stopping_rounds consecutive rounds, training halts and the best iteration is restored.

xgb.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False,
)
# After fit, xgb.best_iteration tells you where it stopped.

The validation set must be separate from your test set. Test set is for final evaluation; validation set is for picking hyperparameters and stopping. Conflating them is a flavor of leakage.

Same problem, three libraries, side by side

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Some realistic tabular dataset
data = fetch_openml("adult", version=2, as_frame=True)
X, y = data.data, (data.target == ">50K").astype(int)
X = pd.get_dummies(X, drop_first=True)  # quick categorical handling

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42,
)
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42,
)

models = {
    "RandomForest": RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42),
    "XGBoost":      XGBClassifier(n_estimators=1000, learning_rate=0.05, max_depth=6,
                                  early_stopping_rounds=50, eval_metric="auc",
                                  n_jobs=-1, random_state=42),
    "LightGBM":     LGBMClassifier(n_estimators=1000, learning_rate=0.05, num_leaves=63,
                                   n_jobs=-1, random_state=42),
}

for name, m in models.items():
    if name == "RandomForest":
        m.fit(X_tr, y_tr)
    else:
        m.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)
    auc = roc_auc_score(y_test, m.predict_proba(X_test)[:, 1])
    print(f"{name:12s}  AUC = {auc:.4f}")

In a quick run on the adult dataset, expect numbers in the 0.91-0.93 ROC AUC range, with the boosters edging out the forest by a percentage point or so. On a real production problem, this is your starting point — pick the winner, do feature engineering (lesson 50), tune with Optuna, and you’re well into the territory of useful predictions.

Why trees still beat deep learning on tabular data

The 2022 Grinsztajn et al. paper formalized what practitioners had been saying for years. Three structural reasons:

Tabular data has heterogeneous features. A row might have age, country, transaction amount, and customer tenure — features measured in completely different units, with different distributions, and no spatial structure connecting them. Neural networks were built for inputs where neighboring values mean something (pixels, words). Trees treat each feature independently, which matches how tabular data is actually structured.
Trees handle non-smooth target functions well. A small change in income might flip a creditworthiness decision; deep nets, with their smooth gradient flow, struggle to represent sharp boundaries. Trees represent them natively.
Trees handle missing values and categoricals without preprocessing. Modern boosting libraries split on missingness directly and (in CatBoost’s case) on categorical values without one-hot. Neural nets need everything in dense numeric form, which is more brittle and loses information.

There’s been progress on tabular deep learning — TabNet, FT-Transformer, SAINT, TabPFN — and on some specific problems they match or beat boosting. But “match or beat boosting” is the bar to clear, and on the average tabular problem in 2026, you start with LightGBM and end with LightGBM.

Closing

Three lessons in. You know the scikit-learn API, you know what makes good features and what leaks, and you know how to fit a tree-based model that’s competitive with anything else on tabular data. That’s enough to handle the bulk of real ML problems people get paid to solve. The next lessons in this module go beyond tabular — evaluation done properly, neural networks, NLP, and the productionization story — but if you stopped here and only ever did pipelines plus boosted trees, you’d already be ahead of most teams.