Python, from the ground up Lesson 53 / 60

Hyperparameter tuning: grid, random, bayesian, optuna

The four search strategies, when each makes sense, and why optuna is the 2026 default.

Every model has two kinds of numbers. Parameters are the ones the model learns from data — the weights in a logistic regression, the splits in a decision tree, the thousands of leaf values in XGBoost. Hyperparameters are the ones you set before training: learning_rate, max_depth, regularization strength, number of trees, dropout rate. Get them wrong and the same model on the same data can range from “useless” to “state of the art.”

Tuning is the process of systematically searching for good hyperparameter values. There are four common strategies, each with a regime where it makes sense, and one library — Optuna — that has eaten most of the others’ lunch in the last few years.

What you’re actually optimizing

The setup is always the same. You have a function that takes hyperparameters and returns a score:

def objective(hyperparams) -> float:
    model = Model(**hyperparams)
    score = cross_val_score(model, X_train, y_train, cv=5).mean()
    return score

You want to find the hyperparameters that maximize that score. The function is expensive (each call requires fitting a model, possibly multiple times for cross-validation), noisy (cross-validation gives you an estimate, not the true score), and black-box (you don’t have a gradient — you can only evaluate it at points you choose).

That last constraint is the interesting one. Most optimization assumes you can take derivatives. Hyperparameter optimization can’t. The four strategies below are different answers to “given that I can only evaluate the function at points I pick, how should I pick those points?”

The brute-force approach: define a discrete grid for each hyperparameter and try every combination.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    "n_estimators": [100, 300, 500],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5, 10],
}

search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
)
search.fit(X_train, y_train)
print(search.best_params_, search.best_score_)

That grid is 3 x 3 x 3 = 27 combinations, times 5 CV folds = 135 model fits. Manageable.

The catastrophe is that it explodes combinatorially. Add a fourth hyperparameter with 5 values and you’re at 135 x 5 x 5 = 3,375 fits. Add a fifth and you’re past 16,000. Grid search is a good fit when:

  • You have 2-3 hyperparameters,
  • with 3-5 sensible values each,
  • and a single fit is fast (logistic regression, small random forest).

For anything heavier — XGBoost on a real dataset, a neural network — grid search is dead before you start.

Instead of trying every combination, sample combinations randomly from a distribution:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint

param_dist = {
    "n_estimators": randint(100, 1000),
    "max_depth": randint(3, 30),
    "min_samples_split": randint(2, 20),
    "max_features": loguniform(0.1, 1.0),
}

search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist,
    n_iter=50,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
    random_state=42,
)
search.fit(X_train, y_train)

Bergstra and Bengio’s 2012 paper “Random Search for Hyper-Parameter Optimization” made the case that with the same compute budget, random search usually finds equally good or better hyperparameters than grid search. The intuition: only a few hyperparameters actually matter much, and grid search wastes compute by exploring the unimportant ones at high resolution. Random search, by sampling continuously, gives every hyperparameter a fair number of distinct values regardless of which ones turn out to matter.

Random search is the right call when:

  • You have 4+ hyperparameters,
  • some are continuous (regularization strengths, learning rates),
  • and you have a fixed compute budget — say, 50 or 100 fits.

It still wastes compute, though. Once you’ve sampled 30 points and seen that high learning_rate is bad, random search keeps drawing more high-learning_rate samples. That’s where the next strategy earns its keep.

Strategy 3: bayesian optimization

The idea: build a probabilistic model of the score function as you go, and use that model to pick the next point smartly.

Concretely, after each evaluation you fit a Gaussian process (or a tree-based model, in modern implementations) to your (hyperparams, score) pairs. The model gives you, for any candidate hyperparameter setting, both a predicted score and an uncertainty estimate. You then pick the next point to evaluate by maximizing an acquisition function that balances:

  • Exploitation: try points where the model thinks the score will be high.
  • Exploration: try points where the model is uncertain.

The standard acquisition function is expected improvement: how much better than the current best do we expect this candidate to be? Each evaluation refines the model, so successive picks get smarter.

This is dramatically more sample-efficient than random search. Where random might need 200 evaluations to find a good region, bayesian often nails it in 30-50. The cost is overhead per evaluation: fitting the surrogate model and optimizing the acquisition function takes seconds. For expensive objectives — XGBoost on a million rows, a neural network training run — that overhead is invisible. For cheap objectives, it’s not worth it.

Strategy 4: population-based and evolutionary

For very large compute budgets — think hyperparameter tuning a foundation model with thousands of GPUs — there’s a family of strategies that maintain a population of configurations, evaluate them in parallel, and use mutation/crossover or “exploit-and-explore” rules to evolve the population. Population-Based Training (DeepMind, 2017) and HyperBand variants live here. You probably won’t need these unless you’re doing serious deep learning at scale. Worth knowing they exist.

Optuna: the 2026 default

Most of the above strategies are now wrapped under one library. Optuna has become the dominant Python hyperparameter library for several good reasons:

  • A single API covers grid, random, and bayesian (TPE — Tree-structured Parzen Estimator).
  • The objective function is plain Python — no awkward parameter dictionaries.
  • Trial pruning kills bad runs early.
  • Distributed mode runs trials across machines with a shared database.
  • The visualization tools are first-class.

The basic shape:

import optuna
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-8, 10.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-8, 10.0, log=True),
    }
    model = XGBClassifier(**params, random_state=42, eval_metric="logloss")
    score = cross_val_score(model, X_train, y_train, cv=5, scoring="roc_auc").mean()
    return score

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

print(study.best_params)
print(study.best_value)

The objective function is just Python. You can put any logic in it: conditional hyperparameters (“if model_type is X, also tune Y”), preprocessing steps, anything. The trial.suggest_* calls register the search space implicitly as you call them — Optuna learns from the trials what to suggest next.

Note log=True on the learning rate and regularization strengths. Always tune those on a log scale. The difference between learning_rate=0.001 and learning_rate=0.01 is a factor of 10, which matters. The difference between learning_rate=0.291 and learning_rate=0.301 is a rounding error.

By default, Optuna uses TPE, a tree-structured Parzen estimator — a bayesian-flavored sampler that handles mixed continuous/discrete/categorical spaces well. You can swap it for RandomSampler, GridSampler, or CmaEsSampler by passing a sampler= argument.

Pruning: kill bad trials early

If you’re tuning a model with iterations (XGBoost rounds, neural network epochs), a slow trial that’s clearly behind the leaderboard is wasting compute. Optuna’s pruning lets the trial report intermediate scores and abort early:

def objective(trial):
    params = {
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
    }
    model = XGBClassifier(**params, random_state=42)

    for epoch in range(100):
        model.set_params(n_estimators=epoch + 1)
        model.fit(X_train, y_train)
        score = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
        trial.report(score, epoch)
        if trial.should_prune():
            raise optuna.TrialPruned()
    return score

study = optuna.create_study(
    direction="maximize",
    pruner=optuna.pruners.MedianPruner(n_warmup_steps=10),
)

MedianPruner cuts trials whose intermediate score is below the median of completed trials at the same step. HyperbandPruner does something more aggressive based on successive halving. Pruning typically gives 2-5x speedup on tunable iterative models.

Cross-validation, nested or not

The model selection question that bites everyone: if you tune hyperparameters on cross-validation and report the best CV score, you’re overfitting to the CV folds. The “true” generalization estimate has been contaminated by the search itself.

The correct setup: nested cross-validation. An outer CV loop measures generalization. Inside each outer fold, an inner CV loop tunes hyperparameters. The outer score is what you report.

from sklearn.model_selection import KFold

outer = KFold(n_splits=5, shuffle=True, random_state=42)
outer_scores = []

for train_idx, test_idx in outer.split(X):
    X_tr, X_te = X.iloc[train_idx], X.iloc[test_idx]
    y_tr, y_te = y.iloc[train_idx], y.iloc[test_idx]

    study = optuna.create_study(direction="maximize")
    study.optimize(lambda trial: tune_inner(trial, X_tr, y_tr), n_trials=50)

    final_model = XGBClassifier(**study.best_params).fit(X_tr, y_tr)
    outer_scores.append(roc_auc_score(y_te, final_model.predict_proba(X_te)[:, 1]))

print(f"Generalization AUC: {np.mean(outer_scores):.3f} +/- {np.std(outer_scores):.3f}")

Nested CV is expensive: 5 outer folds x 50 inner trials x 5 inner folds = 1,250 model fits. The cheap-and-honest alternative: holdout a fraction of your data before any tuning, tune on the rest with regular CV, and use the holdout once at the end. You lose statistical efficiency but the answer’s still trustworthy.

The search space matters more than the algorithm

If your hyperparameter ranges don’t include good values, no algorithm will find them. Random search over learning_rate in [0.5, 1.0] will not find the optimum at 0.05, no matter how many trials you run. Bayesian over the same range is just a smarter waste of compute.

A practical rule: for any new model class, look up the documented “typical” ranges and widen them by an order of magnitude on each end. Then look at where Optuna’s best trials cluster. If they pile up against a boundary of your search space, your boundary is wrong — widen it and re-tune.

Distributed Optuna

For large studies, Optuna can coordinate trials across machines via a shared database (PostgreSQL, MySQL, SQLite for local). Each worker pulls trials from the study, runs them, and writes results back:

study = optuna.create_study(
    storage="postgresql://user:pass@host/optuna_db",
    study_name="xgb-tuning-2026-05",
    direction="maximize",
    load_if_exists=True,
)
study.optimize(objective, n_trials=100)

Same code on every worker. Optuna handles the locking. This is how serious tuning runs happen in 2026 — you fire up a few cloud workers, point them at the same database, and let them chew through trials in parallel.

AI assistance for search spaces

A small but real productivity tip in 2026: AI coding assistants are extremely good at suggesting Optuna search spaces given a model type. Prompts like “give me an Optuna search space for XGBoost regression on tabular data” or “suggest reasonable ranges for tuning a LightGBM classifier with class imbalance” get you working code with sane defaults in seconds. It’s not magic — those ranges are documented in many places — but it saves you the lookup. Sanity-check the result against the model’s docs and adjust based on your specific problem.

Next lesson: end-to-end ML project, putting all of Module 9 into one workflow.


References: Optuna documentation (https://optuna.org/), scikit-learn model selection guide (scikit-learn.org/stable/modules/grid_search.html). Retrieval 2026-05-01.

Search