If you’ve spent any time on the internet in the last decade, you have absorbed the impression that “neural network” means “magic.” It does not. A neural network is a function. Inputs go in one side, outputs come out the other side, and a large pile of tunable numbers in the middle decide how the inputs become outputs. That’s the entire premise. Everything else — backpropagation, gradient descent, the whole deep learning canon — is plumbing around that single idea.
This lesson is deliberately code-light. Module 10 is going to have plenty of PyTorch, plenty of training loops, plenty of GPU memory anxiety. Before any of that lands, I want the mental model to click. If you understand what a neural network is — the math at its core, why nonlinearity matters, what backpropagation is actually doing — every framework decision in the next two lessons will feel obvious instead of mysterious.
A neural network is a function
Think of the simplest possible classifier. Input: a vector x with two numbers. Output: a probability, a single number between 0 and 1. The function we want is f(x) → p. If we write it as a chain of operations, it might look like this:
z1 = W1 @ x + b1 # linear: matrix times vector, plus bias
a1 = relu(z1) # nonlinearity
z2 = W2 @ a1 + b2 # another linear
p = sigmoid(z2) # squash to [0, 1]
W1, W2, b1, b2 are the parameters. Everything else is fixed. Training a neural network means: find values of those parameters such that, on average, f(x) is close to the right answer for every training example we have.
That is the whole game. Loss function, gradient descent, all of it, exists to find good values for those W and b matrices.
The neuron, demystified
A “neuron” is the smallest unit of this whole construction. It takes a vector of inputs x = [x1, x2, ..., xn], multiplies each by a weight, adds a bias, and runs the result through a nonlinear function:
output = activation(w1*x1 + w2*x2 + ... + wn*xn + b)
The weighted sum is just a dot product. The bias is just a number. The nonlinearity is one of:
- ReLU:
max(0, z). Cheap, the workhorse of deep learning. If the input is negative, the output is zero. Otherwise, pass it through. - GELU: smooth ReLU-shaped curve. Standard in transformers since around 2018.
- Sigmoid:
1 / (1 + exp(-z)). Squashes to (0, 1). Used at the output of binary classifiers; rarely used inside the network anymore because it kills gradients. - Tanh: like sigmoid but squashes to (-1, 1). Same problem.
A “layer” is just many neurons sharing the same input vector but with different weights. If the input is a 100-dim vector and a layer has 64 neurons, the layer’s weight matrix is (64, 100). A layer is a matrix multiplication followed by a nonlinearity. A network is several layers stacked.
Why nonlinearity is the whole point
Here is the thing that always sounds like a technicality but is actually load-bearing. Suppose you stack two linear layers with no nonlinearity in between:
y = W2 @ (W1 @ x + b1) + b2
= W2 @ W1 @ x + W2 @ b1 + b2
= W' @ x + b'
Where W' = W2 @ W1 and b' = W2 @ b1 + b2. This is just one linear layer. You can stack a thousand linear layers and the entire stack collapses to a single linear function. You learned nothing by going deep. The model can only draw straight lines through the input space, and most real problems do not have straight-line decision boundaries.
The nonlinearity between layers is what prevents the collapse. ReLU is enough. The “universal approximation theorem” says, roughly: a neural network with at least one hidden layer and a nonlinearity can approximate any continuous function, given enough neurons. In practice, more layers are dramatically more efficient than wider single layers — that’s the empirical observation that birthed “deep” learning.
Backpropagation, in plain words
You have a network. It outputs predictions. You compare predictions to the truth and compute a loss — a single number that says how wrong the network was. For classification, that’s usually cross-entropy. For regression, mean squared error.
You want to nudge each parameter in a direction that reduces the loss. The gradient of the loss with respect to a parameter tells you which direction to nudge it: if the gradient is positive, decreasing the parameter decreases the loss. If negative, increasing it decreases the loss. Take a small step in the negative gradient direction. That’s gradient descent.
Backpropagation is just the algorithm for computing those gradients efficiently using the chain rule from calculus. The clever observation behind backprop, made by people in the 1970s and 80s, is that you don’t need to compute each parameter’s gradient independently. You compute the gradient at the output, then propagate it backward through the network one layer at a time, reusing intermediate results. The cost of computing all gradients is roughly the cost of one forward pass. Without this trick, deep learning is computationally hopeless. With it, you can train networks with hundreds of billions of parameters.
You will essentially never write backprop by hand. Every modern framework — PyTorch, JAX, TensorFlow — does it for you via automatic differentiation. You define the forward pass; the framework records the operations and walks them backward. Lesson 56 shows what that looks like in PyTorch.
What “deep” buys you
Why bother stacking many layers instead of one wide layer? Because depth lets the network learn hierarchical features. The classic example is image recognition: the first layer learns to detect edges. The second layer learns to combine edges into corners and textures. The third layer combines those into eyes, wheels, leaves. The deeper the layer, the more abstract the feature.
You don’t engineer those features. The network discovers them on its own from the training signal. That is the magic, if there is any: deep learning replaces the painful, domain-specific feature engineering of classical ML with raw compute and lots of data. Module 9 spent a whole lesson (50) on feature engineering for tabular data. For images, that whole step disappears — convolutional networks learn the right features from pixels.
When deep learning genuinely wins
Deep learning is the right tool when:
- The input is high-dimensional and structured. Images (millions of pixels, spatial structure). Audio (waveforms, frequency structure). Text (sequences, syntactic structure). Video (spatial + temporal). These are domains where there is no good way to hand-craft features, and where the right architecture (CNN for images, transformer for sequences) can exploit the structure directly.
- You have lots of data. Deep networks have millions to billions of parameters. They overfit instantly on small datasets. Rule of thumb: under ~10,000 examples, you probably want a tree-based model from lesson 51. Above ~1 million, deep learning is usually worth a try.
- Performance matters more than interpretability. A trained CNN that classifies medical images at human-radiologist accuracy is valuable even though nobody can fully explain why it makes any specific decision. In other settings — credit decisions, medical decisions where regulators are involved, scientific causal questions — opacity is disqualifying.
When deep learning loses
Lesson 51 had a callback I’m going to repeat: for tabular data, gradient-boosted trees still win. XGBoost, LightGBM, CatBoost beat neural networks on the kind of structured data you find in spreadsheets, databases, and most business problems, even in 2026. There has been a steady stream of papers attempting to reverse this, and the honest summary is that on most tabular benchmarks, trees are still ahead or tied. They train in seconds on a CPU and they don’t require GPU clusters. Use them.
Other places deep learning is the wrong choice:
- Small datasets (under a few thousand examples). Trees, linear models, or transfer learning from a pretrained network.
- Strict interpretability requirements. You can interpret a logistic regression’s coefficients. You cannot meaningfully interpret a 100-million-parameter network.
- Tight latency budgets on tiny hardware. A microcontroller doing inference at 1ms — you want a tiny tree or a hand-tuned linear model, not a transformer.
- Problems where the cost of being wrong is huge and the cost of being slow is small. A bank making a $50M loan decision can spend an hour and get a human in the loop. They don’t need a 5ms inference.
The compute reality
Training a serious deep learning model means using a GPU or a TPU. Not optional. A modern transformer training run is essentially impossible on a CPU in any reasonable time. The hardware market in 2026 is dominated by NVIDIA H100s and the newer B200s, with AMD MI300s playing catch-up and Apple Silicon (M-series chips with the MPS backend) increasingly viable for development and small-scale training.
If you’re learning, use Google Colab’s free tier, Kaggle notebooks, or Lightning Studios. Don’t buy a GPU before you know you need one.
The “scale is most of it” lesson
The most important and least flattering finding of the last decade of deep learning research: architecture matters less than scale. The same transformer architecture from 2017 is what’s powering the frontier models in 2026. What changed is the amount of data, the amount of compute, and the amount of training. Cleverness in architecture design has produced modest gains. Throwing 10x more data and 10x more compute at the same architecture has produced almost everything dramatic that has happened in AI since 2018.
The practical implication for you, building deep learning systems in 2026: don’t innovate on architecture before you’ve maxed out the boring stuff. More data, cleaner data, longer training, bigger model, better optimizer, better learning rate schedule. The clever architectural twist that gets you 0.5% more accuracy is rarely worth the engineering cost compared to feeding the same model more data.
The plan from here
Lesson 56 introduces PyTorch — tensors, autograd, the nn.Module API. We’ll define a small classification network the right way and inspect what each piece is doing. Lesson 57 ties it all together with the training loop: the five lines that turn a randomly-initialized network into a trained model, plus the production bookkeeping (checkpointing, mixed precision, distributed training) that turn a notebook prototype into a real system. By the end of Module 10, you will have written a working deep learning system from scratch and you will know exactly what every line is for.
Onward.