Chapter 4

Gradient descent — how networks learn

Training a neural network is solving an optimization problem: find the parameters that make the loss as small as possible. Gradient descent is the algorithm that does it. The whole idea fits in a single sentence: "look at which way is downhill, take a small step that way, repeat."

The update rule

Given current parameters $θ$ and a loss $L$ , repeat:

θ \leftarrow θ - η \nabla_{θ} L (θ)

See it happen — try different learning rates

The function plotted below is $f (x) = 0.4 x^{2} - 1.5 x + 2$ . Its minimum is at $x = 1.875$ . Try learning rates $0.1$ (slow, smooth), $0.4$ (fast, clean), and $2.4$ (chaos). Watch the trail of past positions to feel each regime:

Learning rate η0.40

Watch what happens with η > 2: the step overshoots and the ball oscillates instead of settling. That's why learning-rate tuning matters.

Stochastic, mini-batch, and full-batch gradients

Computing the gradient over the entire dataset every step is correct but slow. In practice we use a mini-batch — a small random subset of examples — and average the gradient over just those:

\nabla_{θ} L_{B} (θ) = \frac{1}{∣ B ∣} (x, y) \in B \sum \nabla_{θ} ℓ (x, y; θ)

Batch size 1 — pure stochastic gradient descent (SGD). Each step is one example.
Batch size 32–256 — the sweet spot for most problems.
Whole dataset — full-batch. Stable but slow.

Momentum — give the gradient inertia

Plain SGD jiggles around in narrow valleys, especially when one direction is much steeper than another. Momentum smooths it out by keeping a running average of past gradients — like a ball rolling that builds up speed over time:

v \leftarrow β v + (1 - β) \nabla_{θ} L, θ \leftarrow θ - η v

Adam — momentum + per-parameter scaling

Adam (the default optimizer in this project) is the most popular optimizer in deep learning. It does two things on top of plain SGD:

Tracks an exponential moving average of the gradient itself (like momentum).
Tracks an exponential moving average of the squared gradient.

Then it divides by the square root of the second to scale every parameter individually.

m \leftarrow β_{1} m + (1 - β_{1}) g

First moment — average of past gradients.

v \leftarrow β_{2} v + (1 - β_{2}) g^{2}

Second moment — average of past squared gradients.

θ \leftarrow θ - η \frac{m ^}{v ^ + ϵ}

The actual parameter update — bias-corrected.

← Neural networks Next: Backpropagation →