Gradient descent â how networks learn
Training a neural network is solving an optimization problem: find the parameters that make the loss as small as possible. Gradient descent is the algorithm that does it. The whole idea fits in a single sentence: "look at which way is downhill, take a small step that way, repeat."
The update rule
Given current parameters and a loss , repeat:
See it happen â try different learning rates
The function plotted below is . Its minimum is at . Try learning rates (slow, smooth), (fast, clean), and (chaos). Watch the trail of past positions to feel each regime:
Watch what happens with Ρ > 2: the step overshoots and the ball oscillates instead of settling. That's why learning-rate tuning matters.
Stochastic, mini-batch, and full-batch gradients
Computing the gradient over the entire dataset every step is correct but slow. In practice we use a mini-batch â a small random subset of examples â and average the gradient over just those:
- Batch size 1 â pure stochastic gradient descent (SGD). Each step is one example.
- Batch size 32â256 â the sweet spot for most problems.
- Whole dataset â full-batch. Stable but slow.
Momentum â give the gradient inertia
Plain SGD jiggles around in narrow valleys, especially when one direction is much steeper than another. Momentum smooths it out by keeping a running average of past gradients â like a ball rolling that builds up speed over time:
Adam â momentum + per-parameter scaling
Adam (the default optimizer in this project) is the most popular optimizer in deep learning. It does two things on top of plain SGD:
- Tracks an exponential moving average of the gradient itself (like momentum).
- Tracks an exponential moving average of the squared gradient.
Then it divides by the square root of the second to scale every parameter individually.