Backpropagation, derived from scratch
Backpropagation is just the chain rule (chapter 2) applied to a neural network (chapter 3) so we can run gradient descent (chapter 4). This chapter walks through every line of the derivation, then maps it directly to the source code in this repo. No black boxes.
Setup β what we cached during the forward pass
For a network with layers, the forward pass left us with:
- β the input batch.
- β the pre-activation at each layer.
- β the post-activation.
- β a single scalar, computed from and the targets .
The goal of backprop is to find:
Define a helper: the per-layer error Ξ΄
We'll keep things tidy by introducing a shorthand for "how much the loss changes if I nudge the pre-activation of layer ":
Compute Ξ΄ at the output layer
Start at the last layer. By the chain rule (chapter 2), differentiating through the activation function gives:
Push Ξ΄ backward through the layers
For an interior layer, applying the chain rule again gives:
"To find the error at layer β: take the error at the next layer (β+1), multiply by the next layer's weight matrix transposed (this 'undoes' the forward matrix multiply), then mask element-wise by the derivative of the activation function at layer β."
Two factors with a clear story:
- The matrix piece projects the upstream error backward through layer β+1. If a neuron at layer β contributed strongly to many neurons at β+1, it inherits errors from all of them.
- The element-wise piece says "but only the parts of the input where the activation was actually responsive matter." For ReLU, that means dead neurons (z < 0) get zero gradient β they don't update at all this step.
Apply this iteratively from down to and you have at every layer. That was the hard part.
Read the parameter gradients off Ξ΄
Once we have , the gradients we actually wanted are:
"For the weight gradient: matrix-multiply the transpose of the previous layer's activations with Ξ΄. For the bias gradient: sum Ξ΄ across all examples in the batch."
Why these specific shapes? Because β when you differentiate Z with respect to W, you get A. When you multiply that by (which is Ξ΄), you get the gradient with respect to W. The transpose handles the shape mechanics.
is . is . Their product is β exactly the shape of .
Always check shapes before debugging math. The shape rule will catch nine bugs out of ten before you've even started reasoning about derivatives.
The complete algorithm
- Forward. Compute every and . Compute the loss.
- Output Ξ΄. Either or β if your output is softmax + categorical CE β the fused shortcut from step 3.
- Backward sweep. For , compute:
- Parameter gradients. , .
- Update. Apply the optimizer (chapter 4) using each gradient.
Code, line by line
Here are the key lines from src/lib/nn/network.ts in this repo. Compare each line to a step from the derivation above:
// Step 1 β Forward pass β Network.forward
let out = x;
for (const layer of this.layers) out = layer.forward(out);
// Step 2 β Output Ξ΄ β fused softmax+CCE branch
if (fused) {
delta = (yPred - yBatch) / N; // Ξ΄β½α΄ΈβΎ = (1/N)(Ε· β y)
} else {
delta = lossFn.backward(yBatch, yPred); // βL/βAβ½α΄ΈβΎ
}
// Step 3+4+5 β Backward sweep + parameter gradients β Layer.backward
const dZ = fusedSoftmax
? dA // already Ξ΄ if fused
: this.activation.backward(dA, this.lastZ); // dA β Ο'(z)
this.lastGradW = matmul(transpose(this.lastInput), dZ); // (Aβ½ββ»ΒΉβΎ)α΅ Β· Ξ΄
this.lastGradB = sumRows(dZ); // Ξ£ Ξ΄ over batch
return matmul(dZ, transpose(this.weights)); // Ξ΄ Β· Wα΅ β upstream
// Step 6 β Update β Layer.applyGradients calls the optimizer.Every comment maps to one step in the derivation. There's no library doing the math for us β those are the actual operations on Float64Array buffers. If you understand the chain rule, you understand this code.
Verify it works β the gradient check
A small but powerful sanity check: pick one parameter, perturb it by a tiny , and approximate its derivative numerically by running two forward passes:
"Pick a single weight. Push it up by a tiny Ξ΅, run the forward pass, get the loss. Push the same weight down by Ξ΅, run forward again, get the loss. Subtract, divide by 2Ξ΅. The result is a numerical estimate of that weight's gradient."
If your analytic gradient (from backprop) agrees with this numerical estimate to a few decimal places, your backprop is correct. If they disagree, something in your derivation is wrong β and now you know exactly where to look.
The repo's test file src/lib/nn/network.test.ts ships exactly this check on a small random network with MSE loss and tanh activation. It catches the kind of subtle off-by-a-factor or wrong-transpose bugs that pass smoke tests but quietly degrade training.