Chapter 5

Backpropagation, derived from scratch

Backpropagation is just the chain rule (chapter 2) applied to a neural network (chapter 3) so we can run gradient descent (chapter 4). This chapter walks through every line of the derivation, then maps it directly to the source code in this repo. No black boxes.

Setup — what we cached during the forward pass

For a network with $L$ layers, the forward pass left us with:

$A^{(0)} = X$ — the input batch.
$Z^{(ℓ)} = A^{(ℓ - 1)} W^{(ℓ)} + b^{(ℓ)}$ — the pre-activation at each layer.
$A^{(ℓ)} = σ^{(ℓ)} (Z^{(ℓ)})$ — the post-activation.
$L$ — a single scalar, computed from $A^{(L)}$ and the targets $Y$ .

The goal of backprop is to find:

\frac{\partial L}{\partial W ^{(ℓ)}}, \frac{\partial L}{\partial b ^{(ℓ)}}, for every layer ℓ .

Define a helper: the per-layer error δ

We'll keep things tidy by introducing a shorthand for "how much the loss changes if I nudge the pre-activation of layer $ℓ$ ":

δ^{(ℓ)} \equiv \frac{\partial L}{\partial Z ^{(ℓ)}}

Compute δ at the output layer

Start at the last layer. By the chain rule (chapter 2), differentiating through the activation function gives:

δ^{(L)} = \frac{\partial L}{\partial A ^{(L)}} ⊙ σ^{(L)'} (Z^{(L)})

Push δ backward through the layers

For an interior layer, applying the chain rule again gives:

δ^{(ℓ)} = (δ^{(ℓ + 1)} W^{(ℓ + 1) ⊤}) ⊙ σ^{(ℓ)'} (Z^{(ℓ)})

Apply this iteratively from $ℓ = L - 1$ down to $ℓ = 1$ and you have $δ$ at every layer. That was the hard part.

Read the parameter gradients off δ

Once we have $δ^{(ℓ)}$ , the gradients we actually wanted are:

\frac{\partial L}{\partial W ^{(ℓ)}} = (A^{(ℓ - 1)})^{⊤} δ^{(ℓ)}

Weight gradient.

\frac{\partial L}{\partial b ^{(ℓ)}} = n = 1 \sum N δ_{n, :}^{(ℓ)}

Bias gradient — sum δ down the batch dimension.

The complete algorithm

Forward. Compute every $Z^{(ℓ)}$ and $A^{(ℓ)}$ . Compute the loss.
Output δ. Either $δ^{(L)} = \nabla_{A^{(L)}} L ⊙ σ^{(L)'} (Z^{(L)})$ or — if your output is softmax + categorical CE — the fused shortcut from step 3.
Backward sweep. For $ℓ = L - 1, L - 2, \dots, 1$ , compute:
$δ^{(ℓ)} = (δ^{(ℓ + 1)} W^{(ℓ + 1) ⊤}) ⊙ σ^{(ℓ)'} (Z^{(ℓ)})$
Parameter gradients. $\nabla_{W^{(ℓ)}} L = (A^{(ℓ - 1)})^{⊤} δ^{(ℓ)}$ , $\nabla_{b^{(ℓ)}} L = \sum_{n} δ_{n}^{(ℓ)}$ .
Update. Apply the optimizer (chapter 4) using each gradient.

Code, line by line

Here are the key lines from src/lib/nn/network.ts in this repo. Compare each line to a step from the derivation above:

// Step 1 — Forward pass — Network.forward
let out = x;
for (const layer of this.layers) out = layer.forward(out);

// Step 2 — Output δ — fused softmax+CCE branch
if (fused) {
  delta = (yPred - yBatch) / N;          // δ⁽ᴸ⁾ = (1/N)(ŷ − y)
} else {
  delta = lossFn.backward(yBatch, yPred); // ∂L/∂A⁽ᴸ⁾
}

// Step 3+4+5 — Backward sweep + parameter gradients — Layer.backward
const dZ = fusedSoftmax
  ? dA                                    // already δ if fused
  : this.activation.backward(dA, this.lastZ); // dA ⊙ σ'(z)

this.lastGradW = matmul(transpose(this.lastInput), dZ); // (A⁽ℓ⁻¹⁾)ᵀ · δ
this.lastGradB = sumRows(dZ);                          // Σ δ over batch
return matmul(dZ, transpose(this.weights));            // δ · Wᵀ → upstream

// Step 6 — Update — Layer.applyGradients calls the optimizer.

Verify it works — the gradient check

A small but powerful sanity check: pick one parameter, perturb it by a tiny $ϵ$ , and approximate its derivative numerically by running two forward passes:

\frac{\partial L}{\partial w _{ij}} \approx \frac{L ( w _{ij} + ϵ ) - L ( w _{ij} - ϵ )}{2 ϵ}

← Gradient descent Next: Build your own →