Chapter 2

Calculus you actually need

You don't need a year of analysis to understand backprop. You need three ideas: derivative, partial derivative, and chain rule. Everything else stacks on top.

What a derivative actually is

For a function with one input, $f (x)$ , the derivative at a point is the slope of the tangent line there. Steeper slope → larger derivative. Flat → zero. Going down → negative.

f^{'} (x) = h \to 0 lim \frac{f ( x + h ) - f ( x )}{h}

The four rules you'll use forever

Memorize these four. They cover almost every layer in any network you'll build in this guide.

\frac{d}{d x} x^{n} = n x^{n - 1}

Power rule.

\frac{d}{d x} e^{x} = e^{x}

Exponential.

\frac{d}{d x} σ (x) = σ (x) (1 - σ (x))

Sigmoid (used as an activation).

\frac{d}{d x} tanh (x) = 1 - tanh^{2} (x)

Hyperbolic tangent.

Plot any activation alongside its derivative — the dashed line is the derivative:

Activation

Partial derivatives — change one variable, hold the rest

Most functions in machine learning take many inputs at once. We need a way to ask "how does the output change if I nudge only this one input, leaving everything else alone?" That's a partial derivative.

\frac{\partial f}{\partial x _{i}} = rate of change of f when only x_{i} wiggles.

Stack all the partial derivatives into a vector and you get the gradient:

\nabla f (x) = \partial f / \partial x_{1} \partial f / \partial x_{2} ⋮ \partial f / \partial x_{n}

The chain rule — the engine behind backpropagation

Most interesting functions are compositions — functions inside functions. The chain rule tells you how to differentiate a composition. It's the single most important formula in this guide.

If $y = f (u)$ and $u = g (x)$ , then:

\frac{d y}{d x} = \frac{d y}{d u} \frac{d u}{d x}

The chain rule generalizes to any depth of composition — that's why it works for deep networks. If you stack n functions, the derivative is a product of n smaller derivatives:

y = f_{n} (f_{n - 1} (\dots f_{1} (x))) ⟹ \frac{d y}{d x} = f_{n}^{'} (\cdot) \cdot f_{n - 1}^{'} (\cdot) \dots f_{1}^{'} (\cdot)

Slide x and watch each chain link compute:

x1.00

Inner

u = 2 x + 1

u = 3.00

Outer

y = u^{2}

y = 9.00

Composed

y (x) = (2 x + 1)^{2}

y = 9.00

\frac{d y}{d x} = 2 u = 6.00 \frac{d y}{d u} \cdot 2 \frac{d u}{d x} = 12.00

The chain rule lets us differentiate compositions one step at a time. Backpropagation is just this rule applied across many layers in sequence.

The vector chain rule — one step harder, no scarier

When the function takes a vector in and produces a vector out, the derivative is no longer a single number — it's a matrix called the Jacobian. Same idea, more bookkeeping:

J_{ij} = \frac{\partial f _{i}}{\partial x _{j}}

And the chain rule becomes a matrix product:

\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{\partial u}{\partial x}

← Linear algebra Next: Neural networks →