Chapter 2

Calculus you actually need

You don't need a year of analysis to understand backprop. You need three ideas: derivative, partial derivative, and chain rule. Everything else stacks on top.

1

What a derivative actually is

For a function with one input, , the derivative at a point is the slope of the tangent line there. Steeper slope โ†’ larger derivative. Flat โ†’ zero. Going down โ†’ negative.

2

The four rules you'll use forever

Memorize these four. They cover almost every layer in any network you'll build in this guide.

Power rule.
Exponential.
Sigmoid (used as an activation).
Hyperbolic tangent.

Plot any activation alongside its derivative โ€” the dashed line is the derivative:

Activation
3

Partial derivatives โ€” change one variable, hold the rest

Most functions in machine learning take many inputs at once. We need a way to ask "how does the output change if I nudge only this one input, leaving everything else alone?" That's a partial derivative.

Stack all the partial derivatives into a vector and you get the gradient:

4

The chain rule โ€” the engine behind backpropagation

Most interesting functions are compositions โ€” functions inside functions. The chain rule tells you how to differentiate a composition. It's the single most important formula in this guide.

If and , then:

The chain rule generalizes to any depth of composition โ€” that's why it works for deep networks. If you stack n functions, the derivative is a product of n smaller derivatives:

Slide x and watch each chain link compute:

x1.00
Inner
u = 3.00
Outer
y = 9.00
Composed
y = 9.00

The chain rule lets us differentiate compositions one step at a time. Backpropagation is just this rule applied across many layers in sequence.

5

The vector chain rule โ€” one step harder, no scarier

When the function takes a vector in and produces a vector out, the derivative is no longer a single number โ€” it's a matrix called the Jacobian. Same idea, more bookkeeping:

And the chain rule becomes a matrix product: