Calculus you actually need
You don't need a year of analysis to understand backprop. You need three ideas: derivative, partial derivative, and chain rule. Everything else stacks on top.
What a derivative actually is
For a function with one input, , the derivative at a point is the slope of the tangent line there. Steeper slope โ larger derivative. Flat โ zero. Going down โ negative.
The four rules you'll use forever
Memorize these four. They cover almost every layer in any network you'll build in this guide.
Plot any activation alongside its derivative โ the dashed line is the derivative:
Partial derivatives โ change one variable, hold the rest
Most functions in machine learning take many inputs at once. We need a way to ask "how does the output change if I nudge only this one input, leaving everything else alone?" That's a partial derivative.
Stack all the partial derivatives into a vector and you get the gradient:
The chain rule โ the engine behind backpropagation
Most interesting functions are compositions โ functions inside functions. The chain rule tells you how to differentiate a composition. It's the single most important formula in this guide.
If and , then:
The chain rule generalizes to any depth of composition โ that's why it works for deep networks. If you stack n functions, the derivative is a product of n smaller derivatives:
Slide x and watch each chain link compute:
The chain rule lets us differentiate compositions one step at a time. Backpropagation is just this rule applied across many layers in sequence.
The vector chain rule โ one step harder, no scarier
When the function takes a vector in and produces a vector out, the derivative is no longer a single number โ it's a matrix called the Jacobian. Same idea, more bookkeeping:
And the chain rule becomes a matrix product: