Chapter 3

Neural networks, one neuron at a time

A neural network is a stack of linear layers separated by non-linear activation functions. That's it. The richness comes from how those simple pieces compose. We'll build it up from a single neuron — no skipping.

A single neuron, end to end

One neuron does three things in order:

Weighted sum. Take the input vector $x$ , dot it with a weight vector $w$ , add a bias $b$ .
Activation. Squash the result through a non-linear function $σ$ (sigmoid, ReLU, tanh, etc.).
Output. The squashed number is what the neuron passes on.

a = σ (w \cdot x + b)

Tune $w_{1}, w_{2}, b$ below and see how the decision region shifts:

w_{1}

1.0

w_{2}

-1.0

b

0.0

Decision rule:

σ (1.0 x_{1} + - 1.0 x_{2} + 0.0) > 0.5

Green line is where output = 0.5. The neuron splits the plane in half — a single linear boundary.

A layer = many neurons in parallel

Stack $m$ neurons side by side. Each has its own weight vector. Pack their weight vectors as the rows of a matrix $W$ , their biases as a vector $b$ . The whole layer is one matrix-vector multiply followed by an element-wise activation:

z = W x + b, a = σ (z)

For a whole batch of $N$ inputs (one per row of $X$ ), we usually transpose the convention so we can multiply once and process everyone in parallel:

Z = X W + b, A = σ (Z)

One matmul handles every neuron and every example simultaneously.

Why activations exist at all

Without a non-linear activation, stacking layers does literally nothing useful. Two linear maps composed are still a single linear map. Watch:

W_{2} (W_{1} x + b_{1}) + b_{2} = (W_{2} W_{1}) x + (W_{2} b_{1} + b_{2})

Compare any pair of activations side-by-side (solid line = function, dashed = derivative):

Activation

Stacking layers — a deep network

Chain $L$ layers together. Each layer takes the previous layer's output as its input. We use a superscript in parentheses to label the layer index:

A^{(0)} = X, A^{(ℓ)} = σ^{(ℓ)} (A^{(ℓ - 1)} W^{(ℓ)} + b^{(ℓ)})

Loss — measuring how wrong the network is

To train a network we need a single number that says "this prediction was bad." That number is the loss (or "cost"). We pick a loss function based on what kind of problem we're solving:

L_{MSE} = \frac{1}{N} i = 1 \sum N (\overset{y}{^}_{i} - y_{i})^{2}

Mean squared error — for regression (predicting numbers).

L_{BCE} = - \frac{1}{N} i = 1 \sum N [y_{i} lo g \overset{y}{^}_{i} + (1 - y_{i}) lo g (1 - \overset{y}{^}_{i})]

Binary cross-entropy — for yes/no problems.

L_{CCE} = - \frac{1}{N} i = 1 \sum N k = 1 \sum K y_{ik} lo g \overset{y}{^}_{ik}

Categorical cross-entropy — for multi-class.

← Calculus Next: Gradient descent →