Neural Networks from First Principles

Overview

Neural networks extend linear classifiers by stacking layers of nonlinear transformations. This chapter develops the perceptron model, explains why single-layer networks fail on XOR, introduces multi-layer perceptrons, and derives the four-step backpropagation algorithm from first principles.

You Will Learn

The perceptron as a weighted sum passed through an activation function
Why linear classifiers cannot solve XOR and how hidden layers fix this
Random weight initialisation and the symmetry-breaking argument
The four steps of backpropagation derived from the chain rule
How varying hidden-layer width affects capacity and generalisation

Main Content

The Perceptron Model

A perceptron receives inputs x₁, x₂, …, xₙ, multiplies each by a learned weight, sums the results, adds a bias term, and passes the total through an activation function: output = σ(w₁x₁ + w₂x₂ + … + wₙxₙ + b). With a sigmoid activation σ(z) = 1 / (1 + e⁻ᶻ), the output lies in (0, 1) and can be interpreted as a probability. A single perceptron is equivalent to logistic regression — it can separate any two classes that are linearly separable, but nothing more complex.

The XOR Problem and Hidden Layers

XOR has four points: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. No single straight line can separate the 1s from the 0s. This was famously proven by Minsky and Papert in 1969 and stalled neural network research for over a decade. The fix is to add a hidden layer: the hidden neurons first transform the input space into a new representation where the classes become linearly separable, and then the output neuron draws a straight boundary in that transformed space. With just two hidden neurons and sigmoid activations, a network can solve XOR perfectly.

Why Random Initialisation Matters

If all weights are initialised to zero, every neuron in a layer computes the same output, receives the same gradient, and updates identically — they remain copies of each other forever. This is called the symmetry problem. Random initialisation breaks the symmetry so that neurons can specialise. The scale matters too: weights drawn from a distribution with standard deviation ≈ 1/√(fan_in) keep activations and gradients in a reasonable range for sigmoid networks, preventing both saturation (outputs stuck near 0 or 1 where gradients vanish) and explosion.

Backpropagation in Four Steps

Backpropagation is just the chain rule applied systematically. Step 1 — Output deltas: compute δₖ = (oₖ − tₖ) · σ′(netₖ), where oₖ is the output, tₖ is the target, and σ′ is the derivative of the activation (for sigmoid, σ′ = σ(1−σ)). Step 2 — Hidden deltas: propagate the error backward via δⱼ = (Σₖ δₖ · wⱼₖ) · σ′(netⱼ), where wⱼₖ are the weights connecting hidden neuron j to output neuron k. Step 3 — Update output weights: wⱼₖ ← wⱼₖ − η · δₖ · hⱼ, where hⱼ is the hidden neuron's output and η is the learning rate. Step 4 — Update hidden weights: wᵢⱼ ← wᵢⱼ − η · δⱼ · xᵢ. Repeat for every training example.

Network Width and the Capacity–Generalisation Trade-off

A network with more hidden neurons has greater capacity — it can represent more complex decision boundaries. On Iris, training MLPs with 1, 2, 4, 8, 16 and 32 hidden neurons reveals a pattern: too few neurons underfit (the boundary is too simple), a moderate number fits well, and too many neurons risk overfitting (the boundary becomes unnecessarily wiggly). The sweet spot depends on the complexity of the data. Monitoring validation loss alongside training loss is the practical way to detect when you have crossed from useful capacity into memorisation.

Examples

Sigmoid Activation and Its Derivative

Compute the sigmoid output and its gradient for backpropagation.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

z = np.array([-2, 0, 2])
print("sigmoid:", sigmoid(z))
print("derivative:", sigmoid_derivative(z))

Manual Forward Pass for XOR

A 2-input, 2-hidden, 1-output network forward pass.

import numpy as np

W_hidden = np.array([[5.0, 5.0], [5.0, 5.0]])
b_hidden = np.array([-2.0, -7.0])
W_out = np.array([[10.0, -10.0]])
b_out = np.array([-5.0])

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

for x in [[0,0], [0,1], [1,0], [1,1]]:
    x = np.array(x)
    hidden = sigmoid(W_hidden @ x + b_hidden)
    output = sigmoid(W_out @ hidden + b_out)
    print(f"Input {x} -> Output {output[0]:.3f}")

Common Mistakes

Initialising all weights to zero

Why: Neurons stay identical throughout training due to symmetry — the network effectively has one neuron per layer.

Fix: Use random initialisation with appropriate scale, e.g. np.random.randn(fan_in, fan_out) * np.sqrt(1/fan_in).

Using a very large learning rate with sigmoid activations

Why: Weights grow large, activations saturate near 0 or 1, and gradients vanish — training stalls.

Fix: Start with a small learning rate (0.01–0.1 for sigmoid) and increase carefully.

Mini Exercises

1. Walk through one full forward + backward pass for a 2-input, 2-hidden, 1-output XOR network with concrete weight values. What are the four δ values?

2. Why does a network with 1 hidden neuron fail on Iris (3 classes) while 4 neurons succeed?