THEORY

Logistic Regression — Drawing Lines Between Classes

The sigmoid function, decision boundaries, binary cross-entropy loss, gradient descent for classification, one-vs-all multiclass, and why linear classifiers break on XOR.

Overview

This chapter introduces logistic regression as a probabilistic classifier, develops binary cross-entropy loss, decision boundaries, multi-class extensions, and exposes the fundamental limitation of linear classifiers via the XOR problem.

You Will Learn

  • How logistic regression maps linear scores to probabilities with the sigmoid
  • Binary cross-entropy loss and why it strongly penalises confident mistakes
  • How to interpret and visualise decision boundaries in feature space
  • Multi-class extensions via one-vs-all and softmax regression
  • Why XOR cannot be solved by any linear classifier and what that implies

Main Content

From Linear Scores to Probabilities

Linear regression outputs unbounded real numbers, unsuitable as probabilities. Logistic regression instead models P(y = 1 | x) by passing a linear score z = wᵀx + b through the sigmoid σ(z) = 1 / (1 + exp(−z)). This squash function maps ℝ to (0, 1) and is monotonically increasing, so higher scores correspond to higher probabilities. The model is still linear in the input space, but the output is now interpretable as a probability, which is essential in calibrated decision-making.

Binary Cross-Entropy Loss

Given predicted probability p_i = P(y_i = 1 | x_i) and true label y_i ∈ {0, 1}, the binary cross-entropy loss for a single example is ℓ_i = −[y_i log p_i + (1 − y_i) log(1 − p_i)]. For a dataset it becomes J(w, b) = (1/n) Σ ℓ_i. This loss arises from maximum likelihood under a Bernoulli model and has a crucial property: it punishes confident errors much more heavily than uncertain ones. Predicting p = 0.01 when y = 1 produces a very large loss, which pushes gradients to correct extreme miscalibrations aggressively.

Decision Boundaries

In logistic regression, the decision rule for balanced classes is typically ŷ = 1 if p ≥ 0.5 and ŷ = 0 otherwise. The boundary p = 0.5 corresponds to z = 0, i.e., wᵀx + b = 0. This is a hyperplane in feature space: a line in 2D, a plane in 3D. Visualising this boundary on simple 2D problems (e.g., two Iris classes with two features) yields an immediate geometric interpretation of what the model is doing: it is carving the feature space into two half-spaces.

Multi-Class Extensions

Multi-class classification with K > 2 classes can be handled by training K independent binary classifiers (one-vs-all) or by using softmax regression. Softmax models P(y = k | x) = exp(w_kᵀx + b_k) / Σ_j exp(w_jᵀx + b_j). The corresponding cross-entropy loss generalises binary cross-entropy and encourages the correct class’s logit to dominate. While logistic regression and softmax are linear in x, stacking them with non-linear feature maps yields the building blocks of deep networks.

The XOR Limitation

The XOR dataset consists of four points: (0,0) and (1,1) with label 0, and (0,1) and (1,0) with label 1. No single line can separate the positive and negative points: any line that separates one positive from a negative will misclassify the other pair. Logistic regression will converge to a solution that minimises loss but cannot achieve zero training error; accuracy plateaus at 50%. This exposes a fundamental limit of all linear classifiers and motivates the need for non-linear feature transforms or multi-layer networks.

Examples

Logistic Sigmoid and Decision Rule

Compute probabilities and classify based on the 0.5 threshold.

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-6, 6, 13)
probs = sigmoid(z)
labels = (probs >= 0.5).astype(int)
for zi, pi, yi in zip(z, probs, labels):
    print(f"z={zi:5.2f}, p={pi:5.2f}, y_hat={yi}")

Common Mistakes

Optimising logistic regression with MSE instead of cross-entropy

Why: MSE is less appropriate for probabilities and can lead to slower learning and poor calibration.

Fix: Use cross-entropy (binary or softmax) derived from maximum likelihood for classification tasks.

Assuming linear decision boundaries can solve any classification problem

Why: Non-linearly separable problems like XOR cannot be solved by any linear classifier regardless of optimisation.

Fix: Apply feature engineering (e.g., polynomial features) or use non-linear models such as kernel methods or neural networks.

Mini Exercises

1. Write down the gradient of the binary cross-entropy loss with respect to the weights in logistic regression.

2. Explain why logistic regression cannot solve XOR without feature transformations.

Further Reading