NOTEBOOK

PyTorch I — Tensors, Matmul & Broadcasting

Tensor creation, matrix multiplication with torch.matmul, element-wise ops, broadcasting rules, reshape/transpose, in-place operations, and mean aggregations.

Download Notebook (.ipynb)

Overview

PyTorch tensors are the building blocks of deep learning: multi-dimensional arrays that run on CPU or GPU and support automatic differentiation. This chapter covers tensor creation, dtype and device, shape semantics, broadcasting rules, matrix multiplication, reshape/view/transpose/permute, and autograd basics.

You Will Learn

  • Tensor creation: from lists, NumPy, zeros, ones, randn, arange
  • dtype and device (CPU vs CUDA)
  • Shape semantics and common bugs
  • Broadcasting rules with clear examples
  • Matrix multiplication (matmul) rules and examples
  • reshape, view, transpose, permute
  • autograd: requires_grad, backward()

Main Content

Tensors: Creation and Properties

Create tensors with torch.tensor(), torch.zeros(), torch.ones(), torch.randn(), torch.arange(). Every tensor has .shape, .dtype (float32, int64, etc.), and .device (cpu or cuda). Check these constantly — shape mismatches are the #1 source of bugs in deep learning.

Shape Semantics

Convention: (batch, features) for 2D, (batch, channels, height, width) for images. A design matrix X has shape (n_samples, n_features). A batch of images has (B, C, H, W). Linear layer expects input (batch, in_features) and weight (out_features, in_features); output is (batch, out_features).

Broadcasting

Dimensions are compared from right to left. They are compatible if equal or one is 1. (3, 4) and (4,) → (3, 4). (3, 1) and (1, 5) → (3, 5). Example: subtract a mean vector from a batch: x - x.mean(dim=0) broadcasts the mean across the batch.

Matrix Multiplication

torch.matmul(A, B) or A @ B. For 2D: (m, k) @ (k, n) → (m, n). For batches: (b, m, k) @ (b, k, n) → (b, m, n). Element-wise * is the Hadamard product — same shape required. A linear layer does y = x @ W.T + b.

Reshape, View, Transpose, Permute

view and reshape change shape without copying (if contiguous). squeeze() removes dims of size 1; unsqueeze(dim) adds one. transpose(dim0, dim1) swaps two dimensions. permute(dims) reorders all dimensions — e.g., (B,H,W,C) → (B,C,H,W) for conv layers.

Autograd Basics

Set requires_grad=True on tensors you want to differentiate. Operations build a computation graph. Call .backward() on a scalar loss to compute gradients. Access gradients via .grad. Use torch.no_grad() when you don't need gradients (e.g., validation).

Examples

Tensor Creation and Shape

Create tensors and inspect properties.

import torch
x = torch.randn(3, 4)
print(x.shape)   # (3, 4)
print(x.dtype)   # torch.float32
print(x.device)  # cpu

Broadcasting

Subtract per-feature mean from a batch.

import torch
X = torch.randn(32, 10)  # 32 samples, 10 features
mean = X.mean(dim=0)     # (10,)
X_centered = X - mean    # (32,10) - (10,) broadcasts to (32,10)

Matrix Multiplication

Linear transformation: (batch, in) @ (out, in).T

import torch
batch, in_f, out_f = 8, 64, 32
x = torch.randn(batch, in_f)
W = torch.randn(out_f, in_f)
y = x @ W.T  # (8, 64) @ (32, 64).T = (8, 32)

Autograd

Compute gradients for a simple loss.

import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
loss = y.sum()
loss.backward()
print(x.grad)  # [2., 4., 6.]

Common Mistakes

Using * instead of @ for matrix multiplication

Why: * is element-wise; you get shape errors or wrong results.

Fix: Use torch.matmul or @ for matrix multiplication.

Broadcasting producing wrong shapes silently

Why: PyTorch broadcasts (1,) to match; you may get (3,4) when you wanted (4,3).

Fix: Check .shape after every operation; use unsqueeze explicitly when needed.

In-place operations breaking autograd

Why: x.add_(1) modifies x in place; the graph may not track it correctly.

Fix: Avoid in-place ops on tensors with requires_grad=True; use x = x + 1.

Mini Exercises

1. What is the output shape of torch.randn(5, 3) @ torch.randn(3, 7)?

2. Given x of shape (3, 4), write one line to add a batch dimension so it becomes (1, 3, 4).

3. Why does loss.backward() require loss to be a scalar?

Further Reading