Overview
Python is the lingua franca of machine learning. This chapter covers data types, collections (lists, dicts, sets, tuples) with ML-oriented examples, NumPy array basics and shapes, functions and comprehensions for dataset prep, reading CSV, basic data cleaning, and a simple train/test split.
You Will Learn
- Data types (int, float, str, bool) and their use in ML code
- Lists, dicts, sets, tuples with ML examples (config dicts, class labels, tensor shapes)
- NumPy arrays: creation, shapes, indexing
- Functions and comprehensions for dataset preparation
- Reading CSV, basic data cleaning, train/test split
Main Content
Data Types in ML
Integers store counts (num_epochs, batch_size). Floats store hyperparameters (learning_rate, dropout). Strings store names (optimizer='Adam', model_path). Booleans toggle behavior (is_training=True). These four types appear in every ML script.
Lists, Dicts, Sets, Tuples
A list is an ordered, mutable sequence — e.g., a column of feature values. A tuple is immutable, ideal for fixed shapes like (28, 28, 1). A dict maps keys to values: config = {'lr': 0.01, 'epochs': 100} bundles hyperparameters. A set stores unique elements — useful for distinct class labels: set(y_train) gives you all classes.
NumPy Arrays
NumPy arrays are the foundation of numerical Python. Create with np.array(), np.zeros(), np.ones(), np.arange(). The .shape attribute is critical — (n_samples, n_features) for a design matrix. Indexing and slicing work like lists but can operate on multiple dimensions. Vectorization (array operations without loops) is essential for speed.
Functions and Comprehensions
Functions encapsulate logic: def normalize(x): return (x - x.mean()) / x.std(). List comprehensions build lists concisely: [x**2 for x in features]. Dict comprehensions: {k: v*2 for k, v in config.items()}. These patterns appear constantly in data preprocessing.
Reading CSV and Data Cleaning
Use pandas: df = pd.read_csv('data.csv'). Inspect with df.head(), df.info(), df.describe(). Handle missing values: df.dropna() or df.fillna(value). Filter invalid rows: df[df['age'] > 0]. Convert types: df['date'] = pd.to_datetime(df['date']).
Train/Test Split
Never evaluate on data you trained on. Use sklearn: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42). The random_state ensures reproducibility. For classification, use stratify=y to preserve class proportions.
Examples
Config Dict and List Comprehension
Hyperparameters as a dict; build a list of squared features.
config = {"lr": 0.001, "batch_size": 32, "epochs": 100}
features = [1.2, 3.4, 5.6, 7.8]
squared = [f**2 for f in features]
print(squared) # [1.44, 11.56, 31.36, 60.84]NumPy Array and Shape
Create a design matrix and check its shape.
import numpy as np
X = np.array([[1, 2], [3, 4], [5, 6]])
print(X.shape) # (3, 2) — 3 samples, 2 features
print(X.mean(axis=0)) # [3., 4.] — mean per featureCSV Load and Train/Test Split
Load a CSV, extract features and target, split.
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("data.csv")
X = df[["feat1", "feat2"]].values
y = df["target"].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)Common Mistakes
Using Python lists for large numerical computations
Why: Lists are slow; NumPy uses optimized C code.
Fix: Convert to np.array() and use vectorized operations.
Splitting before cleaning
Why: Test set statistics can leak into training if you clean using global stats.
Fix: Split first, then compute normalization (mean, std) from the training set only and apply to both.
Forgetting random_state in train_test_split
Why: Results are not reproducible; each run gives different splits.
Fix: Always set random_state=42 (or another fixed value) for reproducibility.
Mini Exercises
1. Write a function that takes a list of numbers and returns (mean, variance). Use only a loop and arithmetic.
2. Given a list of dicts (e.g., [{'valid': True, 'x': 1}, {'valid': False, 'x': 2}]), write a comprehension to filter only valid entries.
3. What does X.shape return for a 2D array with 100 samples and 5 features?