ch5.machine_of_learning
Ch 5. The mechanics of learning
The Core Idea of Machine Learning
- Machine learning follows the same logic:
- Collect data
- Choose a model
- Split for validation
- Train and adjust parameters
- Test and refine
✅ Modern machine learning automates what Kepler did manually, making modeling faster and scalable.
PyTorch & Automating Function Fitting
- Goal: Train models that can adapt to different tasks.
- PyTorch makes it easy to:
- Define models
- Compute gradients (derivatives)
- Optimize parameters automatically
✅ Neural networks remove the need for human intuition in model selection, making learning more flexible.
Learning is just Parameter Estimation

1. Overview of the Learning Process
- Given input data and ground truth outputs, we estimate model parameters iteratively.
- The model takes inputs and predicts outputs (forward pass).
- We compute the error (loss function) by comparing predicted vs. actual values.
- Using gradient descent, we update the parameters to minimize error (backward pass).
- This cycle repeats until the error is low enough on unseen data.
A Practical Example: Calibrating a Thermometer
- We found a wall-mounted thermometer without units.
- We recorded temperature in Celsius (°C) and the thermometer’s readings (unknown units, t_u).
- Goal: Find a mathematical relationship between Celsius and the unknown units.
Dataset:
t_c = [0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0]
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = torch.tensor(t_c)
t_u = torch.tensor(t_u)
Plotting t_u vs. t_c reveals a linear trend with noise.

- This suggests a linear model might be appropriate. We assume a simple linear relationship between the two temperature scales:
$t_c = w \cdot t_u + b$
-
w (weight): Scaling factor.
-
b (bias): Offset term.
-
Goal: Find w and b such that predictions match actual Celsius temperatures.
Defining the Learning Task
- Find values for w and b that minimize error.
- We need a loss function:
- Measures how far predictions are from actual values.
- Guides optimization to improve parameter estimates.
Minimizing Loss: Defining the Loss Function in PyTorch
1. What Is a Loss Function?
- A loss function computes a single numerical value representing how well our model’s predictions match the actual values.
- Our goal is to minimize this loss during training.
2. Choosing a Loss Function
- The simplest loss functions compare predictions ( t_p ) and actual values ( t_c ):
- Absolute Difference:
$ \text{Loss} = | t_p - t_c | $ - Squared Difference (Mean Squared Error, MSE):
$ \text{Loss} = (t_p - t_c)^2 $
- Absolute Difference:
✅ Why choose squared difference?
- The squared loss penalizes large errors more than smaller ones.
- Its derivative is well-defined everywhere (unlike the absolute difference).
- It's convex, making it easier to optimize.
3. Implementing the Model in PyTorch
- We define the model as a simple linear function:
def model(t_u, w, b):
return w * t_u + b
- t_u = input tensor (unknown temperature readings).
- w = weight (scaling factor).
- b = bias (offset).
4. Implementing the Loss Function
- We define the Mean Squared Error (MSE) loss function:
def loss_fn(t_p, t_c):
squared_diffs = (t_p - t_c) ** 2
return squared_diffs.mean()
✅ What this does:
- Computes element-wise squared differences between predicted and actual values.
- Averages the results to return a single scalar loss.
5. Initializing Parameters & Checking Loss
- We start with simple initial values for ( w ) and ( b ):
w = torch.ones(())
b = torch.zeros(())
- Compute predictions using the model function:
t_p = model(t_u, w, b)
print(t_p)
Output:
tensor([35.7000, 55.9000, 58.2000, 81.9000, 56.3000, 48.9000, 33.9000, 21.8000, 48.4000, 60.4000, 68.4000])
- Compute the initial loss:
loss = loss_fn(t_p, t_c)
print(loss)
Output:
tensor(1763.8846)
✅ We now have our initial loss! Next, we’ll optimize ( w ) and ( b ) to minimize this loss. 🚀
Gradient Descent in PyTorch
Gradient descent is an optimization algorithm that updates model parameters iteratively to minimize loss. PyTorch automates this process using automatic differentiation, but first, let's build an intuition for how it works manually.
1. Understanding Gradient Descent
Imagine adjusting knobs on a machine to minimize error. We tweak the parameters (weights w and bias b) and observe how the loss function changes.
✔ If turning a knob lowers the loss, keep going in that direction.
✔ If loss increases, reverse direction and adjust in smaller steps.
2. Computing Gradients Numerically
The gradient tells us how much the loss changes when we tweak each parameter.
A finite difference approximation estimates the gradient:
delta = 0.1
loss_rate_of_change_w = (
loss_fn(model(t_u, w + delta, b), t_c) - loss_fn(model(t_u, w - delta, b), t_c)
) / (2.0 * delta)
✅ This method works but is inefficient for models with many parameters.
3. Computing Gradients Analytically
Instead of using finite differences, we use calculus (chain rule) to compute derivatives:

def dloss_fn(t_p, t_c):
return 2 * (t_p - t_c) / t_p.size(0) # Derivative of mean squared error
def dmodel_dw(t_u, w, b):
return t_u # Derivative of linear model w.r.t. weight
def dmodel_db(t_u, w, b):
return 1.0 # Derivative of linear model w.r.t. bias
def grad_fn(t_u, t_c, t_p, w, b):
dloss_dtp = dloss_fn(t_p, t_c)
dloss_dw = dloss_dtp * dmodel_dw(t_u, w, b)
dloss_db = dloss_dtp * dmodel_db(t_u, w, b)
return torch.stack([dloss_dw.sum(), dloss_db.sum()])
Training Loop
The training loop updates parameters iteratively:
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
for epoch in range(1, n_epochs + 1):
w, b = params
t_p = model(t_u, w, b)
loss = loss_fn(t_p, t_c)
grad = grad_fn(t_u, t_c, t_p, w, b)
params = params - learning_rate * grad # Gradient descent step
if epoch % 100 == 0: # Print every 100 epochs
print(f'Epoch {epoch}, Loss {loss.item():.4f}')
return params
params = training_loop(
n_epochs=1000,
learning_rate=1e-2,
params=torch.tensor([1.0, 0.0]),
t_u=t_u,
t_c=t_c
)
- The loss may explode to infinity (
inf), leading to divergence. - Solution: Use a smaller learning rate (e.g.,
1e-3).

Normalizing Inputs
Problem: The weight gradient is much larger than the bias gradient.
Solution: Normalize the input values to bring them into a similar scale.
t_un = 0.1 * t_u # Scale inputs
params = training_loop(
n_epochs=1000,
learning_rate=1e-2,
params=torch.tensor([1.0, 0.0]),
t_u=t_un,
t_c=t_c
)
✅ Effect:
- Stabilizes training
- Allows a single learning rate to work for both parameters
Visualizing the Model Fit
After training, we can plot predictions vs. true values:
import matplotlib.pyplot as plt
t_p = model(t_un, *params) # Get predictions
plt.figure(dpi=600)
plt.xlabel("Temperature (°Fahrenheit)")
plt.ylabel("Temperature (°Celsius)")
plt.plot(t_u.numpy(), t_p.detach().numpy(), label="Model Prediction")
plt.plot(t_u.numpy(), t_c.numpy(), 'o', label="True Data")
plt.legend()
plt.show()

Side note: broadcast
Broadcasting allows element-wise operations between tensors of different shapes by automatically expanding dimensions without copying data. The goal is to make tensors compatible for operations like addition or multiplication.

PyTorch follows three main rules when matching tensor shapes for binary operations:
- Expand singleton dimensions
- If a tensor has size
1in a dimension, it is reused across that dimension.
- If a tensor has size
- Match dimensions from the back (e.g. Align from the right)
- If two tensors have different numbers of dimensions, the smaller one is expanded at the front.
- e.g. (pad missing dimensions on the left with 1 if necessary)
- Dimensions must either be the same or 1
- If neither of the above conditions is met, PyTorch will throw an error.
The resulting shape takes the maximum value at each dimension.
x = torch.ones(()) # Scalar: Shape []
y = torch.ones(3, 1) # Shape [3, 1]
z = torch.ones(1, 3) # Shape [1, 3]
a = torch.ones(2, 1, 1) # Shape [2, 1, 1]
print(f"x: {x.shape}, y: {y.shape}, z: {z.shape}, a: {a.shape}")
print("x * y:", (x * y).shape) # Scalar x is broadcasted to [3,1]
print("y * z:", (y * z).shape) # y expands to [3,3], z expands to [3,3]
print("y * z * a:", (y * z * a).shape) # All expand to [2,3,3]
x: torch.Size([]), y: torch.Size([3, 1]), z: torch.Size([1, 3]), a: torch.Size([2, 1, 1])
x * y: torch.Size([3, 1])
y * z: torch.Size([3, 3])
y * z * a: torch.Size([2, 3, 3])
x * y→x(scalar) is broadcasted to[3, 1]. ✅ Result shape:[3, 1]y * z→y[3, 1]expands to[3, 3](align from right, take maximum)z[1, 3]expands to[3, 3](align from right, take maximum)- ✅ Result shape:
[3, 3]
y * z * a→y * zexpands to[3, 3]a[2, 1, 1]expands to[2, 3, 3](align from right, take maximum)- ✅ Final shape:
[2, 3, 3]
Why Broadcasting Is Useful
✅ No unnecessary memory copies – Instead of explicitly expanding small tensors, PyTorch uses efficient memory access.
✅ Simplifies code – No need to manually reshape tensors to match dimensions.
✅ Faster operations – Avoids redundant data storage by reusing values across dimensions.
When Broadcasting Fails
Broadcasting won't work if dimensions don't match or aren’t 1:
a = torch.ones(2, 3) # Shape [2, 3]
b = torch.ones(3, 2) # Shape [3, 2]
a * b # ❌ ERROR: Mismatched shapes that can't be broadcasted
# e.g. only 1 or same can be aligned, 3 and 2 doesn't meet the criteria
Fix? Transpose b or reshape tensors to align correctly.
PyTorch’s autograd: Backpropagating all things
PyTorch's autograd system automates gradient computation, making deep learning training loops faster and more scalable.
PyTorch tensors track operations if requires_grad=True. This allows automatic differentiation via backpropagation.

params = torch.tensor([1.0, 0.0], requires_grad=True)
- This tensor remembers its computation history.
- Gradients accumulate in
params.gradwhen.backward()is called.
✅ Why use autograd?
- No need to manually compute derivatives.
- Works with complex models with millions of parameters.
Computing Gradients with autograd
1️⃣ Define Model and Loss Function
def model(t_u, w, b):
return w * t_u + b
def loss_fn(t_p, t_c):
return ((t_p - t_c) ** 2).mean() # Mean squared error
2️⃣ Enable Autograd on Parameters
params = torch.tensor([1.0, 0.0], requires_grad=True)
3️⃣ Compute Loss & Backpropagate
loss = loss_fn(model(t_u, *params), t_c) # Forward pass
loss.backward() # Compute gradients
print(params.grad) # Check computed gradients
✅ PyTorch computes derivatives automatically and stores them in params.grad.
Handling Gradient Accumulation
⚠ Problem: Gradients accumulate by default. If .backward() is called multiple times, old gradients remain.
✅ Solution: Manually zero gradients before backpropagation.
if params.grad is not None:
params.grad.zero_() # Reset gradients
1️⃣ Why Does PyTorch Accumulate Gradients?
When you call:
loss.backward()
PyTorch computes the gradients of loss with respect to all learnable parameters and stores them in params.grad. However, instead of replacing the previous gradient values, PyTorch adds (accumulates) the new gradients to params.grad.
import torch
# Create a simple parameter tensor with requires_grad=True
params = torch.tensor([1.0, 2.0], requires_grad=True)
# Define a simple loss function
loss1 = params.sum() # Sum of parameters (1.0 + 2.0 = 3.0)
loss1.backward() # Compute gradients
print(params.grad) # Output: tensor([1., 1.]) (Gradient w.r.t each param)
# Compute another loss without resetting gradients
loss2 = (params * 2).sum() # Loss = (2*1.0 + 2*2.0) = 6.0
loss2.backward() # Compute new gradients
print(params.grad) # Output: tensor([3., 3.]) (Accumulated gradients!)
What Happened Here?
- First
loss1.backward()computed the gradient[1, 1]and stored it inparams.grad. - Then
loss2.backward()computed[2, 2]and added it to the existing gradients, resulting in[3, 3].
Thus, gradients accumulate by default instead of being reset.
2️⃣ Why Not Reset Gradients Automatically?
If PyTorch automatically reset gradients every time .backward() is called, it would prevent useful techniques like:
🔹 Gradient Accumulation Across Multiple Mini-Batches
- For large datasets, models are trained in batches. However, some models may not fit in GPU memory if the batch size is too large. To overcome this, we:
- Accumulate gradients over multiple smaller batches.
- Perform an optimization step after multiple batches.
import torch
params = torch.tensor([1.0, 2.0], requires_grad=True)
optimizer = torch.optim.SGD([params], lr=0.1)
for i in range(3): # Simulating three mini-batches
loss = (params * (i + 1)).sum() # Varying loss per batch
loss.backward() # Accumulate gradients
print(f"After batch {i+1}, gradients:", params.grad)
optimizer.step() # Apply accumulated gradients
params.grad.zero_() # Reset gradients for the next cycle
Why Is This Useful?
- Allows models to process small batches at a time and apply gradients after multiple steps.
- Prevents GPU memory overload while still updating weights correctly.
3️⃣ Why Do We Manually Reset Gradients?
If we don’t manually zero the gradients, the accumulated values from previous .backward() calls will incorrectly affect future updates.
Alternative: Using optimizer.zero_grad()
If you're using PyTorch optimizers, you can reset gradients using:
optimizer.zero_grad() # Clears all gradients before new backpropagation
This ensures that gradients start fresh for each iteration.
Implementing a Training Loop with autograd
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
for epoch in range(1, n_epochs + 1):
if params.grad is not None:
params.grad.zero_() # Reset gradients
t_p = model(t_u, *params) # Forward pass
loss = loss_fn(t_p, t_c) # Compute loss
loss.backward() # Compute gradients
with torch.no_grad(): # Disable autograd during parameter update
params -= learning_rate * params.grad
if epoch % 500 == 0:
print(f"Epoch {epoch}, Loss {loss.item():.4f}")
return params
✅ Key Features:
zero_grad(): Prevents gradient accumulation.with torch.no_grad(): Prevents PyTorch from tracking updates in computation graph.- Works automatically for any differentiable model!
🔹 Run the Training Loop
params = training_loop(
n_epochs=5000, learning_rate=1e-2, params=torch.tensor([1.0, 0.0], requires_grad=True), t_u=t_u, t_c=t_c
)
Using PyTorch Optimizers
PyTorch provides built-in optimizers in torch.optim for various gradient descent strategies.
import torch.optim as optim
dir(optim)
- Every optimizer constructor takes a list of parameters (aka PyTorch tensors, typically with requires_grad set to True) as the first input.
- All parameters passed to the optimizer are retained inside the optimizer object so the optimizer can update their values and access their grad attribute
- Optimizers use the autograd feature of PyTorch to compute the gradient for each parameter, depending on how that parameter contributes to the final output. This allows users to rely on the dynamic computation graph during complex forward passes.

import torch.optim as optim
params = torch.tensor([1.0, 0.0], requires_grad=True)
optimizer = optim.SGD([params], lr=1e-2) # Stochastic Gradient Descent
# Updating Parameters with an Optimizer
optimizer.zero_grad() # Reset gradients
t_p = model(t_u, *params) # Forward pass
loss = loss_fn(t_p, t_c) # Compute loss
loss.backward() # Backpropagate
optimizer.step() # Update parameters
# Training Loop with Optimizer**
def training_loop(n_epochs, optimizer, params, t_u, t_c):
for epoch in range(1, n_epochs + 1):
t_p = model(t_u, *params)
loss = loss_fn(t_p, t_c)
optimizer.zero_grad() # Reset gradients
loss.backward() # Compute gradients
optimizer.step() # Update parameters
if epoch % 500 == 0:
print(f"Epoch {epoch}, Loss {loss.item():.4f}")
return params
params = torch.tensor([1.0, 0.0], requires_grad=True)
optimizer = optim.SGD([params], lr=1e-2)
params = training_loop(5000, optimizer, params, t_u, t_c)
✅ Benefits of Using Optimizers:
- No need to manually update parameters.
- Works with any model architecture.
Trying Different Optimizers
Changing optimizers is as easy as swapping one line.
params = torch.tensor([1.0, 0.0], requires_grad=True)
# Adam adapts learning rates for **faster convergence**
optimizer = optim.Adam([params], lr=1e-1)
params = training_loop(2000, optimizer, params, t_u, t_c)
Training, Validation, and Overfitting in PyTorch
To ensure a model generalizes well to new data, we split the dataset into training and validation sets, track both losses, and avoid overfitting.
1. Why Split Data?
- Training Set → Used to fit the model.
- Validation Set → Used to evaluate generalization.
- If training loss decreases but validation loss increases, the model is overfitting.
✅ Rule 1: If training loss does not decrease, the model might be too simple.
✅ Rule 2: If validation loss stops decreasing while training loss continues decreasing, the model is overfitting.
2. Splitting Data in PyTorch
We randomly shuffle the dataset before splitting it into training and validation sets.
import torch
# Data
t_u = torch.tensor([35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4])
t_c = torch.tensor([0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0])
# Shuffle indices
n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples) # 20% for validation
shuffled_indices = torch.randperm(n_samples) # Shuffle indices
# Split indices
train_indices = shuffled_indices[:-n_val] # First 80% for training
val_indices = shuffled_indices[-n_val:] # Last 20% for validation
# Create training and validation sets
train_t_u, train_t_c = t_u[train_indices], t_c[train_indices]
val_t_u, val_t_c = t_u[val_indices], t_c[val_indices]
# Normalize inputs
train_t_un = 0.1 * train_t_u
val_t_un = 0.1 * val_t_u
3. Updating the Training Loop to Track Validation Loss
Modify the training loop to track both training loss and validation loss.
import torch.optim as optim
def model(t_u, w, b):
return w * t_u + b
def loss_fn(t_p, t_c):
return ((t_p - t_c) ** 2).mean() # Mean squared error
def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u, train_t_c, val_t_c):
for epoch in range(1, n_epochs + 1):
# Compute training predictions & loss
train_t_p = model(train_t_u, *params)
train_loss = loss_fn(train_t_p, train_t_c)
# Compute validation predictions & loss
val_t_p = model(val_t_u, *params)
val_loss = loss_fn(val_t_p, val_t_c)
# Reset gradients
optimizer.zero_grad()
train_loss.backward() # Only training loss affects parameter updates
optimizer.step()
# Print progress
if epoch <= 3 or epoch % 500 == 0:
print(f"Epoch {epoch}, Training loss: {train_loss.item():.4f}, "
f"Validation loss: {val_loss.item():.4f}")
return params
4. Running the Training Loop
Initialize parameters and optimizer, then train the model.
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)
# Run training
params = training_loop(
n_epochs=3000,
optimizer=optimizer,
params=params,
train_t_u=train_t_un,
val_t_u=val_t_un,
train_t_c=train_t_c,
val_t_c=val_t_c
)
5. Recognizing Overfitting

📊 Possible Training vs. Validation Loss Trends: 1️⃣ (Good) Model Generalizes Well
- Training and validation loss both decrease.
2️⃣ (Bad) Underfitting (Model too simple)
- Training loss stops decreasing too early.
- Validation loss is high.
3️⃣ (Bad) Overfitting (Model too complex)
- Training loss keeps decreasing.
- Validation loss starts increasing.
✅ Goal: Keep validation loss close to training loss.
6. How to Prevent Overfitting?
1️⃣ Collect More Data → Helps improve generalization.
2️⃣ Use Simpler Models → Fewer parameters reduce overfitting.
3️⃣ Regularization → Penalizes large weights (e.g., L1/L2 regularization).
4️⃣ Early Stopping → Stop training when validation loss stops improving.
5️⃣ Data Augmentation → Artificially increase training samples.
Autograd Nits and Disabling It for Efficiency
In PyTorch, autograd dynamically builds a computation graph for backpropagation. However, in cases like validation, we don't need gradients, and disabling autograd can improve performance.

1. Why Doesn't backward() Affect Validation?
- Training loss (
train_loss) is computed from the training set, forming a computation graph. - Validation loss (
val_loss) is computed separately from the validation set, forming a different graph. - Calling
backward()only applies totrain_loss, notval_loss, so validation data does not influence training.
✅ Key Insight: Calling backward() on val_loss would incorrectly train on validation data, which we want to avoid.
2. Using torch.no_grad() to Disable Autograd
Since validation doesn’t require gradients, we can disable autograd during validation for better performance.
def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u, train_t_c, val_t_c):
for epoch in range(1, n_epochs + 1):
# Forward pass for training
train_t_p = model(train_t_u, *params)
train_loss = loss_fn(train_t_p, train_t_c)
# Disable autograd during validation
with torch.no_grad():
val_t_p = model(val_t_u, *params)
val_loss = loss_fn(val_t_p, val_t_c)
assert val_loss.requires_grad == False # Validation loss should NOT track gradients
# Backpropagation and parameter update
optimizer.zero_grad()
train_loss.backward()
optimizer.step()
✅ Performance Benefits:
- Saves memory by avoiding unnecessary computation graphs.
- Increases efficiency, especially in large models with millions of parameters.
3. Using torch.set_grad_enabled(is_train) for Flexible Control
Instead of using torch.no_grad(), we can use torch.set_grad_enabled(is_train), which dynamically enables/disables autograd.
def calc_forward(t_u, t_c, is_train):
with torch.set_grad_enabled(is_train):
t_p = model(t_u, *params)
loss = loss_fn(t_p, t_c)
return loss
✅ Use Cases:
- Training:
calc_forward(t_u, t_c, is_train=True)→ Enables autograd. - Validation/Inference:
calc_forward(t_u, t_c, is_train=False)→ Disables autograd.