ch5.machine_of_learning

Ch 5. The mechanics of learning

The Core Idea of Machine Learning

Machine learning follows the same logic:
1. Collect data
2. Choose a model
3. Split for validation
4. Train and adjust parameters
5. Test and refine

✅ Modern machine learning automates what Kepler did manually, making modeling faster and scalable.

PyTorch & Automating Function Fitting

Goal: Train models that can adapt to different tasks.
PyTorch makes it easy to:
- Define models
- Compute gradients (derivatives)
- Optimize parameters automatically

✅ Neural networks remove the need for human intuition in model selection, making learning more flexible.

Learning is just Parameter Estimation

1. Overview of the Learning Process

Given input data and ground truth outputs, we estimate model parameters iteratively.
The model takes inputs and predicts outputs (forward pass).
We compute the error (loss function) by comparing predicted vs. actual values.
Using gradient descent, we update the parameters to minimize error (backward pass).
This cycle repeats until the error is low enough on unseen data.

A Practical Example: Calibrating a Thermometer

We found a wall-mounted thermometer without units.
We recorded temperature in Celsius (°C) and the thermometer’s readings (unknown units, t_u).
Goal: Find a mathematical relationship between Celsius and the unknown units.

Dataset:

t_c = [0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0]
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = torch.tensor(t_c)
t_u = torch.tensor(t_u)

Plotting t_u vs. t_c reveals a linear trend with noise.

This suggests a linear model might be appropriate. We assume a simple linear relationship between the two temperature scales:

$t_c = w \cdot t_u + b$

w (weight): Scaling factor.
b (bias): Offset term.
Goal: Find w and b such that predictions match actual Celsius temperatures.

Defining the Learning Task

Find values for w and b that minimize error.
We need a loss function:
- Measures how far predictions are from actual values.
- Guides optimization to improve parameter estimates.

Minimizing Loss: Defining the Loss Function in PyTorch

1. What Is a Loss Function?

A loss function computes a single numerical value representing how well our model’s predictions match the actual values.
Our goal is to minimize this loss during training.

2. Choosing a Loss Function

The simplest loss functions compare predictions ( t_p ) and actual values ( t_c ):
1. Absolute Difference:
  $ \text{Loss} = | t_p - t_c | $
2. Squared Difference (Mean Squared Error, MSE):
  $ \text{Loss} = (t_p - t_c)^2 $

✅ Why choose squared difference?

The squared loss penalizes large errors more than smaller ones.
Its derivative is well-defined everywhere (unlike the absolute difference).
It's convex, making it easier to optimize.

3. Implementing the Model in PyTorch

We define the model as a simple linear function:

def model(t_u, w, b):
    return w * t_u + b

t_u = input tensor (unknown temperature readings).
w = weight (scaling factor).
b = bias (offset).

4. Implementing the Loss Function

We define the Mean Squared Error (MSE) loss function:

def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c) ** 2
    return squared_diffs.mean()

✅ What this does:

Computes element-wise squared differences between predicted and actual values.
Averages the results to return a single scalar loss.

5. Initializing Parameters & Checking Loss

We start with simple initial values for ( w ) and ( b ):

w = torch.ones(())
b = torch.zeros(())

Compute predictions using the model function:

t_p = model(t_u, w, b)
print(t_p)

Output:

tensor([35.7000, 55.9000, 58.2000, 81.9000, 56.3000, 48.9000, 33.9000, 21.8000, 48.4000, 60.4000, 68.4000])

Compute the initial loss:

loss = loss_fn(t_p, t_c)
print(loss)

Output:

tensor(1763.8846)

✅ We now have our initial loss! Next, we’ll optimize ( w ) and ( b ) to minimize this loss. 🚀

Gradient Descent in PyTorch

Gradient descent is an optimization algorithm that updates model parameters iteratively to minimize loss. PyTorch automates this process using automatic differentiation, but first, let's build an intuition for how it works manually.

1. Understanding Gradient Descent

Imagine adjusting knobs on a machine to minimize error. We tweak the parameters (weights w and bias b) and observe how the loss function changes.
✔ If turning a knob lowers the loss, keep going in that direction.
✔ If loss increases, reverse direction and adjust in smaller steps.

2. Computing Gradients Numerically

The gradient tells us how much the loss changes when we tweak each parameter.
A finite difference approximation estimates the gradient:

delta = 0.1
loss_rate_of_change_w = (
    loss_fn(model(t_u, w + delta, b), t_c) - loss_fn(model(t_u, w - delta, b), t_c)
) / (2.0 * delta)

✅ This method works but is inefficient for models with many parameters.

3. Computing Gradients Analytically

Instead of using finite differences, we use calculus (chain rule) to compute derivatives:

def dloss_fn(t_p, t_c):
    return 2 * (t_p - t_c) / t_p.size(0)  # Derivative of mean squared error

def dmodel_dw(t_u, w, b):
    return t_u  # Derivative of linear model w.r.t. weight

def dmodel_db(t_u, w, b):
    return 1.0  # Derivative of linear model w.r.t. bias

def grad_fn(t_u, t_c, t_p, w, b):
    dloss_dtp = dloss_fn(t_p, t_c)
    dloss_dw = dloss_dtp * dmodel_dw(t_u, w, b)
    dloss_db = dloss_dtp * dmodel_db(t_u, w, b)
    return torch.stack([dloss_dw.sum(), dloss_db.sum()])

Training Loop

The training loop updates parameters iteratively:

def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        w, b = params
        t_p = model(t_u, w, b)
        loss = loss_fn(t_p, t_c)
        grad = grad_fn(t_u, t_c, t_p, w, b)

        params = params - learning_rate * grad  # Gradient descent step

        if epoch % 100 == 0:  # Print every 100 epochs
            print(f'Epoch {epoch}, Loss {loss.item():.4f}')

    return params

params = training_loop(
    n_epochs=1000,
    learning_rate=1e-2,
    params=torch.tensor([1.0, 0.0]),
    t_u=t_u,
    t_c=t_c
)

The loss may explode to infinity (inf), leading to divergence.
Solution: Use a smaller learning rate (e.g., 1e-3).

Normalizing Inputs

Problem: The weight gradient is much larger than the bias gradient.
Solution: Normalize the input values to bring them into a similar scale.

t_un = 0.1 * t_u  # Scale inputs

params = training_loop(
    n_epochs=1000,
    learning_rate=1e-2,
    params=torch.tensor([1.0, 0.0]),
    t_u=t_un,
    t_c=t_c
)

✅ Effect:

Stabilizes training
Allows a single learning rate to work for both parameters

Visualizing the Model Fit

After training, we can plot predictions vs. true values:

import matplotlib.pyplot as plt

t_p = model(t_un, *params)  # Get predictions

plt.figure(dpi=600)
plt.xlabel("Temperature (°Fahrenheit)")
plt.ylabel("Temperature (°Celsius)")
plt.plot(t_u.numpy(), t_p.detach().numpy(), label="Model Prediction")
plt.plot(t_u.numpy(), t_c.numpy(), 'o', label="True Data")
plt.legend()
plt.show()

Side note: broadcast

Broadcasting allows element-wise operations between tensors of different shapes by automatically expanding dimensions without copying data. The goal is to make tensors compatible for operations like addition or multiplication.

PyTorch follows three main rules when matching tensor shapes for binary operations:

Expand singleton dimensions
- If a tensor has size 1 in a dimension, it is reused across that dimension.
Match dimensions from the back (e.g. Align from the right)
- If two tensors have different numbers of dimensions, the smaller one is expanded at the front.
- e.g. (pad missing dimensions on the left with 1 if necessary)
Dimensions must either be the same or 1
- If neither of the above conditions is met, PyTorch will throw an error.

The resulting shape takes the maximum value at each dimension.

x = torch.ones(())        # Scalar: Shape []
y = torch.ones(3, 1)      # Shape [3, 1]
z = torch.ones(1, 3)      # Shape [1, 3]
a = torch.ones(2, 1, 1)   # Shape [2, 1, 1]

print(f"x: {x.shape}, y: {y.shape}, z: {z.shape}, a: {a.shape}")

print("x * y:", (x * y).shape)   # Scalar x is broadcasted to [3,1]
print("y * z:", (y * z).shape)   # y expands to [3,3], z expands to [3,3]
print("y * z * a:", (y * z * a).shape)  # All expand to [2,3,3]

x: torch.Size([]), y: torch.Size([3, 1]), z: torch.Size([1, 3]), a: torch.Size([2, 1, 1])
x * y: torch.Size([3, 1])
y * z: torch.Size([3, 3])
y * z * a: torch.Size([2, 3, 3])

x * y → x (scalar) is broadcasted to [3, 1]. ✅ Result shape: [3, 1]
y * z →
- y [3, 1] expands to [3, 3] (align from right, take maximum)
- z [1, 3] expands to [3, 3] (align from right, take maximum)
- ✅ Result shape: [3, 3]
y * z * a →
- y * z expands to [3, 3]
- a [2, 1, 1] expands to [2, 3, 3] (align from right, take maximum)
- ✅ Final shape: [2, 3, 3]

Why Broadcasting Is Useful

✅ No unnecessary memory copies – Instead of explicitly expanding small tensors, PyTorch uses efficient memory access.
✅ Simplifies code – No need to manually reshape tensors to match dimensions.
✅ Faster operations – Avoids redundant data storage by reusing values across dimensions.

When Broadcasting Fails

Broadcasting won't work if dimensions don't match or aren’t 1:

a = torch.ones(2, 3)   # Shape [2, 3]
b = torch.ones(3, 2)   # Shape [3, 2]

a * b  # ❌ ERROR: Mismatched shapes that can't be broadcasted
       # e.g. only 1 or same can be aligned, 3 and 2 doesn't meet the criteria

Fix? Transpose b or reshape tensors to align correctly.

PyTorch’s `autograd`: Backpropagating all things

PyTorch's autograd system automates gradient computation, making deep learning training loops faster and more scalable.

PyTorch tensors track operations if requires_grad=True. This allows automatic differentiation via backpropagation.

params = torch.tensor([1.0, 0.0], requires_grad=True)

This tensor remembers its computation history.
Gradients accumulate in params.grad when .backward() is called.

✅ Why use autograd?

No need to manually compute derivatives.
Works with complex models with millions of parameters.

Computing Gradients with autograd

1️⃣ Define Model and Loss Function

def model(t_u, w, b):
    return w * t_u + b

def loss_fn(t_p, t_c):
    return ((t_p - t_c) ** 2).mean()  # Mean squared error

2️⃣ Enable Autograd on Parameters

params = torch.tensor([1.0, 0.0], requires_grad=True)

3️⃣ Compute Loss & Backpropagate

loss = loss_fn(model(t_u, *params), t_c)  # Forward pass
loss.backward()  # Compute gradients
print(params.grad)  # Check computed gradients

✅ PyTorch computes derivatives automatically and stores them in params.grad.

Handling Gradient Accumulation

⚠ Problem: Gradients accumulate by default. If .backward() is called multiple times, old gradients remain.
✅ Solution: Manually zero gradients before backpropagation.

if params.grad is not None:
    params.grad.zero_()  # Reset gradients

1️⃣ Why Does PyTorch Accumulate Gradients?

When you call:

loss.backward()

PyTorch computes the gradients of loss with respect to all learnable parameters and stores them in params.grad. However, instead of replacing the previous gradient values, PyTorch adds (accumulates) the new gradients to params.grad.

import torch

# Create a simple parameter tensor with requires_grad=True
params = torch.tensor([1.0, 2.0], requires_grad=True)

# Define a simple loss function
loss1 = params.sum()  # Sum of parameters (1.0 + 2.0 = 3.0)
loss1.backward()  # Compute gradients

print(params.grad)  # Output: tensor([1., 1.]) (Gradient w.r.t each param)

# Compute another loss without resetting gradients
loss2 = (params * 2).sum()  # Loss = (2*1.0 + 2*2.0) = 6.0
loss2.backward()  # Compute new gradients

print(params.grad)  # Output: tensor([3., 3.]) (Accumulated gradients!)

What Happened Here?

First loss1.backward() computed the gradient [1, 1] and stored it in params.grad.
Then loss2.backward() computed [2, 2] and added it to the existing gradients, resulting in [3, 3].

Thus, gradients accumulate by default instead of being reset.

2️⃣ Why Not Reset Gradients Automatically?

If PyTorch automatically reset gradients every time .backward() is called, it would prevent useful techniques like:

🔹 Gradient Accumulation Across Multiple Mini-Batches

For large datasets, models are trained in batches. However, some models may not fit in GPU memory if the batch size is too large. To overcome this, we:
- Accumulate gradients over multiple smaller batches.
- Perform an optimization step after multiple batches.

import torch

params = torch.tensor([1.0, 2.0], requires_grad=True)
optimizer = torch.optim.SGD([params], lr=0.1)

for i in range(3):  # Simulating three mini-batches
    loss = (params * (i + 1)).sum()  # Varying loss per batch
    loss.backward()  # Accumulate gradients

    print(f"After batch {i+1}, gradients:", params.grad)

optimizer.step()  # Apply accumulated gradients
params.grad.zero_()  # Reset gradients for the next cycle

Why Is This Useful?

Allows models to process small batches at a time and apply gradients after multiple steps.
Prevents GPU memory overload while still updating weights correctly.

3️⃣ Why Do We Manually Reset Gradients?

If we don’t manually zero the gradients, the accumulated values from previous .backward() calls will incorrectly affect future updates.

Alternative: Using optimizer.zero_grad()

If you're using PyTorch optimizers, you can reset gradients using:

optimizer.zero_grad()  # Clears all gradients before new backpropagation

This ensures that gradients start fresh for each iteration.

Implementing a Training Loop with `autograd`

def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        if params.grad is not None:
            params.grad.zero_()  # Reset gradients

        t_p = model(t_u, *params)  # Forward pass
        loss = loss_fn(t_p, t_c)  # Compute loss
        loss.backward()  # Compute gradients

        with torch.no_grad():  # Disable autograd during parameter update
            params -= learning_rate * params.grad

        if epoch % 500 == 0:
            print(f"Epoch {epoch}, Loss {loss.item():.4f}")

    return params

✅ Key Features:

zero_grad(): Prevents gradient accumulation.
with torch.no_grad(): Prevents PyTorch from tracking updates in computation graph.
Works automatically for any differentiable model!

🔹 Run the Training Loop

params = training_loop(
    n_epochs=5000, learning_rate=1e-2, params=torch.tensor([1.0, 0.0], requires_grad=True), t_u=t_u, t_c=t_c
)

Using PyTorch Optimizers

PyTorch provides built-in optimizers in torch.optim for various gradient descent strategies.

import torch.optim as optim
dir(optim)

Every optimizer constructor takes a list of parameters (aka PyTorch tensors, typically with requires_grad set to True) as the first input.
All parameters passed to the optimizer are retained inside the optimizer object so the optimizer can update their values and access their grad attribute
Optimizers use the autograd feature of PyTorch to compute the gradient for each parameter, depending on how that parameter contributes to the final output. This allows users to rely on the dynamic computation graph during complex forward passes.

import torch.optim as optim
params = torch.tensor([1.0, 0.0], requires_grad=True)
optimizer = optim.SGD([params], lr=1e-2)  # Stochastic Gradient Descent

# Updating Parameters with an Optimizer
optimizer.zero_grad()  # Reset gradients
t_p = model(t_u, *params)  # Forward pass
loss = loss_fn(t_p, t_c)  # Compute loss
loss.backward()  # Backpropagate
optimizer.step()  # Update parameters

# Training Loop with Optimizer**
def training_loop(n_epochs, optimizer, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)

        optimizer.zero_grad()  # Reset gradients
        loss.backward()  # Compute gradients
        optimizer.step()  # Update parameters

        if epoch % 500 == 0:
            print(f"Epoch {epoch}, Loss {loss.item():.4f}")

    return params

params = torch.tensor([1.0, 0.0], requires_grad=True)
optimizer = optim.SGD([params], lr=1e-2)
params = training_loop(5000, optimizer, params, t_u, t_c)

✅ Benefits of Using Optimizers:

No need to manually update parameters.
Works with any model architecture.

Trying Different Optimizers

Changing optimizers is as easy as swapping one line.

params = torch.tensor([1.0, 0.0], requires_grad=True)
# Adam adapts learning rates for **faster convergence**
optimizer = optim.Adam([params], lr=1e-1)

params = training_loop(2000, optimizer, params, t_u, t_c)

Training, Validation, and Overfitting in PyTorch

To ensure a model generalizes well to new data, we split the dataset into training and validation sets, track both losses, and avoid overfitting.

1. Why Split Data?

Training Set → Used to fit the model.
Validation Set → Used to evaluate generalization.
If training loss decreases but validation loss increases, the model is overfitting.

✅ Rule 1: If training loss does not decrease, the model might be too simple.
✅ Rule 2: If validation loss stops decreasing while training loss continues decreasing, the model is overfitting.

2. Splitting Data in PyTorch

We randomly shuffle the dataset before splitting it into training and validation sets.

import torch

# Data
t_u = torch.tensor([35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4])
t_c = torch.tensor([0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0])

# Shuffle indices
n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples)  # 20% for validation
shuffled_indices = torch.randperm(n_samples)  # Shuffle indices

# Split indices
train_indices = shuffled_indices[:-n_val]  # First 80% for training
val_indices = shuffled_indices[-n_val:]  # Last 20% for validation

# Create training and validation sets
train_t_u, train_t_c = t_u[train_indices], t_c[train_indices]
val_t_u, val_t_c = t_u[val_indices], t_c[val_indices]

# Normalize inputs
train_t_un = 0.1 * train_t_u
val_t_un = 0.1 * val_t_u

3. Updating the Training Loop to Track Validation Loss

Modify the training loop to track both training loss and validation loss.

import torch.optim as optim

def model(t_u, w, b):
    return w * t_u + b

def loss_fn(t_p, t_c):
    return ((t_p - t_c) ** 2).mean()  # Mean squared error

def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u, train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):
        # Compute training predictions & loss
        train_t_p = model(train_t_u, *params)
        train_loss = loss_fn(train_t_p, train_t_c)

        # Compute validation predictions & loss
        val_t_p = model(val_t_u, *params)
        val_loss = loss_fn(val_t_p, val_t_c)

        # Reset gradients
        optimizer.zero_grad()
        train_loss.backward()  # Only training loss affects parameter updates
        optimizer.step()

        # Print progress
        if epoch <= 3 or epoch % 500 == 0:
            print(f"Epoch {epoch}, Training loss: {train_loss.item():.4f}, "
                  f"Validation loss: {val_loss.item():.4f}")

    return params

4. Running the Training Loop

Initialize parameters and optimizer, then train the model.

params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)

# Run training
params = training_loop(
    n_epochs=3000,
    optimizer=optimizer,
    params=params,
    train_t_u=train_t_un,
    val_t_u=val_t_un,
    train_t_c=train_t_c,
    val_t_c=val_t_c
)

5. Recognizing Overfitting

📊 Possible Training vs. Validation Loss Trends: 1️⃣ (Good) Model Generalizes Well

Training and validation loss both decrease.

2️⃣ (Bad) Underfitting (Model too simple)

Training loss stops decreasing too early.
Validation loss is high.

3️⃣ (Bad) Overfitting (Model too complex)

Training loss keeps decreasing.
Validation loss starts increasing.

✅ Goal: Keep validation loss close to training loss.

6. How to Prevent Overfitting?

1️⃣ Collect More Data → Helps improve generalization.
2️⃣ Use Simpler Models → Fewer parameters reduce overfitting.
3️⃣ Regularization → Penalizes large weights (e.g., L1/L2 regularization).
4️⃣ Early Stopping → Stop training when validation loss stops improving.
5️⃣ Data Augmentation → Artificially increase training samples.

Autograd Nits and Disabling It for Efficiency

In PyTorch, autograd dynamically builds a computation graph for backpropagation. However, in cases like validation, we don't need gradients, and disabling autograd can improve performance.

1. Why Doesn't backward() Affect Validation?

Training loss (train_loss) is computed from the training set, forming a computation graph.
Validation loss (val_loss) is computed separately from the validation set, forming a different graph.
Calling backward() only applies to train_loss, not val_loss, so validation data does not influence training.

✅ Key Insight: Calling backward() on val_loss would incorrectly train on validation data, which we want to avoid.

2. Using torch.no_grad() to Disable Autograd

Since validation doesn’t require gradients, we can disable autograd during validation for better performance.

def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u, train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):
        # Forward pass for training
        train_t_p = model(train_t_u, *params)
        train_loss = loss_fn(train_t_p, train_t_c)

        # Disable autograd during validation
        with torch.no_grad():
            val_t_p = model(val_t_u, *params)
            val_loss = loss_fn(val_t_p, val_t_c)

        assert val_loss.requires_grad == False  # Validation loss should NOT track gradients

        # Backpropagation and parameter update
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()

✅ Performance Benefits:

Saves memory by avoiding unnecessary computation graphs.
Increases efficiency, especially in large models with millions of parameters.

3. Using torch.set_grad_enabled(is_train) for Flexible Control

Instead of using torch.no_grad(), we can use torch.set_grad_enabled(is_train), which dynamically enables/disables autograd.

def calc_forward(t_u, t_c, is_train):
    with torch.set_grad_enabled(is_train):
        t_p = model(t_u, *params)
        loss = loss_fn(t_p, t_c)
    return loss

✅ Use Cases:

Training: calc_forward(t_u, t_c, is_train=True) → Enables autograd.
Validation/Inference: calc_forward(t_u, t_c, is_train=False) → Disables autograd.