**PyTorch Stochastic Gradient Optimization Technique**

Similar to Marvel superheroes who have remarkable powers, AI possesses its own set of "superpowers," with self-learning being arguably the most impactful. For many, autonomous learning is like a mysterious black box: we see what goes in and what comes out, but the inner workings are often hidden. In this article, we tackle this puzzle by exploring PyTorch, a widely used machine learning algorithm, aiming to clarify how it learns and updates itself.

PyTorch is an open-source framework for building, training, and deploying deep learning models. It uses optimizers such as SGD or Adam to update model weights via gradients in repeated training loops. We'll demonstrate the Stochastic Gradient Descent (SGD) algorithm with a simple example.

For starters, Stochastic Gradient Descent (SGD) is an optimization algorithm used to train machine learning models by minimizing the loss function. It is a variant of the standard gradient descent that, instead of using the entire dataset to compute the gradient for each update, uses a small, randomly selected subset of data called a mini-batch.

Model Equation:

Enough said, lets move on to an example. We shall consider the following simple linear regression model:

ypred = wx + b – Equation 1*

where ypred = predicted output and x = input, w = weight, b = bias.

The goal is to accurately predict the outputs given a set of input values, based on a model/reference equation. This is achieved through calculating the optimal values of weight(w) and bias(b) that makes predicted output as close as possible to true output.

Loss Function:

Just like humans, the algorithm needs to understand how close it is to calculating the optimal values. This is achieved through a loss function(l), wherein after each step, the quantifies the error between the model prediction and true value. The goal is to minimize this loss function by iteratively modifying the values of w and b.

For simplicity, we choose Mean Squared Error(MSE) as our loss function, given by the following equation.

l = (1/n) * Summation over all values of i from 0 to n [(ypred(i) – y(i))^2]

* – Equation 2*

where n = No. of training instances,

y(i) = Actual Output of ith entry, ypred(i)= Predicted Output of ith entry

Expanding ypred as per Equation 1, we have l as follows:

l = (1/n) * Summation over all values of i from 0 to n [(wx(i) + b* – y(i))^2]

* – Equation 3*

Methodology:

We start by assigning an initial value to w and b, e.g. w = 0 and b = 0. As a next step, we need to train our model by feeding a set of inputs and corresponding outputs. Each such set is known as a training batch.

For each training batch(i), the algorithm computes the gradient of w(dl/dw) and b(dl/db) w.r.t the loss metric(l). Based on the calculated gradients, it then proceeds to update the values of w and b, accordingly:

w(i+1) = w(i) - (dl/dw) * lr - Equation 4

b(i+1) = b(i) - (dl/db) * lr - Equation 5

where lr = learning rate

The learning rate is a hyperparameter that controls how much a model updates its parameters during training. It can significantly affect whether the model performs well or fails to learn.

We repeat the above steps for each training batch till we have exhausted all the training batches.

Gradient Calculation:

Gradient calculation of w and b, i.e. dl/dw and dl/db occurs once the loss metric(l) is calculated. The loss metric(l) is first calculated through a forward propagation by comparing ypred vs y. This is followed by backward propagation wherein the partial derivatives of loss metric vs weight(w) and bias(b) are calculated are calculated

We will look at both these methods in detail in light of our loss function(l).

Forward Propagation:

As part of the forward pass, the algorithm calculates the loss function given a set of input variables. The algorithm keeps a record of data & all executed operations (along with the resulting new variables) in a directed acyclic graph (DAG) consisting of function objects. In this DAG, leaves are the input variables, roots are the output variables. By tracing this graph from roots to leaves, the algorithm can automatically compute the gradients using the chain rule.

The following is a visual representation of our loss function calculation, with the variables as nodes and the mathematical operations as functions:

BlockNote image

To ensure simplicity, each variable node represents a unit function, as follows:

z = l

l = (1/n) * e

e = (d)^2

d = c - y

c = a + b

a = w * x

Tracing the DAG back up would give us Equation 3.

Using a bottoms-up approach, the algorithm starts from the lowest nodes of the tree and it propagates up, through each step, till it reaches at the top of the tree to calculate the loss function(l). As it goes up the tree, at each step, it essentially stores in memory the computed function and the operation’s gradient function, which is then used during the backward propagation flow.

The algorithm also adds an additional variable(z), at the end, which is essentially the same as l(i.e. z = l). This is to help in the gradient calculation of l during backward propagation, as we would see next.

Backward Propagation:

As the name suggests, backward propagation uses a top-down approach, as opposed to the bottoms-up method of forward propagation. It starts from the topmost node(z) and it calculates the partial derivative of each node/variable w.r.t. the underlying variable. Through the previously stored functions(as part of forward propagation), the algorithm computes the gradient of each function. Using the chain rule, the algorithm propagates all the way down to the leaf nodes.

Below is a visual representation of the DAG in our example. In the graph, the blue arrows are in the direction of the forward pass vs the green arrows in the opposite direction, representing the backward pass. The green nodes represent the corresponding backward functions of each operation in the forward pass.

BlockNote image

Weight Gradient Calculation:

The backward propagation is the algorithm’s way to calculate the gradients based on the Partial Derivative Chain-Rule. Applying the chain rule to our example for calculating the weight gradient, we get the following:

dl/dw = (dl/de)(de/dd)(dd/dc)(dc/da)(da/dw) - Equation 6

where l = (1/n) * e

e = (𝑑)^2

d = c - y

c = a + b

a = w * x

The partial derivatives for each intermediate variable are as follows:

dl/de = 1/n,

de/dd = 2d,

dd/dc = 1,

dc/da = 1,

da/dw = x

Substituting the above values in Eq. 6, we get the following expression:

dl/dw = (1/n) * Summation over all values of i from 0 to n [2d(i)11x(i)] - Equation 7

Expanding each intermediate variable, this equates to:

l = (2/n) * Summation over all values of i from 0 to n [(wx(i) + b* – y(i)) * x(i)] - Equation 8

Bias Gradient Calculation:

Similarly, the bias gradient is calculated using the following steps:

dl/db = (dl/de)(de/dd)(dd/dc)*(dc/db) - Equation 9

where l = (1/n) * e

e = (𝑑)^2

d = c - y

c = a + b

The partial derivatives for each intermediate variable are as follows:

dl/de = 1/n,

de/dd = 2d,

dd/dc = 1,

dc/db = 1

Substituting the above values in Eq. 9, we get the following expression:

dl/db = (1/n) * Summation over all values of i from 0 to n [2d(i)11] - Equation 10

Expanding each intermediate variable, this equates to:

dl/db = (2/n) * Summation over all values of i from 0 to n [wx(i) + b - y(i)] - Equation 11*

Number Crunching:

Now let's apply some numbers to demonstrate the above method.

The first step is to create some sample data. For this purpose, we would consider the following to be our reference equation for input(X) and true output(Y):

y = 2 * x + 10 - Equation 12

Thus, the goal of the optimization algorithm is to calculate the values of weight(w) and bias(b) to be as close to 2 and 10, respectively.

Based on this reference equation, we create a sample of input and output data as follows:

X = (0,1,2,3,4,5,6,7,8)

Y = (10,12,14,16,18,20,22,24,26) (based on Equation 12)

We now need to assign a starting value to w and b for the algorithm to initiate the computation process. Let’s set w =0 and b = 0. Also, we assume a learning rate(lr) = 0.1 and a batch size of 3. Choosing a batch size of 3 breaks our X and Y values into the following batches:

Batch 1:

x: 0,1,2

y: 10,12,14

Batch 2:

x: 3,4,5

y: 16,18,20

Batch 3:

x: 6,7,8

y: 22,24,26

Each batch is executed serially, starting from Batch 1.

Applying the calculations to the above dataset and assumptions, we get the following values for each batch:

Batch 1:

n: 3

dl/dw: -26.67

dl/db: -24

lr: 0.01

w1: 0.267

b1: 0.24

loss: 146.67

Batch 2:

n: 3

dl/dw: -135.86

dl/db: -33.39

lr: 0.01

w2: 1.63

b2: 0.57

loss: 280.67

Batch 3:

n: 3

dl/dw: -169.19

dl/db: -24.10

lr: 0.01

w3: 3.32

b3: 0.81

loss: 145.28

For each batch, we observe that the gradient values and weight and bias updates have been made as per the following equations:

dl/dw as per Eq. 8

dl/db as per Eq. 11

w(i) as per Eq. 4

b(i) as per Eq. 5

loss as per Eq. 3

Verification

We broke down all the calculations and ran a batch of three to obtain the values. To confirm the algorithm’s accuracy, lets cross-check the results by using the same optimization technique in Python with the identical data set. The code snippet is shown below:

import torch\
from torch import optim\
from torch.utils.data import TensorDataset, DataLoader\
import torch.nn as nn

# 1. Dummy Data(Batch Size=3, Features=1)\
x = [0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0]\
y = [10.0,12.0,14.0,16.0,20.0,22.0,24.0,26.0]\
tensor_x = torch.tensor(x)\
tensor_y = torch.tensor(y)\
dataset = TensorDataset(tensor_x, tensor_y)\
dataloader = DataLoader(dataset, batch_size=3, shuffle=False)

# 2. Define hyper-parameters\
lr = 0.01 #Set learning rate\
epochs = 1 #Setting epoch to 1 for illustrative purpose

# 3. Define Model(1 linear layer: y=wx+b)\
class Simple_model(nn.Module):\
def __init__(self):\
super().__init__()\
self.weight = nn.Parameter(torch.zeros(1))\
self.bias = nn.Parameter(torch.zeros(1))\
\
def forward(self, xb):\
return xb * self.weight + self.bias

model = Simple_model()

# Define loss metric and optimizer\
criterion = nn.MSELoss()\
optimizer = optim.SGD(model.parameters(), lr=lr)

# 4. Training Loop\
for epoch in range(epochs):\
for batch_idx, (data, target) in enumerate(dataloader):\
# Print initial weight and bias for the present batch\
print("---------- Epoch {epoch}, Batch {batch_idx}----------")\
print(f"Starting W_grad: {model.weight.grad}, "\
f"Starting B_grad: {model.bias.grad}, "\
f"Starting Weight: {model.weight.item():.4f}, "\
f"Starting Bias: {model.bias.item():.4f}")\
\
# Forward Pass\
output = model(data)\
loss = criterion(output, target)\
\
# Backward Pass (Gradient calculation)\
loss.backward()\
\
# Update Weights\
optimizer.step()\
\
# Access gradients for this batch\
weight_grad = model.weight.grad\
bias_grad = model.bias.grad\
\
print(f"Updated W_grad: {weight_grad.item():.4f}, "\
f"Updated B_grad: {bias_grad.item():.4f}, "\
f"Loss: {loss.item():.4f}, "\
f"Updated Weight: {model.weight.item():.4f}, "\
f"Updated Bias: {weight.bias.item():.4f}")\
\
# Zero gradients from previous step\
optimizer.zero_grad()\

Running the above code, we get the exact outputs as provided above, for each batch.