Lecture 2: Introduction To Pytorch
Lecture 2: Introduction To Pytorch
Lecture 2: Introduction To Pytorch
1 Course Reminders
• Homeworks 0 and 1 have been released on the course website.
• Start early!
1. Get a dataset - Obtain a dataset through online sources/manual data collection/scraping the
web etc. You could go for private datasets, Kaggle datasets, Scientific datasets (NSF, NIH
etc) and many many more. You may further want to process and understand the data -
(a) Exploratory data analysis - Explore available data points and features in the data, an-
alyze and clean if necessary. This is very important as data can be inherently biased,
incorrect or imbalanced.
(b) Normalization strategies - feature normalization, feature scaling etc.
(c) Split into train-validation-test sets
(d) Class imbalance in the data - popular data classes more in number - for example - in
cancer datasets, negative (absence of cancer) data points significantly outnumber the
positive ones.
1
(e) More data generally leads to better performance
2. Data Augmentation - make your model more robust to variations in the data by augmenting
your data by data points that are rotated, cropped, skewed, padded , etc.
3. Write a data loader - which shuffles and batches the data for mini batch gradient descent
(covered in detail later)
4. Define a neural network - This is where Pytorch forward/backwards is awesome, you de-
fine the network structure with the sequence as per the forward pass, no need to define the
backward pass unless you use a function which is non differentiable and you need to write
an approximation for the gradient.
5. Define a loss/ objective function - Write a suitable loss/objective function that correctly
represents the problem you are trying to solve. The choice of the loss function can greatly
affect convergence. Optimize the function that matters i.e. choose a function that actually
measures what matters. You might also want to choose a function whose success you can
interpret somehow. Also, consider regularization in your loss to avoid overfitting - consider
drop-out, weight decay, early stopping, getting more varied data.
6. Optimize the neural network - The optimizers find a good solution on the training data.
But optimizers also have all kinds of inductive biases, their choice affects generalization
performance Most of the times you would have to babysit through the optimization process,
staring at the loss function to ensure your model is free of bugs, your learning rate is correct
and everything is working as expected.
2
7. Test performance using a suitable metric and save the model if you are satisfied with the
performance Testing the performance of the model is always important - validation set and
training set features might be similar and performance on validation set might not represent
performance on unseen data. It’s important to test on unseen data with varied data features
so as to estimate the performance of the model when deployed in the real world
8. Deploy model in the production setting. This is where your model will be affected by the
scale - latency and compute might be big points of consideration in many critical applications
like self driving.
Before we jump into what these graphs are and how these graphs are useful, let’s look at two
examples of where this decomposition is useful.
we can observe that we can achieve a good approximation for the logarithm using just the elemen-
tary operations (addition, subtraction, multiplication, division, exponentiation).
3
in intermediate variables used in a computation, and consists of nodes (which each represent the
value and function used in the intermediate computation) and edges (which represent a direct com-
putation between two nodes). Derivatives are stored on the reverse of each edge; each edge (u, v)
stores the partial derivative ∂u
∂v
.
b = w1 a
c = w2 a
d = w3 b + w4 c
L = f (d)
Notice that an edge appears for each pair of values in which there is a direct computation between
them (for example, (a, c) is an edge since c is computed via c = w2 a).
The core derivative storage happens when we reverse each edge:
2
https://towardsdatascience.com/getting-started-with-pytorch-part-1-understanding-how-automatic-
differentiation-works-5008282073ec
4
Each edge now stores the partial derivative with its endpoints. For example, the new reversed edge
∂d
∂c
stores the value
∂d ∂
= (w3 b + w4 c) = w4
∂c ∂c
Now, why is this computational graph useful in computing derivatives? This is all due to the
∂L
chain rule of derivatives: to compute ∂w1
, we can trace along the path L ; w1 and compute
∂L ∂L ∂d ∂b
= · ·
∂w1 ∂d ∂b ∂w1
so computing derivatives of a function is reduced to a shortest-path problem!
5 PyTorch
PyTorch was developed to bootstrap all this computational graph formation and backpropagation
into a central framework. The framework is based on Torch, a scientific computing Lua library
at the time. It was developed by Facebook AI Research, and features this computational graph
formation as well as built-in GPU acceleration, which significantly speeds up model training and
testing.
a = torch.rand(10,10,5)
5
This creates a 10 × 10 × 5 tensor. These tensors can be freely manipulated; their values can be
changed to yield new tensors, they can be stacked to create new ones, and so on.
To track gradients through tensors, we define torch variables, which are nothing but gradient
enabled tensors. Every variable object has several members some of which are:
1. Data: It’s the data a variable is holding. x holds a 1x1 tensor with the value equal to 1.0
while y holds 2.0. z holds the product of two i.e. 2.0
2. requires_grad: This member, if true starts tracking all the operation history and forms a
backward graph for gradient calculation. For an arbitrary tensor a It can be manipulated
in-place as follows: a. requires_grad(True).
3. grad: grad holds the value of gradient. If requires_grad is False it will hold a None value.
Even if requires_grad is True, it will hold a None value unless .backward() function is called
from some other node. For example, if you call out.backward() for some variable out that
involved x in its calculations then x.grad will hold ∂out
∂x
5.2 Torchvision
Torchvision library, which is a part of Pytorch, contains all the important datasets as well as models
and transformation operations generally used in the field of computer vision. It allows you to
import datasets without any hassle.
5.3 Torchtext
Torchtext is a popular package that contains several popular data processing utilities like word
embeddings and many popular datasets for natural language.
6
5.4 GPU acceleration
The way PyTorch makes all this computation fast is by porting computations to the GPU, which
enables massive parallelization of computation (for example, matrix multiplication can be mas-
sively parallelized since many rows and columns are being multiplied simultaneously). PyTorch
does this by employing CUDA, a parallel computing framework for interacting with the GPU.
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: