OpTorch Optimized Deep Learning Architectures For

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

1

OpTorch: Optimized deep learning architectures for


resource limited environments
Salman Ahmed, Hammad Naveed, CBRL, Department of Computer Science,
NUCES-FAST, Islamabad

Abstract—Deep learning algorithms have made many break- both 16-bit and 32-bit floating-point types during the model
throughs and have various applications in real life. Compu- training. This makes it run faster and uses less memory. This
tational resources become a bottleneck as the data and com-
arXiv:2105.00619v1 [cs.LG] 3 May 2021

technique keeps certain parts of the model in the 32-bit types


plexity of the deep learning pipeline increases. In this paper,
we propose optimized deep learning pipelines in multiple as- for numeric stability.
pects of training including time and memory. OpTorch is a Sequential models are executed in a sequence of lists of
machine learning library designed to overcome weaknesses in layers called segments. Therefore, we can execute sequential
existing implementations of neural network training. OpTorch models in segments and checkpoint each segment. All seg-
provides features to train complex neural networks with limited ments except the last will execute in a way that they will not
computational resources. OpTorch achieved the same accuracy
as existing libraries on Cifar-10 and Cifar-100 datasets while store the intermediate activations. The inputs of each segment
reducing memory usage to approximately 50%. We also explore will be saved for re-running the segment in the backward pass.
the effect of weights on total memory usage in deep learning Existing implementations store output of intermediate acti-
pipelines. In our experiments, parallel encoding-decoding along vations before calculating gradients. In some situations, for
with sequential checkpoints results in much improved memory example, if NN is small enough to fit in memory, it is
and time usage while keeping the accuracy similar to existing
pipelines. OpTorch python package is available at available at better to store the output of the activations before calculating
https://github.com/cbrl-nuces/optorch. the gradients in order to speed up learning. Other existing
techniques that optimize neural networks include small Mini-
Index Terms—Neural networks, Deep learning, Neural network
optimization batch training alongside Batch Accumulation. Some layers or
functions in neural networks like batch-normalization works
better with large batch size. Lin et al. trained NNs using small
I. I NTRODUCTION mini-batches and accumulation of gradients to show effect of

N EURAL NETWORKS (NN) are mathematical models


designed to mimic information processing of the human
brain. Generally, NN consists of different layers of neurons
batch accumulation [5].
To solve complex problems we need deep neural networks
large enough to understand the problem. Existing standard
which communicate with each other to make a decision. implementations cannot be used to train large networks with
Neurons and number of layers are key components to design limited computational resources.
different architectures. NN learns to activate neurons based on OpTorch provides Mixed-precision flexibility as well as
inputs passed to it [1]. Sequential-checkpoints. Instead of saving all activation out-
Deep Neural Networks have shown improved performances puts, OpTorch stores limited number of activations and recom-
on a wide variety of applications [2]. Typically, growing the putes remaining at run-time. OpTorch provides multiple Data-
size and complexity of neural networks results in improved flow optimizations as well. We propose an idea to generate
accuracy. As the model size grows, memory and time to train batches in parallel similar to Chen et al. [6] however, we
such networks also increase. One of the greatest achievements encode and compress batches before passing to NN to save
of deep learning in 2020 was by OpenAI to introduce GPT-3. memory and passage time up-to 16X. Such compression
The main takeaway from the study is that complex and large reduces at-least 20% training time.
models can solve complex problems in language modeling[3]. We performed training experiments with enhanced versions
This paper proposes two different optimizations: a) of famous architectures including Resnets (18, 34, 50) [7],
Gradient-flow and b) Data-flow. Gradient-flow optimization EfficientNets (B0, B1, B2, B3, B4, B5, B6, B7) [8], and
represents optimization with in the architecture while Data- Inception-V3 [9] on Cifar-10 and Cifar-100 datsets using
flow optimization represents optimization out of the NN ar- OpTorch.
chitecture including data passage, selection and augmentation.
There are multiple existing Gradient-flow optimizations like
II. M ETHODS
Mixed-precision training [4]. Mixed-precision implementation
uses only half-precision format to save limited float numbers A. Image Data-flow Optimization
in comparison to single-precision format. Mixed precision uses We designed an improved representation of image data.
We encode multiple images to one matrix of the same size
This work was supported by the National Center in Big Data and Cloud
Computing (NCBC) and the National University of Computer and Emerging and pass this matrix to the network by saving up-to 16X
Sciences (NUCES-FAST), Islamabad, Pakistan. memory. OpTorch contains a custom layer for NN to decode
2

such encoded images. This enhancement in the pipeline allows Algorithm 2 Selective-batch-sampling
us to have unique data selection methods like selective-batch- Initialize UC . UC : Unique Classes.
sampling (SBS) to augment data and improve the training time. Initialize N . N : Number of Unique Classes.
1) Encoding and Selective-batch-sampling(SBS): On-the- Initialize W . W : Class Weights
fly pre-processing for specific classes can be performed using Initialize X . X : Array of images (HxWxCxT)
SBS, meaning specific number of images for a specific class Initialize Z . Number of Images. (Z ≤ 16)
with in each batch can be selected. It can be considered as Initialize  = 0
controlling each batch and ratio of each class in a batch while while  6= N do
training. Encoding images allows us to perform pre-sampling Select subset of data for class UC[]
and controlled augmentation. The pixels in an image range select W[i] * Batch Size examples for batch
between 0 - 255 values. Same positional pixel of N images pre-process & dump in batch
can be encoded by using the equation below. =+1
N
end while
X
256 ∗ M[]
=1 Algorithm 3 Decode
where M represents matrix of image, N represents number of Initialize A . A : Encoded array of size HxWxC.
images, and i represents location of pixel in image M. Initialize X . X : Empty Batch of images (HxWxCxB)
Batch controlling will allow us to apply specific augmenta- Initialize Z . Number of Images. (Z ≤ 16)
tions on specific classes. It will allow us to apply state of the Initialize  = 0
art augmentations like MixUp [10], CutMix [11] and AugMix while  6= Z do
[12] easily on specific combination of classes. Algorithm 1 X[] = A mod 256
explains the flow to encode a batch of images into a single A = A div 256 . Division is Integer Division.
matrix of the same size. =+1
end while
Algorithm 1 Encode
Initialize X . X : Batch of images (HxWxCxB)
Initialize A . A : Empty array of size HxWxC. keep an offset based on the condition that the pixel was odd or
Initialize Z . Number of Images. (Z ≤ 16) even. This offset will help to decode the original pixel value.
Initialize  = 0 For float-64 datatype, Algorithm 1 can only encode 16 images,
while  6= Z do but Algorithm 4 can be used to encode 32 images in float-64
mg = X[] datatype and 32-bit for offset.
domn = 256
A = A + (mg ∗ domn) Algorithm 4 Loss-less Forced Encoding
=+1 Initialize X . X : Batch of images (HxWxCxB)
end while Initialize A . A : Empty array of size HxWxC.
Initialize O . O : Boolean Empty array of size HxWxCxB.
Initialize Z . Number of Images. (Z ≤ 32)
Encoding of images created new possibilities of SBS and Initialize  = 0
image augmentations. Based on this encoding, we provide a while  6= Z do
pipeline for SBS which allows each batch to have a number mg = X[]
of images with respect to class weights. oƒ ƒ set = mg mod 2
SBS allows pre-processing data differently for each class domn = 128
during training. Algorithm 2 explains the flow of SBS in our A = A + (mg ∗ domn)
pipeline. It starts by initializing a variable with number of O[] = oƒ ƒ set
examples for each class in a batch. It selects specified number =+1
of examples for each class in each batch. This process allows end while
us to pre-process each class differently in each batch.
2) Decoding: We designed a custom deep learning layer
4) OpTorch Parallel Encoding-Decoding (E-D): We ad-
to decode each input matrix to original images. Algorithm 3
justed the training pipeline by introducing a new thread to
explains the flow to decode a batch of images into a single
encode batches in parallel. While training an epoch a thread
matrix of the same size. We calculate the modulus of each
will shuffle images, encode batches and prepare input for the
pixel by 256 then set the previous input matrix to divide by
next epoch. Figure 1 explains pipeline of training and encoding
256 to get a new image.
batches in parallel to improve training time.
3) OpTorch Loss-less Forced Encoding Option: Loss-less
encoding is an extended version of Algorithm 1. The domain
of pixels in an image is between 0 - 255. If we divide each B. Gradient-flow Optimization
pixel by 2, the domain becomes 0–128, but will result in Gradient-flow optimization represents optimization within
information loss. To convert this into loss-less encoding, we’ll an architecture. OpTorch provides multiple gradient-flow op-
3

Figure 1. Figure represents an optimized flow of training pipeline. Flow starts with a if/else statement that data is dumped in encoded form or not. If not
dumped then a thread will start to encode batches and perform all the pre-processing/augmentation and dump. Training will start after data is dumped for
first time. In parallel of training a new thread will start to encode batches for next epoch and dump these encoded batches.

Figure 2. Figure represents comparison of FP16 (half precision floating points) and FP32 (single precision floating points).
4

Figure 3. Figure represents mechanism of Mixed-precision training. FP16


weights are converted to FP32 before calculating loss and gradients and then
converted back to FP16 to updated weights.

Figure 5. Sigmoid activation function.

Figure 4. Operations done by Neuron.

Figure 6. Neural Network Representation.


timization features like Mixed-precision training, Sequential-
checkpoints and recommendations for sequential-checkpoints.
1) Mixed-precision training: The technical standard used Figure 5 represents a graph of a non-linear activation
to represent floating point numbers in binary formats is IEEE function called sigmoid. There are other activation functions
754, established in 1985 by the Institute of Electrical and available as well [13].
Electronics Engineering. Traditionally, single precision (FP32) Collection of such neurons on the same level is called a
is used to represent parameters in deep learning. In FP32 layer. Stack of such layers form a neural network. Figure 6
format, 1 bit is reserved as the sign bit, 8 bits for the exponent represents a simple neural network. It can be read from left to
(-126 to +127) and 23 bits for numbers. In Half precision right. Left most layer is called input layer. Middle 2 layers are
(FP16), 1 bit is reserved as the sign bit, 5 bits for the exponent called hidden layers. Last layer produces all possible outputs.
(-14 to +14), and 10 bits for the numbers as shown in Figure +1 represent biases in layers.
2. Connections between neurons have some weights. Back-
In standard training, FP32 is used to represent model propagation is used to train these weights by calculating partial
parameters, increasing memory usage significantly. In Mixed- derivative w.r.t loss/error. The partial derivative w.r.t loss/error
precision training, FP16 is used to save model weights but provides the change in weight W to minimize loss/error.
precision decreases which results in low accuracy. Mixed- In practice, a NN of size of few hundred MBs can crash
precision handles this effect on accuracy by converting FP16 a GPU in the training process. Extra memory usage comes
weights to FP32 before loss and gradients calculations but from following 2 main reasons to store extra information while
stores weights in FP16 format as shown in Figure 3. training.
2) Sequential-checkpoint training: A sequential neural net- • Necessary information to back-propagate (gradients of
work can be represented as stack of layers of neurons. Each intermediate activations with respect to loss).
layer can be represented as cluster of neurons. A neuron • Necessary information to calculate gradients.
takes input and perform some mathematical operations to Both steps are essential as outputs of intermediate activa-
produce an output. Figure 4 represents mathematical function tions are required to apply chain rule but implementations
of neuron, considering we have 3 inputs [x1, x2, x3] which is can be flexible. To elaborate back-propagation and memory
multiplied by weights [w1, w2, w3] and added to biases [b1, leakage consider a simple NN as shown in Figure 7.
b2, b3]. After these operations neuron applies a non-linear Flow of back-propagation starts by calculating derivative
transformation called activation function to obtain a value. of O w.r.t L. After this we calculate derivative of W3 w.r.t L.
This activation function is usually used to convert values to Derivative of W3 w.r.t L cannot be calculated directly because
a range of 0 to 1 to help neural network learn. L is not directly dependent on W3. We calculate derivative of
5

Figure 7. Flow of feed-forward and back-propagation in neural network.

Figure 8. GPU Memory usage in 1 iteration. Batch for iteration consists of


W3 w.r.t L in form of following equation using chain-rule. 16 images of size 512x512x3.

dW3 dW3 dN3ot dO


= ∗ ∗
dL dN3ot dO dL Creating checkpoints in Resnet-18 reduced memory con-
Similarly, derivative of W2 w.r.t L is calculated in the form sumption from 7000 MBs to just 2000 MBs. Such memory
of following equation. optimization comes with a cost. It takes more time to train by
creating checkpoints than standard implementations because
we have to do multiple sub-forward pass within a forward
dW2 dW2 dN2ot dW3 dN3ot dO pass at run-time.
= ∗ ∗ ∗ ∗
dL dN2ot dW3 dN3ot dO dL To analyze performance of optimization methods on famous
Weights of neural network are required but all other nec- image models like Resnets (18, 50, 101), EfficientNets (B0,
essary information required to calculated gradients in back- B1, B2, B3, B4, B5, B6, B7), and Inception-V3, we use
propagation like N2out, N3out etc can be calculated at run- CIFAR-10 dataset. CIFAR-10 comes with 60,000 color images
time. All existing implementations in standard libraries save of size 32x32x3 with 6,000 images per class, and 10 object
all these variables like N1out, N2out, and N3out and increase classes.
memory usage exponentially while doing forward pass and Figure 9 represents detailed comparison of widely used
release memory after performing back-propagation. image models with multiple optimization techniques w.r.t
OpTorch provides a simple way to minimize memory usage. accuracy and time taken for training. All these models are
OpTorch creates checkpoint in intermediate outputs. Instead trained on P100 GPU with a batch size of 16. Resnet 50
of saving all variables like N1out, N2out, and N3out we trained for 10 epochs using standard pipeline achieved 93.3%
create a checkpoint and save N2out. For N3out calculation accuracy in approximately 3800 seconds whereas, sequential
during back-propagation, a forward pass is done again from the checkpoints achieved almost same accuracy in 4400 seconds.
previous checkpoint N2out and N3out is calculated directly. We can infer from our experiments that all networks trained
Similarly, we calculate N1out directly by doing partial forward with checkpoint optimization achieved almost same accuracy
pass from the first neuron. This can be scaled to a neural as standard pipeline but take more time to train. However,
network with hundreds of layers. Instead of saving all outputs Figure 8 and Figure 10 show that checkpoint optimization will
of these layers, we save a small number of checkpoints to consume much less memory than other pipelines. For exam-
control memory consumption. ple, sequential checkpoints method reduced more than 50%
memory for Resnet 50 compared to standard baseline pipeline
(Figure 10). These image models (Resnet 50, EfficientNet-
III. R ESULTS b0, Inception-v3) trained by combining parallel encoding-
Large deep neural networks are required to understand decoding (E-D) and sequential checkpoints (S-C) optimization
complex problems. Existing standard implementations cannot achieved same results in terms of accuracy as a baseline in less
be used to train large DNN in resource limited settings. We time and much less memory consumption. This combination
propose OpTorch: A library that allows to train large neural can be represented as most optimized FP32 pipeline w.r.t
networks in limited computational resources. We compare memory, time and accuracy. Mixed-Precision combined with
features of OpTorch with existing implementations in multiple E-D, and S-C optimization takes even less time to train and
environments based on memory consumption and time. less memory.
We first analyze GPU memory consumption of Resnet- OpTorch provides features to combine these pipelines easily
18 architecture in 1 batch iteration. The batch consists of with just a single command.
16 images of size 512x512x3. Figure 8 represents analysis from optorch import sc
of memory consumption in training Resnet-18 with multiple scmodel = sc(model)
enhanced pipelines. The x-axis corresponds to completion We analyze the memory consumption of famous image
of 1 iteration from Resnet-18 whereas, the y-axis represents models by using multiple optimization pipelines. Each bar
memory consumption in MBs. in Figure 10 represents memory consumption in GBs. We
6

Figure 9. Comparison of Time and Accuracy for 10 Epochs on CIFAR-10. E-D represents Encoding-Decoding Optimization pipeline, M-P represents
Mixed-Precision Optimization pipeline, whereas, S-C represents Sequential-Checkpoints Optimization pipeline. X-axis represents Time taken for 10 Epochs
and Y-axis represents Accuracy. Line between Baseline and E-D + S-C represents difference in achieved accuracy and time taken to train for 10 Epochs.
Dashed Line between M-P and E-D + M-P + S-C represents difference in achieved accuracy and time taken to train for 10 Epochs using Mixed Precision.

Figure 10. Memory Consumption Comparison of famous Image models and multiple optimization pipelines for 1 Batch iteration. B represents standard
baseline pipeline, E-D represents Encoding-Decoding Optimization pipeline, M-P represents Mixed-Precision Optimization pipeline, whereas, S-C represents
Sequential-Checkpoints Optimization pipeline. X-axis represents optimization pipelines and Y-axis represents memory consumption in GBs.
7

parallel encoding-decoding along with sequential checkpoints


result in much improved memory and time usage while
keeping the accuracy similar to existing pipelines.

ACKNOWLEDGMENT
This work was supported by the National Center in Big
Data and Cloud Computing (NCBC) and the National Uni-
versity of Computer and Emerging Sciences (NUCES-FAST),
Figure 11. Best checkpoints in a simple neural network of 7 layers. C1
represents first checkpoint as we have to store input, C3 represents output of Islamabad, Pakistan.
neural network that will be saved to evaluate loss while training. C2 represents
checkpoint at intermediate layer. R EFERENCES
[1] H. Wang and B. Raj, “On the Origin of Deep Learning,” ArXiv, 2017.
[2] S. R. Khaze, M. Masdari, and S. Hojjatkhah, “APPLICATION OF AR-
performed this experiment by passing 1 batch of size 16 TIFICIAL NEURAL NETWORKS IN ESTIMATING PARTICIPATION
images of size (512x512x3). Resnet 50 standard pipeline IN ELECTIONS,” ArXiv, 2013.
consumes 2 GB, mixed precision consumes 1 GB, sequential [3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
checkpoints consumes 0.8 GB, and sequential checkpoints Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler,
with mixed precision consumes 0.4 GB. J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
Sequential-Checkpoint optimization pipeline combined with B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, “Language models are few-shot learners,” ArXiv, 2020.
E-D pipeline allows us to train in much less memory in some [4] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,
cases 10 times less memory than standard pipelines as shown B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu,
in Figure 10. “Mixed precision training,” ArXiv, 2017.
[5] T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t Use Large Mini-
Batches, Use Local SGD,” ArXiv, 2018.
IV. D ISCUSSION & R ECOMMENDATION [6] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever,
“Generative pretraining from pixels,” Proceedings of Machine Learning
Due to memory limitations, architectures of the neural Research, 2020.
networks are constrained. One possible solution is to utilize [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” ArXiv, 2015.
more high power GPUs but the other one is to optimize imple- [8] M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for
mentations. Depth of neural network is directly proportional Convolutional Neural Networks,” ArXiv, 2019.
to the extra memory consumption while training. Standard [9] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the Inception Architecture for Computer Vision,” ArXiv, 2015.
implementations will save output of each intermediate layer [10] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond
for back-propagation. Hence, more number of layers means Empirical Risk Minimization,” ArXiv, 2017.
more memory consumption. [11] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix:
Regularization Strategy to Train Strong Classifiers with Localizable
Checkpoint optimization pipeline solves the problem of Features,” ArXiv, 2019.
extra memory consumption but trades-off with extra time [12] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Laksh-
to train. E-D data flow pipeline combined with Checkpoint minarayanan, “AugMix: A Simple Data Processing Method to Improve
Robustness and Uncertainty,” ArXiv, 2019.
optimization gradient flow pipeline solves the issue of trade- [13] C. E. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, “Activation
off between extra memory consumption and extra time to train Functions: Comparison of Trends in Practice and Research for Deep
a neural network. Learning,” ArXiv, 2018.
Consider a simple neural network of 7 layers as shown
in Figure 11. It shows the most optimized way to create
checkpoints is to design a middle layer with less parameters. In
this way checkpoint will be on small layer and it will consume
less memory than other checkpoints.
We recommend designing neural network in Auto-encoder
type or UNet type architectures as shown in Figure 11 where
we have small intermediate layer that can be used as a optimal
checkpoint. It will result in much less memory consumption
than other possible architectures.

V. C ONCLUSION
In this paper, we propose OpTorch, a machine learning li-
brary designed to overcome weaknesses in existing implemen-
tations of neural network training. OpTorch provides features
to train complex neural networks with limited computational
resources. OpTorch achieved the same accuracy as existing
libraries on Cifar-10 and Cifar-100 datasets while reducing
memory usage to approximately 50%. In our experiments,

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy