OpTorch Optimized Deep Learning Architectures For
OpTorch Optimized Deep Learning Architectures For
OpTorch Optimized Deep Learning Architectures For
Abstract—Deep learning algorithms have made many break- both 16-bit and 32-bit floating-point types during the model
throughs and have various applications in real life. Compu- training. This makes it run faster and uses less memory. This
tational resources become a bottleneck as the data and com-
arXiv:2105.00619v1 [cs.LG] 3 May 2021
such encoded images. This enhancement in the pipeline allows Algorithm 2 Selective-batch-sampling
us to have unique data selection methods like selective-batch- Initialize UC . UC : Unique Classes.
sampling (SBS) to augment data and improve the training time. Initialize N . N : Number of Unique Classes.
1) Encoding and Selective-batch-sampling(SBS): On-the- Initialize W . W : Class Weights
fly pre-processing for specific classes can be performed using Initialize X . X : Array of images (HxWxCxT)
SBS, meaning specific number of images for a specific class Initialize Z . Number of Images. (Z ≤ 16)
with in each batch can be selected. It can be considered as Initialize = 0
controlling each batch and ratio of each class in a batch while while 6= N do
training. Encoding images allows us to perform pre-sampling Select subset of data for class UC[]
and controlled augmentation. The pixels in an image range select W[i] * Batch Size examples for batch
between 0 - 255 values. Same positional pixel of N images pre-process & dump in batch
can be encoded by using the equation below. =+1
N
end while
X
256 ∗ M[]
=1 Algorithm 3 Decode
where M represents matrix of image, N represents number of Initialize A . A : Encoded array of size HxWxC.
images, and i represents location of pixel in image M. Initialize X . X : Empty Batch of images (HxWxCxB)
Batch controlling will allow us to apply specific augmenta- Initialize Z . Number of Images. (Z ≤ 16)
tions on specific classes. It will allow us to apply state of the Initialize = 0
art augmentations like MixUp [10], CutMix [11] and AugMix while 6= Z do
[12] easily on specific combination of classes. Algorithm 1 X[] = A mod 256
explains the flow to encode a batch of images into a single A = A div 256 . Division is Integer Division.
matrix of the same size. =+1
end while
Algorithm 1 Encode
Initialize X . X : Batch of images (HxWxCxB)
Initialize A . A : Empty array of size HxWxC. keep an offset based on the condition that the pixel was odd or
Initialize Z . Number of Images. (Z ≤ 16) even. This offset will help to decode the original pixel value.
Initialize = 0 For float-64 datatype, Algorithm 1 can only encode 16 images,
while 6= Z do but Algorithm 4 can be used to encode 32 images in float-64
mg = X[] datatype and 32-bit for offset.
domn = 256
A = A + (mg ∗ domn) Algorithm 4 Loss-less Forced Encoding
=+1 Initialize X . X : Batch of images (HxWxCxB)
end while Initialize A . A : Empty array of size HxWxC.
Initialize O . O : Boolean Empty array of size HxWxCxB.
Initialize Z . Number of Images. (Z ≤ 32)
Encoding of images created new possibilities of SBS and Initialize = 0
image augmentations. Based on this encoding, we provide a while 6= Z do
pipeline for SBS which allows each batch to have a number mg = X[]
of images with respect to class weights. oƒ ƒ set = mg mod 2
SBS allows pre-processing data differently for each class domn = 128
during training. Algorithm 2 explains the flow of SBS in our A = A + (mg ∗ domn)
pipeline. It starts by initializing a variable with number of O[] = oƒ ƒ set
examples for each class in a batch. It selects specified number =+1
of examples for each class in each batch. This process allows end while
us to pre-process each class differently in each batch.
2) Decoding: We designed a custom deep learning layer
4) OpTorch Parallel Encoding-Decoding (E-D): We ad-
to decode each input matrix to original images. Algorithm 3
justed the training pipeline by introducing a new thread to
explains the flow to decode a batch of images into a single
encode batches in parallel. While training an epoch a thread
matrix of the same size. We calculate the modulus of each
will shuffle images, encode batches and prepare input for the
pixel by 256 then set the previous input matrix to divide by
next epoch. Figure 1 explains pipeline of training and encoding
256 to get a new image.
batches in parallel to improve training time.
3) OpTorch Loss-less Forced Encoding Option: Loss-less
encoding is an extended version of Algorithm 1. The domain
of pixels in an image is between 0 - 255. If we divide each B. Gradient-flow Optimization
pixel by 2, the domain becomes 0–128, but will result in Gradient-flow optimization represents optimization within
information loss. To convert this into loss-less encoding, we’ll an architecture. OpTorch provides multiple gradient-flow op-
3
Figure 1. Figure represents an optimized flow of training pipeline. Flow starts with a if/else statement that data is dumped in encoded form or not. If not
dumped then a thread will start to encode batches and perform all the pre-processing/augmentation and dump. Training will start after data is dumped for
first time. In parallel of training a new thread will start to encode batches for next epoch and dump these encoded batches.
Figure 2. Figure represents comparison of FP16 (half precision floating points) and FP32 (single precision floating points).
4
Figure 9. Comparison of Time and Accuracy for 10 Epochs on CIFAR-10. E-D represents Encoding-Decoding Optimization pipeline, M-P represents
Mixed-Precision Optimization pipeline, whereas, S-C represents Sequential-Checkpoints Optimization pipeline. X-axis represents Time taken for 10 Epochs
and Y-axis represents Accuracy. Line between Baseline and E-D + S-C represents difference in achieved accuracy and time taken to train for 10 Epochs.
Dashed Line between M-P and E-D + M-P + S-C represents difference in achieved accuracy and time taken to train for 10 Epochs using Mixed Precision.
Figure 10. Memory Consumption Comparison of famous Image models and multiple optimization pipelines for 1 Batch iteration. B represents standard
baseline pipeline, E-D represents Encoding-Decoding Optimization pipeline, M-P represents Mixed-Precision Optimization pipeline, whereas, S-C represents
Sequential-Checkpoints Optimization pipeline. X-axis represents optimization pipelines and Y-axis represents memory consumption in GBs.
7
ACKNOWLEDGMENT
This work was supported by the National Center in Big
Data and Cloud Computing (NCBC) and the National Uni-
versity of Computer and Emerging Sciences (NUCES-FAST),
Figure 11. Best checkpoints in a simple neural network of 7 layers. C1
represents first checkpoint as we have to store input, C3 represents output of Islamabad, Pakistan.
neural network that will be saved to evaluate loss while training. C2 represents
checkpoint at intermediate layer. R EFERENCES
[1] H. Wang and B. Raj, “On the Origin of Deep Learning,” ArXiv, 2017.
[2] S. R. Khaze, M. Masdari, and S. Hojjatkhah, “APPLICATION OF AR-
performed this experiment by passing 1 batch of size 16 TIFICIAL NEURAL NETWORKS IN ESTIMATING PARTICIPATION
images of size (512x512x3). Resnet 50 standard pipeline IN ELECTIONS,” ArXiv, 2013.
consumes 2 GB, mixed precision consumes 1 GB, sequential [3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
checkpoints consumes 0.8 GB, and sequential checkpoints Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler,
with mixed precision consumes 0.4 GB. J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
Sequential-Checkpoint optimization pipeline combined with B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, “Language models are few-shot learners,” ArXiv, 2020.
E-D pipeline allows us to train in much less memory in some [4] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,
cases 10 times less memory than standard pipelines as shown B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu,
in Figure 10. “Mixed precision training,” ArXiv, 2017.
[5] T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t Use Large Mini-
Batches, Use Local SGD,” ArXiv, 2018.
IV. D ISCUSSION & R ECOMMENDATION [6] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever,
“Generative pretraining from pixels,” Proceedings of Machine Learning
Due to memory limitations, architectures of the neural Research, 2020.
networks are constrained. One possible solution is to utilize [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” ArXiv, 2015.
more high power GPUs but the other one is to optimize imple- [8] M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for
mentations. Depth of neural network is directly proportional Convolutional Neural Networks,” ArXiv, 2019.
to the extra memory consumption while training. Standard [9] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the Inception Architecture for Computer Vision,” ArXiv, 2015.
implementations will save output of each intermediate layer [10] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond
for back-propagation. Hence, more number of layers means Empirical Risk Minimization,” ArXiv, 2017.
more memory consumption. [11] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix:
Regularization Strategy to Train Strong Classifiers with Localizable
Checkpoint optimization pipeline solves the problem of Features,” ArXiv, 2019.
extra memory consumption but trades-off with extra time [12] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Laksh-
to train. E-D data flow pipeline combined with Checkpoint minarayanan, “AugMix: A Simple Data Processing Method to Improve
Robustness and Uncertainty,” ArXiv, 2019.
optimization gradient flow pipeline solves the issue of trade- [13] C. E. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, “Activation
off between extra memory consumption and extra time to train Functions: Comparison of Trends in Practice and Research for Deep
a neural network. Learning,” ArXiv, 2018.
Consider a simple neural network of 7 layers as shown
in Figure 11. It shows the most optimized way to create
checkpoints is to design a middle layer with less parameters. In
this way checkpoint will be on small layer and it will consume
less memory than other checkpoints.
We recommend designing neural network in Auto-encoder
type or UNet type architectures as shown in Figure 11 where
we have small intermediate layer that can be used as a optimal
checkpoint. It will result in much less memory consumption
than other possible architectures.
V. C ONCLUSION
In this paper, we propose OpTorch, a machine learning li-
brary designed to overcome weaknesses in existing implemen-
tations of neural network training. OpTorch provides features
to train complex neural networks with limited computational
resources. OpTorch achieved the same accuracy as existing
libraries on Cifar-10 and Cifar-100 datasets while reducing
memory usage to approximately 50%. In our experiments,