Batch Size To Improve Result
Batch Size To Improve Result
Batch Size To Improve Result
com
ScienceDirect
ICT Express 6 (2020) 312–315
www.elsevier.com/locate/icte
Abstract
Many hyperparameters have to be tuned to have a robust convolutional neural network that will be able to accurately classify images. One
of the most important hyperparameters is the batch size, which is the number of images used to train a single forward and backward pass. In
this study, the effect of batch size on the performance of convolutional neural networks and the impact of learning rates will be studied for
image classification, specifically for medical images. To train the network faster, a VGG16 network with ImageNet weights was used in this
experiment. Our results concluded that a higher batch size does not usually achieve high accuracy, and the learning rate and the optimizer
used will have a significant impact as well. Lowering the learning rate and decreasing the batch size will allow the network to train better,
especially in the case of fine-tuning.
⃝c 2020 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords: Convolutional neural networks; Deep learning; Image classification; Medical images; Batch size
scratch. This experimental study aims at providing a better on three datasets: CIFAR10, CIFAR100, and ImageNet. The
understanding of the batch size value to be considered before authors stated that the best results were obtained with batch
addressing a given problem through a CNN. In fact, despite sizes between 2 and 32, and the authors noted the small batch
the importance of the batch size value for the learning process sizes are more robust than the large batch sizes.
of a CNN, scientific literature only provides a few studies on In general, the main question regarding the batch size is
this topic. Additionally, as discussed in Section 2, the results which is the optimal batch size for training CNNs that will
reported in the literature do not report unanimous conclusions, help the network achieve the highest accuracy in the shortest
with some authors indicating a preference for large batch size time, especially for complex datasets like a medical image
values and other works suggesting the usage of small batch dataset.
size values. The rest of the paper is organized as follows. In
Section 2, previous research done on batch size is presented. 3. Methodology
In Section 3, our methodology is presented. In Sections 4 and
The training of a CNN to classify images can be defined
5, we present our results and then the conclusion.
as minimizing a non-convex loss function L(θ ) by using an
optimizer like a stochastic gradient descent or Adam optimizer,
2. Literature review
where L(θ) is the average cost of training image L i (θ ) over the
Many hyperparameters need to be adjusted before training dataset, and M is the size of the image dataset.
the CNN to classify images. One of the main hyperparameters M
1 ∑
that need to be adjusted before beginning the training process arg min L (θ ) ; L (θ ) = L i (θ)
is the batch size, where the batch size is the number of θ∈R M i=1
images that will be used in the gradient estimation process. The gradient update has three options to be calculated:
Many researchers have studied the effect of batch size on the using the entire image dataset M, using a single image, or
network performance – either the accuracy of the network or using a number between 1 and M. The previous methods are
the time that was taken till convergence – to determine which named batch gradient descent, stochastic gradient descent, and
was better: small batches or large batches. On one hand, a mini-batch gradient descent, respectively. Batch size hyperpa-
small batch size can converge faster than a large batch, but a rameter B is the number of images used to update the gradients
large batch can reach optimum minima that a small batch size per time. By using the SGD optimizer, the network weights
cannot reach. Also, a small batch size can have a significant will be updated using the following equation:
regularization effect because of its high variance [9], but it will
∂L ∂L
require a small learning rate to prevent it from overshooting wt+1 = wt − η ; = ∇W C(wt ; x (B) ; y (B) )
the minima [10]. Below are some researches that were done ∂wt ∂wt
to investigate the pros and cons of using small and large batch where η is the learning rate, x are the sample images used,
sizes. y are the image labels, and w are the weights being updated.
In 2017, Radiuk [11] investigated the effect of batch size For the Adam optimizer, the weights will be updated using the
on CNN performance for image classification, the author used following:
two datasets in the experiment, namely, MNIST and CIFAR-10 η
datasets. Radiuk tested batch sizes with the power of 2, starting wti = wt−1
i
−√ . m̂ t
v̂t + ϵ
from 16 until 1024 and 50, 100, 150, 200, and 250 as well. vt ∂L
Radiuk opted for a LeNet architecture for the MNIST dataset where m̂ t = mt
v̂t = 1−β
1−β1t
, t , m t = β1 m t−1 + (1 − β1 ) ∂w ,
t
2
and a custom-made network with five convolutional layers for [ ]2
∂L ∂L (B) (B)
the CIFAR-10 dataset. The optimizer used for both networks vt = β2 vt−1 + (1 − β2 ) ∂wt and ∂wt = ∇W C(wt ; x ; y )
was the stochastic gradient descent optimizer with a learning where βi ∈ [0, 1] is used to determine how much information
rate of 0.001 for the MNIST and 0.0001 for the CIFAR-10 is needed from the previous update, m t is the first momentum
dataset. For both the datasets, the best accuracy was achieved where it is the gradients’ running average, and vt is the second
by the 1024 batch size, and the worst result was with the 16 momentum where it is the squared gradients’ running average.
batch size. The author stated that based on their results, the The bias-corrected first and second momentums are m̂ t and
higher the batch size the higher the network accuracy, meaning v̂t . As is shown from the previous equations, batch size and
that the batch size has a huge impact on the CNN performance. learning rate have an impact on each other, and they can have
Bengio [12] stated that a batch size of 32 is a good a huge impact on the network performance.
default value, also he stated that the larger batch size will To speed up the network training and to increase its ro-
quicken the computation of the network but will decrease the bustness, fine-tuning of the VGG16 network was applied.
updates required for the network to reach convergence. The Fine-tuning a network is considered a method of transfer
author stated that the batch size likely impacts the convergence learning, where the knowledge transfer between networks that
time and not network performance. Meanwhile, Masters and has been trained on different datasets. Because training CNN
Luschi [13] tested the effect of batch sizes between 21 and 211 , weights from scratch requires millions of images and training
on AlexNet [3] and ResNet [14] architectures with SGD as an for days and this amount of images is not available for medical
optimizer without momentum to exclude the effect of momen- images, usage of transfer learning can be very useful in the
tum on the training. The authors studied the effect of batch size medical field [15].
314 I. Kandel and M. Castelli / ICT Express 6 (2020) 312–315
Table 1
The results of the test AUC of the Adam optimizer.
Test AUC
Batch size Adam LR = 0.0001 Adam LR = 0.001
16 0.9677 0.9144
32 0.9636 0.9332
64 0.9616 0.9381
128 0.9567 0.9432
256 0.9585 0.9652
Table 2
The results of the test AUC of the SGD optimizer.
Test AUC
Batch size SGD LR = 0.0001 SGD LR = 0.001
16 0.9555 0.9461
32 0.9570 0.9521
64 0.9512 0.9545
128 0.9302 0.9567
256 0.9077 0.9579
Fig. 2. A sample of the PatchCamelyon dataset.