Batch Size To Improve Result

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Available online at www.sciencedirect.

com

ScienceDirect
ICT Express 6 (2020) 312–315
www.elsevier.com/locate/icte

The effect of batch size on the generalizability of the convolutional neural


networks on a histopathology dataset
Ibrahem Kandel ∗, Mauro Castelli
Nova Information Management School (NOVA IMS), Universidade Nova de Lisboa, Campus de Campolide, 1070-312, Lisbon, Portugal
Received 30 September 2019; received in revised form 28 March 2020; accepted 28 April 2020
Available online 5 May 2020

Abstract
Many hyperparameters have to be tuned to have a robust convolutional neural network that will be able to accurately classify images. One
of the most important hyperparameters is the batch size, which is the number of images used to train a single forward and backward pass. In
this study, the effect of batch size on the performance of convolutional neural networks and the impact of learning rates will be studied for
image classification, specifically for medical images. To train the network faster, a VGG16 network with ImageNet weights was used in this
experiment. Our results concluded that a higher batch size does not usually achieve high accuracy, and the learning rate and the optimizer
used will have a significant impact as well. Lowering the learning rate and decreasing the batch size will allow the network to train better,
especially in the case of fine-tuning.
⃝c 2020 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords: Convolutional neural networks; Deep learning; Image classification; Medical images; Batch size

1. Introduction that CNN can be applied to are histopathology images, which


are images assessed by pathologists to evaluate whether tissue
Since its introduction nearly two decades ago, convolutional
is cancerous. Histopathology images are very challenging to
neural networks (CNNs) [1] have been used as primary image
classify, even for an experienced pathologist, and that is where
classification algorithms. The true power of the CNN has been
the CNN can be applied, either in giving a second opinion or
rediscovered by the ImageNet competition [2], where AlexNet giving assistance to the pathologist in classifying these images.
architecture [3] succeeded in classifying millions of images To correctly train the CNN to be able to classify images,
with thousands of labels with an accuracy of 85% compared many hyperparameters need to be adjusted; these hyperpa-
to 74% of the traditional algorithms, and that is when the CNN rameters will affect the performance of the network along its
again became one of the most important algorithms for image time to convergence. One of the main hyperparameters that
classification. One of the main benefits of using a CNN is that need to be tuned is the batch size [7], which is the number of
it does not need any manual feature extraction to work, which images used in every epoch to train the network. Setting this
makes it robust against new datasets. CNNs not only succeed hyperparameter too high can make the network take too long
in the image classification domain but are also successfully to achieve convergence (no more gain in accuracy); however,
applied in text classification [4], climate change detection [5], if it is too low, it will make the network bounce back and forth
and speech recognition [6], among others. without achieving acceptable performance. Also, the nature of
Medical images can be considered very complicated the dataset can have an impact on the batch size, especially
datasets because of the complexity and seriousness, and they the medical dataset because of its complexity.
require an experienced physician with years of experience to In this study, we investigated the effect of batch size on
be able to classify the images. Examples of medical images the performance of CNNs and the impact of learning rates
∗ Corresponding author. for image classification. Two different optimizers were used
E-mail address: D20181143@novaims.unl.pt (I. Kandel). to assess the impact of batch size. The CNN architecture used
Peer review under responsibility of The Korean Institute of Communica- in this experiment was the VGG16 [8]; the network was fine-
tions and Information Sciences (KICS). tuned to suit this dataset and to avoid training the network from
https://doi.org/10.1016/j.icte.2020.04.010
2405-9595/⃝ c 2020 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
I. Kandel and M. Castelli / ICT Express 6 (2020) 312–315 313

scratch. This experimental study aims at providing a better on three datasets: CIFAR10, CIFAR100, and ImageNet. The
understanding of the batch size value to be considered before authors stated that the best results were obtained with batch
addressing a given problem through a CNN. In fact, despite sizes between 2 and 32, and the authors noted the small batch
the importance of the batch size value for the learning process sizes are more robust than the large batch sizes.
of a CNN, scientific literature only provides a few studies on In general, the main question regarding the batch size is
this topic. Additionally, as discussed in Section 2, the results which is the optimal batch size for training CNNs that will
reported in the literature do not report unanimous conclusions, help the network achieve the highest accuracy in the shortest
with some authors indicating a preference for large batch size time, especially for complex datasets like a medical image
values and other works suggesting the usage of small batch dataset.
size values. The rest of the paper is organized as follows. In
Section 2, previous research done on batch size is presented. 3. Methodology
In Section 3, our methodology is presented. In Sections 4 and
The training of a CNN to classify images can be defined
5, we present our results and then the conclusion.
as minimizing a non-convex loss function L(θ ) by using an
optimizer like a stochastic gradient descent or Adam optimizer,
2. Literature review
where L(θ) is the average cost of training image L i (θ ) over the
Many hyperparameters need to be adjusted before training dataset, and M is the size of the image dataset.
the CNN to classify images. One of the main hyperparameters M
1 ∑
that need to be adjusted before beginning the training process arg min L (θ ) ; L (θ ) = L i (θ)
is the batch size, where the batch size is the number of θ∈R M i=1
images that will be used in the gradient estimation process. The gradient update has three options to be calculated:
Many researchers have studied the effect of batch size on the using the entire image dataset M, using a single image, or
network performance – either the accuracy of the network or using a number between 1 and M. The previous methods are
the time that was taken till convergence – to determine which named batch gradient descent, stochastic gradient descent, and
was better: small batches or large batches. On one hand, a mini-batch gradient descent, respectively. Batch size hyperpa-
small batch size can converge faster than a large batch, but a rameter B is the number of images used to update the gradients
large batch can reach optimum minima that a small batch size per time. By using the SGD optimizer, the network weights
cannot reach. Also, a small batch size can have a significant will be updated using the following equation:
regularization effect because of its high variance [9], but it will
∂L ∂L
require a small learning rate to prevent it from overshooting wt+1 = wt − η ; = ∇W C(wt ; x (B) ; y (B) )
the minima [10]. Below are some researches that were done ∂wt ∂wt
to investigate the pros and cons of using small and large batch where η is the learning rate, x are the sample images used,
sizes. y are the image labels, and w are the weights being updated.
In 2017, Radiuk [11] investigated the effect of batch size For the Adam optimizer, the weights will be updated using the
on CNN performance for image classification, the author used following:
two datasets in the experiment, namely, MNIST and CIFAR-10 η
datasets. Radiuk tested batch sizes with the power of 2, starting wti = wt−1
i
−√ . m̂ t
v̂t + ϵ
from 16 until 1024 and 50, 100, 150, 200, and 250 as well. vt ∂L
Radiuk opted for a LeNet architecture for the MNIST dataset where m̂ t = mt
v̂t = 1−β
1−β1t
, t , m t = β1 m t−1 + (1 − β1 ) ∂w ,
t
2
and a custom-made network with five convolutional layers for [ ]2
∂L ∂L (B) (B)
the CIFAR-10 dataset. The optimizer used for both networks vt = β2 vt−1 + (1 − β2 ) ∂wt and ∂wt = ∇W C(wt ; x ; y )
was the stochastic gradient descent optimizer with a learning where βi ∈ [0, 1] is used to determine how much information
rate of 0.001 for the MNIST and 0.0001 for the CIFAR-10 is needed from the previous update, m t is the first momentum
dataset. For both the datasets, the best accuracy was achieved where it is the gradients’ running average, and vt is the second
by the 1024 batch size, and the worst result was with the 16 momentum where it is the squared gradients’ running average.
batch size. The author stated that based on their results, the The bias-corrected first and second momentums are m̂ t and
higher the batch size the higher the network accuracy, meaning v̂t . As is shown from the previous equations, batch size and
that the batch size has a huge impact on the CNN performance. learning rate have an impact on each other, and they can have
Bengio [12] stated that a batch size of 32 is a good a huge impact on the network performance.
default value, also he stated that the larger batch size will To speed up the network training and to increase its ro-
quicken the computation of the network but will decrease the bustness, fine-tuning of the VGG16 network was applied.
updates required for the network to reach convergence. The Fine-tuning a network is considered a method of transfer
author stated that the batch size likely impacts the convergence learning, where the knowledge transfer between networks that
time and not network performance. Meanwhile, Masters and has been trained on different datasets. Because training CNN
Luschi [13] tested the effect of batch sizes between 21 and 211 , weights from scratch requires millions of images and training
on AlexNet [3] and ResNet [14] architectures with SGD as an for days and this amount of images is not available for medical
optimizer without momentum to exclude the effect of momen- images, usage of transfer learning can be very useful in the
tum on the training. The authors studied the effect of batch size medical field [15].
314 I. Kandel and M. Castelli / ICT Express 6 (2020) 312–315

Table 1
The results of the test AUC of the Adam optimizer.
Test AUC
Batch size Adam LR = 0.0001 Adam LR = 0.001
16 0.9677 0.9144
32 0.9636 0.9332
64 0.9616 0.9381
128 0.9567 0.9432
256 0.9585 0.9652

Fig. 1. VGG16 network architecture.

Table 2
The results of the test AUC of the SGD optimizer.
Test AUC
Batch size SGD LR = 0.0001 SGD LR = 0.001
16 0.9555 0.9461
32 0.9570 0.9521
64 0.9512 0.9545
128 0.9302 0.9567
256 0.9077 0.9579
Fig. 2. A sample of the PatchCamelyon dataset.

represents that the model had perfect power in classifying


The VGG16 [8] network is considered one of the most im- images.
portant CNNs for image classification because of its deep yet
simple architecture, which gives it a robustness against overfit- 4. Results
ting while providing good performance; VGG16 is presented
in Fig. 1. The last two blocks of the VGG16 network were fine-tuned
The dataset used in this experiment was the PatchCame- using 80% of the dataset and were validated on the remaining
lyon [16], [17] a public dataset that contains 220,000 binary 20% of the dataset, after which the best model was saved and
labeled images to train the CNN. The dataset was balanced, used to classify the Kaggle online test set. The batch sizes
meaning it contained 60% positive to 40% negative images. used in this experiment were B = [16, 32, 64, 128, 256]; two
Another 57,458 images were provided on the Kaggle platform optimizers were used, namely SGD and Adam optimizers, and
to test the algorithm. All the images were 96 × 96 pixels. A two learning rates were used for each optimizer of 0.001 and
sample of the dataset is presented in Fig. 2. 0.0001. For consistency of results and due to the size of the
Image augmentation is usually used to increase the image dataset, the number of epochs was fixed to 50 epochs. To
dataset and also to make the network more robust against overcome overfitting, only the best model was saved, meaning
translation invariance. Image augmentation is defined as cre- that during the training phase, if the validation accuracy of the
ating duplicates of the original image datasets by flipping, epoch was higher than the highest accuracy, then the model
rotating, zooming, and adjusting brightness. In this work, the was saved. The results of the Kaggle online test set are shown
images were horizontally and vertically flipped, with an image in Tables 1 and 2.
rotation of 180 degrees; some images were zoomed in; and Table 1 shows the results of the Adam optimizer with a
some images were shifted. learning rate of 0.001 and a learning rate of 0.0001. For a
To evaluate the CNN classifier performance (i.e. to deter- learning rate of 0.001, the lowest batch size (16) achieved
mine the classifier ability to classify positive images as positive the lowest AUC. The highest performance was from using the
and negative images as negative), the area under the ROC largest batch size (256); it can be shown that the larger the
curve was used (AUC), which can be formally defined as [18]: batch size, the higher the performance. For a learning rate of
0.0001, the difference was mild; however, the highest AUC
1 TP TN was achieved by the smallest batch size (16), while the lowest
AU C = ( + )
2 T N + FN T N + FP AUC was achieved by the largest batch size (256).
where T P is the true positive metric, which is the positive Table 2 shows the result of the SGD optimizer with a
images classified as positive; T N is the true negative metric, learning rate of 0.001 and a learning rate of 0.0001. For a
which is the negative images classified as negative; F P is the learning rate of 0.001, we can see that the large batch size
false positive metric, which is the negative images classified achieved the highest AUC, while the lowest was by using the
as positive; F N is the false negative metric, which is the smallest batch size (16). For a learning rate of 0.0001, it was
positive images classified as negative. The minimum value the opposite; the largest batch size (256) achieved the lowest
of the AUC metric was 0.5, which represents that the model AUC, while the 32 batch size achieved the highest followed
had no predictive power, and the maximum was 1, which by the lowest batch size.
I. Kandel and M. Castelli / ICT Express 6 (2020) 312–315 315

The highest overall AUC achieved during the experiments Acknowledgments


was by the Adam with a learning rate of 0.0001 and batch size
This work was supported by national funds through FCT
of 16.
(Fundação para a Ciência e a Tecnologia), Portugal by the
Our results agree with the ones obtained by Masters and
project GADgET (DSAIPA/DS/0022/2018) and AICE
Luschi [13], where the authors stated that smaller batch sizes
(DSAIPA/DS/0113/2019). Mauro Castelli acknowledges the fi-
should be used. According to Radiuk [11], when a large
nancial support from the Slovenian Research Agency (research
learning rate is used, the higher the batch size, the better the
core funding No.\ P5-0410).
performance of a CNN. While the use of large batch size
values is not recommended in our study, the results of Radiuk
References
match our findings on the relation between the batch size
and the learning rate. In particular, we highlighted that higher [1] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learn-
learning rates require larger batch sizes. Finally, Bengio [12] ing applied to document recognition, Proc. IEEE 86 (11) (1998)
2278–2324.
suggested that 32 is a good default value for the batch size. [2] O. Russakovsky, et al., Imagenet large scale visual recognition
While this is corroborated by our experiments (in which a challenge, Int. J. Comput. Vis. 115 (2014).
batch size of 32 provided good results), the best performance [3] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet Classification with
was achieved with a batch size of 16. Deep Convolutional Neural Networks, Vol. 25, 2012.
[4] M. Hughes, I. Li, S. Kotoulas, T. Suzumura, Medical text classification
using convolutional neural networks, Stud. Health Technol. Inform.
5. Conclusion 235 (2017).
Convolutional neural networks have shown superior accu- [5] Y. Liu, et al., Application of deep convolutional neural networks for
detecting extreme weather in climate datasets, 2016.
racy in image classification, but to accurately train a CNN [6] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn,
many hyperparameters need to be tuned depending on the D. Yu, Convolutional neural networks for speech recognition,
dataset being used. The medical field can benefit greatly IEEE/ACM Trans. Audio Speech Lang. Process. 22 (10) (2014)
by using CNN in image classification to increase accuracy. 1533–1545.
In this paper, we compared the performance of CNN using [7] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network
training by reducing internal covariate shift, 2015.
different batch sizes and different learning rates. According [8] K. Simonyan, A. Zisserman, Very deep convolutional networks for
to our results, we can conclude that the learning rate and the large-scale image recognition, 2014.
batch size have a significant impact on the performance of [9] D.R. Wilson, T.R. Martinez, The general inefficiency of batch training
the network. There is a high correlation between the learning for gradient descent learning, Neural Netw. 16 (10) (2003) 1429–1451.
rate and the batch size, when the learning rates are high, the [10] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, The MIT
Press, 2016.
large batch size performs better than with small learning rates. [11] P. Radiuk, Impact of training set batch size on the performance
We recommend choosing small batch size with low learning of convolutional neural networks for diverse datasets, Inf. Technol.
rate. In practical terms, to determine the optimum batch size, Manag. Sci. 20 (2017).
we recommend trying smaller batch sizes first(usually 32 or [12] Y. Bengio, Practical recommendations for gradient-based training of
64), also keeping in mind that small batch sizes require small deep architectures, 2012, Arxiv.
[13] D. Masters, C. Luschi, Revisiting small batch training for deep neural
learning rates. The number of batch sizes should be a power of networks, 2018.
2 to take full advantage of the GPUs processing. Subsequently, [14] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image
it is possible to increase the batch size value till satisfactory Recognition, Vol. 7, 2015.
results are obtained. [15] N. Tajbakhsh, et al., Convolutional neural networks for medical image
analysis: Full training or fine tuning?, IEEE Trans. Med. Imaging 35
(5) (2016) 1299–1312.
CRediT authorship contribution statement [16] B.S. Veeling, J. Linmans, J. Winkens, T. Cohen, M. Welling, Rotation
Equivariant CNNs for Digital Pathology BT - Medical Image Comput-
Ibrahem Kandel: Investigation, Visualization, Methodol- ing and Computer Assisted Intervention – MICCAI 2018, 2018, pp.
ogy, Software, Writing - original draft. Mauro Castelli: Con- 210–218.
ceptualization, Supervision, Validation, Writing - review & [17] B. Ehteshami Bejnordi, et al., Diagnostic assessment of deep
editing. learning algorithms for detection of lymph node metastases in
women with breast Cancermachine learning detection of breast
Cancer lymph node metastasesmachine learning detection of
Declaration of competing interest breast Cancer lymph node metastases, JAMA 318 (22) (2017)
2199–2210.
The authors declare that they have no known competing [18] F. Idrees, M. Rajarajan, M. Conti, T.M. Chen, Y. Rahulamathavan,
financial interests or personal relationships that could have Pindroid: A novel Android malware detection system using ensemble
appeared to influence the work reported in this paper. learning methods, Comput. Secur. 68 (2017) 36–46.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy