Face Mask Detection
Face Mask Detection
INTRODUCTION
1.1. Motivation
At present, the situation of the Coronavirus pandemic (Covid-19) which can infect
between human to human contacts by patients confirmed that they have never gone to epidemic
areas occurred in Wuhan in China and spread out almost all countries. According to the World
Health Organization (WHO), the number of people who have infected about 6,637,519 cases and
391,161 deaths. The COVID-19 virus created a paramount health emergency in the history of
mankind [1]. The virus could spread by the droplets from the contaminated individual [2, 3]. The
most important defense against the virus is the face mask. This is also advised by the WHO [4,
5]. One of the preventing solutions is wearing a mask to prevent the virus from spreading out in
the air to reduce opportunities to get infected of each people. In many organizations are
controlling all people who would like to get some services from them wearing a mask; however,
the number of users or customers are more than the service providers’ result in rigorous
checking.
It is not only important to wear a mask but also wear the mask in a way that covers the
nose and mouth completely. Wearing the mask inappropriately can also spread the virus and will
not provide significant protection [6]. Developing a face mask recognizer that not only detects
the mask but also the accuracy to which a person is wearing the mask can help prevent the
outburst of the virus and save many lives [7]. This face mask recognizer can be used in public
places to monitor the crowd and identify individuals who are not wearing the mask or those who
are wearing it incorrectly. This can help to spread awareness and educate people the correct way
to wear a mask. This implementation can help frontline workers focus on eradication of the virus
[8]. This face mask recognizer is a necessity and as a face mask is our shield against the virus,
developing this model was essential and presently with the scare of the new variants, this finds
high application value which motivated the idea for this study.
1.2. Objective
Changes in the lifestyle of everyone around the world. In those changes wearing a mask
has been very vital to every individual. Detection of people who are not wearing masks is a
challenge due to Outbreak of the Coronavirus pandemic has created various the large numbers of
populations. This project can be used in schools, hospitals, banks, airports, and etc. as a
digitalized scanning tool. The technique of detecting people’s faces and segregating them into
two classes namely the people with masks and people without masks is done with the help of
deep learning. With the help of this project, a person who is intended to monitor the people can
be seated in a remote area and still can monitor efficiently and give instructions accordingly. A
GitHub dataset which consisted of images with and without masks was used. For the purpose of
this study a pre-trained convolutional neural network that is AlexNet was used.
1.3. Dataset
Masks play a crucial role in protecting the health of individuals against respiratory
diseases, as is one of the few precautions available for COVID-19 in the absence of
immunization. With this dataset, it is possible to create a model to detect people wearing masks,
not wearing them, or wearing masks improperly. This dataset contains 1000 images which are
gathered from GitHub belonging to the 2 classes.
The classes are:
With mask;
Without mask;
CHAPTER 2
LITERATURE REVIEW
Object detection is one of the trending topics in the field of image processing and
computer vision. Ranging from small scale personal applications to large scale industrial
applications, object detection and recognition is employed in a wide range of industries. Some
examples include image retrieval, security and intelligence, OCR, medical imaging and
agricultural monitoring. In object detection, an image is read and one or more objects in that
image are categorized. The location of those objects is also specified by a boundary called the
bounding box. Traditionally, researchers used pattern recognition to predict faces based on prior
face models. A break through face detection technology then was developed named as Viola
Jones detector that was an optimized technique of using Haar [9], digital image features used in
object recognition. However, it failed because it did not perform well on faces in dark areas and
non-frontal faces. Since then, researchers are eager to develop new algorithms based on deep
learning to improve the models. Deep learning allows us to learn features with end to end
manner and removing the need to use prior knowledge for forming feature extractors. There are
various methods of object detection based on deep learning which are divided into two
categories: one stage and two stage object detectors.
Two stage detectors use two neural networks to detect objects, for instance region-based
convolutional neural networks (R-CNN) and faster R-CNN. The first neural network is used to
generate region proposals and the second one refines these region proposals; performing a
coarse-to-fine detection. This strategy results in high detection performance compromising on
speed. The seminal work R-CNN is proposed by R. Girshick et al. [10]. R-CNN uses selective
search to propose some candidate regions which may contain objects. After that, the proposals
are fed into a CNN model to extract features, and a support vector machine (SVM) is used to
recognize classes of objects. However, the second stage of R-CNN is computationally expensive
since the network has to detect proposals on a one-by-one manner and uses a separate SVM for
final classification. Fast R-CNN [11] solves this problem by introducing a region of interest
(ROI) pooling layer to input all proposal regions at once. Faster RCNN [12] is the evolution of
R-CNN and Fast R-CNN, and as the name implies its training and testing speed is greater than
those of its predecessors. While R-CNN and Fast R-CNN use selective search algorithms
limiting the detection speed, Faster R-CNN learns the proposed object regions itself using a
region proposal network (RPN).
On the other hand, a one stage detector utilizes only a single neural network for region
proposals and for detection; some primary ones being SSD (Single Shot Detection) [13] and
YOLO (You OnlyLook Once) [14]. To achieve this, the bounding boxes should be predefined.
YOLO divides the image into several cells and then matches the bounding boxes to objects for
each cell. This, however, is not good for small sized objects. Thus, multi scale detection is
introduced in SSD which can detect objects of varying sizes in an image. Later, in order to
improve detection accuracy, Lin et. al [15] proposes Retina Network (RetinaNet) by combining
an SSD and FPN (feature pyramid network) to increase detection accuracy and reduce class
imbalance. One-stage detectors have higher speed but trades off the detection performance but
then only are preferred over two-stage detectors.
Like object detection, face detection adopts the same architectures as one-stage and two-
stage detectors, but in order to improve face detection accuracy, more face-like features are being
added. However, there is occasional research focusing on face mask detection. Some already
existing facemask detectors have been modeled using OpenCV, Pytorch Lightning, MobileNet,
RetinaNet and Support Vector Machines. Here, we will be discussing two projects. One project
used Real World Masked Face Dataset (RMFD) which contains 5,000 masked faces of 525
people and 90,000 normal faces [16]. These images are 250 x 250 in dimensions and cover all
races and ethnicities and are unbalanced. This project took 100 x 100 images as input, and
therefore, transformed each sample image when querying it, by resizing it to 100x100.
Moreover, this project uses PyTorch then they convert images to Tensors, which is the base data
type that PyTorch can work with. RMFD is imbalanced (5,000 masked faces vs 90,000 non-
masked faces). Therefore, the ratio of the samples in train/validation while splitting the dataset
was kept equal using the train test split function of sklearn. Moreover, to deal with unbalanced
data, they passed this information to the loss function to avoid un proportioned step sizes of the
optimizer. They did this by assigning a weight to each class, according to its represent ability in
the dataset. They assigned more weight to classes with a small number of samples so that the
network will be penalized more if it makes mistakes predicting the label of these classes. While
classes with large numbers of samples, they assigned to them a smaller weight. This makes their
network training agnostic to the proportion of classes.
In the second project [17], a dataset was created by Prajna Bhandary using a PyImage
Search reader. This dataset consists of 1,376 images belonging to all races and is balanced. There
are 690 images with masks and 686 without masks. Firstly, it took normal images of faces and
then created a customized computer vision Python script to add face masks to them. Thereby, it
created a real-world applicable artificial dataset. This method used the facial landmarks which
allow them to detect the different parts of the faces such as eyes, eyebrows, nose, mouth, jaw line
etc. To use the facial landmarks, it takes a picture of a person who is not wearing a mask, and,
then, it detects the portion of that person’s face. After knowing the location of the face in the
image, it extracted the face Region of Interest (ROI). After localizing facial landmarks, a picture
of a mask is placed into the face. In this project, embedded devices are used for deployment that
could reduce the cost of manufacturing. MobileNetV2 architecture is used as it is a highly
efficient architecture to apply on embedded devices with limited computational capacity such as
Google Coral, NVIDIA Jetson Nano. This project performed well, however, if a large portion of
the face is occluded by the mask, this model could not detect whether a person is wearing a mask
or not. The dataset used to train the face detector did not have images of people wearing face
masks as a result, if the large portion off aces is occluded, the face detector would probably fail
to detect properly. To get rid of this problem, they should gather actual images of people wearing
masks rather than artificially generated images.
Initially researchers focused on edge and gray value of face image. In [18] was based on
pattern recognition model, having a prior information of the face model. Adaboost [19] was a
good training classifier. The face detection technology got a breakthrough with the famous Viola
Jones Detector [20], which greatly improved real time face detection. Viola Jones detector
optimized the features of Haar [21], but failed to tackle the real world problems and was
influenced by various factors like face brightness and face orientation. Viola Jones could only
detect frontal well lit faces. It failed to work well in dark conditions and with non-frontal images.
These issues have made the independent researchers work on developing new face detection
models based on deep learning, to have better results for the different facial conditions. We have
developed our face detection model using Multi Human Parsing Dataset [22], based on fully
convolutional networks, such that it can detect the face in any geometric condition frontal or
non-frontal for that matter.
Convolutional Networks have always been used for image classification tasks. Typical
architectures like AlexNet [23] and VGGNet [24] comprise of stacked convolutional layers.
AlexNet with 5 convolutional layers and 3 fully connected layers has been the winner of
ImageNet LSVRC-2012 competition while VGGNet is an improvement over AlexNet as it
replaces large kernels with 3x3 multiple kernels consecutively. The ILSVRC-2014 winning
architecture GoogleNet [25] uses parallel convolution kernels and concatenating the feature
maps together. In it 1×1, 3×3 and 5×5 convolutions and 3×3 max-pooling have been used.
Smaller convolutions extract the local features whereas larger convolutions extract high level
features. More recent architectures such as ResNet [26] have introduced skip connections which
allows deeper networks to avoid saturation in training accuracy. These architectures are often
used for initial feature extraction in face detection networks. In our method, we are using VGG
16 architecture as the base network for face detection and Fully Convolutional Network for
segmentation. VGG 16 network is sufficiently deep to extract features and computationally less
expensive for our case. Though majority of segmentation architectures rely on down sampling
and consecutive upsampling of input image, Fully Convolutional Networks [27], [28], [29] still
are modest and have significantly accurate approach for segmentation.
In [30], the authors developed a face mask wearing condition identification method. They
were ready to classify three categories of face mask-wearing. The categories are face mask-
wearing, incorrect face mask-wearing and no face mask-wearing. Saber et al [31], have applied
the principal component analysis on masked and unmasked face recognition to acknowledge the
person. Also, PCA was utilized in [32]. The author proposed a way that’s used for removing
glasses from human frontal faces. In [33], the authors used theYOLOv3 algorithm for face
detection. YOLOv3 uses Darknet-53 because the backbone. Nizam et al [34] proposed a
completely unique GAN-based network, which will automatically remove mask covering the
face area and regenerate the image by building the missing hole. In [35], the authors presented a
system for detecting the presence or absence of a compulsory medical mask within the OR. The
general is to attenuate the false positive face detection as possible without missing mask
detection so as to trigger alarms just for medical staff who don't wear a surgical mask. Shaik et al
[36] used deep learning real-time face emotion classification and recognition. They used VGG-
16 to classify seven countenances. Under the present Covid-19 lock-in time, this technique is
effective in preventing spread in may use cases. Here are some use cases which will benefit from
system.
Airports: the proposed system could also be vital find travelers at airports. there's no mask. The
traveler’s data are often captured as a video within the system at the doorway. Any passenger
who finds no mask will alert the airport authorities send in order that they can act quickly.
Hospital: the proposed system is often integrated with CCTV cameras, and therefore the data are
often managed to ascertain if its employees are wearing masks. If you discover some doctors. If
they aren't wearing a mask, they're going to receive a reminder to wear a mask.
Office: The proposed system can help to take care of safety standards to stop the spread of
covid-19 or any such airborne disease. If some employees aren’t wearing masks they’re going to
receive reminders to wear mask. The choice of the system must be supported the simplest
performance. So, I'm using the simplest system performance indicators in order that you’ll large
scale implementation. The system has been used with the MobileNetV2 classifier.
MobileNetV2 [37]: MobileNetV2 is that the latest technology of mobile visual recognition,
including classification, object detection and semantic segmentation. The classifier uses deep
intelligent separable convolution, its purpose is to significantly reduce the complexity cost and
model size of the network, so it's suitable for mobile devices, or devices with low computing
power. In MobileNetV2, another best module introduced is that the reverse residual structure.
The nonlinearity within the narrow layer is removed. Maintain because the backbone of feature
extraction, MobileNetV2 achieves the simplest performance in object detection and semantic
segmentation.
In [38], talks about the use of MTCNN for the detection of masked faces. Face
recognition is a promising area in the field of computer vision. Some devices use Face
recognition as an alternative to a fingerprint scanner. CNN has the ability to learn valuable
features by itself. The author used IIIT-Delhi masked face images dataset and applied data
augmentation to enlarge the dataset so that reliability and efficiency can be improved. They used
a pre-trained Multi-task Cascaded Convolutional Neural Network (MTCNN) for the detection of
faces from the dataset. MTCNN outperforms many other face-detection tools. It works in 3
stages. First, it creates multiple copies of the images of different scales. This is called an image
pyramid. The first stage is called the P-Net or Proposal Network. It introduces candidate facial
regions. The second stage is the R-Net or refinement network. It refines the bounding boxes. The
third and final stage is O-Net or Output Network. It determines the final landmarks on the image.
In image post processing, the images are cropped and resized according to the FaceNet
Specification i.e. 160x160. A pre-trained FaceNet Model was used as a baseline for deep
networks. It used 22 deep convolutional layers. A large number of images of masked and
unmasked faces were used to train the model. The classification was done with the help of the
Support Vector Machine (SVM). The results of this methodology were promising. It gave
accuracy up to 98.50% in some datasets and cases.
In [39], proposes using two components: i) a deep neural network for identifying single
or more than one riders on a motorbike by using the YOLOv3 model and ii) another neural
network for detecting whether the rider has worn a helmet or not. In this system, the traffic
surveillance system provides input to the model and the video frames are given as input to the
CNN for detecting helmets on the riders. Initially, the YOLOv3 is used for detecting the
motorbike and the riders. The YOLOv3 model is an improved version of YOLO which was
developed by J. Redmon. The model can detect huge sets of classes; among them only two
classes i.e. person and motorbike are detected. The boxes are drawn around the target to localize
the objects. The network predicts 4 coordinates; bx, by which are the center coordinates and bw,
bh are width, height respectively of the focused target. The overlapping area between motorbike
and person is taken from the bounding boxes to determine whether the person is a motorbike
rider or not. Determine the Euclidean distance from the center coordinates of the two bounding
boxes of a person to the motorbike. If the distance is within the bounding box of the motorbike,
then it is understood that the targeted person is the rider of that motorbike. The CNN model is
then used to identify and classify whether the rider is wearing a helmet or not. For this, the top
one-fourth part of the identified motorbike rider is sent as input from the output received from
the YOLOv3 model.
The CNN model consists of five layers of which the input layer takes the input from the
input image and passes the image through consecutive convolutional layers where each layer
transforms the image using specific features and sends it to the next layer. Each layer filters the
input image given and extracts the required features with plenty of differentiating attributes to
distinguish the target object from other objects. After these five layers, additional two layers are
added which are connected. Depending on the image extracted, the softmax classifier classifies
the object to predict classes with probability distribution as wearing a Helmet or not wearing a
helmet. The CNN predicts bounding boxes along with class probabilities for accuracy of
prediction. In the detection process, the input image is divided into an N×N grid. This grid is
responsible for object detection of any kind of object that falls into that grid’s cell. Each
bounding box consists of 4 measures: px, py, w, h where (px, py) coordinates represent the
center of the box relative to the bounds of the grid cell. The height (h) and width (w) are
predicted relative to the whole image.
The proposed paper explains in detecting single, multiple riders or basically all riders of a
motorbike who are not wearing helmets from traffic surveillance videos. The First YOLOv3
model has been used for motorcyclist detection. Then, the proposed lightweight convolutional
neural network detects the wearing of a helmet or no helmet for all motorbike riders. This project
performs better than other CNN based helmet detection methods and can be extended in the
future to detect more complicated cases of several riders including child riders. As YOLOv3 and
CNN models detect a person's face accurately from a given image and can tell whether a person
is wearing a helmet or not, so one can also use these models to determine if a person is wearing a
face mask or not.
In [40], proposed a new technique of helmet detection which combines two methods in
order to make the detection rate better. Those two methods are i) Haar like feature and ii) Circle
Hough transform. By using these methods the system detects whether a person is wearing a full
helmet or half helmet. When the system receives video input it first separates the images from
video then uses a Haar like feature for detection of a full helmet. As we know the human face is
full of contrast (e.g. eye region is darker than the cheek region), Haar like feature uses these
contrasts to encode the human face, nose, mouth, eyebrows, right eye, left eye. This paper has
used 14 feature prototypes to encode the features which include Edge features(4), Center
surround features(2), Line features (8). For each image of 24 x 24 sub-window, there are more
than 1,17,000 rectangular features so to select only specific rectangles weak learning algorithm
have been used. To boost the performance of classification they have used the AdaBoost
classifier. And to increase the detection efficiency they have used a cascade classifier which also
reduces the computation time radically. For detection of half helmet circle hough transform
methods have been used by authors. This method not only detects the circular shapes but also
any kind of shapes in the given picture which makes it easy to locate helmets, and hence it makes
it possible to detect half helmets. This paper has overcome different issues which were raised
before while detecting full and half helmets. They have tested this algorithm in real-time and the
results are very positive. This paper proposed a new technique of masked face detection by
taking the help of video analytics which combines four steps in order to make the detection rate
better. The four steps are: i) Distance from camera ii) Eye line detection iii) Facial part detection
iv) Eye detection. Video analytics deals with the detection of people and events like walking,
falling, standing at the camera.
In [41], uses a technique called Analog Devices Inc.’s Cross Core Embedded Studio
(CCES) in addition to HOGSVM for person detection it also describes the method of Histogram
of Oriented Gradients (HOG) which is a feature set based on evaluating well-normalized local
histograms of image gradient orientations in a dense grid. As compared to the best Haar wavelet-
based detector it gives good results for person detection, reducing false-positive rates relatively.
The main idea is to detect whether a person is wearing a mask or not. So if a person is detected
but their face is not detected, then it can be considered that a person is wearing a mask. But this
will be also true in a situation like a person facing in the direction opposite to the camera, in such
a situation it will detect the person but not their face and will give the wrong output. Therefore to
deal with such scenarios, it is important to find out if a person is coming towards or going away
from the camera.
To determine whether a person is approaching a camera or going away from a camera the
author has discussed four steps in this paper. As it is a four steps execution. The first step is the
Distance from the camera method. This method is used to see if a person is approaching the
camera or going away from a camera. As the decreasing distance between a person and a camera
indicates that the person is approaching the camera and face detection can be triggered. The
second step is the eye line detection method. This method helps to find out the valley in
horizontal histogram projection. If the eye line is detected, face detection can be applied to see if
the person is wearing a mask or not. The third step facial part detection. In this method, the
author has used Viola Jones’s algorithm to detect facial parts like nose, mouth, eyes, eyebrows,
etc. This algorithm results in a very high true detection rate and a very low false positive rate
which will be shown in the cases where a person is not wearing a mask. If any person is wearing
a mask or his/her face is covered with cloth or hand, then in such cases the detection of the face
might not take place, or face detection will take place but either nose or mouth will not be
detected indicating it as a mask. The final and most important step is to find out the eyes and
then trigger the face detection using the eye detection method. If the person is not wearing a
mask, eyes will be detected and face detection can thus be applied. When a person is wearing a
mask, eye detection returns true but face detection returns false indicating it is a mask.
This paper has stated that in video analytics, the false detection rate is maximum in eye
line detection algorithms as well as in eye detection algorithms. The reason is that eye line
detection and eye detection will detect very small parts of the image. For images with poor
resolution, it will result in false detection. For facial part detection, the execution time is
maximum as compared to all other steps as it deals with face detection followed by face parts
detection which is a complex algorithm. This paper has a detailed explanation of how to detect a
face mask and the authors have tested these above steps in real time and the results are quite
practical and satisfactory.
Dewantara et al [72] exploited to train a nose and mouth classifier to detect multi-pose
masked faces. The authors create a dataset of nose and mouth. Haar-like, LBP, HOG features are
exploited for training models, respectively. If nose and mouth is not detected, the candidate
facial region will be labeled “masked”. Otherwise, it will be labeled “No mask”. It is reported
that the trained classifier of nose and mouth achieves an accuracy of 86.9% using haar-like
features, outperforming LBP and HOG. Obviously, there is further space to improve accuracy.
Petrovic et al [73] developed an indoor safety IoT system which adopts multiple
AdaBoost cadcade-classifiers. These classifiers are provided by OpenCV to detect frontal face,
nose, and mouth, respectively. For a candidate face region, if no mouth and no nose are detected,
it will be regarded as wearing a mask properly. If nose is detected, it will be labeled as “improper
mask”. If mouth is detected, it will be labeled as “no mask”. This approach may work well in the
access control system by OpenCV classifiers. However, it depends on OpenCV classifiers too
much, and it does not provide details about accuracy.
Unlike methods [73], Nieto-Rodriguez et al [69] used two AdaBoost detectors to
implement surgical mask detection. One detector is trained by LogitBoost for face detection, and
the other is trained by GentleAdaBoost for mask detection. Then, two color filters in the HSV
color space are employed to eliminate false positives. Considering the overlapping regions, cross
class removal strategy is designed to keep the region with higher confidence. The method is easy
to implement and it achieves an accuracy of 95% on 496 faces and 181 masks.
Fang et al [75] developed a real-time system of masked facial detection that uses haar-
like features for face detection and mouth detection, respectively. Similar with [73], face region
is firstly located, and then mouth detection is used to determine the mask-wearing conditions.
The designed algorithm is claimed to run on PYNQ-Z2 SoC platform with 0.13s response of
facial mask detection and 96.5% accuracy on given dataset.
In addition, Tengjiao He [76] employed skin color and eye detection to reach the goal of wearing
mask detection. The first step is to locate face region using ellipse skin model and geometric
relationship between eyes and other facial parts. Then, the coverage of skin color in the bottom
half of facial region is calculated to judge mask-wearing conditions. However, this method can
only be applied to specific scenes.
Razavi et al [77] employed Faster R-CNN structure to detect people who do not wear a
mask or do not maintain a safety distance. It was applied to several road maintenance projects for
monitoring workers, ensuring them wear masks and keep proper physical distance. However, the
dataset is limited and it only focuses on construction scenes. Meivel et al [23] used Faster R-
CNN algorithm for mask detection and social distance measurement. This method achieves
93.4% accuracy for complex scenes such as facial poses, beard faces, multiple mask types, and
scarf images. Notably, the effects need improvement when converting surveillance images into
bird-view images.
Zhang et al [47] developed a new framework for masked facial detection called Contex-
Attention R-CNN, which consists of multiple context feature extractor component, decoupling
branches component, and attention component. It is able to enlarge intra-class difference and
reduce inter-class difference through extracting distinguishing features. They also created a
dataset that includes 8635 faces with different conditions for experimental verification. The
framework can achieve mAP = 84:1% on the given dataset, 6.8% higher than that of Faster R-
CNN with ResNet-50. However, the dataset is classes imbalanced.
Chowdary et al [78] exploited InceptionV3 pre-trained model to classify one whether
wears a mask or not. The last layer of InceptionV3 is replaced by 5 layers, which is regarded as a
transfer learning model. It is reported to reach a 99.9% on a simulated dataset.
Dey et al [60] proposed a MobileNet-
Mask to prevent the transmission of SARS-COV-2, which is a deep learning method of
multi-phase facial mask detection. The mask classifier depends on the ROI detection of SSD and
ResNet-10. Due to the minimal processing capability and lightweight mobile-oriented model,
MobileNet-V2 is a good selection for embedded systems. It is reported to achieve higher
accuracy than other methods.
Deng et al [79] introduced attention mechanisms, inverse convolution and feature fusion
to SSD structure for the task of wearing mask detection. It achieves a mAP of 91.7%,
outperforming SSD with 85.4% mAP. Wang et al [80] proposed a holistic edge computing
framework to detect masked faces. It is a serverless in-browser solution by integrate YOLO,
CNN inference computing, and Web Assembly techniques. This design minimizes extra devices.
It has easy deployment, low computation costs, fast detection speed, and achieves mAP = 89%.
Loey et al [42] developed a YOLOv2 with ResNet-50 detector for medical face mask
detection. The method includes two parts. The first is designed by deep transfer learning for
feature extraction. The second part is implemented by YOLOv2 for masked face detection.
Specially, mean IoU is introduced to estimate the best number of anchor boxes and it can
improve the accuracy. The method achieves AP = 81% on a dataset with 1415 images.
Jiang et al [50] designed Squeeze and Excitation(SE) YOLOv3 to balance the
effectiveness and running speed for masked facial detection. It introduces SE into Darknet-53 as
attention mechanism integration to extract essential feature, and adopts GIoUloss, focal loss to
enhance stability and robustness. A new dataset called Properly Wearing Masked Face Detection
(PWMFD) Dataset is created for three categories of masked faces. It is reported that the method
achieves mAP = 73:7% for 608 _ 608 size of images. The method is expected to use in access
control gate system and non-contact temperature measurement. However, the similarity between
incorrect masks is high. It may bring confusions that masks only covering chin are regarded as
without mask.
Prusty et al [26] proposed a data augmentation technique to expand dataset size. New
dataset is used to train YOLOv3 model for masked facial detection. Average accuracy is more
than 93% on given three datasets. However, only two kinds of data augmentation techniques
(grayscale and Gaussian blur) are used. The number is very limited. Kumar et al [51] explored to
test original and tiny variants of YOLO on a new face mask detection dataset which
encompasses 52635 images. For the dataset, over 50k labels are provided. Modified tiny
YOLOv4 is recommended as an effective and efficient masked face detector because of its
optimized feature extraction network.
Yu et al [31] improved YOLOv4 model by introducing a modified CSPDarkNet53 to
reduce computation costs and enhance learning ability. An adaptive image scaling algorithm is
designed to reduce redundancy and an improved PANet structure is used to learn more semantic
information. It is reported to achieve 98.3% accuracy with 54.57 fps under the running
environment of Windows 10, Inter(R)i7-9700k and RTX 2070Super. One limitation is
inconsideration of insufficient lighting samples.
Sharma [85] developed a model that uses
YOLOv5 to detect whether one person is wearing a mask or not. However, if an
individual does not face the camera, its performance will decrease. This is the method’s
limitation. Yang et al [87] applied YOLOv5 in the supervision of wearing mask conditions. The
authors design a man-machine interface for application and set the identifying time for 2 seconds
with the consideration of complex scenes. A 97.9% recognition rate is achieved on the dataset
[62]. It seems the response time is a bit longer. Ieamsaard et al [88] tested the performance of
YOLOv5-based model with 300 epochs, outperforming those models with less than 300 epochs.
Jiang et al [11] proposed RetinaFaceMask for masked face detection, which is based on
RetinaFace [95]. RetinaFaceMask is a single-stage detector. Its principle is to employ feature
pyramid network to fuse high-level semantic information. A novel context attention module is
presented to help RetinaFaceMask focus on the features of faces and masks. Moreover, a cross-
class removal algorithm is proposed to remove those regions with low scores and high IoU
values. Experiments demonstrate that RetinaFaceMask outperforms RetinaFace [95] in Recall
and Precision. Moreover, there are more experimental comparisons between methods. Singh et al
[48] utilized two object detection models named Faster R-CNN and YOLOv3 for masked facial
detection. They presented the comparison from visual and quantitative views, and gave detailed
discussions about the application. Faster R-CNN outperforms YOLOv3 in the accuracy,
however, for real-time application; it would be preferred to use YOLOv3 which runs faster than
Faster R-CNN. The selection of model depends on the environment conditions. Similar
conclusion is drawn in [96]. Roy et al [43] used SSD, Faster R-CNN, YOLOv3, and
YOLOv3Tiny to cope with the challenges of wearing medical mask detection. These methods
are tested on Moxa3K dataset. Experimental results demonstrate that YOLOv3Tiny is the most
suitable method for real-time inference among the methods.
Loy et al [12] developed a hybrid method of deep learning and machine learning to detect
facial mask. It includes two components (or stages): ResNet-50 is used as feature extractor, and
SVM, decision tree, ensemble method are used as classification models. The authors claimed that
SVM classifier achieves testing accuracy of 99.49% in SMFD dataset [61], outperforming
decision tree and ensemble method. Similar with [12], the methods [118], [119] also choose
SVM as the classifier in the second stage.
Buciu [118] took the ratio of color channels into account to discriminate mask and no-
mask images. SSD is used to locate the positions of faces. Then the lower part of face is
considered to construct feature vector called color quotient feature, which will be classified by
SVM model. A recognition rate of 97.25% is obtained. However, this method is sensitive to
mask types, which is its potential weakness. Oumina et al [119] presented several combinations
of multiple CNNs and K-NN or SVM, and conducted experiments. It indicates that the
combination of MobileNetV2 and SVM achieves the best performance among the combinations,
97.11% accuracy. More tests for the approach should be conducted on bigger datasets.
Zereen et al [120] developed a two-stage approach to detect masked face and monitor the
rule violations. It is based on the extraction of facial landmark. It firstly determines whether the
target wears a multi-color mask or not by MTCNN, and secondly it determines whether the
target wears a skin-color mask or not. The method aims to detect five types of facial images
including no mask, beard and mustache, one-color mask, multi-color mask and skin-color mask.
It achieves an accuracy of 97.13% and overcomes the problem of various color mask detection,
especially differentiates wearing skin colored masks. However, the use of several techniques
needs more computation costs, and the setting of empirical thresholds limits its adaptation
ability.
Lin et al [22] combined a sliding window algorithm with a modified LeNet (MLeNet) to
locate masked faces. To improve performance with a small dataset, horizontal reflection is used
to learn MLeNet via fine-tuning. MLeNet can be trained fast under CPU mode. It makes sense
for real-world applications. However, sliding window algorithm requires more computations for
large size of images, which restricts its performance. Rudraraju et al [122] combined haar-like
cascade-classifiers and two MobileNet models for face mask detection. Firstly, face regions are
detected by haar-like cascade-classifier. The first MobileNet model is used to classify masks and
no masks. The second MobileNet model is used to distinguish correct or incorrect wearing
masks. Experiments show that the system achieves around Accuracy = 90%. It is expected to be
deployed at fog gateway.
Tomas et al [33] also chose haar-like cascade classifier for rapid facial detection. CNN
with transfer learning is used to determine whether one wears a mask or not. Multiple models are
trained based on one dataset. VGG16 achieves the best performance with 0.834 accuracy, but its
model size is also the largest. For deploying mobile device, MobileNetV2, with 0.812 accuracy,
is selected as the classification model because it demands less computation costs and smaller
storage. However, this method needs to be improved when detecting masked facials with
alterations and sides.
The method proposed by Lin et al [129] contains five stages: image data collection,
human posture parsing, ROI selection, image normalization, and classification of masked face.
Among these stages, human posture parsing is implemented by Openpose [135] that generates 25
key points for one individual. Five key points belonging to face region are used to extract ROI
for image normalization. Then, the normalized image is classified by a Face Mask Recognition
Network (FMRN). It is reported that the method obtains 95.8% and 94.6% accuracy in daytime
and nighttime, respectively.
Table 2.1 represents the few notable works for the detection of face mask along with their
advantages and disadvantages.
Table 2.1 Comparison of various deep learning architectures
Gradient Descent can be considered as the popular kid among the class of optimizers.
This optimization algorithm uses calculus to modify the values consistently and to achieve the
local minimum. In simple terms, consider you are holding a ball resting at the top of a bowl.
When you lose the ball, it goes along the steepest direction and eventually settles at the bottom of
the bowl. A Gradient provides the ball in the steepest direction to reach the local minimum that is
the bottom of the bowl.
The above equation means how the gradient is calculated. Here alpha is step size that
represents how far to move against each gradient with each iteration. Gradient descent works as
follows:
1. It starts with some coefficients, sees their cost, and searches for cost value lesser than
what it is now.
2. It moves towards the lower weight and updates the value of the coefficients.
3. The process repeats until the local minimum is reached. A local minimum is a point
beyond which it cannot proceed.
The procedure is first to select the initial parameters w and learning rate n. Then
randomly shuffle the data at each iteration to reach an approximate minimum. Since we are not
using the whole dataset but the batches of it for each iteration, the path took by the algorithm is
full of noise as compared to the gradient descent algorithm. Thus, SGD uses a higher number of
iterations to reach the local minima. Due to an increase in the number of iterations, the overall
computation time increases. But even after increasing the number of iterations, the computation
cost is still less than that of the gradient descent optimizer. So the conclusion is if the data is
enormous and computational time is an essential factor, stochastic gradient descent should be
preferred over batch gradient descent algorithm.
The adaptive gradient descent algorithm is slightly different from other gradient descent
algorithms. This is because it uses different learning rates for each iteration. The change in
learning rate depends upon the difference in the parameters during training. The more the
parameters get change, the more minor the learning rate changes. This modification is highly
beneficial because real-world datasets contain sparse as well as dense features. So it is unfair to
have the same value of learning rate for all the features. The Adagrad algorithm uses the below
formula to update the weights. Here the alpha(t) denotes the different learning rates at each
iteration, n is a constant, and E is a small positive to avoid division by 0.
The benefit of using Adagrad is that it abolishes the need to modify the learning rate
manually. It is more reliable than gradient descent algorithms and their variants, and it reaches
convergence at a higher speed.
One downside of AdaGrad optimizer is that it decreases the learning rate aggressively
and monotonically. There might be a point when the learning rate becomes extremely small. This
is because the squared gradients in the denominator keep accumulating, and thus the
denominator part keeps on increasing. Due to small learning rates, the model eventually becomes
unable to acquire more knowledge, and hence the accuracy of the model is compromised.
RMS prop is one of the popular optimizers among deep learning enthusiasts. This is
maybe because it hasn’t been published but still very well know in the community. RMS prop is
ideally an extension of the work RPPROP. RPPROP resolves the problem of varying gradients.
The problem with the gradients is that some of them were small while others may be huge. So,
defining a single learning rate might not be the best idea. RPPROP uses the sign of the gradient
adapting the step size individually for each weight. In this algorithm, the two gradients are first
compared for signs. If they have the same sign, we’re going in the right direction and hence
increase the step size by a small fraction. Whereas, if they have opposite signs, we have to
decrease the step size. Then we limit the step size, and now we can go for the weight update.
The problem with RPPROP is that it doesn’t work well with large datasets and when we
want to perform mini-batch updates. So, achieving the robustness of RPPROP and efficiency of
mini-batches at the same time was the main motivation behind the rise of RMS prop. RMS prop
can also be considered an advancement in AdaGrad optimizer as it reduces the monotonically
decreasing learning rate.
The algorithm mainly focuses on accelerating the optimization process by decreasing the
number of function evaluations to reach the local minima. The algorithm keeps the moving
average of squared gradients for every weight and divides the gradient by the square root of the
mean square.
where gamma is the forgetting factor. Weights are updated by the below formula
In simpler terms, if there exists a parameter due to which the cost function oscillates a lot,
we want to penalize the update of this parameter. Suppose you built a model to classify a variety
of fishes. The model relies on the factor ‘color’ mainly to differentiate between the fishes. Due to
which it makes a lot of errors. What RMS Prop does is, penalize the parameter ‘color’ so that it
can rely on other features too. This prevents the algorithm from adapting too quickly to changes
in the parameter ‘color’ compared to other parameters. This algorithm has several benefits as
compared to earlier versions of gradient descent algorithms. The algorithm converges quickly
and requires lesser tuning than gradient descent algorithms and their variants. The problem with
RMS Prop is that the learning rate has to be defined manually and the suggested value doesn’t
work for every application.
AdaDelta can be seen as a more robust version of AdaGrad optimizer. It is based upon
adaptive learning and is designed to deal with significant drawbacks of AdaGrad and RMS prop
optimizer. The main problem with the above two optimizers is that the initial learning rate must
be defined manually. One other problem is the decaying learning rate which becomes
infinitesimally small at some point. Due to which a certain number of iterations later, the model
can no longer learn new knowledge.
To deal with these problems, AdaDelta uses two state variables to store the leaky average
of the second moment gradient and a leaky average of the second moment of change of
parameters in the model.
Here St and delta Xt denotes the state variables, g’ t denotes rescaled gradient, delta X t-
1 denotes squares rescaled gradients, and epsilon represents a small positive integer to handle
division by 0.
CHAPTER 5
PROPOSED METHODOLOGY
4.1. Algorithm
In Figure 5.1, it starts with data collection. In this case study, the data is collected from an
available Kaggle dataset. Then it is loaded to perform pre-processing in-order to clean the
collected data. Then the data is split into training and testing set. Training dataset will be used to
train the model while testing data will be used to test the model. A fully trained and tested model
results in effective and accurate detection of the presence or absence of a face mask.
1. A.S. Joshi, S.S. Joshi, G. Kanahasabai, R. Kapil, S. Gupta, Deep Learning Framework to
Detect Face Masks from Video Footage, in: 2020 12th International Conference on
Computational Intelligence and Communication Networks (CICN), 2020, pp. 435–440.
2. S.M. Nagashetti, S. Biradar, S.D. Dambal, C.G. Raghavendra, B.D. Parameshachari,
“Detection of Disease in Bombyx Mori Silkworm by Using Image Analysis Approach”
2021 IEEE Mysore Sub Section International Conference (MysuruCon), IEEE (2021),
pp. 440-444.
3. R.K. Kodali, R. Dhanekula, “Face Mask Detection Using Deep Learning” 2021
International Conference on Computer Communication and Informatics (ICCCI) (2021),
pp. 1-5.
4. D.L. Vu, T.K. Nguyen, T.V. Nguyen, T.N. Nguyen, F. Massacci, P.H. Phung, “A
convolutional transformation network for malware classification” 2019 6th NAFOSTED
conference on information and computer science (NICS), IEEE (2019), pp. 234-239.
5. P. Khamlae, K. Sookhanaphibarn, W. Choensawat, “An Application of Deep-Learning
Techniques to Face Mask Detection During the COVID-19” Pandemic 2021 IEEE 3rd
Global Conference on Life Sciences and Technologies (LifeTech) (2021), pp. 298-299.
6. K. Yu, L. Tan, L. Lin, X. Cheng, Z. Yi, T. Sato, “Deep-learning-empowered breast
cancer auxiliary diagnosis for 5GB remote E-health”, IEEE Wirel.
Commun., 28 (3) (2021), pp. 54-61.
7. A. Alguzo, A. Alzu'bi, F. Albalas, “Masked Face Detection using Multi-Graph
Convolutional Networks”, 2021 12th International Conference on Information and
Communication Systems (ICICS) (2021), pp. 385-391.
8. M.S. Islam, E. Haque Moon, M.A. Shaikat, M. Jahangir Alam,, “A Novel Approach to
Detect Face Mask using CNN”, 2020 3rd International Conference on Intelligent
Sustainable Systems (ICISS) (2020), pp. 800-806.
9. P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
features,” in Proceedings of the 2001 IEEE computer society conference on computer
vision and pattern recognition. CVPR 2001, vol.1. IEEE, 2001, pp. I–I.
10. R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate
object detection andsemantic segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition,2014, pp. 580–587.
11. R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on
computer vision, 2015,pp. 1440–1448.
12. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
with regionproposal networks,” in Advances in neural information processing systems,
2015, pp. 91–99.
13. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd:
Single shot multiboxdetector,” in European conference on computer vision. Springer,
2016, pp. 21–37.
14. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-
time object detec-tion,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 779–788.
15. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll ́ar, “Focal loss for dense object
detection,” 2017.
16. Haddad, J., 2020. How I Built A Face Mask Detector For COVID-19 Using Pytorch
Lightning.
17. Rosebrock, A., 2020. COVID-19: Face Mask Detector With Opencv, Keras/Tensorflow,
And Deep Learning- Pyimagesearch.
18. T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, July 2002.
19. T.-H. Kim, D.-C. Park, D.-M. Woo, T. Jeong, and S.-Y. Min, “Multi-class classifier-
based adaboost algorithm,” in Proceedings of the Second Sinoforeign-interchange
Conference on Intelligent Science and Intelligent Data Engineering, ser. IScIDE’11.
Berlin, Heidelberg: Springer-Verlag, 2012, pp. 122–127.
20. P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J. Comput. Vision, vol.
57, no. 2, pp. 137–154, May 2004.
21. P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple
features,” in Proceedings of the 2001 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition. CVPR 2001, vol. 1, Dec 2001, pp. I–I.
22. J. Li, J. Zhao, Y. Wei, C. Lang, Y. Li, and J. Feng, “Towards real world human parsing:
Multiple-human parsing in the wild,” CoRR, vol. abs/1705.07206.
23. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks,” in Advances in Neural Information Processing Systems
25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates,
Inc., 2012, pp. 1097–1105.
24. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition,” CoRR, vol. abs/1409.1556, 2014.
25. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich, “Going deeper with convolutions,” 2015.
26. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–
778, 2016.
27. K. Li, G. Ding, and H. Wang, “L-fcn: A lightweight fully convolutional network for
biomedical semantic segmentation,” in 2018 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM), Dec 2018, pp. 2363–2367.
28. X. Fu and H. Qu, “Research on semantic segmentation of high-resolution remote sensing
image based on full convolutional neural network,” in 2018 12th International
Symposium on Antennas, Propagation and EM Theory (ISAPE), Dec 2018, pp. 1–4.
29. S. Kumar, A. Negi, J. N. Singh, and H. Verma, “A deep learning for brain tumor mri
images semantic segmentation using fcn,” in 2018 4th International Conference on
Computing Communication and Automation (ICCCA), Dec 2018, pp. 1–4.
30. B. QIN and D. Li, “Identifying face mask-wearing condition using image super-
resolution with classification network to prevent COVID-19”, May 2020.
31. M.S. Ejaz, M.R. Islam, M. Sifatullah, A. Sarker, “Implementation of principal component
analysis on masked and non-masked face recognition”, 2019 1st International Conference
on Advances in Science, Engineering and Robotics Technology (ICASERT) (2019), pp.
15.
32. Jeong-Seon Park, You Hwa Oh, Sang ChulAhn, and Seong Whan Lee, “Glasses removal
from facial image using recursive error compensation ”, IEEE Trans. Pattern Anal. Mach.
Intell. 27 (5) (2005) 805–811.
33. C. Li, R. Wang, J. Li, L. Fei, “Face detection based on YOLOv3”, in:: Recent Trends in
Intelligent Computing, Communication and Devices, Singapore, 2020, pp. 277–284.
34. N. Ud Din, K. Javed, S. Bae, J. Yi, “A novel GAN-based network for unmasking of
masked face” IEEE Access, 8 (2020), pp. 4427644287.
35. A. Nieto-Rodríguez, M. Mucientes, V.M. Brea, “System for medical mask detection in
the operating room through facial attributes”, Pattern Recogn. Image Anal. Cham (2015),
pp. 138-145.
36. S. A. Hussain, A.S.A.A. Balushi, “A real time face emotion classification and recognition
using deep learning model”, J. Phys.: Conf. Ser. 1432 (2020) 012087.
37. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications -
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, Hartwig Adam.
38. M. S. Ejaz and M. R. Islam, "Masked Face Recognition Using Convolutional Neural
Network," 2019 International Conference on Sustainable Technologies for Industry 4.0
(STI), Dhaka, Bangladesh, 2019, pp. 1-6.
39. M. Dasgupta, O. Bandyopadhyay and S. Chatterji, "Automated Helmet Detection for
Multiple Motorcycle Riders using CNN," 2019 IEEE Conference on Information and
Communication Technology, Allahabad, India, 2019, pp. 1-4,
40. P. Doungmala and K. Klubsuwan, "Helmet Wearing Detection in Thailand Using Haar
Like Feature and Circle Hough Transform on Image Processing," 2016 IEEE
International Conference on Computer and Information Technology (CIT), Nadi, 2016,
pp. 611-614.
41. G. Deore, R. Bodhula, V. Udpikar and V. More, "Study of masked face detection
approach in video analytics," 2016 Conference on Advances in Signal Processing
(CASP), Pune, 2016, pp. 196-200.
APPENDIX –A
A.1 Introduction
Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows
anybody to write and execute arbitrary python code through the browser, and is especially well
suited to machine learning, data analysis and education. More technically, Colab is a hosted
Jupyter notebook service that requires no setup to use, while providing access free of charge to
computing resources including GPUs.
A.2 What is Google Colab?
Google Colab is an excellent tool for deep learning tasks. It is a hosted Jupyter notebook
that requires no setup and has an excellent free version, which gives free access to Google
computing resources such as GPUs and TPUs.
In Colab, we can enforce the Python version by clicking Runtime -> Change Runtime Type and
selecting python3. Note that as of April 2020, Colab uses Python 3.6. 9 which should run
everything without any errors.
A.3 How to run a python in colab
1. Store mylib.py in your Drive.
5. Copy it by ! cp drive/MyDrive/mylib.py .
6. import mylib.
A.4 What types of GPUs are available in Colab?
The types of GPUs that are available in Colab vary over time. This is necessary for Colab
to be able to provide access to these resources free of charge. Users who are interested in more
reliable access to Colab’s fastest GPUs may be interested in Colab Pro and Pro+. If you would
like to use specific hardware in Colab, check out Colab GCP Marketplace VMs.
Note that using Colab for cryptocurrency mining is disallowed entirely, and may result in
your account being restricted for use with Colab altogether.
A.5 How long can notebooks run in Colab?
Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows
anybody to write and execute arbitrary python code through the browser, and is especially well
suited to machine learning, data analysis and education. More technically, Colab is a hosted
Jupyter notebook service that requires no setup to use, while providing access free of charge to
computing resources including GPUs.
A.6 Where are my notebooks stored, and can I share them?
Colab notebooks are stored in Google Drive, or can be loaded from GitHub. Colab
notebooks can be shared just as you would with Google Docs or Sheets. Simply click the Share
button at the top right of any Colab notebook, or follow these Google Drive file sharing
instructions.
A.7 If I share my notebook, what will be shared?
If you choose to share a notebook, the full contents of your notebook (text, code, output,
and comments) will be shared. You can omit code cell output from being saved or shared by
using Edit > Notebook settings > Omit code cell output when saving this notebook. The
virtual machine you’re using, including any custom files and libraries that you’ve setup, will not
be shared. So it’s a good idea to include cells which install and load any custom libraries or files
that your notebook needs.
A.8 How can I search Colab notebooks?
You can search Colab notebooks using Google Drive. Clicking on the Colab logo at the
top left of the notebook view will show all notebooks in Drive. You can also search for
notebooks that you have opened recently using File > Open notebook.
A.9 Where is my code executed? What happens to my execution state if I close the browser
window?
Code is executed in a virtual machine private to your account. Virtual machines are
deleted when idle for a while, and have a maximum lifetime enforced by the Colab service.
A.10 How can I get my data out?
You can download any Colab notebook that you’ve created from Google Drive following
these instructions, or from within Colab’s File menu. All Colab notebooks are stored in the open
source Jupyter notebook format ( .ipynb).
A.11 How can I reset the virtual machine(s) my code runs on, and why is this sometimes
unavailable?
Selecting Runtime > Disconnect and delete runtime to return all managed virtual
machines assigned to you to their original state. This can be helpful in cases where a virtual
machine has become unhealthy e.g. due to accidental overwrite of system files, or installation of
incompatible software. Colab limits how often this can be done to prevent undue resource
consumption. If an attempt fails, please try again later.
A.12 Why does drive.mount() sometimes fail saying "timed out", and why do I/O
operations in drive.mount()-mounted folders sometimes fail?
Google Drive operations can time out when the number of files or subfolders in a folder
grows too large. If thousands of items are directly contained in the top-level "My Drive" folder
then mounting the drive will likely time out. Repeated attempts may eventually succeed as failed
attempts cache partial state locally before timing out.
If you encounter this problem, try moving files and folders directly contained in "My Drive" into
sub-folders. A similar problem can occur when reading from other folders after a successful
drive.mount(). Accessing items in any folder containing many items can cause errors like
OSError: [Errno 5]
Input/output error. Again, you can fix this problem by moving directly contained items into sub-
folders. Note that "deleting" files or subfolders by moving them to the Trash may not be enough;
if that doesn't seem to help, make sure to also Empty your Trash.
A.13 Why do Drive operations sometimes fail due to quota?
Google Drive enforces various limits, including per-user and per-file operation count and
bandwidth quotas. Exceeding these limits will trigger Input/output error as above, and show a
notification in the Colab UI. A typical cause is accessing a popular shared file, or accessing too
many distinct files too quickly. Workarounds include:
Copy the file using drive.google.com and don't share it widely so that other users don't use up its
limits.
Avoid making many small I/O reads, instead opting to copy data from Drive to the Colab VM in
an archive format (e.g. .zip or.tar.gz files) and unarchive the data locally on the VM instead of in
the mounted Drive directory.
Wait a day for quota limits to reset.