Vision-Based_Fall_Detection_with_Convolutional_Neu
Vision-Based_Fall_Detection_with_Convolutional_Neu
Research Article
Vision-Based Fall Detection with Convolutional
Neural Networks
Received 14 July 2017; Revised 26 September 2017; Accepted 9 November 2017; Published 6 December 2017
Copyright © 2017 Adrián Núñez-Marcos et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
One of the biggest challenges in modern societies is the improvement of healthy aging and the support to older persons in their daily
activities. In particular, given its social and economic impact, the automatic detection of falls has attracted considerable attention in
the computer vision and pattern recognition communities. Although the approaches based on wearable sensors have provided high
detection rates, some of the potential users are reluctant to wear them and thus their use is not yet normalized. As a consequence,
alternative approaches such as vision-based methods have emerged. We firmly believe that the irruption of the Smart Environments
and the Internet of Things paradigms, together with the increasing number of cameras in our daily environment, forms an optimal
context for vision-based systems. Consequently, here we propose a vision-based solution using Convolutional Neural Networks to
decide if a sequence of frames contains a person falling. To model the video motion and make the system scenario independent,
we use optical flow images as input to the networks followed by a novel three-step training phase. Furthermore, our method is
evaluated in three public datasets achieving the state-of-the-art results in all three of them.
1. Introduction in 2008 [2]. Taking into account the growth of aging popu-
lation, these expenditures are expected to approach US$55
Due to the physical weakness associated with aging, the billion by 2020.
elderly suffer high ratios of falls which frequently imply neg-
These considerations have boosted the research on auto-
ative consequences for their health. According to Ambrose
matic fall detection to enable fast and proper assistance to
et al. [1], falls are one of the major causes of mortality
in old adults. This can be explained in part by the high the elderly (see Section 2 for a review of the state of the
incidence of falls in adults over the age of 65: one in three art). The most common strategies consist in a combination
adults falls at least once per year. In addition, the impact of sensing and computing technologies to collect relevant
of these falls is a major concern for health care systems. data and develop algorithms that can detect falls based on
It has to be noted that falls lead to moderate to severe the collected data [3]. These approaches have led to the
injuries, fear of falling, loss of independence, and death of appearance of Smart Environments for elderly assistance,
the third individual of the elderly who suffer these accidents. which had been traditionally limited to home settings [4].
Moreover, the costs associated with these health problems However, we believe that, with the irruption of the paradigm
are not negligible: two reference countries like the United of the Internet of Things (IoT) [5], the possibilities to extend
States and the United Kingdom, with very different health Smart Environments, and more specifically fall detection
care systems, spent US$23.3 and US$1.6 billion, respectively, approaches, grow considerably.
2 Wireless Communications and Mobile Computing
In this paper, we focus on vision-based approaches for fall obtained from a Kinect camera. They also made use of a
detection. Cameras provide very rich information about per- Support Vector Machine (SVM) classifier, feeding it the data
sons and environments and their presence is becoming more from the IMU and the Kinect. Approaches like the latter and
and more important in several everyday environments due [13] combined sensors with vision techniques. However, they
to surveillance necessities. Airports, train and bus stations, used vision-based solutions only to ascertain the prediction
malls, and even streets are already equipped with cameras. of the sensor-based approach.
More importantly, cameras are also installed in elderly care The purely vision-based approaches focus on the frames
centers. Therefore, reliable vision-based fall detection systems of videos to detect falls. By means of computer vision tech-
may play a very important role in future health care and niques, meaningful features such as silhouettes or bounding
assistance systems. boxes are extracted from the frames in order to facilitate
The recent impact of deep learning has changed the detection. Some solutions use those features as input for a
landscape of computer vision, improving the results obtained classifier (e.g., Gaussian Mixture Model (GMM), SVM, and
in many relevant tasks, such as object recognition, segmen- MLP) to automatically detect if a fall has occurred. The
tation, and image captioning [6]. In this paper, we present use of tracking systems is also very extended; for example,
a novel approach in this domain which takes advantage of Lee and Mihailidis [18] applied tracking techniques in a
Convolutional Neural Networks (CNN) for fall detection close environment to detect falls. They proposed using a
(Section 3). More precisely, we introduce a CNN that learns connected-components labeling to compute the silhouette
how to detect falls from optical flow images. Given the small of a person and extracting features such as the spatial
size of typical fall datasets, we take advantage of the capacity orientation of the center of the silhouette or its geometric
of CNNs to be sequentially trained on different datasets. orientation. Combining this information they are able to
First of all, we train our model on the Imagenet dataset detect positions and also falls. Rougier et al. [19] suggested
[7] to acquire the relevant features for image recognition. using silhouettes as well, which is a common strategy in the
Afterwards, following the approach of [8], we train the literature. Applying a matching system along the video to
CNN on the UCF101 action dataset [9]. For that purpose, track the deformation of the silhouette, they analyzed the
we calculate the optical flow images of consecutive frames shape of the body and finally obtained a result with a GMM.
and use them to teach the network how to detect different Mubashir et al. [3] tracked the person’s head to improve their
actions. Finally, we apply transfer learning by reusing the base results using a multiframe Gaussian classifier, which was
network weights and fine-tuning the classification layers so fed with the direction of the principal component and the
the network focuses on the binary problem of fall detection. variance ratio of the silhouette. Another common technique
As a result of the research carried out, this paper presents consists in computing the bounding boxes of the objects to
the following main contributions: determine if they contain a person and then detect the fall by
means of features extracted from it (see, for instance, [20, 21]).
(i) To the best of our knowledge, this is the first time
Following a similar strategy, Vishwakarma et al. [22] worked
that transfer learning is applied from the action
with bounding boxes to compute the aspect ratio, horizontal
recognition domain to fall detection. In that sense, the
and vertical gradients of an object, and fall angle and fed
use of transfer learning is crucial to address the small
them into a GMM to obtain a final answer. Many solutions
amount of samples in public fall detection datasets.
are based on supervised learning, that is, extracting lots of
(ii) We use optical flow images as input to the network features from raw images and using a classifier to learn a
in order to have independence from environmental decision from labeled data. This is the case, for example, of
features. These images only represent the motion of Charfi et al. [17], who extracted 14 features, applied some
consecutive video frames and ignore any appearance- transformations to them (the first and second derivatives,
related information such as color, brightness, or the Fourier transform, and the Wavelet transform), and used
contrast. Thus, we are presenting a generic CNN a SVM to do the classification step. Zerrouki et al. (2016)
approach to fall detection. [23] computed occupancy areas around the body’s gravity
center, extracted their angles, and fed them into various
2. Related Work classifiers, being the SVM the one which obtained the best
results. In 2017, the same author extended his previous work
The literature of fall detection is divided between sensor- by adding Curvelet coefficients as extra features and applying
based and vision-based approaches. The sensor-based detec- a Hidden Markov Model (HMM) to model the different body
tion has commonly relayed on the use of accelerometers, poses [14]. A less frequent technique was used by Harrou
which provide proper acceleration measures such as vertical et al. [24], who applied Multivariate Exponentially Weighted
acceleration. In the case of falls, these measures are very Moving Average (MEWMA) charts. However, they could not
different compared to daily activities or confounding events distinguish between falls and confounding events, which is
(such as bending over or squatting), allowing us to discern a major issue that is taken into account in our solution. In
between them. Vallejo et al. [10] and Sengto and Leauhatong fact, not being able to discriminate between such situations
[11] proposed feeding a Multilayer Perceptron (MLP), the produces a great amount of false alarms.
data of a 3-axis accelerometer (acceleration values in 𝑥-, 𝑦-, Another branch inside the vision-based fall detection
and 𝑧-axis). Kwolek and Kepski [12] applied an Inertial systems is the adoption of 3D vision to take advantage of
Measurement Unit (IMU) combined with the depth maps 3D structures. This strategy requires the use of multiple
Wireless Communications and Mobile Computing 3
Optical flow
RGB images images
Figure 1: The system architecture or pipeline: the RGB images are converted to optical flow images, then features are extracted with a CNN,
and a FC-NN decides whether there has been a fall or not.
cameras (passive systems) or active depth cameras such as (iii) To make the system generic, so it works in different
Microsoft Kinect or time-of-flight cameras that can extract scenarios
depth maps. The Kinect camera is very popular given its To tackle the first objective, the key was to design a system
low price and high performance. Auvinet et al. [25] used that works on human motion, avoiding any dependence
a Kinect camera to build a 3D silhouette to then analyze on image appearance. In that sense, a fall in a video can
the volume distribution along the vertical axis. Gasparrini be expressed as a few contiguous frames stacked together.
et al. [26] also used such a camera to extract 3D features However, this is a naive approach, as the correlation between
and then applied a tracking system to detect the falls. Kinect the frames is not taken into account by processing each image
software provides body joints, which were used by Planinc separately. To address this problem, the optical flow algorithm
and Kampel [27] to obtain the orientation of the major axis [33] was used to describe the displacement vectors between
on their position. Diraco et al. [28] made use of depth maps two frames. Optical flow allowed us to represent human
to compute 3D features. Another simple and yet interesting motion effectively and avoid the influence of static image
approach was given by Mastorakis and Makris [29], who went features.
beyond the typical 2D bounding boxes strategy and applied In order to minimize hand-engineered image processing
3D bounding boxes. All the aforementioned methods took steps, we used CNNs, which have been shown to be very
advantage of the 3D information provided by their camera versatile automatic feature extractors [6]. CNNs can learn
systems. The drawbacks of such approaches are related to the set of features which better suit a given problem if
system deployment: they need either multiple synchronized enough examples are provided during their training phase.
cameras focused on the same area or active depth cameras Furthermore, CNNs are also very convenient tools to achieve
which usually have narrow fields of view and a limited depth. generic features. For that purpose, network parameters and
Thus, from the point of view of system deployment, 2D training strategies need to be tuned.
passive systems are usually a better option, given their lower Since time management is a crucial issue in fall detection,
cost. It is also important to highlight that cameras are already a way to cope with time and motion had to be added to CNNs.
installed in many public places, such as airports, shops, and With that objective, we stacked a set of optical flow images
elderly care centers. Those reasons make 2D passive camera- and fed them into a CNN to extract an array of features
based fall detection a relevant application domain.
𝐹 ∈ R𝑤×ℎ×𝑠 , where 𝑤 and ℎ are the width and height of the
Nowadays, the use of deep neural networks is growing images and 𝑠 is the size of the stack (number of stacked optical
in many problem domains, including vision-based fall detec- flow images). Optical flow images represent the motion of
tion. Wang et al. [15] proposed using a PCAnet [30] to extract two consecutive frames, which is too short-timed to detect
features from color images and then applied a SVM to detect a fall. However, stacking a set of them the network can also
falls. This approach is similar to ours but instead of a PCAnet learn longer time-related features. These features were used
we use a modified VGG16 architecture [31] that allows us to as input of a classifier, a fully connected neural network (FC-
process various frames to take into account motion. Another NN), which outputs a signal of “fall” or “no fall.” The full
research work, led by Wang et al. [16], combined Histograms pipeline can be seen in Figure 1.
of Oriented Gradients (HOG), Local Binary Pattern (LBP),
Finally, we used a three-step training process for our
and features extracted from a Caffe [32] neural network to
optical flow stack-based CNN. This training methodology is
recognize a silhouette and then applied a SVM classifier. In
adopted due to the low number of fall examples found in
contrast, we avoid feature engineering completely, relying on
public datasets (Section 3.3). Furthermore, it also pursues
the features learned by a CNN.
the generality of the learned features for different falling
scenarios. The three training steps and their rationale are
3. Materials and Methods explained in detail in Section 3.2.
The design of our fall detection architecture was driven by the
3.1. The Optical Flow Images Generator. The optical flow [34]
following objectives:
algorithm represents the patterns of the motion of objects as
(i) To make the system independent from environmental displacement vector fields between two consecutive images,
features which can be seen as a 1-channel image 𝐼 ∈ R𝑤×ℎ×1 , where 𝑤
(ii) To minimize the hand-engineered image processing and ℎ are the width and height of the image that represents the
steps correlation between the input pair. By stacking 2𝐿 optical flow
4 Wireless Communications and Mobile Computing
Time
(a)
Time
(b)
Figure 2: Sample of sequential frames of a fall from the Multiple Cameras Fall Dataset (a) and their corresponding optical flow horizontal
displacement images (b).
images (i.e., 𝐿 pairs of horizontal and vertical components designs for image recognition in recent years (AlexNet [37],
𝑦
of vector fields, 𝑑𝑡𝑥 and 𝑑𝑡 , respectively), we can represent a VGG-16 [31], and ResNet [38], among others) which have
motion pattern across the stacked frames. This is useful to been equally used in computer vision problems. In particular,
model short events like falls. The use of optical flow images is we chose a modified version of a VGG-16 network following
also motivated by the fact that anything static (background) the temporal net architecture of Wang et al. [8] for action
is removed and only motion is taken into account. Therefore, recognition. The use of such architecture was motivated by
the input is invariant to the environment where the fall the high accuracy obtained in other related domains.
would be occurring. However, optical flow also presents some
More concretely, we replaced the input layer of VGG-16 so
problems, for example, with lighting changes, as they can
produce displacement vectors that are not desirable. New that it accepted a stack of optical flow images 𝑂 ∈ R𝑤×ℎ×2𝐿 ,
algorithms try to alleviate those problems, although there where 𝑤 and ℎ are the width and height of the image and 2𝐿 is
is no way of addressing them for all the cases. However, in the size of the stack (𝐿 is a tunable parameter). We set 𝐿 = 10,
the available datasets the lighting conditions are stable so the number of optical flow images in a stack, following [39],
the optical flow algorithm seems to be the most appropriate as a suitable time window to accurately capture short-timed
choice. events such as falls. The whole architecture (see Figure 3)
The first part of our pipeline, the optical flow images was implemented using the Keras framework [40] and
generator, receives 𝐿 consecutive images and applies the TVL- is publicly available (https://github.com/AdrianNunez/Fall-
1 optical flow algorithm [35]. We chose TVL-1 due to its better Detection-with-CNNs-and-Optical-Flow).
performance with changing lighting conditions compared We followed a three-step training process to train the
to other optical flow algorithms. We took separately the network for fall detection with a double objective:
horizontal and vertical components of the vector field (𝑑𝑡𝑥
𝑦 (i) To address the low number of fall samples in public
and 𝑑𝑡 , resp.) and stack them together to create stacks 𝑂 ∈
𝑦 𝑥 𝑦 𝑦 datasets: a deep CNN learns better features as more
R224×224×2∗𝐿 , where 𝑂 = {𝑑𝑡𝑥 , 𝑑𝑡 , 𝑑𝑡+1 𝑥
, 𝑑𝑡+1 , . . . , 𝑑𝑡+𝐿 , 𝑑𝑡+𝐿 }.
More precisely, we used the software tool (https:// labeled samples are used in the training phase. For
github.com/yjxiong/dense flow/tree/opencv-3.1) provided by instance, the Imagenet dataset, which is widely used
Wang et al. in [36] to compute the optical flow images for object recognition tasks in images, has 14 million
(see Figure 2 for an example of its output). We kept the images [7]. Current fall datasets are very far from
original optical flow computation parameters of Wang et al. to those figures; thus, it is not feasible to learn robust and
replicate their results in action recognition. Then we used the generic features for fall detection based only on those
exact same configuration to compute the optical flow images datasets. In such cases, transfer learning has shown to
of the fall detection datasets. As the CNN has learned filters be a suitable solution [41].
according to the optical flow images of the action recognition (ii) To build a generic fall detector which can work on
datasets, using the same configuration for the fall detection several scenarios: this requires developing a generic
images minimizes the loss of performance due to the transfer feature extractor, able to focus only on those motion
learning. patterns that can discriminate fall events from other
corporal motions.
3.2. Neural Network Architecture and Training Methodology.
The CNN architecture was a pivotal decision for the design of The training steps for transfer learning, summarized in
our fall detection system. There have been many architectural Figure 4, are the following:
Wireless Communications and Mobile Computing 5
CONV3-128
CONV3-128
CONV3-256
CONV3-256
CONV3-256
CONV3-512
CONV3-512
CONV3-512
CONV3-512
CONV3-512
CONV3-512
MAX POOL
MAX POOL
MAX POOL
MAX POOL
MAX POOL
SOFT-MAX
CONV3-64
CONV3-64
FC-4096
FC-4096
FC-1000
Input
(224 × 224 Predictions
RGB image)
Figure 3: VGG-16 architecture: convolutional layers in green, max pooling layers in orange, and fully connected layers in purple. We followed
the same notation of the original paper.
CNN 2 classes:
Classifier
Frozen fall/no fall
Figure 4: The transfer learning process applied to the neural network. (1) Full training with Imagenet to learn a generic feature extractor. (2)
Training in UCF101 to learn to model motion. (3) Fine-tuning to build a fall detector.
(1) We trained the original VGG-16 net in the Imagenet Firs Seco Thir
t op nd o d op
tica ptic tica
dataset [7], which has over 14 million images and l flo
w st a l flo l flo
ack w st w st
1,000 classes. This is a standard practice in the deep ack ack
learning literature [42], as the network learns generic
features for image recognition; for example, it can dis-
tinguish corners, textures, basic geometric elements,
and so on. Even though the target is to process optical
flow images rather than RGB images, Wang et al. [8] Sliding window of length 10
argue that the generic appearance features learned Figure 5: Sliding window method to obtain stacks of consecutive
from Imagenet provide a solid initialization for the frames.
network in order to learn more optical flow oriented
features. This is due to the generic features learned in
the first layers of the network, as they are useful for fully connected layers, using dropout regularization
any domain. Then, only the top part must be tuned to [43] with 0.9 and 0.8 dropping probabilities.
adapt to a new dataset or domain.
For the fine-tuning with a fall dataset, we extracted 𝐿
(2) Based on the CNN trained on Imagenet, we modified frames from fall and “no fall” sequences (extracted from the
the input layer to accept inputs 𝑂 ∈ R224×224×20 , original videos) using a sliding window with a step of 1 (see
where 224 × 224 is the size of the input images of Figure 5). This way, we obtained 𝑁 − 𝐿 + 1 blocks of frames,
the VGG-16 architecture and 20 is the stack size, as assuming 𝑁 is the number of frames in a given video and 𝐿
described in [8]. Next, we retrained the network with the size of the block, instead of 𝑁/𝐿 from a nonoverlapping
the optical flow stacks of the UCF101 dataset [9]. This sliding window. We did not apply other data augmentation
dataset contains 13,320 videos and has 101 human techniques. To deal with imbalanced datasets we resampled
actions in them. This second step allowed the network (without replacement) the data labeled as “no fall” to match
to learn features to represent human motion, which the size of the data labeled as fall.
could later be used to recognize falls. Even after balancing the datasets, the learning of the
(3) In the final step, we froze the convolutional layers’ fall class seemed to be difficult. As we could see from the
weights so that they remained unaltered during train- results of Section 4, the network did not perform as good
ing. To speed up the process, we saved the features as with the “no fall” class. Thus, we needed alternative ways
extracted from the convolutional layers up to the first of increasing the importance of the fall class in the learning
fully connected layer, hence having arrays of features process such as modifying the loss function. This function of
𝐹 ∈ R of size 4,096 for each input stack. Basically, the a neural network expresses how far our predictions are from
third step consists in fine-tuning the remaining two the ground truth, being the guide for weight updates during
6 Wireless Communications and Mobile Computing
Table 1: Number of frames of each dataset, distribution of frames per class (fall and “no fall”), and number of fall/“no fall” samples (sequences
of frames corresponding to a fall or a “no fall” event).
training. In particular, we chose the binary cross-entropy loss (iii) FDD contains 4 different stages (in contrast to the
function, defined in previous ones) with multiple actors.
loss (𝑝, 𝑡) = − (𝑡 ⋅ log (𝑝) + (1 − 𝑡) ⋅ log (1 − 𝑝)) , (1) The available datasets are recorded in controlled environ-
ments with various restrictions:
where 𝑝 is the prediction of the network and 𝑡 is the ground
truth. A way of increasing the importance of a class consists in (i) There is only one actor in a video.
adding a scaling factor or “class weight” to the loss function. (ii) Images are recorded under suitable and stable lighting
Therefore, the loss function is finally given by conditions, avoiding dark situations or abrupt lighting
changes.
loss (𝑝, 𝑡)
(2) The falls appear in different positions of the scenario, far
= − (𝑤1 ⋅ 𝑡 ⋅ log (𝑝) + 𝑤0 ⋅ (1 − 𝑡) ⋅ log (1 − 𝑝)) , and near of the camera. Specially in the Multicam dataset,
where 𝑤0 and 𝑤1 are, respectively, the weights for the “fall where there are eight cameras available, the distance to the
class” and “no fall class,” 𝑝 is the prediction of the network, camera varies significantly. In the FDD dataset, some falls are
and 𝑡 is the ground truth. A class weight of 1.0 means no also far from the camera, although this is not a general case
change in the weighting of that class. The use of a higher like in Multicam.
class weight for the class 0, that is, 𝑤0 , penalizes the loss All videos have been segmented in frames and divided
function for every mistake made on that class more than between falls and no falls, following the provided annota-
the mistakes on class 1. A neural network always tries to tions. Table 1 summarizes the most relevant figures of each
minimize the loss by adapting its weights; this is the base dataset.
of the backpropagation algorithm [44]. Therefore, by using In Section 4, we compare our results with the state of
this modified loss function, we are encouraging the network the art on all three datasets. Furthermore, we believe that
to prioritize the learning of one of the classes. However, this the combination of the three datasets provides also a good
might come at the price of worsening the learning of the other indicator of the generality of our approach.
class. For that reason, in Section 4 we present the metrics that
show the performance of each class separately. 4. Results and Discussion
Although the use of a 𝑤0 greater than 1.0 biases the
learning towards falls (in case of 𝑤1 = 1), we argue that this To validate our fall detector system we set up several experi-
bias is convenient in fall detection because of the importance ments using the datasets of Section 3.3. In particular, we con-
of detecting a fall even at the price of having some false ducted four types of experiments, namely, (i) experiments for
alarms. A missed detection would be critical in the health of network configuration analysis, with the aim of finding the
the elderly and is, therefore, something to avoid. most suitable configuration for the problem; (ii) experiments
to compare our method with the state-of-the-art approaches
3.3. Datasets. We selected three datasets that are often used in for fall detection; (iii) experiments to test the system in
the literature, which makes them suitable for benchmarking different lighting conditions; and (iv) an experiment to prove
purposes: the UR Fall Dataset (URFD) (http://fenix.univ the generality of the system by combining all datasets.
.rzeszow.pl/∼mkepski/ds/uf.html), the Multiple Cameras Fall
Dataset (Multicam) (http://www.iro.umontreal.ca/∼labimage/ 4.1. Evaluation Methodology. From the point of view of
Dataset/), and the Fall Detection Dataset (FDD) (http://le2i supervised learning, fall detection can be seen as a binary
.cnrs.fr/Fall-detection-Dataset?lang=fr): classification problem on which a classifier must decide
whether specific sequences of video frames represent a fall
(i) URFD contains 30 videos of falls and 40 videos of or not. The most common metrics to assess the performance
Activities of Daily Living (ADL), which we label as no of such a classifier are sensitivity, also known as recall or
falls. true positive rate, and specificity or true negative rate. These
(ii) Multicam contains 24 performances (22 with at least metrics are not biased by imbalanced class distributions,
one fall and the remaining two with only confounding which make them more suitable for fall detection datasets
events). Each performance has been recorded from 8 where the number of fall samples is usually much lower than
different perspectives. The same stage is used for all the number of nonfall samples. For fall detection, the sensi-
the videos, with some furniture reallocation. tivity is a measure of how good our system is in predicting
Wireless Communications and Mobile Computing 7
Table 2: Results of our system on the three datasets (URFD, FDD, and Multicam) using different setups. In the minibatch size column, “full”
means batch training and 𝑤0 is the value of the weight given to the fall class. Notice that the ReLU activation function is always preceded by
batch normalization. Sens. refers to sensitivity, whereas Spec. is used for specificity.
falls, whereas specificity measures the performance for “no (2) Comparison with the state of the art: using the
falls.” Nevertheless, for the sake of comparison with other best configuration found in the first experiment, we
existing approaches, we also computed the accuracy in the compared our results with those of the literature.
case where no other metric was given in their original papers. Again, the evaluation was performed in terms of
Therefore, the three evaluation metrics we used are defined as sensitivity and specificity and, in some specific cases,
follows: accuracy was also used.
TP (3) Test with different lighting conditions: in order to
Sensitivity/Recall = ,
TP + FN provide an understanding of how the system would
TN cope with different lighting conditions (not seen in
Specificity = , (3) the datasets of Section 3.3), we conduct two experi-
TN + FP ments: one with the images darkened and another one
TP + TN with a dynamic light. The evaluation was performed
Accuracy = , using the sensitivity and specificity in all the cases.
TP + TN + FP + FN
where TP refers to true positives, TN to true negatives, FP (4) Generality test: to know to which extent our solution
to false positives, and FN to false negatives. Given a video, is generic, we made an experiment combining all
we evaluated the performance of the system for each optical three datasets. The evaluation was again performed
flow stack; that is, we checked the prediction for a stack with using the sensitivity and specificity values.
respect to its real label. We used a block size of 10 consecutive
frames to create the stack. This number was empirically found Regarding the optimization of the network parameters,
by [39]. Consequently, we define the values mentioned above we used Adam (all parameters apart from the learning rate
as follows: are set as mentioned in [48]) for 3,000 to 6,000 epochs,
depending on the experiment’s computational burden. For
(i) TP: an optical flow stack labeled as fall and predicted slow training in the first experiment (search for the best
as fall configuration), we applied early stopping with a patience of
(ii) FP: an optical flow stack labeled as “no fall” and 100 epochs, that is, if the loss is not improving for 100 epochs
predicted as fall the training stops. This is a suboptimal greedy strategy to get
(iii) TN: an optical flow stack labeled as “no fall” and the best model that avoids full training when it may not be
predicted as “no fall” necessary. We always used a value of 1 for the class weight
𝑤1 in the loss function (see (2)), as we were not interested in
(iv) FN: an optical flow stack labeled as fall and predicted increasing the importance of the “no fall” class.
as “no fall”
4.3. Best Configuration Results. In the search of the best
4.2. Experimental Setup. We conducted four main experi- network configuration, we split each dataset into training and
ments: test set with an 80 : 20 ratio and a balanced distribution of
(1) Search for the best configuration: we evaluated dif- labels (0 or 1, fall, or “no fall”).
ferent network configurations in terms of sensitivity The search in the space of configurations included multi-
and specificity to analyze their impact on each dataset. ple experiments using different learning rates. We also varied
More specifically, we investigated the role of the the values of the minibatch size (with the option of batch
learning rate, minibatch size, class weight (explained training) applying a base 𝑤0 class weight of 2 at first and
in Section 3.2), and the use of the Exponential Linear then modifying it in the directions that seemed promising.
Unit (ELU) [45] activation function compared with The ReLU with Batch Normalization and ELU options were
the Rectified Linear Unit (ReLU) preceded by Batch used equally, although we chose the most encouraging one if
Normalization [46] (as discussed by Mishkin et al. the results were favorable to it. The results are summarized in
in [47]). Regarding the minibatch size, for some Table 2.
experiments we used batch training (whole data is The results of the different configurations are essentially
seen in each update) instead of minibatch training dependent on the number of samples we had to train the
(different data chunks per update). network. However, we observed that some configurations and
8 Wireless Communications and Mobile Computing
a range of values for hyperparameters performed well in all value of 2.0 (having in both cases perfect sensitivity).
the datasets: Therefore, if we had to pick a standard value for all the
datasets, a value of 2.0 would have been adequate too,
(i) Learning rate: the network performed the best with even though it did not produce the highest result in
learning rate values around 10−3 and 10−5 . In fact, all the datasets.
the best models for the three datasets use 10−3 , 10−4 , (iv) ELU versus ReLU with BN: motivated by the work
and 10−5 as learning rate. However, a higher or of Mishkin et al. [47] on different configurations for
lower value creates some problems that are reflected CNN architectures, we tested ELU and ReLU with
in the sensitivity and specificity. More concretely, Batch Normalization in our experiments to know
we often see that the falls are learned better. This which one could be more beneficial. In general, ReLU
behavior could be explained by the analysis given in produces rather stable metrics, even improving the
Section 4.3.1. In other cases, extreme values for the results when the batch size is higher, while ELU
learning rate harm the performance in “no fall” class destabilizes when we do batch training; that is, we
(specificity). This is natural as in both cases (very can obtain a 100% of sensitivity and 0% of specificity
high and low learning rate), it is harder to make the for some values. However, it is not the general case,
network converge: when the learning rate is high the as there are some exceptions; for example, the best
network takes too big steps to be able to get to the result for FDD was obtained while using ELU. In any
minimum of the loss function and when it is low it case, we can assume that ELU is not as reliable as
stays far from the minimum, possibly due to a saddle the combination of ReLU and Batch Normalization
point where it gets stuck. which, following our experiments, shows a more
(ii) Minibatch size: we used minibatch sizes ranging from stable behavior.
64 to 1,024 using powers of 2, as it is commonly
seen in the literature, and batch training, where all 4.3.1. Analysis of False Alarms and Missed Detection. To
the samples of a dataset are used to update the further analyze the system and the performance of the
weights in each epoch (notice that this is possible best configuration, in this part we will discuss the false
due to the low amount of data of these datasets). In positives (false alarms) and false negatives (missed detection)
the case of URFD, we employed smaller batch sizes produced by the network for the FDD dataset. We selected
because the amount of samples is not high enough. this dataset for its greater variety of falls with respect to URFD
We obtained the best results using batch training or and Multicam, that is, different ways of falling with rare cases
a minibatch size of 1,024, so we deduce that a large included. We sampled 72 sequences for the analysis from
amount of samples in each batch is highly beneficial, all the sequences with errors; half of the samples were fall
as more data allow the network to learn a better sequences and the other half were “no fall” sequences.
gradient. Otherwise, if a class is underrepresented in a
minibatch, applying different class weights may cause False Positives or False Alarms. A false alarm is given when
problems in the gradient calculations, thus making the system predicts as a fall a stack of optical flow that was
more difficult the convergence of the learning process. labeled as “no fall.” In the set of samples used for the analysis,
The other values (64, 128, and 256) seem not to we detected some common sources of errors for the majority
affect significantly the results. For some cases a small of the cases, while others were uncommon cases. The errors
value of 64 performs better than 256, whereas the were found stack-wise, that is, each 10 frames. As we are using
opposite case also exists depending on the dataset. a sliding window approach, the errors may be overlapping.
Therefore, the results obtained by small minibatch We computed the amount of stacks per error source and we
sizes may be explained by the randomness of the batch ordered the errors taken into account that number of stacks
construction. for each source divided by the total amount of stacks of all the
(iii) Class weight: the class weight 𝑤0 is important to sources.
increment the importance of the fall class at training (i) In 51.41% of the stacks, we observed that the system
time, as the “no fall” class was empirically found to was learning that the frames before and after the fall
be learned better than the fall class. This is reflected were also considered part of the fall. More accurately,
in the results with high values of the specificity the frames containing the actor destabilizing (previ-
(“no fall” class performance) in contrast to the lower ous to the fall) and the frames where the actor was
values of sensitivity (fall class performance). When already in the floor (after the fall) were predicted as a
this happens, a value of 𝑤0 higher than 1.0 helps fall.
achieving higher sensitivity. The base value was set 2.0
and incremented or decremented when we observed (ii) With a lower occurrence rate, yet appearing in various
promising results. events, we have the following cases:
The results show that a value for 𝑤0 of 2.0 is adequate (a) 14.06%: the actor slowly bends down and then
in general, although we see a value of 1.0 for the lies down on the floor.
best results in URFD and Multicam. However, in the (b) 9.64%: the actor enters the FOV of the camera
case of URFD, we got very similar results for the walking.
Wireless Communications and Mobile Computing 9
(c) 6.43%: from a seated position in a chair, the Table 3: Comparison between our approach and others from the
actor bends down to grab an object in the floor. vision-based fall detection literature for URFD.
(d) 5.62%: the actor exits the FOV of the camera Proposal Sensitivity/Recall Specificity Accuracy
walking. Zerrouki and
Houacine - - 96.88%
(iii) Finally, the remaining cases have a small occurrence
(2017) [14]
rate, appearing in a few stacks and in a unique
sequence. For example, (i) when the actor is lying Ours 100.0% 92.00% 95.00%
down on the floor; (ii) when a small part of the
actor goes out of the FOV of the camera; and (iii) Table 4: Comparison between our approach and others from the
when the actor grabs something from the floor while literature of vision-based fall detection for Multicam.
maintaining his legs stiff.
Proposal Sensitivity/Recall Specificity
Wang et al. [15] 89.20% 90.30%
False Negatives or Missed Detection. When optical flow stacks
Wang et al. [16] 93.70% 92.00%
labeled as fall are fed into the network and the predicted
output is “no fall,” a missed detection is given. We observed Ours 99.00% 96.00%
the following source of errors for the analyzed optical flow
stacks of the set of 36 sequences (ordered in the same way as Table 5: Comparison between our approach and others from the
the false positives): vision-based fall detection literature for FDD.
(i) 42.82% of the cases do not contain anything special Proposal Sensitivity/Recall Specificity Accuracy
during the fall; thus, we hypothesize that the network Charfi et al.
98.00% 99.60% -
may not be learning a specific feature correctly. (2012) [17]
(ii) Two other events compose the 14.65% and 10.70% Zerrouki and
of the cases: (i) the actor walks swinging, trying not Houacine - - 97.02%
to fall, for almost 2 seconds (48 frames) and (ii) the (2017) [14]
actor falls while grabbing a ball in his hands and all Ours 99.00% 97.00% 97.00%
the movement occurs in the axis perpendicular to the
floor.
(2) The second source of error comes from the limitations
(iii) Even with lower occurrence rates (7.04% and 5.92%),
of the cameras and the optical flow algorithm. The
we have two events: (i) the lying-on-the-floor position
quality of the images is given by the cameras and,
is not detected as fall and (ii) the trunk starts the fall,
therefore, it is an intrinsic feature of the datasets. The
while the legs stay stiff.
optical flow algorithm has also its own limitations
(iv) Finally, there are more minor events than in the case (discussed in Section 3.1); for example, in the case of
of the false positives, for example: (i) the actor starts a long distance between the actor and the camera,
the fall while being on his knees; (ii) the actor does the optical flow algorithm is not able to capture the
not entirely fall to the floor, ending in a quadruped movement of the person and the output result is a
position; and (iii) the actor is seen almost from above blank image. This is a limitation of our system that
and the fall is not appreciated. must be studied case by case and included in the
We observed that all those errors could be classified into future work (Section 5).
two groups depending on the source of the error:
4.4. Results and Comparison with the State of the Art. To
(1) Events that do not appear many times in the datasets compare the best models found in the previous section with
are difficult to learn, as the network does not have the state of the art, we used a 5-fold cross-validation for URFD
enough samples: for example, falling while on the and FDD and a leave-one-out cross-validation for Multicam
knees or detecting falls from a top view of the person, following Rougier et al. [19], in order to compare on equal
which are among the analyzed samples. This may conditions. In this last case, we split the dataset into 8 parts of
also be the case of almost the majority of the false the same size, each one containing all the videos recorded by
negatives, where there is no explanation for the error a specific camera. We trained with 7 of those parts and tested
apart from the incapability of the network to learn with the remaining one; the final result is an average of the
correctly specific features. metrics given by each test. Nevertheless, for the three datasets,
Our hypothesis is that the system can learn those we balanced each fold in order to have the same amount of
rare cases with the proper amount of data. Judging falls and “no falls.” Notice that we retrain all the networks
from the results obtained in later Sections 4.5 and 4.6, so that no information is brought from the experiments in
we believe the generalization capability of the system Section 4.3.
is higher and we are only limited by the available Tables 3, 4, and 5 show the results obtained by our
datasets (see Section 3.3 for their limitations). Thus, approach on each dataset compared to others. For the sake
this source of error could be addressed by our system. of a fair comparison, we selected the papers of the state of
10 Wireless Communications and Mobile Computing
the art which meet two requirements: (1) they work only on methodology is not useful to compare our solution
RGB data (no depth maps or accelerometer data) and (2) they with other authors; thus, we discard its use.
provide results using publicly available datasets. Furthermore, Auvinet et al. [25] used ad hoc thresh-
Due to the different performance metrics used by other old values to detect a lying-on-the-floor position,
researchers and the comparison criteria established for this not the fall sequence itself. With this method,
paper in Section 4.1, it is not possible to make a general claim they reported sensitivity and specificity values up
in terms of performance. Hence, we will discuss the results to 99.70%, while we obtained 99.00% and 96.00%,
obtained for each dataset: respectively. Rougier et al. [19], with an accuracy value
URFD of 99.70% compared to our 97.00%, performed fall
detection in a video-level, not by stacks of frames.
(i) Harrou et al. [24] and Zerrouki et al. (2016) [23] used Hence, their system needs a specific data framing,
both URFD and FDD but do not specify which dataset using videos as input instead of a continuous stream
was used to obtain their results or how they combined of images, which is not ideal for real world deploy-
the performance on both datasets. For these reasons, ments.
these works are not included in the comparison tables. Moreover, Auvinet et al. [25] used a 3D vision system,
Harrou et al. [24] reported results by means of the which provides them with more information about
False Alarm Rate (FAR, or False Positive Rate) and the entire scenario than in our approach, although we
Missed Detection Rate (MDR, or False Negative Rate) are only three points behind in specificity from their
metrics, obtaining a FAR of 7.54% and a MDR of results, showing that our system performs compara-
2.00%. Using our systems, we obtained a FAR of bly even with less information.
9.00% and a MDR of 0.00%. Zerrouki et al. (2016)
[23] reported a sensitivity of 98.00% and specificity The use of 3D vision systems stirs up an interesting
of 89.40%, while we obtained 100.0% and 92.00%, discussion about the strengths and weaknesses of
respectively. Again, our values correspond to training both 3D and 2D systems. Regarding detection per-
and testing only in URFD. formance, based on the available experiments and
results, the differences between both approaches are
(ii) In a different work, Zerrouki and Houacine (2017) minimal. 3D vision systems show higher sensitivity
[14] reported an accuracy of 96.88% in this dataset, and specificity, but the difference is low (around 3
while we obtained 95%. Since the dataset is very points for Multicam). However, there are some other
imbalanced (see Table 1), it suffers the problem known aspects to be considered to decide which approach
as the accuracy paradox, where a higher accuracy is the most suitable for a given application. For
does not imply a higher predictive power. In fact, instance, 3D vision systems are often based on active
a system predicting “no fall” for all samples in the sensors, such as structured light (e.g., Kinect) or
dataset would obtain a 91.53% of accuracy without time-of-flight cameras, which may be used even in
detecting any of the existing falls. For that reason, as dark conditions. That is not the case of passive RGB
explained in Section 4.1, we chose to show sensitivity cameras. Even though we show promising results
and specificity values instead. in simulated scenarios with poor light conditions
(Section 4.5.1), those cameras cannot work without
Multicam external light.
(i) Both Wang et al. [15] and Wang et al. [16] evaluated the On the other hand, as far as system deployment is
performance of their systems over stacks of 30 frames. concerned, 3D vision systems usually present higher
Our system instead outperforms their results by using costs. First of all, reliable 3D cameras are expensive
only stacks of 10 frames. More precisely, our system compared to passive 2D cameras. The Kinect is an
achieves a sensitivity of 99% and a specificity of 96%, exception, but it has several limitations: it does not
while Wang et al. [15] obtained 89.20% and 90.30% work in sunny environments and its range is limited
and Wang et al. [16] 93.70% and 92.00% for the same to 4-5 meters. Second, 2D passive cameras are already
metrics, respectively. common in public spaces. So it seems natural to try to
take advantage of the existing installations.
(ii) Regarding the work by Auvinet et al. [25] and
Rougier et al. [19], they are not included in the In conclusion, we can claim that each system has its
comparison tables due to their evaluation methodol- own advantages and drawbacks; hence, they should
ogy (http://www.iro.umontreal.ca/∼labimage/Dataset/ be selected depending on the specific application
technicalReport.pdf). They used a criterion that sets domain. This fact stresses even more the need of
a variable called 𝑡fall , representing the frame where working on fall detection systems for different sensor
the fall starts, therefore dividing each video into two deployments.
parts. If a fall is detected after 𝑡fall , it is considered FDD
a TP; otherwise, it is a FN. Before 𝑡fall , if a fall is
predicted all the “no fall” period is considered a (i) As in the case of URFD, Harrou et al. [24] and
FP; otherwise, it is a TN. We believe the evaluation Zerrouki et al. (2016) [23] mentioned the combined
Wireless Communications and Mobile Computing 11
(a)
(b)
Figure 6: Original images of the FDD dataset (a) and the same images after artificial darkening (b).
use of FDD but only provided a single result, thus artificial lighting conditions, and observed how the fall
making the comparison with them unclear. Harrou detection system performs. We divided the experiments into
et al. reported a FAR of 7.54% and a MDR of two parts:
3.00%, while we get 3.00% and 1.00%. Zerrouki et (1) Static lighting experiments, where we changed the
al. provided a sensitivity of 98.00% and specificity lighting conditions to simulate night-like scenarios.
of 89.40%, while we obtained 99.00% and 97.00%,
respectively (again, only in FDD). (2) Dynamic lighting experiments, where we added a
dynamic artificial lighting that smoothly increases its
(ii) Similar to the URFD case, Zerrouki and Houacine intensity from frame to frame until reaching a specific
(2017) [14] presented a 97.02% of accuracy for the value. Afterwards, the intensity decreases again to
FDD dataset, while we obtained 97.00%. As in the achieve the initial lighting conditions.
previous case with Zerrouki and Houacine in URFD,
the evaluation using pure accuracy is misleading, as
4.5.1. Static Lighting. In this first part, for every frame in a
a system predicting always “no fall” would obtain a
video, we subtract a constant value of 100 to each pixel of each
96.47% of accuracy with null predictive power.
channel (three channels in RGB) so that they get darkened as
(iii) Charfi et al. [17] used ad hoc hand-tuned thresholds if it was night (see Figure 6). With these new images, we will
in their system to detect a fall (11 consecutive fall pre- do the following two experiments.
dictions), indeed obtaining very high results: 98.00%
of sensitivity and 99.60% of specificity. It is not clear Training on Original Images Only. We divided the dataset of
how this system would perform in other datasets, as 80 : 20 ratio into two sets, train and validation, and balanced
the thresholds were hand-tuned for FDD without any the class distribution of the train set. We selected darkened
other proof of generalization. images for the train set and the original ones (unchanged)
for the validation set. Then, we trained the model for 3,000
4.5. Experiments with Lighting Conditions. As discussed in epochs with a learning rate of 0.001, batch training, a 𝑤0 of 2,
Section 3.3, the public benchmark datasets used for this and ELU (the best configuration for the FDD dataset found
research present stable lighting conditions, providing suitable in Section 4.3). We obtained a sensitivity of 45.85% and a
lighting for artificial vision tasks. However, in real world specificity of 98.67%.
scenarios events like sunlight coming through the windows, The result is coherent with the fact that falls are difficult
a lamp switching on and off and so forth is quite frequent. to detect when the actor approaches the floor. It is the darkest
Therefore, we decided to test how our system would behave area in the image; thus, the actor is not distinguishable, as
under those circumstances. To that end, we modified the it gets fused with the darkness. Therefore, any lying-on-the-
original images of the FDD dataset, thus creating different floor position is very difficult to detect.
12 Wireless Communications and Mobile Computing
(a)
(b)
(c)
(d)
Figure 7: Original images of the FDD dataset in (a); the same images with simulated dynamic lighting in (b); optical flow images without
the lighting change in (c); and optical flow images with the lighting change in (d). The 12 frames correspond to half a second of the original
video (recorded at 25 FPS).
Train and Test with Darkened Images. We used exactly the divided the dataset into an 80 : 20 ratio (keeping the same data
same configuration and train/test partition of the previous in each partition as in Section 4.5.1). We trained the model
experiment, but this time the images from the training set for 3,000 epochs with a learning rate of 0.001, batch training,
were also darkened. After training, the sensitivity went up a 𝑤0 of 2, and ELU. The result is a 28.04% of sensitivity and a
to 87.12%, while the specificity decreased a bit (94.92%). The 96.35% of specificity.
best result obtained with this configuration for the original The result is coherent with the data, as the dynamic
dataset was 93.47% of sensitivity and 97.23% of specificity. We lighting generates lots of displacement vectors in the image.
believed the difference is not that large taking into account the This confuses the network that has been trained to see only
level of darkening applied and the fact that we did not explore the displacement vectors of the moving person.
the best configuration for the new images.
Train and Test with the New Images. Finally, we check how
4.5.2. Dynamic Lighting. In real world scenarios, lighting the system is able to adapt to this lighting change if the
conditions are not as stable as in a lab environment. For classifier is properly trained. To this end, we used the new
example, a lamp may be switched on/off in the background, images in the train set and the validation set (the same
generating displacement vectors in the optical flow algo- partition of sets as in the first part of this experiment). The
rithm. To simulate this type of scenarios in the FDD dataset, configuration of the training and the network is also the same
we added a progressive change of lighting that takes 32 frames as in the previous part. This time the system obtains a 90.82%
to light up and fade. As the video was recorded at 25 frames of sensitivity and a 98.40% specificity.
per second (FPS), the lighting lasts for about 1.3 seconds. Like in the previous experiment with darkness, both
To produce this dynamic light change, for each frame, metrics increase significantly when trained with the new data.
channel, and pixel we multiplied the original value by a In particular, this change is big in the case of the sensitivity,
sinusoidal function so that the transition between frames which goes from 28.04% to 90.82% after being trained with
emulates real light conditions. This modification was done the modified samples. This is a proof of the capability of the
once per video at its first 32 frames (see Figure 7). In order network to adapt to new circumstances (darkness or lighting
to achieve more realistic illumination conditions, a single changes in this case) if the appropriate data is used in training
lighting change was applied on each video. This part again time.
is divided into two experiments.
4.6. Generality Test. One of the main drivers of our system
Train with the Original Images and Test with the New Ones. design was the generality (Section 3), that is, to develop a
To test how the system reacts to dynamic lighting conditions, fall detector able to perform in different scenarios. While
we trained the model with the original images and tested with previous experiments in this paper considered each dataset
the new ones (as explained in the first paragraph). Again, we individually, here we tried to avoid any singular feature
Wireless Communications and Mobile Computing 13
Table 6: Experiment with the system trained with the three datasets experiment shown in Table 6 tries to refute that
combined (URFD, Multicam, and FDD). Results are shown on the reasoning. We combined the three datasets both for
combination and the individual sets. training and testing, with all their differences. As
Test set Sensitivity/Recall Specificity can be seen, when we tested the system using the
URFD + Multicam + FDD 94.00% 94.00% videos from all the datasets, we obtained very high
detection rates (sensitivity and specificity of 94%).
URFD 100.0% 99.00%
When observing the results obtained on each indi-
Multicam 85.00% 84.00%
vidual dataset, our results still remain high, except for
FDD 97.00% 98.00% Multicam (sensitivity of 85% and specificity of 84%).
This can be explained because Multicam uses different
perspectives to record the same events. When we
associated with a specific dataset. For that purpose, we generated the combined dataset, we discarded many
generated a new dataset as the combination of the three frames from Multicam to keep the influence of each
previously used as follows: dataset equal. In that process, we lost many frames
(1) In order to give equal weight to all three datasets, we which may have been helpful for our network to learn
resampled the two largest sets to match the size and the different perspectives. Thus, we can conclude
class distribution of the smallest one (URFD). With that to tackle the perspective issue, more images are
this change, the three datasets had the same relevance needed in the training process. However, the results
(amount of samples) and both classes (fall and “no back up our generality claim, since the system was
fall”) were balanced. More concretely, each dataset has able to learn generic features from different datasets
960 samples, 480 fall and 480 “no fall” samples. and showed high detection rates. Under more realistic
conditions (real world environment), the system may
(2) In order to apply a 5-fold cross-validation, we divided
get poor results unless trained in real world data too.
each dataset into 5 groups, each group containing the
Therefore, the key to obtain generic features and being
same amount of fall/“no fall” samples.
able to generalize well is training the system with a
The results of the experiment with the performance on the huge amount of inhomogeneous data.
combined set but also on the samples of each set individually
are shown in Table 6. The network configuration was the The generalization capabilities of our system came mainly
following: a learning rate of 10−3 , a batch size of 1,024, and a from two design decisions: (i) the use of optical flow images,
𝑤0 of 2.0, and we used ReLU with Batch Normalization. The which only model the motion information contained in con-
results correspond to the 5-fold cross-validation, with each secutive frames and avoid any dependence with appearance
fold trained for 1,000 epochs. features, and (ii) the three-step training phase, where the
We believe that these results support our claim of having network learns generic motion-related features for action
a generic fall detector system, mainly based on two reasons: recognition.
has its drawbacks, as stated in Section 3.1. Neverthe- could be a promising research direction, with the aim
less, with the appropriate training we can make up for of automatically detecting different persons in images
it and obtain very good results, as demonstrated in and analyze those regions with our fall detection
Section 4.5. system.
(ii) The use of CNN retrained in different datasets and
for different problems: apart from creating a powerful Conflicts of Interest
feature extractor, this allows us to not depend on
hand-engineered features, which are usually very The authors declare that there are no conflicts of interest
hard to design and are prone to be too specific to regarding the publication of this article.
a given setup. In contrast, our CNN learned generic
features relative to the problem domain. Acknowledgments
(iii) Transfer learning: to overcome the problems posed The authors gratefully acknowledge the support of the Basque
by the low number of samples in fall datasets and Government’s Department of Education for the predoctoral
learn generic features, we adopted transfer learning funding and NVIDIA Corporation for the donation of the
techniques. More concretely, we presented a three- Titan X used for this research. They also thank Wang et al.
step training process which has been successfully [8] for making their work publicly available.
performed for fall detection.
We believe that the presented vision-based fall detector is References
a solid step towards safer Smart Environments. Our system
has been shown to be generic and works only on camera [1] A. F. Ambrose, G. Paul, and J. M. Hausdorff, “Risk factors for
images, using few image samples (10) to determine the occur- falls among older adults: A review of the literature,” Maturitas,
vol. 75, no. 1, pp. 51–61, 2013.
rence of a fall. Those features make the system an excellent
[2] J. C. Davis, M. C. Robertson, M. C. Ashe, T. Liu-Ambrose, K.
candidate to be deployed in Smart Environments, which are
M. Khan, and C. A. Marra, “International comparison of cost
not only limited to home scenarios. Based on emerging IoT of falls in older adults living in the community: A systematic
architectures, the concept of Smart Environments can be review,” Osteoporosis International, vol. 21, no. 8, pp. 1295–1306,
extended to many other everyday environments, providing 2010.
the means to assist the elderly in several contexts. [3] M. Mubashir, L. Shao, and L. Seed, “A survey on fall detection:
However, there is still ground for improvement. In order principles and approaches,” Neurocomputing, vol. 100, pp. 144–
to bring vision-based fall detection to real world deploy- 152, 2013.
ments, we envisage three potential research directions: [4] L. Chen, J. Hoey, C. D. Nugent, D. J. Cook, and Z. Yu, “Sensor-
based activity recognition,” IEEE Transactions on Systems, Man,
(1) Further research on transfer learning with fall detec- and Cybernetics, Part C: Applications and Reviews, vol. 42, no. 6,
tion datasets is warranted in order to improve our pp. 790–808, 2012.
generic feature extractor. Currently, our third training [5] A. Caragliu, C. Del Bo, and P. Nijkamp, “Smart cities in Europe,”
step is limited to fine-tuning (Section 3), where the Journal of Urban Technology, vol. 18, no. 2, pp. 65–82, 2011.
convolutional layers have their weights frozen and [6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
only the classifier layer is really trained. We would vol. 521, no. 7553, pp. 436–444, 2015.
like to consider alternative ways; going deeper in [7] J. Deng, W. Dong, and R. Socher, “ImageNet: a large-scale hier-
the way convolutional layers can be adapted to fall archical image database,” in Proceedings of the 2009 IEEE Con-
datasets. However, those experiments must be carried ference on Computer Vision and Pattern Recognition (CVPR), pp.
out carefully to avoid too specific feature extractors, 248–255, Miami, Fla, USA, June 2009.
which may perform better in a certain dataset but at [8] L. Wang and et al., Towards good practices for very deep two-
the expense of losing generality. stream convnets, 2015.
[9] S. Khurram, A. R. Zamir, and M. Shah, Amir Roshan Zamir, and
(2) Using optical flow images provides a great represen- Mubarak Shah., UCF101: A dataset of 101 human actions classes
tational power for motion, but also involves the heavy from videos in the wild, 2012.
computational burden of preprocessing consecutive [10] M. Vallejo, C. V. Isaza, and J. D. Lopez, “Artificial Neural Net-
frames and drawbacks concerning lighting changes. works as an alternative to traditional fall detection methods,”
Following the philosophy of end-to-end learning, we in Proceedings of the 2013 35th Annual International Conference
would like to avoid any image preprocessing step and of the IEEE Engineering in Medicine and Biology Society, EMBC
work only on raw images in the future. Therefore, 2013, pp. 1648–1651, Japan, July 2013.
more complex network architectures will have to be [11] A. Sengto and T. Leauhatong, “Human falling detection algo-
designed to learn complete and hierarchical motion rithm using back propagation neural network,” in Proceedings
representations from raw images. of the 5th 2012 Biomedical Engineering International Conference,
BMEiCON 2012, Thailand, December 2012.
(3) As the public datasets have only one actor per video [12] B. Kwolek and M. Kepski, “Human fall detection on embedded
we believe that the next step in the field of fall detec- platform using depth maps and wireless accelerometer,” Com-
tion would be the multiperson fall detection. For this puter Methods and Programs in Biomedicine, vol. 117, no. 3, pp.
task, we think that region-based CNNs (R-CNN) [49] 489–501, 2014.
Wireless Communications and Mobile Computing 15
[13] F. Harrou, N. Zerrouki, Y. Sun, and A. Houacine, “Statistical in Proceedings of the Design, Automation and Test in Europe
control chart and neural network classification for improving Conference and Exhibition, DATE 2010, pp. 1536–1541, deu,
human fall detection,” in Proceedings of the 8th International March 2010.
Conference on Modelling, Identification and Control, ICMIC [29] G. Mastorakis and D. Makris, “Fall detection system using
2016, pp. 1060–1064, Algeria, November 2016. Kinect’s infrared sensor,” Journal of Real-Time Image Processing,
[14] N. Zerrouki and A. Houacine, “Combined curvelets and hidden vol. 9, no. 4, pp. 635–646, 2012.
Markov models for human fall detection,” Multimedia Tools and [30] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, “PCANet:
Applications, pp. 1–20, 2017. a simple deep learning baseline for image classification?” IEEE
[15] S. Wang, L. Chen, Z. Zhou, X. Sun, and J. Dong, “Human fall Transactions on Image Processing, vol. 24, no. 12, pp. 5017–5032,
detection in surveillance video based on PCANet,” Multimedia 2015.
Tools and Applications, vol. 75, no. 19, pp. 11603–11613, 2015.
[31] K. Simonyan and A. Zisserman, Very deep convolutional net-
[16] K. Wang, G. Cao, D. Meng, W. Chen, and W. Cao, “Automatic works for large-scale image recognition, 2014.
fall detection of human in video using combination of features,”
[32] Y. Jia, E. Shelhamer, J. Donahue et al., “Caffe: convolutional
in Proceedings of the 2016 IEEE International Conference on
architecture for fast feature embedding,” in Proceedings of the
Bioinformatics and Biomedicine, BIBM 2016, pp. 1228–1233,
ACM International Conference on Multimedia, pp. 675–678,
China, December 2016.
ACM, Orlando, Fla, USA, November 2014.
[17] I. Charfi, J. Miteran, J. Dubois, M. Atri, and R. Tourki, “Def-
inition and performance evaluation of a robust SVM based [33] S. S. Beauchemin and J. L. Barron, “The Computation of Optical
fall detection solution,” in Proceedings of the 8th International Flow,” ACM Computing Surveys, vol. 27, no. 3, pp. 433–466, 1995.
Conference on Signal Image Technology and Internet Based [34] J. J. Gibson, “The Perception of Visual Surfaces,” The American
Systems, SITIS 2012, pp. 218–224, Italy, November 2012. Journal of Psychology, vol. 63, no. 3, p. 367, 1950.
[18] T. Lee and A. Mihailidis, “An intelligent emergency response [35] F. A. Hamprecht, C. Schnörr, and B. Jähne, ““A duality based
system: Preliminary development and testing of automated fall approach for realtime TV-L 1 optical flow,” in Pattern Recogni-
detection,” Journal of Telemedicine and Telecare, vol. 11, no. 4, pp. tion, pp. 214–223, 2007.
194–198, 2005. [36] L. Wang, Y. Xiong, Z. Wang et al., “Temporal segment networks:
[19] C. Rougier, J. Meunier, A. St-Arnaud, and J. Rousseau, “Robust Towards good practices for deep action recognition,” Lecture
video surveillance for fall detection based on human shape Notes in Computer Science (including subseries Lecture Notes
deformation,” IEEE Transactions on Circuits and Systems for in Artificial Intelligence and Lecture Notes in Bioinformatics):
Video Technology, vol. 21, no. 5, pp. 611–622, 2011. Preface, vol. 9912, pp. 20–36, 2016.
[20] S.-G. Miaou, P.-H. Sung, and C.-Y. Huang, “A customized [37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
human fall detection system using omni-camera images and cation with deep convolutional neural networks,” in Proceedings
personal information,” in Proceedings of the 1st Transdisciplinary of the 26th Annual Conference on Neural Information Processing
Conference on Distributed Diagnosis and Home Healthcare, Systems (NIPS ’12), pp. 1097–1105, Lake Tahoe, Nev, USA,
D2H2 2006, pp. 39–42, USA, April 2006. December 2012.
[21] C.-L. Liu, C.-H. Lee, and P.-M. Lin, “A fall detection system [38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
using k-nearest neighbor classifier,” Expert Systems with Appli- image recognition,” in Proceedings of the 2016 IEEE Conference
cations, vol. 37, no. 10, pp. 7174–7181, 2010. on Computer Vision and Pattern Recognition, CVPR 2016, pp.
[22] V. Vishwakarma, C. Mandal, and S. Sural, in Pattern Recognition 770–778, July 2016.
and Machine Intelligence, Automatic detection of human fall in [39] K. Simonyan and A. Zisserman, “Two-stream convolutional
video, Ed., pp. 616–623, 2007. networks for action recognition in videos,” in Proceedings of the
[23] N. Zerrouki, F. Harrou, A. Houacine, and Y. Sun, “Fall detection 28th Annual Conference on Neural Information Processing Sys-
using supervised machine learning algorithms: A comparative tems 2014, NIPS 2014, pp. 568–576, can, December 2014.
study,” in Proceedings of the 8th International Conference on
[40] C. François and etal., “Keras,” 2015, https://github.com/fchollet/
Modelling, Identification and Control, ICMIC 2016, pp. 665–670,
keras.
Algeria, November 2016.
[24] F. Harrou, N. Zerrouki, Y. Sun, and A. Houacine, “A simple [41] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and trans-
strategy for fall events detection,” in Proceedings of the 14th IEEE ferring mid-level image representations using convolutional
International Conference on Industrial Informatics, INDIN 2016, neural networks,” in Proceedings of the 27th IEEE Conference on
pp. 332–336, France, July 2016. Computer Vision and Pattern Recognition (CVPR ’14), pp. 1717–
1724, IEEE, Columbus, Ohio, USA, June 2014.
[25] E. Auvinet, F. Multon, A. Saint-Arnaud, J. Rousseau, and J.
Meunier, “Fall detection with multiple cameras: An occlusion- [42] S. Herath, M. Harandi, and F. Porikli, “Going deeper into action
resistant method based on 3-D silhouette vertical distribution,” recognition: A survey,” Image and Vision Computing, vol. 60, pp.
IEEE Transactions on Information Technology in Biomedicine, 4–21, 2017.
vol. 15, no. 2, pp. 290–300, 2011. [43] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R.
[26] S. Gasparrini, E. Cippitelli, S. Spinsante, and E. Gambi, “A Salakhutdinov, “Dropout: a simple way to prevent neural net-
depth-based fall detection system using a Kinect sensor,” works from overfitting,” Journal of Machine Learning Research,
Sensors, vol. 14, no. 2, pp. 2756–2775, 2014. vol. 15, no. 1, pp. 1929–1958, 2014.
[27] R. Planinc and M. Kampel, “Introducing the use of depth data [44] R. Hecht-Nielsen, “Theory of the backpropagation neural net-
for fall detection,” Personal and Ubiquitous Computing, vol. 17, work,” Neural Networks, vol. 1, no. 1, p. 445, 1988.
no. 6, pp. 1063–1072, 2013. [45] D. Clevert, T. Unterthiner, G. Povysil, and S. Hochreiter,
[28] G. Diraco, A. Leone, and P. Siciliano, “An active vision system “Rectified factor networks for biclustering of omics data,”
for fall detection and posture recognition in elderly healthcare,” Bioinformatics, vol. 33, no. 14, pp. i59–i66, 2017.
16 Wireless Communications and Mobile Computing