Machine Learning and Deep Learning Models For Human Activity Recognition in Security and Surveillance: A Review
Machine Learning and Deep Learning Models For Human Activity Recognition in Security and Surveillance: A Review
Machine Learning and Deep Learning Models For Human Activity Recognition in Security and Surveillance: A Review
https://doi.org/10.1007/s10115-024-02122-6
REVIEW
Abstract
Human activity recognition (HAR) has received the significant attention in the field of security
and surveillance due to its high potential for real-time monitoring, identifying the abnor-
mal activities and situational awareness. HAR is able to identify the abnormal activity or
behaviour patterns, which may indicate potential security risks. HAR system attempts to
automatically provide the information and classification regarding activities performed in
the environment by learning the data captured through sensor or video stream. The overview
of existing research work in the security and surveillance area, which includes traditional,
machine learning (ML) and deep learning (DL) algorithms applicable to field, is presented.
The comparative analysis of different HAR techniques based on features, input source, public
data sets is presented for quick understanding, and it focuses on the recent trends in HAR
field. This review paper provides guidelines for the selection of appropriate algorithm, data
set, performance metrics when evaluating HAR systems in the context of security and surveil-
lance. Overall, this review aims to provide a comprehensive understanding of HAR in the
field of security and surveillance and to serve as a basis for further research and development.
1 Introduction
Human activity recognition (HAR) system gives labelling to human actions and classify them
into the respective category. Significant improvements have been made in HAR for security
and surveillance in recent years due to the increasing availability of sensor technologies, high-
resolution video cameras and the development of sophisticated ML and DL algorithms. The
basic HAR system includes data gathering, feature extraction either handcrafted features or
B Sheetal Waghchaware
ssw21.extc@coeptech.ac.in; sheetalwaghchaware@gmail.com
Radhika Joshi
rdj.extc@coeptech.ac.in
1 Electronics and Telecommunication, COEP Technological University, Wellesley Rd, Pune,
Maharashtra 411005, India
123
4406 S. Waghchaware, R. Joshi
automatically extracted features, feature selection and classification. HAR is in high demand
as it has numerous applications available in different areas. HAR can be beneficial for number
of applications including industry, smart city, intelligent surveillance at public places like
mall, stations, abnormal activity detection, medical, sports analytics, fitness, daily-assisted
living, pedestrian detection, and so on [1]. An ideal human action recognition system should
be able to adapt to differences within a single action class and distinguish between variations
within and between distinct action classes. The data required for performing HAR are either
sensor based or vision based; sensor data set means information collected from different
sensors like accelerometer, magnetometer, gyroscope, RFID, etc. [2], and vision data set
includes information derived from CCTV camera, RGB depth camera like Kinect sensor.
Pre-processing is performed as initial step. In case of sensor data, pre-processing techniques
such as data cleaning, handling missing values, dimensionality reduction, data transformation
are used. For vision data, usually image processing tools are applied for removing digital
noise and perform data augmentation. By using this processed received information, human
activity is detected by traditional, ML or DL based approach. Extracting features in order to
predict the activity is the main and challenging task. Various traditional representative feature
descriptors are Haar, histogram of oriented gradient (HOG), local binary pattern (LBP) [3,
4]. Traditional methods are based on global features of an image and are only suitable for
individual action detection. The feature extraction with the help of traditional methods is
computationally intensive. These issues can be handled more effectively with the help of ML-
DL-based algorithms. Lightweight architectures such as MobileNet, SqueezeNet, and Tiny-
YOLO are well-suited for implementation on resource-constrained devices for human activity
recognition. In addition, authors [5] proposed posture recognition system using mobilenetV2
and LSTM to extract important features. On other hand, authors [6] proposed Tiny YoLo-
based real-time human detection for video surveillance at the edge to achieve a good trade-off
between accuracy and processing time. The survey [7] highlights the multidisciplinary aspect
of HAR, covering application domains, activity types, task complexity, benchmark data sets,
and techniques. This research compares current HAR approaches and discusses common data
sets. The proposed method [8] discusses the integration of biosensor data and multimodal deep
learning techniques for decoding human locomotion patterns in the context of the Internet
of Healthcare Things (IoHT). The authors focus on deep human motion detection and multi-
feature analysis as essential components for developing smart healthcare learning tools,
contributing to more effective and personalized health assessments [9]. HAR has achieved
significant importance due to its wide range of applications and corresponding features in
variety of domains and are listed in Fig. 1.
In contrast to other applications, HAR in security and surveillance carries a special sig-
nificance because of the seriousness of its goals and the potential consequences of failure.
Compared to other domains like health care, sports, or human–computer interaction, HAR
function in security and surveillance has a direct impact on public safety, property protection,
and crime prevention.
Figure 2 shows the characteristics and challenges of HAR in security and surveillance
domain. The vision-based data set specifically CCTV footage is important for HAR in secu-
rity and surveillance. Feature extraction is a critical characteristic of HAR for security and
surveillance, as it transforms raw data into discriminative and context-aware representations,
enabling accurate and real-time detection of human activities. Getting the real-time data is
crucial, and also there is lack of sufficient data [10, 11]. In addition to recognizing prede-
fined activities, HAR systems can also identify anomalies or suspicious behaviours that may
indicate a security threat. Real-time processing is crucial for security and surveillance appli-
cations. HAR systems often operate in real time, allowing immediate detection and response
123
Machine learning and deep learning models for human activity … 4407
123
4408 S. Waghchaware, R. Joshi
recognise the activity when more people are there in group. Human interaction can lead to
misclassification.
1.1 Motivation
The main purpose of the review is to present a current status of HAR in security and surveil-
lance. It has ability to detect threat, deliver real-time monitoring, optimize resource allocation,
facilitate investigations, enable proactive responses, provide scalability and coverage, and
integrate with other security systems. Global Market Insights, Inc.1 forecasts that the market
for behaviour analytics would reach over $3.5 billion by 2024, and it has many research
applications [12, 13]. HAR usage contributes to the overall effectiveness and efficiency of
security operations, promoting safety and security in various environments.
The general process of HAR is illustrated in Fig. 3. The input data captured will be either
sensor-based (smartwatches, smartphones, etc.) or vision-based (CCTV, RGB data) [14].
Several data sets are publicly available online for processing. The data must go through the
pre-processing stage for noise removal, data transformation, and data cleaning. Vision-based
data need different filters and sensor-based data require domain transformation techniques
for data cleaning. The pre-processed data then go through various steps such as feature
extraction, feature selection, and ML or DL classifier [15, 16]. System performance can be
measured using various performance metrics, while training the model based on application
requirements. In order to improve the efficiency of the network, it is necessary to minimize
objective function. In this aspect, the optimization algorithm plays an important role [17].
Various optimizers are available for getting optimal results. This allows human activities to
be classified into a particular class for maximum accuracy and improved model performance.
123
Machine learning and deep learning models for human activity … 4409
ML algorithms have been applied for identifying and analysing the activities for security
and surveillance systems. ML algorithms are mainly classified as supervised algorithm, unsu-
pervised algorithm and reinforcement algorithms. Numerous ML algorithms such as support
vector machine (SVM) K-nearest neighbour (K-NN), multi-layer perceptron (MLP), naive
Bayes classifier [18] are applicable for performing HAR. For the supervised machine learn-
ing approach, the labelled data set requirement is there. The supervised learning suffers from
generating training data set and extending activities. Some work in semi-supervised manner
to overcome this problem is also presented. The unsupervised clustering ML algorithm is
applied to tackle the issue of supervised learning method. The feature extraction in ML needs
deep knowledge of application domain or human experience. These algorithms differentiate
between various activities and assist in identifying potential threats, maintaining security.
However, with the advent of deep learning, researchers are increasingly turning to neural
network architectures to tackle the complexities of HAR. Before the widespread adoption of
DL, traditional ML algorithms were extensively used for image classification, often relying
on manual feature extraction. Unlike ML, DL extracts optimal features automatically. In DL,
feature extraction is easy and requires repeated learning from itself. It reduces human efforts,
feature engineering, need for expert knowledge. It enhances activity recognition accuracy of
security and surveillance system. Activity is nothing more than the sequence of actions that
follow specific pattern. Deep models learn the pattern and then predict the output. However,
it requires large data set for operation in order to achieve good results. Various DL algorithms
like convolutional neural network (CNN), restricted Boltzmann machine (RBM), recurrent
neural network (RNN), long short-term memory (LSTM) [19], deep belief network (DBN)
[20], deep neural network (DNN), and various hybrid deep models (combination of more
than one deep model) [21] are available for human activity classification. CNN and RNN are
most popular deep models. CNN is generally considered as a good for image data. It is scale-
invariant and recognizes both spatial and temporal dependencies. RNN performs well on time
series data because it has recurring connections and passes the previous information to next
layer. RNN has disadvantage of gradient exploding or vanishing problems. The RNN versions
like LSTM, stacked LSTM, bi-directional LSTM overcome this drawback as they have long-
term time dependency. The DNNs are somehow different from traditional classifiers [22].
The other state-of-the-art technologies based on DNNs such as region-based convolutional
neural networks (R-CNNs), Fast R-CNN, Faster R-CNN, and single-shot detector (SSD)
have achieved success in object classification [15]. Hybrid models such as combinations
of CNN and LSTM, 1-D-CNN and SVM, SIFT and SVM, CNN and RNN-GRU are also
generated to evaluate HAR performance. The rest of the paper is organized as follows.
Section 2 presents background information, available research methods, algorithms for HAR,
and Sect. 3 provides the key points for the selection of data set. Section 4 provides the
significance of work flow in HAR. The article’s final section discusses the summary and
suggestions for future research.
2 Background
HAR enhances the capabilities of surveillance systems and enables proactive measures to
ensure safety and security. Many surveys in the area of security and surveillance focus on
different approaches of activity recognition. Many reviews on HAR in field of security and
surveillance have been conducted [23], the fact that HAR is still in its development leads to
123
4410 S. Waghchaware, R. Joshi
the introduction of many new concepts. HAR mostly uses either sensor-based data or vision-
based data as an input source for identification. This review paper aims to provide available
literature and information for both types of data sets in order to gain insights for the secu-
rity and surveillance application. There are different levels for HAR depending on increasing
complexity. The simplest is single person action recognition such as gesture recognition, pos-
ture recognition [24], fall detection. Some are action recognition like running, walking, etc.
The complexity increases for interaction-based activities like human–object interaction and
human–robot interaction [25], while the rest are motion based like tracking, motion detection
and people counting. For the security and surveillance application, the complexity of activi-
ties is higher always as it involves suspicious activity identification, group activity detection,
abnormal activity detection and analysis. “Deep Learning approaches for Human Activity
Recognition in Video Surveillance—a Survey,” [26] provides a comprehensive overview
of HAR in video surveillance. It covers a wide range of topics, including feature extraction
methods, classification algorithms, multi-modal approaches, and benchmark data sets used in
HAR research for security and surveillance. Chen et al. [27] focuses on deep learning-based
HAR methods specifically applied to video surveillance scenarios. “A Survey on Human
Activity Recognition Using Wearable Sensors” provides an in-depth analysis of HAR using
wearable sensors. It covers various sensor modalities, feature extraction techniques, and clas-
sification algorithms employed in wearable-based HAR systems for security and surveillance
applications. Khushboo Banjarey [28] presents survey on vision-based approach for activity
detection. Deshpande et al. presented a vision system for detection of human in videos using
support vector machine. To protect privacy, Wi-Fi channel state information is used as a
source for HAR.
The HAR methods presented in this section are categorized into the subsections—traditional
method, ML method, DL method and hybrid model method implemented so far in security
and surveillance area. These methods need to be selected based upon the complexity of
activities, data set size, time complexity, cost and all.
Traditional method is commonly useful for simple and individual actions. The authors [29]
indicate an approach that uses background subtraction for human detection. Feature extraction
and classification is carried out using HOG descriptor, SVM, respectively. This system is
implemented for multiple human detection and video surveillance application, while [30]
proposes a model that uses HOG for feature extraction, SVM for classification, but cluster
segmentation approach is used to separate human from background or other objects in video
frames. Sonal Kumar and Bhavani in [31] present HAR using median filtering for pre-
processing, watershed segmentation technique. However, feature extraction is achieved by
combining the three feature extractors HOG, colour and GiST. Congcong Liu, Jie Ying, and
colleagues demonstrate how two separate classifiers—CNN and the Bayes classifier, are used
to detect anomalous human behaviour [32].
123
Machine learning and deep learning models for human activity … 4411
When the size of data set is small, ML method is preferrable in that case. The HAR system
[33] mentioned is based on random forest (RF), which is compared with decision tree (DT)
and artificial neural network (ANN). This HAR framework is based on single wearable device
in centralized mode and it is proved that RF has outperformed. The RF binary classifier [3]
is used to classify the abstract activity like static or dynamic activity and a low-pass Butter-
worth filter is applied to pre-process a signal. In [29], authors combined traditional method
and ML method for classification, where SVM works as classifier for activity recognition.
Random forests, gradient boosting machines are an ensemble technique that can be used for
classification tasks. KNN is a simple algorithm that classifies an image based on the majority
vote of its neighbours. The study explores the application of ML techniques to analyse and
classify human activities using wearable sensors. The study emphasizes the potential of HAR
in health care for early detection of abnormalities and enhancing the overall quality of patient
monitoring [34].
DL method is preferred when dealing with sequential and temporal data, as they capture long-
range dependencies and temporal patterns, enabling accurate tracking and analysis of human
activities over time. The most noteworthy modality aspects are automatically extracted using
the attention mechanism and RNN, and then data are subsequently transformed to higher-
level representation. Sample data are evenly sampled and classified into partitions using
K-means clustering to train multi-modal classifiers. On the basis of dense connections and
weighted feature aggregation, a network of convolutional LSTM for HAR is suggested.
Training and tests were carried out using the benchmark data sets Opportunity and UniMiB-
SHAR [37]. The main contribution of this work is the application of a dense connection
module that has L(L + 1)/2 direct connections, while a traditional convolutional network
has L connections for L layers. Sliding window length, kernel size, fusion node of dense
connection module, and several convolutional layers are the hyperparameters that need to
tune during the training of the network. While using CNN, the softmax function is used at
the last layer which gives prediction probability for three moving activity classes (walking,
walking upwards, walking downwards). Three hidden layers are used in this hybrid model,
giving the best performance with high accuracy. As we increase the number of hidden layers,
the complexity of the neural network increases, and performance decreases. Kalman filter is
utilized to detect a target in a moving frame, and three features length–width ratio, entropy,
and Hu invariant moment are extracted from the frame [35]. The ReLu activation function
is applied for all hidden layers, and performance is measured for three extracted features
by taking mean and variance. Kaixuan Chen, and Lina Yao et al. focus on three challenges
in HAR: (i) unavailability of labelled data, (ii) labelling of data is costly, (iii) imbalanced
classification and (iv) interpersonal variability and interclass similarity [36]. In some cases
[38], HAR is carried out with LSTM-RNN network by using passive infrared (PIR), motion,
float and pressure sensors to monitor daily activities. The work in [39] uses LSTM again and
identify activities using triaxial accelerometers data. In [40] the authors compared LSTM, 1D
CNN and hybrid model with joint diverse temporal learning on Kasteren data set. The focus
is mainly on less represented activities. The authors [41] applied CNN-LSTM on UCI-HAR
and iSPL data set while in [52] LSTM-CNN is applied on three data sets, namely UCI-HAR,
WISDM and OPPORTUNITY. In this, fully connected layers are replaced by global average
pooling (GAP) layer in order to reduce the model parameters [43]. This is followed by a fully
123
4412 S. Waghchaware, R. Joshi
connected layer, batch normalization and softmax function. To increase efficiency, small
batches of data segment size are segmented during training and testing. The cross-entropy
loss function is used to calculate the total error, and average accuracy is compared to other
hybrid models.
The number of traditional, ML, DL and hybrid algorithms that can be used for HAR are
summarized and represented in Fig. 4. It gives the use of all algorithms for HAR is in detail
manner. Different algorithms have different assumptions and knowledge bases depending on
which accuracy will vary. Hybrid algorithms can also be implemented by a combination of
two or more base algorithms to improve the network performance.
These algorithms are grouped into texture metric and Fourier feature. Texture metrics describe
the spatial arrangement and variation in pixel intensities or colours within an image and
can be used for tasks such as texture classification, segmentation, and analysis. Fourier
features capture frequency domain information using Fourier analysis. They are useful for
123
Machine learning and deep learning models for human activity … 4413
the detection and identification of activity, usually suitable for straightforward activities with
limited computational resources.
1. Local Binary Pattern (LBP): LBP is a statistical texture descriptor helpful for feature
analysis. Because it is a traditional method, it cannot be directly applied for action identi-
fication. For better performance, it can be combined with other models like SVM, HMM,
and many others [55].
2. Haar Wavelets: A 1D Haar filter is applied, which is nothing more than a simple filter
for extracting features from data. The features are also simple (mean, standard deviation,
correlation, and energy) and have low computation costs. The accuracy is calculated with
help of the computation cost of features [56].
LBP: local binary pattern, HOG: histogram of oriented gradient, SIFT: scale invariant
feature transform, HMM: hidden Markov model, RF: random forest, SVM: support vector
machine, K-NN: K-nearest neighbour, AE: autoencoder, CNN: convolutional neural network,
DCNN: deep convolutional neural network, RBM: restricted Boltzmann machine, DBN: deep
belief network, RNN: recurrent neural network, LSTM: long short-term memory, BoW: bag
of words.
3. Histogram of Oriented Gradient (HOG): It is commonly used for object detection.
It selects useful information and discards the remaining part. It uses the direction of
intensity of the gradients and edge directions. HOG calculates the change in gradient
direction for a given specified portion. It is then followed by SVM or RF classifier for
human detection and recognition [57].
4. GiST: GiST is a feature extraction technique based on the Gabor filter and convolution
process. The Gabor filter is a texture analysis technique that considers the frequency and
orientations. The convolution is carried out in the Fourier domain in order to perform
computation. GiST filters the image into a number of visual feature channels such as
colour, intensity, and orientation [56].
5. Scale Invariant Feature Transform (SIFT): Initially the pre-processing of the input
image/frame is performed by removing noise, foreground, and background subtraction.
SIFT is invariant to scale and rotation. The algorithm mainly performs 4 functions for
123
4414 S. Waghchaware, R. Joshi
identifying features. Initially localize humans, determine a region of interest (ROI) for
locating global feature key points, and the orientation of key points, and then reproduce
them in high-dimensional order. The extracted global features apply to SVM for the
classification purpose of actions [57].
For HAR tasks in security and surveillance, HoG and LBP are often preferred due to their
ability to capture relevant spatial information and robustness to variations in lighting and
texture. SIFT and GiST may have more limited applicability.
2.2.2 ML algorithms
ML algorithms are grouped into supervised and statistical algorithms. Supervised algorithms
utilize labelled training data, unsupervised algorithms extract patterns and clusters from
video data without prior labels and reinforcement algorithms make sequential decisions in
response to human activities. They outperform traditional algorithms and provide good results
for human activity recognition.
1. Hidden Markov Model (HMM): It is a statistical and probabilistic model based on
Bayes theory. It consists of two stages. In the first stage, it learns the data set for each
parameter and trains the model. In the second stage, it calculates the probability for
each sequence class and recognizes the activity based on maximum probability [58]. It
can perform on any length sequence and works easily for feature extraction as well as
classification which is nothing but an advantage of using HMM [59].
2. Discrete HMM: D-HMM estimates the maximal posterior probability to predict the
target.
3. Continuous HMM: C-HMM refers to Markov chain with a finite number of states and
has a set of density functions representing the observations. It detects the sequential
information from time sequence. The proposed system monitors everyday locomotion
with body-worn inertial sensors and other electrophysiological sensors. It groups the
HMMs into three body-related sensors placement, namely, active head, active mid-body,
and active lower-body sensors [60].
4. Random Forest (RF): RF is a supervised machine learning algorithm based on the
bagging ensemble technique, widely applicable for both classification and regression.
It gives good results for activity classification. It forms a number of decision trees by
splitting the training data set. It selects the samples randomly and the output is based on
majority voting. Thus, it maintains high accuracy than other decision tree algorithms. RF
algorithm is determined by the original data set and the number of decision tree classifiers
K [61, 62].
5. Naïve Bayes Classifier (NB): The moving target is detected with the help of a filter.
The features such as entropy, Hu moment, and length–width ratio are extracted from
the detected frame. These are applied to the Bayes classifier. The mean and variance are
computed using these three characteristics and based on its Gaussian distribution function
is adopted for classification [63]. Bayes algorithm is applied to find out the output with the
largest posterior probability and use conditional probability for classification purposes.
6. Support Vector Machine (SVM): The first step is extraction of statistical features (mean,
RMS, standard deviation, Index of dispersion, Skewness, etc.), followed by a dimen-
sionality reduction technique using principal component analysis (PCA). Subsequently,
low-dimensional data are fed to the SVM model. SVM is a simple and discriminative
type of classifier. It is based on maximizing the distance between support vectors. SVM
is an old but still, very popular supervised ML algorithm in today’s date also [64]. It was
123
Machine learning and deep learning models for human activity … 4415
developed for binary classification, but also gives good results when classifying multiple
classes.
7. K-Nearest Neighbour (KNN): It comes under the umbrella of a supervised ML
algorithm. KNN has many applications in text mining, face recognition, and pattern
recognition and also can be used in HAR. The value of K decides the number of neigh-
bours which is dependent on a data set and finding the value of K is a bit crucial. It uses
neighbours and calculates the Euclidean distance with which to add the new data point
to the corresponding group. It has the potential to solve low accuracy problems [54].
SVM, RF, and HMM are commonly used in HAR for security and surveillance because
of their capacity to handle complex data and capture temporal connections.
2.2.3 DL algorithms
DL algorithms are classified into fully connected, CNN, and RNN based on some advan-
tages over machine learning algorithms. DL models work well for unstructured data, feature
extraction is automatic, and self-learning ability through backpropagation.
1. Autoencoder (AE): Autoencoder is a neural network that produces an output that
resembles input. It has an encoder, decoder, and hidden layer (which gives an abstract
representation of data). An ensemble of autoencoders can be used for HAR. Each AE
is dedicated to a single activity and trained for offline mode. When new data arrive for
the online phase, it will reconstruct output, and classification is made with the help of
minimum reconstruction error. Select the threshold as the standard deviation of the error
[53].
2. Convolutional neural network (CNN): CNN is a more established DL model that has
several application areas. Its main attraction is automatic feature extraction. Various CNN
architectures like VGG, AlexNet, and GoogleNet with different convolutional layers and
parameters are available. For activity recognition, CNN models need to be defined first
with their specifications such as the number of convolutional layers, pooling layers,
hyperparameters, batch size, and epoch. The architecture must be trained and tested on
the data set for computing efficiency and accuracy of the model [54].
3. Restricted Boltzmann machines (RBM): It is a generative stochastic model and type
of neural network based on probability distribution. It is generally used for classification,
regression, collaborative filtering, and feature learning. The RBM is composed of two
layers: a visible layer and a hidden layer.
4. Deep Belief Network (DBN): It is also a generative model that stacks multiple RBN
layers. It can be used both supervised and unsupervised manner. It has the potential to
perform both feature extraction and classification. DBNs can be greedily trained layer by
layer and can be useful in HAR applications. DBN can first be trained in an unsupervised
way and then can be proceeded to supervised tunning [20]. Since the deep network
here does not perform during gradient descent, its DBN can give satisfactory output by
un-supervised pre-training.
5. Recurrent neural network (RNN): RNN is a type of neural network with internal
memory and is best suited for sequential data. Thus, it can perform well for human action
recognition as human activities happen in a sequential pattern. Current input and past
output are taken into account when making a decision. The dense layer identifies the
applicable features. It features a cascading of layers with a fully connected layer at the
output side [39]. But RNN can suffer from vanishing and exploding gradients issues. To
handle this problem, there is an LSTM network.
123
4416 S. Waghchaware, R. Joshi
6. Long short-term memory (LSTM): It is the next version of RNN. Compared to RNN, it
has a good capability of handling long-term sequences. HAR is a time-series analysis, so
LSTM may be best suited for this. The main component of the LSTM is the memory cell
that has the input gate, output gate, and forget gate. As it is a variant of RNN, it provides
a self-recurrent path. Time series data are taken as input and applied to the LSTM model.
It computes the features and forms the feature vector. The classification can be done with
the help of any multi-class classifier [40].
In context of HAR for security and surveillance, CNNs, RNNs, and LSTMs can be par-
ticularly useful depending on data.
A hybrid algorithm in the context of human activity recognition (HAR) refers to an approach
that combines multiple techniques or models from different machine learning or data pro-
cessing paradigms to enhance the accuracy, robustness, or efficiency of activity recognition.
These algorithms merge the strengths of different methods to overcome the limitations of
individual techniques.
1. CNN + LSTM: Features of CNN and LSTM [73] are combined to improve the classi-
fication accuracy and overall performance.
2. SIFT + BoW: It uses hybrid combination of traditional algorithms. BoW is used as
feature descriptor and activity classification is attempted with the help of SIFT [74].
3. SVM + CNN: This uses combination of DL algorithm 1-D CNN for feature extraction
and ML algorithm SVM in order to classify the human activities [75].
4. HOG + SVM: It extracts useful information from image by calculating the occurrence
of gradient and followed by SVM [76] to perform the classification task.
CNN + LSTM is most widely used hybrid algorithm for security and surveillance.
The concise summary of all the ML, DL and hybrid algorithms explained in Sect. 2.2
is provided in Table 1 referring to existing literature. It provides HAR in various field with
different approaches. The table gives information regarding the publication year, author, input
source, feature extracted from source, description and algorithm-based technique employed
for the identification of human activities.
It can be seen that from Table 1 user has to make correct choice depending upon feature
and technique. The general approach of HAR for security and surveillance is illustrated in
Fig. 3. The different state-of-the-art techniques and the existing algorithms applied in case of
HAR for security and surveillance are explained in Sect. 2.2. The data set plays pivotal role in
the development and deployment of the algorithms. The analysis of algorithms is evaluated
on the data set. Since there are many publicly accessible data sets for the HAR, choosing the
right data set for the security and surveillance is crucial because it affects the performance
of the model. The selection of the HAR data set in security and surveillance is covered in the
following section.
The selection of data set plays crucial role for HAR in security and surveillance. The data
derived from either visual or sensory input sources are stored in the form of data set. Data
sets serve as a foundation for building accurate and robust recognition systems and enable
123
Machine learning and deep learning models for human activity … 4417
123
4418 S. Waghchaware, R. Joshi
Table 1 (continued)
123
Machine learning and deep learning models for human activity … 4419
Table 1 (continued)
123
4420 S. Waghchaware, R. Joshi
By defining what constitutes "normal" activities (walking, standing, sitting, jogging), secu-
rity systems can be trained to detect anomalies or unusual behaviours that deviate from these
patterns. Understanding and categorizing household and sports activities can aid in creating
more intelligent security systems that are context-aware. By understanding the context and
categorizing activities accurately, security systems can reduce false positives.
Referring to Table 2, following inferences can be drawn.
• The most popular data sets are WISDM, UCI Har, UCF, KTH, Weizmann, Kinect, HMDB-
51, CAD-60, Opportunity.
• UCF Sports is challenging data set as has many variations like camera motion, view angle,
changing background. UCF Sports [48] and daily sports data set is used for analysing
sports.
• Weizmann data set has videos of low resolution.
• Hollywood data set is realistic in nature as it contains video sequences taken from Holly-
wood movies.
• The mentioned data sets contribute significantly to the development ML and DL models
capable of recognizing a wide range of human activities. This capability is crucial for
enhancing security surveillance systems, enabling them to automatically detect, categorize,
and respond to potential security threats with greater accuracy and efficiency
• The UCF crime data set is large-scale video data set having 13 different classes of crime
such as arrest, abuse, robbery, and fighting. It has real-world surveillance footage of total
128 h of videos.
• ShanghaiTach, AIRTlab are public data sets available for identifying abnormal events
and classifying violence activities, respectively. They might be useful for security and
surveillance.
• The other data sets listed in the table can be used to train model and test them on UCF
Crime, ShanghaiTech, and AIRTlab to ensure model robustness and performance.
123
Machine learning and deep learning models for human activity … 4421
WISDM S 6 51 20 Hz Normal
activities
UCF Crime V 13 – – Crime-related
activities,
including
robbery,
burglary
UCI Har S 6 30 10,299 samples Normal
activities
UCF-101 V 101 25 25fps Sports activities
groups Resolution: 320 × 240
UCF-50 V 50 25 – Realistic videos
groups are taken
from
YouTube
Kinect V 18 10 30 Hz, 640 × 480 Normal
activities
Opportunity S 5 4 30 Hz Normal
activities
Opportunity++ S+V 20 4 640 × 480, 10fps Daily Living
activities
MSR activity 3D V 16 10 20 samples Household
activities
HAPT S 6 30 1214 samples Normal
activities
UniMiB-SHAR S 17 30 50 Hz Household
activities
Florence 3D V 9 10 215 samples Household
activities
PAPMP2 S 18 9 100 Hz Household
activities
NTU RGB-D V 60 40 – Household
activities
Actitracker S 6 36 20 Hz Normal
activities
Skoda S 46 – 96 Hz Assembly line
activities
WARD S 13 20 – Normal
activities
USC-HAD S 12 14 – Motion node
accelerometer
Daphnet S 2 10 – Freeze, no
freeze
123
4422 S. Waghchaware, R. Joshi
Table 2 (continued)
123
Machine learning and deep learning models for human activity … 4423
In Table 2, entries for number of users and sampling rate/frame rate are marked with ‘-’
for some data sets indicates-no specific value was emphasized and hence not mentioned.
The data set is essential parameter for HAR in security and surveillance. The algorithms are
performed on the data set in a simulation environment and it gives reasonable performance.
These sensors function reasonably well in a simulation setting. However, they could generate
a large number of false alarms when employed in realistic scenarios.
In the security domain, image classification faces some challenges like data imbalance,
multi-modality, intra-class variation and inter-class similarity and domain adaptiveness.
1. Data imbalance: The raw data recorded from unconstrained conditions are naturally class-
imbalanced. When using an imbalanced data set, conventional models tend to predict the
class with the majority number of training samples while ignoring the class with few
available training samples. Therefore, it is urgent to determine the class imbalance issue
for developing an effective activity recognition model.
2. Multi-modality: Security applications often involve data from multiple sources or modal-
ities, such as combining video, audio, and sensor data. Traditional image classification
models designed for single-modality data may struggle to effectively utilize information
from diverse sources. Integrating features from various modalities can provide a more
comprehensive understanding of security scenarios.
3. Intra-class variation and inter-class similarity: Intra-class similarity means two activities
of different classes fall into same category. For example, the walking gets misclassified as
running because of high speed. Inter-class similarity means if two subjects are performing
same activity, it can be bit different because of movements, variation of subject. The same
activity falls into two different classes. Jogging activity performed by different subject
can get classified as running.
4. Domain adaptiveness: Security environments are dynamic, and the characteristics of
data may vary across different domains or scenarios. Achieving domain adaptability is
crucial for deploying image classification models in diverse security settings. Methods
for domain adaptation, transfer learning, or continual learning are essential to ensure that
models can adapt to changes in the environment without a significant drop in performance.
In order to understand the pros and cons of security and surveillance, HAR approaches are
categorized into sensor and vision-based source of input and corresponding advantages and
disadvantages are presented in Fig. 5.
Advantages are shown in green block, while disadvantages are shown in pink block.
RGB data and Skeleton data are gathered from CCTV cameras and RGB depth cameras,
respectively. Wearable sensors [71] such as an accelerometer, gyroscope, magnetometer, and
proximity sensor are included in the sensor data set [72]. RFID and Wi-Fi are both object
sensors.
Following are the observations derived from Fig. 5
• CCTV and RGB depth camera provide high accuracy but both are expensive.
• Wearable sensors are cheap, wearable and lightweight. They are more sensitive to envi-
ronmental conditions.
• Object sensors are having less cost, but they require more resources and get affected by
temperature and humidity.
123
4424 S. Waghchaware, R. Joshi
The multimodal fusion of sensor and vision information enhances surveillance and security
system. By integrating sensor data and video analytics, security systems may gain a more
comprehensive understanding of the surroundings, making it easier to discriminate between
normal and suspicious activities. This integration can also help to reduce false alarms because
the combination of video and sensor data offers a more complete context for evaluating alerts.
An approach [73] of feature-level fusion for robust HAR, which utilizes data from multiple
123
Machine learning and deep learning models for human activity … 4425
sensors, including RGB camera, depth sensor, and wearable inertial sensors can be applicable
in surveillance application.
Data set and ML-DL algorithms are the prime components in the implementation of HAR
system. These are examined in the Sect. 3. To evaluate the performance of model, there
should be understanding of workflow that properly connects the data set and algorithm. A
workflow ensures that proper assessment techniques are employed, such as partitioning the
data set into training and testing sets, and using appropriate evaluation metrics.
4 Flowgraph of HAR
Referring to literature, combined flowgraph is prepared to indicate important steps doe vision
and sensor-based inputs starting from pre-processing to activity classification. Workflow in
HAR for security and surveillance ensures that the deployment of reliable and accurate
activity recognition systems is there in a methodical and effective manner [74]. Figure 6
gives the detailed flow diagram of HAR for vision and sensor-based data sets. It shows step-
by-step and structured approach for performing activity recognition from top to bottom. The
raw data collected are firstly spilt into train and test data. According to the data set type either
sensor or vision, data pre-processing techniques will be applied. The flowgraph is segmented
into sections—data set collection, data set split, pre-processing, feature extraction and model
training based on the activities that is carried out.
Collect the data set containing human activities. It can be labelled or unlabelled. This data
set could include sensor data from accelerometers, gyroscopes, other sources or vision data
from camera, CCTV.
4.2 Pre-processing
Various filters are applied for noise removal, such as low-pass filter (LPF), mean, median,
Butterworth, Kalman filter. The performance factors of these filters such as signal-to-noise
ratio (SNR), correlation coefficient, cut-off frequency, order of filter must be taken into
account during the filter selection. For vision-based data, pre-processing is performed by
video to frame conversion. The frame is detected using different object detection algorithms
to show the bounding box. Unnecessary frames can be removed. Noise removal is done with
the help of foreground and background subtraction. Normalization, scaling and enhancement
are performed using various image processing algorithms. Pre-processing includes important
tasks like segmentation and data transformation.
4.2.1 Segmentation
Window length is important for the segmentation task. There is a no specific rule for select-
ing window length. In previous studies [62], window size range is 0.25–6.7 s. The feature
extraction for small window length gives better results but it requires more time. Conversely,
activity is performed for a small window duration, then selection of large window can give
poor results. Optimal window length is good.
123
4426 S. Waghchaware, R. Joshi
123
Machine learning and deep learning models for human activity … 4427
Create training and testing sets from the data set. The test set is utilized for evaluation and
performance assessment, whereas the training set is used to train the ML-DL models. The
train test split ratio can be different like 80:20, 70:30, etc. depending on size of data set and
application.
Feature extraction is key major step in human activity recognition. The features are cat-
egorized into two groups—handcrafted features and automatically extracted features. The
handcrafted features are manually extracted features which require expert knowledge. In
earlier studies handcrafted features were used with traditional ML algorithms. Recently,
DL algorithms like CNN extract the features automatically by learning the features [54].
Although DL algorithms have ability to extract features automatically, still we cannot ignore
the handcrafted features. Time domain and frequency domain are the two major types of
handcrafted features. Features used to separate static features and frequency domain features
are used to differentiate between static and dynamic features. The time domain features are
mean value, maximum value, minimum value, range, standard deviation, etc. The frequency
domain features are energy, power, amplitude, kurtosis, skewness, etc. In case of vision data
feature extraction is region of interest (ROI) identification, non-parametric weighted feature
extraction (NWFE) based on extension of scatter matrices. It is good for high-dimensional
multi-class activity recognition problems. Lucas–Kanade–Tomasi (LKT) is feature tracker.
It performs direct search for the position by using spatial intensity information to provide
best match. SIFT detects describe and match local features in image. HOG basically utilized
for object detection. It takes into account the number of occurrences of gradient orientation
in area of image. The simple hand-crafted features for image are edges, corners that can be
detected by edge detector algorithms.
The data need to be transformed by using different transformation techniques so that the data
should be in proper format and next operations can be performed smoothly.
The raw data is huge quantity most of the times having various features presented in it.
Training such a big data takes lot of time and also it requires complex computational platform.
Data reduction methods helps to reduce the dimensionality of data either by combining
some features or by removing the less important features. Principal component analysis is
dimensionality reduction technique based on eigen values and eigen vectors. T-distributed
stochastic neighbour embedding (t-SNE) is nonlinear method which preserves small pairwise
distance. Some other techniques are attribute subset selection and data cube aggregation.
Model is trained using extracted features for classification of activities into the correct cat-
egory. ML or DL model can be utilized for training purpose [82]. Model training includes
various parameters such as learning rate, hyperparameter tuning, number of epochs, batch
size, number of layers, size of kernel, size of filter, activation functions, batch- normalization,
optimization algorithms and all. Different optimization algorithms are available in literature.
123
4428 S. Waghchaware, R. Joshi
Most common performance evaluation metrics [74] that are pertinent to HAR in security and
surveillance are discussed in Table 3.
The analysis is made by evaluating some review papers having different 35 data sets to
find out most popular performance metrics used in security and surveillance area. Referring
to the review data, performance measures impact is illustrated in Fig. 7., while it can be
observed that Accuracy and F-1 score parameters are most commonly used. The accuracy
parameter quantifies the percentage of correctly identified activities out of all instances. A
high detection accuracy shows that system can reliably recognize the relevant activities and
contributes to improving system security. However, accuracy metric is beneficial for balanced
data set, while F-1 score metric is good in case of imbalanced data set as it takes into account
the combination of both precision as well as recall parameters. The public data sets available
for human activity recognition in security and surveillance domain is imbalanced in most
of the cases. The next important parameter is Precision that indicates the prediction ability
of individual category. Precision can be applied to security and surveillance related HAR. It
measures the proportion of correctly identified threats out of all detected activities. Recall
parameter indicates the proportion of true suspicious activities correctly identified. A good
balance between precision and recall is important to minimize false positive and false negative
of the security and surveillance system. FPR gives occurrence of the false positive detection
in form of number, where system identifies the activity as suspicious but it is not. It makes sure
that system is genuine and focus on its objectives rather than false alarm, also known as False
alarm rate. To evaluate the entire performance of the HAR system in security and surveillance
applications, it is crucial to take these performance measures into account collectively. Since
it can be observed that the accuracy parameter is considered in the majority of situations,
as per referred literature a Table 4 is supplied to indicate the accuracy value produced by
employing various ML-DL algorithms. But along with that, real-time detection and speed
are two critical factors required for security and surveillance. Different approaches applied
to the same data set result in varying accuracy parameter values. This is due to the diversity
of the algorithm and its architecture.
It is observed that the accuracy of a HAR system using the same ML algorithm for
different data sets can vary depending upon several factors like different data size, different
data distribution, different feature representation. As shown in Table 4, HMM gives different
123
Machine learning and deep learning models for human activity … 4429
FP False Positive, TP True Positive, FN False Negative, FP False Positive, FPR False Positive Rate, TPR
True Positive Rate
accuracy values for MSR daily activity 3D (94.4) and Florence 3D (96.2). DL algorithms,
especially CNN and RNNs like LSTM, Bi-LSTM, have shown significant improvements in
HAR accuracy. Some methods combine the two or more algorithms like CNN + LSTM, SIFT
+ SVM, etc. to obtain the modified result. Hybrid algorithm like CNN + LSTM combine
spatial feature extraction from CNNs with temporal modelling and sequential learning from
LSTMs like 92% on UCI-HAR and 95.87% on Open HAPT data set. This hybrid algorithm
shows 83.13% accuracy on UMN data set but after the adding an attention mechanism,
accuracy increases and reaches to 94.30%.
123
4430 S. Waghchaware, R. Joshi
123
Machine learning and deep learning models for human activity … 4431
5 Discussion
In this paper, an extensive review regarding the role of traditional feature descriptors, ML and
DL in HAR system for security and surveillance area is conducted by considering the literature
available to date. HAR is continuously developing because of number applications in different
fields. Current trends and techniques are presented in this paper, as most of the literature
reviewed in this review is from the last five years. Different applications of HAR and their
contribution to the respective field are identified. This review highlights the broader impact
of HAR in security and surveillance, ranging from improving security measures to ensuring
public safety. The literature survey may provide specific examples of how HAR has been
employed in security and surveillance scenarios, such as recognizing suspicious behaviour or
identifying unauthorized access. Figure 2 summarizes the traditional, ML and DL algorithms
available for HAR. Various hybrid models presented like CNN + LSTM, SVM + CNN are
also discussed. Empirical observations indicate that hybrid models outperform individual
modes, demonstrating superior performance in various tasks. The different combination of
ML-DL hybrid models can be implemented to improve the performance and reduce the
computational power burden for security and surveillance application. Sensor and vision-
based data set are the two broad categories of HAR data set. The limitations of sensor and
vision-based data are provided in detail. Available data sets for HAR with their specifications
are discussed. Analysis based on methodology, input source, accuracy, and data sets are
presented. The various performance evaluation metrics available for activity classification
are studied and important parameters for security and surveillance application are mentioned
with rationale. HAR enables security experts in recognising patterns, enhancing security
tactics, and making data driven judgements to strengthen security measures. This paper is
valuable resource for researchers, practitioners, and system developers working in this field
of security and surveillance as it provides in-depth understanding of cutting-edge techniques
and recent advancements in HAR.
6 Conclusion
HAR is now an extensively growing field and numerous applications in different domains.
HAR is more critical in security and surveillance than other applications as it deals with
public safety, crime prevention, suspicious activity identification, privacy protection. HAR
has enormous potential for strengthening security and surveillance systems. The evolution
of HAR techniques was examined in this literature review, from early approaches like Haar
and LBP to adoption of ML algorithms up to the development of DL models. The Haar and
LBP feature descriptors are very complex in nature and provides very less accuracy. ML
algorithms use these feature descriptors for feature extraction and then apply the ML models
like SVM, RF, DT to classify the activities and gives good results. In case of ML algorithms,
with relatively simple activities and limited resources for real-time applications, NB and KNN
might be suitable choices. For more complex activities and larger data sets, RF and SVM
could provide better accuracy and generalization. The integration of deep learning models
has led to the improved the accuracy and robustness of HAR in security and surveillance
applications. Some hybrid architectures, such as Convolutional Recurrent Neural Networks
(C-RNN), which combine the strengths of CNNs and RNNs provides more improvement in
system models.
123
4432 S. Waghchaware, R. Joshi
The selection of ML or DL models depends on the size of a data set, hardware resources,
real-time requirement, model interpretability, and computation requirement. Although DL
is broader family of machine learning, it does not mean that deep learning is always better
than machine learning. ML models are statistical, while DL models are dynamic. The inter-
pretability of DL models is low and it requires more parallel processing unit. If the available
data set is small and simple, then apply ML model that will save computational requirement,
less time complexity and space complexity, reduces the cost of GPU, TPU. However, for
complex activities it is better to choose DL models for classification and recognition. The
knowledge gathered from this literature review can open the door for more investigation and
the creation of strong HAR systems that strengthen security protocols and promote safer
surroundings.
7 Future directions
The continued development of sensor and vision technologies, ML algorithms, and the rising
desire for more effective and reliable security systems are the main factors of the future direc-
tions for HAR in security and surveillance. Future research might concentrate on creating
reliable multi-modal sensor fusion approaches to combine data from many sensors including
wearable technology, depth sensors, and video cameras and take advantage of their com-
plimentary capabilities to increase the accuracy and reliability of activity recognition. By
combining ML and DL models, activity recognition can be performed in short period and it
can be used for real-time application as real-time HAR is crucial for security and surveil-
lance. The use of unsupervised and semi-supervised learning approaches can be investigated
in the future to reduce the dependency on massive volumes of labelled data. To facilitate the
transfer of information from one domain to another, future research can examine transfer
learning and domain adaptation strategies to make HAR systems more flexible and reliable
in a variety of security and surveillance applications.
Author contributions All the authors contributed equally in preparing the manuscript.
Declarations
Conflict of interest The authors have no competing interests to declare that are relevant to the content of this
article.
References
1. Ranasinghe S, Al Machot F, Mayr HC (2016) A review on applications of activity recognition systems
with regard to performance and evaluation. Int J Distrib Sens Netw 12:1550147716665520
2. Hussain Z, Sheng M, Zhang WE (2019) Different approaches for human activity recognition: a survey.
arXiv preprint arXiv:1906.05074
3. Nurwulan NR, Selamaj G (2020) Random Forest for human daily activity recognition. J Phys Conf Ser
1655:012087
4. Mattivi R, Shao L (2009) Human action recognition using LBP-TOP as sparse spatio-temporal feature
descriptor. Computer Analysis of Images and Patterns: 13th International Conference, CAIP 2009, Mu¨n-
ster, Germany, September 2–4, 2009. Proceedings 13. 2009; pp 740–747
5. Huu PN, Nguyenand NNTT, Pham Ngoc (2022) Proposing posture recognition system combining
MobilenetV2 and LSTM for Medical Surveillance 10:1839–1849
123
Machine learning and deep learning models for human activity … 4433
6. Nguyen HH, Ta TN, Nguyen NC, Phamand HM, Nguye DM YOLO based real-time human detection for
smart video surveillance at the edge. In: 2020 IEEE eighth international conference on communications
and electronics (ICCE) 2021 Jan 13 (pp. 439-444), pp. 439–444, Mar. (2021)
7. Saleem G, Ijaz Bajwaand U, Hammad Raza R (2022) Toward human activity recognition: a survey. Neural
Comput Appl 35(5):4145–4182
8. Javeed M, Abdelhaq M, Algarniand A, Jalal A (2022) Biosensor-based multimodal deep human locomo-
tion decoding via internet of healthcare things. Micromachines 14(12):1–20
9. Hajjej F (2022) Deep human motion detection and multi-features analysis for smart healthcare learning
tools. IEEE Access 10:116527–116539
10. Bhambri P, Bagga S, Priya D, Singh H, Dhiman HK (2020) Suspicious human activity detection system.
J IoT Soc Mob Anal Cloud 2:216–221
11. Fu Y, Liu T, Ye O (2019) Abnormal activity recognition based on deep learning in crowd. In: 2019 11th
International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC). 2019; pp
301–304
12. Shenoy A, Thillaiarasu N (2022) A survey on different computer vision based human activity recognition
for surveillance applications. In: 2022 6th International Conference on Computing Methodologies and
Communication (ICCMC). 2022; pp 1372–1376
13. Zhang S, Wei Z, Nie J, Huang L, Wang S, Li Z et al (2017) A review on human activity recognition using
vision-based method. J Healthcare Eng 2017:1
14. Coen P (2019) Human activity recognition and prediction using RGBD data; Southern Illinois University
at Carbondale
15. Dang LM, Min K, Wang H, Piran MJ, Lee CH, Moon H (2020) Sensor-based and vision- based human
activity recognition: a comprehensive survey. Pattern Recogn 108:107561
16. Beddiar DR, Nini B (2017) Vision based abnormal human activities recognition: an overview. In: 2017
8th International Conference on Information Technology (ICIT). 2017; pp 548–553
17. Suwannarat K, Kurdthongmee W (2021) Optimization of deep neural network-based human activity
recognition for a wearable device. Heliyon 7
18. Chathuramali KM, Rodrigo R (2012) Faster human activity recognition with SVM. In: International
Conference on advances in ICT for emerging regions (ICTer2012). pp 197–203.
19. Mekruksavanich S, Jitpattanakul A (2021) Lstm networks using smartphone data for sensor-based human
activity recognition in smart homes. Sensors 21:1636
20. Hassan MM, Uddin MZ, Mohamed A, Almogren A (2018) A robust human activity recognition system
using smartphone sensors and deep learning. Futur Gener Comput Syst 81:307–313
21. Khan NS, Ghani MS (2021) A survey of deep learning-based models for human activity recognition.
Wireless Pers Commun 120:1593–1635
22. Abdu-Aguye MG, Gomaa W (2019) Robust human activity recognition based on deep metric learning.
ICINCO 1:656–663
23. Salem FGI, Hassanpour R, Ahmed AA, Douma A (2021) Detection of suspicious activities of human
from surveillance videos. In: 2021 IEEE 1st International Maghreb meeting of the conference on sciences
and techniques of automatic control and computer engineering MI-STA. pp 794–801
24. Dileep AS, Nabilah S, Sreeju S, Farhana K, Surumy S (2022) Suspicious human activity recognition
using 2D pose estimation and convolutional neural network. In: 2022 International conference on wireless
communications signal processing and networking (WiSP- NET). 2022; pp 19–23
25. Mahajan RC, Pathare NK, Vyas V (2022) Video-based anomalous activity detection using 3D-CNN and
transfer learning. In: 2022 IEEE 7th international conference for convergence in technology (I2CT). pp
1–6
26. Khurana R, Kushwaha AKS (2018) Deep learning approaches for human activity recognition in video
surveillance-a survey. In: 2018 First International Conference on Secure Cyber Computing and Commu-
nication (ICSCCC). pp 542–544
27. Chen K, Zhang D, Yao L, Guo B, Yu Z, Liu Y (2021) Deep learning for sensor-based human activity
recognition: overview, challenges, and opportunities. ACM Comput Surv (CSUR) 54:1–40
28. Banjarey K, Sahu SP, Dewangan DK (2021) A survey on human activity recognition using sensors and deep
learning methods. In: 2021 5th international conference on computing methodologies and communication
(ICCMC). 2021; pp 1610–1617
29. Seemanthini K, Manjunath S (2018) Human detection and tracking using HOG for action recognition.
Procedia Comput Sci 132:1317–1326
30. Patel CI, Labana D, Pandya S, Modi K, Ghayvat H, Awais M (2020) Histogram of oriented gradient-based
fusion of features for human action recognition in action video sequences. Sensors 20:7299
31. Sanal Kumar K, Bhavani R (2020) Human activity recognition in egocentric video using HOG, GiST and
color features. Multim Tools Appl 79:3543–3559
123
4434 S. Waghchaware, R. Joshi
32. Liu C, Ying J, Han F, Ruan M (2018) Abnormal human activity recognition using bayes classifier and
convolutional neural network. In: 2018 IEEE 3rd international conference on signal and image processing
(ICSIP). 2018; pp 33–37
33. Xu L, Yang W, Cao Y, Li Q (2017) Human activity recognition based on random forests. In: 2017 13th
international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD).
2017; pp 548–553
34. Nandyal S, Angadi S (2021) Recognition of suspicious human activities using klt and Kalman filter for
atm surveillance system. In: 2021 international conference on innovative practices in technology and
management (ICIPTM). pp 174–179
35. Bibbò L, Vellasco MMBR (2023) Human activity recognition (HAR) in healthcare. Appl Sci 13(24):1–9
36. Chen K, Yao L, Zhang D, Wang X, Chang X, Nie F (2019) A semi-supervised recurrent convolutional
attention model for human activity recognition. IEEE Trans Neural Netw Learn Syst 31:1747–1756
37. Lv T, Wang X, Jin L, Xiao Y, Song M (2020) A hybrid network based on dense connection and weighted
feature aggregation for human activity recognition. IEEE Access 8:68320–68332
38. Alessandrini M, Biagetti G, Crippa P, Falaschetti L, Turchetti C (2021) Recurrent neural net- work for
human activity recognition in embedded systems using ppg and accelerometer data. Electronics 10:1715
39. Chen Y, Zhong K, Zhang J, Sun Q, Zhao X (2016) LSTM networks for mobile human activity recognition.
In: 2016 International conference on artificial intelligence: technologies and applications. pp 50–53
40. Hamad RA, Yang L, Woo WL, Wei B (2020) Joint learning of temporal models to handle imbalanced
data for human activity recognition. Appl Sci 10:5293
41. Mutegeki R, Han DS (2020) A CNN-LSTM approach to human activity recognition. In: 2020 international
conference on artificial intelligence in information and communication (ICAIIC). pp 362–366
42. Xia K, Huang J, Wang H (2020) LSTM-CNN architecture for human activity recognition. IEEE Access
8:56855–56866
43. Cui Z, Ke R, Pu Z, Wang Y (2020) Stacked bidirectional and unidirectional LSTM recurrent neural
network for forecasting network-wide traffic state with missing values. Transp Res Part C Emerg Technol
118:102674
44. Prabono AG, Yahya BN, Lee S-L (2021) Hybrid domain adaptation with deep network architecture for
end-to-end cross-domain human activity recognition. Comput Ind Eng 151:106953
45. Dua N, Singh SN, Semwal VB (2021) Multi-input CNN-GRU based human activity recognition using
wearable sensors. Computing 103:1461–1478
46. Shuvo MMH, Ahmed N, Nouduri K, Palaniappan K (2020) A hybrid approach for human activity recogni-
tion with support vector machine and 1D convolutional neural network. IEEE Appl Imag Pattern Recogn
Workshop (AIPR) 2020:1–5
47. Ibrahim MJ, Kainat J, AlSalman H, Ullah SS, Al-Hadhrami S, Hussain S et al. (2022) An effective
approach for human activity classification using feature fusion and machine learning methods. Appl
Bionics Biomech
48. Hanai Y, Nishimura J, Kuroda T (2009) Haar-like filtering for human activity recognition using 3d
accelerometer. In: 2009 IEEE 13th digital signal processing workshop and 5th IEEE signal processing
education workshop. pp 675–678
49. Dhulavvagol PM, Kundur NC (2017) Human action detection and recognition using SIFT and SVM. In:
international conference on cognitive computing and information processing. pp 475–491
50. Tsai A-C, Ou Y-Y, Sun C-A, Wang J-F (2017) VQ-HMM classifier for human activity recognition based
on R-GBD sensor. In: 2017 International Conference on Orange Technologies (ICOT). pp 201–204
51. Uddin MZ, Kim T-S (2011) Continuous hidden Markov models for depth map-based human activity
recognition. Hidden Markov Models Theory Appl pp. 225–247
52. Palaniappan A, Bhargavi R, Vaidehi V (2012) Abnormal human activity recognition using SVM based
approach. In: 2012 international conference on recent trends in information technology. pp 97–102
53. Garcia KD, de Sa CR, Poel M, Carvalho T, Mendes-Moreira J, Cardoso JM, de Carvalho AC, Kok
JN (2021) An ensemble of autonomous auto-encoders for human activity recognition. Neurocomputing
439:271–280
54. Bevilacqua A, MacDonald K, Rangarej A, Widjaya V, Caulfield B, Kechadi T, (2018) Human activ-
ity recognition with convolutional neural networks. Machine Learning and Knowledge Discovery in
Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Pro-
ceedings, Part III 18. 2019; pp 541–552
55. Andrade-Ambriz YA, Ledesma S, Ibarra-Manzano M-A, Oros-Flores MI, Almanza-Ojeda D-L (2022)
Human activity recognition using temporal convolutional neural network architecture. Expert Syst Appl
191:116287
56. Hartmann Y, Liu H, Lahrberg S, Schultz T (2022) Interpretable high-level features for human activity
recognition. Biosignals pp 40–49
123
Machine learning and deep learning models for human activity … 4435
57. Kumar D, Sailaja SR (2021) Abnormal activity recognition using deep learning in streaming video for
indoor application. ITU Kaleidoscope: Connect Phys Virt Worlds (ITU K) 2021:1–7
58. Feng Z, Zhu X, Xu L, Liu Y (2021) Research on human target detection and tracking based on artifi-
cial intelligence vision. In: 2021 IEEE Asia-Pacific Conference on Image Processing, Electronics and
Computers (IPEC). pp 1051–1054
59. Snoun A, Jlidi N, Bouchrika T, Jemai O, Zaied M (2021) Towards a deep human activity recognition
approach based on video to image transformation with skeleton data. Multim Tools Appl 80:29675–29698
60. Ghadi Y (2022) MS-DLD: Multi-sensors based daily locomotion detection via kinematic-static energy
and body-specific HMM. IEEE Access 10:23964–23979
61. Gupta A, Gupta K, Gupta K, Gupta K (2020) A survey on human activity recognition and classification.
In: 2020 international conference on communication and signal processing (ICCSP). pp 0915–0919
62. Wang H, Zhao J, Li J, Tian L, Tu P, Cao T, An Y, Wang K, Li S (2020) Wearable sensor-based human
activity recognition using hybrid deep learning techniques. Secur Commun Netw 2020:1–12
63. Zebin T, Scully PJ, Peek N, Casson AJ, Ozanyan KB (2019) Design and implementation of a convolu-
tional neural network on an edge computing smartphone for human activity recognition. IEEE Access
7:133509–133520
64. Deep S, Zheng X (2019) Leveraging CNN and transfer learning for vision-based human activity recog-
nition. 2019 29th International Telecommunication Networks and Applications Conference (ITNAC). pp
1–4
65. Zeng M, Nguyen LT, Yu B, Mengshoel OJ, Zhu J, Wu P, Zhang J (2014) Convolutional neural networks for
human activity recognition using mobile sensors. In: 6th international conference on mobile computing,
applications and services. pp 197–205
66. Rachna U, Guruprasad V, Shindhe SD, Omkar S (2022) Real-time violence detection using deep neural
networks and DTW. In: international conference on computer vision and image processing. pp 316–327
67. Ogbuabor G, La R (2018) Human activity recognition for healthcare using smartphones. Proceedings of
the 2018 10th international conference on machine learning and computing. pp 41–46
68. Mohsen S, Elkaseer A, Scholz SG (2021) Human activity recognition using K-nearest neighbor machine
learning algorithm. Proceedings of the International Conference on Sustainable Design and Manufactur-
ing. pp 304–313
69. Pham C, Nguyen-Thai S, Tran-Quang H, Tran S, Vu H, Tran T-H, Le T-L (2020) SensCapsNet: deep
neural network for non-obtrusive sensing based human activity recognition. IEEE Access 8:86934–86946
70. Pareek P, Thakkar A (2021) A survey on video-based human action recognition: recent updates, datasets,
challenges, and applications. Artif Intell Rev 54:2259–2322
71. Hussain Z, Sheng QZ, Zhang WE (2020) A review and categorization of techniques on device- free human
activity recognition. J Netw Comput Appl 167:102738
72. Demrozi F, Pravadelli G, Bihorac A, Rashidi P (2020) Human activity recognition using inertial, physi-
ological and environmental sensors: a comprehensive survey. IEEE access 8:210816–210836
73. Ehatisham-Ul-Haq M (2019) Robust human activity recognition using multimodal feature-level fusion.
7: 60736–60751
74. Beddiar DR, Nini B, Sabokrou M, Hadid A (2020) Vision-based human activity recognition: a survey.
Multim Tools Appl 79:30509–30555
75. Ward JA, Lukowicz P, Gellersen HW (2011) Performance metrics for activity recognition. ACM Trans
Intell Syst Technol (TIST) 2:1–23
76. Gao X, Luo H, Wang Q, Zhao F, Ye L, Zhang Y (2019) A human activity recognition algorithm based on
stacking denoising autoencoder and lightGBM. Sensors 19:947
77. Qin Z, Zhang Y, Meng S, Qin Z, Choo K-KR (2020) Imaging and fusing time series for wearable sensor-
based human activity recognition. Inf Fusion 53:80–87
78. Xu M, Zuo L, Iyengar S, Goldfain A, DelloStritto J (2011) A semi-supervised hidden Markov model-
based activity monitoring system. In: 2011 Annual International Conference of the IEEE Engineering in
Medicine and Biology Society. pp 1794–1797
79. Khan MA, Javed K, Khan SA, Saba T, Habib U, Khan JA, Abbasi A (2020) A Human action recognition
using fusion of multiview and deep features: an application to video surveillance. Multim Tools Appl
1–27
80. Zhang C, Lu Y, Feng M, Wu M (2019) Trucker behavior security surveillance based on human parsing.
IEEE Access 7:97526–97535
81. Manaf A, Singh S (2021) A novel hybridization model for human activity recognition using stacked par-
allel LSTMs with 2D-CNN for feature extraction. In: 2021 12th International Conference on Computing
Communication and Networking Technologies (ICCCNT). 2021; pp 1–7
82. Lu J, Yan WQ, Nguyen M (2018) Human behaviour recognition using deep learning. In: 2018 15th IEEE
International Conference on Advanced Video and Signal Based Surveillance (AVSS). pp 1–6
123
4436 S. Waghchaware, R. Joshi
83. Qi W, Wang N, Su H, Aliverti A (2022) DCNN based human activity recognition framework with depth
vision guiding. Neurocomputing 486:261–271
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.
123