Human Activity Prediction Using Deep Learning JAIN
Human Activity Prediction Using Deep Learning JAIN
Engineering
Global Campus, Jakkasandra Post, Kanakapura Taluk, Ramanagara District, Pin Code: 562 112
2022-2023
A Project Report on
B ACHELOR OF T ECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by
Guruprasad
19BTRCS024
Manesh Suhas
19BTRCS040
Hrithik
19BTRCS028
of Dr. Vanitha K
Professor/Associate/Assistant Professor
Department of Computer Science &
Engineering
CERTIFICATE
This is to certify that the project work titled “Human Activity Prediction Using Deep
Learning” is carried out by Guruprasad PD (19BTRCS024), Manesh Suhas S M
(19BTRCS040) , Hrithik Krishna (19BTRCS028), a bonafide students of Bachelor of
Technology at the Faculty of Engineering & Technology, Jain (Deemed-to-be) University,
Bangalore in partial fulfillment for the award of degree in Bachelor of Technology in Computer
Science & Engineering, during the year 2022-2023.
1.
2.
DECLARATION
Signature
Name1 Guruprasad
PD
USN :19BTRCS024
Name2: Manesh S M
USN2 :19BTRCS040
Name3:Hrithik K
USN:19BTRCS028
Place : Bangalore
ACKNOWLEDGEMENT
We would like to thank our Project Coordinators Dr. Chandrasekhar V and Dr.
Rajat and all the staff members of Computer Science and Engineering for their support.
We are also grateful to our family and friends who provided us with every
requirement throughout the course.
We would like to thank one and all who directly or indirectly helped us in
completing the Project work successfully.
Signature of Students
ABSTRACT
The problem of action prediction has recently gained traction due to its many possible uses in real
life applications. Predicting future actions is an important application of various computer vision
models such as monitoring or autonomous systems requiring prompt action based on little input
data. For this particular project we propose a method to predict actions from video sequences
using the LSTM architecture that leverages features extracted from the video data to predict
actions according to a predefined set of classes. This system allows one to input a short video
sequence and obtain the possible outcome of the same. This system can be used in several
applications that will be discussed
Page No
Abstract
List of Figures i
List of Tables ii
Nomenclature used iii
Chapter 1
1.1 Introduction 11
1.2 Literature Survey 12
1.3 Existing System and Disadvantages 16
1.4 Proposed System and Advantages 17
1.5 Objectives and Limitations of the Current Work 20
1.6 Limitations of current work 21
Chapter 2
2.1 System Architecture 22
2.2 Methodology 24
2.3 Hardware and Software requirements 25
Chapter 3
3.1 HARDWARE AND SOFTWARE TOOL DESCRIPTION 26
Chapter 4
4.1 Hardware Design and Implementation 28
4.1.1 Use Case diagram 28
4.1.2 Data flow diagram 29
4.1.3 Sequence diagram 30
References v
Appendices
Appendix- I vi
Appendix- II vii
Details of Paper Publication xiii
Information regarding students xiv
ii
iii
Chapter 1
1.1. Introduction
Human Activity Prediction is the problem of predicting the action of a person only by
observing the first few frames of the video sample. The main goal of human activity
prediction is to enable early recognition of activities instead of predicting the activities
after completion. This early prediction can help in detecting any unauthorized or
suspicious activities for example in prisons and measures can be taken to avoid it
beforehand. Human activity prediction can be applied in many real life situations such as
in autonomous cars to decide which action to take based on the situation and to prevent
accidents, prisons, hospitals, to predict the behavior of a driver to detect any drowsiness
and various other monitored areas. Although great progress has occurred in the field of
action recognition, activity prediction or anticipation is still being researched and has
become a popular topic of research recently. The main difference between predicting and
recognizing action is that in the former the prediction of class must be done early with
just a few frames or a small part of the video sequence. In this project we propose the use
of a deep learning system that predicts an action upon seeing just a small clip of the video
so it may be able to predict actions that will occur in the future with real time footage. We
have used a Long Short Term Memory (LSTM) model to carry out the required function
multi level model that describes the behaviour of users on the basis of actions and activities. They use the
long short memory architecture (LSTM) for prediction of action of the masses for smarter and more
intelligent cities. the model created by the authors was tested on an extensive activity recognition dataset
Singh, Pulkit, et al. [2] In the paper "End-to-end deep prototype and exemplar models for predicting
human behavior." aim to extend the classic prototype of category learning and models to learn several
representations ofthe given data from raw input. For the prediction model they use The CIFAR-10 ~taste
on which 2 CNN architectures are applied, Reset and All-CNN and then evaluate the resulting model.
Both the models proposed in the paper perform better than the baselines of neural networks, but with an
Battle day, Ruairidh M., Joshua C. Peterson, and Thomas L. Griffiths et al. [3] in "Capturing human
categorization of natural images by combining deep networks and cognitive models." aim to categorize
human actions over a very big dataset, containing over 500,000 data points over 10,000 Natural images
These data representations are crucial to capture the categorization and hence allow simpler models tt at
represent certain categories to perform better than more complex memory based models that u;rnally
They show the use of Mahalanobis distance to find vector distance subject to certain constraints ad
unifying different models In the results around 34% of the image data that was used had a perfect
Lin, Kaixiang, et al.[4] In "Efficient large-scale fleet management via multi-agent deep reinforcement learn
mg. conside~ the ~ro~lem of managing a large number of available vehicles for ride-sharing platforms.
The mam objective of the project is to maximize the total volume of merchandise of the given platform by
repositioning thevehicles available to locations with more demand and supply gaps than present one that
The approach proposed in the paper gives a novel Deep Reinforcement Learning method to learn a very
efficient policy for fleet management allocation of all vehicles to maximize utilization based on demand
and supply.
As for results~ the more! cA2C achieves the highest performance with lesser number repositions, around
65 .3 71/o when it 1s compared with the model cA2C-v I. Additionally, CCE achieves many useful
Cai, Hooey, et al. [ 5] in "Deep video generation, prediction and completion of human action sequences."
proposea general, deep framework in 2 stages for generating, predicting and completing videos of humans
by using proprietary models for completing a given video with high quality. They consider video
generation, prediction and completion as a single problem instead of addressing it as 3 separate problems
as is done before.
The 2 stages used in the model are firstly starting with a GAN that performs certain actions of a category
The model in 2 steps that generates motion sequences using the skeletal structure of humans from random
noise is proposed. In the first step, improved WGAN is applied. In the second step, normal GAN.
The final model can generate believable human motion videos with very high quality. It has the highest
inceptionscore amongst all competing models on being evaluated on the Human3.6m dataset.
Carrera, Joao, and Andrew Zisserman. et al. [6] in "Quo vadis, action recognition? a new model and tile
kinetics dataset." Aim to see if a network to classify action is trained on a larger dataset, it gives a better
performance than when it is used for a different temporal dataset that as fever data points. They introduce a
new 2 Stream Inflated 3D Convent based on 2D Convent mflat10n. Alter the 2 streams are trained, they
give similar performance but averaging both predictions changes j e results from 74.6% up to 80.2%.
Sadegh Aliakbarian, Mohammad, et al. [7] in "Encouraging lstms to anticipate actions very early." The
authors propose a method for anticipating or predicting an action, that gives an overall high ai curacy of
prediction even if only a small part of video or sequence is present to achieve high ai curacy. The proposed
model has an LSTMarc~itecture w!th multiple stages.' _ The authors create an LSTM with several stages
m the architecture that takesmto account different I f: . f, atures and encourages the model to predict the
class
Department of Computer Science & Engineering, FET, Jain (Deemed-to-be)
University 13
Human Activity Prediction Using Deep
as ast ~s 1t can. "f he model performs better than the state-of-the-art" method m early pred1ct1on by a
higher accuracy o "22.0% on JHMDB-21 " and that of 49.9% on UCF-101 and 14.0% on UT-Interaction.
Muhammad Sajjad et al [8] explore the concept of human etiquette analysis through facial recognition. The
inputdata consists of video clips from famous English TV series. Understanding human behaviour can
prove to be veryhelpful in many areas like entertainment, healthcare and others. The main steps in the
proposed method are: detection and tracking of facial features, face registration and facial expression
recognition. After using an algorithm to identify the faces in the video data, Simple vector machine model
(SVM) is used for recognizing them. Then the facial expression is detected using the CNN model proposed
in this paper.
The KDEF dataset is used here which consists of around 4000 different facial expressions. The proposed
CNN model achieved an accuracy of 82% and after achieving data augmentation, an accuracy of 94%.
A subjective evaluation of the model was carried out to examine the performance of the method.
Neziha Jaouedi et al [9] explore the different applications of human activity recognition including in video
surveillance, prisons and human computer interaction. Due to the increase in the use of neural networks of
deep learning, the author proposes a technique using gated recurrent neural networks. The reason to use
gated RNN is because of its high computational powers. Here, it is used for sequential data and video
frameclassification.
The features of the dataset play an important rule here, so the best features need to be selected using
feature extraction. This technique is useful when there's a huge dataset and to reduce noise points without
losing important data. Feature selection impacts the performance of the deep learning model, sufficient
time must be spent on selecting the best features from the dataset.
GMM method has been explicitly used in the tracking of objects in each frame of a video sequence. The
Kalman filter is used to predict the location of an object, in other words both of these methods are used to
track the movement of objects. Then Gated RNN is used for classifying the action. This is evaluated over a
few popular datasets - UCF 101 , UCF sports dataset and KTH human activity dataset, each of which
consists of a variety of activities. The proposed technique in this paper can be used in different
applications.
Department of Computer Science & Engineering, FET, Jain (Deemed-to-be)
University 15
Human Activity Prediction Using Deep
The author has implemented the human activity recognition approach through 4 steps
1. GMM and KF methods were first used to track the motion in the input video.
2. Next the K-nearest neighbors algorithm was used along with GMM and KF techniques for human
3. Finally, implementation of gated RNN was used for video data recognition. It achieved the highest
4 .A test video was used to check if human activity recognition is working properly.
In the Existing System, a Deep Neural Network architecture is used based on recurrent neural
network architecture. In the problem of modeling behavior, prediction of a class label of an activity
for any around done s based on the actions registered earlier with the use of a sensor. The recurrent
nature of LSTM allows the problem to be modelled considering certain sequential dependencies.
is input, the system takes in raw data from the sensors and compares and maps it to previously
defined actions using certain conditions and equivalencies.
The actions identified are then passed on to the embedding layer. This layer gets the IDs of the
actions J d changes and transforms them into embeddings which also have some semantic
meaning. This layer is also made trainable, i.e. being able to learn incrementally during training.
Weights of the layer are initialized using values that are obtained using the Word2Vec algorithm.
The embeddings of the actions thus obtained in the first module are followed by processing by the
sequence modelling part of , e algorithm. Finally, after the LSTM layer, the final module for
prediction, uses the sequence
models made by the LSTMs to finally predict the action that is observed.
Advantage: -
b. Non-intrusive Data Collection: Deep learning models can make predictions based on various
data sources, such as wearable devices, video recordings, or motion sensors. This allows for non-
intrusive data collection, reducing the need for invasive or discomforting measurement methods.
c. Real-time Predictions: Once trained, deep learning models can make predictions in real-time,
enabling immediate responses or interventions based on the predicted activities. This is
particularly beneficial in applications such as healthcare monitoring, where prompt actions can be
critical.
d. Adaptability and Generalization: Deep learning models can adapt to new activities or scenarios
without requiring significant changes to the underlying architecture. They have the potential to
generalize well across different individuals, environments, and variations in activity patterns.
e. Scalability: Deep learning models can handle large-scale datasets, making them suitable for
analyzing extensive collections of human activity data. This scalability enables the development
of robust models that can handle diverse activity prediction tasks.
f. Automation and Efficiency: Human activity prediction using deep learning can automate the
process of activity recognition, reducing the manual effort required for analyzing and labeling
large amounts of data. This automation can lead to increased efficiency and productivity in
various domains.
• To effectively use deep learning models to predict actions in highly sensitive andwell
monitored areas.
Interpretability and Explainability: Deep learning models are often considered as black boxes due to
their complex architectures and the inability to provide detailed explanations for their predictions. This
lack of interpretability can limit the understanding of why a particular prediction was made, which is
crucial in applications where human interpretability and trust are required, such as healthcare or legal
domains.
Overfitting and Generalization: Deep learning models can be prone to overfitting, where they
memorize the training data but fail to generalize well to unseen data. This is particularly challenging
when dealing with human activities, as there can be significant variations and individual differences in
how activities are performed. Ensuring the generalization ability of the models beyond the training data
is a persistent challenge.
Variability and Complexity: Human activities can exhibit high variability and complexity, making it
difficult to capture all the possible variations in a single model. Different individuals may perform the
same activity differently, and environmental factors can also influence activity patterns. Designing deep
learning models that can effectively handle such variability and complexity is a non-trivial task
In the above system architecture, each frame of input video is passed through the
convolutional layers of the model and then they are passed through the LSTM layers and
the output layer where the video frame will be processed to predict the final activity.
0 - level DFD
DFD Level 0 is also called a Context Diagram. It's a basic overview of the whole system
or process being analyzed or modeled. It shows the system as a single high-level process,
with its relationship to external entities.
1- LEVEL DFD
2-LEVEL DFD
2-level DFD goes one step deeper into parts of 1-level DFD
2.2 Methodology
the proposed system we take in a video input and pass it through a multi-stage LSTM architecture.
The first stage consists of a Convolutional Neural Network Model that first extracts features
The features are both context and action based to make a prediction based on both for highest
accuracy. The features are extracted as the layers of the deep neural network correspond to
Convolutional Neural Networks that are capable of extracting base level features that make up
Following the CNN layers the video is passed through the LSTM architecture to learn the features.
The model, having been trained on entire sequences of data and learning the base features of
different classes of actions will be able to predict the future action in a video based on the first
Hardware Requirements:
As the application is an internet based one, all the hardware which is needed to connect to
the internet will act as a hardware interface for the system. For example Modem, WAN -
LAN, Ethernet Cross-Cable.
higher.
J6GB of RAM (8GB is okay but higher performance rate cannot be achieved).
Software Requirements:
Since this is a software hence it will have to run on some hardware and operating system
obviously, so below are the requirements to run this software :
Memory:
Adequate RAM (Random Access Memory) is crucial for storing and manipulating large volumes
of data during preprocessing and model training. The required memory capacity depends on the
size of the dataset and the complexity of the LSTM models. Higher RAM capacity allows for
faster processing and larger batch sizes during training.
Storage:
Deep learning projects often involve working with large datasets, so having ample storage capacity
is important. High-speed storage options, such as solid-state drives (SSDs), are beneficial for fast
data access and training speed. Additionally, network-attached storage (NAS) or cloud storage
solutions can be utilized for efficient data storage and retrieval.
Software Tools:
Python:
Python is a widely used programming language for deep learning due to its simplicity, versatility,
and extensive libraries for scientific computing. Python, along with its associated packages like
NumPy, Pandas, and Scikit-learn, provides a rich ecosystem for data preprocessing, feature
extraction, and model evaluation. It also offers seamless integration with deep learning
frameworks and enables efficient prototyping and experimentation.
CUDA:
CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by
NVIDIA. It allows developers to harness the computational power of NVIDIA GPUs for deep
learning tasks. Deep learning frameworks often provide CUDA integration, enabling efficient
GPU acceleration during model training. CUDA significantly speeds up LSTM computations and
improves training time.
Jupyter Notebooks:
Jupyter Notebooks provide an interactive development environment for data exploration, model
prototyping, and experimentation. They allow for code execution, visualizations, and
documentation in a single interface. Jupyter Notebooks facilitate an iterative development process,
making it convenient to experiment with LSTM models, visualize results, and share code and
insights with others.
Chapter 4
4.1 Hardware Design and Implementation
4.1.1 Use Case diagram
An activity diagram describes the behaviour of the system aka. the control flow from start to finish.
Here the control begins with raw data being input to the system, which is data gotten from the dataset
considered for our purpose of human activity prediction. This data showing human actions then is pre-
processed to make it in standard form to pass it to our system, followed by which it is split into train
and test data which are used for training the model on and testing its output respectively.
The training data is input to the LSTM Model to train it to learn the features of the data using given
training data. After it has been trained, the LSTM model is then evaluated and corrected using the testing
data. Once the model has been finalized it is deployed and can then be used by users tomake predictions
on data input into it in video form
For the class diagram we have taken 4main classes, the database which is for input into the system that in
our case is a video input with several image frames as part of the video sequence that is considered along
with the number of videos passed to the model as training and testing data.
The system represents our deep learning system created for this project that takes in the data passed from
the database and splits it into training and testing data that is passed to the LSTM model that has
convolutional neural network layers followed by layers of neural networks that finally make a prediction
that is given to the system. The system then shows this prediction as output of the class that a data point
is apart of.
This is the system built for prediction of human activity class that aparticular action is predicted to be
Chapter 5
1.1 Testing
◦ Testing techniques are the best practices used by the testing team to assess the developed
software in regards to given requirements.That is why, we have employed testing in our software
as well.
◦ We have used unit testing methods to test the validity of our developed software.
◦ Unit Testing is one of the many stages of software testing and looks at single units, otherwise
known as components, individually. This validates that each component of the software
being tested works as it is designed to.
◦ Unit testing is done during the coding phase while the software or other product is
being developed to make sure it is clear of bugs and ready before its release.
◦ So, by using the unit testing in our project we could see the working and failing units and hence,
we're able to rectify the problem in the source code until the unit testing gave positive results,
thus rendering our code bug free and/or error free.
Below given is the function for testing the compilation of the model
However, multifaceted surveillance cameras also raise concerns around privacy and data
protection. The cameras capture a lot of data, and it can be challenging to control who
has access to that data and how it is used. As a result, it is important to carefully consider
the use of multifaceted surveillance cameras and implement appropriate safeguards to
protect individuals' privacy.
types of alarms has managed to bring down malpractices against library resources as well
as safeguarded the library staff, patrons and the library collections at stake. It is to be
considered that these security issues will not disappear overnight or forever, but the
implementation of ESS has reduced the impacts and possible security threats or
vulnerabilities to the acceptable limits.
In our project, we have presented an approach used for human activity prediction using
deep learning algorithms. We collect raw video samples and observe only a part of each
video clip and use deep learning models to predict the possible action. The main aim of
this project was to enable early recognition of activities instead of detecting it after
activity completion. We have used the LSTM model for human activity prediction. It
considers different features extracted to make the prediction and can be used for several
different applications in computer vision. This early detection of activities can be useful
for several different applications such as autonomous vehicles, medical care, surveillance
systems and smart homes. Another objective was to find the right model to implement
this approach when compared with all the state-of-the- art models.
1. There’s a developing need in elderly care (both bodily and mentally), future programs
of human activity prediction ought to assist prevent damage, e.g., to discover older
peoples' risky situations. An architecture in the smartphone can be developed with the
purpose of users' fal detection. activity prediction and recognition sensors could assist
elders in a proactive way, including lifestyle routine reminders (e.g. acing medication),
living activity tracking for remote robotic assists.
2. Children's care is another area that could benefit from activity prediction studies and
future improvement. Its applications could include tracking infants' napping status and
predicting their needs for food or other stuff.
3. Activity prediction techniques can also be utilized in kids (ASD) detection.
[11]]Zhou, Xiaokang, et al. "Deep-learning-enhanced human activity recognition for Internet of healthcare things." IEEE
Internet of Things Journal 7.7 (2020): 6429-6438.
[12] Chen, Kaixuan, et al. "Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges,
and Opportunities." ACM Computing Surveys (CSUR) 54.4 (2021): 1-40.
[13] Luvizon, Diogo, David Picard, and Hedi Tabia. "Multi-task deep learning for real-time 3D human pose estimation
and action recognition." IEEE transactions on pattern analysis and machine intelligence (2020).
[14] Liciotti, Daniele, et al. "A sequential deep learning application for recognising human activities in smart homes."
Neurocomputing 396 (2020): 501-513.
“[15]Henry Friday Nweke, Ying Wah Teh, et al. "Deep learning algorithms for human activity recognition using mobile
and wearable sensor networks: State of the art and research challenges".”
“[16]Jian Bo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, Shonali Krishnaswamy, et al. "Deep
Convolutional Neural Networks On Multichannel Time Series For Human Activity Recognition".”
“[17]Deepika Singh, Erinc Merdiven, Ismini Psychoula, Johannes Kropf, Sten Hanke, Matthieu Geist, and
Andreas Holzinger, et al. "Human Activity Recognition
v
Using Recurrent Neural Networks".”
“[18]Ming Zeng, Tong Yu, Xiao Wang, Le T. Nguyen, Ole J. Mengshoel, Ian Lane, et al. "Semi-
Supervised Convolutional Neural Networks for Human Activity Recognition".”
“[19]Shreyank N Gowda, et al. "Human activity recognition using combinatorial Deep Belief Networks".”
“[20]Julieta Martinez, Michael J. Black, and Javier Romero, et al. [20] in "On human motion prediction using recurrent
neural networks".
v
v
v
v
Using facial landmarks to detect
Guru prasad P D
USN: 19BTRCS024
Email: guruprasad1pounarkar@gmail.com
Manesh Suhas S M
USN: 19BTRCS040
Email: manishsuhas098@gmail.com
Hrithik Krishna
USN: 19BTRCS028
Email: 19btrcs028@jainuniversity.ac.in
DR. Vanitha
k Guide
Email: k.vanitha@jainuniversity.ac.in
VI
APPENDIX -
SOURCE CODE
*
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import
EarlyStopping from tensorflow.keras.utils import
plot_model
"""And will set `Numpy`, `Python`, and `Tensorflow` seeds to get consistent results on every execution."""
seed_constant = 27
np.random.seed(seed_constant)
random.seed(seed_constant)
tf.random.set_seed(seed_constant
)
In the first step, we will visualize the data along with labels to get an idea about what we will be dealing with. We will be using the
[UCF50 - Action Recognition Dataset](https://www.crcv.ucf.edu/data/UCF50.php), consisting of realistic videos taken from youtube
which differentiates this data set from most of the other available action recognition data sets as they are not realistic and are staged
by actors. The Dataset contains:
vii
APPENDIX
* *`133`* Average Videos per Action Category
For visualization, we will pick `20` random categories from the dataset and a random video from each selected category and will
visualize the first frame of the selected videos with their associated labels written. This way we’ll be able to visualize a subset ( `20`
random videos ) of the dataset.
"""
"""For Visualization, we wil pick 20 random categories from the Dataset and a random video from each selected category and
will visualize the first frame of the selected videos with their associated labels written. This way we'll be able to visualize a subset
(20 random videos) of the dataset.
"""
# Retrieve the list of all the video files present in the randomly selected Class Directory.
video_files_names_list = os.listdir(f'/content/drive/MyDrive/Colab Notebooks/UCF50/{selected_class_Name}')
# Randomly select a video file from the list retrieved from the randomly selected Class Directory.
selected_video_file_name = random.choice(video_files_names_list)
Next, we will perform some preprocessing on the dataset. First, we will read the video files from the dataset and resize the frames of
the videos to a fixed width and height, to reduce the computations and normalized the data to range `[0-1]` by dividing the pixel
values with `255`, which makes convergence faster while training the network.
# Specify the height and width to which each video frame will be resized in our dataset.
IMAGE_HEIGHT , IMAGE_WIDTH = 64, 64
# Specify the number of frames of a video that will be fed to the model as one sequence.
SEQUENCE_LENGTH = 20
# Specify the list containing the names of the classes used for training. Feel free to choose any set of classes.
CLASSES_LIST = ["WalkingWithDog", "TaiChi", "Swing", "HorseRace"]
"""*Note:* The *`IMAGE_HEIGHT`*, *`IMAGE_WIDTH`* and *`SEQUENCE_LENGTH`* constants can be increased for better
results, although increasing the sequence length is only effective to a certain point, and increasing the values will result in the process
being more computationally expensive.
We will create a function *`frames_extraction()`* that will create a list containing the resized and normalized frames of a video whose
path is passed to it as an argument. The function will read the video file frame by frame, although not all frames are added to the list as
we will only need an evenly distributed sequence length of frames.
"""
def
frames_extraction(video_path):
'''
This function will extract the required frames from a video after resizing and normalizing
them. Args:
video_path: The path of the video in the disk, whose frames are to be
extracted. Returns:
frames_list: A list containing the resized and normalized frames of the video.
'''
# Calculate the the interval after which frames will be added to the list.
skip_frames_window = max(int(video_frames_count/SEQUENCE_LENGTH), 1)
# Normalize the resized frame by dividing it with 255 so that each pixel value then lies between 0 and 1
normalized_frame = resized_frame / 255
Now we will create a function *`create_dataset()`* that will iterate through all the classes specified in the
*`CLASSES_LIST`* constant and will call the function *`frame_extraction()`* on every video file of the selected classes and
return the frames (*`features`), class index ( *`labels`*), and video file path (`video_files_paths`*).
"""
def
create_dataset():
'''
This function will extract the data of the selected classes and create the required
dataset. Returns:
features: A list containing the extracted frames of the videos.
labels: A list containing the indexes of the classes associated with the
videos. video_files_paths: A list containing the paths of the videos in the disk.
'''
# Declared Empty Lists to store the features, labels and video file path values.
features = []
labels = []
video_files_paths = []
# Get the list of video files present in the specific class name directory.
files_list = os.listdir(os.path.join(DATASET_DIR, class_name))
# Check if the extracted frames are equal to the SEQUENCE_LENGTH specified above.
# So ignore the vides having frames less than the SEQUENCE_LENGTH.
if len(frames) == SEQUENCE_LENGTH:
APPENDIX
# Append the data to their repective lists.
features.append(frames)
labels.append(class_index)
video_files_paths.append(video_file_path
)
"""Now we will utilize the function *`create_dataset()`* created above to extract the data of the selected classes and create
the required dataset."""
"""Now we will convert `labels` (class indexes) into one-hot encoded vectors."""
"""## *<font style="color:rgb(134,19,348)">Step 3: Split the Data into Train and Test Set</font>*
As of now, we have the required *`features`* (a NumPy array containing all the extracted frames of the videos) and
*`one_hot_encoded_labels`* (also a Numpy array containing all class labels in one hot encoded format). So now, we will split our
data to create training and testing sets. We will also shuffle the dataset before the split to avoid any bias and get splits representing
the overall distribution of the data.
"""
# Split the Data into Train ( 75% ) and Test Set ( 25% ).
features_train, features_test, labels_train, labels_test = train_test_split(features, one_hot_encoded_labels,
test_size = 0.25, shuffle =
True, random_state =
seed_constant)
In this step, we will implement the first approach by using a combination of ConvLSTM cells. A ConvLSTM cell is a variant of an
LSTM network that contains convolutions operations in the network. it is an LSTM with convolution embedded in the
architecture, which makes it capable of identifying spatial features of the data while keeping into account the temporal relation.
<center>
<img src="https://drive.google.com/uc?export=view&id=1KHN_JFWJoJi1xQj_bRdxy2QgevGOH1qP" width= 500px>
</center>
For video classification, this approach effectively captures the spatial relation in the individual frames and the temporal relation
across the different frames. As a result of this convolution structure, the ConvLSTM is capable of taking in 3-dimensional input
`(width, height, num_of_channels)` whereas a simple LSTM only takes in 1-dimensional input hence an LSTM is incompatible for
modeling Spatio-temporal data on its own.
You can read the paper [*Convolutional LSTM Network: A Machine Learning Approach for Precipitation
Nowcasting](https://arxiv.org/abs/1506.04214v1) by **Xingjian Shi* (NIPS 2015), to learn more about this architecture.
def
create_convlstm_model():
'''
This function will construct the required convlstm model.
Returns:
model: It is the required constructed convlstm model.
'''
############################################################################################################
############
model.add(Flatten())
############################################################################################################
############
"""Now we will utilize the function *`create_convlstm_model()`* created above, to construct the required `convlstm` model."""
# Construct the required convlstm model. APPENDIX
convlstm_model =
create_convlstm_model()
Now we will use the *`plot_model()`* function, to check the structure of the constructed model, this is helpful while constructing
a complex network and making that the network is created correctly.
"""
checkpoint_path =
'/content/convlstm_model Date_Time_2023_05_0813_13_24_Loss_0.42100247740745544
Accuracy_0.8524590134620667.h5' convlstm_model.load_weights(checkpoint_path)
# Compile the model and specify loss function, optimizer and metrics values to the model
convlstm_model.compile(loss = 'categorical_crossentropy', optimizer = 'Adam', metrics = ["accuracy"])
Next, we will add an early stopping callback to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting) and start the
training after compiling the model.
"""
checkpoint_path = "/content/drive/MyDrive/Colab
Notebooks/trained_model_/cp.ckpt" checkpoint_dir =
os.path.dirname(checkpoint_path)
# Compile the model and specify loss function, optimizer and metrics values to the model
convlstm_model.compile(loss = 'categorical_crossentropy', optimizer = 'Adam', metrics = ["accuracy"])
os.listdir(checkpoint_dir)
Model</font>* After training, we will evaluate the model on the test set.
"""
# Define a useful name for our model to make it easy for us while navigating through multiple saved models.
model_file_name =
f'convlstm_model Date_Time{current_date_time_string} Loss{model_evaluation_loss} Accuracy{model_evaluation_accuracy}
.h5'
"""### *<font style="color:rgb(134,19,348)">Step 4.3: Plot Model’s Loss & Accuracy Curves</font>*
Now we will create a function *`plot_metric()`* to visualize the training and validation metrics. We already have separate
metrics from our training and validation steps so now we just have to visualize them.
"""
# Construct a range object which will be used as x-axis (horizontal plane) of the
graph. epochs = range(len(metric_value_1))
"""Now we will utilize the function *`plot_metric()`* created above, to visualize and understand the
In this step, we will implement the LRCN Approach by combining Convolution and LSTM layers in a single model. Another
similar approach can be to use a CNN model and LSTM model trained separately. The CNN model can be used to extract spatial
features from the frames in the video, and for this purpose, a pre-trained model can be used, that can be fine-tuned for the problem.
And the LSTM model can then use the features extracted by CNN, to predict the action being performed in the video.
But here, we will implement another approach known as the Long-term Recurrent Convolutional Network (LRCN), which
combines CNN and LSTM layers in a single model. The Convolutional layers are used for spatial feature extraction from the
frames, and the extracted spatial features are fed to LSTM layer(s) at each time-steps for temporal sequence modeling. This way the
network learns spatiotemporal features directly in an end-to-end training, resulting in a robust model.
<center>
<img src='https://drive.google.com/uc?export=download&id=1I-q5yLsIoNh2chfzT7JYvra17FsXvdme'>
</center>
You can read the paper [Long-term Recurrent Convolutional Networks for Visual Recognition and
Description](https://arxiv.org/abs/1411.4389?source=post_page--------------------------) by Jeff Donahue (CVPR 2015), to learn
more
about this architecture.
<center>
<img src='https://drive.google.com/uc?export=download&id=1CbauSm5XTY7ypHYBHH7rDSnJ5LO9CUWX' width=400>
</center>
To implement our LRCN architecture, we will use time-distributed *`Conv2D`* layers which will be followed by *`MaxPooling2D`*
and *`Dropout`* layers. The feature extracted from the *`Conv2D`* layers will be then flattened using the *`Flatten`* layer and will
be fed to a *`LSTM`* layer. The *`Dense`* layer with softmax activation will then use the output from the *`LSTM`* layer to
predict the action being performed.
"""
def
create_LRCN_model():
'''
This function will construct the required LRCN
model. Returns:
model: It is the required constructed LRCN model.
'''
############################################################################################################
############
model.add(TimeDistributed(MaxPooling2D((4,
4)))) model.add(TimeDistributed(Dropout(0.25)))
model.add(TimeDistributed(Flatten())
) model.add(LSTM(32))
############################################################################################################
############
"""Now we will utilize the function *`create_LRCN_model()`* created above to construct the required `LRCN`
Now we will use the *`plot_model()`* function to check the structure of the constructed `LRCN` model. As we had checked for
the previous model.
"""
checkpoint_path =
'/content/LRCN_model Date_Time_2023_05_0813_23_44_Loss_0.3155522644519806
Accuracy_0.8934426307678223.h5' LRCN_model.load_weights(checkpoint_path)
After checking the structure, we will compile and start training the
model. """
# Compile the model and specify loss function, optimizer and metrics to the model.
LRCN_model.compile(loss = 'categorical_crossentropy', optimizer = 'Adam', metrics = ["accuracy"])
Model</font>* As done for the previous one, we will evaluate the `LRCN` model on
After that, we will save the model for future uses using the same technique we had used for the previous
model. """
# Define a useful name for our model to make it easy for us while navigating through multiple saved
models. model_file_name =
f'LRCN_model Date_Time{current_date_time_string} Loss{model_evaluation_loss} Accuracy{model_evaluation_accuracy}.h
5'
"""### *<font style="color:rgb(134,19,348)">Step 5.3: Plot Model’s Loss & Accuracy Curves</font>*
Now we will utilize the function *`plot_metric()`* we had created above to visualize the training and validation metrics of this
model. """
"""## *<font style="color:rgb(134,19,348)">Step 6: Test the Best Performing Model on YouTube videos</font>*
From the results, it seems that the LRCN model performed significantly well for a small number of classes. so in this step, we will
put the `LRCN` model to test on some youtube videos.
We will create a function *`download_video()`* to download the YouTube videos first using *`pytube`* library. The library
only requires a URL to a video to download it along with its associated metadata like the title of the video.
""" APPENDIX
!pip install pytube
def download_video(url,
output_path): try:
yt = YouTube(url)
video =
yt.streams.get_highest_resolution()
video.download(output_path)
print("Video downloaded
successfully!") except Exception as e:
print("Error:", str(e))
Now we will utilize the function *`download_youtube_videos()`* created above to download a youtube video on which the `LRCN`
model will be tested.
#OG code
# Provide the URL of the YouTube video you want to download
video_url = "https://www.youtube.com/watch?v=8u0qjmHIOcE"
Next, we will create a function *`predict_on_video()`* that will simply read a video frame by frame from the path passed in as
an argument and will perform action recognition on video and save the results.
"""
# Initialize the VideoWriter Object to store the output video in the disk.
video_writer = cv2.VideoWriter(output_file_path, cv2.VideoWriter_fourcc('M', 'P', '4', 'V'),
video_reader.get(cv2.CAP_PROP_FPS), (original_video_width, original_video_height))
# Initialize a variable to store the predicted action being performed in the video.
predicted_class_name = ''
# Normalize the resized frame by dividing it with 255 so that each pixel value then lies between 0 and 1.
normalized_frame = resized_frame / 255
# Check if the number of frames in the queue are equal to the fixed sequence
length. if len(frames_queue) == SEQUENCE_LENGTH:
# Pass the normalized frames to the model and get the predicted probabilities.
predicted_labels_probabilities = LRCN_model.predict(np.expand_dims(frames_queue, axis = 0))[0]
Now we will utilize the function *`predict_on_video()`* created above to perform action recognition on the test video we had
downloaded using the function *`download_youtube_videos() and display the output video with the predicted action overlayed on it. """
# Construct the output video path. APPENDIX
output_video_file_path = f'{test_videos_directory}/{video_title}-Output-SeqLen{SEQUENCE_LENGTH}.mp4'
Now let's create a function that will perform a single prediction for the complete videos. We will extract evenly distributed *N*
*`(SEQUENCE_LENGTH)`* frames from the entire video and pass them to the `LRCN` model. This approach is really useful
when you are working with videos containing only one activity as it saves unnecessary computations and time in that scenario.
"""
def predict_single_action(video_file_path,
SEQUENCE_LENGTH): '''
This function will perform single action recognition prediction on a video using the LRCN model.
Args:
video_file_path: The path of the video stored in the disk on which the action recognition is to be performed.
SEQUENCE_LENGTH: The fixed number of frames of a video that can be passed to the model as one sequence.
'''
# Initialize a variable to store the predicted action being performed in the video.
predicted_class_name = ''
# Calculate the interval after which frames will be added to the list.
skip_frames_window = max(int(video_frames_count/SEQUENCE_LENGTH),1)
# Read a frame.
success, frame = video_reader.read()
# Normalize the resized frame by dividing it with 255 so tteach pixel value then lies between 0 and 1.
normalized_frame = resized_frame / 255
APPENDIX
# Appending the pre-processed frame into the frames
list frames_list.append(normalized_frame)
# Passing the pre-processed frames to the model and get the predicted probabilities.
predicted_labels_probabilities = LRCN_model.predict(np.expand_dims(frames_list, axis = 0))[0]
Now we will utilize the function *`predict_single_action()`* created above to perform a single prediction on a complete youtube test
video that we will download using the function *`download_youtube_videos()`*, we had created above.
"""
v
APPENDIX
v
v
i
APPENDIX-III
DATASHEETS
The data set can be accessed from below link:
\https://www.crcv.ucf.edu/data/UCF50.rar
x
Publication Details
Acceptance certificate