0% found this document useful (0 votes)
11 views41 pages

HH Docx-1

Uploaded by

poornima devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views41 pages

HH Docx-1

Uploaded by

poornima devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

INTEGRATED CURL COUNTING AND ACTION

RECOGNITION SYSTEM

MINI PROJECT REPORT

Submitted by

Nithya Shree.V (3122213002069)


Poornima Devi.M (3122213002077)
Sujatha Natarajan (3122213002105)

UEC2604

MACHINE

LEARNING

Department of Electronics and Communication

Engineering Sri Sivasubramaniya Nadar College of

Engineering
(An Autonomous Institution, Affiliated to Anna
University)

Rajiv Gandhi Salai (OMR), Kalavakkam – 603

110 EVEN SEM 2023-2024


Sri Sivasubramaniya Nadar College of Engineering
(An Autonomous Institution, Affiliated to Anna University)

BONAFIDE CERTIFICATE

Certified that this mini project titled “INTEGRATED CURL


COUNTING AND ACTION RECOGNITION SYSTEM” is the bonafide
work of “NithyaShree.V(3122213002069), PoornimaDevi.M
(3122213002077) and Sujatha Natarajan (3122213002105) of VI
Semester Electronics and Communication Engineering Branch
during Even Semester 2023 – 2024 for UEC2604 Machine Learning

Submitted for examination held on

INTERNAL EXAMINER
ABSTRACT

Human action recognition in videos is an active area of research in


computer vision and pattern recognition. Nowadays, artificial
intelligence (AI) based systems are needed for human-behavior
assessment and security purposes. The existing action recognition
techniques are mainly using pretrained weights of different AI
architectures for the visual representation of video frames in the training
stage, which affect the features’ discrepancy determination, such as the
distinction between the visual and temporal signs. To address this issue,
we propose a bi-directional long short-term memory (BiLSTM) based
attention mechanism with a dilated convolutional neural network
(DCNN) that selectively focuses on effective features in the input frame
to recognize the different human actions in the videos. In this diverse
network, we use the DCNN layers to extract the salient discriminative
features by using the residual blocks to upgrade the features that keep
more information than a shallow layer. Furthermore, we feed these
features into a BiLSTM to learn the long-term dependencies, which is
followed by the attention mechanism to boost the performance and
extract the additional high-level selective action related patterns and cues
We further use the center loss with Softmax to improve the loss function
that achieves a higher performance in the video-based action
classification. The proposed system is evaluated on three benchmarks.

3
TABLE OF

CHAPTER NO TITLE PAGE NO

1 Introduction 7
2 Literature Survey 8
3 Methodology 10
3.1 Introduction 10
3.2 ML Process Flow 11
3.3 Real Time Curl Counter 12
3.3.1 Making Detections 12
3.3.2 Calculation Of Angles 13
3.3.3 Exercise Detection Logic 14
3.3.4 Determining Joints 15
3.3.5 Counting Exercise Repetitions 16
3.4 Action Recognition 18
3.4.1 CNN Workflow 18
3.4.1.1 Pooling 20
3.4.2 The Neural Network 22
3.4.3 Convolution LSTM Workflow 23
3.4.4 Data Acquisition 25
3.4.5 Data Visualization 26
3.4.6 Main Working 28

4 Results 45
4.1 Multiple Activity Prediction 45
4.1 Single Prediction on a Test Video 47
5 Conclusion 48
6 References 49

4
LIST OF

Figure no Content Page no.

3.2.a Flow diagram of the process of 11


Machine Learning
3.3.1.a Making Detections 12
3.3.2.a Calculation of Angles 13
3.3.4.a Determining Joints Using Mediapipe 15
3.3.5.a Curl Counting 16
3.3.5.b Curl Counting 16
3.4.1.a CNN Workflow 17
3.4.1.b How CNN Works 17
3.4.1.c Convolution Process 18
3.4.1.1.a Pooling Process 19
3.4.1.1.b Max Pooling 19
3.4..2.a CNN Architecture 20
3.4.3.a Convolutional LSTM Workflow 21
3.4.5.a Data Visualization 23
3.4.6.a ConvLSTM Working 24
3.4.6.b ConvLSTM Model Architecture 27
3.4.6.c Loss Curve 28
3.4.6.d Accuracy Curve 28
3.4.6.e LRCN Model Architecture 32
3.4.6.f,&g Loss Curve and Accuracy Curve 37
4.a Multiple Action Recognition 38
4.b Single Prediction Results 39

5
SYMBOLS AND

Symbols Abbreviation
ML Machine Learning
CNN Convolutional Neural Network
LSTM Long Short-Term Memory Networks
DCNN Dilated convolutional neural network
BiLSTM Bi-directional long short-term memory
LRCN Long-term Recurrent Convolutional Network
HAR Human Activity Recognition

6
CHAPTER

INTRODUCTION

Physical activity is crucial for maintaining a healthy lifestyle. Video


based action recognition is an emerging and challenging area of
research in this era particularly for identifying and recognizing actions
in a video sequence from a surveillance stream. The action recognition
in a video has many applications, such as content-based video
retrieval,surveillance systems for security and privacy purposes,human–
computer-interaction, and activity recognition.. Nowadays, the digital
contents are exponentially growing day-by-day, so effective AI-based
intelligent internet of things systems are needed for surveillance to
monitor and identify human actions and activities. The aim of action
recognition is to detect and identify people, their behavior, suspicious
activities in the videos, and deliver appropriate information to support
interactive programs and IoT based applications Action recognition still
poses many challenges when it comes to ensuring the security and the
safety of the residents, including industrial monitoring, violence
detection, person identification, virtual reality, and cloud environments
due to significant improvements in camera movements, occlusions,
complex background, and variations in illumination. (add curl counter
intro)

7
CHAPTER

LITERATURE SURVEY

This section presents previous work related to our proposed


method.

• Ajay L, Vidyadevi G Biradar, Chandu M, Bharath JB developed


AI-powered fitness trainer utilizing human pose estimation to
analyze real-time exercise movements, providing personalized
feedback for enhanced workout effectiveness and healthier
lifestyles

• Rutuja Mhaiskar, Preeti Verma, named as “Performance Analysis


of Human Activity” aims to create an AI gym assistant through
Jupyter Notebook and MediaPipe, leveraging pre-trained models
for accurate pose estimation and hand tracking.

• Shaikh Mohd presented a paper named as Pushup Counting and


Evaluating Based on Human Keypoint Detection aims into AI and
ML integration via the MediaPipe framework for real-time push-
up counting and evaluation, achieving over 90% accuracy. While
some detection errors exist, expanding datasets and refining error
classification are envisioned for future enhancements.

8
• Sejal Bhatia named the title as “Activity Identification with
Machine Learning on Wearable Tracker" Wearable study shows
ML can identify activities (success with XGBoost).Their wearable
data might not be ideal for your camera-based project, but ML for
activity detection is promising.

• Litao Guang, Jiancheng Zou, Zibo Wen investigates real-time


human heart rate and blood pressure detection during exercise via
MediaPipe, integrating YOLOv8-pose for monitoring. By
leveraging pose estimation and hand key points, it introduces a
non-contact system with potential applications in fitness tracking
and health monitoring projects.

9
CHAPTER 3

METHODOLOGY

3.1 INTRODUCTION

A quick overview of our project would entail


● Implementation of Google MediaPipe's BlazePose model
for real-time human pose estimation
● Computer vision tools (i.e., OpenCV) for color
conversion, detecting cameras, detecting camera
properties, displaying images, and custom
graphics/visualization
● Inferred 3D joint angle computation according to
relative coordinates of surrounding body landmarks
● Guided training data generation
● Data preprocessing and callback methods for efficient deep
neural network training
● Customizable LSTM and Attention-Based LSTM models
● Real-time visualization of joint angles, rep counters, and
probability distribution of exercise classification
predictions

In this chapter, let us discuss the basic methodology of how ML


works, and about the methodology and working of CNN,
convolutional LSTM and rep counting.

1
3.2 ML PROCESS

Let us look at a breakdown of the ML process:


* Training data: This is the initial data that’s fed into the machine
learning algorithm. The quality and quantity of this data heavily
influences the outcome of the model.
* Train ML algorithm: This step involves training the algorithm on the
provided data. During training, the algorithm learns to identify patterns
and relationships within the data.
* Model Input Data: Once trained, the model can then be used to make
predictions on new data. This new data is fed into the model as input.
* ML Algorithm: The machine learning algorithm leverages the learned
patterns and relationships from the training data to analyze the new input
data.

Fig 3.2.a:Flow diagram of the process of Machine Learning

1
3.3 REAL-TIME CURL

3.3.1 MAKING DETECTIONS:

We begin by detecting and tracking human keypoints using Mediapipe.


Both YOLO and Mediapipe offer pre-trained models optimized for
human pose detection. We have decided to use Mediapipe as it best fits
our model. Keypoints are marked and the angles between them are
calculated. Once the human keypoints are detected, we proceed to
identify specific key points relevant to different exercises. These key
points typically include joints like shoulders, elbows, hips, and knees.
Calculate the angles between these key points to accurately assess the
person's posture and form during exercises.

Fig 3.3.1.a Making Detections

1
3.3.2 CALCULATION OF

Angle Computation: Custom functions are developed to calculate the


angles between specific body landmarks, notably the shoulder,
elbow, and wrist.

Thresholding: Threshold values are defined to identify the initiation and


completion of a curl motion based on the angle measurements. For
instance, a curl might be considered initiated when the arm is below a
certain angle threshold and completed when it surpasses another
threshold.

Incremental Counting: A logic scheme is implemented to track the


number of curls completed by individuals in real-time. This involves
incrementing a counter each time the detected angle pattern suggests the
completion of a curl motion.

Fig 3.3.2.a Calculation of Angles

1
3.3.3. EXERCISE DETECTION

We utilize the detected keypoints and angles to implement logic for


exercise detection. This logic should be adaptable to different types of
exercises, such as push-ups, squats, lunges, etc. We define thresholds or
rules specific to each exercise to classify and identify them accurately.

The process involves leveraging the detected keypoints and angles to


discern whether the observed movement aligns with the execution of a
curl. Key elements to consider include the positioning of the elbows and
shoulders relative to the torso, as well as the trajectory of the hands
during the exercise. By analyzing the angles formed between these key
points, specific criteria can be established to differentiate between a
bicep curl and other activities or poses. For instance, the angle at the
elbow joint could be a critical factor, with thresholds set to identify the
bending and extension phases of the curl movement. Additionally,
factors such as the range of motion, consistency in movement patterns,
and temporal sequence of actions can further refine the detection logic,
enhancing its accuracy and reliability.

This entails accounting for variations in body proportions, techniques,


and equipment used during the exercise. Incorporating machine learning
techniques or adaptive algorithms can enable the system to learn and
adjust its detection criteria based on observed data, allowing for real-
time adjustments to accommodate diverse scenarios.

1
3.3.4 DETERMINING

Fig 3.3.4.a Determining Joints Using Mediapipe

By leveraging convolutional neural networks (CNNs) and other


machine learning techniques, Mediapipe analyzes input data to identify
anatomical landmarks such as joints, including those for the shoulders,
elbows, wrists, hips, knees, and ankles.

These joints are crucial for understanding human pose and movement,
serving as the foundation for a wide range of applications, from fitness
tracking to gesture recognition. Through its sophisticated architecture
and training methodology, Mediapipe achieves robustness and accuracy
in joint detection across diverse body types, poses, and environmental
conditions, making it a valuable tool for researchers, developers, and
practitioners in various fields.

1
3.3.5 COUNTING EXERCISE

Monitor the person's movements and transitions between different


exercise phases to count repetitions. Utilize techniques like state
machines or temporal analysis to track the progression of each repetition
and accurately count them. Example(Counting Push-Up Repetitions):
As the person performs push-ups, the application keeps track of the
number of repetitions completed. This is done by counting instances
where the person transitions from the starting position to the ending
position of a push-up.

Fig 3.3.5.a Curl Counting

Fig 3.3.5.b Curl Counting

1
3.4 ACTION

3.4.1 CNN WORKFLOW

Fig 3.4.1.a :CNN Workflow

Fig 3.4.1.b:How CNN works

A convolution is a linear operation that involves the multiplication of a


set of weights with the input. These weights are present in a smaller
matrix called a kernel or filter. Convolution is basically done to

1
emphasize the important or key features of an image such as borders,
edges, corners, highlighted portions, etc. The filters will contain values
that will help extract necessary portions of the image. The below image
shows how it’s done:

Fig 3.4.1.c: Convolution Process

As it is observed, the filter is superimposed onto the image starting from


the right corner most set of pixels. The weights of the filter are
multiplied by the values of the pixels over which it is placed. The
multiplied values are added up to give the first value of the convolved
matrix as shown (Fig 3.5). The filter moves right by one pixel and does
the same, until it reaches the last section of the image. The size of the
convolved feature is given by,

1
3.4.1.1

Pooling is a convolution process where the filter extracts a single value


from the area it convolves. It is done in order to summarize or reduce
the size of data to be able to make the CNN process simpler. It is a form
of image compression. It is similar to convolution but is done more
differently. The diagram below shows an example:

Fig 3.4.1.1.a: Pooling Process

Pooling in our case is done via Max Pooling, where the maximum value
is chosen from the sub-matrix of the input and is used as the first value
of the pooling matrix. For the next value, calculate the maximum value
in the next sub-matrix and update the new element into the Pooling
matrix. This goes on till the pooling matrix is filled.

Fig 3.4.1.1.b: Pooling Process-Max Pooling

1
3.4.2 THE NEURAL

Input layer: The flattened layer that was just created acts as the input
layer for the upcoming neural network. the data from the input layer is
further transferred onto the deeper layers

Hidden layer : The inputs coming from the previous layer, are
multiplied with weights and summed up along with a bias. The
weighted sum is then passed through an Activation Function. Activation
Function has the responsibility of which node to fire for feature
extraction and finally output is calculated. This whole process is known
as Forward Propagation.

Fig 3.4.2.a: CNN Architecture

2
3.4.3 CONVOLUTIONAL LSTM

Fig 3.4.3.a:Convolutional LSTM Workflow

2
3.4.4 DATA

Data of the various 30 activities were taken from corporate sources and
used to train the model. For this purpose, we use the TensorFlow
module. This is nothing but an open-source library developed by Google.
We will be using the [UCF50 - Action Recognition Dataset]
(https://www.crcv.ucf.edu/data/UCF50.php), consisting of realistic
videos taken from youtube which differentiates this data set from most of
the other available action recognition datasets as they are not realistic
and are staged by actors. The Dataset contains:

* `50` Action Categories

* `25` Groups of Videos per Action Category

* `133` Average Videos per Action Category

* `199` Average Number of Frames per Video

* `320` Average Frames Width per Video

* `240` Average Frames Height per Video

* `26` Average Frames Per Seconds per Video

2
3.4.5 DATA

In the first step, we will visualize the data along with labels to get an idea
about what we will be dealing with.

For visualization, we will pick `20` random categories from the dataset
and a random video from each selected category and will visualize the
first frame of the selected videos with their associated labels written.
This way we’ll be able to visualize a subset ( `20` random videos ) of the
dataset.

Fig 3.4.5.a : Data Visualization

2
3.4.6 MAIN

* Prediction: Based on the analysis of the new data, the model generates
a prediction or output.
* Accuracy: The accuracy of the model’s predictions are assessed. This
is usually determined by comparing the model’s predictions against a set
of known values.
* Successful Model: If the model’s accuracy meets a certain threshold,
it’s considered successful.
* Testing- single prediction vids along with multiple action recognition
videos for multiple video prediction we downloaded a yt video and for
single action prediction we have implemented a real-time curl counter
for exercise tracking. we feed that input here.

Preprocessing of the Dataset is performed where we extract, resize the


videos and normalize them to set parameters.
-Split the Data into Train and Test Set-split our data to create training
and testing sets. We will also shuffle the dataset before the split to avoid
any bias and get splits representing the overall distribution of the data.
- Implement the ConvLSTM Approach-
In this step, we will implement the first approach by using a combination
of ConvLSTM cells. A ConvLSTM cell is an LSTM with convolution
embedded in the architecture, which makes it capable of identifying
spatial features of the data while keeping into account the temporal
relation.

Fig 3.4.6.a ConvLSTM Working

2
- Construct the Model-
To construct the model, we will use Keras [`ConvLSTM2D`] recurrent
layers. The `ConvLSTM2D` layer also takes in the number of filters and
kernel size required for applying the convolutional operations. The
output of the layers is flattened in the end and is fed to the `Dense` layer
with softmax activation which outputs the probability of each action
category. Model: "sequential"

Layer (type) Output Shape Param #


=======================================================
conv_lstm2d (ConvLSTM2D) (None, 20, 62, 62, 4) 1024

max_pooling3d(MaxPooling3D) (None, 20, 31, 31, 4) 0

time_distributed (TimeDistributed) (None, 20, 31, 31, 4) 0

conv_lstm2d_1 (ConvLSTM2D) (None, 20, 29, 29, 8) 3488

max_pooling3d_1 (MaxPooling 3D) (None, 20, 15, 15, 8) 0

time_distributed_1 (TimeDistributed) (None, 20, 15, 15, 8) 0

conv_lstm2d_2 (ConvLSTM2D) (None, 20, 13, 13, 14) 11144

max_pooling3d_2 (MaxPooling 3D) (None, 20, 7, 7, 14) 0

time_distributed_2 (TimeDistributed) (None, 20, 7, 7, 14) 0

conv_lstm2d_3 (ConvLSTM2D) (None, 20, 5, 5, 16) 17344

max_pooling3d_3 (MaxPooling 3D) (None, 20, 3, 3, 16) 0

flatten (Flatten) (None, 2880) 0

dense (Dense) (None, 4) 11524


========================================================
Total params: 44,524
Trainable params:
44,524 Non-trainable
params:

Model Created Successfully!

2
Explanation:

1. ConvLSTM2D Layers:
- There are multiple ConvLSTM2D layers in the model, each followed
by max-pooling layers.
- ConvLSTM2D layers combine convolutional and LSTM operations,
allowing them to learn spatial-temporal patterns directly from the input
sequence of images.
- The output shape of each layer indicates a sequence of 20 frames
with different spatial dimensions and depths of feature maps.
- The number of output channels (4, 8, 14, 16) in each ConvLSTM2D
layer increases gradually, indicating a progressive extraction of more
complex features.

2. MaxPooling3D Layers:
- Max-pooling layers are applied after each ConvLSTM2D layer to
reduce the spatial dimensions of the feature maps.
- MaxPooling3D layers operate over both spatial and temporal
dimensions, reducing computational complexity and focusing on the
most relevant features.

3. TimeDistributed Layers:
- TimeDistributed layers are used to apply operations (possibly
additional convolutions or transformations) independently to each time
step of the input sequence.
- In this model, TimeDistributed layers don't introduce any additional
parameters but may be used for further feature processing.

4. Flatten Layer:
- The Flatten layer is applied to convert the 3D feature maps into a 1D
vector, which can be fed into a fully connected dense layer for
classification.

5. Dense Layer:
- The final Dense layer performs classification based on the features
extracted by the ConvLSTM2D layers.
- The output shape indicates that the model predicts 4 classes.

2
Fig 3.4.6.b:ConvLSTM Model Architecture

2
-Step 4.3: Plot Model’s Loss & Accuracy

Fig 3.4.6.c: Loss Curve

Fig 3.4.6.d: Accuracy Curve

2
- Implement the LRCN

In this step, we implement the LRCN Approach by combining


Convolution and LSTM layers in a single model. The CNN model can be
used to extract spatial features from the frames in the video, and for this
purpose, a pre-trained model can be used that can be fine-tuned for the
problem. And the LSTM model can then use the features extracted by
CNN, to predict the action being performed in the video. The
Convolutional layers are used for spatial feature extraction from the
frames, and the extracted spatial features are fed to LSTM layer(s) at
each time-steps for temporal sequence modeling. This way the network
learns spatiotemporal features directly in an end-to-end training,
resulting in a robust model.

-Construct the Model-

We use time-distributed `Conv2D` layers which will be followed by


`MaxPooling2D` and `Dropout` layers. The feature extracted from the
`Conv2D` layers will be then flattened using the `Flatten` layer and will
be fed to a `LSTM` layer. The `Dense` layer with softmax activation will
then use the output from the `LSTM` layer to predict the action being
performed. Model: "sequential_1"

2
Layer (type) Output Shape Param #
============================================================
time_distributed_3 (TimeDistributed) (None, 20, 64, 64, 16) 448

time_distributed_4 (TimeDistributed) (None, 20, 16, 16, 16) 0

time_distributed_5 (TimeDistributed) (None, 20, 16, 16, 16) 0

time_distributed_6 (TimeDistributed) (None, 20, 16, 16, 32) 4640

time_distributed_7 (TimeDistributed) (None, 20, 4, 4, 32) 0

time_distributed_8 (TimeDistributed) (None, 20, 4, 4, 32) 0

time_distributed_9 (TimeDistributed) (None, 20, 4, 4, 64) 18496

time_distributed_10 (TimeDistributed) (None, 20, 2, 2, 64) 0

time_distributed_11 (TimeDistributed) (None, 20, 2, 2, 64) 0

time_distributed_12 (TimeDistributed) (None, 20, 2, 2, 64) 36928

time_distributed_13 (TimeDistributed) (None, 20, 1, 1, 64) 0

time_distributed_14 (TimeDistributed) (None, 20, 64) 0

lstm (LSTM) (None, 32) 12416

dense_1 (Dense) (None, 5) 165

====================================================
Total params: 73093 (285.52 KB)
Trainable params: 73093 (285.52 KB)
Non-trainable params: 0 (0.00 Byte)

Model Created Successfully!

3
Explanation:

1. Input Layer (TimeDistributed):


- The input images are processed in a time-distributed manner,
indicating that the model is designed to handle sequences of images (20
frames in this case).
- The input images have a shape of (64, 64) pixels.

2. Convolutional Layers (TimeDistributed):


- There are several convolutional layers (with ReLU activation
functions) in the model. These layers extract features from the input
images at different spatial resolutions and depths.
- The number of filters increases from 16 to 64 across these layers,
indicating a progressive extraction of more complex features.

3. MaxPooling Layers (TimeDistributed):


- Max-pooling layers are used to downsample the spatial dimensions
of the feature maps, reducing computational complexity and extracting
the most important features.

4. LSTM Layer:
- The LSTM layer processes the extracted features from the CNN
layers over time (across the sequence of 20 frames).
- LSTM networks are effective for sequence modeling tasks, as they
can capture temporal dependencies and patterns in the data.

5. Dense Layer:
- The output of the LSTM layer is fed into a dense layer with softmax
activation, which produces the final output.
- The output shape indicates that the model predicts 5 classes, likely
corresponding to different exercises or actions.

3
Fig 3.4.6.e LRCN Model Architecture

3
-Compile & Train the Model

Epoch 1/70
86/86 [==============================] - 20s 202ms/step -
loss: 1.5562 - accuracy: 0.2711 - val_loss: 1.5446 - val_accuracy: 0.2558
Epoch 2/70
86/86 [==============================] - 16s 190ms/step -
loss: 1.2941 - accuracy: 0.4548 - val_loss: 1.4454 - val_accuracy: 0.5000
Epoch 3/70
86/86 [==============================] - 16s 190ms/step -
loss: 1.0564 - accuracy: 0.5539 - val_loss: 0.9310 - val_accuracy: 0.6744
Epoch 4/70
86/86 [==============================] - 16s 187ms/step -
loss: 1.0035 - accuracy: 0.5627 - val_loss: 0.8670 - val_accuracy: 0.6628
Epoch 5/70
86/86 [==============================] - 16s 191ms/step -
loss: 0.7846 - accuracy: 0.6443 - val_loss: 0.9286 - val_accuracy: 0.6512
Epoch 6/70
86/86 [==============================] - 16s 189ms/step -
loss: 0.7097 - accuracy: 0.7055 - val_loss: 0.7071 - val_accuracy: 0.7326
Epoch 7/70
86/86 [==============================] - 16s 190ms/step -
loss: 0.6251 - accuracy: 0.7318 - val_loss: 0.6335 - val_accuracy: 0.7791
Epoch 8/70
86/86 [==============================] - 19s 227ms/step -
loss: 0.5355 - accuracy: 0.7464 - val_loss: 0.6527 - val_accuracy: 0.8140
Epoch 9/70
86/86 [==============================] - 17s 198ms/step -
loss: 0.4992 - accuracy: 0.8076 - val_loss: 0.8590 - val_accuracy: 0.6512
Epoch 10/70
86/86 [==============================] - 17s 191ms/step -
loss: 0.4256 - accuracy: 0.8484 - val_loss: 0.5051 - val_accuracy: 0.8140
Epoch 11/70
86/86 [==============================] - 16s 184ms/step -
loss: 0.3412 - accuracy: 0.8746 - val_loss: 0.4942 - val_accuracy: 0.8372
Epoch 12/70
86/86 [==============================] - 17s 195ms/step -

3
loss: 0.2829 - accuracy: 0.8892 - val_loss: 0.4248 - val_accuracy: 0.8488
Epoch 13/70
86/86 [==============================] - 17s 195ms/step -
loss: 0.2577 - accuracy: 0.9155 - val_loss: 0.6338 - val_accuracy: 0.7791
Epoch 14/70
86/86 [==============================] - 17s 198ms/step -
loss: 0.3098 - accuracy: 0.8892 - val_loss: 0.4802 - val_accuracy: 0.8256
Epoch 15/70
86/86 [==============================] - 17s 201ms/step -
loss: 0.1505 - accuracy: 0.9563 - val_loss: 0.6157 - val_accuracy: 0.8372
Epoch 16/70
86/86 [==============================] - 16s 189ms/step -
loss: 0.1907 - accuracy: 0.9417 - val_loss: 0.5611 - val_accuracy: 0.8372
Epoch 17/70
86/86 [==============================] - 16s 189ms/step -
loss: 0.1115 - accuracy: 0.9679 - val_loss: 0.6157 - val_accuracy:
0.8256 Epoch 18/70
86/86 [==============================] - 16s 188ms/step -
loss: 0.1582 - accuracy: 0.9534 - val_loss: 0.4729 - val_accuracy: 0.8721
Epoch 19/70
86/86 [==============================] - 16s 190ms/step -
loss: 0.1523 - accuracy: 0.9329 - val_loss: 0.4059 - val_accuracy: 0.8837
Epoch 20/70
86/86 [==============================] - 16s 192ms/step -
loss: 0.0773 - accuracy: 0.9796 - val_loss: 0.3830 - val_accuracy: 0.8953
Epoch 21/70
86/86 [==============================] - 16s 191ms/step -
loss: 0.0349 - accuracy: 0.9971 - val_loss: 0.4478 - val_accuracy: 0.8605
Epoch 22/70
86/86 [==============================] - 18s 212ms/step -
loss: 0.0311 - accuracy: 0.9942 - val_loss: 0.5619 - val_accuracy: 0.8721
Epoch 23/70
86/86 [==============================] - 17s 193ms/step -
loss: 0.0609 - accuracy: 0.9796 - val_loss: 0.4596 - val_accuracy: 0.8721
Epoch 24/70
86/86 [==============================] - 16s 191ms/step -
loss: 0.1835 - accuracy: 0.9388 - val_loss: 0.5093 - val_accuracy: 0.8837

3
Epoch 25/70
86/86 [==============================] - 16s 189ms/step -
loss: 0.2025 - accuracy: 0.9534 - val_loss: 0.4255 - val_accuracy: 0.8721
Epoch 26/70
86/86 [==============================] - 16s 189ms/step -
loss: 0.1295 - accuracy: 0.9621 - val_loss: 0.4286 - val_accuracy: 0.8721
Epoch 27/70
86/86 [==============================] - 17s 199ms/step -
loss: 0.0509 - accuracy: 0.9883 - val_loss: 0.4392 - val_accuracy: 0.8721
Epoch 28/70
86/86 [==============================] - 16s 190ms/step -
loss: 0.0294 - accuracy: 0.9971 - val_loss: 0.3164 - val_accuracy: 0.8953
Epoch 29/70
86/86 [==============================] - 16s 192ms/step -
loss: 0.0225 - accuracy: 0.9971 - val_loss: 0.3819 - val_accuracy: 0.8953
Epoch 30/70
86/86 [==============================] - 16s 188ms/step -
loss: 0.0186 - accuracy: 0.9971 - val_loss: 0.4409 - val_accuracy: 0.8837
Epoch 31/70
86/86 [==============================] - 16s 184ms/step -
loss: 0.0191 - accuracy: 0.9971 - val_loss: 0.4313 - val_accuracy: 0.8837
Epoch 32/70
86/86 [==============================] - 16s 188ms/step -
loss: 0.0170 - accuracy: 0.9971 - val_loss: 0.3776 - val_accuracy: 0.8953
Epoch 33/70
86/86 [==============================] - 16s 186ms/step -
loss: 0.0135 - accuracy: 0.9971 - val_loss: 0.4151 - val_accuracy: 0.9070
Epoch 34/70
86/86 [==============================] - 16s 191ms/step -
loss: 0.0100 - accuracy: 0.9971 - val_loss: 0.5455 - val_accuracy: 0.8721
Epoch 35/70
86/86 [==============================] - 18s 211ms/step -
loss: 0.0040 - accuracy: 1.0000 - val_loss: 0.4936 - val_accuracy: 0.8837
Epoch 36/70
86/86 [==============================] - 18s 204ms/step -
loss: 0.0093 - accuracy: 0.9971 - val_loss: 0.4147 - val_accuracy: 0.8953
Epoch 37/70

3
86/86 [==============================] - 17s 202ms/step -
loss: 0.0066 - accuracy: 0.9971 - val_loss: 1.0140 - val_accuracy: 0.7907
Epoch 38/70
86/86 [==============================] - 16s 191ms/step -
loss: 0.0180 - accuracy: 0.9971 - val_loss: 0.4302 - val_accuracy: 0.9070
Epoch 39/70
86/86 [==============================] - 16s 188ms/step -
loss: 0.0072 - accuracy: 0.9971 - val_loss: 0.3273 - val_accuracy: 0.9186
Epoch 40/70
86/86 [==============================] - 16s 188ms/step -
loss: 0.0042 - accuracy: 1.0000 - val_loss: 0.3865 - val_accuracy: 0.9186
Epoch 41/70
86/86 [==============================] - 17s 194ms/step -
loss: 0.0027 - accuracy: 1.0000 - val_loss: 0.3989 - val_accuracy: 0.9186
Epoch 42/70
86/86 [==============================] - 17s 195ms/step -
loss: 0.0025 - accuracy: 1.0000 - val_loss: 0.3846 - val_accuracy: 0.9186
Epoch 43/70
86/86 [==============================] - 17s 198ms/step -
loss: 0.0021 - accuracy: 1.0000 - val_loss: 0.4022 - val_accuracy: 0.9186

Evaluation Of The Trained Model:

5/5 [==============================] - 17s 3s/step - loss:


0.4397 - accuracy: 0.8741

3
-Plot Model’s Loss & Accuracy Curves:

Fig 3.4.6.f&g: Loss and Accuracy Curve

3
CHAPTER
4 RESULTS

We conducted extensive experimentation to evaluate and verify the


efficiency of the HAR system for identifying actions in videos. We first
validated the proposed system having both temporal and spatial features
and compared it with a system that uses only sequential features. We
further evaluated and compared our system having a spatio-temporal
attention network with a spatial attention net where the attention network
is applied after the convolutional process. We achieved better results
with the suggested attention-based system than the other baseline
attention methods, which implement either spatial or spatial–temporal
information. It indicates the importance of temporal information in
sequential data that can enhance the recognition performance, such as
video-based action recognition.. The output of our system is superior to
the other deep learning techniques on these datasets. Our system
achieved a higher accuracy with the sports video, which is primarily due
to these videos containing many similar activities that are difficult to
recognize using a simple system. Our system learns deep spatial as well
as temporal information to support its judgment in correctly identifying
the actions within the sports videos.

The results and accuracy obtained from the HAR system are given
below:

Fig 4.a: Multiple Action Recognition

3
Further integrating MediaPipe curl counting with human action
recognition has significantly enriched the system's capabilities,
particularly in the realm of fitness tracking and exercise monitoring.
Moreover, the real-time nature of MediaPipe's pose estimation allows for
seamless integration with live video feeds, enabling users to receive
instant feedback during their workouts. Whether it's monitoring the
number of curls performed during a bicep curl exercise or tracking the
consistency of form throughout a set, the system provides timely
guidance to help users optimize their workouts and maximize results.
We obtain an accuracy of 93% with our model. The results obtained
from mediapipe are given below:

Figs 4.b:Single Prediction Results

3
CHAPTER

CONCLUSION

Spatiotemporal features play an essential role in recognizing various


actions in surveillance video data such as human action recognition. In
this article, we proposed a unique attention-based pipeline for human
action recognition, utilizing both the spatial and the temporal features
from a sequence of frames. For this purpose, we used a CNN network to
extract the high-level salient features from the video frames, and we then
used the skip connection approach to upgrade the learned features using
the UFLBs and a dilated CNN. Furthermore, these spatial features were
fed into the CLSTM network to learn the temporal information. An
attention layer is embedded to further determine the spatiotemporal
information in more detail, which enhances the performance at each step
of the LSTM.The center and the softmax loss functions are employed to
improve the classification performance of the human actions in the
videos. We conducted extensive experiments on three standard
benchmark datasets including the UCF50, the UCF Sports, and the J-
HMDB.

4
REFERENC

[1] N. Spolaôr, et al., A systematic review on content-based video


retrieval, Eng. Appl. Artif. Intell. 90 (2020) 103557.

[2] A. Keshavarzian, S. Sharifian, S. Seyedin, Modified deep residual


network architecture deployed on serverless framework of IoT
platform based on human activity recognition application, Future
Gener. Computer System 101 (2019) 14–28.

[3] A.D. Antar, M. Ahmed, M.A.R. Ahad, Challenges in sensor-based


human activity recognition and a comparative analysis of benchmark
datasets: A review, in: 2019 Joint 8th International Conference on
Informatics, Electronics & Vision (ICIEV) and 2019 3rd
International Conference on Imaging, Vision & Pattern Recognition,
IcIVPR, IEEE, 2019.

[4] K.A. da Costa, et al., Internet of things: A survey on machine


learning-based intrusion detection approaches, Comput. Netw. 151
(2019) 147–157.

[5] J.K. Aggarwal, M.S. Ryoo, Human activity analysis: A review,


ACM Comput. Surv. 43 (3) (2011)

[6] S. Kulkarni, S. Jadhav, D. Adhikari, A survey on human group


activity recognition by analyzing person action from video sequences
using machine learning techniques.Springer, 2020.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy