HH Docx-1
HH Docx-1
RECOGNITION SYSTEM
Submitted by
UEC2604
MACHINE
LEARNING
Engineering
(An Autonomous Institution, Affiliated to Anna
University)
BONAFIDE CERTIFICATE
INTERNAL EXAMINER
ABSTRACT
3
TABLE OF
1 Introduction 7
2 Literature Survey 8
3 Methodology 10
3.1 Introduction 10
3.2 ML Process Flow 11
3.3 Real Time Curl Counter 12
3.3.1 Making Detections 12
3.3.2 Calculation Of Angles 13
3.3.3 Exercise Detection Logic 14
3.3.4 Determining Joints 15
3.3.5 Counting Exercise Repetitions 16
3.4 Action Recognition 18
3.4.1 CNN Workflow 18
3.4.1.1 Pooling 20
3.4.2 The Neural Network 22
3.4.3 Convolution LSTM Workflow 23
3.4.4 Data Acquisition 25
3.4.5 Data Visualization 26
3.4.6 Main Working 28
4 Results 45
4.1 Multiple Activity Prediction 45
4.1 Single Prediction on a Test Video 47
5 Conclusion 48
6 References 49
4
LIST OF
5
SYMBOLS AND
Symbols Abbreviation
ML Machine Learning
CNN Convolutional Neural Network
LSTM Long Short-Term Memory Networks
DCNN Dilated convolutional neural network
BiLSTM Bi-directional long short-term memory
LRCN Long-term Recurrent Convolutional Network
HAR Human Activity Recognition
6
CHAPTER
INTRODUCTION
7
CHAPTER
LITERATURE SURVEY
8
• Sejal Bhatia named the title as “Activity Identification with
Machine Learning on Wearable Tracker" Wearable study shows
ML can identify activities (success with XGBoost).Their wearable
data might not be ideal for your camera-based project, but ML for
activity detection is promising.
9
CHAPTER 3
METHODOLOGY
3.1 INTRODUCTION
1
3.2 ML PROCESS
1
3.3 REAL-TIME CURL
1
3.3.2 CALCULATION OF
1
3.3.3. EXERCISE DETECTION
1
3.3.4 DETERMINING
These joints are crucial for understanding human pose and movement,
serving as the foundation for a wide range of applications, from fitness
tracking to gesture recognition. Through its sophisticated architecture
and training methodology, Mediapipe achieves robustness and accuracy
in joint detection across diverse body types, poses, and environmental
conditions, making it a valuable tool for researchers, developers, and
practitioners in various fields.
1
3.3.5 COUNTING EXERCISE
1
3.4 ACTION
1
emphasize the important or key features of an image such as borders,
edges, corners, highlighted portions, etc. The filters will contain values
that will help extract necessary portions of the image. The below image
shows how it’s done:
1
3.4.1.1
Pooling in our case is done via Max Pooling, where the maximum value
is chosen from the sub-matrix of the input and is used as the first value
of the pooling matrix. For the next value, calculate the maximum value
in the next sub-matrix and update the new element into the Pooling
matrix. This goes on till the pooling matrix is filled.
1
3.4.2 THE NEURAL
Input layer: The flattened layer that was just created acts as the input
layer for the upcoming neural network. the data from the input layer is
further transferred onto the deeper layers
Hidden layer : The inputs coming from the previous layer, are
multiplied with weights and summed up along with a bias. The
weighted sum is then passed through an Activation Function. Activation
Function has the responsibility of which node to fire for feature
extraction and finally output is calculated. This whole process is known
as Forward Propagation.
2
3.4.3 CONVOLUTIONAL LSTM
2
3.4.4 DATA
Data of the various 30 activities were taken from corporate sources and
used to train the model. For this purpose, we use the TensorFlow
module. This is nothing but an open-source library developed by Google.
We will be using the [UCF50 - Action Recognition Dataset]
(https://www.crcv.ucf.edu/data/UCF50.php), consisting of realistic
videos taken from youtube which differentiates this data set from most of
the other available action recognition datasets as they are not realistic
and are staged by actors. The Dataset contains:
2
3.4.5 DATA
In the first step, we will visualize the data along with labels to get an idea
about what we will be dealing with.
For visualization, we will pick `20` random categories from the dataset
and a random video from each selected category and will visualize the
first frame of the selected videos with their associated labels written.
This way we’ll be able to visualize a subset ( `20` random videos ) of the
dataset.
2
3.4.6 MAIN
* Prediction: Based on the analysis of the new data, the model generates
a prediction or output.
* Accuracy: The accuracy of the model’s predictions are assessed. This
is usually determined by comparing the model’s predictions against a set
of known values.
* Successful Model: If the model’s accuracy meets a certain threshold,
it’s considered successful.
* Testing- single prediction vids along with multiple action recognition
videos for multiple video prediction we downloaded a yt video and for
single action prediction we have implemented a real-time curl counter
for exercise tracking. we feed that input here.
2
- Construct the Model-
To construct the model, we will use Keras [`ConvLSTM2D`] recurrent
layers. The `ConvLSTM2D` layer also takes in the number of filters and
kernel size required for applying the convolutional operations. The
output of the layers is flattened in the end and is fed to the `Dense` layer
with softmax activation which outputs the probability of each action
category. Model: "sequential"
2
Explanation:
1. ConvLSTM2D Layers:
- There are multiple ConvLSTM2D layers in the model, each followed
by max-pooling layers.
- ConvLSTM2D layers combine convolutional and LSTM operations,
allowing them to learn spatial-temporal patterns directly from the input
sequence of images.
- The output shape of each layer indicates a sequence of 20 frames
with different spatial dimensions and depths of feature maps.
- The number of output channels (4, 8, 14, 16) in each ConvLSTM2D
layer increases gradually, indicating a progressive extraction of more
complex features.
2. MaxPooling3D Layers:
- Max-pooling layers are applied after each ConvLSTM2D layer to
reduce the spatial dimensions of the feature maps.
- MaxPooling3D layers operate over both spatial and temporal
dimensions, reducing computational complexity and focusing on the
most relevant features.
3. TimeDistributed Layers:
- TimeDistributed layers are used to apply operations (possibly
additional convolutions or transformations) independently to each time
step of the input sequence.
- In this model, TimeDistributed layers don't introduce any additional
parameters but may be used for further feature processing.
4. Flatten Layer:
- The Flatten layer is applied to convert the 3D feature maps into a 1D
vector, which can be fed into a fully connected dense layer for
classification.
5. Dense Layer:
- The final Dense layer performs classification based on the features
extracted by the ConvLSTM2D layers.
- The output shape indicates that the model predicts 4 classes.
2
Fig 3.4.6.b:ConvLSTM Model Architecture
2
-Step 4.3: Plot Model’s Loss & Accuracy
2
- Implement the LRCN
2
Layer (type) Output Shape Param #
============================================================
time_distributed_3 (TimeDistributed) (None, 20, 64, 64, 16) 448
====================================================
Total params: 73093 (285.52 KB)
Trainable params: 73093 (285.52 KB)
Non-trainable params: 0 (0.00 Byte)
3
Explanation:
4. LSTM Layer:
- The LSTM layer processes the extracted features from the CNN
layers over time (across the sequence of 20 frames).
- LSTM networks are effective for sequence modeling tasks, as they
can capture temporal dependencies and patterns in the data.
5. Dense Layer:
- The output of the LSTM layer is fed into a dense layer with softmax
activation, which produces the final output.
- The output shape indicates that the model predicts 5 classes, likely
corresponding to different exercises or actions.
3
Fig 3.4.6.e LRCN Model Architecture
3
-Compile & Train the Model
Epoch 1/70
86/86 [==============================] - 20s 202ms/step -
loss: 1.5562 - accuracy: 0.2711 - val_loss: 1.5446 - val_accuracy: 0.2558
Epoch 2/70
86/86 [==============================] - 16s 190ms/step -
loss: 1.2941 - accuracy: 0.4548 - val_loss: 1.4454 - val_accuracy: 0.5000
Epoch 3/70
86/86 [==============================] - 16s 190ms/step -
loss: 1.0564 - accuracy: 0.5539 - val_loss: 0.9310 - val_accuracy: 0.6744
Epoch 4/70
86/86 [==============================] - 16s 187ms/step -
loss: 1.0035 - accuracy: 0.5627 - val_loss: 0.8670 - val_accuracy: 0.6628
Epoch 5/70
86/86 [==============================] - 16s 191ms/step -
loss: 0.7846 - accuracy: 0.6443 - val_loss: 0.9286 - val_accuracy: 0.6512
Epoch 6/70
86/86 [==============================] - 16s 189ms/step -
loss: 0.7097 - accuracy: 0.7055 - val_loss: 0.7071 - val_accuracy: 0.7326
Epoch 7/70
86/86 [==============================] - 16s 190ms/step -
loss: 0.6251 - accuracy: 0.7318 - val_loss: 0.6335 - val_accuracy: 0.7791
Epoch 8/70
86/86 [==============================] - 19s 227ms/step -
loss: 0.5355 - accuracy: 0.7464 - val_loss: 0.6527 - val_accuracy: 0.8140
Epoch 9/70
86/86 [==============================] - 17s 198ms/step -
loss: 0.4992 - accuracy: 0.8076 - val_loss: 0.8590 - val_accuracy: 0.6512
Epoch 10/70
86/86 [==============================] - 17s 191ms/step -
loss: 0.4256 - accuracy: 0.8484 - val_loss: 0.5051 - val_accuracy: 0.8140
Epoch 11/70
86/86 [==============================] - 16s 184ms/step -
loss: 0.3412 - accuracy: 0.8746 - val_loss: 0.4942 - val_accuracy: 0.8372
Epoch 12/70
86/86 [==============================] - 17s 195ms/step -
3
loss: 0.2829 - accuracy: 0.8892 - val_loss: 0.4248 - val_accuracy: 0.8488
Epoch 13/70
86/86 [==============================] - 17s 195ms/step -
loss: 0.2577 - accuracy: 0.9155 - val_loss: 0.6338 - val_accuracy: 0.7791
Epoch 14/70
86/86 [==============================] - 17s 198ms/step -
loss: 0.3098 - accuracy: 0.8892 - val_loss: 0.4802 - val_accuracy: 0.8256
Epoch 15/70
86/86 [==============================] - 17s 201ms/step -
loss: 0.1505 - accuracy: 0.9563 - val_loss: 0.6157 - val_accuracy: 0.8372
Epoch 16/70
86/86 [==============================] - 16s 189ms/step -
loss: 0.1907 - accuracy: 0.9417 - val_loss: 0.5611 - val_accuracy: 0.8372
Epoch 17/70
86/86 [==============================] - 16s 189ms/step -
loss: 0.1115 - accuracy: 0.9679 - val_loss: 0.6157 - val_accuracy:
0.8256 Epoch 18/70
86/86 [==============================] - 16s 188ms/step -
loss: 0.1582 - accuracy: 0.9534 - val_loss: 0.4729 - val_accuracy: 0.8721
Epoch 19/70
86/86 [==============================] - 16s 190ms/step -
loss: 0.1523 - accuracy: 0.9329 - val_loss: 0.4059 - val_accuracy: 0.8837
Epoch 20/70
86/86 [==============================] - 16s 192ms/step -
loss: 0.0773 - accuracy: 0.9796 - val_loss: 0.3830 - val_accuracy: 0.8953
Epoch 21/70
86/86 [==============================] - 16s 191ms/step -
loss: 0.0349 - accuracy: 0.9971 - val_loss: 0.4478 - val_accuracy: 0.8605
Epoch 22/70
86/86 [==============================] - 18s 212ms/step -
loss: 0.0311 - accuracy: 0.9942 - val_loss: 0.5619 - val_accuracy: 0.8721
Epoch 23/70
86/86 [==============================] - 17s 193ms/step -
loss: 0.0609 - accuracy: 0.9796 - val_loss: 0.4596 - val_accuracy: 0.8721
Epoch 24/70
86/86 [==============================] - 16s 191ms/step -
loss: 0.1835 - accuracy: 0.9388 - val_loss: 0.5093 - val_accuracy: 0.8837
3
Epoch 25/70
86/86 [==============================] - 16s 189ms/step -
loss: 0.2025 - accuracy: 0.9534 - val_loss: 0.4255 - val_accuracy: 0.8721
Epoch 26/70
86/86 [==============================] - 16s 189ms/step -
loss: 0.1295 - accuracy: 0.9621 - val_loss: 0.4286 - val_accuracy: 0.8721
Epoch 27/70
86/86 [==============================] - 17s 199ms/step -
loss: 0.0509 - accuracy: 0.9883 - val_loss: 0.4392 - val_accuracy: 0.8721
Epoch 28/70
86/86 [==============================] - 16s 190ms/step -
loss: 0.0294 - accuracy: 0.9971 - val_loss: 0.3164 - val_accuracy: 0.8953
Epoch 29/70
86/86 [==============================] - 16s 192ms/step -
loss: 0.0225 - accuracy: 0.9971 - val_loss: 0.3819 - val_accuracy: 0.8953
Epoch 30/70
86/86 [==============================] - 16s 188ms/step -
loss: 0.0186 - accuracy: 0.9971 - val_loss: 0.4409 - val_accuracy: 0.8837
Epoch 31/70
86/86 [==============================] - 16s 184ms/step -
loss: 0.0191 - accuracy: 0.9971 - val_loss: 0.4313 - val_accuracy: 0.8837
Epoch 32/70
86/86 [==============================] - 16s 188ms/step -
loss: 0.0170 - accuracy: 0.9971 - val_loss: 0.3776 - val_accuracy: 0.8953
Epoch 33/70
86/86 [==============================] - 16s 186ms/step -
loss: 0.0135 - accuracy: 0.9971 - val_loss: 0.4151 - val_accuracy: 0.9070
Epoch 34/70
86/86 [==============================] - 16s 191ms/step -
loss: 0.0100 - accuracy: 0.9971 - val_loss: 0.5455 - val_accuracy: 0.8721
Epoch 35/70
86/86 [==============================] - 18s 211ms/step -
loss: 0.0040 - accuracy: 1.0000 - val_loss: 0.4936 - val_accuracy: 0.8837
Epoch 36/70
86/86 [==============================] - 18s 204ms/step -
loss: 0.0093 - accuracy: 0.9971 - val_loss: 0.4147 - val_accuracy: 0.8953
Epoch 37/70
3
86/86 [==============================] - 17s 202ms/step -
loss: 0.0066 - accuracy: 0.9971 - val_loss: 1.0140 - val_accuracy: 0.7907
Epoch 38/70
86/86 [==============================] - 16s 191ms/step -
loss: 0.0180 - accuracy: 0.9971 - val_loss: 0.4302 - val_accuracy: 0.9070
Epoch 39/70
86/86 [==============================] - 16s 188ms/step -
loss: 0.0072 - accuracy: 0.9971 - val_loss: 0.3273 - val_accuracy: 0.9186
Epoch 40/70
86/86 [==============================] - 16s 188ms/step -
loss: 0.0042 - accuracy: 1.0000 - val_loss: 0.3865 - val_accuracy: 0.9186
Epoch 41/70
86/86 [==============================] - 17s 194ms/step -
loss: 0.0027 - accuracy: 1.0000 - val_loss: 0.3989 - val_accuracy: 0.9186
Epoch 42/70
86/86 [==============================] - 17s 195ms/step -
loss: 0.0025 - accuracy: 1.0000 - val_loss: 0.3846 - val_accuracy: 0.9186
Epoch 43/70
86/86 [==============================] - 17s 198ms/step -
loss: 0.0021 - accuracy: 1.0000 - val_loss: 0.4022 - val_accuracy: 0.9186
3
-Plot Model’s Loss & Accuracy Curves:
3
CHAPTER
4 RESULTS
The results and accuracy obtained from the HAR system are given
below:
3
Further integrating MediaPipe curl counting with human action
recognition has significantly enriched the system's capabilities,
particularly in the realm of fitness tracking and exercise monitoring.
Moreover, the real-time nature of MediaPipe's pose estimation allows for
seamless integration with live video feeds, enabling users to receive
instant feedback during their workouts. Whether it's monitoring the
number of curls performed during a bicep curl exercise or tracking the
consistency of form throughout a set, the system provides timely
guidance to help users optimize their workouts and maximize results.
We obtain an accuracy of 93% with our model. The results obtained
from mediapipe are given below:
3
CHAPTER
CONCLUSION
4
REFERENC