0% found this document useful (0 votes)

10 views

Robot Arm Model Using Deep Q Network(RL)

This project focuses on developing a smart torque prediction model for a two-degree-of-freedom robot arm using neural networks and reinforcement learning, specifically through a Deep Q Network (DQN). The study aims to replace traditional model-based control strategies with a data-driven approach that enhances accuracy and adaptability in robotic control systems. Experimental results indicate that the ANN model outperformed others in convergence and stability, while the DQN effectively selected the best model for torque prediction under varying conditions.

Uploaded by

Tiamiyu Saheed Oluwatosin

We take content rights seriously. If you suspect this is your content, claim it here.

0% found this document useful (0 votes)

10 views

Robot Arm Model Using Deep Q Network(RL)

Uploaded by

Tiamiyu Saheed Oluwatosin

We take content rights seriously. If you suspect this is your content, claim it here.

You are on page 1/ 97

A course project

Learning Robot Model from Data through Reinforcement

Learning using a Deep Q Network
for

Machine Learning/Deep Learning (CPSC-5616)

Submitted by

Guiah Soumahoro (0385610)

Saheed Oluwatosin Tiamiyu (0466367)

Charles Udoh (0460759)

Instructor

Meysar Zeinali, Ph.D., PEng.

Assistant Professor, Bharti School of Engineering

Laurentian University

April 15th, 2025

1. Abstract 3
2. Introduction 4
2.1. Background and Motivation 4
2.1.1. Overview of Neural Networks and Deep Q-Networks 4
2.2. Problem Statement 5
2.3. Objectives of the Project 5
2.4. Literature Review 6
2.5. Key Methodologies 7
2.5.1. Learning-Based Modeling 7
2.5.2. Challenges and Future Directions 7
2.6. Scope and Limitations 8
3. Problem Formulation and Calculation 8
3.1. Mathematical Formulation of Neural Networks 8
3.1.1. Equations Governing Deep Q Networks 9
3.1.2. DQN Decision Function 10
3.2. Computational Considerations and Assumptions 11
4. Methodology 11
4.1. Data Collection and Preprocessing 11
4.1.1. Description of Data Sources 11
4.1.2. Data Cleaning and Normalization Procedures 12
4.2. Design and Architecture of the Neural Networks 17
4.2.1. ANN: Architecture and Training Strategy 17
4.2.3. LSTM: Architecture and Training Strategy 19
4.2.4. GRU: Architecture and Training Strategy 21
4.3. DQN Integration: Acting as the Selector 22
4.3.1. Combining Neural Networks with DQN 23
4.3.2. Framework for Action Selection 24
4.4. Optimization Techniques and Improvement Methods 24
4.4.1. Weight Initialization Strategies 24
4.4.2. Regularization and Avoiding Overfitting 25
4.4.3. Training Strategies (Sequential, Batch, Mini-batch) 25
4.4.4. Handling Slow Learning and Local Minima 25
4.5. System Architecture and Workflow 26
4.5.1. Graphical Representation of the System 26
4.5.2. Flowchart of Data and Control 27
5. Results and Discussion 28
5.1. Experimental Setup and Evaluation Metrics 28
5.2. Training Results and Performance Analysis 28
5.2.1. ANN Learning Curves 28
5.2.2. RNN Learning Curves 30
5.2.3. LSTM Learning Curve 31

1
5.2.4. GRU Learning Curve 32
5.3. DQN Action Selection Analysis 33
5.3.1. DQN learning Curve 33
5.4. Comparison of Improvement Techniques 34
5.4.1. Impact of Weight Initialization 34
5.4.2. Effectiveness of Regularization Strategies 35
5.4.3. Effectiveness of certain activation functions 36
5.5. Discussion 37
5.5.1. Analysis of the Results 37
5.5.2. Limitations, Challenges, and Future Work 38
6. Conclusion 38
6.1. Summary of Findings 39
6.2. Contributions of the Project 39
6.3. Challenges, Future Work, and Recommendations 39
7. Declarations 40
8. References 40
9. Appendices 41
9.1. Appendix A: Complete Code Listing 41
9.1.1. Principal Component Analysis code and Pre-processing 41
9.1.2. Artificial Neural Network 45
9.1.3. Recurrent Neural Network 52
9.1.4. Long Short Term Memory 61
9.1.5. Gated Recurrent Unit 73
9.1.6. Deep Q neural network 84
9.2. Appendix B: Training Data Description and Pre-processing Details 96

2
1. Abstract
This project uses neural networks (NN) and reinforcement learning (RL) for developing a smart
torque prediction model for a two-degree-of-freedom (2-DOF) robot arm. The majority of the
traditional model-based control strategies rely on approximated dynamics and hand-tuned PID,
which are inflexible and not precise. The goal of this study is to replace these conventional
models with a learning paradigm approach based on data, which closely describes actual world
dynamics and provides exact torque prediction to enable accurate robot control.

The approach uses a supervised learning framework to learn the neural networks using joint
state information and the respective torque values. The input data was then minimized to retain
four relevant features through the application of Principal Component Analysis (PCA), and two
torque output features were employed as targets. The data was then split into 70% training,
15% validation, and 15% testing. Standard artificial neural networks (ANN), recurrent neural
networks (RNN), long short-term memory (LSTM) networks, and gated recurrent unit (GRU)
models were evaluated. Also, a Deep Q-Network (DQN) reinforcement learning module was
incorporated to dynamically select the best-performing model at inference time.

DQN hyperparameters were selected to enhance performance, i.e., a discount factor of γ=0.99,
learning rate α=1e−3, and ϵ-decay rate of 0.995. Xavier initialization, L2 regularization, and
training with a batch size of 32 were used in all networks. Root Mean Square Error (RMSE) was
used to measure prediction accuracy, cumulative reward to check policy learning, and action
distribution to analyze model selection behavior of the DQN.

Experimental results showed that the ANN model exhibited the shortest and most stable
convergence due to its simple architecture and lower computational complexity. Over 100
epochs of training, it exhibited uniformly diminishing RMSE, validating strong learning dynamics
and best optimization. The RNN, LSTM, and GRU models portrayed slower convergence, likely
owing to their higher complexity and longer training time per epoch. But these models are
advantageous in terms of modeling temporal dependencies and being helpful in scenarios of
dynamic trajectory prediction or real-time adaptation. The DQN module learned adequately to
favor the ANN model when there was a requirement for fast and accurate prediction, as
indicated by the analysis of action distribution. This model-switching capability in a dynamic
manner provides a real-world practical advantage for robotic systems where different models
may perform better under different operating conditions.

In conclusion, this study affirms the validity and effectiveness of using neural networks and
reinforcement learning for torque prediction in robotic manipulators. The integration of
supervised learning and reinforcement learning offers a valid alternative to conventional control
schemes by increasing precision, reducing human intervention, and enabling intelligent
adaptation to changing environments. This platform can be extended to more sophisticated
robotic systems and higher degrees of freedom in future studies.

3
2. Introduction
2.1. Background and Motivation

Robot manipulators form critical components in contemporary automation systems, particularly

in manufacturing, assembly lines, and precision handling. Nonetheless, the performance of the
manipulators greatly relies on the precision and versatility of the driving control systems'
accuracy. Classic control methods, i.e., PID controllers or analytical inverse dynamics models,
are not only hard to optimize but would also not have good generalizability to experimentally
encountered uncertainties such as changes in load, joint friction, or non-linearities. They tend to
take the best possible knowledge of the physics model, not typically seen in real practice.

Evidence-based strategies have become viable alternatives in recent times. Reinforcement

Learning (RL) has especially demonstrated the potential to allow robots to discover optimal
control policies automatically by trying and failing at interacting with the environment. Through
RL, a robotic manipulator can learn to modify its behavior depending on experience, which
provides more robustness and flexibility in control. The 2-degree-of-freedom (2-DOF) robotic
manipulator is an accessible yet representative platform to study RL-based control systems.

Motivated by recent research works like Ni et al. (2021), this project would like to investigate
how RL can be utilized to learn the internal dynamics of a 2-DOF manipulator and enhance
control performance.

2.1.1. Overview of Neural Networks and Deep Q-Networks

Neural Networks (NNs) are a powerful computational model derived from the human brain to
discover intricate patterns and relationships in data. They are comprised of layers of processing
units (neurons) that are connected to one another, which learn to transform input features to
output targets during training. In robotics, and more specifically, 2-DOF manipulators, neural
networks are particularly good at approximating nonlinear dynamics that are difficult to model
using conventional physics-based methods. NNs such as Multilayer Perceptrons (MLPs) have
already seen success in predicting joint torques from state inputs of position, velocity, and
acceleration. More complex types, such as Recurrent Neural Networks (RNNs), Long
Short-Term Memory (LSTM), and Gated Recurrent Units (GRU), can process time-varying
sequences and are thus best positioned to model dynamic robotic systems through time.

Deep Q-Networks (DQNs), instead, represent a type of Deep Reinforcement Learning (DRL)
architecture that is exactly the combination of Q-learning, one of the most popular value-based
approaches, with deep neural networks. A DQN employs an NN to approximate the Q-function,
i.e., the overall expected reward for taking a specific action from a specific state and then
following an optimal policy afterwards. In robotics torque control prediction, a DQN chooses
among a collection of actions or neural models depending on observed state inputs. A DQN
learns through trial-and-error interactions with the environment and employs a reward function
to update its policy, enhancing the decision quality over time.

4
Here, NNs are utilized as function approximators to predict the inverse dynamics of a 2-DOF
manipulator, while the DQN is used as a policy selector, selecting the most suitable NN model to
reduce torque prediction errors and enhance control performance. The synthesis of supervised
learning (for training the NNs) and reinforcement learning (for training the DQN) enables an
adaptive and data-driven system that substitutes hand-designed control models with intelligent
learning agents that can generalize across tasks and dynamic conditions.

2.2. Problem Statement

Despite their widespread use, robotic manipulators still lack adaptability, especially in the
presence of changing or unknown dynamics. Traditional model-based control methods are not
efficient in these circumstances since they need accurate modeling, manual tuning, and are not
adept at external disturbance rejection. These limitations reduce performance and render robots
untrustworthy when operating under varying conditions.

While Reinforcement Learning offers a model-free, flexible learning approach, its use in robotic
control systems, specifically for learning inverse dynamics or torque control, remains largely
uncharted for low-degree manipulators. In addition, there is little work comparing RL-based
models to classical approaches in simulation environments tailored for robotic manipulators.
This project addresses this gap in research by developing a data-driven RL model to learn
torque prediction from joint state data and benchmarking its performance against traditional
methods.

2.3. Objectives of the Project

An accurate model of the dynamic behavior of robot manipulators is key to achieving

high-performance motion control. Traditional models, which are derived from estimated physical
equations, become insufficient due to unmodeled dynamics, parameter drift, or external
disturbances. It is our aim to develop a Neural Network (NN)-based robot model for a
two-degree-of-freedom (2-DOF) robot manipulator that can offer an improvement over these
traditional estimates. By using joint states as inputs to predict torque outputs, our goal is to
either perfectly or near-perfectly model the robot’s real-world dynamics, replacing mathematical
estimations with a data-driven solution. The expected outcome is a robust and accurate torque
prediction model using a supervised learning-based NN, further enhanced through possible
integration with reinforcement learning (RL). This promises smoother, adaptive, and more
efficient robot control systems. The objectives are better refined and listed as follows:

● To investigate and apply reinforcement learning (RL) algorithms for learning the dynamic
behavior and control policy of a 2-DOF robotic manipulator from data.
● To design a training and evaluation system integrating simulated data generation,
state-action-reward mapping, and policy optimization.
● To compare the performance of RL-based control policies with conventional
model-based controllers on metrics like tracking accuracy, RMSE, and adaptability to the
system.

5
● To verify the real-world feasibility of the trained RL models in a simulated environment by
conducting dynamic trajectory tracking missions and analyzing the results for efficiency
and robustness.

2.4. Literature Review

Over the last few years, robot dynamics identification using machine learning methods,
especially neural networks (NNs), has been a promising trend that brings greater precision and
flexibility to robot control systems. System identification involves building models of a robot
mimicking its response from measured input-output data. This approach is essential when
physical parameters of robots cannot be directly measured or are environmentally varying
(Kumbla & Jamshidi, 1997; Wu et al., 2012). The nonlinearities in robot systems, such as joint
friction and inertia, typically render conventional physics-based modeling inadequate for
high-performance control, especially in unstructured environments.

One of the first neural-network-based identification models for industrial robots was presented
by Kumbla and Jamshidi (1997) with a multilayer perceptron (MLP) with time-delayed inputs.
The authors linked the model to a neuro-fuzzy control system in which adaptive rule tuning was
possible. The MLP trained could approximate the dynamic system response and improve
overall control accuracy, opening the door to hybrid intelligent controllers. Generalizing the idea
of dynamic parameter identification, Wu et al. (2012) applied Modified Fourier Series (MFS) and
Maximum Likelihood Estimation (MLE) for modeling a 6-DOF robot's dynamics. Incorporating
static friction models also enhanced their model predictability. Their approach was superior in
noise handling and accuracy compared to traditional Fourier-based identification, particularly for
experimental conditions.

Patel, Zeinali, and Passi (2021) extended this avenue of research by suggesting the application
of RNNs such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) models.
These models were trained with joint state and torque data and then combined with Adaptive
Sliding Mode Control (ASMC) for improved trajectory tracking on a 2-DOF simulated
manipulator. Their synergy significantly reduced control errors and chattering, with torque
prediction accuracy of as much as 90%. These results emphasize the necessity of the
application of time-aware architectures in robot learning systems, especially in processing noisy
or time-sequential input data.

Besides academic studies, industry and web sources have provided pertinent information. V7
Labs (n.d.) conducted a comparative study of the widely used activation functions for NN
models, including ReLU, Tanh, and Sigmoid. Their study pointed out that the choice of the
appropriate activation function improves learning convergence and stability in training,
especially in dynamic control settings. Qian (n.d.) described the application of NNs to the
estimation of torques from joint states using inverse dynamics approaches. Their contribution
highlighted the adaptive advantage of neural models under varying loads or frictional
circumstances, as long as ample training data is present.

6
Collectively, the literature illustrates that neural networks, and especially in combination with
hybrid control strategies like ASMC or neuro-fuzzy systems, provide an outstanding
enhancement to conventional rigid-body models. Neural structures are able to learn very
nonlinear relations and handle dynamic environments, making them so appropriate for real-time
robotics applications. RNN-based models, and particularly those based on LSTM or GRU cells,
are particularly useful in systems with very high temporal dependencies. The papers reviewed
herein confirm the potential of NNs for modeling dynamics and indicate that the way forward will
likely be via the synergy between analytical physics-based modeling and data-driven learning
methods.

2.5. Key Methodologies

2.5.1. Learning-Based Modeling

This research follows a supervised learning data-driven robot dynamics modeling approach with
neural networks (NNs), where the NNs are trained with labeled joint state and torque data in an
effort to learn the underlying inverse dynamics of a two-degree-of-freedom (2-DOF) robotic
manipulator. The cost function is utilized to measure the disparity between the predicted and
actual torque values to optimize the network through weight updates. Reinforcement Learning
(RL) is also investigated as a complementary approach to enhance the model's performance via
real-time feedback and adaptive policy learning in dynamic and uncertain environments.

The method is driven by past research. Kumbla and Jamshidi (1997) utilized multilayer
perceptrons (MLPs) with time-delayed inputs for dynamic identification in industrial robots,
coupled with a neuro-fuzzy controller for adaptive tuning. Wu et al. (2012) employed Modified
Fourier Series (MFS) and Maximum Likelihood Estimation (MLE) for closed-loop parameter
identification on a 6-DOF robot with higher accuracy and noise immunity. Patel, Zeinali, and
Passi (2021) demonstrated the feasibility of LSTM and GRU-based recurrent neural networks
(RNNs) as alternatives to model-based inverse dynamics with high prediction accuracy and
reduced trajectory tracking errors. Complementary observations are made by V7 Labs (n.d.)
regarding the emphasis on activation function selection for stable convergence, while Qian
(n.d.) highlights the potential of inverse dynamics learning to advance control flexibility in the
face of changing operating conditions. These strategies collectively form the basis of
implementing and training a robust, accurate, and adaptive NN-based robot torque model.

2.5.2. Challenges and Future Directions

Challenges include data collection issues, where poor generalization occurs without varied
datasets; black-box models that are not interpretable and may violate physical laws; noise
sensitivity, which may destabilize control; and computational demands that limit real-time
applications. Future work should explore hybrid methods of rigid-body physics combined with
neural networks for residual effects, and utilize recurrent neural networks (RNNs) like LSTMs for
online motion adaptation.

7
2.6. Scope and Limitations

This project deals with modeling and controlling a 2-DOF robotic manipulator in a simulated
environment using supervised learning and reinforcement learning methods. Data acquisition,
model training, performance analysis, and comparison with conventional methods are the major
areas addressed.

But the project has the following limitations:

● Simulated Environment Only: The model will be trained and tested in simulation.
Real-world testing on physical manipulators is outside the current scope.
● Limited DOF: Although the results are scalable theoretically, the conclusions are strictly
applicable to a 2-DOF manipulator and cannot be extended to higher DOF systems
without additional investigation.
● Noise and Real-World Dynamics: Although variations and noise may be simulated,
actual hardware failures and sensor mismatches might involve some issues other than
those reproduced by the simulator.
● Computation Constraints: Training a deep reinforcement learning model can be limited
by the computation resources available, which can limit the number of policies and
architectures that can be extensively searched.

In spite of these limitations, the project provides a good baseline for adaptive control of robots
and provides a stepping stone for future research involving more complex and real-time
implementations.

3. Problem Formulation and Calculation

3.1. Mathematical Formulation of Neural Networks

Neural networks (NN) are powerful computational tools capable of approximating complex
functions, using interconnected nodes characterized by weights and biases. A standard Artificial
Neural Networks (ANN) uses a forward-pass approach to compute its output via the following
equation:

𝑁𝑒𝑡𝑘 = ∑ 𝑊𝑖𝑗𝑘 + 𝑏𝑘 (1)

𝑖

𝐻𝑘 = 𝑓(𝑁𝑒𝑡𝑘) (2)

𝑊𝑖𝑗𝑘 represents the weight connecting node 𝑖 from layer kkk to node 𝑗 in the subsequent layer
𝑘 + 1, and 𝑏 is a bias term associated with the receiving node. The linear sum of inputs and
𝑘
weights called 𝑁𝑒𝑡 is then passed through a nonlinear activation function 𝑓(.) , which then
𝑘

8
results in the hidden layer output value (𝐻 ). This nonlinear activation allows for the NN to
𝑘
model complex nonlinearities in data. This process is repeated across successive layers until
reaching the final output layer, outputting a hypothesis function acting as the NNs prediction.
Based on the error computed by comparing the estimated value against the actual value taken
from data, the NN will adjust its weighted parameters to get a better estimate. Network weight
adjustment based on error is done through a process called backward-pass or backpropagation.
Specifically, backpropagation calculates the partial derivatives of the error with respect to each
weight (gradient), shown by the following equation:
(𝑛𝑒𝑤) (𝑜𝑙𝑑) ϑ𝐸
𝑊𝑖𝑗𝑘 = 𝑊𝑖𝑗𝑘 − α ϑ𝑊𝑖𝑗𝑘 (3)

Where alpha represents the learning rate, controlling the rate or magnitude of weight updates.
After applying these adjustments, the network will perform a forward pass once more to
evaluate the new error. A decreasing root mean square error (RMSE) is proof of improved
performance and that the network is learning. However, an increasing RMSE could mean
design issues, such as improper NN structure, a learning rate that is too high, improper
initialization, etc.

3.1.1. Equations Governing Deep Q Networks

Deep Q Networks (DQN) extend from Q Networks (QN) by using a deep neural network (DNN)
to estimate the Q-value function Q(s, a,) which is the expected reward when doing an action
within a specific state s, following a given policy. The equation used in DQNs is derived from the
Bellman equation for the optimal-value function (Sutton & Barto, 2018):

[
𝑄 * (𝑠𝑡+1, 𝑎) = 𝐸 𝑟𝑡+1 + γ𝑚𝑎𝑥𝑎'𝑄 * (𝑠𝑡+1, 𝑎) ] (4)

Where:

𝑠𝑡: current state at time 𝑡 𝑟𝑡+1: immediate reward received 𝑠𝑡+1Next state due to taking
after taking action 𝑎 from the state action 𝑎
𝑡 𝑡
𝑠𝑡

𝑎𝑡: action chosen at the 𝑎 Possible action from the next 𝑄 * (𝑠𝑡+1, 𝑎): optimal
time 𝑠 state 𝑠 action-value function
𝑡+1
𝑡

Similarly to an ANN, the DQN estimates the Q-value through a multi-layered neural network with
parameters theta, denoted as Q(s, a; theta). Training the network requires the use of the
following Cost function, where the mean square error (MSE) is minimized (Sutton & Barto,
2018):

9
− 2
𝐿(θ) = 𝐸(𝑠,𝑎,𝑟,𝑠′)∼𝑈(𝐷)[ (𝑟 + γ𝑚𝑎𝑥𝑎'𝑄(𝑠′, 𝑎′; θ ) − 𝑄(𝑠, 𝑎; θ)) ] (5)

Where:

𝑟 + γ𝑚𝑎𝑥𝑎'𝑄(𝑠′, 𝑎′; θ ):
− θ: current network parameters 𝐷: replay memory
(experience replay buffer)
Ideal Q-value computed
using the Bellman equation
−
𝑚𝑎𝑥𝑎'𝑄(𝑠′, 𝑎′; θ ):
−
θ : parameters of a separate target 𝑈(𝐷): indicates uniformly
network, periodically updated from sampling a batch of
Maximum Q-value for the experiences from the
the main network parameters
next state 𝑠' replay memory

3.1.2. DQN Decision Function

Decision-making in DQN is the action selection process based on the Q-values learned by the
neural network. More specifically, the action in every time step is typically the one with the
highest predicted Q-value for the present state, under what is referred to as a greedy
policy(Sutton & Barto, 2018).

π(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄(𝑠, 𝑎; θ) (6)

Where:

π(𝑠): policy mapping states to actions. 𝑄(𝑠, 𝑎; θ): neural network output representing
the estimated Q-value for each action 𝑎 in state 𝑠

To promote adequate exploration of the state-action space and to prevent premature

convergence to inferior policies, the DQN algorithm typically uses an epsilon-greedy policy.
Such a policy fundamentally chooses the optimum action (exploitation) most of the time, with
intermittent random action for finding possibly better approaches (Sutton & Barto, 2018):

{ }
𝑎𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄(𝑠, 𝑎; θ), 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 1 − ϵ (𝑒𝑥𝑝𝑙𝑜𝑖𝑡𝑎𝑡𝑖𝑜𝑛)
(6)
{𝑟𝑎𝑛𝑑𝑜𝑚 𝑎𝑐𝑡𝑖𝑜𝑛, 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 ϵ } (𝑒𝑥𝑝𝑙𝑜𝑟𝑎𝑡𝑖𝑜𝑛)

Where:

10
ϵ∈[0, 1] : controls the exploration-exploitation trade-off

𝐻𝑖𝑔ℎ𝑒𝑟 ϵ: More exploration 𝐿𝑜𝑤𝑒𝑟 ϵ: More exploitation, selecting the best-known actions,
(randomness), useful in early beneficial in later phases as the network becomes accurate
training phases.

3.2. Computational Considerations and Assumptions

Deep Q-Networks (DQNs) can be computationally demanding, particularly for high-dimensional

state spaces or large action spaces. In this particular project, the action space is intentionally
limited to four actions, which leads to a great decrease in computational complexity. However,
the utilization of deep neural networks necessarily raises computational demands. To counter
this, batch processing was utilized to maximize computational efficiency and effectively deal with
hardware limitations. In addition, proper selection of hyperparameters is important since it
significantly affects the convergence behavior as well as the computational requirements of the
neural network model.

4. Methodology
4.1. Data Collection and Preprocessing

4.1.1. Description of Data Sources

The measurements were collected from a 2-degree-of-freedom (2-DOF) robot arm with two
motor-driven rotary joints and torque provision for inducing motion. As seen in the figure, both
the joints (θ₁ and θ₂) possess the three most critical physical values: position (θ), angular
velocity (θ̇), and angular acceleration (θ̈). These are joint states and input features, while the
output torques τ₁ and τ₂ are target values to be learned.

Structure of the Dataset:

The dataset is composed of the following variables:

Joint 1: Joint 2:

- θ₁: Angular position - θ₂: Angular position
- θ̇₁: Angular speed - θ̇₂: Angular speed
- θ̈₁: Angular - θ̈₂: Angular acceleration
acceleration - τ₂: Output torque
- τ₁: Output torque

This results in a dataset where:
Figure 1. Sketch of robot arms & data annotation.

11
Inputs: [θ₁, θ₂, θ̇₁, θ̇₂, θ̈₁, θ̈₂]
Outputs: [τ₁, τ₂]

4.1.2. Data Cleaning and Normalization Procedures

Figure 2. Original joint state and torque output data.

This structured dataset formed the basis for training neural network models to accurately predict
the required torques for both joints, based on the current and historical joint states. In the
preprocessing phase raw data is visualized (see figure 2) where the original joint state and
torque data are plotted against sample indices. The following figure shows erratic behaviour
particularly in the first 50,000-60,000 samples. This is assumed to be due to sensor startup and
motor vibrations. To address this, that segment of data was removed and the remaining data
was filtered using a Savitzky-Golay filter (source). This can be seen in figure 3 as it shows the
smoothed joint state and torque data. Once completed, normalization was then applied to
standardize all features ensuring uniformity and improve training stability

12
Figure 3. Filtered joint state and torque output data.

Dimensionality reduction was achieved through Principal Component Analysis (PCA) which was
performed on the normalized dataset with the goal of noise reduction and redundant information
removal, while keeping rich information related to the dynamics of the joint states.

13
Figure 4. Scree plot and Cumulative Variance.

The scree plot (Figure 4) presents the eigenvalues of the covariance matrix, and the cumulative
variance plot (Figure 5) indicates that 95% of the variance is explained with only 4 of the 6
calculated principal components (PC).

14
Figure 5. PCA reduced input and torque output data.

A final figure shows the 4 PCs further demonstrating how the high-dimensional data is
compacted and represented. It highlights the structure of the original data, without the noise of
the original measurements.

15
Figure 6. First two principal components plot.

Although PCA retained 4 components to account for 95% of the variance, only the first 2
principal components are plotted in this figure. This two-dimensional projection is performed for
visualization because it allows one to make an intuitive judgment of clusters, trends, and outliers
in the data. The remaining components are still utilized in subsequent analyses and model
training to ensure that the important dynamics are preserved.

To conclude it all, the dataset was split into 70% training, 15% validation and 15% testing when
utilised by all network models, allowing for a more robust evaluation of the models ability to
generalise.

16
4.2. Design and Architecture of the Neural Networks

4.2.1. ANN: Architecture and Training Strategy

Forward pass: layer = 𝑙

(𝑙) (𝑙) (𝑙−1) (𝑙)

𝑧 = 𝑊ℎ 𝑥𝑡=1 + 𝑏𝑡

(𝑙)
ℎ𝑡 = 𝑡𝑎𝑛ℎ 𝑧 ( ) (𝑙)

𝑡
(
ŷ = 𝑡𝑎𝑛ℎ 𝑊ℎ ℎ𝑡 + 𝑏𝑡
(𝑙) (𝑙)
)
(𝑙)

Backward pass:
(𝑙) (𝑙)
∂𝐿 ∂𝐿 ∂ŷ ℎ𝑡 𝑧
(𝑙) = ∂ŷ
• (𝑙) • ... • (𝑙) • (𝑙)
∂𝑊ℎ ℎ𝑡 𝑧 ∂𝑊ℎ

Figure 7. Artificial Neural Network.

https://www.researchgate.net/figure/Artificial-neural-network-with-five-layers
_fig1_364434635

(𝑙)
Where 𝑊ℎ is the weighted connections in layer 𝑙 Loss Function with L2 regularization:

( )
2 (𝑙) 2 (𝑙) 2
1
Where 𝑥𝑡 is the input 𝐿 = 𝑇
∑ ||ŷ𝑡 − 𝑦𝑡|| + λ ||𝑊𝑡=1|| +... + ||𝑊𝑡=𝑡||
𝑡=1

(𝑙)
Where ℎ𝑡 is the net output in hidden layers

𝑡
Where ŷ is the estimated output

Architecture:

The ANN consists of five layers. There is one input layer with 4 nodes, three hidden layers
containing an arbitrary amount of nodes each, and a final output layer with 2 nodes. Each
hidden layer, along with the output layer, employs a hyperbolic tangent (tanh) activation
function, allowing the network to capture complex nonlinear relationships effectively.
Additionally, Xavier (Glorot) initialization is applied to the weights to prevent issues with
vanishing or exploding gradients, thus ensuring stable gradient propagation during training.

17
Training Strategy:

Training is done using Stochastic Gradient Descent (SGD) with momentum (β=0.9), which then
facilitates stable yet accelerated convergence. L2 regularization (λ=1e-5) is integrated into the
training process to mitigate potential overfitting. The network is trained using mini-batches
containing 32 (which can be changed) samples per iteration, optimizing computational efficiency
and reducing stochastic noise in gradient updates. Model performance is evaluated using Mean
Squared Error (MSE), tracked through Root Mean Square Error (RMSE) during training.

4.2.2. RNN: Architecture and Training Strategy

Forward pass: layer = 𝑙

Backward pass:

𝑇
∂𝐿 ∂𝐿 (𝑙)
(𝑙) = ∑ (𝑙) • ℎ𝑡−1
∂𝑊ℎ 𝑡=1 ∂𝑊ℎ
Figure 8. Recurrent Neural Network

https://d2l.ai/chapter_recurrent-neural-networks/

Loss Function with L2 regularization:

𝑇
𝐿 =
1
𝑇
2
( (𝑙) 2
∑ ||ŷ𝑡 − 𝑦𝑡|| + λ ∑ ||𝑊ℎ || + ||𝑊𝑥 ||
𝑡=1 𝑙
)
(𝑙) 2

https://en.wikipedia.org/wiki/Recurrent_neural_network

Architecture:

The RNN model consists of three recurrent layers, each comprising 20 hidden units, which
enables efficient learning of temporal dependencies. The hyperbolic tangent (tanh) activation
function is used in the hidden states to capture nonlinear relationships across sequences. For
better modeling, the implementation of Backpropagation Through Time (BPTT) was necessary.
This was achieved by unrolling the network across T = 10 sequential timesteps, allowing
effective gradient computation across temporal dimensions(Xu et al., 2019).

18
Training Strategy:

The RNN model was trained using SGD with momentum (β = 0.9) and incorporates L2
regularization (λ = 0.001) to improve generalization. Input data is split into sequences of length
in accordance with the network's unrolling strategy. To stabilize gradient updates and avoid
extremely large updates due to sequence length, gradients computed over these sequences are
averaged by a factor of 1/T, reducing the gradient magnitude.

4.2.3. LSTM: Architecture and Training Strategy

Figure 9. Long Short Term Memory Network

https://blog.stackademic.com/complete-guide-to-learn-lstm-models-types-applications-and-when-to-use-which-model-f9b779f31714

Forward Pass: Backward Pass:

∂𝐿
(𝑙)
∂𝑐𝑡
=
∂𝐿
(𝑙)
∂ℎ𝑡
(𝑙)
( 2
( ))
⊙ 𝑜𝑡 ⊙ 1 − 𝑡𝑎𝑛ℎ 𝑐𝑡
(𝑙)

Loss Function with L2 regularization:

Same as RNN

19
https://en.wikipedia.org/wiki/Long_short-term_memory#Training

Architecture:

The LSTM model’s architecture consists of three stacked LSTM layers, where each layer
features 20 hidden units. It utilises the gates structure of LSTMs found in literature (input (i),
forget (f), output (o), and candidate (g) gates). These gates allow the LSTM to manage
information flow efficiently and handle long-range dependencies within sequential data. Gate
activations are done via sigmoid activation functions to modulate memory retention, while cell
state updates apply a tanh activation function to manage information content adaptively (Patel
et al., 2021).

Training Strategy:

The LSTM network was trained using SGD with momentum (β = 0.9) and includes L2
regularization (λ = 1e-5) to reduce the risk of overfitting. The training process involves
processing sequential data in the form of mini-batches of 16 sequences each. This method
provides computational efficiency as well as gradient accuracy. The training process explicitly
tracks and maintains the cell state (c) across timesteps, allowing the network to capture and
propagate long-term temporal dependencies effectively.

20
4.2.4. GRU: Architecture and Training Strategy

Figure 10. Gated recurrent Unit

https://www.researchgate.net/figure/The-architecture-of-a-multi-layer-gated-recurrent-neural-network_fig5_330723201

Forward Pass: Backward Pass:

BPTT with Gate Gradients: Propagate gradients through

𝑧𝑡 , 𝑟𝑡 and ĥ𝑡 .

∂𝐿
(𝑙)
∂𝑐𝑡
=
∂𝐿
(𝑙)
∂ℎ𝑡
(𝑙)
( 2
⊙ 𝑜𝑡 ⊙ 1 − 𝑡𝑎𝑛ℎ 𝑐𝑡( ))
(𝑙)

Loss Function with L2 regularization:

Same as RNN/LSTM

https://en.wikipedia.org/wiki/Gated_recurrent_unit

Architecture:

The GRU model consists of three stacked GRU layers containing 20 hidden units per layer.
GRUs use the reduced-gating mechanism of two gates: update (z) and reset (r) gates,
effectively controlling information flow through sequences. The gate activations utilize the
sigmoid functions to regulate information flowing through, and the updates for candidate states

21
through the tanh activation. This architecture integrates the cell and hidden states and
minimizes computational needs while maintaining the ability to learn sequential patterns (Patel
et al., 2021).

Training Strategy:

The GRU network uses an SGD optimizer with momentum (β = 0.9) and L2 regularization (λ =
1e-5) to improve the network's generalization performance. The gating structure saves on
computational costs and enables efficient backpropagation through time, enhancing training
speed and stability. Stacking the hidden and cell states reduces gradient computation and
update, further refining the training process

4.3. DQN Integration: Acting as the Selector

Figure 11. Deep Q Network

https://www.researchgate.net/figure/Representation-of-DQN_fig4_371726397

Forward pass: Loss Function with L2 regularization:

( )
ℎ1 = 𝑡𝑎𝑛ℎ 𝑊1 𝑠 + 𝑏1 , 𝐿 =
1
𝑁
𝑇
[ ( )]
∑ 𝑄𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑(𝑎𝑖) − 𝑟𝑖 + λ𝑚𝑎𝑥𝑎'𝑄𝑡𝑎𝑟𝑔𝑒𝑡(𝑠𝑖', 𝑎')
2
4
2
+ λ ∑ ||𝑊𝑗||
ℎ = 𝑡𝑎𝑛ℎ(𝑊 ℎ + 𝑏 )
𝑡=1 𝑗=1
2 2 1 2

22
( )
ℎ3 = 𝑡𝑎𝑛ℎ 𝑊3 ℎ2 + 𝑏3 , 𝑄 = 𝑊4 ℎ3 + 𝑏4

Backward pass:

∂𝑄𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑
∂𝐿
∂𝑊𝑖 (
= 𝑄𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 − 𝑄𝑡𝑎𝑟𝑔𝑒𝑡 •) 𝑊𝑖

Architecture:

The DQN follows the same architecture as the ANN seen earlier. The DQN consists of five
layers. It contains one input layer with 4 nodes, three hidden layers with 25 nodes in them, and
one output layer with 2 nodes. Each hidden layer, along with the output layer, employs a
hyperbolic tangent (tanh) activation function, allowing the network to efficiently learn complex
nonlinear relationships. In addition, Xavier (Glorot) initialization is applied to the weights to
prevent vanishing or exploding gradient issues and hence offer stable gradient flow during
training.

Training Strategy:

Training is conducted using Stochastic Gradient Descent (SGD) with momentum (β=0.9), which
then allows for stable yet accelerated convergence. L2 regularization (λ=1e-5) is added to
training to deter potential overfitting. The network is trained on mini-batches of size 32 (can be
changed) samples per iteration, with computations being optimal and stochastic noise in
gradient updates being minimal. Model performance is evaluated using Mean Squared Error
(MSE), tracked using Root Mean Squared Error (RMSE) during training.

4.3.1. Combining Neural Networks with DQN

Mechanism:

The DQN is acting like a controller tasked with selecting one of the four pre-trained neural
networks (ANN, RNN, LSTM, and GRU) to predict torque values. The DQN receives as input 4
PCA features and outputs Q-values corresponding to each candidate model, which allows
selection based on the predicted performance of each model. The reward function is formulated
as the negative squared error between the selected model’s prediction (estimated torque) and
the actual torque. Which pushes the DQN to minimize prediction error (Sutton & Barto, 2018).

𝑟= − ∥𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 − 𝑇𝑎𝑟𝑔𝑒𝑡∥
2 (7)

Training:

The training process for the DQN uses several key strategies meant to ensure stability and
effective learning. An experience replay buffer with a capacity of 5000 transitions (𝑠, 𝑎, 𝑟, 𝑠') for

23
decorrelating training samples (reducing autocorrelation and cross-correlation within a set of
signals) and improving data efficiency. It maintains and updates a separate target network every
5 epochs to stabilize the target Q-value estimates. The loss function is the Mean Squared
Temporal Difference (TD) Error between the targets and the online Q-values, where the target
for a state-action pair is computed as:

[
𝑇𝑎𝑟𝑔𝑒𝑡(𝑠, 𝑎) = 𝐸 𝑟𝑡+1 + γ𝑚𝑎𝑥𝑎'𝑄 * (𝑠', 𝑎') ] (8)

Optimization is performed using Stochastic Gradient Descent (SGD) and momentum (β=0.9),
with learning rate α=1e−3, and L2 regularization strength λ=1e−4. Pre-trained network
parameters are not updated at all during the process, with the DQN learning to select the
optimal model to use for each input state through the historical performance of each rival
network.

4.3.2. Framework for Action Selection

ϵ-Greedy Strategy:

The model employs an ε-greedy strategy to balance exploration and exploitation in action
selection. With probability 1−ϵ, the agent exploits what it already knows by choosing the action
with the maximum Q-value. With probability ϵ, a random action is chosen to encourage
exploration. At training time, ϵ is decayed from an initial value of 1.0 to at least 0.1 using a
decay factor of 0.995 per epoch, hence gradually shifting the emphasis from exploration to
exploitation as learning advances.

Workflow:

The process of selecting an action is initiated by passing the input state ss to the DQN and
getting Q-values for all possible actions. The ϵ-greedy mechanism described above is employed
to choose an action aa. After an action has been selected, the same corresponding pretrained
model is employed to predict the torque, and the reward rr is computed as the negative squared
error of the prediction to the ground truth. This experience transition (s,a,r,s′)(s, a, r, s') is logged
in the replay buffer. Afterwards, mini-batches of these stored transitions are sampled at intervals
to train the DQN such that the network learns efficiently from various past experiences.

4.4. Optimization Techniques and Improvement Methods

4.4.1. Weight Initialization Strategies

Xavier Glorot Initialization:

Xavier initialization was used on all of the networks in this project (ANN, RNN, LSTM, and GRU)
to preserve activation variance during the forward pass process and preserve the gradient
during back propagation. The weights are scaled by the inverse square of input and output
dimensions to prevent vanishing and exploding gradients.

24
𝑘= 6/(𝑛𝑖𝑛 + 𝑛𝑜𝑢𝑡) (9)

4.4.2. Regularization and Avoiding Overfitting

L2 regularization is applied consistently to all model weights to discourage large parameter

values and avoid overfitting. In particular, an L2 penalty term with λ=1e−5 is added to all
models. Additionally, 15% of the data is reserved as a validation set so that the performance of
all of the models could be tested and evaluated on unseen data throughout the entirety of the
training life cycle, thereby creating a robust method for assessing generalization.

4.4.3. Training Strategies (Sequential, Batch, Mini-batch)

Training strategies for the models was done by using a mini-batch approach for better
computational efficiency and convergence stability. The ANN uses mini-batch training with a
batch size of 32 samples. In the case of RNN, LSTM, and GRU architectures, they process
sequences via mini-batches containing 16 sequences, with each sequence spanning T=10
timesteps. This method ensures that temporal dependencies are adequately captured. The
DQN, however, leverages an experience replay mechanism where transitions are stored and
mini-batches of 32 transitions are drawn per update. Additionally, for the recurrent networks,
Backpropagation Through Time (BPTT) is employed over T=10 timesteps. This allows gradients
to be propagated effectively through the sequential structure.

4.4.4. Handling Slow Learning and Local Minima

To mitigate slow convergence and getting trapped within a local minima, several advanced
optimization techniques are incorporated. Momentum (with β=0.9) is utilized in the SGD
algorithm to accelerate gradient descent in beneficial directions, allowing the optimizer to travel
over shallow regions of the error landscape. For the DQN algorithm, an adaptive exploration
policy with ϵ-decay is employed to gradually shift the trade-off from exploration to exploitation.
This ensures that the agent continues to explore sufficiently but increasingly depends on its
learned policy. Additionally, a target network is also maintained to provide more stable Q-value
targets, avoiding problems associated with a constantly changing target in Q-learning
updates(Malik et al., 2022).

25
4.5. System Architecture and Workflow

4.5.1. Graphical Representation of the System

Figure 12. DQN graphical representation

https://www.researchgate.net/figure/A-data-flow-diagram-for-a-DQN-with-a-replay-buffer-and-a-target-network-85_fig
4_346110033

Where:

Replay memory: a storage Prediction Network: One of the four Target Network: A copy of the
memory which retains past pre-trained neural networks (ANN, online DQN, which is saved
transitions during training (s, a, r, RNN, LSTM, and GRU) that is used periodically
s’) to predict torque values

Environment: The interacting Predicted Q: The Q valu,e which is Target Q: The Q value
environment space which outputted by the online DQN to select estimate, which is used as a
provides the states, action,s and an action training target for the online
rewards DQN

Gradient loss: The derivative of Reward r: A scalar feedback signal DQN loss calculation: The
the loss/cost function with quantifying the immediate quality of optimization objectives for
respect to weights used for an action. training DQN based on error
backpropagation

26
4.5.2. Flowchart of Data and Control

Figure 13. Data processing flowchart

(Made on Powerpoint by Guiah Soumahoro )

Data Flow:

Raw data is processed through PCA to generate new normalized input features. These input
features are then split into a training, validation, and testing set. The training data is then used
to train the DQN to select actions, triggering a model prediction. The prediction is compared with
the ground truth to determine a reward, and the experience tuple generated is added to the
replay buffer.

Control Flow:

For each time period, a mini-batch is sampled from the replay buffer and Q-targets are
estimated by the target network. The DQN is then updated via SGD with momentum. This
includes periodic updates to the target network to ensure training stability.

27
5. Results and Discussion
5.1. Experimental Setup and Evaluation Metrics

The experiment setup involved employing a dataset comprising 4 PCA-based features and 2
torque targets, divided into 70% training, 15% validation, and 15% testing. Hyperparameters
were established before experimentation. The most important hyperparameters of the DQN are
a discount factor γ=0.99, a learning rate α=1e−3, and an ϵ-decay rate of 0.98. Xavier
initialization, L2 regularization, and training with a batch size of 32 are used in all networks.

The model's performance was tested using RMSE for predictive performance, cumulative
reward to assess the performance, and a test of action distribution, which tests the frequency of
the DQN choosing the available neural network models.

5.2. Training Results and Performance Analysis

5.2.1. ANN Learning Curves

28
The ANN exhibited the fastest convergence among all models attempted, primarily due to its
comparatively simple architecture and reduced computational complexity. With its simpler
architecture in the absence of recurrent connections, the ANN required significantly less
computational effort per epoch, enabling fast and stable learning throughout the course of 100
epochs. The model exhibited a steady and consistent reduction in RMSE, which is a clear sign
of successful optimization, healthy training dynamics, and successful convergence to a correct
solution. The ongoing reduction corroborates the fact that the ANN model was appropriate and
successful for the task at hand.

29
5.2.2. RNN Learning Curves

The standard RNN model, compared to the previously tested ANN model, showed a lower rate
of convergence. However, it exhibited stronger generalization characteristics, which was shown
by the observation of closer training, validation, and test RMSE values. Overall, the RNN
presented an increasing rate of RMSE decrease with only minimal RMSE transient spikes.

30
5.2.3. LSTM Learning Curve

The LSTM showed steady and consistent decline in RMSE for the majority of the training
process, with a brief momentary spike observed at around the 85th epoch stage. This may
indicate a transient gradient explosion. To mitigate this issue, gradient clipping was
implemented. Additionally, the learning rate (alpha) was reduced from 1e-3 to 1e-4, which
helped stabilize the training dynamics. Based on the experiment, it is believed that extending
the training duration may be necessary to fully show the robustness and long-term performance
of the proposed LSTM model.

31
5.2.4. GRU Learning Curve

The GRU model possessed the lowest RMSE across training, validation, and test sets among
all the other models, which reflects its superior ability to learn temporal relationships. This
performance suggests that the inherent gating mechanism in the GRU architecture makes
sequential pattern modeling more efficient. Additionally, the model demonstrated a steady
decrease in error with little variability throughout training, further supporting its stability and
robustness in learning temporal dynamics.

32
5.3. DQN Action Selection Analysis

5.3.1. DQN learning Curve

The proposed RL DQN model is observed to converge very quickly in the early stages at an
RMSE of approximately 0.24. Although some minor fluctuation can be observed with more
epochs, the RMSE is consistently close to the same figure. This may well be due to the
combined neural network models (ANN, RNN, LSTM, and GRU), since these tend to achieve an

33
average RMSE near 0.24. Hence, adding a larger population of diverse strong neural networks
which are trained with varied conditions, could enable the DQN to learn better from real-time
observations and select more accurate actions. Also, the plateauing consistently at 0.24 RMSE
could be because of parameters such as the optimization approach applied, regularization
terms, data noise, or representational limits in features; these parameters are recommended to
be investigated further.

5.4. Comparison of Improvement Techniques

5.4.1. Impact of Weight Initialization

No weights initialized Standard Initialization Glorot Xavier Initialization

Experimentation showed that whether or not the network weights were not initialized (i.e., left at
zero) or initialized in the normal manner (sampling from a normal distribution with small standard
deviation), the weights were essentially constant throughout training and the RMSE was not
changing. This lack of variation suggests that the Q-network is not learning, most likely because
it has been unable to break the initial symmetry or generate sufficient gradient signals.
Accordingly, the network cannot distinguish between actions, which leads to fundamentally the
same performance across training, validation, and test sets.

Weight initialization is crucial for ensuring stable learning and rapid convergence of deep neural
networks. For the current project, Xavier Glorot initialization was used across all networks. This
was necessary for the DQN as early Q-value estimates directly affect action selection and thus
the corresponding reward signals. If the DQN weights are not properly initialized properly, such

34
as a uniform initialization method, the Q-network could produce unstable predictions which
could lead to significantly underestimate or overestimate action values during it’s early training
stage.

The results suggest that applying Xavier initialization provides faster convergence and smaller
initial prediction errors. Additionally, in the action prediction networks (ANN, RNN, LSTM, GRU),
Xavier initialization was useful in extracting temporal or spatial relations embedded in the data.
In models like LSTM and GRU, weight initialization with good balance prevents initial
saturations in gate activations (e.g., the forget gates) and plays a critical role in good extraction
of long-term dependencies in sequential data.

5.4.2. Effectiveness of Regularization Strategies

Not regularized L1 regularization L2 regularization

Experimentational results highlight the critical contribution of regularization to the learning

dynamics of the DQN. In the absence of any regularization, we observed that the network
weights were nearly stationary during training, and the RMSE did not decrease significantly.
This implies that the DQN was not learning. On the other hand, the introduction of L1
regularization, which penalizes the absolute values of the weights, enabled the network to adapt
its weights and produce an RMSE decrease over time. However, compared to L2 regularization,
learning curves under L1 regularization featured more severe fluctuations. These oscillations
are likely due to the naturally sparse nature of L1 regularization; by moving some weights
abruptly towards zero, the gradients can switch more violently from epoch to epoch. Briefly,
while L1 regularization permits weight updating and learning, L2 regularization appears to offer

35
smoother and more stable convergence, which can be useful to achieve consistent performance
on training and validation sets.

Furthermore, regularization techniques such as the L2 regularization, play an important role in

mitigating overfitting, especially in complex systems like the proposed DQN which interacts with
2
(
several network models. L2 regularization was applied by incorporating a penalty term λ||𝑊|| )
to the loss/cost function for all network models.

L2 regularization contributed to more stable Q-value estimates by discouraging excessively

large weight that could amplify noise in the reward signal.This is important as reward is denoted
as the negative square difference between prediction and target output, making the DQN prone
to overfitting in the underlying models.Without such regularization, the DQN might overfit to
outlier transitions within the replay buffer, which could compromise its action selection strategy.

L2 regularization has proven to yield several advantages being lower validation RMSE relative
to training RMSE which suggests reduced overfitting and stable learning curves for the DQN
and the other prediction networks as regularization helped smooth out erratic weight updates.

5.4.3. Effectiveness of certain activation functions

Sigmoid function ReLU function Tanh function

When the DQN utilized the sigmoid activation function, it was observed that the weights
changed with time; nevertheless, the RMSE remained quite constant at about 0.25 with the
occasional random spikes only. Replacing the sigmoid activation function with ReLU activation,

36
however, the mean loss became NaN very soon during training, indicating numerical instability
(quite likely due to exploding gradients or dead neurons).

This experiment shows the necessity of choosing a proper activation function for the job at
hand. In deep Q-networks, choosing the activation function is shown to affect gradient flow
during backpropagation. Sigmoid activation functions are susceptible to vanishing gradients and
saturation, whereas ReLU is admired for its non-saturating nature and computational efficiency.
ReLU, however, must also be properly controlled as well (e.g., having proper initialization,
learning rate optimization, and potentially gradient clipping) in order to ensure stability. Selecting
the appropriate activation function is crucial in attaining stable training dynamics and successful
learning performance.

5.5. Discussion

5.5.1. Analysis of the Results

In the early training stages, action selection was relatively balanced, with each of the four
possible actions being chosen around 110,000 to 112,000 times. This denotes that initially the
agent has yet to identify which action is superior for predicting torque values for given states.

As the training progressed around 40 epochs, the agent started favoring action 1 being the ANN
model. Additionally, periodic updates to the target network contributed to a more stable Q-value
estimate during these transitions. The average loss was seen to decrease toward 0.0700 and
the RMSE declined to around 0.2410, 0.2415, and 0.2380, respectively, indicating improved
model accuracy and generalization.

37
In the later stage of training the agent continued to refine its policy and solidified its preference
to action 1 being the ANN network. From epoch 100, it is observed that action 1 was selected
around 395,000 to 396,000 times, while the other actions had significantly lower counts.
Additionally, the close alignment of RMSE for training, validation and test data confirms robust
generalization and minimal overfitting. The final printed weight matrices reveal parameter values
within expected ranges, further validating that the model training was well-regularized and
effectively initialized.

5.5.2. Limitations, Challenges, and Future Work

A key limitation of the proposed DQN model is that the pretrained model weights remain static,
which constrains adaptability. Additionally, training four separate models along with the DQN is
computationally intensive and time-consuming. Finally, achieving proper balancing between
exploration and exploitation in rapidly changing situations is still an active issue.

Future work would include the integration of model fine-tuning alongside the DQN, which could
enhance responsiveness to changing conditions. An additional improvement would be feeding
the DQN, NNs as actions that were trained in different environments, making them differ from
each other enough to make the DQN more robust in various conditions.

6. Conclusion
This project established a robust data-driven solution to predicting the joint torques of a 2-DOF
robot arm through integrating diversified neural network architectures with a Deep Q-Network
(DQN) for model selection. With heavy preprocessing of data—like noise filtering by a
Savitzky–Golay filter, normalization, and dimensionality reduction via PCA—the system could
derive principal features from a noisy dataset. Experimental outcomes demonstrated that while
the ANN model converged fastest due to its simple architecture, recurrent networks (RNN,
LSTM, GRU) became superior in temporal modeling while taking longer to converge. The DQN
learned effectively to select the optimal model (primarily the ANN) based on previous
performance, thereby ensuring proper torque prediction and providing a valid alternative to
traditional, model-based control.

38
6.1. Summary of Findings

The experiments revealed that well-initialized and regularized neural networks, when coupled
with an adaptive DQN controller, can significantly reduce prediction errors and achieve strong
generalization across training, validation, and test sets. Key findings include:

● The ANN model exhibited rapid and stable convergence with uniformly diminishing
RMSE values.
● Temporal models (RNN, LSTM, GRU) showed slower convergence but were more adept
at capturing sequential dependencies, with GRU achieving the lowest overall RMSE.
● The DQN module, leveraging a carefully tuned ε-greedy policy and periodic target
network updates, learned to predominantly select the ANN model, underscoring its ability
to dynamically adapt to the best-performing network under given conditions.
● The preprocessing techniques, particularly the Savitzky–Golay filter and PCA (with 4
principal components retained for 95% variance), effectively reduced noise and
dimensionality, which in turn improved training stability and model accuracy.

6.2. Contributions of the Project

This work provides a contribution to the field of robotic control through the presentation of a new
mixture of supervised and reinforcement learning for the purposes of accurate and adaptive
torque prediction. Contributions are as follows:

● An end-to-end data preprocessing pipeline that is successful in cleaning, normalizing,

and reducing noisy sensor data's dimensionality.
● The comparison and evaluation of a number of neural network architectures (ANN, RNN,
LSTM, GRU) in one framework, where performance and convergence characteristics
can be directly compared.
● The use of a DQN as a meta-controller, dynamically switching between pre-trained
models based on real-time state evaluation, thereby combining the strengths of different
types of networks.
● Hands-on experience with hyperparameter tuning, weight initialization, regularization
methods, and surmounting the difficulties with training deep intricate models to apply in
robotics.

6.3. Challenges, Future Work, and Recommendations

Although the outcomes look promising, challenges exist. Current framework uses fixed
pre-trained weights, which means flexibility is not available when changing conditions are too
fast. The process of training numerous neural network models along with the DQN is
computationally expensive, and exploration-exploitation trade-off in dynamic worlds remains
challenging. Online fine-tuning of pre-trained models into the DQN framework should be tackled
in the future to boost responsiveness. It is also recommended to diversify the network pool
using even more diverse architectures and exploring hybrid approaches combining data-driven
models and physics-driven information. Verification within real-world robotic frameworks and

39
further deeper investigation into higher levels of degrees of freedom in scalability terms will also
be significant roles for optimizing usefulness in real-world scenarios as well as hardness.

7. Declarations
I, Guiah Soumahoro, acknowledge having used ChatGPT version 03 as a tool to help debug
and construct the neural network models codes while following what was learned in class and
shown by our teacher. AI was used as a tool.

8. References
Kumbla, K. D., & Jamshidi, M. (1997). Neural network-based identification of robot dynamics
used for neuro-fuzzy controller. Proceedings of the IEEE International Conference on Systems,
Man, and Cybernetics, 4, 3413–3418.

Malik, A., Lischuk, Y., Henderson, T., & Prazenica, R. (2022). A Deep Reinforcement-Learning

Approach for Inverse Kinematics Solution of a High Degree of Freedom Robotic

Manipulator. Robotics, 11(2), 44. https://doi.org/10.3390/robotics11020044

Patel, R., Zeinali, M., & Passi, K. (2021, May). Deep Learning-based Robot Control using

Recurrent NeuralNetworks (LSTM; GRU) and Adaptive Sliding Mode Control. The 8th

International Conference of Control, Dynamic Systems, and Robotics.

https://doi.org/10.11159/cdsr21.113

Qian, H. (n.d.). Neural network controller for robotic manipulator. Retrieved from
https://www.habr.com/en/articles/nn-controller

V7 Labs. (n.d.). Activation functions in neural networks: 12 types & use cases.
https://www.v7labs.com/blog/neural-network-activation-functions

Wu, C. H., Kao, Y. H., Chen, Y. Y., & Lin, C. C. (2012). Closed-loop dynamic parameter
identification of robot manipulators using modified Fourier series. Mathematical Problems in
Engineering, 2012.

Xu, F., Li, Z., Nie, Z., Shao, H., & Guo, D. (2019). New Recurrent Neural Network for Online

Solution of Time-Dependent Underdetermined Linear System With Bound Constraint.

IEEE Transactions on Industrial Informatics, 15(4), 2167–2176.

40
https://doi.org/10.1109/TII.2018.2865515

9. Appendices
9.1. Appendix A: Complete Code Listing

9.1.1. Principal Component Analysis code and Pre-processing

%-------------------------------------------------------
% FINAL PROJECT:
% Learning Robot Model From Data Using Reinforcement Learning
% By: Guiah Soumahoro
%-------------------------------------------------------
clc;
clear;
close all;
%-------------------------------------------------------
% STEP [1]
% Load the data and set up parameters (excel dataset needs to be in the
% same folder
%-------------------------------------------------------
% Inform the user if the excel data sheet is not in the same folder
try
dataTable = readtable('Dataset.xlsx');
catch error
error('Could not read excel file. Make sure it located in the current folder
and not opened in Excel.')
end
% Convert the table into an array to be used by MATLAB
data = table2array(dataTable);
% Determine the total number of samples
N = size(data,1);
% Create a pseudo time vector using sample indices (no time data given)
t = 1:N; %%%%%%%%%%%%
%-------------------------------------------------------
% STEP [2]
% Extract Input and Output Data from the table
%-------------------------------------------------------
X = data(:,1:6); % First 6 columns are Inputs
Y = data(:,8:9); % Rows 8 and 9 are Outputs
%---------------------- Plot Original Dataset ----------------------
figure
subplot(2,1,1);
plot(t,X);
xlabel('Sample Index');

41
ylabel('Joint States');
title('Original Joint States Data');
legend('q1','q2','q1''','q2''','q1''''','q2''''','Location','best');
grid on;
subplot(2,1,2)
plot(t,Y)
xlabel('Sample Index');
ylabel('Torque');
title('Original Torque Data');
legend('tau1','tau2','Location','best');
grid on;
% NOTE: Eractic Initial Data was seen at the start of the plot. Could be
% vibrations when motors are starting and may affect learning of the Neural
% Networks. It needs to be removed. (May be due to sensor startup, actuator
% warm-up or deadband)
%-------------------------------------------------------
% STEP [3]
% Remove Eractic Data seen at the start of the plot
% as visual inspection indicates that the first
% 50 000-60 000 samples are eratic
%-------------------------------------------------------
remove_index = 60000; % Remove values before that number set
if N > remove_index
t_clean = t(remove_index:end);
X_clean = X(remove_index:end,:);
Y_clean = Y(remove_index:end,:);
% If an error is present due to poor index value
else
error('Not enough samples to remove the first %d samples.', removeIndex);
end
% NOTE: The plot shows that the data is very noisy with vibratory,segments
% occur approximately every 35,000-45,000 samples. This shows a need for a
% filter useful for the project scope
%-------------------------------------------------------
% STEP [4]
% Filter out noise Using Filtering Techniques
% Can choose between Savitzky-Golay or Butterworth Filter
%-------------------------------------------------------
%----------------------- Savitzky-Golay ------------------------------
% Window size coefficient
m = 500;
window_size = 2*m + 1;
% Polynomial Order
p = 4;
% Build design matrix A (Vandermonde matrix)
t = (-m:m)'; % Relative time indices
A = zeros(window_size, p+1);
for i = 0:p
A(:, i+1) = t.^i;

42
end
% Compute pseudo-inverse
C = pinv(A); % Each row of C gives a coefficient for one derivative
coeff = C(1,:); % Row 1 corresponds to smoothing (0th derivative)
% Initialize filter matrices
X_filtered = zeros(size(X_clean));
Y_filtered = zeros(size(Y_clean));
% Get dimensions
[N_updated, num_x] = size(X_clean);
[N_updated, num_y] = size(Y_clean);
% Apply filter to input X matrix signal
for j = 1:num_x
for i = m+1 : N_updated-m
X_window = X_clean(i-m:i+m, j);
X_filtered(i,j) = coeff * X_window;
end
end
% Apply filter to output Y matrix signal
for j = 1:num_y
for i = m+1 : N_updated-m
Y_window = Y_clean(i-m:i+m, j);
Y_filtered(i,j) = coeff * Y_window;
end
end
%----------------------- Butterworth Filter --------------------------
% Can be implemented in the future
%----------------- Plot the New Filtered Dataset ----------------------
figure
subplot(2,1,1);
plot(t_clean, X_filtered);
xlabel('Sample Index');
ylabel('Joint States');
title('Filtered Joint States Data');
legend('q1','q2','q1''','q2''','q1''''','q2''''','Location','best');
grid on;
subplot(2,1,2)
plot(t_clean, Y_filtered)
xlabel('Sample Index');
ylabel('Torque');
title('Filtered Torque Data');
legend('tau1','tau2','Location','best');
grid on;
% Note: The plot shows that the data is eratic at every 50,000 dataset
% (it seems to be intentional trajectory switching). If so they are very
% informative for generalization
%-------------------------------------------------------
% STEP [5]
% PCA Analysis on filtered data
%-------------------------------------------------------

43
%------ Data Preprocessing: Standardize (z-score normalize) the data -----
mu_x = mean(X_filtered); % mean of feature (inputs)
sigma_x = std(X_filtered); % standard deviation of the feature
mu_y = mean(Y_filtered); % mean of feature (outputs)
sigma_y = std(Y_filtered); % standard deviation of the feature
% z-score normalization of data (nx6)
X_normalized = (X_filtered - mu_x) ./ sigma_x;
Y_normalized = (Y_filtered - mu_y) ./ sigma_y;
%-------------------- Covariance Matrix Calculation ----------------------
CovMatrix = (X_filtered - mu_x)'*(X_filtered - mu_x)*(1/(N_updated - 1));
%------------------------- Eigen Decomposition -------------------------
% Find eigenvectors and eigenvalues
[V, D] = eig(CovMatrix);
% Extract eigenvalues
eigenvalues = diag(D);
% Sort in descending order eigenvalues and reorder eigenvecto.
[eigenvalues_sort, v] = sort(eigenvalues, 'descend');
V_sort = V(:, v);
% Calculate the percentage of variance explained by each eigenvalue.
variance_e = 100 * eigenvalues_sort / sum(eigenvalues_sort);
cumulative_variance = cumsum(variance_e);
%----------------- Plot the New Filtered Dataset ----------------------
figure;
subplot(2,1,1);
bar(eigenvalues_sort);
xlabel('Principal Component');
ylabel('Eigenvalue');
title('Scree Plot (Eigenvalues of Covariance Matrix)');
subplot(2,1,2);
plot(cumulative_variance, '-o', 'LineWidth', 1.5);
xlabel('Number of Principal Components');
ylabel('Cumulative Variance Explained (%)');
title('Cumulative Variance Explained by Principal Components');
grid on;
% Decide on the number of principal components to retain (e.g., 95% variance)
numComponents = find(cumulative_variance >= 95, 1);
fprintf('Number of principal components to retain for 95%% variance: %d\n',
numComponents);
%-------------------------------------------------------
% STEP [6]
% Transform data set onto Principal Components
%-------------------------------------------------------
% Update your dataset by projecting onto the selected principal components.
% This reduces the data from 6 dimensions to numComponents (e.g., 4).
X_PCA = X_normalized * V_sort(:, 1:numComponents); %%%%%%%%%%%%%%
%----------------- Plot the New Filtered Dataset ----------------------
% For visualization, plot the first two principal components.
figure;
if numComponents >= 2

44
scatter(X_PCA(:,1), X_PCA(:,2), 10, 'filled');
xlabel('Principal Component 1');
ylabel('Principal Component 2');
title('Projection of Outlier-Free Data onto First Two Principal
Components');
grid on;
else
plot(X_PCA, zeros(size(X_reduced)), 'o');
xlabel('Principal Component 1');
title('Projection of Outlier-Free Data (Only 1 Component Retained)');
grid on;
end
% Plot the reduced joint state inputs (X_reduced) and the cleaned torque data
(Y_noOut)
figure
subplot(2,1,1);
plot(t_clean, X_PCA);
xlabel('Sample Index');
ylabel('Reduced PCA Inputs');
title('Reduced Joint State Inputs (Principal Components) Over Time');
legend(arrayfun(@(i) sprintf('PC%d', i), 1:size(X_PCA,2), 'UniformOutput',
false), 'Location','best');
grid on;
subplot(2,1,2);
plot(t_clean, Y_normalized);
xlabel('Sample Index');
ylabel('Normalized Torque values');
title('Cleaned Torque Data');
legend('tau1','tau2','Location','best');
grid on;
%-------------------------------------------------------
% STEP [7]
% Saved the new Processed Data
%-------------------------------------------------------
Dataset_PCA = [X_PCA, Y_normalized];
writematrix(Dataset_PCA, 'DatasetPCA.csv');

9.1.2. Artificial Neural Network

%-------------------------------------------------------
% Five layer Deep Neural Network using Backpropagation
% By: Guiah Soumahoro (Debugging of ANN was done using ChatGPT)
%-------------------------------------------------------
function ANN_Regression()
clc;
clear all;
close all;

%-------------------------------------------------------

45
%% STEP [1]
% Load the data and set up parameters (excel dataset needs to be in the
% same folder
%-------------------------------------------------------

try
dataTable = readtable('DatasetPCA.csv');
catch error
error('Could not read excel file. Make sure it located in the current
folder and not opened in Excel.')
end

% Convert Table into array to be used by MATLAB

data = table2array(dataTable);
X = data(:, 1:4); % Extract the first 4 columns as features (input
variables)
Y = data(:, 5:6); % Extract column 5 & 6 as the target output
N = size(data,1);% Total number of samples in the dataset

%-------------------------------------------------------
%% STEP [2]
% Normalize the data
%-------------------------------------------------------

% Note: z-score normolization was done in the preprocessing steps

%-------------------------------------------------------
%% STEP [3]
% Randomize data sets to remove ordering bias and split the data
% (70% training, 15% validation and 15% testing
%-------------------------------------------------------

% Randomize data
idx = randperm(N); % Somehow doing this instead of adding random(N)
% in the X and Y randomization makes the NN work better X = X(random(N),:);
X = X(idx,:);
Y = Y(idx,:);

% Define the number of samples for each partition:

N_train = round(0.70 * N); % 70% for training
N_val = round(0.15 * N); % 15% for validation
N_test = N - N_train - N_val; % Remaining samples for testing

% Split the dataset into training, validation, and test sets.

X_train = X(1:N_train, :);
Y_train = Y(1:N_train, :);

X_val = X(N_train+1 : N_train+N_val, :);

Y_val = Y(N_train+1 : N_train+N_val, :);

46
X_test = X(N_train+N_val+1 : end, :);
Y_test = Y(N_train+N_val+1 : end, :);

%-------------------------------------------------------
%% STEP [4]
% Build the Neural Network architecture (5 layers) so you can update
% And initialize weights
%-------------------------------------------------------

% NN architecture
input_nodes = 4; % Number of input nodes in layer 1
hidden_nodes_1 = 25; % Number of hidden nodes in layer 2
hidden_nodes_2 = 25; % Number of hidden nodes in layer 3
hidden_nodes_3 = 25; % Number of hidden nodes in layer 4
output_nodes = 2; % Number of output nodes

%-----------------------------------------------------------
% Use Xavier (Glorot) initialization for weights
%-----------------------------------------------------------

% Initialize weights with small random values:

% W1: Weights from input layer (plus bias) to hidden layer.
limit1 = sqrt(6/(input_nodes+hidden_nodes_1));
W1 = rand(hidden_nodes_1, input_nodes+1) * 2 * limit1 - limit1;

% W2: Weights from hidden layer (plus bias) to output layer.

limit2 = sqrt(6/(hidden_nodes_1+hidden_nodes_2));
W2 = rand(hidden_nodes_2, hidden_nodes_1+1) * 2 * limit2 - limit2;

% W2: Weights from hidden layer (plus bias) to output layer.

limit3 = sqrt(6/(hidden_nodes_2+hidden_nodes_3));
W3 = rand(hidden_nodes_3, hidden_nodes_2+1) * 2 * limit3 - limit3;

% W2: Weights from hidden layer (plus bias) to output layer.

limit4 = sqrt(6/(hidden_nodes_3+output_nodes));
W4 = rand(output_nodes, hidden_nodes_3+1) * 2 * limit4 - limit4;

% ----------------------------------------------------------
% ASSIGNMENT 3 ADDITION Initialize velocity terms
% ---------------------------------------------------------
W1_vel = zeros(size(W1));
W2_vel = zeros(size(W2));
W3_vel = zeros(size(W3));
W4_vel = zeros(size(W4));

%-------------------------------------------------------
%% STEP [5]
% Set up hyperparameters of the NN

47
%-------------------------------------------------------

alfa = 1e-3; % Learning rate for weight updates

epoch = 100; % Total number of training epochs
lamda = 1e-5; % L2 regularization parameter
beta = 0.9; % Momentum parameter

batch_size = 32; % Batch size number

num_batches = floor(N_train/batch_size);

% Preallocate arrays to store Root Mean Square (RMSE) values for each epoch
for later plotting.
train_RMSE = zeros(epoch,1);
val_RMSE = zeros(epoch,1);
test_RMSE = zeros(epoch,1);

% Preallocate arrays to record the evolution of selected weights.

% Here we track W1(1,1) and W2(1,1) as examples.
weight_history_W1 = zeros(epoch,1);
weight_history_W2 = zeros(epoch,1);

%-------------------------------------------------------
%% STEP [5]
% Set up activation function (so it can be easily modified later
%-------------------------------------------------------

% Note: Using Tanh function

act_func = @(z) tanh(z); % Can be used on an array
Act_deriv = @(a) 1 - tanh(a).^2; % Derivative of activation function

% Note: Using sigmoid function

% act_func = @(z) 1./(1 + exp(-z)); % Can be used on an array
% Act_deriv = @(a) a .* (1 - a); % Derivative of activation function

%-------------------------------------------------------
%% STEP [6]
% Trainning the NN with the training data set
%-------------------------------------------------------

% Use stochastic gradient descent (SGD) to update weights over multiple

epochs.
for num_epoch = 1:epoch
% Start timing
tic

% Shuffle training data at the start of each epoch

idx = randperm(N_train);
X_train = X_train(idx, :);
Y_train = Y_train(idx, :);

48
% Loop over each training dataset.
for n = 1:num_batches
% Batch index
start_idx = (n - 1) * batch_size + 1;
end_idx = n * batch_size;

% Extract batch
X_batch = X_train(start_idx:end_idx, :)';
Y_batch = Y_train(start_idx:end_idx, :)';

% ----- Forward Pass (vectorized) -----

% Create an input vector X for the specific row with added bias of 1
X1 = [ones(1, batch_size); X_batch]; % [5 x B]
net_hid_1 = W1 * X1;
H1 = act_func(net_hid_1);

% Calcualte Net of hidden layer 1 net term

X2 = [ones(1, batch_size); H1];
net_hid_2 = W2 * X2;
H2 = act_func(net_hid_2);

% Calcualte Net of hidden layer 2 net term

X3 = [ones(1, batch_size); H2];
net_hid_3 = W3 * X3;
H3 = act_func(net_hid_3);

% Calcualte Net of hidden layer 3 net term

Y1 = [ones(1, batch_size); H3];
net_out = W4 * Y1;
Y_hat = act_func(net_out); % [2 x B]

% ----- Calculate Error -----

e = Y_hat - Y_batch;

% ----- catch any ERRORS ----

if any(isnan(Y_hat), 'all') || any(isnan(W1), 'all') ||
any(isnan(W4), 'all')
error('NaN detected in weights or predictions. Stopping
early.');
end

% ----- Backpropagation -----

delta4 = e .* Act_deriv(net_out);
delta3 = (W4(:,2:end)' * delta4) .* Act_deriv(net_hid_3);
delta2 = (W3(:,2:end)' * delta3) .* Act_deriv(net_hid_2);
delta1 = (W2(:,2:end)' * delta2) .* Act_deriv(net_hid_1);

49
% Note: Exclude the bias weight from W2, W3, W4 (i.e., use
W2(:,2:end)).

% ------ Gradients for W1 to W4 ----

dW4 = (delta4 * Y1') / batch_size;
dW3 = (delta3 * X3') / batch_size;
dW2 = (delta2 * X2') / batch_size;
dW1 = (delta1 * X1') / batch_size;

% ----------------------------------------------------------
% ASSIGNMENT 3 ADDITION L2 REGULARIZATION (bias not included)
% ---------------------------------------------------------
dW4(:,2:end) = dW4(:,2:end) + lamda*W4(:,2:end);
dW3(:,2:end) = dW3(:,2:end) + lamda*W3(:,2:end);
dW2(:,2:end) = dW2(:,2:end) + lamda*W2(:,2:end);
dW1(:,2:end) = dW1(:,2:end) + lamda*W1(:,2:end);

% ----------------------------------------------------------
% ASSIGNMENT 3 ADDITION MOMENTUM
% ---------------------------------------------------------
W4_vel = beta * W4_vel + alfa * dW4;
W3_vel = beta * W3_vel + alfa * dW3;
W2_vel = beta * W2_vel + alfa * dW2;
W1_vel = beta * W1_vel + alfa * dW1;

% ------ Weight Update ------

W4 = W4 - W4_vel;
W3 = W3 - W3_vel;
W2 = W2 - W2_vel;
W1 = W1 - W1_vel;

end

% --- Record Weight Trajectories ---

% Save selected weights to track their evolution.
weight_history_W1(num_epoch) = W1(1,1);
weight_history_W2(num_epoch) = W2(1,1);

% --- Evaluate RMSE for full training/val/test sets ---

Y_hat_train = forward_pass(X_train, W1, W2, W3, W4, act_func);
Y_hat_val = forward_pass(X_val, W1, W2, W3, W4, act_func);
Y_hat_test = forward_pass(X_test, W1, W2, W3, W4, act_func);

train_RMSE(num_epoch) = sqrt(mean(mean((Y_train - Y_hat_train).^2)));

val_RMSE(num_epoch) = sqrt(mean(mean((Y_val - Y_hat_val).^2)));
test_RMSE(num_epoch) = sqrt(mean(mean((Y_test - Y_hat_test).^2)));

% Display progress every 100 epochs.

if mod(num_epoch,1) == 0

50
time_taken = toc;
fprintf('Epoch %d: Train RMSE = %.4f | Val = %.4f | Test = %.4f |
Time = %.2fs\n', ...
num_epoch, train_RMSE(num_epoch), val_RMSE(num_epoch),
test_RMSE(num_epoch), time_taken);
end

end

%-------------------------------------------------------
%% STEP [7]
% Plot all the necessary graphs
%-------------------------------------------------------

% Plot the RMSE over epochs for training, validation, and test sets.
figure;
plot(train_RMSE, 'b-', 'LineWidth', 2); hold on;
plot(val_RMSE, 'r--', 'LineWidth', 2);
plot(test_RMSE, 'g-.', 'LineWidth', 2);
xlabel('Epoch');
ylabel('RMSE');
title('Root Mean Square Error Over Training');
legend('Train', 'Validation', 'Test');
grid on;

% Plot the trajectory of selected weights versus epochs.

figure;
plot(1:epoch, weight_history_W1, 'm-', 'LineWidth', 2); hold on;
plot(1:epoch, weight_history_W2, 'c-', 'LineWidth', 2);
xlabel('Epoch');
ylabel('Weight Value');
title('Selected Weight Trajectories vs. Epoch');
legend('W1(1,1)', 'W2(1,1)');
grid on;

% -------------------- Display Final Weights ---------------------------

disp('Final Weights (Input-to-Hidden, W1):');
disp(W1);
disp('Final Weights (Hidden-to-Output, W2):');
disp(W2);
% Save them to a MAT file named ANN_weights.mat
save('ANN_weights.mat', 'W1', 'W2', 'W3', 'W4');
end
%-------------------------------------------------------
%% HELPER FUNCTIONS
%-------------------------------------------------------
function Y_hat = forward_pass(X, W1, W2, W3, W4, act_func)
N = size(X, 1);
X = X';

51
X1 = [ones(1,N); X];
H1 = act_func(W1 * X1);
H2 = act_func(W2 * [ones(1,N); H1]);
H3 = act_func(W3 * [ones(1,N); H2]);
Y_hat = act_func(W4 * [ones(1,N); H3])';
end

9.1.3. Recurrent Neural Network

%-------------------------------------------------------
% Deep Recurrent Neural Network
% By: Guiah Soumahoro (Debugging of RNN was done using ChatGPT)
%-------------------------------------------------------
function deepRNN_BPTT_Regression()

clc;
clear all;
close all;
%% Step 1: Load data & define inputs (X) and targets (Y)
try
dataTable = readtable('DatasetPCA.csv');
catch
error('Could not read DatasetPCA.csv. Make sure it is in the current
folder.');
end
data = table2array(dataTable);
X = data(:, 1:4); % 4 features (input)
Y = data(:, 5:6); % 2 continuous outputs (target)
N = size(X,1);
%-------------------------------------------------------
%% STEP [2]
% Normalize the data
%-------------------------------------------------------

% Note: z-score normolization was done in the preprocessing steps

% Randomize data
idx = randperm(N); % Somehow doing this instead of adding random(N)
% in the X and Y randomization makes the NN work better X = X(random(N),:);
X = X(idx,:);
Y = Y(idx,:);

52
% Define the number of samples for each partition:
N_train = round(0.70 * N); % 70% for training
N_val = round(0.15 * N); % 15% for validation
N_test = N - N_train - N_val; % Remaining samples for testing

% Split the dataset into training, validation, and test sets.

X_train = X(1:N_train, :);
Y_train = Y(1:N_train, :);

X_val = X(N_train+1 : N_train+N_val, :);

Y_val = Y(N_train+1 : N_train+N_val, :);

X_test = X(N_train+N_val+1 : end, :);

Y_test = Y(N_train+N_val+1 : end, :);

%-------------------------------------------------------
%% STEP [4]
% Create sequences for Back Propagation Through Time (BPTT)
%-------------------------------------------------------
T = 10; % Time steps per sequence
[trainSequencesX, trainSequencesY] = createSequences(X_train, Y_train, T);
[valSequencesX, valSequencesY] = createSequences(X_val, Y_val, T);
[testSequencesX, testSequencesY] = createSequences(X_test, Y_test, T);
numTrainSeq = length(trainSequencesX);
%-------------------------------------------------------
%% STEP [5]
% Build the RNN architecture so you can update
% And initialize weights
%-------------------------------------------------------
input_dim = 4;
hidden_dim = 20;
output_dim = 2;
numHiddenLayers = 3;

% For each layer, the input dimension differs:

input_dims = [input_dim, repmat(hidden_dim, 1, numHiddenLayers-1)];
%-----------------------------------------------------------
% Use Xavier (Glorot) initialization for weights
%----------------------------------------------------------

% Pre-allocate cell array for deep RNN parameters

params = cell(numHiddenLayers,1);

% Xavier initialization for each layer's weights:

for l = 1:numHiddenLayers
cur_input_dim = input_dims(l);
limit_in = sqrt(6 / (cur_input_dim + hidden_dim));
limit_rec = sqrt(6 / (hidden_dim + hidden_dim));

53
% For layer l, we have input-to-hidden weights (W) and hidden-to-hidden
weights (U)
params{l}.W = rand(hidden_dim, cur_input_dim) * 2 * limit_in - limit_in;
params{l}.U = rand(hidden_dim, hidden_dim) * 2 * limit_rec - limit_rec;
params{l}.b = zeros(hidden_dim, 1);
end

% Output layer parameters (use same Xavier limit as input-to-hidden)

limit_out = sqrt(6 / (hidden_dim + output_dim));
V = rand(output_dim, hidden_dim) * 2*limit_out - limit_out;
b_y = zeros(output_dim, 1);
% ----------------------------------------------------------
% Initialize Momentum velocity terms
% --------------------------------------------------------

% Initialize momentum buffers for each deep RNN layer (as cell arrays)
W_vel = cell(numHiddenLayers,1); U_vel = cell(numHiddenLayers,1); b_vel =
cell(numHiddenLayers,1);
for l = 1:numHiddenLayers
W_vel{l} = zeros(size(params{l}.W));
U_vel{l} = zeros(size(params{l}.U));
b_vel{l} = zeros(size(params{l}.b));
end
% Momentum buffers for output layer:
V_vel = zeros(size(V));
b_y_vel = zeros(size(b_y));

%-------------------------------------------------------
%% STEP [6]
% Set up hyperparameters of the GRU and activation functions
%-------------------------------------------------------
alpha = 1e-3; % Learning rate
beta = 0.9; % Momentum coefficient
lamda = 1e-5; % L2 regularization
epoch = 100; % Number of epochs

% Activation function and derivative (tanh)

tanh_func = @(z) tanh(z);
d_tanh = @(a) 1 - a.^2; % derivative given activation value

% Storage for RMSE per epoch

train_RMSE = zeros(epoch,1);
val_RMSE = zeros(epoch,1);
test_RMSE = zeros(epoch,1);
% --- Record Weight Trajectories ---
% Initialize arrays to store a selected weight's trajectory.
weight_history_W1 = zeros(epoch, 1); % from first layer W(1,1)
weight_history_W2 = zeros(epoch, 1); % from output layer U(1,1)

54
%-------------------------------------------------------
%% STEP [7]
% Trainning the RNN with the training data set
%-------------------------------------------------------s
for ep = 1:epoch
seqOrder = randperm(numTrainSeq);
for s = 1:numTrainSeq
seqIdx = seqOrder(s);
X_seq = trainSequencesX{seqIdx}; % (T x input_dim)
Y_seq = trainSequencesY{seqIdx}; % (T x output_dim)

% ----- Forward pass -----

[cache, ~] = deepRNNForwardPass_3Layers(X_seq, params, V, b_y,
tanh_func);

% ----- Backward pass -----

grads = deepRNNBackwardPass(X_seq, Y_seq, cache, params, V,
tanh_func, d_tanh);
grads = scaleGradients(grads, 1/T);
% ----------------------------------------------------------
% L2 REGULARIZATION
% --------------------------------------------------------

if lamda > 0
for l = 1:numHiddenLayers
grads{l}.dW = grads{l}.dW + lamda * params{l}.W;
grads{l}.dU = grads{l}.dU + lamda * params{l}.U;
end
grads{numHiddenLayers+1}.dV = grads{numHiddenLayers+1}.dV +
lamda * V;
end

55
% --- Record Weight Trajectories ---
% Here we record the (1,1) element of params{1}.W_z and V over epochs.
weight_history_W1(ep) = params{1}.W(1,1);
weight_history_W2(ep) = params{1}.W(2,1);
% Evaluate RMSE on full training, validation, and test sets:
Yhat_train = deepRNNPredictBPTT(X_train, T, params, V, b_y, tanh_func);
train_RMSE(ep) = sqrt(mean((Y_train(:) - Yhat_train(:)).^2));

Yhat_val = deepRNNPredictBPTT(X_val, T, params, V, b_y, tanh_func);

val_RMSE(ep) = sqrt(mean((Y_val(:) - Yhat_val(:)).^2));

Yhat_test = deepRNNPredictBPTT(X_test, T, params, V, b_y, tanh_func);

test_RMSE(ep) = sqrt(mean((Y_test(:) - Yhat_test(:)).^2));

fprintf('Epoch %d: Train RMSE = %.4f, Val RMSE = %.4f, Test RMSE =
%.4f\n', ...
ep, train_RMSE(ep), val_RMSE(ep), test_RMSE(ep));
end

%-------------------------------------------------------
%% STEP [7]
% Plot all the necessary graphs
%-------------------------------------------------------
figure;
plot(1:epoch, train_RMSE, 'LineWidth',2); hold on;
plot(1:epoch, val_RMSE, '--', 'LineWidth',2);
plot(1:epoch, test_RMSE, '-.', 'LineWidth',2);
xlabel('Epoch'); ylabel('RMSE');
legend('Train', 'Val', 'Test', 'Location', 'best');
title('BPTT Deep RNN Training RMSE');
grid on;
disp('Training complete.');
% Plot the trajectory of selected weights versus epochs.
figure;
plot(1:epoch, weight_history_W1, 'm-', 'LineWidth', 2); hold on;
plot(1:epoch, weight_history_W2, 'c-', 'LineWidth', 2);
xlabel('Epoch');
ylabel('Weight Value');
title('Selected Weight Trajectories vs. Epoch');
legend('W(1,1)', 'W(2,1)');
grid on;
% --------------- End of deepRNN_BPTT_Regression code ---------------
save('RNN_weights.mat', 'params', 'V', 'b_y');
end
%-------------------------------------------------------
%% HELPER FUNCTIONS
%-------------------------------------------------------
%% Helper Function: Create Sequences
function [seqsX, seqsY] = createSequences(X, Y, T)

56
N = size(X,1);
numSeq = floor(N/T);
seqsX = cell(numSeq,1);
seqsY = cell(numSeq,1);
idx = 1;
for s = 1:numSeq
seqsX{s} = X(idx:idx+T-1, :);
seqsY{s} = Y(idx:idx+T-1, :);
idx = idx+T;
end
end
% ------------------------------------------------------------------------
%% Helper Function: Deep RNN Forward Pass (3 Layers, Unrolled over T Steps)
function [cache, h_end] = deepRNNForwardPass_3Layers(X_seq, params, V, b_y,
tanh_func)
% X_seq: (T x input_dim)
% params: cell array (length = 3) each with fields W, U, b for that layer.
% V, b_y: output layer parameters.
[T, ~] = size(X_seq);
numLayers = numel(params);
hidden_dim = size(params{1}.W,1);
output_dim = size(V,1);

% Initialize hidden states for each layer.

h = cell(numLayers,1);
for l = 1:numLayers
h{l} = zeros(hidden_dim, 1);
end

% Preallocate cache for each layer.

for l = 1:numLayers
cache.l{l}.h = zeros(hidden_dim, T);
end
cache.Y_hat = zeros(T, output_dim);
cache.X_seq = X_seq;

% Unroll through time

for t = 1:T
x_t = X_seq(t, :)';
for l = 1:numLayers
if l == 1
input_t = x_t;
else
input_t = h{l-1};
end
% Compute activation for layer l:
a_t = params{l}.W * input_t + params{l}.U * h{l} + params{l}.b;
h{l} = tanh_func(a_t);
cache.l{l}.h(:, t) = h{l};

57
end
% Compute output from top layer:
y_t = V * h{numLayers} + b_y;
cache.Y_hat(t, :) = y_t';
end
h_end = h{numLayers};
end
% ------------------------------------------------------------------------
%% Helper Function: Deep RNN Backward Pass (BPTT)
function grads = deepRNNBackwardPass(X_seq, Y_seq, cache, params, V, tanh_func,
d_tanh)
% X_seq: (T x input_dim)
% Y_seq: (T x output_dim)
% cache: structure from deepRNNForwardPass_3Layers, which contains:
% cache.l{l}.h (hidden states per layer, size [hidden_dim x T])
% cache.Y_hat (outputs, [T x output_dim])
% params: cell array (length = L) with fields:
% params{l}.W, params{l}.U, params{l}.b for each RNN layer.
% V, b_y: output layer parameters.
% tanh_func: activation function, e.g., @(z) tanh(z)
% d_tanh: derivative of tanh given activation value, e.g., @(a) 1 - a.^2
%
% This function computes the gradients via BPTT.

L = numel(params);
[T, ~] = size(X_seq);
hidden_dim = size(params{1}.W, 1);
output_dim = size(V, 1);

% Initialize output layer gradients.

gradOut.dV = zeros(size(V));
gradOut.db_y = zeros(output_dim, 1);

% Initialize gradients for each RNN layer.

for l = 1:L
grads{l}.dW = zeros(size(params{l}.W));
grads{l}.dU = zeros(size(params{l}.U));
grads{l}.db = zeros(hidden_dim, 1);
end

% Initialize error propagation variables for each layer.

for l = 1:L
dh_next{l} = zeros(hidden_dim, 1);
end

% Retrieve hidden states per layer.

H = cell(L,1);
for l = 1:L
H{l} = cache.l{l}.h; % each is [hidden_dim x T]

58
end
Y_hat = cache.Y_hat;

% Backpropagate through time (from t = T down to t = 1)

for t = T:-1:1
% For each layer, initialize current error (dh_curr) as a column vector.
for l = 1:L
dh_curr{l} = zeros(hidden_dim, 1);
end

% Output layer error.

y_t = Y_hat(t, :)'; % [output_dim x 1]
y_ref = Y_seq(t, :)'; % [output_dim x 1]
e_t = y_t - y_ref; % [output_dim x 1]

% Use hidden state from top layer to update output gradients.

h_top = H{L}(:, t); % [hidden_dim x 1]
gradOut.dV = gradOut.dV + e_t * h_top';
gradOut.db_y = gradOut.db_y + e_t;

% Backprop into top RNN layer:

dh = V' * e_t + dh_next{L}; % [hidden_dim x 1]
dh_curr{L} = dh;

% Process layers L down to 1.

for l = L:-1:1
% Retrieve h_prev from same layer.
if t == 1
h_prev = zeros(hidden_dim, 1);
else
h_prev = H{l}(:, t-1); % [hidden_dim x 1]
end

% Input to layer l
if l == 1
input_t = X_seq(t, :)'; % [input_dim x 1]
else
input_t = H{l-1}(:, t); % [hidden_dim x 1]
end

% Current hidden state

h_t = H{l}(:, t); % [hidden_dim x 1]

% Force both dh_curr{l} and h_t to be column vectors.

dh_vec = dh_curr{l}(:); % [hidden_dim x 1]
h_t = h_t(:); % [hidden_dim x 1]

% Debug (uncomment if needed):

59
% fprintf('t=%d, layer=%d: size(dh_curr) = %s, size(h_t) = %s\n', t,
l, mat2str(size(dh_vec)), mat2str(size(h_t)));

% Compute gradient through tanh activation:

d_a = dh_vec .* d_tanh(h_t); % both [hidden_dim x 1]

% Compute gradients for weight matrices and bias.

grads{l}.dW = grads{l}.dW + d_a * input_t';
grads{l}.dU = grads{l}.dU + d_a * h_prev';
grads{l}.db = grads{l}.db + d_a;

% Propagate error to previous time step within layer l.

dh_prev = params{l}.U' * d_a;
dh_next{l} = dh_prev;

% For layers below, propagate error from current layer’s input.

if l > 1
if t == T
dh_next{l-1} = zeros(hidden_dim, 1);
end
dh_next{l-1} = dh_next{l-1} + dh_prev;
end

if t > 1
dh_curr{l} = dh_next{l};
end
end % end for l
end % end for t

% Append output layer gradients as the (L+1)-th element.

grads{L+1} = struct('dV', gradOut.dV, 'b_y', gradOut.db_y);
end
% ------------------------------------------------------------------------
%% Helper Function: Scale Gradients by a Factor
function newGrads = scaleGradients(grads, factor)
if iscell(grads)
newGrads = cell(size(grads));
for k = 1:numel(grads)
if isstruct(grads{k})
fn = fieldnames(grads{k});
for j = 1:numel(fn)
newGrads{k}.(fn{j}) = factor * grads{k}.(fn{j});
end
else
newGrads{k} = factor * grads{k};
end
end
else
fn = fieldnames(grads);

60
newGrads = grads;
for k = 1:numel(fn)
newGrads.(fn{k}) = factor * grads.(fn{k});
end
end
end
%% Helper Function: Momentum Update
function [param, param_vel] = momentumUpdate(param, param_vel, alpha, beta,
dparam)
param_vel = beta * param_vel + alpha * dparam;
param = param - param_vel;
end
% ------------------------------------------------------------------------
%% Helper Function: Deep RNN Prediction Using BPTT (Unrolled Forward Pass)
function Y_hat_all = deepRNNPredictBPTT(X_data, T, params, V, b_y, tanh_func)
N = size(X_data,1);
numSeq = floor(N/T);
output_dim = size(V,1);
Y_hat_all = zeros(N, output_dim);
idx = 1;
for s = 1:numSeq
X_seq = X_data(idx:idx+T-1, :);
[cache, ~] = deepRNNForwardPass_3Layers(X_seq, params, V, b_y,
tanh_func);
Y_hat_seq = cache.Y_hat;
Y_hat_all(idx:idx+T-1, :) = Y_hat_seq;
idx = idx + T;
end
end

9.1.4. Long Short Term Memory

%-------------------------------------------------------
% A Long Short Term Memory Network
% By: Guiah Soumahoro (Debugging of LSTM was done using ChatGPT)
%------------------------------------------------------
function LSTM_BPTT_Regression_Batched()
clc;
clear all;
close all;
%-------------------------------------------------------
%% STEP [1]
% Load the data and set up parameters (excel dataset needs to be in the
% same folder
%-------------------------------------------------------
try
dataTable = readtable('DatasetPCA.csv');
catch

61
error('Could not read DatasetPCA.csv. Make sure it is in the current
folder.');
end
data = table2array(dataTable);
X = data(:, 1:4); % 4 features (input)
Y = data(:, 5:6); % 2 continuous outputs (target)
N = size(data,1);
%-------------------------------------------------------
%% STEP [2]
% Normalize the data
%-------------------------------------------------------

% Note: z-score normolization was done in the preprocessing steps

% Randomize data
idx = randperm(N); % Somehow doing this instead of adding random(N)
% in the X and Y randomization makes the NN work better X = X(random(N),:);
X = X(idx,:);
Y = Y(idx,:);

% Define the number of samples for each partition:

N_train = round(0.70 * N); % 70% for training
N_val = round(0.15 * N); % 15% for validation
N_test = N - N_train - N_val; % Remaining samples for testing

% Split the dataset into training, validation, and test sets.

X_train = X(1:N_train, :);
Y_train = Y(1:N_train, :);

X_val = X(N_train+1 : N_train+N_val, :);

Y_val = Y(N_train+1 : N_train+N_val, :);

X_test = X(N_train+N_val+1 : end, :);

Y_test = Y(N_train+N_val+1 : end, :);

62
numTrainSeq = length(trainSequencesX);
%-------------------------------------------------------
%% STEP [5]
% Build the LSTM architecture so you can update
% And initialize weights
%-------------------------------------------------------
input_dim = 4;
hidden_dim = 20;
output_dim = 2;
numHiddenLayers = 3;

% For each layer, the input dimension is different:

input_dims = [input_dim, repmat(hidden_dim, 1, numHiddenLayers-1)];
% Gradient clipping value
clipValue = 5;
%-----------------------------------------------------------
% Use Xavier (Glorot) initialization for weights
%-----------------------------------------------------------

% Initialize LSTM parameters using Xavier initialization

params = cell(numHiddenLayers,1);
for l = 1:numHiddenLayers
current_input_dim = input_dims(l);
limit_in = sqrt(6 / (current_input_dim + hidden_dim));
limit_rec = sqrt(6 / (hidden_dim + hidden_dim));

% For the LSTM, we initialize parameters for four gates: input (u),
forget (f), output (o) and candidate (g).
params{l}.U_u = rand(hidden_dim, current_input_dim) * 2 * limit_in -
limit_in;
params{l}.W_u = rand(hidden_dim, hidden_dim) * 2 * limit_rec -
limit_rec;
params{l}.b_u = zeros(hidden_dim, 1);

params{l}.U_f = rand(hidden_dim, current_input_dim) * 2 * limit_in -

limit_in;
params{l}.W_f = rand(hidden_dim, hidden_dim) * 2 * limit_rec -
limit_rec;
params{l}.b_f = ones(hidden_dim, 1); % commonly set to 1

params{l}.U_o = rand(hidden_dim, current_input_dim) * 2 * limit_in -

limit_in;
params{l}.W_o = rand(hidden_dim, hidden_dim) * 2 * limit_rec -
limit_rec;
params{l}.b_o = zeros(hidden_dim, 1);

params{l}.U_g = rand(hidden_dim, current_input_dim) * 2 * limit_in -

limit_in;

63
params{l}.W_g = rand(hidden_dim, hidden_dim) * 2 * limit_rec -
limit_rec;
params{l}.b_g = zeros(hidden_dim, 1);
end
% Output layer parameters
limit_out = sqrt(6 / (hidden_dim + output_dim));
V = rand(output_dim, hidden_dim) * 2 * limit_out - limit_out;
b_v = zeros(output_dim, 1);
% ----------------------------------------------------------
% Initialize Momentum velocity terms
% ---------------------------------------------------------

% Initialize momentum velocity buffers for LSTM parameters

W_u_vel = cell(numHiddenLayers,1);
b_u_vel = cell(numHiddenLayers,1);
U_u_vel = cell(numHiddenLayers,1);
W_f_vel = cell(numHiddenLayers,1);
b_f_vel = cell(numHiddenLayers,1);
U_f_vel = cell(numHiddenLayers,1);
W_o_vel = cell(numHiddenLayers,1);
b_o_vel = cell(numHiddenLayers,1);
U_o_vel = cell(numHiddenLayers,1);
W_g_vel = cell(numHiddenLayers,1);
b_g_vel = cell(numHiddenLayers,1);
U_g_vel = cell(numHiddenLayers,1);
for l = 1:numHiddenLayers
W_u_vel{l} = zeros(size(params{l}.W_u));
b_u_vel{l} = zeros(size(params{l}.b_u));
U_u_vel{l} = zeros(size(params{l}.U_u));

W_f_vel{l} = zeros(size(params{l}.W_f));
b_f_vel{l} = zeros(size(params{l}.b_f));
U_f_vel{l} = zeros(size(params{l}.U_f));

W_o_vel{l} = zeros(size(params{l}.W_o));
b_o_vel{l} = zeros(size(params{l}.b_o));
U_o_vel{l} = zeros(size(params{l}.U_o));

W_g_vel{l} = zeros(size(params{l}.W_g));
b_g_vel{l} = zeros(size(params{l}.b_g));
U_g_vel{l} = zeros(size(params{l}.U_g));
end
V_vel = zeros(size(V));
b_v_vel = zeros(size(b_v));

%-------------------------------------------------------
%% STEP [6]
% Set up hyperparameters of the LSTM and activation functions
%-------------------------------------------------------

64
alfa = 1e-4; % Learning rate
beta = 0.9; % Momentum coefficient
lamda = 1e-5; % L2 regularization (set to 0 for debugging)
epoch = 100; % Number of training epochs

% Activation functions and their derivatives:

sigmoid_func = @(z) 1./(1+exp(-z));
tanh_func = @(z) tanh(z);
d_sigmoid = @(a) a.*(1-a);
d_tanh = @(a) 1 - tanh(a).^2;

% Mini-batch settings:
batchSize = 16; % Number of sequences per mini-batch
numBatches = ceil(numTrainSeq / batchSize);
% Storage for RMSE per epoch:
train_RMSE = zeros(epoch, 1);
val_RMSE = zeros(epoch, 1);
test_RMSE = zeros(epoch, 1);
% --- Record Weight Trajectories ---
% Initialize arrays to store a selected weight's trajectory.
weight_history_Wu = zeros(epoch, 1); % from first layer W_u(1,1)
weight_history_V = zeros(epoch, 1); % from output layer V(1,1)
%-------------------------------------------------------
%% STEP [7]
% Trainning the LSTM with the training data set
%-------------------------------------------------------
seqIndices = 1:numTrainSeq;
for ep = 1:epoch
% Shuffle sequences each epoch:
seqShuffled = seqIndices(randperm(numTrainSeq));
for b = 1:numBatches
idxStart = (b-1)*batchSize + 1;
idxEnd = min(b * batchSize, numTrainSeq);
currentBatch = seqShuffled(idxStart:idxEnd);
% Concatenate sequences into 3D arrays: [T x input_dim x
currentBatchSize]
X_batch = cat(3, trainSequencesX{currentBatch});
Y_batch = cat(3, trainSequencesY{currentBatch});

% ----- Forward pass on the batch -----

[cache, ~, ~] = deepLSTMForwardPass_Batch(X_batch, params, V, b_v,
sigmoid_func, tanh_func);

% ----- Backward pass on the batch -----

grads = deepLSTMBackwardPass_Batch(X_batch, Y_batch, cache, params,
V, ...
sigmoid_func, d_sigmoid, tanh_func, d_tanh);
% Scale gradients by 1/T:
grads = scaleGradients(grads, 1/T);

65
% ----------------------------------------------------------
% L2 REGULARIZATION
% --------------------------------------------------------
if lamda > 0
for l = 1:numHiddenLayers
grads{l}.dW_u = grads{l}.dW_u + lamda * params{l}.W_u;
grads{l}.dU_u = grads{l}.dU_u + lamda * params{l}.U_u;

grads{l}.dW_f = grads{l}.dW_f + lamda * params{l}.W_f;

grads{l}.dU_f = grads{l}.dU_f + lamda * params{l}.U_f;

grads{l}.dW_o = grads{l}.dW_o + lamda * params{l}.W_o;

grads{l}.dU_o = grads{l}.dU_o + lamda * params{l}.U_o;

grads{l}.dW_g = grads{l}.dW_g + lamda * params{l}.W_g;

grads{l}.dU_g = grads{l}.dU_g + lamda * params{l}.U_g;
end
grads{numHiddenLayers+1}.V = grads{numHiddenLayers+1}.V + lamda
* V;
end

% ----------------------------------------------------------
% MOMENTUM ADDED TO LSTM
% ---------------------------------------------------------
for l = 1:numHiddenLayers
[params{l}.W_u, W_u_vel{l}] = momentumUpdate(params{l}.W_u,
W_u_vel{l}, alfa, beta, clipGradient(grads{l}.dW_u, clipValue));
[params{l}.b_u, b_u_vel{l}] = momentumUpdate(params{l}.b_u,
b_u_vel{l}, alfa, beta, clipGradient(grads{l}.db_u, clipValue));
[params{l}.U_u, U_u_vel{l}] = momentumUpdate(params{l}.U_u,
U_u_vel{l}, alfa, beta, clipGradient(grads{l}.dU_u, clipValue));
[params{l}.W_f, W_f_vel{l}] = momentumUpdate(params{l}.W_f,
W_f_vel{l}, alfa, beta, clipGradient(grads{l}.dW_f, clipValue));
[params{l}.b_f, b_f_vel{l}] = momentumUpdate(params{l}.b_f,
b_f_vel{l}, alfa, beta, clipGradient(grads{l}.db_f, clipValue));
[params{l}.U_f, U_f_vel{l}] = momentumUpdate(params{l}.U_f,
U_f_vel{l}, alfa, beta, clipGradient(grads{l}.dU_f, clipValue));
[params{l}.W_o, W_o_vel{l}] = momentumUpdate(params{l}.W_o,
W_o_vel{l}, alfa, beta, clipGradient(grads{l}.dW_o, clipValue));
[params{l}.b_o, b_o_vel{l}] = momentumUpdate(params{l}.b_o,
b_o_vel{l}, alfa, beta, clipGradient(grads{l}.db_o, clipValue));
[params{l}.U_o, U_o_vel{l}] = momentumUpdate(params{l}.U_o,
U_o_vel{l}, alfa, beta, clipGradient(grads{l}.dU_o, clipValue));
[params{l}.W_g, W_g_vel{l}] = momentumUpdate(params{l}.W_g,
W_g_vel{l}, alfa, beta, clipGradient(grads{l}.dW_g, clipValue));
[params{l}.b_g, b_g_vel{l}] = momentumUpdate(params{l}.b_g,
b_g_vel{l}, alfa, beta, clipGradient(grads{l}.db_g, clipValue));

66
[params{l}.U_g, U_g_vel{l}] = momentumUpdate(params{l}.U_g,
U_g_vel{l}, alfa, beta, clipGradient(grads{l}.dU_g, clipValue));
end

% Output layer updates:

[V, V_vel] = momentumUpdate(V, V_vel, alfa, beta,
grads{numHiddenLayers+1}.V);
[b_v, b_v_vel] = momentumUpdate(b_v, b_v_vel, alfa, beta,
grads{numHiddenLayers+1}.b_v);
end
gradNorm = norm(grads{l}.dW_u(:));
fprintf('Layer %d, dW_u norm: %.4f\n', l, gradNorm);
% --- Record Weight Trajectories ---
% Here we record the (1,1) element of params{1}.W_z and V over epochs.
weight_history_Wu(ep) = params{1}.W_u(1,1);
weight_history_V(ep) = V(1,1);

% Evaluate RMSE on full training, validation, and test sets:

Yhat_train = deepLSTMPredictBPTT(X_train, T, params, V, b_v,
sigmoid_func, tanh_func);
train_RMSE(ep) = sqrt(mean((Y_train(:) - Yhat_train(:)).^2));

Yhat_val = deepLSTMPredictBPTT(X_val, T, params, V, b_v, sigmoid_func,

tanh_func);
val_RMSE(ep) = sqrt(mean((Y_val(:) - Yhat_val(:)).^2));

Yhat_test = deepLSTMPredictBPTT(X_test, T, params, V, b_v, sigmoid_func,

tanh_func);
test_RMSE(ep) = sqrt(mean((Y_test(:) - Yhat_test(:)).^2));

67
plot(1:epoch, weight_history_Wu, 'm-', 'LineWidth', 2); hold on;
plot(1:epoch, weight_history_V, 'c-', 'LineWidth', 2);
xlabel('Epoch');
ylabel('Weight Value');
title('Selected Weight Trajectories vs. Epoch');
legend('W_u(1,1)', 'V(1,1)');
grid on;
% --------------- End of LSTM_BPTT_Regression_Batched code ---------------
save('LSTM_weights.mat', 'params', 'V', 'b_v');
end
%-------------------------------------------------------
%% HELPER FUNCTIONS
%-------------------------------------------------------
%% Helper Function: Create Sequences
function [seqsX, seqsY] = createSequences(X, Y, T)
N = size(X,1);
numSeq = floor(N / T);
seqsX = cell(numSeq, 1);
seqsY = cell(numSeq, 1);
idx = 1;
for s = 1:numSeq
seqsX{s} = X(idx:idx+T-1, :);
seqsY{s} = Y(idx:idx+T-1, :);
idx = idx + T;
end
end
% ------------------------------------------------------------------------
%% Batched LSTM Forward Pass Function
function [cache, h_end, c_end] = deepLSTMForwardPass_Batch(X_batch, params, V,
b_v, sigmoid_func, tanh_func)
% X_batch: [T x input_dim x batchSize]
[T, inputDim, batchSize] = size(X_batch);
numLayers = numel(params);
hidden_dim = size(params{1}.W_u,1);
output_dim = size(V,1);

% Initialize hidden and cell states for each layer: [hidden_dim x batchSize]
h = cell(numLayers,1); c = cell(numLayers,1);
for l = 1:numLayers
h{l} = zeros(hidden_dim, batchSize);
c{l} = zeros(hidden_dim, batchSize);
end

% Preallocate cache for each layer:

for l = 1:numLayers
cache.l{l}.i = zeros(hidden_dim, T, batchSize);
cache.l{l}.f = zeros(hidden_dim, T, batchSize);
cache.l{l}.o = zeros(hidden_dim, T, batchSize);
cache.l{l}.g = zeros(hidden_dim, T, batchSize);

68
cache.l{l}.c = zeros(hidden_dim, T, batchSize);
cache.l{l}.h = zeros(hidden_dim, T, batchSize);
end
cache.Y_hat = zeros(T, output_dim, batchSize);
cache.X_batch = X_batch;

for t = 1:T
x_t = reshape(X_batch(t,:,:), [inputDim, batchSize]); % [input_dim x
batchSize]
for l = 1:numLayers
if l == 1
input_t = x_t;
else
input_t = h{l-1};
end
% Retrieve parameters for layer l:
U_u = params{l}.U_u; W_u = params{l}.W_u; b_u = params{l}.b_u;
U_f = params{l}.U_f; W_f = params{l}.W_f; b_f = params{l}.b_f;
U_o = params{l}.U_o; W_o = params{l}.W_o; b_o = params{l}.b_o;
U_g = params{l}.U_g; W_g = params{l}.W_g; b_g = params{l}.b_g;

% Compute LSTM gate activations vectorized over batch

i_t = sigmoid_func(U_u * input_t + W_u * h{l} +
repmat(b_u,1,batchSize));
f_t = sigmoid_func(U_f * input_t + W_f * h{l} +
repmat(b_f,1,batchSize));
o_t = sigmoid_func(U_o * input_t + W_o * h{l} +
repmat(b_o,1,batchSize));
g_t = tanh_func(U_g * input_t + W_g * h{l} +
repmat(b_g,1,batchSize));

% Update cell and hidden states

c{l} = f_t .* c{l} + i_t .* g_t;
h{l} = o_t .* tanh_func(c{l});

% Save states in cache

cache.l{l}.i(:,t,:) = i_t;
cache.l{l}.f(:,t,:) = f_t;
cache.l{l}.o(:,t,:) = o_t;
cache.l{l}.g(:,t,:) = g_t;
cache.l{l}.c(:,t,:) = c{l};
cache.l{l}.h(:,t,:) = h{l};
end
% Compute output from the top layer:
y_t = V * h{numLayers} + repmat(b_v,1,batchSize);
cache.Y_hat(t,:,:) = y_t;
end
h_end = h{numLayers};
c_end = c{numLayers};

69
end
% ------------------------------------------------------------------------
%% Batched LSTM Backward Pass Function
function grads = deepLSTMBackwardPass_Batch(X_batch, Y_batch, cache, params, V,
...
sigmoid_func, d_sigmoid, tanh_func, d_tanh)
% X_batch: [T x input_dim x batchSize]
% Y_batch: [T x output_dim x batchSize]
[T, ~, batchSize] = size(X_batch);
numLayers = numel(params);
hidden_dim = size(params{1}.W_u,1);
output_dim = size(V,1);

% Initialize output-layer gradients:

gradOut.dV = zeros(size(V));
gradOut.db_v = zeros(output_dim,1);

% Initialize LSTM layer gradients:

for l = 1:numLayers
grads{l}.dW_u = zeros(size(params{l}.W_u));
grads{l}.dU_u = zeros(size(params{l}.U_u));
grads{l}.db_u = zeros(hidden_dim,1);

grads{l}.dW_f = zeros(size(params{l}.W_f));
grads{l}.dU_f = zeros(size(params{l}.U_f));
grads{l}.db_f = zeros(hidden_dim,1);

grads{l}.dW_o = zeros(size(params{l}.W_o));
grads{l}.dU_o = zeros(size(params{l}.U_o));
grads{l}.db_o = zeros(hidden_dim,1);

grads{l}.dW_g = zeros(size(params{l}.W_g));
grads{l}.dU_g = zeros(size(params{l}.U_g));
grads{l}.db_g = zeros(hidden_dim,1);
end

% Initialize dh_next and dc_next for each layer:

dh_next = cell(numLayers,1);
dc_next = cell(numLayers,1);
for l = 1:numLayers
dh_next{l} = zeros(hidden_dim, batchSize);
dc_next{l} = zeros(hidden_dim, batchSize);
end

% Loop backward in time:

for t = T:-1:1
% Output layer error:
y_t = reshape(cache.Y_hat(t,:,:), [output_dim, batchSize]);
y_ref = reshape(Y_batch(t,:,:), [output_dim, batchSize]);

70
e_t = y_t - y_ref; % [output_dim x batchSize]

h_top = reshape(cache.l{numLayers}.h(:,t,:), [hidden_dim, batchSize]);

gradOut.dV = gradOut.dV + e_t * h_top';
gradOut.db_v = gradOut.db_v + sum(e_t,2);

dh = V' * e_t + dh_next{numLayers}; % [hidden_dim x batchSize]

% Process layers L downto 1:

for l = numLayers:-1:1
% Retrieve cached activations for layer l at time t:
i_t = reshape(cache.l{l}.i(:,t,:), [hidden_dim, batchSize]);
f_t = reshape(cache.l{l}.f(:,t,:), [hidden_dim, batchSize]);
o_t = reshape(cache.l{l}.o(:,t,:), [hidden_dim, batchSize]);
g_t = reshape(cache.l{l}.g(:,t,:), [hidden_dim, batchSize]);
c_t = reshape(cache.l{l}.c(:,t,:), [hidden_dim, batchSize]);
if t == 1
c_prev = zeros(hidden_dim, batchSize);
h_prev = zeros(hidden_dim, batchSize);
else
c_prev = reshape(cache.l{l}.c(:,t-1,:), [hidden_dim,
batchSize]);
h_prev = reshape(cache.l{l}.h(:,t-1,:), [hidden_dim,
batchSize]);
end

if l == 1
input_l = reshape(X_batch(t,:,:), [size(X_batch,2), batchSize]);
else
input_l = reshape(cache.l{l-1}.h(:,t,:), [hidden_dim,
batchSize]);
end

% Backprop through output of LSTM cell:

do = dh .* tanh_func(c_t);
d_o = do .* d_sigmoid(o_t);

dct = dh .* o_t .* d_tanh(tanh_func(c_t)) + dc_next{l};

di = dct .* g_t;
df = dct .* c_prev;
dg = dct .* i_t;

d_i = di .* d_sigmoid(i_t);
d_f = df .* d_sigmoid(f_t);
d_g = dg .* d_tanh(g_t);

% Gradients for parameters in layer l:

grads{l}.dU_u = grads{l}.dU_u + d_i * input_l';
grads{l}.dW_u = grads{l}.dW_u + d_i * h_prev';

71
grads{l}.db_u = grads{l}.db_u + sum(d_i,2);

grads{l}.dU_f = grads{l}.dU_f + d_f * input_l';

grads{l}.dW_f = grads{l}.dW_f + d_f * h_prev';
grads{l}.db_f = grads{l}.db_f + sum(d_f,2);

grads{l}.dU_o = grads{l}.dU_o + d_o * input_l';

grads{l}.dW_o = grads{l}.dW_o + d_o * h_prev';
grads{l}.db_o = grads{l}.db_o + sum(d_o,2);

grads{l}.dU_g = grads{l}.dU_g + d_g * input_l';

grads{l}.dW_g = grads{l}.dW_g + d_g * h_prev';
grads{l}.db_g = grads{l}.db_g + sum(d_g,2);

% Propagate gradients for h and c:

dh_prev = params{l}.W_u' * d_i + params{l}.W_f' * d_f +
params{l}.W_o' * d_o + params{l}.W_g' * d_g;
dc_prev = dct .* f_t;

dh_next{l} = dh_prev;
dc_next{l} = dc_prev;
if l > 1
dh_next{l-1} = dh_next{l-1} + params{l}.U_u' * d_i +
params{l}.U_f' * d_f + ...
params{l}.U_o' * d_o + params{l}.U_g' * d_g;
end

if t > 1
dh = dh_next{l};
end
end
end
grads{numLayers+1} = struct('V', gradOut.dV, 'b_v', gradOut.db_v);
end
% ------------------------------------------------------------------------
%% Helper Function: Scale Gradients by a Factor
function newGrads = scaleGradients(grads, factor)
if iscell(grads)
newGrads = cell(size(grads));
for k = 1:numel(grads)
if isstruct(grads{k})
fn = fieldnames(grads{k});
for j = 1:numel(fn)
newGrads{k}.(fn{j}) = factor * grads{k}.(fn{j});
end
else
newGrads{k} = factor * grads{k};
end
end

72
else
fn = fieldnames(grads);
newGrads = grads;
for k = 1:numel(fn)
newGrads.(fn{k}) = factor * grads.(fn{k});
end
end
end
% ------------------------------------------------------------------------
%% Helper Function: Momentum Update
function [param, param_vel] = momentumUpdate(param, param_vel, alpha, beta,
dparam)
param_vel = beta * param_vel + alpha * dparam;
param = param - param_vel;
end
% ------------------------------------------------------------------------
%% Helper Function: LSTM Prediction Using BPTT (Processes dataset in sequences)
function Y_hat_all = deepLSTMPredictBPTT(X_data, T, params, V, b_v,
sigmoid_func, tanh_func)
N = size(X_data,1);
numSeq = floor(N / T);
output_dim = size(V,1);
Y_hat_all = zeros(N, output_dim);
idx = 1;
for s = 1:numSeq
X_seq = X_data(idx:idx+T-1, :);
% Use batched forward pass with batchSize=1 for prediction
[cache, ~, ~] = deepLSTMForwardPass_Batch(reshape(X_seq, [T,
size(X_seq,2), 1]), ...
params, V, b_v, sigmoid_func, tanh_func);
Y_hat_seq = reshape(cache.Y_hat, [T, output_dim]);
Y_hat_all(idx:idx+T-1, :) = Y_hat_seq;
idx = idx + T;
end
end
%% Helper: Gradient Clipping
function gradClipped = clipGradient(grad, clipValue)
% Clipping the gradient elementwise to be within [-clipValue, clipValue]
gradClipped = max(min(grad, clipValue), -clipValue);
end

9.1.5. Gated Recurrent Unit

%-------------------------------------------------------
% A Deep Gated Recurrent Unit
% By: Guiah Soumahoro (Debugging of GRU was done using ChatGPT)
%-------------------------------------------------------
function GRU_BPTT_Regression_Batched()
clc;

73
clear all;
close all;
%-------------------------------------------------------
%% STEP [1]
% Load the data and set up parameters (excel dataset needs to be in the
% same folder
%-------------------------------------------------------
try
dataTable = readtable('DatasetPCA.csv');
catch
error('Could not read DatasetPCA.csv. Make sure it is in the current
folder.');
end
data = table2array(dataTable);
X = data(:, 1:4); % 4 features (input)
Y = data(:, 5:6); % 2 continuous outputs (target)
N = size(data,1);
%-------------------------------------------------------
%% STEP [2]
% Normalize the data
%-------------------------------------------------------

% Note: z-score normolization was done in the preprocessing steps

% Randomize data
idx = randperm(N); % Somehow doing this instead of adding random(N)
% in the X and Y randomization makes the NN work better X = X(random(N),:);
X = X(idx,:);
Y = Y(idx,:);

% Define the number of samples for each partition:

N_train = round(0.70 * N); % 70% for training
N_val = round(0.15 * N); % 15% for validation
N_test = N - N_train - N_val; % Remaining samples for testing

% Split the dataset into training, validation, and test sets.

X_train = X(1:N_train, :);
Y_train = Y(1:N_train, :);

X_val = X(N_train+1 : N_train+N_val, :);

Y_val = Y(N_train+1 : N_train+N_val, :);

X_test = X(N_train+N_val+1 : end, :);

74
Y_test = Y(N_train+N_val+1 : end, :);

%-------------------------------------------------------
%% STEP [4]
% RCreate sequences for Back Propagation Through Time (BPTT)
%-------------------------------------------------------
T = 10; % Number of time steps per sequence
[trainSequencesX, trainSequencesY] = createSequences(X_train, Y_train, T);
[valSequencesX, valSequencesY] = createSequences(X_val, Y_val, T);
[testSequencesX, testSequencesY] = createSequences(X_test, Y_test, T);
numTrainSeq = length(trainSequencesX);
%-------------------------------------------------------
%% STEP [5]
% Build the GRU architecture so you can update
% And initialize weights
%-------------------------------------------------------
input_dim = 4;
hidden_dim = 20;
output_dim = 2;
numHiddenLayers = 3;

% Define each layer's input dimension:

input_dims = [input_dim, repmat(hidden_dim,1,numHiddenLayers-1)];

%-----------------------------------------------------------
% Use Xavier (Glorot) initialization for weights
%-----------------------------------------------------------

% Initialize GRU parameters with Xavier initialization

params = cell(numHiddenLayers,1);
% For loop will initialize parameters for the number of hidden layers
for l = 1:numHiddenLayers
current_input_dim = input_dims(l);
limit_in = sqrt(6 / (current_input_dim + hidden_dim));
limit_rec = sqrt(6 / (hidden_dim + hidden_dim));

% Update gate parameters

params{l}.W_z = rand(hidden_dim, current_input_dim) * 2*limit_in -
limit_in;
params{l}.U_z = rand(hidden_dim, hidden_dim) * 2*limit_rec -
limit_rec;
params{l}.b_z = zeros(hidden_dim, 1);

% Reset gate parameters

params{l}.W_r = rand(hidden_dim, current_input_dim) * 2*limit_in -
limit_in;
params{l}.U_r = rand(hidden_dim, hidden_dim) * 2*limit_rec -
limit_rec;
params{l}.b_r = zeros(hidden_dim, 1);

75
% Candidate hidden state parameters
params{l}.W_h = rand(hidden_dim, current_input_dim) * 2*limit_in -
limit_in;
params{l}.U_h = rand(hidden_dim, hidden_dim) * 2*limit_rec -
limit_rec;
params{l}.b_h = zeros(hidden_dim, 1);
end
% Output layer:
limit_out = sqrt(6 / (hidden_dim + output_dim));
V = rand(output_dim, hidden_dim) * 2*limit_out - limit_out;
b_v = zeros(output_dim, 1);

% ----------------------------------------------------------
% Initialize Momentum velocity terms
% ---------------------------------------------------------
% Initialize momentum velocity terms for GRU layers
W_z_vel = cell(numHiddenLayers,1); U_z_vel = cell(numHiddenLayers,1);
b_z_vel = cell(numHiddenLayers,1);
W_r_vel = cell(numHiddenLayers,1); U_r_vel = cell(numHiddenLayers,1);
b_r_vel = cell(numHiddenLayers,1);
W_h_vel = cell(numHiddenLayers,1); U_h_vel = cell(numHiddenLayers,1);
b_h_vel = cell(numHiddenLayers,1);

for l = 1:numHiddenLayers
W_z_vel{l} = zeros(size(params{l}.W_z));
U_z_vel{l} = zeros(size(params{l}.U_z));
b_z_vel{l} = zeros(size(params{l}.b_z));

W_r_vel{l} = zeros(size(params{l}.W_r));
U_r_vel{l} = zeros(size(params{l}.U_r));
b_r_vel{l} = zeros(size(params{l}.b_r));

W_h_vel{l} = zeros(size(params{l}.W_h));
U_h_vel{l} = zeros(size(params{l}.U_h));
b_h_vel{l} = zeros(size(params{l}.b_h));
end
V_vel = zeros(size(V));
b_v_vel = zeros(size(b_v));

%-------------------------------------------------------
%% STEP [6]
% Set up hyperparameters of the GRU and activation functions
%-------------------------------------------------------
alfa = 1e-3; % Learning rate
beta = 0.9; % Momentum coefficient
lamda = 1e-5; % L2 regularization (set to 0 for debugging)
epoch = 100; % Number of training epochs

76
% Activation functions and derivatives:
sigmoid_func = @(z) 1./(1+exp(-z));
tanh_func = @(z) tanh(z);
d_sigmoid = @(a) a.*(1-a);
d_tanh = @(a) 1 - tanh(a).^2;

%-------------------------------------------------------
%% STEP [7]
% Trainning the GRU with the training data set
%-------------------------------------------------------
seqIndices = 1:numTrainSeq;
for ep = 1:epoch
seqShuffled = seqIndices(randperm(numTrainSeq));
for b = 1:numBatches
idxStart = (b-1)*batchSize + 1;
idxEnd = min(b*batchSize, numTrainSeq);
currentBatch = seqShuffled(idxStart:idxEnd);

% Concatenate sequences: X_batch [T x input_dim x curBatchSize],

Y_batch similarly.
X_batch = cat(3, trainSequencesX{currentBatch});
Y_batch = cat(3, trainSequencesY{currentBatch});

% ----- Forward pass on the batch -----

[cache, ~] = deepGRUForwardPass_Batch(X_batch, params, V, b_v,
sigmoid_func, tanh_func);

% ----- Backward pass on the batch -----

grads = deepGRUBackwardPass_Batch(X_batch, Y_batch, cache, params,
V, ...
sigmoid_func, d_sigmoid, tanh_func, d_tanh);
grads = scaleGradients(grads, 1/T);
% ----------------------------------------------------------
% L2 REGULARIZATION
% ---------------------------------------------------------

if lamda > 0

77
for l = 1:numHiddenLayers
grads{l}.dW_z = grads{l}.dW_z + lamda * params{l}.W_z;
grads{l}.dU_z = grads{l}.dU_z + lamda * params{l}.U_z;
grads{l}.dW_r = grads{l}.dW_r + lamda * params{l}.W_r;
grads{l}.dU_r = grads{l}.dU_r + lamda * params{l}.U_r;
grads{l}.dW_h = grads{l}.dW_h + lamda * params{l}.W_h;
grads{l}.dU_h = grads{l}.dU_h + lamda * params{l}.U_h;
end
grads{numHiddenLayers+1}.V = grads{numHiddenLayers+1}.V + lamda
* V;
end

% ----------------------------------------------------------
% MOMENTUM ADDED TO GRU
% ---------------------------------------------------------
for l = 1:numHiddenLayers
[params{l}.W_z, W_z_vel{l}] = momentumUpdate(params{l}.W_z,
W_z_vel{l}, alfa, beta, grads{l}.dW_z);
[params{l}.b_z, b_z_vel{l}] = momentumUpdate(params{l}.b_z,
b_z_vel{l}, alfa, beta, grads{l}.db_z);
[params{l}.U_z, U_z_vel{l}] = momentumUpdate(params{l}.U_z,
U_z_vel{l}, alfa, beta, grads{l}.dU_z);
[params{l}.W_r, W_r_vel{l}] = momentumUpdate(params{l}.W_r,
W_r_vel{l}, alfa, beta, grads{l}.dW_r);
[params{l}.b_r, b_r_vel{l}] = momentumUpdate(params{l}.b_r,
b_r_vel{l}, alfa, beta, grads{l}.db_r);
[params{l}.U_r, U_r_vel{l}] = momentumUpdate(params{l}.U_r,
U_r_vel{l}, alfa, beta, grads{l}.dU_r);
[params{l}.W_h, W_h_vel{l}] = momentumUpdate(params{l}.W_h,
W_h_vel{l}, alfa, beta, grads{l}.dW_h);
[params{l}.b_h, b_h_vel{l}] = momentumUpdate(params{l}.b_h,
b_h_vel{l}, alfa, beta, grads{l}.db_h);
[params{l}.U_h, U_h_vel{l}] = momentumUpdate(params{l}.U_h,
U_h_vel{l}, alfa, beta, grads{l}.dU_h);
end
% Output layer updates:
[V, V_vel] = momentumUpdate(V, V_vel, alfa, beta,
grads{numHiddenLayers+1}.V);
[b_v, b_v_vel] = momentumUpdate(b_v, b_v_vel, alfa, beta,
grads{numHiddenLayers+1}.b_v);
end

% --- Record Weight Trajectories ---

% Here we record the (1,1) element of params{1}.W_z and V over epochs.
weight_history_Wz(ep) = params{1}.W_z(1,1);
weight_history_V(ep) = V(1,1);
% Evaluate RMSE on training, validation, and test sets:
Yhat_train = deepGRUPredictBPTT(X_train, T, params, V, b_v,
sigmoid_func, tanh_func);

78
train_RMSE(ep) = sqrt(mean((Y_train(:) - Yhat_train(:)).^2));

Yhat_val = deepGRUPredictBPTT(X_val, T, params, V, b_v, sigmoid_func,

tanh_func);
val_RMSE(ep) = sqrt(mean((Y_val(:) - Yhat_val(:)).^2));

Yhat_test = deepGRUPredictBPTT(X_test, T, params, V, b_v, sigmoid_func,

tanh_func);
test_RMSE(ep) = sqrt(mean((Y_test(:) - Yhat_test(:)).^2));

fprintf('Epoch %2d: Train RMSE = %.4f, Val RMSE = %.4f, Test RMSE =
%.4f\n',...
ep, train_RMSE(ep), val_RMSE(ep), test_RMSE(ep));
end
%-------------------------------------------------------
%% STEP [8]
% Plot all the necessary graphs
%-------------------------------------------------------
figure;
plot(1:epoch, train_RMSE, 'LineWidth',2); hold on;
plot(1:epoch, val_RMSE, '--', 'LineWidth',2);
plot(1:epoch, test_RMSE, '-.', 'LineWidth',2);
xlabel('Epoch'); ylabel('RMSE');
legend('Train','Val','Test','Location','best');
title('Deep GRU BPTT with Mini-Batches: Training RMSE');
grid on;
disp('Training complete.');
% Plot the trajectory of selected weights versus epochs.
figure;
plot(1:epoch, weight_history_Wz, 'm-', 'LineWidth', 2); hold on;
plot(1:epoch, weight_history_V, 'c-', 'LineWidth', 2);
xlabel('Epoch');
ylabel('Weight Value');
title('Selected Weight Trajectories vs. Epoch');
legend('W_z(1,1)', 'V(1,1)');
grid on;
% --------------- End of GRU_BPTT_Regression_Batched code ---------------
% Save the relevant GRU parameters to a MAT-file
save('GRU_weights.mat', 'params', 'V', 'b_v');
end
%-------------------------------------------------------
%% HELPER FUNCTIONS
%-------------------------------------------------------
%% Helper Function: Create Sequences
function [seqsX, seqsY] = createSequences(X, Y, T)
N = size(X,1);
numSeq = floor(N/T);
seqsX = cell(numSeq,1);
seqsY = cell(numSeq,1);

79
idx = 1;
for s = 1:numSeq
seqsX{s} = X(idx:idx+T-1, :);
seqsY{s} = Y(idx:idx+T-1, :);
idx = idx + T;
end
end
% ------------------------------------------------------------------------
%% Batched Forward Pass Function
function [cache, h_end] = deepGRUForwardPass_Batch(X_batch, params, V, b_v,
sigmoid_func, tanh_func)
% X_batch: [T x input_dim x batchSize]
[T, inputDim, batchSize] = size(X_batch);
numLayers = numel(params);
hidden_dim = size(params{1}.W_z,1);
output_dim = size(V,1);
% Initialize hidden states (for each layer): [hidden_dim x batchSize]
h = cell(numLayers,1);
for l = 1:numLayers
h{l} = zeros(hidden_dim, batchSize);
end

% Preallocate cache:
for l = 1:numLayers
cache.l{l}.z = zeros(hidden_dim, T, batchSize);
cache.l{l}.r = zeros(hidden_dim, T, batchSize);
cache.l{l}.h_tilde = zeros(hidden_dim, T, batchSize);
cache.l{l}.h = zeros(hidden_dim, T, batchSize);
end
cache.Y_hat = zeros(T, output_dim, batchSize);
cache.X_batch = X_batch;
for t = 1:T
% Ensure x_t has dimensions [inputDim x batchSize]
x_t = reshape(X_batch(t, :, :), [inputDim, batchSize]);
for l = 1:numLayers
if l == 1
input_t = x_t; % [inputDim x batchSize]
else
input_t = h{l-1}; % [hidden_dim x batchSize]
end
% Retrieve parameters:
W_z = params{l}.W_z; U_z = params{l}.U_z; b_z = params{l}.b_z;
W_r = params{l}.W_r; U_r = params{l}.U_r; b_r = params{l}.b_r;
W_h = params{l}.W_h; U_h = params{l}.U_h; b_h = params{l}.b_h;

% Compute gates:
z_t = sigmoid_func(W_z*input_t + U_z*h{l} + repmat(b_z, 1,
batchSize));

80
r_t = sigmoid_func(W_r*input_t + U_r*h{l} + repmat(b_r, 1,
batchSize));
h_tilde = tanh_func(W_h*input_t + U_h*(r_t .* h{l}) + repmat(b_h, 1,
batchSize));
h{l} = (1 - z_t).*h{l} + z_t.*h_tilde;

cache.l{l}.z(:,t,:) = z_t;
cache.l{l}.r(:,t,:) = r_t;
cache.l{l}.h_tilde(:,t,:) = h_tilde;
cache.l{l}.h(:,t,:) = h{l};
end
% Compute output; y_t: [output_dim x batchSize]
y_t = V*h{numLayers} + repmat(b_v,1,batchSize);
cache.Y_hat(t,:,:) = y_t;
end
h_end = h{numLayers};
end
% ------------------------------------------------------------------------
%% Batched Backward Pass Function
function grads = deepGRUBackwardPass_Batch(X_batch, Y_batch, cache, params, V,
...
sigmoid_func, d_sigmoid, tanh_func, d_tanh)
% X_batch: [T x input_dim x batchSize]
% Y_batch: [T x output_dim x batchSize]
[T, ~, batchSize] = size(X_batch);
numLayers = numel(params);
hidden_dim = size(params{1}.W_z,1);
output_dim = size(V,1);
% Initialize output-layer gradients:
gradOut.dV = zeros(size(V));
gradOut.db_v = zeros(output_dim, 1);
% Initialize GRU layer gradients:
for l = 1:numLayers
grads{l}.dW_z = zeros(size(params{l}.W_z));
grads{l}.dU_z = zeros(size(params{l}.U_z));
grads{l}.db_z = zeros(hidden_dim, 1);
grads{l}.dW_r = zeros(size(params{l}.W_r));
grads{l}.dU_r = zeros(size(params{l}.U_r));
grads{l}.db_r = zeros(hidden_dim, 1);
grads{l}.dW_h = zeros(size(params{l}.W_h));
grads{l}.dU_h = zeros(size(params{l}.U_h));
grads{l}.db_h = zeros(hidden_dim, 1);
end

% Initialize dh_next for each layer:

dh_next = cell(numLayers,1);
for l = 1:numLayers
dh_next{l} = zeros(hidden_dim, batchSize);
end

81
% Loop backward in time:
for t = T:-1:1
% Output layer error:
y_t = reshape(cache.Y_hat(t,:,:), [output_dim, batchSize]);
y_ref = reshape(Y_batch(t,:,:), [output_dim, batchSize]);
e_t = y_t - y_ref; % [output_dim x batchSize]

h_top = reshape(cache.l{numLayers}.h(:,t,:), [hidden_dim, batchSize]);

gradOut.dV = gradOut.dV + e_t * h_top';
gradOut.db_v = gradOut.db_v + sum(e_t,2);

dh = V' * e_t + dh_next{numLayers}; % [hidden_dim x batchSize]

for l = numLayers:-1:1
z_t = reshape(cache.l{l}.z(:,t,:), [hidden_dim, batchSize]);
r_t = reshape(cache.l{l}.r(:,t,:), [hidden_dim, batchSize]);
h_tilde = reshape(cache.l{l}.h_tilde(:,t,:), [hidden_dim,
batchSize]);
h_t = reshape(cache.l{l}.h(:,t,:), [hidden_dim, batchSize]);

if t == 1
h_prev = zeros(hidden_dim, batchSize);
else
h_prev = reshape(cache.l{l}.h(:,t-1,:), [hidden_dim,
batchSize]);
end

if l == 1
x_l = reshape(X_batch(t,:,:), [size(X_batch,2), batchSize]);
else
x_l = reshape(cache.l{l-1}.h(:,t,:), [hidden_dim, batchSize]);
end
d_z = dh .* (h_tilde - h_prev);
d_h_tilde = dh .* z_t;
d_h_prev_direct = dh .* (1 - z_t);

d_a_h = d_h_tilde .* (1 - h_tilde.^2);

dW_h = d_a_h * x_l';
dU_h = d_a_h * (r_t .* h_prev)';
db_h = sum(d_a_h, 2);

d_h_prev_candidate = (params{l}.U_h' * d_a_h) .* r_t;

d_r = (params{l}.U_h' * d_a_h) .* h_prev;
d_a_r = d_r .* d_sigmoid(r_t);
dW_r = d_a_r * x_l';
dU_r = d_a_r * h_prev';
db_r = sum(d_a_r, 2);

d_a_z = d_z .* d_sigmoid(z_t);

82
dW_z = d_a_z * x_l';
dU_z = d_a_z * h_prev';
db_z = sum(d_a_z, 2);

dh_prev = d_h_prev_direct + d_h_prev_candidate + params{l}.U_r' *

d_a_r + params{l}.U_z' * d_a_z;

grads{l}.dW_z = grads{l}.dW_z + dW_z;

grads{l}.dU_z = grads{l}.dU_z + dU_z;
grads{l}.db_z = grads{l}.db_z + db_z;
grads{l}.dW_r = grads{l}.dW_r + dW_r;
grads{l}.dU_r = grads{l}.dU_r + dU_r;
grads{l}.db_r = grads{l}.db_r + db_r;
grads{l}.dW_h = grads{l}.dW_h + dW_h;
grads{l}.dU_h = grads{l}.dU_h + dU_h;
grads{l}.db_h = grads{l}.db_h + db_h;

dh_next{l} = dh_prev;
if l > 1
% Accumulate into lower layer error signal
if isempty(dh_next{l-1})
dh_next{l-1} = dh_prev;
else
dh_next{l-1} = dh_next{l-1} + dh_prev;
end
end
end
end
grads{numLayers+1} = struct('V', gradOut.dV, 'b_v', gradOut.db_v);
end
% ------------------------------------------------------------------------
%% Helper Function: Scale Gradients by a Factor
function newGrads = scaleGradients(grads, factor)
if iscell(grads)
newGrads = cell(size(grads));
for k = 1:numel(grads)
if isstruct(grads{k})
fn = fieldnames(grads{k});
newGrads{k} = grads{k};
for j = 1:numel(fn)
newGrads{k}.(fn{j}) = factor * grads{k}.(fn{j});
end
else
newGrads{k} = factor * grads{k};
end
end
else
fn = fieldnames(grads);
newGrads = grads;

83
for k = 1:numel(fn)
newGrads.(fn{k}) = factor * grads.(fn{k});
end
end
end
% ------------------------------------------------------------------------
%% Helper Function: Momentum Update
function [param, param_vel] = momentumUpdate(param, param_vel, alpha, beta,
dparam)
param_vel = beta * param_vel + alpha * dparam;
param = param - param_vel;
end
% ------------------------------------------------------------------------
%% Helper Function: GRU Prediction Using BPTT (Processes dataset in sequences)
function Y_hat_all = deepGRUPredictBPTT(X_data, T, params, V, b_v,
sigmoid_func, tanh_func)
N = size(X_data,1);
numSeq = floor(N/T);
output_dim = size(V,1);
Y_hat_all = zeros(N, output_dim);
idx = 1;
for s = 1:numSeq
X_seq = X_data(idx:idx+T-1, :);
% For prediction, use batched forward pass with batchSize=1.
[cache, ~] = deepGRUForwardPass_Batch(reshape(X_seq, [T, size(X_seq,2),
1]), ...
params, V, b_v, sigmoid_func, tanh_func);
% cache.Y_hat is [T x output_dim x 1]; reshape to [T x output_dim]
Y_hat_seq = reshape(cache.Y_hat, [T, output_dim]);
Y_hat_all(idx:idx+T-1, :) = Y_hat_seq;
idx = idx + T;
end
end

9.1.6. Deep Q neural network

%-------------------------------------------------------
% Deep DQN with three hidden layers, experience replay, and a target network.
% The Q-update uses a weighted target:
% T(s,a) = (1 - updateRate)*Q(s,a) + updateRate*( r + discount*max(Q(s',:))
)
%
% By: Guiah Soumahoro (Debugging of DQN was done using ChatGPT)
%-------------------------------------------------------
function demoDQN_dataset()

clc;
clear all;
close all;

84
%-------------------------------------------------------
%% STEP [1]
% Load pre-trained Network Weight
%-------------------------------------------------------
if exist('ANN_weights.mat','file')
annData = load('ANN_weights.mat');
W1_ANN = annData.W1; W2_ANN = annData.W2;
W3_ANN = annData.W3; W4_ANN = annData.W4;
fprintf('Loaded ANN weights.\n');
else
error('ANN_weights.mat not found!');
end
if exist('GRU_weights.mat','file')
gruData = load('GRU_weights.mat');
params_GRU = gruData.params;
V_GRU = gruData.V; b_v_GRU = gruData.b_v;
fprintf('Loaded GRU weights.\n');
else
error('GRU_weights.mat not found!');
end
if exist('LSTM_weights.mat','file')
lstmData = load('LSTM_weights.mat');
params_LSTM = lstmData.params;
V_LSTM = lstmData.V; b_v_LSTM = lstmData.b_v;
fprintf('Loaded LSTM weights.\n');
else
error('LSTM_weights.mat not found!');
end
if exist('RNN_weights.mat','file')
rnnData = load('RNN_weights.mat');
params_RNN = rnnData.params;
V_RNN = rnnData.V; b_y_RNN = rnnData.b_y;
fprintf('Loaded RNN weights.\n');
else
error('RNN_weights.mat not found!');
end

%-------------------------------------------------------
%% STEP [2]
% Load the data and set up parameters (excel dataset needs to be in the
% same folder
%-------------------------------------------------------
try
dataTable = readtable('DatasetPCA.csv');
catch
error('Could not read DatasetPCA.csv. Make sure it is in the current
folder.');
end

85
data = table2array(dataTable);
% *** IMPORTANT ***: Verify the number of input features.
% If you want only the first 4 features, use:
X = data(:, 1:4); % state features
Y = data(:, 5:6); % target torque (2D)
N = size(data,1);

%-------------------------------------------------------
%% STEP [2]
% Normalize the data
%-------------------------------------------------------

% Note: z-score normolization was done in the preprocessing steps

% Randomize data
idx = randperm(N); % Somehow doing this instead of adding random(N)
% in the X and Y randomization makes the NN work better X = X(random(N),:);
X = X(idx,:);
Y = Y(idx,:);

% Define the number of samples for each partition:

N_train = round(0.70 * N); % 70% for training
N_val = round(0.15 * N); % 15% for validation
N_test = N - N_train - N_val; % Remaining samples for testing

% Split the dataset into training, validation, and test sets.

X_train = X(1:N_train, :);
Y_train = Y(1:N_train, :);

X_val = X(N_train+1 : N_train+N_val, :);

Y_val = Y(N_train+1 : N_train+N_val, :);

X_test = X(N_train+N_val+1 : end, :);

Y_test = Y(N_train+N_val+1 : end, :);

%-------------------------------------------------------
%% STEP [4]
% Define Discrete Action spaces (4 action spaces)
%-------------------------------------------------------
actionSpace = 1:4; % 1 = ANN, 2 = GRU, 3 = LSTM, 4 = RNN
numActions = numel(actionSpace);

%-------------------------------------------------------

86
%% STEP [5]
% Build the Deep Q Network architecture (5 layers) so you can update
% And initialize weights
%-------------------------------------------------------
% Ensure that output_dim equals 4.
stateDim = size(X_train,2); % Expected: 4
hidden_dim1 = 25;
hidden_dim2 = 25;
hidden_dim3 = 25;
output_dim = 4; % Must be 4
%-----------------------------------------------------------
% Use Xavier (Glorot) initialization for weights
%-----------------------------------------------------------

limit1 = sqrt(6/(stateDim+hidden_dim1));
W1 = rand(hidden_dim1, stateDim)*2*limit1 - limit1;
b1 = zeros(hidden_dim1,1);

limit2 = sqrt(6/(hidden_dim1+hidden_dim2));
W2 = rand(hidden_dim2, hidden_dim1)*2*limit2 - limit2;
b2 = zeros(hidden_dim2,1);

limit3 = sqrt(6/(hidden_dim2+hidden_dim3));
W3 = rand(hidden_dim3, hidden_dim2)*2*limit3 - limit3;
b3 = zeros(hidden_dim3,1);

limit4 = sqrt(6/(hidden_dim3+output_dim));
W4 = rand(output_dim, hidden_dim3)*2*limit4 - limit4;
b4 = zeros(output_dim,1);

% ----------------------------------------------------------
% Initialize velocity momentum terms
% ---------------------------------------------------------:
W1_vel = zeros(size(W1)); b1_vel = zeros(size(b1));
W2_vel = zeros(size(W2)); b2_vel = zeros(size(b2));
W3_vel = zeros(size(W3)); b3_vel = zeros(size(b3));
W4_vel = zeros(size(W4)); b4_vel = zeros(size(b4));

%-------------------------------------------------------
%% STEP [6]
% Initialize Target Network (Copy of online network)
%-------------------------------------------------------
W1_target = W1; b1_target = b1;
W2_target = W2; b2_target = b2;
W3_target = W3; b3_target = b3;
W4_target = W4; b4_target = b4;

%-------------------------------------------------------
%% STEP [7]

87
% Set up hyperparameters of DQN
%-------------------------------------------------------
epsilon = 1.0;
epsilon_min = 0.1;
epsilon_decay = 0.98;
alfa_momentum = 1e-3;
momentumVal = 0.9;
lamda = 1e-4;
gamma = 0.99; % Discount
epoch = 100;
batchSize = 32;

% updateRate corresponds to the learning rate in Q-update:

alfa = 1e-3;
% Preallocate arrays to record the evolution of selected weights.
% Here we track W1(1,1) and W2(1,1) as examples.
weight_history_W1 = zeros(epoch,1);
weight_history_W2 = zeros(epoch,1);

%-------------------------------------------------------
%% STEP [6]
% Initialize experience replay buffer
%-------------------------------------------------------
replayCapacity = 5000;
replay_s = zeros(stateDim, replayCapacity);
replay_a = zeros(1, replayCapacity); % scalar actions
replay_r = zeros(1, replayCapacity);
replay_snext = zeros(stateDim, replayCapacity);
replay_done = zeros(1, replayCapacity);
replayCount = 0;

%-------------------------------------------------------
%% STEP [9]
% Trainning the DQN with experience replay
%-------------------------------------------------------
numTrain = N_train;
totalSteps = numTrain - 1;
avgReturn = zeros(epoch,1);

for ep = 1:epoch
indices = randperm(totalSteps);
totalReward = 0;
% Initialize an action count vector for this epoch:
actionCount = zeros(1, numActions);

% Populate replay buffer:

for i = indices
s = X_train(i,:)'; % Expecting [stateDim x 1]
if ~isequal(size(s), [stateDim, 1])

88
error('s has wrong dimensions: %s', mat2str(size(s)));
end
if i < numTrain
s_next = X_train(i+1,:)';
done = 0;
else
s_next = zeros(stateDim,1);
done = 1;
end

Q_current = dqnForwardBatch(s, W1, b1, W2, b2, W3, b3, W4, b4); %
Should be [4 x 1]
if size(Q_current,2) ~= 1
error('Q_current is not a single column vector, size: %s',
mat2str(size(Q_current)));
end
if rand < epsilon
a_test = randi(numActions);
action = a_test;
else
[~, a_temp] = max(Q_current(:));
action = a_temp;
end
% Update the action count
actionCount(action) = actionCount(action) + 1;

totalReward = totalReward + ( - (norm(getTorque(action, i) -

Y_train(i,:)')^2) );

replayCount = replayCount + 1;
if replayCount > replayCapacity
replayCount = 1;
end
replay_s(:, replayCount) = s;
replay_a(1, replayCount) = action; % action must be scalar
replay_r(1, replayCount) = - (norm(getTorque(action, i) -
Y_train(i,:)')^2);
replay_snext(:, replayCount) = s_next;
replay_done(1, replayCount) = done;
end
% Display the actions taken this epoch:
fprintf('Epoch %d action counts:\n', ep);
for a = 1:numActions
fprintf(' Action %d: %d times\n', a, actionCount(a));
end

% Mini-batch update from replay buffer:

numUpdates = floor(replayCapacity / batchSize);
totalLoss = 0;

89
for update = 1:numUpdates
idx_batch = randi(replayCapacity, [1, batchSize]);
s_batch = replay_s(:, idx_batch);
a_batch = replay_a(idx_batch);
r_batch = replay_r(idx_batch);
snext_batch = replay_snext(:, idx_batch);
done_batch = replay_done(idx_batch);

Q_batch = dqnForwardBatch(s_batch, W1, b1, W2, b2, W3, b3, W4, b4);
% [4 x batchSize]
Q_next = dqnForwardBatch(snext_batch, W1_target, b1_target,
W2_target, b2_target, W3_target, b3_target, W4_target, b4_target);
maxQ_next = max(Q_next, [], 1);

% Construct target Q matrix:

T_batch = Q_batch;
for j = 1:batchSize
if done_batch(j) == 1
T_batch(a_batch(j), j) = r_batch(j);
else
T_batch(a_batch(j), j) = r_batch(j) + gamma*maxQ_next(j);
end
end

errorBatch = Q_batch - T_batch;

lossBatch = 0.5 * mean(sum(errorBatch.^2, 1));
totalLoss = totalLoss + lossBatch;

% Backpropagation through the Q-network

m = batchSize;
[Z1, ~] = layerForward(s_batch, W1, b1);
[Z2, ~] = layerForward(Z1, W2, b2);
[Z3, ~] = layerForward(Z2, W3, b3);
A4 = W4 * Z3 + repmat(b4, 1, m);
dA4 = errorBatch;
dW4 = (dA4 * Z3') / m;
db4 = mean(dA4, 2);

dZ3 = W4' * dA4;

dA3 = dZ3 .* (1 - Z3.^2);
dW3 = (dA3 * Z2') / m;
db3 = mean(dA3, 2);

dZ2 = W3' * dA3;

dA2 = dZ2 .* (1 - Z2.^2);
dW2 = (dA2 * Z1') / m;
db2 = mean(dA2, 2);

dZ1 = W2' * dA2;

90
dA1 = dZ1 .* (1 - Z1.^2);
dW1 = (dA1 * s_batch') / m;
db1 = mean(dA1, 2);

% L2 regularization:
dW1 = dW1 + lamda * W1;
dW2 = dW2 + lamda * W2;
dW3 = dW3 + lamda * W3;
dW4 = dW4 + lamda * W4;

% Momentum updates:
[W4, W4_vel] = momentumUpdate(W4, W4_vel, alfa_momentum,
momentumVal, dW4);
[b4, b4_vel] = momentumUpdate(b4, b4_vel, alfa_momentum,
momentumVal, db4);
[W3, W3_vel] = momentumUpdate(W3, W3_vel, alfa_momentum,
momentumVal, dW3);
[b3, b3_vel] = momentumUpdate(b3, b3_vel, alfa_momentum,
momentumVal, db3);
[W2, W2_vel] = momentumUpdate(W2, W2_vel, alfa_momentum,
momentumVal, dW2);
[b2, b2_vel] = momentumUpdate(b2, b2_vel, alfa_momentum,
momentumVal, db2);
[W1, W1_vel] = momentumUpdate(W1, W1_vel, alfa_momentum,
momentumVal, dW1);
[b1, b1_vel] = momentumUpdate(b1, b1_vel, alfa_momentum,
momentumVal, db1);
end

epsilon = max(epsilon * epsilon_decay, epsilon_min);

avgReturn(ep) = totalReward / numTrain;
% Replace numBatches with numUpdates:
fprintf('Epoch %d, Avg Loss = %.4f, Avg Return = %.4f\n', ep,
totalLoss/numUpdates, avgReturn(ep));

%-------------------------------------------------------
%% STEP [10]
% Update the target Network every 20 epochs
%-------------------------------------------------------
if mod(ep,20)==0
W1_target = W1; b1_target = b1;
W2_target = W2; b2_target = b2;
W3_target = W3; b3_target = b3;
W4_target = W4; b4_target = b4;
fprintf('Target network updated at epoch %d.\n', ep);
end
% --- Record Weight Trajectories ---
% Save selected weights to track their evolution.
weight_history_W1(ep) = W1(1,1);

91
weight_history_W2(ep) = W2(1,1);

% --- Evaluate RMSE for full training/val/test sets ---

mu_train = dqnPredict(X_train, W1, b1, W2, b2, W3, b3, W4, b4);
train_RMSE(ep) = sqrt(mean((Y_train(:)-mu_train(:)).^2));

mu_val = dqnPredict(X_val, W1, b1, W2, b2, W3, b3, W4, b4);
val_RMSE(ep) = sqrt(mean((Y_val(:)-mu_val(:)).^2));

mu_test = dqnPredict(X_test, W1, b1, W2, b2, W3, b3, W4, b4);
test_RMSE(ep) = sqrt(mean((Y_test(:)-mu_test(:)).^2));

fprintf('Epoch %d, Train RMSE=%.4f, Val RMSE=%.4f, Test RMSE=%.4f\n',...

ep, train_RMSE(ep), val_RMSE(ep), test_RMSE(ep));
end

%-------------------------------------------------------
%% STEP [7]
% Plot all the necessary graphs
%-------------------------------------------------------
% Plot the average return over epochs.
figure;
subplot(2,1,1);
plot(1:epoch, avgReturn, 'LineWidth',2);
xlabel('Epoch'); ylabel('Average Return');
title('DQN Average Return'); grid on;

% Plot the RMSE over epochs for training, validation, and test sets.
subplot(2,1,2);
plot(1:epoch, train_RMSE, 'b-', 'LineWidth',2); hold on;
plot(1:epoch, val_RMSE, 'r--', 'LineWidth',2);
plot(1:epoch, test_RMSE, 'g-.', 'LineWidth',2);
xlabel('Epoch'); ylabel('RMSE');
legend('Train','Val','Test','Location','best');
title('DQN RMSE'); grid on;
% Plot the trajectory of selected weights versus epochs.
figure;
plot(1:epoch, weight_history_W1, 'm-', 'LineWidth', 2); hold on;
plot(1:epoch, weight_history_W2, 'c-', 'LineWidth', 2);
xlabel('Epoch');
ylabel('Weight Value');
title('Selected Weight Trajectories vs. Epoch');
legend('W1(1,1)', 'W2(1,1)');
grid on;
% -------------------- Display Final Weights ---------------------------
disp('Final Weights (Input-to-Hidden, W1):');
disp(W1);
disp('Final Weights (Hidden-to-Output, W2):');
disp(W2);

92
disp('Deep DQN Training Complete.');

%-------------------------------------------------------
%% HELPER FUNCTIONS
%-------------------------------------------------------

%% Online Q-network forward pass on a mini-batch.

function Q = dqnForwardBatch(X, W1, b1, W2, b2, W3, b3, W4, b4)
m = size(X,2);
Z1 = tanh(W1 * X + repmat(b1, 1, m));
Z2 = tanh(W2 * Z1 + repmat(b2, 1, m));
Z3 = tanh(W3 * Z2 + repmat(b3, 1, m));
Q = W4 * Z3 + repmat(b4, 1, m); % Q is [4 x m]
end
%% dqnPredict: For each state in the data, choose an action (max Q) then use
the corresponding pre-trained network to produce a torque prediction.
function mu = dqnPredict(X_data, W1, b1, W2, b2, W3, b3, W4, b4)
Np = size(X_data,1);
mu = zeros(Np,2);
for j = 1:Np
s = X_data(j,:)';
Qvals = dqnForwardBatch(s, W1, b1, W2, b2, W3, b3, W4, b4); % [4 x
1]
[~, idx] = max(Qvals(:)); % Ensure scalar index
switch idx
case 1, mu(j,:) = forward_ANN_single(X_data(j,:));
case 2, mu(j,:) = forward_GRU_single(X_data(j,:), params_GRU,
V_GRU, b_v_GRU);
case 3, mu(j,:) = forward_LSTM_single(X_data(j,:), params_LSTM,
V_LSTM, b_v_LSTM);
case 4, mu(j,:) = forward_RNN_single(X_data(j,:), params_RNN,
V_RNN, b_y_RNN);
end
end
end
%% Momentum update helper.
function [param, param_vel] = momentumUpdate(param, param_vel, alpha,
momentum, dparam)
param_vel = momentum*param_vel + alpha*dparam;
param = param - param_vel;
end
% Single layer forward propagation.
function [Z, A] = layerForward(X, W, b)
A = W * X + repmat(b, 1, size(X,2));
Z = tanh(A);
end
%-------------------------------------------------------
%% PRE-TRAINED NETWORK FORWARD PASSES

93
%-------------------------------------------------------
%% ANN forward pass (5-layer network)
function y_pred = forward_ANN_single(x_input)
x = x_input(:);
X1 = [1; x];
y1 = tanh(W1_ANN * X1);
X2 = [1; y1];
y2 = tanh(W2_ANN * X2);
X3 = [1; y2];
y3 = tanh(W3_ANN * X3);
X4 = [1; y3];
net_out = W4_ANN * X4;
y_pred = tanh(net_out);
y_pred = y_pred';
end
%% GRU forward pass (single-step)
function y_pred = forward_GRU_single(x_input, params_GRU, V_GRU, b_v_GRU)
x = x_input(:);
numLayers = numel(params_GRU);
hidden_dim = size(params_GRU{1}.W_z,1);
h = cell(numLayers,1);
for L = 1:numLayers, h{L} = zeros(hidden_dim,1); end
for L = 1:numLayers
if L == 1
input_t = x;
else
input_t = h{L-1};
end
W_z = params_GRU{L}.W_z; U_z = params_GRU{L}.U_z; b_z =
params_GRU{L}.b_z;
W_r = params_GRU{L}.W_r; U_r = params_GRU{L}.U_r; b_r =
params_GRU{L}.b_r;
W_h = params_GRU{L}.W_h; U_h = params_GRU{L}.U_h; b_h =
params_GRU{L}.b_h;
z_t = sigmoid(W_z * input_t + U_z * h{L} + b_z);
r_t = sigmoid(W_r * input_t + U_r * h{L} + b_r);
h_tilde = tanh(W_h * input_t + U_h * (r_t .* h{L}) + b_h);
h{L} = (1 - z_t) .* h{L} + z_t .* h_tilde;
end
y_pred_col = V_GRU * h{numLayers} + b_v_GRU;
y_pred = tanh(y_pred_col);
y_pred = y_pred';
end
%% LSTM forward pass (single-step)
function y_pred = forward_LSTM_single(x_input, params_LSTM, V_LSTM,
b_v_LSTM)
x = x_input(:);
numLayers = numel(params_LSTM);
hidden_dim = size(params_LSTM{1}.W_u,1);

94
h = cell(numLayers,1); c = cell(numLayers,1);
for L = 1:numLayers
h{L} = zeros(hidden_dim,1);
c{L} = zeros(hidden_dim,1);
end
for L = 1:numLayers
if L == 1
input_t = x;
else
input_t = h{L-1};
end
U_u = params_LSTM{L}.U_u; W_u = params_LSTM{L}.W_u; b_u =
params_LSTM{L}.b_u;
U_f = params_LSTM{L}.U_f; W_f = params_LSTM{L}.W_f; b_f =
params_LSTM{L}.b_f;
U_o = params_LSTM{L}.U_o; W_o = params_LSTM{L}.W_o; b_o =
params_LSTM{L}.b_o;
U_g = params_LSTM{L}.U_g; W_g = params_LSTM{L}.W_g; b_g =
params_LSTM{L}.b_g;
i_t = sigmoid(U_u * input_t + W_u * h{L} + b_u);
f_t = sigmoid(U_f * input_t + W_f * h{L} + b_f);
o_t = sigmoid(U_o * input_t + W_o * h{L} + b_o);
g_t = tanh(U_g * input_t + W_g * h{L} + b_g);
c{L} = f_t .* c{L} + i_t .* g_t;
h{L} = o_t .* tanh(c{L});
end
y_pred_col = V_LSTM * h{numLayers} + b_v_LSTM;
y_pred = tanh(y_pred_col);
y_pred = y_pred';
end
%% RNN forward pass (single-step)
function y_pred = forward_RNN_single(x_input, params_RNN, V_RNN, b_y_RNN)
x = x_input(:);
numLayers = numel(params_RNN);
hidden_dim = size(params_RNN{1}.W,1);
h = cell(numLayers,1);
for L = 1:numLayers, h{L} = zeros(hidden_dim,1); end
for L = 1:numLayers
if L == 1
input_t = x;
else
input_t = h{L-1};
end
act = params_RNN{L}.W * input_t + params_RNN{L}.U * h{L} +
params_RNN{L}.b;
h{L} = tanh(act);
end
y_pred_col = V_RNN * h{numLayers} + b_y_RNN;
y_pred = tanh(y_pred_col);

95
y_pred = y_pred';
end
%% Simple sigmoid function.
function s = sigmoid(z)
s = 1./(1+exp(-z));
end
%% Helper: Get torque from the chosen pre-trained network.
function torque = getTorque(a, idx)
% idx is the index in X_train for the current sample.
switch a
case 1, torque = forward_ANN_single(X_train(idx,:))';
case 2, torque = forward_GRU_single(X_train(idx,:), params_GRU,
V_GRU, b_v_GRU)';
case 3, torque = forward_LSTM_single(X_train(idx,:), params_LSTM,
V_LSTM, b_v_LSTM)';
case 4, torque = forward_RNN_single(X_train(idx,:), params_RNN,
V_RNN, b_y_RNN)';
otherwise, torque = zeros(2,1);
end
end
end

9.2. Appendix B: Training Data Description and Pre-processing Details

Training data used for the current project was provided by the class professor. Original Dataset
is named “Dataset” in the attached folder while the file named “DatasetPCA” is the processed
dataset used to train all network models.

A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
No ratings yet
A Deep Reinforcement Learning Algorithm For Robotic Manipulation Tasks in Simulated Environments
10 pages
DuranK_Thesis_Redacted
No ratings yet
DuranK_Thesis_Redacted
63 pages
robotics1
No ratings yet
robotics1
17 pages
Towards Vision-Based Deep Reinforcement Learning For Robotic Motion Control
No ratings yet
Towards Vision-Based Deep Reinforcement Learning For Robotic Motion Control
8 pages
PyTorch Essentials: A Comprehensive Guide to Machine Learning Techniques
From Everand
PyTorch Essentials: A Comprehensive Guide to Machine Learning Techniques
Adam Jones
No ratings yet
Model-based Deep Reinforcement Learning for Robotic Systems
No ratings yet
Model-based Deep Reinforcement Learning for Robotic Systems
146 pages
Paper Ask1 Arxiv
No ratings yet
Paper Ask1 Arxiv
7 pages
joshi2020
No ratings yet
joshi2020
6 pages
Proposal PDF
No ratings yet
Proposal PDF
8 pages
Composite Dynamic Movement Primitives Based On Neural Networks For Human-Robot Skill Transfer
No ratings yet
Composite Dynamic Movement Primitives Based On Neural Networks For Human-Robot Skill Transfer
11 pages
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
No ratings yet
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
6 pages
Paper Ask1
No ratings yet
Paper Ask1
7 pages
Advancing_Robotic_Control_Data-Driven_Model_Predictive_Control_for_a_7-DOF_Robotic_Manipulator
No ratings yet
Advancing_Robotic_Control_Data-Driven_Model_Predictive_Control_for_a_7-DOF_Robotic_Manipulator
9 pages
robotics-12-00012-v2
No ratings yet
robotics-12-00012-v2
19 pages
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
No ratings yet
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
10 pages
Master Thesis
No ratings yet
Master Thesis
77 pages
s10846-017-0468-y
No ratings yet
s10846-017-0468-y
21 pages
[2] Acceleration-based Quadrotor Guidance Under Time Delays Using Deep Reinforcement Learning
No ratings yet
[2] Acceleration-based Quadrotor Guidance Under Time Delays Using Deep Reinforcement Learning
22 pages
Q Transformer
No ratings yet
Q Transformer
20 pages
Reinforcement_Learning_with_Ta
No ratings yet
Reinforcement_Learning_with_Ta
20 pages
Machine Learning Meets Advanced Robotic Manipulation: A, B C, C C D e
No ratings yet
Machine Learning Meets Advanced Robotic Manipulation: A, B C, C C D e
69 pages
2011-Leon Teaching A Robotb
No ratings yet
2011-Leon Teaching A Robotb
8 pages
Deep Reinforcement Learning With Optimized Reward Functions For Robotic Trajectory Planning
No ratings yet
Deep Reinforcement Learning With Optimized Reward Functions For Robotic Trajectory Planning
11 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Robot Neural: Control Using Neiworrs G. Jaein, D. Cluuney and D. Neural West Avenue Vancouver, Canada V6N
No ratings yet
Robot Neural: Control Using Neiworrs G. Jaein, D. Cluuney and D. Neural West Avenue Vancouver, Canada V6N
7 pages
Enhancing Robotic Manipulation: Harnessing The Power of Multi-Task Reinforcement Learning and Single Life Reinforcement Learning in Meta-World
No ratings yet
Enhancing Robotic Manipulation: Harnessing The Power of Multi-Task Reinforcement Learning and Single Life Reinforcement Learning in Meta-World
13 pages
A Survey On Deep Reinforcement Learning Algorithms For Robotic Manipulation
No ratings yet
A Survey On Deep Reinforcement Learning Algorithms For Robotic Manipulation
35 pages
2410.18519v2
No ratings yet
2410.18519v2
6 pages
Report Final
No ratings yet
Report Final
15 pages
Deep Reinforcement Learning For The Control of Robotic Manipulation: A Focussed Mini-Review
No ratings yet
Deep Reinforcement Learning For The Control of Robotic Manipulation: A Focussed Mini-Review
12 pages
Tracking Control of Robot Manipulators Using Second Order Neuro Sliding Mode
No ratings yet
Tracking Control of Robot Manipulators Using Second Order Neuro Sliding Mode
10 pages
Paper Adaptive@PID@Computed Torque@Control@Robot
No ratings yet
Paper Adaptive@PID@Computed Torque@Control@Robot
10 pages
Fundamentals of Machine Learning: An Introduction to Neural Networks
From Everand
Fundamentals of Machine Learning: An Introduction to Neural Networks
Peter Johnson
No ratings yet
MEG511_Term Report
No ratings yet
MEG511_Term Report
15 pages
Swagat Kumar: PHD Thesis: Kinematic Control of Redundant Manipulators Using Neural Networks
No ratings yet
Swagat Kumar: PHD Thesis: Kinematic Control of Redundant Manipulators Using Neural Networks
4 pages
A Deep Reinforcement-Learning Approach For Inverse Kinematics Solution of A High Degree of Freedom Robotic Manipulator
No ratings yet
A Deep Reinforcement-Learning Approach For Inverse Kinematics Solution of A High Degree of Freedom Robotic Manipulator
17 pages
The Role of Deep Reinforcement Learning in Motion Planning For Robotic Arm Assistive Systems
No ratings yet
The Role of Deep Reinforcement Learning in Motion Planning For Robotic Arm Assistive Systems
7 pages
Kormushev ROB2013
No ratings yet
Kormushev ROB2013
28 pages
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
No ratings yet
3.reinforcement Learning DDPG-PPO Agent-Based Control S Ystem
14 pages
The Actor-Dueling-Critic Method
No ratings yet
The Actor-Dueling-Critic Method
20 pages
1 s2.0 S095741742301686X Main
No ratings yet
1 s2.0 S095741742301686X Main
10 pages
FT_SMC _2dof paper
No ratings yet
FT_SMC _2dof paper
36 pages
Contemporary Machine Learning Methods: Harnessing Scikit-Learn and TensorFlow
From Everand
Contemporary Machine Learning Methods: Harnessing Scikit-Learn and TensorFlow
Adam Jones
No ratings yet
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Complexity - 2021 - Kayakoku - A Novel Behavioral Strategy For RoboCode Platform Based On Deep Q Learning
No ratings yet
Complexity - 2021 - Kayakoku - A Novel Behavioral Strategy For RoboCode Platform Based On Deep Q Learning
14 pages
On Adaptive Trajectory Tracking of A Robot Manipulator Using Inversion of Its Neural Emulator
No ratings yet
On Adaptive Trajectory Tracking of A Robot Manipulator Using Inversion of Its Neural Emulator
14 pages
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
No ratings yet
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
7 pages
ARTICLEONnlp
No ratings yet
ARTICLEONnlp
18 pages
1 Introduction To RL
No ratings yet
1 Introduction To RL
46 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
47 pages
An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm
No ratings yet
An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm
12 pages
An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm
No ratings yet
An Efficient Hardware Implementation of Reinforcement Learning: The Q-Learning Algorithm
12 pages
Application of Neural Q-Learning Controllers On The Khepera II Via Webots Software
No ratings yet
Application of Neural Q-Learning Controllers On The Khepera II Via Webots Software
8 pages
A Survey On Deep Learning and Deep Reinforcement Learning in Robotics With A Tutorial On Deep Reinforcement Learning
No ratings yet
A Survey On Deep Learning and Deep Reinforcement Learning in Robotics With A Tutorial On Deep Reinforcement Learning
33 pages
Learning Model Predictive Control with Long Short-Term Memory_2020
No ratings yet
Learning Model Predictive Control with Long Short-Term Memory_2020
16 pages
TR0918 PDF
No ratings yet
TR0918 PDF
65 pages
Autonomous Driving System Based On Deep Q Learnig: Takafumi Okuyama, Tad Gonsalves Jaychand Upadhay
No ratings yet
Autonomous Driving System Based On Deep Q Learnig: Takafumi Okuyama, Tad Gonsalves Jaychand Upadhay
5 pages
Neural-Network-Based Terminal Sliding-Mode Control of Robotic Manipulators Including Actuator Dynamics
No ratings yet
Neural-Network-Based Terminal Sliding-Mode Control of Robotic Manipulators Including Actuator Dynamics
9 pages
Op Tim Ization
No ratings yet
Op Tim Ization
19 pages
Application of Reinforcement Learning To A Two Dof Robot Arm Control
No ratings yet
Application of Reinforcement Learning To A Two Dof Robot Arm Control
2 pages
Summary Discovery of Ignorance
No ratings yet
Summary Discovery of Ignorance
15 pages
WS3A Applications of Laplace Transform
No ratings yet
WS3A Applications of Laplace Transform
4 pages
Download (Ebook) Time Lapse Approach to Monitoring Oil, Gas, and CO2 Storage by Seismic Methods by Junzo Kasahara, Yoko Hasada ISBN 9780128035887, 9780128036099, 0128035889, 0128036095 ebook All Chapters PDF
100% (10)
Download (Ebook) Time Lapse Approach to Monitoring Oil, Gas, and CO2 Storage by Seismic Methods by Junzo Kasahara, Yoko Hasada ISBN 9780128035887, 9780128036099, 0128035889, 0128036095 ebook All Chapters PDF
67 pages
Heisenberg Uncertainty Principle Questions
No ratings yet
Heisenberg Uncertainty Principle Questions
6 pages
Lightning Protection Risk Management: Date: 4/12/2020 Project No.: 12/008
No ratings yet
Lightning Protection Risk Management: Date: 4/12/2020 Project No.: 12/008
15 pages
Dupont Corian Performance Properties
No ratings yet
Dupont Corian Performance Properties
1 page
SSS Syllabus Physics
No ratings yet
SSS Syllabus Physics
44 pages
NDA Study Schedule
No ratings yet
NDA Study Schedule
1 page
Pharmaceutical Process Scale Up A Plan for Total Quality Control from Manufacturer to Customer 5ed. Edition Levin M. (Ed.) download pdf
100% (1)
Pharmaceutical Process Scale Up A Plan for Total Quality Control from Manufacturer to Customer 5ed. Edition Levin M. (Ed.) download pdf
77 pages
0625 PHYSICS: MARK SCHEME For The October/November 2010 Question Paper For The Guidance of Teachers
No ratings yet
0625 PHYSICS: MARK SCHEME For The October/November 2010 Question Paper For The Guidance of Teachers
8 pages
Iso 2178
No ratings yet
Iso 2178
6 pages
Waves Teacher Guide GG
No ratings yet
Waves Teacher Guide GG
2 pages
G4 Lab Report Experiment 4
No ratings yet
G4 Lab Report Experiment 4
14 pages
Chapter 07
No ratings yet
Chapter 07
115 pages
Runge Kutta Methods
No ratings yet
Runge Kutta Methods
19 pages
Introduction - Sesh 1
No ratings yet
Introduction - Sesh 1
17 pages
(Ebook) Mechanical Simulation with MATLAB® (Springer Tracts in Mechanical Engineering) by Dan B. Marghitu, Hamid Ghaednia, Jing Zhao ISBN 9783030881016, 3030881016 - Read the ebook online or download it as you prefer
100% (3)
(Ebook) Mechanical Simulation with MATLAB® (Springer Tracts in Mechanical Engineering) by Dan B. Marghitu, Hamid Ghaednia, Jing Zhao ISBN 9783030881016, 3030881016 - Read the ebook online or download it as you prefer
86 pages
IJST 2024 Published Articles V 17 I1 36SPI1
No ratings yet
IJST 2024 Published Articles V 17 I1 36SPI1
26 pages
EPPD1043-topic 01 Functions - Part 1 - Introduction
No ratings yet
EPPD1043-topic 01 Functions - Part 1 - Introduction
8 pages
Higher Maths: Revision Notes
No ratings yet
Higher Maths: Revision Notes
26 pages
Magnetic Particle Testing
No ratings yet
Magnetic Particle Testing
3 pages
SPOD Applied in Square Cylinder
No ratings yet
SPOD Applied in Square Cylinder
21 pages
SPM Paper 2 7 Electric
No ratings yet
SPM Paper 2 7 Electric
74 pages
Ngan Hang Cau Hoi Trac Nghiem
No ratings yet
Ngan Hang Cau Hoi Trac Nghiem
50 pages
Jul 18, 2024 12:33 PM (User) : References: Type DOI Reference
No ratings yet
Jul 18, 2024 12:33 PM (User) : References: Type DOI Reference
1 page
Question Bank Module 1
0% (1)
Question Bank Module 1
6 pages
Tubo 2 STD P265 Dual Charpy Ec H. R56612 (13.726)
No ratings yet
Tubo 2 STD P265 Dual Charpy Ec H. R56612 (13.726)
2 pages
Download full Multimodal Brain Image Analysis and Mathematical Foundations of Computational Anatomy 4th International Workshop MBIA 2019 and 7th International Workshop MFCA 2019 Held in Conjunction with MICCAI 2019 Shenzhen China October 17 2019 Proceedings Dajiang Zhu ebook all chapters
100% (3)
Download full Multimodal Brain Image Analysis and Mathematical Foundations of Computational Anatomy 4th International Workshop MBIA 2019 and 7th International Workshop MFCA 2019 Held in Conjunction with MICCAI 2019 Shenzhen China October 17 2019 Proceedings Dajiang Zhu ebook all chapters
52 pages
Saic A 1001
No ratings yet
Saic A 1001
2 pages
Asphaltines and Asphalts 2
No ratings yet
Asphaltines and Asphalts 2
645 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Robot Arm Model Using Deep Q Network(RL)

Uploaded by

Robot Arm Model Using Deep Q Network(RL)

Uploaded by

A course project

Learning Robot Model from Data through Reinforcement

Machine Learning/Deep Learning (CPSC-5616)

Guiah Soumahoro (0385610)

Saheed Oluwatosin Tiamiyu (0466367)

Charles Udoh (0460759)

Meysar Zeinali, Ph.D., PEng.

Assistant Professor, Bharti School of Engineering

April 15th, 2025

Robot manipulators form critical components in contemporary automation systems, particularly

Evidence-based strategies have become viable alternatives in recent times. Reinforcement

2.1.1.​ Overview of Neural Networks and Deep Q-Networks

2.2.​ Problem Statement

2.3.​ Objectives of the Project

An accurate model of the dynamic behavior of robot manipulators is key to achieving

2.4.​ Literature Review

2.5.​ Key Methodologies

2.5.1.​ Learning-Based Modeling

2.5.2.​ Challenges and Future Directions

But the project has the following limitations:

3.​ Problem Formulation and Calculation

𝑁𝑒𝑡𝑘 = ∑ 𝑊𝑖𝑗𝑘 + 𝑏𝑘 (1)

3.1.1.​ Equations Governing Deep Q Networks

3.1.2.​ DQN Decision Function

π(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎​𝑄(𝑠, 𝑎; θ) (6)

To promote adequate exploration of the state-action space and to prevent premature

3.2.​ Computational Considerations and Assumptions

Deep Q-Networks (DQNs) can be computationally demanding, particularly for high-dimensional

4.1.1.​ Description of Data Sources

Structure of the Dataset:​

Joint 1:​ Joint 2:​

4.1.2.​ Data Cleaning and Normalization Procedures

Figure 2. Original joint state and torque output data.

4.2.1.​ ANN: Architecture and Training Strategy

Forward pass: layer = 𝑙

(𝑙) (𝑙) (𝑙−1) (𝑙)

Figure 7. Artificial Neural Network.

4.2.2.​ RNN: Architecture and Training Strategy

Forward pass: layer = 𝑙

Loss Function with L2 regularization:

4.2.3.​ LSTM: Architecture and Training Strategy

Figure 9. Long Short Term Memory Network

Forward Pass: Backward Pass:

Loss Function with L2 regularization:

Figure 10. Gated recurrent Unit

Forward Pass: Backward Pass:

BPTT with Gate Gradients: Propagate gradients through

Loss Function with L2 regularization:

4.3.​ DQN Integration: Acting as the Selector

Figure 11. Deep Q Network

Forward pass: Loss Function with L2 regularization:

4.3.1.​ Combining Neural Networks with DQN

4.3.2.​ Framework for Action Selection

4.4.​ Optimization Techniques and Improvement Methods

4.4.1.​ Weight Initialization Strategies

Xavier Glorot Initialization:​

4.4.2.​ Regularization and Avoiding Overfitting

L2 regularization is applied consistently to all model weights to discourage large parameter

4.4.3.​ Training Strategies (Sequential, Batch, Mini-batch)

4.4.4.​ Handling Slow Learning and Local Minima

4.5.1.​ Graphical Representation of the System

Figure 12. DQN graphical representation

Figure 13. Data processing flowchart

(Made on Powerpoint by Guiah Soumahoro )

5.2.​ Training Results and Performance Analysis

5.2.1.​ ANN Learning Curves

5.3.1.​ DQN learning Curve

5.4.​ Comparison of Improvement Techniques

5.4.1.​ Impact of Weight Initialization

No weights initialized Standard Initialization Glorot Xavier Initialization

5.4.2.​ Effectiveness of Regularization Strategies

Not regularized L1 regularization L2 regularization

Experimentational results highlight the critical contribution of regularization to the learning

Furthermore, regularization techniques such as the L2 regularization, play an important role in

L2 regularization contributed to more stable Q-value estimates by discouraging excessively

5.4.3.​ Effectiveness of certain activation functions

2.1.1. Overview of Neural Networks and Deep Q-Networks

2.2. Problem Statement

2.3. Objectives of the Project

2.4. Literature Review

2.5. Key Methodologies

2.5.1. Learning-Based Modeling

2.5.2. Challenges and Future Directions

3. Problem Formulation and Calculation

3.1.1. Equations Governing Deep Q Networks

3.1.2. DQN Decision Function

π(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄(𝑠, 𝑎; θ) (6)

3.2. Computational Considerations and Assumptions

4.1.1. Description of Data Sources

Structure of the Dataset:

Joint 1: Joint 2:

4.1.2. Data Cleaning and Normalization Procedures

4.2.1. ANN: Architecture and Training Strategy

4.2.2. RNN: Architecture and Training Strategy

4.2.3. LSTM: Architecture and Training Strategy

4.3. DQN Integration: Acting as the Selector

4.3.1. Combining Neural Networks with DQN

4.3.2. Framework for Action Selection

4.4. Optimization Techniques and Improvement Methods

4.4.1. Weight Initialization Strategies

Xavier Glorot Initialization:

4.4.2. Regularization and Avoiding Overfitting

4.4.3. Training Strategies (Sequential, Batch, Mini-batch)

4.4.4. Handling Slow Learning and Local Minima

4.5.1. Graphical Representation of the System

5.2. Training Results and Performance Analysis

5.2.1. ANN Learning Curves

5.3.1. DQN learning Curve

5.4. Comparison of Improvement Techniques

5.4.1. Impact of Weight Initialization

5.4.2. Effectiveness of Regularization Strategies

5.4.3. Effectiveness of certain activation functions

5.5.1. Analysis of the Results

5.5.2. Limitations, Challenges, and Future Work

6.2. Contributions of the Project

● An end-to-end data preprocessing pipeline that is successful in cleaning, normalizing,

6.3. Challenges, Future Work, and Recommendations

9.1.1. Principal Component Analysis code and Pre-processing

9.1.2. Artificial Neural Network

9.1.3. Recurrent Neural Network