Deep Reinforcement Learning Based Approach For Tra C Signal Control Deep Reinforcement Learning Based Approach For Tra C Signal Control
Deep Reinforcement Learning Based Approach For Tra C Signal Control Deep Reinforcement Learning Based Approach For Tra C Signal Control
Deep Reinforcement Learning Based Approach For Tra C Signal Control Deep Reinforcement Learning Based Approach For Tra C Signal Control
Available online
online at www.sciencedirect.com
at www.sciencedirect.com
Available online at www.sciencedirect.com
ScienceDirect
Transportation Research
Transportation Procedia
Research 62 00
Procedia (2022) 278–285
(2021) 000–000
Transportation Research Procedia 00 (2021) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
24th
24th Euro
Euro Working
Working Group
Group on
on Transportation
Transportation Meeting,
Meeting, EWGT
EWGT 2021,
2021, 8-10
8-10 September
September 2021,
2021,
Aveiro, Portugal
Aveiro, Portugal
Deep
Deep Reinforcement
Reinforcement Learning
Learning based
based approach
approach for
for Traffic
Traffic Signal
Signal
Control
Control
a,∗ a a
Kővári
Kővári Bálint
Bálinta,∗,, Tettamanti
Tettamanti Tamás
Tamása ,, Bécsi
Bécsi Tamás
Tamása
a Department of Control for Transportation and Vehicle Systems, Budapest University of Technology and Economics, Műegyetem rkp. 3, 1111,
a Department of Control for Transportation and Vehicle Systems, Budapest University of Technology and Economics, Műegyetem rkp. 3, 1111,
Budapest, Hungary
Budapest, Hungary
Abstract
Abstract
The paper introduces a novel approach to the classical adaptive traffic signal control (TSC) problem. Instead of the traditional
The paper introduces a novel approach to the classical adaptive traffic signal control (TSC) problem. Instead of the traditional
optimization or simple rule-based approach, Artificial Intelligence is applied. Reinforcement Learning is a spectacularly evolving
optimization or simple rule-based approach, Artificial Intelligence is applied. Reinforcement Learning is a spectacularly evolving
realm of Machine Learning which owns the key features such as generalization, scalability, real-time applicability for solving the
realm of Machine Learning which owns the key features such as generalization, scalability, real-time applicability for solving the
traffic signal control problem. Nevertheless, the researchers’ responsibilities become more serious regarding the formulation of
traffic signal control problem. Nevertheless, the researchers’ responsibilities become more serious regarding the formulation of
state representation and the rewarding system. These Reinforcement Learning features are also the most fascinating and contro-
state representation and the rewarding system. These Reinforcement Learning features are also the most fascinating and contro-
versial virtues since the utilized abstractions decide whether the algorithm solves the problem or not. This paper proposes a new
versial virtues since the utilized abstractions decide whether the algorithm solves the problem or not. This paper proposes a new
interpretation for the feature-based state representation and rewarding concept that makes the TSC problem highly generalizable
interpretation for the feature-based state representation and rewarding concept that makes the TSC problem highly generalizable
and has scaling potential. The proposed method’s feasibility is demonstrated via a simulation study using a high-fidelity micro-
and has scaling potential. The proposed method’s feasibility is demonstrated via a simulation study using a high-fidelity micro-
scopic traffic simulator. The results justify that the Deep Reinforcement Learning based approach is a real candidate for real-time
scopic traffic simulator. The results justify that the Deep Reinforcement Learning based approach is a real candidate for real-time
traffic light control.
traffic light control.
©
© 2022
2021 The
The Authors.
Authors. Published
Published byby ELSEVIER
Elsevier B.V.B.V.
© 2021
This The Authors.
accessPublished by Elsevier B.V.
This is an
is an open
open access article under
article under the CC
the CC BY-NC-ND
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an
Peer-reviewopen access
under article under
responsibility of the
the CC BY-NC-ND
scientific license
committee of the(http://creativecommons.org/licenses/by-nc-nd/4.0/)
24th Euro Working Group on Transportation Meeting
Peer-review under responsibility of the scientific committee of the 24th Euro Working Group on Transportation Meeting.
Peer-review
(EWGT under responsibility of the scientific committee of the 24th Euro Working Group on Transportation Meeting.
2021)
Keywords: Deep Reinforcement Learning; Traffic Light; Traffic Signal Control; Policy Gradient Algorithm; SUMO; Traffic Simulation
Keywords: Deep Reinforcement Learning; Traffic Light; Traffic Signal Control; Policy Gradient Algorithm; SUMO; Traffic Simulation
1.
1. Introduction
Introduction
The
The presence
presence ofof Deep
Deep Reinforcement
Reinforcement Learning
Learning (DRL)
(DRL) based
based approaches
approaches inin the
the diverse
diverse realm
realm of
of road
road transportation
transportation
is
is growing dramatically Farazi et al. (2020). Most aspects of autonomous vehicle control, such as motion
growing dramatically Farazi et al. (2020). Most aspects of autonomous vehicle control, such as motion planning,
planning,
highway
highway driving,
driving, etc.,
etc., are
are intensely
intensely researched
researched with
with DRL
DRL techniques
techniques Aradi
Aradi (2020).
(2020). The
The same
same goes
goes for
for the
the competitive
competitive
problem
problem ofof route
route optimization,
optimization, ranging
ranging from
from the
the fundamental
fundamental traveling
traveling salesman
salesman problem
problem Bello
Bello et
et al.
al. (2016)
(2016) to to more
more
complex
complex multi-vehicle routing problems Nazari et al. (2018). Adaptive Traffic Signal Control (TSC) also at the center
multi-vehicle routing problems Nazari et al. (2018). Adaptive Traffic Signal Control (TSC) also at the center
of
of interest.
interest. This
This trend
trend can
can last
last because
because DRL
DRL can
can provide
provide aa natural
natural and
and robust
robust framework
framework for
for solving
solving complex
complex sequen-
sequen-
∗ Corresponding author.
∗ Corresponding author.
E-mail address: kovari.balint@kjk.bme.hu
E-mail address: kovari.balint@kjk.bme.hu
2352-1465
2352-1465 © © 2021
2022 The
TheAuthors.
Authors. Published
Published by Elsevier B.V.
by ELSEVIER B.V.
2352-1465 © 2021 The Authors. Published by Elsevier B.V.
This isisan open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open access article under the CC BY-NC-ND licenselicense
This an open access article under the CC BY-NC-ND (https://creativecommons.org/licenses/by-nc-nd/4.0)
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-reviewunder
Peer-review underresponsibility
responsibilityof of
thethe scientific
scientific committee
committee of 24th
of the the 24th
Euro Euro Working
Working Group Group on Transportation
on Transportation Meeting. Meeting
Peer-review under responsibility of the scientific committee of the 24th Euro Working Group on Transportation Meeting.
(EWGT 2021)
10.1016/j.trpro.2022.02.035
Kővári Bálint et al. / Transportation Research Procedia 62 (2022) 278–285 279
2 Bálint et al. / Transportation Research Procedia 00 (2021) 000–000
tial decision-making problems. Moreover, thanks to the deep learning component, it can provide real-time solutions
in challenging domains which computationally overwhelms classic methods. Traffic light control is a permanent en-
gineering topic as traffic problems emerge worldwide, and autonomous vehicles will likely generate more mobility
than before (Kisgyörgy and Szele, 2018). Traffic-responsive signal control is present in traffic engineering practice for
several decades to tackle traffic congestion. With widespread traffic sensors (loop detectors, earth-magnetic detectors,
cameras, etc.) it is used increasingly intensively (Papageorgiou, 2004).
Deep Reinforcement Learning based control has been applied for traffic light control in a limited manner. Most
of the authors utilize Value-based algorithms in Single-Agent solutions such as Deep Q-Learning (Guo and Harmati,
2020; Touhbi et al., 2017) and its on-policy variance SARSA (El-Tantawy et al., 2014; Wen et al., 2007) which,
compared to Policy-based methods, can not guarantee convergence at least to some local optima. Hence, these methods
are more exposed to struggle in convergence during training, which, as a result, can shrink performance. Moreover,
most of the used reward strategies utilize low-level features like queue length (Muresan et al., 2019), waiting time
Gao et al. (2017) or no traffic dependent features at all (Thorpe and Anderson, 1996), making credit assignment and
generalization harder.
The paper’s main contribution is twofold. On the one hand, it is justified that Deep Reinforcement Learning can
be effectively used for traffic-actuated signal control problem. On the other hand, this paper proposes a new inter-
pretation of the composed feature-based state representation that enables the utilization of a new rewarding concept.
Considering the rewarding systems’ unique role in training procedures, it can firmly enhance the TSC problem’s
generalization, which is crucial since generalization is a critical component in RL agent’s adaptation capabilities for
unseen scenarios. It is also worth mention that in this paper, the Policy Gradient algorithm is utilized, which has
guaranteed convergence. Thanks to this feature, it has the potential to reach better performance, compared to that
Value-based approaches, which are utilized mostly for the TSC problem in the past (Haydari and Yilmaz, 2020).
The paper is structured as follows. In Section 2, the applied methodology for reinforcement learning is discussed.
Section 3 introduces the applied working environment used for machine learning and traffic simulation. Section 4
demonstrates the results based on realistic traffic simulation. The paper ends up with a short conclusion.
2. Methodology
Reinforcement Learning (RL) is a currently heavily researched realm of Machine Learning (ML) thanks to its
tremendous success in boardgames (Silver et al., 2017), robotics (Lillicrap et al., 2015), videogames (Mnih et al.,
2015). Compared to Supervised Learning (SL), RL utilizes a different concept of learning. An SL agent seeks to
map the input x to the output y based on training samples. Hence the primary concerns are the insatiable demand
for training data, which can not be provided in several control problems. In the meantime, RL generates its training
data by interacting with the environment in an online manner. The interaction between an RL training loop’s entities
is formulated as a Markov Decision Process (MDP) {S , A, T, R}, where the agent seeks to maximize the cumulative
reward by developing a policy. The cumulated reward is formulated as follows:
T
G= γ t rt . (1)
t=1
Where γ is the discount factor that determines the dependence of past actions on future rewards. An interaction looks
as follows: The agent receives state si that represents the environment. The agent chooses an action ai based on a
policy πi that execution triggers a state transition P(si , ai |si+1 ) in the environment. The quality of the selected action
ai in the given time step is described to the agent by the environment through a scalar feedback value called reward.
In the RL concept, the reward signal can be interpreted as the only compass that an agent has to develop the optimal
policy.
280 Kővári Bálint et al. / Transportation Research Procedia 62 (2022) 278–285
Bálint et al. / Transportation Research Procedia 00 (2021) 000–000 3
For solving the introduced control task, the Policy Gradient algorithm is implemented, which can be categorized
into the family of model-free Policy-based RL algorithms. An agent’s training is basically about tuning the utilized
function approximator, which is a Neural Network in our case. If the training is successful, then the function ap-
proximator predicts values or probabilities from which the policy can be derived directly or indirectly. Compared to
Value-based methods, the Neural Network trained via the Policy Gradient algorithm predicts the choice probabilities
of the available actions in every state, which can be interpreted as direct suggestions of choosing one or another. More
specifically the output of the Policy Gradient algorithm is a probability distribution, which is defined by the function
approximator’s θ parameters. Consequently the parameters of the function approximator have to be tuned precisely
to maximizes the agent’s performance indicator function which is identical to the reward signal acquired from the
environment J(θ) = J(πθ ). The mentioned function looks as follows:
τ
Jπθ = E γ t rt (2)
t=1
The underlying idea behind the equation above is to find local extreme values in the direction of the greatest increase
of the reward signal. Considering Williams (1992), and Sutton et al. (2000), the updated rule of the Policy Gradient
algorithm is formulated as follows:
τ
θ ← θ + α∇ log πθ (st , at ) γt rt (3)
t=1
1. The training starts from the initial state s0 of the environment, while the function approximator’s θ parameters
are generated according to the chosen initial scheme.
2. After the initialization of the training loop’s two entities, they interact until a terminal state encounters.
3. The interactions {S , A, R} are saved to the episode history, and the discounted rewards are calculated via the
chosen γ discount factor.
4. Finally, the gradients are calculated and added to the gradient buffer, which updates the function approximator’s
weights if the updated frequency encounters.
πθ (st , at ) stands for the probability of choosing action a in, state s and the α parameter is the learning rate, which
defines how firmly a gradient can change the function approximator’s parameters at once.
3. Environment
As mentioned, the chosen problem is the Traffic Signal Control (TSC) of a simple intersection, where the vehicles
can arrive from all four directions on one lane each. The left turns in the intersection are omitted for simplicity. Hence
the vehicles are only allowed to turn right or go straight out from the network. The training environment is modeled
in SUMO (Simulation of Urban Mobility) (Lopez et al., 2018). SUMO is a perfect tool for such an endeavor since
it can utilize both real transportation networks and artificially created ones with tremendous options to customize
and scale. On top of all that, it has an excellent Python interface called TraCI, which is crucial from the aspect of
Deep Reinforcement Learning, since most of the development tools and abstractions like Tensorflow, PyTorch, and
Keras are implemented in Python. Figure 1 shows how the SUMO is integrated into the introduced RL training loop.
The described infrastructure is formulated in SUMO and controlled from the Python training environment via TraCI.
Kővári Bálint et al. / Transportation Research Procedia 62 (2022) 278–285 281
The Python training loop is designed according to the OpenAI gym standards. Hence the decisions of the agent are
channeled into TraCI commands by the Python environment. Then the TraCI commands trigger changes in the SUMO
infrastructure. The change’s measures are acquired by TraCI and formulated appropriately by the Python environment
to represent the infrastructure’s state to the agent.
Deep Reinforcement Learning’s main advantages and capabilities are real-time applicability, generalization, and
scalability of the solution, which means that the trained Neural Network can manage unseen scenarios in the right
fashion. In establishing high-level generalization and scalability, most of the attention goes to the algorithms. Still,
the state vector’s importance that represents the environment to the agent also has to be valued appropriately since
this is the part of DRL where the algorithm can assign credit, or in other words, figure out how each piece of infor-
mation affects one or another. Consequently, the composition of the control task’s abstraction is the key to a scaleable
and generalizable solution. Hence state representation is one of the most crucial aspects since a state descriptor’s
conciseness directly influences the training procedure’s success. Moreover, the formulation of the state descriptor is
exclusively the researcher’s duty. Thus the result depends on how well the researcher knows the abstraction of the
control problem. Nevertheless, the same consideration goes for the formulation of the reward signal in DRL because
this is the only compass that the agent has in finding the optimal behavior. In this paper, our angle is to compose a state
vector and a rewarding scheme for the TSC problem that can be easily generalized and has the potential of scaling.
Image like state representations is widely spread in the realm of traffic signal control problems. The majority of
the researchers favor Discrete Traffic Signal Encoding (DTSE), where all the lanes in the intersection are divided into
cells starting from the lane’s end. The average length of the vehicles determines the cell size. The cells are filled with
different types of information, such as the pace, acceleration, vehicle position, and signal phase. This matrix can be
282 Kővári Bálint et al. / Transportation Research Procedia 62 (2022) 278–285
Bálint et al. / Transportation Research Procedia 00 (2021) 000–000 5
fed into a Convolution Neural Network (CNN) with n channels, where cells are interpreted as pixels, and n defines
the number of different types of information represented in the given cells. Other image-like representations are raw
RGB images that can include roadside information, which can be advantageous in several scenarios. Another common
approach in state representation is feature-based value vectors. In this case, every lane of a road is represented with a
set of values such as phase duration, average speed, cumulative waiting time, phase cycle, queue length. This concept’s
main advantage is that the utilized information can be easily gathered from the intersection with loop detectors or road
sensors.
In our case, a feature-based approach is utilized, which solely contains one value for each lane. This value is the
number of vehicles currently occupying the particular lane, which is not the same as the queue length. The queue
length is the feature mostly utilized in the state representation of the TSC problem. Still, it is hard to estimate it
correctly, and in some sense, it carries the information about the lane’s traffic load in a delayed manner compared to
this approach. The values are transformed into [1,0] interval to mitigate the effect of both vanishing and exploding
gradients during the training process. We choose the number of vehicles because it has a straightforward relationship
with the current traffic load on a particular lane, which can be easily channeled into decisions concerning the traffic
lights.
In the formulated SUMO infrastructure, there are four lanes where the traffic load can arrive from, and in the
intersection, the vehicles are only allowed to turn right or go straight out from the network. For this environment, the
utilized discrete actions are the choice of simplicity. There are only two discrete actions. The first one turns the traffic
signal to green in the horizontal lanes, while the vertical lanes are red, called North-South Green (NSG). The second
does the opposite, notably turns the horizontal lane’s traffic light to red and the vertical’s to green, called East-West
Green (EWG). It is also important to mention that the chosen traffic signal setting stays intact for 5s measured in the
SUMO’s infrastructure time. Only after that can the agent choose another one.
Along with the state representation, the formulation of the reward scheme is another crucial aspect since this scalar
feedback value is the only guidance that an agent has in the endeavor of developing an optimal behavior. Moreover,
the nature of the rewards is identical to heuristics. Thus they come with no guarantees. This paper introduces a new
rewarding for the TSC problem, which arises from our new interpretation of the composed feature-based value vector
containing the normalized number of vehicles per lane. The state vector is interpreted as a distribution. The actions
can control the distribution’s mean and standard deviation by changing the traffic lights and easing the lanes’ load.
The agent’s goal is chosen to minimize the standard deviation of this distribution. For achieving this goal, the reward
is immediate. Its value is the distribution’s calculated standard deviation, min-max normalized to the [1,-1] interval.
The scheme is complemented with conditions that always have to be fulfilled during the training. If not, the episode
terminates, and the agent receives a punishment instead of a reward. The conditions determine a threshold regarding
the number of vehicles in each lane. If it is exceeded, the episode terminates. The episode is also terminated if the
agent does not provide a green phase to a direction that has lower occupancy than the introduced threshold, but the
lanes of the other direction are empty. The traffic-wise interpretation of this concept is that the agent must learn that
if there is a lane asymmetrically loaded with traffic, it should provide more green time in that particular direction.
It is also essential that the agent’s way of interfering is easing up heavily loaded lanes, which lowers the number of
vehicles that stay on the particular lane. Our expectation from the introduced state-action-reward structure that it can
be easily generalized; hence it lets the agent quickly adapt to unseen traffic scenarios.
Of course, this concept operates over the assumption that each lane is identically important, and the traffic load has
to be eased equally.
The calculation of the introduced reward is shown in the 4 equation.
The parameters in the reward function are chosen according to the σ calculated from the state vector. If σ < 0.1 then
the parameters are Rmax = 1, Rmin = 0, σmax = 0.1, σmin = 0. If σ >= 0.1 then the parameters are Rmax = 0, Rmin = −1,
σmax = 0.5, σmin = 0.1.
3.4. Limitations
The study focuses on the TSC problem for the case of a single intersection where the left turns are omitted, and
the possible signal phases can not conflict. Thus the agent’s only goal is to find the proper phase combination for the
given traffic flow.
3.5. Training
The paper aims to formulate a new state representation complemented with a unique reward scheme that enables
the high-level generalization of the TSC problem. The most common approach for exploiting the Neural Networks
generalization feature is a randomized environment where the agent interacts with many different scenarios and a
tremendous number of training episodes, which makes the reservations against RL’s sample inefficiency legitimate.
Following this concept can result in a solution with excellent performance. Still, we can not measure generalization
directly since it is not known which scenario the agent is unfamiliar with. Consequently, the opposite strategy is
Parameter Value
Layers Dense
Num. of hidden layers 3
Num. of neurons 256,256
Hidden layers activation function RELU
Learning rate (α) 0.0001
Discount factor (γ) 0.9
Update frequency (ξ) in eps. 10
Optimizer Adam
kernel initializer Xavier normal
utilized where the agent interacts with the same scenario in every episode. Thanks to that, we can measure the level
of generalization, specifically by evaluating the agent’s performance in truly unseen scenarios. The chosen training
scenario that the agent tries to solve during the entire training process is the control of an equally loaded intersection,
where each lane toward the intersection is assigned with the same amount of traffic flow. The parameters of the training
and the Neural Network are shown in Table 1.
Figure 2 shows the convergence of the PG algorithm on the introduced scenario. Both the average and the min
reward of the episodes are displayed. As mentioned above, the reward is calculated based on the standard deviation
of the lane’s density. Hence each value can be directly associated with the quality of the control performed by the PG
agent.
4. Results
After training the agent, its performance is evaluated in various unseen scenarios that can genuinely assess gen-
eralization quality. But first, the agent’s behavior is considered in the very traffic scenario used during training. This
scenario is when all lanes have equal traffic flow, and the traffic signals have to be controlled accordingly. For this
traffic scenario, the solution is obvious; the agent must provide equal green time for both the horizontal and vertical
directions, and it does just that. At first sight, it shows that the agent understands its goal and reaches it, but for now,
it can be overfitting. Consequently, edge cases are defined to obtain a clear and thorough assessment of the agent’s
behavior. The first class of cases is the ones where either horizontal or vertical lanes get traffic flow. It goes without
Bálint et al. / Transportation Research Procedia 00 (2021) 000–000 7
284 Kővári Bálint et al. / Transportation Research Procedia 62 (2022) 278–285
saying that the agent has never seen a scenario like that. Nevertheless, it solves the scenario properly by only provid-
ing green time to the direction -whether horizontal or vertical- that has traffic flow and keeps the opposite direction’s
traffic lights red. This behavior shows that the agent understands the control task and the actions’ influence on the
infrastructure. For that matter, this behavior performs better for sure than a traffic light control with fixed periods
in terms of queue length, which shows that the RL-based approach can operate in a traffic-responsive manner. The
second class of test cases is where each lane gets traffic flow, but one lane gets considerably more. In these cases, the
agent alters the symmetric policy that provides equal green time for all directions. The alteration is substantial when
the number of vehicles exceeds the limit, used as a terminal condition in the training procedure. In this scenario, the
agent seems to understand what an asymmetric traffic flow means for the intersection, and it acts accordingly. These
evaluations suggest that despite the training setup, the abstraction of the control problem along with the reward scheme
that arises from the new interpretation of state representation and the simplified discrete actions create a structure that
profoundly enhances the generalization of the problem. The trained agent and the SUMO’s built-in Time-gap based
controller are also compared in the same scenario. Table 2 shows the results. The results show that the agent applies a
control that results in fewer vehicles in the most occupied lanes during the comparison in both different scenarios.
Maximum density of a link during simulation [%] RL-Agent Time-Gap based Actuated Control
5. Conclusion
This paper proposes a new abstraction for the TSC problem: a combination of state representation, a brand new
rewarding concept, and simplified action space. Together these components boost the generalization of the control
problem. As the evaluations underlined, the trained RL agent adapts its behavior to unseen traffic scenarios to operate
in the right fashion. In the future, the proposed approach, further versatilities have to be explored. Notably, it is
possible to establish a hierarchy in importance between the individual lanes by weighing them to calculate the standard
deviation. This consideration can turn the agent more sensible for the traffic load of specific lanes. Moreover, such an
approach can manage more interconnected intersections since the complementation would not change the underlying
idea.
Kővári Bálint et al. / Transportation Research Procedia 62 (2022) 278–285 285
8 Bálint et al. / Transportation Research Procedia 00 (2021) 000–000
Acknowledgements
The research was supported by the Ministry of Innovation and Technology NRDI Office within the framework of
the Autonomous Systems National Laboratory Program. The research was also supported by the Hungarian Govern-
ment and co-financed by the European Social Fund through the project ,,Talent management in autonomous vehicle
control technologies” (EFOP-3.6.3-VEKOP-16-2017-00001).
References
Aradi, S., 2020. Survey of deep reinforcement learning for motion planning of autonomous vehicles. IEEE Transactions on Intelligent
Transportation Systems .
Bello, I., Pham, H., Le, Q.V., Norouzi, M., Bengio, S., 2016. Neural combinatorial optimization with reinforcement learning. arXiv preprint
arXiv:1611.09940 .
El-Tantawy, S., Abdulhai, B., Abdelgawad, H., 2014. Design of reinforcement learning parameters for seamless application of adaptive traffic
signal control. Journal of Intelligent Transportation Systems 18, 227–245.
Farazi, N.P., Ahamed, T., Barua, L., Zou, B., 2020. Deep reinforcement learning and transportation research: A comprehensive review. arXiv
preprint arXiv:2010.06187 .
Gao, J., Shen, Y., Liu, J., Ito, M., Shiratori, N., 2017. Adaptive traffic signal control: Deep reinforcement learning algorithm with experience
replay and target network. arXiv preprint arXiv:1705.02755 .
Guo, J., Harmati, I., 2020. Comparison of game theoretical strategy and reinforcement learning in traffic light control. Periodica Polytechnica
Transportation Engineering .
Haydari, A., Yilmaz, Y., 2020. Deep reinforcement learning for intelligent transportation systems: A survey. Preprint arXiv:2005.00935 .
Kisgyörgy, L., Szele, A., 2018. Autonomous vehicles in sustainable cities: more questions than answers, in: WIT Transactions on Ecology and
the Environment, pp. 725–734. doi:10.2495/SDP180611.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D., 2015. Continuous control with deep reinforcement
learning. arXiv preprint arXiv:1509.02971 .
Lopez, P.A., Behrisch, M., Bieker-Walz, L., Erdmann, J., Flötteröd, Y., Hilbrich, R., Lücken, L., Rummel, J., Wagner, P., Wiessner, E., 2018.
Microscopic traffic simulation using sumo, in: 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2575–
2582. doi:10.1109/ITSC.2018.8569938.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G.,
et al., 2015. Human-level control through deep reinforcement learning. nature 518, 529–533.
Muresan, M., Fu, L., Pan, G., 2019. Adaptive traffic signal control with deep reinforcement learning an exploratory investigation. arXiv
preprint arXiv:1901.00960 .
Nazari, M., Oroojlooy, A., Snyder, L.V., Takáč, M., 2018. Reinforcement learning for solving the vehicle routing problem. arXiv preprint
arXiv:1802.04240 .
Papageorgiou, M., 2004. Overview of road traffic control strategies. IFAC Proceedings Volumes 37, 29 – 40. URL: http://www.
sciencedirect.com/science/article/pii/S1474667017306572, doi:https://doi.org/10.1016/S1474-6670(17)30657-2.
4th IFAC Workshop DECOM-TT 2004: Automatic Systems for Building the Infrastructure in Developing Countries, Bansko, Bulgaria,
October 3-5, 2004.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al., 2017.
Mastering the game of go without human knowledge. nature 550, 354–359.
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y., 2000. Policy gradient methods for reinforcement learning with function approxima-
tion, in: Advances in neural information processing systems, pp. 1057–1063.
Thorpe, T.L., Anderson, C.W., 1996. Tra c light control using sarsa with three state representations. Technical Report. Citeseer.
Touhbi, S., Babram, M.A., Nguyen-Huu, T., Marilleau, N., Hbid, M.L., Cambier, C., Stinckwich, S., 2017. Adaptive traffic signal control:
Exploring reward definition for reinforcement learning. Procedia Computer Science 109, 513–520.
Wen, K., Qu, S., Zhang, Y., 2007. A stochastic adaptive control model for isolated intersections, in: 2007 IEEE International Conference on
Robotics and Biomimetics (ROBIO), IEEE. pp. 2256–2260.
Williams, R.J., 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 229–256.