Optimization Control Strategy For A Central Air Co
Optimization Control Strategy For A Central Air Co
Optimization Control Strategy For A Central Air Co
Article
Optimization Control Strategy for a Central Air Conditioning
System Based on AFUCB-DQN
He Tian 1,2 , Mingwen Feng 1,2 , Huaicong Fan 1,2 , Ranran Cao 1,2 and Qiang Gao 3,4, *
1 National Demonstration Center for Experimental Mechanical and Electrical Engineering Education,
Tianjin University of Technology, Tianjin 300384, China
2 Tianjin Key Laboratory for Advanced Mechatronic System Design and Intelligent Control,
School of Mechanical Engineering, Tianjin University of Technology, Tianjin 300384, China
3 School of Electrical Engineering and Automation, Tianjin University of Technology, Tianjin 300384, China
4 Tianjin Key Laboratory for Control Theory & Applications in Complicated Industry Systems,
Tianjin 300000, China
* Correspondence: gaoqiang@tjut.edu.cn
Abstract: The central air conditioning system accounts for 50% of the building energy consumption,
and the cold source system accounts for more than 60% of the total energy consumption of the central
air conditioning system. Therefore, it is crucial to solve the optimal control strategy of the cold source
system according to the cooling load demand, and adjust the operating parameters in time to achieve
low energy consumption and high efficiency. Due to the complex and changeable characteristics of
the central air conditioning system, it is often difficult to achieve ideal results using traditional control
methods. In order to solve this problem, this study first coupled the building cooling load simulation
environment and the cold source system simulation environment to build a central air conditioning
system simulation environment. Secondly, noise interference was introduced to reduce the gap
between the simulated environment and the actual environment, and improve the robustness of the
environment. Finally, combined with deep reinforcement learning, an optimal control strategy for the
central air conditioning system is proposed. Aiming at the simulation environment of the central air
conditioning system, a new model-free algorithm is proposed, called the dominant function upper
confidence bound deep Q-network (AFUCB-DQN). The algorithm combines the advantages of an
advantage function and an upper confidence bound algorithm to balance the relationship between
Citation: Tian, H.; Feng, M.; Fan, H.;
exploration and exploitation, so as to achieve a better control strategy search. Compared with the
Cao, R.; Gao, Q. Optimization
traditional deep Q-network (DQN) algorithm, double deep Q-network (DDQN) algorithm, and the
Control Strategy for a Central Air
Conditioning System Based on
distributed double deep Q-network (D3QN) algorithm, the AFUCB-DQN algorithm has more stable
AFUCB-DQN. Processes 2023, 11, convergence, faster convergence speed, and higher reward. In this study, significant energy savings
2068. https://doi.org/10.3390/ of 21.5%, 21.4%, and 22.3% were obtained by conducting experiments at indoor thermal comfort
pr11072068 levels of 24 ◦ C, 25 ◦ C, and 26 ◦ C in the summer.
1. Introduction
With the rapid development of the global economy, building energy consumption is
Copyright: © 2023 by the authors.
increasing, and has become one of the three major energy-consuming sectors, alongside
Licensee MDPI, Basel, Switzerland. industrial and transportation energy consumption. In office buildings that utilize central
This article is an open access article air conditioning, the energy consumption of central air conditioning accounts for approxi-
distributed under the terms and mately 50% of the total building energy consumption [1]. The energy consumption of the
conditions of the Creative Commons chiller system constitutes 60% to 80% of the entire air conditioning system [2]. Most central
Attribution (CC BY) license (https:// air conditioning systems operate with parameters set to their maximum values, making the
creativecommons.org/licenses/by/ optimization of chiller system operating parameters crucial for energy savings in the overall
4.0/). central air conditioning system. The cooling load of building air conditioning is influenced
by various factors, such as outdoor meteorological parameters, building design, and indoor
occupancy. Therefore, dynamically controlling the system’s operating parameters based
on cooling load demand to improve energy efficiency has typically been a focal point in
research on energy savings in central air conditioning systems [3]. Table 1 presents the key
findings from recent research articles in the field.
Since the introduction of adaptive algorithms in 1980, adaptive control has become
one of the important means to solve the problem of adjusting control parameters of air
conditioning systems [4]. In the context of the European Union (EU) and governments
around the world developing mandatory building energy research and conservation poli-
cies for buildings and their air conditioning systems [5], it has become crucial to properly
establish the energy performance of buildings and their different systems to reduce the gap
between building energy model (BEM) simulation results and actual measurements [6].
The core of central air conditioning energy-saving optimal control is to find the best air
conditioning control parameters while maintaining indoor comfort requirements to achieve
the goal of minimum energy consumption. Gao et al. [7] proposed an event-triggered
distributed model predictive control (DMPC) scheme for improving indoor temperature
regulation in multizone buildings. By comprehensively considering energy consumption
and thermal comfort, the scheme determines the optimal temperature set point and verifies
its effectiveness in practical cases. Sampath et al. [8] controlled the heating, ventilation, and
air conditioning (HVAC) system through an adaptive control system, which improved the
thermal comfort of the occupants and the system efficiency. Giuseppe et al. [9] proposed
an optimization framework based on model predictive control and genetic algorithms
to minimize heating energy costs and thermal discomfort. Yang et al. [10] successfully
reduced the total energy consumption of the air conditioning water system by using the
improved parallel artificial immune system (IPAIS) algorithm. Sun et al. [11] used the
equilibrium optimization (EO) algorithm to optimize the load scheduling of chillers in the
HVAC system, which effectively saved energy consumption. Tang et al. [12] proposed a
model predictive control (MPC) method for optimally controlling central air conditioning
systems integrated with cold storage during rapid demand response (DR) events, achieving
power reduction and indoor environment optimization, reducing energy consumption, and
ensuring comfort. However, central air conditioning systems are highly nonlinear, uncer-
tain, time-varying, and coupled, which increase the requirements for control algorithms.
Traditional adaptive algorithms and control methods often fail to achieve ideal control
effects when dealing with these challenges [13]. In addition, the mechanism modeling and
parameter identification of these algorithms are relatively complex.
Reinforcement learning (RL) [14] is a machine learning approach that has emerged in
recent years, and is characterized by self-learning and online learning capabilities. Through
the mechanism of “actions and rewards”, RL can achieve the adaptive optimization of
controllers in the absence of control system models, making it a data-driven control method.
Deep reinforcement learning (DRL) [15] inherits the feature representation capabilities of
deep learning and the ability of reinforcement learning to interact autonomously with the
environment. In recent years, DRL has been widely applied in the field of air conditioning
control, and can be categorized into model-based and model-free RL. Model-based methods
refer to the Markov decision process (MDP) five-tuple (state S, reward R, action A, state
transition probability P, discount factor gamma). If the five-tuple is fully known, it is
considered a model-based method; otherwise, it is regarded as a model-free method.
Model-based algorithms are appealing for task implementation because an optimized
model can provide the intelligent agent with “foresight” to simulate scenarios and under-
stand the consequences of actions, even in the absence of knowledge about the dynamic
environment. Monte Carlo tree search (MCTS) is the most well-known model-based al-
gorithm widely applied in many board games, such as chess and Go. The iterative linear
quadratic regulator (iLQR) [16] and MPC generally require stringent assumptions to be
made for their implementation. Zhao et al. [17] proposed a model-based DRL approach
using a hybrid model to address the heating, ventilation, and air conditioning control
Processes 2023, 11, 2068 3 of 21
problem, which improved learning efficiency and reduced learning costs. Chen et al. [18]
combined model-based deep reinforcement learning with MPC to propose a novel learning-
based control strategy for HVAC systems, demonstrating the effectiveness of the algorithm
through simulation experiments. However, acquiring an accurate model is challenging
for most problems. Many environments are stochastic, and their dynamic transitions are
unknown, requiring the model to be learned. Modeling in environments with large state
and action spaces is particularly difficult, especially when the transitions are complex.
Furthermore, the model can only be effective if it can accurately predict future changes in
the environment. In particular, central air conditioning systems, as complex multivariable
systems, pose additional challenges in modeling and prediction.
Model-free RL learns optimal control strategies by interacting with the model-free
building environment, avoiding cumbersome modeling work [19] and offering better
scalability and generalization capabilities [20]. For instance, Heo et al. [21] proposed a
data-driven intelligent ventilation control strategy based on deep reinforcement learning,
effectively improving system performance through deep Q-network (DQN) algorithm-
controlled air conditioning systems. Yuan et al. [22] presented a reinforcement learning-
based control strategy for a variable air volume (VAV) air conditioning system. Wei et al. [23]
introduced a data-driven approach based on deep reinforcement learning to control variable
air volume HVAC systems. Deng et al. [24] combined active building environment change
detection with DQN to propose a novel HVAC control strategy, effectively saving energy
consumption. Lei et al. [25] proposed a practical person-centric multivariable HVAC
control framework based on DRL, utilizing a branching dueling Q-network (BDQ) to
significantly reduce energy consumption. Marantos et al. [26] applied neural network-
fitted Q-iteration methods to HVAC system control, achieving significant improvements in
energy efficiency and thermal comfort compared to rule-based controllers. Zhang et al. [27]
used the asynchronous advantage actor–critic (A3C) algorithm to control HVAC systems,
making them suitable for the overall building energy model and achieving energy-saving
effects. Wang et al. [28] applied the Monte Carlo actor–critic algorithm with long short-term
memory (LSTM) neural networks to HVAC system control to achieve optimization effects.
Ding et al. [29] proposed a deep reinforcement learning-based multizone residential HVAC
thermal comfort control strategy, implementing the optimal HVAC thermal comfort control
policy through a deep deterministic policy gradient (DDPG). Zhang et al. [30] reduced the
heating demand in office building heating systems through deep reinforcement learning
training. Gao et al. [31] developed a DDPG-based method to learn the optimal thermal
comfort control policy, effectively reducing HVAC energy consumption.
In recent years, research has primarily focused on proposing energy-saving strategies
for HVAC systems using deep reinforcement learning methods and validating the perfor-
mance of these algorithms. However, there is still insufficient research on energy-saving
strategies for cooling systems and improving deep reinforcement learning algorithms to
adapt to these strategies. Additionally, stable and secure data obtained from real-world
central air conditioning system environments are scarce, and the cost of data acquisition is
high, making it unsuitable for directly training reinforcement learning agents. Therefore, it
is necessary to establish a simulated environment for central air conditioning systems that
closely resembles real-world conditions.
Processes 2023, 11, 2068 4 of 21
This research considered multiple factors that influence building cooling loads, includ-
ing solar radiation, human heat dissipation, heat transfer through external windows, and
heat transfer through exterior walls, to construct a simulation environment for building
cooling loads. For the chiller unit, cooling tower, and water pump, this research established
a simulation environment for the cooling source system. By coupling the simulation en-
vironment for the building cooling load with the simulation environment for the cooling
source system, this research created a simulated environment for the central air condition-
ing system. To make the simulation environment more realistic in terms of data collection
processes for operating parameters, this research introduced noise interference and, con-
sequently, enhanced the robustness of the environment. This approach not only allows
Processes 2023, 11, 2068 5 of 21
the simulation of data anomalies caused by various disturbances, but also improves the
reliability of the simulation environment.
For the established simulated environment of the central air conditioning system, this
research proposes the advantaged upper confidence bound deep Q-network (AFUCB-DQN)
algorithm. Unlike traditional DQN algorithms, this research utilized an advantage function
to reduce the influence of air conditioning data variance on environmental variance. This
research also combined the upper confidence bound (UCB) algorithm to address the issue
of sampling errors caused by environmental stochasticity induced by noise. The main
contributions of this paper are summarized as follows:
1. This research proposes a comprehensive simulation environment for deep reinforce-
ment learning that considers the building cooling load, which can provide guidance
for the energy optimization of real-world central air conditioning systems.
2. This research took into account various disturbances encountered during sensor
data collection of the cooling source system in real-world environments. To reduce
the discrepancy between the constructed simulation environment and the real envi-
ronment, and enhance the robustness of the environment, this research introduced
noise interference.
3. For the proposed simulation environment of the central air conditioning system
in this paper, this research introduced an advantage function based on the DQN
algorithm and combined it with the UCB algorithm to form the proposed AFUCB-
DQN algorithm.
The remaining sections of this paper are organized as follows. Section 2 presents the
theoretical background of reinforcement learning, the Q-learning algorithm, and the DQN
algorithm. Section 3 constructs and couples the simulation environment for building cool-
ing load with the simulation environment for the cooling source system, while establishing
the Markov decision process for the central air conditioning system. Section 4 introduces
the proposed AFUCB-DQN algorithm. Section 5 validates the simulation environment and
discusses the experimental results. Section 6 summarizes the paper and proposes directions
for future work.
2. Related Theory
This section introduces the relevant theory of reinforcement learning, Q-learning, and
DQN algorithms. The main terms used in this paper are summarized in Table 1.
2.2. Q-Learning
The Q-learning algorithm [33] is a classic model-free algorithm that involves con-
structing a Q-table that stores the expected rewards (Q-value) for different actions in each
state–action pair. In this algorithm, the agent selects actions based on the current state by
choosing the state with the maximum Q-value. The Q-value represents the estimation of
the current reward plus the discounted future rewards, and serves as an approximation of
the reward function.
Processes 2023,
Processes 2023, 11,
11, 2068
x FOR PEER REVIEW 6 6of
of 22
21
The Q-learning algorithm is not constrained by the environment model or the state
2.2. Q-Learning
transition function, as it accumulates experience through interactions with the environment.
The Q-learning algorithm [33] is a classic model-free algorithm that involves con-
Convergence of the Q-table is achieved when the values in the table no longer undergo sig-
structing a Q-table that stores the expected rewards (Q-value) for different actions in each
nificant changes. The updated formula for the state–action value function in this algorithm
state–action pair. In this algorithm, the agent selects actions based on the current state by
is as follows:
choosing the state with the maximum Q-value. The Q-value representsthe estimation of
the current reward plus the discounted future rewards, and serves as an approximation
Q(s, a)= Q(s, a)+α r + γmaxQ(st+1 , at+1 ) − Q(s, a) (1)
of the reward function. at + 1
The Q-learning algorithm is not constrained by the environment model or the state
In thisfunction,
transition context, as
theitreward discount
accumulates experience ∈ [0, 1] represents
factor γ through interactionsthe extent
with the to which
environ-
current actions influence future rewards. When γ equals 0, the agent only
ment. Convergence of the Q-table is achieved when the values in the table no longer un- considers
immediate rewards
dergo significant [34]. The
changes. TheQ-learning algorithm
updated formula learns
for the by continuously
state–action updating
value function the
in this
Q-table.
algorithm However, frequent read and write operations on Q-values can decrease learning
is as follows:
efficiency and limit the algorithm’s capability to handle larger state spaces.
In the DQN algorithm, the loss function is constructed as shown in the following
formula: " 2 #
L(ω)= E r + γmaxQ(st+1 , at+1 , ω) − Q(s, a, ω) (2)
a+1
Taking the partial derivative of the loss function with respect to the parameters ω, we
obtain the next gradient and update the network parameters ω.
δL(ω) δQ(s, a, ω)
= r+ γmaxQ(st+1 , at+1 , ω) − Q(s, a, ω) (3)
δω a+1 δω
During the training process, the DQN algorithm incorporates an experience replay
mechanism, which breaks the correlations between samples and enhances the stability of
the algorithm. Additionally, the DQN algorithm constrains the range of reward values
and error terms, ensuring that the Q-values and gradient values remain within reasonable
bounds, further improving the stability of the algorithm.
3. Environment Construction
In this study, this article carried out the coupling of building cooling load simulation
environment and cold source system simulation environment, in order to realize the
comprehensive simulation and control optimization of central air conditioning system.
Specifically, this research established a building cooling load simulation environment that
can consider outdoor meteorological conditions and building characteristics, including
temperature, humidity, solar radiation, and building parameters, to accurately calculate
the building’s cooling load demand. At the same time, this research also established a cold
source system simulation environment to simulate the operating status of the chiller system.
In order to realize the coupling between the building cooling load and the cooling
source system, this research applied the output of the building cooling load simulation
environment as the input of the cooling source system simulation environment. Specifically,
the building cooling load simulation environment provides real-time cooling load demand
information to the cold source system simulation environment, and the cold source system
simulation environment adjusts the operating parameters of the chiller according to this
demand information, including chilled water supply temperature, chilled water flow,
cooling water flow, and cooling tower air volume. In this way, the cold source system
can be dynamically adjusted according to the actual demand of the building’s cooling
load, so as to optimize energy consumption and improve energy saving effects. Through
the above coupled methods, this research realized the comprehensive simulation and
control optimization of the central air conditioning system. This integrated approach can
more accurately simulate the operation of the central air conditioning system, and provide
guidance for the optimal control of the actual central air conditioning system.
In order to keep the indoor temperature stable, this research designed the cooling
capacity of the air conditioner to be 1.2 times the cooling load of the building, so as to
provide enough cooling capacity reserve to meet the demands of sudden temperature
fluctuations and load increases. Furthermore, in order to optimize the control parameters
of the cold source system, this research modeled the simulated environment of the central
air conditioning system as a Markov decision process, and used a deep reinforcement
learning algorithm for training and convergence. Through this modeling method, the
system can learn and output the best control action, so that the cold source system can be
optimized and controlled according to the actual situation, improving energy efficiency
and performance.
Figure 2.
Figure 2. Types
Typesof
ofheat
heatcomprising
comprisingthe
thebuilding
buildingcooling
coolingload.
load.
3.2.
3.2. Establishment
Establishment ofof the
the Chilled
Chilled Water
Water System
System Simulation
Simulation Environment
Environment
The
The cold source system simulation environment is established
cold source system simulation environment is established to to accurately
accurately simulate
simulate
and analyze the performance and energy consumption of the central
and analyze the performance and energy consumption of the central air conditioning air conditioning
sys-
system, so to
tem, so as as provide
to provide a reliable
a reliable basis
basis for for
the the research
research of optimal
of optimal control
control algorithms
algorithms and
and energy-saving
energy-saving strategies.
strategies. In this
In this simulation
simulation environment,this
environment, thisresearch
research considered
considered the
the
equipment
equipmentparameters,
parameters,energyenergy consumption
consumption calculation method,
calculation and operating
method, constraints
and operating con-
of the central
straints of the air conditioning
central systemsystem
air conditioning to ensure that the
to ensure thatsimulation results
the simulation match
results the
match
behavior of the actual system. By establishing such a simulation environment,
the behavior of the actual system. By establishing such a simulation environment, this this research
helps us to
research better
helps us understand and optimize
to better understand the performance
and optimize of the cold
the performance source
of the coldsystem,
source
and provides guidance for the energy efficiency improvement
system, and provides guidance for the energy efficiency improvement of the actual of the actual central air
central
conditioning system.
air conditioning system.
3.2.1. Simulation Environment for the Chilled Water System
This research investigated a widely applied central air conditioning system that uti-
lizes water as the refrigerant and chilled water units as the cooling source. Each type of
equipment is represented by a single unit, and their specific parameters are listed in Table 2.
To maintain a stable indoor design temperature, the cooling capacity of the air conditioning
system is set to 1.2 times the building’s cooling load. The main energy-consuming com-
Processes 2023, 11, 2068 9 of 21
ponents of the central air conditioning system include chilled water units, chilled water
pumps, cooling water pumps, and cooling towers. For example,
where Ptotal represents the total energy consumption of the central air conditioning system,
Pchiller represents the energy consumption of the chilled water units, Ppumpe represents
the energy consumption of the chilled water pumps, Ppumpc represents the energy con-
sumption of the cooling water pumps, and Ptower represents the energy consumption of
the cooling towers.
Qe
Pchiller = (6)
COP
ρgVe He
Ppumpe = (7)
3.6 × 106 · ηe
ρgVc Hc
Ppumpc = (8)
3.6 × 106 · ηc
3
ft
Ptower = Ptower − r (9)
f0
where Qe represents the cooling capacity, COP denotes the operating efficiency of the chiller
unit, Ve is the flow rate of chilled water, He represents the head of the chilled water pump,
ηe denotes the overall efficiency of the chilled water pump, ρ is the density of the fluid, g is
the acceleration due to gravity, Vc represents the flow rate of cooling water, Hc denotes the
head of the cooling water pump, ηc represents the overall efficiency of the cooling water
pump, ft is the operating frequency of the fan, f0 is the rated frequency of the fan, and
Ptower − r represents the rated power of the fan.
3.2.2. Constraints
Table 3 summarizes the operational parameter constraints for the chiller, pump, and
cooling tower based on the selected equipment’s product manuals and the code for design
of heating, ventilation, and air conditioning, considering the strong coupling within the
central air conditioning system and the limitations imposed by outdoor weather conditions:
Processes 2023, 11, 2068 10 of 21
Table 3. Operational parameter constraints for chiller, pump, and cooling tower.
4. AFUCB-DQN
When facing large-scale MDPs, the Q-learning algorithm suffers from the problem of
explosive memory due to the large number of state–action pairs. In our research, which
focuses on the constructed simulation environment of the central air conditioning system,
the state space exhibits high-dimensional characteristics. When using the Q-learning
Processes 2023, 11, 2068 11 of 21
algorithm, the high computational complexity and memory storage requirements degrade
the algorithm’s performance. To address these issues, this research employed the DQN
algorithm, where the approximation capability of neural networks helps improve the
stability of the algorithm.
Furthermore, the variance of air conditioning data can lead to environmental vari-
ance [37], thereby reducing learning efficiency and causing learning instability and over-
fitting. To address this problem, this research introduced the advantage function based
on the DQN algorithm, aiming to mitigate the impact of air conditioning data variance
on environmental variance. This approach enhances learning stability, improves learning
efficiency, and prevents overfitting. In the proposed simulation environment of the central
air conditioning system, to enhance the robustness of the environment and reduce the gap
between the simulation environment and the real environment, this research introduced
noise perturbation. However, the presence of noise perturbation can introduce errors in
the sampling results. When using the DQN algorithm, traditional ε-greedy exploration
cannot avoid the issue of error data, leading to decreased learning efficiency and increased
instability. To address this problem, this research adopted the UCB algorithm, which aims
to balance the trade-off between exploration and exploitation. It assigns confidence to each
action based on its potential value and uncertainty. By calculating and selecting the action
with the maximum UCB value based on confidence, this research effectively solved the
problem of sampling result errors caused by noise.
Figure
Figure 3.
3. Schematic
Schematic diagram
diagram of
of the
the AFUCB-DQN
AFUCB-DQN algorithm
algorithm model.
model.
Advantage Function
4.2. Advantage Function
The advantage function is a function used in reinforcement learning to evaluate the
relative superiority or inferiority of an action compared to other actions. In reinforcement
learning, an
an agent
agentneeds
needstotochoose
choosethe
theoptimal action
optimal actionin aingiven statestate
a given to maximize long-term
to maximize long-
rewards.
term To achieve
rewards. this goal,
To achieve thisthe agent
goal, the needs
agent to evaluate
needs the potential
to evaluate rewardsrewards
the potential associated
as-
with eachwith
sociated possible
eachaction in the
possible current
action state.
in the The advantage
current state. The function
advantage provides anprovides
function effective
way to assessway
an effective the to
value of each
assess action,ofand
the value eachhelps the and
action, agent makethe
helps informed decisions.
agent make The
informed
advantage function is defined as follows:
decisions. The advantage function is defined as follows:
π π
Aπ (As, as,)=
a Q=πQ(s, s, Vππ(s)
a)a−- V (10)
where QQππ((s,
where s, aa)) represents
representsthe
thelong-term
long-term rewards
rewards obtained
obtained by choosing
by choosing actionaction
a, anda,Vπand
(s)
π
V (s) represents the average long-term rewards obtained in state s. Thus,
represents the average long-term rewards obtained in state s. Thus, the advantage the advantage
function
function
A π Aπ (s, a) represents
(s, a) represents the additional
the additional rewardsrewards gained
gained by by choosing
choosing action aaction a relative
relative to the
to the average
average action.action.
The introduction of the advantage function
function helps
helps to reduce the variance caused by
variations in the data of the air conditioning system, as it subtracts a baseline (state value
function V Vππ((s)).
s)). By
By reducing
reducing the variance, the advantage function decreases the absolute
values of the state value function, thereby improving learning
learning stability.
stability. The advantage
function decomposes the value function into action value and state value components,
reducing their correlation and making the learning process more stable. By using the
advantage function, the update processes of the action value and state value can be inde-
pendent of each other, reducing their mutual interference and improving learning stability.
In reinforcement learning, the reward signal is often sparse, which means the agent may
need to spend a considerable amount of time exploring the environment to obtain rewards.
By computing the advantage function, this research transformed the reward signal into a
denser signal, reducing the sparsity of the reward signal and making it easier for the agent
to find the optimal policy, thereby improving learning efficiency.
In contrast to the DQN algorithm, the AFUCB-DQN algorithm separates the Q-
network value function into two parts: the state value function component V(s, ω, ωV )
Processes 2023, 11, 2068 13 of 21
and the advantage function component A(s, a, ω, ωA ). The state value function compo-
nent represents the intrinsic value of the static environment itself, dependent only on state s
and independent of the specific action a. The advantage function component represents the
additional value obtained by choosing an action in a specific state, dependent on both state
s and action a. Finally, the two functions are combined to obtain the Q-value corresponding
to each action:
where ω represents the shared parameters of the neural network, and ωV and ωA represent
the unique neural network parameters for the state value function V(s) and the action
advantage function A(a), respectively.
where UCBi represents the UCB value of the i-th action, Xi denotes the average reward
obtained from selecting the i-th action, Ni is the number of times the i-th action has been
chosen, and t represents the current time step. The UCB algorithm selects the action with
the highest UCB value as follows:
By replacing the ε-greedy strategy with the UCB algorithm, this research avoided
excessive randomness caused by the ε-greedy strategy, and addressed the errors in the sam-
pled results due to the stochasticity of the environment caused by noise. This improvement
helps enhance the decision-making performance in the simulation environment.
30 lightly active employees. In the design of the building envelope, this research referred
to the Design Standard for Energy Efficiency of Public Buildings [39]. The heat transfer
2 · ◦ C , and the heat transfer coef-
coefficient of the external walls was set to 0.796 W/ m
ficient of the windows was set to 3.1 W/ m2 · ◦ C , with the window area accounting for
80% of the wall area. Considering the working hours of the employees and the simulated
environment of the central air conditioning system, this research selected a weather dataset
provided by the Xihe Energy Big Data Platform [40]. This dataset included daily weather
data from 8:00 to 18:00 between 1 July 2021, and 31 August 2021, including temperature,
humidity, solar radiation, and other information.
Regarding the AFUCB-DQN algorithm, the specific design of the deep neural network
and hyperparameters are shown in Table 4. To enhance the algorithm’s performance, this
research selected GELU as the activation function. Compared to other commonly used
activation functions such as ReLU and sigmoid, GELU exhibits smoother nonlinear charac-
teristics, which helps improve the algorithm’s performance. Additionally, a sigmoid-like
transformation is introduced into the nonlinear transformation of the activation function,
allowing the output of the GELU function to span a wider range, thereby accelerating the
convergence speed of the model.
Table 4. Design of the deep neural network and hyperparameters in the AFUCB-DQN algorithm.
Size of Input 4
No. of hidden layers 2
Size of each hidden layer [8, 128], [128, 64]
Size of output 4
Activation function GELU
Optimizer Adam
Learning rate 10−3
Batch size 64
Discount factor 0.95
Buffer size 128
Delayed policy update U 2
To verify the accuracy of the proposed air conditioning system simulation environ-
ment, As shown in Figure 5, this research compared the simulated data obtained from our
simulation environment with the actual data used in the cooling source system portion of
reference [42]. The results show that the power difference between the simulated data and
the actual data was within ±7%, indicating that the simulation data can be used for the
simulation and research of central air conditioning cooling source systems.
Figure
Figure 5. Validation 5. Validation
of the of the system
cooling source cooling source system
simulation simulation environment.
environment.
(a) (b)
(c)
Processes 2023, 11, Figure
x FOR PEER Figure 6.comparison
REVIEW
6. Convergence Convergenceof
comparison of different algorithms
different algorithms at different
at different indoor temperatures.
indoor temperatures. (a) 18
(a) Indoor In-of 22
door temperature of 24 °C, (b) indoor temperature of 25 °C, and (c) indoor temperature of 26 °C.
temperature of 24 ◦ C, (b) indoor temperature of 25 ◦ C, and (c) indoor temperature of 26 ◦ C.
Figure 7 illustrates the energy consumption of different algorithms for each sample
at different indoor temperatures. It can be observed that the AFUCB-DQN algorithm con-
sistently achieves significantly lower energy consumption compared to the DQN algo-
rithm, DDQN algorithm, and D3QN algorithm. Additionally, the AFUCB-DQN algorithm
demonstrates stable energy-saving performance. To provide a more intuitive representa-
tion of the energy-saving effect, Figure 8 presents the average hourly energy consumption
of the DQN algorithm, DDQN algorithm, and D3QN algorithm, and of the AFUCB-DQN
algorithm, in the central air conditioning system at indoor temperatures of 24 °C, 25 °C,
and 26 °C. The figure also indicates the percentage reduction in energy consumption
achieved by each algorithm compared to the original energy consumption. Therefore,
while meeting the thermal comfort requirements of different individuals during the sum-
mer, the AFUCB-DQN algorithm exhibits significant energy-saving benefits compared to
the DQN algorithm, DDQN algorithm, and D3QN algorithm.
(a) (b)
(c)
Figure
Figure 7. Energy 7. Energy comparison
consumption consumptionofcomparison of different for
different algorithms algorithms for each
each sample sample atindoor
at different different in-
door temperatures. (a) Indoor temperature
◦ of 24 °C, (b) indoor temperature
◦ of 25 °C, and (c) indoor
temperatures. (a) Indoor temperature of 24 C, (b) indoor temperature of 25 C, and (c) indoor
temperature of 26 °C.
temperature of 26 ◦ C.
Processes 2023,
Processes 11, x2068
2023, 11, FOR PEER REVIEW 1918ofof22
21
Figure
Figure 8.
8. Comparison
Comparison of
of the
the hourly
hourly average
average energy
energy consumption
consumption of
of different
different algorithms
algorithms and
and the
the
original energy consumption at different indoor temperatures.
original energy consumption at different indoor temperatures.
6. Conclusions
6. Conclusions and
and Future
Future Work
Work
This study
This study proposes
proposes an an innovative
innovative method
method that that combines
combines the
the building
building cooling
cooling load
load
simulation environment,
simulation environment, the the cooling
cooling source
source system,
system, andand deep
deep reinforcement
reinforcement learning
learning to to
optimize the control strategy of the central air conditioning system,
optimize the control strategy of the central air conditioning system, and proposes the cool- and proposes the
cooling
ing sourcesource system
system controlcontrol
basedbased
on the onAFUCB-DQ
the AFUCB-DQ algorithm
algorithm optimization
optimization method. method.
This
This method improves the stability of the learning process by using
method improves the stability of the learning process by using the advantage function, the advantage func-
tion,obtains
and and obtains a better
a better exploration–utilization
exploration–utilization balance balance
abilityability by introducing
by introducing the UCB thealgo-
UCB
algorithm,
rithm, avoiding
avoiding the error
the error datadata
problemproblem
that that
maymayoccur occur during
during the exploration
the exploration process.
process. By
By comprehensively considering various factors of building cooling
comprehensively considering various factors of building cooling load and introducing load and introducing
noise interference,
noise interference, this
this study
study constructs
constructs an an accurate
accurate andand robust
robust central
central air
air conditioning
conditioning
system simulation environment, which can dynamically adjust
system simulation environment, which can dynamically adjust the operating parametersthe operating parameters
of the cooling source system according to the actual cooling
of the cooling source system according to the actual cooling load demand. After load demand. After train-
training
ing and comparative analysis of the AFUCB-DQN algorithm,
and comparative analysis of the AFUCB-DQN algorithm, this research found that under this research found that
under
the the premise
premise of indoor ofthermal
indoor thermal comfort requirements
comfort requirements in summer,in summer,
comparedcompared
with the DQN with
the DQN algorithm, DDQN algorithm, and D3QN algorithm, the
algorithm, DDQN algorithm, and D3QN algorithm, the algorithm shows more stable con- algorithm shows more
stable convergence,
vergence, faster convergence
faster convergence speed,rewards,
speed, and higher and higher rewards,
resulting resulting
in energy in energy
optimization
optimization and significant improvements in energy savings. The
and significant improvements in energy savings. The operating parameters of the cold operating parameters of
the cold source system obtained by the proposed method can provide
source system obtained by the proposed method can provide effective guidance for the effective guidance
for the operation of the actual central air conditioning system. The main focus of this
operation of the actual central air conditioning system. The main focus of this research is
research is the optimization of energy consumption and energy saving effect under the
the optimization of energy consumption and energy saving effect under the requirement
requirement of indoor thermal comfort in summer. However, for room use requirements in
of indoor thermal comfort in summer. However, for room use requirements in other sea-
other seasons or different working conditions, the applicability and effect of this method
sons or different working conditions, the applicability and effect of this method still need
still need further verification and exploration. Future research should consider meeting the
further verification and exploration. Future research should consider meeting the room
room use requirements under different working conditions, and explore methods, such
use requirements under different working conditions, and explore methods, such as
as multi-connected air conditioning systems and multi-agent reinforcement learning, to
multi-connected air conditioning systems and multi-agent reinforcement learning, to
Processes 2023, 11, 2068 19 of 21
solve related problems, and extend the scope of optimization control to energy-saving
optimization throughout the year.
The central air conditioning system control optimization method proposed in this
study not only realizes the improvement in energy saving effect and the guarantee of indoor
thermal comfort, but also provides guidance and reference for practical application.
Author Contributions: Conceptualization, H.T., M.F. and Q.G.; methodology, H.T. and M.F.; software,
H.T. and M.F.; validation, H.T. and M.F.; formal analysis, H.T.; investigation, H.T.; resources, H.F.;
data curation, R.C.; writing—original draft preparation, H.T. and M.F.; writing—review and editing,
H.T., M.F. and Q.G.; visualization, H.T. and M.F.; supervision, Q.G.; project administration, H.T. and
Q.G. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the State Grid Tianjin Electric Power Company Science and
Technology Project, grant number KJ21-1-21, the Tianjin Postgraduate Scientific Research Innovation
Project, grant number 2022SKYZ070, and the Tianjin University of Technology 2022 School-Level
Postgraduate Scientific Research Innovation Practice Project, grant number YJ2209.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data reported were taken from papers included in the references.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviation
The following abbreviations are used in this manuscript:
AFUCB-DQN Advantaged upper confidence bound deep Q-network
UCB Upper confidence bound
EU European Union
BEM Building energy model
DMPC Distributed model predictive control
HVAC Heating ventilation and air conditioning
EO Equilibrium optimization
IPAIS Improved parallel artificial immune system
DR Demand response
RL Reinforcement learning
DRL Deep reinforcement learning
MDP Markov decision process
MCTS Monte Carlo tree search
iLQR Iterative linear quadratic regulator
MPC Model predictive control
DQN Deep Q-network
VAV Variable air volume
BDQ Branching dueling Q-network
A3C Asynchronous advantage actor–critic
LSTM Long short-term memory
DDPG Deep deterministic policy gradient
D3QN Distributed double deep Q-network
MBRL-MC Model-based deep reinforcement learning and model predictive control
References
1. Perez-Lombard, L.; Ortiz, J.; Maestre, I.R. The Map of Energy Flow in HVAC Systems. Appl. Energy 2011, 88, 5020–5031. [CrossRef]
2. Tang, R.; Wang, S.; Sun, S. Impacts of Technology-Guided Occupant Behavior on Air-Conditioning System Control and Building
Energy Use. Build. Simul. 2021, 14, 209–217. [CrossRef]
3. Chen, J.; Sun, Y. A New Multiplexed Optimization with Enhanced Performance for Complex Air Conditioning Systems. Energy
Build. 2017, 156, 85–95. [CrossRef]
4. Gholamzadehmir, M.; Del Pero, C.; Buffa, S.; Fedrizzi, R.; Aste, N. Adaptive-Predictive Control Strategy for HVAC Systems in
Smart Buildings—A Review. Sustain. Cities Soc. 2020, 63, 102480. [CrossRef]
Processes 2023, 11, 2068 20 of 21
5. Lu, Y.; Khan, Z.A.; Alvarez-Alvarado, M.S.; Zhang, Y.; Huang, Z.; Imran, M. A Critical Review of Sustainable Energy Policies for
the Promotion of Renewable Energy Sources. Sustainability 2020, 12, 5078. [CrossRef]
6. Mariano-Hernandez, D.; Hernandez-Callejo, L.; Zorita-Lamadrid, A.; Duque-Perez, O.; Santos Garcia, F. A Review of Strategies
for Building Energy Management System: Model Predictive Control, Demand Side Management, Optimization, and Fault Detect
& Diagnosis. J. Build. Eng. 2021, 33, 101692.
7. Gao, J.; Yang, X.; Zhang, S.; Tu, R.; Ma, H. Event-Triggered Distributed Model Predictive Control Scheme for Temperature
Regulation in Multi-Zone Air Conditioning Systems with Improved Indoor Thermal Preference Indicator. Int. J. Adapt. Control
Signal Process. 2023, 37, 1389–1409. [CrossRef]
8. Salins, S.S.; Kumar, S.S.; Thommana, A.J.J.; Vincent, V.C.; Tejero-Gonzalez, A.; Kumar, S. Performance Characterization of an
Adaptive-Controlled Air Handling Unit to Achieve Thermal Comfort in Dubai Climate. Energy 2023, 273, 127186. [CrossRef]
9. Aruta, G.; Ascione, F.; Bianco, N.; Mauro, G.M.; Vanoli, G.P. Optimizing Heating Operation via GA- and ANN-Based Model
Predictive Control: Concept for a Real Nearly-Zero Energy Building. Energy Build. 2023, 292, 113139. [CrossRef]
10. Yang, S.; Yu, J.; Gao, Z.; Zhao, A. Energy-Saving Optimization of Air-Conditioning Water System Based on Data-Driven and
Improved Parallel Artificial Immune System Algorithm. Energy Convers. Manag. 2023, 283, 116902. [CrossRef]
11. Sun, F.; Yu, J.; Zhao, A.; Zhou, M. Optimizing Multi-Chiller Dispatch in HVAC System Using Equilibrium Optimization Algorithm.
Energy Rep. 2021, 7, 5997–6013. [CrossRef]
12. Tang, R.; Wang, S. Model Predictive Control for Thermal Energy Storage and Thermal Comfort Optimization of Building Demand
Response in Smart Grids. Appl. Energy 2019, 242, 873–882. [CrossRef]
13. Utama, C.; Troitzsch, S.; Thakur, J. Demand-Side Flexibility and Demand-Side Bidding for Flexible Loads in Air-Conditioned
Buildings. Appl. Energy 2021, 285, 116418. [CrossRef]
14. Sutton, R.; Barto, A. Reinforcement Learning; MIT Press: Cambridge, MA, USA, 1998.
15. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016.
16. Li, W.; Todorov, E. Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems. In Proceedings of
the International Conference on Informatics in Control, Automation and Robotics, Setubal, Portugal, 28 August 2004.
17. Zhao, H.; Zhao, J.; Shu, T.; Pan, Z. Hybrid-Model-Based Deep Reinforcement Learning for Heating, Ventilation, and Air-
Conditioning Control. Front. Energy Res. 2021, 8, 610518. [CrossRef]
18. Chen, L.; Meng, F.; Zhang, Y. MBRL-MC: An HVAC Control Approach via Combining Model-Based Deep Reinforcement Learning
and Model Predictive Control. IEEE Internet Things J. 2022, 9, 19160–19173. [CrossRef]
19. Wang, Z.; Hong, T. Reinforcement Learning for Building Controls: The Opportunities and Challenges. Appl. Energy 2020, 269,
115036. [CrossRef]
20. Biemann, M.; Scheller, F.; Liu, X.; Huang, L. Experimental Evaluation of Model-Free Reinforcement Learning Algorithms for
Continuous HVAC Control. Appl. Energy 2021, 298, 117164. [CrossRef]
21. Heo, S.; Nam, K.; Loy-Benitez, J.; Li, Q.; Lee, S.; Yoo, C. A Deep Reinforcement Learning-Based Autonomous Ventilation Control
System for Smart Indoor Air Quality Management in a Subway Station. Energy Build. 2019, 202, 109440. [CrossRef]
22. Yuan, X.; Pan, Y.; Yang, J.; Wang, W.; Huang, Z. Study on the Application of Reinforcement Learning in the Operation Optimization
of HVAC System. Build. Simul. 2021, 14, 75–87. [CrossRef]
23. Wei, T.; Wang, Y.; Zhu, Q. Deep Reinforcement Learning for Building HVAC Control. In Proceedings of the 54th Annual Design
Automation Conference 2017, Austin, TX, USA, 18 June 2017.
24. Deng, X.; Zhang, Y.; Qi, H. Towards Optimal HVAC Control in Non-Stationary Building Environments Combining Active Change
Detection and Deep Reinforcement Learning. Build. Environ. 2022, 211, 108680. [CrossRef]
25. Lei, Y.; Zhan, S.; Ono, E.; Peng, Y.; Zhang, Z.; Hasama, T.; Chong, A. A Practical Deep Reinforcement Learning Framework for
Multivariate Occupant-Centric Control in Buildings. Appl. Energy 2022, 324, 119742. [CrossRef]
26. Marantos, C.; Lamprakos, C.P.; Tsoutsouras, V.; Siozios, K.; Soudris, D. Towards Plug&Play Smart Thermostats Inspired by
Reinforcement Learning. In Proceedings of the Workshop on INTelligent Embedded Systems Architectures and Applications,
Turin, Italy, 4 October 2018.
27. Zhang, Z.; Chong, A.; Pan, Y.; Zhang, C.; Lu, S.; Lam, K.P. A Deep Reinforcement Learning Approach to Using Whole Building
Energy Model for HVAC Optimal Control. In Proceedings of the 2018 Building Performance Analysis Conference and SimBuild,
Chicago, IL, USA, 26–28 September 2018.
28. Wang, Y.; Velswamy, K.; Huang, B. A Long-Short Term Memory Recurrent Neural Network Based Reinforcement Learning
Controller for Office Heating Ventilation and Air Conditioning Systems. Processes 2017, 5, 46. [CrossRef]
29. Ding, Z.-K.; Fu, Q.-M.; Chen, J.-P.; Wu, H.-J.; Lu, Y.; Hu, F.-Y. Energy-Efficient Control of Thermal Comfort in Multi-Zone
Residential HVAC via Reinforcement Learning. Connect. Sci. 2022, 34, 2364–2394. [CrossRef]
30. Zhang, Z.; Chong, A.; Pan, Y.; Zhang, C.; Lam, K.P. Whole Building Energy Model for HVAC Optimal Control: A Practical
Framework Based on Deep Reinforcement Learning. Energy Build. 2019, 199, 472–490. [CrossRef]
31. Gao, G.; Li, J.; Wen, Y. Deep Comfort: Energy-Efficient Thermal Comfort Control in Buildings Via Reinforcement Learning. IEEE
Internet Things J. 2020, 7, 8472–8484. [CrossRef]
32. Li, Z.; Sun, Z.; Meng, Q.; Wang, Y.; Li, Y. Reinforcement Learning of Room Temperature Set-Point of Thermal Storage Air-
Conditioning System with Demand Response. Energy Build. 2022, 259, 111903. [CrossRef]
33. Watkins, C.; Dayan, P. Q-Learning. Mach. Learn. 1992, 8, 279–292. [CrossRef]
Processes 2023, 11, 2068 21 of 21
34. Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking Deep Reinforcement Learning for Continuous Control.
In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016.
35. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep
Reinforcement Learning. arXiv 2013, arXiv:1312.5602.
36. GB 50019-2003; Code for Design of Heating Ventilatian and Air Gonditioning. China Planning Press: Beijing, China, 2003.
37. Sun, L.; Wu, J.; Jia, H.; Liu, X. Research on Fault Detection Method for Heat Pump Air Conditioning System under Cold Weather.
Chin. J. Chem. Eng. 2017, 25, 1812–1819. [CrossRef]
38. Li, Y.; Wang, Z.; Xu, W.; Gao, W.; Xu, Y.; Xiao, F. Modeling and Energy Dynamic Control for a ZEH via Hybrid Model-Based Deep
Reinforcement Learning. Energy 2023, 277, 127627. [CrossRef]
39. GB 50189-2015; Design Standard for Energy Efficiency of Public Buildings. China Architecture and Building Press: Beijing,
China, 2015.
40. National Aeronautics and Space Administration. Data Are Based on Historical Reanalysis Datasets from the European Centre for
Medium-Range Weather Forecasts (ECMWF). Available online: https://www.xihe-energy.com (accessed on 24 May 2022).
41. Lu, Y. Practical Design Manual for Heating and Air Conditioning, 2nd ed.; China Architecture and Building Press: Beijing, China, 2008.
42. Huang, Y. Study of Operation Optimization for Cold Source System of Central Air-Conditioning Based on TRNSYS. Ph.D. Thesis,
South China University of Technology, Chengdu, China, 2015.
43. GB 50736-2012; Design Code for Heating Ventilation and Air Conditioning of Civil Buildings. China Architecture and Building
Press: Beijing, China, 2012.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.