Distributed Economic Dispatch in Microgrids Based On Cooperative Reinforcement Learning

2192 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO.
6, JUNE 2018
Distributed Economic Dispatch in Microgrids Based

on Cooperative Reinforcement Learning
Weirong Liu , Member, IEEE, Peng Zhuang, Hao Liang, Member, IEEE, Jun Peng, Member, IEEE,
and Zhiwu Huang, Member, IEEE
Abstract— Microgrids incorporated with distributed genera- losses, and infrastructure construction cost [1]. In recent years,
tion (DG) units and energy storage (ES) devices are expected the deployments of microgrids continue to increase globally.
to play more and more important roles in the future power In particular, the installed microgrid capacity is estimated to
systems. Yet, achieving efficient distributed economic dispatch in
microgrids is a challenging issue due to the randomness and grow from 1.1 GW in 2012 to 4.7 GW in 2017, with an
nonlinear characteristics of DG units and loads. This paper estimated market opportunity of $17.3 billion [2].
proposes a cooperative reinforcement learning algorithm for dis- To manage all the DG units and ES devices in an opti-
tributed economic dispatch in microgrids. Utilizing the learning mal way, economic dispatch can be performed to allocate
algorithm can avoid the difficulty of stochastic modeling and the power demands in a microgrid, such that the cost of
high computational complexity. In the cooperative reinforcement
learning algorithm, the function approximation is leveraged microgrid operation can be minimized, while the system
to deal with the large and continuous state spaces. And a operation constraints are satisfied. In comparison with the
diffusion strategy is incorporated to coordinate the actions of centralized approach, distributed economic dispatch avoids
DG units and ES devices. Based on the proposed algorithm, single point failure and fits well with the plug-and-play
each node in microgrids only needs to communicate with its nature of microgrids. The distributed controller can utilize the
local neighbors, without relying on any centralized controllers.
Algorithm convergence is analyzed, and simulations based on local information such as the neighborhood/adjacent controller
real-world meteorological and load data are conducted to validate states to achieve the global optimization [3], so the information
the performance of the proposed algorithm. needed to deal with can be confined to local, which reduces
Index Terms— Cooperative reinforcement learning, diffusion the computation and communication resource requirements
strategy, distributed economic dispatch, energy storage (ES), and is applicable to the local DG and ES controller with
function approximation, microgrids. restricted resources. And it increases the system reliability by
using multiple local controllers rather than a unique centralized
I. I NTRODUCTION controller.
M ICROGRIDS are small-scale and localized electric

power systems, typically established at the commu-
nity level to generate, distribute, and regulate the flow of
However, it is extremely challenging to establish distributed
economic dispatch of microgrid due to its intrinsic random-
ness. The renewable energy sources are fluctuant, uncertain,
electricity from suppliers to customers. Distributed genera- and nondispatchable. And the controllable power, such as
tion (DG) units, especially the ones based on renewable energy utility grids, ES devices, and diesel generation units, can be
resources, and energy storage (ES) devices can be integrated scheduled to supplement these renewable energies in case
into the microgrids to reduce carbon emissions, power delivery of critical loads [4]. And it is common NP-hard for the
Manuscript received April 30, 2017; revised October 15, 2017 and multistage stochastic optimization problems transferred from
January 31, 2018; accepted January 31, 2018. Date of publication March 2, economic dispatch. Furthermore, without a centralized view of
2018; date of current version May 15, 2018. This work was supported in part the microgrid, the energy management system (EMS) at each
by the National Natural Science Foundation of China under Grant 61672539,
Grant 61672537, Grant 61379111, and Grant 61772558, in part by the CSC node of the microgrid can hardly obtain explicit stochastic
State Scholarship Fund under Grant 201406375017, and in part by the Natural models for all DG units and ES devices in the microgrid. Such
Sciences and Engineering Research Council of Canada. (Corresponding a difficulty can be further aggravated by the limited number
author: Weirong Liu.)
W. Liu is with the School of Information Science and Engineering, Central of loads which are more unpredictable than that in traditional
South University, Changsha 410083, China, and also with the Department of large-scale electrical grids [5].
Electrical and Computer Engineering, University of Alberta, Edmonton, AB In the literature, some existing works on distributed eco-
T6G 1H9, Canada (e-mail: frat@csu.edu.cn).
P. Zhuang and H. Liang are with the Department of Electrical and Com- nomic dispatch [6], [7] rely on the accurate forecast of power
puter Engineering, University of Alberta, Edmontion, AB T6G 1H9, Canada generation and load. To address the randomness related to
(e-mail: pzhuang@ualberta.ca; hao2@ualberta.ca). DG units and loads, the Markov decision processes (MDPs)
J. Peng and Z. Huang are with the Hunan Engineering Laboratory of
Rail Vehicles Braking Technology, School of Information Science and can be leveraged to optimize the energy management in
Engineering, Central South University, Changsha 410083, China (e-mail: microgrids [8], [9]. Based on the stochastic models, the model
pengj@csu.edu.cn; hzw@csu.edu.cn). predictive control methods can be used to realize the receding-
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. horizon optimization for upcoming transient time [10], [11].
Digital Object Identifier 10.1109/TNNLS.2018.2801880 The scenario-based methods can construct stochastic scenarios
2162-237X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: PAKISTAN INST OF ENGINEERING AND APPLIED SCIENCES. Downloaded on April 08,2021 at 06:21:46 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: DISTRIBUTED ECONOMIC DISPATCH IN MICROGRIDS BASED ON COOPERATIVE REINFORCEMENT LEARNING 2193
to optimize long-term expectations for microgrid operation constraints. In order to solve this problem, a cooperation
problems [12], [13]. To reduce the computational complexity, mechanism is introduced into the reinforcement learning algo-
a bilevel optimization structure is proposed in [14] based rithm with function approximation. Theoretical analysis and
on a scenario reduction method, and a stochastic dynamic simulations are conducted to evaluate the performance of the
programming method is developed in [15] to seek the optimal proposed algorithm. The main contributions of this paper are
dispatch based on reduced scenarios. Yet, acquiring accurate fourfold.
and a priori statistical information for all DG units and loads 1) A distributed framework is established by incorporating
in a microgrid is not straightforward, which can limit the a diffusion strategy to coordinate the actions and to
applications of the aforementioned methods. balance the loads among DG units and ES devices.
To avoid the construction of stochastic models beforehand, 2) Function approximation is applied to provide precise
reinforcement learning is a promising technique for solving control in the continuous state space of microgrids.
MDPs efficiently [16], for which the most notable feature is 3) The innovative theoretical analysis is provided for the
model-free, i.e., it is able to find the optimal solution without framework consisted of parameterized Q-learning and
relying on any a priori knowledge or burdensome stochastic diffusion strategy. The conditions for the convergence
modeling. On the other hand, it can obtain the highest global of the proposed algorithm are mathematically derived
long-term reward instead of the immediate reward. These for Q value iteration.
features endow it the strong ability to address the multistate 4) Simulations are conducted based on real-world mete-
stochastic optimization of economic dispatch in microgrids. orological and load data to evaluate the performance
Furthermore, when introducing the fully cooperative mech- of the proposed algorithm, and the coupled power flow
anism, agents are designed to fulfill the tasks together and constraints are taken into consideration.
to optimize some global long-term performance indices by The remainder of this paper is organized as follows.
only sharing partial states or rewards at each iteration rather Section II presents the system model of a microgrid and the
than acquirement of other agents’ strategies [17]. This feature formulation of the economic dispatch problem. The proposed
also especially meets the control requirement of distributed cooperative reinforcement learning algorithm is presented in
generations in microgrids. Section III. In Section IV, the performance of the proposed
Some existing works have exploited the applications of algorithm is analyzed. Simulation results are presented in
reinforcement learning in residential household energy man- Section V, followed by the conclusions and future works in
agement [18] and generation control of electrical grids [19]. Section VI.
However, for distributed economic dispatch in microgrids,
the state space and decision variables are continuous, and
no central controller is available. As a result, the classical II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
reinforcement learning approaches suffer from the notorious Generally, microgrids are connected to the main utility grid
“curse of dimensionality” problem. To solve this problem, through point of common coupling (PCC) [28]. In addition to
some existing research works utilize the fuzzy-Q learning [20], the loads, nodes may be equipped with photovoltaic (PV) pan-
[21], which is one of the variations of Q-learning, to obtain the els as the DG units and/or batteries as the ES devices. Diesel
optimal economic dispatch and pricing strategies. However, generators are installed as dispatchable energy resources to
fuzzy-Q learning may have slow convergence, and it needs ensure reliable system operation. Each node provides electrical
to extract the fuzzy features manually, which makes it prone power for some residential customers.
to local optima. A significant research effort is still needed In time slot k, for each node i , its load pil (k) and the power
p
to reduce the computational cost of reinforcement learning generation from PV panel pi (k) are random variables. The
without relying on a central controller. power dispatch of the diesel generator pid (k) and the power
Function approximation is another method that can be dispatch of the battery pib (k) are the decision variables. For a
combined with reinforcement learning to address the continu- battery, when it is in the discharging state, pib (k) is positive,
ous state-space problem [22]. Specifically, the linear function and when it is in the charging state, pib (k) is negative. The
approximation derived from the gradient of projected Bell- variable notations of the microgrid are listed in Table I.
man error and off-policy strategy [23], [24] has theoretical
convergence guarantee and can be used in real-time con-
trol applications under unknown environment [25]. Yet, how A. Objective
to incorporate the cooperation mechanism into the function The objective of the economic dispatch is to minimize the
approximation method and how to extend the existing conver- cost of microgrid operation for all nodes in a long term. For
gence analysis [26], [27] to distributed economic dispatch in each node i in time slot k, the decision variables are the
microgrids still require extensive research. power dispatch of the diesel generator pid (k) and the power
In this paper, a cooperative reinforcement learning algorithm output (charge or discharge) of the battery pib (k). If there are
is proposed for distributed economic dispatch in microgrids. no diesel generators or batteries on node i , pid (k) or pib (k)
The microgrid model is constructed by incorporating DG units is set to zero. For each node in the microgrid, the operation
and ES devices. A distributed economic dispatch problem cost includes three components: the cost of power purchased
is formulated to minimize the cost of microgrid operation from the utility grid (C g (·)), the cost of power generated by
while satisfying the power balance, capacity, and operational diesel generator (Cd (·)), and the cost of battery wear due to
2194 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO. 6, JUNE 2018
TABLE I where cib is determined by battery charging characteristics,

VARIABLE N OTATIONS OF THE M ICROGRID which can be obtained based on empirical experiments. On the
other hand, for a discharging period when pib (k) ≥ 0,
the effective energy that can be discharged from a battery
depends on the state of health (SOH) and the change of SOC
during the discharging process according to Peukert’s law [31].
We have
α h
b p b (k)
i
i
SOCi pi (k), SOHi (k) = −dib · (8)
SOHi (k)
where the coefficients dib and αih can also be obtained from
empirical tests of the battery. In this paper, both SOCi and
SOHi are the normalized values between 0 and 1. In fact,
to avoid damaging batteries due to deep charging and dis-
charging, SOCi is limited to [0.2, 0.8] in the actual battery
operation.
B. Constraints
The constraints of the economic dispatch include power
charging and discharging (Cb (·)). Thus, the objective of energy
balance constraint, capacity constraints, and operational con-
management is
g straints. These constraints are described by (9)–(14) in detail.
min C g pi (k) + Cd pid (k) The power balance constraint indicates that the total power
pid (k), pib (k)
i∈I,k∈K
k∈K i∈I generated by all kinds of sources in the microgrid plus the
total power drawn from the utility grid should equal the total
+ Cb pib (k), SOHi (k) (1)
load of nodes given by
in which set I is the set of nodes in the microgrid and set p g

K is the set of time slots. Therefore, the objective is the pi (k) + pi (k) + pid (k) + pib (k) = pil (k). (9)
i∈I i∈I
global optimization for the long-term reward. These costs are,
respectively, given by The total power balance can be achieved by the cooperation
g g among nodes with DGs and ESs, i.e., there are bidirectional
C g pi (k) = a g (k) · pi (k) (2) power exchanges among nodes. And also, in this paper, it is
d 2
Cd pi (k) = ai · pi (k) + bi · pi (k) + ci (3)
d d d d d assumed that the DGs and ESs can feed back the redundancy
b power to utility grid. Thus, there are bidirectional power
Cb pi (k), SOHi (k) = ρib · (SOHi (k + 1) − SOHi (k)) (4)
flow in microgrids, which may potentially impact on voltage
where a g (k) is the time-of-use electricity price of utility grid regulation and directional protection scheme in power grids.
in time slot k, while aid , bid , and cid are the generation cost Further discussion goes beyond the paper’s range and will be
coefficients of the diesel generator [15]. thoroughly investigated in the next research.
In (4), ρib is the degradation cost coefficient of the battery, The capacity constraints correspond to the operating ranges
which depends on the unit cost of the battery. Furthermore, of decision variables. For the batteries, we have
(SOHi (k + 1) − SOHi (k)) is the battery life degradation
caused by battery charging or discharging cycles, calculated
b
Pmin,i < pib (k) < Pmax,i
b
(10)
by degradation iteration [29] b
where Pmin,i b
and Pmax,i are the lower and upper bounds of
SOHi (k + 1) = SOHi (k) + h i (k) · SOHi (k) (5) the power output of the battery at node i , respectively, for
charging and discharging.
where h i (k) is the degradation factor, which is a function of For the diesel generators, we have
the rate of change of battery state of charge (SOC) and can
be calculated as
d
Pmin,i < pid (k) < Pmax,i
d
(11)
−1
h i (k) = aih · (SOCi (k))βi + ηih
h d d
(6) where Pmin,i
and Pmax,i
are the lower and upper bounds of the
power generation from diesel generator at node i , respectively.
where aih , βih , and ηih are the degradation parameters deter-
The line capacity constraints are also incorporated by power
mined by battery characteristics, whose values are determined
capacities, which are described by
by empirical tests. The rate of change of battery SOC,
i.e., SOCi (k), is determined by the charging or discharging Re((Vi − V j )2 /Z i j )) < Pmax_i j (12)
power.
where Re(·) is the obtaining real part operation for the complex
For a charging period with pib (k) < 0, the rate of change
number. Z i j is the complex resistance of line from bus i to
of battery SOC can be calculated as [30]
bus j . Pmax_i j is the allowed power capacity of the line from
SOCi pib (k), SOHi (k) = −cib · pib (k) (7) bus i to bus j .
The operational constraints are introduced to maintain the constraints should have practical meaning and correspond
voltage stability in the microgrid, such that severe voltage to real power networks and industrial devices. Thus, in the
fluctuations can be avoided to protect the electrical devices. following simulations, the PV data, the residual power usage,
The voltage of each node in the microgrid can be determined the utility electricity price, and the DG and ES parameters
based on power flow analysis, which is complex and nonlinear from realistic scenarios and available devices are considered.
in nature. In order to reduce the computational complexity
when solving power flow equation, a linear approximation III. C OOPERATIVE R EINFORCEMENT L EARNING
algorithm of power flow in power distribution networks [32] Due to the randomness of PV generation and loads in the
is used in this paper. Then, we have microgrids, problem P1 belongs to a class of stochastic opti-
vmin < v(k) < vmax (13) mization problems. It is nonlinear and nonconvex in general,
since the battery degradation cost is determined by a piecewise
where v(k) = {v i (k)|i ∈ {1, 2, · · · , n}} is the vector of nonlinear function. When the problem is discretized, it will
voltages of all nodes in the microgrid, which can be calculated be transformed into an NP-hard problem, which becomes
as analytically intractable when the microgrid scale increases [2].
g In the distributed control manner, there are no central-
v(k) = An×n · p(k) + vslack . (14)
g
ized controllers to collect system global information and
The matrix ∈
An×n n×n
can be obtained based on the dispatch power generation. To achieve global optimization,
topology and admittance matrix of the microgrid [32]. The the distributed controller should exchange its information with
elements of p(k) are the net power consumptions of the nodes their neighboring controllers. Each controller makes its action
in the microgrid. The elements in vslack ∈ n equal to the decision not only by its own states but also the states of
nominal voltage at the PCC (v slack ). Furthermore, the elements neighboring controllers, which is called distributed cooperative
in vmin ∈ n and vmax ∈ n correspond to the lower and upper mechanism. It is proven that the distributed controller can
bounds of the voltage at each node, i.e., v i ∈ (v min , v max ). achieve the global performance of centralized controller when
As for the frequency and current constraints, it is assumed the distributed controller is designed properly [3]. In this
that the power system is operating under the steady-state paper, the diffusion strategy is introduced to function approx-
condition. And the DG combined with voltage source inverter imation as cooperative mechanism, that is, the controllers
can be seen as a synchronous generator under steady state. will send their approximation parameters to their neighboring
Therefore, it can be taken that the power balance equation (9) controllers. And each controller will add the neighboring
guarantees the system frequency at a standard value. And also, approximation parameters to its own approximation parame-
it is assumed that the system will be protected by the installed ters and take them into parameter update law.
over current protection devices according to designed node’s On the other hand, reinforcement learning tries to find
power capacity range. When the node voltages meet the power the optimal decision sequence by leveraging historic trial-
flow constraint (13), the currents will be restricted by installed and-error interactions with the dynamic environment, without
over current protection devices. These assumptions coincide requiring any a priori knowledge of statistical models or fore-
with the proposed system model of the microgrid. cast information [22].
From the above, by combining the objective of energy
management (1), and all considered constraints (9)–(13),
A. Formulation of the Reinforcement Learning Problem
the economic dispatch optimization problem can be formulated
as P1 The reinforcement learning approach for node i can be
g defined by three fundamental elements: 1) a set of environment
(P1) min C g pi (k) + Cd pid (k) states Si ; 2) a set of agent actions Ai ; and 3) a sequence of
pd (k), pb (k)
i
i∈I,k∈K
i k∈K i∈I rewards ri obtained by agent i over time. In this paper, each
node in the microgrid is defined as an agent. In time slot k,
+ Cb pib (k), SOHi (k)
the state of agent i includes the load, PV generation, and SOC
s.t. (9) ∼ (13). (15) and SOH of the battery given by
Remark 1: In this paper, the traditional economic dispatch p
si (k) = pil (k), pi (k), SOCi (k), SOHi (k) . (16)

in power systems [33] is considered, which deals with min-
imizing the generation cost while satisfying the restrictions Here, we have si (k) ∈ Si , where Si = {s1 , s2 , . . . , sm s,i } is the
on power balance and generation capacity. Thus, the power set of all admissible system states with respect to all possible
losses are neglected in the objective (1) and the power balance values of loads, PV generation, and SOC/SOH of the battery.
constraint (9). However, for large-scale power transmission m s,i is the number of admissible states for agent i .
grid or for multiple microgrids with long distances [34], Furthermore, the action of agent i in time slot k can be
the power losses cannot be neglected and should be added into defined based on the decision variables of this agent given by
objective and constraints, which will be our further research
ai (k) = pid (k), pib (k) . (17)

work.
Remark 2: It should be noted that without considering the Here, we have ai (k) ∈ Ai , where Ai = {a1 , a2 , . . . , am a,i }
right parameters, the solving of the optimization problem (P1) is the set of all admissible actions given the power balance,
will not be significant. The parameters of the costs and capacity, and operational constraints in problem P1. m a,i is
the number of admissible actions for agent i . Furthermore, Based on the above definition of Q function, a classic
the reward of agent i on given state si and action ai can be reinforcement learning algorithm can be used to solve the
described as economic dispatch problem. However, a major issue of clas-
sic reinforcement learning is the requirement of storing the
ri (si , ai ) = −[Ce (si , ai ) + λ p Cv (si , ai )] (18)
Q values for all state-action pairs. As the number of state-
where Ce (si , ai ) is the total energy cost of each agent given action pairs increases exponentially with the dimension of
by (1). Cv (si , ai ) is the Lagrangian penalty term, while λ p states and actions, the computational cost can be prohibitive.
is the penalty coefficient. The penalty term is related to the In particular, the state space of the economic dispatch problem
constraint (13), and it is the variant of logarithmic barrier in microgrid [defined in (16) with respect to the load, PV gen-
method for inequality constraints [35] given by eration, and SOC and SOH of the battery] is continuous in
⎧ nature. Quantization of the continuous state space will further
⎪
⎨ln v slack (2v −v −v ) 2 , v i ∈ (v min , v max )
1
i max min
increase the computational cost. This issue is typically referred
Cv (si , ai ) = 1− v max −v min
to as the “curse of dimensionality.”
⎪
⎩
∞, otherwise.
(19) B. Function Approximation
To avoid the storage of a large Q value table, one of the
The rationale behind the definition is that, when v i
promising solutions is to use approximators to approximate
approaches a boundary of the permitted range (i.e.,
the Q value for large and/or continuous state and action
v min or v max ), the voltage regulation cost approaches infinite
spaces [22]. For instance, an approximator can adopt a series
due to the logarithmic nature of the function, which can pre-
of tunable parameters and the features extracted from state-
vent the agent from taking aggressive actions to violate system
action space to estimate Q value. Then, the approximator
operation constraints. The minus sign at the front of the right-
establishes a mapping from a parameter space to the state-
hand side of (18) is to transfer the cost minimizing problem to
action pair space of the Q value function. The mapping can
classical reward maximization form of reinforcement learning.
be either linear or nonlinear. In this paper, we use the linear
In the following simulation, a maximum positive offset will
mapping for analytical tractability [22]. A representative form
be added to the reward to ensure the reward positive. This is
of the linear approximator is given by
a little technique to conveniently set the initial Q value and,
nf
consequently, the latter feature vector of (25) in the algorithm
implementation. Q i (si , ai ) = ϕl,i (si , ai )θl,i = ϕiT (si , ai )θi (22)
l=1
The objective of the reinforcement learning for agent i is
to obtain the best long-term reward given by where θi ∈ n f is a tunable approximating parameter vector,
∞ while ϕi (si , ai ) ∈ n f is the feature vector depending on state-
action pairs, given by
k
E γ ri (t0 + k + 1)|si (t0 ) = s0 (20)
k=0 ϕi (si , ai ) = (ϕ1,i (si , ai ), ϕ2,i (si , ai ), . . . , ϕn f ,i (si , ai )) (23)
where t0 is the initial time, s0 is the initial state, k is the where the element ϕl,i (si , ai ) is a basis function (BF), such
index of time slot, and ri (t0 + k + 1) is the reward at time as Gaussian radial BF, whose center is a selected fixed point
slot t0 + k + 1. The discount factor γ is used to reduce the in state spaces. Generally, the set of BFs corresponding to the
effect of future reward while keeping the cumulative reward fixed points evenly distributed in state space. In this paper,
over infinite horizon bounded. all vectors are considered as column vectors if not specified,
For the distributed economic dispatch problem considered and (·)T represents the matrix transpose operation. The radial
in this paper, a priori knowledge may not be available due BF neural networks have been used for stochastic nonlinear
to the randomness of loads and PV generation. In order to interconnected systems [36] and has been proven that it has
address this problem, the Q-learning algorithm can be used to perfect generalization performance [37].
leverage sampled historical values. For each state-action pair, Since agent i only selects one action a j,i from the action
we can define a Q function as follows: set Ai = {a1,i , a2,i , . . . , am a ,i } when it is in state si , the other
elements in parameter vector θi should not contribute to the
Q k+1,i (si , ai ) = Q k,i (si , ai ) + α rk,i (si , ai )
Q value approximation. Therefore, (22) can be rewritten as
+ γ max
Q k,i si , ai − Q k,i (si , ai ) (21)
nf
ai
Q i (si , a j,i ) = ϕl,i (si , a j,i )θl,i = ϕiT (si , a j,i )θi (24)
where (si , ai ) is the state-action pair in time slot k and (si , ai )
l=1
is the possible state-action pair in the next time slot (∀si , si ∈
Si , ai , ai ∈ Ai ). rk,i (si , ai ) is the immediate reward when the where the feature vector ϕiT (si , a j,i ) is defined as follows:
agent i takes action ai at state si on time slot k. α is called ϕiT (si , a j,i )
the learning rate, which determines the exploration rate of Q- a1,i a2,i a j,i
learning. γ is the discount factor, the same as in (20). Note that
the time slot index k is used as the subscript of Q function for = (0, 0, . . . , 0, 0, 0, . . . , 0, . . . , ϕ1,i (s), ϕ2,i (s), . . . ϕns f ,i (si )
a j +1,i am a ,i
notational simplicity, and then, Q i (si (k), ai (k)) at time slot k
is simplified as Q k,i (si , ai ). 0, 0, . . . 0, . . . , 0, 0, . . . 0) ∈ n f (25)
where the elements that are not related to action a j,i are set self-state updating process. In this paper, the adapt-then-
to zero. The dimension of the feature vector ϕi (si , a j,i ) and combine (ATC) mechanism is incorporated in the reinforce-
parameter vector θi is n f = n s f · m a . Here, n s f is the number ment learning algorithm [38]. Consider an agent i with state
of state features and m a is the dimension of action set. In the x i and the dynamic characteristic
following discussions, we denote the value of ϕi (si , a j,i ) in
time slot k as ϕi (k) for notational simplicity. x i (k + 1) = x i (k) + f (x i (k)). (32)
To obtain the optimal updating law for parameter vector θi Then, the diffusion strategy is given by
of agent i , the approximation performance can be evaluated
based on the projected Bellman error Ji [23], which can be x̃ i (k + 1) = x i (k) + f (x i (k)) (33)

calculated as x i (k + 1) = bi j x̃ j (k + 1) (34)
j ∈Ni
Ji = E[δi (k)ϕi (k)]T E[ϕi (k)ϕi (k)T ]−1 E[δi (k)ϕi (k)] (26)
where x̃ i (k + 1) is an intermediate term introduced by the
where δi (k) ∈ is the temporal difference (TD) error in time diffusion strategy and x i (k + 1) is the updated state by
slot k, which can be calculated as combining all intermediate terms from the neighbors of agent
i . Ni is the neighboring set of agent i . In addition, bi j is
δi (k) = ri (k) + γ ϕ̂i (k)T θi (k) − ϕi (k)T θi (k). (27)
the weight assigned to neighboring agent j by agent i . Here,
Compared with the classic Q-learning equation (21), we can we can define a matrix B = [bi j ] ∈ n×n as the topology
see that ϕ̂i (k)T θi (k) should be the approximation of matrix of the microgrid communication network. In general,
max Q k,i (si , ai ). In practical applications, ϕ̂i (k) can be esti- the topology matrix B is a stochastic matrix, which means that
ai B1n = 1n , where 1n ∈ n is the unit vector.
mated as follows: By integrating the ATC diffusion strategy into the parameter
updating process of Greedy-GQ [via (30) and (31)], the coop-
ϕ̂i (k) ≈ arg max ϕi (si , ai )T θi (k). (28)

ϕi (si ,ai ) erative reinforcement learning algorithm is given by
According to [23], the gradient of (1/2)Ji can be approxi- θ˜i (k + 1) = θi (k) + α(k)(δi (θi (k)) · ϕi (k)
mated as −E[δi (k)ϕi (k)] + γ E[ϕ̂i (k)ϕi (k)T ]ωi∗ (k), where − γ wi (k)T · ϕi (k) · ϕ̂i (k)) (35)

ωi∗ (k) = E[ϕi (k)ϕi (k)T ]−1 E[δi (k)ϕi (k)]. (29) θi (k + 1) = bi j θ˜j (k + 1) (36)
j ∈Ni
To accelerate the convergence of the learning algorithm, w̃i (k + 1) = wi (k) + β(k) · (δi (θi (k))
a correction term can be introduced to adjust the update of − ϕi (k)T · wi (k)) · ϕi (k) (37)
the approximation parameter vector θi [24] as follows:
wi (k + 1) = bi j w˜ j (k + 1). (38)
θi (k + 1) = θi (k) + α(k)(δi (k)ϕi (k) − γ ωi (k) ϕi (k)ϕ̂i (k))
T j ∈Ni
(30) Note that the proposed cooperative reinforcement learning

ωi (k + 1) = ωi (k) + β(k)(δi (k) − ϕi (k)T ωi (k))ϕi (k). (31) algorithm introduces two intermediate vectors, θ˜i (k + 1) and
w̃i (k + 1). The real approximation parameter vector θi (k + 1)
Here, the iterations via (30) and (31) correspond to the Greedy- and correction parameter vector wi (k+1) are the combinations
GQ algorithm proposed in [24], which is derived from the of the aforementioned intermediate vectors of neighboring
gradient of the projected Bellman error. However, the update agents, corresponding to the ATC diffusion strategy given
is just for one agent without cooperation and cannot be applied by (33) and (34). In the proposed algorithm, the learning
in microgrid for distributed economic dispatch. How to extend rate parameters α(k) and β(k) can be set with the conditions
the update based on a distributed framework still needs to be S(1) ∼ S(4) [24]
investigated.
S(1) α(k) > 0, β(k) > 0
∞ ∞

C. Cooperation Mechanism S(2) α(k) = β(k) = +∞
t =0 t =0
The cooperative reinforcement learning algorithm incor- ∞
porates a diffusion strategy in the reinforcement learning
S(3) [α(k)2 + β(k)2 ] < +∞
process, such that the distributed information exchange can
t =0
be achieved in microgrids while reducing the computational S(4) α(k)/β(k) → 0.
cost. In multiagent systems, the diffusion strategy can achieve
faster convergence and can reach lower mean-square deviation Remark 3: It should be noted that the conditions are suf-
than the consensus strategy [38]. In addition, the diffusion ficient conditions of convergence. In a practical application,
strategy has a better performance for responding to continuous the general requirement to choose the two learning rates α(k)
real-time signals and is insensitive to neighboring weights. and β(k) is 0 < α(k) < β(k) < 1. α(k) and β(k) can be
The basic idea of the diffusion strategy is to incorporate constant. α(k) is less than β(k) with about one or two orders
cooperative terms based on neighboring states in each agent’s of magnitude.
Based on the cooperative reinforcement learning algorithm, θ = {θ1, θ2 , . . . θn } converges with probability 1 to the TD
a model-free solution of the distributed economic dispatch fixed point under the condition
problem (i.e., problem P1) can be obtained, such that each
|1 + λm (E(Fi (k)))| < 1
node in the microgrid only needs to communicate with its
local neighbors without relying on a central controller. The where λm (E(Fi (k))) is the mth largest eigenvalue of matrix
main procedure is summarized in Algorithm 1. E(Fi (k)), and

α(k)ϕi (k)(γ ϕ̂i (k) − ϕi (k))T − γ α(k)ϕ̂i (k)ϕi (k)T
Fi (k) = .
Algorithm 1 Cooperative Learning Reinforcement Algo- β(k)ϕi (k)(γ ϕ̂i (k) − ϕi (k))T − β(k)ϕi (k)ϕi (k)T
rithm With Function Approximation and Diffusion Mech- Proof: By incorporating the definition of TD error
anism (δi (θi (k))) in (27) into the function approximation algo-
Input : Policy π, Neighboring Matrix B and learning rithm (35) and (37), the iterations of θ˜i and w̃i can be rewritten
rate α, β as follows:
Output: The action ai (k) of agent i (i = 1, 2 . . . n) for
every time slot k θ˜i (k + 1) = θi (k) + α(k)ϕi (k)(γ ϕ̂i (k) − ϕi (k))θi (k)
1 Initialization: θ0 , ω0 ; − α(k)γ ϕ̂i (k)ϕi (k)T wi (k) + α(k)ri (k)ϕi (k)
2 for every time slot k = 1 to T do (39)
3 for every agent i = 1 to n do
w̃i (k + 1) = wi (k) + β(k)ϕi (k)(γ ϕ̂i (k) − ϕi (k))θi (k)
4 Calculate the feature vector ϕi (k) of state si (k);
5 Take action ai (k) according policy π; − β(k)ϕi (k)ϕi (k)T wi (k) + β(k)r (k)ϕi (k). (40)
6 Observe the reward ri (k); Then, a global data model can be constructed by introducing
7 Calculate the TD error δi (k) by (27); two transferred state variables
8 Estimate ϕ̂(k) by (28);
9 Update parameter θi (k), ωi (k) by (35)∼(38); ψ̃i (k + 1) = (θ̃i (k + 1), w̃i (k + 1))T (41)
T
10 i ← i + 1; ψi (k + 1) = (θi (k + 1), wi (k + 1)) . (42)
11 end Combined with (39) and (40), the parameter updates
12 k ← k + 1; in (35)–(38) can be written in the matrix form given by
13 end
14 return Action Sequence; ψ̃i (k + 1) = ψi (k) + (Fi (k) · ψi (k) + f i (k)) (43)

ψi (k + 1) = bi j ψ̃ j (k + 1) (44)
j ∈Ni
In Algorithm 1, it is can be seen that for each agent,
at one time slot, the iteration includes feature extracting, action where the values of Fi (k) and f i (k) are given by

taking, reward observing, TD error calculating, new feature α(k)G i (k) − γ α(k) Ĥi (k)

Fi (k) = (45)
estimating, and parameter updating, whose computation load β(k)G i (k) − β(k)Hi (k)
mainly lies in the inner product operation of feature vector
f i (k) = (α(k)ri (k)ϕi (k), β(k)ri (k)ϕi (k))T . (46)
ϕi and approximation parameter vector θi , whose dimensions
are both n f . Thus, the computational complexity of iteration In the above equations (45) and (46), we have
is O(n f ). For T time slots and n agents, the computational
G i (k) = ϕi (k)(γ ϕ̂i (k) − ϕi (k))T (47)
complexity is O(T · n · n f ). T
Ĥi (k) = ϕ̂i (k)ϕi (k) (48)
Hi (k) = ϕi (k)ϕi (k)T . (49)
IV. C ONVERGENCE A NALYSIS
Define the following expectations:
The convergence of the proposed cooperative reinforcement
learning algorithm is critical for the efficient and reliable Fi = E[Fi (k)], G i = E[G i (k)] (50)
operation of microgrids. However, the existing convergence Hi = E[Hi (k)], f i = E[ f i (k)]. (51)
analysis [26], [27] for reinforcement learning algorithms can-
not be directly applied to the proposed algorithm, mainly It can be proved that for each agent i , there exists a unique
because of the introduction of function approximation and fixed point ψ 0 , such that Fi · ψ 0 + f i = 0 under the condition
cooperation mechanism to solve the distributed economic that the base points of feature vector are sufficient in state and
dispatch problem in a microgrid. Therefore, in this section, action space [23]. Then, the error vectors of each agent i are
an innovative analytical approach is provided for the learning given by
framework consisted of parameterized Q-learning and diffu-
eψ̃,i (k) = ψ 0 − ψ̃i (k), eψ,i (k) = ψ 0 − ψi (k). (52)
sion strategy, which is developed to evaluate the convergence
of the proposed cooperative reinforcement learning algorithm. Accordingly, the global error vectors for all agents are given
The analytical result can be generalized in Theorem 1. by
Theorem 1: Consider the iterations (35)–(38) with TD gra-
eψ̃ (k) = (eψ̃,1 (k), eψ̃,2 (k), . . . , eψ̃,n (k))T
dient corrections and diffusion mechanism, α(k) and β(k)

satisfying S(1) ∼ S(4), and then, the system global parameter eψ (k) = (eψ,1 (k), eψ,2 (k), . . . , eψ,n (k))T . (53)
Integrating (43), (44), and (52), the global error vector can
be derived in a recursive form given by
eψ (k + 1) = 1n ⊗ ψ 0 − ψ(k + 1)
= 1n ⊗ ψ 0 − B I ψ̃(k + 1)
= 1n ⊗ ψ 0 − B I (ψ(k) + Fdiag (k)ψ(k) + fcol (k))
= 1n ⊗ ψ 0 − B I [(I F + Fdiag (k))ψ(k) + f col (k)]
= 1n ⊗ ψ 0 − B I (I F + Fdiag (k))ψ(k) + B I f col (k)
(54)
where B I = B ⊗ I F with ⊗ denoting the Kronecker product,
while I F ∈ n·n f ×n·n f represents an identity matrix. Fur- Fig. 1. Simulation topology of microgrid with 33 nodes.
thermore, we define Fdiag (k) = diag{F1 (k), F2 (k), . . . , Fn (k)}
and f col (k) = col{ f 1 (k), f 2 (k), . . . , f n (k)}. According to (52)
and (53), ψ(k) = 1n ⊗ψ 0 −eψ (k). Thus, (54) can be rewritten
as matrix B in algorithm. However, to expedite convergence
speed, it is assumed that one node can communicate with the
eψ (k + 1) = B I (I F + Fdiag (k))eψ (k) node within two electrical hops. For instance, the neighbor set
+ 1n ⊗ ψ 0 − B I 1n ⊗ ψ 0 of node 29 is {27, 28, 30, 31}.
The real-world average household meteorological data for
− B I (Fdiag (k)1n ⊗ ψ 0 + f col (k)). (55)
PV generation and load are obtained from the National
Since ψ 0 is constant, we have B I (Fdiag (k)1n ⊗ψ 0 + f col (k)) = Renewable Energy Laboratory [40] and Open Energy Infor-
0. Also, since B is a stochastic matrix, we have mation [41], respectively. The meteorological and load data
1n ⊗ ψ 0 = B I 1n ⊗ ψ 0 . By taking expectation on both are collected over 13 weeks during three summer months,
sides of (55), we have from June 2, 2013 to August 31, 2013, at Arcata, CA, USA.
In the simulation, the duration of each time slot equals 1 h.
E[eψ (k + 1)] = B I (I F + E[Fdiag (k)])E[eψ (k)]
The proposed algorithm and other compared algorithms are
= B ⊗ (I f + E[Fi (k)])E[eψ,i (k)] implemented and tested by MATLAB 2015b.
= B ⊗ (I f + Fi )E[eψ,i (k)]. (56) For the utility grid, the electricity price is obtained
from [42]. For the diesel generator, the cost coefficients are
Since B is stochastic matrix, we have 0 < |λm (B)| ≤ 1, a d = 1.50−4 $/kW2 , bd = 0.025$/kW, and cd = 0.04$,
where λm (B) is the mth largest eigenvalue of matrix B. Thus, which are extracted from manufacturers’ data [43]. For the
when the matrix (I F + E[Fdiag (k)]) is stable, that means battery under consideration, the degradation cost coefficient
|1 + λm (Fi )| < 1 (57) is ρ b = 100, while the coefficients of the life degradation
are a h = 0.001, β h = −2, and ηh = 0 according to the
where λm (Fi ) is the mth largest eigenvalue of matrix Fi , and curve-fitting method [44]. The parameter to calculate the rate
the iteration in (56) can converge. of charge/discharge of SOC are cb = 1.82, d b = −1, and
Remark 4: From the definition of Fi (k) in (45), it can α h = 2.5, respectively [45]. In per unit, the normalized voltage
be seen that the elements of Fi are actually determined by of each node in the microgrid should be within the range
parameters α(k) and β(k) and feature vector ϕ(k). In practical [0.95, 1.05].
applications, the elements of feature vector ϕ(k) can be For reinforcement learning, the two step sizes for coop-
normalized and is less than 1, and learning parameters α(k) erative learning are set to α = 0.0005 and β = 0.001,
and β(k) can be chosen to be small enough to meet the respectively. Also, a discount factor of γ = 0.8 is used, and
condition (57). Therefore, condition (57) serves as a guideline the weight coefficient is selected as λ p = 30. The initial
for the design of cooperative reinforcement learning algorithm, approximation parameters (θ0 and w0 ) are set as 1n f =
such that the convergence of the distributed economic dispatch {1, 1, .., 1}n f . These parameters are listed in Table II.
in microgrids can be guaranteed. In the simulation, the feature vectors are extracted by
Gaussian RBFs, whose center points are uniformly distributed
V. S IMULATION R ESULTS in state spaces. For power components, the dividing space of
In order to evaluate the performance of the proposed coop- center points is 5 kW, and for SOC and SOH components,
erative reinforcement learning algorithm, simulations are con- which have been normalized to [0, 1], the dividing space of
ducted for the distributed economic dispatch of a residential center points is 0.1. These dividing spaces are uniform for all
microgrid with 33-bus distribution feeder [39], whose topology agents.
is presented in Fig. 1. In the topology, node 1 is the slack node Figs. 2 and 3 illustrate the PV generation and load vari-
and the rest nodes have 40∼400 households, whose total loads ations for one household over two consecutive weeks from
for nodes come from node_33.m case file on the topology’s August 11. It is obvious that both the load and PV generation
website [39]. The electrical connection presented in Fig. 1 present significant fluctuations. In particular, the difference of
determines the communication links, that is, the adjacent PV generation at the same hour of two weeks can reach up to
TABLE II
S IMULATION S ETTING
Fig. 4. Average values of θ for nodes 6, 14, and 32.
Fig. 5. Decision on battery and diesel generator operation for node 2 during
week 11 (168 h).
Fig. 2. Variation comparison of PV generation for one household over two
consecutive weeks (168 h).
in Fig. 4. From Fig. 4, it can be seen that though different

nodes have different decision variables with different ranges,
they can be convergent after about 7000 iterations. In the
training procedure, every period with three months’ emulated
meteorological and load data is taken as one training iteration.
However, the nodes have different convergence speeds. For
instance, the convergence speed of node 32 is slower than
that of nodes 6 and 14. This is because in the simulation
topology [39], the real power of node 32 is 210 kW, while the
real powers of nodes 6 and 14 are 60 and 120 kW, respectively.
It means that node 32 has larger state and action space and
needs more attempts to research the optimal action strategy.
Fig. 3. Variation comparison of load for one household over two consecutive To illustrate the control actions, the decision variables,
weeks (168 h). p b (k) and pd (k), calculated based on the cooperative rein-
forcement algorithm are shown in Fig. 5. The results cor-
respond to node 2 during the week from August 11 to
60%, which makes the precise forecast very difficult. In order August 17 (the 11th week since June 2). From Fig. 5, it can
to address such randomness, reinforcement learning can be be seen that the battery is continuously charged or discharged,
performed. which indicates that the cooperative reinforcement learning
Before it is applied to real power dispatch, the approxi- algorithm continually optimizes the battery scheduling to
mation parameters of reinforcement learning are trained by reduce cost. Also, the algorithm can avoid battery charging
history data first. The training data are constructed by lever- and discharging at very high power to limit the degradation of
aging the meteorological and load data collected last year, battery life.
in the same summer months, at the same location. Then, To evaluate the performance of our proposed algorithm,
the little random disturbance is added to emulate different the comparison of the average cumulative costs obtained
year changes, and the state sequence can be formed with from the fuzzy-Q learning algorithm [20], scenario-based
an arbitrary length. To present the convergence of algorithm, algorithm [13], and our proposed algorithm is shown in Fig. 6.
the average values of θ for nodes 6, 14, and 32 are illustrated In Fig. 6, the cost curves decrease at some time slots, because
Fig. 7. Impact of battery capacity on different algorithms during

Fig. 6. Comparison of the average cumulative costs obtained from different week 11 (168 h).
algorithms during week 11 (168 h).
TABLE III
it is permissible for DG units and ES devices to sell their C OMPARISON OF THE AVERAGE C UMULATIVE C OSTS ($) OF D IFFERENT
A LGORITHMS OVER A LL W EEKS IN THE T HREE M ONTHS
power to utility grid when the grid’s price is high, which will
reduce some cost. On the contrary, if the grid’s price is low,
the node will purchase the extra electricity from grid and save
it in ES devices.
From Fig. 6, it can also be shown that all simulated
algorithms present the similar tendencies on the average
cumulative costs. That means for the same scenarios with the
same state variables, these simulated algorithms will make the
similar actions for many situations. For instance, at least they
have to follow the rule that at the lowest electricity price,
the EMS should purchase the utility gird power to charge
the battery, while at the highest electricity price, the power
stored in battery should be discharged to supply the home
appliance or feedback to utility grid. Their similar tendencies
illustrate that they all do optimization to some extent and try
to approximate to the optimal strategy.
However, there is a difference among their optimization
performances. Fig. 6 illustrates that our proposed algorithm
achieves the lowest cost. The scenario-based algorithm that is
extracted from Latin hypercube sampling generation based on
the Monte Carlo model has the close cost with our proposed Similar to Fig. 6, for various battery capacities, our proposed
algorithm. However, this method requires a priori knowledge algorithm performs better than both the fuzzy-Q learning
of the probability distribution of scenarios with respect to the algorithm and the scenario-based algorithm.
PV generation and load in the microgrid, which may not be A comparison of the average cumulative costs of different
available in practical applications. Both the fuzzy-Q learning algorithms over each week is shown in Table III, which
algorithm and our proposed algorithm are model-free, while illustrates the same results as in Figs. 6 and 7. Our proposed
our proposed algorithm presents a better performance than that algorithm achieves the lowest cost through all the weeks, and
of the fuzzy-Q learning algorithm due to the efficiency of the the scenario-based algorithm can obtain slightly higher costs.
diffusion mechanism and function approximation. For the fuzzy-Q learning algorithm, the cost is almost higher
To evaluate the impact of battery capacity on different than that both of the proposed algorithm and the scenario-
algorithms, the average costs achieved by these algorithms based algorithm.
are shown in Fig. 7. The nominal 100% capacity corresponds It must be noted that compared with conventional stochastic
to 100 kWh. It can be observed that, as the battery capacity optimization methods, the proposed algorithm is not sensitive
increases, the costs achieved by all algorithms decrease. The to increase the time horizon and has the ability to extend the
main reason is that, larger battery capacity provides higher optimal time horizon more conveniently, since the training
flexibility for microgrid operation under load and PV gener- state sequence can be constructed with an arbitrary length
ation variations, since the battery operates as a buffer during by using historical data easily. While for conventional sto-
the economic dispatch process. However, the cost reduction chastic optimization methods, the extension of time horizon
diminishes for very large battery capacity, since the battery is may generally result in the unsolvability due to the NP-hard
underutilized due to a limited load demand in the microgrid. characteristics.
VI. C ONCLUSION [14] Z. Wang, B. Chen, J. Wang, M. M. Begovic, and C. Chen, “Coor-
dinated energy management of networked microgrids in distribu-
Due to the flexibility of incorporating DG units and ES tion systems,” IEEE Trans. Smart Grid, vol. 6, no. 1, pp. 45–53,
devices, microgrids have been rapidly deployed in recent Jan. 2015.
[15] T. A. Nguyen and M. L. Crow, “Stochastic optimization of renewable-
years. However, the stochastic nature of the power gener- based microgrid operation incorporating battery operating cost,” IEEE
ation from DG units and loads brings significant challenge Trans. Power Syst., vol. 31, no. 3, pp. 2289–2296, May 2016.
to economic dispatch, especially in a distributed framework. [16] F. A. Oliehoek, “Decentralized POMDPs,” in Reinforcement Learn-
ing: State-of-the-Art (Adaptation, Learning, and Optimization). Berlin,
This paper utilizes the cooperative reinforcement learning with Germany: Springer-Verlag, 2012, ch. 15.
function approximation to address the distributed economic [17] Z. Zhang, D. Zhao, J. Gao, D. Wang, and Y. Dai, “FMRQ—A multiagent
dispatch in microgrids. The proposed cooperative reinforce- reinforcement learning algorithm for fully cooperative tasks,” IEEE
Trans. Cybern., vol. 47, no. 6, pp. 1367–1379, Jun. 2017.
ment learning algorithm leverages a diffusion strategy to [18] Q. Wei, D. Liu, and G. Shi, “A novel dual iterative Q-learning method for
coordinate the actions among multiple agents in the microgrid. optimal battery management in smart residential environments,” IEEE
It can avoid the issue of curse of dimensionality and has Trans. Ind. Electron., vol. 62, no. 4, pp. 2509–2518, Apr. 2015.
guaranteed convergence. Simulation results illustrate that the [19] T. Yu, H. Z. Wang, B. Zhou, K. W. Chan, and J. Tang, “Multi-agent
correlated equilibrium Q(λ) learning for coordinated smart generation
proposed cooperative reinforcement learning algorithm can control of interconnected power grids,” IEEE Trans. Power Syst., vol. 30,
achieve the lowest cost for distributed economic dispatch no. 4, pp. 1669–1679, Jul. 2015.
in microgrids compared with the fuzzy-Q algorithm and the [20] L. Xin, Z. Chuanzhi, Z. Peng, and Y. Haibin, “Genetic based fuzzy
Q-learning energy management for smart grid,” in Proc. IEEE CCC,
traditional scenario-based algorithm. Jun. 2012, pp. 6924–6927.
In the future, we will extend our proposed algorithm to [21] M. Sharifi and H. Kebriaei, “A study on pricing strategies for residential
achieve the coordination among multiple microgrids by devel- load management using fuzzy reinforcement learning,” in Proc. IEEE
CCIP, Mar. 2015, pp. 1–6.
oping a hierarchical reinforcement learning structure. In this [22] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement
hierarchical structure, due to the increased scale, the power Learning and Dynamic Programming Using Function Approximators.
losses will be considered, more practical constraints will be Boca Raton, FL, USA: CRC Press, 2010.
[23] R. S. Sutton et al., “Fast gradient-descent methods for temporal-
incorporated, and more realistic costs, such as renewable difference learning with linear function approximation,” in Proc. 26th
generating costs, will be introduced. Annu. Int. Conf. Mach. Learn., 2009, pp. 993–1000.
[24] H. R. Maei, C. Szepesvari, S. Bhatnagar, and R. S. Sutton, “Toward
off-policy learning control with function approximation,” in Proc. 27th
R EFERENCES Int. Conf. Mach. Learn., 2010, pp. 1–8.
[25] H. Modares, F. L. Lewis, and Z.-P. Jiang, “H∞ tracking control of
[1] L. Mariam, M. Basu, and M. F. Conlon, “Microgrid: Architecture, policy completely unknown continuous-time systems via off-policy reinforce-
and future trends,” Renew. Sustain. Energy Rev., vol. 64, pp. 477–489, ment learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 10,
Oct. 2016. pp. 2550–2562, Oct. 2015.
[2] S. Parhizi, H. Lotfi, A. Khodaei, and S. Bahramirad, “State of the art in [26] S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed, “Distributed policy
research on microgrids: A review,” IEEE Access, vol. 3, pp. 890–925, evaluation under multiple behavior strategies,” IEEE Trans. Autom.
2015. Control, vol. 60, no. 5, pp. 1260–1274, May 2015.
[3] G. Binetti, A. Davoudi, F. L. Lewis, D. Naso, and B. Turchiano, “Dis- [27] S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed, “Cooperative off-policy
tributed consensus-based economic dispatch with transmission losses,” prediction of Markov decision processes in adaptive networks,” in Proc.
IEEE Trans. Power Syst., vol. 29, no. 4, pp. 1711–1720, Jul. 2014. IEEE ICASSP, May 2013, pp. 4539–4543.
[4] G. K. Venayagamoorthy, R. K. Sharma, P. K. Gautam, and A. Ahmadi, [28] Y. Li and Y. W. Li, “Power management of inverter interfaced
“Dynamic energy management system for a smart microgrid,” IEEE autonomous microgrid based on virtual frequency-voltage frame,” IEEE
Trans. Neural Netw. Learn. Syst., vol. 27, no. 8, pp. 1643–1656, Trans. Smart Grid, vol. 2, no. 1, pp. 30–40, Mar. 2011.
Aug. 2016. [29] A. Hoke, A. Brissette, K. Smith, A. Pratt, and D. Maksimovic, “Account-
[5] H. Liang and W. Zhuang, “Stochastic modeling and optimization in a ing for lithium-ion battery degradation in electric vehicle charging
microgrid: A survey,” Energies, vol. 7, no. 4, pp. 2027–2050, 2014. optimization,” IEEE J. Emerg. Sel. Topics Power Electron., vol. 2, no. 3,
[6] H. Liang, B. J. Choi, A. Abdrabou, W. Zhuang, and X. S. Shen, “Decen- pp. 691–700, Sep. 2014.
tralized economic dispatch in microgrids via heterogeneous wireless [30] B. Aksanli and T. Rosing, “Optimal battery configuration in a residential
networks,” IEEE J. Sel. Areas Commun., vol. 30, no. 6, pp. 1061–1074, home with time-of-use pricing,” in Proc. IEEE Int. Conf. Smart Grid
Jul. 2012. Commun., Oct. 2013, pp. 157–162.
[7] H. Liang, A. K. Tamang, W. Zhuang, and X. S. Shen, “Stochastic [31] D. Doerffel and S. A. Sharkh, “A critical review of using the Peukert
information management in smart grid,” IEEE Commun. Surveys Tuts., equation for determining the remaining capacity of lead-acid and
vol. 16, no. 3, pp. 1746–1770, 3rd Quart., 2014. lithium-ion batteries,” J. Power Sources, vol. 155, no. 2, pp. 395–400,
[8] M. Střelec, K. Macek, and A. Abate, “Modeling and simulation of a 2006.
microgrid as a stochastic hybrid system,” in Proc. IEEE ISGT, Oct. 2012, [32] S. Bolognani and S. Zampieri, “On the existence and linear approxima-
pp. 1–9. tion of the power flow solution in power distribution networks,” IEEE
[9] J. Wu and X. Guan, “Coordinated multi-microgrids optimal control Trans. Power Syst., vol. 31, no. 1, pp. 163–172, Jan. 2016.
algorithm for smart distribution management system,” IEEE Trans. [33] A. J. Wood and B. F. Wollenberg, Power Generation, Operation, and
Smart Grid, vol. 4, no. 4, pp. 2174–2181, Dec. 2013. Control. Hoboken, NJ, USA: Wiley, 1996.
[10] G. Hug, S. Kar, and C. Wu, “Consensus + innovations approach for [34] J. Ni and Q. Ai, “Economic power transaction using coalitional game
distributed multiagent coordination in a microgrid,” IEEE Trans. Smart strategy in micro-grids,” IET Generat., Transmiss., Distrib., vol. 10,
Grid, vol. 6, no. 4, pp. 1893–1903, Jul. 2015. no. 1, pp. 10–18, 2016.
[11] W. Qi, J. Liu, and P. D. Christofides, “Distributed supervisory predictive [35] S. Boyd and L. Vandenberghe, Convex Optimization. New York, NY,
control of distributed wind and solar energy systems,” IEEE Trans. USA: Cambridge Univ. Press, 2004.
Control Syst. Technol., vol. 21, no. 2, pp. 504–512, Mar. 2013. [36] H. Q. Wang, X. P. Liu, and K. F. Liu, “Robust adaptive neural tracking
[12] M. E. Khodayar, M. Barati, and M. Shahidehpour, “Integration of control for a class of stochastic nonlinear interconnected systems,”
high reliability distribution system in microgrid operation,” IEEE Trans. IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 3, pp. 510–523,
Smart Grid, vol. 3, no. 4, pp. 1997–2006, Dec. 2012. Apr. 2016.
[13] N. Growe-Kuska, H. Heitsch, and W. Romisch, “Scenario reduction and [37] Y. Lei, L. Ding, and W. Zhang, “Generalization performance of radial
scenario tree construction for power management problems,” in Proc. basis function networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26,
IEEE Bologna Power Tech Conf., Jun. 2003, p. 7. no. 3, pp. 551–564, Mar. 2015.
[38] S.-Y. Tu and A. H. Sayed, “Diffusion strategies outperform consensus Hao Liang (S’09–M’14) received the Ph.D. degree
strategies for distributed estimation over adaptive networks,” IEEE from the Department of Electrical and Computer
Trans. Signal Process., vol. 60, no. 12, pp. 6217–6234, Dec. 2012. Engineering, University of Waterloo, Waterloo, ON,
[39] Distribution Test 33-Bus Feeder (Node_33.m). Accessed: Feb. 12, 2017. Canada, in 2013.
[Online]. Available: http://www.ece.ubc.ca/~hameda/download_files/ He has been an Assistant Professor with the
[40] NREL. Measurement and Instrumentation Data Center (MIDC) Department of Electrical and Computer Engineering,
of NREL. Accessed: Feb. 14, 2017. [Online]. Available: University of Alberta, Edmonton, AB, Canada, since
http://www.nrel.gov/midc/ 2014. His current research interests include the areas
[41] Open Energy Information and Data (OpenEI). Accessed: Feb. 12, 2017. of smart grid, wireless communications, and wireless
[Online]. Available: http://en.openei.org/wiki/Main_Page networking.
[42] G. Liu, Y. Xu, and K. Tomsovic, “Bidding strategy for microgrid in
day-ahead market based on hybrid stochastic/robust optimization,” IEEE
Trans. Smart Grid, vol. 7, no. 1, pp. 227–237, Jan. 2016.
[43] MQ Power. Datasheets for Diesel On-Site Power Industrial
Generators of MQ POWER Company. Accessed: Feb. 14, 2017.
[Online]. Available: http://www.powertechengines.com/MQP-
DataSheets/MQP050IZ_Rev_0.pdf
[44] A. Millner, “Modeling lithium ion battery degradation in electric vehi-
cles,” in Proc. IEEE CITRES, Sep. 2010, pp. 349–356. Jun Peng (M’08) received the B.S. degree from
[45] N. Omar, P. Van den Bossche, T. Coosemans, and J. Van Mierlo, Xiangtan University, Xiangtan, China, in 1987, the
“Peukert revisited—Critical appraisal and need for modification for M.S. degree from the National University of Defense
lithium-ion batteries,” Energies, vol. 6, no. 11, pp. 5625–5641, 2013. Technology, Changsha, China, in 1990, and the
Ph.D. degree from Central South University, Chang-
sha, in 2005.
In 1990, she joined the Staff of Central South Uni-
versity. From 2006 to 2007, she was with the School
Weirong Liu (M’08) received the B.E. degree in of Electrical and Computer Science, University of
computer software engineering and the M.E. degree Central Florida, Orlando, FL, USA, as a Visiting
in computer application technology from Central Scholar. She is currently a Professor with the School
South University, Changsha, China, in 1998 and of Information Science and Engineering, Central South University. Her current
2003, respectively, and the Ph.D. degree in control research interests include cooperative control, and cloud computing and
theory and control engineering from the Institute of wireless communications.
Automation, Chinese Academy of Sciences, Beijing,
China, in 2007.
Since 2008, he has been a Faculty Member with
the School of Information Science and Engineering,
Central South University, where he is currently an
Associate Professor. From 2016 to 2017, he was a Visiting Scholar with the
Department of Electrical and Computer Engineering, University of Alberta,
Edmonton, AB, Canada. His current research interests include reinforcement
learning, neural networks, wireless sensor networks, network protocol, smart
grid, and microgrid.
Dr. Liu received the Best Paper Award from the 7th Chinese Conference on
Cloud Computing in 2016. Zhiwu Huang (M’08) received the B.S. degree
in industrial automation from Xiangtan University,
Xiangtan, China, in 1987, the M.S. degree in indus-
trial automation from the Department of Automatic
Control, University of Science and Technology Bei-
Peng Zhuang received the B.S. degree in electrical
jing, Beijing, China, in 1989, and the Ph.D. degree in
and computer engineering from the University of control theory and control engineering from Central
Alberta, Edmonton, AB, Canada, in 2015, where he South University, Changsha, China, in 2006.
is currently pursuing the Ph.D. degree. In 1994, he joined the Staff of Central South Uni-
His current research interests include stochastic versity. From 2008 to 2009, he was with the School
optimization of power system planning and opera- of Computer Science and Electronic Engineering,
tion, energy management in smart grid, and cyber University of Essex, Wivenhoe Park, Colchester, U.K., as a Visiting Scholar.
security.
He is currently a Professor with the School of Information Science and
Engineering, Central South University. His current research interests include
fault diagnostic technique, cooperative control, and cloud computing.

Distributed Economic Dispatch in Microgrids Based On Cooperative Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Distributed Economic Dispatch in Microgrids Based On Cooperative Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Economic Dispatch in Microgrids Based On Cooperative Reinforcement Learning

Uploaded by

Copyright:

Available Formats

2192 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 29, NO.

Distributed Economic Dispatch in Microgrids Based

M ICROGRIDS are small-scale and localized electric

TABLE I where cib is determined by battery charging characteristics,

si (k) = pil (k), pi (k), SOCi (k), SOHi (k) . (16)

ai (k) = pid (k), pib (k) . (17)

(30) Note that the proposed cooperative reinforcement learning

Fig. 4. Average values of θ for nodes 6, 14, and 32.

in Fig. 4. From Fig. 4, it can be seen that though different

Fig. 7. Impact of battery capacity on different algorithms during

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.