An Introduction To Quantum Reinforcement Learning
An Introduction To Quantum Reinforcement Learning
Learning (QRL)
Samuel Yen-Chi Chen
Wells Fargo
New York, NY, USA
yen-chi.chen@wellsfargo.com
in the integration of these two cutting-edge fields. Among the A. Quantum Computing
various ML techniques, reinforcement learning (RL) stands out A qubit represents the fundamental unit of quantum infor-
for its ability to address complex sequential decision-making
problems. RL has already demonstrated substantial success in mation processing. Unlike a classical bit, which is restricted
the classical ML community. Now, the emerging field of Quantum to holding a state of either 0 or 1, a qubit can simultaneously
Reinforcement Learning (QRL) seeks to enhance RL algorithms encapsulate the information of both 0 and 1 due to the
by incorporating principles from quantum computing. This paper principle of superposition. A single qubit quantum state can
offers an introduction to this exciting area for the broader AI be expressed as |Ψ⟩ = α |0⟩ + β |1⟩, where |0⟩ = [1, 0]T
and ML community.
and |1⟩ = [0, 1]T are column vectors, and α and β are
Index Terms—Quantum neural networks, Quantum machine
learning, Variational quantum circuits, Quantum reinforcement complex numbers. In an n-qubit system, the state vector has
learning, Quantum artificial intelligence a length of 2n . Quantum gates U are utilized to transform
a quantum state, represented as |Ψ⟩, to another state |Ψ′ ⟩
through the operation |Ψ′ ⟩ = U |Ψ⟩. These quantum gates
I. I NTRODUCTION are unitary transformations that satisfy the condition U U † =
U † U = I2n ×2n , where n denotes the number of qubits. It has
Quantum computing (QC) offers the potential for substantial been demonstrated that a small set of basic quantum gates
computational advantages in specific problems compared to is sufficient for universal quantum computation. One such set
classical computers [1]. Despite the current limitations of includes single-qubit gates H, σx , σy , σz , Rx (θ) = e−iθσx /2 ,
quantum devices, such as noise and imperfections, significant Ry (θ) = e−iθσy /2 , Rz (θ) = e−iθσz /2 , and the two-qubit gate
efforts are being made to achieve quantum advantages. One CNOT. In quantum machine learning (QML), rotation gates
prominent area of focus is quantum machine learning (QML), Rx , Ry , and Rz are particularly crucial as their rotation angles
which leverages quantum computing principles to enhance can be treated as trainable or learnable parameters subject to
machine learning tasks. Most QML algorithms rely on a hybrid optimization. For quantum operations on multi-qubit systems,
quantum-classical paradigm, which divides the computational the unitary transformation can be constructed via the tensor
task into two components: quantum computers handle the product of individual single-qubit or two-qubit operations,
parts that benefit from quantum computation, while classical U = U1 ⊗ U2 ⊗ · · · ⊗ Uk . At the final stage of a quantum
computers process the parts they excel at. circuit, a procedure known as measurement is performed. A
Variational quantum algorithms (VQAs) [2] form the foun- single execution of a quantum circuit generates a binary string.
dation of current quantum machine learning (QML) ap- This procedure can be repeated multiple times to determine the
proaches. QML has demonstrated success in various ma- probabilities of different computational bases (e.g., |0, · · · , 0⟩,
chine learning tasks, including classification [3]–[6], sequen- · · · , |1, · · · , 1⟩) or to calculate expectation values (e.g., Pauli
tial learning [7], [8], natural language processing [9]–[12], X, Y , and Z).
and reinforcement learning [13]–[19]. Among these areas,
quantum reinforcement learning (QRL) is an emerging field B. Variational Quantum Circuits
where researchers are exploring the application of quantum Variational quantum circuits (VQCs), also referred to as
computing principles to enhance the performance of reinforce- parameterized quantum circuits (PQCs), represent a special-
ment learning agents. This article provides an introduction to ized class of quantum circuits with trainable parameters. VQCs
the concepts and recent developments in QRL. are extensively utilized within the current hybrid quantum-
classical computing framework [2] and have demonstrated
The views expressed in this article are those of the authors and do not specific types of quantum advantages [20]–[22]. There are
represent the views of Wells Fargo. This article is for informational purposes three fundamental components in a VQC: encoding circuit,
only. Nothing contained in this article should be construed as investment
advice. Wells Fargo makes no express or implied warranties and expressly variational circuit and the final measurements. As shown in
disclaims all legal, tax, and accounting implications related to this article. Figure 1, the encoding circuit U (x) transforms the initial
⊗n ⊗n
quantum state |0⟩ into |Ψ⟩ = U (x) |0⟩ . Here the n stopping criterion. The use of quantum neural networks for
⊗n
represents the number of qubits, |0⟩ represents the n-qubit learning policy or value functions is referred to as quantum
initial state |0, · · · , 0⟩ and the U (x) represents the unitary reinforcement learning (QRL). The idea of QRL is illustrated
which depends on the input value x. The measurement process in Figure 2. For a comprehensive review of current QRL
extracts data from the VQC by assessing either a subset or all domain, refer to the review article [18].
of the qubits, producing a classical bit sequence for further
use. Running the circuit once yields a bit sequence such as
”0,0,1,1.” However, preparing and executing the circuit multi-
ple times (shots) generates expectation values for each qubit.
Most works mentioned in this survey focus on the evaluation
of Pauli-Z expectation values derived from measurements in
VQCs. Generally, the mathematical expressionD ofEthe VQC
−−−−→ D E
can be expressed as f (x; Θ) = Ẑ1 , · · · , Ẑn , where
D E D E
† † ˆ
Ẑk = 0 U (x)W (Θ)Zk W (Θ)U (x) 0 . In the hybrid
quantum-classical framework, the VQC can be integrated with
other classical components, such as deep neural networks and
tensor networks, or with other quantum components, including Fig. 2. Concept of quantum reinforcement learning (QRL).
additional VQCs. The entire model can be optimized in an
end-to-end manner using either gradient-based [4], [5] or B. Quantum Deep Q-learning
gradient-free [14] methods. For gradient-based methods like Q-learning [25] is a fundamental model-free RL algorithm.
gradient descent, the gradients of quantum components can It learns the optimal action-value function and operates off-
be computed via the parameter-shift rules [3], [23], [24]. policy. The process begins with the random initialization of
Qπ (s, a) for all states s ∈ S and actions a ∈ A, stored in a
Q-table. The Qπ (s, a) estimates are updated using the Bellman
equation:
Q (st , at ) ← Q (st , at )
h i
+ α rt + γ max Q (st+1 , a) − Q (st , at ) . (1)
a