0% found this document useful (0 votes)
21 views

An Introduction To Quantum Reinforcement Learning

Uploaded by

Shubham Barge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

An Introduction To Quantum Reinforcement Learning

Uploaded by

Shubham Barge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

An Introduction to Quantum Reinforcement

Learning (QRL)
Samuel Yen-Chi Chen
Wells Fargo
New York, NY, USA
yen-chi.chen@wellsfargo.com

Abstract—Recent advancements in quantum computing (QC) II. Q UANTUM N EURAL N ETWORKS


and machine learning (ML) have sparked considerable interest
arXiv:2409.05846v1 [quant-ph] 9 Sep 2024

in the integration of these two cutting-edge fields. Among the A. Quantum Computing
various ML techniques, reinforcement learning (RL) stands out A qubit represents the fundamental unit of quantum infor-
for its ability to address complex sequential decision-making
problems. RL has already demonstrated substantial success in mation processing. Unlike a classical bit, which is restricted
the classical ML community. Now, the emerging field of Quantum to holding a state of either 0 or 1, a qubit can simultaneously
Reinforcement Learning (QRL) seeks to enhance RL algorithms encapsulate the information of both 0 and 1 due to the
by incorporating principles from quantum computing. This paper principle of superposition. A single qubit quantum state can
offers an introduction to this exciting area for the broader AI be expressed as |Ψ⟩ = α |0⟩ + β |1⟩, where |0⟩ = [1, 0]T
and ML community.
and |1⟩ = [0, 1]T are column vectors, and α and β are
Index Terms—Quantum neural networks, Quantum machine
learning, Variational quantum circuits, Quantum reinforcement complex numbers. In an n-qubit system, the state vector has
learning, Quantum artificial intelligence a length of 2n . Quantum gates U are utilized to transform
a quantum state, represented as |Ψ⟩, to another state |Ψ′ ⟩
through the operation |Ψ′ ⟩ = U |Ψ⟩. These quantum gates
I. I NTRODUCTION are unitary transformations that satisfy the condition U U † =
U † U = I2n ×2n , where n denotes the number of qubits. It has
Quantum computing (QC) offers the potential for substantial been demonstrated that a small set of basic quantum gates
computational advantages in specific problems compared to is sufficient for universal quantum computation. One such set
classical computers [1]. Despite the current limitations of includes single-qubit gates H, σx , σy , σz , Rx (θ) = e−iθσx /2 ,
quantum devices, such as noise and imperfections, significant Ry (θ) = e−iθσy /2 , Rz (θ) = e−iθσz /2 , and the two-qubit gate
efforts are being made to achieve quantum advantages. One CNOT. In quantum machine learning (QML), rotation gates
prominent area of focus is quantum machine learning (QML), Rx , Ry , and Rz are particularly crucial as their rotation angles
which leverages quantum computing principles to enhance can be treated as trainable or learnable parameters subject to
machine learning tasks. Most QML algorithms rely on a hybrid optimization. For quantum operations on multi-qubit systems,
quantum-classical paradigm, which divides the computational the unitary transformation can be constructed via the tensor
task into two components: quantum computers handle the product of individual single-qubit or two-qubit operations,
parts that benefit from quantum computation, while classical U = U1 ⊗ U2 ⊗ · · · ⊗ Uk . At the final stage of a quantum
computers process the parts they excel at. circuit, a procedure known as measurement is performed. A
Variational quantum algorithms (VQAs) [2] form the foun- single execution of a quantum circuit generates a binary string.
dation of current quantum machine learning (QML) ap- This procedure can be repeated multiple times to determine the
proaches. QML has demonstrated success in various ma- probabilities of different computational bases (e.g., |0, · · · , 0⟩,
chine learning tasks, including classification [3]–[6], sequen- · · · , |1, · · · , 1⟩) or to calculate expectation values (e.g., Pauli
tial learning [7], [8], natural language processing [9]–[12], X, Y , and Z).
and reinforcement learning [13]–[19]. Among these areas,
quantum reinforcement learning (QRL) is an emerging field B. Variational Quantum Circuits
where researchers are exploring the application of quantum Variational quantum circuits (VQCs), also referred to as
computing principles to enhance the performance of reinforce- parameterized quantum circuits (PQCs), represent a special-
ment learning agents. This article provides an introduction to ized class of quantum circuits with trainable parameters. VQCs
the concepts and recent developments in QRL. are extensively utilized within the current hybrid quantum-
classical computing framework [2] and have demonstrated
The views expressed in this article are those of the authors and do not specific types of quantum advantages [20]–[22]. There are
represent the views of Wells Fargo. This article is for informational purposes three fundamental components in a VQC: encoding circuit,
only. Nothing contained in this article should be construed as investment
advice. Wells Fargo makes no express or implied warranties and expressly variational circuit and the final measurements. As shown in
disclaims all legal, tax, and accounting implications related to this article. Figure 1, the encoding circuit U (x) transforms the initial
⊗n ⊗n
quantum state |0⟩ into |Ψ⟩ = U (x) |0⟩ . Here the n stopping criterion. The use of quantum neural networks for
⊗n
represents the number of qubits, |0⟩ represents the n-qubit learning policy or value functions is referred to as quantum
initial state |0, · · · , 0⟩ and the U (x) represents the unitary reinforcement learning (QRL). The idea of QRL is illustrated
which depends on the input value x. The measurement process in Figure 2. For a comprehensive review of current QRL
extracts data from the VQC by assessing either a subset or all domain, refer to the review article [18].
of the qubits, producing a classical bit sequence for further
use. Running the circuit once yields a bit sequence such as
”0,0,1,1.” However, preparing and executing the circuit multi-
ple times (shots) generates expectation values for each qubit.
Most works mentioned in this survey focus on the evaluation
of Pauli-Z expectation values derived from measurements in
VQCs. Generally, the mathematical expressionD ofEthe VQC
−−−−→ D E
can be expressed as f (x; Θ) = Ẑ1 , · · · , Ẑn , where
D E D E
† † ˆ
Ẑk = 0 U (x)W (Θ)Zk W (Θ)U (x) 0 . In the hybrid
quantum-classical framework, the VQC can be integrated with
other classical components, such as deep neural networks and
tensor networks, or with other quantum components, including Fig. 2. Concept of quantum reinforcement learning (QRL).
additional VQCs. The entire model can be optimized in an
end-to-end manner using either gradient-based [4], [5] or B. Quantum Deep Q-learning
gradient-free [14] methods. For gradient-based methods like Q-learning [25] is a fundamental model-free RL algorithm.
gradient descent, the gradients of quantum components can It learns the optimal action-value function and operates off-
be computed via the parameter-shift rules [3], [23], [24]. policy. The process begins with the random initialization of
Qπ (s, a) for all states s ∈ S and actions a ∈ A, stored in a
Q-table. The Qπ (s, a) estimates are updated using the Bellman
equation:
Q (st , at ) ← Q (st , at )
h i
+ α rt + γ max Q (st+1 , a) − Q (st , at ) . (1)
a

The conventional Q-learning approach offers the optimal


action-value function but is impractical for problems requir-
Fig. 1. Generic Structure of a Variational Quantum Circuit (VQC).
ing extensive memory, especially with high-dimensional state
(s) or action (a) spaces. In environments with continuous
III. Q UANTUM R EINFORCEMENT L EARNING states, storing Q(s, a) in a table is inefficient or impossible.
A. Reinforcement Learning To address this challenge, neural networks (NNs) are used
Reinforcement Learning (RL) is a pivotal paradigm within to represent Qπ (s, a), ∀s ∈ S, a ∈ A, leading to deep
machine learning, where an autonomous entity known as the Q-learning. The network in this technique is known as a
agent learns to make decisions through iterative interactions deep Q-network (DQN) [26]. To enhance the stability of
with its environment [25]. The agent operates within a defined DQN, techniques such as experience replay and the use of
environment, represented as E, over discrete time steps. At a target network are employed [26]. Experience replay stores
each time step t, the agent receives state or observation experiences as transition tuples st , at , rt , st+1 in a memory
information, denoted as st , from the environment E. Based or buffer. After gathering sufficient experiences, the agent
on this information, the agent selects an action at from a set randomly samples a batch to compute the loss and update
of permissible actions A, guided by its policy π. The policy DQN parameters. Additionally, to reduce correlation between
π acts as a function that maps the current state or observation target and prediction, a target network, which is a duplicate
st to the corresponding action at . Notably, the policy can be of the DQN, is used. The DQN parameters θ are updated
stochastic, indicating that for a given state st , the action at is iteratively, while the target network parameters θ− are updated
determined by a probability distribution π(at |st ). periodically. The DQN training is done via minimizing the
Upon executing action at , the agent transitions to the sub- mean square error (MSE) loss function:
sequent state st+1 and receives a scalar reward rt . This cycle
h i
2
L(θ) = E (rt + γ maxa′ Q (st+1 , a′ ; θ− ) − Q (st , at ; θ)) (2)
continues until the agent reaches a terminal state or fulfills
a specified stopping condition, such as a maximum number Other loss functions such as Huber loss or mean absolute
of steps. We define an episode as the sequence beginning error (MAE) can also be used. The first VQC-based QRL
from an initial state, following the described process, and is described in the work [13] in which a VQC is designed
concluding either at the terminal state or upon meeting the to solve environments with discrete observations such as the
Frozen Lake and Cognitive-Radio. The design follows the compared to the average value of the state st . This method
original idea in classical deep Q-learning [26]. As shown in is referred to as the advantage actor-critic (A2C) approach,
Figure 3, there are target network and experience replay in this where the policy π serves as the actor and the value function
quantum DQN. They are basically two sets of quantum circuit V acts as the critic [25]. Similar to traditional policy gradient
parameters. The quantum agent is optimized via gradient methods, the A2C algorithm can be implemented using VQCs.
descent algorithm such as RMSProp in the hybrid quantum- In [29], the authors utilize VQCs to construct both the actor
classical manner. Later, more sophisticated efforts in the area (policy function) and the critic (value function). Their study
of quantum DQN take into account continuous observation demonstrates that, for comparable numbers of model param-
spaces like Cart-Pole [15], [16]. eters, a hybrid approach—where classical neural networks
post-process the outputs from the VQC—achieves superior
performance across the tested environments. The asynchronous
advantage actor-critic (A3C) algorithm [30] is an enhanced
variant of the A2C method that utilizes multiple concurrent
actors to learn the policy through parallel processing. This
approach involves deploying several agents across several
instances of the environment, enabling them to experience a
wide range of states simultaneously. By reducing the corre-
lation between states or observations, this method improves
the numerical stability of on-policy RL algorithms like actor-
critic [30]. Moreover, asynchronous training eliminates the
need for extensive replay memory, which helps in reducing
memory usage [30]. A3C achieves high sample efficiency and
robust learning performance, making it a favored choice in RL.
In the context of quantum RL, asynchronous or distributed
Fig. 3. Quantum deep Q-learning.
training can further boost sampling efficiency and leverage
the capabilities of multiple quantum computers or quantum
C. Quantum Policy Gradient Methods processing units (QPUs). In [31], the authors extend the A3C
In contrast to value-based RL algorithms, such as Q- framework to quantum settings, showing that VQC actors and
learning, which depend on learning a value function to guide critics can outperform classical models when the sizes of the
decision-making at each time step, policy gradient meth- models are comparable.
ods aim to optimize a policy function directly. This policy
function π(a|s; θ) is parameterized by θ. The parameters D. Quantum RL with Evolutionary Optimization
θ are adjusted using gradient ascent on the expected total One of the significant challenges in current QML appli-
return, E[Rt ]. A prominent example of a policy gradient cations is the limitation of quantum computers or quantum
algorithm is the REINFORCE algorithm [27]. The policy simulation software in processing input dimensions. These
function π(a|s; θ) can be implemented using a VQC, where systems can only handle inputs up to a certain level, which is
the rotation parameters serve as θ. In [28], the authors employ insufficient for encoding larger vectors. In QRL, this constraint
the REINFORCE algorithm to train a VQC-based policy. means the observation vector that the quantum agent can
Their results demonstrate that VQC-based policies can achieve process from the environment is severely restricted. To address
performance comparable to or exceeding that of classical this issue, various dimensional reduction methods have been
DNNs on several standard benchmarks. In the traditional proposed. Among these, a hybrid quantum-classical approach
REINFORCE algorithm, parameter updates for θ are based that incorporates a classical learnable model with a VQC has
on the gradient ∇θ log π (at |st ; θ) Rt , which provides an un- shown promising results. In the work by Chen et al. [14], a
biased estimate of ∇θ E [Rt ]. However, this gradient estimate quantum-inspired classical model based on a specific type of
can exhibit high variance, which may lead to difficulties or tensor network, known as a matrix product state (MPS), is
instability during training. To address this issue and reduce integrated with a VQC to function as a learnable compressor
variance while preserving unbiasedness, a baseline term can [4] (see Figure 4). The hybrid architecture MPS-VQC, includ-
be subtracted from the return. This baseline, bt (st ), is a ing the tensor network and VQC, is randomly initialized, and
learned function of the state st . The update rule then be- the entire model is trained in an end-to-end manner. Although
comes ∇θ log π (at |st ; θ) (Rt − bt (st )). A typical choice for gradient-based methods have achieved considerable success in
the baseline bt (st ) in RL is an estimate of the value function RL, several challenges remain. Notably, these methods can
V π (st ). Employing this baseline generally leads to a reduction become trapped in local minima or fail to converge to the
in the variance of the policy gradient estimate [25]. The term optimal solution, particularly in sparse RL environments where
Rt −bt = Q(st , at )−V (st ) represents the advantage A(st , at ) the agent frequently receives zero rewards during episodes.
of taking action at in state st . This advantage can be viewed Evolutionary optimization techniques have been proposed to
as a measure of how favorable or unfavorable action at is address these challenges in classical RL and have demon-
strated significant success [32]. A similar approach can be the quantum parameters. Indeed, it has been demonstrated
applied to hybrid quantum-classical RL models. Specifically, that a randomly initialized RNN can function as a reservoir,
a population P of N agents, represented as parameter vectors transforming input information into a high-dimensional space.
Θi , i ∈ 1, · · · , N , is randomly initialized. In each generation, The only part that requires training is the linear layer following
the top-performing agents are selected to serve as parents for the reservoir. Quantum RNNs, such as QLSTM, can also be
generating the next generation of agents/parameter vectors. utilized as reservoirs [8]. It has been shown that even with-
The update rules for the new parameters involve adding out training, the QLSTM reservoir can achieve performance
Gaussian noise to the parent parameters. This method has comparable to fully trained models [8]. To further enhance
been shown to optimize MPS-VQC models effectively and the performance of QLSTM-based QRL agents and reduce
to outperform NN-VQC in selected benchmarks [14]. training resource requirements, a randomly initialized QLSTM
can be employed as a reservoir in an RL agent [19]. Numerical
simulations have demonstrated that the QLSTM reservoir can
achieve performance comparable to, and sometimes superior
to, fully trained QLSTM RL agents.

F. Quantum RL with Fast Weight Programmers


An alternative approach for developing a QRL model that
can memorize temporal or sequential dependencies without
utilizing quantum RNNs is the Quantum Fast Weight Pro-
grammers (QFWP). The idea of Fast Weight Programmers
(FWP) was originally proposed in the work of Schmidhuber
[34], [35]. In this sequential learning model, two distinct
Fig. 4. Hybrid Quantum-Classical RL with Tensor Networks. neural networks (NN) are utilized: the slow programmer
E. Quantum RL with Recurrent Policies and the fast programmer. Here, the NN weights act as the
model/agent’s program. The core concept of FWP involves
The previously mentioned quantum RL methods primarily the slow programmer generating updates or changes to the
utilize various VQCs without incorporating recurrent struc- fast programmer’s NN weights based on observations at each
tures. However, recurrent connections are essential in classical time-step. This reprogramming process quickly redirects the
machine learning for retaining memory of past time steps. fast programmer’s attention to salient information within the
Certain RL tasks necessitate that agents have the capability incoming data stream. Notably, the slow programmer does not
to remember information from previous time steps to se- completely overwrite the fast programmer but instead applies
lect optimal actions. For instance, environments with partial updates or changes. This approach allows the fast programmer
observability often require agents to make decisions based to incorporate previous observations, enabling a simple feed-
not only on information from the current time step but also forward NN to manage sequential prediction or control without
on information accumulated from the past. In classical ML, the high computational demands of recurrent neural networks
recurrent neural networks (RNNs), such as long short-term (RNNs). The idea of FWP can be further extended into the
memory (LSTM) [33], have been proposed to solve tasks hybrid quantum-classical regime as described in the work
with temporal dependencies. The quantum version of LSTM [36]. In the work [36], classical neural networks are used to
(QLSTM) has been designed by replacing classical neural construct the slow networks, which generate values to update
networks with VQCs [7]. It has been shown that QLSTM can the parameters of the fast networks, implemented as a VQC.
outperform classical LSTM in several time-series prediction
tasks when the model sizes are similar [7]. To address RL
environments with partial observability or those requiring
temporal memories, QRL agents utilizing QLSTM as the value
or policy function have been proposed in [17]. It has been
demonstrated that QLSTM-based value or policy functions
enable QRL agents to outperform classical LSTM models with
a similar number of parameters.
While the QLSTM-based models achieve significant results
in several benchmarks, there is at least one major challenge
preventing such models from wide applications. The training
of RNNs, both in quantum and classical, requires significant
computational resources due to the requirement of performing
backpropagation-through-time (BPTT). One might question
whether it is possible to leverage the capabilities of QLSTM
without the need for gradient calculations with respect to
Fig. 5. Quantum Fast Weight Programmers.
As illustrated in Figure 5, the input vector ⃗x is first pro- the DiffQAS framework to asynchronous QRL. This extension
cessed by a classical neural network encoder. The encoder’s allows multiple parallel instances (a single instance is shown in
output is then fed into two additional neural networks. One Figure 6 ) to optimize their own structural weights (denoted as
network generates an output vector [Li ] corresponding to the w in Figure 6 ) alongside the VQC parameters. The gradients
number of VQC layers, while the other produces an output of these structural weights and quantum circuit parameters are
vector [Qj ] matching the number of qubits in the VQC. shared across instances to enhance the training process.
We then calculate the outer product of [Li ] and [Qj ]. It
 be written as [Li ] ⊗ [Qj ] = [M
can  ij ] = [Li × Qj ] =
L1 × Q1 L1 × Q2 · · · L1 × Qn
L2 × Q1 L2 × Q2 · · · L2 × Qn 
, where l is the num-
 
 .. .. ..
 . . . 
Ll × Q1 Ll × Q2 · · · Ll × Qn
ber of learnable layers in VQC and n is the number of qubits.
At time t + 1, the updated VQC parameters can be calculated
t+1 t
as θij = f (θij , Li × Qj ), where f combines the previous
t Fig. 6. Differentiable Quantum Architecture Search (DiffQAS).
parameters θij with the newly computed Li × Qj . In the time
series modeling and RL tasks in [36], the additive update IV. Q UANTUM RL A PPLICATIONS AND C HALLENGES
rule is used. The new circuit parameters are calculated as QRL can be extended to multi-agent settings and applied
t+1 t
θij = θij + Li × Qj . This method preserves information in fields like wireless communication and autonomous control
from previous time steps in the circuit parameters, influencing systems [41]. Additionally, as discussed in Section III-G, QAS
the VQC behavior with each new input ⃗x. The output from involves sequential decision-making and can be addressed
the VQC can be further processed by components such as through RL. In [42], a QRL approach is developed to discover
scaling, translation, or a classical neural network to refine the quantum circuit architectures that generate desired quantum
final results. states. In the NISQ era, a major challenge for QML applica-
G. Quantum RL with Quantum Architecture Search tions is the limited quantum resources, which complicates both
While QRL has demonstrated effectiveness across various the training and inference phases. In [43], [44], the authors
problem domains, the design of successful architectures is propose a method using a QNN to generate classical NN
far from trivial. Developing VQC architectures tailored to weights. For an N -qubit QNN, measuring the expectation
specific problems requires substantial effort and expertise. values of individual qubits provides up to N values. However,
The field of quantum architecture search (QAS) focuses collecting the probabilities of all computational basis states
on developing methods to identify high-performing quantum |00 · · · 0⟩ , · · · , |11 · · · 1⟩ yields 2N values. These values can
circuits for specific tasks. A QAS problem is formulated by be rescaled and used as NN weights. Thus, for an NN with
specifying a particular goal (e.g., total returns in RL) and the M weights, only ⌈log2 M ⌉ qubits are needed to generate the
constraints of the quantum device (e.g., maximum number weights. Numerical simulations demonstrate that the quantum
of quantum operations, set of allowed quantum gates). QAS circuit can efficiently generate NN weights, achieving infer-
has been explored in the context of QRL. For instance, in ence performance comparable to conventional training. Future
[37], evolutionary algorithms are employed to search for high- research could further explore the trainability challenges in
performing circuits. The authors define a set of candidate VQC QRL models highlighted by Sequeira et al. [45], which are
blocks, including entangling blocks, data-encoding blocks, key to enhancing their practical performance.
variational blocks, and measurement blocks. The objective of V. C ONCLUSION AND O UTLOOK
the evolutionary search is to determine an optimal sequence of
This paper introduces the concept of quantum reinforcement
these blocks, given a constraint on the maximum number of
learning (QRL), where variational quantum circuits (VQCs)
circuit blocks. While this approach has shown effectiveness in
are used as policy and value functions. It also explores
the evaluated cases, scalability issues may arise as the search
advanced constructs, including quantum recurrent policies,
space expands. Differentiable quantum architecture search
quantum fast weight programmers, and QRL with differen-
(DiffQAS) methods, as proposed in [38], draw inspiration
tiable quantum architectures. QRL holds the potential to offer
from differentiable neural architecture search in classical deep
quantum advantages in various sequential decision-making
learning to identify effective quantum circuits for RL. In
tasks.
[39], the authors apply DiffQAS to quantum deep Q-learning.
They parameterize a probability distribution P (k, α) for circuit R EFERENCES
architecture k using α. During training, mini-batches of VQCs [1] M. A. Nielsen and I. L. Chuang, Quantum computation and quantum
are sampled, and the weighted loss is calculated based on information. Cambridge university press, 2010.
the distribution P (k, α). Both the architecture parameter α [2] K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin-Lea,
A. Anand, M. Degroote, H. Heimonen, J. S. Kottmann, T. Menke et al.,
and the quantum circuit parameters θ are updated using con- “Noisy intermediate-scale quantum algorithms,” Reviews of Modern
ventional gradient-based methods. In [40], the authors extend Physics, vol. 94, no. 1, p. 015004, 2022.
[3] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, “Quantum circuit brid quantum-classical computations,” arXiv preprint arXiv:1811.04968,
learning,” Physical Review A, vol. 98, no. 3, p. 032309, 2018. 2018.
[4] S. Y.-C. Chen, C.-M. Huang, C.-W. Hsing, and Y.-J. Kao, “An end-to- [25] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
end trainable hybrid classical-quantum classifier,” Machine Learning: MIT press, 2018.
Science and Technology, vol. 2, no. 4, p. 045021, 2021. [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
[5] J. Qi, C.-H. H. Yang, and P.-Y. Chen, “Qtn-vqc: An end-to-end learning Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
framework for quantum neural networks,” Physica Scripta, vol. 99, 12 et al., “Human-level control through deep reinforcement learning,”
2023. nature, vol. 518, no. 7540, pp. 529–533, 2015.
[6] S. Y.-C. Chen, T.-C. Wei, C. Zhang, H. Yu, and S. Yoo, “Quantum [27] R. J. Williams, “Simple statistical gradient-following algorithms for
convolutional neural networks for high energy physics data analysis,” connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4,
Physical Review Research, vol. 4, no. 1, p. 013231, 2022. pp. 229–256, 1992.
[7] S. Y.-C. Chen, S. Yoo, and Y.-L. L. Fang, “Quantum long short-term [28] S. Jerbi, C. Gyurik, S. Marshall, H. Briegel, and V. Dunjko,
memory,” in ICASSP 2022-2022 IEEE International Conference on “Parametrized quantum policies for reinforcement learning,” Advances
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. in Neural Information Processing Systems, vol. 34, pp. 28 362–28 375,
8622–8626. 2021.
[8] S. Y.-C. Chen, D. Fry, A. Deshmukh, V. Rastunkov, and C. Stefanski, [29] M. Kölle, M. Hgog, F. Ritz, P. Altmann, M. Zorn, J. Stein, and
“Reservoir computing via quantum recurrent neural networks,” arXiv C. Linnhoff-Popien, “Quantum advantage actor-critic for reinforcement
preprint arXiv:2211.02612, 2022. learning,” arXiv preprint arXiv:2401.07043, 2024.
[9] S. S. Li, X. Zhang, S. Zhou, H. Shu, R. Liang, H. Liu, and L. P. Garcia, [30] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
“Pqlm-multilingual decentralized portable quantum language model,” in D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech forcement learning,” in International conference on machine learning.
and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. PMLR, 2016, pp. 1928–1937.
[10] C.-H. H. Yang, J. Qi, S. Y.-C. Chen, Y. Tsao, and P.-Y. Chen, “When bert [31] S. Y.-C. Chen, “Asynchronous training of quantum reinforcement
meets quantum temporal convolution learning for text classification in learning,” Procedia Computer Science, vol. 222, pp. 321–330, 2023,
heterogeneous computing,” in ICASSP 2022-2022 IEEE International international Neural Network Society Workshop on Deep Learning
Conference on Acoustics, Speech and Signal Processing (ICASSP). Innovations and Applications (INNS DLIA 2023). [Online]. Available:
IEEE, 2022, pp. 8602–8606. https://www.sciencedirect.com/science/article/pii/S1877050923009365
[11] R. Di Sipio, J.-H. Huang, S. Y.-C. Chen, S. Mangini, and M. Worring, [32] F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and
“The dawn of quantum natural language processing,” in ICASSP 2022- J. Clune, “Deep neuroevolution: Genetic algorithms are a competitive
2022 IEEE International Conference on Acoustics, Speech and Signal alternative for training deep neural networks for reinforcement learning,”
Processing (ICASSP). IEEE, 2022, pp. 8612–8616. arXiv preprint arXiv:1712.06567, 2017.
[12] J. Stein, I. Christ, N. Kraus, M. B. Mansky, R. Müller, and C. Linnhoff- [33] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Popien, “Applying qnlp to sentiment analysis in finance,” in 2023 IEEE computation, vol. 9, no. 8, pp. 1735–1780, 1997.
International Conference on Quantum Computing and Engineering [34] J. Schmidhuber, “Learning to control fast-weight memories: An alterna-
(QCE), vol. 2. IEEE, 2023, pp. 20–25. tive to dynamic recurrent networks,” Neural Computation, vol. 4, no. 1,
[13] S. Y.-C. Chen, C.-H. H. Yang, J. Qi, P.-Y. Chen, X. Ma, and H.-S. pp. 131–139, 1992.
Goan, “Variational quantum circuits for deep reinforcement learning,” [35] ——, “Reducing the ratio between learning complexity and number
IEEE Access, vol. 8, pp. 141 007–141 024, 2020. of time varying variables in fully recurrent nets,” in ICANN’93: Pro-
[14] S. Y.-C. Chen, C.-M. Huang, C.-W. Hsing, H.-S. Goan, and Y.-J. Kao, ceedings of the International Conference on Artificial Neural Networks
“Variational quantum reinforcement learning via evolutionary optimiza- Amsterdam, The Netherlands 13–16 September 1993 3. Springer, 1993,
tion,” Machine Learning: Science and Technology, vol. 3, no. 1, p. pp. 460–463.
015025, 2022. [36] S. Y.-C. Chen, “Learning to program variational quantum circuits with
[15] O. Lockwood and M. Si, “Reinforcement learning with quantum vari- fast weights,” arXiv preprint arXiv:2402.17760, 2024.
ational circuit,” in Proceedings of the AAAI Conference on Artificial [37] L. Ding and L. Spector, “Evolutionary quantum architecture search
Intelligence and Interactive Digital Entertainment, vol. 16, no. 1, 2020, for parametrized quantum circuits,” in Proceedings of the Genetic and
pp. 245–251. Evolutionary Computation Conference Companion, 2022, pp. 2190–
[16] A. Skolik, S. Jerbi, and V. Dunjko, “Quantum agents in the gym: a 2195.
variational quantum algorithm for deep q-learning,” Quantum, vol. 6, p. [38] S.-X. Zhang, C.-Y. Hsieh, S. Zhang, and H. Yao, “Differentiable
720, 2022. quantum architecture search,” Quantum Science and Technology, vol. 7,
[17] S. Y.-C. Chen, “Quantum deep recurrent reinforcement learning,” in no. 4, p. 045023, 2022.
ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech [39] Y. Sun, Y. Ma, and V. Tresp, “Differentiable quantum architecture
and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. search for quantum reinforcement learning,” in 2023 IEEE International
[18] N. Meyer, C. Ufrecht, M. Periyasamy, D. D. Scherer, A. Plinge, and Conference on Quantum Computing and Engineering (QCE), vol. 2.
C. Mutschler, “A survey on quantum reinforcement learning,” arXiv IEEE, 2023, pp. 15–19.
preprint arXiv:2211.03464, 2022. [40] S. Y.-C. Chen, “Differentiable quantum architecture search in
[19] S. Y.-C. Chen, “Efficient quantum recurrent reinforcement learning asynchronous quantum reinforcement learning,” arXiv preprint
via quantum reservoir computing,” in ICASSP 2024-2024 IEEE In- arXiv:2407.18202, 2024.
ternational Conference on Acoustics, Speech and Signal Processing [41] C. Park, W. J. Yun, J. P. Kim, T. K. Rodrigues, S. Park, S. Jung,
(ICASSP). IEEE, 2024, pp. 13 186–13 190. and J. Kim, “Quantum multiagent actor–critic networks for cooperative
[20] A. Abbas, D. Sutter, C. Zoufal, A. Lucchi, A. Figalli, and S. Woerner, mobile access in multi-uav systems,” IEEE Internet of Things Journal,
“The power of quantum neural networks,” Nature Computational Sci- vol. 10, no. 22, pp. 20 033–20 048, 2023.
ence, vol. 1, no. 6, pp. 403–409, 2021. [42] S. Y.-C. Chen, “Quantum reinforcement learning for quantum archi-
[21] M. C. Caro, H.-Y. Huang, M. Cerezo, K. Sharma, A. Sornborger, tecture search,” in Proceedings of the 2023 International Workshop on
L. Cincio, and P. J. Coles, “Generalization in quantum machine learning Quantum Classical Cooperative, 2023, pp. 17–20.
from few training data,” Nature communications, vol. 13, no. 1, pp. 1– [43] C.-Y. Liu, E.-J. Kuo, C.-H. A. Lin, J. G. Young, Y.-J. Chang, M.-H.
11, 2022. Hsieh, and H.-S. Goan, “Quantum-train: Rethinking hybrid quantum-
[22] Y. Du, M.-H. Hsieh, T. Liu, and D. Tao, “Expressive power of classical machine learning in the model compression perspective,” arXiv
parametrized quantum circuits,” Physical Review Research, vol. 2, no. 3, preprint arXiv:2405.11304, 2024.
p. 033125, 2020. [44] C.-Y. Liu, C.-H. A. Lin, C.-H. H. Yang, K.-C. Chen, and M.-H. Hsieh,
[23] M. Schuld, V. Bergholm, C. Gogolin, J. Izaac, and N. Killoran, “Eval- “Qtrl: Toward practical quantum reinforcement learning via quantum-
uating analytic gradients on quantum hardware,” Physical Review A, train,” arXiv preprint arXiv:2407.06103, 2024.
vol. 99, no. 3, p. 032331, 2019. [45] A. Sequeira, L. P. Santos, and L. Soares Barbosa, “Trainability issues in
[24] V. Bergholm, J. Izaac, M. Schuld, C. Gogolin, C. Blank, K. McK- quantum policy gradients,” Machine Learning: Science and Technology,
iernan, and N. Killoran, “Pennylane: Automatic differentiation of hy- 2024.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy