Research On Optimization Reasearch and AI

Artificial Intelligence for Operations Research: Revolutionizing the
Operations Research Process

∗
Zhenan Fan, Bissan Ghaddar, Xinglu Wang, Linzi Xing, Yong Zhang, Zirui Zhou
Abstract
The rapid advancement of artificial intelligence (AI) techniques has opened up new opportuni-
arXiv:2401.03244v1 [math.OC] 6 Jan 2024
ties to revolutionize various fields, including operations research (OR). This survey paper explores
the integration of AI within the OR process (AI4OR) to enhance its effectiveness and efficiency
across multiple stages, such as parameter generation, model formulation, and model optimization.
By providing a comprehensive overview of the state-of-the-art and examining the potential of AI
to transform OR, this paper aims to inspire further research and innovation in the development of
AI-enhanced OR methods and tools. The synergy between AI and OR is poised to drive significant
advancements and novel solutions in a multitude of domains, ultimately leading to more effective
and efficient decision-making.
Keywords. Decision analysis, Artificial Intelligence, Operations Research, Modeling, Algorithm

selection, Optimization, Machine Learning
1 Introduction
Operations Research (OR) is an interdisciplinary field that employs advanced analytical techniques
and methodologies to support decision-making processes in organizations, aiming to improve efficiency,
optimize resource allocation, and achieve desired objectives. By leveraging mathematical models, op-
timization algorithms, simulation, and statistical methods, OR aids in addressing complex problems
in various domains, including logistics, supply chain management, transportation, energy, manufac-
turing, finance, healthcare, and public services, among others (Winston and Goldberg 2004). The
general framework for operations research includes the following steps (Rajgopal 2004)
• Problem identification and definition: The initial step requires a thorough understanding of
the problem, including its context, objectives, and relevant constraints. This involves engaging
with stakeholders to clarify their expectations, goals, and requirements, as well as identifying
any trade-offs or conflicting priorities that may arise during the decision-making process. A
well-defined problem statement provides a solid foundation for the subsequent stages of the OR
process.
• Parameter Generation: The next step involves generating key parameters in the optimization
model, such as the objective coefficients and the constraint matrix. In practice, we might have
relevant data from various sources, such as historical records, expert opinions, market research,
or sensor readings. The collected data may need preprocessing, such as cleaning, normalization,
aggregation, or transformation, to ensure its quality and suitability for the subsequent modeling
stage. These data are then converted to modeling parameters by human experts or AI models.
For example, in a supply chain planning problem, although the expert has not yet written down
explicitly the objective and constraint equations, they can first decide on key parameters such
as supply, demand, and profit according to their understanding of the problem and analysis of
data.
∗
The authors are listed in alphabetical order
1
• Model formulation: With a clear understanding of the problem and the corresponding key
parameters, a mathematical or simulation model is developed to represent the system under
investigation. Models are simplifications of reality, designed to capture the essential elements
of the problem while abstracting away unnecessary details. Depending on the problem’s nature
and requirements, various modeling techniques can be employed, including linear/non-linear
programming, integer programming, stochastic programming, queuing theory, network models,
game theory, or agent-based simulations, among others. Before applying the model in prac-
tice, its validity must be assessed to ensure that it accurately captures the problem’s essential
characteristics and adheres to the system’s logical and physical constraints.
• Model optimization: Once the model has been validated, it is solved and analyzed using various
techniques to identify optimal or near-optimal solutions that satisfy the problem’s objectives and
constraints. This may involve employing optimization algorithms, heuristics, metaheuristics, or
simulation-based methods, depending on the model’s complexity and the desired level of solution
quality.
• Interpretation and validation: In the final stage of the OR process, the implemented solution
is reviewed and evaluated to determine its effectiveness in meeting the desired objectives. If
necessary, the model may be updated, and the process iterated to further refine the solution
and enhance the system’s performance. This step embodies the iterative and ongoing nature of
operations research as a tool for continuous improvement.
While these steps have proven effective in the past, recent advances in artificial intelligence (AI)
are poised to revolutionize the way we approach and solve OR problems. AI techniques have the
potential to enhance every stage of the OR process, facilitating the development of more accurate
and efficient models and offering innovative solutions to complex problems. In this survey paper, we
explore AI4OR, i.e., how AI can be combined with OR, with a focus on three key aspects: parameter
generation, model formulation, and model optimization.
First, in the parameter generation phase, AI can be employed to improve the quality and rele-
vance of data used to formulate the mathematical models. One promising approach is the predict-
then-optimize framework of Elmachtoub and Grigas (2022), which leverages AI techniques to make
data-driven predictions about uncertain variables in the decision-making process. By incorporating
advanced AI algorithms, such as deep learning and reinforcement learning, predict-then-optimize can
efficiently handle high-dimensional and complex data structures, as well as adapt to dynamic envi-
ronments. Furthermore, AI-powered feature selection and dimensionality reduction techniques can be
utilized to identify the most important variables and relationships within the data, leading to more
parsimonious and interpretable models.
Second, in the model formulation phase, AI can be used to bridge the gap between natural lan-
guage problem description and mathematical models. This is particularly relevant in situations where
domain experts may struggle to translate their knowledge into mathematical terms. Large language
models, such as ChatGPT (Brown et al. 2020) and Llama (Touvron et al. 2023), have shown remark-
able success in understanding and generating natural language and can be employed to automatically
convert problem descriptions into mathematical formulations (Bubeck et al. 2023, Ramamonjison
et al. 2023). By harnessing the power of natural language processing (NLP) and AI, these models
can extract relevant information, identify key constraints and objectives, and translate them into a
mathematical representation suitable for optimization.
Third, in the model optimization phase, AI techniques can be exploited to enhance the perfor-
mance of optimization algorithms, shifting away from traditional methods towards a more adaptive,
learning-driven approach. Classic optimization methods, such as gradient descent, conjugate gradi-
ent, Newton steps, and branch-and-bound, are constructed based on theoretical foundations and the
implementation of optimization experts. While these methods offer performance guarantees, they
may not always be the most efficient or effective solutions for specific problem instances. This survey
paper focuses on three main categories in this direction: automatic algorithm configuration, contin-
uous optimization algorithm selection and design, and discrete optimization algorithm selection and
design. A brief description of these three categories is presented next:
2
1. Automatic Algorithm Configuration: This category encompasses techniques that use AI to
fine-tune the parameters of existing optimization algorithms, enabling better performance on
specific problem instances. Methods such as Bayesian optimization, genetic algorithms, and
reinforcement learning can be employed to intelligently search the parameter space and identify
configurations that yield improved performance (Ansótegui et al. 2009, Lindauer et al. 2022,
Anastacio and Hoos 2020a).
2. Continuous Optimization Algorithm Selection and Design: AI techniques are being employed to
enhance the optimization algorithm that solves problems with continuous variables. Techniques
like learning to optimize (Wichrowska et al. 2017), the adaptive penalty for ADMM (Zeng et al.
2022), and smart column selection (Chi et al. 2022) can be utilized to dynamically determine
the most appropriate step size, balance exploration and exploitation trade-offs, and accelerate
the optimization process.
3. Discrete Optimization Algorithm Selection and Design: This category focuses on the application
of AI techniques in optimization problems with discrete variables, such as those encountered in
combinatorial optimization problems. AI-driven heuristics (Di Liberto et al. 2016), metaheuris-
tics (Talbi 2009), and learning-based approaches (Gomory 1960) can be employed to enhance
algorithms like branch-and-bound and cutting-plane methods, and improve solving mixed inte-
ger programming problems.
Our survey covers important stages in the pipeline of OR and investigates how AI can assist each
stage of the pipeline, which offers a holistic perspective and allows us to gain valuable insights into
the potential synergies between AI and OR. Our key contributions involve two aspects. First, we
comprehensively examine the different components of the OR pipeline while existing surveys focus
solely on specific aspects. For instance, Bengio et al. (2021) focus on the integration of AI with
combinatorial optimization problems, enabling autonomous learning and decision-making on a chosen
specific set of problems. Zhang et al. (2022) review how AI assists Mixed Integer Programming (MIP)
algorithms including branch-and-bound and heuristic methods. Lodi and Zarpellon (2017) focus on
the AI-enhanced variable and node selection in the branch-and-bound algorithm for MIPs. Schede
et al. (2022) survey automated algorithm configuration methods. Kotary et al. (2021) survey two
directions that leverage AI for constrained optimization (CO): AI-augmented CO, which enhances
optimization algorithms with AI assistance, and End-to-End CO learning, where machine learning
directly predicts the solution of CO. Additionally, some sections in our survey have not been discussed
before in the existing surveys. For example, the model formulation (see Section 4) and enhancement
(see Section 6.2 and 6.3) for specific algorithms like ADMM and column generation.
Second, we analyze the pipeline as a whole and emphasize the interactions between different
components. For example, the interaction between mathematical model parameter generation and
optimization attracted increased attention from recent literature. Compared with the traditional
predict-then-optimize paradigm that isolates the two components, recent approaches started to explore
their interactions. The smart predict-then-optimize paradigm by Elmachtoub and Grigas (2022)
and Amos and Kolter (2017) gathers feedback from later decision errors to refine the prediction of
parameters. The “integrated prediction and optimization” paradigm presented by Bertsimas and
Kallus (2020), Maragno et al. (2021) and Bergman et al. (2021) presents another possible interaction.
Recall that the traditional predict-then-optimize paradigm involves first predicting the mathematical
model parameters and then deriving an optimal decision. Thus, the parameter is not dependent on
the decision. In contrast, the new paradigm allows the parameters to be affected by the decision, i.e.,
the parameter generation takes the future decision into account.
We survey this emerging direction, describe existing works (See Section 3), and envision other
potential interactions between different components within the optimization pipeline (See Section
8). It is essential to clarify that our survey’s scope is limited to methods that involve the use of
optimization software. There is a subset of real-world operational problems that do not necessitate
optimization software, e.g., unconstraint problem, time series prediction, or classification. For these
problems, end-to-end AI solutions is possible (Kraus et al. 2020, Kotary et al. 2021, Zhang et al.
2023, Yan et al. 2020, Guo et al. 2019). However, this survey does not cover these problems and their
end-to-end AI solutions. Kraus et al. (2020) review the before-mentioned problems, like predicting
3
the movements of stocks, from an operational point of view and how AI will achieve high prediction
performance. Considering that AI solutions are often a black box, De Bock et al. (2023) disucss how
the integration of explainability and ethical consideration can be taken into account in AI solutions
alongside performance. Furthermore, while our focus is on how AI can boost operational research,
it is well known that machine learning itself roots deeply in mathematical optimization. Gambella
et al. (2021) survey mathematical optimization models presented in various AI algorithms, such as
classification, clustering, deep learning, and Bayesian network structure learning.
In summary, operation research has benefited from the advancement in various algorithms (see
Table 1), computing power as well as commercial and open-source solvers. Solvers like Gurobi (2022),
CPLEX (2009), and OptVerse Huawei (2021) are generally applicable to various real-world applica-
tions and can solve large-scale problems efficiently. Meanwhile, the advances in AI led to a ground-
breaking shift in the development of optimization algorithms. Instead of relying on structured algo-
rithmic development, AI techniques learn from data (e.g., past solving experience), enhance existing
methods or even create entirely novel solution methods. For specific classes of optimization problems,
integration of artificial intelligence into operations research can even achieve better performance than
existing solvers (Song et al. 2020, Gasse et al. 2019, Khalil et al. 2016, Jung et al. 2022, Hutter et al.
2010, 2009). AI4OR promises to bring about a new era of innovation and efficiency. This survey
paper will delve into the various AI techniques that can be employed at each stage of the OR pro-
cess, providing a comprehensive overview of the state-of-the-art and exploring the potential of AI
to transform the way we approach and solve complex decision-making problems. As we continue to
develop and refine AI technologies, the synergy between AI and OR will undoubtedly lead to exciting
advancements and novel solution methods in a multitude of domains. An overview illustrating how
AI techniques help each step in the OR pipeline is shown in Figure 1.
Algorithm Target problem Assumption

Gradient-based methods minn f (x) f is differentiable
x∈R
ADMM-type methods min f (x) + g(s) s.t. Ax + s = b f and g are continuous
x∈R ,y∈Rm
n
Simplex method min cT x s.t. Ax = b, x ≥ 0 A has full rank

Pn
x∈R P
Column-generation method min cp xp s.t. xp ap = b, x ≥ 0 |Ω| is very large
x∈R|Ω| p∈Ω p∈Ω
Cutting-plane method min cT x s.t. Ax = b, x ≥ 0 A has full rank
x∈Zn
Branch-and-bound method min f (x) X is finite
x∈X
Table 1: Summary of involved algorithms and their target optimization problems
2 Preliminary: AI techniques
There are two main challenges in operation research, and AI techniques have the potential to address
them. We start by specifically introducing these two challenges:
1) The complex interaction among decision variables and constraints (Cappart et al. 2021, Steever
et al. 2019): Simple linear algebra-based heuristics may be able to identify that the constraint matrix
has a block structure, then the original problem can be decomposed into several subproblems, and
thus accelerate solving the optimization problem. However, when the interaction within the constraint
matrix is more complex, e.g., more variables or constraints are coupling with each other, simple
heuristics are either not applicable or worsen the computational performance. It is common that
a group of optimization problems share certain characteristics that cannot be described (or easily
obtained) mathematically. In such situations, AI tools are more applicable and efficient.
2) The computation cost of solving complex optimization problems (Krentel 1986): Many interest-
ing and practical optimization problems are NP-hard. Even for polynomial-time solvable problems,
e.g., LP and QP, when the problems scale up, the computational time grows quickly. Thanks to
advances in hardware, software, and optimization solvers, many optimization problems can now be
solved within acceptable time frames. However, due to the increase in data availability optimization
4
Figure 1: AI techniques in the OR process
problems often continue to grow in terms of scale and complexity that surpass the capabilities of ex-
isting solvers for real-world applications. Thus we need to harness AI techniques to further accelerate
and improve the efficiency of solving OR models.
In the following, we present brief intuitions about some of the AI techniques frequently used in
OR and why they are helpful in addressing the challenges. We will focus on two key aspects of AI
techniques: 1) the AI models themselves and 2) the learning algorithms of these models.
1) Commonly used AI models include Graph Neural Networks (GNN, Section 2.1) and Recurrent
Neural Network (RNN, Section 2.2). GNN excels at handling the complex interactions in graphs, such
as the real-world graphs represented in the parameter generation stage (Elmachtoub and Grigas 2022)
and the equivalent graph representation of LP, QP, or MILP problems in the AI-driven optimization
stage. RNN is capable of retaining information from previous time steps. When applying RNN to an
iterative algorithm, a time step corresponds to an iteration. Many iterative optimization algorithms
are typically slow because they need to check one solution in each iteration and determine the sub-
sequent solutions to assess. An inefficient or suboptimal choice can hinder the optimization progress.
Given RNN’s proficiency with sequential information (e.g., previous algorithm decisions and states),
it is a helpful tool in such situations.
2) Prominent learning algorithms include Reinforcement learning (see Section 2.3) and Imitation
Learning (see Section 2.4). Reinforcement learning offers a solution to the challenge of high computa-
tional cost. As previously mentioned, this cost often stems from an inefficient or suboptimal decision
made in early iterations. To tackle this, two more challenges appear: i) The metrics, like execution
time or duality gap, are non-differentiable w.r.t the decisions. ii) The need for a learning algorithm that
permits long-term rewards to influence earlier decisions, termed as delayed rewards. Reinforcement
learning effectively meets both these criteria. Imitation Learning is another strategy for the challenge
of high computational cost. By design, imitation learning emulates expert behaviour. In optimization
fields, these ”experts” often represent computationally intensive methods well-documented in the lit-
erature. Instead of traditional input-label pairs as in supervised learning, imitation learning utilizes
state-action pairs derived from expert demonstrations. The learning objective becomes the alignment
of the model’s predictions with the expert behaviour, which is differentiable. Upon completion, imi-
tation learning yields a model that rapidly predicts expert behaviour, thereby effectively tackling the
computational challenge.
In the subsequent portion of this section, we will give a detailed overview of these four AI tech-
niques.
5
Figure 2: Graph representation for linear program.
2.1 Graph Neural Networks

Graph Neural Networks (GNNs) are a class of neural networks designed for processing graph-structured
data (Zhou et al. 2020, Wu et al. 2020, Xu et al. 2019, Veličković 2023) . Consider a graph G = (V, E),
where V = {v1 , . . . , vN } is the set of nodes and E ⊆ V × V is the set of edges. Each node vi is as-
sociated with a feature vector xi ∈ Rd . GNN aims to learn a mapping that maps the graph G and
its node features to a set of output vectors y1 , . . . , yN , where yi ∈ Rm . GNN typically consists of
(l)
multiple layers, and the node representations are updated at each layer. Let hi be the hidden rep-
resentation of node vi at the l-th layer of the GNN. The initial node representations are given by the
(0)
input features: hi = xi . The GNN updates the node representations in each layer as follows:
 
(l+1) (l) (l) (l)
X
hi = ϕ(l) hi , ψ (l) (hi , hj ) ,
vj ∈N (vi )
where ϕ(l) and ψ (l) are neural network functions with learnable parameters, and N (vi ) is the set of
neighbors of node vi in the graph. The updated node representations are then used to compute the
output vectors:
(L)
yi = ω(hi ),
where L is the number of layers in the GNN, and ω is a neural network function with learnable
(l) (L)
parameters. The parameters of the GNN are given by θ = {ϕθ , · · · , θ(l) , · · · , ϕθ , ωθ }, where θ(l) =
(l) (l)
{ϕθ , ψθ } is the parapemter for l-th layer.
Figure 3: Graph representation for quadratic programs.
6
In summary, GNN processes graph-structured data by updating the node representations through
multiple layers using neural network functions. The GNN is defined by a set of learnable parameters
and is trained by minimizing a loss function that measures the difference between the predicted output
vectors and the ground truth output vectors using an optimization algorithm. As shown in Figures 2
and 3 (these figures are adapted and simplified from Jung et al. (2022), Fan et al. (2023) and Gasse
et al. (2019)), optimization problems like linear programming and quadratic programming can be
equivalently transformed into a graph. This transformation is beneficial since the complex interactions
between constraints and variables are expressed, and the permutation invariance property is preserved
(Veličković 2023, Wu et al. 2020). Specifically, in the context of optimization, permutation invariance
implies that rearranging the order of constraints or variables does not affect the problem’s equivalence
or the solution’s validity. This property is retained in the graph representation, i.e., permuting the
constraints and variables results in an equivalent graph and does not impact the prediction made
by GNN. For representing mixed integer programming, we can simply add another attribute to the
variable nodes denoting whether the variables are integer or continuous. More complex features like
the sparsity of columns can also be added (Fan et al. 2023). Thus, GNN is useful for representing
optimization problems and providing feature representation when we want to accelerate the model
optimization process with AI techniques.
2.2 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a class of neural networks designed for processing sequential
data (Medsker and Jain 2001, Yu et al. 2019, Staudemeyer and Morris 2019). Let x1 , . . . , xT be a
sequence of input vectors, where T is the length of the sequential data and xt ∈ Rd represents the
input vector at t step in the sequence. The RNN consists of neural network functions f (θf ; ·) and
g(θg ; ·), such that we have
h(t) = f (θf ; xt , ht−1 )
(1)
y(t) = g(θg ; ht )
where parameter θ = {θf , θg } are the learnable parameters and ht is the hidden state at time step t.
For standard RNN, f and g can be Multilayer Perceptron (MLP) models or even simply a single
matrix multiplication operation. However, standard RNNs face a significant challenge with the vanish-
ing gradient problem, which makes them struggle to capture long-term dependencies in the sequential
data. To illustrate, consider the case that function f is a matrix multiplication operation. The initial
input information x0 is subjected to repeated multiplication with the same matrix θf over successive
time steps. The cumulative effect of this iterative multiplication can cause the contribution of x0
towards ht decreasing geometrically, depending on the eigenvalues of the matrix θf . This reduction of
influence shows that RNN could ‘forget’ distant past information. To overcome this limitation, Long
Short-Term Memory Networks (LSTMs) were introduced (Yu et al. 2019, Staudemeyer and Morris
2019). LSTMs are a special kind of RNN. The functions f and g are structured differently to include a
gating mechanism. This gating mechanism is designed to control and manage the flow of information
within the neural network, enabling LSTMs to retain information for much longer periods of time and
to better capture long-term dependencies.
Ideally, the sequence y 1 , . . . , y T generated by the RNN with parameter θ should match the ground-
truth sequence z 1 , . . . , z T annotated by human experts. The exact match is challenging and unnec-
essary, but it motivates us to adopt the difference between these two sequences as the loss function
L(θ) for training an RNN model.
T
1X t t
L(θ) = ℓ(y , z ), where y t depends on θ
T
t=1
ℓ(·, ·) is some user-specified function measuring the difference between y t and z t . The RNN is trained
by minimizing the loss function L(θ) with respect to the learnable parameters θ using an optimization
algorithm, such as stochastic gradient descent or one of its variants. Since θf and θg are reused at
each step t in Equation (1), The optimization process typically involves backpropagation through time
7
(Medsker and Jain 2001), which is an extension of the standard backpropagation algorithm to handle
recurrent neural networks.
In summary, an RNN consists of a sequence of input vectors, hidden states, and output vectors.
The network is defined by a set of learnable functions. The RNN is trained by minimizing a loss
function that measures the difference between the predicted output sequence and the ground truth
output sequence using an optimization algorithm. Due to RNN’s ability to operate on sequential
information, it is useful for accelerating iterative optimization algorithms, where historical states and
decisions matter for the current decision.
2.3 Reinforcement Learning

Reinforcement learning is an AI technique in which an agent learns to make decisions through inter-
actions with its environment. The agent seeks to maximize the cumulative reward it receives over
time. The problem is typically formulated as a Markov Decision Process (MDP), defined by the tuple
(S, A, P, R, γ), where:
• S = {s1 , · · · , sn } is the sequence of states.

• A = {a1 , · · · , an } is the sequence of actions.
• P (st+1 |st , at ) is the state transition probability function, describing the probability of transi-
tioning from state st to state st+1 after taking action at .
• R(st , at , st+1 ) is the reward function, providing the immediate reward received by the agent after
taking action at in state st and transitioning to state st+1 .
• γ ∈ [0, 1] is the discount factor, which determines the importance of future rewards relative to
immediate rewards.
A policy, denoted as π(a|s), is a probability distribution over actions given the current state. The
goal of reinforcement learning is to find an optimal policy π ∗ that maximizes the expected return
from any initial state. To understand this objective, we first introduce the discounted return at the
time step t:
∞
X
G(st , at ) = γ k R(st+k , at+k , st+k+1 )
k=0
Given the stochastic nature of state transitions, starting from an initial state st can lead to different
possible future states. In other words, {st+1 , st+2 , . . .} and G(st , at ) are all random variables. There-
fore, the goal is actually to maximize the expected value of the discounted return from any state s,
commonly referred to as expected return, which can be formally defined as:
V π (s) = Eπ G(s0 , a0 ) s0 = s

(2)
Here, a0 is also a random variable decided by π(a0 |s0 ). This expected return from state s measures
the quality of state s. Thus, V π (s) is also named as the state value function.
To find the optimal policy, there are two streams of methods: 1) value function-based methods and
2) policy optimization. In the value function-based method, a commonly used tool is the state-action
value function, denoted as Qπ (s, a). Formally, this function is defined as follows:
Qπ (s, a) = Eπ G(s0 , a0 ) s0 = s, a0 = a

This function serves to quantitatively evaluate the quality of taking action a in state s. Given Qπ (s, a),
the optimal policy to choose the action given by arg maxa Qπ (s, a).
Policy optimization aims to find the optimal policy π ∗ that maximizes the expected return given
in Equation (2). One common approach is to parameterize the policy using a function approximator,
such as a neural network, with learnable parameters θ. The policy is then represented as πθ (a|s). Now
the expected return objective V πθ (s) is parameterized by θ. For simplicity, we denote this objective
as J(θ) = V πθ (s).
8
The policy gradient method is a popular approach to optimize the policy parameters θ. It com-
putes the gradient of the objective function with respect to the policy parameters and updates the
parameters using gradient ascent. The policy gradient can be expressed as:
" T #
X
∇θ J(θ) = Eπθ ∇θ log πθ (at |st )G(st , at ) . (3)
t=0
The policy parameters θ are updated using a gradient-based optimization algorithm, such as
stochastic gradient ascent or an adaptive optimization method like Adam ((Ruder 2016, Kingma and
Ba 2015):
θ ← θ + α∇θ J(θ),
where α is the learning rate.
A number of improving techniques have been developed to enhance the basic policy gradient
method. Their purpose is to stabilize the learning process, improve convergence, and boost overall
performance. These are a few popular approaches:
• Trust region: The principle is to maintain a “trust region” for the policy that the agent outputs
(Schulman et al. 2015). In essence, the agent creates a new policy that isn’t too deviant from the
old policy, confining the updates within a predefined boundary or “trust region.” This technique
is commonly used when there is a need for stability in the learning process is crucial.
• Surrogate objective: This strategy substitutes the original objective function with a surrogate
function, known as PPO algorithm (Byun et al. 2020). This method is less complex to implement
compared to the trust region approach, and it also promotes stable training.
• Value function approximations: In this strategy, value function approximations are incorpo-
rated as a baseline to enhance policy optimization (Mnih et al. 2016, Babaeizadeh et al. 2017,
Haarnoja et al. 2018). This method capitalizes on both the learnable value function and policy
optimization process, usually resulting in improved empirical performance.
In conclusion, reinforcement learning equips an agent with a reward that doesn’t necessarily need
to be differentiable regarding the agent’s decisions. Moreover, the agent will not be short-sighted since
the learning objective is maximizing cumulative rewards. Given the continuous advancements in the
state of the art, reinforcement learning is becoming increasingly stable and applicable to real-world
scenarios.
2.4 Imitation learning

Imitation learning is a type of learning algorithm where an agent learns to make decisions by observing
demonstrations provided by an expert (Hussein et al. 2017). The goal is to learn a policy that
mimics the expert’s behaviour as closely as possible. This approach is particularly useful in scenarios
where reinforcement learning methods face challenges, such as when designing an appropriate reward
function is difficult, when environmental interactions are costly, or when the learning process requires a
large number of samples. There are several methods for imitation learning, including behavior cloning
by Codevilla et al. (2019) and Torabi et al. (2018) and interactive imitation learning by (Ross and
Bagnell 2010). Behavior cloning is a simple and widely used imitation learning approach (Codevilla
et al. 2019, Torabi et al. 2018). Given a dataset of expert demonstrations D = {(si , ai )}Ni=1 , where si
is a state and ai is the corresponding expert action, the goal is to learn a policy πθ (a|s) that maps
state to actions. The policy is usually represented by a function approximator, such as a neural
network, with learnable parameters θ. The learning problem is formulated as a supervised learning
task, minimizing the loss function L(θ):
N
1 X
L(θ) = ℓ(ai , πθ (a|si )),
N
i=1
9
Figure 4: Illustration of how AI models generate parameters and interact with the optimization model.
where ℓ is a distance or divergence measure between the expert action ai and the predicted action
πθ (a|si ). Common choices for ℓ include the mean squared error for continuous actions or the cross-
entropy loss for discrete actions.
Despite its simplicity, behaviour cloning can suffer from “cascading errors” (Ross and Bagnell
2010). This happens because the model’s training data consists of sequences of actions and states
from the expert. If, at any point, the model deviates from the expert’s actions, it may find itself in
a state that it has never seen during training. This can lead to incorrect actions in the next step,
which in turn leads to more unfamiliar states, causing the errors to cascade. To address this issue,
Ross and Bagnell (2010) propose interactive imitation learning. After the model takes a new action,
the pair of the current state and expert action (if it exists) is added to demonstrations D. Despite
the annotation burden on the expert increases, the model learns to handle the states deviating from
the original expert’s actions, helping to correct its mistakes and prevent cascading errors. Imitation
learning is closely related to reinforcement learning, as both aim to learn policies that maximize some
objective. While RL seeks to maximize the cumulative reward directly, imitation learning aims to
learn from expert demonstrations.
Imitation learning is often considered more “sample efficient,” i.e., it learns effectively from a
smaller number of expert demonstrations. This is because imitation learning leverages the knowledge
of an expert, bypassing the need for the extensive “trial-and-error” process that characterizes RL.
This trial-and-error process involves the agent making numerous attempts at a task, learning from its
mistakes and successes, which can be time-consuming and resource-intensive. However, the perfor-
mance of imitation learning is bounded by the capabilities of the expert. For instance, if the expert
prioritizes short-term benefits, the agent trained through imitation learning may also be short-sighted.
In such cases, RL is better suited because it allows the agent to learn from its own experiences, rather
than strictly following an expert. This means that the agent can adapt based on the consequences of
its own actions, taking into account both immediate and future consequences.
In conclusion, the presented techniques have specific benefits for operational research. Each algo-
rithm has its own suitable application scenario, as we will see in the coming sections.
3 Model Paramater Generation

As we discussed before, parameter generation is a crucial step of the operations research procedure.
The raw data usually cannot be directly integrated into the optimization model. The parameter of
10
the optimization model is traditionally obtained from human experts. However, it is a frequent case
that some constraints and objectives cannot be described by explicit formulas. In this case, the AI
model is suitable. Figure 4 illustrate our three-way taxonomy of existing works according to how the
AI model generates parameter and interact with the optimization model.
3.1 Predict, then optimize

One straightforward approach is predict-then-optimize, i.e., first predict critical unknown parameters
within the optimization model (i.e., modelling parameter) and then leverage the optimization solver
to devise a decision. For example, consider a vehicle routing problem that requires to be solved several
times on a daily basis. The raw data is the historical traveling time on the edges of a road network
based on conditions like weather, time of the day, and neighbouring edges’ traffic. However, as the
conditions change, the current traveling time (modelling parameter) varies. AI-based methods can
help create predictive models that estimate modelling parameters within the optimization model, by
analyzing and processing large volumes of data, uncovering valuable insights that enhance the decision-
making process. The optimization solver then leverages these predictions to devise an optimal or near-
optimal route. This example highlights the importance of AI in parameter generation and showcases
how most solution systems tackling real-world analytics challenges benefit from the integration of
both prediction and optimization. Formally, to predict the unobserved modelling parameters θ, we
will build a training set D from historical records, including N pairs of attribute xi and parameter
θi , i.e.,
D = {(x1 , θ1 ), . . . , (xN , θN )}.
Here, xi is the i-th attribute correlated to the modelling parameter θi and θi is decided in hindsight,
for example, the traveling time can be decided by letting a car actually run through a road. Let
m(w; ·) denote an AI model for prediction, where w represents the AI model parameters. In other
words, we want θ̂i = m(w; xi ) to be a good estimation of modelling parameter θi . It should be
emphasized that AI model parameters w is different from modelling parameter θi . The key difference
is that w is learnable while θi is not. θi is collected as the target value that the AI model seeks to
predict given attribute xi .
The predict-then-optimize approach first trains the AI model m by
N
1 X
w∗ ∈ arg min ∥m(w; xi ) − θi ∥2 ,
w N
i=1
and then solve the optimization problem

vθ̂ ∈ arg min fθ̂ (v) s.t. v ∈ Cθ̂ with θ̂ = m(w∗ ; x). (4)
v
where θ̂ is the predictive modelling parameter, v is the decision variable, fθ̂ (v) is the objective to be
minimized, and Cθ̂ is the feasible region determined by the set of constraints with predicted modelling
parameter θ̂.
3.2 Smart predict, then optimize

Compared to predict-then-optimize approach that separates prediction and optimization into two
stages, Elmachtoub and Grigas (2022) propose an end-to-end framework called “Smart Predict, then
Optimize (SPO)” that directly trains a predictive model such that the decision error will be minimized.
The motivation behind SPO is to trade off the predictive model’s accuracy in the prediction stage
in exchange for a near-optimal decision. The error of the prediction stage is less important because
1) each parameter in the optimization model is not equally important. 2) these parameters may be
correlated (Cameron et al. 2022).
To be specific, 1) in the prediction stage, the model inevitably makes errors due to model capacity
or data quality and scale. However, take the vehicle routing problem as an example, if our main
concern is the travel time on the shortest route, then we are not concerned about the lack of nu-
merical precision in estimating the travel time for roads that are impossible to choose. SPO achieves
11
better tradeoffs by allowing some error on unimportant parameters but keeping the final decision
near-optimal. 2) the correlated parameters are commonly seen in stochastic optimization settings.
Meanwhile, it is possible that one parameter is related non-linearly to multiple prediction targets but
the predictive model is unaware of the combination. For example, the predictive model predicts the
cost per unit and the number of units independently but later on the total cost is used as a parameter
in the optimization model. Without knowing the correlation, a small error in the prediction stage
may lead to a large error in the decision.
Formally, the general SPO framework directly aims to minimize the decision error ℓ(vθ , vθ̂ ) between
the two decisions from two mathematical models with ground-truth parameter θ and predictive mod-
elling parameter θ̂. Here, ℓ(vθ , vθ̂ ) could be the square of L2 norm ∥vθ −vθ̂ ∥2 or the objective difference
|fθ (vθ ) − fθ (vθ̂ )|. Recall that we only have empirical historical data D = {(x1 , θ1 ), . . . , (xN , θN )}, so
we can only build the empirical training loss N1 N
P
i=1 ℓ(vθ̂i , vθi ). This loss relies on model param-
eter w since the predictive modelling parameter θ̂i relies on w. We denote this empirical loss by
1 PN
L(w) ≡ N i=1 ℓ(vθ̂i , vθi ) for simplicity. The optimal model parameter w∗ is acquired by solving the
following well-known empirical loss minimization problem
N
1 X
w∗ ∈ arg min L(w) ≡ ℓ(vθ̂i , vθi ) with θ̂i = m(w; xi )
w N
i=1
The main technical bottleneck for solving this optimization problem is computing the sub-gradient
N ⊤ !
∂m(w; xi ) ⊤ ∂vθ̂i ∂ℓ(vθ̂i , vθi )
X
∂L(w) = ,
∂w ∂ θ̂i ∂vθ̂i
i=1
which requires differentiating the arg min operator since vθ̂i is obtained by solving an optimization
problem. As a result, vθ̂i can be discontinuous and non-differentiable w.r.t θi . To overcome this
challenge, the literature considers different optimization problems and accordingly propose different
methods (Elmachtoub and Grigas 2022, Amos and Kolter 2017, Yan et al. 2021, Mandi et al. 2020,
Wang et al. 2019, Pogancic et al. 2020). Under the general SPO framework, Elmachtoub and Grigas
(2022) consider a constrained problem with a linear objective and a feasible region that is irrelevant
to θ in Equation (4). Formally, it is
vθ̂ ∈ arg min θ̂⊤ v s.t. v ∈ C with θ̂ = m(w; x). (5)

v
Accordingly, the decision error is ℓ(vθ , vθ̂ ) = θ⊤ (vθ − vθ̂ ). They propose a convex surrogate loss
through the dual interpretation of ℓ(vθ , vθ̂ ) and obtain a subgradient for ∂L(w). Amos and Kolter
(2017) integrate the QP optimization as a differentiable layer into the AI model, by differentiating the
KKT optimality conditions. Yan et al. (2021) further extend it to the case that soft constraints appear
in the objective. Soft constraints are sometimes required in practice, for example, such constraints
allow a slight excess of supply over demand through a penalty term in the objective. This penalty term
often appears as the non-differentiable max operator function, i.e., max(·, 0). This non-differentiability
challenge is addressed using a proposed piecewise linear surrogate. Mandi et al. (2020) extend the
SPO framework to NP-hard discrete optimization problems, by continuously relaxing the MILP and
offering approximate subgradients. Pogancic et al. (2020) obtain the approximate gradient by viewing
the MILP as a black box and invoking two calls to an optimization solver (one with original parameter
θ and another with perturbed parameter).
It has been shown in the literature that the SPO end-to-end framework performs better than
the two-stage predict-then-optimize approach. However, the two-stage approach still has some pros.
In the two-stage approach, one may add robustness (i.e. inductive bias) in the prediction stage by
observing the entire dataset or integrating common knowledge. This could improve the performance
of the predictive model, especially with a small data size. It will be interesting to combine this
advantage into the SPO framework in future work.
12
3.3 Integrated prediction and optimization
In the following section, we will first describe the “Integrated prediction and optimization” approach
through an example and then discuss the two tools developed: OptiCL (Maragno et al. 2021) and
JONAS (Bergman et al. 2021).
There are two motivations for this approach. The first motivation is inherited from the “predict
then optimize”, i.e., there may not exist an explicit formula for calculating certain parameters in
the optimization model. The second motivation is that predicting these parameters may rely on
the decision made after solving the optimization model. This additional motivation explains why, in
Figure 4, the AI model takes in the decision variables. For instance, consider a telecommunication
company that wants to attract customers with low prices while maximizing revenue. Let x be the
decision variable on price, y be the probability of customer churn, and α be the additional information
about customers like demographic and geographic information. Customer churn probability y depends
on both price x and other information α. The company collect historical dataset D and trained an
AI model ĥD that can estimate y given aforementioned information, i.e., y = ĥD (x, α). There may be
certain policies like: 1) the price in certain regions has a lower bound. More generally, prices x fall
in the feasible region X (α) decided by geographic information α. 2). If a customer is highly likely to
leave then the price can be adjusted accordingly, but this adjustment varies across different regions.
More generally, this variation can be expressed by an explicit formula g. And we write the policy as
a constraint g(x, y, α) ≤ 0. The objective of revenue related to price x, customer churn probability
y, and other information α is denoted by f (x, y, α). Then, the optimization problem is denoted as
follows.
maxx,y f (x, y, α)
s.t. g(x, y, α) ≤ 0,
(6)
y = ĥD (x, α),
x ∈ X (α).
Then, we will discuss how optimization problem (6) is solved. The challenge is that constraint
y = ĥD (x, α) involves a pre-trained AI model ĥD . This AI model can be linear models, decision trees,
or multi-layer perceptions (MLPs). OptiCL (Maragno et al. 2021) and JONAS (Bergman et al. 2021)
are similar methods to overcome this challenge, where the constraint y = ĥD (x, α) are converted
to linear inequalities. This conversion process is called “embedding AI model into the optimization
model”. 1) In the case of linear models, for instance, linear regression and support vector machine, the
decision region of the learned model is characterized by a half-space, thus easily converted to a linear
inequality constraint. 2) For the decision tree, since each leaf node corresponds to a polytope, the
model’s decision region is a set of polytopes. Conversion of the decision region to linear inequalities
is a straightforward process. 3) The conversion of MLP is not as trivial. MLP often use ReLU
operators, i.e., ReLU(x) = max(x, 0). This operator can be converted to linear constraints with big-
M and integer artificial variables. Then, by recursively peeling off layers of the MLP and introducing
artificial variables, the MLP with an arbitrary number of hidden layers and nodes can be embedded
into the optimization model.
While OptiCL and JANOS are similar regarding the conversion process, they differ in some of
the detailed techniques. JANOS employs a discretization method that breaks down the sigmoid
operator, σ(x) = 1+e1−x , into piecewise linear functions. OptiCL proposes two techniques for handling
the uncertainty of the trained AI model. The uncertainty arises because the true constraints are
unknown and the AI model may not capture the true constraints accurately. To mitigate these
challenges, OptiCL adopts the following strategies: 1) Employs an ensemble of AI models. A solution
is deemed feasible only if a significant portion of the ensemble agrees on it. 2) Defines the dataset’s
convex hull as a trust region. Predictions are deemed reliable when they are within or around this
trust region. Additionally, recall that the modeling parameter θ̂ is predicted by AI model m(w; ·) from
observed attributes x, denoted by θ̂ = m(w; x). This process is commonly termed as point estimation.
It fails to consider the uncertainty of θ̂. The uncertainty exists because we cannot observe every related
attribute and the AI model cannot perfectly predict the target modeling parameter. Bertsimas and
Kallus (2020) propose accounting for the uncertainty when generating the model parameters. They
13
also extend their approach to incorporate the interaction between prediction and optimization. The
authors apply the proposed approach to an inventory management problem.
Fajemisin et al. (2023) provide a survey on “optimization with constraint learning”. Though
with a different name, it actually presents predictive AI models that are integrated into constraint
and objective. Lombardi and Milano (2018) present another survey on the same topic but focus
on discrete optimization problems. They cover a broad range of methods where AI models are
integrated into optimization models, including: 1) Building explicit constraint equations using active
learning. Active learning is an iterative process involving a dynamic interaction between domain
experts and a learning model. Initially, the learning model formulates constraints and employs a solver
to derive a solution. The domain expert then assesses the feasibility of this solution. Feedback from
this assessment refines and enhances the learning model’s understanding and subsequent constraint
formulations. 2) Approximate the objective using simple formulas. In scenarios where evaluating
the objective function is computationally expensive, such as involving simulations, AI models are
employed to approximate the objective. The aim is to capture the internal functions of the simulator
through this approximation, thereby reducing computational overhead but still finding a high-quality
approximate solution.
To conclude, the AI model is able to generate key parameters for the optimization model, elim-
inating manual work and surpassing human capabilities, specifically where explicit formulas are not
applicable. We discussed three categories of how the AI model interacts with the optimization model.
A direction left unexplored in the literature involves allowing AI models to receive gradient feedback
while integrating AI with optimization. The motivation for this direction is that at the beginning,
we do not know how to formulate a certain constraint, so we randomly initialize it as an AI model.
Obviously, the decision made by solving the optimization model will be random. However, the infea-
sibility or error of decision serves as feedback to improve the AI model. The challenge of realizing
this direction lies in the computation cost. Embedding even the simplest MLP can introduce many
additional (integer) artificial variables, consequently increasing the scale of the optimization problem.
If the scale of the optimization problem goes beyond a manageable limit, the gradient of the decision
error w.r.t AI model parameter will not be calculated within an acceptable time. Nevertheless, it may
remain promising to explore this direction starting from a small-scale application.
4 Model Formulation
In the operations research process, modelling holds paramount importance and is often a time-
consuming step. Modelling is a defining characteristic of OR, and thus, it warrants significant atten-
tion. Modelling in OR often involves the creation of mathematical models, which typically consist of
three main elements: decision variables, constraints, and objective function(s). Decision variables are
employed to represent specific actions under the control of the decision-maker. In complex models, it
is common to define auxiliary or artificial variables to better model the problem. Although not directly
controlled by the decision-maker, these variables are also considered decision variables. Constraints
serve to establish limits on the range of values that each decision variable can assume, with each
constraint typically translating a specific restriction (e.g., resource availability) or requirement (e.g.,
meeting contracted demand) within the model. Constraints dictate the feasible values assignable to
decision variables, effectively determining the possible decisions for the system or process under con-
sideration. The final component of a mathematical model is the objective function, which represents
a mathematical expression of a performance measure (e.g., cost, profit, time, revenue, or utilization)
as a function of the decision variables. Regarding the nature of the objective function, the goal is
typically either to maximize or minimize its value.
In recent years, the most significant breakthrough in AI research has been the remarkable progress
in Natural Language Processing (NLP) achieved by large language models (LLMs). There have
been preliminary efforts to explore the potential of using LLMs, such as ChatGPT (Brown et al.
2020) and Llama (Touvron et al. 2023), for mathematical modeling. These early investigations have
demonstrated promising results, showcasing the potential for LLMs to revolutionize the process of
formulating mathematical models in operations research by automating the translation from natural
language problem descriptions into mathematical representations.
14
Large language models are obtained through a complex, multistage process (Ouyang et al. 2022,
Zhou et al. 2023, Taori et al. 2023). It starts with pre-training, where the model is exposed to a vast
amount of text data from a broad spectrum of domains. This enables the model to learn general
language patterns, semantics, syntax, and common-sense knowledge. During pre-training, the model
attempts to predict the next word in a sentence, thus learning the context-dependent representations
of words. Once this unsupervised learning stage is complete, the model undergoes fine-tuning. In fine-
tuning, the model is instructed with a more specific task using a smaller, labeled dataset. This step
helps the model refine its capabilities and align its responses with specific goals. Some LLMs (Ouyang
et al. 2022) also incorporate a level of reinforcement learning from human feedback (RLHF), where
the model’s responses are ranked by human preference and used to further improve its predictions.
When utilized as a backend model during inference time, LLMs offer impressive capabilities. One
important application is reference editing (Gao et al. 2023, Reid and Neubig 2022, Wen et al. 2017),
where the LLM is used to generate, evaluate, and modify drafts of text based on user input. In this
process, the model first generates an initial draft. Then, based on feedback or additional instructions,
the model performs iterative edits until the text meets the desired quality and specifications. This not
only enables the creation of more refined and context-specific content but also provides a mechanism
for users to control the model’s output in a more flexible manner. Through such interactive usage,
LLMs can be a powerful tool in various fields, from content creation and editing to customer service
and beyond.
In this survey, we aim to harness the natural language understanding capabilities of large language
models to generate mathematical models based on requirement lists or problem descriptions provided
by business owners. Additionally, these models can be modified by experts through natural language
inputs, streamlining the model iteration process and fostering a more collaborative approach. In this
section, the Llama family LLMs (Touvron et al. 2023, Rozière et al. 2023), considered as one of the
most powerful open-sourced LLMs, is discussed and adopted for generating mathematical models from
natural language problem descriptions. We analyze problems with two different levels of complexity
as described in sections 4.1 and 4.2 and we provide insights into the effectiveness of using LLMs for
the model formulation stage.
4.1 Textbook modelling problems

In this section, we focus on the textbook modeling problem. The problems of this level are quite sim-
ple. Figure 5 demonstrates an example. We utilize the dataset from the NL4OPT competition held
at NeurIPS 2022 (Ramamonjison et al. 2023), which was centered around constructing mathematical
models from natural language problem descriptions. This dataset contains 713 training, 99 validation,
and 289 testing data points. In this dataset, each data point consists of both a problem description
and a human-composed mathematical model. Declaration-level mapping accuracy is proposed in this
competition for evaluating various methods. As shown in Figure 5, the competition decomposes the
overall task into two sub-tasks: identifying model entities (sub-task 1) and generating the mathemat-
ical model (sub-task 2). For an in-depth understanding of the competition and its evaluation criteria,
we refer readers to (Ramamonjison et al. 2023). The purpose of using the NL4OPT dataset here is
to evaluate the performance of LLM for text-book level problems, review popular LLMs, and provide
some insights.
As mentioned before, most LLMs are obtained through a two-step process: pretraining and fine-
tuning. During pretraining, LLMs are exposed to large volumes of text data, enabling them to
grasp language patterns and common knowledge. The finetuning stage then refines the model on
specific, high-quality instruction-answer pairs, ensuring it adheres to given instructions. Training
over-parameterized models on a vast amount of text data leads to the emergence of advanced capa-
bilities in LLMs, such as contextual comprehension, generating coherent responses, and even solving
mathematical problems. These skills, not explicitly taught, naturally arise from the model. This
phenomenon is known as “emergence”.
The usage of LLM has become increasingly standardized. Developers often download a pre-trained
LLM of their chosen size and finetune it with specific data to meet their needs. Various off-the-shelf
LLMs are obtained in this way. Their difference lies in model size (the amount of disk space the
parameters of the model take) and finetuning dataset. To showcase the superiority of LLMs in
15
Figure 5: Example on a textbook modeling problem.
Index LLMs Model size Finetuning dataset Acc.

1 Llama-2-13b-chat 52 GB Open domain 24%
2 Code-Llama-34b-instruct 136 GB Programming and math domain 65%
3 Llama-2-70b-chat 280 GB Open domain 37%
4 Llama-2-13b-chat (SFT) 52 GB Open domain+NL4OPT train set 82%
5 NL4OPT winning submission 1 GB NL4OPT train set 90%
Table 2: Declaration-level mapping accuracy (denoted by “Acc.”) for different LLMs and NL4OPT
Winning Submission.
modeling textbook problems and to give relevant insights, we investigated several widely-used off-
the-shelf LLMs, including Llama-2-13b-chat (Touvron et al. 2023), Code-Llama-34b-instruct (Rozière
et al. 2023) and Llama-2-70b-chat (Touvron et al. 2023), a variant LLM further fine-tuned with the
NL4OPT training set, named “Llama-2-13b-chat (SFT)”, and the NL4OPT winning submission for
sub-task 2. Here, the winning submission also follows a similar structure, i.e., pick an off-the-shelf pre-
trained LLM – Bart (Lewis et al. 2020) and fine-tune it with the NL4OPT train set. A careful hyper-
parameter tuning is conducted on the validation set to achieve high performance. The same metric
(Declaration-level mapping accuracy) is employed to evaluate different LLMs (See Ramamonjison
et al. (2023) for more details). Table 2 lists the performance of different LLMs and the NL4OPT
winning submission.
Without any training, Code-Llama-34b-instruct has achieved 65% test accuracy. Without hyper-
parameter tuning, Llama-2-13b-chat (SFT) achieves an 82% accuracy. This already concludes that
LLM is easy to use and performs well on textbook-level problems. The winning submission archives
the 90% performance, but we need to mention that it is unfair to compare the first four LLMs in
Table 2 with the winning submission. The first four LLMs solve the task of modeling from natural lan-
guage in an end-to-end manner, while as mentioned before, the winning submission of the competition
decomposes the overall task into two sub-tasks, thereby complicating the process. The reported 90%
performance for sub-task 2 assumes that the entity information is perfectly annotated after sub-task
1, indicating that this 90% is merely an upper bound and does not measure the end-to-end perfor-
16
mance. Lastly, it is worth noting that LLMs can be easily further improved. For example, with a
more comprehensive and high-quality pretraining corpus, such as operational research papers, LLM
will understand and generalize better for mathematical modeling. When comparing the performance
of Code-Llama-34b-instruct and Llama-2-70b-chat, it becomes evident that a larger model size does
not necessarily guarantee improved performance. The quality and relevance of the fine-tuning dataset
play a more significant role. Further, by examining Llama-2-13b-chat against its further fine-tuned
version, we reinforce the importance of relevant and in-distribution data for a fine-grained task like
modeling from natural language.
To summarize, the current performance of LLMs for modeling textbook-level problems has been
impressive. As these LLMs continue to evolve and improve, it is expected that their capability for
effective modeling directly from natural language inputs will significantly advance in the near future.
4.2 Real-world problems

In the second experiment, we use a set of real-world problems, including Unrelated-Machine Scheduling
Problem (Blazewicz et al. 1991), One-dimensional Bin Packing (Pinedo 2016), Capacitated Vehicle
Routing (Toth and Vigo 2014), Team Formation (Anagnostopoulos et al. 2012), Portfolio Optimization
(Markowitz 1952), Staff Scheduling (Blöchliger 2004), and Airline Revenue Management (Talluri and
Van Ryzin 2004), Cutting Stock Problem in the Printing Industry (Mostajabdaveh et al. 2022a), Fair
Distribution of Relief Aid Supplies in Disaster Response (Mostajabdaveh et al. 2022b), and Short-
term Scheduling in Mining Operations (Blom et al. 2017). For each problem, we provide a detailed
problem description and use the simple prompt “Write a mathematical model of the following problem
description + <Problem Description>” as input to the Llama-2-70b-chat LLM. We show the example
of the Uniform Machine Scheduling Problem (Blazewicz et al. 1991) in Figure 6. The generated model
formulation, although it contains several flaws, offers a starting point for further refinement. On the
constructive side, LLM correctly identifies sets, parameters, and important decision variables, and
constructs a skeletal framework for further refinement.
However, the generated formulation still involves errors that need to be addressed, e.g., incorrect
constraint expressions, extra constraints, and missing constraints. These errors can render the prob-
lem infeasible or the final solution irrelevant to the problem. For the specific example in Figure 6,
Constraint 2 is actually redundant due to the existence of Constraint 4. Furthermore, a decision
variable si is introduced, denoting the start time of task i, without accompanying constraints to
prevent the temporal overlap of tasks assigned to the same machine. However, if we point out the
redundant or incorrect constraints and prompt Llama to correct the errors, it can make the necessary
corrections after a few rounds of prompts and responses. For the above-mentioned example, LLM
finds that introducing si is unnecessary in the context of the Unrelated-Machine Scheduling Problem
because the order of tasks on the same machine does notP influence the machine’s completion time.
N
After removing si , it suggests an improved Constraint 3: i=1 pij xij ≤ Cmax , ∀j ∈ {1, 2, . . . , M }.
This demonstrates the potential for iteratively refining the generated mathematical models through
interaction with LLM, leading to more accurate and relevant formulations.
Overall, while LLM made several errors, the provided formulations can serve as a starting point
for OR experts to create mathematical models. However, OR experts should not rely on LLM to
accurately create mathematical models, especially for less common or complex problems. Each output
needs to be thoroughly verified and adjusted by the experts to ensure correctness and relevance.
5 Automatic Algorithm Configuration

Optimization solvers contain parameters that significantly impact their performance. However, man-
ually tuning these parameters is challenging due to their complex interactions. Automatic algorithm
configuration (AAC) aims to automate the tuning process by systematically exploring the parameter
space to find optimal configurations that maximize performance. More formally, consider a parame-
terized optimization algorithm or solver A, its parameter space Θ, a distribution of problem instances
D, and a cost function o that evaluates the performance of a solver configuration θ ∈ Θ on an instance
17
Figure 6: Example on the Unrelated-machines Scheduling Problem.
Figure 7: Illustration of automatic algorithm configuration, adapted from the work by Hutter et al.
(2009).
d. The goal of AAC is to solve:
θ∗ = arg min E o(A(θ), d) (7)

θ∈Θ d∼D
to find the configuration that optimizes expected performance over D.

The Parametric Iterated Local Search (ParamILS) framework (Hutter et al. 2009) is a represen-
tative AAC method for tuning optimization solvers. Given a training set of instances {d1 , ..., dn },
ParamILS proceeds as follows:
• Initialization: Randomly sample r configurations {θ1 , ..., θr } and select θ with the lowest
training cost:
n
1X
θ = arg min L(θ) = o(A(θ), di )
θ∈{θ1 ,...,θr } n
i=1
• Iterated Local search: While the stopping criteria (e.g., solution quality exceeds a threshold,
maximum runtime is reached) is not met, local search and perturbation are repeatedly executed.
18
– Local search: Randomly sample a new configuration θ′ from the neighborhood N (θ). If
L(θ′ ) < L(θ), set θ = θ′ . Repeat this for k times. Here k is a hyperparameter.
– Perturbation: Randomly choose a configuration θ′ differing from θ in at most p param-
eters where p is a pre-defined hyperparameter. Perform the local search from θ′ . If better,
set θ = θ′ .
ParamILS has shown significant benefits for configuring mixed integer programming solvers and
other optimization algorithms. More specifically, in Hutter et al. (2010), the authors apply ParamILS
to leading MIP solvers: CPLEX (CPLEX 2009), Gurobi (Gurobi 2022), and LPSOLVE (Berkelaar
2015). To assess the benefits of auto-tuning, the authors perform experiments on two sets of MIP
instances: a library of over 200 cases from prior studies, and over 90 new larger cases. They measure
the runtime required for the solvers to prove optimality or find the best solution within a time limit.
Results show that auto-tuning reduces the runtime of CPLEX and Gurobi by over 25-50% on average
compared to the default configurations, and over 50-100% for the hardest problem instances. The
authors also analyze how the parameter settings found by ParamILS differ based on instance features
like the number of variables and constraints. They find that more aggressive parameter configurations
(e.g. stronger presolving) are selected for easier instances, while more robust and diverse settings
are chosen for harder cases. This demonstrates how auto-tuning can exploit instance properties to
configure solvers more appropriately.
Alternatively, López-Ibáñez et al. (2016) propose an R package called IRACE for automatic al-
gorithm configuration. IRACE implements a method called iterated racing for efficiently exploring
the parameter space. In one iteration of IRACE, multiple parameter configurations are evaluated in
parallel on training instances with limited budgets (called racing), and poor performers are discarded.
The purpose of racing is to gather preliminary performance and estimate the configurations’ quality.
This process is repeated over multiple iterations and strikes a balance between exploring and exploit-
ing the promising regions of the parameter space. Compared with ParamILS, which handles mainly
categorical parameters, IRACE handles a wide range of parameter types, including continuous, inte-
ger, categorical, and conditional parameters. Meanwhile, IRACE leverages statistical testing to select
candidate configurations, and thus is more suitable when the performance of the algorithm being
tuned exhibits stochastic behavior. López-Ibáñez et al. (2016) evaluate IRACE on configuring algo-
rithms from various domains, including constraint programming and hyperparameter optimization.
Results show that IRACE can achieve average speedups of 20-60% over default parameter settings and
matches or outperforms manual tuning by experts. For algorithms with a large number of parameters,
the benefits of auto-tuning with IRACE become even more substantial.
In Equation 7, the expected performance on a set of instances is called the performance function,
denoted as g(θ) = Ed∼D o(A(θ), d). However, when navigating through high-dimensional parameter
space, the curse of dimensionality makes both random samplings (López-Ibáñez et al. 2016) and
local search strategies (Hutter et al. 2009, 2010) often inadequate. To capture the knowledge of
parameter space, building a surrogate performance function is advantageous. Model-based methods
(Hutter et al. 2011, Lindauer and Hutter 2018, Lindauer et al. 2022) incorporate a learned model
as the surrogate performance function to guide the search for configurations. It alternates between
exploiting the current model to find promising configurations and exploring the configuration space
to improve the model. However, it fails when dealing with stochastic or non-smooth functions g,
e.g., multiple evaluations at the same configuration can result in different outcomes for randomized
algorithms. More future research is needed to improve the efficiency of exploring the configuration
space, e.g., using more advanced sampling methods, combining the idea of racing, sampling, and
surrogate model (Balaprakash et al. 2007, Anastacio and Hoos 2020b,a).
To reiterate, AAC automatically finds the best-performing configuration settings for any given
algorithm, with respect to a particular set of problem instances. In essence, AAC targets a single
algorithm and treats it as a “black box”. This carries both merits and drawbacks. On the upside,
the algorithm is treated as a black box that accepts an algorithm configuration and an optimization
problem, then yields performance metrics, such as solving time. It means there’s no need to compre-
hend the internal workings of the algorithm and empowers non-experts to enhance the performance of
optimization algorithms. Moreover, by exploring the parameter space in a structured yet randomized
manner, AAC might outperform experts by discovering complex parameter interactions that would
19
be difficult to identify manually. However, there are two inherent disadvantages. Firstly, the “No
Free Lunch” theorem (Wolpert and Macready 1997) suggests that certain algorithms may outperform
others on a specific set of problem instances. Thus, relying solely on one algorithm may be insufficient
if the distribution of problem instance D is too diverse. Secondly, AAC’s approach means it doesn’t
capitalize on insights derived from the algorithm’s internal mechanics. To overcome the limitations,
some closely related tasks on hyperparameter tuning are proposed.
To address the first limitation, it is important to choose an optimization algorithm that aligns
well with the nature and distinct attributes of the problems at hand. The task of algorithm selection
is highlighted by Bischl et al. (2016). This task involves determining the most suitable algorithm
among a set of algorithms for a given problem instance in order to exploit the varying performance
of algorithms over a diverse set of instances. Formally, given K candidate algorithms {A1 , ..., AK },
a distribution of problem instances D, and a cost function o that evaluates the performance of an
algorithm on an instance d. The objective is to identify a mapping f (·). This mapping assigns each
problem instance to an algorithm index, optimizing the following expected performance:
f ∗ = arg min E o(Af (d) , d). (8)

f d∼D
Here, f (d) retrieves the index of the algorithm best suited for the problem instance d. While the
algorithms are still treated as black boxes, the characteristic inside the instances is utilized, since the
mapping f is often implemented by using instance features like the density of the constraint matrix.
These instance features are then mapped to an algorithm via an AI model. It is worth noting that
algorithm selection disregards algorithm configuration. Therefore, if one algorithm underperforms
compared to another, it might be due to suboptimal configuration rather than the algorithm’s inherent
unsuitability for the task. In future research, it would be intriguing to investigate tasks that integrate
both algorithm selection and automatic configuration tuning.
Regarding the second limitation, i.e., AAC perceives the algorithm as a “black box”, one potential
strategy is to group algorithms based on their inherent characteristics, thus peeling back a layer of
this “black box”. For example, many algorithms are iterative and can benefit from dynamic configu-
ration during their execution, adjusting based on the information available at run time. Adriaensen
et al. (2022) propose a general automated dynamic algorithm configuration framework for iterative
algorithms. Another promising approach is to focus on a single specific algorithm, fully “unboxing”
it to leverage its inherent features. Advanced AI-driven iterative optimization algorithms are further
explored in Sections 6 and 7.
6 Algorithm Selection and Design for Continuous Optimization

In this section, we review the literature on using AI techniques to enhance the algorithms for solving
continuous optimization problems.
6.1 Enhancement for gradient-based methods

In this section, we consider solving an unconstrained continuous optimization problem
min f (x), (9)

x∈Rn
where f : Rn → R is a differentiable function. Gradient descent (Beck 2017) is one of the most widely
used algorithms for solving Problem (9) because of its cheap cost at each iteration. However, the
performance of gradient descent is quite limited by the fact that it only makes use of the latest gradient
and ignores past information. To resolve this issue, many gradient-based optimization algorithms have
been proposed to improve the performance of gradient descent, and we summarize a few representatives
below:
• Gradient descent
xt+1 = xt − η∇f (xt ),
where η represents the learning rate.
20
• Gradient descent with momentum (Tseng 1998)
g t = γg t−1 + (1 − γ)∇f (xt ),

xt+1 = xt − ηg t ,
where γ ∈ [0, 1) is the momentum coefficient determining the effect of the previous gradient
updates on the current update.
• Nesterov’s accelerated gradient descent (Nesterov 1983)
g t+1 = γg t − η∇f (xt + γg t ),

xt+1 = xt + g t+1 .
• Adaptive gradient descent (AdaGrad) (Duchi et al. 2011)
Gt = Gt−1 + diag(∇f (xt ))2 ,

h 1
i
xt+1 = xt − η ∗ (Gt )− 2 ∇f (xt ) ,
where Gt is a rough approximation of the diagonal elements of the Hessian matrix and the
symbol ∗ is multiplication between a scalar and a vector.
• Adam (Kingma and Ba 2015)
g t = γ1 g t−1 + (1 − γ1 )∇f (xt ),

Gt = γ2 Gt−1 + (1 − γ2 ) diag(∇f (xt ))2 ,
h 1
i
xt+1 = xt − η ∗ (Gt )− 2 g t ,
where γ1 , γ2 ∈ [0, 1) are momentum coefficient and squared gradient coefficient respectively.
More specifically, the squared gradient coefficient γ2 determines the importance of previously
squared gradients when performing exponential moving average for Gt .
We refer interested readers to Ruder (2016) for a more detailed overview of those gradient-based
methods with explicit update rules. In this section, we want to review the papers aiming to learn a
parameterized update rules.
As the optimization process can be regarded as a trajectory of iterative updates, LSTMs (see Sec-
tion 2.2) are natural modelling choices for learning the update rule. Andrychowicz et al. (2016) pro-
posed the first LSTM-based learning technique to replace the explicit update rules in those gradient-
based methods. The main idea is to learn a good update rule which is expected to have a good
performance on some target problem sets. More specifically, it aims to solve the following problem
" T # t
∗
X
t t+1 t t g
θ ∈ arg min E f (x ) with x = x − g and = m(θ; ∇f (xt ), ht ),
θ (f,x0 )∼F ht+1
t=1
where T is some pre-determined maximal number of iterations, F is the target problem set containing
problem instances and corresponding initial points, m(θ; ·, ·) is the LSTM model with θ being the
learnable parameters, and ht is the hidden embedding for all the gradient information up to iteration
t. There are two technical difficulties with this approach. The first one is the varying dimensions, as
the problem instances f in the target problem set F may have different dimensions, which requires the
model m(θ; ·, ·) to be able to handle varying input dimensions. Andrychowicz et al. (2016) resolved
this technical difficulty by letting the LSTM model operate coordinatewise on variable x, i.e.,
t
gi
= m(θ; ∇f (xt )i , hti ), ∀i = 1, . . . , n.
ht+1
i
In this way, the LSTM model can handle problem instances with any dimension.
21
The second challenge is the choice of the maximal number of iterations T . A small T may
yield unsatisfactory results, which is also known as truncation bias, and a big T may cause gradient
explosion (Pascanu et al. 2013, Chen et al. 2022). Much research has been done to solve this problem.
Lv et al. (2017) proposed to add random scaling and convex regularizers to stabilize the training
process so that a larger T can be selected. They show that their strategies are intended to prevent
significant random updates when the LSTM optimizer is insufficiently trained. Alternatively, Chen
et al. (2020a) utilized techniques from curriculum learning (Bengio et al. 2009) to gradually increase T
during the training so that the model can mitigate the dilemma between truncation bias and gradient
explosion. Chen et al. (2020b) tackled this problem by learning another variational stopping policy so
that T is no longer a fixed pre-determined parameter. Such a design will cause difficulty in training
the model. Therefore, the authors propose a novel training procedure that decomposes the task into
an oracle model learning stage and an imitation stage. Metz et al. (2019) suggested using large T
and tried to overcome the gradient explosion problem by computing the gradient as the weighted
average of two unbiased gradient estimators. The two unbiased gradient estimators are computed
by the reparameterization trick (Kingma and Welling 2013) and the log-derivative trick (Williams
1992). Alternatively, Wichrowska et al. (2017) suggested using more-advanced RNN structures. In
contrast to a single RNN layer, the authors implemented three RNN layers in a hierarchical manner,
denoted as “bottom RNN”, “middle RNN”, and “upper RNN”. The “bottom RNN” takes the scaled
gradients as input and outputs the hidden states, the “middle RNN” takes these hidden states and
produces averaged hidden states, and the “upper RNN” receives the averaged hidden states. The
authors showed empirically that such a hierarchical design leads to lower memory and computing
overhead while achieving superior generalization.
Besides LSTM, reinforcement learning is another widely adopted framework for learning gradient-
based methods. Li and Malik (2016) proposed the first RL-based framework for learning gradient-
based methods. In their setting, the state st consists of the current iterate xt and feastures Φt
which depends on the history of iterates x0 , . . . , xt , gradients ∇f (x0 ), . . . , ∇f (xt ) and objectives
f (x0 ), . . . , f (xt ). The action at is the step ∆x that will be used to update the iterate. The re-
ward is defined as the decrease in the objective. Finally, they use RL to learn a policy π that can
sample the update steps from a Gaussian distribution whose mean and variance are learnable param-
eters. Based on this framework, Li and Malik (2017) developed an extension that is suited to learning
optimization algorithms for high-dimensional stochastic problems. More specifically, they notice that
if the values of two coordinates in all current and past gradients and iterates are identical, then the
step vector produced by the algorithm should have identical values in these two coordinates. Based
on this finding, they proposed grouping coordinates under permutation invariance into a coordinate
group. They show that this formulation reduced the computation cost of training neural networks on
well-established image classification datasets.
Table 3 summarizes the abovementioned methods into two categories. There is no clear winner
between LSTM-based and RL-based methods. When training loss is differentiable w.r.t. the model
parameter, supervised learning and a medium-size model (like LSTM) are favourable, due to data
efficiency and easier hyper-parameter tuning. This is the reason why more literature follows the
LSTM-based framework. One benefit of the RL framework is that it handles non-differential cases.
For example, Zheng et al. (2022) proposes to learn an RNN that outputs a symbolic gradient update
formula. The loss is not differentiable w.r.t. the formula, and thus RL is the only choice.
6.2 Enhancement for ADMM-type methods

The Alternating Direction Method of Multipliers (ADMM) is an efficient first-order optimization
algorithm that solves problems in the form
min f (x) + g(s) s.t. Ax + s = b, (10)

x∈Rn ,s∈Rm
where A ∈ Rm×n and b ∈ Rm (Boyd et al. 2011). As a key component of the ADMM method, the
augmented Lagrangian function for problem (10) is given by
ρ
Lρ (x, s, y) = f (x) + g(s) + y T (Ax + s − b) + ∥Ax + s − b∥2 , (11)
2
22
Framework Paper Methodology
Andrychowicz et al. (2016) First propose a basic LSTM-based framework.
Lv et al. (2017) Use random scaling and convex regularizers to stabilize the
training process.
LSTM-based Chen et al. (2020a) Use curriculum learning to gradually increase the maximal
number of iterations.
Chen et al. (2020b) Learn a variational stopping policy to determine the stopping
time.
Metz et al. (2019) Use the weighted average of two unbiased gradient estimators.
Wichrowska et al. (2017) Enhance the RNN structure in a hierarchical manner.
Li and Malik (2016) First propose a basic RL-based framework.
RL-based Li and Malik (2017) Combine the coordinates with permutation invariance prop-
erty into one group.
Zheng et al. (2022) Learn an agent that outputs a symbolic gradient update for-
mula.
Table 3: Summary of works using AI techniques to enhance gradient-based methods.
where y ∈ Rm is the dual variable or Lagrange multiplier and ρ > 0 is called the penalty parameter.
Then the ADMM consists of the iterations
xt+1 = arg minn Lρ (x, st , y t )
x∈R
t+1
s = arg minm Lρ (xt+1 , s, y t ) (12)
s∈R
y t+1 = y t + ρ(Axt+1 + st+1 − b).
Computational experiments on different applications (Fortin and Glowinski 2000, Fukushima 1992,
Kontogiorgis and Meyer 1998) have shown that, if the fixed penalty ρ is chosen too small or too large,
the solution time can increase significantly. As a remedy, some heuristics have been developed to adapt
ρ during the optimization process, where the key idea is to balance primal and dual residuals (He
et al. 2000, Wang and Liao 2001, Boyd et al. 2011). A simplified version of their adaption strategy
can be summarized as 
t t t
τ ρ if ∥rp ∥ ≥ µ∥rd ∥

ρt+1 = τ1 ρt if ∥rdt ∥ ≥ µ∥rpt ∥ (13)
ρt

otherwise,
where
rpt := Axt + st − b ∈ Rm and rdt := ρt AT (st − st−1 ) ∈ Rn
denote the primal and dual residuals, respectively, and µ, τ > 1 are hyperparameters.
In this section, we want to review the literatre aiming to learn a parameterized adaption rule. Zeng
et al. (2022) propose a reinforcement-learning-based penalty-tuning strategy for solving distributed
optimal power flow problems (Mhanna et al. 2018). Zeng et al. (2022) propose to learn a good penalty-
tuning strategy using reinforcement learning. Following the general introduction to reinforcement
learning (see Section 2.3), we only need to specify the state space S, the action space A and the
reward function R : S × A → R. Zeng et al. (2022) set the state to include the past k-point history
of primal and dual residuals, i.e.,
h i
st = (rpt−k+1 , rdt−k+1 ), . . . , (rpt , rdt ) ∈ Rk×(m+n) = S.
The problems considered by the authors have the same dimension, and thus k, m, n are fixed. For the
action space, the authors set it to be a discrete set of pre-determined possible penalty values, i.e.,
A = {ρ1 , . . . , ρd } with ρ1 < · · · < ρd .
23
For the reward function, the authors consider two parts: termination and comparison with the base-
line. For termination, the authors design the following reward function
(
200 if ∥rpt+1 ∥ < ϵp and ∥rdt+1 ∥ < ϵd
Rtermination (st , ρt ) = (14)
0 otherwise,
where ϵp and ϵd are the small tolerance for checking the termination. For the comparison with the
baseline, the authors set the baseline to be the regular parameter-tuning rule, e.g. (13), and then they
design the following reward function as the relative advantage of the RL policy over the baseline
∥r̂pt+1 ∥ − ∥rpt+1 ∥ ∥r̂dt+1 ∥ − ∥rdt+1 ∥

Rcomparasion (st , ρt ) = + ,
∥r̂pt+1 ∥ ∥r̂dt+1 ∥
where r̂pt+1 and r̂dt+1 are primal and dual residuals obtained by the baseline. Then the overall reward
function is defined as the summation
R(st , ρt ) = Rtermination (st , ρt ) + Rcomparasion (st , ρt ).
Ichnowski et al. (2021) propose a reinforcement-learning-based penalty-tuning strategy for solving

quadratic programming problems in the following form:
1 T
min x Qx + q T x s.t. Ax = s, l ≤ s ≤ u, (15)
x∈Rn ,s∈Rm 2
where Q ∈ Rn×n is a symmetric positive semi-definite matrix, q ∈ Rn , A ∈ Rm×n , and l, u ∈ Rm .

The ADMM is one of the widely adopted optimization algorithms for solving QP Problem (15), e.g.,
implemented in the OSQP solver (Stellato et al. 2020). Unlike the standard ADMM, the variant used
in OSQP has m penalty parameters, i.e., ρ ∈ Rm . To tackle this issue, Ichnowski et al. (2021) adopted
the multi-agent single-policy RL (Huang et al. 2020). More specifically, the authors design m agents
for predicting entries in ρ and all the agents share the same policy. For agent i ∈ [m], the state is
defined by
min(sti − li , ui − sti )
 
 (Axt )i − sti 
t
 
t
 yi  ∈ R6 .

si = 
 t
ρi 
∥rpt ∥
 
 
t
∥rd ∥
Thus, the state space S contains 6 elements, i.e., S = R6 . Note that the dimension of the state space
is independent of the problem size, and thus this approach can handle different QP problems. The
design of the action space is similar to Zeng et al. (2022). For the reward function, the authors simply
set it as the termination reward as in Equation (14).
A drawback of the approach proposed by Ichnowski et al. (2021) is that the state representation
is local for each agent and penalty parameter, and thus is insufficient to capture the contextual
information. To resolve this issue, Jung et al. (2022) extended this framework by using the graph
representation of QP as the state representation and then the corresponding Q function is modelled
by message-passing on the graph. See Section 2.1 for a more detailed introduction to the graph
representation of QP.
6.3 Enhancement for column-generation methods

Column generation (CG) is an algorithm for solving linear programs (LPs) with a prohibitively large
number of variables (i.e., columns) (Desaulniers et al. 2006). CG starts by solving a restricted master
problem (RMP) with a subset of columns and gradually generates new columns that can improve the
24
Paper Problem type Methodology
Zeng et al. (2022) distributed OPF RL
Ichnowski et al. (2021) QP Multi-agent single-policy RL
Jung et al. (2022) QP GNN + RL
Table 4: Summary of works using AI techniques to enhance ADMM-based methods.
Figure 8: Illustration for column generation (CG) algorithm and how does AI model assist CG. The
credits for sources (a) and (b) are attributed to Chi et al. (2022).
solution of the current RMP. The method stops when no such columns exist. More specifically, we
consider the following master LP problem
X
[MP] min cp xp
x∈R|Ω|
p∈Ω
X
s.t. xp ap = b
p∈Ω
xp ≥ 0, ∀p ∈ Ω,
where Ω represents the set of variable indices. We consider the case when |Ω| is large and the variables
can not be enumerated explicitly. The CG algorithm starts with a subset F ⊂ Ω of variables and
deals with the following restricted master problem [RMP]
X
[RMP] min cp xp
x∈R|F |
p∈F
X
s.t. xp ap = b
p∈F
xp ≥ 0, ∀p ∈ F.
Let yF ∈ Rm denote the optimal variable for the [RMP]. Then new columns with negative reduced
cost will be generated by solving another subproblem, called the pricing subproblem (PSP) or column
generation subproblem (CGSP). After solving [PSP], new column(s) G is generated. G must be
negative-reduced-cost columns, i.e., G ⊆ {p ∈ Ω | cp − yF T a < 0}. In practice, enumerating all
p
possible columns p ∈ Ω is computationally infeasible. Thus, ap is usually implicitly specified by
the constraint of pricing problem [PSP]. The objective of the [PSP] is to incorporate dual values and
identify the most promising column(s) that can augment the [RMP] solution. Then, the CG algorithm
updates the index set by F ← F ∪ G, and repeats the above process until no such column exists. The
bottleneck of CG is [PSP], since [PSP] usually involves integer variables. Meanwhile, the total number
of CG iterations largely depends on the strategy of choosing the negative-reduced-cost columns. An
illustration of the CG algorithm is shown in Figure 8(a).
25
Desaulniers et al. (2020) pioneered the research direction on using AI techniques to select columns
in the CG algorithm. As shown in Figure 8(b), the [PSP] first heuristically or exactly provides
candidate columns Ḡ ⊆ {p ∈ Ω | cp − yF T a < 0}. For optimization problems like the vehicle and
p
crew scheduling problem and the vehicle routing problem with time windows, it is possible to generate
multiple negative-reduced-cost columns at no extra computational effort. Given the candidate columns
G, we want to select a proper subset G ⊆ G. The challenge in this task is that we don’t want the
size of G to be very large as it may make the next restricted master problem hard to solve. In the
meantime, we want G to contain more “promising” columns. Desaulniers et al. (2020) suggest that
the importance of columns can be obtained by solving the following mixed integer linear program
(MILP):
X X
min cp xp + λ yp (16a)
x∈R|F ∪G| ,y∈{0,1}G
p∈F ∪G p∈G
X
s.t. xp ap = b, (16b)
p∈F ∪G
xp ≥ 0, ∀p ∈ F ∪ G, (16c)
xp ≤ yp , ∀p ∈ G, (16d)
where the binary variables yp decide whether the column p should be selected for the next RMP. More
specifically, after solving problem (16), if yp = 1, then column p is a candidate for selection, otherwise,
column p is excluded. The hyperparameter λ in (16a) balances the trade-off between minimizing the
[RMP] objective and controlling the size of G. In other words, by solving Problem (16), we can derive
a better subset G with two advantages. Firstly, the subsequent [RMP] objective is greatly reduced,
leading to an improved solution for the master problem. Secondly, the size of the resulting subset
remains manageable. Once the MILP is solved, the better subset G is determined by:
G = {p ∈ G | yp = 1}.
However, solving a MILP at each iteration can be time-consuming in practice. Thus the authors
suggest imitating this MILP using AI techniques. More specifically, they represent the MILP as a
bipartite graph as the one introduced in Section 2.1 (seen in Figure 2). Then they train a standard
GNN to imitate the optimal {yp }p∈Ḡ from the MILP (16). This is a binary classification task. In the
inference stage, the GNN will predict the probability that a generated column p should be selected,
i.e., P r(yp = 1). If this probability is greater than 0.5, then column p is considered promising and
added to the [RMP] in the next iteration.
Alternatively, Chi et al. (2022) propose a reinforcement-learning-based approach to select columns
in the CG algorithm. More specifically, at each iteration of the CG algorithm, the state s is the bipar-
tite graph representing the current [RMP] plus current iteration information, including the candidate
columns G with their reduced costs and solution values, and how long a column stays or leaves the
basis. In each iteration, the authors want to select one column from the candidate columns, so the
action space A is G. To encourage the RL agent to select columns such that the CG algorithm
converges better and faster, the reward function at each step consists of two components: 1) The
−objt
first component objt−1obj0 gives a higher score if taking action at leads to a better outcome, i.e., it
causes a faster decrease in the objective value. Here, objt is the objective value of the RMP at time
step t, and obj0 is the objective value of the [RMP] in the first CG iteration. The latter is used to
normalize (objt−1 − objt ), ensuring that the first component is numerically comparable across different
LP instances. 2) The second component is simply −1, which encourages the agent to prefer shorter
iterations. This component represents the total number of iterations in the cumulative rewards. To
balance these two components, a non-negative hyperparameter α is introduced. The reward at time
step t is then defined as
objt−1 − objt
Rt = α − 1.
obj0
Chi et al. (2022) also use the bipartite graph to represent the [RMP]. Given the bipartite graphs
with node features, GNNs are used as the Q-function approximation. Given a particular state, the
26
Q-function estimates the expected future reward (Q-values) for all possible actions, i.e., all candidate
columns. GNNs are capable of capturing the complex relationships between nodes in the graph, which
makes them suitable for this task. Note that Chi et al. (2022) select only one column at each iteration,
so the action with the maximum Q value is selected to add to the [RMP]. Since the Q-function will
consider future rewards, the agent can make a better column selection at each step. Compared with
the heuristic (greedy) strategy for selecting columns, the RL-based algorithm converges faster in terms
of the number of iterations and total time. Chi et al. (2022) also mention that adopting a curriculum
learning paradigm improves the learning of the RL agent and results in better column selection and
faster convergence. This is crucial to convergence when the training set contains instances of varying
difficulties.
Besides selecting from candidate columns, the AI model can also directly generate columns. Shen
et al. (2022) consider the graph coloring problem. A column corresponds to a maximal independent
set (MIS). The pricing problem for generating a MIS is an NP-hard Maximum Weight Independent
Set Problem. Shen et al. (2022) leverage an AI-based primal heuristic to generate a near-optimal MIS
from the pricing problem. Given the feature vector of each vertex, this AI-based heuristic estimates
the probability of each vertex being selected into the MIS. It is then used to guide a sampling method
to efficiently generate multiple high-quality MISs (i.e., columns). The sampling method ensures the
column is indeed an MIS by sequentially adding vertices, marking adjacent vertex as invalid, and
adjusting the probability of selecting remaining vertices. The diversity of multiple MISs is achieved
by randomly selecting the starting vertex. Due to generating high-quality columns efficiently, MLPH
significantly accelerates the progress of CG and reduces CG iterations, especially for larger and denser
graphs.
To conclude, the different AI models assist CG by either selecting more promising columns or
directly generating columns. The former category naturally ensures the candidate columns are valid
and can reduce the CG iterations further if RL or curriculum learning is adopted. However, the
time for solving [PSP] is irreducible. In contrast, the latter category replaces solving [PSP] with an
AI-model inference and thus has greater potential. If the latter category is further augmented with
an RL algorithm, it may achieve greater end-to-end speedup. Its drawback is that the AI model
cannot directly generate a column feasible to [PSP]. An ad-hoc sampling step has to be designed to
mitigate the gap between AI prediction and feasible column generation, for each type of problem and
the corresponding [PSP]. It would be interesting to explore the possibility of either designing a unified
sampling step or developing an end-to-end AI model capable of generating the columns.
6.4 Enhancement for simplex method

The simplex algorithm is a classical method for solving linear programming problems of the form
min cT x
x∈Rn
s.t. Ax = b,
x ≥ 0.
In the following, we will first briefly introduce the primal simplex algorithm. It consists of two
crucial components: basis initialization and pivoting. Typically, the introductory textbooks assume
that the primal simplex algorithm begins with a primal feasible basis. This assumption is reasonable
as techniques like the big-M method are able to address the infeasibility. Nonetheless, it remains
a question whether the initial basis is “good,” i.e., whether it is close to and converges rapidly to
the optimal solution. After basis initialization, the primal simplex algorithm repeats a pivoting
process, selecting the entering basis variable and the leaving basis variable. Geometrically, each basis
corresponds to a feasible solution and a vertex in the feasible region (a polyhedron). Accordingly, the
simplex algorithm starts from the initial solution vertex and each pivot corresponds to a step transition
towards adjacent vertices until it reaches the optimal solution vertex. To be specific, The simplex
method starts from an initial feasible basis B = [Bx ; Bs ] containing the indices of basic variable and
constraint variables. At each iteration, we first divide the variable x ∈ Rn (assume the slacks are also
included) into x = [xB ; xN ], where xB and xN represent the basic and non-basic variables, respectively.
27
The matrix A = [B; N ] and the vector c = [cB ; cN ] are also divided correspondingly, where B is the
matrix of coefficients for the basic variables and N is the matrix of coefficients for the non-basic
variables. It is required that each column of the matrix B be linearly independent. Substituting the
constraint Ax = b into the objective function, we obtain cT x = cTB B −1 b + (cTN − cTB B −1 N )xN , where
cTB and cTN represent the transpose of the vectors cB and cN , respectively. The second term is called
the reduced cost vector c̄ = cTN − cTB B −1 N , which represents the cost of increasing the value of the
non-basic variables. The non-basic variables corresponding to the negative components of c̄ can cause
a decrease in the objective function. Therefore, the selection of the entering basis variable is the
process of selecting a non-basic variable corresponding to a negative component of c̄. Different pivot
rules provide different methods for selecting the entering basis variable. The essence of the pivot rule
of the simplex algorithm is to convert the status of a certain column between basic and non-basic.
Once the basic variables are determined, we can use xB = B −1 b − B −1 N xN ≥ 0 to derive the leaving
basis variable. The pivoting process is repeated until the basis corresponding to the optimal solution
is obtained, i.e., the end of the simplex algorithm.
For basis initialization, Fan et al. (2023) propose a GNN-based initial basis selection strategy.
They first represent an LP as a bipartite graph and convert the basis selection task into a classification
task, i.e., determine the basis status for each constraint and variable. To ensure the predicted basis
status always satisfies the bound requirements, e.g., a free variable (with unbounded lower and upper
bounds) can only be basic and a variable with an unbounded upper bound will not achieve its upper
bound. Then in the inference stage, Fan et al. (2023) propose the basis generation and adjustment
steps ensuring the basis is valid, i.e., the corresponding constraint matrix is non-singular. Extensive
experiments are conducted, including large-scale industrial cases, to demonstrate the performance
of the GNN-based initial basis selection strategy over traditional initialization heuristics that fail to
produce a close-to-optimal basis (Ploskas et al. 2021). In contrast, the GNN-based method utilizes
past solved linear programs and smartly builds a close-to-optimal basis.
The selection of the entering basis variable, commonly known as the pivoting strategy, plays a
crucial role in determining the efficiency of the simplex method. Various classical pivoting strategies
(Dantzig and Thapa 2003, Forrest and Goldfarb 1992) have been proposed in the literature, and
their effectiveness has been evaluated empirically. The Dantzig pivoting rule (Dantzig and Thapa
2003) is a popular strategy that selects the non-basic variable with the most negative reduced cost
as the entering variable. In contrast, the steepest edge pivoting rule (Forrest and Goldfarb 1992)
selects the non-basic variable with the largest rate of decrease of the objective value per unit distance
travelled along the improving edge. The choice of pivoting strategy is highly dependent on the specific
problem at hand, and different strategies may yield significantly different results. Suriyanarayana
et al. (2022) use reinforcement learning techniques to switch between Dantzig’s rule and the steepest
edge rule. Namely, at each iteration of the simplex algorithm, the trained agent will select one of
the two above-mentioned pivoting rules. Alternatively, Li et al. (2022) aims to learn a new pivoting
strategy through the application of the Monte Carlo Tree Search (MCTS) method. Their contribution
focuses on four core aspects, including transforming the simplex method into a pseudo-tree structure,
constructing appropriate reinforcement learning models, finding the optimal pivot sequence under
the guarantee of theory, and providing a complete method for discovering multiple optimal pivot
paths. The study proposes a novel imitative tree structure, SimplexPseudoTree, for the exploration
of optimal pivot paths, and constructs four reinforcement-learning models to determine the optimal
pivot paths based on the MCTS method. The research provides theoretical as well as computational
experiments to demonstrate the optimality of the proposed MCTS rule. The MCTS rule can avoid
unnecessary searches and determine the shortest pivot paths of the simplex method, leading to more
efficient problem-solving in the context of linear programming, namely solving the problem with fewer
iterations.
7 Algorithm Selection and Design for Discrete Optimization

Consider an optimization problem of minimizing f over some finite set X , i.e.,
min f (x).
x∈X
28
Figure 9: The steps of the branch-and-bound algorithm for mixed-integer linear programming prob-
lems.
The branch-and-bound (B&B) method, initially proposed by Land and Doig (2010), recursively divides
the finite feasible region X into its subsets X1 , X2 , . . . , Xp until no more division is possible such that
p
[
X = Xi .
i=1
All these subsets form a B&B tree. The key assumption in the B&B method is that for every subset
or node Xi , there is an algorithm that can calculate
• A lower bound, i.e., fXi ≤ minx∈Xi f (x).
• An upper bound provided by each feasible point x ∈ Xi , i.e., fXi = f (x) ≥ minx∈Xi f (x).
The basic idea behind the B&B algorithm is that, if for two subsets X1 , X2 ⊆ X ,
fX2 ≤ fX1 ,
then the solutions in X1 can be disregarded. More specifically, in each iteration, the B&B algorithm
will pick a child node Xi , which is called node selection, and check whether the lower bound exceeds
the best available upper bound. If it exceeds, then the B&B algorithm can safely remove the current
node from the B&B tree, and continue with the next child node. If it doesn’t exceed, then the B&B
will divide the current node by branching on a variable. This step is also called variable selection.
7.1 Mixed-integer linear programming

Mixed-integer linear programming problems are the core of discrete optimization because they can
model a wide variety of problems in different applications. A MIP is an optimization problem of the
form
p∗ = min cT x (17a)
x∈Zd ×Rn−d
s.t. Ax = b, (17b)
x ≥ 0, (17c)
where A ∈ Rm×n . When applying the B&B algorithm to solve a MIP problem, one first relaxes the
integrality constraint and obtain a linear program (LP)
p = minn cT x (18a)
x∈R
s.t. Ax = b, (18b)
x ≥ 0, (18c)
whose solution provides a lower-bound, i.e., p ≤ p∗ . Let x denote the minimizer for the LP relaxation.
If x satisfies the integrality constraint for the original MILP problem, then x is a global solution, and
29
we are done. Otherwise, we need to decompose the LP relaxation into two sub-problems by selecting
a variable violating the integrality constraint, i.e.,
j ∈ [p] such that xj ∈
/ Z.
The two sub-problems have the form
p−
j = minn cT x
x∈R
s.t. Ax = b,
xi ≥ 0 ∀i ̸= j,
0 ≤ xj ≤ ⌊xj ⌋
and
p+
j = minn cT x
x∈R
s.t. Ax = b,
xi ≥ 0 ∀i ̸= j,
xi ≥ ⌈xj ⌉.
The variable selection then refers to the process of selecting variable j. Then the B&B algorithm is
going to pick one of the child LP problems and continue the above process. The node selection then
refers to picking the child LP problem.
Node and variable selection largely affect the performance of the B&B algorithm (Huang et al.
2021). Many machine-learning-based approaches have been developed to assist node selection (He
et al. 2014, Song et al. 2018, Sabharwal et al. 2012) and variable selection (Khalil et al. 2016, Alvarez
et al. 2017, Di Liberto et al. 2016, Balcan et al. 2018, Gasse et al. 2019, Gupta et al. 2020, 2022,
Zarpellon et al. 2021, Qu et al. 2022, Etheve et al. 2020, Sun et al. 2020) in the B&B algorithm for
solving MIP problems. In the next two sections, we provide an overview of these approaches.
7.1.1 Variable selection

Variable selection is a crucial task in the branch-and-bound algorithm. It determines the way in
which a current node is partitioned into two child nodes in a recursive manner, by choosing which
fractional variables, also known as candidate variables, to branch on. The goal of an effective branching
strategy is to minimize the number of explored nodes before the termination of the B&B algorithm.
The variable selection plays a vital role in the B&B method and has led to the development of various
heuristics over time (Achterberg et al. 2005, Linderoth and Savelsbergh 1999).
The simplest heuristic for variable selection is the most-infeasible branching rule (Achterberg et al.
2005). It suggests branching on the variable with the greatest fractional part. For binary variables,
this corresponds to selecting the variable whose value in the LP relaxation is furthest from being
an integer, e.g., 0.5. The intuition behind this is that by branching on the most fractional variable,
the most “ambiguous” part of the current solution may be prioritized, hoping to quickly converge to
an integer solution. However, this method has been shown to perform poorly in practice. Another
popular heuristic is pseudocost branching (Bénichou et al. 1971), which uses a history of increase
in the dual bounds observed during previous branching to estimate the dual bound improvements
for each candidate variable at the current node. Although its performance improves as the B&B
algorithm progresses, it usually performs poorly in the early stages. Strong branching (Applegate
et al. 1995) is another well-known heuristic. It evaluates the dual bound increase for each fractional
variable by computing the linear programming relaxations resulting from branching on that variable.
The variable that results in the largest increase is then selected as the branching variable for the
current node. Although strong branching can produce a B&B tree with a small number of nodes, its
high computational cost often makes it intractable in practice.
In recent years, several machine-learning-based variable selection strategies have been developed.
These strategies can be divided into three categories: 1) models that switch between different branch-
ing rules (Di Liberto et al. 2016, Balcan et al. 2018), 2) models that mimic a strong but expensive
30
branching rule (Khalil et al. 2016, Alvarez et al. 2017, Gasse et al. 2019, Gupta et al. 2020, 2022,
Zarpellon et al. 2021), and 3) the utilization of reinforcement learning to learn a new branching
strategy (Qu et al. 2022, Sun et al. 2020, Etheve et al. 2020). These AI-based approaches aim to
overcome the limitations of the classical variable selection rules, such as the most-infeasible branching
and pseudocost branching, by providing a more adaptive and flexible approach to variable selection
in the B&B algorithm.
Switching Between Different Branching Rules Di Liberto et al. (2016) conducted a study
on the dynamic and sequential nature of branch-and-bound algorithms used in mixed-integer linear
programming problems. They show that no single branching rule could perform optimally across
different subproblems of the same MILP. This observation motivated the development of the Dynamic
Approach for Switching Heuristics (DASH) algorithm. DASH employs a two-step approach, where
the first step involves clustering problems based on defined features using the K-means algorithm, and
the second step involves learning the correct assignment of branching rules to each cluster during an
offline training phase. The algorithm adapts to changes in the instance as the search depth increases,
switching to a new branching rule that best fits the current cluster.
Rather than selecting only one branching rule, an alternative approach proposed by Balcan et al.
(2018) is combining multiple branching rules that are score-based. It means the rule relies on a
quantitative “score” assigned to each variable to determine its priority in the branching process.
Take the most-infeasible branching rule mentioned above as an example. Denote the solution of
the LP relaxation associated with the current node as x. The score of i-th candidate variable is
score(xi ) = min(⌈xi ⌉ − xi , xi − ⌊xi ⌋). Given the scores from different score-based branch rules, Balcan
et al. (2018) proposed a learning-based approach where scores are combined and weighted to enhance
accelerate the B&B algorithm.
Imitation Learning As mentioned earlier, strong branching is a well-known variable selection strat-
egy in B&B algorithms, which is known for its remarkable practical performance. More specifically,
let
C ⊆ {i ∈ [p] | xi ∈
/ Z}
denote the set of branching candidates. The key idea of strong branching is to calculate a score si for
every possible candidate i ∈ C and then select the one with the highest score, i.e.,
i = arg max scorei .

i∈C
Typically, the score is defined by the improvement in the lower bound, i.e.,
scorei = max{p− +
i − p, ϵ} · max{pi − p, ϵ},
where ϵ is some small constant. However, the implementation of strong branching is computationally
demanding, as it requires the resolution of two LP problems for each candidate variable. To address
this issue, several researchers have explored the use of AI techniques to mimic the strong branching rule
in B&B algorithms. Khalil et al. (2016) proposed the first work in this direction, where they developed
an SVM model that learns a branching rule customized to a single instance during the B&B process.
This process involves collecting a set of B&B nodes and performing strong branching on these nodes
to obtain the ranking of the candidate variables. The variables are then categorized into “good” and
“bad” based on the scores, and the SVM model is trained to identify “good” variables. Alvarez et al.
(2017) introduced a two-phased approach, which results in a “learned” branching strategy that can
be used as an approximation of strong branching within the B&B algorithm. The first phase involves
solving a set of training problems with strong branching as a branching heuristic and recording each
branching decision in a training set. In the second phase, the learned heuristic is introduced into B&B
and evaluated for efficiency on a set of test problems. Gasse et al. (2019) utilized a GNN model to
tackle the variable selection problem in B&B. This model takes in the bipartite graph representation of
an MILP. This representation includes nodes representing both constraints and variables, along with
their respective features and connectivity. To be specific, constraint node features can be the dual
31
solutions from LP relaxation, cosine similarities between each constraint’s coefficients and objective
coefficients, The variable node features can be the objective coefficients and lower and upper bounds
of variables. Due to the graph representation and the GNN model, MILPs with varied sizes can be
handled. The GNN is trained to approximate strong branching using imitation learning and has been
shown to improve upon previous branching approaches for several MILP problem benchmarks and
is competitive with state-of-the-art B&B solvers. However, this method requires a high-end GPU
card to speed up the GNN inference time, which is not always feasible for MILP practitioners. To
overcome this limitation, Gupta et al. (2020) studied the time-accuracy trade-off in learning to branch,
and proposed a hybrid architecture that uses a GNN model at the root node and a fast but weak
predictor, such as a Multi-Layer Perceptron (MLP), at the remaining nodes. This approach enhances
the weak model with high-level structural information extracted at the root node by the GNN model.
Gupta et al. (2022) later found that the strong branching heuristic often leads to a child node’s best
choice being the parent’s second-best choice, known as the “lookback” phenomenon. To imitate this
behaviour more closely, they proposed two methods that incorporate the lookback phenomenon into
GNN training by adding a lookback regularized term. Finally, Zarpellon et al. (2021) also inherit
the imitation learning framework, but innovatively propose to utilize the information of B&B search
trees. They believe that many MILPs share similarities in terms of the search trees. However,
there is no natural parameterization of the search tree. To address this gap, a set of 61 hand-
crafted input features is proposed to describe candidate variables in terms of their roles in the B&B
process. These tree features capture various aspects, such as the current node’s depth and bound
quality, the tree’s growth rate and composition, the evolution of global bounds, aggregated variables’
scores, statistics on feasible solutions, and depths of open nodes. Experimental evidence suggests
that explicitly incorporating these features enhances the accuracy of the learned agent and the agent
effectively helps reduce the size of the search tree. The proposed method outperforms the current
state-of-the-art approach and allows for generalization to unseen MILP instances. This generalization
empirically verifies that unseen instances share similar search trees to training instances.
In conclusion, all of the above works aim to mimic the strong branching rule for solving MIPs
from a specific domain and have achieved promising results. However, more research is needed to fully
understand the potential and limitations of these methods to advance the field of variable selection
in B&B algorithms.
Reinforcement Learning As stated in Section 2.3, one limitation of imitation learning is that the
performance of the learned strategy is limited by the expertise of the expert. In the following, We
will first describe how the expert (i.e., strong branching) is limited and then discuss the reinforcement
learning-based approach proposed by Sun et al. (2020).
Strong branching is commonly acknowledged as an effective algorithm, primarily because it often
results in the smallest search trees in the B&B algorithm. This compactness in search trees can be
attributed to two main factors: decision quality and the impact of the decision on other nodes. 1)
Decision Quality: strong branching involves making decisions on variable selection. These decisions
are considered of high quality, as they contribute to faster convergence towards the optimal solution.
2) Impact on other nodes: at each node in the search, strong branching will solve several relaxed LPs
before making decisions on variable selection. In the process of solving relaxed LPs, relevant infor-
mation is produced. This relevant information, by default, is not discarded and helps accelerate B&B
algorithm. We call the utilization of relevant information as secondary effects. For example, a) strong
branching can evaluate the pruning conditions prior to actual branching and relevant information can
help prune some subproblems. b) When solving the LP of the current branch, relevant information
is obtained that can enhance other LP relaxations by eliminating unnecessary constraints.
Among these two factors, we may assume that decision quality plays a more significant role in
accelerating B&B algorithm. However, the empirical findings of Sun et al. (2020) suggest otherwise.
In their study, the authors disabled the secondary effects in the full strong branching algorithm and
observed that the reduction in tree size was notably less substantial compared to when the secondary
effects were enabled. This implies that the acceleration brought by strong branching primarily stems
from the secondary effects. Meanwhile, imitation learning can only imitate the decisions made by
strong branching, not the secondary effects. As a result, imitating strong branching may not be a
32
wise choice for learning a variable selection policy.
In response to these findings, Sun et al. (2020) proposed a reinforcement learning-based approach
to model the variable selection process as a Markov Decision Process (MDP). The authors design a
primal-dual policy network, which is similar to the GNN model operating on bipartite representation.
By setting the cumulative reward as the negative value of the total number of solving nodes within the
B&B algorithm, the learned policy is non-myopic, aiming to solve problems in fewer steps. Moreover,
to encourage exploration during the learning process, they introduced a novelty score of the current
policy. This score is determined by examining the policy’s surrounding neighbourhood. A policy is
deemed novel if it significantly deviates from its neighbours. The novelty score is integrated into the
cumulative reward, guiding policy evolution. Such a combination enables the RL agent to navigate
novel states in the B&B process, bypass local optima, and adopt a variety of strategies.
Another limitation of imitation learning lies in the mismatch between demonstration and real
data. Many studies employ imitation learning to emulate the strong branching method and rely
solely on data gathered from expert policies. However, when the learned policy is applied to unseen
instances, it might not always make decisions as accurate as strong branching. Consequently, the
resulting states can diverge from the training data. The experts cannot provide demonstrations for
every potential state the model might encounter as problem characteristics and structures can vary
significantly. Thus, there might exist a mismatch between training data and unseen problem instances.
To address this, Qu et al. (2022) proposed a novel reinforcement learning-based branching algorithm
that trains on training data at the early stage to accelerate the learning process. The model then
updates with a mixture of training and self-generated data to balance the exploration and exploitation.
As a byproduct, this approach was also found to overcome the issue of the large variance in gradient
estimation, which is a common challenge in MDP-based approaches.
7.1.2 Node selection

When applying the B&B algorithm to solve MILP problems, the solving process involves breaking
down a problem into smaller sub-problems, referred to as nodes, and selecting which node to process
next. This selection is based on two main goals: finding good feasible MILP solutions to improve the
upper bound and getting good LP relaxations to improve the lower bound. In the literature, various
search methods have been proposed for node selection in the B&B algorithm. One of the earliest
methods is depth-first search, proposed by Dakin (1965), where the node with the maximum depth in
the B&B search tree is selected. This method is efficient in terms of memory consumption. Another
popular node selection method is the best-first search, which is proposed by Hart et al. (1968). This
method selects the node with the currently smallest dual objective value.
Recently, various AI-based node selection strategies have been proposed to enhance the perfor-
mance of classical node selection strategies in the branch-and-bound algorithm. He et al. (2014)
introduced an imitation learning method that learns a node selection strategy by observing a small
set of solved problems. The method assumes that the problems at the test time exhibit similar charac-
teristics, such as problem type, size, and parameter distribution, as those observed during the training
time. The node selection policy is designed to repeatedly pick a node from the queue of unexplored
nodes in a manner that mimics a simple oracle, which knows the optimal solution in advance and
only expands nodes containing the optimal solution.
However, in practice, many MILPs are substantially large. This poses a significant challenge for the
implementation of the imitation learning methods discussed earlier He et al. (2014), as constructing
a training set based on optimal solutions would be extremely time-consuming. To overcome this
challenge, Song et al. (2018) presented a variant approach of imitation learning for node selection. In
this approach, the expert will use a cut-off technique, i.e., a solver runs until a certain node limit is
reached and outputs the best solution found. Then, the shortest path to the resulting solution becomes
the expert demonstration of node selection trajectory. In addition to the cut-off technique, a “gradual
scaling up” technique is applied to enhance the agent’s scalability and generalization in tackling larger
problem instances. Specifically, after the agent is trained on problems of a certain size, larger problems
are generated for the agent to interact with. During this interaction, the agent provides node selection
suggestions within the B&B algorithm, and the best solution found so far serves as expert feedback to
further enhance the agent’s performance. To summarize, this particular variant of imitation learning
33
Paper Goal Methodology Features
Di Liberto Learn to switch between Cluster MILP problems 40 features describing differ-
et al. (2016) different variable selec- into clusters using the ent statistics of the problem
tion strategies K-means algorithm and
learn an assignment of
branching methods to
each cluster
Balcan et al. Learn to combine the Empirical risk minimiza- Similar to algorithm config-
(2018) scores returned by differ- tion to find the optimal uration, it finds an optimal
ent existing variable se- weight for convex combi- weight for a certain distri-
lection strategies nations bution of MILPs. Since the
weight only applies to this
distribution, it requires no
instance-specific features.
Khalil et al. Learn to mimic the Imitation learning via 18 static features and 54 dy-
(2016) strong branching strat- SVM namic features
egy
Alvarez et al. Learn to mimic the Imitation learning via static problem features, dy-
(2017) strong branching strat- random forests namic problem features and
egy dynamic optimization fea-
tures
Gasse et al. Learn to mimic the Imitation learning using Bipartite graph representa-
(2019) strong branching strat- GNN tion with 5 features for the
egy constraint, 13 features for
the variable and 1 feature for
the constraint matrix
Gupta et al. Learn to mimic the Imitation learning using same features as Gasse et al.
(2020) strong branching strat- GNN at the root node (2019) at the root node and
egy and SVM at remaining same features as Khalil et al.
nodes (2016) at remaining nodes
Gupta et al. Learn to mimic the Imitation learning via same features as Gasse et al.
(2022) strong branching strat- GNN and Lookback reg- (2019)
egy ularization
Zarpellon Learn to mimic the Imitation learning using 25 features from the candi-
et al. (2021) SCIP’s default branching deep neural network date variables and 61 fea-
strategy tures describing the state of
the B&B search tree
Sun et al. Learn a novel branching Reinforcement learning same features as Gasse et al.
(2020) strategy with GNN (2019)
Qu et al. Learn a novel branching Reinforcement learning same features as Gasse et al.
(2022) strategy with Double DQN (2019)
Table 5: Summary of works using AI techniques to enhance variable selection in the B&B algorithm.
34
Paper Goal Methodology Features
He et al. Learn to mimic a the- Imitation learning via node features, branching
(2014) oretically optimal node SVM features and B&B tree fea-
selection strategy which tures
knows the optimal solu-
tion in advance and only
expands nodes containing
the optimal solution
Song et al. Same as He et al. (2014) Retrospective imitation node features and tree fea-
(2018) learning tures
Yilmaz and Learn to mimic the Use imitation learning to Same as Gasse et al. (2019)
Yorke-Smith SCIP’s default node learn an operator that
(2021) selection strategy can decide which child
node to expand
Labassi et al. Same as He et al. (2014) Imitation learning via Same as Gasse et al. (2019)
(2022) GNN
Table 6: Summary of works using AI techniques to enhance node selection in the B&B algorithm.
falls under the category of interactive imitation learning. By incorporating this approach, the learned
agent is capable of scaling up and achieving improved performance on larger problem instances.
In recent works, Yilmaz and Yorke-Smith (2021) also trained a node selection policy with imitation
learning. However, their approach uniquely introduced a node comparison operator, which determines
whether to expand the left child, the right child, or both children of a given node. The operator can
be combined with a backtracking algorithm to provide a full node selection policy. Labassi et al.
(2022) combined the imitation learning framework introduced by He et al. (2014) with the bipartite
graph representation of mixed-integer programming problems developed by Gasse et al. (2019). Their
method showed improved performance compared to the previous methods.
7.2 Mixed integer non-linear programming

The field of mixed-integer nonlinear programming (MINLP) has seen some progress in the integration
of AI techniques, however, it remains an area with less maturity compared to mixed-integer linear pro-
gramming. One example of the use of AI in MINLP is the work of Baltean-Lugojan et al. (2018), who
attempted to learn linear outer approximations of semidefinite constraints for non-convex quadratic
programming problems with box constraints. In this study, a neural network was used to select the
most promising submatrices without incurring the computational overhead of solving semidefinite
programming problems to generate the necessary cuts. Ghaddar et al. (2022) focused on using AI
to select the “best branching strategy” in the context of a B&B search tree embedded within the
reformulation-linearization technique (RLT) for solving polynomial problems. They designed hand-
crafted features to select the branching strategy that optimizes a quantile regression forest-based
approximation of their performance indicator. Additionally, González-Rodrı́guez et al. (2022) con-
sidered a portfolio of second-order cone and SDP constraints to strengthen the RLT formulation for
polynomial problems and used AI to select constraints to add within a B&B framework.
In a related line of research, several studies have explored the use of AI to predict the computa-
tional advantages of certain techniques for solving mixed-integer quadratic programs (MIQPs) and
nonconvex mixed-integer nonlinear programs. Bonami et al. (2018) trained classifiers to predict the
computational benefits of linearizing products of binary variables or binary and bounded continuous
variables for solving MIQPs. Nannicini et al. (2011) used support vector machine (SVM) classification
to decide whether an expensive optimality-based bound tightening routine should be used instead of
a cheaper feasibility-based routine for nonconvex MINLPs.
35
7.3 Enhancement for cutting-plane methods
Cutting-plane methods is one of the famous methods for solving integer programming problems (Go-
mory 1960). Consider the following integer programming problem
min cT x
x∈Zn
s.t. Ax ≤ b
x ≥ 0.
Let C = {x ∈ Zn | Ax ≤ b, x ≥ 0} denote the feasible region for the integer programming problem.
The cutting plane method starts with solving the LP obtained from the above problem by dropping
the integrality constraints x ∈ Zn . Let C (0) = {x ∈ Rn | Ax ≤ b, x ≥ 0} ⊇ C denote the feasible
region for the relaxed linear programming problem. Let x(0) ∈ C (0) denote the optimal solution to the
relaxed LP problem. Let’s assume x(0) ∈ / Zn . The cutting plane method then finds a cut (α(0) , β (0) )
such that
T T
α(0) x ≤ β (0) ∀x ∈ C and α(0) x(0) > β (0)
Then the new feasible region is constructed by
T
C (0) ⊇ C (1) = C (0) ∩ {x ∈ Rn | α(0) x ≤ β (0) } ⊇ C,
and the corresponding LP is solved to obtain x(1) . This procedure iterates until x(t) ∈ Zn , which can
be shown to be the optimal solution for the original integer programming problem.
Gomory cuts are a typical way of generating cuts (Gomory 1960). Let x(t) denote the current
iterate and define
(t)
I (t) = {i ∈ [n] | xi ∈ / Z}.
Then for each i ∈ I (t) , we can generate a cut by having
(t) (t) (t) (t) (t) (t)
αi = −Ai + ⌊Ai ⌋ and βi = −bi + ⌊bi ⌋
where A(t) and b(t) are the constraints in the current LP. Then we choose one of the possible Gomory
cuts
(t) (t)
D(t) = {(αi , βi ) | i ∈ I (t) }
to add to the current LP. The selection of the Gomory cut largely affects the performance of the
cutting-plane method. Tang et al. (2020) propose a reinforcement-learning-based approach to select
cuts. At iteration t, the state space is given by the current LP and the corresponding optimal solution,
i.e.,
st = (C (t) , c, x(t) ).
The available actions are given by
at = D(t) ,
consisting of all possible Gomory’s cutting planes, and the reward at time step t is given by the
increase in the objective value, i.e.,
Rt = cT x(t+1) − cT x(t) .
Finally, we describe the design of the policy network πθ (at | st ). In order to handle problem instances
of different sizes, the authors utilize the LSTM network architecture. More specifically, they first
embed all the constraints in the current LP and all the candidate cuts into the same space via an
LSTM network,
hi = LST Mθ ([ai , bi ]), ∀[ai , bi ] ∈ C (t)

gj = LST Mθ ([αj , βj ]), ∀[αj , βj ] ∈ D(t) .
36
Then the score for every candidate [αj , βj ] ∈ D(t) is computed by
X
scorej = hTi gj .
[ai ,bi ]∈C (t)
Finally, πθ (at | st ) returns the probabilities over the action space by applying a softmax function to
all the computed scores.
Similarly, Huang et al. (2022) also propose a cut-selection strategy based on multiple-instance
learning. The training process is similar to reinforcement learning, and the major difference is that
the reward is defined by the reduction in the total running time. Alternatively, Paulus et al. (2022)
design another cut-selection strategy via imitation learning. The key idea is that they design a greedy
selection rule which is called looking ahead. More specifically, for each candidate cut
[α, β] ∈ D(t) ,
the cut is added to the current LP and solve for the optimal solution
(t)
xα,β = arg min{cT x | x ∈ C t , αT x ≤ β},
and the looking ahead rule will select the cut that improves the objective most, i.e.,
(t)
[α(t) , β (t) ] ∈ arg max{sα,β := cT xα,β − cT x(t) | [α, β] ∈ D(t) }.
Looking ahead is a strong rule for selecting cuts—but an expensive one. At every iteration, running it
requires to solve |D(t) | additional LPs. Therefore, the authors propose to use looking ahead scores to
facilitate the training of a policy for cut selection via imitation learning. They represent the current
LP, and the candidate cuts as a tripartite graph whose nodes are divided into three parts: variables,
constraints and cuts. Then they use the standard GNN approach to predict the looking ahead score
for each cut node with a soft binary entropy loss.
7.4 Heuristics
For discrete optimization problems, besides exact methods like Branch-and-Bound and Branch-and-
Cut, heuristic algorithms are also widely adopted for their simplicity and efficiency. Although they do
not guarantee to find the optimal solution, they are designed to find a good solution within reasonable
computational time. Traditional heuristics can be classified into two categories based on their goals:
1) Finding a feasible solution quickly. The feasibility pump (FP) (Berthold et al. 2019) and diving
(Maniezzo et al. 2021) heuristic fall into this category, 2) Trying to find a high-quality feasible solution,
even though there’s no assurance of achieving optimality. Local branching (Fischetti and Lodi 2003),
relaxation induced neighborhood search (RINS) (Danna et al. 2005), and large neighborhood search
(LNS) (Shaw 1998, Pisinger and Røpke 2018) fall into this category. It’s essential to note that these
two categories can be combined, i.e., find an initial feasible solution quickly and then enhance it. For
instance, one might initially employ FP and then RINS to refine the solution further.
AI-based heuristics often draw inspiration from traditional heuristics. In the following, we first
discuss the heuristics inspired by category 1) and then by category 2). Within category 1), two
AI-based heuristics have been proposed inspired by FP and diving respectively. Qi et al. (2021)
propose a smart feasibility pump (SFP) method inspired by the traditional feasibility pump (FP)
heuristic. This new method surpasses the traditional one by reducing the number of steps needed to
reach the first feasible solution. FP first solves the relaxed problem (18). Its solution x(0) is then
rounded to obtain an initial integer solution x̄(0) . However, due to rounding, this solution is usually
infeasible w.r.t. constraints (17b) and (17c). FP iteratively projects the solution x̄(t) back to the
feasible region of the relaxed problem (18), obtaining x(t+1) and rounding it to x̄(t+1) . The projection
wants to satisfy constraints (17b) to (17c) with the smallest change on the current solution, while
the rounding focuses only on the integer constraints. This process may take a long time since the
projection and rounding focus on short-term benefits. Furthermore, there is a risk of stagnation, i.e.,
failing to find any feasible solution. In contrast, SFP trains an RL agent such that the infeasibility
37
will rapidly decrease in several iterations. SFP inherits the framework of FP, but the key point is
that SFP allows the agent to change the current solution x(t) before rounding. The reward is defined
as negative infeasibility. The cumulative rewards will drive the agent to balance between current-step
and long-term negative infeasibility. The state includes the information of Problem (17) and the
current solution. The action of the agent is the change a(t) in the current solution x(t) . Specifically,
the state transition incorporates the change and rounding, i.e., x(t+1) = [x(t) + a(t) ], where [·] is the
rounding operation. Despite the non-differentiable nature of rounding and infeasibility with respect
to the agent’s actions, the PPO algorithm (See section 2.3) can still be effectively utilized, since it is
designed to handle such a challenge.
Alternatively, Nair et al. (2021) propose neural diving. In the literature, diving is a heuristic that
explores the branch-and-bound tree. It starts from any partial assignment of integer variables and
finds a feasible assignment of integer variables (if it exists) in an iterative way. In contrast, neural
diving wants to directly predict an initial assignment of integer variables. This assignment may not
be feasible, but needs to be high-quality, i.e., resulting in a high objective value. This distinguishes
neural diving from traditional diving techniques since the goal has changed. The prediction of the
assignment is achieved by a GNN inference. The advantage is that this inference is a “one-off deal”.
Next, we discuss how the GNN model is trained. Let Pi represent the i-th MIP problem in the training
set. Each problem has its graph representation. If we solve the MIP with the B&B algorithm, multiple
feasible solutions can be found on the leave nodes, and each solution corresponds to an assignment
of integer variables. We denote the number of these feasible assignments as Ni and the set of these
feasible assignments as Xi = {x(i,j) }N i
j=1 . Here, x(i,j) is the j-th assignment of integer variables for i-th
MIP problem. Now the quality of j-th assignment of integer variables is evaluated by the following
energy score E x(i,j) ; Pi .
(
fˆ x(i,j)

if x(i,j) is feasible
E x(i,j) ; Pi =
∞ otherwise
where fˆ(x(i,j) ) represents the objective value obtained by assigning x(i,j) to integer variables in Pi
and assigning the rest continuous variables according to the solution of the resulting linear program.
This energy score is normalized on all assignments for i-th problem and converted to sample weight
wij . To summarize, now we have a training set {(Pi , Xi )}N
i=1 . We can train the neural network g(·; θ)
with parameter θ with the following training loss.
Ni
N X
X
L(θ) = − wij ℓ g(Pi ; θ), x(i,j) ,
i=1 j=1
where the loss function ℓ(·, ·) measures the difference between the prediction and the ground truth.
The prediction g(Pi ; θ) is of high quality since it is highly likely to be feasible and close to optimal. It is
guaranteed that the provided x(i,j) in training set are all feasible assignments. However, g(Pi ; θ) may
still provide an infeasible assignment at the inference stage. Thus, neural diving is usually followed
by a search method that allows infeasibility, such as B&B (Nair et al. 2021) and large neighborhood
search (Song et al. 2018, Sonnerat et al. 2021).
For category 2), traditional heuristics are typically iterative, where early decisions can significantly
influence long-term outcomes. Thus, reinforcement learning is frequently employed to enhance these
traditional heuristics. However, it’s important to note that RL, while powerful, still encounters
convergence issues, especially when confronted with large action and state spaces, or sparse reward
signals. A prevalent approach to mitigate these challenges involves initially employing imitation
learning to train the agent, followed by fine-tuning using reinforcement learning.
Khalil et al. (2017) focus on challenging combinatorial optimization problems on graphs, e.g., the
Minimal Vertex Cover (MVC) problem. This problem seeks the smallest subset of vertices such that
each edge in the graph is incident to at least one vertex in this subset. To solve this problem, traditional
greedy heuristics progressively build the desired subset by incorporating the vertex perceived as the
most advantageous during each iteration. Khalil et al. (2017) employ reinforcement learning to design
AI-enhanced greedy heuristics. The “action” refers to the selection of a vertex into the subset. The
38
action space includes the vertices that satisfy the graph problem’s constraints. The “state” contains
the current subset and the graph embedding. Here, “graph embedding” is a numerical vector that
summarizes the information in the graph. The “reward” is directly the objective function of the
original problem. In this way, the learned agent doesn’t merely seek a feasible solution but aims for
the solution with high objective value. Meanwhile, the agent is equipped to handle “delayed rewards”
(See Section 2.3), which are challenging for traditional greedy algorithms.
In the following, we introduce two AI-enhanced heuristics (Sonnerat et al. 2021, Song et al. 2018),
building upon existing Large Neighborhood Search (LNS) heuristics. We start by briefly introducing
LNS. LNS is a combinatorial optimization heuristic that begins with an assignment of values for
variables and iteratively refines it by searching a large neighborhood around the current assignment. In
each iteration, LNS evaluates multiple neighborhood solutions, comparing which one is most promising
for rapidly converging to a high-quality solution. This naturally raises two questions: 1) How to find
neighborhood solutions? 2) How to assess whether a particular neighborhood is more promising than
another? For question 1), existing heuristics have a consistent approach. From the current solution,
they “unassign” specific variable values, marking them as “unknown.” The resulting MIP is then
exactly solved using off-the-shelf solvers to obtain the neighborhood solution. AI-assisted heuristics
inherit this approach but will smartly decide which variable values to “unassign”. For question 2),
a naive method leverages local information. Specifically, neighborhood solutions are evaluated using
metrics like objective value or primal gap. The most promising neighborhood is selected. This
method, however, has two main limitations: a) computational cost and b) its short-sighted nature.
Limitation a) exists because each neighborhood solution is obtained by solving an MIP. To overcome
this limitation, imitation learning is utilized. The benefit is that during the inference stage, the
learned AI model directly predicts the promising neighborhood, which is a “one-off” deal. Regarding
limitation b), the term “short-sighted” refers to the tendency to make “greedy” decisions based
purely on local information. Simply identifying the locally best neighborhood does not guarantee
convergence to a high-quality solution in the long run. RL provides a remedy to this challenge due
to deleted rewards. In the end, the common paradigm is to initially utilize imitation learning to train
an agent, followed by refinement through RL. The purpose of this paradigm is not merely to address
the aforementioned limitations but also to help the learning process of agents converge. Since after
imitation learning the agent is already able to make a rational decision, in later RL stages, the agent
needs less exploration and thus converges faster.
Two AI-enhanced LNS heuristics follow the same paradigm, with the primary distinction being
their design of action. Sonnerat et al. (2021) design the action directly as which integer variables
to unassign while Song et al. (2020) design the action as a decomposition. Specifically, it means
decomposing the integer variable sets x into disjoint subsets, i.e., x = x1 ∪ x2 ∪ ... ∪ xk ... ∪ xK . Here,
the hyper-parameter K represents the number of equally sized subsets. This decomposition action is
predicted by an AI model (multi-layer perceptron). The model predicts the subset index to which each
variable should be allocated. As for state transition, a number of K different MIPs are solved. For
the k-th MIP (k ∈ {1, ..., K}), variables outside of the subset xk retain their values from the current
solution, while an optimization solver refines the variables within xk . The solution with the best
primal gap is then selected. A shared limitation in both AI-enhanced heuristics is the lack of adaptive
control over the neighborhood size. In the work by Sonnerat et al. (2021), the neighborhood size is
determined by a hyperparameter representing the number of unassigned integer variables. In Song
et al. (2020), the neighborhood size is controlled by the number of subsets K. They are both static
hyper-parameters. Although larger neighborhoods can reduce the risk of trapping to local optima,
this also increases computational cost. A promising direction for future research is the development
of mechanisms for adaptive neighborhood sizing.
To conclude, AI models assist or surpass heuristics either by imitating a computationally expensive
target or by learning from the costly interaction with the environment. At the inference stage, AI
models can quickly take smart actions and even outperform the state-of-the-art commercial solvers
on MIP with hundreds or thousands of variables (Song et al. 2020).
39
8 Conclusion
AI for Operations Research typically focuses on enhancing an individual stage within the Operations
Research pipeline. However, exploring interactions between different stages is an intriguing area of
study, such as gathering feedback from later stages to refine earlier ones. The smart predict-then-
optimize paradigm is one example as a pioneering approach for integrating diverse stages. Under the
SPO framework, besides minimizing the objective, an interesting extension is also ensuring feasibility.
Relaxing feasibility is a widely recognized technique employed in real-world applications. Its moti-
vation stems from the fact that domain experts often find the response of “impossible to complete
the task” unsatisfactory. Instead, they strive to comprehend the reasons for infeasibility and explore
avenues to relax constraints slightly, thereby achieving a feasible solution. By incorporating this con-
cept into SPO, it is interesting to let the AI model be capable of predicting parameters or adjusting
parameters such that the constraint is feasible. Furthermore, other types of interactions can also
prove valuable. For instance, if the optimization process exhibits slow convergence or sub-optimality,
it may indicate the need for an alternative formulation. It is beneficial to explore the potential of AI
models to automatically perform such adaptations. Nevertheless, designing a space that encompasses
well-known optimization formulations and provides the building blocks for modifying formulations is
a challenging task. Achieving a framework that automatically adapts formulations also remains a
promising area of research.
Despite the advancements in automatic algorithm configuration and algorithm selection, the task
of “unified software selection & tuning” remains interesting and challenging. Given an optimization
problem instance, and a set of available optimization software, e.g. Gurobi, CPLEX, and OptVerse,
we want to predict which software along with its potential configuration is going to perform best on
this instance. On the one hand, the challenge is inherited from the algorithm selection task, i.e., the
target (algorithm or software) is regarded as a black box. On the other hand, the software incorporates
several parameters and preprocessing algorithms that result in a complex structure. To be specific,
if CPLEX performs worse than Gurobi on an instance, the root cause might be an inappropriate
configuration rather than the software’s inadequacy. In other words, the software selection task
combines the algorithm selection and automatic algorithm configuration tasks.
In the model-based method for automatic algorithm configuration, a performance model is built
for Algorithm Runtime Prediction. Specifically, the goal is to employ AI techniques to predict the
runtime of an algorithm on a previously unseen problem given the problem-specific features. While
the existing work usually claim the performance model is empirically strong, it will be useful to know
its boundary, i.e., when it will fail. To deal with this, we may need a measurement for smoothness and
learnability of the performance function. We may also need a similarity metric between the unseen
input and training instances.
AI-based methodologies implicitly assume that the testing instances are similar to the ones used
in training. However, there currently does not exist a well-established similarity metric exists. The
challenge is that the property of a similarity metric is hard to describe mathematically. We may
define the metric by looking at the model formulation, structure, and the corresponding parameters,
but it is hard to derive a guarantee that similar problems will benefit similarly from an AI model.
Nevertheless, designing an empirical similarity metric with AI techniques might still be valuable.
In conclusion, AI techniques have demonstrated great potential in enhancing each stage of the OR
process. More future works are worth exploring the synergy between AI and OR, e.g., utilizing AI to
enhance the interactions between different OR stages. This synergy will undoubtedly lead to exciting
advancements and novel solution methods in a multitude of domains.
Acknowledgements
Bissan Ghaddar’s research is supported by the Natural Sciences and Engineering Research Council of
Canada Discovery Grant 2017-04185 and by the David G. Burgoyne Faculty Fellowship.
40
References
Achterberg, T., Koch, T., and Martin, A. (2005). Branching rules revisited. Operations Research Letters,
33(1):42–54.
Adriaensen, S., Biedenkapp, A., Shala, G., Awad, N., Eimer, T., Lindauer, M., and Hutter, F. (2022). Auto-
mated dynamic algorithm configuration. Journal of Artificial Intelligence Research, 75:1633–1699.
Alvarez, A. M., Louveaux, Q., and Wehenkel, L. (2017). A machine learning-based approximation of strong
branching. INFORMS Journal on Computing, 29(1):185–195.
Amos, B. and Kolter, J. Z. (2017). OptNet: Differentiable optimization as a layer in neural networks. In
Proceedings of the 34th International Conference on Machine Learning, pages 136–145.
Anagnostopoulos, A., Becchetti, L., Castillo, C., Gionis, A., and Leonardi, S. (2012). Online team formation
in social networks. In Proceedings of the International Conference on World Wide Web, pages 839–848.
Anastacio, M. and Hoos, H. (2020a). Combining sequential model-based algorithm configuration with default-
guided probabilistic sampling. In Genetic and Evolutionary Computation Conference Companion. ACM.
Anastacio, M. and Hoos, H. (2020b). Model-based algorithm configuration with default-guided probabilistic
sampling. In Proceedings of the 16th International Conference on Parallel Problem Solving from Nature
(PPSN-20).
Andrychowicz, M., Denil, M., Colmenarejo, S. G., Hoffman, M. W., Pfau, D., Schaul, T., and de Freitas, N.
(2016). Learning to learn by gradient descent by gradient descent. In Proceedings of the Advances in
Neural Information Processing Systems, pages 3981–3989.
Ansótegui, C., Sellmann, M., and Tierney, K. (2009). A gender-based genetic algorithm for the automatic
configuration of algorithms. In International Conference on Principles and Practice of Constraint Pro-
gramming, pages 142–157. Springer.
Applegate, D., Bixby, R., Chvátal, V., and Cook, W. (1995). Finding cuts in the TSP (A preliminary report).
Technical report, Citeseer.
Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., and Kautz, J. (2017). Reinforcement learning through
asynchronous advantage actor-critic on a GPU. In Proceedings of the International Conference on Learning
Representations.
Balaprakash, P., Birattari, M., and Stützle, T. (2007). Improvement strategies for the f-race algorithm: Sam-
pling design and iterative refinement. In Hybrid Metaheuristics, pages 108–122. Springer.
Balcan, M.-F., Dick, T., Sandholm, T., and Vitercik, E. (2018). Learning to branch. In Proceedings of the
International Conference on Machine Learning, pages 344–353.
Baltean-Lugojan, R., Bonami, P., Misener, R., and Tramontani, A. (2018). Scoring positive semidefinite cutting
planes for quadratic optimization via trained neural networks. Optimization Online Preprint.
Beck, A. (2017). First-order methods in optimization. Society for Industrial and Applied Mathematics.
Bengio, Y., Lodi, A., and Prouvost, A. (2021). Machine learning for combinatorial optimization: A method-
ological tour d’horizon. European Journal of Operational Research, 290(2):405–421.
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In Proceedings of the
International Conference on Machine Learning, pages 41–48.
Bénichou, M., Gauthier, J.-M., Girodet, P., Hentges, G., Ribière, G., and Vincent, O. (1971). Experiments in
mixed-integer linear programming. Mathematical Programming, 1:76–94.
Bergman, D., Huang, T., Brooks, P. A., Lodi, A., and Raghunathan, A. (2021). Janos: An integrated predictive
and prescriptive modeling framework. INFORMS Journal on Computing, 34(2):807–816.
Berkelaar, M. (2015). Package ‘lpsolve’.
Berthold, T., Lodi, A., and Salvagnin, D. (2019). Ten years of feasibility pump, and counting. EURO Journal
on Computational Optimization, 7(1):1–14.
Bertsimas, D. and Kallus, N. (2020). From predictive to prescriptive analytics. Management Science,
66(3):1025–1044.
Bischl, B., Kerschke, P., Kotthoff, L., Lindauer, M., Malitsky, Y., Fréchette, A., Hoos, H., Hutter, F., Leyton-
Brown, K., Tierney, K., and Vanschoren, J. (2016). ASlib: A benchmark library for algorithm selection.
Artificial Intelligence, 237:41–58.
Blazewicz, J., Dror, M., and Weglarz, J. (1991). Mathematical programming formulations for machine schedul-
ing: A survey. European Journal of Operational Research, 51(3):283–300.
Blöchliger, I. (2004). Modeling staff scheduling problems. A tutorial. European Journal of Operational Research,
158(3):533–542.
41
Blom, M., Pearce, A. R., and Stuckey, P. J. (2017). Short-term scheduling of an open-pit mine with multiple
objectives. Engineering Optimization, 49(5):777–795.
Bonami, P., Lodi, A., and Zarpellon, G. (2018). Learning a classification of mixed-integer quadratic pro-
gramming problems. In Proceedings of the International Conference on the Integration of Constraint
Programming, Artificial Intelligence, and Operations Research, pages 595–604. Springer.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical
learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning,
3(1):1–122.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry,
G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J.,
Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are
few-shot learners. arXiv:2005.14165.
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y.,
Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., and Zhang, Y. (2023). Sparks of artificial general
intelligence: Early experiments with GPT-4. arXiv:2303.12712.
Byun, J., Kim, B., and Wang, H. (2020). Proximal policy gradient: PPO with policy gradient.
arXiv:2010.09933.
Cameron, C., Hartford, J., Lundy, T., and Leyton-Brown, K. (2022). The perils of learning before optimizing.
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3708–3715.
Cappart, Q., Chételat, D., Khalil, E. B., Lodi, A., Morris, C., and Velickovic, P. (2021). Combinatorial opti-
mization and reasoning with graph neural networks. In Proceedings of the International Joint Conference
on Artificial Intelligence, pages 4348–4355.
Chen, T., Chen, X., Chen, W., Heaton, H., Liu, J., Wang, Z., and Yin, W. (2022). Learning to optimize: A
primer and a benchmark. Journal of Machine Learning Researc, 23(189):1–59.
Chen, T., Zhang, W., Jingyang, Z., Chang, S., Liu, S., Amini, L., and Wang, Z. (2020a). Training stronger
baselines for learning to optimize. In Proceedings of the Advances in Neural Information Processing
Systems, volume 33, pages 7332–7343.
Chen, X., Dai, H., Li, Y., Gao, X., and Song, L. (2020b). Learning to stop while learning to predict. In
Proceedings of the International Conference on Machine Learning, pages 1520–1530.
Chi, C., Aboussalah, A. M., Khalil, E. B., Wang, J., and Sherkat-Masoumi, Z. (2022). A deep reinforcement
learning framework for column generation. arXiv:2206.02568.
Codevilla, F., Santana, E., López, A. M., and Gaidon, A. (2019). Exploring the limitations of behavior cloning
for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 9329–9338.
CPLEX, I. I. (2009). V12. 1: User’s manual for CPLEX. International Business Machines Corporation,
46(53):157.
Dakin, R. J. (1965). A tree-search algorithm for mixed integer programming problems. The Computer Journal,
8(3):250–255.
Danna, E., Rothberg, E. E., and Pape, C. L. (2005). Exploring relaxation induced neighborhoods to improve
MIP solutions. Mathematical Programming, 102:71–90.
Dantzig, G. B. and Thapa, M. N. (2003). Linear Programming 2: Theory and Extensions. Springer-Verlag.
De Bock, K. W., Coussement, K., Caigny, A. D., Slowiński, R., Baesens, B., Boute, R. N., Choi, T.-M., Delen,
D., Kraus, M., Lessmann, S., Maldonado, S., Martens, D., Óskarsdóttir, M., Vairetti, C., Verbeke, W., and
Weber, R. (2023). Explainable ai for operational research: A defining framework, methods, applications,
and a research agenda. European Journal of Operational Research.
Desaulniers, G., Desrosiers, J., and Solomon, M. M. (2006). Column generation, volume 5. Springer Science &
Business Media.
Desaulniers, G., Lodi, A., and Morabit, M. (2020). Machine-learning-based column selection for column gener-
ation. Transportation Science, 55:815–831.
Di Liberto, G., Kadioglu, S., Leo, K., and Malitsky, Y. (2016). DASH: Dynamic approach for switching
heuristics. European Journal of Operational Research, 248(3):943–953.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic
optimization. Journal of machine learning research, 12(7).
Elmachtoub, A. N. and Grigas, P. (2022). Smart “predict, then optimize”. Management Science, 68(1):9–26.
42
Etheve, M., Alès, Z., Bissuel, C., Juan, O., and Kedad-Sidhoum, S. (2020). Reinforcement learning for vari-
able selection in a branch and bound algorithm. In Integration of Constraint Programming, Artificial
Intelligence, and Operations Research, pages 176–185. Springer International Publishing.
Fajemisin, A. O., Maragno, D., and den Hertog, D. (2023). Optimization with constraint learning: A framework
and survey. European Journal of Operational Research.
Fan, Z., Wang, X., Yakovenko, O., Sivas, A. A., Ren, O., Zhang, Y., and Zhou, Z. (2023). Smart initial basis
selection for linear programs. In Proceedings of the International Conference on Machine Learning, volume
202, pages 9650–9664.
Fischetti, M. and Lodi, A. (2003). Local branching. Mathematical Programming, 98:23–47.
Forrest, J. J. and Goldfarb, D. (1992). Steepest-edge simplex algorithms for linear programming. Mathematical
Programming, 57(1–3):341–374.
Fortin, M. and Glowinski, R. (2000). Augmented Lagrangian methods: applications to the numerical solution
of boundary-value problems. Elsevier.
Fukushima, M. (1992). Application of the alternating direction method of multipliers to separable convex
programming problems. Computational Optimization and Applications, 1(1):93–111.
Gambella, C., Ghaddar, B., and Naoum-Sawaya, J. (2021). Optimization problems for machine learning: A
survey. European Journal of Operational Research, 290(3):807–828.
Gao, M., Wan, X., Su, J., Wang, Z., and Huai, B. (2023). Reference matters: Benchmarking factual error
correction for dialogue summarization with fine-grained evaluation framework. In Proceedings of the
Association for Computational Linguistics, pages 13932–13959.
Gasse, M., Chételat, D., Ferroni, N., Charlin, L., and Lodi, A. (2019). Exact combinatorial optimization with
graph convolutional neural networks. In Proceedings of the Advances in Neural Information Processing
Systems, volume 32, pages 15554–15566.
Ghaddar, B., Gómez-Casares, I., González-Dı́az, J., González-Rodrı́guez, B., Pateiro-López, B., and
Rodrı́guez-Ballesteros, S. (2022). Learning for spatial branching: An algorithm selection approach.
arXiv:.2204.10834.
Gomory, R. (1960). An algorithm for the mixed integer problem. Technical report, RAND Corporation, Santa
Monica, CA.
González-Rodrı́guez, B., Alvite-Pazó, R., Alvite-Pazó, S., Ghaddar, B., and González-Dı́az, J. (2022). Polyno-
mial optimization: Enhancing RLT relaxations with conic constraints. arXiv:2208.05608.
Guo, T., Han, C., Tang, S., and Ding, M. (2019). Solving combinatorial problems with machine learning
methods. Nonlinear Combinatorial Optimization.
Gupta, P., Gasse, M., Khalil, E., Mudigonda, P., Lodi, A., and Bengio, Y. (2020). Hybrid models for learning
to branch. In Proceedings of the Advances in Neural Information Processing Systems, volume 33, pages
18087–18097.
Gupta, P., Khalil, E. B., Chetélat, D., Gasse, M., Bengio, Y., Lodi, A., and Kumar, M. P. (2022). Lookback
for learning to branch. arXiv:2206.14987.
Gurobi (2022). Gurobi Optimizer Reference Manual.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine
Learning.
Hart, P. E., Nilsson, N. J., and Raphael, B. (1968). A formal basis for the heuristic determination of minimum
cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107.
He, B., Yang, H., and Wang, S. (2000). Alternating direction method with self-adaptive penalty parameters
for monotone variational inequalities. Journal of Optimization Theory and Applications, 106(2):337–356.
He, H., Daume III, H., and Eisner, J. M. (2014). Learning to search in branch and bound algorithms. In
Proceedings of the Advances in Neural Information Processing Systems, volume 27, pages 3293–3301.
Huang, L., Chen, X., Huo, W., Wang, J., Zhang, F., Bai, B., and Shi, L. (2021). Branch and bound in mixed
integer linear programming problems: A survey of techniques and trends. arXiv:2111.06257.
Huang, W., Mordatch, I., and Pathak, D. (2020). One policy to control them all: Shared modular policies
for agent-agnostic control. In Proceedings of the International Conference on Machine Learning, pages
4455–4464.
Huang, Z., Wang, K., Liu, F., Zhen, H.-L., Zhang, W., Yuan, M., Hao, J., Yu, Y., and Wang, J. (2022).
Learning to select cuts for efficient mixed-integer programming. Pattern Recognition, 123:108353.
Huawei (2021). Optverse solver.
43
Hussein, A., Gaber, M. M., Elyan, E., and Jayne, C. (2017). Imitation learning: A survey of learning methods.
ACM Computing Surveys, 50(2).
Hutter, F., Hoos, H. H., and Leyton-Brown, K. (2010). Automated configuration of mixed integer program-
ming solvers. In Integration of AI and OR Techniques in Constraint Programming for Combinatorial
Optimization Problems, pages 186–202. Springer Berlin Heidelberg.
Hutter, F., Hoos, H. H., and Leyton-Brown, K. (2011). Sequential model-based optimization for general
algorithm configuration. In International Conference on Learning and Intelligent Optimization, volume
6683 of Lecture Notes in Computer Science, pages 507–523. Springer.
Hutter, F., Hoos, H. H., Leyton-Brown, K., and Stützle, T. (2009). ParamILS: An automatic algorithm
configuration framework. Journal of Artificial Intelligence Research, 36:267–306.
Ichnowski, J., Jain, P., Stellato, B., Banjac, G., Luo, M., Borrelli, F., Gonzalez, J. E., Stoica, I., and Goldberg,
K. (2021). Accelerating quadratic optimization with reinforcement learning. In Proceedings of the Advances
in Neural Information Processing Systems, volume 34, pages 21043–21055.
Jung, H., Park, J., and Park, J. (2022). Learning context-aware adaptive solvers to accelerate quadratic
programming. arXiv:2211.12443.
Khalil, E., Dai, H., Zhang, Y., Dilkina, B., and Song, L. (2017). Learning combinatorial optimization algorithms
over graphs. In Proceedings of the Advances in Neural Information Processing Systems, pages 6348–6358.
Khalil, E., Le Bodic, P., Song, L., Nemhauser, G., and Dilkina, B. (2016). Learning to branch in mixed integer
programming. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30.
Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the Interna-
tional Conference on Learning Representations.
Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv:1312.6114.
Kontogiorgis, S. and Meyer, R. R. (1998). A variable-penalty alternating directions method for convex opti-
mization. Mathematical Programming, 83(1):29–53.
Kotary, J., Fioretto, F., Van Hentenryck, P., and Wilder, B. (2021). End-to-end constrained optimization
learning: A survey. In Proceedings of the International Joint Conference on Artificial Intelligence, pages
4475–4482.
Kraus, M., Feuerriegel, S., and Oztekin, A. (2020). Deep learning in business analytics and operations research:
Models, applications and managerial implications. European Journal of Operational Research, 281(3):628–
641.
Krentel, M. W. (1986). The complexity of optimization problems. In Proceedings of the eighteenth annual ACM
symposium on Theory of Computing. ACM.
Labassi, A. G., Chételat, D., and Lodi, A. (2022). Learning to compare nodes in branch and bound with graph
neural networks. arXiv:2210.16934.
Land, A. H. and Doig, A. G. (2010). An automatic method for solving discrete programming problems. Springer.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L.
(2020). BART: denoising sequence-to-sequence pre-training for natural language generation, translation,
and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 7871–7880.
Li, A., Li, B., Han, C., and Guo, T. (2022). Rethinking optimal pivoting paths of simplex method.
arXiv:2210.02945.
Li, K. and Malik, J. (2016). Learning to optimize. In Proceedings of the International Conference on Learning
Representations.
Li, K. and Malik, J. (2017). Learning to optimize neural nets. arXiv:1703.00441.
Lindauer, M., Eggensperger, K., Feurer, M., Biedenkapp, A., Deng, D., Benjamins, C., Ruhkopf, T., Sass, R.,
and Hutter, F. (2022). SMAC3: A versatile bayesian optimization package for hyperparameter optimiza-
tion. Journal of Machine Learning Research, 23(54):1–9.
Lindauer, M. and Hutter, F. (2018). Warmstarting of model-based algorithm configuration. In Proceedings of
the AAAI Conference on Artificial Intelligence, pages 1355–1362.
Linderoth, J. T. and Savelsbergh, M. W. (1999). A computational study of search strategies for mixed integer
programming. INFORMS Journal on Computing, 11(2):173–187.
Lodi, A. and Zarpellon, G. (2017). On learning and branching: a survey. Journal of the Spanish Society of
Statistics and Operations Research, 25(2):207–236.
Lombardi, M. and Milano, M. (2018). Boosting combinatorial problem modeling with machine learning. In
Proceedings of the International Joint Conference on Artificial Intelligence, pages 5472–5478.
44
López-Ibáñez, M., Dubois-Lacoste, J., Cáceres, L. P., Birattari, M., and Stützle, T. (2016). The irace package:
Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43–58.
Lv, K., Jiang, S., and Li, J. (2017). Learning gradient descent: Better generalization and longer horizons. In
Mandi, J., Stuckey, P. J., and Guns, T. (2020). Smart predict-and-optimize for hard combinatorial optimization
problems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1603–1610.
Maniezzo, V., Boschetti, M. A., and Stützle, T. (2021). Diving Heuristics, pages 133–141. EURO Advanced
Tutorials on Operational Research. Springer, Cham.
Maragno, D., Wiberg, H. M., Bertsimas, D., Birbil, S. I., den Hertog, D., and Fajemisin, A. O. (2021). Mixed-
integer optimization with constraint learning. arXiv:2111.04469.
Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1):77–91.
Medsker, L. R. and Jain, L. (2001). Recurrent neural networks. Design and Applications, 5:64–67.
Metz, L., Maheswaranathan, N., Nixon, J., Freeman, D., and Sohl-Dickstein, J. (2019). Understanding and
correcting pathologies in the training of learned optimizers. In Proceedings of the International Conference
on Machine Learning, pages 4556–4565.
Mhanna, S., Verbič, G., and Chapman, A. C. (2018). Adaptive admm for distributed ac optimal power flow.
IEEE Transactions on Power Systems, 34(3):2025–2035.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu,
K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the International
Conference on Machine Learning, volume 48, pages 1928–1937.
Mostajabdaveh, M., Salman, F. S., and Tahmasbi, N. (2022a). Two dimensional guillotine cutting stock and
scheduling problem in printing industry. Computers and Operations Research, 148:106014.
Mostajabdaveh, M., Salman, S., and Gutjahr, W. (2022b). A branch-and-price algorithm for fast and equitable
last-mile relief aid distribution. SSRN Electronic Journal.
Nair, V., Bartunov, S., Gimeno, F., von Glehn, I., Lichocki, P., Lobov, I., O’Donoghue, B., Sonnerat, N.,
Tjandraatmadja, C., Wang, P., Addanki, R., Hapuarachchi, T., Keck, T., Keeling, J., Kohli, P., Ktena,
I., Li, Y., Vinyals, O., and Zwols, Y. (2021). Solving mixed integer programs using neural networks.
arXiv:2012.13349.
Nannicini, G., Belotti, P., Lee, J., Linderoth, J., Margot, F., and Wächter, A. (2011). A probing algorithm
for MINLP with failure prediction by SVM. In Integration of AI and OR Techniques in Constraint
Programming for Combinatorial Optimization Problems, pages 154–169. Springer.
Nesterov, Y. (1983). A method for solving the convex programming problem with convergence rate o(1/k 2 ). In
Proceedings of the USSR Academy of Sciences, volume 269, pages 543–547.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K.,
Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano,
P., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback.
In Proceedings of the Advances in Neural Information Processing Systems, volume 35, pages 27730–27744.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In
Paulus, M. B., Zarpellon, G., Krause, A., Charlin, L., and Maddison, C. (2022). Learning to cut by looking
ahead: Cutting plane selection via imitation learning. In Proceedings of the International Conference on
Machine Learning, pages 17584–17600.
Pinedo, M. L. (2016). Scheduling: theory, algorithms, and systems. Springer Cham, 6 edition.
Pisinger, D. and Røpke, S. (2018). Large neighborhood search. Handbook of Metaheuristics.
Ploskas, N., Sahinidis, N. V., and Samaras, N. (2021). A triangulation and fill-reducing initialization procedure
for the simplex algorithm. Mathematical Programming Computation, 13:491–508.
Pogancic, M. V., Paulus, A., Musil, V., Martius, G., and Rolı́nek, M. (2020). Differentiation of blackbox
combinatorial solvers. In Proceedings of the International Conference on Learning Representations.
Qi, M., Wang, M., and Shen, Z.-J. (2021). Smart feasibility pump: Reinforcement learning for (mixed) integer
programming. arXiv:2102.09663.
Qu, Q., Li, X., Zhou, Y., Zeng, J., Yuan, M., Wang, J., Lv, J., Liu, K., and Mao, K. (2022). An improved
reinforcement learning algorithm for learning to branch. arXiv:2201.06213.
Rajgopal, J. (2004). Principles and applications of operations research. Maynard’s Industrial Engineering
Handbook, pages 11–27.
Ramamonjison, R., Yu, T. T., Li, R., Li, H., Carenini, G., Ghaddar, B., He, S., Mostajabdaveh, M., Banitalebi-
Dehkordi, A., Zhou, Z., and Zhang, Y. (2023). NL4Opt competition: Formulating optimization problems
based on their natural language descriptions. arXiv:2303.08233.
45
Reid, M. and Neubig, G. (2022). Learning to model editing processes. In Findings of the Association for
Computational Linguistics: EMNLP 2022, pages 3822–3832.
Ross, S. and Bagnell, D. (2010). Efficient reductions for imitation learning. In Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics, AISTATS 2010, volume 9 of JMLR
Proceedings, pages 661–668.
Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Remez, T., Rapin, J.,
Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C., Grattafiori, A., Xiong, W., Défossez,
A., Copet, J., Azhar, F., Touvron, H., Martin, L., Usunier, N., Scialom, T., and Synnaeve, G. (2023).
Code llama: Open foundation models for code. arXiv:2307.09288.
Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747.
Sabharwal, A., Samulowitz, H., and Reddy, C. (2012). Guiding combinatorial optimization with UCT. In
Proceedings of the Integration of AI and OR Techniques in Contraint Programming for Combinatorial
Optimzation Problems, pages 356–361. Springer.
Schede, E., Brandt, J., Tornede, A., Wever, M., Bengs, V., Hullermeier, E., and Tierney, K. (2022). A survey
of methods for automated algorithm configuration. Journal of Artificial Intelligence Research, 1:13676.
Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. (2015). Trust region policy optimization. In
Proceedings of the International Conference on Machine Learning, volume 37, pages 1889–1897.
Shaw, P. (1998). Using constraint programming and local search methods to solve vehicle routing problems. In
Proceedings of the International Conference on Principles and Practice of Constraint Programming.
Shen, Y., Sun, Y., Li, X., Eberhard, A., and Ernst, A. (2022). Enhancing column generation by a machine-
learning-based pricing heuristic for graph coloring. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 36, pages 9926–9934.
Song, J., Lanka, R., Zhao, A., Bhatnagar, A., Yue, Y., and Ono, M. (2018). Learning to search via retrospective
imitation. arXiv:1804.00846.
Song, J., Yue, Y., and Dilkina, B. (2020). A general large neighborhood search framework for solving integer
linear programs. In Proceedings of the Advances in Neural Information Processing Systems, volume 33,
pages 20012–20023.
Sonnerat, N., Wang, P., Ktena, I., Bartunov, S., and Nair, V. (2021). Learning a large neighborhood search
algorithm for mixed integer programs. arXiv:2107.10201.
Staudemeyer, R. C. and Morris, E. R. (2019). Understanding LSTM – A tutorial into long short-term memory
recurrent neural networks. arXiv:1909.09586.
Steever, Z., Murray, C. C., Yuan, J., Karwan, M. H., and Lübbecke, M. E. (2019). An image-based approach
to detecting structural similarity among mixed integer programs. INFORMS Journal on Computing,
34:1849–1870.
Stellato, B., Banjac, G., Goulart, P., Bemporad, A., and Boyd, S. (2020). Osqp: An operator splitting solver
for quadratic programs. Mathematical Programming Computation, 12(4):637–672.
Sun, H., Chen, W., Li, H., and Song, L. (2020). Improving learning to branch via reinforcement learning. In
NeurIPS 2020 Workshop on Learning Meets Combinatorial Algorithms (LMCA).
Suriyanarayana, V., Tavaslıoğlu, O., Patel, A. B., and Schaefer, A. J. (2022). Reinforcement learning of simplex
pivot rules: a proof of concept. Optimization Letters, pages 1–13.
Talbi, E.-G. (2009). Metaheuristics: from design to implementation. John Wiley & Sons.
Talluri, K. T. and Van Ryzin, G. J. (2004). Revenue management under a general discrete choice model of
consumer behavior. Management Science, 50(1):15–33.
Tang, Y., Agrawal, S., and Faenza, Y. (2020). Reinforcement learning for integer programming: Learning to
cut. In Proceedings of the International Conference on Machine Learning, pages 9367–9376.
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023).
Stanford Alpaca: An instruction-following LLaMA model. GitHub repository: https://github.com/
tatsu-lab/stanford_alpaca.
Torabi, F., Warnell, G., and Stone, P. (2018). Behavioral cloning from observation. In Proceedings of the
International Joint Conference on Artificial Intelligence, pages 4950–4957.
Toth, P. and Vigo, D. (2014). Vehicle routing: problems, methods, and applications. Society for Industrial and
Applied Mathematics.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava,
P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J.,
Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H.,
Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril,
46
T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y.,
Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian,
R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan,
A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. (2023). Llama 2:
Open foundation and fine-tuned chat models. arXiv:2307.09288.
Tseng, P. (1998). An incremental gradient (-projection) method with momentum term and adaptive stepsize
rule. SIAM Journal on Optimization, 8(2):506–531.
Veličković, P. (2023). Everything is connected: Graph neural networks. Current Opinion in Structural Biology,
79:102538.
Wang, P., Donti, P. L., Wilder, B., and Kolter, J. Z. (2019). Satnet: Bridging deep learning and logical
reasoning using a differentiable satisfiability solver. In Proceedings of the International Conference on
Machine Learning, volume 97, pages 6545–6554.
Wang, S. and Liao, L. (2001). Decomposition method with a variable parameter for a class of monotone
variational inequality problems. Journal of Optimization Theory and Applications, 109(2):415–429.
Wen, T.-H., Vandyke, D., Mrkšić, N., Gasic, M., Barahona, L. M. R., Su, P.-H., Ultes, S., and Young, S. (2017).
A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the European
Chapter of the Association for Computational Linguistics, pages 438–449.
Wichrowska, O., Maheswaranathan, N., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Freitas, N., and Sohl-
Dickstein, J. (2017). Learned optimizers that scale and generalize. In Proceedings of the International
Conference on Machine Learning, pages 3751–3760.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.
In Reinforcement learning, pages 5–32. Springer.
Winston, W. L. and Goldberg, J. B. (2004). Operations Research: Applications and Algorithms. Thomson
Brooks/Cole, 4th edition.
Wolpert, D. H. and Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on
Evolutionary Computation, 1(1):67–82.
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Philip, S. Y. (2020). A comprehensive survey on graph
neural networks. IEEE Transactions on neural networks and learning systems, 32(1):4–24.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2019). How powerful are graph neural networks? In Proceedings
of the International Conference on Learning Representations.
Yan, J., Yang, S., and Hancock, E. R. (2020). Learning for graph matching and related combinatorial opti-
mization problems. In Proceedings of the International Joint Conference on Artificial Intelligence, pages
4988–4996.
Yan, K., Yan, J., Luo, C., Chen, L., Lin, Q., and Zhang, D. (2021). A surrogate objective framework for
prediction+optimization with soft constraints. arXiv:2111.11358.
Yilmaz, K. and Yorke-Smith, N. (2021). A study of learning search approximation in mixed integer branch and
bound: Node selection in SCIP. Artificial Intelligence, 2(2):150–178.
Yu, Y., Si, X., Hu, C., and Zhang, J. (2019). A review of recurrent neural networks: LSTM cells and network
architectures. Neural computation, 31(7):1235–1270.
Zarpellon, G., Jo, J., Lodi, A., and Bengio, Y. (2021). Parameterizing branch-and-bound search trees to learn
branching policies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages
3931–3939.
Zeng, S., Kody, A., Kim, Y., Kim, K., and Molzahn, D. K. (2022). A reinforcement learning approach to
parameter selection for distributed optimal power flow. Electric Power Systems Research, 212:108546.
Zhang, C., Wu, Y., Ma, Y., Song, W., Le, Z., Cao, Z., and Zhang, J. (2023). A review on learning to solve
combinatorial optimisation problems in manufacturing. IET Collaborative Intelligent Manufacturing.
Zhang, J., Liu, C., Yan, J., Li, X., Zhen, H.-L., and jie Yuan, M. (2022). A survey for solving mixed integer
programming via machine learning. Neurocomputing, 519:205–217.
Zheng, W., Chen, T., Hu, T.-K., and Wang, Z. (2022). Symbolic learning to optimize: Towards interpretability
and scalability. arXiv:2203.06578.
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G.,
Lewis, M., Zettlemoyer, L., and Levy, O. (2023). Lima: Less is more for alignment. arXiv:2305.11206.
Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., and Sun, M. (2020). Graph neural
networks: A review of methods and applications. AI Open, 1:57–81.
47

Research On Optimization Reasearch and AI

Uploaded by

Copyright:

Available Formats

Research On Optimization Reasearch and AI

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research On Optimization Reasearch and AI

Uploaded by

Copyright:

Available Formats

Artificial Intelligence for Operations Research: Revolutionizing the

Operations Research Process

Keywords. Decision analysis, Artificial Intelligence, Operations Research, Modeling, Algorithm

Algorithm Target problem Assumption

Simplex method min cT x s.t. Ax = b, x ≥ 0 A has full rank

Table 1: Summary of involved algorithms and their target optimization problems

2.1 Graph Neural Networks

Figure 3: Graph representation for quadratic programs.

2.2 Recurrent Neural Networks

2.3 Reinforcement Learning

• S = {s1 , · · · , sn } is the sequence of states.

2.4 Imitation learning

3 Model Paramater Generation

3.1 Predict, then optimize

and then solve the optimization problem

3.2 Smart predict, then optimize

vθ̂ ∈ arg min θ̂⊤ v s.t. v ∈ C with θ̂ = m(w; x). (5)

4.1 Textbook modelling problems

Index LLMs Model size Finetuning dataset Acc.

4.2 Real-world problems

5 Automatic Algorithm Configuration

d. The goal of AAC is to solve:

θ∗ = arg min E o(A(θ), d) (7)

to find the configuration that optimizes expected performance over D.

f ∗ = arg min E o(Af (d) , d). (8)

6 Algorithm Selection and Design for Continuous Optimization

6.1 Enhancement for gradient-based methods

min f (x), (9)

g t = γg t−1 + (1 − γ)∇f (xt ),

g t+1 = γg t − η∇f (xt + γg t ),

• Adaptive gradient descent (AdaGrad) (Duchi et al. 2011)

Gt = Gt−1 + diag(∇f (xt ))2 ,

g t = γ1 g t−1 + (1 − γ1 )∇f (xt ),

6.2 Enhancement for ADMM-type methods

min f (x) + g(s) s.t. Ax + s = b, (10)

Table 3: Summary of works using AI techniques to enhance gradient-based methods.

A = {ρ1 , . . . , ρd } with ρ1 < · · · < ρd .

∥r̂pt+1 ∥ − ∥rpt+1 ∥ ∥r̂dt+1 ∥ − ∥rdt+1 ∥

R(st , ρt ) = Rtermination (st , ρt ) + Rcomparasion (st , ρt ).

Ichnowski et al. (2021) propose a reinforcement-learning-based penalty-tuning strategy for solving

where Q ∈ Rn×n is a symmetric positive semi-definite matrix, q ∈ Rn , A ∈ Rm×n , and l, u ∈ Rm .

6.3 Enhancement for column-generation methods

Table 4: Summary of works using AI techniques to enhance ADMM-based methods.

6.4 Enhancement for simplex method

7 Algorithm Selection and Design for Discrete Optimization

• A lower bound, i.e., fXi ≤ minx∈Xi f (x).

7.1 Mixed-integer linear programming

7.1.1 Variable selection

i = arg max scorei .

7.1.2 Node selection

7.2 Mixed integer non-linear programming

hi = LST Mθ ([ai , bi ]), ∀[ai , bi ] ∈ C (t)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.