AL ILQR Tutorial

AL-iLQR Tutorial
Brian Jackson
Abstract— Trajectory optimization is a powerful framework The most common approach, and the one treated exclu-
for controlling complex dynamical systems. While many algo- sively in the current tutorial, is to discretize the problem
rithms for solving these difficult problems have been proposed, in time, dividing the time period of length T seconds into
methods based on differential dynamic programming (DDP)
have recently become very popular due to their straightforward N − 1 segments, typically of equal length of ∆t seconds.
implementation and computational efficiency. While the original This results in N discretization points, also referred to as
DDP algorithm has no ability to deal with additional path “knot” points, including the initial and final times. There
constraints, the use of DDP within an augmented Lagrangian exist many methods for approximating the integrals in (2)
framework allows for a powerful and efficient framework with discrete sums, as well as transforming the ordinary
for solving constrained trajectory optimization problems. This
tutorial presents the algorithm in detail, based on experience differential equation for the dynamics (1) into a discrete
derived from implementing state-of-the-art implementations of difference equation of the form:
this algorithm, and aims to provide useful insight to help
those unfamiliar with DDP methods quickly understand the xk+1 = f (xk , uk , ∆t). (3)
algorithm and its extensions.
The resulting discrete optimization problem can then be
I. I NTRODUCTION written as:
Trajectory optimization is a powerful framework for con-
trolling complicated robotic systems. The value of trajectory N
X −1
optimization lies primarily in its generality, allowing it to minimize `N (xN ) + `k (xk , uk , ∆t)
x0:N , u0:N −1
be applied to a very broad class of dynamical systems. k=0
Importantly, trajectory optimization can be applied to any subject to
(4)
type of dynamical system whose dynamics are Markovian, xk+1 = f (xk , uk , ∆t), k = 1, ..., N -1,
i.e.
gk (xk , uk ){<= 0}, ∀k,
ẋ = f (x, u) (1)
hk (xk , uk ) = 0, ∀k,
where ẋ ∈ Rn is the time derivative of the state x ∈ Rn and
u ∈ Rm are the controls. There exist many methods for solving problems of the
While very general and powerful, trajectory optimization, form (4). These methods are typically divided into two
like all nearly all branches of optimal control, makes the categories: “indirect” and “direct” methods. Direct methods
important assumption that the states of the system are known treat both the states and controls as decision variables and use
exactly, i.e. that the system has full state feedback. Most general-purpose nonlinear programming (NLP) solvers, such
formulations of trajectory optimization assume a given initial as SNOPT or IPOPT. These methods typically transcribe
condition. the optimization problem into something of the form given
Trajectory optimization can be succinctly summarized by in (4), often with varying methods for approximating the
the underlying optimization problem being solved: continuous-time dynamics or unique formulations of problem
constraints. The most common method, direct collocation
Z tf (DIRCOL), uses Hermite-Simpson integration to integrate
minimize `(x(tf )) + `(x(t), u(t))dt both the cost and the dynamics, which is essentially a 3rd
x(t), u(t) 0
order implicit Runge-Kutta integrator for the states and first-
subject to ẋ = f (x, u), order hold (i.e. linear interpolation) for the controls. These
(2)
x(0) = x0 , methods benefit directly from the robustness and generality
g(x(t), u(t)) ≤ 0, of the NLP solvers on which they depend. However, direct
methods also tend to be fairly slow and require large opti-
h(x(t), u(t)) = 0,
mization packages.
where x(t) and u(t) are the state and control trajectories Alternatively, indirect methods exploit the Markov struc-
from time 0 to T . The dynamics constraint, which con- ture of (4) and often impose strict Markovianity across the
strains derivatives of the optimization variables, is what sets entire problem, including the cost functions and constraints.
trajectory optimization apart from other general nonlinear The dynamics constraints are then implicitly enforced by
optimization problems. The field of trajectory optimiza- simulating forward the system’s dynamics. Differential Dy-
tion is dedicated to finding efficient ways to solve these namic Programming (DDP) and iterative LQR (iLQR) are
differentially-constrained optimization problems. closely related indirect methods that solve (4) by breaking
it into a sequence of smaller sub-problems. DDP methods where E and I are the sets of equality and inequality
improve on more naive “simple shooting” methods by incor- constraint indices, respectively and φ is the penalty scaling
porating a feedback policy during the forward simulation of parameter (typically 2 ≤ φ ≤ 10).
the dynamics at each time step. Because of their strict en-
forcement of dynamic feasibility, it is often difficult to find a B. Linear Quadratic Regulator
control sequence that produces a reasonable initialization for The Linear Quadratic Regular (LQR) problem is a canon-
DDP methods. While they are fast and have a low memory ical problem in the theory of optimal control, partially due
footprint, making them amenable to embedded implemen- to the fact that it has analytical solutions that can be derived
tation, DDP methods have historically been considered less using a variety of methods, and from the fact that LQR is an
numerically robust and less well-suited to handling nonlinear extremely useful tool in practice. We present derivations for
state and input constraints. both continuous-time and discrete-time LQR. The discrete-
This tutorial derives a method for solving constrained time derivation is particularly useful in setting up the ideas
trajectory optimization problems using DDP or iLQR within that will used in deriving constrained DDP.
an augmented Lagrangian framework. The result is a fast, ef- The continuous-time LQR problem is formulated as
ficient algorithm that allows nonlinear equality and inequality
constraints on both the states and controls. 1 T
min x (tf )Q(tf )x(tf )
This tutorial is based on previous research [1]–[3] and u(t) 2
experience implementing this algorithm in several program-
1 tf T (9)
Z
ming languages. + [x (t)Q(t)x(t) + uT (t)R(t)u(t)]dt
2 0
II. BACKGROUND s.t. ẋ(t) = A(t)x(t) + B(t)u(t)
A. Augmented Lagrangian where R is a real, symmetric, positive-definite matrix and Q
Augmented Lagrangian is an optimization method for is a real, symmetric, positive semi-definite matrix.
solving constrained optimization problems of the form and the discrete-time LQR problem is
minimize f (x) 1 T
x min x QN xN
(5) u(t) 2 N
subject to c(x){≤, =}0.
N −1
One of the easiest methods for solving constrained op- 1 X T (10)
+ [xk Qk xk + uTk Rk uk ]dt
timization problems is to move the constraints into the 2
k=0
cost function and iteratively increase the penalty for either s.t. xk+1 = Ak xk + Bk uk
getting close to or violating the constraint. However, penalty
methods only converge to the optimal answer as the penalty 1) Hamilton-Jacobi-Bellman Derivation: The Hamilton-
terms are increased to infinity, which is impractical to Jacobi-Bellman equation is an important equation in the
implement in a numerical optimization routine with limited- theory of optimal control that states the necessary conditions
precision arithmetic. Augmented Lagrangian methods im- for optimality for a continuous-time system. We define the
prove on penalty methods by maintaining estimates of the cost function to be
Z tf
Lagrange multipliers associated with the constraints. This is
accomplished by forming the augmented Lagrangian J(x(t), u(t), t) = `(x(tf ), t) + `(x(t), u(t), t)dt (11)
0
1 and the minimum cost function
LA = f (x) + λT c(x) + c(x)T Iµ c(x) (6)
2
J ∗ (x(t), t) = min J(x(t), u(t), t). (12)
where λ are the Lagrange multipliers, µ are the penalty u(t)
multipliers, and We now define the Hamiltonian as
(
0 if ci (x) < 0 ∧ λi = 0, i ∈ I H(x(t), u(t), Jx∗ , t) = `(x(t), u(t), t)
Iµ = . (7) (13)
µi otherwise + Jx∗T (x(t), t)[f (x(t), u(t), t]
∗
The problem is then solved as follows: where Jx∗ = ∂J
∂x and f (x(t), u(t), t) are the dynamics.
1) solve minx LA (x, λ, µ), holding λ and µ constant The Hamilton-Jacobi-Bellman equation is then
2) update Lagrange multipliers
( 0 = Jt∗ (x(t), t) + H(x(t), u∗ (t), Jx∗ , t) (14)
+ λi + µi ci (x∗ ) i∈E
λi = (8) We now use these to solve the continuous LQR problem.
max(0, λi + µi ci (x∗ ) i ∈ I, We form the Hamiltonian
3) update penalty term: µ+ = φµ, φ > 1 H(x(t), u(t), Jx∗ , t) = xT (t)Q(t)x(t) + uT (t)R(t)u(t)
4) check constraint convergence
5) if tolerance not met, go to 1 + Jx∗T (x(t), t)[A(t)x(t) + B(t)u(t)] (15)
and minimize it with respect to u(t) by setting ∂H/∂u = 0: forward one time instance). For convenience, we define
∂H the action-value function Q(xk , uk ) to be the value being
= Ru + B T Jx∗ = 0. (16) minimized:
∂u
Vk (x) = min Q(xk , uk ) (25)
Solving with respect to u yields the optimal control uk
We also assume that Vk (x) is a quadratic form, i.e.

u∗ = −R−1 B T Jx∗ (17)
1
Vk (x) = xTk Pk xk (26)
2
2
which is globally optimal since ∂ H ∂u = R(t) is positive 2
definite. so that the action-value function takes the following form:
Substituting the optimal control back into our Hamiltonian
(15): 1 T 1
Q(xk , uk ) = x Qk xk + uTk Rk uk
1 1 2 k 2
H(x, u∗ , Jx∗ , t) = xT Qx + Jx∗T BR−1 B T Jx∗ 1
2 2 + (Ak xk + Bk uk )T Pk+1 (Ak xk + Bk uk ) (27)
2
+ Jx∗T [Ax − BR−1 B T Jx∗ ]
(18) Since there are no controls to optimize at the last time
1 1
= xT Qx − Jx∗T BR−1 B T Jx∗ step, we note that
2 2
1
+ Jx∗T Ax VN (x) = xTN QN xN (28)
2
We’re now ready to use the Hamilton-Jacobi-Bellman Since we know the terminal cost-to-go, we can find the op-
equation (14). Since the HJB equation is a first-order partial timal cost-to-go for all time steps once we have a recurrence
differential equation of the minimum cost, we need to guess relation that gives Vk as a function of Vk+1 . We start by
a solution, which we assume to be quadratic: optimizing the action-value function at time step k, which
1 T has the following first-order necessary condition:
J ∗ (x(t), t) = x (t)K(t)x(t) (19)
2 ∂Q
=0
where K is a symmetric positive-definite matrix. ∂u (29)
Plugging these into the HJB equation we get: =Rk uk + B T Pk+1 (Ak xk + Bk uk )
1 Solving for u we find the optimal control trajectory:
0 = xT K̇ + Q − K T BR−1 B T K + 2KA x

(20)
2 u∗k = − (Rk + BkT Pk+1 Bk )−1 BkT Pk+1 Ak xk
Leveraging the symmetry of quadratic forms and the fact
= −Q−1
uu Qux xk (30)
that (20) must be equal to zero for all x(t) we arrive at the
Riccati equation: = −Kk xk

0 = K̇ + Q − K T BR−1 B T K + KA + AT K. (21) where Quu = ∂ 2 Q ∂u2 and Qux = ∂ 2 Q ∂u∂x . We
introduce this notation for easy comparison to the more
When solved for K(t) using a specialized Riccati solver, the complicated DDP derivation in the following sections.
optimal control law is given by With an optimal feedback policy, we now substitute (30)
into (27) to get the cost-to-go:
u ∗ (t) = −R−1 (t)B T (t)K(t)x(t). (22)
Vx (x) =Q(xk , u∗k )
2) Discrete LQR: We now shift our focus to solving the 1
discrete-time LQR problem (10). We start by defining the = xTk Qk xk + xTk KkT Rk Kk xk
2
value function
+ xTk (Ak − Bk Kk )T Pk+1 (Ak − Bk Kk )xk

N −1
1 T 1 X T 1
Vi (x) = min xN QxN + xk Qk xk + uTk Rk uk = xT Qk + ATk Pk+1 Ak (31)
uk ,...,uN −1 2 2 2
k=i T T
(23) + Kk (Rk + B Pk+1 Bk )Kk
subject to the dynamics: xk+1 = Ak xk + Bk uk . This gives − KkT BkT Pk+1 Ak − Ak Pk+1 Bk Kk xk

the cost of starting at a particular state and simulating 1
the dynamics forward. The cost of the entire trajectory = xTk Pk xk
2
is therefore V0 (x0 ). From the Bellman equation and the
where, after substituting in (30) and simplifying,
principle of optimality we can re-define the value function
in a more convenient, recursive form: Pk = Qk + ATk Pk+1 Ak
1 1 − ATk Pk+1
T
Bk (Rk + BkT Pk−1 Bk )−1 BkT Pk+1 Ak . (32)
Vk (x) = min xTk Qk xk + uTk Rk uk + Vk+1 (Ak xk + Bk uk )
uk 2 2
(24) This establishes the recurrence relation between Vk and
which simple state that the cost-to-go is simply the cost Vk+1 , which can be used to recursively calculate the cost-to-
incurred for the current decision, plus the cost-to-go of where go for the entire trajectory, along with the optimal feedback
our current decision takes us (by simulating our dynamics control gains Kk .
III. AL-DDP In order to make the dynamic programming step feasible,
we take a second-order Taylor series of the nonlinear cost-
We now present the derivation for solving constrained to-go
trajectory optimization problems using differential dynamic 1
programming and iterative LQR within an augmented La- δVk (x) ≈ δxTk Pk δxk + pTk δxk (37)
2
grangian framework (AL-DDP or AL-iLQR). The derivation where Pk and pk are the Hessian and gradient of the cost-to-
proceeds very similarly to that of discrete LQR, as derived go at time step k, respectively. It is important to note that by
in Section II-B.2. taking Taylor series approximations, we have now switched
The key idea of DDP is that at each iteration, all nonlinear to optimizing deviations about the nominal control trajectory.
constraints and objectives are approximated using first or sec- Similar to (28), we can trivially calculate the cost-to-go at
ond order Taylor series expansions so that the approximate the terminal time step since there are no controls to optimize:
functions, now operating on deviations about the nominal
trajectory, can be solved using discrete LQR. This optimal pN = (`N )x + (cN )Tx (λ + IµN cN ) (38)
feedback policy is computed during the “backward pass” PN = (`N )xx + (cN )Tx IµN (cN )x . (39)
(Algorithm 1), since the dynamic programming step begins
at the tail of the trajectory, as in LQR. The optimal deviations which are simply the gradient and Hessian of (34) with
are then applied to the nominal trajectory during a “forward respect to the states xN .
pass” (Algorithm 2), using the optimal feedback policy The relationship between δVk and δVk+1 is derived by
during the forward simulation—also known as rollout–of the taking the second-order Taylor expansion of Qk with respect
dynamics. to the state and control,
To handle constraints, we simply “augment” the cost func- T T
1 δxk Qxx Qxu δxk Q δxk
tion with the multiplier and penalty terms of the augmented δQk = + x (40)
2 δuk Qux Quu δuk Qu δuk
Lagrangian, treating λ and µ as constants. After several
iterations DDP, the multipliers and penalty terms are updated, Dropping the time-step indices for notational clarity, the
and the process is repeated. The algorithm is summarized in block matrices are,
Algorithm 3. We now proceed with the formal derivation.
Qxx = `xx + AT P 0 A + cTx Iµ cx (41)
0
A. Backward Pass Quu = ùu + B P B + cTu Iµ cu
T
(42)
Qux = ùx + B T P 0 A + cTu Iµ cx (43)
We first form the augmented Lagrangian of (4):
Qx = `x + AT p0 + cTx (λ + Iµ c) (44)
1 T
Qu = ù + B T p0 + cTu (λ + Iµ c), (45)
LA =`N (xN ) + λN + cN (xN )Iµ,N cN (xN )
2
N
X −1
where A = ∂f /∂x|xk ,uk , B = ∂f /∂u|xk ,uk , and 0 indicates
+ `k (xk , uk , ∆t) variables at the k + 1 time step. It is in this step that we
k=0
(33) differentiate between DDP and iLQR. DDP computes the full
1 T second-order expansion, which includes computations with
+ λ + ck (xk , uk )T Iµ,k ck (xk , uk )

2 rank-3 tensors resulting from the vector-valued dynamics and
N −1
X general constraints. iLQR computes these expansions only
=LN (xN , λN , µN ) + Lk (xk , uk , λk , µk ) to first order, such that the dynamics and constraints are
k=0
both linear in the states and controls, resulting in a Gauss-
where λk ∈ R is a Lagrange multiplier, µk ∈ Rpk is a
pk Newton approximation of the true Hessian. While this lower
penalty weight, and ck = (gk , hk ) ∈ Rpk is the concatenated accuracy expansion results in the need for more iterations,
set of inequality and equality constraints with index sets Ik these iterations are considerably less expensive to compute,
and Ek , respectively. Iµk ∈ Rpk ×pk is the penalty matrix often resulting in a considerably faster overall convergence
defined in (7). rate. The value of the full second-order information often
We now define the cost-to-go and the action-value func- depends on the type of problem being solved, and there exist
tions as before: a variety of methods for approximating the rank-3 tensor
information without the need to explicitly compute it. The
VN (xN )|λ,µ = LN (xN , λN , µN ) (34) motivation for the name “iterative LQR” should now be clear,
as we see that by linearizing the dynamics we arrive at
Vk (xk )|λ,µ = min{Lk (xk , uk , λk , µk )
uk problem nearly identical to (10).
+ Vk+1 (f (xk , uk , ∆t))|λ,µ } (35) Minimizing (40) with respect to δuk gives a correction to
= min Q(xk , uk )|λ,µ , (36) the control trajectory. The result is a feedforward term dk
uk and a linear feedback term Kk δxk . Regularization is added
to ensure the invertibility of Quu :
where Vk (x)|λ,µ is the cost-to-go at time step k evaluated
with the Lagrange multipliers λ and the penalty terms µ. δu∗k = −(Quu +ρI)−1 (Qux δxk +Qu ) ≡ Kk δxk +dk . (46)
Take a quick moment to note that this is almost exactly the is the expected decrease in cost, computed by simply using
same as (30), except we have now added some regularization d¯k = αdk to compute (49) at each time step. This can be
to handle poorly conditioned Hessians (since these now computed very efficiently by simply storing the two terms in
depend on the expansion about the nominal trajectory) and (49) separately and then scaling them by α and α2 during
now have a feedforward term, which results from the linear the forward pass.
terms in the expansion. We would see similar terms appear If z lies within a the interval [β1 , β2 ], usually [1e-4, 10],
in LQR if we included linear terms in the cost function. we accept the candidate trajectories. If it does not, we update
Substituting δu∗k back into (40), a closed-form expression the scaling parameter α = γα, where 0 < γ < 1 is the
for pk , Pk , and the expected change in cost, ∆Vk , is found: backtracking scaling parameter. γ = 0.5 is typical. Increasing
Pk = Qxx + KkT Quu Kk + KkT Qux + Qxu Kk (47) the lower bound on the acceptance interval will require that
more significant progress is made during each backward-
pk = Qx + KkT Quu dk
+ KkT Qu + Qxu dk (48) forward pass iteration. Decreasing the upper bound will keep
1 the progress closer to the expected decrease. These values
∆Vk = dTk Qu + dTk Quu dk . (49)
2 aren’t changed much in practice.
2) Regularization: If the line search fails to make progress
Algorithm 1 Backward Pass after a certain number of iterations, or the cost “blows up”
1: function BACKWARD PASS (X, U ) to exceed some maximum threshold (can easily happen for
2: pN , PN ← (38), (39) highly nonlinear, unstable dynamics) the forward pass is
3: for k=N-1:-1:0 do abandoned and the regularization term ρ is increased prior
4: δQ ← (40), (41)-(43) to starting the backward pass. Increasing the regularization
5: if Quu 0 then term effectively makes the partial Hessian in (46) more
6: K, d, ∆V ← (46), (49) like the identity matrix, effectively “steering” the Newton
7: else (or Gauss-Newton) step direction towards the more naive
8: Increase ρ and go to line 3 gradient descent direction, which tends to be more reliable
9: end if when the current iterate is far from the local optimum.
10: end for The regularization is also increased if the partial Hessian
11: return K, d, ∆V in (46) looses rank during the backward pass. In this case,
12: end function the regularization term is increased and the backward pass is
restarted from the beginning. The regularization is only de-
B. Forward Pass creased after a successful backward pass. The regularization
scaling value is usually between 1.5 and 2.0.
Now that we have computed the optimal feedback gains
for each time step, we now update the nominal trajectories
by simulating forward the dynamics. Since the initial state is Algorithm 2 Forward Pass
fixed, the entire forward simulation can be summarized by 1: function F ORWARD PASS (X, U, K, d, ∆V, J)
the following updates: 2: Initialize x̄0 = x0 , α = 1, J − ← J
3: for k=0:1:N-1 do
δxk =x̄k − xk (50) 4: ūk = uk + Kk (x̄k − xk ) + αdk
δuk =Kk δxk + αdk (51) 5: x̄k+1 ← Using x̄k , ūk , (3)
ūk =uk + δuk (52) 6: end for
7: J ← Using X, U
x̄k+1 =f (x̄k , ūk ) (53)
8: if J satisfies line search conditions then
where x̄k and ūk are the updated nominal trajectories and 9: X ← X̄, U ← Ū
0 ≤ α ≤ 1 is a scaling term. 10: else
1) Line Search: : As with all nonlinear optimization, a 11: Reduce α and go to line 3
line search along the descent direction is needed to ensure an 12: end if
adequate reduction in cost. We employ a simple backtracking 13: return X, U, J
line search on the feedforward term using the parameter 14: end function
α. After applying equations (50)-(53) to get candidate state
and control trajectories, we compute the ratio of the actual
decrease to the expected decrease:
C. Termination Conditions
J(X, U ) − J(X̄, Ū )
z= (54) The “inner” DDP/iLQR solve is run until one of the
−∆V (α)
following termination conditions is met
where
N −1 • The cost decrease between iterations Jprev − J is less
X 1 than some intermediate tolerance, intermediate . This is the
∆V (α) = αdTk Qu + α2 dTk Quu dk (55)
2 typical exit criteria.
k=0
• The feedforward gains go to zero. We compute the Algorithm 4 ALTRO
average maximum of the normalized gains: 1: procedure
2: Initialize x0 , U, tolerances; X̃
N −1
1 X kdk k∞ 3: if Infeasible Start then
∇ilqr = (56)
N −1 |Uk | + 1 4: X ← X̃, s0:N −1 ← from (74)
k=0
5: else
• The solver hits a maximum number of iterations, imax
ilqr 6: X ← Simulate from x0 using U
7: end if
Algorithm 3 Iterative LQR 8: X, U, λ ← AL- I LQR(X, U, tol.)
1: Initialize x0 , U, tolerance 9: (X, U, λ) ← P ROJECTION((X, U, λ), tol.)
2: X ← Simulate from x0 using U , (3) 10: return X, U
3: function I LQR(X, U ) 11: end procedure
4: J ← Using X, U
5: do
6: J− ← J A. Square Root Backward Pass
7: K, d, ∆V ← BACKWARD PASS(X, U ) Augmented Lagrangian methods make rapid convergence
8: X, U, J ← F ORWARD PASS(X, U, K, d, ∆V, J − ) on constraint satisfaction, but only as long as the penalty
9: while |J − J − | > tolerance terms are updated at every outer loop iteration. This can
10: return X, U, J quickly lead to very large penalty parameters, resulting in
11: end function severe numerical ill-conditioning. To help mitigate this issue
and make AL-iLQR more numerically robust we derive
a square-root backward pass inspired by the square-root
D. Augmented Lagrangian Update Kalman filter.
One the inner solve hits one of the termination conditions 1) Background: To begin, we provide some background
listed in section III-C, the dual variables are updated accord- on matrix square-roots. The Cholesky factorization of a
ing to, square positive-definite matrix G factors the matrix into
( two upper-triangular matrices G = U T U , where the upper-
λki + µki cki (x∗k , u∗k ) i ∈ Ek triangular matrix factor U can be considered a “square root”
λ+
ki = ∗ ∗
(57)
max(0, λki + µki cki (xk , uk )) i ∈ Ik , of the√ matrix G. We denote this matrix square root as
U = G.
and the penalty is increased monotonically according to the Also in terms of background, the QR factorization F =
schedule, QR returns an upper-triangular matrix R and an orthogonal
µ+ki = φµki , (58) matrix Q. √
where φ > 1 is a scaling factor. We
√ now seek
√ a method for computing A + B in terms
After experimenting with various different heuristic of A and B, where A, B ∈ Rn×n are square positive-
schemes for updating the multipliers and the penalty param- definite matrices. We begin by noting that
h√ √
eters, we have found that the most naive approach, that is √ Ti
A+B = A
T
B √A = F T F (59)
updating them every outer loop iteration, is by far the most B
reliable approach. It’s possible that better performance may
be achieved by skipping penalty updates, or only updating where F ∈ R2n×n . Taking the QR factorization of F we get
some of penalty parameters (e.g. the ones corresponding to A + B =F T F
active constraints), but we found that the most basic approach
=RT QT QR (60)
works the best on the largest variety of problems. T
=R R
E. Hyperparameters
since QT Q = I given that Q √ is an orthogonal matrix.
For convenience, we summarize all the hyperparameters Therefore we have that R = A + B since R is upper-
in AL-iLQR, splitting them between those used for the triangular. We use QRR (·) to denote the operation of taking
inner unconstrained iLQR solve and those used by the outer the QR factorization of the argument and returning the upper-
Augmented Lagrangian solver. triangular factor. √
The related equation A − B√can be computed using √
IV. I MPROVEMENTS successive rank-one downdates of A using√the √ rows of B.
Here we provide some improvements to the AL-iLQR al- We denote this operation as D OWN DATE( A, B).
gorithm derived in the previous section. These improvements 2) Derivation: The ill-condition of the backward pass is
are all incorporated into the ALTRO algorithm, summarized most significant in the Hessian of the cost-to-go, Pk . Our
in Algorithm 4 and implemented in TrajectoryOptimiza- objective is to find an algorithm that only stores the square
tion.jl. root of this matrix and never calculates it explicitly.
TABLE I
I LQR H YPERPARAMETERS
Symbol Name Description Typical Value(s) Importance

cost Cost tolerance Convergenced when difference in cost between iterations < cost [1e-2, 1e-4, 1e-8] High
ilqr
grad Gradient tolerance Converged when ∇ilqr < ilqr
grad 1e-5 Med
ilqr
max Max iterations Maximum number of backward/forward pass iterations [50,500] Med
β1 Line search l.b. Lower bound criteria for (54). ↑ requires more progress be made [1e-10,1e-8,1e-1] Low
β2 Line search u.b. Upper bound criteria for (54). ↓ requires progress match expected [1,10,20] Low
ils
max Line search iterations Maximum number of backtracking line search iterations [5, 10, 20] Low
ρinit Initial regularization Initial value for regularization of Qzz in backward pass 0 Low
ρmax Max regularization Any further increases will saturate. ↓ allows for less aggressive regularization 1e-8 Low
ρmin Min regularization Any regularization below will round to 0. 1e-8 Low
φρ Reg. scaling How much regularization is increased/decreased (1,1.6,10) Low
Jmax Max cost Maximum cost allowed during rollout 1e8 Med
TABLE II
AUGMENTED L AGRANGIAN H YPERPARAMETERS
Symbol Name Description Typical Value(s) Importance

cost Cost tolerance Converged when difference in cost between iterations ¡ cost [1e2,1e-4,1e-8] High
uncon iLQR cost tolerance Cost tolerance for intermediate iLQR solves. ↓ can speed up overall conver- [1e-1,1e-3,1e-8] High
gence by requiring less optimality at each inner solve, resulting in more frequent
dual updates.
cmax Constraint tolerance Convergence when maximum constraint violation < cmax [1e-2,1e-4,1e-8] High
iouter
max Outer loop iterations Maximum number of outer loop updates [10,30,100] Med
µmax Max penalty ↑ allows for more outer loop iterations with good convergence, but may result 1e-8 Low
in poor conditioning. ↓ may avoid ill-conditioning
φµ Penalty scaling ↑ Increases penalty faster, potentially converges faster, but will eventually fail (1,10,100] Med
to converge if too high
µinit Initial penalty ↑ more likely to remain feasible and find a feasible solution faster. ↓ makes [1e-4,1,100] Very High
the initial problem appear unconstrained, which may be an ideal initial guess
for constrained problem.
We begin by calculating the square root of the terminal trivially computed using backward or forward substitution,
cost-to-go: given that the square root factors are triangular.
√
The gradient (48) and change of the cost-to-go (49) are
(`N )x x
SN = P N = QRR (61) also computed with simple substitution:
Iµ,N cN
√ p = Qx + (Zuu K)T (Zuu d) + K T Qu + Qxu d (66)
where we define S ∈ Rn×n = P to be the upper-triangular
square root of the Hessian of 1
√ the cost-to-go. ∆V = dT Qu + (Zuu d)T (Zuu d). (67)
All
√ that is left is to find P k as an expression in terms 2
of P k+1 . We start by finding the square roots of Qxx and We note that (47) is a quadratic form that can be expressed
Quu : as T
Qxx QTux I

 √ I
P =

`xx K Qux Quu K
0
p
Zxx = Qxx ← QRR pS A  (62) T T
Iµ cx I Zxx 0 Zxx C I (68)
 √ P =
 K C T DT 0 D K
ùu T
0
 S B  Z + CK Zxx + CK
= xx
p
Zuu = Quu ← QRR   Iµ cu  .
p  (63) DK DK
√
ρI where,
The optimal gains are then trivially computed as −T
C = Zxx Qxu (69)
q
−1 −T −1 −T
K = −Zuu Zuu Qux (64) D = Zuu T Z
uu − Qux Zxx Zxx Qxu (70)
−T

−1 −T
d = −Zuu Zuu Qu , (65) = D OWN DATE Zuu , Zxx Qxu . (71)
Here it’s important to note that these equations should be The square root of Pk is then
√ −T

computed using a linear solver (e.g. the “\” operator in Zuu + Zxx Qxu K
P k = QRR −T (72)
MATLAB or Julia) sequentially, since these equations are D OWN DATE Zuu , Zxx Qxu ·K
B. Infeasible State Trajectory Initialization q ∗ . For r ∈ R3 , r̂ is a quaternion with s = 0 and z = r.
Desired state trajectories can often be identified (e.g., H ∈ R4×3 is the matrix “hat” operator, such that r̂ = Hr
from sampling-based planners or expert knowledge), whereas and r = H T r̂. The matrix M (q) ∈ R4×4 is defined such
finding a control trajectory that will produce this result is that M (q2 )q1 = q2 ⊗ q1 .
usually challenging. Dynamically infeasible state trajectory Quaternion rotations require special consideration in op-
initialization is enabled by introducing additional inputs to timization problems to properly account for the unit-norm
the dynamics with slack controls s ∈ Rn , constraint. We define φ ∈ R3 to be a three-dimensional
differential rotation, such that φ̂ = δq, where q = q0 ⊗ δq,
xk+1 = f (xk , uk ) + sk , (73) when δq is a small rotation. We seek the Jacobian G(q) ∈
R3×4 that maps from the quaternion to our three-dimensional
to make the system artificially fully actuated.
differential, φ. From above, we have q = q0 ⊗ δq = q0 ⊗ φ̂ =
Given initial state and control trajectories, X̃ and U ,
M (q0 )Hφ, so G(q0 )T = M (q0 )H.
the initial infeasible controls s0:N −1 are computed as the
If h(q) : q 7→ Rp is a function that maps a quaternion to
difference between the dynamics and desired state trajectory
a real-valued vector, the linearization of h(q) gives
at each time step:
∂h(q)
sk = x̃k+1 − f (x̃k , uk ) (74) δh ≈ G(q) G(q)T φ. (78)
∂q
The optimization problem (4) is modified by replacing the Similarly, for a real-valued function g(q) : q 7→ R, the
dynamics with (73). An additional cost term, second-order Taylor series expansion is
N −1
1 T ∂g(q) 1 ∂ 2 g(q)
G(q)T φ + φT G(q) G(q)T φ
X
s Rs sk , (75) δg ≈ (79)
2 k ∂q 2 ∂q 2
k=0
2) Quaterion AL-iLQR: We now present modifications to
and constraints sk = 0, k=0, . . . , N −1 are also added to AL-iLQR in order to optimize states that contain quaternions.
the problem. Since sk = 0 at convergence, a dynamically With standard commercial solvers, like those used in direct
feasible solution to (4) is still obtained. trajectory optimization methods, quaternions can only be
C. Optimizing Quaternions treated as vectors in R4 , often resulting in poor performance.
With AL-iLQR we can modify the algorithm to correctly
When dealing with robotic systems with arbitrary orien-
optimize with respect to a unit quaternion. Extensions of
tation, such as satellites, airplanes, quadrotors, etc., the 3D
this method to states with multiple quaternions are easily
orientation must be optimized. Rotations present a unique
obtained through simple concatenation.
challenge for optimization since all vector parameterizations
Let x ∈ Rn (n ≥ 4) be a state vector containing a unit
of rotations are either underparameterized (e.g. Euler angles)
quaternion of the form,
or subject to constraints (e.g. rotation matrix or quaternions).
Quaternions are often considered the best parameterization y
x= (80)
of rotation, since they only use four parameters and have a q
number of desireable qualities that make them attractive for where y ∈ Rn−4 is a standard vector and q is the quaternion.
numerical computation. We begin with some background on Using the results from above, the difference between two
quaternions, along with our notation conventions for quater- states, x and x0 , is
nion operations, and then present a method for handling
quaternions within AL-iLQR. δy y − y0
δx = = ∈ Rn−1 , (81)
1) Background and Notation: The unit quaternion is a φ H T (q0∗ ⊗ q)
four-parameter non-singular representation of rotations com- from which we get the Jacobian,
prising a vector part z ∈ R3 and a scalar part s ∈ R. To rep-
I 0
resent quaternion operations as linear-algebraic expressions, G(x) = n−4 (82)
0 G(q)
the following conventions are used: A quaternion q encodes
the rotation from a vehicle’s body frame to the world frame. where G(x) ∈ Rn−1×n . The next sections detail how
The equivalent rotation matrix is given by, to modify the AL-iLQR backward and forward passes to
correctly take expansions with respect to quaternion states.
R(q) = I + 2z × (z × + sI), (76) Backward pass: The backward pass of iLQR linearizes the
where I is the 3 × 3 identity matrix and z × denotes the 3 × 3 dynamics at each time step and approximates the cost-to-go
skew-symmetric cross-product matrix: as a quadratic function by taking the second-order Taylor
  series expansion of the cost-to-go. PN ∈ Sn++ , pN ∈ Rn are
0 −z3 z2 the Hessian and gradient of the terminal cost-to-go and the
z × =  z3 0 −z1  . (77) corrected terms are,
−z2 z1 0
P̃N = G(xN )PN G(xN )T (83)
Quaternion multiplication is denoted q2 ⊗ q1 , and the quater-
nion conjugate, representing the opposite rotation, is denoted p̃N = G(xN )pN (84)
Qxx , Qux , and Qx are the terms from the approximate to the cost. Algorithm 5 takes successive Newton steps δz,
quadratic expansion of the action-value function at time only updating the constraint Jacobian D when the conver-
step k that must be modified. The corrected terms from the gence rate r drops below a threshold, allowing re-use of the
expansion are: same matrix factorization S for inexpensive linear system
solutions. Further, this algorithm can be implemented in a
Q̃xx = G(xk )Qxx G(xk )T (85)
sequential manner [4] that does not require building large
Q̃ux = Qux G(xk )T (86) matrices, making it amenable to embedded systems.
Q̃x = G(xk )Qx . (87)
Algorithm 5 Projection
The rest of the backward pass proceeds as normal, using the
1: function P ROJECTION (Y, tol.)
corrected expansion terms.
2: H −1 ← invert Hessian of objective
Forward pass: The only required modification to the
3: while kdk∞ > tol. do
forward pass is the correct calculation of the difference
4: d, D ←√ linearize active constraints
between states x and x0 using (81). The modified change in
5: S ← DH −1 DT
control, where K ∈ Rm×(n−1) and d ∈ Rm are the feedback
6: v ← kdk∞
and feedforward gains from the backward pass, is:
7: r←∞
δuk = Kk δxk + dk . (88) 8: while v > tol. and r > conv. rate tol. do
9: Y, v + ← L INE S EARCH(Y, S, H −1 , D, d, v)
Objective: Directly subtracting states that contain quater- 10: r ← log v + / log v
nions (or any rotational representation) is incorrect. A good 11: end while
alternative is to again use the state difference Jacobian, as 12: end while
demonstrated in the following objective, 13: return Y
˜
J(X, U) = 14: end function
(xN − xf )T G(xN )T W̃N G(xN )(xN − xf )

N −1
(89)
Algorithm 6 Projection Line Search
X
+ ∆t (xk − xf )T G(xk )T W̃k G(xk )(xk − xf )
k=0 1: function L INE S EARCH (Y, S, H −1 , D, d, v0 )
+ (u − ur )T Vk (u − ur ) 2: Initialize α, γ
3: while v > v0 do
where W̃ ∈ Sn−1 m
+ , V ∈ S++ are cost matrices. Here the 4: δYp ← H −1 DT (S −1 S −T d)
three-parameter angular representation φ is penalized. 5: Ȳp ← Yp + αδYp
D. Solution Polishing 6: d ← U PDATE C ONSTRAINTS(Ȳp )
7: v ← kdk∞
Augmented Lagrangian methods only make good progress
8: α ← γα
on constraints as long as the penalty terms are updated.
9: end while
However, the penalties can only be updated a finite number
10: return Ȳ , v
of times before the penalties result in severe ill-conditioning,
11: end function
as discussed previously. As a result, augmented Lagrangian
methods suffer from slow “tail” convergence, or convergence
near the optimal solution. Active-set methods, on the other A PPENDIX
hand, exhibit quadratic convergence near the optimal solu-
R EFERENCES
tion. We present an active-set projection method to rapidly
converge on the constraints once the convergence of AL- [1] T. A. Howell, B. E. Jackson, and Z. Manchester,
iLQR slows down. “ALTRO: A Fast Solver for Constrained Trajectory
This final solution polishing method solves the following Optimization,” in IEEE/RSJ International Conference
optimization problem: on Intelligent Robots and Systems, Macau, China, Nov.
2019.
minimize δz T Hδz [2] B. E. Jackson, T. Punnoose, D. Neamati, K. Tracy,
δz (90) and R. Jitosho, “ALTRO-C: A Fast Solver for Conic
subject to Dδz = d Model-Predictive Control,” in 2021 IEEE International
where z ∈ RN n×(N −1)m is the concatenated vector of Conference on Robotics and Automation (ICRA), Xi’an,
the states and controls at all time steps, H is the Hessian China, Jun. 2021, p. 8.
of the cost, and D, d are the linearized active constraints. [3] B. E. Jackson, K. Tracy, and Z. Manchester, “Planning
Constraints are considered active if the constraint violation With Attitude,” IEEE Robotics and Automation Letters,
is less than some small positive value constraint . pp. 1–1, 2021.
This problem essentially projects the solution from AL-
iLQR onto the constraint manifold, while minimizing impact
[4] C. V. Rao, S. J. Wright, and J. B. Rawlings, “Appli-
cation of Interior-Point Methods to Model Predictive
Control,” en, Journal of Optimization Theory and Ap-
plications, vol. 99, no. 3, pp. 723–757, Dec. 1998.

AL ILQR Tutorial

Uploaded by

Copyright:

Available Formats

AL ILQR Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AL ILQR Tutorial

Uploaded by

Copyright:

Available Formats

AL-iLQR Tutorial

We also assume that Vk (x) is a quadratic form, i.e.

Symbol Name Description Typical Value(s) Importance

Symbol Name Description Typical Value(s) Importance

(xN − xf )T G(xN )T W̃N G(xN )(xN − xf )

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.