Nderactuated Obotics: Russ Tedrake
Nderactuated Obotics: Russ Tedrake
Nderactuated Obotics: Russ Tedrake
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
TABLE OF CONTENTS
• Preface
• Chapter 1: Fully-actuated vs Underactuated Systems
◦ Motivation
▪ Honda's ASIMO vs. passive dynamic walkers
▪ Birds vs. modern aircraft
▪ Manipulation
▪ The common theme
◦ Definitions
◦ Feedback Equivalence
◦ Input and State Constraints
▪ Nonholonomic constraints
◦ Underactuated robotics
◦ Goals for the course
◦ Exercises
MODEL SYSTEMS
• Chapter 2: The Simple Pendulum
◦ Introduction
◦ Nonlinear dynamics with a constant torque
▪ The overdamped pendulum
▪ The undamped pendulum with zero torque
▪ The undamped pendulum with a constant torque
◦ The torque-limited simple pendulum
▪ Energy-shaping control
◦ Exercises
• Chapter 3: Acrobots, Cart-Poles, and Quadrotors
◦ The Acrobot
▪ Equations of motion
◦ The Cart-Pole system
▪ Equations of motion
◦ Quadrotors
▪ The Planar Quadrotor
▪ The Full 3D Quadrotor
◦ Balancing
▪ Linearizing the manipulator equations
▪ Controllability of linear systems
▪ LQR feedback
◦ Partial feedback linearization
▪ PFL for the Cart-Pole System
▪ General form
◦ Swing-up control
▪ Energy shaping
▪ Cart-Pole
▪ Acrobot
▪ Discussion
◦ Other model systems
◦ Exercises
• Chapter 4: Simple Models of Walking and Running
◦ Limit Cycles
▪ Poincaré Maps
◦ Simple Models of Walking
▪ The Rimless Wheel
▪ The Compass Gait
▪ The Kneed Walker
▪ Curved feet
▪ And beyond...
◦ Simple Models of Running
▪ The Spring-Loaded Inverted Pendulum (SLIP)
▪ Hopping robots from the MIT Leg Laboratory
▪ Towards human-like running
◦ A simple model that can walk and run
◦ Exercises
• Chapter 5: Highly-articulated Legged Robots
◦ Center of Mass Dynamics
▪ A hovercraft model
▪ Robots with (massless) legs
▪ Capturing the full robot dynamics
▪ Impact dynamics
◦ The special case of flat terrain
▪ An aside: the zero-moment point derivation
◦ ZMP planning
▪ From a CoM plan to a whole-body plan
◦ Whole-Body Control
◦ Footstep planning and push recovery
◦ Beyond ZMP planning
◦ Exercises
• Chapter 6: Model Systems with Stochasticity
◦ The Master Equation
◦ Stationary Distributions
◦ Extended Example: The Rimless Wheel on Rough Terrain
◦ Noise models for real robots/systems.
APPENDIX
• Appendix A: Drake
◦ Pydrake
◦ Online Jupyter Notebooks
◦ Running on your own machine
▪ Install Drake
◦ Getting help
• Appendix B: Multi-Body Dynamics
◦ Deriving the equations of motion
◦ The Manipulator Equations
▪ Recursive Dynamics Algorithms
▪ Hamiltonian Mechanics
▪ Bilateral Position Constraints
▪ Bilateral Velocity Constraints
◦ The Dynamics of Contact
▪ Compliant Contact Models
▪ Rigid Contact with Event Detection
▪ Time-stepping Approximations for Rigid Contact
◦ Principle of Stationary Action
• Appendix C: Optimization and Mathematical Programming
◦ Optimization software
◦ General concepts
▪ Convex vs nonconvex optimization
▪ Constrained optimization with Lagrange multipliers
◦ Convex optimization
▪ Linear Programs/Quadratic Programs/Second-Order
Cones
▪ Semidefinite Programming and Linear Matrix Inequalities
▪ Sums-of-squares optimization
▪ Solution techniques
◦ Nonlinear programming
▪ Second-order methods (SQP / Interior-Point)
▪ First-order methods (SGD / ADMM)
▪ Zero-order methods (CMA)
▪ Example: Inverse Kinematics
◦ Combinatorial optimization
▪ Search, SAT, First order logic, SMT solvers, LP
interpretation
▪ Mixed-integer convex optimization
◦ "Black-box" optimization
• Appendix D: An Optimization Playbook
• Appendix E: Miscellaneous
◦ How to cite these notes
◦ Annotation tool etiquette
◦ Some great final projects
◦ Please give me feedback!
PREFACE
This book is about nonlinear dynamics and control, with a focus on mechanical
systems. I've spent my career thinking about how to make robots move robustly, but
also with speed, efficiency, and grace. I believe that this is best achieved through a
tight coupling between mechanical design, passive dynamics, and nonlinear control
synthesis. These notes contain selected material from dynamical systems theory, as
well as linear and nonlinear control. But the dynamics of our robots quickly get too
complex for us to handle with a pencil-and-paper approach. As a result, the primary
focus of these notes is on computational approaches to control design, especially
using optimization.
When I started teaching this class, and writing these notes, the computational
approach to control was far from mainstream in robotics. I had just finished my
Ph.D. focused on reinforcement learning (applied to a bipedal robot), and was
working on optimization-based motion planning. I remember sitting at a robotics
conference dinner as a young faculty, surrounded by people I admired, talking
about optimization. One of the senior faculty said "Russ: the people that talk like
you aren't the people that get real robots to work." Wow, have things changed.
Now almost every advanced robot is using optimization or learning in the planning/
control system.
Today, the conversations about reinforcement learning (RL) are loud and passionate
enough to drown out almost every other conversation in the room. Ironically, now
I am the older professor and I find myself still believing in RL, but not with the
complete faith of my youth. There is so much one can understand about the structure
of the equations that govern our mechanical systems; algorithms which don't make
use of that structure are missing obvious opportunities for data efficiency and
robustness. The dream is to make the learning algorithms discover this structure on
their own. My goal for this course, however, is to help you discover this structure,
and to learn how to use this structure to develop stronger algorithms and to guide
your scientific endeavors into learning-based control.
I'll go even further. I'm willing to bet that our views of intelligence in 10-20 years
will look less like feedforward networks with a training mode and a test mode,
and more like a system with dynamics that ebb and flow in a beautiful dance with
the dynamics of the environment. These systems will move more flexibly between
perception, forward prediction / sequential decision making, storing and retrieving
long-term memories, and taking action. A fascinating question is whether it will be
important for these systems to be embodied (e.g. in a robot) in order to explore
the world at the timescales of classical mechanics that we learn and evolve with. It
certainly makes for a wonderful playground.
Although the material in the book comes from many sources, the presentation is
targeted very specifically at a handful of robotics problems. Concepts are introduced
only when and if they can help progress the capabilities we are trying to develop.
Many of the disciplines that I am drawing from are traditionally very rigorous, to
the point where the basic ideas can be hard to penetrate for someone that is new
to the field. I've made a conscious effort in these notes to keep a very informal,
conversational tone even when introducing these rigorous topics, and to reference
the most powerful theorems but only to prove them when that proof would add
particular insights without distracting from the mainstream presentation. I hope that
the result is a broad but reasonably self-contained and readable manuscript that will
be of use to any enthusiastic roboticist.
ORGANIZATION
The material in these notes is organized into a few main parts. "Model Systems"
introduces a series of increasingly complex dynamical systems and overviews some
of the relevant results from the literature for each system. "Nonlinear Planning
and Control" introduces quite general computational algorithms for reasoning about
those dynamical systems, with optimization theory playing a central role. Many of
these algorithms treat the dynamical system as known and deterministic until the
last chapters in this part which introduce stochasticity and robustness. "Estimation
and Learning" follows this up with techniques from statistics and machine learning
which capitalize on this viewpoint to introduce additional algorithms which can
operate with less assumptions on knowing the model or having perfect sensors.
The book closes with an "Appendix" that provides slightly more introduction (and
references) for the main topics used in the course.
The order of the chapters was chosen to make the book valuable as a reference.
When teaching the course, however, I take a spiral trajectory through the material,
introducing robot dynamics and control problems one at a time, and introducing only
the techniques that are required to solve that particular problem.
SOFTWARE
All of the examples and algorithms in this book, plus many more, are now available
as a part of our open-source software project: DRAKE. DRAKE is a C++ project, but
in this text we will use Drake's Python bindings. I encourage super-users or readers
who want to dig deeper to explore the C++ code as well (and to contribute back).
Please see the appendix for specific instructions for using DRAKE along with these
notes.
First chapter
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 1
Fully-actuated vs
Open in Colab
Underactuated Systems
Robots today move far too conservatively, and accomplish only a fraction of the
tasks and achieve a fraction of the performance that they are mechanically capable
of. In many cases, we are still fundamentally limited by control technology which
matured on rigid robotic arms in structured factory environments. The study of
underactuated robotics focuses on building control systems which use the natural
dynamics of the machines in an attempt to achieve extraordinary performance in
terms of speed, efficiency, or robustness.
1.1 MOTIVATION
Let's start with some examples, and some videos.
The world of robotics changed when, in late 1996, Honda Motor Co. announced
that they had been working for nearly 15 years (behind closed doors) on walking
robot technology. Their designs have continued to evolve, resulting in a humanoid
robot they call ASIMO (Advanced Step in Innovative MObility). For nearly 20 years,
Honda's robots were widely considered to represent the state of the art in walking
robots, although there are now many robots with designs and performance very
similar to ASIMO's. We will dedicate effort to understanding a few of the details of
ASIMO when we discuss algorithms for walking... for now I just want you to become
familiar with the look and feel of ASIMO's movements [watch the asimo video below
now].
I hope that your first reaction is to be incredibly impressed with the quality and
versatility of ASIMO's movements. Now take a second look. Although the motions
are very smooth, there is something a little unnatural about ASIMO's gait. It feels a
little like an astronaut encumbered by a heavy space suit. In fact this is a reasonable
analogy... ASIMO is walking a bit like somebody that is unfamiliar with his/her
dynamics. Its control system is using high-gain feedback, and therefore considerable
joint torque, to cancel out the natural dynamics of the machine and strictly follow
a desired trajectory. This control approach comes with a stiff penalty. ASIMO uses
roughly 20 times the energy (scaled) that a human uses to walk on the flat (measured
by cost of transport)[1]. Also, control stabilization in this approach only works in
a relatively small portion of the state space (when the stance foot is flat on the
ground), so ASIMO can't move nearly as quickly as a human, and cannot walk on
unmodelled or uneven terrain.
Figure 1.2 - A 3D passive dynamic walker by Steve Collins and Andy Ruina[2].
For contrast, let's now consider a very different type of walking robot, called a
passive dynamic walker (PDW). This "robot" has no motors, no controllers, no
computer, but is still capable of walking stably down a small ramp, powered only
by gravity [watch videos above now]. Most people will agree that the passive gait
of this machine is more natural than ASIMO's; it is certainly more efficient. Passive
walking machines have a long history - there are patents for passively walking toys
dating back to the mid 1800's. We will discuss, in detail, what people know about the
dynamics of these machines and what has been accomplished experimentally. This
most impressive passive dynamic walker to date was built by Steve Collins in Andy
Ruina's lab at Cornell[2].
The world is just starting to see what this vision could look like. This video from
Boston Dynamics is one of my favorites of all time:
Figure 1.3 - Boston Dynamics' Atlas robot does a backflip. Make sure you've seen
the dancing video, too.
As a consequence, birds are extremely efficient flying machines; some are capable
of migrating thousands of kilometers with incredibly small fuel supplies. The
wandering albatross can fly for hours, or even days, without flapping its wings
- these birds exploit the shear layer formed by the wind over the ocean surface
in a technique called dynamic soaring. Remarkably, the metabolic cost of flying
for these birds is indistinguishable from the baseline metabolic cost[3], suggesting
that they can travel incredible distances (upwind or downwind) powered almost
completely by gradients in the wind. Other birds achieve efficiency through similarly
rich interactions with the air - including formation flying, thermal soaring, and ridge
soaring. Small birds and large insects, such as butterflies and locusts, use "gust
soaring" to migrate hundreds or even thousands of kilometers carried primarily by
the wind.
Birds are also incredibly maneuverable. The roll rate of a highly acrobatic aircraft
(e.g, the A-4 Skyhawk) is approximately 720 deg/sec[4]; a barn swallow has a roll
rate in excess of 5000 deg/sec[4]. Bats can be flying at full-speed in one direction,
and completely reverse direction while maintaining forward speed, all in just over 2
wing-beats and in a distance less than half the wingspan[5]. Although quantitative
flow visualization data from maneuvering flight is scarce, a dominant theory is that
the ability of these animals to produce sudden, large forces for maneuverability can
be attributed to unsteady aerodynamics, e.g., the animal creates a large suction
vortex to rapidly change direction[6]. These astonishing capabilities are called upon
routinely in maneuvers like flared perching, prey-catching, and high speed flying
through forests and caves. Even at high speeds and high turn rates, these animals
are capable of incredible agility - bats sometimes capture prey on their wings,
Peregrine falcons can pull 25 G's out of a 240 mph dive to catch a sparrow in mid-
flight[7], and even the small birds outside our building can be seen diving through a
chain-link fence to grab a bite of food.
Although many impressive statistics about avian flight have been recorded, our
understanding is partially limited by experimental accessibility - it is quite difficult to
carefully measure birds (and the surrounding airflow) during their most impressive
maneuvers without disturbing them. The dynamics of a swimming fish are closely
related, and can be more convenient to study. Dolphins have been known to swim
gracefully through the waves alongside ships moving at 20 knots[6]. Smaller fish,
such as the bluegill sunfish, are known to possess an escape response in which
they propel themselves to full speed from rest in less than a body length; flow
visualizations indeed confirm that this is accomplished by creating a large suction
vortex along the side of the body[8] - similar to how bats change direction in less
than a body length. There are even observations of a dead fish swimming upstream
by pulling energy out of the wake of a cylinder; this passive propulsion is presumably
part of the technique used by rainbow trout to swim upstream at mating season[9].
1.1.3 Manipulation
Despite a long history of success in industrial applications, and the huge potential
for consumer applications, we still don't have robot arms that can perform any
meaningful tasks in the home. Admittedly, the perception problem (using sensors
to detect/localize objects and understand the scene) for home robotics is incredibly
difficult. But even if we were given a perfect perception system, our robots are still
a long way from performing basic object manipulation tasks with the dexterity and
versatility of a human.
Most robots that perform object manipulation today use a stereotypical pipeline.
First, we enumerate a handful of contact locations on the hand (these points, and
only these points, are allowed to contact the world). Then, given a localized object
in the environment, we plan a collision-free trajectory for the arm that will move the
hand into a "pre-grasp" location. At this point the robot closes it's eyes (figuratively)
and closes the hand, hoping that the pre-grasp location was good enough that the
object will be successfully grasped using e.g. only current feedback in the fingers
to know when to stop closing. "Underactuated hands" make this approach more
successful, but the entire approach really only works well for enveloping grasps.
The enveloping grasps approach may actually be sufficient for a number of simple
pick-and-place tasks, but it is a very poor representation of how humans do
manipulation. When humans manipulate objects, the contact interactions with the
object and the world are very rich -- we often use pieces of the environment as
fixtures to reduce uncertainty, we commonly exploit slipping behaviors (e.g. for
picking things up, or reorienting it in the hand), and our brains don't throw NaNs if
we use the entire surface of our arms to e.g. manipulate a large object.
In the last few years, I've began to focus my own research to problems in the
manipulation domain. In this space, the interaction between dynamics and
perception is incredibly rich. As a result, I've started an entirely separate set of notes
(and a second course) on manipulation.
By the way, in most cases, if the robots fail to make contact at the anticipated contact
times/locations, bad things can happen. The results are hilarious and depressing at
the same time. (Let's fix that!)
Classical control techniques for robotics are based on the idea that feedback can
be used to override the dynamics of our machines. These examples suggest that
to achieve outstanding dynamic performance (efficiency, agility, and robustness)
from our robots, we need to understand how to design control systems which take
advantage of the dynamics, not cancel them out. That is the topic of this course.
Surprisingly, many formal control ideas do not support the idea of "exploiting"
the dynamics. Optimal control formulations (which we will study in depth) allow
it in principle, but optimal control of nonlinear systems is still a relatively ad hoc
discipline. Sometimes I joke that in order to convince a control theorist to consider
the dynamics, you have to do something drastic, like taking away her control
authority - remove a motor, or enforce a torque-limit. These issues have created a
formal class of systems, the underactuated systems, for which people have begun to
more carefully consider the dynamics of their machines in the context of control.
1.2 DEFINITIONS
According to Newton, the dynamics of mechanical systems are second order (F = ma
). Their state is given by a vector of positions, q (also known as the configuration
vector), and a vector of velocities, q̇, and (possibly) time. The general form for a
second-order control dynamical system is:
is fully actuated in state x = (q, q̇) and time t if the resulting map f is surjective:
for every q̈ there exists a u which produces the desired response. Otherwise it
is underactuated (in x at time t).
As we will see, the dynamics for many of the robots that we care about turn out to
be affine in commanded torque, so let's consider a slightly constrained form:
Notice that whether or not a control system is underactuated may depend on the
state of the system or even on time, although for most systems (including all of the
systems in this book) underactuation is a global property of the system. We will
refer to a system as underactuated if it is underactuated in all states and times. In
practice, we often refer informally to systems as fully actuated as long as they are
fully actuated in most states (e.g., a "fully-actuated" system might still have joint
limits or lose rank at a kinematic singularity). Admittedly, this permits the existence
of a gray area, where it might feel awkward to describe the system as either fully-
or underactuated (we should instead only describe its states); even powerful robot
arms on factory floors do have actuator limits, but we can typically design controllers
for them as though they were fully actuated. The primary interest of this text is in
systems for which the underactuation is useful/necessary for developing a control
strategy.
It is well known that the inertia matrix, M(q) is (always) uniformly symmetric and
positive definite, and is therefore invertible. Putting the system into the form of
equation 2 yields:
Because M −1 (q) is always full rank, we find that a system described by the
manipulator equations is fully-actuated if and only if B is full row rank. For this
particular example, q = [θ 1 , θ 2 ] T and u = [τ 1 , τ 2 ] T (motor torques at the joints), and
B = I 2×2 . The system is fully actuated.
Python Example
I personally learn best when I can experiment and get some physical intuition.
Most chapters in these notes have an associated Jupyter notebook that can run on
Google's Colab; this chapter's notebook makes it easy for you to see this system in
action.
Open in Colab
Try it out! You'll see how to simulate the double pendulum, and even how to
inspect the dynamics symbolically.
Note: You can also run the code on your own machines (see the Appendix for
details).
While the basic double pendulum is fully actuated, imagine the somewhat bizarre
case that we have a motor to provide torque at the elbow, but no motor at the
shoulder. In this case, we have u = τ 2 , and B(q) = [0, 1] T . This system is clearly
underactuated. While it may sound like a contrived example, it turns out that it is
almost exactly the dynamics we will use to study as our simplest model of walking
later in the class.
The matrix f 2 in equation 2 always has dim[q] rows, and dim[u] columns. Therefore,
as in the example, one of the most common cases for underactuation, which trivially
implies that f 2 is not full row rank, is dim[u] < dim[q]. This is the case when a robot
has joints with no motors. But this is not the only case. The human body, for instance,
has an incredible number of actuators (muscles), and in many cases has multiple
muscles per joint; despite having more actuators than position variables, when I
jump through the air, there is no combination of muscle inputs that can change the
ballistic trajectory of my center of mass (barring aerodynamic effects). My control
system is underactuated.
A quick note about notation. When describing the dynamics of rigid-body systems in
this class, I will use q for configurations (positions), q̇ for velocities, and use x for
the full state (x = [q T , q̇ T ] T ). There is an important limitation to this convention (3D
angular velocity should not be represented as the derivative of 3D pose) described in
the Appendix, but it will keep the notes cleaner. Unless otherwise noted, vectors are
always treated as column vectors. Vectors and matrices are bold (scalars are not).
q̈ = u ′ .
In other words, if f 1 and f 2 are known and f 2 is invertible, then we say that the
system is "feedback equivalent" to q̈ = u ′ . There are a number of strong results
which generalize this idea to the case where f 1 and f 2 are estimated, rather than
known (e.g, [10]).
Since we are embedding a nonlinear dynamics (not a linear one), we refer to this necessary for any real
as "feedback cancellation", or "dynamic inversion". This idea reveals why I say implementation.
control is easy -- for the special case of a fully-actuated deterministic system with
known dynamics. For example, it would have been just as easy for me to invert
gravity. Observe that the control derivations here would not have been any more
difficult if the robot had 100 joints.
Open in Colab
As always, I highly recommend that you take a few minutes to read through the
source code.
In practice it can be useful to separate out constraints which depend only on the
input, e.g. ϕ(u) ≥ 0, such as actuator limits, as they can often be easier to handle
than state constraints. An obstacle in the environment might manifest itself as one
or more constraints that depend only on position, e.g. ϕ(q) ≥ 0.
ẍ = u, |u| ≤ 1.
Input and state constraints can complicate control design in similar ways to having
an insufficient number of actuators, (i.e., further limiting the set of the feasible
trajectories), and often require similar tools to find a control solution.
You might have heard of the term "nonholonomic system" (see e.g. [11]), and be
thinking about how nonholonomy relates to underactuation. Briefly, a nonholonomic
constraint is a constraint of the form ϕ(q, q
˙, t) = 0, which cannot be integrated into a
constraint of the form ϕ(q, t) = 0 (a holonomic constraint). A nonholonomic constraint
does not restrain the possible configurations of the system, but rather the manner
in which those configurations can be reached. While a holonomic constraint reduces
the number of degrees of freedom of a system by one, a nonholonomic constraint
does not. An automobile or traditional wheeled robot provides a canonical example:
or equivalently,
ẏ cos θ − ẋ sin θ = 0.
Contrast the wheeled robot example with a robot on train tracks. The train tracks
correspond to a holonomic constraint: the track constraint can be written directly
in terms of the configuration q of the system, without using the velocity q ˙. Even
though the track constraint could also be written as a differential constraint on the
velocity, it would be possible to integrate this constraint to obtain a constraint on
configuration. The track restrains the possible configurations of the system.
˙, q̈, t) = 0,
ϕ(q, q
Even control systems for fully-actuated and otherwise unconstrained systems can
be improved using the lessons from underactuated systems, particularly if there is
a need to increase the efficiency of their motions or reduce the complexity of their
designs.
1.6 GOALS FOR THE COURSE
This course is based on the observation that there are new computational tools from
optimization theory, control theory, motion planning, and even machine learning
which can be used to design feedback control for underactuated systems. The goal
of this class is to develop these tools in order to design robots that are more dynamic
and more agile than the current state-of-the-art.
The target audience for the class includes both computer science and mechanical/
aero students pursuing research in robotics. Although I assume a comfort with linear
algebra, ODEs, and Python, the course notes aim to provide most of the material and
references required for the course.
I have a confession: I actually think that the material we'll cover in these notes
is valuable far beyond robotics. I think that systems theory provides a powerful
language for organizing computation in exceedingly complex systems -- especially
when one is trying to program and/or analyze systems with continuous variables in
a feedback loop (which happens throughout computer science and engineering, by
the way). I hope you find these tools to be broadly useful, even if you don't have a
humanoid robot capable of performing a backflip at your immediate disposal.
1.7 EXERCISES
a. The state of the humanoid can be represented by the angles and the
angular velocities of all its joints.
b. While doing the backflip (state in the left figure), the humanoid is fully
actuated.
c. While standing (state in the right figure), the humanoid is fully actuated.
Exercise 1.2 (Trajectory Tracking in State Space)
Take a robot whose dynamics is governed by equation 2, and assume it to be fully
actuated in all states x = [q T , q̇ T ] T at all times t.
f(x, t) = [ q̇ ].
f 1 (q, q̇, t) + f 2 (q, q̇, t)u
Is the double pendulum with the new control input still fully-actuated in all
states? If not, identify the states in which it is underactuated.
Exercise 1.4 (Underactuation of the Planar Quadrotor)
Later in the course we will study the dynamics of a quadrotor quite in depth, for
the moment just look at the structure of the resulting equations of motion from
the planar quadrotor section. The quadrotor is constrained to move in the vertical
plane, with the gravity pointing downwards. The configuration vector q = [x, y, θ] T
collects the position of the center of mass and the pitch angle. The control input
is the thrust u = [u 1 , u 2 ] T produced by the two rotors. The input u can assume both
signs and has no bounds.
REFERENCES
1. Steven H. Collins and Andy Ruina and Russ Tedrake and Martijn Wisse,
"Efficient bipedal robots based on passive-dynamic walkers", Science, vol.
307, pp. 1082-1085, February 18, 2005. [ link ]
2. Steven H. Collins and Martijn Wisse and Andy Ruina, "A Three-Dimensional
Passive-Dynamic Walking Robot with Two Legs and Knees", International
Journal of Robotics Research, vol. 20, no. 7, pp. 607-615, July, 2001.
3. J.P.Y. Arnould and D.R. Briggs and J.P. Croxall and P.A. Prince and A.G. Wood,
"The foraging behaviour and energetics of wandering albatrosses brooding
chicks", Antarctic Science, vol. 8, no. 3, pp. 229-236, 1996.
4. Wei Shyy and Yongsheng Lian and Jian Teng and Dragos Viieru and Hao
Liu, "Aerodynamics of Low Reynolds Number Flyers",Cambridge University
Press , 2008.
5. Xiaodong Tian and Jose Iriarte-Diaz and Kevin Middleton and Ricardo
Galvao and Emily Israeli and Abigail Roemer and Allyce Sullivan and Arnold
Song and Sharon Swartz and Kenneth Breuer, "Direct measurements of the
kinematics and dynamics of bat flight", Bioinspiration \& Biomimetics, vol.
1, pp. S10-S18, 2006.
9. D.N. Beal and F.S. Hover and M.S. Triantafyllou and J.C. Liao and G. V.
Lauder, "Passive propulsion in vortex wakes", Journal of Fluid Mechanics,
vol. 549, pp. 385–402, 2006.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
2.1 INTRODUCTION
Our goals for this chapter are modest: we'd like to understand the dynamics of a
pendulum.
The Lagrangian derivation of the equations of motion (as described in the appendix)
of the simple pendulum yields:
ml 2 θ(t)
¨ + mgl sin θ(t) = Q.
We'll consider the case where the generalized force, Q, models a damping torque
(from friction) plus a control torque input, u(t):
˙ + u(t).
Q = −bθ(t)
Open in Colab
These are relatively simple differential equations, so if I give you θ(0) and θ(0)
˙ , then
you should be able to integrate them to obtain θ(t)... right? Although it is possible,
integrating even the simplest case (b = u = 0) involves elliptic integrals of the first
kind; there is relatively little intuition to be gained here.
This is in stark contrast to the case of linear systems, where much of our
understanding comes from being able to explicitly integrate the equations. For
instance, for a simple linear system we have
q̇ = aq → q(t) = q(0)e at ,
and we can immediately understand that the long-term behavior of the system is a
(stable) decaying exponential if a < 0, an (unstable) growing exponential if a > 0, and
that the system does nothing if a = 0. Here we are with certainly one of the simplest
nonlinear systems we can imagine, and we can't even solve this system?
All is not lost. If what we care about is the long-term behavior of the system, then
there are a number of techniques we can apply. In this chapter, we will start by
investigating graphical solution methods. These methods are described beautifully
in a book by Steve Strogatz[1].
Let's start by studying a special case -- intuitively when bθ˙ ≫ ml 2 θ¨ -- which via
dimensional analysis (using the natural frequency √ l to match units) occurs when
g
b√ gl ≫ ml 2 . This is the case of heavy damping, for instance if the pendulum was
moving in molasses. In this case, the damping term dominates the acceleration term,
and we have:
In other words, in the case of heavy damping, the system looks approximately first
order. This is a general property of heavily-damped systems, such as fluids at very
low Reynolds number.
I'd like to ignore one detail for a moment: the fact that θ wraps around on itself every
2π. To be clear, let's write the system without the wrap-around as:
bẋ = u 0 − mgl sin x. (2)
Our goal is to understand the long-term behavior of this system: to find x(∞) given
x(0). Let's start by plotting ẋ vs x for the case when u 0 = 0:
The first thing to notice is that the system has a number of fixed points or steady
states, which occur whenever ẋ = 0. In this simple example, the zero-crossings are
x ∗ = {. . . , −π, 0, π, 2π, . . . }. When the system is in one of these states, it will never leave
that state. If the initial conditions are at a fixed point, we know that x(∞) will be at
the same fixed point.
Next let's investigate the behavior of the system in the local vicinity of the fixed
points. Examining the fixed point at x ∗ = π, if the system starts just to the right of the
fixed point, then ẋ is positive, so the system will move away from the fixed point. If it
starts to the left, then ẋ is negative, and the system will move away in the opposite
direction. We'll call fixed-points which have this property unstable. If we look at the
fixed point at x ∗ = 0, then the story is different: trajectories starting to the right or
to the left will move back towards the fixed point. We will call this fixed point locally
stable. More specifically, we'll distinguish between multiple types of stability (where
ϵ is used to denote an arbitrary small scalar quantity):
An initial condition near a fixed point that is stable in the sense of Lyapunov may
never reach the fixed point (but it won't diverge), near an asymptotically stable fixed
point will reach the fixed point as t → ∞, and near an exponentially stable fixed point
will reach the fixed point with a bounded rate. An exponentially stable fixed point is
also an asymptotically stable fixed point, but the converse is not true. Attractivity
does not actually imply Lyapunov stability†, which is why we requires i.s.L. † we can't see that in
specifically for the definition of asymptotic stability. Systems which are stable i.s.L. one dimension so will
but not asymptotically stable are easy to construct (e.g. ẋ = 0). Interestingly, it is also have to hold that
possible to have nonlinear systems that converge (or diverge) in finite-time; a so-
called finite-time stability; we will see examples of this later in the book, but it
is a difficult topic to penetrate with graphical analysis. Rigorous nonlinear system
analysis is rich with subtleties and surprises. Moreover, these differences actually
matter -- the code that we will write to stabilize the systems will be subtly different
depending on what type of stability we want, and it can make or break the success
of our methods.
Our graph of ẋ vs. x can be used to convince ourselves of i.s.L. and asymptotic
stability by visually inspecting ẋ in the vicinity of a fixed point. Even exponential
stability can be inferred if we can find a negatively-sloped line passing through the
equilibrium point which separates the graph of f(x) from the horizontal axis, since it
implies that the nonlinear system will converge at least as fast as the linear system
represented by the straight line. I will graphically illustrate unstable fixed points
with open circles and stable fixed points (i.s.L.) with filled circles.
Next, we need to consider what happens to initial conditions which begin farther
from the fixed points. If we think of the dynamics of the system as a flow on the x
-axis, then we know that anytime ẋ > 0, the flow is moving to the right, and ẋ < 0, the
flow is moving to the left. If we further annotate our graph with arrows indicating
the direction of the flow, then the entire (long-term) system behavior becomes clear:
For instance, we can see that any initial condition x(0) ∈ (−π, π) will result in
lim t→∞ x(t) = 0. This region is called the basin of attraction of the fixed point at x ∗ = 0
. Basins of attraction of two fixed points cannot overlap, and the manifold separating
two basins of attraction is called the separatrix. Here the unstable fixed points, at
x ∗ = {. . , −π, π, 3π, . . . } form the separatrix between the basins of attraction of the
stable fixed points.
ẋ + x = tanh(wx), (3)
which is plotted below for two values of w. It's convenient to note that tanh(z) ≈ z
for small z. For w ≤ 1 the system has only a single fixed point. For w > 1 the system
has three fixed points : two stable and one unstable.
These equations are not arbitrary - they are actually a model for one of the
simplest neural networks, and one of the simplest model of persistent memory[2].
In the equation x models the firing rate of a single neuron, which has a feedback
connection to itself. tanh is the activation (sigmoidal) function of the neuron, and w
is the weight of the synaptic feedback.
Open in Colab
As a bonus, you'll also find a the equations of an LSTM unit that you can also
experiment with. See if you can figure it out!
One last piece of terminology. In the neuron example, and in many dynamical
systems, the dynamics were parameterized; in this case by a single parameter, w.
As we varied w, the fixed points of the system moved around. In fact, if we increase
w through w = 1, something dramatic happens - the system goes from having one
fixed point to having three fixed points. This is called a bifurcation. This particular
bifurcation is called a pitchfork bifurcation. We often draw bifurcation diagrams
which plot the fixed points of the system as a function of the parameters, with solid
lines indicating stable fixed points and dashed lines indicating unstable fixed points,
as seen in the figure:
Our pendulum equations also have a (saddle-node) bifurcation when we change the
constant torque input, u 0 . Finally, let's return to the original equations in θ, instead of
in x. Only one point to make: because of the wrap-around, this system will appear to
have oscillations. In fact, the graphical analysis reveals that the pendulum will turn
forever whenever |u 0 | > mgl, but now you understand that this is not an oscillation,
but an instability with θ → ±∞.
You can find another beautiful example of these concepts (fixed points, basins of
attraction, bifurcations) from recurrent neural networks in the exercise below on
Hopfield networks.
this time with b = 0. This time the system dynamics are truly second-order. We can
always think of any second-order system as (coupled) first-order system with twice
as many variables. Consider a general, autonomous (not dependent on time), second-
order system,
ẋ 1 =x 2
ẋ 2 =f(x 1 , x 2 , u),
where x 1 = q and x 2 = q̇. Therefore, the graphical depiction of this system is not a
line, but a vector field where the vectors [ẋ 1 , ẋ 2 ] T are plotted over the domain (x 1 , x 2 )
. This vector field is known as the phase portrait of the system.
In this section we restrict ourselves to the simplest case when u 0 = 0. Let's sketch
the phase portrait. First sketch along the θ-axis. The x-component of the vector field
here is zero, the y-component is − l sin θ. As expected, we have fixed points at ±π, . . .
g
Now sketch the rest of the vector field. Can you tell me which fixed points are stable?
Some of them are stable i.s.L., none are asymptotically stable.
Orbit calculations
You might wonder how we drew the black contour lines in the figure above. We
could have obtained them by simulating the system numerically, but those lines can
be easily obtained in closed-form. Directly integrating the equations of motion is
difficult, but at least for the case when u 0 = 0, we have some additional physical
insight for this problem that we can take advantage of. The kinetic energy, T , and
potential energy, U , of the pendulum are given by
1 ˙2
T= Iθ , U = −mgl cos(θ),
2
where I = ml 2 and the total energy is E(θ, θ) ˙ + U(θ). The undamped pendulum
˙ = T (θ)
is a conservative system: total energy is a constant over system trajectories. Using
conservation of energy, we have:
˙
E(θ(t), θ(t)) ˙
= E(θ(0), θ(0)) = E0
1 ˙2
I θ (t) − mgl cos(θ(t)) = E 0
2
˙ = ±√ 2 [E 0 + mgl cos (θ(t))]
θ(t)
I
Using this, if you tell me θ I can determine one of two possible values for θ˙, and the
solution has all of the richness of the black contour lines from the plot. This equation
has a real solution when cos(θ) > cos(θ max ), where
−1 E0
θ max = {cos (− mgl ), E 0 < mgl
π, otherwise.
Of course this is just the intuitive notion that the pendulum will not swing above the
height where the total energy equals the potential energy. As an exercise, you can
verify that differentiating this equation with respect to time indeed results in the
equations of motion.
The particular orbit defined by E = mgl is special -- this is the orbit that visits the
(unstable) equilibrium at the upright. It is known as the homoclinic orbit.
Now what happens if we add a constant torque? If you visualize the bifurcation
diagram, you'll see that the fixed points come together, towards q = π2 , 5π
2 , . . ., until
they disappear. One fixed-point is unstable, and one is stable.
Before we continue, let me now give you the promised example of a system that
is not stable i.s.L., but which attracts all trajectories as time goes to infinity. We
can accomplish this with a very pendulum-like example (written here in polar
coordinates):
ṙ = r(1 − r),
θ
θ˙ = sin 2 ( ).
2
This system has two equilibrium points, one at r ∗ = 0, θ ∗ = 0, and the other at
r ∗ = 1, θ ∗ = 0. The fixed point at zero is clearly unstable. The fixed point with r ∗ = 1
attracts all other trajectories, but it is not stable by any of our definitions.
Take a minute to draw the vector field of this (you can draw each coordinate
independently, if it helps) to make sure you understand. Note that to wrap-around
rotation is convenient but not essential -- we could have written the same dynamical
system in cartesian coordinates without wrapping. And if this still feels too arbitrary,
we will see it happen in practice when we introduce the energy-shaping swing-up
controller for the pendulum in the next chapter.
Now let's add damping back. You can still add torque to move the fixed points (in the
same way).
With damping, the downright fixed point of the pendulum now becomes
asymptotically stable (in addition to stable i.s.L.). Is it also exponentially stable? How
can we tell? One technique is to linearize the system at the fixed point. A smooth,
time-invariant, nonlinear system that is locally exponentially stable must have a
stable linearization; we'll discuss linearization more in the next chapter.
˙ = 2mgl sin θ.
u = π(θ, θ)
But these plots we've been making tell a different story. How would you shape the
natural dynamics - at each point pick a u from the stack of phase plots - to stabilize
the vertical fixed point with minimal torque effort? This is exactly the way that I
would like you to think about control system design. And we'll give you your first
solution techniques -- using dynamic programming -- in the next lecture.
The simple pendulum is fully actuated. Given enough torque, we can produce any
number of control solutions to stabilize the originally unstable fixed point at the top
(such as designing a feedback controller to effectively invert gravity).
In words, adding energy to the system is simple - apply torque in the same direction
as θ˙. To remove energy, apply torque in the opposite direction (e.g., damping).
To swing up the pendulum, even with torque limits, let us use this observation to
drive the system to its homoclinic orbit, and then let the dynamics of the pendulum
carry us to the upright equilibrium. Recall that the homoclinic orbit has energy mgl
-- let's call this our desired energy:
E d = mgl.
Furthermore, let's define the difference between our current energy and this desired
energy as E = E − E d , and note that we still have
~
~˙ ˙
˙ = uθ.
E =E
Now consider the feedback controller of the form
~
u = −kθ˙E, k > 0.
I admit that this looks a bit strange; it was chosen for the simple reason that it turns
the resulting "error dynamics" into something simple:
~˙ ~
E = −kθ˙2 E.
Think about the graphical analysis of this system if you were to draw E ˙~ vs. E for any
~
fixed θ˙ -- it's a straight line separated from the horizontal axis which would imply an
exponential convergence: E → 0. This is true for any θ˙, except for θ˙ = 0 (so it will not
~
actually swing us up from the downright fixed point... but if you nudge the system
just a bit, then it will start pumping energy and will swing all of the way up). The
essential property is that when E > E d , we should remove energy from the system
(damping) and when E < E d , we should add energy (negative damping). Even if the
control actions are bounded, the convergence is easily preserved.
This is a nonlinear controller that will push all system trajectories to the unstable
equilibrium. But does it make the unstable equilibrium locally stable? With only this
controller, the fixed point is attractive, but is not stable -- just like our example
above. Small perturbations may cause the system to drive all of the way around the
circle in order to once again return to the unstable equilibrium. For this reason, to
actually balance the system, we'll have to switch to a different controller once we
get near the fixed point (an idea that we'll discuss more in the next chapter).
Open in Colab
Make sure that you take a minute to look at the code which runs during these
examples. Note the somewhat arbitrary threshold for switching to the balancing
controller. We'll give a much more satisfying answer for that in the chapter on
Lyapunov methods.
There are a few things that are extremely nice about this controller. Not only is it
simple, but it is actually incredibly robust to errors we might have in our estimate
of the model parameters, m, l, and g. In fact, the only place that the model enters
our control equation is in the calculation of E = 12 ml 2 θ˙2 − mgl(1 + cos θ), and the only
~
property of this estimate that impacts stability is the location of the orbit when
E = 0, which is 12 lθ˙2 = g(cos θ + 1). This doesn't depend at all on our estimate of the
~
mass, and only linearly on our estimates of the length and gravity (and if one cannot
measure the length of the pendulum accurately, then a proper measuring device
would make an excellent investment). This is somewhat amazing; we will develop
many optimization-based algorithms capable of swinging up the pendulum, but it
will take a lot of work to find one that is as insensitive to the model parameters.
If there is damping in the original system, of course we can cancel this out, too, using
~ ˙ And once again, the controller is quite robust if we have errors in the
u = −kθ˙E + bθ.
estimated damping ratio.
2.4 EXERCISES
whose dynamics is represented in the figure above. Notice that, together with
x ∗ = −1 and x ∗ = 0, all the points in the closed interval [1, 2] are equilibria for
this system. For each equilibrium point, determine whether it is unstable, stable
i.s.L., asymptotically stable, or exponentially stable.
a. Sketch the graph of the function f(x), and identify all the equilibrium
points. (Feel free to plot this with a tool of your choice to check your
work.)
b. For each equilibrium point, determine whether it is stable (i.s.L.) or
unstable. For each one of the stable equilibria, identify the basin of
attraction.
c. Consider an additive term w in the system dynamics: ẋ = f(x) + w. As
w ranges from 0 to ∞, determine how many stable and unstable
equilibrium points the system has. Support your answer with a sketch of
the bifurcation diagram for nonnegative values of w.
Identify the minimum n min and the maximum n max number of unstable equilibria
that the system can have. Support your answer with the sketch of two functions
f min (x) and f max (x) that verify the requirements above and have, respectively, n min
and n max unstable equilibria.
Exercise 2.4 (Attractivity vs Stability)
x 1 − |x|
ẋ 1 = x 1 (1 − |x|) + x 2 ,
2|x|
x 1 − |x|
ẋ 2 = x 2 (1 − |x|) − x 1 ,
2|x|
where |x| = √x 21 + x 22 . To help you with the analysis of this system, we set up this
python notebook. Take your time to understand the code in it, and then answer
the following questions.
a. Find all the equilibrium points of this system (no need to look outside
the portion of state space depicted above). Use the notebook to double
check your answer: simulate the evolution of the system setting the
equilibrium points you identified as initial conditions.
b. Determine whether the equilibria you identified at the previous point
are attractive and/or stable i.s.L. Explain briefly your reasoning, and feel
free to include some plots generated with the notebook in your written
answer.
c. This dynamical system is a (very) close relative of one of the systems
we analyzed in this chapter. Can you guess which one is it? What is the
connection between the two? Extra points: support your claim with a
mathematical derivation.
(If this equation seems obscure to you, try to derive it using the Lagrangian
approach; but be careful, the kinetic energy T (θ, θ, ˙ t) of this system depends
explicitly on time.) Our goal is to design a time-dependent control law u = π(θ, θ,
˙ t)
that makes the pendulum spin at constant velocity θ = 1.
˙
ẋ = A T softmax(β A x) − x (4)
where A ∈ R m×n and β ∈ R are a constant matrix and a scalar, respectively, and
the softmax function is given by softmax(z) i = ∑e ezk .
zi
⎢⎥
where you'll perform a graphical analysis of the dynamics function represented
by your trained network.
exactly?
˙ represent fixed
points? Which fixed points are stable and which are unstable?
c. What is one reason why our model might not fit the ground truth data
2. H. Sebastian Seung and Daniel D. Lee and Ben Y. Reis and David W. Tank,
"The autapse: a simple illustration of short-term analog memory storage by
tuned synaptic feedback", Journal of Computational Neuroscience, vol. 9,
pp. 171-85, 2000.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 3
Acrobots, Cart-Poles, and
Open in Colab
Quadrotors
A great deal of work in the control of underactuated systems has been done in
the context of low-dimensional model systems. These model systems capture the
essence of the problem without introducing all of the complexity that is often
involved in more real-world examples. In this chapter we will focus on two of the
most well-known and well-studied model systems--the Acrobot and the Cart-Pole.
After we have developed some tools, we will see that they can be applied directly
to other model systems; we will give a number of examples using Quadrotors. All
of these systems are trivially underactuated, having less actuators than degrees of
freedom.
Figure 3.1 illustrates the model parameters used in our analysis. θ 1 is the shoulder
joint angle, θ 2 is the elbow (relative) joint angle, and we will use q = [θ 1 , θ 2 ] T ,
x = [q, q̇] T . The zero configuration is with both links pointed directly down. The
moments of inertia, I 1 , I 2 are taken about the pivots. The task is to stabilize the
unstable fixed point x = [π, 0, 0, 0] T .
We will derive the equations of motion for the Acrobot using the method of Lagrange.
The locations of the center of mass of each link, p c1 , p c2 , are given by the kinematics:
p c1 = [ l c1 s 1 ], p c2 = [ l 1 s 1 + l c2 s 1+2 ], (1)
−l c1 c 1 −l 1 c 1 − l c2 c 1+2
where s 1 is shorthand for sin(θ 1 ), c 1+2 is shorthand for cos(θ 1 + θ 2 ), etc. The energy is
given by:
1
T = T1 + T2 , T1 = I 1 q̇ 12 (2)
2
1 1
T 2 = (m 2 l 21 + I 2 + 2m 2 l 1 l c2 c 2 )q̇ 12 + I 2 q̇ 22 + (I 2 + m 2 l 1 l c2 c 2 )q̇ 1 q̇ 2 (3)
2 2
U = −m 1 gl c1 c 1 − m 2 g(l 1 c 1 + l c2 c 1+2 ) (4)
Entering these quantities into the Lagrangian yields the equations of motion:
using q = [θ 1 , θ 2 ] T , u = τ we have:
2
M(q) = [I 1 + I 2 + m 2 l 1 + 2m 2 l 1 l c2 c 2 I 2 + m 2 l 1 l c2 c 2 ], (8)
I 2 + m 2 l 1 l c2 c 2 I2
C(q, q̇) = [−2m 2 l 1 l c2 s 2 q̇ 2 −m 2 l 1 l c2 s 2 q̇ 2 ], (9)
m 2 l 1 l c2 s 2 q̇ 1 0
τ g (q) = [−m 1 gl c1 s 1 − m 2 g(l 1 s 1 + l c2 s 1+2 )], B = [0]. (10)
−m 2 gl c2 s 1+2 1
Open in Colab
The other model system that we will investigate here is the cart-pole system, in
which the task is to balance a simple pendulum around its unstable equilibrium,
using only horizontal forces on the cart. Balancing the cart-pole system is used
in many introductory courses in control, including 6.003 at MIT, because it can
be accomplished with simple linear control (e.g. pole placement) techniques. In
this chapter we will consider the full swing-up and balance control problem, which
requires a full nonlinear control treatment.
Figure 3.2 - The Cart-Pole system. Click here to see a real robot.
The figure shows our parameterization of the system. x is the horizontal position of
the cart, θ is the counter-clockwise angle of the pendulum (zero is hanging straight
down). We will use q = [x, θ] T , and x = [q, q̇] T . The task is to stabilize the unstable
fixed point at x = [0, π, 0, 0] T .
mc + mp m p l cos θ ˙
M(q) = [ ], C(q, q̇) = [0 −m p lθ sin θ],
m p l cos θ mp l2 0 0
τ g (q) = [ 0 ], B = [1]
−m p gl sin θ 0
In some of the analysis that follows, we will study the form of the equations of
motion, ignoring the details, by arbitrarily setting all constants to 1:
Open in Colab
3.3 QUADROTORS
Quadrotors have become immensely popular over the last few years -- advances
in outrunner motors from the hobby community made them powerful, light-weight,
and inexpensive! They are strong enough to carry an interesting payload (e.g. of
sensors for mapping / photographing the environment), but dynamic enough to
move relatively quickly. The most interesting dynamics start showing up at higher
speeds, when the propellors start achieving lift from the airflow across them due
to horizontal motion, or when they are close enough to the ground to experience
significant ground-effect, but we won't try to capture those effects (yet) here.
We can get started with an extremely simple model of a quadrotor that is restricted
to live in the plane. In that case, it actually only needs two propellors, but calling it a
"birotor" doesn't have the same ring to it. The equations of motion are almost trivial,
since it is only a single rigid body, and certainly fit into our standard manipulator
equations:
Figure 3.3 - The Planar Quadrotor System (which we also refer to in the code as
"Quadrotor2D", to keep it next to "Quadrotor" in the directory listing). The model
parameters are mass, m, moment of inertia, I , and the distance from the center to
the base of the propellor, r.
3.4 BALANCING
For both the Acrobot and the Cart-Pole systems, we will begin by designing a linear
controller which can balance the system when it begins in the vicinity of the unstable
fixed point. To accomplish this, we will linearize the nonlinear equations about
the fixed point, examine the controllability of this linear system, then using linear
quadratic regulator (LQR) theory to design our feedback controller.
Although the equations of motion of both of these model systems are relatively
tractable, the forward dynamics still involve quite a few nonlinear terms that must
be considered in any linearization. Let's consider the general problem of linearizing
a system described by the manipulator equations.
∂f ∂f
ẋ = f (x, u) ≈ f (x ∗ , u ∗ ) + [ ] (x − x ∗ ) + [ ] (u − u ∗ ) (23)
∂x x=x∗ ,u=u∗ ∂u x=x∗ ,u=u∗
Let us consider the specific problem of linearizing the manipulator equations around
a (stable or unstable) fixed point. In this case, f (x ∗ , u ∗ ) is zero, and we are left with
the standard linear state-space form:
q̇
ẋ =[ ], (24)
M −1 (q) [τ g (q) + B(q)u − C(q, q̇)q̇]
≈A lin (x − x ∗ ) + B lin (u − u ∗ ), (25)
where A lin , and B lin are constant matrices. Let us define x̄ = x − x ∗ , ū = u − u ∗ , and
write
Evaluation of the Taylor expansion around a fixed point yields the following, very
simple equations, given in block form by:
0 I
A lin =[ ∂τ ∂B
] (26)
M −1 ∂qg + ∑ j M −1 ∂qj uj 0 x=x ∗ ,u=u ∗
B lin =[ 0 ] (27)
M −1 B x=x∗ ,u=u∗
∂τ g
[ ] = [g(m 1 l c1 + m 2 l 1 + m 2 l c2 ) m 2 gl c2 ] (28)
∂q x=x∗ m 2 gl c2 m 2 gl c2
The linear dynamics follow directly from these equations and the manipulator form
of the Acrobot equations.
∂τ g
[ ] = [0 0 ] (29)
∂q x=x∗ 0 m p gl
Studying the properties of the linearized system can tell us some things about the
(local) properties of the nonlinear system. For instance, having a strictly stable
linearization implies local exponential stability of the nonlinear system [3] (Theorem
4.15). It's worth noting that having an unstable linearization also implies that the
system is locally unstable, but if the linearization is marginally stable then one
cannot conclude whether the nonlinear system is asymptotically stable, stable i.s.L.,
or unstable (only that it is not exponentially stable)[4].
3.4.2 Controllability of linear systems
ẋ = Ax + Bu,
we can integrate this linear system in closed form, so it is possible to derive the
exact conditions of controllability. In particular, for linear systems it is sufficient to
demonstrate that there exists a control input which drives any initial condition to the
origin.
Av i = λ i v i ,
where λ i is the ith eigenvalue, and v i is the corresponding (right) eigenvector. There
will be n eigenvalues for the n × n matrix A. Collecting the (column) eigenvectors
into the matrix V and the eigenvalues into a diagonal matrix Λ, we have
AV = VΛ.
Here comes our primary assumption: let us assume that each of these n eigenvalues
takes on a distinct value (no repeats). With this assumption, it can be shown that
the eigenvectors v i form a linearly independent basis set, and therefore V −1 is well-
defined.
We can continue our eigenmodal analysis of the linear system by defining the modal
coordinates, r, with:
x = Vr, or r = V −1 x.
ṙ = V −1 AVr + V −1 Bu = Λr + V −1 Bu.
This illustrates the power of modal analysis; in modal coordinates, the dynamics
diagonalize yielding independent linear equations:
ṙ i = λ i r i + ∑ β ij u j , β = V −1 B.
j
Now the concept of controllability becomes clear. Input j can influence the dynamics
in modal coordinate i if and only if β ij ≠ 0. In the special case of non-repeated
eigenvalues, having control over each individual eigenmode is sufficient to (in
finite time) regulate all of the eigenmodes[5]. Therefore, we say that the system is
controllable if and only if
Without loss of generality, lets consider the that the final state of the system is zero.
Then we have:
tf
x(0) = − ∫ e −Aτ Bu(τ)dτ.
0
1 2 1 3
ez = 1 + z + z + z +. . . .
2 6
The notation e At uses the same definition:
1 1
e At = I + At + (At) 2 + (At) 3 +. . . .
2 6
Not surprisingly, this has many special forms. For instance, if A is diagonalizable,
e At = Ve Λt V −1 , where A = VΛV −1 is the eigenvalue decomposition of A [6]. The
particular form we will use here is
n−1
e −Aτ = ∑ α k (τ)A k .
k=0
This is a particularly surprising form, because the infinite sum above is represented
by this finite sum; the derivation uses Sylvester's Theorem [5], [7]. Then we have,
n−1 tf
x(0) = − ∑ A k B ∫ α k (τ)u(τ)dτ
k=0 0
n−1 tf
= − ∑ A k Bw k , where w k = ∫ α k (τ)u(τ)dτ
k=0 0
⎡ w0 ⎤
w1
= − [B AB A 2 B ⋯ A n−1 B] n×(nm) w 2
⋮
⎣ ⎦
w n−1
The matrix containing the vectors B, AB, ... A n−1 B is called the controllability
matrix. In order for the system to be controllable, for every initial condition x(0)
we must be able to find the corresponding vector w. This is only possible when the
rows of the controllability matrix are linearly independent. Therefore, the condition
of controllability is that this controllability matrix is full rank. In DRAKE you can
obtain the controllability matrix for a linear system using ControllabilityMatrix,
or check the rank condition directly using IsControllable.
Note that a linear feedback to change the eigenvalues of the eigenmodes is not
sufficient to accomplish our goal of getting to the goal in finite time. In fact, the
open-loop control to reach the goal is easily obtained with a final-value LQR problem,
and (for R = I) is actually a simple function of the controllability Grammian[7].
The controllability analysis presented here is for linear time-invariant (LTI) systems.
A comparable analysis exists for linear time-varying (LTV) systems. We will even see
extensions to nonlinear systems; although it will often be referred to by the synonym
of "reachability" analysis.
Controllability tells us that a trajectory to the fixed point exists, but does not tell us
which one we should take or what control inputs cause it to occur. Why not? There
are potentially infinitely many solutions. We have to pick one.
The tools for controller design in linear systems are very advanced. In particular, as
we describe in the linear optimal control chapter, one can easily design an optimal
feedback controller for a regulation task like balancing, so long as we are willing to
linearize the system around the operating point and define optimality in terms of a
quadratic cost function:
∞
J(x 0 ) = ∫ [x T (t)Qx(t) + u T (t)Ru(t)]dt, x(0) = x 0 , Q = Q T > 0, R = R T > 0.
0
u(t) = −Kx(t),
K = LinearQuadraticRegulator(A, B, Q, R)
that will linearize the system for you around an equilibrium and return the controller
(in the original coordinates). Therefore, to use LQR, one simply needs to define the
symmetric positive-definite cost matrices, Q and R. In their most common form, Q
and R are positive diagonal matrices, where the entries Q ii penalize the relative
errors in state variable x i compared to the other state variables, and the entries R ii
penalize actions in u i .
Open in Colab
Make sure that you take a minute to look at the code which runs during these
examples. Can you set the Q and R matrices differently, to improve the
performance?
Simulation of the closed-loop response with LQR feedback shows that the task is
indeed completed - and in an impressive manner. Oftentimes the state of the system
has to move violently away from the origin in order to ultimately reach the origin.
Further inspection reveals the (linearized) closed-loop dynamics are in fact non-
minimum phase (acrobot has 3 right-half zeros, cart-pole has 1).
Open in Colab
Note that LQR, although it is optimal for the linearized system, is not necessarily the
best linear control solution for maximizing basin of attraction of the fixed-point. The
theory of robust control(e.g., [8]), which explicitly takes into account the differences
between the linearized model and the nonlinear model, will produce controllers
which outperform our LQR solution in this regard.
In the introductory chapters, we made the point that the underactuated systems
are not feedback equivalent to q̈ = u. Although we cannot always simplify the full
dynamics of the system, it is still possible to linearize a portion of the system
dynamics. The technique is called partial feedback linearization.
Consider the cart-pole example. The dynamics of the cart are affected by the motions
of the pendulum. If we know the model, then it seems quite reasonable to think that
we could create a feedback controller which would push the cart in exactly the way
necessary to counter-act the dynamic contributions from the pendulum - thereby
linearizing the cart dynamics. What we will see, which is potentially more surprising,
is that we can also use a feedback controller for the cart to feedback linearize the
dynamics of the passive pendulum joint.
We'll use the term collocated partial feedback linearization to describe a controller
which linearizes the dynamics of the actuated joints. What's more surprising is
that it is often possible to achieve non-collocated partial feedback linearization -
a controller which linearizes the dynamics of the unactuated joints. The treatment
presented here follows from [9].
Collocated
Starting from the equations 18 and 19, we have
θ¨ = −ẍc − s
ẍ(2 − c 2 ) − sc − θ˙2 s = f x
Therefore, applying the feedback control
results in
ẍ =ẍ d
θ¨ = − ẍ d c − s,
Look carefully at the resulting equations. Of course, it says that we can impose
whatever accelerations we like on the cart. But even the resulting equations of the
pole happen to take a nice form here: they have been reduced to the equations of
the simple pendulum (without a cart), where the torque input is now given instead
by ẍc. It's as if we have a simple pendulum with torque control, except our command
is modulated by a cos θ term, and this cos θ term is fundamental -- it's true that our
control authority goes to zero when the pole is horizontal, because no amount of
force on the cart in that configuration can act like a torque on the pole.
Non-collocated
Starting again from equations 18 and 19, we have
θ¨ + s
ẍ = −
c
¨ − 2 ) − 2 tan θ − θ˙2 s = f x
θ(c
c
Applying the feedback control
2 ¨d
f x = (c − )θ − 2 tan θ − θ˙2 s (31)
c
results in
θ¨ =θ¨d
1 ¨d
ẍ = − θ − tan θ.
c
Note that this expression is only valid when cos θ ≠ 0. Once again, we know that
the force cannot create a torque when the pole is perfectly horizontal. In fact, the
controller we have written will "blow-up" -- requesting infinite force at cos θ = 0; so
make sure you saturate the command before you implement it on hardware (or even
in simulation). Although it may come as a small consolation, at least we have that
(c − 2c ) never goes to zero; in fact you can check for yourself that |c − 2c | ≥ 1.
For systems that are trivially underactuated (torques on some joints, no torques on
other joints), we can, without loss of generality, reorganize the joint coordinates in
any underactuated system described by the manipulator equations into the form:
M 11 q̈ 1 + M 12 q̈ 2 = τ 1 , (32)
M 21 q̈ 1 + M 22 q̈ 2 = τ 2 + u, (33)
M(q) = [M 11 M 12 ].
M 21 M 22
Fortunately, because M is uniformly (e.g. for all q) positive definite, M 11 and M 22 are
also positive definite, by the Schur complement condition for positive definiteness.
Collocated linearization
Performing the same substitutions into the full manipulator equations, we get:
q̈ 1 = M −1
11 [τ 1 − M 12 q̈ 2 ] (34)
(M 22 − M 21 M −1
11 M 12 )q̈ 2 − τ 2 + M 21 M −1
11 τ 1 = u (35)
Non-collocated linearization
q̈ 2 = M +
12 [τ 1 − M 11 q̈ 1 ] (36)
(M 21 − M 22 M 12 M 11 )q̈ 1 − τ 2 + M 22 M +
+
12 τ 1 = u (37)
y = h(q),
¯ + [ÿ d − Hq̇
q̈ 2 = H ˙ − H 1 M −1 τ 1 ], (38)
11
where H ¯ = H 2 − H 1 M −1 M 12 . and H
11
¯ + is the right Moore-Penrose pseudo-
inverse,
¯+ =H
H ¯ T (H
¯H¯ T ) −1 ,
then we have
ÿ = ÿ d . (39)
subject to
¯ = p,
rank (H) (40)
ẏ = Hq̇
˙
ÿ = Hq̇ + H 1 q̈ 1 + H 2 q̈ 2 .
Solving 32 for the dynamics of the unactuated joints we have:
q̈ 1 = M −1
11 (τ 1 − M 12 q̈ 2 ) (41)
Substituting, we have
˙ + H 1 M −1 (τ 1 − M 12 q̈ 2 ) + H 2 q̈ 2
ÿ =Hq̇ (42)
11
=Hq̇ ¯ 2 + H 1 M −1 τ 1
˙ + Hq̈ (43)
11
d
=ÿ (44)
Note that the last line required the rank condition (40) on H
¯ to ensure that the
rows of H
¯ are linearly independent, allowing H¯H¯ + = I.
Assuming the internal dynamics are stable, this yields converging error dynamics
when K p , K d > 0[4], which implies y(t) → y d (t). For a position control robot, the
acceleration command of (38) suffices. Alternatively, a torque command follows by
substituting (38) and (41) into (33).
y = h(q) = −l cos θ
ẏ = lθ˙ sin θ
The task space derivation above provides a convenient generalization of the partial
feedback linearization (PFL) [9], which encompasses both the collocated and non-
collocated results. If we choose y = q 2 (collocated), then we have
˙ = 0, H
H 1 = 0, H 2 = I, H ¯ + = I.
¯ = I, H
From this, the command in (38) reduces to q̈ 2 = q̈ d2 . The actuator command is then
u = M 21 M −1 d d
11 (τ 1 − M 12 q̈ 2 ) + M 22 q̈ 2 − τ 2 ,
¯ = −M −1 M 12 .
˙ = 0, H
H 1 = I, H 2 = 0, H 11
The rank condition (40) requires that rank(M 12 ) = l, in which case we can write
¯ + = −M + M 11 , reducing the rank condition to precisely the "Strong Inertial
H 12
Coupling" condition described in [9]. Now the command in (38) reduces to
q̈ 2 = M + d
12 [τ 1 − M 11 q̈ 1 ] (45)
The actuator command is found by substituting (45) into (33), yielding the same
results as in [9].
Recall in the last chapter, we used energy shaping to swing up the simple pendulum.
This idea turns out to be a bit more general than just for the simple pendulum. As
we will see, we can use similar concepts of "energy shaping" to produce swing-up
controllers for the acrobot and cart-pole systems. It's important to note that it only
takes one actuator to change the total energy of a system.
Although a large variety of swing-up controllers have been proposed for these model
systems (c.f. [10], [11], [12], [13], [14], [15], [1], [16]), the energy shaping controllers
tend to be the most natural to derive and perhaps the most well-known.
3.6.2 Cart-Pole
Let's try to apply the energy-shaping ideas to the cart-pole system. The basic idea,
from [17], is to use collocated PFL to simplify the dynamics, use energy shaping to
regulate the pendulum to its homoclinic orbit, then to add a few terms to make sure
that the cart stays near the origin. This is a bit surprising... if we want to control the
pendulum, shouldn't we use the non-collocated version? Actually, we ultimately want
to control both the cart and the pole, and we will build on the collocated version
both because it avoids inverting the cos θ term that can go to zero and because (when
all parameters are set to 1) we were left with a particularly simple form of the
equations:
ẍ = u (46)
1
θ¨ = −uc − s. − (uc + gs) (47)
l
The first equation is clearly simple, but even the second equation is exactly the
equations of a decoupled pendulum, just with a slightly odd control input that is
modulated by cos θ.
Let us regulate the energy of this decoupled pendulum using energy shaping. The
energy of the pendulum (a unit mass, unit length, simple pendulum in unit gravity)
is given by:
1 ˙2
E(x) = θ − cos θ.
2
The desired energy, equivalent to the energy at the desired fixed-point, is
E d = 1.
~˙
E(x) ˙
=E(x) = θ˙θ¨ + θs
˙
˙
=θ[−uc ˙
− s] + θs
= − uθ˙ cos θ.
Therefore, if we design a controller of the form
~
u = kθ˙ cos θE, k>0
the result is
~˙ ~
E = −kθ˙2 cos 2 θE.
This guarantees that |E| is non-increasing, but isn't quite enough to guarantee that
~
it will go to zero. For example, if θ = θ˙ = 0, the system will never move. However, if
we have that
t
∫ θ˙2 (t ′ ) cos 2 θ(t ′ )dt ′ → ∞, as t → ∞,
0
then we have E(t) → 0. This condition, a version of the LaSalle's theorem that we will
~
develop in our notes on Lyapunov methods, is satisfied for all but the trivial constant
trajectories at fixed points.
Now we return to the full system dynamics (which includes the cart). [17] and [18]
use the simple pendulum energy controller with an addition PD controller designed
to regulate the cart:
~
ẍ d = k E θ˙ cos θE − k p x − k d ẋ.
[17] provides a proof of convergence for this controller with some nominal
parameters. [10] provides a slightly different controller derived directly from a
Lyapunov argument.
Figure 3.4 - Cart-Pole Swingup: Example phase plot of the pendulum subsystem
using energy shaping control. The controller drives the system to the homoclinic
orbit, then switches to an LQR balancing controller near the top.
3.6.3 Acrobot
Swing-up control for the acrobot can be accomplished in much the same way. [13]
- pump energy. Clean and simple. No proof. Slightly modified version (uses arctan
instead of sat) in [19]. Clearest presentation in [18].
1 T
E(x) = q̇ Mq̇ + U(x).
2
E d = U(x ∗ ).
~
ū = q̇ 1 E.
q̈ 2d = −k 1 q 2 − k 2 q̇ 2 + k 3 ū,
3.6.4 Discussion
The energy shaping controller for swing-up presented here is a pretty faithful
representative from the field of nonlinear underactuated control. Typically these
control derivations require some clever tricks for simplifying or canceling out terms
in the nonlinear equations, then some clever Lyapunov function to prove stability.
In these cases, PFL was used to simplify the equations, and therefore the controller
design.
We will see another nice example of changing coordinates in the nonlinear equations
to make the problem easier when we study "differential flatness" for trajectory
optimization. This approach is perhaps most famous these days for very dynamic
trajectory planning with quadrotors.
These controllers are important, representative, and relevant. But clever tricks
with nonlinear equations seem to be fundamentally limited. Most of the rest of
the material presented in this book will emphasize more general computational
approaches to formulating and solving these and other control problems.
The acrobot and cart-pole systems are just two of the model systems used heavily in
underactuated control research. Other examples include:
• Pendubot
• Inertia wheel pendulum
• Furuta pendulum (horizontal rotation and vertical pendulum)
• Hovercraft
3.8 EXERCISES
REFERENCES
1. Richard M. Murray and John Hauser, "A case Study in Approximate
Linearization: The Acrobot Example", Memorandum No. UCB/ERL
(unknown), April, 1991.
3. Mark Spong, "The Swingup Control Problem for the Acrobot", IEEE Control
Systems Magazine, vol. 15, no. 1, pp. 49-55, February, 1995.
11. Isabelle Fantoni and Rogelio Lozano, "Non-linear Control for Underactuated
Mechanical Systems",Springer-Verlag , 2002.
12. Araki and N.; Okada and M.; Konishi and Y.; Ishigaki and H.;, "Parameter
identification and swing-up control of an Acrobot system", International
Conference on International Technology, 2005.
13. Xin Xin and M. Kaneda, "New analytical results of the energy based
swinging up control of the Acrobot", Proceedings of the 43rd IEEE
Conference on Decision and Control (CDC), vol. 1, pp. 704 - 709, Dec, 2004.
15. Arun D. Mahindrakar and Ravi N. Banavar, "A swing-up of the acrobot
based on a simple pendulum strategy", International Journal of Control, vol.
78, no. 6, pp. 424-429, April, 2005.
16. M.D. Berkemeier and R.S. Fearing, "Tracking fast inverted trajectories of
the underactuated Acrobot", Robotics and Automation, IEEE Transactions
on, vol. 15, no. 4, pp. 740-750, Aug, 1999.
17. Xu-Zhi Lai and Jin-Hua She and Simon X. Yang and Min Wu, "Stability
Analysis and Control Law Design for Acrobots", Proceedings of the 2006
IEEE International Conference on Robotics and Automation, May, 2006.
18. Chung Choo Chung and John Hauser, "Nonlinear Control of a Swinging
Pendulum", Automatica, vol. 31, no. 6, pp. 851-862, June, 1995.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 4
Simple Models of Walking Open in Colab
and Running
Practical legged locomotion is one of the fundamental problems in robotics; we've
seen amazing progress over the last few years, but there are still some major
unsolved problems. Much of the recent progress is due to improvements in hardware
-- a legged robot must carry all of its sensors, actuators and power and traditionally
this meant underpowered motors that act as far-from-ideal force/torque sources --
Boston Dynamics and other companies have made incredible progress here. The
control systems implemented on these systems, though, are still more heuristic
than you might think. They also require dramatically higher bandwidth and lower
latency that the human motor control system and still perform worse in challenging
environments.
Figure 4.1 - System trajectories of the Van der Pol oscillator with μ = .2. (Left)
phase portrait. (Right) time domain.
However, if we examine system trajectories in the time domain (see the right panel
of the figure above), then we can see that our traditional notion for stability of a
trajectory, lim t→∞ |x(t) − x ∗ (t)| = 0, is not satisfied here for any x ∗ (t). Although the
phase portrait clearly reveals that all trajectories converge to the same orbit, the
time domain plot reveals that these trajectories do not necessarily synchronize in
time.
This example shows that stability of a trajectory (in time) is not the definition we
want. Instead we will define the stability of an orbit, x ∗ (t), using
Here the quantity of interest is the distance between our trajectory and the closest
point on the orbit. Depending on exactly how this distance evolves with time,
we can define orbital stability: for every small ϵ > 0, we can find a δ > 0 such
that min τ ||x(t 0 ) − x ∗ (τ)|| < δ implies ∀t, min τ ||x(t) − x ∗ (τ)|| < ϵ). The definitions for
asymptotic orbital stability, exponential orbital stability, finite-time orbital stability
follow naturally, as does the definition of an unstable orbit. In the case of limit cycles,
this x ∗ (t) is a periodic solution with x ∗ (t + t period ) = x ∗ (t).
The first tool one should know for analyzing the stability of a limit cycle, and one that
will serve us quite well for the examples in this chapter, is the method of Poincaré.
Let's consider a dynamical system, ẋ = f (x), with x ∈ R n . Define an n − 1 dimensional
surface of section, S . We will also require that S is transverse to the flow (i.e., all
trajectories starting on S flow through S , not parallel to it). The Poincaré map (or
return map) is a mapping from S to itself:
x p [n + 1] = P(x p [n]),
where x p [n] is the state of the system at the nth crossing of the surface of section.
Note that we will use the notation x p to distinguish the state of the discrete-time
system from the continuous-time system; they are related by x p [n] = x(t c [n]), where
t c [n] is the time of the nth crossing of S .
For our purposes in this chapter, it will be useful for us to allow the Poincaré maps
to also include fixed points of the original continuous-time system, and to define the
return time to be zero. These points do not satisfy the transversality condition, and
require some care numerically, but they do not compromise our analysis.
Example 4.2 (Return map for the Van der Pol Oscillator)
Since the state space of this system is two-dimensional, a return map for the
system must have dimension one. For instance, we can take S to be the line
segment where q = 0, q̇ ≥ 0. It is easy to see that all trajectories that start on S are
transverse to it.
One dimensional maps, like one dimensional flows, are amenable to graphical
analysis.
Figure 4.2 - (Left) Phase plot with the surface of section, S , drawn with a black
dashed line. (Right) The resulting Poincaré first-return map (blue), and the line of
slope one (red).
If we can demonstrate that P(x p ) exists for all x p ∈ S , (all trajectories that leave S
eventually return to S ), then something remarkable happens: we can reduce the
stability analysis for a limit cycle in the original coordinates into the stability analysis
of a fixed point on the discrete map. If x ∗p is a stable (or unstable) fixed point of the
map P, then there exists a unique periodic orbit x ∗ (t) which passes through x ∗p that is
a (stable, or unstable) limit cycle. If we are able to upper-bound the time it takes for
trajectories leaving S to return, then it is also possible to infer asympotitic orbital or
even exponential orbital stability.
It is sometimes possible to infer more global stability properties of the return map by
examining P. We will study computational methods in the later chapter. [2] describes
some less computational stability properties known for unimodal maps which are
useful for the simple systems we study in this chapter.
We can quickly recognize the fixed points as the points where the return map crosses
this line. We can infer local exponential stability if the absolute value of the slope of
return map at the fixed point is less than one (|λ| < 1). The figure above illustrates
the stability of the Van der Pol limit cycle; it also illustrates the (one-sided) instability
of the fixed point at the origin. Understanding the regions of attraction of fixed
points on an iterated map (or any discrete-time system) graphically requires a little
more care, though, than the continuous-time flows we examined before. The discrete
maps can jump a finite distance on each step, and will in fact oscillate back and forth
from either side of the fixed point when λ < 0.
Figure 4.4 - A 3D passive dynamic walker by Steve Collins and Andy Ruina[3].
This amazing robot has no motors and no controllers... it walks down a tiny ramp and
is powered entirely by gravity. We can build up our understanding of this robot in a
series of steps. As we will see, what is remarkable is that even though the system
is passive, we believe that the periodic gait you see in the video is actually a stable
limit cycle!
Well before the first passive walkers, one of the earliest models of walking was
proposed by McMahon[4], who argued that humans use a mostly ballistic (passive)
gait. He observed that muscle activity (recorded via EMG) in the "stance leg" is
relatively high, but muscle activity in the "swing leg" is very low, except for at
the very beginning and end of swing. He proposed a "ballistic walker" model -- a
three-link pendulum in the sagittal plane, with one link from the ankle to the hip
representing a straight "stance leg", the second link from the hip back down to the
"swing leg" knee, and the final link from the knee down to the swing foot -- and
argued that this model could largely reproduce the kinematics of gait. This model
of "walking by vaulting" is considered overly simplistic today, as we now understand
much better the role of compliance in the stance leg during walking. His model was
also restricted to the continuous portion of gait, not the actual dynamics of taking a
step. But it was the start. Nearly a decade later, McGeer[5] followed up with a series
of similarly inspired walking machines, which he coined "passive-dynamic walkers".
Figure 4.5 - The rimless wheel. The orientation of the stance leg, θ, is measured
clockwise from the vertical axis.
Perhaps the simplest possible model of a legged robot, introduced as the conceptual
framework for passive-dynamic walking by McGeer[5], is the rimless wheel. This
system has rigid legs and only a point mass at the hip as illustrated in the figure
above. To further simplify the analysis, we make the following modeling
assumptions:
• Collisions with ground are inelastic and impulsive (only angular momentum
is conserved around the point of collision).
• The stance foot acts as a pin joint and does not slip.
• The transfer of support at the time of contact is instantaneous (no double
support phase).
• 0 ≤ γ < π2 , 0 < α < π2 , l > 0.
Note that the coordinate system used here is slightly different than for the simple
pendulum (θ = 0 is at the top, and the sign of θ has changed).
The most comprehensive analysis of the rimless wheel was done by [6].
Stance Dynamics
The dynamics of the system when one leg is on the ground are given by
g
θ¨ = sin(θ).
l
If we assume that the system is started in a configuration directly after a transfer of
support (θ(0 + ) = γ − α), then forward walking occurs when the system has an initial
velocity, θ(0
˙ + ) > ω 1 , where
g
ω 1 = √2 [1 − cos (γ − α)].
l
ω 1 is the threshold at which the system has enough kinetic energy to vault the mass
over the stance leg and take a step. This threshold is zero for γ = α and does not exist
for γ > α (because when γ > α the mass is always ahead of the stance foot, and the
standing fixed point disappears). The next foot touches down when θ(t) = γ + α, at
which point the conversion of potential energy into kinetic energy yields the velocity
The angular momentum around the point of collision at time t just before the next
foot collides with the ground is
The angular momentum at the same point immediately after the collision is
˙ + ) = θ(t
θ(t ˙ − ) cos(2α).
Forward simulation
Given the stance dynamics, the collision detection logic (θ = γ ± α), and the collision
update, we can numerically simulate this simple model. Doing so reveals something
remarkable... the wheel starts rolling, but then one of two things happens: it either
runs out of energy and stands still, or it quickly falls into a stable periodic solution.
Convergence to this stable periodic solution appears to happen from a huge range
of initial conditions. Try it yourself.
I've setup the notebook to make it easy for you to try a handful of interesting initial
conditions. And even made an interactive widget for you to watch the simulation
phase portrait as you change those initial conditions. Give it a try! (I recommend
using Binder instead of Colab so you get the interactive features)
The rimless wheel model actually uses a number of the more nuanced features
of DRAKE in order to simulate the hybrid system accurately (as do many of the
examples in this chapter). It actually registers the collision events of the system --
the simulator uses zero-crossing detection to ensure that the time/state of collision
is obtained accurately, and then applies the reset map.
Figure 4.7 - Phase portrait of the rimless wheel (m = 1, l = 1, g = 9.8, α = π/8,
γ = 0.08).
One of the fantastic things about the rimless wheel is that the dynamics are
exactly the undamped simple pendulum that we've thought so much about. So you
will recognize the phase portraits of the system -- they are centered around the
unstable fixed point of the simple pendulum.
Poincaré Map
We can now derive the angular velocity at the beginning of each stance phase as a
function of the angular velocity of the previous stance phase. First, we will handle
the case where γ ≤ α and θ˙+ n > ω 1 . The "step-to-step return map", factoring losses
from a single collision, the resulting map is:
+ g
θ˙+ √ ˙ 2
n+1 = cos(2α) (θ n ) + 4 sin α sin γ.
l
where the θ˙+ indicates the velocity just after the energy loss at impact has occurred.
Using the same analysis for the remaining cases, we can complete the return map.
The threshold for taking a step in the opposite direction is
g
ω 2 = −√2 [1 − cos(α + γ)].
l
θ˙+ ˙+
n+1 = −θ n cos(2α).
g
θ˙+ √ ˙+ 2
n+1 = − cos(2α) (θ n ) − 4 sin α sin γ.
l
Notice that the return map is undefined for θ˙n = {ω 1 , ω 2 }, because from these
configurations, the wheel will end up in the (unstable) equilibrium point where θ = 0
and θ˙ = 0, and will therefore never return to the map.
This return map blends smoothly into the case where γ > α. In this regime,
⎧ g
cos(2α)√(θ˙+ 2
n ) + 4 l sin α sin γ, 0 ≤ θ˙+
n
˙ +
θ n+1 = ⎨−θ˙+ cos(2α), ω 2 < θ˙+ .
n n <0
⎩ g
− cos(2α)√(θ˙+ 2
n ) − 4 l sin α sin γ, θ˙+
n ≤ w2
Notice that the formerly undefined points at {ω 1 , ω 2 } are now well-defined transitions
with ω 1 = 0, because it is kinematically impossible to have the wheel statically
balancing on a single leg.
⎪
For a fixed point, we require that θ˙+
points, depending on the parameters:
n+1 = θ n = ω . Our system has two possible fixed
˙+ ∗
g
ω ∗stand = 0, ω ∗roll = cot(2α)√4 sin α sin γ.
l
The limit cycle plotted above illustrates a state-space trajectory in the vicinity of
the rolling fixed point. ω ∗stand is a fixed point whenever γ < α. ω ∗roll is a fixed point
whenever ω ∗roll > ω 1 . It is interesting to view these bifurcations in terms of γ. For
small γ, ω stand is the only fixed point, because energy lost from collisions with the
ground is not compensated for by gravity. As we increase γ, we obtain a stable
rolling solution, where the collisions with the ground exactly balance the conversion
of gravitational potential to kinetic energy. As we increase γ further to γ > α, it
becomes impossible for the center of mass of the wheel to be inside the support
polygon, making standing an unstable configuration.
Figure 4.8 - Limit cycle trajectory of the rimless wheel (
m = 1, l = 1, g = 9.8, α = π/8, γ = 0.15). All hatched regions converge to the rolling
fixed point, ω ∗roll ; the white regions converge to zero velocity (ω ∗stand ).
Stability of standing still
I opened this chapter with the idea that the natural notion of stability for a walking
system is periodic stability, and I'll stand by it. But we can also examine the stability
of a fixed-point (of the original coordinates; no Poincaré this time) for a system that
has contact mechanics. For a legged robot, a fixed point means standing still. This
can actually be a little subtle in our hybrid models: in the rimless wheel, standing
still is the limiting case where we have infinitely frequent collisions. [Details coming
soon.]
The rimless wheel models only the dynamics of the stance leg, and simply assumes
that there will always be a swing leg in position at the time of collision. To remove
this assumption, we take away all but two of the spokes, and place a pin joint at
the hip. To model the dynamics of swing, we add point masses to each of the legs.
I've derived the equations of motion for the system assuming that we have a torque
actuator at the hip - resulting in swing dynamics equivalent to an Acrobot (although
in a different coordinate frame) - but let's examine the passive dynamics of the
system first (u = 0).
In addition to the modeling assumptions used for the rimless wheel, we also assume
that the swing leg retracts in order to clear the ground without disturbing the
position of the mass of that leg. This model, known as the compass gait, is well
studied in the literature using numerical methods [7, 8, 9], but relatively little is
known about it analytically.
The state of this robot can be described by 4 variables: θ st , θ sw , θ˙st , and θ˙sw . The
abbreviation st is shorthand for the stance leg and sw for the swing leg. Using
q = [θ st , θ sw ] T and u = τ , we can write the dynamics as
with
2 2
M = [ (m h + m)l + ma −mlb cos(θ st − θ sw )]
−mlb cos(θ st − θ sw ) mb 2
and l = a + b.
• Add a "floating base" to the system by adding a free (x,y) joint at the pre-
impact stance toe, q fb = [x, y, θ st , θ sw ] T .
• Calculate the mass matrix for this new system, call it M fb .
• Write the position of the (pre-impact) swing toe as a function ϕ(q fb ). Define
the Jacobian of this function: J =
∂ϕ
∂q fb .
• The post-impact velocity that ensures that ϕ˙ = 0 after the collision is given
by
q̇ + = [I − M −1 T −1 T −1 −
fb J [JM fb J ] J]q̇ ,
noting that ẋ − = ẏ − = 0, and that you need only read out the last two
elements of q̇ + . The velocity of the post-impact stance foot will be zero
by definition, and the new velocity of the pre-impact stance foot can be
completely determined from the minimal coordinates.
• Switch the stance and swing leg positions and velocities.
Open in Colab
Try playing with the initial conditions. You'll find this system is much more
sensitive to initial conditions than the rimless wheel. It actually took some work to
find the initial conditions that make it walk stably.
Numerical integration of these equations reveals a stable limit cycle, plotted below.
Notice that when the left leg is in stance (top part of the plot), the trajectory quite
resembles the familiar pendulum dynamics of the rimless wheel. But this time, when
the leg impacts, it takes a long arc around as the swing leg before returning to
stance. The impacts are also clearly visible as discontinuous changes (decreases)
in velocity. The dependence of this limit cycle on the system parameters has been
studied extensively in [7].
Figure 4.10 - Passive limit cycle trajectory of the compass gait. (m = 5kg, m h = 10
kg, a = b = 0.5m, ϕ = 0.0525rad. x(0) = [0, 0, 0.4, −2.0] T ). Drawn is the phase portait of
the left leg angle, which is recovered from θ st and θ sw in the simulation with some
simple book-keeping.
The basin of attraction of the stable limit cycle is a narrow band of states
surrounding the steady state trajectory. Although the simplicity of this model makes
it attractive for conceptual explorations, this lack of stability makes it difficult to
implement on a physical device.
After this collision, the knee is locked and we switch to the compass gait model
with a different mass distribution. In other words, the system becomes a two-
link pendulum. Again, the heel-strike is modeled as an inelastic collision. After the
collision, the legs switch instantaneously. After heel-strike then, we switch back to
the ballistic walker's three-link pendulum dynamics. This describes a full step cycle
of the kneed walker, which is shown above.
Figure 4.14 - Tad McGeer's kneed walker. Here is Matt Haberland's guide to
launching it successfully. It's nontrivial!
The world has seen all sorts of great variations on the passive-dynamic walker
theme. Almost certainly the most impressive are the creations of Dutch artist Theo
Jansen -- he likes to say that he is creating "new forms of life" which he calls the
Strandbeest. There are many variations of these amazing machines -- including his
beach walkers which are powered only by the wind (I've actually been able to visit
Theo's workshop once; it is very windy there).
These results are very real. Robin Deits (an exceptionally talented student who felt
particularly inspired once on a long weekend) once reproduced one of Theo's earliest
creations in DRAKE.
To this day, I'm still a bit surprised that impressive running robots (assuming you
accept hopping as the first form of running) appeared before the impressive walking
robots. I would have thought that running would be a more difficult control and
engineering task than walking. But these hopping machines turned out to be an
incredibly clever way to build something simple and very dynamic.
Shortly after Raibert built his machines, Dan Koditschek and Martin Buehler started
analyzing them with simpler models [2]. Years later, in collaboration with biologist
Bob Full, they started to realize that the simple models used to describe the
dynamics of Raibert's hoppers could also be used to describe experimental data
obtained from running animals. In fact, they described an incredible range of
experimental results, for animals ranging in size from a cockroach up to a horse[12]
(I think the best overall summary of this line of work is [13]). The most successful of
the simple models was the so-called "spring-loaded inverted pendulum" (or "SLIP",
for short).
4.3.1 The Spring-Loaded Inverted Pendulum (SLIP)
The model is a point mass, m, on top of a massless, springy leg with rest length
of l 0 , and spring constant k. The configuration of the system is given by the x, z
position of the center of mass, and the length, l, and angle θ of the leg. The dynamics
are modeled piecewise - with one dynamics governing the flight phase, and another
governing the stance phase.
In SLIP, we actually use different state variables for each phase of the dynamics.
During the flight phase, we use the state variables: x = [x, z, ẋ, ż] T . The flight
dynamics are simply
⎡ ẋ ⎤
ẋ = ż .
0
⎣ ⎦
−g
Since the leg is massless, we assume that the leg can be controlled to any position
instantaneously, so it's reasonable to take the leg angle as the control input θ = u.
During the "stance phase", we write the dynamics in polar coordinates, with the foot
anchored at the origin. Using the state variables: x = [r, θ, ṙ, θ]
˙ T , the location of the
mass is given by
The total energy is given by the kinetic energy (of the mass) and the potential energy
from gravity and from the leg spring:
m 2 k
T= (ṙ + r 2 θ˙2 ), U = mgr cos θ + (r 0 − r) 2 .
2 2
⎢⎥
Plugging these into Lagrange yields the stance dynamics:
We assume that the stance phase is entirely ballistic; there are no control inputs
during this phase.
Unlike the rimless wheel model, the idealization of a spring leg with a massless toe
means that no energy is lost during the collision with the ground. The collision event
causes a transition to a different state space, and to a different dynamics model,
but no instantaneous change in velocity. The transition from flight to stance happens
when z − l 0 cos θ ≤ 0. The transition from stance to flight happens when the stance leg
reaches its rest length: r ≥ l 0 .
Example 4.5 (Simulation of the SLIP model)
You can simulate the spring-loaded inverted pendulum using:
Open in Colab
We used the same witness functions and resets to simulate the walking models, but
those were implemented in C++. This is a purely Python implementation, making
it a bit easier to see those details.
For this system, one can still make progress if we define the Poincare section to be
at the apex of the hop (when ż = 0), and analyze the dynamics from apex to apex.
We will ignore the absolute horizontal position, x, and the leg angle, θ, as they are
irrelevant to the flight dynamics. This still leaves us with two state variables at the
apex: the horizontal velocity, ẋ, and the hopping height, z.
Importantly, at the apex, these two variables are linked -- when you know one, you
know the other. This is because no energy is ever added or removed from the system.
If you know the (constant) total energy in the system, and you know the height at
the apex, then you know the horizontal velocity (up to a sign):
1
E apex (z, ẋ) = mẋ 2 + mgz = E 0 .
2
To make our analysis less dependent on our specific choice of parameters, we
typically report any analysis using a dimensionless total energy, E = mgl , and a
~ E
0
Putting all of these together, we can numerically evaluate the apex-to-apex return
map, given a known total energy of the system.
Using our graphical staircase analysis (for discrete-time systems), we can see that
there are two fixed points of this map: one "stable" fixed point around 0.87 meters,
and one unstable fixed point around 0.92 meters.
As always, you should run this code yourself to make sure you understand:
Open in Colab
I'd also recommend running the simulation using initial conditions at various
points on this plot. You should ask: why does the plot cut off so abruptly just
below the stable fixed point? Well the apex-to-apex map becomes undefined if the
apex is smaller that l 0 cos θ (the height at touchdown). Furthemore, when z is too
large (relative to E), then the system will not have enough horizontal energy to
~
transition through the stance phase and continue forward motion: the center of
mass will compress the spring and then fall backwards in the direction from which
it came.
The system's conservation of total energy is quite convenient for our return map
analysis, but it should give you pause. The rimless wheel actually relied on the
dissipation due to collisions with the ground to acquire its stability. Is there any
chance for a system that is energy conserving to be stable? The apex-to-apex return
maps visualized above reveal "stable" fixed-points, but you must understand that the
projection from (z, x])
˙ to just z is using the assumption that the energy is conserved.
As a result, our analysis on the one-dimensional Poincare map can only make a
statement about partial stability (i.s.L., asymptotic, or exponential); our analysis
does not say anything about stability against disturbances that change the total
energy of the system.
Despite this limitation, it is quite interesting that such a simple system converges
to particular hopping heights from various initial conditions. The observation that
these "piecewise-Hamiltonian" systems (Hamiltonian because they conserve total
energy) can exhibit this partial stability generated quite a bit of interest from the
theoretical physics community. You can sense that enthusiasm when reading [13].
[14] also gives a more thorough analysis of this apex-to-apex map, including a closed-
form approximation to the return map dynamics based on a linear (small angle)
approximation of the stance dynamics.
SLIP extensions
The original SLIP model, perhaps inspired by the hopping robots turned out to be
quite effective in describing the vertical center of mass dynamics of a wide range of
animals [13]. Remarkably, experiments showed that the dimensionless individual leg
stiffness was very close to 10 for a huge range of animals, ranging from a 0.001 kg
cockroach to a 135 kg horse [15]. But the SLIP model, and modest extensions to it,
has also been used to understand lateral stability in cockroaches, resulting in some
of my favorite experimental videos of all time [16].
SLIP Control
4.5 EXERCISES
a. A function that, given the state of the hopper, returns its total
mechanical energy.
b. The hopping controller class. This is a DRAKE VectorSystem that, at each
sampling time, reads the state of the hopper and returns the preload
of the hopper spring necessary for the system to hop at the desired
height. All the necessary information for the synthesis of this controller
are given in the notebook.
REFERENCES
1. Steven H. Strogatz, "Nonlinear Dynamics and Chaos: With Applications to
Physics, Biology, Chemistry, and Engineering",Perseus Books , 1994.
3. Steven H. Collins and Martijn Wisse and Andy Ruina, "A Three-Dimensional
Passive-Dynamic Walking Robot with Two Legs and Knees", International
Journal of Robotics Research, vol. 20, no. 7, pp. 607-615, July, 2001.
10. Vanessa Hsu, "Passive dynamic walking with knees: A Point-foot model", ,
February, 2007. [ link ]
12. R.J. Full and D.E. Koditschek, "Templates and anchors: neuromechanical
hypotheses of legged locomotion on land", Journal of Experimental Biology,
vol. 202, no. 23, pp. 3325-3332, 1999.
13. Philip Holmes and Robert J. Full and Dan Koditschek and John
Guckenheimer, "The Dynamics of Legged Locomotion: Models, Analyses,
and Challenges", Society for Industrial and Applied Mathematics (SIAM)
Review, vol. 48, no. 2, pp. 207--304, 2006.
15. Daniel E Koditschek and Robert J Full and Martin Buehler, "Mechanical
aspects of legged locomotion control", Arthropod structure \& development,
vol. 33, no. 3, pp. 251--272, 2004.
17. Marc H. Raibert, "Legged Robots That Balance",The MIT Press , 1986.
18. Raibert and M. H. and Chepponis and M. and Brown and H. B., "Running
on four legs as though they were one", IEEE Journal of Robotics and
Automation, vol. 2, no. 2, pp. 70--82, 1986.
19. Marc H. Raibert, "Hopping in legged systems: Modeling and simulation for
the 2D one-legged case", IEEE Trans. Systems, Man, and Cybernetics, vol.
14, pp. 451-463, 1984.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 5
Highly-articulated Legged
Robots
The passive dynamic walkers and hopping robots described in the last chapter
capture the fundamental dynamics of legged locomotion -- dynamics which are
fundamentally nonlinear and punctuated by impulsive events due to making and
breaking contact with the environment. But if you start reading the literature on
humanoid robots, or many-legged robots like quadrupeds, then you will find a quite
different set of ideas taking center stage: ideas like the "zero-moment point" and
footstep planning. The goal of this chapter is to penetrate that world.
I'd like to start the discussion with a model that might seem quite far from the world
of legged robots, but I think it's a very useful way to think about the problem.
5.1 CENTER OF MASS DYNAMICS
Imagine you have a flying vehicle modeled as a single rigid body in a gravity field
with some number of force "thrusters" attached. We'll describe the configuration of
the vehicle by its orientation, θ and the location of its center of mass (x, z). The vector
from the center of mass to thruster i is given by r i , yielding the equations of motion:
mẍ = ∑ F i,x ,
i
mz̈ = ∑ F i,z − mg,
i
I θ¨ = ∑ [r i × F i ] y ,
i
where I've used F i,x to represent the component of the force in the x direction and
have taken just the y-axis component of the cross product on the last line.
Now imagine that you have just a small number of thrusters (let's say two) each
with severe input constraints. To make things more interesting, let us say that you're
allowed to move the thrusters, so ṙ i becomes an additional control input, but with
some important limitations: you have to turn the thrusters off when they are moving
(e.g. |F i ||ṙ i | = 0) and there is a limit to how quickly you can move the thrusters
|ṙ| i ≤ v max . This problem is now a lot more difficult, due especially to the fact that
constraints of the form u 1 u 2 = 0 are non-convex.
I find this a very useful thought exercise; somehow our controller needs to make
an effectively discrete decision to turn off a thruster and move it. The framework
of optimal control should support this - you are sacrificing a short-term loss of
control authority for the long-term benefit of having the thruster positioned where
you would like, but we haven't developed tools yet that deal well with this discrete
plus continuous decision making. We'll need to address that here.
The union of these regions need not form a convex set. Furthermore, these locations
could be specified in the robot frame, but may also be specified in the world frame,
which I'll call p i :
⎡x⎤
pi = ri − 0 .
⎣ ⎦
z
This problem still feels like it should be tractable, but it's definitely getting hard.
⎢⎥
The constraints on where you can put your feet / thrusters will depend on the
kinematics of your leg, and the speed at which you can move them will depend on
the full dynamics of the leg -- these are difficult constraints to deal with. But the
actual dynamics of the rigid body are actually still affine, and still simple!
We don't actually need to have massless legs for this discussion to make sense. If
we use the coordinates x, z to describe the location of the center of mass (CoM) of
the entire robot, and m to represent the entire mass of the robot, then the first two
equations remain unchanged. The center of mass is a configuration dependent point,
and does not have an orientation, but one important generalization of the orientation
dynamics is given by the centroidal momentum matrix, A(q), where A(q)q̇ captures
the linear and angular momentum of the robot around the center of mass [1].
Note that the center of mass dynamics are still affine -- even for a big complicated
humanoid -- but the centroidal momentum dynamics are nonlinear.
For an articulated robot, though, there are a number of possible strategies for the
legs which can affect the dynamics of the center of mass. For example, if the robot
hits the ground with a stiff leg like the rimless wheel, then the angular momentum
about the point of collision will be conserved, but any momentum going into the
ground will be lost. However, if the robot has a spring leg and a massless toe like the
SLIP model, then no energy need be lost.
One strategy that many highly-articulated legged robots use is to keep their center
of mass at a constant height,
z=c ⇒ ż = z̈ = 0,
and minimize their angular momentum about the center of mass (here θ¨ = 0). Using
this strategy, if the swinging foot lands roughly below the center of mass, then even
with a stiff leg there is no energy dissipated in the collision - all of the momentum is
conserved. This often (but does not always) justify ignoring the impacts in the center
of mass dynamics of the system.
While not the only important case, it is extremely common for our robots to be
walking over flat, or nearly flat terrain. In this situation, even if the robot is touching
the ground in a number of places (e.g., two heels and two toes), the constraints on
the center of mass dynamics can be summarized very efficiently.
First, on flat terrain F i,z represents the force that is normal to the surface at contact
point i. If we assume that the robot can only push on the ground (and not pull), then
this implies
In other words, if I cannot pull on the ground, then my center of mass cannot
accelerate towards the ground faster than gravity.
For instance, if I keep my center of mass at a constant height, then z̈ = 0 and |ẍ| ≤ μg
; this is a nice reminder of just how important friction is if you want to be able to
move around in the world.
Even better, let us define the "center of pressure" (CoP) as the point on the ground
where
∑ i p i,x F i,z
x cop = ,
∑ i F i,z
and since all p i,z are equal on flat terrain, z cop is just the height of the terrain. It turns
out that the center of pressure is a "zero-moment point" (ZMP) -- a property that we
will demonstrate below -- and moment-balance equation gives us a very important
relationship between the location of the CoP and the dynamics of the CoM:
¨
(mz̈ + mg)(x cop − x) = (z cop − z)mẍ − I θ.
If we use the strategy proposed above for ignoring collision dynamics, z̈ = θ¨ = 0, then
we have z − z cop is a constant height h, and the result is the famous "ZMP equations":
g
ẍ = − (x cop − x).
h
So the location of the center of pressure completely determines the acceleration
of the center of mass, and vice versa! What's more, this relationship is affine -- a
property that we can exploit in a number of ways.
The zero-moment point (ZMP) is discussed very frequently in the current literature
on legged robots. It also has an unfortunate tendency to be surrounded by some
confusion; a number of people have defined the ZMP is slightly different ways (see
e.g. [2] for a summary). Therefore, it makes sense to provide a simple derivation
here.
First let us recall that for rigid body systems I can always summarize the
contributions from many external forces as a single wrench (force and torque) on the
object -- this is simply because the F i terms enter our equations of motion linearly.
Specifically, given any point in space, r, in coordinates relative to (x, z):
I θ¨ = ∑ [r i × F i ] y = (r × F net ) y + τ net ,
i
where F net = ∑ i F i and the value of τ net depends on the location r. For some choices
of r, we can make τ net = 0. Examining
we can see that when F net,z > 0 there is an entire line of solutions, r x = ar z + b,
including one that will intercept the terrain. For walking robots, it is this point on
the ground from which the external wrench can be described by a single force vector
(and no moment) that is the famous "zero-moment point" (ZMP). Substituting the
back in to replace F net , we can see that
r z mẍ − I θ¨
rx = .
mz̈ + mg
If we assume that z̈ = θ¨ = 0 and replace the relative coordinates with the global
coordinates r = p − [x, 0, z] T , then we arrive at precisely the equation presented
above.
Furthermore, since
∑ i r i,x F i,z
rx = .
F net,z
In three dimensions, we solve for the point on the ground where τ net,y = τ net,x = 0, but
allow τ net,z ≠ 0 to extract the analogous equations in the y-axis:
r z mÿ − I θ¨
ry = .
mz̈ + mg
Coming soon. For a description of our approach with Atlas, see [5, 4]. For nice
geometric insights on push recovery, see [6].
Coming soon. For a description of our approach with Atlas, see [7, 4].
Example 5.1 (LittleDog gait optimization)
Open in Colab
5.7 EXERCISES
5.8 REFERENCES
1. David E. Orin and Ambarish Goswami and Sung-Hee Lee, "Centroidal
dynamics of a humanoid robot", Autonomous Robots, no. September 2012,
pp. 1--16, jun, 2013.
3. Scott Kuindersma and Frank Permenter and Russ Tedrake, "An Efficiently
Solvable Quadratic Program for Stabilizing Dynamic Locomotion",
Proceedings of the International Conference on Robotics and Automation,
May, 2014. [ link ]
4. Scott Kuindersma and Robin Deits and Maurice Fallon and Andr\'{e}s
Valenzuela and Hongkai Dai and Frank Permenter and Twan Koolen and
Pat Marion and Russ Tedrake, "Optimization-based Locomotion Planning,
Estimation, and Control Design for the {A}tlas Humanoid Robot",
Autonomous Robots, vol. 40, no. 3, pp. 429--455, 2016. [ link ]
5. Robin Deits and Russ Tedrake, "Footstep Planning on Uneven Terrain with
Mixed-Integer Convex Optimization", Proceedings of the 2014 IEEE/RAS
International Conference on Humanoid Robots (Humanoids 2014), 2014.
[ link ]
6. Twan Koolen and Tomas de Boer and John Rebula and Ambarish Goswami
and Jerry Pratt, "Capturability-based analysis and control of legged
locomotion, Part 1: Theory and application to three simple gait models", The
International Journal of Robotics Research, vol. 31, no. 9, pp. 1094-1113,
2012.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 6
Model Systems with
Open in Colab
Stochasticity
My goals for this chapter are to build intuition for the beautiful and rich behavior of
nonlinear dynamical system that are subjected to random (noise/disturbance) inputs.
So far we have focused primarily on systems described by
where this additional input w is the (vector) output of some random process. In other
words, we can begin thinking about stochastic systems by simply understanding the
dynamics of our existing ODEs subjected to an additional random input.
This form is extremely general as written. w(t) can represent time-varying random
disturbances (e.g. gusts of wind), or even constant model errors/uncertainty. One
thing that we are not adding, yet, is measurement uncertainty. There is a great deal
of work on observability and state estimation that study the question of how you can
infer the true state of the system given noise sensor readings. For this chapter we
are assuming perfect measurements of the full state, and are focused instead on the
way that "process noise" shapes the long-term dynamics of the system.
I will also stick primarily to discrete time dynamics for this chapter, simply because
it is easier to think about the output of a discrete-time random process, w[n], than
a w(t). But you should know that all of the ideas work in continuous time, too. Also,
most of our examples will take the form of additive noise:
which is a particular useful and common specialization of our general form. And
this form doesn't give up much -- the disturbance on step k can pass through the
nonlinear function f on step k + 1 giving rich results -- but is often much easier to
work with.
6.1 THE MASTER EQUATION
Let's start by looking at some simple examples.
ẋ = x − x 3 .
This deterministic system has stable fixed points at x ∗ = {−1, 1} and an unstable
fixed point at x ∗ = 0.
When w[n] = 0, this system has the same fixed points and stability properties as
the continuous time system. But let's examine the system when w[n] is instead the
result of a zero-mean Gaussian white noise process, defined by:
∀n, E [w[n]] = 0,
2
E [w[i]w[j]] = {σ , if i = j,
0, otherwise.
When you simulate this system for small values of σ, you will see trajectories move
roughly towards one of the two fixed points (for the deterministic system), but
every step is modified by the noise. In fact, even if the trajectory were to arrive
exactly at what was once a fixed point, it is almost surely going to move again on
the very next step. In fact, if we plot many runs of the simulation from different
initial conditions all on the same plot, we will see something like the figure below.
Figure 6.1 - Simulation of the bistable system with noise (σ = 0.01) from many
initial conditions.
During any individual simulation, the state jumps around randomly for all time,
and could even transition from the vicinity of one fixed point to the other fixed
point. Another way to visualize this output is to animate a histogram of the
particles over time.
Figure 6.2 - Histogram of the states of the bistable system with noise (σ = 0.01)
after simulating from random initial conditions until t = 30.
Click here to see the animation
Open in Colab
Let's take a moment to appreciate the implications of what we've just observed.
Every time that we've analyzed a system to date, we've asked questions like "given
x[0], what is the long-term behavior of the system, lim n→∞ x[n]?", but now x[n] is a
random variable. The trajectories of this system do not converge, and the system
does not exhibit any form of stability that we've introduced so far.
All is not lost. If you watch the animation closely, you might notice the distribution
of this random variable is actually very well behaved. This is the key idea for this
chapter.
Let us use p n (x) to denote the probability density function over the random variable
x at time n. Note that this density function must satisfy
∞
∫ p n (x)dx = 1.
−∞
It is actually possible to write the dynamics of the probability density with the simple
relation
∞
p n+1 (x) = ∫ p(x|x ′ )p n (x ′ )dx ′ ,
−∞
Figure 6.3 - The discrete-time dynamics of the "logistic map" (plotted alongside
the line x[n + 1] = x[n]).
It takes only a moment of tracing your finger along the plot using the "staircase
method" to see what makes this system so interesting -- rollouts from a single
initial condition end up bouncing all over the interval (0, 1), and neighboring initial
conditions will end up taking arbitrarily different trajectories (this is the hallmark
of a "chaotic" system).
Figure 6.4 - Two simulations of the logistic map from different initial conditions.
Remember, there is no stochasticity here -- the dynamics are entirely
deterministic!
Here's what's completely fascinating -- even though the dynamics of any one initial
condition for this system are extremely complex, if we study the dynamics of a
distribution of states through the system, they are surprisingly simple and well-
behaved. This system is one of the rare cases when we can write the master
equation in closed form[2]:
1 1 1 1 1
p n+1 (x) = [p n ( − √1 − x) + p n ( + √1 − x)].
4√1 − x 2 2 2 2
Figure 6.5 - Plotting the (closed-form) evolution of the master equation for the
logistic map, initialized with p 0 (x) = 1, reveals surprising simplicity, and rapid
convergence to a stationary distribution (dashed orange line).
For this system (and many chaotic systems), the dynamics of a single initial
condition are complicated, but the dynamics a distribution of initial conditions are
beautiful and simple!
Note: when the dynamical system under study has deterministic dynamics (but a
distribution of initial conditions), the linear map given by the master equation is
known as the Perron-Frobenius operator, and it gives rise to the Liouville equation
that we will study later in the chapter.
The slightly more general form of the master equation, which works for multivariate
distributions with state-domain X , and systems with control inputs u, is
Continuous-time formulations are also possible -- these lead to the so-called Fokker-
Planck equation.
When w[n] = 0, the system is stable (to the origin) for −1 < a < 1. Let's make sure
that we can understand the dynamics of the master equation for the case when
w[n] is Gaussian white noise with standard deviation σ.
First, recall that the probability density function of a Gaussian with mean μ is given
by
1 (w−μ) 2
p(w) = e− 2σ 2 .
√2πσ 2
1 (x[n+1]−ax[n]) 2
p(x[n + 1]|x[n]) = e− 2σ 2 ,
√2πσ 2
Now here's the magic. Let's push a distribution, p n (x), which is zero-mean, with
standard deviation σ n , through the master equation:
∞ (x−ax ′ ) 2 (x ′ ) 2
1 1 −
p n+1 (x) = ∫ e− 2σ 2 e 2σ 2n dx ′ ,
√2πσ 2 √2πσ 2n −∞
x2
1 −
2(σ 2 +a 2 σ 2
= e n) .
√2π(σ 2 + a 2 σ 2n )
Taking it a step further, we can see that a stationary distribution for this system is
given by a mean-zero Gaussian with
σ2
σ 2∗ = .
1 − a2
Note that this distribution is well defined when −1 < a < 1 (only when the system
is stable).
The stationary distribution of the linear Gaussian system reveals the fundamental
and general balance between two terms in the governing equations of any stochastic
dynamical system: the stability of the deterministic system is bringing trajectories
together (smaller a means faster convergence of the deterministic system and
results in a more narrow distribution), but the noise in the system is forcing
trajectories apart (larger σ means larger noise and results in a wider distribution).
Given how rich the dynamics can be for deterministic nonlinear systems, you can
probably imagine that the possible long-term dynamics of the probability are also
extremely rich. If we simply flip the signs in the dynamics we examined above, we'll
get our next example:
ẋ = −x + x 3 ,
With w[n] = 0, the system has only a single stable fixed point (at the origin), whose
deterministic region of attraction is given by x ∈ (−1, 1). If we again simulate the
system from a set of random initial conditions and plot the histogram, we will see
something like the figure below.
Figure 6.6 - Histogram of the states of the bistable system with noise (σ = 0.01)
after simulating from random initial conditions until t = 30.
Click here to see the animation
Be sure to watch the animation. Or better yet, run the simulation for yourself by
changing the sign of the derivative in the bistability example and re-running:
Open in Colab
Now try increasing the noise (e.g. pre-multiply the noise input, w, in the dynamics
by a scalar like 2).
What is the stationary distribution for this system? In fact, there isn't one.
Although we initially collect probability density around the stable fixed point, you
should notice a slow leak -- on every step there is some probability of transitioning
past unstable fixed points and getting driven by the unstable dynamics off towards
infinity. If we run the simulation long enough, there won't be any probability
density left at x = 0.
q̈ + (q 2 − 1)q̇ + q = w(t),
Here's the question: if we start with a small set of initial conditions centered
around one point on the limit cycle, then what is the long-term behavior of this
distribution?
Figure 6.7 - Randomly sampled initial conditions pictured with the stable limit
cycle of the Van der Pol oscillator.
Open in Colab
The explanation is simple: the periodic solution of the system is only orbitally
stable; there is no stability along the limit cycle. So any disturbances that push
the particles along the limit cycle will go unchecked. Eventually, the distribution
will "mix" along the entire cycle. Perhaps surprisingly, this system that has a limit
cycle stability when w = 0 eventually reaches a stationary distribution (fixed point)
in the master equation.
In our original analysis of the rimless wheel, we derived the "post-collision" return
map -- mapping the angular velocity from the beginning of the stance phase to the
(post-collision) angular velocity at the next stance phase. But now that the relative
location of the ground is changing on every step, we instead write the "apex-to-apex"
return map, which maps the angular velocity from one apex (the moment that the
leg is vertical) to the next, which is given by:
More coming soon. Read the paper [3] and/or watch the video.
REFERENCES
1. Katie Byl and Russ Tedrake, "Metastable Walking on Stochastically Rough
Terrain", Proceedings of Robotics: Science and Systems IV, 2008. [ link ]
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
Dynamic Programming
In chapter 2, we spent some time thinking about the phase portrait of the simple
pendulum, and concluded with a challenge: can we design a nonlinear controller
to reshape the phase portrait, with a very modest amount of actuation, so that
the upright fixed point becomes globally stable? With unbounded torque, feedback-
cancellation solutions (e.g., invert gravity) can work well, but can also require an
unnecessarily large amount of control effort. The energy-based swing-up control
solutions presented for the acrobot and cart-pole systems are considerably more
appealing, but required some cleverness and might not scale to more complicated
systems. Here we investigate another approach to the problem, using computational
optimal control to synthesize a feedback controller directly.
In this chapter, we will introduce optimal control - a control design process using
optimization. This approach is powerful for a number of reasons. First and foremost,
it is very general - allowing us to specify the goal of control equally well for fully- or
under-actuated, linear or nonlinear, deterministic or stochastic, and continuous or
discrete systems. Second, it permits concise descriptions of potentially very complex
desired behaviours, specifying the goal of control as a scalar objective (plus a list
of constraints). Finally, and most importantly, optimal control is very amenable to
numerical solutions. [1] is a fantastic reference on this material for those who want
a somewhat rigorous treatment; [2] is an excellent (free) reference for those who
want something more approachable.
The fundamental idea in optimal control is to formulate the goal of control as the
long-term optimization of a scalar cost function. Let's introduce the basic concepts
by considering a system that is even simpler than the simple pendulum.
If you would like a mechanical analog of the system (I always do), then you can
think about this as a unit mass brick moving along the x-axis on a frictionless
surface, with a control input which provides a horizontal force, u. The task is to
design a control system, u = π(x, t), x = [q, q̇] T to regulate this brick to x = [0, 0] T .
In order to formulate this control design problem using optimal control, we must
define a scalar objective which scores the long-term performance of running each
candidate control policy, π(x, t), from each initial condition, (x 0 , t 0 ), and a list of
constraints that must be satisfied. For the task of driving the double integrator
to the origin, one could imagine a number of optimal control formulations which
would accomplish the task, e.g.:
Note that the input limits, |u| ≤ 1 are also required to make this problem well-
posed; otherwise both optimizations would result in the optimal policy using
infinite control input to approach the goal infinitely fast. Besides input limits,
another common approach to limiting the control effort is to add an additional
quadratic cost on the input (or "effort"), e.g. ∫ [u T (t)Ru(t)]dt, R ≻ 0. This could be
added to either formulation above. We will examine many of these formulations in
some details in the examples worked out at the end of this chapter.
Optimal control has a long history in robotics. For instance, there has been a great
deal of work on the minimum-time problem for pick-and-place robotic manipulators,
and the linear quadratic regulator (LQR) and linear quadratic regulator with
Gaussian noise (LQG) have become essential tools for any practicing controls
engineer. With increasingly powerful computers and algorithms, the popularity of
numerical optimal control has grown at an incredible pace over the last few years.
Your intuition might tell you that the best thing that the brick can do, to reach
the goal in minimum time with limited control input, is to accelerate maximally
towards the goal until reaching a critical point, then hitting the brakes in order to
come to a stop exactly at the goal. This would be called a bang-bang control policy;
these are often optimal for systems with bounded input, and it is in fact optimal for
the double integrator, although we will not prove it until we have developed more
tools.
Let's work out the details of this bang-bang policy. First, we can figure out the
states from which, when the brakes are fully applied, the system comes to rest
precisely at the origin. Let's start with the case where q(0) < 0, and q̇(0) > 0, and
"hitting the brakes" implies that u = −1 . Integrating the equations, we have
q̈(t) = u = −1
q̇(t) = q̇(0) − t
1 2
q(t) = q(0) + q̇(0)t − t .
2
Substituting t = q̇(0) − q̇ into the solution reveals that the system orbits are
parabolic arcs:
1
q = − q̇ 2 + c − ,
2
with c − = q(0) + 1
2 q̇ 2 (0).
Perhaps the most important of these orbits are the ones that pass directly through
the origin (e.g., c − = 0). Following our initial logic, if the system is going slower
than this q̇ for any q, then the optimal thing to do is to slam on the accelerator (
u = −sgn(q)). If it's going faster than the q̇ that we've solved for, then still the best
thing to do is to brake; but inevitably the system will overshoot the origin and have
to come back. We can summarize this policy with:
And for completeness, we can compute the optimal time to the goal by solving for
the amount of time required to reach the switching surface plus the amount of
time spent moving along the switching surface to the goal. With a little algebra,
you will find that the time to goal, T (x), is given by
⎧
2√ 12 q̇ 2 − q − q̇ for u = +1 regime,
T (x) = ⎨0 for u = 0,
⎩
q̇ + 2√ 12 q̇ 2 + q for u = −1,
plotted here:
Figure 7.5 - Time to the origin using the bang-bang policy
Notice that the function is continuous (though not smooth), even though the policy
is discontinuous.
As we begin to develop theoretical and algorithmic tools for optimal control, we will
see that some formulations are much easier to deal with than others. One important
example is the dramatic simplification that can come from formulating objective
functions using additive cost, because they often yield recursive solutions. In the
additive cost formulation, the long-term "score" for a trajectory can be written as
T
∫ ℓ(x(t), u(t))dt,
0
where ℓ() is the instantaneous cost (also referred to as the "running cost"), and T can
be either a finite real number or ∞. We will call a problem specification with a finite
T a "finite-horizon" problem, and T = ∞ an "infinite-horizon" problem. Problems and
solutions for infinite-horizon problems tend to be more elegant, but care is required
to make sure that the integral converges for the optimal controller (typically by
having an achievable goal that allows the robot to accrue zero-cost).
The quadratic cost function suggested in the double integrator example above is
clearly written as an additive cost. At first glance, our minimum-time problem
formulation doesn't appear to be of this form, but we actually can write it as an
additive cost problem using an infinite horizon and the instantaneous cost
ℓ(x, u) = {0 if x = 0,
1 otherwise.
For systems with continuous states and continuous actions, dynamic programming
is a set of theoretical ideas surrounding additive cost optimal control problems.
For systems with a finite, discrete set of states and a finite, discrete set of actions,
dynamic programming also represents a set of very efficient numerical algorithms
which can compute optimal feedback controllers. Many of you will have learned it
before as a tool for graph search.
Imagine you have a directed graph with states (or nodes) {s 1 , s 2 , . . . } ∈ S and
"actions" associated with edges labeled as {a 1 , a 2 , . . . } ∈ A, as in the following trivial
example:
a2 s4
s2 a2
a1 a1
a1 s3
s1 a2
Let us also assume that each edge has an associate weight or cost, using ℓ(s, a) to
denote the cost of being in state s and taking action a. Furthermore we will denote
the transition "dynamics" using
There are many algorithms for finding (or approximating) the optimal path from a
start to a goal on directed graphs. In dynamic programming, the key insight is that
we can find the shortest path from every node by solving recursively for the optimal
cost-to-go (the cost that will be accumulated when running the optimal controller)
from every node to the goal. One such algorithm starts by initializing an estimate
J^∗ = 0 for all s i , then proceeds with an iterative algorithm which sets
In software, J^∗ can be represented as a vector with dimension equal to the number
of discrete states. This algorithm, appropriately known as value iteration, is
guaranteed to converge to the optimal cost-to-go up to a constant factor, J^∗ → J ∗ + c
[3], and in practice does so rapidly. Typically the update is done in batch -- e.g. the
estimate is updated for all i at once -- but the asynchronous version where states are
updated one at a time is also known to converge, so long as every state is eventually
updated infinitely often. Assuming the graph has a goal state with a zero-cost self-
transition, then this cost-to-go function represents the weighted shortest distance to
the goal.
It's a simple algorithm, but playing with an example can help our intuition.
Example 7.3 (Grid World)
Imagine a robot living in a grid (finite state) world. Wants to get to the goal
location. Possibly has to negotiate cells with obstacles. Actions are to move up,
down, left, right, or do nothing. [4]
Figure 7.7 - The one-step cost for the grid-world minimum-time problem. The goal
state has a cost of zero, the obstacles have a cost of 10, and every other state has
a cost of 1.
Open in Colab
Open in Colab
Please do take some time to try different cost functions by editing the code
yourself.
Let's take a minute to appreciate how amazing this is. Our solution to finding the
optimal controller for the double integrator wasn't all that hard, but it required some
mechanical intuition and solutions to differential equations. The resulting policy
was non-trivial -- bang-bang control with a parabolic switching surface. The value
iteration algorithm doesn't use any of this directly -- it's a simple algorithm for graph
search. But remarkably, it can generate effectively the same policy with just a few
moments of computation.
It's important to note that there are some differences between the computed policy
and the optimal policy that we derived, due to discretization errors. We will ask you
to explore these in the problems.
The real value of this numerical solution, however, is unlike our analytical solution
for the double integrator, we can apply this same algorithm to any number of
dynamical systems virtually without modification. Let's apply it now to the simple
pendulum, which was intractable analytically.
Open in Colab
Again, you can easily try different cost functions by editing the code yourself.
7.3 CONTINUOUS DYNAMIC PROGRAMMING
I find the graph search algorithm extremely satisfying as a first step, but also
become quickly frustrated by the limitations of the discretization required to use
it. In many cases, we can do better; coming up with algorithms which work more
natively on continuous dynamical systems. We'll explore those extensions in this
section.
It's important to understand that the value iteration equations, equations (1) and (2),
are more than just an algorithm. They are also sufficient conditions for optimality: if
we can produce a J ∗ and π ∗ which satisfy these equations, then π ∗ must be an optimal
controller. There are an analogous set of conditions for the continuous systems. For
a system ẋ = f(x, u) and an infinite-horizon additive cost ∫ 0 ℓ(x, u)dt, we have:
∞
∂J ∗
0 = min [ℓ(x, u) + f(x, u)], (3)
u ∂x
∂J ∗
π ∗ (x) = argmin u [ℓ(x, u) + f(x, u)]. (4)
∂x
Equation 3 is known as the Hamilton-Jacobi-Bellman (HJB) equation. [5] gives an Technically, a Hamilto
informal derivation of these equations as the limit of a discrete-time approximation Jacobi equation is a PD
of the dynamics, and also gives the famous "sufficiency theorem". The treatment in whose time derivative
[5] is for the finite-horizon case; I have modified it to one particular form of an depends on the first-
infinite-horizon case here. order partial derivativ
over state, which we g
in the finite-time
Theorem 7.1 - HJB Sufficiency Theorem (stated deriviation; Eq 3 is the
informally) steady-state solution o
the Hamilton-Jacobi
Consider a system ẋ = f(x, u) and an infinite-horizon additive cost ∫ 0 ℓ(x, u)dt,
∞
equation.
with f and ℓ continuous functions, and ℓ a strictly-positive-definite function that
obtains zero only when x = x ∗ . Suppose J(x) is a positive-definite solution to
the HJB equation; that is J is continuously differentiable in x and is such that
∂J
0 = min [ℓ(x, u) + f(x, u)], for all x,
u∈U ∂x
and that π(x) is the minimizer for all x. Then, under some technical conditions
on the existence and boundedness of solutions, we have that J(x) − J(x ∗ ) is the
optimal cost-to-go and π is an optimal policy.
dV (x)
g V (x, u) = g(x, u) + f(x, u), (5)
dx
is non-negative, and achieves zero value whenever u = μ(x);
c. for every x 0 ∈ Ω, the system of equations
t
ẋ(t) = f(x(t), u(t)), x(0) = x 0 (i.e., x(t) = x 0 + ∫ f(x(τ), u(τ)dτ) (6)
0
defined on the set of all piecewise continuous pairs (x, u) satisfying (6),
achieves its minimal value of V (x 0 ) − V (x ∗ ) at every piecewise continuous
solution of (6), (7).
Proof. First, observe that, for all piecewise continuous (x, u) satisfying (6),
I. due to (5),
T T
∫ g(x(t), u(t))dt = V (x 0 ) − V (x(T )) + ∫ g V (x(t), u(t))dt; (8)
0 0
Since the upper bound V (x 0 ) does not depend on T , and g(x(t), u(t)) ≥ 0 is non-
negative, we conclude that
∞
∫ g(x(t), u(t))dt < ∞.
0
As a tool for verifying optimality, the HJB equations are actually surprisingly easy to
work with: we can verify optimality for an infinite-horizon objective without doing
any integration; we simply have to check a derivative condition on the optimal cost-
to-go function J ∗ . Let's see this play out on the double integrator example.
ℓ(x, u) = q 2 + q̇ 2 + u 2 .
I claim (without derivation) that the optimal controller for this objective is
π(x) = −q − √3q̇.
To convince you that this is indeed optimal, I have produced the following cost-to-
go function:
Taking
∂J ∂J
= 2√3q + 2q̇, = 2q + 2√3q̇,
∂q ∂ q̇
we can write
∂J
ℓ(x, u) + f(x, u) = q 2 + q̇ 2 + u 2 + (2√3q + 2q̇)q̇ + (2q + 2√3q̇)u
∂x
This is a convex quadratic function in u, so we can find the minimum with respect
to u by finding where the gradient with respect to u evaluates to zero.
∂ ∂J
[ℓ(x, u) + f(x, u)] = 2u + 2q + 2√3q̇.
∂u ∂x
u ∗ = −q − √3q̇,
thereby confirming that our policy π is in fact the minimizer. Substituting u ∗ back
into the HJB reveals that the right side does in fact simplify to zero. I hope you are
convinced!
Note that evaluating the HJB for the time-to-go of the minimum-time problem for the
double integrator will also reveal that the HJB is satisfied wherever that gradient is
well-defined. This is certainly mounting evidence in support of our bang-bang policy
being optimal, but since ∂J ∂x is not defined everywhere, it does not actually satisfy
the requirements of the sufficiency theorem as stated above. In some sense, the
assumption in the sufficiency theorem that ∂J ∂x is defined everywhere makes it very
weak.
We still face a few barriers to actually using the HJB in an algorithm. The first
barrier is the minimization over u. When the action set was discrete, as in the graph
search version, we could evaluate the one-step cost plus cost-to-go for every possible
action, and then simply take the best. For continuous action spaces, in general we
cannot rely on the strategy of evaluating a finite number of possible u's to find the
minimizer.
All is not lost. In the quadratic cost double integrator example above, we were able
to solve explicitly for the minimizing u in terms of the cost-to-go. It turns out that
this strategy will actually work for a number of the problems we're interested in,
even when the system (which we are given) or cost function (which we are free to
pick, but which should be expressive) gets more complicated.
Recall that I've already tried to convince you that a majority of the systems of
interest are control affine, e.g. I can write
In my view, this is not very restrictive - many of the cost functions that I find myself
choosing to write down can be expressed in this form. Given these assumptions, we
can write the HJB as
∂J
0 = min [ℓ 1 (x) + u T Ru + [f 1 (x) + f 2 (x)u]].
u ∂x
Since this is a positive quadratic function in u, if the system does not have any
constraints on u, then we can solve in closed-form for the minimizing u by taking the
gradient of the right-hand side:
∂ ∂J
= 2u T R + f 2 (x) = 0,
∂u ∂x
and setting it equal to zero to obtain
1 ∂J T
u ∗ = − R −1 f 2T (x) .
2 ∂x
If there are linear constraints on the input, such as torque limits, then more
generally this could be solved (at any particular x) as a quadratic program.
What happens in the case where our system is not control affine or if we really do
need to specify an instantaneous cost function on u that is not simply quadratic? If
the goal is to produce an iterative algorithm, like value iteration, then one common
approach is to make a (positive-definite) quadratic approximation in u of the HJB,
and updating that approximation on every iteration of the algorithm. This broad
approach is often referred to as differential dynamic programming (c.f. [6]).
The other major barrier to using the HJB in a value iteration algorithm is that the
estimated optimal cost-to-go function, J^∗ , must somehow be represented with a
finite set of numbers, but we don't yet know anything about the potential form it
must take. In fact, knowing the time-to-goal solution for minimum-time problem with
the double integrator, we see that this function might need to be non-smooth for
even very simple dynamics and objectives.
One natural way to parameterize J^∗ -- a scalar valued-function defined over the state
space -- is to define the values on a mesh. This approach then admits algorithms
with close ties to the relatively very advanced numerical methods used to solve other
partial differential equations (PDEs), such as the ones that appear in finite element
modeling or fluid dynamics. One important difference, however, is that our PDE lives
in the dimension of the state space, while many of the mesh representations from
the other disciplines are optimized for two or three dimensional space. Also, our
PDE may have discontinuities (or at least discontinuous gradients) at locations in the
state space which are not known apriori.
A slightly more general view of the problem would describe the mesh (and the
associated interpolation functions) as just one form of representations for function
approximation. Using a neural network to represent the cost-to-go also falls under
the domain of function approximation, perhaps representing the other extreme in
terms of complexity; using deep networks in approximate dynamic programming is
common in deep reinforcement learning, which we will discuss more later in the
book.
• What if the true cost-to-go function does not live in the prescribed function
class; e.g., there does not exist an α which satisfies the sufficiency
conditions for all x? (This seems very likely to be the case.)
• What update should we apply in the iterative algorithm?
• What can we say about its convergence?
Using the least squares solution in a value iteration update is sometimes referred
to as fitted value iteration, where x k are some number of samples taken from the
continuous space and for discrete-time systems the iterative approximate solution to
∞
J ∗ (x 0 ) = min ∑ ℓ(x[n], u[n]),
u[⋅]
n=0
s.t. x[n + 1] = f(x[n], u[n]), x[0] = x 0
becomes
Since the desired values J kd are only an initial guess of the cost-to-go, we will apply
this algorithm iteratively until (hopefully) we achieve some numerical convergence.
Note that the update in (10) is not quite the same as doing least-squares optimization
of
2
∑ (J^α∗ (x k ) − min [ℓ(x k , u) + J^α∗ (f(x k , u))]) ,
u
k
because in this equation α has an effect on both occurences of J^∗ . In (10), we cut
that dependence by taking J kd as fixed desired values; this version performs better
in practice. Like many things, this is an old idea that has been given a new name
in the deep reinforcement learning literature -- people think of the J^α∗ on the right
hand side as being the output from a fixed "target network". For nonlinear function
approximators, the update to α in (10) is often replaced with just one step of gradient
descent.
Open in Colab
In general, the convergence and accuracy guarantees for value iteration with
generic function approximators are quite weak. But we do have some results for the
special case of linear function approximators. A linear function approximator takes
the form:
but must deal with the fact that f(x k , u) might not result in a next state that is
directly at a mesh point. Most interpolation schemes for a mesh can be written as
some weighted combination of the values at nearby mesh points, e.g.
⎢⎥ ∑ βi = 1
i
with β i the relative weight of the ith mesh point. In DRAKE we have implemented
barycentric interpolation[7]. Taking α i = J^∗ (x i ), the cost-to-go estimate at mesh
point i, we can see that this is precisely an important special case of fitted value
iteration with linear function approximation. Furthermore, assuming β i (x i ) = 1, (e.g.,
the only point contributing to the interpolation at a mesh point is the value at the
mesh point itself), the update in Eq. (11) is precisely the least-squares update (and it
achieves zero error). This is the representation used in the value iteration examples
that you've already experimented with above.
Continuous-time systems
For solutions to systems with continuous-time dynamics, I have to uncover one of
the details that I've so far been hiding to keep the notation simpler. Let us consider
a problem with a finite-horizon:
N
min ∑ ℓ(x[n], u[n]),
u[⋅]
n=0
s.t. x[n + 1] = f(x[n], u[n]), x[0] = x 0
In fact, the way that we compute this is by solving the time-varying cost-to-go
function backwards in time:
The convergence of the value iteration update is equivalent to solving this time-
varying cost-to-go backwards in time until it reaches a steady-state solution (the
infinite-horizon solution). Which explains why value iteration only converges if the
optimal cost-to-go is bounded.
Now let's consider the continuous-time version. Again, we have a time-varying cost-
to-go, J ∗ (x, t). Now
dJ ∗ ∂J ∗ ∂J ∗
= f(x, u) + ,
dt ∂x ∂t
and our sufficiency condition is
∂J ∗ ∂J ∗
0 = min [ℓ(x, u) + f(x, u) + ].
u ∂x ∂t
But since ∂J ∗
∂t doesn't depend on u, we can pull it out of the min and write the (true)
HJB:
∂J ∗ ∂J ∗
− = min [ℓ(x, u) + f(x, u)].
∂t u ∂x
The standard numerical recipe [8] for solving this is to approximate J^∗ (x, t) on
a mesh and then integrate the equations backwards in time (until convergence,
if the goal is to find the infinite-horizon solution). If, for mesh point x i we have
α i (t) = J^∗ (x i , t), then:
∂J ∗ (x i , t)
−α̇ i (t) = min [ℓ(x i , u) + f(x i , u)],
u ∂x
Probably most visible and consistent campaign using numerical HJB solutions in
applied control (at least in robotics) has come from Claire Tomlin's group at
Berkeley. Their work leverages Ian Mitchell's Level Set Toolbox, which solves the
Hamilton-Jacobi PDEs on a Cartesian mesh using the technique cartooned above,
and even includes the minimum-time problem for the double integrator as a tutorial
example[9].
7.4 EXTENSIONS
There are many many nice extensions to the basic formulations that we've presented
so far. I'll try to list a few of the most important ones here. I've also had a number of
students in this course explore very interesting extensions; for example [10] looked a
imposing a low-rank structure on the (discrete-state) value function using ideas from
matrix completion to obtain good estimates despite updating only a small fraction of
the states.
One of the most amazing features of the dynamic programming, additive cost
approach to optimal control is the relative ease with which it extends to optimizing
stochastic systems.
For discrete systems, we can generalize our dynamics on a graph by adding action-
dependent transition probabilities to the edges. This new dynamical system is known
as a Markov Decision Process (MDP), and we write the dynamics completely in terms
of the transition probabilities
For discrete systems, this is simply a big lookup table. The cost that we incur for
any execution of system is now a random variable, and so we formulate the goal of
control as optimizing the expected cost, e.g.
∞
J ∗ (s[0]) = min E [∑ ℓ(s[n], a[n])]. (12)
a[⋅]
n=0
Note that there are many other potential objectives, such as minimizing the worst-
case error, but the expected cost is special because it preserves the dynamic
programming recursion:
then this algorithm has the same strong convergence guarantees of its counterpart
for deterministic systems. And it is essentially no more expensive to compute!
This is a preview of a much more general toolkit that we develop later for stochastic
and robust control.
For discrete MDPs, value iteration is a magical algorithm because it is simple, but
known to converge to the global optimal, J ∗ . However, other important algorithms
are known; one of the most important is a solution approach using linear
programming. This formulation provides an alternative view, but may also be more
generalizable and even more efficient for some instances of the problem.
Recall the optimality conditions from Eq. (1). If we describe the cost-to-go function
as a vector, J i = J(s i ), then these optimality conditions can be rewritten in vector
form as
where ℓ i (a) = ℓ(s i , a) is the cost vector, and T i,j (a) = Pr(s j |s i , a) is the transition
probability matrix. Let us take J as the vector of decision variables, and replace Eq.
(13) with the constraints:
Clearly, for finite a, this is finite list of linear constraints, and for any vector J
satisfying these constraints we have J ≤ J ∗ (elementwise). Now write the linear
program:
maximize J c T J,
subject to ∀a, J ≤ ℓ(a) + T(a)J,
Perhaps even more interesting is that this approach can be generalized to linear
function approximators. Taking a vector form of my notation above, now we have a
matrix of features with Ψ i,j = ψ Tj (x i ) and we can write the LP
This time the solution is not necessarily optimal, because Ψα ∗ only approximates J ∗
, and the relative values of the elements of c (called the "state-relevance weights")
can determine the relative tightness of the approximation for different features [11].
7.5 EXERCISES
which finds the optimal policy for the infinite horizon cost function
∑∞ n=0 ℓ(s[n], a[n]). If J is the true optimal cost-to-go, show that any solution
∗
update. Is the controller still optimal, even if the estimated cost-to-go function is
off by a constant factor?
Surprisingly, adding a discount factor can help with this. Consider the infinite-
horizon discounted cost: ∑ ∞
n=0 γ ℓ(s[n], a[n]), where 0 < γ ≤ 1 is the discount factor.
n
For simplicity, assume that there exists a state s i that is a zero-cost fixed point
under the optimal policy; e.g. ℓ(s i , π ∗ (s i )) = 0 and f(s i , π ∗ (s i )) = s i . Is J^∗ = J ∗ + c
still a fixed point of the value iteration update when γ < 1? Show your work.
The figure above shows an autonomous car moving at constant velocity v > 0 on
a straight road. Let x be the (longitudinal) position of the car along the road, y its
(transversal) distance from the centerline, and θ the angle between the centerline
and the direction of motion. The only control action is the steering velocity u,
which is constrained in the interval [u min , u max ] (where u min < 0 and u max > 0). We
describe the car dynamics with the simple kinematic model
ẋ = v cos θ,
ẏ = v sin θ,
θ˙ = u.
Let x = [x, y, θ] T be the state vector. To optimize the car trajectory we consider a
quadratic objective function
∞
J=∫ [x T (t)Qx(t) + Ru 2 (t)]dt,
0
a. Suppose our only goal is to keep the distance y between the car and
the centerline as small as possible, as fast as possible, without worrying
about anything else. What would be your choice for Q and R?
b. How would the behavior of the car change if you were to multiply
the weights Q and R from the previous point by an arbitrary positive
coefficient α?
c. The cost function from point (a) might easily lead to excessively sharp
turns. Which entry of Q or R would you increase to mitigate this issue?
d. Country roads are more slippery on the sides than in the center. Is this
class of objective functions (quadratic objectives) rich enough to include
a penalty on sharp turns that increases with the distance of the car from
the centerline?
e. With this dynamics model and this objective function, would you ever
choose a weight matrix Q which is strictly positive definite (independent
of the task you want the car to accomplish)? Why?
a. What are the states x and the inputs u for which the running cost
ℓ(x, u) = x 2 + (u 2 − 1) 2
is equal to zero?
b. Consider now two control signals u τ1 (t) and u τ2 (t), with τ 1 = τ 2 /2. Which
one of the two incurs the lower cost J ? (Hint: start by sketching how the
system state x(t) evolves under these two control signals.)
c. What happens to the cost J when τ gets smaller and smaller? What does
the optimal control signal look like? Could you implement it with a finite-
bandwidth controller?
ẋ = x + u,
x[n + 1] − x[n]
= x[n] + u[n],
h
and the objective as
∞
h ∑(3x 2 [n] + u 2 [n]).
n=0
One of the following expressions is the correct cost-to-go J h∗ (x) for this
discretized problem. Can you identify it without solving the discrete-
time HJB equation? Explain your answer.
i. J h∗ (x) = S h x 4 with S h = 3 + h + √6h 2 .
ii. J h∗ (x) = S h x 2 with S h = 1 + 2h + 2√h 2 + h + 1.
iii. J h∗ (x) = S h x 2 with S h = 3 + 2h 2 + √h 2 + h + 2.
x t+1 = f(x t , u t )
P (x t+1 |x t , u t ) = f(x t , u t )
REFERENCES
1. Dimitri P. Bertsekas, "Dynamic Programming and Optimal Control",Athena
Scientific , 2000.
8. Stanley Osher and Ronald Fedkiw, "Level Set Methods and Dynamic
Implicit Surfaces",Springer , 2003.
10. Yuzhe Yang and Guo Zhang and Zhi Xu and Dina Katabi, "Harnessing
Structures for Value-Based Planning and Reinforcement Learning",
International Conference on Learning Representations, 2020.
11. Daniela Pucci de Farias, "The Linear Programming Approach to
Approximate Dynamic Programming: Theory and Application", PhD thesis,
Stanford University, June, 2002.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
The linear quadratic regulator is likely the most important and influential result in
optimal control theory to date. In this chapter we will derive the basic algorithm and
a variety of useful extensions.
ẋ = Ax + Bu,
Our goal is to find the optimal cost-to-go function J ∗ (x) which satisfies the HJB:
∂J ∗
∀x, 0 = min [x T Qx + u T Ru + (Ax + Bu)].
u ∂x
There is one important step here -- it is well known that for this problem the optimal
cost-to-go function is quadratic. This is easy to verify. Let us choose the form:
J ∗ (x) = x T Sx, S = S T ⪰ 0.
∂J ∗
= 2x T S.
∂x
Since we have guaranteed, by construction, that the terms inside the min are
quadratic and convex (because R ≻ 0), we can take the minimum explicitly by finding
the solution where the gradient of those terms vanishes:
∂
= 2u T R + 2x T SB = 0.
∂u
This yields the optimal policy
u ∗ = π ∗ (x) = −R −1 B T Sx = −Kx.
0 = x T [Q − SBR −1 B T S + 2SA]x.
All of the terms here are symmetric except for the 2SA, but since x T SAx = x T A T Sx,
we can write
0 = x T [Q − SBR −1 B T S + SA + A T S]x.
and since this condition must hold for all x, it is sufficient to consider the matrix
equation
0 = SA + A T S − SBR −1 B T S + Q.
It is worth examining the form of the optimal policy more closely. Since the value
function represents cost-to-go, it would be sensible to move down this landscape as
quickly as possible. Indeed, −Sx is in the direction of steepest descent of the value
function. However, not all directions are possible to achieve in state-space. −B T Sx
represents precisely the projection of the steepest descent onto the control space,
and is the steepest descent achievable with the control inputs u. Finally, the pre-
scaling by the matrix R −1 biases the direction of descent to account for relative
weightings that we have placed on the different control inputs. Note that although
this interpretation is straight-forward, the slope that we are descending (in the value
function, S) is a complicated function of the dynamics and cost.
examples/double_integrator/lqr.py
√
K = [1, √3], S=[ 3 1 ].
1 √3
8.1.1 Local stabilization of nonlinear systems
˙ = ẋ = f(x, u),
x̄
∂f(x 0 , u 0 ) ∂f(x 0 , u 0 )
˙ ≈ f(x 0 , u 0 ) +
x̄ (x − x 0 ) + (u − u 0 ) = Ax̄ + Bū.
∂x ∂u
Similarly, we can define a quadratic cost function in the error coordinates, or take
a (positive-definite) second-order approximation of a nonlinear cost function about
the operating point (linear and constant terms in the cost function can be easily
incorporated into the derivation by parameterizing a full quadratic form for J ∗ , as
seen in the Linear Quadratic Tracking derivation below).
u ∗ = u 0 − K(x − x 0 ).
Open in Colab
I find it very compelling that the same derivation (and effectively identical code)
can stabilize such a diversity of systems!
∂J ∗ ∂J ∗
∀x, ∀t ∈ [t 0 , t f ], 0 = min [ℓ(x, u) + f(x, u) + ].
u ∂x ∂t
h(x) = x T Q f x, Q f = Q Tf ⪰ 0
ℓ(x, u) = x T Qx + u T Ru, Q = Q T ⪰ 0, R = R T ≻ 0
∂J ∗ ∂J ∗
0 = min [x T Qx + u T Ru + (Ax + Bu) + ].
u ∂x ∂t
Due to the positive definite quadratic form on u, we can find the minimum by setting
the gradient to zero:
∂ ∂J ∗
= 2u T R + B=0
∂u ∂x
T
1 ∂J ∗
u ∗ = π ∗ (x, t) = − R −1 B T
2 ∂x
In order to proceed, we need to investigate a particular form for the cost-to-go
function, J ∗ (x, t). Let's try a solution of the form:
∂J ∗ ∂J∗
= 2x T S(t), = x T S(t)x,
˙
∂x ∂t
and therefore
u ∗ = π ∗ (x, t) = −R −1 B T S(t)x
0 = x T [Q − S(t)BR −1 B T S(t) + S(t)A + A T S(t) + S(t)]x.
˙
Therefore, S(t) must satisfy the condition (known as the continuous-time differential
Riccati equation):
S(t f ) = Q f .
Since we were able to satisfy the HJB with the minimizing policy, we have met
the sufficiency condition, and have found the optimal policy and optimal cost-to-go
function.
Note that the infinite-horizon LQR solution described in the prequel is exactly the
steady-state solution of this equation, defined by S(t)
˙ = 0. Indeed, for controllable
systems this equation is stable (backwards in time), and as expected the finite-
horizon solution converges on the infinite-horizon solution as the horizon time limits
to infinity.
ẋ = A(t)x + B(t)u.
Similarly, the cost functions Q and R can also be time-varying. This is quite
surprising, as the class of time-varying linear systems is a quite general class of
systems. It requires essentially no assumptions on how the time-dependence enters,
except perhaps that if A or B is discontinuous in time then one would have to use
the proper techniques to accurately integrate the differential equation.
Let us assume that we have a nominal trajectory, x 0 (t), u 0 (t) defined for t ∈ [t 1 , t 2 ].
Similar to the time-invariant analysis, we begin by defining a local coordinate system
relative to the trajectory:
Now we have
˙ = ẋ − ẋ 0 = f(x, u) − f(x 0 , u 0 ),
x̄
∂f(x 0 , u 0 ) ∂f(x 0 , u 0 )
˙ ≈ f(x 0 , u 0 ) +
x̄ (x − x 0 ) + (u − u 0 ) − f(x 0 , u 0 ) = A(t)x̄ + B(t)ū.
∂x ∂u
This very similar to using LQR to stablize a fixed-point, but with some important
differences. First, the linearization is time-varying. Second, our linearization is
valid for any state along a feasible trajectory (not just fixed-points), because the
coordinate system is moving along with the trajectory.
Similarly, we can define a quadratic cost function in the error coordinates, or take a
(positive-definite) second-order approximation of a nonlinear cost function along the
trajectory (linear and constant terms in the cost function can be easily incorporated
into the derivation by parameterizing a full quadratic form for J ∗ , as seen in the
Linear Quadratic Tracking derivation below).
Remember that stability is a statement about what happens as time goes to infinity.
In order to talk about stabilizing a trajectory, the trajectory must be defined for all
t ∈ [t 1 , ∞). This can be accomplished by considering a finite-time trajectory which
terminates at a stabilizable fixed-point at a time t 2 ≥ t 1 , and remains at the fixed
point for t ≥ t 2 . In this case, the finite-horizon Riccati equation is initialized with the
infinite-horizon LQR solution: S(t 2 ) = S ∞ , and solved backwards in time from t 2 to t 1
for the remainder of the trajectory. And now we can say that we have stabilized the
trajectory!
For completeness, we consider a slightly more general form of the linear quadratic
regulator. The standard LQR derivation attempts to drive the system to zero.
Consider now the problem:
ẋ = Ax + Bu
h(x) = (x − x d (t f )) T Q f (x − x d (t f )), Q f = Q Tf ⪰ 0
ℓ(x, u, t) = (x − x d (t)) T Q(x − x d (t)) + (u − u d (t)) T R(u − u d (t)),
Q = Q T ⪰ 0, R = R T ≻ 0
∂J ∗ ∂J ∗
= 2x T S xx (t) + 2s Tx (t), = xT S
˙ xx (t)x + 2x T ṡ x (t) + ṡ 0 (t).
∂x ∂t
Using the HJB,
∂J ∗ ∂J ∗
0 = min [(x − x d (t)) T Q(x − x d (t)) + (u − u d (t)) T R(u − u d (t)) + (Ax + Bu) + ],
u ∂x ∂t
we have
∂
= 2(u − u d (t)) T R + (2x T S xx (t) + 2s Tx (t))B = 0,
∂u
u ∗ (t) = u d (t) − R −1 B T [S xx (t)x + s x (t)]
S xx (t f ) =Q f
s x (t f ) = − Q f x d (t f )
s 0 (t f ) =x Td (t f )Q f x d (t f ).
Notice that the solution for S xx is the same as the simpler LQR derivation, and
is symmetric (as we assumed). Note also that s 0 (t) has no effect on control (even
indirectly), and so can often be ignored.
A quick observation about the quadratic form, which might be helpful in debugging.
We know that J(x, t) must be uniformly positive. This is true iff S xx ≻ 0 and
xx s x , which comes from evaluating the function at x min (t) defined by
s 0 > s Tx S −1
= 0.
∂J ∗ (x,t)
[ ∂x ]
x=x min (t)
The finite-horizon LQR formulation can be used to impose a strict final boundary
value condition by setting an infinite Q f . However, integrating the Riccati equation
backwards from an infinite initial condition isn't very practical. To get around this,
let us consider solving for P(t) = S(t) −1 . Using the matrix relation dSdt = −S −1 dS
dt S ,
−1
−1
we have:
˙
−P(t) = −P(t)QP(t) + BR −1 B − AP(t) − P(t)A T ,
P(t f ) = 0.
Essentially all of the results above have a natural correlate for discrete-time systems.
What's more, the discrete time versions tend to be simpler to think about in the
model-predictive control (MPC) setting that we'll be discussing below and in the next
chapters.
then we have
yielding
which is the famous Riccati difference equation. The infinite-horizon LQR solution is
given by the (positive-definite) fixed-point of this equation:
Like in the continuous time case, this equation is so important that we have special
numerical recipes for solving this discrete-time algebraic Riccati equation (DARE).
DRAKE delegates to these numerical methods automatically when you evaluate the
LinearQuadraticRegulator method on a system that has only discrete state and a
single periodic timestep.
Open in Colab
8.3.2 LQR with input and state constraints
One important case that does have closed-form solutions is LQR with linear equality
constraints (c.f. [3], section IIIb). This is particularly relevant in the case of
stabilizing robots with kinematic constraints such as a closed kinematic chain, which
appears in four-bar linkages or even for the linearization of a biped robot with two
feet on the ground.
One can also design the LQR gains using linear matrix inequalities (LMIs). I will
defer the derivation til we cover the policy gradient view of LQR, because the
LMI formulation is based on a change of variables from the basic policy evaluation
criterion. If you want to look ahead, you can find that formulation here.
Solving the algebraic Riccati equation is still the preferred way of computing the
LQR solution. But it is helpful to know that one could also compute it with convex
optimization. In addition to deepening our understanding, this can be useful for
generalizing the basic LQR solution (e.g. for robust stabilization) or to solve for the
LQR gains jointly as part of a bigger optimization.
We can also obtain the solution to the discrete-time finite-horizon (including the
time-varying or tracking variants) LQR problem using optimization -- in this case it
actually reduces to a simple least-squares problem. The presentation in this section
can be viewed as a simple implementation of the Youla parameterization (sometimes
called "Q-parameterization") from controls. Small variations on the formulation here
play an important role in the minimax variants of LQR (which optimize a worst-case
performance), which we will discuss in the robust control chapter (e.g. [4, 5]).
First, let us appreciate that the default parameterization is not convex. Given
N−1
min ∑ x T [n]Qx[n] + u T [n]Ru[n], Q = Q T ⪰ 0, R = R T ≻ 0
n=0
subject to x[n + 1] = Ax[n] + Bu[n],
x[0] = x 0
if we wish to search over controllers of the form
u[n] = K n x[n],
then we have
x[1] = Ax 0 + BK 0 x 0 ,
x[2] = A(A + BK 0 )x 0 + BK 1 (A + BK 0 )x 0
n−1
x[n] = (∏(A + BK i ))x 0
i=0
As you can see, the x[n] terms in the cost function include our decision variables
multiplied together -- resulting in a non-convex objective. The trick is to re-
parameterize the decision variables, and write the feedback in the form:
~
u[n] = K n x 0 ,
leading to
~
x[1] = Ax 0 + BK 0 x 0 ,
~ ~
x[2] = A(A + BK 0 )x 0 + BK 1 x 0
n−1
~
x[n] = (A n + ∑ A n−i−1 BK i )x 0
i=0
Now all of the decision variables, K i , appear linearly in the solution to x[n] and
~
therefore (convex) quadratically in the objective.
We still have an objective function that depends on x 0 , but we would like to find the
optimal K i for all x 0 . To achieve this let us evaluate the optimality conditions of this
~
least squares problem, starting by taking the gradient of the objective with respect
to K i , which is:
~
N−1 N−1
~
x 0 x T0 (K Ti (R + ∑ B T (A m−i−1 ) T QA m−i−1 B) + ∑ (A m ) T QA m−i−1 B).
m=i+1 m=i+1
We can satisfy this optimality condition for all x 0 by solving the linear matrix
equation:
N−1 N−1
~
K Ti (R + ∑ B T (A m−i−1 ) T QA m−i−1 B) + ∑ (A m ) T QA m−i−1 B = 0.
m=i+1 m=i+1
We can always solve for K i since it's multiplied by a (symmetric) positive definite
~
matrix (it is the sum of a positive definite matrix and many positive semi-definite
matrices), which is always invertible.
If you need to recover the original K i parameters, you can extract them recursively
with
~
K0 = K0 ,
n−1
~
K n = K n ∏(A + BK i ), 0 < n ≤ N − 1.
i=0
But often this is not actually necessary. In some applications it's enough to know
the performance cost under LQR control, or to handle the response to disturbances
explicitly with the disturbance-based feedback (which I've already promised for the
robust control chapter). Afterall, the problem formulation that we've written here,
which makes no mention of disturbances, assumes the model is perfect and the
controls K n x 0 are just as suitable for deployment as K n x[n].
~
"System-Level Synthesis" (SLS) is the name for an important and slightly different
approach, where one optimizes the closed-loop response directly[6]. Although SLS is
a very general tool, for the particular formulation we are considering here it reduces
to creating additional decision variables Φ i , such at that
x[n] = Φ n x[0],
Once again, the algorithms presented here are not as efficient as solving the Riccati
equation if we only want the solution to the simple case, but they become very
powerful if we want to combine the LQR synthesis with other objectives/constraints.
For instance, if we want to add some sparsity constraints (e.g. enforcing that some
elements of K i are zero), then we could solve the quadratic optimization subject to
~
linear equality constraints [7].
8.4 EXERCISES
8.5 NOTES
For completeness, I've included here the derivation for continuous-time finite-
horizon LQR with all of the bells and whistles.
Observe that our LQR "optimal tracking" derivation fits in this form, as we can
always write
by taking
Q xx = Q t , q x = −Q t x d − N t u d , q 0 = x Td Q t x d + 2x Td N t u d ,
R uu = R t , r u = −R t u d − N Tt x d , r 0 = u Td R t u d , N = N t .
Of course, we can also add a quadratic final cost with Q f . Let's search for a positive
quadratic, time-varying cost-to-go function of the form:
T
S xx (t) s x (t)
J(t, x) = [x] S(t) [x], S(t) = [ ], S xx (t) ≻ 0,
1 1 s Tx (t) s 0 (t)
T
∂J ∂J
= 2x T S xx + 2s Tx , = [x] S˙ [x].
∂x ∂t 1 1
Writing out the HJB:
∂J ∂J
min [ℓ(x, u) + [A(t)x + B(t)u + c(t)] + ] = 0,
u ∂x ∂t
∂
= 2u T R uu + 2r Tu + 2x T N + (2x T S xx + 2s Tx )B = 0
∂u
T
N + S xx B
u ∗ = −R −1
uu [ ] [x] = −K(t) [x] = −K x (t)x − k 0 (t).
r Tu + s Tx B 1 1
Inserting this back into the HJB gives us the updated Riccati differential equation.
Since this must hold for all x, we can collect the quadratic, linear, and offset terms
and set them each individually equal to zero, yielding:
˙ xx =Q xx − (N + S xx B)R −1 (N + S xx B) T + S xx A + A T S xx ,
−S uu
−ṡ x =q x − (N + S xx B)R −1 T T
uu (r u + B s x ) + A s x + S xx c,
−ṡ 0 =q 0 + r 0 − (r u + B T s x ) T R −1 T T
uu (r u + B s x ) + 2s x c,
REFERENCES
1. A. Alessio and A. Bemporad, "A survey on explicit model predictive control",
Int. Workshop on Assessment and Future Directions of Nonlinear Model
Predictive Control, 2009.
2. Tobia Marcucci and Robin Deits and Marco Gabiccini and Antonio Bicchi
and Russ Tedrake, "Approximate Hybrid Model Predictive Control for Multi-
Contact Push Recovery in Complex Environments", Humanoid Robots
(Humanoids), 2017 IEEE-RAS 17th International Conference on, 2017.
[ link ]
3. Michael Posa and Scott Kuindersma and Russ Tedrake, "Optimization and
stabilization of trajectories for constrained dynamical systems",
Proceedings of the International Conference on Robotics and Automation
(ICRA), pp. 1366-1373, May, 2016. [ link ]
6. James Anderson and John C. Doyle and Steven Low and Nikolai Matni,
"System {Level} {Synthesis}", arXiv:1904.01634 [cs, math], apr, 2019.
7. Yuh-Shyang Wang and Nikolai Matni and John C. Doyle, "Localized {LQR}
{Optimal} {Control}", arXiv:1409.6404 [cs, math], September, 2014.
Previous Chapter Table of contents Next Chapter
Accessibility © Russ Tedrake, 2021
UNDERACTUATED ROBOTICS
Algorithms for Walking, Running, Swimming, Flying, and
Manipulation
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
Lyapunov Analysis
Optimal control provides a powerful framework for formulating control problems
using the language of optimization. But solving optimal control problems for
nonlinear systems is hard! In many cases, we don't really care about finding the
optimal controller, but would be satisfied with any controller that is guaranteed
to accomplish the specified task. In many cases, we still formulate these problems
using computational tools from optimization, and in this chapter we'll learn about
tools that can provide guaranteed control solutions for systems that are beyond the
complexity for which we can find the optimal feedback.
There are many excellent books on Lyapunov analysis; for instance [1] is an excellent
and very readable reference and [2] can provide a rigorous treatment. In this
chapter I will summarize (without proof) some of the key theorems from Lyapunov
analysis, but then will also introduce a number of numerical algorithms... many of
which are new enough that they have not yet appeared in any mainstream textbooks.
which I've written with the damping on the right-hand side to remind us that it is
an external torque that we've modeled.
One route forward was from looking at the total system energy (kinetic +
potential), which we can write down:
˙ = 1 2 ˙2
E(θ, θ) ml θ − mgl cos θ.
2
Recall that the contours of this energy function are the orbits of the undamped
pendulum.
A natural route to proving the stability of the downward fixed points is by arguing
that energy decreases for the damped pendulum (with b > 0) and so the system
will eventually come to rest at the minimum energy, E = −mgl, which happens at
θ = 2πk. Let's make that argument slightly more precise.
∂V
V˙ (x) = f(x) < 0, ∀x ∈ B ∖ {0},
∂x
then the origin is (locally) asymptotically stable. And if we have
∂V
V˙ (x) = f(x) ≤ −αV (x), ∀x ∈ B ∖ {0},
∂x
for some α > 0, then the origin is (locally) exponentially stable.
Note that for the sequel we will use the notation V ≻ 0 to denote a positive-definite
function, meaning that V (0) = 0 and V (x) > 0 for all x ≠ 0 (and also V ⪰ 0 for positive
semi-definite, V ≺ 0 for negative-definite functions).
The intuition here is exactly the same as for the energy argument we made in the
pendulum example: since V˙ (x) is always zero or negative, the value of V (x) will only
get smaller (or stay the same) as time progresses. Inside the subset B, for every ϵ
-ball, I can choose a δ such that |x(0)| 2 < δ ⇒ |x(t)| 2 < ϵ, ∀t by choosing δ sufficiently
small so that the sublevel set of V (x) for the largest value that V (x) takes in the δ
ball is completely contained in the ϵ ball. Since the value of V can only get smaller
(or stay constant) in time, this gives stability i.s.L.. If V˙ is strictly negative away
from the origin, then it must eventually get to the origin (asymptotic stability). The
exponential condition is implied by the fact that ∀t > 0, V (x(t)) ≤ V (x(0))e −αt .
Notice that the system analyzed above, ẋ = f(x), did not have any control inputs.
Therefore, Lyapunov analysis is used to study either the passive dynamics of a
system or the dynamics of a closed-loop system (system + control in feedback). We
will see generalizations of the Lyapunov functions to input-output systems later in
the text.
9.1.1 Global Stability
The notion of a fixed point being stable i.s.L. is inherently a local notion of stability
(defined with ϵ- and δ- balls around the origin), but the notions of asymptotic and
exponential stability can be applied globally. The Lyapunov theorems work for this
case, too, with only minor modification.
V (x) ≻ 0,
∂V
V˙ (x) = f(x) ≺ 0, and
∂x
V (x) → ∞ whenever ||x|| → ∞,
Perhaps you noticed the disconnect between the statement above and the argument
that we made for the stability of the pendulum. In the pendulum example, using
the mechanical energy resulted in a Lyapunov function time derivative that was
only negative semi-definite, but we eventually argued that the fixed points were
asymptotically stable. That took a little extra work, involving an argument about the
fact that the fixed points were the only place that the system could stay with E ˙=0
; every other state with E˙ = 0 was only transient. We can formalize this idea for the
more general Lyapunov function statements--it is known as LaSalle's Theorem.
V (x) ≻ 0, V˙ (x) ⪯ 0,
and V (x) → ∞ as ||x|| → ∞, then x will converge to the largest invariant set
where V˙ (x) = 0.
desired energy, E d = 1, and the difference E(x) = E(x) − E d . We were able to show
~
that our proposed controller produced
where k is a positive gain that we get to choose. And we said that was good!
Now we have the tools to understand that really we have a Lyapunov function
1 ~2
V (x) = E (x),
2
, and what we have shown is that V˙ ≤ 0. By LaSalle, we can only argue that the
closed-loop system will converge to the largest invariant set, which here is the
entire homoclinic orbit: E(x) = 0. We have to switch to the LQR controller in order
~
to stabilize the upright.
As you can see from the plots, V˙ (x) ends up being a quite non-trivial function! We'll
develop the computational tools for verifying the Lyapunov/LaSalle conditions for
systems of this complexity in the upcoming sections.
At this point, you might be wondering if there is any relationship between Lyapunov
functions and the cost-to-go functions that we discussed in the context of dynamic
programming. After all, the cost-to-go functions also captured a great deal about the
long-term dynamics of the system in a scalar function. We can see the connection if
we re-examine the HJB equation
∂J ∗
0 = min [ℓ(x, u) + f(x, u). ]
u ∂x
Let's imagine that we can solve for the optimizing u ∗ (x), then we are left with
∂x f(x, u ) or simply
∗
0 = ℓ(x, u ∗ ) + ∂J ∗
In other words, in optimal control we must find a cost-to-go function which matches
this gradient for every x; that's very difficult and involves solving a potentially high-
dimensional partial differential equation. By contrast, Lyapunov analysis is asking
for much less - any function which is going downhill (at any rate) for all states. This
can be much easier, for theoretical work, but also for our numerical algorithms. Also
note that if we do manage to find the optimal cost-to-go, J ∗ (x), then it can also serve
as a Lyapunov function so long as ℓ(x, u ∗ (x)) ⪰ 0.
9.2 LYAPUNOV ANALYSIS WITH CONVEX OPTIMIZATION
Let's take a moment to see how things play out for linear systems.
V (x) = x T Px, P = P T ≻ 0,
V˙ (x) = x T PAx + x T A T Px ≺ 0.
PA + A T P ≺ 0.
For stable linear systems the existence of a quadratic Lyapunov function is actually
a necessary (as well as sufficient) condition. Furthermore, a Lyapunov function can
always be found by finding the positive-definite solution to the matrix Lyapunov
equation
PA + A T P = −Q, (1)
for any Q = Q T ≻ 0.
This is a very powerful result - for nonlinear systems it will be potentially difficult to
find a Lyapunov function, but for linear systems it is straight-forward. In fact, this
result is often used to propose candidates for non-linear systems, e.g., by linearizing
the equations and solving a local linear Lyapunov function which should be valid in
the vicinity of a fixed point.
If you don't know much about convex optimization or want a quick refresher, please
take a few minutes to read the optimization preliminaries in the appendix. The main
requirement for this section is to appreciate that it is possible to formulate efficient
optimization problems where the constraints include specifying that one or more
matrices are positive semi-definite (PSD). These matrices must be formed from a
linear combination of the decision variables. For a trivial example, the optimization
min a, subject to [a 0] ⪰ 0,
a 0 1
The value in this is immediate for linear systems. For example, we can formulate the
search for a Lyapunov function for the linear system ẋ = Ax by using the parameters
p to populate a symmetric matrix P and then write the SDP:
Note that you would probably never use that particular formulation, since there
specialized algorithms for solving the simple Lyapunov equation which are more
efficient and more numerically stable. But the SDP formulation does provide
something new -- we can now easily formulate the search for a "common Lyapunov
function" for uncertain linear systems.
A = ∑ βi Ai , ∑ β i = 1, ∀i, β i > 0.
i i
find subject to P ⪰ 0, ∀ i , PA i + A Ti P ⪯ 0.
p
The solver will then return a matrix P which satisfies all of the constraints, or
return saying "problem is infeasible". It can easily be verified that if P satisfies the
Lyapunov condition at all of the vertices, then it satisfies the condition for every A
in the set:
P(∑ β i A i ) + (∑ β i A i ) T P = ∑ β i (PA i + A Ti P) ⪯ 0,
i i i
since ∀i, β i > 0. Note that, unlike the simple Lyapunov equation for a known linear
system, this condition being satisfied is a sufficient but not a necessary condition
-- it is possible that the set of uncertain matrices A is robustly stable, but that this
stability cannot be demonstrated with a common quadratic Lyapunov function.
Open in Colab
As always, make sure that you open up the code and take a look.
There are many small variants of this result that are potentially of interest. For
instance, a very similar set of conditions can certify "mean-square stability" for
linear systems with multiplicative noise (see e.g. [3], § 9.1.1).
V α (x) = α 0 + α 1 x 1 + α 2 x 2 + α 3 x 1 x 2 + α 4 x 21 +. . . ,
ẋ 0 = − x 0 − 2x 21
ẋ 1 = − x 1 − x 0 x 1 − 2x 31 ,
and the fixed Lyapunov function V (x) = x 20 + 2x 21 , test if V˙ (x) is negative definite.
The numerical solution can be written in a few lines of code, and is a convex
optimization.
Open in Colab
V (x) = c 0 + c 1 x 0 + c 2 x 1 + c 3 x 20 + c 4 x 0 x 1 + c 5 x 21 .
We will set the scaling (arbitrarily) to avoid numerical issues by setting V (0) = 0,
V ([1, 0]) = 1. Then we write:
find subject to V is sos,
c
− V˙ is sos.
Open in Colab
It is important to remember that there are a handful of gaps which make the
existence of this solution a sufficient condition (for proving that every sublevel set of
V is an invariant set of f ) instead of a necessary one. First, there is no guarantee that
a stable polynomial system can be verified using a polynomial Lyapunov function (of
any degree, and in fact there are known counter-examples [5]) and here we are only
searching over a fixed-degree polynomial. Second, even if a polynomial Lyapunov
function does exist, there is a gap between the SOS polynomials and the positive
polynomials.
There is another very important connection between Lyapunov functions and the
concept of an invariant set: any sublevel set of a Lyapunov function is also an
invariant set. This gives us the ability to use sublevel sets of a Lyapunov function as
approximations of the region of attraction for nonlinear systems.
G : {x|V (x) ≤ ρ}
on which
∀x ∈ G, V˙ (x) ⪯ 0,
In the vicinity of the origin, ẋ looks like −x, and as we move away it looks
increasingly like x 3 . There is a stable fixed point at the origin and unstable fixed
points at ±1. In fact, we can deduce visually that the region of attraction to the
stable fixed point at the origin is x ∈ (−1, 1). Let's see if we can demonstrate this
with a Lyapunov argument.
First, let us linearize the dynamics about the origin and use the Lyapunov equation
for linear systems to find a candidate V (x). Since the linearization is ẋ = −x, if we
take Q = 1, then we find P = 12 is the positive definite solution to the algebraic
Lyapunov equation (1). Proceeding with
1 2
V (x) = x ,
2
we have
V˙ = x(−x + x 3 ) = −x 2 + x 4 .
This function is zero at the origin, negative for |x| < 1, and positive for |x| > 1.
Therefore we can conclude that the sublevel set V < 12 is invariant and the set
x ∈ (−1, 1) is inside the region of attraction of the nonlinear system. In fact, this
estimate is tight.
While we will defer most discussions on robustness analysis until later in the notes,
the idea of a common Lyapunov function, which we introduced briefly for linear
systems in the example above, can be readily extended to nonlinear systems and
region of attraction analysis. Imagine that you have a model of a dynamical system
but that you are uncertain about some of the parameters. For example, you have a
model of a quadrotor, and are fairly confident about the mass and lengths (both of
which are easy to measure), but are not confident about the moment of inertia. One
approach to robustness analysis is to define a bounded uncertainty, which could take
the form of
In standard Lyapunov analysis, we are searching for a function that goes downhill
for all x to make statements about the long-term dynamics of the system. In common
Lyapunov analysis, we can make many similar statements about the long-term
dynamics of an uncertain system if we can find a single Lyapunov function that goes
downhill for all possible values of α. If we can find such a function, then we can use
it to make statements with all of the variations we've discussed (local, regional, or
global; in the sense of Lyapunov, asymptotic, or exponential).
V˙ = −x 2 + αx 4 .
This function is zero at the origin, and negative for all α whenever x 2 > αx 4 , or
1 2
|x| < =√ .
√α max 3
Therefore, we can conclude that |x| < √ 23 is inside the robust region of attraction
of the uncertain system.
Not all forms of uncertainty are as simple to deal with as the gain uncertainty in
that example. For many forms of uncertainty, we might not even know the location of
the fixed points of the uncertain system. In this case, we can often still use common
Lyapunov functions to give some guarantees about the system, such as guarantees of
robust set invariance. For instance, if you have uncertain parameters on a quadrotor
model, you might be ok with the quadrotor stabilizing to a pitch of 0.01 radians, but
you would like to guarantee that it definitely does not flip over and crash into the
ground.
V˙ = −x 2 + x 4 + αx.
Therefore V can never start at less than one-third and cross over to greater than
one-third. To see this, see that
1 2 2 2 1 1
V = ⇒ x = ±√ ⇒ V˙ = − ± α√ < 0, ∀α ∈ [− , ].
3 3 9 3 4 4
Note that not all sublevel sets of this invariant set are invariant. For instance
V < 321
does not satisfy this condition, and by visual inspection we can see that it
is in fact not robustly invariant.
Now we have arrived at the tool that I believe can be a work-horse for many serious
robotics applications. Most of our robots are not actually globally stable (that's
not because they are robots -- if you push me hard enough, I will fall down, too),
which means that understanding the regions where a particular controller can be
guaranteed to work can be of critical importance.
The S-procedure
Consider a scalar polynomial, p(x), and a semi-algebraic set g(x) ⪯ 0, where g is a
vector of polynomials. If I can find a polynomial "multiplier", λ(x), such that
g(x) ≤ 0 ⇒ p(x) ≥ 0
where λ(x) is a multiplier polynomial with free coefficients that are to be solved for
in the optimization.
ẋ = −x + x 3
and try to use SOS optimization to demonstrate that the region of attraction of the
fixed point at the origin is x ∈ (−1, 1), using the Lyapunov candidate V = x 2 .
λ(x) = c 0 + c 1 x + c 2 x 2 .
find
c
subject to − V˙ (x) + λ(x)(V (x) − 1) is SOS
λ(x) is SOS
Open in Colab
In this example, we only verified that the one-sublevel set of the pre-specified
Lyapunov candidate is negative (certifying the ROA that we already understood).
Even more useful is if you are able to search for the largest ρ that can satisfy these
conditions. Unfortunately, in this first formulation, optimizing ρ directly would make
the optimization problem non-convex because we would have terms like ρc 0 , ρc 1 x, ...
which are bilinear in the decision variables; we need the sums-of-squares constraints
to be only linear in the decision variables.
Fortunately, because the problem is convex with ρ fixed (and therefore can be solved
reliably), and ρ is just a scalar, we can perform a simple line search on ρ to find the
largest value for which the convex optimization returns a feasible solution. This will
be our estimate for the region of attraction.
There are a number of variations to this basic formulation; I will describe a few of
them below. There are also important ideas like rescaling and degree-matching that
can have a dramatic effect on the numerics of the problem, and potentially make
them much better for the solvers. But you do not need to master them all in order to
use the tools effectively.
x = Variable("x")
sys = SymbolicVectorSystem(state=[x], dynamics=[-x+x**3])
context = sys.CreateDefaultContext()
V = RegionOfAttraction(sys, context)
Open in Colab
Under the assumption that the Hessian of V˙ (x) is negative-definite at the origin
(which is easily checked), we can write
max ρ
ρ,λ(x)
with d a fixed positive integer. As you can see, ρ no longer multiplies the coefficient
of λ(x). But why does this certify a region of attraction?
You can read the sums-of-squares constraint as certifying the implication that
whenever V˙ (x) = 0, we have that either V (x) ≥ ρ OR x = 0. Multiplying by some
multiple of x T x is just a clever way to handle the "or x = 0" case, which is necessary
since we expect V˙ (0) = 0. This implication is sufficient to prove that V˙ (x) ≤ 0
whenever V (x) ≤ ρ, since V and V˙ are smooth polynomials; we examine this claim in
full detail in one of the exercises.
Using the S-procedure with equalities instead of inequalities also has the potential
advantage of removing the SOS constraint on λ(x). But perhaps the biggest
advantage of this formulation is the possibility of dramatically simplifying the
problem using the quotient ring of this algebraic variety, and in particular some
recent results for exact certification using sampling varieties[6].
But what if we believe the system is regionally stable, despite having an indefinite
linearization? Or perhaps we can certify a larger volume of state space by using
a Lyapunov candidate with degree greater than two. Can we use sums-of-squares
optimization to find that in the region of attraction case, too?
To accomplish this, we will now make V (x) a polynomial (of some fixed degree) with
the coefficients as decision variables. First we will need to add constraints to ensure
that
where ϵ is some small positive constant. This ϵ term simply ensures that V is strictly
positive definite. Now let's consider our basic formulation:
Notice that I've replaced ρ = 1; now that we are searching for V the scale of the
level-set can be set aribtrarily to 1. In fact, it's better to do so -- if we did not set
V (0) = 0 and V (x) ≤ 1 as the sublevel set, then the optimization problem would be
underconstrained, and might cause problems for the solver.
Open in Colab
˙
−B(x) is SOS and B(0) > 0.
In order to find the smallest such outer approximation (given a fixed degree
polynomial), we choose an objective that tries to "push-down" on the value of B(x).
We typically accomplish this by introducing another polynomial function W (x) with
the requirements that
To make the problem a little numerically better, you'll see in the code that I've
asked for B(x)
˙ to be strictly negative definite, for B(0) ≥ 0.1, and I've chosen to only
include even-degree monomials in B(x) and W (x). Plotting the solution reveals:
As you can see, the superlevel set, B(x) ≥ 0 is a tight outer-approximation of the
true region of attraction. In the limit of increasing the degrees of the polynomials
to infinity, we would expect that W (x) would converge to the indicator function
that is one inside the region of attraction, and zero outside (we are quite far from
that here, but nevertheless have a tight approximation from B(x) ≥ 0).
Open in Colab
Before focusing on the finite time, let us first realize that the basic (infinite-time)
Lyapunov analysis can also be applied to time-varying systems: ẋ = f(t, x). We can
analyze the stability (local, regional, or global) of this system with very little change.
If we find
V (x) ≻ 0,
∂V
∀t, ∀x ≠ 0, V˙ (t, x) = f(t, x) < 0, V˙ (t, 0) = 0,
∂x
then all of our previous statement still hold. In our SOS formulations, t is simply one
more indeterminate.
x(t 1 ) ∈ X 1 ⇒ x(t 2 ) ∈ X 2 ,
Reachability analysis can be done forward in time: we choose X 1 and try to find the
smallest region X 2 for which we can make the statement hold. X(t) would be called
the forward-reachable set (FRS), and can be very useful for certifying e.g. a motion
plan. For instance, you might like to prove that your UAV does not crash into a tree
in the next 5 seconds. In this case X 1 might be take to be a point representing the
current state of the vehicle, or a region representing an outer-approximation of the
current state if the state in uncertain. In this context, we would call X 2 a forward-
reachable set. In this use case, we would typically choose any approximations in our
Lyapunov analysis to certify that an estimate of the reachable region is also an outer-
approximation: X 2 ⊆ X^2 .
Reachability analysis can also be done backward in time: we choose X 2 and try to
maximize the region X 1 for which the statement can be shown to hold. Now X(t)
is called the backward-reachable set (BRS), and for robustness we typically try
to certify that our estimates are an inner-approximation, X^1 ⊆ X 1 . The region-of-
attraction analysis we studied above can be viewed as a special case of this, with
X 2 taken to be the fixed-point, t 2 = 0 and t 1 = −∞. But finite-time BRS also have an
important role to play, for instance when we are composing multiple controllers in
order to achieve a more complicated task, which we will study soon.
Note that finite-time reachability is about proving invariance of the set, not stability,
and the V˙ condition need only be certified at the boundary of the level set. If V is
decreasing at the boundary, then trajectories can never leave. One can certainly ask
for more -- we may want to show that the system is converging towards V (t, x) = 0,
perhaps even at some rate -- but only invariance is required to certify reachability.
Like in the case for region of attraction, we have many formulations. We can certify
an existing Lyapunov candidate, V (t, x), and just try to maximize/minimize ρ(t). Or
we can search for the parameters of V (t, x), too. Again, we can initialize that search
using the time-varying version of the Lyapunov equation, or the solutions to a time-
varying LQR Riccati equation.
In practice, we often certify the Lyapunov conditions over x at only a finite set of
samples t i ∈ [t 1 , t 2 ]. I don't actually have anything against sampling in one dimension;
there are no issues with scaling to higher dimensions, and one can make practical
rigorous statement about bounding the sampling error. And in these systems, adding
t into all of the equation otherwise can dramatically increase the degree of the
polynomials required for the SOS certificates. All of this was written up nicely in
[14], and robust variants of it were developed in [15, 8].
We've been talking a lot in this chapter about numerical methods for polynomial
systems. But even our simple pendulum has a sin θ in the dynamics. Have I been
wasting your time? Must we just resort to polynomial approximations of the non-
polynomial equations? It turns out that our polynomial tools can perform exact
analysis of the manipulation equation for almost all†of our robots. We just have to do †the most notable
a little more work to reveal that structure. exception to this is if
your robot has helical
Let us first observe that rigid-body kinematics are polynomial (except the helical screw joints (see [16])
joint). This is fundamental -- the very nature of a "rigid body" assumption is that
Euclidean distance is preserved between points on the body; if p 1 and p 2 are two
points on a body, then the kinematics enforce that |p 1 − p 2 | 22 is constant -- these are
polynomial constraints. Of course, we commonly write the kinematics in a minimal
coordinates using sin θ and cos θ. But because of rigid body assumption, these terms
only appear in the simplest forms, and we can simply make new variables
s i = sin θ i , c i = cos θ i , and add the constraint that s 2i + c 2i = 1. For a more thorough
discussion see, for instance, [16] and [17]. Since the potential energy of a multi-body
system is simply an accumulation of weight times the vertical position for all of the
points on the body, the potential energy is polynomial.
is polynomial in s, c, θ˙. Since the kinetic energy of our robot is given by the
accumulation of the kinetic energy of all the mass, T = ∑ i 12 m i v Ti v i , the kinetic
energy is polynomial, too (even when we write it with inertial matrices and angular
velocities).
Finally, the equations of motion can be obtained by taking derivatives of the
Lagrangian (kinetic minus potential). These derivatives are still polynomial!
⎡ cθ˙ ⎤
ẋ = −sθ˙ .
1
⎣−
ml 2
(bθ˙ + mgls)⎦
V = α 0 + α 1 s + α 2 c+. . . α 9 s 2 + α 10 sc + α 11 sθ.
˙
In fact, this is asking too much -- really V˙ only needs to be negative when s 2 + c 2 = 1
. We can accomplish this with the S-procedure, and instead write
(Recall that λ(x) is another polynomial with free coefficients which the
stability:
Open in Colab
⎢⎥
optimization can use to make terms arbitrarily more positive when s 2 + c 2 ≠ 1.)
Finally, for style points, in the code example in DRAKE we ask for exponential
As always, make sure that you open up the code and take a look. The result is a
Lyapunov function that looks familiar (visualized as a contour plot here):
Aha! Not only does the optimization finds us coefficients for the Lyapunov function
which satisfy our Lyapunov conditions, but the result looks a lot like mechanical
energy. In fact, the result is a little better than energy... there are some small extra
terms added which prove exponential stability without having to invoke LaSalle's
Theorem.
The one-degree-of-freedom pendulum did allow us to gloss over one important detail:
while the manipulator equations M(q)q̈ + C(q, q̇)q̇ =. . . are polynomial, in order to
solve for q̈ we actually have to multiply both sides by M −1 . This, unfortunately, is not
a polynomial operation, so in fact the final dynamics of the multibody systems are
rational polynomial. Not only that, but evaluating M −1 symbolically is not advised --
the equations get very complicated very fast. But we can actually write the Lyapunov
conditions using the dynamics in implicit form, e.g. by writing V (q, q̇, q̈) and asking it
to satisfy the Lyapunov conditions everywhere that M(q)q̈+. . . =. . . +Bu is satisfied,
using the S-procedure.
g(x, u, ẋ) = 0.
This form is strictly more general because I can always write g(x, u, ẋ) = f(x, u) − ẋ
, but importantly here I can also write the bottom rows of g as
M(q)q̈ + C(q, q̇)q̇ − τ g − Bu. This form can also represent differential algebraic
equations (DAEs) which are more general than ODEs; g could even include
algebraic constraints like s 2i + c 2 − 1. Most importantly, for manipulators, g can
be polyonimal, even if f would have been rational. DRAKE provides access to
continuous-time dynamics in implicit form via the
CalcImplicitTimeDerivativesResidual method.
There are a few things that do break this clean polynomial view of the world. Rotary
springs, for instance, if modeled as τ = k(θ 0 − θ) will mean that θ appears alongside
sin θ and cos θ, and the relationship between θ and sin θ is sadly not polynomial.
Linear feedback from LQR actually looks like the linear spring, although writing the
feedback as u = −K sin θ is a viable alternative.
In practice, you can also Taylor approximate any smooth nonlinear function using
polynomials. This can be an effective strategy in practice, because you can limit
the degrees of the polynomial, and because in many cases it is possible to provide
conservative bounds on the errors due to the approximation.
One final technique that is worth knowing about is a change of coordinates, often
referred to as the stereographic projection, that provides a coordinate system in
which we can replace sin and cos with polynomials:
1 − p2 2p ∂p 1 + p2
cos θ = , sin θ = , and = ,
1 + p2 1 + p2 ∂θ 2
where ∂θ can be used in the chain rule to derive the dynamics ṗ. Although the
∂p
equations are rational, they share the denominator 1 + p 2 and can be treated
efficiently in mass-matrix form. Compared to the simple substitution of s = sin θ and
c = cos θ, this is a minimal representation (scalar to scalar, no s 2 + c 2 = 1 required);
unfortunately it does have a singularity at θ = π, so likely cannot be used for global
analysis.
9.6 CONTROL DESIGN
Throughout this chapter, we've developed some very powerful tools for reasoning
about stability of a closed-loop system (with the controller already specified). I hope
you've been asking yourself -- can I use these tools to design better controllers?
Of course the answer is "yes!". In this section, I'll discuss the control approaches
that are the most direct consequences of the convex optimization approaches to
Lyapunov functions. Another very natural idea is to use these tools in the content of
a feedback motion planning algorithm, which is the subject of an upcoming chapter.
Let's re-examine the Lyapunov conditions for linear systems from Eq. 2, but now
add in a linear state feedback u = Kx, resulting in the closed-loop dynamics
ẋ = (A + BK)x. One can show that the set of all stabilizing K can be characterized
by the following two matrix inequalities [3]:
Q = Q T ≻ 0, AQ + QA T + BY + Y T B T ≺ 0.
Furthermore, given matrices A and B, there exists a matrix K such that (A + BK)
is stable if and only if there exist matrices Q and Y which satisfy this (strict) linear
matrix inequality.
For control design using convex optimization, we will lean heavily on the assumption
of the dynamics being control-affine. Let me write it this time as:
m−1
ẋ = f (x) + ∑ u i f i (x).
i=0
For linear optimal control, we found controllers of the form, u = −Kx. The natural
generalization of this to polynomial analysis will be to look for controllers of the form
u = π(x), where π(x) is a polynomial. (For mechanical systems like the pendulum
above, we might make π a polynomial in s and c.)
If we apply this control to the Lyapunov conditions (for global analysis), we quickly
see the problem. We have
m−1
∂V
V˙ (x) = [f (x) + ∑ f i (x)π i (x)],
∂x i=0
One very natural strategy is to use alternations. The idea is simple, we will fix
π and optimize V , then fix V and optimize π and repeat until convergence. This
approach has roots in the famous "DK iterations" for robust control (e.g. [18]). It
takes advantage of the structured convex optimization at each step.
For this approach, it is important to start with an initial feasible V or π. For example,
one can think of locally stabilizing a nonlinear system with LQR, and then searching
for a linear control law (or even a higher-order polynomial control law) that achieves
a larger basin of attraction. But note that once we move from global stability to
region of attraction optimization, we now need to alternate between three sets
of variables: V (x), π(x), λ(x), where λ(]bx) was the multiplier polynomial for the S-
procedure. We took exactly this approach in [19]. In that paper, we showed that
alternations could increase the certified region of attraction for the Acrobot.
The alternation approach takes advantage of convex optimization in the inner loop,
but it is still only a local approach to solving the nonconvex joint optimization. It is
subject to local minima. The primary advantage is that, barring numerical issues,
we expect recursive feasibility (once we have a controller and Lyapunov function
that achieves stability, even with a small region of attraction, we will not lose it) and
monotonic improvement on the objective. It is also possible to attempt to optimize
these objectives more directly with other non-convex optimization procedures (like
stochastic gradient descent, or sequential quadratic programming) (e.g. [6]), but
strict feasibility is harder to guarantee. Often times the Lyapunov conditions are
just evaluated at samples, or along sample trajectories; we can still certify the
performance using just a single SOS verification step the with controller fixed before
deplying.
Another core idea connecting Lyapunov analysis with control design is the "control-
Lyapunov function". Given a system ẋ = f(x, u), a control-Lyapunov function V is a
positive-definite function for which
∂V
∀x ≠ 0, ∃u V˙ (x, u) = f(x, u) < 0 and ∃u V˙ (0, u) = 0.
∂x
In words, for all x, there exists a control that would allow the V to decrease. Once
again, we adorn this basic property with extra conditions (e.g. radially unbounded,
or satisfied over a control- invariant set) in order to construct global or regional
stability statements. What is important to understand is that we can design control-
Lyapunov functions without explicitly parameterizing the controller; often the
control action is not even determined until execution time by finding a particular u
that goes downhill.
∀x, ∀u, ˙ u) ≤ 0.
B(x,
Now here is where it gets a little frustrating. Certainly, sublevel sets of B(x) are
control-invariant (via the proper choice of u). But we do not (cannot) expect that
the entire estimated region (the 0-superlevel set) can be rendered invariant. The
estimated region is an outer approximation of the backward reachable set. [21]
gave a recipe for extracting a polynomial control law from the BRS; an inner
approximation of this control law can be obtained via the original SOS region of
attraction tools. This is unfortunatley suboptimal/conservative. Although we would
like to certify the control-Lyapunov conditions directly in an inner approximation,
the ∃ quantifier remains as an obstacle.
9.6.4 Approximate dynamic programming with SOS
We have already established the most important connection between the HJB
conditions and the Lyapunov conditions:
The HJB involves solving a complex PDE; by changing this to an inequality, we relax
the problem and make it amenable to convex optimization.
To see this, take the integral of both sides along any solution trajectory, x(t), u(t). For
the upper-bound we get
∞ ∞
∫ V˙ π (x)dt = V π (x(∞)) − V π (x(0)) ≤ ∫ −ℓ(x(t), π(x(t)))dt.
0 0
The upper bound is the one that we would want to use in a certification procedure --
it provides a guarantee that the total cost achieved by the system started in x is less
than V (x). But it turns out that the lower bound is much better for control design.
This is because we can write
∂V ∂V
∀x, min[ℓ(x, u) + f(x, u)] ≥ 0 ≡ ∀x, ∀u, ℓ(x, u) + f(x, u) ≥ 0.
u ∂x ∂x
Therefore, without having to specify apriori a controller, if we can find a function
V (x) such that ∀x, ∀u, V˙ (x, u) ≥ −ℓ(x, u), then we have a lower-bound on the optimal
cost-to-go.
You should take a minute to convince yourself that, unfortunately, the same trick
does not work for the upper-bound. Again, we would need ∃ as the quantifier on u
instead of ∀.
Sums-of-squares formulation
Combining a number of ideas we've seen already, this leads to a natural sums-of-
squares formulation for optimal control:
max ∫ V (x)dx,
V (x) X
∂V
subject to ℓ(x, u) + f(x, u) is SOS,
∂x
V (0) = 0.
The SOS constraint enforces the lower bound, and the objective "pushes up" on the
lower-bound by maximizing the integral over some compact region. Once again, we
can then try to extract a control law by either using this lower bound as a control-
Lyapunov function, or by extracting (and certifying) a polynomial controller.
Perhaps you noticed that this is a natural extension of the linear programming
approach to dynamic programming. For systems with continuous state, the LP
approach verified the inequality conditions only at sample points; here we verify
them for all x, u. This is an important generalization: not so much because it can
certify better (the lower bound is not a particularly valuable thing to certify), but
because it can scale to dimensions were dense sampling is not viable. This provides
something of a spectrum between the mesh-based value iteration and the very
scalable LQR.
The biggest challenge to this approach, however, is not the number of dimensions,
but the degree of the polynomial required to achieve a meaningful approximation.
Remember that the optimal cost-to-go for even the min-time double integrator
problem is non-smooth. Like in the pseudo-spectral methods that we will see in the
context of trajectory optimization; choosing the right polynomial basis can make a
huge difference for the numerical robustness of this approach.
9.10 EXERCISES
you are given the positive definite function V (x) = + x 22 (plotted here) and
x 21
1+x 21
told that, for this system, V˙ (x) is negative definite over the entire space. Is V (x) a
valid Lyapunov function to prove global asymptotic stability of the origin for the
system described by the equations above? Motivate your answer.
a. B r is an invariant set for the given system, i.e.: if the initial state x(0) lies
in B r , then x(t) will belong to B r for all t ≥ 0.
b. B r is a subset of the ROA of the fixed point x = 0, i.e.: if x(0) lies in B r ,
then lim t→∞ x(t) = 0.
ẋ 1 = x 2 − x 31 ,
ẋ 2 = −x 1 − x 32 .
Show that the Lyapunov function V (x) = x 21 + x 22 proves global asymptotic stability
of the origin for this system.
ż 1 = u 1 cos z 3 ,
ż 2 = u 1 sin z 3 ,
ż 3 = u 2 .
The goal is to design a feedback law π(z) that drives the robot to the origin
z = 0 from any initial condition. As pointed out in [22], this problem becomes
dramatically easier if we analyze it in polar coordinates. As depicted below, we
let x 1 be the radial and x 2 the angular coordinate of the robot, and we define
x 3 = x 2 − z 3 . Analyzing the figure, basic kinematic considerations lead to
ẋ 1 = u 1 cos x 3 ,
u 1 sin x 3
ẋ 2 = − ,
x1
u 1 sin x 3
ẋ 3 = − − u2 .
x1
Figure 9.9 - Wheeled robot with Cartesian z and polar x coordinate system.
u 1 = π 1 (x) = −x 1 cos x 3 ,
(x 2 + x 3 ) cos x 3 sin x 3
u 2 = π 2 (x) = x 3 + ,
x3
makes V˙1 (x, π 1 (x)) ≤ 0 and V˙2 (x, π(x)) ≤ 0 for all x. (Technically speaking,
π 2 (x) is not defined for x 3 = 0. In this case, we let π 2 (x) assume its
limiting value x 2 + 2x 3 , ensuring continuity of the feedback law.)
c. Explain why Lyapunov's direct method does not allow us to establish
asymptotic stability of the closed-loop system.
d. Substitute the control law u = π(x) in the equations of motion, and derive
the closed-loop dynamics ẋ = f(x, π(x)). Use LaSalle's theorem to show
(global) asymptotic stability of the closed-loop system.
e. In this python notebook we set up a simulation environment for you to
try the controller we just derived. Type the control law from point (b) in
the dedicated cell, and use the notebook plot to check your work.
a. The one from the example above, augmented with a line search that
maximizes the area of the ROA.
b. A single-shot SOS program that can directly maximize the area of the
ROA, without any line search.
c. An improved version of the previous, where less SOS constraints are
imposed in the optimization problem.
REFERENCES
1. Jean-Jacques E. Slotine and Weiping Li, "Applied Nonlinear
Control",Prentice Hall , October, 1990.
7. Russ Tedrake and Ian R. Manchester and Mark M. Tobenkin and John W.
Roberts, "{LQR-Trees}: Feedback Motion Planning via Sums of Squares
Verification", International Journal of Robotics Research, vol. 29, pp.
1038--1052, July, 2010. [ link ]
11. Michael Posa and Twan Koolen and Russ Tedrake, "Balancing and Step
Recovery Capturability via Sums-of-Squares Optimization", Robotics:
Science and Systems, 2017. [ link ]
13. D. Henrion and J.B. Lasserre and C. Savorgnan, "Approximate volume and
integration for basic semialgebraic sets", SIAM Review, vol. 51, no. 4, pp.
722--743, 2009.
14. Mark M. Tobenkin and Ian R. Manchester and Russ Tedrake, "Invariant
Funnels around Trajectories using Sum-of-Squares Programming",
Proceedings of the 18th IFAC World Congress, extended version available
online: arXiv:1010.3013 [math.DS], 2011. [ link ]
15. Anirudha Majumdar, "Robust Online Motion Planning with Reachable Sets",
, May, 2013. [ link ]
17. A.J. Sommese and C.W. Wampler, "The Numerical solution of systems of
polynomials arising in engineering and science",World Scientific Pub Co Inc
, 2005.
18. R. Lind and G.J. Balas and A. Packard, "Evaluating {D-K} iteration for
control design", American Control Conference, 1994, vol. 3, pp. 2792 - 2797
vol.3, 29 June-1 July, 1994.
19. Anirudha Majumdar and Amir Ali Ahmadi and Russ Tedrake, "Control
Design along Trajectories with Sums of Squares Programming",
Proceedings of the 2013 IEEE International Conference on Robotics and
Automation (ICRA), pp. 4054-4061, 2013. [ link ]
20. Milan Korda and Didier Henrion and Colin N Jones, "Controller design
and region of attraction estimation for nonlinear dynamical systems", The
19th World Congress of the International Federation of Automatic Control
(IFAC), 2014.
21. Anirudha Majumdar and Ram Vasudevan and Mark M. Tobenkin and Russ
Tedrake, "Convex Optimization of Nonlinear Feedback Controllers via
Occupation Measures", Proceedings of Robotics: Science and Systems
(RSS), 2013. [ link ]
22. Michele Aicardi and Giuseppe Casalino and Antonio Bicchi and Aldo
Balestrino, "Closed loop steering of unicycle like vehicles via Lyapunov
techniques", IEEE Robotics \& Automation Magazine, vol. 2, no. 1, pp.
27--35, 1995.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
Trajectory Optimization
I've argued that optimal control is a powerful framework for specifying complex
behaviors with simple objective functions, letting the dynamics and constraints
on the system shape the resulting feedback controller (and vice versa!). But the
computational tools that we've provided so far have been limited in some important
ways. The numerical approaches to dynamic programming which involve putting a
mesh over the state space do not scale well to systems with state dimension more
than four or five. Linearization around a nominal operating point (or trajectory)
allowed us to solve for locally optimal control policies (e.g. using LQR) for even
very high-dimensional systems, but the effectiveness of the resulting controllers is
limited to the region of state space where the linearization is a good approximation
of the nonlinear dynamics. The computational tools for Lyapunov analysis from the
last chapter can provide, among other things, an effective way to compute estimates
of those regions. But we have not yet provided any real computational tools for
approximate optimal control that work for high-dimensional systems beyond the
linearization around a goal. That is precisely the goal for this chapter.
It is worth contrasting this parameterization problem with the one that we faced
in our continuous-dynamic programming algorithms. For trajectory optimization, we
need a finite-dimensional parameterization in only one dimension (time), whereas in
the mesh-based value iteration algorithms we had to work in the dimension of the
state space. Our mesh-based discretizations scaled badly with this state dimension,
and led to numerical errors that were difficult to deal with. There is relatively
much more known about discretizing solutions to differential equations over time,
including work on error-controlled integration. And the number of parameters
required for trajectory parameterizations scales linearly with the state dimension,
instead of exponentially in mesh-based value iteration.
For instance, let us start by writing both u[⋅] and x[⋅] as decision variables. Then we
can write:
N−1
min ℓ f (x[N]) + ∑ ℓ(x[n], u[n])
x[⋅],u[⋅] n0
subject to x[n + 1] = Ax[n] + Bu[n], ∀n ∈ [0, N − 1]
x[0] = x 0
+ additional constraints.
We call this modeling choice -- adding x[⋅] as decision variables and modeling the
discrete dynamics as explicit constraints -- the "direct transcription". Importantly,
for linear systems, the dynamics constraints are linear constraints in these decision
variables. As a result, if we can restrict our additional constraints to linear inequality
constraints and our objective function to being linear/quadratic in x and u, then
the resulting trajectory optimization is a convex optimization (specifically a linear
program or quadratic program depending on the objective). As a result, we can
reliably solve these problems to global optimality at quite large scale; these days it
is common to solve these optimization online inside a high-rate feedback controller.
Example 10.1 (Trajectory optimization for the Double
Integrator)
We've looked at a few optimal control problems for the double integrator using
value iteration. For one of them -- the quadratic objective with no constraints on u
-- we know now that we could have solved the problem "exactly" using LQR. But we
have not yet given satisfying numerical solutions for the minimum-time problem,
nor for the constrained LQR problem.
In the trajectory formulation, we can solve these problems exactly for the discrete-
time double integrator, and with better accuracy for the continuous-time double
integrator. Take a moment to appreciate that! The bang-bang policy and cost-to-go
functions are fairly nontrivial functions of state; it's quite satisfying that we can
evaluate them using convex optimization! The limitation, of course, is that we are
only solving them for one initial condition at a time.
Open in Colab
If you have not studied convex optimization before, you might be surprised by the
modeling power of even of this framework. Consider, for instance, an objective of the
form
The field of convex optimization is replete with tricks like this. Knowing and
recognizing them are skills of the (optimization) trade. But there are also many
relevant constraints which cannot be recast into convex constraints (in the original
coordinates) with any amount of skill. An important example is obstacle avoidance.
Imagine a vehicle that must decide if it should go left or right around an obstacle.
This represents a fundamentally non-convex constraint in x; we'll discuss the
implications of using non-convex optimization for trajectory optimization below.
The savvy reader might have noticed that adding x[⋅] as decision variables was not
strictly necessary. If we know x[0] and we know u[⋅], then we should be able to
solve for x[n] using forward simulation. For our discrete-time linear systems, this is
particularly nice:
What's more, the solution is still linear in u[⋅]. This is amazing... we can get rid of
a bunch of decision variables, and turn a constrained optimization problem into an
unconstrained optimization problem (assuming we don't have any other constraints).
This approach -- using u[⋅] but not x[⋅] as decision variables and using forward
simulation to obtain x[n] -- is called the direct shooting transcription. For linear
systems with linear/quadratic objectives in x, and u, it is still a convex optimization,
and has less decision variables and constraints than the direct transcription.
10.2.3 Computational Considerations
For linear convex problems, the solvers are mature enough that these differences
often don't amount to much. For nonlinear optimization problems, the differences
can be substantial. If you look at trajectory optimization papers in mainstream
robotics, you will see that both direct transcription and direct shooting approaches
are used. (It's possible you could guess which research lab wrote the paper simply
by the transcription they use!)
It is also worth noting that the problems generated by the direct transcription have
an important and exploitable "banded" sparsity pattern -- most of the constraints
touch only a small number of variables. This is actually the same pattern that we
exploit in the Riccati equations. Thanks to the importance of these methods in real
applications, numerous specialized solvers have been written to explicitly exploit
this sparsity (e.g. [3]).
is the simple case (when A is invertible). But in general, we can use any finitely-
parameterized representation of u(t) and any numerical integration scheme to
obtain x[n + 1] = f (x[n], u[n]).
The formulations that we wrote for direct transcription and direct shooting above
are still valid when the dynamics are nonlinear, it's just that the resulting problem is
nonconvex. For instance, the direct transcription for discrete-time systems becomes
the more general:
N−1
min ℓ f (x[N]) + ∑ ℓ(x[n], u[n])
x[⋅],u[⋅] n0
subject to x[n + 1] = f (x[n], u[n]), ∀n ∈ [0, N − 1]
x[0] = x 0
+ additional constraints.
Direct shooting still works, too, since on each iteration of the algorithm we can
compute x[n] given x[0] and u[⋅] by forward simulation. But things get a bit more
interesting when we consider continuous-time systems.
For nonlinear dynamics, we have many choices for how to approximate the discrete
dynamics
t[n+1]
x[n + 1] = x[n] + ∫ f(x(t), u(t))dt, x(t[n]) = x[n].
t[n]
For instance, in DRAKE we have an entire suite of numerical integrators that achieve
different levels of simulation speed and/or accuracy, both of which can be highly
dependent on the details of f (x, u).
In direct collocation (c.f., [4]), both the input trajectory and the state trajectory are
represented explicitly as piecewise polynomial functions. In particular, the sweet
spot for this algorithm is taking u(t) to be a first-order polynomial and x(t) to be a
cubic polynomial.
It turns out that in this sweet spot, the only decision variables we need in our
optimization are the sample values u(t) and x(t) at the so called "break" points of
the spline. You might think that you would need the coefficients of the cubic spline
parameters, but you do not. For the first-order interpolation on u(t), given u(t k )
and u(t k+1 ), we can solve for every value u(t) over the interval t ∈ [k, k + 1]. But
we also have everything that we need for the cubic spline: given x(t k ) and u(t k ),
we can compute ẋ(t k ) = f(x(t k ), u(t k )); and the four values x(t k ), x(t k+1 ), ẋ(t k ), ẋ(t k+1 )
completely define all of the parameters of the cubic spline over the interval
t ∈ [t k , t k+1 ]. This is very convenient, because it is easy for us to add additional
constraints to u and x at the sample points (and would have been relatively harder
to have to convert every constraint into constraints on the spline coefficients).
Figure 10.1 - Cubic spline parameters used in the direct collocation method.
It turns out that we need one more constraint per time segment to enforce the
dynamics and to fully specify the trajectory. In direct collocation, we add a derivative
constraint at the so-called collocation points. In particular, if we choose the
collocation points to be the midpoints of the spline, then we have that
1
t c,k =(t k + t k+1 ), h k = t k+1 − t k ,
2
1
u(t c,k ) = (u(t k ) + u(t k+1 )),
2
1 h
x(t c,k ) = (x(t k ) + x(t k+1 )) + (ẋ(t k ) − ẋ(t k+1 )),
2 8
3 1
ẋ(t c,k ) = − (x(t k ) − x(t k+1 )) − (ẋ(t k ) + ẋ(t k+1 )).
2h 4
These equations come directly from the equations that fit the cubic spline to the end
points/derivatives then interpolate them at the midpoint. They give us precisely what
we need to add the dynamics constraint to our optimization at the collocation points:
N−1
min ℓ f (x[N]) + ∑ h n ℓ(x[n], u[n])
x[⋅],u[⋅] n0
subject to ẋ(t c,n ) = f(x(t c,n ), u(t c,n )), ∀n ∈ [0, N − 1]
x[0] = x 0
+ additional constraints.
I hope this notation is clear -- I'm using x[k] = x(t k ) as the decision variables,
and the collocation constraint at t c,k depends on the decision variables:
x[k], x[k + 1], u[k], u[k + 1]. The actual equations of motion get evaluated at both the
break points, t k , and the collocation points, t c,k .
Figure 10.2 - A swing-up trajectory for the simple pendulum (with severe torque
limits) optimized using direct collocation.
Direct collocation also easily solves the swing-up problem for the pendulum,
Acrobot, and cart-pole system. Try it for yourself:
Open in Colab
The direct collocation method of [4] was our first example of explicitly representing
the solution of the optimal control problem as a parameterized trajectory, and adding
constraints to the derivatives at a series of collocation points. In the algorithm
above, the representation of choice was piecewise-polynomials, e.g. cubic spines,
and the spline coefficients were the decision variables. A closely related approach,
often called "pseudo-spectral" optimal control, uses the same collocation idea, but
represents the trajectories instead using a linear combination of global, polynomial
basis functions. These methods use typically much higher-degree polynomials, but
can leverage clever parameterizations to write sparse collocation objectives and
to select the collocation points [5, 6]. Interestingly, The continuously-differentiable
nature of the representation of these methods has led to comparatively more
theorems and analysis than we have seen for other direct trajectory optimization
methods [6] -- but despite some of the language used in these articles please
remember they are still local optimization methods trying to solve a nonconvex
optimization problem. While the direct collocation method above might be expected
to converge to the true optimal solution by adding more segments to the piecewise
polynomial (and having each segment represent a smaller interval of time), here we
expect convergence to happen as we increase the degree of the polynomials.
The pseudo-spectral methods are also sometimes knowns as "orthogonal collocation"
because the N basis polynomials, ϕ j (t), are chosen so that at the N th collocation
point t j , we have
ϕ i (t j ) = {1 i = j,
0 otherwise.
Note that for both numerical reasons and for analysis, time is traditionally rescaled
from the interval [t 0 , t f ] to [−1, 1]. Collocation points are chosen based on small
variations of Gaussian quadrature, known as the "Gauss-Lobatto" which includes
collocation points at t = −1 and t = 1.
The different transcriptions presented above represent different ways to map the
(potentially continuous-time) optimal control problem into a finite set of decision
variables, objectives, and constraints. But even once that choice is made, there are
numerous approaches to solving this optimization problem. Any general approach to
nonlinear programming can be applied here; in the python examples we've included
so far, the problems are handed directly to the sequential-quadratic programming
(SQP) solver SNOPT, or to the interior-point solver IPOPT.
To be concise (and slightly more general), let us define x[n + 1] = f d (x[n], u[n]) as the
discrete-time approximation of the continuous dynamics; for example, the forward
Euler integration scheme used above would give f d (x[n], u[n]) = x[n] + f(x[n], u[n])dt.
Then we have
where the gradient of the state with respect to the inputs can be computed during
the "forward simulation",
These simulation gradients can also be used in the chain rule to provide the
gradients of any constraints. Note that there are a lot of terms to keep around here,
on the order of (state dim) × (control dim) × (number of timesteps). Ouch. Note also
that many of these terms are zero; for instance with the Euler integration scheme
above ∂uk = 0 if k ≠ n. (If this looks like I'm mixing two notations here, recall that
∂u[n]
I'm using u k to represent the decision variable and u[n] to represent the input used
in the nth step of the simulation.)
1. Simulate forward:
x[n + 1] = f d (x[n], u n ),
from x[0] = x 0 .
2. Calculate backwards:
from λ[N − 1] = .
∂ℓ f (x[N])
∂x[N]
3. Extract the gradients:
∂J ∂ℓ(x[n], u[n]) ∂f(x[n], u[n])
= + λ[n] T ,
∂u[n] ∂u[n] ∂u[n]
with ∂J
∂u k = ∑n ∂J ∂u[n]
∂u[n] ∂u k
.
Here λ[n] is a vector the same size as x[n] which has an interpretation as
. The equation governing λ is known as the adjoint equation, and it
∂J T
λ[n] = ∂x[n+1]
represents a dramatic efficiency improvement over calculating the huge number of
simulation gradients described above. In case you are interested, yes the adjoint
equation is exactly the backpropagation algorithm that is famous in the neural
networks literature, or more generally a bespoke version of reverse-mode automatic
differentiation.
So the next phase of your journey is to start trying to "help" the solver along. There
are two common approaches.
The first is tuning your cost function -- some people spend a lot of time adding
new elements to the objective or adjusting the relative weight of the different
components of the objective. This is a slippery slope, and I tend to try to avoid it
(possibly to a fault; other groups tend to put out more compelling videos!).
The second approach is to give a better initial guess to your solver to put it in the
vicinity of the "right" local minimal. I find this approach more satisfying, because
for most problems I think there really is a "correct" formulation for the objective
and constraints, and we should just aim to find the optimal solution. Once again,
we do see a difference here between the direct shooting algorithms and the direct
transcription / collocation algorithms. For shooting, we can only provide the solver
with an initial guess for u(⋅), whereas the other methods allow us to also specify an
initial guess for x(⋅) directly. I find that this can help substantially, even with very
simple initializations. In the direct collocation examples for the swing-up problem of
the Acrobot and cart-pole, simply providing the initial guess for x(⋅) as a straight line
trajectory between the start and the goal was enough to help the solver find a good
solution; in fact it was necessary.
In fact, there are a few notable approximations that we've already made to get to this
point: the integration accuracy of our trajectory optimization tends to be much less
than the accuracy used during forward simulation (we tend to take bigger time steps
during optimization to avoid adding too many decision variables), and the default
convergence tolerance from the optimization toolboxes tend to satisfy the dynamic
constraints only to around 10 −6 . As a result, if you were to simulate the optimized
control trajectory directly even from the exact initial conditions used in / obtained
from trajectory optimization, you might find that the state trajectory diverges from
your planned trajectory.
There are a number of things we can do about this. It is possible to evaluate
the local stability of the trajectory during the trajectory optimization, and add
a cost or constraint that rewards open-loop stability (e.g. [9, 10]). This can be
very effective (though it does tend to be expensive). But open-loop stability is a
quite restrictive notion. A potentially more generally useful approach is to design a
feedback controller to regulate the system back to the planned trajectory.
The maturity, robustness, and speed of solving trajectory optimization using convex
optimization leads to a beautiful idea: if we can optimize trajectories quickly enough,
then we can use our trajectory optimization as a feedback policy. The recipe is
simple: (1) measure the current state, (2) optimize a trajectory from the current
state, (3) execute the first action from the optimized trajectory, (4) let the dynamics
evolve for one step and repeat. This recipe is known as model-predictive control
(MPC).
One core idea is the concept of receding-horizon MPC. Since our trajectory
optimization problems are formulated over a finite-horizon, we can think each
optimization as reasoning about the next N timesteps. If our true objective is
to optimize the performance over a horizon longer than N (e.g. over the infinite
horizon), then it is standard to continue solving for an N step horizon on each
evaluation of the controller. In this sense, the total horizon under consideration
continues to move forward in time (e.g. to recede).
The theoretical and practical aspects of Linear MPC are so well understood today
that it is considered the de-facto generalization of LQR for controlling a linear
system subject to (linear) constraints.
At the outset, this was a daunting task. When birds land on a perch, they pitch up
and expose their wings to an "angle-of-attack" that far exceeds the typical flight
envelope. Airplanes traditionally work hard to avoid this regime because it leads
to aerodynamic "stall" -- a sudden loss of lift after the airflow separates from the
wing. But this loss of lift is also accompanied by a significant increase in drag,
and birds exploit this when they rapidly decelerate to land on a perch. Post-stall
aerodynamics are a challenge for control because (1) the aerodynamics are time-
varying (characterized by periodic vortex shedding) and nonlinear, (2) it is much
harder to build accurate models of this flight regime, at least in a wind tunnel, and
(3) stall implies a loss of attached flow on the wing and therefore on the control
surfaces, so a potentially large reduction in control authority.
We picked the project initially thinking that it would be a nice example for model-
free control (like reinforcement learning -- since the models were unknown). In the
end, however, it turned out to be the project that really taught me about the power
of model-based trajectory optimization and linear optimal control. By conducting
dynamic system identification experiments in a motion capture environment, we
were able to fit both surprisingly simple models (based on flat-plate theory) to the
dynamics[14], and also more accurate models using "neural-network-like" terms to
capture the residuals between the model and the data [19]. This made model-based
control viable, but the dynamics were still complex -- while trajectory optimization
should work, I was quite dubious about the potential for regulating to those
trajectories with only linear feedback.
I was wrong. Over and over again, I watched time-varying linear quadratic
regulators take highly nontrivial corrective actions -- for instance, dipping down
early in the trajectory to gain kinetic energy or tipping up to dump energy out of the
system -- in order to land on the perch at the final time. Although the quality of the
linear approximation of the dynamics did degrade the farther that we got from the
nominal trajectory, the validity of the controller dropped off much less quickly (even
as the vector field changed, the direction that the controller needed to push did not).
This was also the thinking that got me initially so interested in understanding the
regions of attraction of linear control on nonlinear systems.
In the end, the experiments were very successful. We started searching for the
"simplest" aircraft that we could build that would capture the essential control
dynamics, reduce complexity, and still accomplish the task. We ended up building
a series of flat-plate foam gliders (no propellor) with only a single actuator to
control the elevator. We added dihedral to the wings to help the aircraft stay in the
longitudinal plane. The simplicity of these aircraft, plus the fact that they could be
understood through the lens of quite simple models makes them one of my favorite
canonical underactuated systems.
Figure 10.4 - The original perching experiments from [14] in a motion capture
arena with a simple rope acting as the perch. The main videos were shot with high-
speed cameras; an entire perching trajectory takes about .8 seconds.
10.5.1 The Flat-Plate Glider Model
(x,z)
-θ
fw
lw
lh
fe
g
le
Figure 10.5 - The flat-plate glider model. Note that traditionally aircraft coordinates
are chosen so that positive pitch brings the nose up; but this means that positive z
is down (to have a right-handed system). For consistency, I've chosen to stick with
the vehicle coordinate system that we use throughout the text -- positive z is up, but
positive pitch is down.
In our experiments, we found the dynamics of our aircraft were captured very well
by the so-called "flat plate model" [14]. In flat plate theory lift and drag forces of a
wing are summarized by a single force at the center-of-pressure of the wing acting
normal to the wing, with magnitude:
where ρ is the density of the air, S is the area of the wing, α is the angle of attack
of the surface, n is the normal vector of the lifting surface, and v is the velocity of
the center of pressure relative to the air. This corresponds to having lift and drag
coefficients
c
p w = p − l w [ c θ ], p e = p − l h [ c θ ] − l e [ θ+ϕ ],
−s θ −s θ −s θ+ϕ
where the origin of our vehicle coordinate system, p = [x, z] T , is chosen to be the
center of mass. We assume still air, so v = ṗ for both the wing and the elevator.
We assume that the elevator is massless, and the actuator controls velocity directly
(note that this breaks our "control affine" structure, but is more realistic for the
tiny hobby servos we were dealing with). This gives us a total of 7 state variables
˙ T and one control input u = ϕ.
x = [x, z, θ, ϕ, ẋ, ż, θ] ˙ The resulting equations of motion
are:
s
n w = [s θ ], n e = [ θ+ϕ ],
cθ c θ+ϕ
f w = f n (S w , n w , ṗ w ), f e = f n (S e , n e , ṗ e ),
1
ẍ = (f w s θ + f e s θ+ϕ ),
m
1
z̈ = (f w c θ + f e c θ+ϕ ) − g,
m
1
θ¨ = (l w f w + (l h c ϕ + l e )f e ).
I
Open in Colab
The linear controller around a nominal trajectory was surprisingly effective, but it's
not enough. We'll return to this example again when we talk about "feedback motion
planning", in order to discuss how to find a controller that can work for many more
initial conditions -- ideally all of the initial conditions of interest for which the aircraft
is capable of getting to the goal.
What precisely does it mean for a trajectory, x(⋅), u(⋅), to be locally optimal? It means
that if I were to perturb that trajectory in any way (e.g. change u 3 by ϵ), then I
would either incur higher cost in my objective function or violate a constraint. For
an unconstrained optimization, a necessary condition for local optimality is that the
gradient of the objective at the solution be exactly zero. Of course the gradient
can also vanish at local maxima or saddle points, but it certainly must vanish at
local minima. We can generalize this argument to constrained optimization using
Lagrange multipliers.
Let us use Lagrange multipliers to derive the necessary conditions for our
constrained trajectory optimization problem in discrete time
N−1
min ℓ f (x[N]) + ∑ ℓ(x[n], u[n]),
x[⋅],u[⋅]
n=0
subject to x[n + 1] = f d (x[n], u[n]).
and set the derivatives to zero to obtain the adjoint equation method described for
the shooting algorithm above:
∂L
∀n ∈ [0, N − 1], = f d (x[n], u[n]) − x[n + 1] = 0 ⇒ x[n + 1] = f(x[n], u[n])
∂λ[n]
∂L ∂ℓ(x[n], u[n]) ∂f d (x[n], u[n])
∀n ∈ [0, N − 1], = + λ T [n] − λ T [n − 1] = 0
∂x[n] ∂x ∂x
∂ℓ(x[n], u[n]) T ∂f d (x[n], u[n]) T
⇒ λ[n − 1] = + λ[n].
∂x ∂x
∂L ∂ℓ f T ∂ℓ f
= − λ T [N − 1] = 0 ⇒ λ[N − 1] =
∂x[N] ∂x ∂x
∂L ∂ℓ(x[n], u[n]) ∂f d (x[n], u[n])
∀n ∈ [0, N − 1], = + λ T [n] = 0.
∂u[n] ∂u ∂u
You won't be surprised to hear that these necessary conditions have an analogue in
continuous time. I'll state it here without derivation. Given the initial conditions, x 0 ,
a continuous dynamics, ẋ = f(x, u), and the instantaneous cost ℓ(x, u), for a trajectory
x(⋅), u(⋅) defined over t ∈ [t 0 , t f ] to be optimal it must satisfy the conditions that
∀t ∈ [t 0 , t f ], ẋ ∗ = f(x ∗ , u ∗ ), x ∗ (0) = x 0
∂ℓ T ∂f T ∗ ∂ℓ f T
∀t ∈ [t 0 , t f ], ˙∗ =
−λ + λ , λ ∗ (T ) =
∂x ∂x ∂x
∀t ∈ [t 0 , t f ], u = argmin u∈U [ℓ(x , u) + (λ ∗ ) T f(x ∗ , u)].
∗ ∗
Note that the terms which are minimized in the final line of the theorem are
commonly referred to as the Hamiltonian of the optimal control problem,
the HJB.
There are some very important cases where nonconvex trajectory optimization can
be turned back into convex trajectory optimization based on a clever change of
variables. One of the key ideas in this space is the notion of "differential flatness",
which is philosophically quite close to the idea of partial feedback linearization
that we discussed for acrobots and cart-pole systems. But perhaps the most famous
applciation of differential flatness, which we will explore here, is actually for
quadrotors.
One of the most important lessons from partial feedback linearization, is the idea
that if you have m actuators, then you basically get to control exactly m quantities of
your system. Differential flatness exploits this same idea (choosing m outputs), but in
the opposite direction. The idea is that, for some systems, if you give me a trajectory
in those m coordinates, it may in fact dictate what all of the states/actuators must
have been doing. The fact that you can only execute a subset of possible trajectories
can, in this case, make trajectory planning much easier!
To see this, recall the equations of motion for this system were given by:
−mẍ (u 1 + u 2 ) sin θ
= = tan θ.
mÿ + mg (u 1 + u 2 ) cos θ
In words, given ẍ(t) and ÿ(t), I can solve for θ(t). I can differentiate this relationship
(in time) twice more to obtain θ¨. Using (3) with (1) or (2) give us two linear
equations for u 1 and u 2 which are easily solved.
Now you can see why we need the original trajectories to be smooth -- the solution
to u(t) depends on θ(t)
¨ which depends on d x(t) and 4 ; we need those derivatives
4
d 4 y(t)
4 dt dt
to exist along the entire trajectory.
What's more -- if we ignore input limits for a moment -- any sufficiently smooth
trajectory of x(t), y(t) is feasible, so if I can simply find one that avoids the
obstacles, then I have designed my state trajectory. As we will see, optimizing even
high-degree piecewise-polynomials is actually an easy problem (it works out to be
a quadratic program), assuming the constraints are convex. In practice, this means
that once you have determined whether to go left or right around the obstacles,
trajectory design is easy and fast.
Open in Colab
Figure 10.6 - Differential flatness for the planar quadrotor -- by solving a simple
optimization to find a smooth trajectory for x(t) and y(t), I can back out θ(t) and
even u(t).
The example above demonstrates that the planar quadrotor system is differentially
flat in the outputs x(t), y(t). More generally, if we have a dynamical system
ẋ = f(x, u),
du dk u
z(t) = h (x, u, , . . . , k ),
dt dt
such that we can write the x and u purely as a function of the output and it's time
derivatives,
dz dk z
x(t) = x (z, , . . . , k ),
dt dt
dz dk z
u(t) = u (z, , . . , k ),
dt dt
then we say that the system f is differentially flat in the outputs z [21]. And the
requirement for showing that a system is differentially flat in those outputs is to
write the function which solves for x(t) and u(t) as a function of only z(t) and its time
derivatives.
I'm not aware of any numerical recipes for showing that a system is differentially flat
nor for finding potential flat outputs, but I admit I haven't worked on it nor looked for
those recipes. That would be interesting! I think of differential flatness as a property
that one must find for your system -- typically via a little mechanical intuition and a
lot of algebra. Once found, it can be very useful.
Example 10.4 (Differential flatness for the 3D
Quadrotor)
Probably the most famous example of differential flatness is on the full 3D
quadrotors. [22] showed that the 3D quadrotor is differentially flat in the outputs
{x, y, z, θ yaw }. They used this, to dramatic effect, to perform all sorts of acrobatics.
The resulting videos are awesome (and probably deserve a lot of credit for the
widespread popularity of quadrotors in academia and industry over the next few
years).
A few things to note about these examples, just so we also understand the
limitations. First, the technique above is great for designing trajectories, but
additional work is required to stabilizing those trajectories (we'll cover that topic
in more detail later in the notes). Trajectory stabilization benefits greatly from
good state estimation, and the examples above were all performed in a motion
capture arena. Also, the "simple" version of the trajectory design that is fast
enough to be used for flying through moving hoops is restricted to convex
optimization formulations -- which means one has to hard-code apriori the
decisions about whether to go left or right / up or down around each obstacle.
The motivation behind iterative LQR is quite appealing -- it leverages the surprising
structure of the Riccati equations to come up with a second-order update to the
trajectory after a single backward pass of the Riccati equation followed by a forward
simulation. Anecdotally, the convergence is fast and robust. Numerous
applications...[25, 26].
10.8 EXERCISES
REFERENCES
1. John T. Betts, "Survey of numerical methods for trajectory optimization",
Journal of Guidance, Control, and Dynamics, vol. 21, no. 2, pp. 193-207,
1998.
3. Yang Wang and Stephen Boyd, "Fast model predictive control using online
optimization", IEEE Transactions on control systems technology, vol. 18, no.
2, pp. 267--278, 2009.
5. Divya Garg and Michael Patterson and Camila Francolin and Christopher
Darby and Geoffrey Huntington and William Hager and Anil Rao, "Direct
trajectory optimization and costate estimation of finite-horizon and
infinite-horizon optimal control problems using a Radau pseudospectral
method", Computational Optimization and Applications, vol. 49, pp.
335-358, 2011.
9. Katja D. Mombaur and Hans Georg Bock and Johannes P. Schloder and
Richard W. Longman, "Open-loop stable solutions of periodic optimal
control problems in robotics", Z. Angew. Math. Mech. (ZAMM), vol. 85, no.
7, pp. 499 – 515, 2005.
11. Carlos E Garcia and David M Prett and Manfred Morari, "Model predictive
control: theory and practice—a survey", Automatica, vol. 25, no. 3, pp.
335--348, 1989.
13. Alberto Bemporad and Manfred Morari, "Robust model predictive control:
A survey", Robustness in identification and control, vol. 245, pp. 207-226,
1999.
14. Rick Cory and Russ Tedrake, "Experiments in Fixed-Wing {UAV} Perching",
Proceedings of the AIAA Guidance, Navigation, and Control Conference, pp.
1-12, 2008. [ link ]
15. John W. Roberts and Rick Cory and Russ Tedrake, "On the Controllability
of Fixed-Wing Perching", Proceedings of the American Control Conference
(ACC), 2009. [ link ]
17. Joseph Moore, "Powerline Perching with a Fixed-wing {UAV}", , May, 2011.
[ link ]
18. Joseph Moore and Russ Tedrake, "Control Synthesis and Verification for
a Perching {UAV} using {LQR}-Trees", In Proceedings of the IEEE
Conference on Decision and Control, pp. 8, December, 2012. [ link ]
19. Joseph Moore, "Robust Post-Stall Perching with a Fixed-Wing UAV", PhD
thesis, Massachusetts Institute of Technology, September, 2014. [ link ]
21. Richard M. Murray and Muruhan Rathinam and Willem Sluis, "Differential
flatness of mechanical control systems: A catalog of prototype systems",
ASME international mechanical engineering congress and exposition, 1995.
24. Athanasios Sideris and James E. Bobrow, "A Fast Sequential Linear
Quadratic Algorithm for Solving Unconstrained Nonlinear Optimal Control
Problems", , February, 2005.
25. Farbod Farshidian and Edo Jelavic and Asutosh Satapathy and Markus
Giftthaler and Jonas Buchli, "Real-time motion planning of legged robots:
A model predictive control approach", 2017 IEEE-RAS 17th International
Conference on Humanoid Robotics (Humanoids), pp. 577--584, 2017.
26. Y. Tassa and T. Erez and E. Todorov, "Synthesis and stabilization of complex
behaviors through online trajectory optimization", Intelligent Robots and
Systems (IROS), 2012 IEEE/RSJ International Conference on, pp.
4906--4913, 2012.
29. Daniel Mellinger and Alex Kushleyev and Vijay Kumar, "Mixed-integer
quadratic program trajectory generation for heterogeneous quadrotor
teams", 2012 IEEE international conference on robotics and automation,
pp. 477--483, 2012.
30. Robin Deits and Russ Tedrake, "Efficient Mixed-Integer Planning for
{UAVs} in Cluttered Environments", Proceedings of the {IEEE}
International Conference on Robotics and Automation ({ICRA}), 2015.
[ link ]
31. Benoit Landry and Robin Deits and Peter R. Florence and Russ Tedrake,
"Aggressive Quadrotor Flight through Cluttered Environments Using Mixed
Integer Programming", Proceedings of the International Conference on
Robotics and Automation (ICRA), May, 2016. [ link ]
32. Tobia Marcucci and Jack Umenberger and Pablo A. Parrilo and Russ
Tedrake, "Shortest Paths in Graphs of Convex Sets", arXiv preprint, 2021.
[ link ]
33. O. Junge and J. E. Marsden and S. Ober-Bloebaum, "Discrete mechanics and
optimal control", Proceedings of the 16th IFAC World Congress, 2005.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
Policy Search
In our case study on perching aircraft, we solved a challenging control problem, but
our approach to control design was based on only linear optimal control (around an
optimized trajectory). We've also discussed some approaches to nonlinear optimal
control that could scale beyond small, discretized state spaces. These were based
on estimating the cost-to-go function, including value iteration using function
approximation and approximate dynamic programming as a linear program or as a
sums-of-squares program.
There are a lot of things to like about methods that estimate the cost-to-go function
(aka value function). The cost-to-go function reduces the long-term planning
problem into a one-step planning problem; it encodes all relevant information about
the future (and nothing more). The HJB gives us optimality conditions for the cost-
to-go that give us a strong algorithmic handle to work with.
In this chapter, we will explore another very natural idea: let us parameterize
a controller with some decision variables, and then search over those decision
variables directly in order to achieve a task and/or optimize a performance objective.
One specific motivation for this approach is the (admittedly somewhat anecdotal)
observation that often times simple policies can perform very well on complicated
robots and with potentially complicated cost-to-go functions.
We'll refer to this broad class of methods as "policy search" or, when optimization
methods are used, we'll sometimes use "policy optimization". This idea has not
received quite as much attention from the controls community, probably because we
know many relatively simple cases where it does not work well. But it has become
very popular again lately due to the empirical success of "policy gradient" algorithms
in reinforcement learning (RL). This chapter includes a discussion of the "model-
based" version of these RL policy-gradient algorithms; we'll describe their "model-
free" versions in a future chapter.
Using our prescription for optimal control using additive costs, we can evaluate the
performance of this controller from any initial condition using, e.g.:
∞
J α (x) = ∫ ℓ(x(t), u(t))dt,
0
subject to ẋ = f(x, u), u = π α (x), x(0) = x.
In order to provide a scalar cost for each set of policy parameters, α, we need one
more piece: a relative importance for the different initial conditions.
First, let's evaluate our objective for a given K. This step is known as "policy
evaluation". If we use a Gaussian with mean zero and covariance Ω as our
distribution over initial conditions, then for LQR we have
∞
E [∫ [x T Qx + u T Ru]dt],
0
subject to ẋ = Ax + Bu, u = −Kx, x(0) ∼ N (0, Ω).
This problem is also known as the H 2 optimal control problem[1], since the expected-
cost-to-go here is the H 2 -norm of the linear system from disturbance input (here only
an impulse that generates our initial conditions) to performance output (which here
1
1
R2u
To evaluate a policy K, let us first re-arrange the cost function slightly, using the
properties of the matrix trace:
T T
E [x(t)x(t) T ] = e (A−BK)t E [x(0)x(0) T ]e (A−BK) t = e (A−BK)t Ωe (A−BK) t ,
which is just a (symmetric) matrix function of t. The integral of this function, call it
X, represents the expected 'energy' of the closed-loop response:
∞
X = E [∫ xx T dt].
0
(You can see a closely related derivation here). Finally, the total policy evaluation is
given by
Unfortunately, the Lyapunov equation represents a nonlinear constraint (on the pair
K, X). Indeed, it is well known that even the set of controllers that stabilizing a
linear systems can be nonconvex in the parameters K when there are 3 or more state
variables[2].
⎡1 0 −10⎤ ⎡1 −10 0⎤
K 1 = −1 1 0 and K 2 = 0 1 0 ,
⎣ ⎦ ⎣ ⎦
0 0 1 −1 0 1
are both stabilizing controllers for this system (all eigenvalues of A − BK are
inside the unit circle of the complex plane). However the controller
K^ = (K 1 + K 2 )/2 has two eigenvalues outside the unit circle.
Since the set of controllers that achieve finite total cost is non-convex, clearly the
cost function we consider here is also non-convex.
As an aside, for this problem we do actually know a change of variables that make
the problem convex. Let's introduce a new variable Y = KX. Since X is PSD, we can
back out K = YX −1 . Now we can rewrite the optimization:
1 1 1 1
min tr Q 2 XQ 2 + tr R 2 YX −1 Y T R 2
X,Y
subject to AX − BY + XA T − Y T B T + Ω = 0,
X ≻ 0.
The second term in the objective appears to be nonconvex, but is actually convex. In
order to write it as a SDP, we can replace it exactly with one more slack variable, Z,
and a Schur complement[3]:
1 1
min tr Q 2 XQ 2 + tr Z
X,Y,Z
subject to AX − BY + XA T − Y T B T + Ω = 0,
1
[ Z R 2 Y] ⪰ 0.
1
YT R 2 X
Nevertheless, our original question is asking about searching directly in the original
parameterization, K. If the objective in nonconvex in those parameters, then how
should we perform the search?
How does one show that an optimization landscape has no local minima (even
though it may be non-convex)? One of the most popular tools is to demonstrate
gradient dominance with the famous Polyak-Łojasiewicz (PL) inequality [5]. For an
optimization problem
min f(x)
x∈R d
Then we say that the function satisfies the PL-inequality if the following holds for
some μ > 0:
1
∀x, ∥∇f(x)∥ 2F ≥ μ(f(x) − f ∗ ),
2
where f ∗ is a value obtained at the optima. In words, the gradient of the objective
must grow faster than the gradient of a quadratic function. Note, however, that the
distance here is measured in f(x), not x; we do not require (nor imply) that the
optimal solution is unique. It clearly implies that for any minima, x ′ with ∇f(x ′ ) = 0,
since the left-hand side is zero, we must have the right-hand side also be zero, so x ′
is also a global optima: f(x ′ ) = f ∗ .
1
(2x + 6 sin(x) cos(x)) 2 ≥ μ(x 2 + 3 sin 2 (x)),
2
with μ = 0.175. (I confirmed this with a small dReal program).
[5] gives a convergence rate for convergence to an optima for gradient descent given
the PL conditions. [4] showed that the gradients of the LQR cost we examine here
with respect to K satisfy the PL conditions on any sublevel set of the cost-to-go
function.
11.2.4 True gradient descent
The results described above suggest that one can use gradient descent to obtain the
optimal controller, K ∗ for LQR. For the variations we've seen so far (where we know
the model), I would absolutely recommend that solving the Riccati equations is a
much better algorithm; it is faster and more robust, with no parameters like step-
size to tune. But gradient descent becomes more interesting / viable when we think
of it as a model for a less perfect algorithm, e.g. where the plant model is not given
and the gradients are estimated from noisy samples.
It is a rare luxury, due here to our ability to integrate the linear plants/controllers,
quadratic costs, and Gaussian initial conditions, that we could compute the value
function exactly in (1). We can also compute the true gradient -- this is a pinnacle
of exactness we should strive for in our methods but will rarely achieve again. The
gradient is given by
∂E[J K (x)]
= 2(RK − B T P)X,
∂K
where P satisfies another Lyapunov equation:
Note that the term policy gradient used in reinforcement learning typically refers to
the slightly different class of algorithms I hinted at above. In those algorithms, we
use the true gradients of the policy (only), but estimate the remainder of the terms
in the gradient through sampling. These methods typically require many samples to
estimate the gradients we compute here, and should only be weaker (less efficient)
than the algorithms in this chapter. The papers investigating the convergence of
gradient descent for LQR have also started exploring these cases. We will study
these so-called model-free" policy search algorithms soon.
LQR / H 2 control is one of the good cases, where we know that for the objective
parameterized directly in K, all local optima are global optima. [6] extended this
result for mixed H 2 /H ∞ control. [7] gives a recent treatment of the tabular (finite)
MDP case.
For LQR, we also know alternative parameterizations of the controller which make
the objective actually convex, including the LMI formulation and the Youla
parameterization. Their utility in a policy search setting was studied initially in [8].
Unfortunately, we do not expect these nice properties to hold in general. There are
a number of nearby problems which are known to be nonconvex in the original
parameters. The case of static output feedback is an important one. If we extend
our plant model to include (potentially limited) observations: ẋ = Ax + Bu, y = Cx,
then searching directly over controllers, u = −Ky, is known to be NP-hard[9]. This
time, the set of stabilizing K matrices may be not only nonconvex, but actually
disconnected. We can see that with a simple example (given to me once during a
conversation with Alex Megretski).
ẋ = Ax + Bu, y = Cx,
with
⎡0 0 2⎤ ⎡1⎤
A= 1 0 0 , B= 0 , C = [1 1 3].
⎣ ⎦ ⎣ ⎦
0 1 0 0
11.4 TRAJECTORY-BASED
is estimating
POLICY SEARCH
∂
∂α
E x∼X0 [J α (x)].
In the LQR problem, we were able to compute these terms exactly; with the biggest
simplification coming from the fact that the response of a linear system to Gaussian
initial conditions stays Gaussian. This is not true for more general nonlinear systems.
∂
∂α
E x∼X0 [J α (x)] ≈
E x∼X0 [J α (x)] ≈
1 N−1
∑ J α (x i ),
N i=0
1 N−1 ∂J α (x i )
∑
N i=0 ∂α
,
xi ∼ X0 .
xi ∼ X0 .
Our confidence in the accuracy of this estimate will improve as we increase N ; see
e.g. Section 4.2 of [11] for details on the confidence intervals. The most optimistic
among us will say that it's quite fine to have only a noisy estimate of the gradient
-- this leads to stochastic gradient descent which can have some very desirable
properties. But it does make the algorithm harder to analyze and debug.
Using the Monte-Carlo estimator, the total gradient update is just a sum over
gradients with respect to particular initial conditions, ∂α . But for finite-horizon
∂J α (x i )
objectives, these are precisely the gradients that we have already studied in the
context of trajectory optimization. They can be computed efficiently using an adjoint
method. The difference is that here we think of α as the parameters of a feedback
controller, whereas before we thought of them as the parameters of a trajectory, but
this makes no difference to the chain rule.
running_cost
integrator
plant pi state
cost u0 y0
tau state state command command
Open in Colab
Coming soon...
REFERENCES
1. Ben M. Chen, "H2 {Optimal} {Control}", Encyclopedia of {Systems} and
{Control}, pp. 515--520, 2015.
2. Maryam Fazel and Rong Ge and Sham M. Kakade and Mehran Mesbahi,
"Global Convergence of Policy Gradient Methods for the Linear Quadratic
Regulator", International Conference on Machine Learning, 2018.
3. Erik A Johnson and Baris Erkus, "Dissipativity and performance analysis
of smart dampers via LMI synthesis", Structural Control and Health
Monitoring: The Official Journal of the International Association for
Structural Control and Monitoring and of the European Association for the
Control of Structures, vol. 14, no. 3, pp. 471--496, 2007.
5. Hamed Karimi and Julie Nutini and Mark Schmidt, "Linear convergence
of gradient and proximal-gradient methods under the polyak-{\l}ojasiewicz
condition", Joint European Conference on Machine Learning and
Knowledge Discovery in Databases, pp. 795--811, 2016.
6. Kaiqing Zhang and Bin Hu and Tamer Basar, "Policy Optimization for ℋ₂
Linear Control with ℋ∞ Robustness Guarantee: Implicit Regularization and
Global Convergence", Proceedings of the 2nd Conference on Learning for
Dynamics and Control, vol. 120, pp. 179--190, 10--11 Jun, 2020.
7. Alekh Agarwal and Sham M. Kakade and Jason D. Lee and Gaurav Mahajan,
"On the Theory of Policy Gradient Methods: Optimality, Approximation, and
Distribution Shift", , 2020.
8. John Roberts and Ian Manchester and Russ Tedrake, "Feedback Controller
Parameterizations for Reinforcement Learning", Proceedings of the 2011
IEEE Symposium on Adaptive Dynamic Programming and Reinforcement
Learning (ADPRL), 2011. [ link ]
10. Jalaj Bhandari and Daniel Russo, "Global {Optimality} {Guarantees} {For}
{Policy} {Gradient} {Methods}", arXiv:1906.01786 [cs, stat], oct, 2020.
11. Reuven Y Rubinstein and Dirk P Kroese, "Simulation and the Monte Carlo
method",John Wiley \& Sons , vol. 10, 2016.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 12
Motion Planning as Search
The term "motion planning" is a hopelessly general term which almost certainly
encompasses the dynamic programming, feedback design, and trajectory
optimization algorithms that we have already discussed. However, there are a
number of algorithms and ideas that we have not yet discussed which have grown
from the idea of formulating motion planning as a search problem -- for instance
searching for a path from a start to a goal in a graph which is too large solve
completely with dynamic programming. Some, but certainly not all, of these
algorithms sacrifice optimality in order to find any path if it exists, and the notion
of a planner being "complete" -- guaranteed to find a path if one exists -- is highly
valued. This is precisely our goal for this chapter, to add some additional tools that
will be able to provide some form of solutions for our most geometrically complex,
highly non-convex, robot control problems.
[1] is a very nice book on planning algorithms in general and on motion planning
algorithms in particular. Compared to other planning problems, motion planning
typically refers to problems where the planning domain is continuous (e.g.
continuous state space, continuous action space), but many motion planning
algorithms trace their origins back to ideas in discrete domains (e.g., graph search).
For this chapter, we will consider the following problem formulation: given a system
defined by the nonlinear dynamics (in continuous- or discrete-time)
and given a start state x(0) = x 0 and a goal region G, find any finite-time trajectory
from x 0 to to x ∈ G if such a trajectory exists.
Interpolation can work well if you are trying to solve for the cost-to-go function over
the entire state space, but it's less compatible with search methods which are trying
to find just a single path through the space. If I start in node 1, and land between
node 2 and node 3, then which node to I continue to expand from?
One approach to avoiding this problem is to build a search tree as the search
executes, instead of relying on a predefined mesh discretization. This tree will
contains nodes rooted in the continuous space at exactly the points where system
can reach.
Another other problem with any fixed mesh discretization of a continuous space,
or even a fixed discretization of the action space, is that unless we have specific
geometric / dynamic insights into our continuous system, it very difficult to provide
a complete planning algorithm. Even if we can show that no path to the goal exists
on the tree/graph, how can we be certain that there is no path for the continuous
system? Perhaps a solution would have emerged if we had discretized the system
differently, or more finely?
One approach to addressing this second challenge is to toss out the notion of fixed
discretizations, and replace them with random sampling (another approach would
be to adaptively add resolution to the discretization as the algorithm runs). Random
sampling, e.g. of the action space, can yield algorithms that are probabilistically
complete for the continuous space -- if a solution to the problem exists, then a
probabilistically complete algorithm will find that solution with probability 1 as the
number of samples goes to infinity.
With these motivations in mind, we can build what is perhaps the simplest
probabilistically complete algorithm for finding a path from the a starting state to
some goal region with in a continuous state and action space:
We're nowhere close to the goal yet, and it's not exactly a hard problem.
While the idea of generating a tree of feasible points has clear advantages, we have
lost the ability to cross off a node (and therefore a region of space) once it has been
explored. It seems that, to make randomized algorithms effective, we are going to at
the very least need some form of heuristic for encouraging the nodes to spread out
and explore the space.
12.2.1 Rapidly-Exploring Random Trees (RRTs)
Kinodynamic-RRT*, LQR-RRT(*)
12.2.4 Discussion
Cell decomposition...
Mixed-integer planning.
a. Implement RRT
b. Implement RRT*
REFERENCES
1. Steven M. LaValle, "Planning Algorithms",Cambridge University Press ,
2006.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 13
Feedback Motion Planning
Previous Chapter Table of contents Next Chapter
Accessibility © Russ Tedrake, 2021
UNDERACTUATED ROBOTICS
Algorithms for Walking, Running, Swimming, Flying, and
Manipulation
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 14
Robust and Stochastic
Control
So far in the notes, we have concerned ourselves with only known, deterministic
systems. In this chapter we will begin to consider uncertainty. This uncertainy can
come in many forms... we may not know the governing equations (e.g. the coefficient
of friction in the joints), our robot may be walking on unknown terrain, subject to
unknown disturbances, or even be picking up unknown objects. There are a number
of mathematical frameworks for considering this uncertainty; for our purposes this
chapter will generalizing our thinking to equations of the form:
where w is a new random input signal to the equations capturing all of this potential
variability. Although it is certainly possible to work in continuous time, and treat w(t)
as a continuous-time random signal (c.f. Wiener process), it is notationally simpler
to work with w[n] as a discrete-time random signal. For this reason, we'll devote our
attention in this chapter to the discrete-time systems.
This modeling framework is rich enough for us to convey the key ideas; but it is not
quite sufficient for all of the systems I am interested in. In DRAKE we go to additional
lengths to support more general cases of stochastic systems. This includes modeling
system parameters that are drawn from random each time the model is initialized,
but are fixed over the duration of a simulation; it is possible but inefficient to
model these as additional state variables that have no dynamics. In other problems,
even the dimension of the state vector may change in different realizations of the
problem! Consider, for instance, the case of a robot manipulating random numbers
of dishes in a sink. I do not know many control formulations that handle this type of
randomness well, and I consider this a top priority to think more about! (We'll begin
to address it in the output feedback chapter.)
The classic reference for robust control is [1]. [2] has a treatment that does more of
it's development in the time-domain and via Riccati equations.
14.2.1 Analysis
We also saw that essentially the same technique can be used to certify stability
against disturbances, e.g.:
L 2 gain
In some sense, the common-Lyapunov analysis above is probably the wrong analysis
for linear systems (perhaps other systems as well). It might be unreasonable to
assume that disturbances are bounded. Moreover, we know that the response to an
input (including the disturbance input) is linear, so we expect the magnitude of the
deviation in x compared to the undisturbed case to be proportional to the magnitude
of the disturbance, w. A more natural bound for a linear system's response is to
bound the magnitude of the response (from zero initial conditions) relative to the
magnitude of the disturbance.
Typically, this is done with the a scalar "L 2 gain", γ, defined as:
T
∫ 0 ∥x(t)∥ 2 dt
argmin γ subject to sup T
≤ γ 2, or
w(⋅)∈∫ ∥w(t)∥ 2 dt≤∞ ∫ 0 ∥w(t)∥ 2 dt
∑N
0 ∥x[n]∥
2
argmin γ subject to sup ≤ γ 2.
w[⋅]∈∑ n ∥w[n]∥ 2 ≤∞ ∑N
0 ∥w[n]∥
2
The name "L 2 gain" comes from the use of the ℓ 2 norm on the signals w(t) and x(t),
which is assumed only to be finite.
More often, these gains are written not in terms of x[n] directly, but in terms of some
"performance output", z[n]. For instance, if would would like to bound the cost of a
quadratic regulator objective as a function of the magnitude of the disturbance, we
can minimize
∑N
0 ∥z[n]∥
2
√
min subject to sup ≤ γ 2, z[n] = [ Qx[n]].
γ w[n] ∑N
0 ∥w[n]∥ 2 √Ru[n]
This is a simple but important idea, and understanding it is the key to understanding
the language around robust control. In particular the H 2 norm of a system (from
input w to output z) is the energy of the impulse response; when z is chosen to
represent the quadratic regulator cost as above, it corresponds to the expected LQR
cost. The H ∞ norm of a system (from w to z) is the largest singular value of the
transfer function; it corresponds to the L 2 gain.
Small-gain theorem
Coming soon...
Dissipation inequalities
Coming soon... See, for instance, [4] or [5, Ch. 2].
14.2.2 H 2 design
14.2.3 H ∞ design
[8] provides an extensive treatment of this framework; nearly all of the analysis from
LQR/LQG (including Riccati equations, Hamiltonian formulations, etc) have analogs
for their versions with exponential cost, and he argues that LQG and H-infinity
control can (should?) be understood as special cases of this approach.
The standard criticism of H 2 optimal control is that minimizing the expected value
does not allow any guarantees on performance. The standard criticism of H ∞ optimal
control is that it concerns itself with the worst case, and may therefore be
conservative, especially because distributions over possible disturbances chosen a
priori may be unnecessarily conservative. One might hope that we could get some
of this performance back if we are able to update our models of uncertainty online,
adapting to the statistics of the disturbances we actually receive. This is one of the
goals of adaptive control.
One of the fundamental problems in online adaptive control is the trade-off between
exploration and exploitation. Some inputs might drive the system to build more
accurate models of the dynamics / uncertainty quickly, which could lead to better
performance. But how can we formalize this trade-off?
There has been some nice progress on this challenge in machine learning in the
setting of (contextual) multi-armed bandit problems. For our purposes, you can
think of bandits as a limiting case of an optimal control problem where there are
no dynamics (the effects of one control action do not effect the results of the
next control action). In this simpler setting, the online optimization community
has developed exploration-exploitation strategies based on the notion of minimizing
regret -- typically the accumulated difference in the performance achieved by my
online algorithm vs the performance that would have been achieved if I had been
privy to the data from my experiments before I started. This has led to methods that
make use of concepts like upper-confidence bound (UCB) and more recently bounds
using a least-squares squares confidence bound [9] to provide bounds on the regret.
In the last few years, we've see these results translated into the setting of linear
optimal control...
14.3.2 Iterative H 2
Coming soon...
14.6 EXTENSIONS
REFERENCES
1. Kemin Zhou and John C. Doyle, "Essentials of Robust Control",Prentice Hall
, 1997.
7. S. Boyd and S.-J. Kim and L. Vandenberghe and A. Hassibi, "A Tutorial on
Geometric Programming", Optimization and Engineering, vol. 8, no. 1, pp.
67-127, 2007.
9. Dylan Foster and Alexander Rakhlin, "Beyond {UCB}: Optimal and Efficient
Contextual Bandits with Regression Oracles", Proceedings of the 37th
International Conference on Machine Learning, vol. 119, pp. 3199--3210,
13--18 Jul, 2020.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 15
Output Feedback (aka
Pixels-to-Torques)
In this chapter we will start considering systems of the form:
In other words, we'll finally start addressing the fact that we have to make decisions
based on sensor measurements -- most of our discussions until now have tacitly
assumed that we have access to the true state of the system for use in our feedback
controllers (and that's already been a hard problem).
In some cases, we will see that the assumption of "full-state feedback" is not so bad
-- we do have good tools for state estimation from raw sensor data. But even our
best state estimation algorithms do add some dynamics to the system in order to
filter out noisy measurements; if the time constants of these filters is near the time
constant of our dynamics, then it becomes important that we include the dynamics
of the estimator in our analysis of the closed-loop system.
In other cases, it's entirely too optimistic to design a controller assuming that we
will have an estimate of the full state of the system. Some state variables might
be completely unobservable, others might require specific "information-gathering"
actions on the part of the controller.
For me, the problem of robot manipulation is the application domain where more
direct approaches to output feedback become critically important. Imagine you are
trying to design a controller for a robot that needs to button the buttons on your
dress shirt. If step one is to estimate the state of the shirt (how many degrees of
freedom does my shirt have?), then it feels like we're not going to be successful. Or
if you want to program a robot to make a salad -- what's the state of the salad? Do I
really need to know the positions and velocities of every piece of lettuce in order to
be successful?
15.1 THE CLASSICAL PERSPECTIVE
To some extent, this idea of calling out "output feedback" as a special, advanced
topic is a new problem. Before state space and optimization-based approaches to
control ushered in "modern control", we had "classical control". Classical control
focused predominantly (though not exclusively) on linear time-invariant (LTI)
systems, and made very heavy use of frequency-domain analysis (e.g. via the Fourier
Transform/Laplace Transform). There are many excellent books on the subject; [1, 2]
are nice examples of modern treatments that start with state-space representations
but also treat the frequency-domain perspective.
REFERENCES
1. Joao P. Hespanha, "Linear Systems Theory",Princeton Press , 2009.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
The first natural question we must ask is, given a system ẋ = f(x), or a control
system ẋ = f(x, u), how do we go about finding periodic solutions which may be
passively stable, open-loop stable, or stabilizable via closed-loop feedback? It turns
out that the trajectory optimization tools that we developed already are very well
suited to this task.
Certainly we can add control inputs back into the formulation, too, but let's start
with this simple case. Take a moment to think about the feasible solutions to this
problem formulation. Certainly a fixed point x[n] = x ∗ will satisfy the constraints; if
we don't want these solutions to come out of the solver we might need to exclude
them with constraints or add an objective that guides the solver towards the desired
solutions. The other possible solutions are trajectories that are periodic in exactly N
steps. That's pretty restrictive.
Example 16.1 (Finding the limit cycle of the Van der Pol
oscillator)
Recall the dynamics of the Van der Pol oscillator given by
Try it yourself:
Open in Colab
As always, make sure that you take a look at the code. Poke around. Try changing
some things.
One of the things that you should notice in the code is that I provide an initial
guess for the solver. In most of the examples so far I've been able to avoid doing
that--the solver takes small random numbers as a default initial guess and solves
from there. But for this problem, I found that it was getting stuck in a local minima.
Adding the initial guess that the solution moves around a circle in state space was
enough.
Recall the important distinction between stability of a trajectory in time and stability
of a limit cycle was that the limit cycle does not converge in phase -- trajectories
near the cycle converge to the cycle, but trajectories on the cycle may not converge
with each other. This is type of stability, also known as orbital stability can be written
as stability to the manifold described by a trajectory x ∗ (t),
Instead, we will focus our attention on constructing Lyapunov functions in the full
state space, which use the continuous dynamics. In particular, we would like to
consider Lyapunov functions which have the form cartooned below; they vanish to
zero everywhere along the cycle, and are strictly positive everywhere away from the
cycle.
Figure 16.1 - Cartoon of a Lyapunov function which vanishes on a limit cycle, and is
strictly positive everywhere else. (a small segment has been removed solely for the
purposes of visualization).
How can we parameterize this class of functions? For arbitrary cycles this could
be very difficult in the original coordinates. For simple cycles like in the cartoon,
one could imagine using polar coordinates. More generally, we will define a new
coordinate system relative to the orbit, with coordinates
τ̇ = f τ (x ⊥ , τ)
ẋ ⊥ = f ⊥ (x ⊥ , τ).
To keep our notation simpler for the remainder of these notes, we will assume
that the origin of this new coordinate system is on the nominal trajectory (
min τ |x − x ∗ (τ)| = 0 ⇒ x ⊥ = 0). Similarly, by taking τ to be the phase variable, we will
leverage the fact that on the nominal trajectory, we have τ̇ = f τ (0, τ) = 1.
The value of this construction for Lyapunov analysis was proposed in [1] and has
been extended nicely to control design in [2] and for region of attraction estimation
in [3]. A quite general numerical strategy for defining the transversal coordinates is
given in [4].
∀τ, V (0, τ) = 0,
∀τ, ∀x ⊥ ∈ B, x ⊥ ≠ 0, V (x ⊥ , τ) > 0,
with
∀τ, V˙ (0, τ) = 0,
∀τ, ∀x ⊥ ∈ B, x ⊥ ≠ 0, V˙ (x ⊥ , τ) < 0,
⎛ 1 ⎞
ẋ 1 =x 2 − αx 1 1−
⎝ √x 21 + x 22 ⎠
⎛ 1 ⎞
ẋ 2 = − x 1 − αx 2 1− ,
⎝ √x 21 + x 22 ⎠
⎜⎟
Figure 16.3 - Vector field of the ring oscillator
which is valid when x ⊥ > −1, then the simple transverse dynamics are revealed:
τ̇ =1
ẋ ⊥ = − αx ⊥ .
This demonstrates that the limit cycle is locally asymptotically stable, and
furthermore that the invariant open-set V < 1 is inside the region of attraction of
that limit cycle. In fact, we know that all x ⊥ > −1 are in the region of attraction
that limit cycle, but this is not proven by the Lyapunov argument above.
Let's compare this approach again with the approach that we used in the chapter
on walking robots, where we used a Poincaré map analysis to investigate limit
cycle stability. In transverse coordinates approach, there is an additional burden
to construct the coordinate system along the entire trajectory, instead of only at a
single surface of section. In fact, the transverse coordinates approach is sometimes
referred to as a "moving Poincaré section". Again, the reward for this extra work is
that we can check a condition that only involves the instantaneous dynamics, f(x),
as opposed to having to integrate the dynamics over an entire cycle to generate the
discrete Poincaré map, x p [n + 1] = P (x p [n]). As we will see below, this approach will
also be more compatible with designing continuous feedback controller that stabilize
the limit cycle.
In the case of Lyapunov analysis around a fixed point, there was an important
special case: for stable linear systems, we actually have a recipe for constructing a
Lyapunov function. As a result, for nonlinear systems, we often found it convenient
to begin our search by linearizing around the fixed point and using the Lyapunov
candidate for the linear system as an initial guess. That same approach can be
extended to limit cycle analysis.
∂f ⊥
ẋ ⊥ = f ⊥ (x ⊥ , τ) ≈ x ⊥ = A ⊥ (τ)x ⊥ .
∂x ⊥
Note that I've assumed that x ⊥ is zero on the nominal trajectory, so don't need the
additional notation of working in the error coordinates here. Remember that τ is the
phase along the trajectory, but [1] showed that (exponential) stability of the time-
varying linear system ẋ ⊥ = A ⊥ (t)x ⊥ implies local exponential orbital stability of the
original system. In particular, if this transverse linear system is periodic and orbitally
stable, then for each Q = Q T ≻ 0, there exists a unique positive definite solution to
the periodic Riccati equation:
There is a surprisingly rich literature on the numerical methods for finding the
periodic solutions to these Lyapunov (and corresponding Riccati) equations. In
practice, for every problem I've ever tried, I've simply integrated the equations
backwards in time until the integration converges to a periodic solution (this is not
guaranteed to work, but almost always does).
Many of the tools we've developed for stabilization of fixed points or stabilization
of trajectories can be adapted to this new setting. When we discussed control
for the spring-loaded inverted pendulum model, we considered discrete control
decisions, made once per cycle, which could stabilize the discrete Poincare map,
x p [n + 1] = f p (x p [n], u[n]), with u[n] the once-per-cycle decisions. While it's possible to
get the local linear approximations of this map using our adjoint methods, remember
that we rarely have an analytical expression for the nonlinear f p .
But the tools we've developed above also give us the machinery we need to consider
continuous feedback throughout the trajectory. Let's look at a few important
formulations.
It turns out many of our simple walking models -- particularly ones with a point foot
that are derived in the minimal coordinates -- are only short one actuator (between
the foot and the ground). One can represent even fairly complex robots this way;
much of the theory that I'll elude to here was originally developed by Jessy Grizzle
and his group in the context of the bipedal robot RABBIT. Jessy's key observation
was that limit cycle stability is effectively stability in n − 1 degrees of freedom, and
you can often achieve it easily with n − 1 actuators -- he called this line of work
"Hybrid Zero Dynamics" (HZD). We'll deal with the "hybrid" part of that in the next
chapter, but here is a simple example to illustrate the "zero dynamics" concept.
[Coming soon...]
∂f ⊥ ∂f ⊥
ẋ ⊥ = f ⊥ (x ⊥ , τ, u) ≈ x⊥ + (u − u 0 (τ)) = A ⊥ (τ)x ⊥ + B ⊥ (τ)ū.
∂x ⊥ ∂u
[2] showed that the (periodic) time-varying LQR controller that stabilized this system
achieves orbital stabilization of the original system.
Let's take a minute to appreciate the difference between this approach and time-
varying LQR in the full coordinates. Of course, the behavior is quite different: time-
varying LQR in the full coordinates will try to slow down or speed up to keep time
with the nominal trajectory, where this controller makes no attempt to stabilize
the phase variable. You might think you could design the original time-varying
controller, ū = K(t)x̄, and just "project to the closest time" during execution (e.g. find
the time t = argmin τ |x − x 0 (τ)|) and use the feedback corresponding to that time. But
this projection is not guaranteed to be safe; you can find systems that are stabilizable
by the transverse linearization are unstable in closed-loop using a feedback based
on this projection.
But there is an even more important reason to like the transverse LQR approach.
If orbital stability is all you need, then this is a better formulation of the problem
because you are asking for less. As we've just discussed with hybrid zero-dynamics,
stabilizing m degrees of freedom with m actuators is potentially very different than
stabilizing m + 1 degrees of freedom. In the transverse formulation, we are asking
LQR to minimize the cost only in the transverse coordinates (it is mathematically
equivalent to designing a cost function in the full coordinates that is zero in the
direction of the nominal trajectory). In practice, this can result is much smaller cost-
to-go matrices, S(t), and smaller feedback gains, K(t). For underactuated systems,
this difference can be dramatic.
The observation that orbital stabilization of a trajectory can ask much less of your
underactuated control system (and lead to smaller LQR feedback gains), is quite
powerful. It is not limited to stabilizing periodic motions. If your aim is to stabilize
a non-periodic path through state-space, but do not actually care about the timing,
then formulating your stabilization can be a very good choice. You can find examples
where a system is stabilizable in the transverse coordinates, but not in the full
coordinates.
Take care, though, as the converse can also be true: it's possible that a system could
be stabilizable in the full coordinates but not in some transverse coordinates; if you
choose those coordinates badly.
The general approach to designing the transverse coordinates [4] could address this
concern by including a criteria for controllability when optimizing the transverse
coordinates.
REFERENCES
1. John Hauser and Chung Choo Chung, "Converse {L}yapunov functions for
exponentially stable periodic orbits", Systems & Control Letters, vol. 23, no.
1, pp. 27 -- 34, 1994.
3. Ian R. Manchester and Mark M. Tobenkin and Michael Levashov and Russ
Tedrake, "Regions of Attraction for Hybrid Limit Cycles of Walking Robots",
Proceedings of the 18th IFAC World Congress, extended version available
online: arXiv:1010.2247 [math.OC], 2011.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 17
Planning and Control Open in Colab
through Contact
So far we have developed a fairly strong toolbox for planning and control with
"smooth" systems -- systems where the equations of motion are described by a
function ẋ = f(x, u) which is smooth everywhere. But our discussion of the simple
models of legged robots illustrated that the dynamics of making and breaking
contact with the world are more complex -- these are often modeled as hybrid
dynamics with impact discontinuities at the collision event and constrained
dynamics during contact (with either soft or hard constraints).
My goal for this chapter is to extend our computational tools into this richer class of
models. Many of our core tools still work: trajectory optimization, Lyapunov analysis
(e.g. with sums-of-squares), and LQR all have natural equivalents.
Let's start with a warm-up exercise: trajectory optimization for the rimless wheel.
We already have basically everything that we need for this, and it will form a nice
basis for generalizing our approach throughout the chapter.
The rimless wheel was our simplest example of a passive-dynamic walker; it has no
control inputs but exhibits a passively stable rolling fixed point. We've also already
seen that trajectory optimization can be used as a tool for finding limit cycles of a
smooth passive system, e.g. by formulating a direct collocation problem:
It turns out that applying this to the rimless wheel is quite straight forward. We
still want to find a periodic trajectory, but now have to take into account the
collision event. We can do this by modifying the periodicity condition. Let's force
the initial state to be just after the collision, and the final state to be just before the
collision, and make sure they are related to each other via the collision equation:
Although it is likely not needed for this simple example (since the dynamics are
sufficiently limited), for completeness one should also add constraints to ensure
that none of the intermediate points are in contact,
The result is a simple and clean numerical algorithm for finding the rolling limit
cycle solution of the rimless wheel. Please take it for a spin:
Open in Colab
The specific case of the rimless wheel is quite clean. But before we apply it to the
compass gait, the kneed compass gait, the spring-loaded inverted pendulum, etc,
then we should stop and figure out a more general form.
The language of hybrid systems gives us a rich language for describing systems of
this form, and a suite of tools for analyzing and controlling them. The term "hybrid
systems" is a bit overloaded, here we use "hybrid" to mean both discrete- and
continuous-time, and the particular systems we consider here are sometimes called
autonomous hybrid systems because the internal dynamics can cause the discrete
changes without any exogeneous input†. In the hybrid systems formulation, we †This is in contrast to,
describe a system by a set of modes each described by (ideally smooth) continuous for instance, the mode
dynamics, a set of guards which here are continuous functions whose zero-level set of a power-train wher
describes the conditions which trigger an event, and a set of resets which describe a change in gears
the discrete update to the state that is triggered by the guard. Each guard is comes as an external
associated with a particular mode, and we can have multiple guards per mode. Every input.
guard has at most one reset. You will occasionally here guards referred to as
"witness functions", since they play that role in simulation, and resets are sometimes
referred to as "transition functions".
The imagery that I like to keep in my head for hybrid systems is illustrated below
for a simple example of a robot's heel striking the ground. A solution trajectory
of the hybrid system has a continuous trajectory inside each mode, punctuated by
discrete updates when the trajectory hits the zero-level set of the guard (here the
distance between the heel and the ground becomes zero), with the reset describing
the discrete change in the state variables.
For this robot foot, we can decompose the dynamics into distinct modes: (1) foot in
the air, (2) only heel on the ground, (3) heel and toe on the ground, (4) only toe on
the ground (push-off). More generally, we will write the dynamics of mode i as f i , the
guard which signals the transition mode i to mode j as ϕ i,j (where ϕ i,j (x i ) > 0 inside
mode i), and the reset map from i to j as Δ i,j , as illustrated in the following figure:
Figure 17.3 - The language of hybrid systems: modes, guards, and reset maps.
Using the more general language of modes, guards, and resets, we can begin
to formulate the "hybrid trajectory optimization" problem. In hybrid trajectory
optimization, there is a major distinction between trajectory optimization where the
mode sequence is known apriori and the optimization is just attempting to solve for
the continuous-time trajectories, vs one in which we must also discover the mode
sequence.
For the case when the mode sequence is fixed, then hybrid trajectory optimization
is as simple as stitching together multiple individual mathematical programs into a
single mathematical program, with the boundary conditions constrained to enforce
the guard/reset constraints. For a simple hybrid system with a given a mode
sequence and using the shorthand x k , etc, for the state in mode in the kth segment
of the sequence, we can write:
It is then natural to add control inputs (as additional decision variables), and to add
an objective and any more constraints.
Let's start simpler -- with just a "bounce pass". We can capture the dynamics of a
bouncing ball (in the plane, ignoring spin) with some very simple dynamics:
q = [x], q̈ = [ 0 ].
z −g
During any time interval without contact of duration h, we can actually integrate
these dynamics perfectly:
⎡ x(t) + hẋ(t) ⎤
1 2
x(t + h) = z(t) + hż(t) − 2 gh .
ẋ(t)
⎣ ⎦
ż(t) − hg
With the bounce pass, we just consider collisions with the ground, so we have
a guard, ϕ(x) = z, which triggers when z = 0, and a reset map which assumes an
elastic collision with coefficient of restitution e:
T
x + = Δ(x − ) = [x − z− ẋ − −eż − ] .
We'll formulate the problem as this: given an initial ball position (x = 0, z = 1), a
final ball position 4m away (x = 4, z = 1), find the initial velocity to achieve that goal
in 5 seconds. Clearly, this implies that ẋ(0) = 4/5. The interesting question is -- what
should we do with ż(0)? There are multiple solutions -- which involve bouncing a
different number of times. We can find them all with a simple hybrid trajectory
optimization, using the observation that there are two possible solutions for each
number of bounces -- one that starts with a positive ż(0) and one with a negative
ż(0).
Open in Colab
Figure 17.4 - Trajectory optimization to find solutions for a "bounce pass". I've
plotted all solutions that were found for 2, 3, or 4 bounces... but I think it's best
to stick to a single bounce if you're using this on the court.
Now let's try our trick shot. I'll move our goal to x f = −1m, z f = 3m, and introduce
a vertical wall at x = 0, and move our initial conditions back to x 0 = −.25m. The
collision dynamics, which now must take into account the spin of the ball, are in
the appendix. The first bounce is against the wall, the second is against the floor.
I'll also constrain the final velocity to be down (have to approach the hoop from
above). Try it out.
Open in Colab
Figure 17.5 - Trajectory optimization for the "trick shot". Nothing but the bottom
of the net! The crowd is going wild!
There is some work to do in order to derive the equations of motion in this form.
Do you remember how we did it for the rimless wheel and compass gait examples?
In both cases we assumed that exactly one foot was attached to the ground and
that it would not slip, this allowed us to write the Lagrangian as if there was a pin
joint attaching the foot to the ground to obtain the equations of motion. For the
SLIP model, we derived the flight phase and stance phase using separate Lagrangian
equations each with different state representations. I would describe this as the
minimal coordinates modeling approach -- it is elegant and has some important
computational advantages that we will come to appreciate in the algorithms below.
But it's a lot of work! For instance, if we also wanted to consider friction in the foot
contact of the rimless wheel, we would have to derive yet another set of equations
to describe the sliding mode (adding, for instance, a prismatic joint that moved the
foot along the ramp), plus the guards which compute the contact force for a given
state and the distance from the boundary of the friction cone, and on and on.
where λ i are the constraint forces and J i are the constraint Jacobians. Conveniently,
if the guard function in our contact equations is the signed distance from contact,
ϕ i (q), then this Jacobian is simply J i (q) = ∂qi . I've written the basic derivations for
∂ϕ
the common cases (position constraints, velocity constraints, impact equations, etc)
in the appendix. What is important to understand here is that this is an alternative
formulation for the equations governing the modes, guards, and resets, but that is it
no longer a minimal coordinate system -- the equations of motion are written in 2N
state variables but the system might actually be constrained to evolve only along
a lower dimensional manifold (if we write the rimless wheel equations with three
configuration variables for the floating base, it still only rotates around the toe when
it is in stance and is inside the friction cone). This will have implications for our
algorithms.
17.2 EXERCISES
a. Enforce the contact between the stance foot and the ground at all the
break points.
b. Enforce the contact between the swing foot and the ground at the initial
time.
c. Prevent the penetration of the swing foot in the ground at all the break
points. (In this analysis, we will neglect the scuffing between the swing
foot and the ground which arises when the swing leg passes the stance
leg.)
d. Ensure that the contact force at the stance foot lies in the friction cone
at all the break points.
e. Ensure that the impulse generated by the collision of the swing foot with
the ground lies in the friction cone.
Previous Chapter Table of contents Next Chapter
Accessibility © Russ Tedrake, 2021
UNDERACTUATED ROBOTICS
Algorithms for Walking, Running, Swimming, Flying, and
Manipulation
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
System Identification
My primary focus in these notes has been to build algorithms that design of analyze
a control system given a model of the plant. In fact, we have in some places gone
to great lengths to understand the structure in our models (the structure of the
manipulator equations in particular) and tried to write algorithms which exploit that
structure.
Our ambitions for our robots have grown over recent years to where it makes sense
to question this assumption. If we want to program a robot to fold laundry, spread
peanut butter on toast, or make a salad, then we should absolutely not assume that
we are simply given a model (and the ability to estimate the state of that model). This
has led many researchers to focus on the "model-free" approaches to optimal control
that are popular in reinforcement learning. But I worry that the purely model-
free approach is "throwing the baby out with the bathwater". We have fantastic
tools for making long-term decisions given a model; the model-free approaches are
correspondingly much much weaker.
So in this chapter I would like to cover the problem of learning a model. This is far
from a new problem. The field of "system identification" is as old as controls itself,
but new results from machine learning have added significantly to our algorithms
and our analyses, especially in the high-dimensional and finite-sample regimes.
But well before the recent machine learning boom, system identification as a field
had a very strong foundation with thorough statistical understanding of the basic
algorithms, at least in the asymptotic regime (the limit where the amount of data
goes to infinity). My goal for this chapter is to establish this foundation, and to
provide some pointers, but I will stop short of providing the full statistical viewpoint
here.
18.1 PROBLEM FORMULATION: EQUATION ERROR VS
SIMULATION ERROR
Our problem formulation inevitably begins with the data. In practice, if we have
access to a physical system, instrumented using digital electronics, then we have
a system in which we can apply input commands, u n , at some discrete rate, and
measure the outputs, y n of the system at some discrete rate. We normally assume
these rates our fixed, and often attempt to fit a state-space model of the form
where I have used α again here to indicate a vector of parameters. In this setting, a
natural formulation is to minimize a least-squares estimation objective:
N−1
min ∑ ∥y[n] − y n ∥ 22 , subject to (1).
α,x[0]
n=0
I have written purely deterministic models to start, but in general we expect both
the state evolution and the measurements to have randomness. Sometimes, as I
have written, we fit a deterministic model to the data and rely on our least-squares
objective to capture the errors; more generally we will look at fitting stochastic
models to the data.
We often separate the identification procedure into two parts, were we first estimate
the state x
^ n given the input-output data u n , y n , and then focus on estimating the
state-evolution dynamics in a second step. The dynamics estimation algorithms fall
into two main categories:
The examples of folding laundry or making a salad fall into a different category. In
those examples, I might not even know a priori the number of state variables needed
to provide a reasonable description of the behavior. That will force a more general
examination of the the identification problem, which we will explore in the remaining
sections.
We can separate the parameters in the multibody equations again into kinematic
parameters and dynamic parameters. The kinematic parameters, like link lengths,
describe the coordinate transformation from one joint to another joint in the
kinematic tree. It is certainly possible to write an optimization procedure to calibrate
these parameters; you can find a fairly thorough discussion in e.g. Chapter 11 of [1].
But I guess I'm generally of the opinion that if you don't have accurate estimates
of your link lengths, then you should probably invest in a tape measure before you
invest in nonlinear optimization.
One notable exception to this is calibration with respect to joint offsets. This one
can be a real nuisance in practice. Joint sensors can slip, and some robots even
use relative rotary encoders, and rely on driving the joint to some known hard joint
limit each time the robot is powered on in order to obtain the offset. I've worked
on one humanoid robot that had a quite elaborate and painful kinematic calibration
procedure which involve fitting additional hardware over the joints to ensure they
were in a known location and then running a script. Having a an expensive and/
or unreliable calibration procedure can put a damper on any robotics project. For
underactuated systems, in particular, it can have a dramatic effect on performance.
Open in Colab
Our tools from robust / stochastic control are well-suited to identifying (and
bounding / minimizing) these sensitivities, at least for the linearized model we use
in LQR.
The general approach to estimating joint offsets from data is to write the equations
with the joint offset as a parameter, e.g. for the double pendulum we would write the
forward kinematics as:
¯ ¯ ¯
p 1 = l 1 [ sin(θ 1 + θ 1 ) ], p 2 = p 1 + l 2 [ sin(θ 1 + θ 1 + θ 2 + θ 2 ) ].
− cos(θ 1 + θ¯1 ) − cos(θ 1 + θ¯1 + θ 2 + θ¯2 )
We try to obtain independent measurements of the end-effector position (e.g. from
motion capture, from perhaps some robot-mounted cameras, or from some
mechanical calibration rig) with their corresponding joint measurements, to obtain
data points of the form ⟨p 2 , θ 1 , θ 2 ⟩. Then we can solve a small nonlinear optimization
problem to estimate the joint offsets to minimize a least-squares residual.
Now let's thinking about estimating the dynamic parameters of multibody system.
We've been writing the manipulation equations in the form:
Each of the terms in this equation can depend on the parameters α that we're
trying to estimate. But the parameters enter the multibody equations in a particular
structured way: the equations are affine in the lumped parameters. More precisely,
the manipulator equations above can be factored into the form
⎡ml 2 ⎤
[θ¨ θ˙ sin θ] b − τ = 0.
⎣ ⎦
mgl
As a simple example, I've loaded the cart-pole system model from URDF, created a
symbolic version of the MultibodyPlant, and populated the Context with symbolic
variables for the quantities of interest. Then I can evaluate the (inverse) dynamics
in order to obtain my equations.
Open in Colab
Symbolic dynamics:
(0.10000000000000001 * v(0) - u(0) + (pow(v(1), 2) * mp * l * sin(q(1))) + (vd(0) * mc) + (
(0.10000000000000001 * v(1) - (vd(0) * mp * l * cos(q(1))) + (vd(1) * mp * pow(l, 2)) + 9.8
Go ahead and compare these with the cart-pole equations that we derived by hand.
W = [Q 1 R 1 ],
Q2 ] [ ⇒ Wα l = Q 1 (R 1 α l ).
0
Note that we could have easily made the fit more accurate with more data (or more
carefully selected data).
Should we be happy with only estimating the (identifiable) lumped parameters? Isn't
it the true original parameters that we are after? The linear algebra decomposition
of the data matrix (assuming we apply it to a sufficiently rich set of data), is actually
revealing something fundamental for us about our system dynamics. Rather than
feel disappointed that we cannot accurately estimate some of the parameters, we
should embrace that we don't need to estimate those parameters for any of the
dynamic reasoning we will do about our equations (simulation, verification, control
design, etc). The identifiable lumped parameters are precisely the subset of the
lumped parameters that we need.
For practical reasons, it might be convenient to take your estimates of the lumped
parameters, and try to back out the original parameters (for instance, if you want
to write them back out into a URDF file). For this, I would recommend a final
post-processing step that e.g. finds the parameters α ^ that are as close as possible
(e.g. in the least-squares sense) to your original guess for the parameters, subject
to the nonlinear constraint that R 1 α l (^
α) matches the estimated identifiable lumped
parameters.
There are still a few subtleties worth considering, such as how we parameterize
the inertial matrices. Direct estimation of the naive parameterization, the six entries
of a symmetric 3x3 matrix, can lead to non-physical inertial matrices. [3] describes
a parameter estimation formulation that includes a convex formulation of the
physicality constraints between these parameters.
In addition to leveraging tools from linear algebra, there are a number of other
refinements to the basic recipe that leverage our understanding of mechanics. One
important example is the "energy formulations" of parameter estimation [4].
We have already observed that evaluating the equation error in torque space
(inverse dynamics) is likely better than evaluating it in state space space (forward
dynamics), because we can factorized the inverse dynamics. But this property is not
exclusive to inverse dynamics. The total energy of the system (kinetic + potential) is
also affine in the lumped parameters. We can use the relation
Why might we prefer to work in energy coordinates rather than torque? The
differences are apparent in the detailed numerics. In the torque formulation, we find
ourselves using q̈ directly. Conventional wisdom holds that joint sensors can reliably
report joint positions and velocities, but that joint accelerations, often obtained by
numerically differentiating twice, tend to be much more noisy. In some cases, it
might be numerically better to apply finite differences to the total energy of the
system rather than to the individual joints †. [5] provides a study of various filtering † Sometimes these
formulations which leads to a recommendation in [4] that the energy formulation methods are written a
tends to be numerically superior.
The term "residual physics" has become quite popular recently (e.g. [6]) as people
are looking for ways to combine the modeling power of deep neural networks with
our best tools from mechanics. But actually this idea is quite old, and there is a
natural class of residual models that fit very nicely into our least-squares lumped-
parameter estimation. Specifically, we can consider models of the form:
with α r the additional parameters of the residual and Φ a set of (fixed) nonlinear
basis functions. The hope is that these residual models can capture any "slop terms"
in the dynamics that are predictable, but which we did not include in our original
parametric model. Nonlinear friction and aerodynamic drag are commonly cited
examples.
Common choices for these basis functions, Φ(q, q̇), for use with the manipulator
equations include radial basis functions[7] or wavelets [8]. Although allowing the
basis functions to depend on q̈ or u would be fine for the parameter estimation, we
tend to restrict them to q and q̇ to maintain some of the other nice properties of the
manipulator equations (e.g. control-affine).
One key assumption for any claims about our parameter estimation algorithms
recovering the true identifiable lumped parameters is that the data set was
sufficiently rich; that the trajectories were "parametrically exciting". Basically we
need to assume that the trajectories produced motion so that the data matrix, W
contains information about all of the identifiable lumped parameters. Thanks to our
linear parameterization, we can evaluate this via numerical linear algebra on W.
number is always greater than one, by definition, and the lower the value the better
the condition. The condition number of the data matrix is a nonlinear function of
the data taken over the entire trajectory, but it can still be optimized in a nonlinear
trajectory optimization ([1], § 12.3.4).
The field of adaptive control is a huge and rich literature; many books have been
written on the topic (e.g [11]). Allow me to make a few quick references to that
literature here.
As I said, adaptive control is a rich subject. One of the biggest lessons from that
field, however, is that one may not need to achieve convergence to the true (lumped)
parameters in order to achieve a task. Many of the classic results in adaptive control
make guarantees about task execution but explicitly do not require/guarantee
convergence in the parameters.
Can we apply these same techniques to e.g. walking robots that are making and
breaking contact with the environment?
There is certainly a version of this problem that works immediately: if we know the
contact Jacobians and have measurements of the contact forces, then we can add
these terms directly into the manipulator equations and continued with the least-
squares estimation of the lumped parameters, even including frictional parameters.
One can also study cases where the contact forces are not measured directly. For
instance, [13] studies the extreme case of identifiability of the inertial parameters of
a passive object with and without explicit contact force measurements.
If multibody parameter estimation forms the first relevant pillar of established work
in system identification, then identification of linear systems forms the second.
Linear models have been a primary focus of system identification research since the
field was established, and have witnessed a resurgence in popularity during just the
last few years as new results from machine learning have contributed new bounds
and convergence results, especially in the finite-sample regime (e.g. [14, 15, 16,
17]).
Let's start our treatment with the easy case: fitting a linear model from direct
(potentially noisy) measurements of the state. Fitting a discrete-time model,
x[n + 1] = Ax[n] + Bu[n] + w[n], to sampled data (u n , x n = x[n] + v[n]) using the
equation-error objective is just another linear least-squares problem. Typically, we
form some data matrices and write our least-squares problem as:
X ′ ≈ [A X
B] [ ], where
U
⎡∣ ∣ ∣ ⎤ ⎡∣ ∣ ∣ ⎤ ⎡∣ ∣ ∣ ⎤
X = x0 x1 ⋯ x N−2 , X′ = x1 x2 ⋯ x N−1 , U = u0 u1 ⋯ u N−2 .
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣ ∣
By the virtues of linear least squares, this estimator is unbiased with respect to the
uncorrelated process and/or measurement noise, w[n] and v[n].
Interestingly, the simple answer is "no". If you only collect the input and state data
from running this controller, you will see that the least-squares problem that we
⎢⎥
formulate above is rank-deficient. The estimated A and B matrices, denoted A
Fortunately, we could have seen this problem by checking the rank of the least-
squares solution. Generating more examples won't fix this problem. Instead, to
generate a richer dataset, I've added a small additional signal to the input:
u(t) = π lqr (x(t)) + 0.1 sin(t). That makes all of the difference.
Open in Colab
I hope you try the code. The basic algorithm for estimation is disarmingly simple,
but there are a lot of details to get right to actually make it work.
as a cross between the classic cart-pole system and a fighter jet (perhaps a little
closer to the cartpole)!
Here we've replaced the pole with an airfoil (hydrofoil), turned the entire system
on its side, and dropped it into a water tunnel. Rather than swing-up and balance
the pole against gravity, the task is to balance the foil in its unstable configuration
against the hydrodynamic forces. These forces are the result of unsteady fluid-
body interactions; unlike the classic cart-pole system, this time we do not have
an tractable parameterized ODE model for the system. It's a perfect problem for
system identification and ILC.
Figure 18.2 - A cartoon of the hydronamic cart-pole system. The cart is actuated
horizontally, the foil pivots around a passive joint, and the fluid is flowing in the
direction of the arrows. (The entire system is horizontal, so there is no effect from
gravity.) The aerodynamic center of the airfoil is somewhere in the middle of the
foil; because the pin joint is at the tail, the passive system will "weather vane" to
the stable "downward" equilibrium (left). Balancing corresponds to stabilizing the
unstable "upward" equilibrium (right). The fluid-body dynamics during the
transition (center) are unsteady and very nonlinear.
Figure 18.3 - Comparison of the step response using three different controllers:
the balancing LQR controller (blue), LQR with a feed-forward term obtained from
trajectory optimization with the LTI model (red), and a controller obtained via ILC
with a time-varying linear model and iLQR (green).
These experiments were quite beautiful and very visual. They went on from here
to consider the effect of stabilizing against incoming vortex disturbances using
real-time perception of the oncoming fluid. If you're at all interested, I would
encourage you to check out John Robert's thesis[19] and/or even the video of his
thesis defense.
connection to Koopman operator theory that we will discuss briefly below. For our
purposes, we also need a control input, so might consider a form like ϕ˙ = Aϕ(x) + Bu.
Once again, thanks to the maturity of least-squares, with this approach it is possible
to include list many possible basis functions, then use sparse least-squares and/or
the modal decomposition to effectively find the important subset.
Note that multibody parameter estimation described above is not this, although it is
closely related. The least-squares lumped-parameter estimation for the manipulator
equations uncovered dynamics that were still nonlinear in the state variables.
In the more general form, we would like to estimate a model of the form
Once again, we will apply least-squares estimation, but combine this with the
famous "Ho-Kalman" algorithm (also known as the "Eigen System Realization" (ERA)
algorithm) [21]. My favorite presentation of this algorithm is [16].
For the purposes of identification, let's write y[n] as a function of the most recent
N + 1 inputs (for k ≥ N ):
⎡ u[n − N] ⎤
u[n − N + 1]
y[n] =[CA N−1 B CA N−2 B ⋯ CB D] + δ[n]
⋮
⎣ ⎦
u[n]
=G[n]ū[n] + δ[n]
where δ[n] captures the remaining terms from initial conditions, noise and control
inputs before the (sliding) window. G[n] = [CA N−1 B CA N−2 B ⋯ CB D], and
ū[n] represents the concatenated u[n]'s from time n − N up to n. Importantly we have
that δ[n] is uncorrelated with ū[n]: ∀k > n we have E [u k δ n ] = 0. This is sometimes
known as Neyman orthogonality, and it implies that we can estimate G using simple
least-squares G ^ = argmin ∑
G n≥N ∥y n − Gū n ∥ . [16] gives bounds on the norm of the
2
estimation error as a function of the number of samples and the variance of the
noise.
How should we pick the window size N ? By Neyman orthogonality, we know that our
estimates will be unbiased for any choice of N ≥ 0. But if we choose N too small, then
the term δ[n] will be large, leading to a potentially large variance in our estimate. For
stable systems, the δ terms will get smaller as we increase N . In practice, we choose
N based on the characteristic settling time in the data (roughly until the impulse
response becomes sufficiently small).
⎢⎥
If you've studied linear systems, G will look familiar; it is precisely this (multi-input,
multi-output) matrix impulse response, also known as the "Markov parameters".
In fact, estimating G ^ may even be sufficient for control design, as it is closely-
related to the parameterization used in disturbance-based feedback for partially-
observable systems [22, 23]. But the Ho-Kalman algorithm can be used to extract
^ with state-dimension dim(x) = n x from G, ^ assuming that
the true system is observable and controllable with order at least n x and the data
matrices we form below are sufficiently rich[16].
It is important to realize that many system matrices, A, B, C, can describe the same
input-output data. In particular, for any invertible matrix (aka similarity transform),
T, the system matrices A, B, C and A ′ , B ′ , C ′ with,
A ′ = T −1 AT, B ′ = T −1 B, C ′ = CT,
Once • Loop Reflect
W
p K = [x] + [cos θ − sin θ] P p K ,
0 sin θ cos θ
where x is the position of the cart, and θ is the angle of the pole. I've used
Monogram notation to denote that P p K is the Cartesian position of a point A in
the pole's frame, P , and W p K is that same position in the camera/world frame, W .
While it is probably unreasonable to linearize the RGB outputs as a function of the
cart-pole positions, x and θ, it is reasonable to approximate the keypoint positions
by linearizing this equation for small angles.
You'll see in the example that we recover the impulse response quite nicely.
The big question is: does Ho-Kalman discover the 4-dimensional state vector that
we know and like for the cart-pole? Does it find something different? We would
typically choose the order by looking at the magnitude of the singular values of the
Hankel data matrix. In this example, in the case of no noise, you see clearly that 2
states describe most of the behavior, and we have diminishing returns after 4 or 5:
As an exercise, try adding process and/or measurement noise to the system that
generated the data, and see how they effect this result.
Open in Colab
Another important class of linear models predict the output directly from a history
of recent inputs and outputs:
These are the so-called "AutoRegressive models with eXogenous input (ARX)"
models. The coefficients of ARX models can be fit directly from input-output data
using linear least squares.
where I've (again) used s for discrete states, a for discrete actions, and o for
discrete observations. With everything discrete, these conditional probabilities are
represented naturally with matrices/tensors. We'll use the tensors
T ijk ≡ Pr(s[n + 1] = s i |s[n] = s j , a[n] = a k ) to denote the transition matrix, and
O ijk ≡ Pr(o[n] = o i |s[n] = s j , a[n] = a k ) for the observation matrix.
Following analogously to our discussion on linear dynamical systems, the first case
to understand is when we have direct access to state and action trajectories:
s[⋅], a[⋅]. This corresponds to identifying a Markov chain or Markov decision process
(MDP). The transition matrices can be extracted directly from the statistics of the
transitions:
Not surprisingly, for this estimate to converge to the true transition probabilities
asymptotically, we need the identification (exploration) dynamics to be ergodic (for
every state/action pair to be visited infinitely often).
For large MDPs, analogous to the modal decomposition that we described for linear
dynamical systems, we can also consider factored representations of the transition
matrix. [Coming soon...]
Deep learning tools allow us to quite reliably tune the neural networks to fit
our training data. The major challenges are with respect to data efficiency/
generalization, and with respect to actually designing planners / controllers based
on these models that have such complex descriptions.
One extremely interesting question that arises when fitting rich nonlinear models
like neural networks is the question of how to generate the training data. You
might have the general sense that we would like data that provides some amount
of "coverage" of the state space; this is particularly challenging for underactuated
systems since we have dynamic constraints restricting our trajectories in state
space.
This topic has received considerable attention lately in the context of model-based
reinforcement learning (e.g. [30]). Broadly speaking, in the ideal case we would like
the training data we use for system identification to match the distribution of data
that we will encounter when executing the optimal policy. But one cannot know this
distribution until we've designed our controller, which (in our current discussion)
requires us to have a model. It is a classic "chicken and the egg" problem. In most
cases, it speaks to the importance to interleaving system identification and control
design instead of the simpler notion of performing identification once and then using
the model forevermore.
18.9 EXERCISES
REFERENCES
1. W Khalil and E Dombre, "Modeling, Identification and Control of
Robots",Elsevier , 2004.
6. Andy Zeng and Shuran Song and Johnny Lee and Alberto Rodriguez and
Thomas Funkhouser, "Tossingbot: Learning to throw arbitrary objects with
residual physics", IEEE Transactions on Robotics, vol. 36, no. 4, pp.
1307--1319, 2020.
10. Steven L Brunton and Joshua L Proctor and J Nathan Kutz, "Discovering
governing equations from data by sparse identification of nonlinear
dynamical systems", Proceedings of the national academy of sciences, vol.
113, no. 15, pp. 3932--3937, 2016.
12. Joseph Moore and Russ Tedrake, "Adaptive Control Design for
Underactuated Systems Using Sums-of-Squares Optimization", Proceedings
of the 2014 American Control Conference (ACC), June, 2014. [ link ]
13. Nima Fazeli and Roman Kolbert and Russ Tedrake and Alberto Rodriguez,
"Parameter and contact force estimation of planar rigid-bodies undergoing
frictional contact", International Journal of Robotics Research, vol. 36, no.
13-14, pp. 1437-1454, 2017. [ link ]
14. Moritz Hardt and Tengyu Ma and Benjamin Recht, "Gradient descent learns
linear dynamical systems", arXiv preprint arXiv:1609.05191, 2016.
15. Elad Hazan and Holden Lee and Karan Singh and Cyril Zhang and Yi Zhang,
"Spectral filtering for general linear dynamical systems", arXiv preprint
arXiv:1802.03981, 2018.
17. Max Simchowitz and Ross Boczar and Benjamin Recht, "Learning linear
dynamical systems with semi-parametric least squares", Conference on
Learning Theory, pp. 2714--2802, 2019.
18. L. Ljung, "System Identification: Theory for the User",Prentice Hall , 1999.
20. Steven L Brunton and J Nathan Kutz, "Data-driven science and engineering:
Machine learning, dynamical systems, and control",Cambridge University
Press , 2019.
22. Sadra Sadraddini and Russ Tedrake, "Robust Output Feedback Control
with Guaranteed Constraint Satisfaction", In the Proceedings of 23rd ACM
International Conference on Hybrid Systems: Computation and Control, pp.
12, April, 2020. [ link ]
23. Max Simchowitz and Karan Singh and Elad Hazan, "Improper learning for
non-stochastic control", Conference on Learning Theory, pp. 3320--3436,
2020.
25. Lucas Manuelli* and Wei Gao* and Peter Florence and Russ Tedrake,
"{kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation}",
arXiv e-prints, pp. arXiv:1903.06684, Mar, 2019. [ link ]
26. Peter R. Florence* and Lucas Manuelli* and Russ Tedrake, "Dense Object
Nets: Learning Dense Visual Object Descriptors By and For Robotic
Manipulation", Conference on Robot Learning (CoRL), October, 2018.
[ link ]
27. Peter Florence and Lucas Manuelli and Russ Tedrake, "Self-Supervised
Correspondence in Visuomotor Policy Learning", IEEE Robotics and
Automation Letters, vol. 5, no. 2, pp. 492-499, April, 2020. [ link ]
28. Lucas Manuelli and Yunzhu Li and Pete Florence and Russ Tedrake,
"Keypoints into the Future: Self-Supervised Correspondence in Model-
Based Reinforcement Learning", Conference on Robot Learning (CoRL),
2020. [ link ]
29. Jack Umenberger and Johan W{\aa}gberg and Ian R Manchester and
Thomas B Sch{\"o}n, "Maximum likelihood identification of stable linear
dynamical systems", Automatica, vol. 96, pp. 280--292, 2018.
30. Alekh Agarwal and Nan Jiang and Sham M. Kakade and Wen Sun,
"Reinforcement Learning: Theory and Algorithms",Online Draft , 2020.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 19
State Estimation
19.1 OBSERVERS AND THE KALMAN FILTER
19.3 SMOOTHING
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
CHAPTER 20
Model-Free Policy Search
Reinforcement learning (RL) is a collection of algorithms for solving the same
optimal control problem that we've focused on through the text, but the real gems
from the RL literature are the algorithms for (almost) black-box optimization of
stochastic optimal control problems. The idea of an algorithm that only has a "black-
box" interface to the optimization problem means that it can obtain (potentially
noisy) samples of the optimal cost via trial and error, but does not have access to the
underlying model, and does not have direct access to complete gradient information.
In this chapter we will examine one particular style of reinforcement learning that
attempts to find good controllers by explicitly parameterizing a family of policies
(e.g. via a parameter vector α), then searching directly for the parameters that
optimize the long-term cost. For a stochastic optimal control problem with our
favorite additive form of the objective, this might look like:
N
min E [∑ ℓ(x[n], u[n])] (1)
α
n=0
x[0] ∼ p 0 (x),
x[n] ∼ p(x[n]|x[n − 1], u[n − 1]),
u[n] ∼ p α (u[n]|x[n]).
I hope the notation is clear? x is a random vector, drawn from the distribution p α (x),
with the subscript indicating that the distribution depends on the parameter vector
α. What is the gradient of this function? The REINFORCE derivation is
∂ ∂
E [g(x)] = ∫ dx g(x)p α (x)
∂α ∂α
∂
= ∫ dx g(x) p α (x)
∂α
∂
= ∫ dx g(x)p α (x) log p α (x)
∂α
∂
=E [g(x) log p α (x)].
∂α
y = log u
∂y 1 ∂u
=
∂u u ∂x
to write
∂ ∂
p α (x) = p α (x) log p α (x).
∂α ∂α
This suggests a simple Monte Carlo algorithm for estimating the policy gradient:
draw N random samples x i then estimate the gradient as
∂ 1 ∂
E [g(x)] ≈ ∑ g(x i ) log p α (x i ).
∂α N i ∂α
This trick is even more potent in the optimal control case. For a finite-horizon
problem we have
N N
∂ ∂
E [∑ ℓ(x[n], u[n])] = ∫ dx[⋅]du[⋅] [∑ ℓ(x[n], u[n])] p α (x[⋅], u[⋅])
∂α n=0 n=0
∂α
N
∂
=E [(∑ ℓ(x[n], u[n])) log p α (x[⋅], u[⋅])],
n=0
∂α
where x[⋅] is shorthand for the entire trajectory x[0], . . . , x[n], and
N N
p α (x[⋅], u[⋅]) = p 0 (x[0]) (∏ p(x[n]|x[n − 1], u[n − 1])) (∏ p α (u[n]|x[n])).
n=1 n=0
where the last step uses the fact that E [ℓ(x[n], u[n]) ∂α
∂
log p α (u[k]|x[k])] = 0 for all k > n
.
This update should surprise you. It says that I can find the gradient of the long-term
cost by taking only the gradient of the policy... but not the gradient of the plant, nor
the cost! The intuition is that one can obtain the gradient by evaluating the policy
along a number of (random) trajectory roll-outs of the closed-loop system, evaluating
the cost on each, then increasing the probability in the policy of taking the actions
correlated with lower long-term costs.
This derivation is often presented as the policy gradient derivation. The identity is
certainly correct, but I'd prefer that you think of this as just one way to obtain the
policy gradient. It's a particularly clever one in that it uses exactly the information
that we have available in reinforcement learning -- we have access to the
instantaneous costs, ℓ(x[n], u[n]), and to the policy -- so are allowed to take gradients
of the policy -- but does not require any understanding whatsoever of the plant
model. Although it's clever, it is not particularly efficient -- the Monte Carlo
approximations of the expected value have a high variance, so many samples are
required for an accurate estimate. Other derivations are possible, some even
simpler, and others that do make use of the plant gradients if you have access to
them, and these will perform differently in the finite-sample approximations.
For the remainder of this section, I'd like to dig in and try to understand the nature
of the stochastic update a little better, to help you understand what I mean.
Let's take a step back and think more generally about how one can use gradient
descent in black-box (unconstrained) optimization. Imagine you have a simple
optimization problem:
min g(α),
α
and you can directly evaluate g(), but not . How can you perform gradient
∂g
∂α
descent?
∂g g(α + ϵ i ) − g(α)
≈ ,
∂α i ϵ
where ϵ i is the column vector with ϵ in the ith row, and zeros everywhere else.
Finite difference methods can be computationally very expensive, requiring n + 1
evaluations of the function for every gradient step, where n is the length of the input
vector.
Δα = −η[g(α + β) − g(α)]β.
The intuition here is that if g(α + β) < g(α), then β was a good change and we'll move
in the direction of β. If the cost increased, then we'll move in the opposite direction.
Assuming the function is smooth and β is small, then we will always move downhill
on the Lyapunov function.
Even more interesting, on average, the update is actually in the direction of the true
gradient. To see this, again we can assume the function is locally smooth and β is
small to write the Taylor expansion:
∂g
g(α + β) ≈ g(α) + β.
∂α
Now we have
∂g ∂g T
Δα ≈ − η [ β]β = −ηββ T
∂α ∂α
T
∂g
E [Δα] ≈ − ηE [ββ T ]
∂α
If we select each element of β independently from a distribution with zero mean and
variance σ 2β , or E [β i ] = 0, E [β i β j ] = σ 2β δ ij , then we have
∂g T
E [Δα] ≈ −ησ 2β .
∂α
Note that the distribution p α (x) need not be Gaussian, but it is the variance of the
distribution which determines the scaling on the gradient.
The weight perturbation update above requires us to evaluate the function g twice
for every update of the parameters. This is low compared to the method of finite
differences, but let us see if we can do better. What if, instead of evaluating the
function twice per update [g(α + β) and g(α)], we replace the evaluation of g(α) with
an estimator, b = g^(α), obtained from previous trials? In other words, consider an
update of the form:
η
Δα = − [g(α + β) − b]β (2)
σ 2β
The estimator can take any of a number of forms, but perhaps the simplest is based
on the update (after the nth trial):
where γ parameterizes the moving average. Let's compute the expected value of the
update, using the same Taylor expansion that we used above:
η ∂g
E[Δα] = − E [[g(α) + β − b]β]
σ 2β ∂α
T
η η T ∂g
=− E [[g(α) − b]β] − E [ββ ]
σ 2β σ 2β ∂α
∂g T
=−η
∂α
In other words, the baseline does not effect our basic result - that the expected
update is in the direction of the gradient. Note that this computation works for any
baseline estimator which is uncorrelated with β, as it should be if the estimator is a
function of the performance in previous trials.
Although using an estimated baseline doesn't effect the average update, it can have
a dramatic effect on the performance of the algorithm. As we will see in a later
section, if the evaluation of g is stochastic, then the update with a baseline estimator
can actually outperform an update using straight function evaluations.
Let's take the extreme case of b = 0. This seems like a very bad idea... on each step
we will move in the direction of every random perturbation, but we will move more
or less in that direction based on the evaluated cost. In average, we will still move
in the direction of the true gradient, but only because we will eventually move more
downhill than we move uphill. It feels very naive.
20.1.6 REINFORCE w/ additive Gaussian noise
∂ ∂
E [g(x)] = E [g(x) log p α (x)].
∂α ∂α
It turns out that weight perturbation is a REINFORCE algorithm. To see this, take
x = α + β, β ∈ N(0, σ 2 ), to have
1 −(x−α) T (x−α)
p α (x) = e 2σ 2
2
(2πσ ) N
−(x − α) T (x − α)
log p α (x) = + terms that don't depend on α
2σ 2
∂ 1 1
log p α (x) = 2 (α − x) T = 2 β T .
∂α σ σ
If we use only a single trial for each Monte-Carlo evaluation, then the REINFORCE
update is
η
Δα = − g(α + β)β,
σ2
which is precisely the weight perturbation update (the crazy version with b = 0
discussed above). Although it will move in the direction of the gradient on average,
it can be highly inefficient. In practice, people use many more than one sample to
estimate the poilicy gradient.
20.1.7 Summary
The policy gradient "trick" from REINFORCE using log probabilities provides one
means of estimating the true policy gradient. It is not the only way to obtain the
policy gradient... in fact the trivial weight-perturbation update obtains the same
gradient for the case of a policy with a mean linear in α and a fixed diagonal
covariance. It's cleverness lies in the fact that it makes use of the information we do
have (instantaneous cost values and the gradients of the policy), and that it provides
an unbiased estimate of the gradient (note that taking the gradients of a model that
is only slightly wrong might not have this virtue). But its inefficiency comes from
having a potentially very high variance. Variance reduction for policy gradient, often
through baseline estimation, continues to be an active area of research.
In the special case of the unbiased update, the equation reduces to:
E[Δα] T E [Δα]
SNR = .
E [(Δα) T (Δα)] − E[Δα] T E [Δα]
∂g ∂g T N
∂g 2
E[Δα] T E [Δα] = η 2 = η2 ∑ ( ) ,
∂α ∂α i=1
∂α i
η2
E [(Δα) T (Δα)] = E [[g(α + β) − g(α)] 2 β T β]
σ 4β
η2 ∂g 2 T
∣⎪⎧0
E [β i β j β 2k ] = ⎨σ 4
⎩ β
μ 4 (β)
i≠j
i=j≠k
i=j=k
≈
E [∑
i
[[
E [(∑
i
∂g
∂α
∂α
∂g
∂α
i
i
β] β β]
β i ) (∑
βi ∑
∂g ∂g
j
∂g
∂α
σ β i,j,k ∂α i ∂α j
=η 2 (N − 1) ∑ (
N −2+
1
i
j
β j ∑ β 2k ]
k
k
E [β i β j β 2k ]
∂g 2 η 2 μ 4 (z)
∂α i
μ 4 (β i )
σ 4β
) +
.
σ 4β
ratio
∑
i
(
for
∂g 2
∂α i
SNR =
1
N +1
.
)
additive
Example 20.2 (Signal-to-noise ratio for additive
uniform noise)
For β i drawn from a uniform distribution over [-a,a], we have
μ 1 = 0, μ 2 = a3 = σ 2β , μ 3 = 0, μ 4 = a5 = 95 σ 4β , simplifying the above to:
2 4
1
SNR = 1
.
N− 5
Performance calculations using the SNR can be used to design the parameters of
the algorithm in practice. For example, based on these results it is apparent that
noise added through a uniform distribution yields better gradient estimates than the
Gaussian noise case for very small N , but that these differences are neglibile for
large N .
Similar calculations can yield insight into picking the size of the additive noise
(scaled by σ β ). The calculations in this section seem to imply that larger σ β can only
reduce the variance, overcoming errors or noise in the baseline estimator, b; this is a
~
short-coming of our first-order Taylor expansion. If the cost function is not linear in
the parameters, then an examination of higher-order terms reveals that large σ β can
increase the SNR. The derivation with a second-order Taylor expansion is left for an
exercise.
REFERENCES
1. John W. Roberts, "Motor Learning on a Heaving Plate via Improved-{SNR}
Algorithms", , February, 2009. [ link ]
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
APPENDIX A
Drake
DRAKE is the main software toolbox that we use for this text, which in fact originated
in a large part due to the MIT underactuated course. The DRAKE website is the main
source for information and documentation. The goal of this chapter is to provide any
additional information that can help you get up and running with the examples and
exercises provided in these notes.
A.1 PYDRAKE
DRAKE is primarily a C++ library, with rigorous coding standards and a maturity
level intended to support even professional applications in industry. In order to
provide a gentler introduction, and to facilitate rapid prototyping, I've written these
notes exclusively in python, using Drake's python bindings (pydrake). These bindings
are less mature than the C++ backend; your feedback (and even contributions) are
very welcome. It is still improving rapidly.
There are also a number of tutorials in DRAKE that can help you get started.
As you get more advanced, you will likely want to run (and extend) these examples
on your own machine. The DRAKE website has a number of installation options,
including precompiled binaries and Docker instances. Here I provide an example
workflow of setting up drake from the precompiled binaries and running the
textbook examples.
The links below indicate the specific distributions that the examples in this text have
been tested against.
and install the prerequisites using the platform-specific installation script provided:
cd underactuated
sudo scripts/setup/ubuntu/18.04/install_prereqs.sh
pip3 install --requirement requirements.txt
export PYTHONPATH=`pwd`:${PYTHONPATH}
You'll likely want to start from the underactuated root directory. Then launch your
notebook with:
jupyter notebook
The examples for each chapter that has them will be in a .ipynb file right alongside
the chapter's html file, and the notebook exercises are all located in the exercises
subdirectory.
If you have trouble with DRAKE, please follow the advice here. If you have trouble
with the underactuated repo, you can check for known issues (and potentially file a
new one) here.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
APPENDIX B
Multi-Body Dynamics
B.1 DERIVING THE EQUATIONS OF MOTION
The equations of motion for a standard robot can be derived using the method
of Lagrange. Using T as the total kinetic energy of the system, and U as the
total potential energy of the system, L = T − U , and Q i as the generalized force
corresponding to q i , the Lagrangian dynamic equations are:
d ∂L ∂L
− = Qi . (1)
dt ∂ q̇ i ∂q i
Without going into the full details, the key idea for handling constraints is the
"principle of virtual work" (and the associated D'Almbert's principle). To describe
even a pendulum at equilibrium in the Newtonian approach, we would have to
compute both external forces (like gravity) and internal forces (like the forces
keeping the pendulum arm attached to the table), and have their sum equal zero.
Computing internal forces is slightly annoying for a pendulum; it becomes untenable
for more complex mechanisms. If the sum of the forces equals zero, then the work
done by those forces in some (infinitesmal) virtual displacement is certainly also
equal to zero. Now here is the trick: if we consider only virtual displacements that
are consistent with the kinematic constraints (e.g. rotations around the pin joint
of the pendulum), then we can compute that virtual work and set it equal to zero
without ever computing the internal forces! Extending this idea to the dynamics
case (via D'Almbert's and Hamilton's principles) results eventually in the Lagrangian
equations above.
If you are not comfortable with these equations, then any good book chapter on rigid
body mechanics can bring you up to speed. [1] is an excellent practical guide to
robot kinematics/dynamics. But [2] is my favorite mechanics book, by far; I highly
recommend it if you want something more. It's also ok to continue on for now
thinking of Lagrangian mechanics simply as a recipe than you can apply in a great
many situations to generate the equations of motion.
For completeness, I've included a derivation of the Lagrangian from the principle of
stationary action below.
Consider the simple double pendulum with torque actuation at both joints and
all of the mass concentrated in two points (for simplicity). Using q = [θ 1 , θ 2 ] T , and
p 1 , p 2 to denote the locations of m 1 , m 2 , respectively, the kinematics of this system
are:
p 1 =l 1 [ s 1 ], p 2 = p 1 + l 2 [ s 1+2 ]
−c 1 −c 1+2
ṗ 1 =l 1 q̇ 1 [c 1 ], ṗ 2 = ṗ 1 + l 2 (q̇ 1 + q̇ 2 ) [c 1+2 ]
s1 s 1+2
Note that s 1 is shorthand for sin(q 1 ), c 1+2 is shorthand for cos(q 1 + q 2 ), etc. From this
we can write the kinetic and potential energy:
1 1
T = m 1 ṗ T1 ṗ 1 + m 2 ṗ T2 ṗ 2
2 2
1 1
= (m 1 + m 2 )l 1 q̇ 1 + m 2 l 22 (q̇ 1 + q̇ 2 ) 2 + m 2 l 1 l 2 q̇ 1 (q̇ 1 + q̇ 2 )c 2
2 2
2 2
U =m 1 gy 1 + m 2 gy 2 = −(m 1 + m 2 )gl 1 c 1 − m 2 gl 2 c 1+2
(m 1 + m 2 )l 21 q̈ 1 + m 2 l 22 (q̈ 1 + q̈ 2 ) + m 2 l 1 l 2 (2q̈ 1 + q̈ 2 )c 2
− m 2 l 1 l 2 (2q̇ 1 + q̇ 2 )q̇ 2 s 2 + (m 1 + m 2 )l 1 gs 1 + m 2 gl 2 s 1+2 = τ 1
m 2 l 22 (q̈ 1 + q̈ 2 ) + m 2 l 1 l 2 q̈ 1 c 2 + m 2 l 1 l 2 q̇ 12 s 2 + m 2 gl 2 s 1+2 = τ 2
where q is the joint position vector, M is the inertia matrix, C captures Coriolis
forces, and τ g is the gravity vector. The matrix B maps control inputs u into
generalized forces. Note that we pair Mq̈ + Cq̇ on the left side because "... the
equations of motion depend on the choice of coordinates q. For this reason neither
Mq̈ nor Cq̇ individually should be thought of as a generalized force; only their sum
is a force"[5](s.10.2). These terms represent the contribution from kinetic energy to
the Lagrangian:
T T
d ∂T ∂T
− = M(q)q̈ + C(q, q̇)q̇.
dt ∂q̇ ∂q
The manipulator equations are very general, but they do define some important
characteristics. For example, q̈ is (state-dependent) linearly related to the control
input, u. This observation justifies the control-affine form of the dynamics assumed
throughout the notes. There are many other important properties that might prove
useful. For instance, we know that the i element of Cq̇ is given by [3, §5.2.3]:
∂M ij 1 ∂M jk
[C(q, q̇)q̇] i = ∑ c ijk q̇ j q̇ k , c ijk (q) = − ,
j,k
∂q k 2 ∂q k
(e.g. it also quadratic in q̇) and, for the appropriate choice of C, we have that
M(q) − 2C(q, q̇) is skew-symmetric[6].
Note that we have chosen to use the notation of second-order systems (with q̇ and
q̈ appearing in the equations) throughout this book. Although I believe it provides
more clarity, there is an important limitation to this notation: it is impossible to
describe 3D rotations in "minimal coordinates" using this notation without
introducing kinematic singularities (like the famous "gimbal lock"). For instance, a
common singularity-free choice for representing a 3D rotation is the unit quaternion,
described by 4 real values (plus a norm constraint). However we can still represent
the rotational velocity without singularities using just 3 real values. This means that
the length of our velocity vector is no longer the same as the length of our position
vector. For this reason, you will see that most of the software in DRAKE uses the
more general notation with v to represent velocity, q to represent positions, and the
manipulator equations are written as
which is not necessarily a second-order system. See [7] for a nice discussion of this
topic.
The equations of motions for our machines get complicated quickly. Fortunately,
for robots with a tree-link kinematic structure, there are very efficient and natural
recursive algorithms for generating these equations of motion. For a detailed
reference on these methods, see [8]; some people prefer reading about the
Articulated Body Method in [9]. The implementation in DRAKE uses a related
formulation from [10].
M(q)q̇ =p
T
1 ∂ [q̇ T M(q)q̇]
ṗ =c H (q, p) + τ g (q) + Bu, where c H (q, p) = ( ) .
2 ∂q
If our robot has closed-kinematic chains, for instance those that arise from a four-
bar linkage, then we need a little more. The Lagrangian machinery above assumes
"minimal coordinates"; if our state vector q contains all of the links in the kinematic
chain, then we do not have a minimal parameterization -- each kinematic loop
adds (at least) one constraint so should remove (at least) one degree of freedom.
Although some constraints can be solved away, the more general solution is to use
the Lagrangian to derive the dynamics of the unconstrained system (a kinematic
tree without the closed-loop constraints), then add additional generalized forces that
ensure that the constraint is always satisfied.
h(q) = 0. (5)
For the case of the kinematic closed-chain, this can be the kinematic constraint that
the location of one end of the chain is equal to the location of the other end of the
chain. The equations of motion can be written
where H(q) = ∂h
∂q and λ is the constraint force. Let's use the shorthand
Using
˙ = Hq̇,
h (8)
¨ = Hq̈ + Hq̇,
h ˙ (9)
we can solve for λ, by observing that when the constraint is imposed, h = 0 and
therefore h
˙ = 0 and h
¨ = 0. Inserting the dynamics (7) into (9) yields
To combat numerical "constraint drift", one might like to add a restoring force in the
event that the constraint is not satisfied to numerical precision. To accomplish this,
rather than solving for h
¨ = 0 as above, we can instead solve for
¨ = Hq̈ + Hq̇
h ˙ − α 2 h,
˙ = −2αh (11)
λ = −(HM −1 H T ) + (HM −1 τ + (H
˙ + 2αH)q̇ + α 2 h). (12)
1 T
min λ HM −1 H T λ − λ T h
¨ uc .
λ 2
The primal formulation has accelerations as the decision variables, and the dual
formulation has constraint forces as the decision variables. For now this is merely a
cute observation; we'll build on it more when we get to discussing contact forces.
where ≠ 0. These are less common, but arise when, for instance, a joint is driven
∂h v
∂q̇
through a prescribed motion. Here, the manipulator equations are given by
∂h v T
Mq̈ = τ + λ. (15)
∂q̇
Using
˙ v = ∂h v q̇ + ∂h v q̈,
h (16)
∂q ∂q̇
setting h
˙ v = 0 yields
∂h v −1 ∂h v + ∂h v −1 ∂h v
λ = −( M ) [ M τ+ q̇]. (17)
∂q̇ ∂q̇ ∂q̇ ∂q
∂h v −1 ∂h v + ∂h v −1 ∂h v
λ = −( M ) [ M τ+ q̇ + αh v ]. (18)
∂q̇ ∂q̇ ∂q̇ ∂q
There are three main approaches used to modeling contact with "rigid" body
systems: 1) rigid contact approximated with stiff compliant contact, 2) hybrid models
with collision event detection, impulsive reset maps, and continuous (constrained)
dynamics between collision events, and 3) rigid contact approximated with time-
averaged forces (impulses) in a time-stepping scheme. Each modeling approach has
advantages/disadvantages for different applications.
Before we begin, there is a bit of notation that we will use throughout. Let ϕ(q)
indicate the relative (signed) distance between two rigid bodies. For rigid contact,
we would like to enforce the unilateral constraint:
ϕ(q) ≥ 0. (19)
We'll use n = to denote the "contact normal", as well as t 1 and t 2 as a basis for
∂ϕ
∂q
the tangents at the contact (orthonormal vectors in the Cartesian space, projected
into joint space), all represented as row vectors in joint coordinates.
Figure B.2 - Contact coordinates in 2D. (a) The signed distance between contact
bodies, ϕ(q). (b) The normal (n) and tangent (t) contact vectors -- note that these
can be drawn in 2D when the q is the x, y positions of one of the bodies, but more
generally the vectors live in the configuration space. (c) Sometimes it will be
helpful for us to express the tangential coordinates using d 1 and d 2 ; this will make
more sense in the 3D case.
Figure B.3 - Contact coordinates in 3D.
We will also find it useful to assemble the contact normal and tangent vectors into a
single matrix, J, that we'll call the contact Jacobian:
⎡n⎤
J(q) = t 1 .
⎣ ⎦
t2
As written, Jv gives the relative velocity of the closest points in the contact
coordinates; it can be also be extended with three more rows to output a full spatial
velocity (e.g. when modeling torsional friction). The generalized force produced by
the contact is given by J T λ, where λ = [f n , f t1 , f t2 ] T :
⎢⎥
Coulomb friction is described by two parameters -- μ static and μ dynamic -- which are
the coefficients of static and dynamic friction. When the contact tangential velocity
(given by tv) is zero, then friction will produce whatever force is necessary to
resist motion, up to the threshold μ static f n . But once this threshold is exceeding,
we have slip (non-zero contact tangential velocity); during slip friction will produce
a constant drag force |f t | = μ dynamic f n in the direction opposing the motion. This
behaviour is not a simple function of the current state, but we can approximate it
with a continuous function as pictured below.
Figure B.5 - (left) The Coloumb friction model. (right) A continuous piecewise-linear
approximation of friction (green) and the Stribeck approximation of Coloumb
friction (blue); the x-axis is the contact tangential velocity, and the y-axis is the
friction coefficient.
With these two laws, we can recover the contact forces as relatively simple functions
of the current state. However, the devil is in the details. There are a few features of
this approach that can make it numerically unstable. If you've ever been working in
a robotics simulator and watched your robot take a step only to explode out of the
ground and go flying through the air, you know what I mean.
In order to tightly approximate the (nearly) rigid contact that most robots make with
the world, the stiffness of the contact "springs" must be quite high. For instance, I
might want my 180kg humanoid robot model to penetrate into the ground no more
than 1mm during steady-state standing. The challenge with adding stiff springs to
our model is that this results in stiff differential equations (it is not a coincidence that
the word stiff is the common term for this in both springs and ODEs). As a result,
the best implementations of compliant contact for simulation make use of special-
purpose numerical integration recipes (e.g. [13]) and compliant contact models are
often difficult to use in e.g. trajectory/feedback optimization.
But there is another serious numerical challenge with the basic implementation of
this model. Computing the signed-distance function, ϕ(q), when it is non-negative
is straightforward, but robustly computing the "penetration depth" once two bodies
are in collision is not. Consider the case of use ϕ(q) to represent the maximum
penetration depth of, say, two boxes that are contacting each other. With compliant
contact, we must have some penetration to produce any contact force. But the
direction of the normal vector, n = ∂q , can easily change discontinuously as the point
∂ϕ
Figure B.6 - (Left) Two boxes in penetration, with the signed distance defined by
the maximum penetration depth. (Right) The contact normal for various points of
maximal penetration.
If you really think about this example, it actually even more of the foibles of trying
to even define the contact points and normals consistently. It seems reasonable to
define the point on body B as the point of maximal penetration into body A, but
notice that as I've drawn it, that isn't actually unique! The corresponding point
on body A should probably be the point on the surface with the minimal distance
from the max penetration point (other tempting choices would likely cause the
distance to be discontinuous at the onset of penetration). But this whole strategy is
asymmetric; why shouldn't I have picked the vector going with maximal penetration
into body B? My point is simply that these cases exist, and that even our best
software implementations get pretty unsatisfying when you look into the details. And
in practice, it's a lot to expect the collision engine to give consistent and smooth
contact points out in every situation.
Some of the advantages of this approach include (1) the fact that it is easy to
implement, at least for simple geometries, (2) by virtue of being a continuous-time
model it can be simulated with error-controlled integrators, and (3) the tightness of
the approximation of "rigid" contact can be controlled through relatively intuitive
parameters. However, the numerical challenges of dealing with penetration are very
real, and they motivate our other two approaches that attempt to more strictly
enforce the non-penetration constraints.
Impulsive Collisions
The collision event is described by the zero-crossings (from positive to negative) of
ϕ. Let us start by assuming frictionless collisions, allowing us to write
Mq̈ = τ + λn T , (21)
interval:
t+
c t+
c
∫ dt [Mq̈] = ∫ dt [τ + λn T ]
t−
c t−
c
Since M, τ , and n are constants over this interval, we are left with
t+
c
Mq̇ + − Mq̇ − = n T ∫ λdt,
t−
c
t+
c
nq̇ + − nq̇ − = nM −1 n T ∫ λdt.
t−
c
After the collision, we have ϕ˙+ = −eϕ˙− , with 0 ≤ e ≤ 1 denoting the coefficient of
restitution, yielding:
and therefore
t+
c
#
∫ λdt = −(1 + e)[nM −1 n T ] nq̇ − .
t−
c
I've used the notation A # here to denote the pseudoinverse of A (I would normally
write A + , but have changed it for this section to avoid confusion). Substituting this
back in above results in
#
q̇ + = [I − (1 + e)M −1 n T [nM −1 n T ] n]q̇ − . (22)
We can add friction at the contact. Rigid bodies will almost always experience
contact at only a point, so we typically ignore torsional friction [8], and model only
tangential friction by extending n to a matrix
⎡n⎤
J = t1 ,
⎣ ⎦
t2
to capture the gradient of the contact location tangent to the contact surface. Then
λ becomes a Cartesian vector representing the contact impulse, and (for infinite
friction) the post-impact velocity condition becomes Jq̇ + = diag(−e, 0, 0)Jq̇ − , resulting
in the equations:
#
q̇ + = [I − M −1 J T diag(1 + e, 1, 1)[JM −1 J T ] J]q̇ − . (23)
M(q)q̈ =
⎣
0
0
0
m
0 2
3 mr
+
q̇ =
0
0
⎤
2⎦
⎣
⎡ 0 ⎤ ⎡0
q̈ = −g + 1
⎣ ⎦ ⎣
0
0
3
− 5r
0
−e
0
0 q̇ .
2 ⎦
5
1⎤
−
λz
⎦ λx
where λ are the contact forces (which are zero except during the impulsive
collision). Given a coefficient of restitution e and a collision with a horizontal
ground, the post-impact velocities are:
⎡ 3
5 0 − 25 r⎤
We can put the bilateral constraint equations and the impulsive collision equations
together to implement a hybrid model for unilateral constraints of the form. Let us
generalize
ϕ(q) ≥ 0,
T
0 [ ] = τ g + J λ,
to be the vector of all pairwise (signed) distance functions between rigid bodies; this
vector describes all possible contacts. At every moment, some subset of the contacts
are active, and can be treated as a bilateral constraint (ϕ i = 0). The guards of the
hybrid model are when an inactive constraint becomes active (ϕ i > 0 → ϕ i = 0), or
when an active constraint becomes inactive (λ i > 0 → λ i = 0 and ϕ˙i > 0). Note that
collision events will (almost always) only result in new active constraints when e = 0
, e.g. we have pure inelastic collisions, because elastic collisions will rarely permit
sustained contact.
If the contact involves Coulomb friction, then the transitions between sticking and
sliding can be modeled as additional hybrid guards.
So far we have seen two different approaches to enforcing the inequality constraints
of contact, ϕ(q) ≥ 0 and the friction cone. First we introduced compliant contact
which effectively enforces non-penetration with a stiff spring. Then we discussed
event detection as a means to switch between different models which treat the
active constraints as equality constraints. But, perhaps surprisingly, one of the most
(24)
popular and scalable approaches is something different: it involves formulating a
mathematical program that can solve the inequality constraints directly on each time
step of the simulation. These algorithms may be more expensive to compute on each
time step, but they allow for stable simulations with potentially much larger and
more consistent time steps.
Complementarity formulations
What class of mathematical program due we need to simulate contact? The discrete
nature of contact suggests that we might need some form of combinatorial
optimization. Indeed, the most common transcription is to a Linear Complementarity
Problem (LCP) [14], as introduced by [15] and [16]. An LCP is typically written as
w ≥ 0, z ≥ 0, w T z = 0.
Rather than dive into the full transcription, which has many terms and can be
relatively difficult to parse, let me start with two very simple examples.
mq̈ = u + f.
h2 h2
q[n + 1] = [q[n] + hv[n] + u[n]] + f[n]
m m
q[n + 1] ≥ 0, f[n] ≥ 0, q[n + 1] ⋅ f[n] = 0.
Take a moment and convince yourself that this fits into the LCP prescription given
above.
Note: Please don't confuse this visualization with the more common visualization
of the solution space for an LCP (in two or more dimensions) in terms of
"complementary cones"[14].
Perhaps it's also helpful to plot the solution, q[n + 1], f[n] as a function of q[n], v[n].
I've done it here for m = 1, h = 0.1, u[n] = 0:
1 ′ T
min
′
(v − v − hM −1 τ) M (v ′ − v − hM −1 τ)
v 2
1 1 1
subject to ϕ(q ′ ) = ϕ(q + hv ′ ) ≈ ϕ(q) + nv ′ ≥ 0.
h h h
The objective is even cleaner/more intuitive if we denote the next step velocity that
would have occurred before the contact impulse is applied as v − = v + hM −1 τ :
1 ′ T
min
′
(v − v − ) M (v ′ − v − )
v 2
1
subject to ϕ(q ′ ) ≥ 0.
h
The LCP for the frictionless contact dynamics is precisely the optimality conditions
of this convex (because M ⪰ 0) quadratic program, and once again the contact
impulse, f ≥ 0, plays the role of the Lagrange multiplier (with units N ⋅ s).
So why do we talk so much about LCPs instead of QPs? Well LCPs can also represent
a wider class of problems, which is what we arrive at with the standard transcription
of Coulomb friction. In the limit of infinite friction, then we could add an additional
constraint that the tangential velocity at each contact point was equal to zero (but
these equation may not always have a feasible solution). Once we admit limits on the
magnitude of the friction forces, the non-convexity of the disjunctive form rear's it's
ugly head.
mq̈ = u + f,
but this time I'm using f for the friction force which is inside the friction cone if
q̇ = 0 and on the boundary of the friction cone if q̇ ≠ 0; this is known as the principle
of maximum dissipation[15]. Here the magnitude of the normal force is always mg
, so we have |f| ≤ μmg, where μ is the coefficient of friction. And we will use the
same semi-implicit Euler approximation to cast this into discrete time.
Now, to write the friction constraints as an LCP, we must introduce some slack
variables. First, we'll break the force into a positive component and a negative
component: f[n] = f + − f − . And we will introduce one more variable v abs which will
be non-zero if the velocity is non-zero (in either direction). Now we can write the
LCP:
find subject to
v abs ,f + ,f −
0 ≤ v abs + v[n + 1] ⊥ f + ≥ 0,
0 ≤ v abs − v[n + 1] ⊥ f − ≥ 0,
0 ≤ μmg − f + − f − ⊥ v abs ≥ 0,
h
v[n + 1] = v[n] + (u + f + − f − ).
m
It's enough to make your head spin! But if you work it out, all of the constraints
we want are there. For example, for v[n + 1] > 0, then we must have f + = 0,
v abs = v[n + 1], and f − = μmg. When v[n + 1] = 0, we can have v abs = 0, f + − f − ≤ μmg
, and those forces must add up to make v[n + 1] = 0.
We can put it all together and write an LCP with both normal forces and friction
forces, related by Coulomb friction (using a polyhedral approximation of the friction
cone in 3d)[15]. Although the solution to any LCP can also be represented as the
solution to a QP, the QP for this problem is non-convex.
1 ′ 1
L(v ′ , β) = (v − v − ) T M(v ′ − v − ) − ∑ β i ( ϕ(q) + (n + μd)v ′ ),
2 i
h
we can see that the Lagrange multiplier, β i ≥ 0, associated with the ith constraint is
the magnitude of the impulse (with units N ⋅ s) in the direction n − μd i , where n = ∂q
∂ϕ
is the contact normal; the sum of the forces is an affine combination of these extreme
rays of the polyhedral friction cone. As written, the optimization is a QP.
But be careful! Although the primal solution was convex, and the dual problems are
always convex, the objective here can be positive semi-definite. This isn't a mistake
-- [17] describes a few simple examples where the solution to v ′ is unique, but the
impulses that produce it are not (think of a table with four legs).
When the tangential velocity is zero, this constraint is tight; for sliding contact the
relaxation effectively causes the contact to "hydroplane" up out of contact, because
ϕ(q ′ ) ≥ hμd i v ′ . It seems like a quite reasonable approximation to me, especially for
small h!
Let's continue on and write the dual program. To make the notation a little better,
the us stack the Jacobian vectors into J β , such that the ith row is n + μd i , so that we
have
T
∂L
= M(v ′ − v) − hτ − J Tβ β = 0.
∂v ′
Substituting this back into the Lagrangian give us the dual program:
1 T 1
min β J β M −1 J Tβ β + ϕ(q) ∑ β i .
β≥0 2 h i
One final note: I've written just one contact point here to simplify the notation, but
of course we repeat the constraint(s) for each contact point; [17] studies the typical
case of only including potential contact pairs with ϕ(q) ≤ ϵ, for small ϵ ≥ 0.
where ẋ d is a "desired contact velocity", and F C(q) describes the friction cone. The
friction cone constraint looks new but is not; observe that J T λ = J Tβ β, where β ≥ 0
is a clever parameterization of the friction cone in the polyhedral case. The bigger
difference is in the linear term: [18] proposed ẋ d = Jv − hBJv − hK[ϕ(q), 0, 0] T , with B
and K stabilization gains that are chosen to yield a critically damped response.
How should we interpret this? If you expand the linear terms, you will see that
this is almost exactly the dual form we get from the position equality constraints
formulation, including the Baumgarte stabilization. It's interesting to think about the
implications of this formulation -- like the Anitescu relaxation, it is possible to get
some "force at a distance" because we have not in any way expressed that λ = 0
when ϕ(q) > 0. In Anitescu, it happened in a velocity-dependent boundary layer; here
it could happen if you are "coming in hot" -- moving towards contact with enough
relative velocity that the stabilization terms want to slow you down.
In dual form, it is natural to consider the full conic description of the friction cone:
The resulting dual optimization is has a quadratic objective and second-order cone
constraints (SOCP).
∣
Also a single point force cannot capture effects like torsional friction, and performs
badly in some very simple cases (imaging a box sitting on a table). As a result,
many of the most effective algorithms for contact restrict themselves to very limited/
simplified geometry. For instance, one can place "point contacts" (zero-radius
spheres) in the four corners of a robot's foot; in this case adding forces at exactly
those four points as they penetrate some other body/mesh can give more consistent
contact force locations/directions. A surprising number of simulators, especially for
legged robots, do this.
In practice, most collision detection algorithms will return a list of "collision point
pairs" given the list of bodies (with one point pair per pair of bodies in collision,
or more if they are using the aforementioned "multi-contact" heuristics), and our
contact force model simply attaches springs between these pairs. Given a point-pair,
p A and p B , both in world coordinates, ...
B.4 PRINCIPLE
[20] says
OF STATIONARY ACTION
Sounds important!
What does it mean for a trajectory to be a stationary point? Using the calculus of
variations, we think of an infinitesimal variation of this trajectory: q(t) + ϵ(t), where
ϵ(t) is an arbitrary differentiable function that is zero at t 0 and t 1 . That this variation
should not change the action means δA = 0 for any small ϵ(t):
∂L ∂L
L (q(t) + ϵ(t), q̇(t) + ϵ̇(t)) = L (q(t), q̇(t)) + ϵ(t) + ϵ̇(t).
∂q(t) ∂q̇(t)
and observe that the first term in the right-hand side is zero since ϵ(t 0 ) = ϵ(t 1 ) = 0.
This leaves
t1
∂L d ∂L
δA = ∫ [ − ]ϵ(t)dt = 0.
t0 ∂q dt ∂q̇
Since this must integrate to zero for any ϵ(t), we must have
∂L d ∂L
− = 0,
∂q dt ∂q̇
which can be multiplied by negative one to obtain the familiar form of the (unforced)
Lagrangian equations of motion. The forced version follows from the variation
t1 t1
δA = ∫ L(q(t), q̇(t))dt + ∫ Q T (t)ϵ(t)dt = 0.
t0 t0
3. H. Asada and J.E. Slotine, "Robot Analysis and Control", , pp. 93-131, 1986.
11. Firdaus E Udwadia and Robert E Kalaba, "A new perspective on constrained
motion", Proceedings of the Royal Society of London. Series A:
Mathematical and Physical Sciences, vol. 439, no. 1906, pp. 407--410, 1992.
13. Alejandro M Castro and Ante Qu and Naveen Kuppuswamy and Alex
Alspach and Michael Sherman, "A Transition-Aware Method for the
Simulation of Compliant Contact with Regularized Friction", IEEE Robotics
and Automation Letters, vol. 5, no. 2, pp. 1859--1866, 2020.
14. Katta G Murty and Feng-Tien Yu, "Linear complementarity, linear and
nonlinear programming",Citeseer , vol. 3, 1988.
15. D.E. Stewart and J.C. Trinkle, "AN IMPLICIT TIME-STEPPING SCHEME
FOR RIGID BODY DYNAMICS WITH INELASTIC COLLISIONS AND
COULOMB FRICTION", International Journal for Numerical Methods in
Engineering, vol. 39, no. 15, pp. 2673--2691, 1996.
19. Ryan Elandt and Evan Drumwright and Michael Sherman and Andy Ruina,
"A pressure field model for fast, robust approximation of net contact force
and moment between nominally rigid objects", , pp. 8238-8245, 11, 2019.
20. Leonard Susskind and George Hrabovsky, "The theoretical minimum: what
you need to know to start doing physics",Basic Books , 2014.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
Optimization and
Mathematical Programming
The view of dynamics and controls taken in these notes builds heavily on tools
from optimization -- and our success in practice depends heavily on the effective
application of numerical optimization. There are many excellent books on
optimization, for example [1] is an excellent reference on smooth optimization and
[2] is an excellent reference on convex optimization (which we use extensively). I
will provide more references for the specific optimization formulations below.
The intention of this chapter is, therefore, mainly to provide a launchpad and to
address some topics that are particularly relevant in robotics but might be more
hidden in the existing general treatments. You can get far very quickly as a consumer
of these tools with just a high-level understanding. But, as with many things,
sometimes the details matter and it becomes important to understand what is going
on under the hood. We can often formulate an optimization problem in multiple ways
that might be mathematically equivalent, but perform very differently in practice.
Some of the algorithms from optimization are quite simple to implement yourself;
stochastic gradient descent is perhaps the classic example. Even more of them
are conceptually fairly simple, but for some of the algorithms, the implementation
details matter a great deal -- the difference between an expert implementation
and a novice implementation of the numerical recipe, in terms of speed and/or
robustness, can be dramatic. These packages often use a wealth of techniques for
numerically conditioning the problems, for discarding trivially valid constraints, and
for warm-starting optimization between solves. Not all solvers support all types of
objectives and constraints, and even if we have two commercial solvers that both
claim to solve, e.g. quadratic programs, then they might perform differently on the
particular structure/conditioning of your problem. In some cases, we also have nicely
designed open-source alternatives to these commercial solvers (often written by
academics who are experts in optimization) -- sometimes they compete well with the
commercial solvers or even outperform them in particular problem types.
As a result, there are also a handful of software packages that attempt to provide
a layer of abstraction between the formulation of a mathematical program and the
instantiation of that problem in each of the particular solvers. Famous examples
of this include CVX and YALMIP for MATLAB, and JuMP for Julia. DRAKE's
MathematicalProgram classes provide a middle layer like this for C++ and Python;
its creation was motivated initially and specifically by the need to support the
optimization formulations we use in this text.
It's important to realize that even though this formulation is incredibly general,
it does have it's limits. As just one example, when write optimizations to plan
trajectories of a robot, in this formulation we typically have to choose a-priori as
particular number of decision variables that will encode the solution. Although we
can of course write algorithms that change the number of variables and call the
optimizer again, somehow I feel that this "general" formulation fails to capture,
for instance, the type of mathematical programming that occurs in sample-based
optimization planning -- where the resulting solutions can be described by any finite
number of parameters, and the computation flexibly transitions amongst them.
∂L ∂L
= 0, = 0.
∂z ∂λ
Note that ∂L∂λ = ϕ(z), so requiring this to be zero is equivalent to requiring the
constraints to be satisfied.
min x + y, subject to x 2 + y 2 = 1.
x,y
The level sets of x + y are straight lines with slope −1, and the constraint requires
that the solution lives on the unit circle.
multipliers.
Formulating
L = x + y + λ(x 2 + y 2 − 1),
∂L 1
= 1 + 2λx = 0 ⇒ λ=− ,
∂x 2x
∂L y
= 1 + 2λy = 1 − =0 ⇒ y = x,
∂y x
∂L 1
= x 2 + y 2 − 1 = 2x 2 − 1 = 0 ⇒ x=± .
∂λ √2
Given the two solutions which satisfy the necessary conditions, the negative
solution is clearly the minimizer of the objective.
min ∥x − a∥ 2
x
subject to ∥x − b∥ ≥ 1
min y − 2ax + a 2
x,y
subject to y − 2bx + b 2 ≥ 1
y ≥ x2
I've plotted the feasible region and the objective with an arrow. As you can see,
the feasible region is the epigraph of y = x 2 intersected with the linear constraint.
Now here is the key point: for linear objectives, the optimal solution will lie on the
boundary only if the cost is directly orthogonal to the objective; otherwise it will
lie at a vertex. So in this case, the solution will only lie on the interior if a = b; for
every other value, this relaxation will give the optimal solution. Note that we could
have equally well written a quadratic equality constraint.
I find this incredibly clever, and only frustrating that it is somewhat particular to
quadratics. (The fact that the entire exterior of the quadratic region could be an
optimal solution does not hold if we try to use the same approach for e.g. being
outside of a polytope).
Open in Colab
C.3.3 Sums-of-squares optimization
It turns out that in the same way that we can use SDP to search over the positive
quadratic equations, we can generalize this to search over the positive polynomial
equations. To be clear, for quadratic equations we have
P⪰0 ⇒ x T Px ≥ 0, ∀x.
Even better, there is quite a bit that is known about how to choose the terms in m(x)
. For example, if you give me the polynomial
p(x) = 2 − 4x + 5x 2
and ask if it is positive for all real x, I can convince you that it is by producing the
sums-of-squares factorization
p(x) = 1 + (1 − 2x) 2 + x 2 ,
and I know that I can formulate the problem without needing any monomials of
degree greater than 1 (the square-root of the degree of p) in my monomial vector. In
practice, what this means for us is that people have authored optimization front-ends
which take a high-level specification with constraints on the positivity of polynomials
and they automatically generate the SDP problem for you, without having to think
about what terms should appear in m(x) (c.f. [4]). This class of optimization problems
is called Sums-of-Squares (SOS) optimization.
As it happens, the particular choice of m(x) can have a huge impact on the numerics
of the resulting semidefinite program (and on your ability to solve it with commercial
solvers). DRAKE implements some particularly novel/advanced algorithms to make
this work well [5].
p(x) is SOS
as shorthand for the constraint that p(x) ≥ 0 for all x, as demonstrated by finding a
sums-of-squares decomposition.
max λ
λ
s.t. p(x) − λ is sos.
Go ahead and play with the code (most of the lines are only for plotting; the actual
optimization problem is nice and simple to formulate).
Open in Colab
Note that this finds the minimum value, but does not actually produce the x value
which mimimizes it. This is possible [6], but it requires examining the dual of the
sums-of-squares solutions (which we don't need for the goals of this chapter).
The picture that you should have in your head is a nonlinear, potentially non-convex
objective function defined over (multi-dimensional) z, with a subset of possible z
values satisfying the constraints.
Note that minima can be the result of the objective function having zero derivative
or due to a sloped objective up against a constraint.
Numerical methods for solving these optimization problems require an initial guess,
z^, and proceed by trying to move down the objective function to a minima. Common
approaches include gradient descent, in which the gradient of the objective function
is computed or estimated, or second-order methods such as sequential quadratic
programming (SQP) which attempts to make a local quadratic approximation of the
objective function and local linear approximations of the constraints and solves a
quadratic program on each iteration to jump directly to the minimum of the local
approximation.
While not strictly required, these algorithms can often benefit a great deal from
having the gradients of the objective and constraints computed explicitly; the
alternative is to obtain them from numerical differentiation. Beyond pure speed
considerations, I strongly prefer to compute the gradients explicitly because it
can help avoid numerical accuracy issues that can creep in with finite difference
methods. The desire to calculate these gradients will be a major theme in the
discussion below, and we have gone to great lengths to provide explicit gradients
of our provided functions and automatic differentiation of user-provided functions in
DRAKE.
When I started out, I was of the opinion that there is nothing difficult about
implementing gradient descent or even a second-order method, and I wrote all of the
solvers myself. I now realize that I was wrong. The commercial solvers available for
nonlinear programming are substantially higher performance than anything I wrote
myself, with a number of tricks, subtleties, and parameter choices that can make a
huge difference in practice. Some of these solvers can exploit sparsity in the problem
(e.g., if the constraints operate in a sparse way on the decision variables). Nowadays,
we make heaviest use of SNOPT [7], which now comes bundled with the binary
distributions of DRAKE, but also support a large suite of numerical solvers. Note that
while I do advocate using these tools, you do not need to use them as a black box.
In many cases you can improve the optimization performance by understanding and
selecting non-default configuration parameters.
C.4.1 Second-order methods (SQP / Interior-Point)
Penalty methods
Augmented Lagrangian
An advanced, but very readable book on MIP [8]. Nice survey paper on MILP [9].
REFERENCES
1. Jorge Nocedal and Stephen J. Wright, "Numerical optimization",Springer ,
2006.
4. Stephen Prajna and Antonis Papachristodoulou and Peter Seiler and Pablo
A. Parrilo, "SOSTOOLS: Sum of Squares Optimization Toolbox for MATLAB
User’s guide", , June 1, 2004.
7. Philip E. Gill and Walter Murray and Michael A. Saunders, "User's Guide
for SNOPT Version 7: Software for Large -Scale Nonlinear Programming", ,
February 12, 2006.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
APPENDIX D
An Optimization Playbook
Coming soon... a collection of tips and formulations that can make seemingly non-
smooth constraints smooth, seemingly non-convex constraints convex, etc.
Russ Tedrake
Note: These are working notes used for a course being taught at MIT. They will
be updated throughout the Spring 2021 semester. Lecture videos are available on
YouTube.
APPENDIX E
Miscellaneous
E.1 HOW TO CITE THESE NOTES
Thank you for citing these notes in your work. Please use the following citation:
My primary goal for the annotation tool is to host a completely open dialogue on
the intellectual content of the text. However, it has turned out to serve an additional
purpose: it's a convenient way to point out my miscellaneous typos and grammatical
blips. The only problem is that if you highlight a typo, and I fix it 10 minutes
later, your highlight will persist forevermore. Ultimately this pollutes the annotation
content.
• you can make a public editorial comment, but must promise to delete that
comment once it has been addressed.
• You can join my "editorial" group and post your editorial comments using
this group "scope".
Ideally, once I mark a comment as "done", I would appreciate it if you can delete that
comment.
I highly value both the discussions and the corrections. Please keep them coming,
and thank you!
E.3 SOME GREAT FINAL PROJECTS
Each semester students put together a final project using the tools from this class.
Many of them are fantastic! Here is a small sample.
Spring 2020:
• Collision Free Mixed Integer Planning for Quadrotors Using Convex Safe
Regions by Bernhard Paus Græsdal
• Pancake flipping via trajectory optimization by Charles Dawson
• Dynamic Soaring by Lukas Lao Beyer
• Acrobatic Humanoid by Matt Chignoli and AJ Miller
• Trajectory Optimization for the Furuta Pendulum by Samuel Cherna and
Philip Murzynowski
I'm very interested in your feedback. The annotation tool is one mechanism, but you
can also comment directly on the YouTube lectures, or even add issues to the github
repo that hosts these course notes.
I've also created this simple survey to collect your general comments/feedback.