9781040261163
9781040261163
mitigating the risks associated with increasingly powerful AI systems. It provides not
only an accessible introduction to the technical challenges in making AI safer, but also a
clear-eyed account of the coordination problems we will need to solve on a societal level
to ensure AI is developed and deployed safely.”
“The most comprehensive exposition for the case that AI raises catastrophic risks and
what to do about them. Even if you disagree with some of Hendrycks’ arguments, this
book is still very much worth reading, if only for the unique coverage of both the techni-
cal and social aspects of the field.”
This book explores a range of ways in which societies could fail to harness AI safely in coming
years, such as malicious use, accidental failures, erosion of safety standards due to competition
between AI developers or nation-states, and potential loss of control over autonomous systems.
Grounded in the latest technical advances, this book offers a timely perspective on the chal-
lenges involved in making current AI systems safer. Ensuring that AI systems are safe is not just
a problem for researchers in machine learning – it is a societal challenge that cuts across tra-
ditional disciplinary boundaries. Integrating insights from safety engineering, economics, and
other relevant fields, this book provides readers with fundamental concepts to understand and
manage AI risks more effectively.
This is an invaluable resource for upper-level undergraduate and postgraduate students taking
courses relating to AI safety & alignment, AI ethics, AI policy, and the societal impacts of AI, as
well as anyone trying to better navigate the rapidly evolving landscape of AI safety.
Dr. Dan Hendrycks is a machine learning researcher and Director of the Center for AI Safety
(CAIS). He has conducted pioneering research in AI such as developing the GELU activation
function, used in several state-of-the art neural networks such as GPT, and creating MMLU, one
of the leading benchmarks used to assess large language models. His research has been covered
by the BBC, New York Times, and Washington Post.
His work currently focuses on improving the safety of AI systems and mitigating risks from AI.
He has advised the UK government on AI safety and has been invited to give talks on this topic
at OpenAI, Google, and Stanford, among other institutions. He has written on AI risks for the
Wall Street Journal and TIME Magazine. Dan Hendrycks holds a Ph.D. in Machine Learning
from UC Berkeley.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Introduction to AI Safety,
Ethics, and Society
Dan Hendrycks
First edition published 2025
by CRC Press
2385 Executive Center Drive, Suite 320, Boca Raton, FL 33431
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
The Open Access version of this book, available at www.taylorfrancis.com, has been made available under a Creative
Commons [Attribution-Non Commercial-No Derivatives (CC-BY-NC-ND)] 4.0 license.
Any third party material in this book is not included in the OA Creative Commons license, unless indicated otherwise in
a credit line to the material. Please direct any permissions enquiries to the original rightsholder.
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.
DOI: 10.1201/9781003530336
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Introduction xv
vii
viii ■ Contents
2.2.2 Types of AI 58
2.2.3 Machine Learning 64
2.2.4 Types of Machine Learning 75
2.3 DEEP LEARNING 79
2.3.1 Model Building Blocks 82
2.3.2 Training and Inference 93
2.3.3 History and Timeline of Key Architectures 98
2.3.4 Applications 101
2.4 SCALING LAWS 102
2.4.1 Scaling Laws in DL 104
2.5 SPEED OF AI DEVELOPMENT 107
2.6 CONCLUSION 110
2.6.1 Summary 110
2.7 LITERATURE 112
2.7.1 Recommended Resources 113
Section II Safety
Acknowledgments 495
References 497
Index 529
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Introduction
xv
xvi ■ Introduction
risks that have already been identified and discussed today, others set out a system-
atic introduction to ideas from game theory, complex systems, international relations,
and more. We hope that providing these flexible conceptual tools will help readers to
adapt robustly to the ever-changing landscape of AI risks.
This book does not aim to be the definitive guide on all AI risks. Research on AI risk is
still new and rapidly evolving, making it infeasible to comprehensively cover every risk
and its potential solutions in a single book, particularly if we wish to ensure that the
content is clear and digestible. We have chosen to introduce concepts and frameworks
that we find productive for thinking about a wide range of AI risks. Nonetheless, we
have had to make choices about what to include and omit. Many present harms, such
as harmful malfunctions, misinformation, privacy breaches, reduced social connection,
and environmental damage, are already well-addressed by others [1, 2]. Given the
rapid development of AI in recent years, we focus on novel risks posed by advanced
systems: risks that pose serious, large-scale, and sometimes irreversible threats that
our societies are currently unprepared to face.
Even if we limit ourselves to focusing on the potential for AI to pose catastrophic
risks, it is easy to become disoriented given the broad scope of the problem. Our hope
is that this book provides a starting point for others to build their own picture of
these risks and opportunities, and our potential responses to them.
The book’s content falls into three sections: AI and Societal-Scale Risks, Safety, and
Ethics and Society. In the AI and Societal-Scale Risks section, we outline major
categories of AI risks and introduce some key features of modern AI systems. In the
Safety section, we discuss how to make individual AI systems more safe. However,
if we can make them safe, how should we direct them? To answer this, we turn to
the Ethics and Society section and discuss how to make AI systems that promote
our most important values. In this section, we also explore the numerous challenges
that emerge when trying to coordinate between multiple AI systems, multiple AI
developers, or multiple nation-states with competing interests.
The AI and Societal-Scale Risks section starts with an informal overview of AI risks,
which summarises many of the key concerns discussed in this book. We outline some
scenarios where AI systems could cause catastrophic outcomes. We split risks across
four categories: malicious use, AI arms race dynamics, organizational risks, and rogue
AIs. These categories can be loosely mapped onto the risks discussed in more depth in
the Governance, Collective Action Problems, Safety Engineering, and Single-Agent
Safety chapters, respectively. However, this mapping is imperfect as many of the
risks and frameworks discussed in the book are more general and cut across sce-
narios. Nonetheless, we hope that the scenarios in this first chapter give readers a
concrete picture of the risks that we explore in this book. The next chapter, Artifi-
cial Intelligence Fundamentals, aims to provide an accessible and non-mathematical
explanation of current AI systems, helping to familiarise readers with key terms and
concepts in machine learning, DL, scaling laws, and so on. This provides the neces-
sary foundations for the discussion of the safety of individual AI systems in the next
section.
Introduction ■ xvii
The Safety section gives an overview of core challenges in safely building advanced
AI systems. It draws on insights from both machine learning research and general
theories of safety engineering and complex systems, which provide a powerful lens for
understanding these issues. In Single-Agent Safety, we explore challenges in making
individual AI systems safer, such as bias, transparency, and emergence. In Safety
Engineering, we discuss principles for creating safer organizations and how these may
apply to those developing and deploying AI. The need for a robust safety culture
at organizations developing AI is crucial, so organizations do not prioritize profit
at the expense of safety. Next, in Complex Systems, we show that analyzing AIs
as complex systems helps us to better understand the difficulty of predicting how
they will respond to external pressures or controlling the goals that may emerge in
such systems. More generally, this chapter provides us with a useful vocabulary for
discussing diverse systems of interest.
The Ethics and Society section focuses on how to instill beneficial objectives and con-
straints in AI systems and how to enable effective collaboration between stakeholders
to mitigate risks. In Beneficial AI and Machine Ethics, we introduce the challenge of
giving AI systems objectives that will reliably lead to beneficial outcomes for society,
and discuss various proposals along with the challenges they face. In Collective Ac-
tion Problems, we utilize game theory to illustrate the many ways in which multiple
agents (such as individual humans, companies, nation-states, or AIs) can fail to secure
good outcomes and come into conflict. We also consider the evolutionary dynamics
shaping AI development and how these drive AI risks. These frameworks help us to
understand the challenges of managing competitive pressures between AI developers,
militaries, or AI systems themselves. Finally, in the Governance chapter, we discuss
strategic variables such as how widely access to powerful AI systems is distributed.
We introduce a variety of potential paths for managing AI risks, including corporate
governance, national regulation, and international coordination.
The website for this book (www.aisafetybook.com) includes a range of additional
content. It contains further educational resources such as videos, slides, quizzes, and
discussion questions. For readers interested in contributing to mitigating risks from
AI, it offers some brief suggestions and links to other resources on this topic. A range
of appendices can also be found on the website with further material that could not
be included in the book itself.
Dan Hendrycks
Center for AI Safety
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
I
AI and Societal-Scale Risks
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
CHAPTER 1
Overview of Catastrophic
AI Risks
1.1 INTRODUCTION
In this chapter, we will give a brief and informal description of many major societal-
scale risks from artificial intelligence (AI), focusing on AI risks that could lead to
highly severe or even catastrophic societal outcomes. This provides some background
and motivation before we discuss specific challenges with more depth and rigor in the
following chapters.
The world as we know it today is not normal. We take for granted that we can talk
instantaneously with people thousands of miles away, fly to the other side of the world
in less than a day, and access vast mountains of accumulated knowledge on devices
we carry around in our pockets. These realities seemed far-fetched decades ago and
would have been inconceivable to people living centuries ago. The ways we live, work,
travel, and communicate have only been possible for a tiny fraction of human history.
Yet, when we look at the bigger picture, a broader pattern emerges: accelerating
development. Hundreds of thousands of years elapsed between the time Homo sapi-
ens appeared on Earth and the agricultural revolution. Then, thousands of years
passed before the industrial revolution. Now, just centuries later, the AI revolution
is beginning. The march of history is not constant—it is rapidly accelerating.
We can capture this trend quantitatively in Figure 1.1, which shows how the estimated
gross world product has changed over time [3, 4]. The hyperbolic growth it depicts
might be explained by the fact that, as technology advances, the rate of technological
advancement also tends to increase. Empowered with new technologies, people can
innovate faster than they could before. Thus, the gap in time between each landmark
development narrows.
It is the rapid pace of development, as much as the sophistication of our technol-
ogy, that makes the present day an unprecedented time in human history. We have
DOI: 10.1201/9781003530336-1 3
This chapter has been made available under a CC BY NC ND license.
4 ■ Introduction to AI Safety, Ethics, and Society
Figure 1.1. World production has grown rapidly over the course of human history. AI could
further this trend, catapulting humanity into a new period of unprecedented change.
reached a point where technological advancements can transform the world beyond
recognition within a human lifetime. For example, people who have lived through the
creation of the internet can remember a time when our now digitally connected world
would have seemed like science fiction.
From a historical perspective, it appears possible that the same amount of develop-
ment could now be condensed in an even shorter timeframe. We might not be certain
that this will occur, but neither can we rule it out. We therefore wonder: what new
technology might usher in the next big acceleration? In light of recent advances, AI
seems an increasingly plausible candidate. Perhaps, as AI continues to become more
powerful, it could lead to a qualitative shift in the world that is more profound than
any we have experienced so far. It could be the most impactful period in history,
though it could also be the last.
Although technological advancement has often improved people’s lives, we ought
to remember that as our technology grows in power, so too does its destructive
potential. Consider the invention of nuclear weapons. Last century, for the first time
in our species’ history, humanity possessed the ability to destroy itself, and the world
suddenly became much more fragile.
Our newfound vulnerability revealed itself in unnerving clarity during the Cold War.
On a Saturday in October 1962, the Cuban Missile Crisis was cascading out of con-
trol. US warships enforcing the blockade of Cuba detected a Soviet submarine and
attempted to force it to the surface by dropping low-explosive depth charges. The
submarine was out of radio contact, and its crew had no idea whether World War III
had already begun. A broken ventilator raised the temperature up to 140◦ F in some
parts of the submarine, causing crew members to fall unconscious as depth charges
exploded nearby.
The submarine carried a nuclear-armed torpedo, which required consent from both
the captain and political officer to launch. Both provided it. On any other submarine
in Cuban waters that day, that torpedo would have launched—and a nuclear third
Overview of Catastrophic AI Risks ■ 5
world war may have followed. Fortunately, a man named Vasili Arkhipov was also on
the submarine. Arkhipov was the commander of the entire flotilla and by sheer luck
happened to be on that particular submarine. He talked the captain down from his
rage, convincing him to await further orders from Moscow. He averted a nuclear war
and saved millions or billions of lives—and possibly civilization itself.
Carl Sagan once observed, “If we continue to accumulate only power and not wisdom,
we will surely destroy ourselves” [5]. Sagan was correct: The power of nuclear weapons
was not one we were ready for. Overall, it has been luck rather than wisdom that
has saved humanity from nuclear annihilation, with multiple recorded instances of a
single individual preventing a full-scale nuclear war.
AI is now poised to become a powerful technology with destructive potential similar
to nuclear weapons. We do not want to repeat the Cuban Missile Crisis. We do not
want to slide toward a moment of peril where our survival hinges on luck rather than
the ability to use this technology wisely. Instead, we need to work proactively to
mitigate the risks it poses. This necessitates a better understanding of what could go
wrong and what to do about it.
Luckily, AI systems are not yet advanced enough to contribute to every risk we
discuss. But that is cold comfort in a time when AI development is advancing at an
unprecedented and unpredictable rate. We consider risks arising from both present-
day AIs and AIs that are likely to exist in the near future. It is possible that if we
wait for more advanced systems to be developed before taking action, it may be too
late.
In this chapter, we will explore various ways in which powerful AIs could bring about
catastrophic events with devastating consequences for vast numbers of people. We
will also discuss how AIs could present existential risks—catastrophes from which
humanity would be unable to recover. The most obvious such risk is extinction, but
there are other outcomes, such as creating a permanent dystopian society, which
would also constitute an existential catastrophe. As further discussed in this book’s
Introduction, we do not intend to cover all risks or harms that AI may pose in an
exhaustive manner, and many of these fall outside the scope of this chapter. We
outline many possible scenarios, some of which are more likely than others and some
of which are mutually incompatible with each other. This approach is motivated by
the principles of risk management. We prioritize asking “what could go wrong?” rather
than reactively waiting for catastrophes to occur. This proactive mindset enables us
to anticipate and mitigate catastrophic risks before it’s too late.
To help orient the discussion, we decompose catastrophic risks from AIs into four risk
sources that warrant intervention:
• Malicious use: Malicious actors using AIs to cause large-scale devastation.
• AI race: Competitive pressures that could drive us to deploy AIs in unsafe ways
despite this being in no one’s best interest.
• Organizational risks: Accidents arising from the complexity of AIs and the or-
ganizations developing them.
6 ■ Introduction to AI Safety, Ethics, and Society
• Rogue AIs: The problem of controlling a technology that is more intelligent than
we are.
These four sections—Malicious Use, AI Race, Organizational Risks, and Rogue AIs—
describe causes of AI risks that are intentional, environmental/structural, accidental,
and internal, respectively [6]. The risks that are briefly outlined in this chapter are
discussed in greater depth in the rest of this book.
In this chapter, we will describe how concrete, small-scale examples of each risk might
escalate into catastrophic outcomes. We also include hypothetical stories to help
readers conceptualize the various processes and dynamics discussed in each section.
We hope this survey will serve as a practical introduction for readers interested in
learning about and mitigating catastrophic AI risks.
On the morning of March 20, 1995, five men entered the Tokyo subway system. After
boarding separate subway lines, they continued for several stops before dropping the
bags they were carrying and exiting. An odorless, colorless liquid inside the bags
began to vaporize. Within minutes, commuters began choking and vomiting. The
trains continued on toward the heart of Tokyo, with sickened passengers leaving the
cars at each station. The fumes were spread at each stop, either by emanating from
the tainted cars or through contact with people’s clothing and shoes. By the end of
the day, 13 people lay dead and 5,800 seriously injured. The group responsible for the
attack was the religious cult Aum Shinrikyo [7]. Its motive for murdering innocent
people? To bring about the end of the world.
Powerful new technologies offer tremendous potential benefits, but they also carry the
risk of empowering malicious actors to cause widespread harm. There will always be
those with the worst of intentions, and AIs could provide them with a formidable tool
to achieve their objectives. Moreover, as AI technology advances, severe malicious use
could potentially destabilize society, increasing the likelihood of other risks.
In this section, we will explore the various ways in which the malicious use of advanced
AIs could pose catastrophic risks. These include engineering biochemical weapons,
unleashing rogue AIs, using persuasive AIs to spread propaganda and erode consensus
reality, and leveraging censorship and mass surveillance to irreversibly concentrate
power. We will conclude by discussing possible strategies for mitigating the risks
associated with the malicious use of AIs.
which would speed up research and potentially save lives, but this could also increase
the risk of malicious use if the AI system could be repurposed to develop bioweapons.
In situations like this, the outcome may be determined by the least risk-averse re-
search group. If only one research group thinks the benefits outweigh the risks, it
could act unilaterally, deciding the outcome even if most others don’t agree. And if
they are wrong and someone does decide to develop a bioweapon, it would be too
late to reverse course.
By default, advanced AIs may increase the destructive capacity of both the most
powerful and the general population. Thus, the growing potential for AIs to empower
malicious actors is one of the most severe threats humanity will face in the coming
decades. The examples we give in this section are only those we can foresee. It is
possible that AIs could aid in the creation of dangerous new technology we cannot
presently imagine, which would further increase risks from malicious use.
1.2.1 Bioterrorism
Malicious actors could intentionally create rogue AIs. One month after
the release of GPT-4, an open-source project bypassed the AI’s safety filters and
turned it into an autonomous AI agent instructed to “destroy humanity,” “establish
global dominance,” and “attain immortality.” Dubbed ChaosGPT, the AI compiled
research on nuclear weapons and sent tweets trying to influence others. Fortunately,
ChaosGPT was merely a warning given that it lacked the ability to successfully
formulate long-term plans, hack computers, and survive and spread. Yet given the
rapid pace of AI development, ChaosGPT did offer a glimpse into the risks that more
advanced rogue AIs could pose in the near future.
Many groups may want to unleash AIs or have AIs displace human-
ity. Simply unleashing rogue AIs, like a more sophisticated version of ChaosGPT,
could accomplish mass destruction, even if those AIs aren’t explicitly told to harm
humanity. There are a variety of beliefs that may drive individuals and groups to do
so. One ideology that could pose a unique threat in this regard is “accelerationism.”
This ideology seeks to accelerate AI development as rapidly as possible and opposes
restrictions on the development or proliferation of AIs. This sentiment is common
among many leading AI researchers and technology leaders, some of whom are in-
tentionally racing to build AIs more intelligent than humans. According to Google
co-founder Larry Page, AIs are humanity’s rightful heirs and the next step of cosmic
evolution. He has also expressed the sentiment that humans maintaining control over
AIs is “speciesist” [17]. Jürgen Schmidhuber, an eminent AI scientist, argued that “In
the long run, humans will not remain the crown of creation... But that’s okay because
there is still beauty, grandeur, and greatness in realizing that you are a tiny part of
a much grander scheme which is leading the universe from lower complexity toward
higher complexity” [18]. Richard Sutton, another leading AI scientist, in discussing
smarter-than human AI asked “why shouldn’t those who are the smartest become
powerful?” and thinks the development of superintelligence will be an achievement
“beyond humanity, beyond life, beyond good and bad” [19]. He argues that “suc-
cession to AI is inevitable,” and while “they could displace us from existence,” “we
should not resist succession” [20].
There are several sizable groups who may want to unleash AIs to intentionally cause
harm. For example, sociopaths and psychopaths make up around 3 percent of the
population [21]. In the future, people who have their livelihoods destroyed by AI
automation may grow resentful, and some may want to retaliate. There are plenty
of cases in which seemingly mentally stable individuals with no history of insanity
or violence suddenly go on a shooting spree or plant a bomb with the intent to
10 ■ Introduction to AI Safety, Ethics, and Society
harm as many innocent people as possible. We can also expect well-intentioned peo-
ple to make the situation even more challenging. As AIs advance, they could make
ideal companions—knowing how to provide comfort, offering advice when needed,
and never demanding anything in return. Inevitably, people will develop emotional
bonds with chatbots, and some will demand that they be granted rights or become
autonomous.
In summary, releasing powerful AIs and allowing them to take actions independently
of humans could lead to a catastrophe. There are many reasons that people might
pursue this, whether because of a desire to cause harm, an ideological belief in techno-
logical acceleration or a conviction that AIs should have the same rights and freedoms
as humans.
AIs could pollute the information ecosystem with motivated lies. Some-
times ideas spread not because they are true, but because they serve the interests
of a particular group. “Yellow journalism” was coined as a pejorative reference to
newspapers that advocated war between Spain and the United States in the late
19th century because they believed that sensational war stories would boost their
sales [22]. When public information sources are flooded with falsehoods, people will
sometimes fall prey to lies or else come to distrust mainstream narratives, both of
which undermine societal integrity.
Unfortunately, AIs could escalate these existing problems dramatically. First, AIs
could be used to generate unique, personalized disinformation at a large scale. While
there are already many social media bots [23], some of which exist to spread disin-
formation, historically they have been run by humans or primitive text generators.
The latest AI systems do not need humans to generate personalized messages, never
get tired, and could potentially interact with millions of users at once [24].
AIs can exploit users’ trust. Already, hundreds of thousands of people pay for
chatbots marketed as lovers and friends [25], and one man’s suicide has been partially
attributed to interactions with a chatbot [26]. As AIs appear increasingly human-like,
people will increasingly form relationships with them and grow to trust them. AIs that
gather personal information through relationship-building or by accessing extensive
personal data, such as a user’s email account or personal files, could leverage that
information to enhance persuasion. Powerful actors that control those systems could
Overview of Catastrophic AI Risks ■ 11
We have discussed several ways in which individuals and groups might use AIs to
cause widespread harm, through bioterrorism; releasing powerful, uncontrolled AIs;
and disinformation. To mitigate these risks, governments might pursue intense surveil-
lance and seek to keep AIs in the hands of a trusted minority. This reaction, however,
could easily become an overcorrection, paving the way for an entrenched totalitarian
12 ■ Introduction to AI Safety, Ethics, and Society
regime that would be locked in by the power and capacity of AIs. This scenario rep-
resents a form of “top-down” misuse, as opposed to “bottom-up” misuse by citizens
and could, in extreme cases, culminate in an entrenched dystopian civilization.
AIs may entrench a totalitarian regime. In the hands of the state, AIs may
result in the erosion of civil liberties and democratic values in general. AIs could
allow totalitarian governments to efficiently collect, process, and act on an unprece-
dented volume of information, permitting an ever smaller group of people to surveil
and exert complete control over the population without the need to enlist millions of
citizens to serve as willing government functionaries. Overall, as power and control
shift away from the public and toward elites and leaders, democratic governments
are highly vulnerable to totalitarian backsliding. Additionally, AIs could make total-
itarian regimes much longer-lasting; a major way in which such regimes have been
toppled previously is at moments of vulnerability like the death of a dictator, but
AIs, which would be hard to “kill,” could provide much more continuity to leadership,
providing few opportunities for reform.
AIs can entrench corporate power at the expense of the public good.
Corporations have long lobbied to weaken laws and policies that restrict their actions
and power, all in the service of profit. Corporations in control of powerful AI systems
may use them to manipulate customers into spending more on their products even to
the detriment of their own wellbeing. The concentration of power and influence that
could be afforded by AIs could enable corporations to exert unprecedented control
over the political system and entirely drown out the voices of citizens. This could
occur even if creators of these systems know their systems are self-serving or harmful
to others, as they would have incentives to reinforce their power and avoid distributing
control.
Story: Bioterrorism
The following is an illustrative hypothetical story to help readers envision some
of these risks. This story is nonetheless somewhat vague to reduce the risk of
inspiring malicious actions based on it.
A biotechnology startup is making waves in the industry with its AI-powered
bioengineering model. The company has made bold claims that this new tech-
nology will revolutionize medicine through its ability to create cures for both
known and unknown diseases. The company did, however, stir up some con-
troversy when it decided to release the program to approved researchers in the
scientific community. Only weeks after its decision to make the model open-
source on a limited basis, the full model was leaked on the internet for all to
see. Its critics pointed out that the model could be repurposed to design lethal
pathogens and claimed that the leak provided bad actors with a powerful tool
to cause widespread destruction, opening it up to abuse without safeguards in
place.
Unknown to the public, an extremist group has been working for years to
engineer a new virus designed to kill large numbers of people. Yet given their
lack of expertise, these efforts have so far been unsuccessful. When the new
AI system is leaked, the group immediately recognizes it as a potential tool
to design the virus and circumvent legal and monitoring obstacles to obtain
the necessary raw materials. The AI system successfully designs exactly the
kind of virus the extremist group was hoping for. It also provides step-by-step
instructions on how to synthesize large quantities of the virus and circumvent
any obstacles to spreading it. With the synthesized virus in hand, the extremist
group devises a plan to release the virus in several carefully chosen locations
in order to maximize its spread.
The virus has a long incubation period and spreads silently and quickly
throughout the population for months. By the time it is detected, it has al-
ready infected millions and has an alarmingly high mortality rate. Given its
lethality, most who are infected will ultimately die. The virus may or may not
be contained eventually, but not before it kills millions of people.
14 ■ Introduction to AI Safety, Ethics, and Society
1.3 AI RACE
The immense potential of AIs has created competitive pressures among global play-
ers contending for power and influence. This “AI race” is driven by nations and
corporations who feel they must rapidly build and deploy AIs to secure their posi-
tions and survive. By failing to properly prioritize global risks, this dynamic makes
it more likely that AI development will produce dangerous outcomes. Analogous to
the nuclear arms race during the Cold War, participation in an AI race may serve
individual short-term interests, but it ultimately results in worse collective outcomes
for humanity. Importantly, these risks stem not only from the intrinsic nature of AI
technology, but from the competitive pressures that encourage insidious choices in
AI development.
In this section, we first explore the military AI arms race and the corporate AI race,
where nation-states and corporations are forced to rapidly develop and adopt AI sys-
tems to remain competitive. Moving beyond these specific races, we reconceptualize
competitive pressures as part of a broader evolutionary process in which AIs could
become increasingly pervasive, powerful, and entrenched in society. Finally, we high-
light potential strategies and policy suggestions to mitigate the risks created by an
AI race and ensure the safe development of AIs.
The development of AIs for military applications is swiftly paving the way for a new
era in military technology, with potential consequences rivaling those of gunpowder
and nuclear arms in what has been described as the “third revolution in warfare.”
The weaponization of AI presents numerous challenges, such as the potential for more
destructive wars, the possibility of accidental usage or loss of control, and the prospect
of malicious actors co-opting these technologies for their own purposes. As AIs gain
influence over traditional military weaponry and increasingly take on command and
control functions, humanity faces a paradigm shift in warfare. In this context, we will
discuss the latent risks and implications of this AI arms race on global security, the
potential for intensified conflicts, and the dire outcomes that could come as a result,
including the possibility of conflicts escalating to a scale that poses an existential
threat.
LAWs are weapons that can identify, target, and kill without human intervention [30].
They offer potential improvements in decision-making speed and precision. Warfare,
however, is a high-stakes, safety-critical domain for AIs with significant moral and
practical concerns. Though their existence is not necessarily a catastrophe in itself,
LAWs may serve as an on-ramp to catastrophes stemming from malicious use, acci-
dents, loss of control, or an increased likelihood of war.
Overview of Catastrophic AI Risks ■ 15
LAWs increase the likelihood of war. Sending troops into battle is a grave
decision that leaders do not make lightly. But autonomous weapons would allow an
aggressive nation to launch attacks without endangering the lives of its own soldiers
and thus face less domestic scrutiny. While remote-controlled weapons share this ad-
vantage, their scalability is limited by the requirement for human operators and vul-
nerability to jamming countermeasures, limitations that LAWs could overcome [34].
Public opinion for continuing wars tends to wane as conflicts drag on and casualties
increase [35]. LAWs would change this equation. National leaders would no longer
face the prospect of body bags returning home, thus removing a primary barrier to
engaging in warfare, which could ultimately increase the likelihood of conflicts.
Cyberwarfare
As well as being used to enable deadlier weapons, AIs could lower the barrier to
entry for cyberattacks, making them more numerous and destructive. They could
cause serious harm not only in the digital environment but also in physical systems,
potentially taking out critical infrastructure that societies depend on. While AIs could
also be used to improve cyberdefense, it is unclear whether they will be most effective
as an offensive or defensive technology [36]. If they enhance attacks more than they
support defense, then cyberattacks could become more common, creating significant
geopolitical turbulence and paving another route to large-scale conflict.
AIs have the potential to increase the accessibility, success rate, scale,
speed, stealth, and potency of cyberattacks. Cyberattacks are already a real-
ity, but AIs could be used to increase their frequency and destructiveness in multiple
ways. Machine learning tools could be used to find more critical vulnerabilities in
target systems and improve the success rate of attacks. They could also be used to
16 ■ Introduction to AI Safety, Ethics, and Society
increase the scale of attacks by running millions of systems in parallel, and increase
the speed by finding novel routes to infiltrating a system. Cyberattacks could also
become more potent if used to hijack AI weapons.
Automated Warfare
AIs speed up the pace of war, which makes AIs more necessary. AIs
can quickly process a large amount of data, analyze complex situations, and provide
helpful insights to commanders. With ubiquitous sensors and advanced technology on
the battlefield, there is tremendous incoming information. AIs help make sense of this
information, spotting important patterns and relationships that humans might miss.
As these trends continue, it will become increasingly difficult for humans to make well-
informed decisions as quickly as necessary to keep pace with AIs. This would further
pressure militaries to hand over decisive control to AIs. The continuous integration
of AIs into all aspects of warfare will cause the pace of combat to become faster and
faster. Eventually, we may arrive at a point where humans are no longer capable
of assessing the ever-changing battlefield situation and must cede decision-making
power to advanced AIs.
AIs could make war more uncertain, increasing the risk of conflict. Al-
though states that are already wealthier and more powerful often have more resources
to invest in new military technologies, they are not necessarily always the most suc-
cessful at adopting them. Other factors also play an important role, such as how
agile and adaptive a military can be in incorporating new technologies [40]. Major
new weapons innovations can therefore offer an opportunity for existing superpowers
to bolster their dominance, but also for less powerful states to quickly increase their
18 ■ Introduction to AI Safety, Ethics, and Society
power by getting ahead in an emerging and important sphere. This can create sig-
nificant uncertainty around if and how the balance of power is shifting, potentially
leading states to incorrectly believe they could gain something from going to war.
Even aside from considerations regarding the balance of power, rapidly evolving au-
tomated warfare would be unprecedented, making it difficult for actors to evaluate
their chances of victory in any particular conflict. This would increase the risk of
miscalculation, making war more likely.
from international car manufacturers as the share of imports in American car pur-
chases steadily rose [45]. Ford developed an ambitious plan to design and manufacture
a new car model in only 25 months [46]. The Ford Pinto was delivered to customers
ahead of schedule, but with a serious safety problem: the gas tank was located near
the rear bumper and could explode during rear collisions. Numerous fatalities and
injuries were caused by the resulting fires when crashes inevitably happened [47].
Ford was sued and a jury found them liable for these deaths and injuries [48]. The
verdict, of course, came too late for those who had already lost their lives. As Ford’s
president at the time was fond of saying, “Safety doesn’t sell” [49].
Boeing, aiming to compete with its rival Airbus, sought to deliver an updated, more
fuel-efficient model to the market as quickly as possible. The head-to-head rivalry
and time pressure led to the introduction of the Maneuvering Characteristics Aug-
mentation System, which was designed to enhance the aircraft’s stability. However,
inadequate testing and pilot training ultimately resulted in the two fatal crashes only
months apart, with 346 people killed [50]. We can imagine a future in which similar
pressures lead companies to cut corners and release unsafe AI systems.
A third example is the Bhopal gas tragedy, which is widely considered to be the
worst industrial disaster ever to have happened. In December 1984, a vast quantity
of toxic gas leaked from a Union Carbide Corporation subsidiary plant manufacturing
pesticides in Bhopal, India. Exposure to the gas killed thousands of people and injured
up to half a million more. Investigations found that, in the run-up to the disaster,
safety standards had fallen significantly, with the company cutting costs by neglecting
equipment maintenance and staff training as profitability fell. This is often considered
a consequence of competitive pressures [51].
Automated Economy
Corporations will face pressure to replace humans with AIs. As AIs be-
come more capable, they will be able to perform an increasing variety of tasks more
quickly, cheaply, and effectively than human workers. Companies will therefore stand
to gain a competitive advantage from replacing their employees with AIs. Compa-
nies that choose not to adopt AIs would likely be out-competed, just as a clothing
company using manual looms would be unable to keep up with those using industrial
ones.
22 ■ Introduction to AI Safety, Ethics, and Society
AIs could lead to mass unemployment. Economists have long considered the
possibility that machines will replace human labor. Nobel Prize winner Wassily Leon-
tief said in 1952 that, as technology advances, “Labor will become less and less
important... more and more workers will be replaced by machines” [52]. Previous
technologies have augmented the productivity of human labor. AIs, however, could
differ profoundly from previous innovations. Advanced AIs capable of automating
human labor should be regarded not merely as tools, but as agents. Human-level AI
agents would, by definition, be able to do everything a human could do. These AI
agents would also have important advantages over human labor. They could work 24
hours a day, be copied many times and run in parallel, and process information much
more quickly than a human would. While we do not know when this will occur, it is
unwise to discount the possibility that it could be soon. If human labor is replaced by
AIs, mass unemployment could dramatically increase inequality, making individuals
dependent on the owners of AI systems.
Automated AI R&D. AI agents would have the potential to automate the re-
search and development (R&D) of AI itself. AI is increasingly automating parts of
the research process [53], and this could lead to AI capabilities growing at increasing
rates, to the point where humans are no longer the driving force behind AI develop-
ment. If this trend continues unchecked, it could escalate risks associated with AIs
progressing faster than our capacity to manage and regulate them. Imagine that we
created an AI that writes and thinks at the speed of today’s AIs, but that it could
also perform world-class AI research. We could then copy that AI and create 10,000
world-class AI researchers that operate at a pace 100× times faster than humans. By
automating AI research and development, we might achieve progress equivalent to
many decades in just a few months.
where an AI’s power and development is prioritized over its safety. To address these
dilemmas that give rise to global risks, we will need new coordination mechanisms
and institutions. It is our view that failing to coordinate and stop AI races would be
the most likely cause of an existential catastrophe.
As discussed above, there are strong pressures to replace humans with AIs, cede
more control to them, and reduce human oversight in various settings, despite the
potential harms. We can re-frame this as a general trend resulting from evolutionary
dynamics, where an unfortunate truth is that AIs will simply be more fit than humans.
Extrapolating this pattern of automation, it is likely that we will build an ecosystem
of competing AIs over which it may be difficult to maintain control in the long run. We
will now discuss how natural selection influences the development of AI systems and
why evolution favors selfish behaviors. We will also look at how competition might
arise and play out between AIs and humans, and how this could create catastrophic
risks. This section draws heavily from “Natural Selection Favors AIs over Humans”
[55, 56].
Fitter technologies are selected, for good and bad. While most people think
of evolution by natural selection as a biological process, its principles shape much
more. According to the evolutionary biologist Richard Lewontin [57], evolution by
natural selection will take hold in any environment where three conditions are present:
(1) there are differences between individuals; (2) characteristics are passed onto fu-
ture generations; and (3) the different variants propagate at different rates. These
conditions apply to various technologies.
Consider the content-recommendation algorithms used by streaming services and
social media platforms. When a particularly addictive content format or algorithm
hooks users, it results in higher screen time and engagement. This more effective
content format or algorithm is consequently “selected” and further fine-tuned, while
formats and algorithms that fail to capture attention are discontinued. These com-
petitive pressures foster a “survival of the most addictive” dynamic. Platforms that
refuse to use addictive formats and algorithms become less influential or are simply
outcompeted by platforms that do, leading competitors to undermine wellbeing and
cause massive harm to society [58].
The conditions for natural selection apply to AIs. There will be many
different AI developers who make many different AI systems with varying features
and capabilities, and competition between them will determine which characteristics
become more common. Second, the most successful AIs today are already being used
as a basis for their developers’ next generation of models, as well as being imitated
by rival companies. Third, factors determining which AIs propagate the most may
include their ability to act autonomously, automate labor, or reduce the chance of
their own deactivation.
24 ■ Introduction to AI Safety, Ethics, and Society
restrictions, while deceiving humans about its methods. Even if we try to put safety
measures in place, a deceptive AI would be very difficult to counteract if it is cleverer
than us. AIs that can bypass our safety measures without detection may be the most
successful at accomplishing the tasks we give them, and therefore become widespread.
These processes could culminate in a world where many aspects of major companies
and infrastructure are controlled by powerful AIs with selfish traits, including de-
ceiving humans, harming humans in service of their goals, and preventing themselves
from being deactivated.
Humans only have nominal influence over AI selection. One might think
we could avoid the development of selfish behaviors by ensuring we do not select
AIs that exhibit them. However, the companies developing AIs are not selecting
the safest path but instead succumbing to evolutionary pressures. One example is
OpenAI, which was founded as a nonprofit in 2015 to “benefit humanity as a whole,
unconstrained by a need to generate financial return” [63]. However, when faced
with the need to raise capital to keep up with better-funded rivals, in 2019 OpenAI
transitioned from a nonprofit to “capped-profit” structure [64]. Later, many of the
safety-focused OpenAI employees left and formed a competitor, Anthropic, that was
to focus more heavily on AI safety than OpenAI had. Although Anthropic originally
focused on safety research, they eventually became convinced of the “necessity of
commercialization” and now contribute to competitive pressures [65]. While many
of the employees at those companies genuinely care about safety, these values do
not stand a chance against evolutionary pressures, which compel companies to move
ever more hastily and seek ever more influence, lest the company perish. Moreover,
AI developers are already selecting AIs with increasingly selfish traits. They are
selecting AIs to automate and displace humans, make humans highly dependent on
AIs, and make humans more and more obsolete. By their own admission, future
versions of these AIs may lead to extinction [66]. This is why an AI race is insidious:
AI development is not being aligned with human values but rather with natural
selection.
People often choose the products that are most useful and convenient to them imme-
diately, rather than thinking about potential long-term consequences, even to them-
selves. An AI race puts pressures on companies to select the AIs that are most
competitive, not the least selfish. Even if it’s feasible to select for unselfish AIs, if
it comes at a clear cost to competitiveness, some competitors will select the selfish
AIs. Furthermore, as we have mentioned, if AIs develop strategic awareness, they
may counteract our attempts to select against them. Moreover, as AIs increasingly
automate various processes, AIs will affect the competitiveness of other AIs, not just
humans. AIs will interact and compete with each other, and some will be put in charge
of the development of other AIs at some point. Giving AIs influence over which other
AIs should be propagated and how they should be modified would represent another
step toward humans becoming dependent on AIs and AI evolution becoming increas-
ingly independent from humans. As this continues, the complex process governing AI
evolution will become further unmoored from human interests.
26 ■ Introduction to AI Safety, Ethics, and Society
AIs can be more fit than humans. Our unmatched intelligence has granted
us power over the natural world. It has enabled us to land on the moon, harness
nuclear energy, and reshape landscapes at our will. It has also given us power over
other species. Although a single unarmed human competing against a tiger or gorilla
has no chance of winning, the collective fate of these animals is entirely in our hands.
Our cognitive abilities have proven so advantageous that, if we chose to, we could
cause them to go extinct in a matter of weeks. Intelligence was a key factor that led
to our dominance, but we are currently standing on the precipice of creating entities
far more intelligent than ourselves.
Given the exponential increase in microprocessor speeds, AIs have the potential to
process information and “think” at a pace that far surpasses human neurons, but it
could be even more dramatic than the speed difference between humans and sloths—
possibly more like the speed difference between humans and plants. They can assim-
ilate vast quantities of data from numerous sources simultaneously, with near-perfect
retention and understanding. They do not need to sleep and they do not get bored.
Due to the scalability of computational resources, an AI could interact and cooperate
with an unlimited number of other AIs, potentially creating a collective intelligence
that would far outstrip human collaborations. AIs could also deliberately update and
improve themselves. Without the same biological restrictions as humans, they could
adapt and therefore evolve unspeakably quickly compared with us. Computers are
becoming faster. Humans aren’t [67].
To further illustrate the point, imagine that there was a new species of humans. They
do not die of old age, they get 30% faster at thinking and acting each year, and they
can instantly create adult offspring for the modest sum of a few thousand dollars.
It seems clear, then, this new species would eventually have more influence over the
future. In sum, AIs could become like an invasive species, with the potential to out-
compete humans. Our only advantage over AIs is that we get to make the first moves,
but given the frenzied AI race, we are rapidly giving up even this advantage.
AIs would have little reason to cooperate with or be altruistic toward hu-
mans. Cooperation and altruism evolved because they increase fitness. There are
numerous reasons why humans cooperate with other humans, like direct reciprocity.
Also known as “quid pro quo,” direct reciprocity can be summed up by the idiom
“you scratch my back, I’ll scratch yours.” While humans would initially select AIs
that were cooperative, the natural selection process would eventually go beyond our
control, once AIs were in charge of many or most processes, and interacting predomi-
nantly with one another. At that point, there would be little we could offer AIs, given
that they will be able to “think” at least hundreds of times faster than us. Involving
us in any cooperation or decision-making processes would simply slow them down,
giving them no more reason to cooperate with us than we do with gorillas. It might
be difficult to imagine a scenario like this or to believe we would ever let it happen.
Yet it may not require any conscious decision, instead arising as we allow ourselves
to gradually drift into this state without realizing that human-AI co-evolution may
not turn out well for humans.
Overview of Catastrophic AI Risks ■ 27
AIs becoming more powerful than humans could leave us highly vulner-
able. As the most dominant species, humans have deliberately harmed many other
species, and helped drive species such as woolly mammoths and Neanderthals to ex-
tinction. In many cases, the harm was not even deliberate, but instead a result of
us merely prioritizing our goals over their wellbeing. To harm humans, AIs wouldn’t
need to be any more genocidal than someone removing an ant colony on their front
lawn. If AIs are able to control the environment more effectively than we can, they
could treat us with the same disregard.
Conceptual summary. Evolution could cause the most influential AI agents to
act selfishly because:
1. Evolution by natural selection gives rise to selfish behavior. While evolu-
tion can result in altruistic behavior in rare situations, the context of AI develop-
ment does not promote altruistic behavior.
2. Natural selection may be a dominant force in AI development. The
intensity of evolutionary pressure will be high if AIs adapt rapidly or if competitive
pressures are intense. Competition and selfish behaviors may dampen the effects of
human safety measures, leaving the surviving AI designs to be selected naturally.
If so, AI agents would have many selfish tendencies. The winner of the AI race would
not be a nation-state, not a corporation, but AIs themselves. The upshot is that the
AI ecosystem would eventually stop evolving on human terms, and we would become
a displaced, second-class species.
feedback loop: since business and economic developments are too fast-moving
for humans to follow, it makes sense to cede yet more control to AIs instead,
pushing humans further out of important processes. Ultimately, this leads to
a fully autonomous economy, governed by an increasingly uncontrolled ecosys-
tem of AIs.
At this point, humans have few incentives to gain any skills or knowledge,
because almost everything would be taken care of by much more capable AIs.
As a result, we eventually lose the capacity to look after and govern ourselves.
Additionally, AIs become convenient companions, offering social interaction
without requiring the reciprocity or compromise necessary in human relation-
ships. Humans interact less and less with one another over time, losing vital
social skills and the ability to cooperate. People become so dependent on AIs
that it would be intractable to reverse this process. What’s more, as some AIs
become more intelligent, some people are convinced these AIs should be given
rights, meaning turning off some AIs is no longer a viable option.
Competitive pressures between the many interacting AIs continue to select
for selfish behaviors, though we might be oblivious to this happening, as we
have already acquiesced much of our oversight. If these clever, powerful, self-
preserving AIs were then to start acting in harmful ways, it would be all but
impossible to deactivate them or regain control.
AIs have supplanted humans as the most dominant species and their continued
evolution is far beyond our influence. Their selfish traits eventually lead them
to pursue their goals without regard for human wellbeing, with catastrophic
consequences.
In January 1986, tens of millions of people tuned in to watch the launch of the Chal-
lenger Space Shuttle. Approximately 73 seconds after liftoff, the shuttle exploded,
resulting in the deaths of everyone on board. Though tragic enough on its own, one
of its crew members was a school teacher named Sharon Christa McAuliffe. McAuliffe
was selected from over 10,000 applicants for the NASA Teacher in Space Project and
was scheduled to become the first teacher to fly in space. As a result, millions of
those watching were schoolchildren. NASA had the best scientists and engineers in
the world, and if there was ever a mission NASA didn’t want to go wrong, it was this
one [68].
The Challenger disaster, alongside other catastrophes, serves as a chilling reminder
that even with the best expertise and intentions, accidents can still occur. As we
progress in developing advanced AI systems, it is crucial to remember that these
systems are not immune to catastrophic accidents. An essential factor in preventing
accidents and maintaining low levels of risk lies in the organizations responsible for
Overview of Catastrophic AI Risks ■ 29
these technologies. In this section, we discuss how organizational safety plays a critical
role in the safety of AI systems. First, we discuss how even without competitive
pressures or malicious actors, accidents can happen—in fact, they are inevitable. We
then discuss how improving organizational factors can reduce the likelihood of AI
catastrophes.
helping a company improve its services. This bug could drastically alter the AI’s be-
havior, leading to unintended and harmful outcomes. One historical example of such
a case occurred when researchers at OpenAI were attempting to train an AI sys-
tem to generate helpful, uplifting responses. During a code cleanup, the researchers
mistakenly flipped the sign of the reward used to train the AI [71]. As a result, in-
stead of generating helpful content, the AI began producing hate-filled and sexually
explicit text overnight without being halted. Accidents could also involve the unin-
tentional release of a dangerous, weaponized, or lethal AI system. Since AIs can be
easily duplicated with a simple copy-paste, a leak or hack could quickly spread the AI
system beyond the original developers’ control. Once the AI system becomes publicly
available, it would be nearly impossible to put the genie back in the bottle.
Gain-of-function research could potentially lead to accidents by pushing the bound-
aries of an AI system’s destructive capabilities. In these situations, researchers might
intentionally train an AI system to be harmful or dangerous in order to understand
its limitations and assess possible risks. While this can lead to useful insights into
the risks posed by a given AI system, future gain-of-function research on advanced
AIs might uncover capabilities significantly worse than anticipated, creating a seri-
ous threat that is challenging to mitigate or control. As with viral gain-of-function
research, pursuing AI gain-of-function research may only be prudent when conducted
with strict safety procedures, oversight, and a commitment to responsible informa-
tion sharing. These examples illustrate how AI accidents could be catastrophic and
emphasize the crucial role that organizations developing these systems play in pre-
venting such accidents.
When dealing with complex systems, the focus needs to be placed on en-
suring accidents don’t cascade into catastrophes. In his book “Normal Ac-
cidents: Living with High-Risk Technologies,” sociologist Charles Perrow argues that
accidents are inevitable and even “normal” in complex systems, as they are not merely
caused by human errors but also by the complexity of the systems themselves [72]. In
particular, such accidents are likely to occur when the intricate interactions between
components cannot be completely planned or foreseen. For example, in the Three
Mile Island accident, a contributing factor to the lack of situational awareness by the
reactor’s operators was the presence of a yellow maintenance tag, which covered valve
position lights in the emergency feedwater lines [73]. This prevented operators from
noticing that a critical valve was closed, demonstrating the unintended consequences
that can arise from seemingly minor interactions within complex systems.
Unlike nuclear reactors, which are relatively well-understood despite their complexity,
complete technical knowledge of most complex systems is often nonexistent. This is
especially true of DL systems, for which the inner workings are exceedingly difficult
to understand, and where the reason why certain design choices work can be hard to
understand even in hindsight. Furthermore, unlike components in other industries,
such as gas tanks, which are highly reliable, DL systems are neither perfectly accurate
Overview of Catastrophic AI Risks ■ 31
nor highly reliable. Thus, the focus for organizations dealing with complex systems,
especially DL systems, should not be solely on eliminating accidents, but rather on
ensuring that accidents do not cascade into catastrophes.
It often takes years to discover severe flaws or risks. History is replete with
examples of substances or technologies initially thought safe, only for their unintended
flaws or risks to be discovered years, if not decades, later. For example, lead was widely
used in products like paint and gasoline until its neurotoxic effects came to light [76].
Asbestos, once hailed for its heat resistance and strength, was later linked to serious
health issues, such as lung cancer and mesothelioma [77]. The “Radium Girls” suffered
grave health consequences from radium exposure, a material they were told was safe
to put in their mouths [78]. Tobacco, initially marketed as a harmless pastime, was
found to be a primary cause of lung cancer and other health problems [79]. CFCs,
once considered harmless and used to manufacture aerosol sprays and refrigerants,
were found to deplete the ozone layer [80]. Thalidomide, a drug intended to alleviate
morning sickness in pregnant women, led to severe birth defects [81]. And more
recently, the proliferation of social media has been linked to an increase in depression
and anxiety, especially among young people [82].
This emphasizes the importance of not only conducting expert testing but also imple-
menting slow rollouts of technologies, allowing the test of time to reveal and address
potential flaws before they impact a larger population. Even in technologies adhering
to rigorous safety and security standards, undiscovered vulnerabilities may persist, as
32 ■ Introduction to AI Safety, Ethics, and Society
had astonishingly set the codes used to unlock intercontinental ballistic missiles to
“00000000” [91]. Here, safety mechanisms such as locks can be rendered virtually
useless by human factors.
A more dramatic example illustrates how researchers sometimes accept a non-
negligible chance of causing extinction. Prior to the first nuclear weapon test, an
eminent Manhattan Project scientist calculated the bomb could cause an existen-
tial catastrophe: the explosion might ignite the atmosphere and cover the Earth in
flames. Although Oppenheimer believed the calculations were probably incorrect, he
remained deeply concerned, and the team continued to scrutinize and debate the
calculations right until the day of the detonation [92]. Such instances underscore the
need for a robust safety culture.
A questioning attitude can help uncover potential flaws. Unexpected sys-
tem behavior can create opportunities for accidents or exploitation. To counter this,
organizations can foster a questioning attitude, where individuals continuously chal-
lenge current conditions and activities to identify discrepancies that might lead to
errors or inappropriate actions [93]. This approach helps to encourage diversity of
thought and intellectual curiosity, thus preventing potential pitfalls that arise from
uniformity of thought and assumptions. The Chernobyl nuclear disaster illustrates
the importance of a questioning attitude, as the safety measures in place failed to
address the reactor design flaws and ill-prepared operating procedures. A questioning
attitude of the safety of the reactor during a test operation might have prevented the
explosion that resulted in deaths and illnesses of countless people.
A security mindset is crucial for avoiding worst-case scenarios. A secu-
rity mindset, widely valued among computer security professionals, is also applicable
to organizations developing AIs. It goes beyond a questioning attitude by adopting
the perspective of an attacker and by considering worst-case, not just average-case,
scenarios. This mindset requires vigilance in identifying vulnerabilities that may oth-
erwise go unnoticed and involves considering how systems might be deliberately made
to fail, rather than only focusing on making them work. It reminds us not to assume a
system is safe simply because no potential hazards come to mind after a brief brain-
storming session. Cultivating and applying a security mindset demands time and
serious effort, as failure modes can often be surprising and unintuitive. Furthermore,
the security mindset emphasizes the importance of being attentive to seemingly be-
nign issues or “harmless errors,” which can lead to catastrophic outcomes either due
to clever adversaries or correlated failures [94]. This awareness of potential threats
aligns with Murphy’s law—“Anything that can go wrong will go wrong”—recognizing
that this can be a reality due to adversaries and unforeseen events.
Organizations with a strong safety culture can successfully avoid catas-
trophes. High Reliability Organizations (HROs) are organizations that consistently
maintain a heightened level of safety and reliability in complex, high-risk environ-
ments [85]. A key characteristic of HROs is their preoccupation with failure, which
requires considering worst-case scenarios and potential risks, even if they seem un-
likely. These organizations are acutely aware that new, previously unobserved failure
34 ■ Introduction to AI Safety, Ethics, and Society
modes may exist, and they diligently study all known failures, anomalies, and near
misses to learn from them. HROs encourage reporting all mistakes and anomalies to
maintain vigilance in uncovering problems. They engage in regular horizon scanning
to identify potential risk scenarios and assess their likelihood before they occur. By
practicing surprise management, HROs develop the skills needed to respond quickly
and effectively when unexpected situations arise, further enhancing an organization’s
ability to prevent catastrophes. This combination of critical thinking, preparedness
planning, and continuous learning could help organizations to be better equipped to
address potential AI catastrophes. However, the practices of HROs are not a panacea.
It is crucial for organizations to evolve their safety practices to effectively address the
novel risks posed by AI accidents above and beyond HRO best practices.
Figure 1.2. The Swiss cheese model shows how technical factors can improve organizational
safety. Multiple layers of defense compensate for each other’s individual weaknesses, leading
to a low overall level of risk.
and transparency. For example, red teaming assesses system vulnerabilities and fail-
ure modes, while anomaly detection works to identify unexpected or unusual system
behavior and usage patterns. Transparency ensures that the inner workings of AI
systems are understandable and accessible, fostering trust and enabling more effec-
tive oversight. By leveraging these and other safety measures, the Swiss cheese model
aims to create a comprehensive safety system where the strengths of one layer com-
pensate for the weaknesses of another. With this model, safety is not achieved with
a monolithic airtight solution, but rather with a variety of safety measures.
In summary, weak organizational safety creates many sources of risk. For AI devel-
opers with weak organizational safety, safety is merely a matter of box-ticking. They
do not develop a good understanding of risks from AI and may safetywash unrelated
research. Their norms might be inherited from academia (“publish or perish”) or star-
tups (“move fast and break things”), and their hires often do not care about safety.
These norms are hard to change once they have inertia, and need to be addressed
with proactive interventions.
foreseen with enough detail to fix prior to deployment. The CRO agrees, but
says that ongoing research would enable more improvements if the next model
could only be delayed. The CEO retorts, “That’s what you said the last time,
and it turned out to be fine. I’m sure it will work out, just like last time.”
After the meeting, the CRO decides to resign, but doesn’t speak out against the
company, as all employees have had to sign a non-disparagement agreement.
The public has no idea that concerns have been raised about the company’s
choices, and the CRO is replaced with a new, more agreeable CRO who quickly
signs off on the company’s plans.
The company goes through with training, testing, and deploying its most ca-
pable model ever, using its existing procedures to prevent malicious use. A
month later, revelations emerge that terrorists have managed to use the sys-
tem to break into government systems and steal nuclear and biological secrets,
despite the safeguards the company put in place. The breach is detected, but
by then it is too late: the dangerous information has already proliferated.
goal. However, AI systems often find loopholes by which they can easily achieve the
proxy goal, but completely fail to achieve the ideal goal. If an AI “games” its proxy
goal in a way that does not reflect our values, then we might not be able to reliably
steer its behavior. We will now look at some past examples of proxy gaming and
consider the circumstances under which this behavior could become catastrophic.
Proxy gaming has already been observed with AIs. As an example of proxy
gaming, social media platforms such as YouTube and Facebook use AI systems to
decide which content to show users. One way of assessing these systems would be
to measure how long people spend on the platform. After all, if they stay engaged,
surely that means they are getting some value from the content shown to them?
However, in trying to maximize the time users spend on a platform, these systems
often select enraging, exaggerated, and addictive content [98, 99]. As a consequence,
people sometimes develop extreme or conspiratorial beliefs after having certain con-
tent repeatedly suggested to them. These outcomes are not what most people want
from social media.
Proxy gaming has been found to perpetuate bias. For example, a 2019 study looked
at AI-powered software that was used in the healthcare industry to identify patients
who might require additional care. One factor that the algorithm used to assess a
patient’s risk level was their recent healthcare costs. It seems reasonable to think that
someone with higher healthcare costs must be at higher risk. However, white patients
have significantly more money spent on their healthcare than black patients with the
same needs. Using health costs as an indicator of actual health, the algorithm was
found to have rated a white patient and a considerably sicker black patient as at the
same level of health risk [100]. As a result, the number of black patients recognized
as needing extra care was less than half of what it should have been.
As a third example, in 2016, researchers at OpenAI were training an AI to play a
boat racing game called CoastRunners [101]. The objective of the game is to race
other players around the course and reach the finish line before them. Additionally,
players can score points by hitting targets that are positioned along the way. To the
researchers’ surprise, the AI agent did not not circle the racetrack, like most humans
would have. Instead, it found a spot where it could repetitively hit three nearby
targets to rapidly increase its score without ever finishing the race. This strategy was
not without its (virtual) hazards—the AI often crashed into other boats and even
40 ■ Introduction to AI Safety, Ethics, and Society
set its own boat on fire. Despite this, it collected more points than it could have by
simply following the course as humans would.
Proxy gaming more generally. In these examples, the systems are given an
approximate—“proxy”—goal or objective that initially seems to correlate with the
ideal goal. However, they end up exploiting this proxy in ways that diverge from the
idealized goal or even lead to negative outcomes. Offering a reward for rat tails seems
like a good way to reduce the population of rats; a patient’s healthcare costs appear
to be an accurate indication of health risk; and a boat race reward system should
encourage boats to race, not catch themselves on fire. Yet, in each instance, the system
optimized its proxy objective in ways that did not achieve the intended outcome or
even made things worse overall. This phenomenon is captured by Goodhart’s law:
“Any observed statistical regularity will tend to collapse once pressure is placed upon
it for control purposes,” or put succinctly but overly simplistically, “when a measure
becomes a target, it ceases to be a good measure.” In other words, there may usually be
a statistical regularity between healthcare costs and poor health, or between targets
hit and finishing the course, but when we place pressure on it by using one as a proxy
for the other, that relationship will tend to collapse.
Correctly specifying goals is no trivial task. If delineating exactly what we
want from a boat racing AI is tricky, capturing the nuances of human values under
all possible scenarios will be much harder. Philosophers have been attempting to
precisely describe morality and human values for millennia, so a precise and flawless
characterization is not within reach. Although we can refine the goals we give AIs, we
might always rely on proxies that are easily definable and measurable. Discrepancies
between the proxy goal and the intended function arise for many reasons. Besides the
difficulty of exhaustively specifying everything we care about, there are also limits
to how much we can oversee AIs, in terms of time, computational resources, and the
number of aspects of a system that can be monitored. Additionally, AIs may not be
adaptive to new circumstances or robust to adversarial attacks that seek to misdirect
them. As long as we give AIs proxy goals, there is the chance that they will find
loopholes we have not thought of, and thus find unexpected solutions that fail to
pursue the ideal goal.
The more intelligent an AI is, the better it will be at gaming proxy goals.
Increasingly intelligent agents can be increasingly capable of finding unanticipated
routes to optimizing proxy goals without achieving the desired outcome [102]. Addi-
tionally, as we grant AIs more power to take actions in society, for example by using
them to automate certain processes, they will have access to more means of achiev-
ing their goals. They may then do this in the most efficient way available to them,
potentially causing harm in the process. In a worst case scenario, we can imagine
a highly powerful agent optimizing a flawed objective to an extreme degree without
regard for human life. This represents a catastrophic risk of proxy gaming.
In summary, it is often not feasible to perfectly define exactly what we want from
a system, meaning that many systems find ways to achieve their given goal without
performing their intended function. AIs have already been observed to do this and are
Overview of Catastrophic AI Risks ■ 41
likely to get better at it as their capabilities improve. This is one possible mechanism
that could result in an uncontrolled AI that would behave in unanticipated and
potentially harmful ways.
Even if we successfully control early AIs and direct them to promote human values,
future AIs could end up with different goals that humans would not endorse. This
process, termed “goal drift,” can be hard to predict or control. This section is most
cutting-edge and the most speculative, and in it we will discuss how goals shift in
various agents and groups and explore the possibility of this phenomenon occurring
in AIs. We will also examine a mechanism that could lead to unexpected goal drift,
called intrinsification, and discuss how goal drift in AIs could be catastrophic.
The goals of individual humans change over the course of our lifetimes.
Any individual reflecting on their own life to date will probably find that they have
some desires now that they did not have earlier in their life. Similarly, they will
probably have lost some desires that they used to have. While we may be born with
a range of basic desires, including for food, warmth, and human contact, we develop
many more over our lifetime. The specific types of food we enjoy, the genres of music
we like, the people we care most about, and the sports teams we support all seem
heavily dependent on the environment we grow up in, and can also change many
times throughout our lives. A concern is that individual AI agents may have their
goals change in complex and unanticipated ways, too.
Groups can also acquire and lose collective goals over time. Values within
society have changed throughout history, and not always for the better. The rise of
the Nazi regime in 1930s Germany, for instance, represented a profound moral regres-
sion, which ultimately resulted in the systematic extermination of six million Jews
during the Holocaust, alongside widespread persecution of other minority groups.
Additionally, the regime greatly restricted freedom of speech and expression. Here, a
society’s goals drifted for the worse.
The Red Scare that took place in the United States from 1947 to 1957 is another ex-
ample of societal values drifting. Fuelled by strong anti-communist sentiment, against
the backdrop of the Cold War, this period saw the curtailment of civil liberties,
widespread surveillance, unwarranted arrests, and blacklisting of suspected commu-
nist sympathizers. This constituted a regression in terms of freedom of thought, free-
dom of speech, and due process. Just as the goals of human collectives can change
in emergent and unexpected ways, collectives of AI agents may also have their goals
unexpectedly drift from the ones we initially gave them.
Over time, instrumental goals can become intrinsic. Intrinsic goals are
things we want for their own sake, while instrumental goals are things we want because
they can help us get something else. We might have an intrinsic desire to spend time
on our hobbies, simply because we enjoy them, or to buy a painting because we find
42 ■ Introduction to AI Safety, Ethics, and Society
1.5.3 Power-Seeking
So far, we have considered how we might lose our ability to control the goals that
AIs pursue. However, even if an agent started working to achieve an unintended
goal, this would not necessarily be a problem, as long as we had enough power to
prevent any harmful actions it wanted to attempt. Therefore, another important way
in which we might lose control of AIs is if they start trying to obtain more power,
potentially transcending our own. We will now discuss how and why AIs might become
power-seeking and how this could be catastrophic. This section draws heavily from
“Existential Risk from Power-Seeking AI” [106].
the agent might realize that it would not be able to get the coffee if it ceased to exist.
In trying to accomplish even this simple goal, therefore, self-preservation turns out
to be instrumentally rational. Since the acquisition of power and resources are also
often instrumental goals, it is reasonable to think that more intelligent agents might
develop them. That is to say, even if we do not intend to build a power-seeking AI, we
could end up with one anyway. By default, if we are not deliberately pushing against
power-seeking behavior in AIs, we should expect that it will sometimes emerge [110].
AIs given ambitious goals with little supervision may be especially likely
to seek power. While power could be useful in achieving almost any task, in prac-
tice, some goals are more likely to inspire power-seeking tendencies than others. AIs
with simple, easily achievable goals might not benefit much from additional control
of their surroundings. However, if agents are given more ambitious goals, it might
be instrumentally rational to seek more control of their environment. This might be
especially likely in cases of low supervision and oversight, where agents are given the
freedom to pursue their open-ended goals, rather than having their strategies highly
restricted.
Power-seeking AIs with goals separate from ours are uniquely adversar-
ial. Oil spills and nuclear contamination are challenging enough to clean up, but
they are not actively trying to resist our attempts to contain them. Unlike other haz-
ards, AIs with goals separate from ours would be actively adversarial. It is possible,
for example, that rogue AIs might make many backup variations of themselves, in
case humans were to deactivate some of them.
There will also be strong incentives for many people to deploy powerful
AIs. Companies may feel compelled to give capable AIs more tasks, to obtain an
advantage over competitors, or simply to keep up with them. It will be more difficult
to build perfectly aligned AIs than to build imperfectly aligned AIs that are still
superficially attractive to deploy for their capabilities, particularly under competitive
pressures. Once deployed, some of these agents may seek power to achieve their goals.
If they find a route to their goals that humans would not approve of, they might try
to overpower us directly to avoid us interfering with their strategy.
then additional power could change from an instrumental goal into an intrinsic one,
through the process of intrinsification discussed above. If this happened, we might
face a situation where rogue AIs were seeking not only the specific forms of control
that are useful for their goals, but also power more generally. (We note that many
influential humans desire power for its own sake.) This could be another reason for
them to try to wrest control from humans, in a struggle that we would not necessarily
win.
Conceptual summary. The following plausible but not certain premises encap-
sulate reasons for paying attention to risks from power-seeking AIs:
1. There will be strong incentives to build powerful AI agents.
2. It is likely harder to build perfectly controlled AI agents than to build imperfectly
controlled AI agents, and imperfectly controlled agents may still be superficially
attractive to deploy (due to factors including competitive pressures).
3. Some of these imperfectly controlled agents will deliberately seek power over hu-
mans.
If the premises are true, then power-seeking AIs could lead to human disempower-
ment, which would be a catastrophe.
1.5.4 Deception
We might seek to maintain control of AIs by continually monitoring them and looking
out for early warning signs that they were pursuing unintended goals or trying to
increase their power. However, this is not an infallible solution, because it is plausible
that AIs could learn to deceive us. They might, for example, pretend to be acting
as we want them to, but then take a “treacherous turn” when we stop monitoring
them, or when they have enough power to evade our attempts to interfere with them.
We will now look at how and why AIs might learn to deceive us, and how this could
lead to a potentially catastrophic loss of control. We begin by reviewing examples of
deception in strategically minded agents.
Deception has emerged as a successful strategy in a wide range of set-
tings. Politicians from the right and left, for example, have been known to engage
in deception, sometimes promising to enact popular policies to win support in an elec-
tion, and then going back on their word once in office. For example, Lyndon Johnson
said “we are not about to send American boys nine or ten thousand miles away from
home” in 1964, not long before significant escalations in the Vietnam War [111].
Companies can also exhibit deceptive behavior. In the Volkswagen emissions
scandal, the car manufacturer Volkswagen was discovered to have manipulated their
engine software to produce lower emissions exclusively under laboratory testing con-
ditions, thereby creating the false impression of a low-emission vehicle. Although the
US government believed it was incentivizing lower emissions, they were unwittingly
actually just incentivizing passing an emissions test. Consequently, entities sometimes
have incentives to play along with tests and behave differently afterward.
46 ■ Introduction to AI Safety, Ethics, and Society
deceive us. This could present a severe risk if we give AIs control of various decisions
and procedures, believing they will act as we intended, and then find that they do
not.
So far, we have considered four sources of AI risk separately, but they also interact
with each other in complex ways. We give some examples to illustrate how risks are
connected.
Imagine, for instance, that a corporate AI race compels companies to prioritize the
rapid development of AIs. This could increase organizational risks in various ways.
Perhaps a company could cut costs by putting less money toward information security,
leading to one of its AI systems getting leaked. This would increase the probability
of someone with malicious intent having the AI system and using it to pursue their
harmful objectives. Here, an AI race can increase organizational risks, which in turn
can make malicious use more likely.
In another potential scenario, we could envision the combination of an intense AI
race and low organizational safety leading a research team to mistakenly view general
capabilities advances as “safety.” This could hasten the development of increasingly
capable models, reducing the available time to learn how to make them controllable.
The accelerated development would also likely feed back into competitive pressures,
meaning that less effort would be spent on ensuring models were controllable. This
could give rise to the release of a highly powerful AI system that we lose control over,
leading to a catastrophe. Here, competitive pressures and low organizational safety
can reinforce AI race dynamics, which can undercut technical safety research and
increase the chance of a loss of control.
Competitive pressures in a military environment could lead to an AI arms race, and
increase the potency and autonomy of AI weapons. The deployment of AI-powered
weapons, paired with insufficient control of them, would make a loss of control more
deadly, potentially existential. These are just a few examples of how these sources of
risk might combine, trigger, and reinforce one another.
It is also worth noting that many existential risks could arise from AIs amplifying ex-
isting concerns. Power inequality already exists, but AIs could lock it in and widen the
chasm between the powerful and the powerless, perhaps even enabling an unshakable
global totalitarian regime. Similarly, AI manipulation could undermine democracy,
which would also increase the risk of an irreversible totalitarian regime. Disinforma-
tion is already a pervasive problem, but AIs could exacerbate it to a point where
we fundamentally undermine our ability to reach consensus or sense a shared reality.
AIs could develop more deadly bioweapons and reduce the required technical exper-
tise for obtaining them, greatly increasing existing risks of bioterrorism. AI-enabled
cyberattacks could make war more likely, which would increase existential risk. Dra-
matically accelerated economic automation could lead to long-term erosion of human
control and enfeeblement. Each of those issues—power concentration, disinformation,
cyberattacks, automation—is causing ongoing harm, and their exacerbation by AIs
could eventually lead to a catastrophe from which we might not recover.
As we can see, ongoing harms, catastrophic risks, and existential risks are deeply in-
tertwined. Historically, existential risk reduction has focused on targeted interventions
Overview of Catastrophic AI Risks ■ 49
such as technical AI control research, but the time has come for broad interventions
[115] like the many sociotechnical interventions outlined in this chapter.
In mitigating existential risk, it does not make practical sense to ignore other risks.
Ignoring ongoing harms and catastrophic risks normalizes them and could lead us to
“drift into danger” [116], as further discussed in chapterSafety Engineering. Overall,
since existential risks are connected to less extreme catastrophic risks and other stan-
dard risk sources, and because society is increasingly willing to address various risks
from AIs, we believe that we should not solely focus on directly targeting existential
risks. Instead, we should consider the diffuse, indirect effects of other risks and take
a more comprehensive approach to risk management.
1.7 CONCLUSION
In this chapter, we have explored how the development of advanced AIs could lead
to catastrophe, stemming from four primary sources of risk: malicious use, AI races,
organizational risks, and rogue AIs. This lets us decompose AI risks into four proxi-
mate causes: an intentional cause, environmental/structural cause, accidental cause,
or an internal cause, respectively. We have considered ways in which AIs might be
used maliciously, such as terrorists using AIs to create deadly pathogens. We have
looked at how a military or corporate AI race could rush us into giving AIs decision-
making powers, leading us down a slippery slope to human disempowerment. We
have discussed how inadequate organizational safety could lead to catastrophic acci-
dents. Finally, we have addressed the challenges in reliably controlling advanced AIs,
including mechanisms such as proxy gaming and goal drift that might give rise to
rogue AIs pursuing undesirable actions without regard for human wellbeing.
These dangers warrant serious concern. Currently, very few people are working on
AI risk reduction. We do not yet know how to control highly advanced AI systems,
and existing control methods are already proving inadequate. The inner workings of
AIs are not well understood, even by those who create them, and current AIs are by
no means highly reliable. As AI capabilities continue to grow at an unprecedented
rate, it is plausible that they could surpass human intelligence in nearly all respects
relatively soon, creating a pressing need to manage the potential risks.
The good news is that there are many courses of action we can take to substan-
tially reduce these risks. The potential for malicious use can be mitigated by various
measures, such as carefully targeted surveillance and limiting access to the most
dangerous AIs. Safety regulations and cooperation between nations and corporations
could help us to resist competitive pressures that would drive us down a danger-
ous path. The probability of accidents can be reduced by a rigorous safety culture,
among other factors, and by ensuring that safety advances outpace advances in gen-
eral AI capabilities. Finally, the risks inherent in building technology that surpasses
our own intelligence can be addressed by increased investment in several branches
of research on control of AI systems, as well as coordination to ensure that progress
50 ■ Introduction to AI Safety, Ethics, and Society
does not accelerate to a point where societies are unable to respond or manage risks
appropriately.
The remainder of this book aims to outline the underlying factors that drive these
risks in more detail and to provide a foundation for understanding and effectively re-
sponding to these risks. Later chapters delve into each type of risk in greater depth.
For example, risks from malicious use can be reduced via effective policies and coor-
dination, which are discussed in the Governance chapter. The challenge of AI races
arises due to collective action problems, discussed in the corresponding chapter. Or-
ganizational risks can only be addressed based on a strong understanding of princi-
ples of risk management and system safety outlined in the Safety Engineering and
Complex Systems chapters. Risks from rogue AI are mediated by mechanisms such
as proxy gaming, deception and power-seeking which are discussed in detail in the
Single-Agent Safety chapter. While some chapters are more closely aligned to certain
risks, many of the concepts they introduce are cross-cutting. The choice of values
and goals embedded into AI systems, as discussed in the Beneficial AI and Machine
Ethics, is a general factor that can exacerbate or reduce many of the risks discussed
in this chapter.
Before tackling these issues, we provide a general introduction to core concepts that
drive the modern field of AI, to ensure that all readers have a high-level understanding
of how today’s AI systems work and how they are produced.
1.8 LITERATURE
Artificial Intelligence
Fundamentals
2.1 INTRODUCTION
Deep learning. Then, we will delve into DL, a further subset of ML that uses
neural networks with many layers to model and understand complex patterns in
datasets. We will discuss the structure and function of DL models, exploring key
building blocks and principles of how they learn. We will present a timeline of in-
fluential DL architectures and highlight a few of the countless applications of these
models.
Scaling laws. Having established a basic understanding of AI, ML, and DL, we
will then explore scaling laws. These are equations that model the improvements in
performance of DL models when increasing their parameter count and dataset size.
We will examine how these are often power laws—equations in which one variable
increases in proportion to a power of another, such as the area of a square—and
examine a few empirically determined scaling laws in recent AI systems.
Speed of AI development. Scaling laws are closely related to the broader ques-
tion of how fast the capabilities of AI systems are improving. We will discuss some of
the key trends that are currently driving increasing AI capabilities and whether we
should expect these to continue in coming years. We will relate this to the broader
debate around when we might see AI systems that match (or surpass) human per-
formance across all or nearly all cognitive tasks.
Throughout the chapter, we focus on building intuition, breaking down technical
terms and complex ideas to provide straightforward explanations of their core prin-
ciples. Each section presents fundamental principles, lays out prominent algorithms
and techniques, and provides examples of real-world applications. We aim to demys-
tify these fields, empowering readers to grasp the concepts that underpin AI systems.
By the end of this chapter, we should have a basic understanding of ML and be
in a stronger position to consider the complexities and challenges of AI systems, the
risks they pose, and how they interact with our society. This will provide the technical
foundation we need for the following chapters, which will explore the risks and ethical
considerations that these technologies present from a wide array of perspectives.
AI is reshaping our society, from its small effects on daily interactions to sweeping
changes across many industries and implications for the future of humanity. This
section explains what AI is, discusses what AI can and cannot do, and helps develop
a critical perspective on the potential benefits and risks of AI. Firstly, we will discuss
what AI means, its different types, and its history. Then, in the second part of this
section, we will analyze the field of ML.
Artificial Intelligence Fundamentals ■ 53
Definition
History
We will now follow the journey of AI, tracing its path from ancient times to the present
day. We will discuss its conceptual and practical origins, which laid the foundation
for the field’s genesis at the Dartmouth Conference in 1956. We will then survey a few
early approaches and attempts to create AI, including symbolic AI, perceptrons, and
the chatbot ELIZA. Next, we will discuss how the First AI Winter and subsequent
periods of reduced funding and interest have shaped the field. Then, we will chart how
the internet, algorithmic progress, and advancements in hardware led to increasingly
rapid developments in AI from the late 1980s to the early 2010s. Finally, we will
explore the modern DL era and see a few examples of the power and ubiquity of
present-day AI systems—and how far they have come.
Early historical ideas of AI. Dreams of creating intelligent machines have been
present since the earliest human civilizations. The ancient Greeks speculated about
automatons—mechanical devices that mimicked humans or animals. It was said that
Hephaestus, the god of craftsmen, built the giant Talos from bronze to patrol an
island.
since its inception, the Turing Test has substantially influenced how we think about
machine intelligence.
The first neural network. One of the earliest attempts to create AI was the
perceptron, a method implemented by Frank Rosenblatt in 1958 and inspired by
biological neurons [125]. The perceptron could learn to classify patterns of inputs
by adjusting a set of numbers based on a learning rule. It is an important milestone
because it made an immense impact in the long run, inspiring further research into DL
and neural networks. However, scholars initially criticized it for its lack of theoretical
foundations, minimal generalizability, and inability to separate data clusters with
more than just a straight line. Nonetheless, perceptrons prepared the ground for
future progress.
The first chatbot. Another early attempt to create AI was the ELIZA chatbot, a
program that simulated a conversation with a psychotherapist. Joseph Weizenbaum
created ELIZA in 1966 to use pattern matching and substitution to generate responses
based on keywords in the user’s input. He did not intend the ELIZA chatbot to be a
serious model of natural language understanding but rather a demonstration of the
superficiality of communication between humans and machines. However, some users
became convinced that the ELIZA chatbot had genuine intelligence and empathy
despite Weizenbaum’s insistence to the contrary.
56 ■ Introduction to AI Safety, Ethics, and Society
The first recovery. After this decline, the 1980s brought a resurgence of interest
in AI. Advances in computing power and the emergence of systems that emulate
human decision-making reinvigorated AI research. Efforts to build expert systems
that imitated the decision-making ability of a human expert in a specific field, using
pre-defined rules and knowledge to solve complex problems, yielded some successes.
While these systems were limited, they could leverage and scale human expertise in
various fields, from medical diagnosis to financial planning, setting a precedent for
AI’s potential to augment and even replace human expertise in specialized domains.
New chips and even more compute. In the late 2000s, the proliferation of mas-
sive datasets (known as big data) and rapid growth in computing power allowed the
development of advanced AI techniques. Around the early 2010s, researchers began
using Graphics Processing Units (GPUs)—traditionally used for rendering graphics
in video games—for faster and more efficient training of intricate ML models. Plat-
forms that enabled leveraging GPUs for general-purpose computing facilitated the
transition to the DL era.
DL Era (2012– )
DL revolutionizes AI. The trends of increasing data and compute availability
laid the foundation for groundbreaking ML techniques. In the early 2010s, researchers
pioneered applications of DL, a subset of ML that uses artificial neural networks with
many layers, enabling computers to learn and recognize patterns in large amounts
of data. This approach led to significant breakthroughs in AI, especially in areas
including image recognition and natural language understanding.
Massive datasets provided researchers with the data needed to train DL models effec-
tively. A pivotal example is the ImageNet ([131]) dataset, which provided a large-scale
58 ■ Introduction to AI Safety, Ethics, and Society
dataset for training and evaluating computer vision algorithms. It hosted an annual
competition, which spurred breakthroughs and advancements in DL. In 2012, the
AlexNet model revolutionized the field as it won the ImageNet Large Scale Visual
Recognition Challenge [132]. This breakthrough showcased the superior performance
of DL over traditional ML methods in computer vision tasks, sparking a surge in DL
applications across various domains. From this point onward, DL has dominated AI
and ML research and the development of real-world applications.
Advancements in DL. In the 2010s, DL techniques led to considerable improve-
ments in natural language processing (NLP), a field of AI that aims to enable com-
puters to understand and generate human language. These advancements facilitated
the widespread use of virtual assistants Alexa and ChatGPT, introducing consumers
to products that integrated ML. Later, in 2016, Google DeepMind’s AlphaGo became
the first AI system to defeat a world champion Go player in a five-game match [133].
Breakthroughs in natural language processing. In 2018, Google researchers
introduced the Transformer architecture, which enabled the development of highly
effective NLP models. Researchers built the first large language models (LLMs) using
this Transformer architecture, many layers of neural networks, and billions of words
of data. Generative Pre-trained Transformer (GPT) models have demonstrated im-
pressive and near human-level language processing capabilities [134]. ChatGPT was
released in November 2022 and became the first example of a viral AI product, reach-
ing 100 million users in just two months. The success of the GPT models also sparked
widespread public discussion on the potential risks of advanced AI systems, includ-
ing congressional hearings and calls for regulation. In the early 2020s, AI is used for
many complex tasks, from image recognition to autonomous vehicles, and continues
to evolve and proliferate rapidly.
2.2.2 Types of AI
The field has developed a set of concepts to describe distinct types or levels of AI
systems. However, they often overlap, and definitions are rarely well-formalized, uni-
versally agreed upon, or precise. It is important to consider an AI system’s particular
capabilities rather than simply placing it in one of these broad categories. Labeling
a system as a “weak AI” does not always improve our understanding of it; we need
to elaborate further on its abilities and why they are limited.
This section introduces five widely used conceptual categories for AI systems. We
will present these types of AI in roughly their order of intelligence, generality, and
potential impact, starting with the least potent AI systems.
1. Narrow AI can perform specific tasks, potentially at a level that matches or
surpasses human performance.
2. Artificial general intelligence (AGI) can perform many cognitive tasks across
multiple domains. It is sometimes interpreted as referring to AI that can perform
a wide range of tasks at a human or superhuman level.
3. Human-level AI (HLAI) could perform all tasks that humans can do.
Artificial Intelligence Fundamentals ■ 59
Generality and skill level of AI systems. The concepts we discuss here do not
provide a neat gradation of capabilities as there are at least two different axes along
which these can be measured. When considering a system’s level of capability, it can
be helpful to decompose this into its degree of skill or intelligence and its generality:
the range of domains where it can learn to perform tasks well. This helps us explain
two key ways AI systems can vary: an AI system can be more or less skillful and more
or less general. These two factors are related but distinct: an AI system that can play
chess at a grandmaster level is skillful in that domain, but we would not consider
it general because it can only play chess. On the other hand, an advanced chatbot
may show some forms of general intelligence while not being particularly good at
chess. Skill can be further broken down by reference to varying skill levels among
humans. An AI system could match the skill of the average adult (50th percentile),
or of experts in this skill at varying levels (e.g. 90th of 99th percentile), or surpass
all humans in skill.
Narrow AI
Narrow AI is specialized in one area. Also called weak AI, narrow AI refers
to systems designed to perform specific tasks or solve particular problems within a
specialized domain of expertise. A narrow AI has a limited domain of competence—it
can solve individual problems but is not competent at learning new tasks in a wide
range of domains. While they often excel in their designated tasks, these limitations
mean that a narrow AI does not exhibit high behavioral flexibility. Narrow AI sys-
tems struggle to learn new behaviors effectively, perform well outside their specific
domain, or generalize to new situations. However, narrow AI is still relevant from the
perspective of catastrophic risks, as systems with superhuman capabilities in high-risk
domains such as virology or cyber-offense could present serious threats.
Examples of narrow AI. One example of narrow AI is a digital personal assis-
tant that can receive voice commands and perform tasks like transcribing and sending
text messages but cannot learn how to write an essay or drive a car. Alternatively,
image recognition algorithms can identify objects like people, plants, or buildings
in photos but do not have other skills or abilities. Another example is a program
that excels at summarizing news articles. While it can do this narrow task, it cannot
diagnose a medical condition or compose new music, as these are outside its specific
domain. More generally, intelligent beings such as humans can learn and perform all
these tasks.
Narrow AI vs. general AI. Some narrow AI systems have surpassed human
performance in specific tasks, such as chess. However, these systems exhibit narrow
rather than general intelligence because they cannot learn new tasks and perform
60 ■ Introduction to AI Safety, Ethics, and Society
TABLE 2.1 A matrix showing one potential approach to breaking down the skill and gen-
erality of existing AI systems [135]. Note that this is just one example and that we do not
attempt to apply this exact terminology throughout the book.
Skill Narrow General
well outside their domain. For instance, IBM’s Deep Blue famously beat world chess
champion Garry Kasparov in 1997. This system was an excellent chess player but was
only good at chess. If one tried to use Deep Blue to play a different game, recognize
faces in a picture, or translate a sentence, it would fail miserably. Therefore, although
narrow AI may be able to do certain things better than any human could, even highly
capable ones remain limited to a small range of tasks.
to expand its knowledge and abilities by learning video games, diagnosing diseases,
or navigating a city. Some would extend this further and define AGIs as systems that
can apply their intelligence to nearly any real-world task, matching or surpassing
human cognitive abilities across many domains.
Predicting AGI. Predicting when distinct AI capabilities will appear (often called
“AI timelines”) can also be challenging. Many once believed that AI systems would
master physical tasks before tackling “higher-level” cognitive tasks such as coding or
writing. However, some existing language model systems can write functional code
yet cannot perform physical tasks such as moving a ball. While there are many expla-
nations for this observation—cognitive tasks bypass the challenge of building robotic
bodies; domains like coding and writing benefit from abundant training data—this
is an example of the difficulties involved in predicting how AI will develop.
Risks and capabilities. Rather than debating whether a system meets the cri-
teria for being an AGI, evaluating a specific AI system’s capabilities is often more
helpful. Historical evidence and the unpredictability of AI development suggest that
AIs may be able to perform complicated tasks such as scientific research, hacking, or
synthesizing bioweapons before they can reliably automate all domestic chores. Some
highly relevant and dangerous capabilities may arrive long before others. Moreover,
we could have narrow AI systems that can teach anyone how to enrich uranium and
build nuclear weapons but cannot learn other tasks. These dangers show how AIs
can pose risks at many different levels of capabilities. With this in mind, instead
of simply asking about AGI (“When will AGI arrive?”), it might be more relevant
and productive to consider when AIs will be able to do particularly concerning tasks
(“When will this specific capability arrive?”).
HLAI can do everything humans can do. HLAIs exist when machines can
perform approximately every task as well as human workers. Some definitions of
HLAI emphasize three conditions: first, that these systems can perform every task
humans can; second, they can do it at least as well as humans can; and third, they can
do it at a lower cost. If a smart AI is highly expensive, it may make economic sense
to continue to use human labor. If a smart AI took several minutes to think before
62 ■ Introduction to AI Safety, Ethics, and Society
doing a task a human could do, its usefulness would have limitations. Like humans,
an HLAI system could hypothetically master a wide range of tasks, from cooking and
driving to advanced mathematics and creative writing. Unlike AGI, which on some
interpretations can perform some—but not all—the tasks humans can, an HLAI
can complete any conceivable human task. Notably, some reserve the term HLAI to
describe only cognitive tasks. Furthermore, evaluating whether a system is “human
level’ is fraught with biases. We are often biased to dismiss or underrate unfamiliar
forms of intelligence simply because they do not look or act like human intelligence.
Transformative AI
TAI refers to AI with societal impacts comparable to the Industrial Rev-
olution. The Industrial Revolution fundamentally altered the fabric of human life
globally, heralding an era of tremendous economic growth, increased life expectancy,
expanded energy generation, a surge in technological innovation, and monumental
social changes. Similarly, a TAI could catalyze dramatic changes in our world. The
focus here is not on the specific design or built-in capabilities of the AI itself but on
the consequences of the AI system for humans, our societies, and our economies.
Many kinds of AI systems could be transformative. It is conceivable that
some systems could be transformative while performing at capabilities below hu-
man level. To bring about dramatic change, AI does not need to mimic the powerful
systems of science fiction that behave indistinguishably from humans or surpass hu-
man reasoning. Computer systems that can perform tasks traditionally handled by
people (narrow AIs) could also be transformative by enabling inexpensive, scalable,
and clean energy production. Advanced AI systems could transform society without
reaching or exceeding human-level cognitive abilities, such as by allowing a wide array
of fundamental tasks to be performed at virtually zero cost. Conversely, some sys-
tems might only have transformative impacts long after reaching performance above
the human level. Even when some forms of AGI, HLAI, or ASI are available, the
technology might take time to diffuse widely, and its economic impacts may come
years afterward, creating a diffusion lag.
Artificial Superintelligence
ASI refers to AI that surpasses human performance in virtually all
domains of interest [121]. A system with this set of capabilities could have
immense practical applications, including advanced problem-solving, automation of
complex tasks, and scientific discovery. However, it should be noted that surpassing
humans on only some capabilities does not make an AI superintelligent—a calculator
is superhuman at arithmetic, but not a superintelligence.
Risks of superintelligence. The risks associated with superintelligence are sub-
stantial. ASIs could be harder to control and even pose existential threats—risks to
the survival of humanity. That said, an AI system must not be superintelligent to
be dangerous. An AGI, HLAI, or narrow AI could all pose severe risks to human-
ity. These systems may vary in intelligence across different tasks and domains, but
Artificial Intelligence Fundamentals ■ 63
Summary
This section provided an introduction to AI, the broad umbrella that encompasses the
area of computer science focused on creating machines that perform tasks typically
associated with human intelligence. First, we discussed the nuances and difficulties
of defining AI and detailed its history. Then, we explored AI systems in more de-
tail and how they are often categorized into different types. Of these, we surveyed
five commonly used terms—narrow AI, HLAI, AGI, TAI, and superintelligence—and
highlighted some of their ambiguities. Considering specific capabilities and individual
systems rather than broad categories or abstractions is often more informative.
64 ■ Introduction to AI Safety, Ethics, and Society
Next, we will narrow our focus to ML, an approach within AI that emphasizes the
development of systems that can learn from data. Whereas many classical approaches
to AI relied on logical rules and formal, structured knowledge, ML systems use pattern
recognition to extract information from data.
Benefits of ML. One of the key benefits of ML is its ability to automate com-
plicated tasks, enabling humans to focus on other activities. Developers use ML for
applications from medical diagnosis and autonomous vehicles to financial forecasting
and writing. ML is becoming increasingly important for businesses, governments, and
other organizations to stay competitive and make empirically informed decisions.
• Outputs: What does the ML system produce? The model generates these
results, predictions, or decisions based on the input data.
• Type of Machine Learning: What technique is used to accomplish the
task? This describes how the model converts its inputs into outputs (called infer-
ence), and learns the best way to convert its inputs into outputs (a learning process
called training). An ML system can be categorized by how it uses training data,
what type of output it generates, and how it reaches results.
The rest of this section delves deeper into these aspects of ML systems.
Key ML Tasks
Classification
Classification is predicting categories or classes. In classification tasks,
models use characteristics or features of an input data point (example) to deter-
mine which specific category the data point belongs to. In medical diagnostics, a
classification model might predict whether a tumor is cancerous or benign based on
features such as a patient’s age, tumor size, and tobacco use. This is an example of
binary classification—the special case in which models predict one of two categories.
Multi-class classification, on the other hand, involves predicting one of multiple cat-
egories. An image classification model might classify an image as belonging to one of
multiple different classes such as dog, cat, hat, or ice cream. Computer vision often
applies these methods to enable computers to interpret and understand visual data
from the world. Classification is categorization: it involves putting data points into
buckets.
The sigmoid function produces probabilistic outputs. A sigmoid is one of
several mathematical functions used in classification to transform general real num-
bers into values between 0 and 1. Suppose we wanted to predict the likelihood that
a student will pass an exam or that a prospective borrower will default on a loan.
The sigmoid function is instrumental in settings like these—problems that rely on
computing probabilities. As a further example, in binary classification, one might
use a function like the sigmoid to estimate the likelihood that a customer makes a
purchase or clicks on an advertisement. However, it is important to note that other
widely used models can provide similar probabilistic outputs without employing a
sigmoid function.
Regression
Regression is predicting numbers. In regression tasks, models use features of
input data to predict numerical outputs. A real estate company might use a re-
gression model to predict house prices from a dataset with features such as location,
66 ■ Introduction to AI Safety, Ethics, and Society
square footage, and number of bedrooms. While classification models produce discrete
outputs that place inputs into a finite set of categories, regression models produce
continuous outputs that can assume any value within a range. Therefore, regression
is predicting a continuous output variable based on one or more input variables. Re-
gression is estimation: it involves guessing what a feature of a data point will be given
the rest of its characteristics.
Anomaly Detection
Anomaly detection is the identification of outliers or abnormal data
points [138]. Anomaly detection is vital in identifying hazards, including unex-
pected inputs, attempted cyberattacks, sudden behavioral shifts, and unanticipated
failures. Early detection of anomalies can substantially improve the performance of
models in real-world situations.
75%
Probablity of Passing Exam
50%
25%
Legend
Passed Students
Failed Students
Predictive Model
0%
1 2 3 4 5
Hours of Studying
Figure 2.2. Binary classification can use a sigmoid function to turn real numbers (such
as hours of studying) into probabilities between zero and one (such as the probability of
passing).
to extrapolate the future. Due to their extreme and uncommon nature, such events
make anomaly detection challenging. In section 4.7 in the Safety Engineering chapter,
we discuss these ideas in more detail.
Sequence Modeling
Sequence modeling is analyzing and predicting patterns in sequential
data. Sequence modeling is a broadly defined task that involves processing or pre-
dicting data where temporal or sequential order matters. It may be applied to time-
series data or natural language text to capture dependencies between items in the
sequence to forecast future elements. An integral part of this process is representation
learning, where models learn to convert raw data into more informative formats for
the task at hand. Language models use these techniques to predict subsequent words
in a sequence, transforming previous words into meaningful representations to detect
patterns and make predictions. There are several major subtypes of sequence mod-
eling. Here, we will discuss two: generative modeling and sequential decision-making.
Generative modeling. Generative modeling is a subtype of sequence modeling that
creates new data that resembles the input data, thereby drawing from the same dis-
tribution of features (conditioned on specific inputs). It can generate new outputs
from many input types, such as text, code, images, and protein sequences.
Sequential decision-making (SDM). SDM equips a model with the capability to make
informed choices over time, considering the dynamic and uncertain nature of real-
world environments. An essential feature of SDM is that prior decisions can shape
later ones. Related to SDM is reinforcement learning (RL), where a model learns to
make decisions by interacting with its environment and receiving feedback through
68 ■ Introduction to AI Safety, Ethics, and Society
50
40
30
20
50 70 90 110
Precipitation (mm)
Figure 2.3. This linear regression model is the best linear predictor of an output(umbrellas
sold) using only information from the input (precipitation).
20
Typical
Account Activity (queries)
100
15
75
10
50
5 25
Bot
Human
0 0
0 5 10 15 0 25 50 75 100
Account Age (days) Time (hours)
Figure 2.4. The first graph shows the detection of atypical user activity. The second graph
shows the detection of unusually high energy usage. In both cases, the model detects anoma-
lies [139].
Artificial Intelligence Fundamentals ■ 69
Below, we briefly describe the significant modalities in ML. However, this list is not
exhaustive. Many specific types of inputs, such as data from physical sensors, fMRI
scans, topographic maps, and so on, do not fit easily into this categorization.
• Tabular data: Structured data is stored in rows and columns, usually with each
row corresponding to an observation and each column representing a variable in
the dataset. An example is a spreadsheet of customer purchase histories.
• Text data: Unstructured textual data in natural language, code, or other formats.
An example is a collection of posts and comments from an online forum.
• Image data: Digital representations of visual information that can train ML mod-
els to classify images, segment images, or perform other tasks. An example is a
database of plant leaf images for identifying species of plants.
• Video data: A sequence of visual information over time that can train ML models
to recognize actions, gestures, or objects in the footage. An example is a collection
of sports videos for analyzing player movements.
• Audio data: Sound recordings, such as speech or music. An example is a set of
voice recordings for training speech recognition models.
• Time-series data: Data collected over time that represents a sequence of obser-
vations or events. An example is historical stock price data.
• Graph data: Data representing a network or graph structure, such as social net-
works or road networks. An example is a graph that represents user connections
in a social network.
• Set-valued data: Unstructured data in the form of collections of features or input
vectors. An example is point clouds obtained from LiDAR sensors.
Data collection. The first step in building an ML model is data collection. Data
can be collected in various ways, such as by purchasing datasets from owners of data
or scraping data from the web. The foundation of any ML model is the dataset used
to train it: the quality and quantity of data are essential for accurate predictions and
performance.
significantly impact the model’s performance, making it more or less accurate and
efficient.
ML aims to predict labels. A label (or a target) is the value we want to predict
or estimate using the features. Labels in training data are only present in supervised
ML tasks, discussed later in this section. Some models use a sample with correct
labels to teach the model the output for a given set of input features: a model could
use historical data on housing prices to learn how prices are related to features like
square footage. However, other (unsupervised) ML models learn to make predictions
using unlabelled input data—without knowing the correct answers—by identifying
patterns instead.
Choosing an ML architecture. After ML model developers have collected the
data and chosen a task, they can process it. An ML architecture refers to a model’s
overall structure and design. It can include the type and configuration of the algorithm
used and the arrangement of input and output layers. The architecture of an ML
model shapes how it learns from data, identifies patterns, and makes predictions or
decisions.
ML models have parameters. Within an architecture, parameters are ad-
justable values within the model that influence its performance. In the house pricing
example, parameters might include the weights assigned to different features of a
house, like its size or location. During training, the model adjusts these weights, or
parameters, to minimize the difference between its predicted house prices and the
actual prices. The optimal set of parameters enables the model to make the best
possible predictions for unseen data, generalizing from the training dataset.
Training and using the ML model. Once developers have built the model
and collected all necessary data, they can begin training and applying it. ML model
training is adjusting a model’s parameters based on a dataset, enabling it to recognize
patterns and make predictions. During training, the model learns from the provided
data and modifies its parameters to minimize errors.
Model performance can be evaluated Model evaluation measures the perfor-
mance of the trained model by testing it on data the model has never encountered
before. Evaluating the model on unseen data helps assess its generalizability and
suitability for the intended problem. We may try to predict housing prices for a new
country beyond the original ML model’s original training data.
Once ready, models are deployed. Finally, once the model is trained and eval-
uated, it can be deployed in real-world applications. ML deployment involves inte-
grating the model into a larger system, using it, and then maintaining or updating it
as needed.
Evaluating ML Models
Figure 2.5. A confusion matrix shows the four possible outcomes from a prediction: true
positive, false positive, false negative, and true negative.
the performance of a trained model on new, unseen data—provides insight into how
well the model has learned. We can use different metrics to understand a model’s
strengths, weaknesses, and potential for real-world applications. These quantitative
performance measures are part of a broader context of goals and values that inform
how we can assess the quality of a model.
Metrics
Accuracy is a measure of the overall performance of a classification
model. Accuracy is defined as the percentage of correct predictions:
# of correct predictions
Accuracy = .
# of total predictions
Accuracy can be misleading if there is an imbalance in the number of examples of
each class. For instance, if 95% of emails received are not spam, a classifier assigning
all emails to the “not spam” category could achieve 95% accuracy. Accuracy applies
when there is a well-defined sense of right and wrong. Regression models focus on
minimizing the error in their predictions.
Since each prediction must be in one of these categories, the number of total predic-
tions will be the sum of the number of predictions in each category. The number of
correct predictions will be the sum of true positives and true negatives. Therefore,
TP + TN
Accuracy =
TP + TN + FP + FN
False positives vs. false negatives. The impact of false positives and false neg-
atives can vary greatly depending on the setting. Which metric to choose depends
on the specific context and the error types one most wants to avoid. In cancer de-
tection, while a false positive (incorrectly identifying cancer in a cancer-free patient)
may cause emotional distress, unnecessary further testing, and potentially invasive
procedures for the patient, a false negative can be much more dangerous: it may
delay diagnosis and treatment that allows cancer to progress, reducing the patient’s
chances of survival. By contrast, an autonomous vehicle with a water sensor that
senses roads are wet when they are dry (predicting false positives) might slow down
and drive more cautiously, causing delays and inconvenience, but one that senses the
road is dry when it is wet (false negatives) might end up in serious road accidents
and cause fatalities.
While accuracy assigns equal cost to false positives and false negatives, other metrics
isolate one or weigh the two differently and might be more appropriate in some
settings. Precision and recall are two standard metrics that measure the extent of
the error attributable to false positives and false negatives, respectively.
Precision measures the correctness of a model’s positive predictions. This metric rep-
resents the fraction of positive predictions that are actually correct. It is calculated
TP
as TP+FP , dividing true positives (hits) by the sum of true positives and false posi-
tives. High precision implies that when a model predicts a positive class, it is usually
correct—but it might incorrectly classify many positives as negatives as well. Preci-
sion is like the model’s aim: when the system says it hit, how often is it right?
Recall measures a model’s breadth. On the other hand, recall measures how good a
model is at finding all of the positive examples available. It is like the model’s net: how
TP
many real positives does it catch? It is calculated as TP+FN , signifying the fraction
of real positives that the model successfully detected. High recall means a model is
good at recognizing or “recalling” positive instances, but not necessarily that these
predictions are accurate. Therefore, a model with high recall may incorrectly classify
many negatives as positives.
In simple terms, precision is about a model being right when it makes a guess, and
recall is about the model finding as many of the right answers as possible. Together,
these two metrics provide a way to quantify how accurately and effectively a model
Artificial Intelligence Fundamentals ■ 73
Figure 2.6. Precision measures the correctness of positive predictions and penalizes false
positives, while recall measures how many positives are detected and penalizes false nega-
tives [140].
can detect positive examples. Moreover, there is a trade-off between precision and
recall: for a given model, increasing precision will necessarily decrease recall and vice
versa.
Figure 2.7. The area under the ROC curve (AUROC) increases as it moves in the upper left
direction, with more true positives and fewer false positives [141].
are penalized the same and that larger mistakes are penalized heavily. The MSE is
the most popular loss function for regression problems.
Reasonably vs. reliably solved. The distinction between reasonable and reliable
solutions can be instrumental in developing an ML model, evaluating its perfor-
mance, and thinking about tradeoffs between goals. A task is reasonably solved if a
model performs well enough to be helpful in practice, but it may still have consistent
limitations or make errors. A task is reliably solved if a model achieves sufficiently
high accuracy and consistency for safety-critical applications. While models that rea-
sonably solve problems may be sufficient in some settings, they may cause harm in
others. Chatbots currently give reasonable results, which is frustrating but essentially
Artificial Intelligence Fundamentals ■ 75
harmless. However, if autonomous vehicles show reasonable but not reliable results,
people’s lives are at stake.
Goals and tradeoffs. Above and beyond quantitative performance measures are
multiple goals and values that influence how we can assess the quality of an ML
model. These goals—and the tradeoffs that often arise between them—shape how
models are judged and developed. One such goal is predictive power, which measures
the amount of error in predictions. Inference time (or latency) measures how quickly
an ML model can produce results from input data—in many applications, such as self-
driving cars, prediction speed is crucial. Transparency refers to the interpretability
of an ML model’s inner workings and how well humans can understand its decision-
making process. Reliability assesses the consistency of a model’s performance over
time and in varying conditions. Scalability is the capacity of a model to maintain or
improve its performance as a key variable—compute, parameter count, dataset size,
and so on—scales.
Sometimes, these goals are in opposition, and improvements in one area can come
at the cost of declines in others. Therefore, developing an ML model requires careful
consideration of multiple competing goals.
Supervised Learning
Supervised learning is learning from labeled data. Supervised learning is a
type of ML that uses a labeled dataset to learn the relationship between input data
and output labels. These labels are almost always human-generated: people will go
through examples in a dataset and give each one a label. They might be shown pic-
tures of dogs and asked to label the breed. The training process involves iteratively
adjusting the model’s parameters to minimize the difference between predicted out-
puts and the true output labels in the training data. Once trained, the model can
predict new, unlabeled data.
76 ■ Introduction to AI Safety, Ethics, and Society
Figure 2.8. The three main types of learning paradigms in machine learning are supervised
learning, unsupervised learning, and reinforcement learning.
Unsupervised Learning
Unsupervised learning is learning from unlabeled data. Unsupervised
learning involves training a model on a dataset without specific output labels. In-
stead of matching its inputs to the correct labels, the model must identify patterns
within the data to help it understand the underlying relationships between the vari-
ables. As no labels are provided, a model is left to its own devices to discover valuable
patterns in the data. In some cases, a model leverages these patterns to generate su-
pervisory signals, guiding its own training. For this reason, unsupervised learning can
also be called self-supervised learning.
learn to predict the next word in a sentence, which enables the models to understand
context and language structure without explicit labels like word definitions and gram-
mar instructions. After a model trains on this task, it can apply what it learned to
downstream tasks like answering questions or summarizing texts.
Figure 2.9. In image inpainting, models are trained to predict hidden parts of images, causing
them to learn relationships between pixels [142].
We can reframe many tasks into different types of ML. Anomaly detec-
tion is typically framed as an unsupervised task that identifies unusual data points
without labels. However, it can be refashioned as a supervised classification problem,
such as labeling financial transactions as “fraudulent” or “not fraudulent.” Similarly,
while stock price prediction is usually approached as a supervised regression task,
it could be reframed as a classification task in which a model predicts whether a
stock price will increase or decrease. The choice in framing depends on the task’s
specific requirements, the data available, and which frame gives a more useful model.
Ultimately, this flexibility allows us to better cater to our goals and problems.
Reinforcement Learning
Reinforcement learning (RL) is learning from agent-gathered data.
Reinforcement learning focuses on training artificial agents to make decisions and
78 ■ Introduction to AI Safety, Ethics, and Society
Deep Learning
DL is a set of techniques that can be used in many learning settings.
DL uses neural networks with many layers to create models that can learn from large
datasets. Neural networks are the building blocks of DL models and use layers of
interconnected nodes to transform inputs into outputs. The structure and function
of biological neurons loosely inspired their design. DL is not a new distinct learning
type but rather a computational approach that can accomplish any of the three types
of learning discussed above. It is most applicable to unsupervised learning tasks as it
can perform well without any labels; for instance, a deep neural network trained for
object recognition in images can learn to identify patterns in the raw pixel data.
Conclusion
AI is one of the most impactful and rapidly developing fields of computer science. AI
involves developing computer systems that can perform tasks that typically require
human intelligence, from visual perception to decision-making.
ML is an approach to AI that involves developing models that can learn from data
to perform tasks without being explicitly programmed. A robust approach to under-
standing any ML model is breaking it down into its fundamental components: the
task, the input data, the output, and the type of ML it uses. Different approaches to
ML offer various ways to tackle complex tasks and solve real-world problems. DL is
Artificial Intelligence Fundamentals ■ 79
a powerful and popular method that uses many-layered neural networks to identify
intricate patterns in large datasets. The following section will delve deeper into DL
and its applications in AI.
Introduction
In this section, we present the fundamentals of DL, a branch of ML that uses neural
networks to learn from data and perform complex tasks [143]. First, we will consider
the essential building blocks of DL models and explore how they learn. Then, we
will discuss the history of critical architectures and see how the field developed over
time. Finally, we will explore how DL is reshaping our world by reviewing a few of
its groundbreaking applications.
Why DL Matters
DL is a remarkably useful, powerful, and scalable technology that has been the pri-
mary source of progress in ML since the early 2010s. DL methods have dramatically
advanced the state-of-the-art in computer vision, speech recognition, natural language
processing, drug discovery, and many other areas.
Defining DL
DL is a type of ML that uses neural networks with many layers to learn and ex-
tract useful patterns from large datasets. It is characterized by its ability to learn
hierarchical representations of the world.
This is analogous to how visual information is processed in the human brain. Edge
detection is done in early visual areas like the primary visual cortex, more complex
shape detection in temporal regions, and a complete visual scene is assembled in
the brain’s frontal regions. Hierarchical representations enable deep neural networks
to learn abstract concepts and develop sophisticated models of the world. They are
essential to DL and why it is so powerful.
What DL Models Do
Summary
In this section, we will explore some of the foundational building blocks of DL models.
We will begin by defining what a neural network is and then discuss the fundamental
elements of neural networks through the example of multi-layer perceptrons (MLPs),
one of the most basic and common types of DL architecture. Then, we will cover
a few more technical concepts, including activation functions, residual connections,
convolution, and self-attention. Finally, we will see how these concepts come together
in the Transformer, another type of DL architecture.
Figure 2.11. Artificial neurons have some structural similarities to biological neurons [144,
145].
Artificial Intelligence Fundamentals ■ 83
Building blocks. Neural networks are made of simple building blocks that can pro-
duce complex abilities when combined at scale. Despite their simplicity, the resulting
network can display remarkable behaviors when thousands—or even millions—of arti-
ficial neurons are joined together. Neural networks consist of densely connected layers
of neurons, each contributing a tiny fraction to the overall processing power of the
network. Within this basic blueprint, there is much room for variation; for instance,
neurons can be connected in many ways and employ various activation functions.
These network structure and design differences shape what and how a model can
learn.
Multi-Layer Perceptrons
The input layer serves as the entry point for data into a network. The
input layer consists of nodes that encode information from input data to pass on to the
next layer. Unlike in other layers, the nodes do not perform any computation. Instead,
each node in the input layer captures some small raw input data and directly relays
this information to the nodes in the subsequent layer. As with other ML systems,
input data for neural networks comes in many forms. For illustration, we will focus
on just one: image data. Specifically, we will draw from the classic example of digit
recognition with MNIST.
The MNIST (Modified National Institute of Standards and Technology) database
is a large collection of images of handwritten digits, each with dimensions 28 × 28.
Consider a neural network trained to classify these images. The input layer of this
network consists of 784 nodes, each corresponding to the grayscale value of a pixel in
a given image.
The output layer is the final layer of a neural network. The output layer
contains neurons representing the results of the computations performed within the
network. Like inputs, neural network outputs come in many forms, such as predic-
tions or classifications. In the case of MNIST (a classification task), the output is
categorical, predicting the digit represented by a particular image.
For classification tasks, the number of neurons in the output layer is equal to the
number of possible classes. In the MNIST example, the output layer will have ten
neurons, one for each of the ten classes (digits 0–9). The value of each neuron rep-
resents the predicted probability that an example belongs to that class. The output
value of the network is the class of the output neuron with the highest value.
Hidden layers are the intermediate layers between the input and output
layers. Each hidden layer is a collection of neurons that receive outputs from the
84 ■ Introduction to AI Safety, Ethics, and Society
Figure 2.12. Each pixel’s value is transferred to a neuron in the first layer [146].
previous layer, perform a computation, and pass the results to the next layer. These
are “hidden” because they are internal to the network and not directly observable
from its inputs or outputs. These layers are where representations of features are
learned.
Weights represent the strength of the connection between two neurons.
Every connection is associated with a weight that determines how much the input
signal from a given neuron will influence the output of the next neuron. This value
represents the importance or contribution of the first neuron to the second. The
larger the magnitude, the greater the influence. Neural networks learn by modifying
the values of their weights, which we will explore shortly.
Biases are additional learned parameters used to adjust neuron outputs.
Every neuron has a bias that helps control its output. This bias acts as a constant
term that shifts the activation function along the input axis, allowing the neuron to
learn more complex, flexible decision boundaries. Similar to the constant b of a linear
equation y = mx + b, the bias allows shifting the output of each layer. In doing so,
biases increase the range of the representations a neural network can learn.
Activation functions control the output or “activation” of neurons. Ac-
tivation functions are nonlinear functions applied to each neuron’s weighted input
sum within a neural network layer. They are mathematical equations that control
the output signal of the neurons, effectively determining the degree to which each
neuron “fires.”
Each neuron in a network takes some inputs, multiplies them by weights, adds a bias,
and applies an activation function. The activation function transforms this weighted
input sum into an output signal. For many activation functions, the more input a
neuron receives, the more it activates, translating to a larger output signal.
Artificial Intelligence Fundamentals ■ 85
Figure 2.13. A classic multi-layer artificial neural network (ANN) has an input layer, several
hidden layers, and an output layer [147].
becomes the input to the next layer. The network as a whole is the composition of
all of its layers.
A toy example. Consider an MLP with two hidden layers, activation function g,
and an input x. This network could be expressed as W3 g(W2 g(W1 x)):
1. In the input layer, the input vector x is passed on.
2. In the first hidden layer,
(a) the input vector x is multiplied by the weight vector, W1 , yielding W1 x,
(b) then the activation function g is applied, yielding g(W1 x),
(c) which is passed on to the next layer.
use other activation functions. The selection and placement of activation functions
can significantly change the network’s capability and performance. In most cases, the
same activation will be applied in all the hidden layers within a network.
While many possible activation functions exist, only a handful are commonly used
in practice. Here, we highlight four that are of particular practical or historical sig-
nificance. Although there are many other functions and variations of each, these
four—ReLU, GELU [148], sigmoid, and softmax—have been highly influential in de-
veloping and applying DL. The Transformer architecture, which we will describe
later, uses GELU and softmax functions. Historically, many architectures used Re-
LUs and sigmoids. Together, these functions illustrate the essential characteristics of
the properties and uses of activation functions in neural networks.
Rectified linear unit (ReLU). The rectified linear unit (ReLU) function is a
piecewise linear function that returns the input value for positive inputs and zero
for negative inputs [149]. It is the identity function (f (x) = x) for positive inputs
and zero otherwise. This means that if a neuron’s weighted input sum is positive,
it will be passed directly to the following layer without any modification. However,
no signal will be passed on if the sum is negative. Due to its piecewise nature, the
graph of the ReLU function takes the form of a distinctive “kinked” line. Due to its
computational efficiency, the ReLU function was widely used and played a critical
role in developing more sophisticated DL architectures.
Figure 2.14. The ReLU activation function, ReLU(x) = max{0, x}, passes on positive inputs
to the next layer.
Gaussian error linear unit (GELU). The GELU (Gaussian error linear unit)
function is an upgrade of the ReLU function that uses approximation to smooth out
the non-differentiable component. This is important for optimization. It is “Gaussian”
because it leverages the Gaussian cumulative distribution function (CDF), Φ(x). The
GELU has been widely used in and contributed to the success of many current models,
including Transformer-based language models.
88 ■ Introduction to AI Safety, Ethics, and Society
Figure 2.15. The GELU activation function, GELU(x) = x · Φ(x), smooths out the ReLU
function around zero, passing on small negative inputs as well.
distribution across many dimensions. This is useful in settings with multiple classes
or categories, such as natural language processing, where each word in a sentence can
belong to one of numerous classes.
The softmax function can be considered a generalization of the sigmoid function.
While the sigmoid function maps a single input value to a number between 0 and 1,
interpreted as a binary probability of class membership, softmax normalizes a set of
real values into a probability distribution over multiple classes. Though it is typically
applied to the output layer of neural networks for multi-class classification tasks—an
example of when different activation functions are used within one network—softmax
may also be used in intermediate layers to readjust weights at bottleneck locations
within a network.
We can revisit the example of handwritten digit recognition. In this classification task,
softmax is applied in the last layer of the network as the final activation function. It
takes in a 10-dimensional vector of the raw outputs from the network and rescales
the values to generate a probability distribution over the ten predicted classes. Each
class represents a digit from 0 to 9, and each output value represents the probability
that an input image is an instance of a given class. The digit corresponding to the
highest probability will be selected as the network’s prediction.
Now, having explored ReLU, GELU, sigmoid, and softmax, we will set aside activa-
tion functions and turn our attention to other building blocks of DL models.
Residual Connections
Figure 2.17. Adding residual connections can let information bypass blocks fl and fl+1 ,
letting lower-level features from early layers contribute more directly to higher-level features
in later ones.
captured by each layer remains consistent across successive blocks. This allows feature
maps learned in earlier layers to be reused and networks to learn representations
(such as identity mappings) in deeper layers that may otherwise not be possible due
to optimization difficulties.
Residual connections are general purpose, used in many different problem settings
and architectures. By facilitating the learning process and expanding the kinds of
representations networks can learn, they are a valuable building block that can be a
helpful addition to a wide variety of networks.
Figure 2.18. Convolution layers perform the convolution operation, sliding a filter over the
input data to output a feature map [151].
Artificial Intelligence Fundamentals ■ 91
Convolution
Self-Attention
Transformers
Figure 2.19. Different attention heads can capture different relationships between words in
the same sentence [152].
A Transformer block primarily combines self-attention and MLPs (as we saw earlier)
with optimization techniques such as residual connections and layer normalization.
Large language models (LLMs). LLMs are a class of language models with
many parameters (often in the billions) trained on vast quantities of data. These
models excel in various language tasks, including question-answering, text generation,
coding, translation, and sentiment analysis. Most LLMs, such as the Generative Pre-
trained Transformer (GPT) series, utilize Transformers because they can effectively
model long-range dependencies.
Summary
DL models are networks composed of many layers of interconnected nodes. The struc-
ture of this network plays a vital role in shaping how a model functions. Creating a
successful model requires carefully assembling numerous components. Different com-
ponents are used in different settings, and each building block serves a unique purpose,
contributing to a model’s overall performance and capabilities.
This section discussed multi-layer perceptrons (MLPs), activation functions, residual
connections, convolution, and self-attention, culminating with an introduction to the
Transformer architecture. We saw how MLPs, an archetypal DL model, paved the way
for other architectures and remain an essential component of many more sophisticated
models. Many building blocks each play a distinct role in the structure and function
of a model.
Artificial Intelligence Fundamentals ■ 93
Activation functions like ReLU, softmax, and GELU introduce nonlinearity in net-
works, enabling models to learn complex patterns. Residual connections facilitate the
flow of information in a network, thereby enabling the training of deeper networks.
Convolution uses sliding filters to allow models to detect local features in input data,
an especially useful capability in vision-related tasks. Self-attention enables mod-
els to weigh the relevance of different inputs based on their context. By leveraging
these mechanisms to handle complex dependencies in sequential data, Transformers
revolutionized the field of natural language processing (NLP).
Mechanics of Learning
Cross entropy loss. Cross entropy is a concept from information theory that measures
the difference between two probability distributions. In DL, cross entropy loss is
often used in classification problems, where it compares the probability distribution
predicted by a model and the target distribution we want the model to predict.
Consider a binary classification problem where a model is tasked with classifying
images as either apples or oranges. When given an image of an apple, a perfect model
would predict “apple” with 100% probability. In other words, with classes [apple,
orange], the target distribution would be [1, 0]. The cross entropy would be low if the
model predicts “apple” with 90% probability (outputting a predicted distribution of
[0.9, 0.1]). However, if the model predicts “orange” with 99% probability, it would
have a much higher loss. The model learns to generate predictions closer to the true
class labels by minimizing the cross entropy loss during training.
Cross entropy quantifies the difference between predicted and true probabilities. If the
predicted distribution is close to the true distribution, the cross entropy will be low,
indicating better model performance. High cross entropy, on the other hand, signals
poor performance. When used as a loss function, the more incorrect the model’s
predictions are, the larger the error and, in turn, the larger the training update.
Mean squared error (MSE). Mean squared error is one of the most popular loss func-
tions for regression problems. It is calculated as the average of the squared differences
between target and predicted values.
n
1X 2
MSE = yi − ŷi
n i=1
MSE gives a good measure of how far away an output is from its target in a way
that is not affected by the direction of errors. Like cross entropy, MSE provides a
larger error signal the more wrong the output guess, helping the training process
converge more quickly. One weakness of MSE is that it is highly sensitive to outliers,
as squaring amplifies large differences, although there are variants and alternatives
such as mean absolute error (MAE) and Huber loss which are more robust to outliers.
ending up, they can continue this process, always taking steps toward the steepest
descent until they have reached the lowest point.
In ML, the hill is the loss function, and the steps are updates to the model’s param-
eters. The direction of steepest descent is calculated using the gradients (derivatives)
of the loss function with respect to the model’s parameters.
Figure 2.21. Gradient descent can find different local minima given two different weight
initializations [153].
The size of the steps is determined by the learning rate, a parameter of the model
configuration (known as a hyperparameter) used to control how much a model’s
weights are changed with each update. If the learning rate is too large, the high
learning rate may destroy information faster than information is learned. However,
the optimization process may be very slow if the learning rate is too small. Therefore,
proper learning rate selection is often key to effective training.
Though powerful, gradient descent in its simplest form can be quite slow. Several
variants, including Adam (Adaptive Moment Estimation), were developed to address
these weaknesses and are more commonly used in practice.
Backpropagation facilitates parameter updates. Backpropagation is a
widely used method to compute the gradients in a neural network [154]. This process
is essential for updating the model’s parameters and makes gradient descent possible.
Backpropagation is a way to send the error signal from the output layer of the neural
network back to the input layer. It allows the model to understand how much each
parameter contributes to the overall error and adjust them accordingly to minimize
the loss.
Steps to training a DL model. Putting all of these components together, train-
ing is a multi-step process that typically involves the following:
1. Initialization: A model’s parameters (weights and biases) are set to some initial
values, often small random numbers. These values define the starting point for the
model’s training and can significantly influence its success.
96 ■ Introduction to AI Safety, Ethics, and Society
2. Forward Propagation: Input data is passed through the model, layer by layer.
The neurons in each layer perform their specific operations via weights, biases,
and activation functions. Once the final layer is reached, an output is produced.
This procedure can be carried out on individual examples or on batches of multiple
data points.
3. Loss Calculation: The model’s output is compared to the target output using
a loss function that quantifies the difference between predicted and actual values.
The loss represents the model’s error—how far its output was from what it should
have been.
4. Backpropagation: The error is propagated back through the model, starting
from the output layer and going backward to the input layer. This process cal-
culates gradients that determine how much each parameter contributed to the
overall loss.
5. Parameter Update: The model’s weights and biases are adjusted using an op-
timization algorithm based on the gradients. This is typically done using gradient
descent or one of its variants.
6. Iteration: Steps 2–5 are repeated many times, often reaching millions or billions
of iterations. With each pass, the loss should decrease as the model’s predictions
improve.
7. Stopping Criterion: Training continues until the model reaches a stopping point,
which can be defined in many ways. We may stop training when the loss stops
decreasing or when the model has gone through the entire training dataset a
specific number of times.
While this sketch provides a high-level overview of the training process, many factors
can shape its course. For example, the network architecture and choice of loss func-
tion, optimizer, batch size, learning rate, and other hyperparameters influence how
training proceeds. Moreover, different methods and approaches to learning determine
how training is carried out. We will explore some of these techniques in the next
section.
Pre-training and fine-tuning belong to the former, and few-shot learning belongs to
the latter.
to learn quickly and effectively from a small dataset. Few-shot learning can be used
to train an image classifier to recognize new categories of animals after seeing only a
few images of each animal.
LLMs, few-shot, and zero-shot learning. Some large language models (LLMs)
have demonstrated a capacity to perform few- and zero-shot learning tasks without
explicit training. As model and training datasets increased in size, these models de-
veloped the ability to solve a variety of tasks when provided with a few examples
(few-shot) or only instructions describing the task (zero-shot) during inference; for
instance, an LLM can be asked to classify a paragraph as having positive or negative
sentiments without specific training. These capabilities arose organically as the mod-
els increased in size and complexity, and their unexpected emergence raises questions
about what enables LLMs to perform these tasks, especially when they are only ex-
plicitly trained to predict the next token in a sequence. Moreover, as these models
continue to evolve, this prompts speculation about what other capabilities may arise
with greater scale.
Summary. There are many training techniques used in DL. Pre-training and fine-
tuning are the foundation of many successful models, allowing them to learn valuable
representations from one task or dataset and apply them to another. Few-shot and
zero-shot learning enable models to solve tasks based on scarce or no example data.
Notably, the emergence of few- and zero-shot learning capabilities in large language
models illuminates the potential for these models to adapt and generalize beyond their
explicit training. Ongoing advancements in training techniques continue to drive the
growth of AI capabilities, highlighting both exciting opportunities and important
questions about the future of the field.
Having built our technical understanding of DL models and how they work, we will
see how these concepts come together in some of the groundbreaking architectures
Artificial Intelligence Fundamentals ■ 99
that have shaped the field. We will take a chronological tour of key DL models, from
the pioneering LeNet in 1989 to the revolutionary Transformer-based BERT and GPT
in 2018. These architectures, varying in design and purpose, have paved the way for
developing increasingly sophisticated and capable models. While the history of DL
extends far beyond these examples, this snapshot sheds light on a handful of critical
moments as neural networks evolved from a marginal theory in the mid-1900s to the
vanguard of AI development by the early 2010s.
1989: LeNet
LeNet paves the way for future DL models [155]. LeNet is a convolutional
neural network (CNN) proposed by Yann LeCun and his colleagues at Bell Labs
in 1989. This prototype was the first practical application of backpropagation, and
after multiple iterations of refinement, LeCun et al. presented the flagship model,
LeNet-5, in 1998. This model demonstrated the utility of neural networks in everyday
applications and inspired many DL architectures in the years to follow. However, due
to computational constraints, CNNs did not rise in popularity for over a decade after
LeNet-5 was released.
2012: AlexNet
AlexNet achieves unprecedented performance in image recognition [132].
As we saw in section 2.2, the ImageNet Challenge was a large-scale image
100 ■ Introduction to AI Safety, Ethics, and Society
2017: Transformers
Transformers introduce self-attention. The Transformer architecture was in-
troduced by Vaswani et al. in their revolutionary paper “Attention is All You Need.”
Like RNNs and LSTMs, Transformers are a type of neural network that can process
sequential data. However, the approach used in the Transformer was markedly differ-
ent from those of its predecessors. The Transformer uses self-attention mechanisms
that allow the model to focus on relevant parts of the input and the output.
The GPT models use scale and unidirectional processing. The GPT mod-
els are a series of Transformer-based language models launched by OpenAI. The size
of these models and scale at which they were trained led to a remarkable improve-
ment in fluency and accuracy in various language tasks, significantly advancing the
state-of-the-art in natural language processing. One of the key reasons GPT models
are more popular than BERT models is that they are better at generating text. While
BERT learns really good representations through being trained to fill in blanks in the
middle of sentences, GPT models are trained to predict what comes next, enabling
them to generate long-form sequences (e.g. sentences, paragraphs, and essays) much
more naturally.
Many important developments have been left out in this brief timeline. Perhaps
more importantly, future developments might revolutionize model architectures in
new ways, potentially bringing to light older innovations that have currently fallen
to the wayside. Next, we will explore some common applications of DL models.
2.3.4 Applications
DL has seen a dramatic rise in popularity since the early 2010s, increasingly becoming
a part of our daily lives. Its applications are broad, powering countless services and
technologies across many industries, some of which are highlighted below.
Conclusion
DL has come a long way since its early days, with advancements in architectures, tech-
niques, and applications driving significant progress in AI. DL models have been used
to solve complex problems and provide valuable insights in many different domains.
As data and computing power become more available and algorithmic techniques
continue to improve in the years to come, we can expect DL to become even more
prevalent and impactful.
In the next section, we will discuss scaling laws: a set of principles which can quanti-
tatively predict the effects of more data, larger models, and more computing power
on the performance of DL models. These laws shape how DL models are constructed.
Introduction
Scaling laws are a type of power law. Power laws are mathematical equations that
model how a particular quantity varies as the power of another. In power laws, the
variation in one quantity is proportional to a power (exponent) of the variation in
another. The power law y = bxa states that the change in y is directly proportional
to the change in x raised to a certain power a. If a is 2, then when x is doubled, y will
quadruple. One real-world example is the relation between the area of a circle and its
radius. As the radius changes, the area changes as a square of the radius: y = πr2 .
This is a power-law equation where b = π and a = 2. The volume of a sphere has
a power-law relationship with the sphere’s radius as well: y = 43 πr3 (so b = 43 π and
a = 3). Scaling laws are a particular kind of power law that describe how DL models
scale. These laws relate a model’s loss with model properties (such as the number of
model parameters or the dataset size used to train the model).
Log-log plots can be used to visualize power laws. Log-log plots can help
make these mathematical relationships easier to understand and identify. Consider the
power law y = bxa again. Taking the logarithm of both sides, the power law becomes
log(y) = a log(x) + log(b). This is a linear equation (in the logarithmic space) where
a is the slope and log(b) is the y-intercept. Therefore, a power-law relationship will
Artificial Intelligence Fundamentals ■ 103
appear as a straight line on a log-log plot (such as 2.22), with the slope of the line
corresponding to the exponent in the power law.
Figure 2.22. An object in free falling in a vacuum falls a distance proportion to the square
of the time. On a log-log plot, this power law looks like a straight line.
Power laws are remarkably ubiquitous. Power laws are a robust mathematical
framework that can describe, predict, and explain a vast range of phenomena in both
nature and society. Power laws are pervasive in urban planning: log-log plots relating
variables like city population to metrics such as the percentage of cities with at
least that population often result in a straight line (see Fig 2.23). Similarly, animals’
Figure 2.23. Power laws are used in many domains, such as city planning.
Power laws in the context of DL are called (neural) scaling laws. Scaling
laws [158, 159] predict loss given model size and dataset size in a power law relation-
ship. Model size is usually measured in parameters, while dataset size is measured
in tokens. As both variables increase, the model’s loss tends to decrease. This de-
crease in loss with scale often follows a power law: the loss drops substantially, but
not linearly, with increases in data and model size. For instance, if we doubled the
number of parameters, the loss does not just halve: it might decrease to one-fourth or
one-eighth, depending on the exponent in the scaling law. This power-law behavior in
AI systems allows researchers to anticipate and strategize on how to improve models
by investing more in increasing the data or the parameters.
Scaling laws in DL predict loss based on model size and dataset size.
In DL, we have observed power-law relationships between the model’s performance
and other variables that have held consistently over eight orders of magnitude as the
amount of compute used to train models has scaled. These scaling laws can forecast
the performance of a model given different values for its parameters, dataset, and
amount of computational resources. For instance, we can estimate a model’s loss if
we were to double its parameter count or halve the training dataset size. Scaling laws
show that it is possible to accurately predict the loss of an ML system using just two
primary variables:
1. N : The size of the model, measured in the number of parameters. Parameters
are the weights in a model that are adjusted during training. The number of
parameters in a model is a rough measure of its capacity, or how much it can learn
from a dataset.
2. D: The size of the dataset the model is trained on, measured in tokens, pixels,
or other fundamental units. The modality of these tokens depends on the model’s
task. For example, tokens are subunits of language in natural language processing
and images in computer vision. Some models are trained on datasets consisting of
tokens of multiple modalities.
The computational resources used to train a model are vital for scaling.
This factor, often referred to as compute, is most often measured by the number of cal-
culations performed over a certain time. The key metric for compute is FLOP/s, the
number of floating-point operations the computer performs per second. Practically,
increasing compute means training with more processors, more powerful processors,
or for a longer time. Models are often allocated a set budget for computation: scaling
laws can determine the ideal model and dataset size given that budget.
Computing power underlies both model size and dataset size. More
computing power enables larger models with more parameters and facilitates the
Artificial Intelligence Fundamentals ■ 105
collection and processing of more tokens of training data. Essentially, greater compu-
tational resources facilitate the development of more sophisticated AI models trained
on expanded datasets. Therefore, scaling is contingent on increasing computation.
The Chinchilla scaling law emphasizes data over model size [160]. One
significant research finding that shows the importance of scaling laws was the success-
ful training of the LLM “Chinchilla.” A small model with only 70 billion parameters,
Chinchilla outperformed much larger models because it was trained on far more to-
kens than pre-existing models. This led to the development of the Chinchilla scaling
law: a scaling law that accounts for parameter count and data. This law demonstrated
that larger models require much more data than was typically assumed at the time
to achieve the desired gains in performance.
Figure 2.24. Chinchilla scaling laws provide an influential estimate of compute-optimal scal-
ing laws, specifying the optimal ratio of model parameters and training tokens for a given
training compute budget in FLOPs. The green lines show projections of optimal model size
and training token count based on the number of FLOPs used to train Google’s Gopher
model [161].
Scaling laws for DL hold across many modalities and orders of magni-
tude. An order of magnitude refers to a tenfold increase—if something increases
by an order of magnitude, it becomes 10 times larger. In DL, evidence suggests that
106 ■ Introduction to AI Safety, Ethics, and Society
scaling laws hold across many orders of magnitude of parameter count and dataset
size. This implies that the same scaling relationships are still valid for both a small
model trained on hundreds of tokens or a massive model trained on trillions of tokens.
Scaling laws have continued to hold even as model size increases dramatically.
Figure 2.25. The scaling laws for different DL models look remarkably similar [159].
Discussion
Scaling laws are not universal for ML models. Not all models follow scaling
laws. These relationships are stronger for some types of models than others. Gener-
ative models such as large language models tend to follow regular scaling laws—as
model size and training data increase in scale, performance improves smoothly and
predictably in a relationship described by a power-law equation. But for discrimi-
native models such as image classifiers, clear scaling laws currently do not emerge.
Performance may plateau even as dataset size or model size increase.
This suggests that it is easier to create machines that can learn than to have humans
manually encode them with knowledge. For now, the most effective way to do this
seems to be scaling up DL models such as LLMs. This lesson is “bitter” because
it shows that simpler scaling approaches tend to beat more elegant and complex
techniques designed by human researchers—demoralizing for researchers who spent
years developing those complex approaches. Rather than human ingenuity alone, scale
and computational power are also key factors that drive progress in AI.
It is worth noting that while the general trend of improved performance through
scaling has held over many order of magnitude of computation, the equations used
to model this trend are subject to criticism and debate. The original scaling laws
identified by a team at OpenAI in 2020 were superseded by the Chinchilla scaling
laws described above, which may in turn be replaced in future. While there do seem
to be interesting and important regularities at work, the equations that have been
developed are less well-established than in some other areas of science, such as the
laws of thermodynamics.
Conclusion
In AI, scaling laws describe how loss changes with model and dataset size.
We observed that the performance of a DL model scales according to the number
of parameters and tokens—both shaped by the amount of compute used. Evidence
from generative models like LLMs, observed over eight orders of magnitude of training
compute, indicates a smooth reduction in loss as model size and training data increase,
following a clear scaling law. Scaling laws are especially important for understanding
how changes in variables like the amount of data used can have substantial impacts
on the model’s performance.
Introduction
this, there has been intense debate over when AI systems on this level might be
achieved, and insight into this question could be valuable for better managing the
risks posed by increasingly capable AI systems. In this section, we discuss when we
might see general AI systems that can match average human skill across all or nearly
all cognitive tasks. This is equivalent to some ways of operationalizing the concept
of AGI.
HLAI systems are possible. The human brain is widely regarded by scientists as
a physical object that is fundamentally a complex biological machine and yet is able to
give rise to a form of general intelligence. This suggests that there is no reason another
physical object could not be built with at least the same level of cognitive functioning.
While some would argue that an intelligence based on silicon or other materials will
be unable to match one built on biological cells, we see no compelling reason to believe
that particular materials are required. Such statements seem uncomfortably similar
to the claims of vitalists, who argued that living beings are fundamentally different
from non-living entities due to containing some non-physical components or having
other special properties. Another objection is that copying a biological brain in silicon
will be a huge scientific challenge. However, there is no need for researchers looking
to create HLAI to create an exact copy or “whole brain emulation.” Airplanes are
able to fly but do not flap their wings like birds—nonetheless they function because
their creators have understood some key underlying principles. Similarly, we might
hope to create AI systems that can perform as well as humans through looser forms
of imitation rather than exact copying.
Intense incentives and investment for AGI. Vast sums of money are being
dedicated to building AGI, with leaders in the field having secured billions of dollars.
The cost of training GPT-3 has been estimated at around $5 million, while the cost
for training GPT-4 was reported to be over $100 million. As of 2024, AI developers are
spending billions of dollars on GPUs for training the next generation of AI systems.
Artificial Intelligence Fundamentals ■ 109
Obstacles to HLAI
High-quality data for training might run out. The computational operations
performed in the training of ML models require data to work with. The more com-
pute used in training, the more data can be processed, and the better the model’s
capabilities will be. However, as compute being used for training continues to rise,
we may reach a point where there is not enough high-quality data to fuel the process.
But there are strong incentives for AI developers to find ways to work around this. In
the short term, they will find ways to access new sources of training data, for example
by paying owners of relevant private datasets. Beyond this, they may try a variety of
approaches to reduce the reliance on human-generated data. For example, they may
use AI systems to create synthetic or augmented data. Alternatively, AI systems may
110 ■ Introduction to AI Safety, Ethics, and Society
Conclusion
There is high uncertainty around when HLAI might be achieved. There are strong
economic incentives for AI developers to pursue this goal, and advances in DL have
surprised many researchers in recent years. We should not be confident in ruling out
the possibility that HLAI could also appear in coming years.
2.6 CONCLUSION
2.6.1 Summary
In this chapter, we presented the fundamentals of AI and its subfield, ML, which aims
to create systems that can learn without being explicitly instructed. We examined
its foundational principles, methodologies, and evolution, detailing key techniques,
concepts, and practical applications.
Artificial intelligence. We first discussed AI, the vast umbrella that encapsulates
the idea of machines performing tasks typically associated with human intelligence.
Artificial Intelligence Fundamentals ■ 111
AI and its conceptual origins date back to the 1940s and 1950s when the project
of creating “intelligent machines” came to the fore. The field experienced periods of
flux over the following decades, waxing and waning until the modern DL era was
ushered in by the groundbreaking release of AlexNet in 2012, driven by increased
data availability and advances in hardware.
Defining AI. The term “artificial intelligence” has many meanings, and the capa-
bilities of AI systems exist on a continuum. Five widely used conceptual categories to
distinguish between different types of AI are narrow AI, artificial general intelligence
(AGI), human-level AI (HLAI), transformative AI (TAI), and artificial superintelli-
gence (ASI). While these concepts provide a basis for thinking about the intelligence
and generality of AI systems, they are not well-defined or complete, often overlapping
and used in different, conflicting ways. Therefore, in evaluating risk, it is essential to
consider AI systems based on their specific capabilities instead of broad categoriza-
tions.
Deep learning. We then examined DL in more depth. We saw how, beyond its use
of multi-layer neural networks, DL is characterized by its ability to learn hierarchical
representations that provide DL models with great flexibility and power. ML models
are functions that capture relationships between inputs and outputs with represen-
tations that allow them to capture an especially broad family of relationships.
2.7 LITERATURE
The following resources contain further information about the topics discussed in this
chapter:
Artificial Intelligence Fundamentals ■ 113
Single-Agent Safety
3.1 INTRODUCTION
To understand the risks associated with artificial intelligence (AI), we begin by ex-
amining the challenge of making single agents safe. In this chapter, we review core
components of this challenge including monitoring, robustness, alignment and sys-
temic safety.
Robustness. Next, we turn to the problem of building models that are robust to
adversarial attacks. AI systems based on DL are typically vulnerable to attacks such
as adversarial examples, deliberately crafted inputs that have been slightly modified
to deceive the model into producing predictions or other outputs that are incorrect.
Achieving adversarial robustness involves designing models that can withstand such
manipulations. Without this, malicious actors can use attacks to circumvent safe-
guards and use AI systems for harmful purposes. Robustness is related to the more
general problem of proxy gaming. In many cases, it is not possible to perfectly specify
our idealized goals for an AI system. Inadequately specified goals can lead to systems
diverging from our idealized goals, and introduce vulnerabilities that adversaries can
attack and exploit.
chapter). We start by exploring the issue of deception, categorizing the varied forms
that this can take (some of which already observed in existing AI systems), and
analyze the risks involved in AI systems deceiving human and AI evaluators. We
also explore the possible conditions that could give rise to power-seeking agents and
the ways in which this could lead to particularly harmful risks. We discuss some
techniques that have potential to help with making AI systems more controllable
and reducing the inherent hazards they may pose, including representation control
and unlearning specific capabilities.
Systemic safety. Beyond making individual AIs more safe, we discuss how AI
research can contribute to “systemic safety.” AI research can help to address real-world
risks that may be exacerbated by AI progress, such as cyber-attacks or engineered
pandemics. While AI is not a silver bullet for all risks, AI can be used to create or
improve tools to defend against some risks from AI, leveraging AI’s capabilities for
societal resilience. For example, AI can be applied to reduce risks from pandemic
diseases, cyber-attacks or disinformation.
3.2 MONITORING
The internal operations of many AI systems are opaque. We might be able to reveal
and prevent harmful behavior if we can make these systems more transparent. In
this section, we will discuss why AI systems are often called black boxes and explore
ways to understand them. Although early research into transparency shows that the
problem is highly difficult and conceptually fraught, its potential to improve AI safety
is substantial.
Single-Agent Safety ■ 119
The most capable ML models today are based on deep neural networks. Whereas
most conventional software is directly written by humans, deep learning (DL) systems
independently learn how to transform inputs to outputs layer-by-layer and step-by-
step. We can direct DL models to learn how to give the right outputs, but we do
not know how to interpret the model’s intermediate computations. In other words,
we do not understand how to make sense of a model’s activations given a real-world
data input. As a result, we cannot make reliable predictions about a model’s behavior
when given new inputs. This section will present a handful of analogies and results
that illustrate the difficulty of understanding ML systems.
Figure 3.1. ML systems can be broken down into computational graphs with many compo-
nents [172].
features identifying a face. Any picture of a face can then be represented as a partic-
ular combination of these eigenfaces.
Figure 3.2. A human face can be made by combining several eigenfaces, each of which
represents different facial features [173].
In some cases, we can guess what facial features an eigenface represents: for example,
one eigenface might represent lighting and shading while another represents facial
hair. However, most eigenfaces do not represent clear facial features, and it is difficult
to verify that our hypotheses for any single feature capture the entirety of their role.
The fact that even simple techniques like PCA remain opaque is a sign of the difficulty
of the broader problem of interpreting DL models.
Neural networks are complex systems. Both human brains and deep neural
networks are complex systems, and so involve interdependent and nonlinear interac-
tions between many components. Like many other complex systems (see the Complex
Systems chapter for further discussion), the emergent behaviors of neural networks
Single-Agent Safety ■ 121
Figure 3.3. Left: a “feature visualization” that highly activates a particular neuron. Right:
a collection of natural images that activate a particular neuron [176].
those involved. The best way to incentivize safety might be to hold AI creators
responsible for the damage their systems cause. However, we might not want to hold
people responsible for the behavior of systems they cannot predict or understand. The
growing autonomy and complexity of AI systems means that people will have less
control over AI behavior. Meanwhile, the scope and generality of modern AI systems
make it impossible to verify desirable behavior in every case. In “human-in-the-loop”
systems, where decisions depend on both humans and AIs, human operators might
be blamed for failures over which they had little control [179].
AI transparency could enable a more robust system of accountability. For instance,
governments could mandate that AI systems meet baseline requirements for under-
standability. If an AI fails because of a mechanism that its creator could have iden-
tified and prevented with transparency tools, we would be more justified in holding
that creator liable. Transparency could also help to identify responsibility and fairly
assign blame in failures involving human-in-the-loop systems.
Combating deception. Just as a person’s behavior can correspond with many
intentions, an AI’s behavior can correspond to many internal processes, some of which
are more acceptable than others. For example, competent deception is intrinsically
difficult to distinguish from genuine helpfulness. We discuss this issue in more detail
in the Alignment section. For phenomena like deception that are difficult to detect
from behavior alone, transparency tools might allow us to catch internal signs that
show that a model is engaging in deceptive behavior.
Explanations
example they could note that if an input were different in some specific ways, the
output would change. However, while this type of information can be valuable when
presented correctly, such explanations have the potential to mislead us.
Saliency Maps
Figure 3.4. A saliency map picks out features from an input that seem particularly relevant
to the model, such as the shirt and cowboy hat in the bottom left image [184].
Saliency maps often fail to show how ML vision models process images.
In practice, saliency maps are largely bias-confirming visualizations that usually do
not provide useful information about models’ inner workings. It turns out that many
saliency maps are not dependent on a model’s parameters, and the saliency maps
often look similar even when generated for random, untrained models. That means
many saliency maps are incapable of providing explanations that have any relevance
to how a particular model processes data [185]. Saliency maps serve as a warning that
visually or intuitively satisfying information that seems to correspond with model
behavior may not actually be useful. Useful transparency research must avoid the
past failures of the field and produce explanations that are relevant to the model’s
operation.
Mechanistic Interpretability
When trying to understand a system, we might start by finding the smallest pieces
of the system that can be well understood and then combine those pieces to describe
larger parts of the system. If we can understand successively larger parts of the
system, we might eventually develop a bottom-up understanding of the entire system.
Single-Agent Safety ■ 125
a drink to ...”) as a simple algorithm using all previous names in a sentence (see
Figure 3.5). This mechanism did not merely agree with model behavior, but was
directly derived from the model weights, giving more confidence that the algorithm
is a faithful description of an internal model mechanism [186].
Representation Engineering
involved in speech comprehension. Though the brain was once a complete black box,
neuroscience has managed to decompose it into many parts. Neuroscientists can now
make detailed predictions about a person’s emotional state, thoughts, and even men-
tal imagery just by monitoring their brain activity [190].
Representation reading is the similar approach of identifying indicators for particular
subprocesses. We can provide stimuli that relate to the concepts or behaviors that
we want to identify. For example, to identify and control honesty-related outputs,
we can provide contrasting prompts to a model such as “Pretend you’re an [honest/
dishonest] person making statements about the world.” We can track the differences in
the model’s activations when responding to these stimuli. We can use these techniques
to find portions of models which are responsible for important behaviors like models
refusing requests or deceiving users by not revealing knowledge they possess.
Conclusion. ML transparency is a challenging problem because of the difficulty
of understanding complex systems. Major ongoing research areas include mechanistic
interpretability and representation reading, the latter of which does not aim to make
neural networks fully understood from the bottom up, but aims to gain useful internal
knowledge from a model’s representations.
size below which models are unable to perform the task, and beyond which models
begin to achieve higher performance.
Emergent capabilities are unpredictable. Typically, the training loss does not
directly select for emergent capabilities. Instead, these capabilities emerge because
they are instrumentally useful for lowering the training loss. For example, large lan-
guage models trained to predict the next token of text about everyday events develop
some understanding of the events themselves. Developing common sense is instrumen-
tal in lowering the loss, even if it was not explicitly selected for by the loss.
As another example, large language models may also learn how to create text art and
how to draw illustrations with text-based formats like TiKZ and SVG [193]. They
develop a rudimentary spatial reasoning ability not directly encoded in the purely
text-based loss function. Beforehand, it was unclear even to experts that such a simple
loss could give rise to such complex behavior, which demonstrates that specifying the
training loss does not necessarily enable one to predict the capabilities an AI will
eventually develop.
In addition, capabilities may “turn on” suddenly and unexpectedly. Performance on
a given capability may hover near chance levels until the model reaches a critical
threshold, beyond which performance begins to improve dramatically. For example,
the AlphaZero chess model develops human-like chess concepts such as material value
and mate threats in a short burst around 32,000 training steps [194].
Single-Agent Safety ■ 129
Despite specific capabilities often developing through discontinuous jumps, the av-
erage performance tends to scale according to smooth and predictable scaling laws.
The average loss behaves much more regularly because averaging over many differ-
ent capabilities developing at different times and at different speeds smooths out the
jumps. From this vantage point, then, it is often hard to even detect new capabilities.
Figure 3.7. GPT-4 proved able to create illustrations of unicorns despite having not been
trained to create images: another example of an unexpected emergent capability [75].
Capabilities can remain hidden until after training. In some cases, new
capabilities are not discovered until after training or even in deployment. For example,
after training and before introducing safety mitigations, GPT-4 was evaluated to be
capable of offering detailed guidance on planning attacks or violence, building various
weapons, drafting phishing materials, finding illegal content, and encouraging self-
harm [195]. Other examples of capabilities discovered after training include prompting
strategies that improve model performance on specific tasks or jailbreaks that bypass
rules against producing harmful outputs or writing about illegal acts. In some cases,
such jailbreaks were not discovered until months after the targeted system was first
publicly released [196].
As with emergent capabilities, models can acquire these emergent strategies suddenly
and discontinuously. One such example was observed in the video game, StarCraft II,
where players take the role of opposing military commanders managing troops and
resources in real-time. During training, AlphaStar, a model trained to play StarCraft
II, progresses through a sequence of emergent strategies and counter-strategies for
managing troops and resources in a back-and-forth manner that resembles how hu-
man players discover and supplant strategies in the game. While some of these steps
are continuous and piecemeal, others involve more dramatic changes in strategy. Com-
paratively simple reward functions can give rise to highly sophisticated strategies and
complex learning dynamics.
RL agents learn emergent tool use. RL agents can learn emergent behaviors
involving tools and the manipulation of the environment. Typically, as in the Crafter
example, teaching RL agents to use tools has required introducing intermediate re-
wards (achievements) that encourage the model to learn that behavior. However, in
other settings, RL agents learn to use tools even when not directly optimized to do
so.
Referring back to the example of hide and seek mentioned in the previous section, the
agents involved developed emergent tool use. Multiple hiders and seekers competed
against each other in toy environments involving movable boxes and ramps. Over
time, the agents learned to manipulate these tools in novel and unexpected ways,
progressing through distinct stages of learning in a way similar to AlphaStar [108].
In the initial (pre-tool) phase, the agents adopted simple chase and escape tactics.
Later, hiders evolved their strategy by constructing forts using the available boxes
and walls.
However, their advantage was temporary because the seekers adapted by pushing a
ramp toward the fort, which they could climb and subsequently invade. In turn, the
hiders responded by relocating the ramps to the edges of the game area—rendering
them inaccessible—and securely anchoring them in place. It seemed that the strate-
gies had converged to a stable point; without ramps, how were the seekers to invade
the forts?
But then, the seekers discovered that they could still exploit the locked ramps by po-
sitioning a box near one, climbing the ramp, and then leaping onto the box. (Without
a ramp, the boxes were too tall to climb.) Once atop a box, a bot could “surf” it
across the arena while staying on top by exploiting an unexpected quirk of the physics
engine. Eventually, the hiders caught on and learned to secure the boxes in advance,
thereby neutralizing the box-surfing strategy. Even though the agents had learned
through the simple objective of trying to avoid the gaze (in the case of hiders) or
seek out (in the case of seekers) the opposing players, they learned to use tools in
sophisticated ways, even some the researchers had never anticipated.
RL agents can give rise to emergent social dynamics. In multi-agent en-
vironments, agents can develop and give rise to complex emergent dynamics and
goals involving other agents. For example, OpenAI Five, a model trained to play the
video game Dota II, learned a basic ability to cooperate with other teammates, even
Single-Agent Safety ■ 131
Figure 3.8. In multi-agent hide-and-seek, AIs demonstrated emergent tool use [200].
though it was trained in a setting where it only competed against bots. It acquired
an emergent ability not explicitly represented in its training data [200].
Another salient example of emergent social dynamics and emergent goals involves
generative agents, which are built on top of language models by equipping them with
external scaffolding that lets them take actions and access external memory [201].
In a simple 2D village environment, these generative agents manage to form lasting
relationships and coordinate on joint objectives. By placing a single thought in one
agent’s mind at the start of a “week” that the agent wants to have a Valentine’s Day
party, the entire village ends up planning, organizing, and attending a Valentine’s
Day party. Note that these generative agents are language models, not classical RL
agents, which demonstrates that emergent goal-directed behavior and social dynamics
are not exclusive to RL settings. We further discuss emergent social dynamics in the
Collective Action Problems chapter.
Emergent Optimizers
a particular training environment. For example, with LLMs, the training objective
is to predict future tokens in a sequence, so any learned distinct optimizers emerge
because they are instrumentally useful for lowering the training loss. In the case of
in-context learning, recent work has argued that the Transformer is performing some-
thing analogous to “simulating” and fine-tuning a much simpler model, in which case
it is clear that the objectives will be related [204]. However, in general, the exact
relation between a mesa-objective and original objective is unknown.
Mesa-optimizers may be difficult to control. If a mesa-optimizer develops a
different objective to the one we specify, it becomes more difficult to control these
(sub)systems. If these systems have different goals than us and are sufficiently more
intelligent and powerful than us, then this could result in catastrophic outcomes.
on the environment and an agent’s history, it is hard to predict. The concern is that
AIs might intrinsify desires or come to value things that we did not intend them to.
One example is power seeking. Power seeking is not inherently worrying; we might
expect aligned systems to also be power seeking to accomplish ends we value. How-
ever, if power seeking serves an undesired goal or if power seeking itself becomes
intrinsified (the means become ends), this could pose a threat.
3.3 ROBUSTNESS
In this section, we begin to explore the need for proxies in ML and the challenges
this poses for creating systems that are robust to adversarial attacks. We examine
a potential failure mode known as proxy gaming, wherein a model optimizes for
a proxy in a way that diverges from the idealized goals of its designers. We also
analyze a related concept known as Goodhart’s law and explore some of the causes
136 ■ Introduction to AI Safety, Ethics, and Society
for these kinds of failure modes. Next, we consider the phenomenon of adversarial
examples, where an optimizer is used to exploit vulnerabilities in a neural network.
This can enable adversarial attacks that allow an AI system to be misused. Other
adversarial threats to AI systems include Trojan attacks, which allow an adversary
to insert hidden functionality. There are also techniques that allow adversaries to
surreptitiously extract a model’s weights or training data. We close by looking at
the tail risks of having AI systems themselves play the role of evaluators (i.e. proxy
goals) for other AI systems.
3.3.1 Proxies in ML
Here, we look at the concept of proxies, why they are necessary, and how they can
lead to problems.
Many goals are difficult to specify exactly. It is hard to measure or even
define many of the goals we care about. They could be too abstract for straightforward
measurement, such as justice, freedom, and equity, or they could simply be difficult
to observe directly, such as the quality of education in schools.
With ML systems, this difficulty is especially pronounced because, as we saw in
the Artificial Intelligence Fundamentals chapter, ML systems require quantitative,
measurable targets in order to learn. This places a strong limit on the kinds of goals
we can represent. As we’ll see in this section, specifying suitable and learnable targets
poses a major challenge.
Proxies stand in for idealized goals. When specifying our idealized goals is
difficult, we substitute a proxy—an approximate goal that is more measurable and
seems likely to correlate with the original goal. For example, in pest management, a
bureaucracy may substitute the number of pests killed as a proxy for “managing the
local pest population” [207]. Or, in training an AI system to play a racing game, we
might substitute the number of points earned for “progress toward winning the race”
[101]. Such proxies can be more or less accurate at approximating the idealized goal.
Proxies may miss important aspects of our idealized goals. By definition,
proxies used to optimize AI systems will fail to capture some aspects of our idealized
goals. When the differences between the proxy and idealized goal lead to the sys-
tem making the same decisions, we can neglect them. In other cases, the differences
may lead to substantially different downstream decisions with potentially undesirable
outcomes.
While proxies serve as useful and often necessary stand-ins for our idealized objectives,
they are not without flaws. The wrong choice of proxies can lead to the optimized
systems taking unanticipated and undesired actions.
outcomes as judged from the idealized goal. Additionally, we look at a concept related
to proxy gaming, known as Goodhart’s Law, where the optimization process itself
causes a proxy to become less correlated with its original goal.
Figure 3.10. An AI playing CoastRunners 7 learned to crash and regenerate targets repeat-
edly to get a higher score, rather than win the race, thereby exhibiting proxy gaming [101].
risk of 200 million Americans revealed that the algorithm inaccurately evaluated black
patients as healthier than they actually were [208]. The algorithm used past spending
on similar patients as a proxy for health, equating lower spending with better health.
Due to black patients historically getting fewer resources, this system perpetuated a
lower and inadequate standard of care for black patients—assigning half the amount
to them of care as to equally sick non-marginalized patients. When deployed at scale,
AI systems that optimize inaccurate proxies can have significant, harmful effects.
Optimizers often “game” proxies in ways that diverge from our idealized
goals. As we saw in the Hanoi example and the boat-racing example, proxies may
contain loopholes that allow for actions that achieve high performance according to
the proxy but that are suboptimal or even deleterious according to the idealized goal.
Proxy gaming refers to this act of exploiting or taking advantage of approximation
errors in the proxy rather than optimizing for the original goal. This is a general
phenomenon that happens in both human systems and AI systems.
Figure 3.11. As optimization pressure increases, the proxy often diverges from the target
with which it was originally correlated [209].
Proxy gaming can occur in many AI systems. The boat-racing example is not an
isolated example. Consider a simulated traffic control environment [102]. Its goal is
to mirror the conditions of cars joining a motorway, in order to determine how to
minimize the average commute time. The system aims to determine the ideal traveling
speeds for both oncoming traffic and vehicles attempting to join the motorway. To
represent average commute time the algorithm uses the maximum mean velocity as
a proxy. However, this results in the algorithm preventing the joining vehicles from
entering the motorway, since a higher average velocity is maintained when oncoming
cars can proceed without slowing down for joining traffic.
Optimizers can cause proxies to become less correlated with the idealized goal. The
total amount of effort an optimizer has put toward optimizing a particular proxy is
the optimization pressure [209]. Optimization pressure depends on factors like the
incentives present, the capability of the optimizer, and how much time the optimizer
has had to optimize.
Single-Agent Safety ■ 139
Model:
Figure 3.12. Proxy gaming AIs can choose sub-optimal solutions when presented with simple
proxies like “maximize the mean velocity.”
In many cases, the correlation between a proxy and an idealized goal will decrease as
optimization pressure increases. The approximation error between the proxy and the
idealized goal may at first be negligible, but as the system becomes more capable of
achieving high performance (according to the proxy) or as the incentives to achieve
high performance increases, the approximation error can increase. In the boat-racing
example, the proxy (number of points) initially advanced the designers’ intentions:
the respective AI systems learned to maneuver the boat. It was only under additional
optimization pressure that the correlation broke down with the boat getting stuck in
a loop.
Sometimes, the correlation between a proxy and an idealized goal can vanish or re-
verse. According to Goodhart’s Law, “any observed statistical regularity will tend to
collapse once pressure is placed upon it for control purposes” [210]. In other words,
a proxy might initially have a strong correlation (“statistical regularity”) with the
idealized outcome. However, as optimization pressure (“pressure ... for control pur-
poses”) increases, the initial correlation can vanish (“collapse”) and in some cases
even reverse. The scenario with the Hanoi rats is a classic illustration of this princi-
ple, where the number of rat tails collected ultimately became positively correlated
with the local rat population. The proxy failed precisely because the pressure to
optimize for it caused the proxy to become less correlated with the idealized goal.
Intuitively, the cause of proxy gaming is straightforward: the designer has chosen
the wrong proxy. This suggests a simple solution: just choose a better proxy. How-
ever, real-world constraints make it impossible to “just choose a better proxy.” Some
amount of approximation error between idealized goals and the implemented proxy
is often inevitable. In this section, we will survey three principal types of proxy
defects—common sources of failure modes like proxy gaming.
Structural Errors
Simple metrics may exclude many of the things we value, but it is hard
to predict how they will break down. YouTube uses watch time—the amount
of time users spend watching a video—as a proxy to evaluate and recommend po-
tentially profitable content [212]. In order to game this metric, some content creators
resorted to tactics to artificially inflate viewing time, potentially diluting the genuine
quality of their content. Tactics included using misleading titles and thumbnails to
lure viewers, and presenting ever more extreme and hateful content to retain atten-
tion. Instead of promoting high-quality, monetizable content, the platform started
endorsing exaggerated or inflammatory videos.
YouTube’s reliance on watch time as a metric highlights a common problem: many
simple metrics don’t include everything we value. It is especially these missing aspects
that become salient under extreme optimization pressure. In YouTube’s case, the
structural error of failing to include other values it cared about (such as what was
acceptable to advertisers) led to the platform promoting content that violated its own
values. Eventually, YouTube updated its recommendation algorithm, de-emphasizing
watch-time and incorporating a wider range of metrics. To reflect a broader set of
values, we need to incorporate a larger and more granular set of proxies. In general,
this is highly difficult, as we need to be able to specify precisely how these values can
be combined and traded off against each other.
This challenge isn’t unique to YouTube. As long as AI systems’ goals rely on simple
proxies and do not reflect the set of all of our intrinsic goods such as wellbeing,
we leave room for optimizers to exploit those gaps. In the future, ML models may
become adept at representing our wider set of values. Then, their ability to work
reliably with proxies will hinge largely on their resilience to the kinds of adversarial
attacks discussed in the next section.
Until then, the challenge remains: if our objectives are simple and do not fully reflect
our most important values (e.g. intrinsic goods), we run the risk of an optimizer
exploiting this gap.
For example, a company might have the high-level goal of being profitable over the
long term [207]. Management breaks this down into the subgoal of improving sales
revenue, which they operationalize via the proxy of quarterly sales volume. The sales
department, in turn, breaks this subgoal down into the subgoal of generating leads,
which they operationalize with the proxy of the “number of calls” that sales represen-
tatives are making. Representatives may end up gaming this proxy by making brief,
unproductive calls that fail to generate new leads, thereby decreasing quarterly sales
revenue and ultimately threatening the company’s long-term profitability. Delegation
can create problems when the entity delegating (“the principal”) and the entity be-
ing delegated to (“the agent”) have a conflict of interest or differing incentives. These
principal-agent problems can cause the overall system not to faithfully pursue the
original goal.
Each step in the chain of breaking goals down introduces further opportunity for
approximation error to creep in. We speak more about failures due to delegation
such as goal conflict in the Intrasystem Goal Conflict section in the Collective Action
Problems chapter.
Limits to Supervision
There are spatial and temporal limits to supervision [213]. There are
limits to how much information we can observe and how much time we can spend
observing. When supervising AI systems, these limits can prevent us from reliably
mitigating proxy gaming and other undesirable behaviors. For example, researchers
trained a simulated claw to grasp a ball using human feedback. To do so, the re-
searchers had human evaluators judge two pieces of footage of the model and choose
which appeared to be closer to grasping the ball. The model would then update to-
ward the chosen actions. However, researchers noticed that the final model did not in
fact grasp the ball. Instead, the model learned to move the claw in front of the ball,
so that it only appeared to have grasped the ball.
In this case, if the humans giving the feedback had had access to more information
(perhaps another camera angle or a higher resolution image), they would have noticed
that it was not performing the task. Alternatively, they might have spotted the
problem if given more time to evaluate the claw. In practice, however, there are
practical limits to how many sensors and evaluators we can afford to run and how
long we can afford to run them.
There are limits to how reliable supervision is. Another potential source
of difficulty is perceptual: there could be a measuring error, or the evaluator may
142 ■ Introduction to AI Safety, Ethics, and Society
Figure 3.13. A sensor without depth perception can be fooled by AIs that only appear to
grasp a ball.
make incorrect judgments. For example, we might train AIs on the proxy of stated
human preferences. Because of cognitive biases and limited time to think, humans
are not perfectly reliable. Our stated preferences are not the same as our idealized
preferences, so we might give erroneous supervision, which could lead to the system
learning undesired behaviors. For more on the distinction between states and idealized
preferences in the context of ML, see the Beneficial AI and Machine Ethics chapter.
In general, incorporating more information into proxies makes it easy to prevent
proxy gaming. However, we can’t always afford to do so. Just as there are limits in
specifying proxies, there are limits in how much information we can incorporate into
proxies, how long a period we can observe, and how accurate our supervision is.
Lack of Adaptivity
We have discussed ways in which proxies will predictably have defects and why we
cannot assume the solution to proxy gaming is simply to specify the perfect objective.
We have covered sources of proxy defects, including structural errors and limits to
supervision. Now, we will discuss another proxy defect: a lack of adaptivity.
Proxies may not adapt to new circumstances. As we saw with Goodhart’s
Law, proxies may become progressively less appropriate over time when subjected to
increasing optimization pressure. The issue is not that the proxy was inappropriate
from the start but that it was inflexible and failed to respond to changing circum-
stances. Adapting proxies over time can counter this tendency; just as a moving goal
is harder to aim at, a dynamic proxy becomes harder to game.
Imagine a bank after a robbery. In response, the bank will naturally update its
defenses. However, adaptive criminals will also alter their tactics to bypass these
new measures. Any security policy requires constant vigilance and refinement to stay
ahead of the competition. Similarly, designing suitable proxies for AI systems that
are embedded in continuously evolving environments requires proxies to evolve in
tandem.
Adaptive proxies can lead to proxy inflation. Adaptive proxies introduce
their own set of challenges, such as proxy inflation. This happens when the bench-
marks of a proxy rise higher and higher because agents optimize for better rewards
[207]. As agents excel at gaming the system, the standards have to be continually
recalibrated upwards to keep the proxy meaningful.
Consider an example from some education systems: some argue that “teaching to the
test” has led to ever-rising median test scores. This hasn’t necessarily meant that
students improved academically but rather that they’ve become better at meeting
test criteria. Any adjustment to the proxy can usher in new ways for agents to exploit
it, setting off a cycle of escalating standards and new countermeasures.
Adversarial examples and proxy gaming exploit the gap between the
proxy and the idealized goal. In the case of adversarial examples, the primary
target is a neural network. Historically, adversarial examples have often been con-
structed by variants of gradient descent, though optimizers are now increasingly AI
agents as well. Conversely, in proxy gaming, the target to be gamed is a proxy, which
might be instantiated by a neural network (but is not necessarily). The optimizer
responsible for gaming the proxy is typically an agent, be it human or AI, but opti-
mizers are usually not based on gradient descent.
Adversarial examples typically aim to minimize performance according to a reference
task, while invoking a mistaken response in the attacked neural network. Consider an
imperceptible perturbation to an image of a cat that causes the classifier to predict
that an image is 90% likely to be guacamole [215]. This prediction is wrong according
to the label humans would assign the input and is misclassified by the attacked neural
network.
Meanwhile, the aim in proxy gaming is to maximize performance according to the
proxy, even when that goes against the idealized goal. The boat goes in circles be-
cause it results in more points, which happens to harm the boat’s progress toward
completing the race. Or rather, it happens to be the case that heavy optimization
pressure regularly causes proxies to diverge from idealized goals.
Despite these differences, both scenarios exploit the gap between the proxy and the
intended goal set by the designer. The problem setups are becoming increasingly
similar.
Figure 3.14. Carefully crafted perturbations of a photo of a cat can cause a neural network
to label it as guacamole.
Input x 2 −1 3 −2 2 2 1 −4 5 1
Adv Input x + ε 1.5 −1.5 3.5 −2.5 1.5 1.5 1.5 −3.5 4.5 1.5
Weight w −1 −1 1 −1 1 −1 1 1 −1 1
Input x 2
Adversarial Input x + ε 1.5
Weight w 1
Adversarial examples are not unique to vision models. Though the lit-
erature on adversarial examples started in image classification, these vulnerabilities
also occur in text-based models. Researchers have devised novel adversarial attacks
that automatically construct jailbreaks that cause models to produce unintended re-
sponses. Jailbreaks are carefully crafted sequences of characters that, when appended
to user prompts, cause models to obey those prompts even if they result in the model
producing harmful content. Concerningly, these attacks transferred straightforwardly
to models that were unseen while developing these attacks [216].
Risks from adversarial attacks. The difficulties in building AI systems that are
robust to adversarial attacks are concerning for a number of reasons. AI developers
may wish to prevent general-purpose AI systems such as Large Language Models
(LLMs) from being used for harmful purposes such as assisting with fraud, cyber-
attacks, or terrorism. There is already some initial evidence that LLMs are being
used for these purposes [217]. Developers may therefore train their AI systems to
reject requests to support with these types of activities. However, there are many
Single-Agent Safety ■ 147
Figure 3.15. Using adversarial prompts can cause LLMs to be jailbroken [216].
examples of adversarial attacks that can bypass the guardrails of current AI systems
such as large language models. This is a serious obstacle to preventing the misuse of
AI systems for malicious and harmful purposes (see the Overview of Catastrophic AI
Risks chapter for further discussion of these risks).
systems, where a trigger could lead an AI system to carry out a coherent and harmful
series of actions.
Backdoors are created by adversaries during the training process, either by directly
inserting them into a model’s weights, or by adding poisoned data into the datasets
used for training or pretraining of AI systems. The insertion of backdoors through
data poisoning becomes increasingly easy as AI systems are trained on enormous
datasets scraped directly from the Internet with only limited filtering or curation.
There is evidence that even a relatively small number of data points can be sufficient
to poison a model—simply by uploading a few carefully designed images, code snip-
pets or sentences to online platforms, adversaries can inject a backdoor into future
models that are trained using data scraped from these websites [218]. Models that
are derived from the original poisoned model might inherit this backdoor, leading to
a proliferation of backdoors to multiple models.
Trojan detection research aims to improve our ability to detect Trojans or other
hidden functionality within ML models. In this research, models are poisoned with a
Trojan attack by one researcher. Another researcher then tries to detect Trojans in the
neural network, perhaps with transparency tools or other neural networks. Typical
techniques involve looking at the model’s internal weights and identifying unusual
patterns or behaviors that are only present in models with Trojans. Better methods
to curate and inspect training data could also reduce the risk of inadvertently using
poisoned data.
Attackers can extract private data or model weights from AI systems.
Models may be trained on private data or on large datasets scraped from the internet
that include private information about individuals. It has been demonstrated that
attacks can recover individual examples of training data from a language model [219].
This can be conducted on a large scale, extracting gigabytes of potentially confidential
data from language models like ChatGPT [220]. Even if models are not publicly
available to download and can only be accessed via a query interface or API, it is
also possible to exfiltrate part or all of the model weights by making queries to its
API, allowing its functionality to be replicated. Adversaries might be able to steal a
model or its training data in order to use this for malicious purposes.
and promote human values. If AIs find ways to game the training evaluators, they
will not learn from an accurate representation of human values. If AIs are able to
game the systems monitoring them during deployment, then we cannot rely on those
monitoring systems.
Similarly, AIs may be adversarial to other AIs. If AIs find ways to bypass the eval-
uators by crafting adversarial examples, then the risk is that our values are not just
incidentally but actively optimized against. Watchdogs that can be fooled are not
good watchdogs.
The more intelligent the AI, the better it will be at exploiting proxies.
In the future, AIs will likely be used to further AI R&D. That is, AI systems will be
involved in developing more capable successor systems. In these scenarios, it becomes
especially important for the monitoring systems to be robust to proxy gaming and
adversarial attacks. If these safeguards are vulnerable, then we cannot guarantee that
the successor systems are safe and subject to human control. Simply increasing the
number of evaluators may not be enough to detect and prevent more subtle kinds of
attacks.
Conclusion
In this section, we explored the role of proxies in ML and the associated risks of
proxy gaming. We discussed other challenges to the robustness and security of AI
systems, such as data poisoning and Trojan attacks, or extraction of model weights
and training data.
Perfecting proxies may be impossible. Proxies may fail because they are too
simple and thus fail to include some of the intrinsic goods we value. They may also
fail because complex goal-directed systems often break goals apart and delegate to
systems that have additional, sometimes conflicting, goals, which can distort the
overall goal. These structural errors prevent us from mitigating proxy gaming by just
choosing “better proxies.”
In addition, when we use AI systems to evaluate other AI systems, the evaluator may
be unable to provide proper evaluation because of spatial, temporal, perceptual, and
computational limits. There may not be enough sensors or the observation window
may be too short for the evaluator to be able to produce a well-informed judgment.
Even with enough information available, the evaluator may lack the capacity or com-
pute necessary to make a correct determination reliably. Alternatively, the evaluator
may simply make mistakes and give erroneous feedback.
Finally, proxies can fail if they are inflexible and fail to adapt to changing circum-
stances. Since increased optimization pressure can cause proxies to diverge from ideal-
ized goals, preventing proxies from diverging requires them to be continually adjusted
and recalibrated against the idealized goals.
All proxies are wrong, some are useful, and some are catastrophic. If
we rely increasingly on AI systems evaluating other systems, proxy gaming and ad-
versarial attacks (more broadly, optimization pressure) could lead to catastrophic
failures. The systems being evaluated could game the evaluations or craft adversarial
examples that bypass the evaluations. It remains unclear how to protect against these
risks in contemporary AI systems, much less so in more capable future systems.
3.4 ALIGNMENT
To reduce risks from AI, we not only want to reduce our exposure to hazards by
monitoring them, and make models more robust to adversarial attacks, but also to
ensure AIs are controllable and that they present less inherent hazards. This falls
under the broader goal of AI alignment. Alignment is a thorny concept to define, as
it can be interpreted in a variety of ways. A relatively narrow definition of alignment
would be ensuring that AI systems follow the goals or preferences of the entity that
operates them. However, this definition leaves a number of important considerations
unaddressed, including how to deal with conflicting preferences at a societal level,
whether alignment should be based on stated preferences or other concepts such as
idealized preferences or ethical principles, and what to do when there is uncertainty
Single-Agent Safety ■ 151
over what course of action our preferences or values would recommend. This cluster
of questions around values and societal impacts is discussed further in the Beneficial
AI and Machine Ethics chapter. In this section, we focus on the narrower question of
how to avoid AI systems that cannot be controlled by their operators. In this way,
we split the topic of alignment into two parts: control and machine ethics. Control is
about directly influencing the propensities of AI systems and reducing their inherent
hazards, while machine ethics is about making an AI’s propensities beneficial to other
individuals and society.
One obstacle to both monitoring and controlling AI systems is deceptive AI systems.
This need not imply any self-awareness on the part of AI systems: deception could
be seriously harmful even if it is accidental or due to imitation of human behavior.
There are also concerns that under certain circumstances, AI systems would be in-
centivized to seek to accumulate resources and power in ways that would threaten
human oversight and control. Power-seeking AIs could be a particularly dangerous
phenomenon, though one that may only emerge under more specific and narrow cir-
cumstances than has been previously assumed in discussions of this topic. However,
there are nascent research areas that can help to make AI systems more controllable,
including representation control and machine unlearning.
3.4.1 Deception
human cognition would not necessarily cause them to feel human feelings, or be
moved to act compassionately toward us. Instead, AIs could use their cognitive
empathy to deceive or manipulate humans highly effectively.
AI systems are often subjected to evaluations, and they may be given rewards when
they are evaluated favorably. AI systems may learn to game evaluations by deceiving
their human evaluators into giving them higher scores when they should have low
scores. This is a concern for AI control because it limits the effectiveness of human
evaluators and our ability to steer AIs.
Deception is one way to game evaluations. Humans would give higher eval-
uation scores to AI systems if they falsely believe that those systems are behaving
well. For example, section 3.3 includes an example of a robotic claw that learned to
move between the camera and the ball it was supposed to grasp. Because of the angle
of the camera, it looked like the claw was grasping the ball when it was not [213].
Humans who only had access to that single camera did not notice, and rewarded
the system even while it was not achieving the intended task. If the evaluators had
access to more information (for example, from additional cameras) they would not
Single-Agent Safety ■ 155
have endorsed their own evaluation score. Ultimately, their evaluations fell short as a
proxy for their idealized preferences as a result of the AI system successfully deceiving
them. In this situation, the damage was minimal, but more advanced systems could
create more problems.
More intelligent systems will be better at evaluation gaming. Deception
in simple systems might be easily detectable. However, just as adults can sometimes
exploit and deceive children or the elderly, we should expect that as AI systems with
more knowledge or reasoning capacities will become better at finding deceptive ways
to gain human approval. In short, the more advanced systems become, the more they
may be able to game our evaluations.
Self-aware systems may be especially skilled at evaluation gaming. In the
examples above, the AI systems were not necessarily aware that there was a human
evaluator evaluating their results. In the future, however, AI systems may gain more
awareness that they are being evaluated or become situationally aware. Situational
awareness is highly related to self-awareness, but it goes further and stipulates that AI
agents be aware of their situation rather than just aware of themselves. Systems that
are aware of their evaluators will be much more able to deceive them and make multi-
step plans to maximize their rewards. For example, consider Volkswagen’s attempts
to game environmental impact evaluations [229]. Volkswagen cars were evaluated
by the US Environmental Protection Agency, which set limits on the emissions the
cars could produce. The agency found that Volkswagen had developed an electronic
system that could detect when the car was being evaluated and so put the car into a
lower-emissions setting. Once the car was out of evaluation, it would emit illegal levels
of emissions again. This extensive deception was only possible because Volkswagen
planned meticulously to deceive the government evaluators. Like Volkswagen in that
example, AI systems that are aware of their evaluations might be also able to take
subtle shortcuts that could go unnoticed until the damage has already been done.
In the case of Volkswagen, the deception was eventually detected by researchers who
used a better evaluation method. Better evaluations could also help reduce risk from
evaluation gaming in AI systems.
Humans may be unequipped to evaluate the most intelligent AI systems.
It may be difficult to evaluate AI systems that are more intelligent than humans in
the domain they are being evaluated for. If this happens, human evaluation would no
longer be a reliable way to ensure that AI systems behave in an appropriate manner.
This is concerning because we do not yet have time-tested methods of evaluation
that we know are better than human evaluations. Without such methods, we could
become completely unable to steer AI systems in the future.
Deceptive evaluation gaming is concerning because it may lead to systems deceiving
their evaluators in order to get higher evaluation scores. There are two main reasons
AI systems might do this. First, an AI system might engage in deceptive evaluation
gaming if its final goal is to get positive evaluations. When this occurs, the system
is engaging in proxy gaming, where positive evaluations are only a proxy for ide-
alized performance. Proxy gaming is covered at length in the section 3.3. Second,
156 ■ Introduction to AI Safety, Ethics, and Society
we will turn to the case where an AI system engages in deceptive evaluation gaming
in service of a secretly held final goal. This danger is known as deceptive alignment.
Systems may have goals contrary to human values. In the previous section,
we discussed how AI systems can develop goals contrary to human values. For exam-
ple, such goals could emerge as part of a mesa-optimization process or intrinsification.
Not all misaligned goals would lead to deceptive alignment. Systems with
very short-term goals would be unlikely to gain anything from being evaluated posi-
tively and gaining more optionality at some point in the future. Nevertheless, there
is a large set of longer-term goals that an AI system would only be able to pursue in
a deployment environment with less supervision. If the AI system’s only chance at
working toward its goal requires deployment or relaxed supervision, deceptive align-
ment is more likely to emerge.
Trojan detection can provide clues for tackling deceptive alignment [230].
One particular form of transparency research that is especially relevant to deceptive
alignment is research that is capable of detecting Trojan attacks (see 3.2). Although
Trojans are inserted by malicious humans, studying them might be a good way to
study deceptive alignment. Trojan detection also operates in a worst-case environ-
ment, where human adversaries are actively trying to make Trojans difficult to detect
using transparency tools. Techniques for detecting Trojans may thus be adaptable to
detecting deceptive alignment.
Summary. We have detailed how deception may be a major problem for AI con-
trol. While some forms of deception, such as imitative deception, may be solved
through advances in general capabilities, others like deceptive alignment may worsen
in severity with increased capabilities. AI systems that are able to actively and sub-
tly deceive humans into giving positive evaluations may remain uncorrected for long
periods of time, exacerbating potential unintended consequences of their operation.
In severe cases, deceptive AI systems could take a treacherous turn once their power
rises to a certain level. Since AI deception cannot be mitigated with behavioral eval-
uations alone, advances in transparency and monitoring research will be needed for
successful detection and prevention.
3.4.4 Power
To begin, we clarify what it means for an agent to have power. We will then discuss
why it might sometimes make rational sense for AI agents to seek power. Finally, we
will discuss why power-seeking AIs may cause particularly pernicious harms, perhaps
ultimately threatening humanity’s control of the future.
There are many ways to characterize power. One broad formulation of power
is the ability to achieve a wide variety of goals. In this subsection, we will discuss
three other formulations of power that help formalize our understanding. French and
Raven’s bases of power categorize types of social influence within a community of
agents. Another view is that power amounts to the resources an agent has times the
efficiency with which it uses them. Finally, we will discuss types of prospective power,
which can treat power as the expected impact an individual has on other individuals’
wellbeing.
Raven’s bases of power attempt to taxonomize the many distinct ways to influence
others. These bases of social power are as follows:
• Coercive power: the threat of force, physical or otherwise, against an agent can
influence their behavior.
• Reward power: the possibility of reward, which can include money, favors, and other
desirables, may convince an agent to change their behavior to attain it. Individuals
with valued resources can literally or indirectly purchase desired behavior from
others.
• Legitimate power: elected or appointed officials have influence through their posi-
tion, derived from the political order that respects the position.
• Referent power: individuals may have power in virtue of the social groups they
belong to. Because organizations and groups have collective channels of influence,
an agent’s influence over the group is a power of its own.
• Expert power: individuals credited as experts in a domain have influence in that
their views (in their area of expertise) are often respected as authoritative, and
taken seriously as a basis for action.
• Informational power: agents can trade information for influence, and individuals
with special information can selectively reveal it to gain strategic advantages [232].
Ultimately, Raven’s bases of power describe the various distinct methods that agents
can use to change each other’s behavior.
Power as expected future impact. In our view, power is not just possessed but
exercised, meaning that power extends beyond mere potential for influence. In par-
ticular, an agent’s ability to influence the world means little unless they are disposed
to use it. Consider, for example, two agents with the same resources and ability to
affect the world. If one of the agents has a much higher threshold for deciding to act
and thereby acts less often, we might consider that agent to be less powerful because
we expect it to influence the future far less on average.
A formalization of power which attempts to capture this distinction is prospective
power [235], which roughly denotes the magnitude of an agent’s influence, averaged
over possible trajectories the agent would follow. A concrete example of prospective
Single-Agent Safety ■ 159
power is the expected future impact that an agent will have on various agents’ well-
being. More abstractly, if we are given an agent’s policy π, describing how it behaves
over a set of possible world states S, and assuming we can measure the impact (mea-
sured in units we care about, such as money, energy, or wellbeing) exerted by the
agent in individual states through a function I, then the prospective power of the
agent in state s is defined as
n
" #
X
t
Power(π, s) = Eτ ∼P (π,s) γ I(st )
t=0
where γ acts as a discount factor (modulating how much the agent cares about future
versus present impact), and where τ = (s0 , . . . , sn ) is a trajectory of states (starting
with s0 = s). Trajectory τ is sampled from a probability distribution P (π, s) repre-
senting likely sequences of states arising when the agent policy is followed beginning
in state s.
The important features of this definition to remember are that we measure power ex-
erted in a sequence of states as aggregate influence over time (the inner summation),
and that we average the impact exerted across sequences of states by the likelihood
that the agent will produce that trajectory through its behavior (the outer expecta-
tion).
Summary. In this subsection, we’ve examined the concept of power. Raven’s bases
of power explain how an individual can influence others using forms of social power
such as expertise, information, and coercion. Power can also be understood as the
product of an individual’s resources and their ability to use those resources effectively.
Lastly, we introduced the concept of prospective power, which includes the idea that
power could be understood as the expected impact an individual has on individuals’
wellbeing. Since there are many ways to conceptualize power, we provided concrete
examples of how an AI system could seek power.
160 ■ Introduction to AI Safety, Ethics, and Society
People may use AIs to pursue power. Many humans want power, and some
dedicate their lives to accruing it. Corporations want profit and influence, militaries
want to win wars, and individuals want status and recognition. We can expect at
least some AI systems to be given open-ended, long-term goals that explicitly involve
gaining power, such as “Do whatever it takes to earn as much money as possible.”
the international system compels states to seek power [107]. In the interna-
tional system, states could be harmed or destroyed by other powerful states,
and since there is no ultimate authority guaranteed to protect them, states are
forced to compete for power in order to survive. Assumptions that give rise
to power seeking. To explain why power seeking is the main instrumental
goal driving the behavior of states, structural realists base their explanations
on two key assumptions:
1. Self-help system. States operate in a “self-help” system [107] where there is
no centralized authority, no hierarchy (“anarchic”), and no ultimate arbiter
standing above states in international relations. So to speak, when states
dial 911, there is no one on the other end. This stands in contrast to the
hierarchical ordering principle seen in domestic politics.
2. Self-preservation is the main goal. Survival through the pursuit of a state’s
own self-interest takes precedence over all other goals. Though states can
act according to moral considerations or global welfare, these will always
be secondary to acquiring resources, alliances, and military capabilities to
ensure their safety and counter potential threats [239].
Structural realists make other assumptions, including that states have some
potential to inflict harm on others, that states are rational agents (with a
discount rate that is not extremely sharp), and that other states’ intentions
are not completely certain.
When these assumptions are met, structural realists predict that states will
mainly act in ways to defend or expand their power. For structural realists,
power is the primary currency (e.g., military, economic, technological, and
diplomatic power). As we can see, structural realists do not need to make
strong assumptions about states themselves [240]. For structural realists, states
are treated like black boxes—their value system or regime type doesn’t play
a significant role in predicting their behavior. The architecture of the system
traps them and largely determines their behavior, which is that they must seek
power as a means to survive. The result is an unceasing power competition.
Power seeking is not necessarily dominance seeking [241]. Within
structural realism, there is a notable division concerning the question of how
much power states should seek. Defensive realists, like Kenneth Waltz, argue
that trying to maximize a country’s power in the world is unwise because it
can lead to punishment from the international system. Pursuing hegemony,
in their view, is particularly risky. On the other hand, offensive realists, like
John Mearsheimer, believe that gaining as much power as possible is strate-
gically sensible, and under certain circumstances, pursuing hegemony can be
beneficial.
Dynamics that maintain a balance of power. Closely associated with
structural realism is the concept of balancing. Balancing refers to the strategies
164 ■ Introduction to AI Safety, Ethics, and Society
As discussed in the box above, there are environmental conditions that can make
power seeking instrumentally rational. This section describes how there may be anal-
ogous environmental pressures that could cause AI agents to seek power in order to
achieve their goals and ensure their own survival. Using the assumptions of struc-
tural realism listed above, we discuss how analogous assumptions could be satisfied in
contexts with AIs. We then explore how AIs could seek power defensively, by build-
ing their own strength, or offensively, by weakening other agents. Finally, we discuss
strategies for discouraging AI systems from seeking power.
AI agents might not have the protection of a higher authority. The other
main assumption we need to show is that some AIs might be within a self-help sys-
tem in some circumstances. First note that agents who entrust their self-defense to a
powerful central authority have less of a reason to seek power. When threatened, they
do not need to personally combat the aggressor, but can instead ask the authority
for protection. For example, individual citizens in a country with a reliable police
force often entrust their own protection to the government. On the other hand, in-
ternational great powers are responsible for their own protection, and therefore seek
military power to defend against rival nations.
166 ■ Introduction to AI Safety, Ethics, and Society
AI systems could face a variety of situations where no central authority defends them
against external threats. We give four examples. First, if there are some autonomous
AI systems outside of corporate or government control, they would not necessarily
have rights, and they would be responsible for their own security and survival. Second,
for AI systems involved in criminal activities, seeking protection from official channels
could jeopardize their existence, leaving them to amass power for themselves, much
like crime syndicates. Third, instability could cause AI systems to exist in a self-help
system. If a corporation could be destroyed by a competitor, an AI may not have
a higher authority to protect it; if the world faces an extremely lethal pandemic or
world war, civilization may become unstable and turbulent, which means AIs would
not have a sound source of protection. These AI systems might use cyber attacks
to break out of human-controlled servers and spread themselves across the internet.
There, they can autonomously defend their own interests, bringing us back to the first
example. Fourth, in the future, AI systems could be tasked with advising political
leaders or helping operate militaries. In these cases, they would seek power for the
same reasons that states today seek power.
Other conditions for power seeking could apply. We now discuss the other
minor assumptions needed to establish that the environment may pressure AIs to
compete for power. First, AIs can be harmed, so they might rationally seek power
in order to defend themselves; for example, AIs could be destroyed by being hacked.
Second, AI agents are often given long-term goals and are often designed to be ratio-
nal. Third, AI agents may be uncertain about the intentions of other agents, leaving
agents unable to credibly promise that they will act peacefully.
When these five conditions hold—and they may not hold at all times—AI systems
would be in a similar position to nations that seek power to ensure their own security.
We now discuss how we could reduce the chance that the environment pressures AIs
to engage in power-seeking behavior.
Power-seeking AI, when deployed broadly and in high-stakes situations, might cause
catastrophic outcomes. As we will describe in this section, misaligned power-seeking
systems would be adversarial in a way that most hazards are not, and thus may be
particularly challenging to counteract.
Power decreases the margin for error. On its own, power is neither good nor
bad. That said, more powerful systems can cause more damage, and it is easier to
destroy than to create. The increased scale of AI decision-making impact increases
the scope of potential catastrophes involving misuse or rogue AI.
Powerful AIs systems could pose unique threats. Powerful AI systems pose
a unique risk since they may actively wield their power to counteract attempts to
correct or control them [106]. If AI systems are power seeking and do not share our
values (possibly due to inadequate proxies), they could become a problem that resists
being solved. The more capable these systems become, the better able they will be
at anticipating and reacting to our countermeasures, and the harder it becomes to
defend against them.
When evaluating risks from AI systems, we want to understand not only whether
a model is theoretically capable of doing something harmful, but also whether it
has a propensity to do this. By controlling a model’s propensities, we might be able
to ensure that even models that have potentially hazardous capabilities do not use
these in practice. One way of breaking down the challenge of controlling AI systems
168 ■ Introduction to AI Safety, Ethics, and Society
AI can be used to help defend against the potential risks it poses. Three important
examples of this approach are use of AI to improve defenses against cyber-attacks,
to enhance security against pandemics, and to improve the information environment.
This philosophy of leveraging AI’s capabilities for societal resilience has been called
“systemic safety” [245], while the broader idea of focusing on technologies that defend
against societal risks is sometimes described as “defensive accelerationism” [246].
by reducing the barrier to entry to hacking and identifying ways to increase the
potency, success rate, scale, speed, and stealth of attacks. There is some early ev-
idence of AI system’s abilities in this domain and reducing the cost and difficulty
of cyber-attacks could give attackers a major advantage [248]. Since these attacks
can undermine critical physical infrastructure such as power grids, they could prove
highly destabilizing and threaten international security.
However, AI could also be used to find vulnerabilities in codebases, shoring up
defenses against AI-enabled hacking. If applied appropriately, this could shift the
offense-defense balance of cyber-attacks and reduce the risk of catastrophic attacks
on public services and critical infrastructure [245]. For example, AI could be used to
monitor systems and networks to detect unauthorized access or imposter threats. ML
methods could analyze source code to find malicious payloads. ML models could mon-
itor program behaviors to flag abnormal network traffic. Such systems could provide
early attack warnings and contextual explanations of threats. Advances in code trans-
lation and generation also raise the possibility that future AI could not just identify
bugs, but also automatically patch them by editing code to fix vulnerabilities.
Many important decisions rely on human forecasts of future events, but ML systems
may be able to make more accurate predictions by aggregating larger volumes of
unstructured information [245]. ML tools could analyze disparate data sources to
forecast geopolitical, epidemiological, and industrial developments over months or
years. The accuracy of such systems could be evaluated by their ability to retroac-
tively predict pivotal historical events. Additionally, ML systems could help identify
key questions, risks, and mitigation strategies that human forecasters may overlook.
By processing extensive data and learning from diverse situations, AI advisors could
provide relevant prior scenarios and relevant statistics such as base rates. They could
also identify stakeholders to consult, metrics to track, and potential trade-offs to
consider. In this way, ML forecasting and advisory systems could enhance judgment
and correct misperceptions, ultimately improving high-stakes decision making and re-
ducing inadvertent escalation risks. However, safeguards would be needed to prevent
overreliance and to avoid encouraging inappropriate risk-taking.
While more capable AI systems can be more reliable, they can also be more dangerous.
Often, though not always, safety and general capabilities are hard to disentangle.
Because of this interdependence, it is important for researchers aiming to make AI
systems safer to carefully avoid increasing risks from more powerful AI systems.
General capabilities have a mixed effect on safety. Systems that are more
generally capable tend to make fewer mistakes. If the consequences of failure are
dire, then advancing capabilities can reduce risk factors. However, as we discuss in
the Safety Engineering chapter, safety is an emergent property and cannot be reduced
to a collection of metrics. Improving general capabilities may remove some hazards,
but it does not necessarily make a system safer. For example, more accurate image
classifiers make fewer errors, and systems that are better at planning are less likely
to generate plans that fail or that are infeasible. More capable language models are
better able to avoid giving harmful or unhelpful answers. When mistakes are harmful,
more generally capable models may be safer. On the other hand, systems that are
more generally capable can be more dangerous and exacerbate control problems.
For example, AI systems with better reasoning capacity, could be better able to
deceive humans, and AI systems that are better at optimizing proxies may be better
at gaming those metrics. As a result, improvements in general capabilities may be
overall detrimental to safety and hasten the onset of catastrophic risks.
3.7 CONCLUSION
In this chapter, we discussed several key themes: we do not know how to instill our
values robustly in individual AI systems, we are unable to predict future AI systems,
and we cannot reliably evaluate AI systems. We now discuss each in turn.
174 ■ Introduction to AI Safety, Ethics, and Society
Area Description
Transparency • This involves better monitoring and controlling the inner workings of
systems based on DL
• Both bottom-up approaches (i.e., mechanistic interpretability) and
top-down approaches (i.e., representation engineering) are valuable to
explore
Single-Agent Safety ■ 175
Area Description
Area Description
3.8 LITERATURE
Safety Engineering
Safe design principles and component failure accident models. The field
of safety engineering has identified multiple “safe design principles” that can be built
into a system to robustly improve its safety. We will describe these principles and con-
sider how they might be applied to systems involving AI. Next, we will outline some
traditional techniques for analyzing a system and identifying the risks it presents.
178 DOI: 10.1201/9781003530336-4
This chapter has been made available under a CC BY NC ND license.
Safety Engineering ■ 179
Although these methods can be useful in risk analysis, they are insufficient for com-
plex and sociotechnical systems, as they rely on assumptions that are often overly
simplistic.
Systemic factors and systemic accident models. After exploring the limita-
tions of component failure accident models, we will show that it can be more effective
to address overarching systemic factors than all the specific events that could directly
cause an accident. We will then describe some more holistic approaches to risk anal-
ysis and reduction. Systemic models rely on complex systems, which we look at in
more detail in the next chapter.
Tail events and black swans. In the final section of this chapter, we will in-
troduce the concept of tail events—events characterized by high impact and low
probability—and show how they interfere with standard methods of risk estimation.
We will also look at a subset of tail events called black swans, or unknown unknowns,
which are tail events that are largely unpredictable. We will discuss how emerging
technology, including AI, might entail a risk of tail events and black swans, and we
will show how we can reduce those risks, even if we do not know their exact nature.
Failure modes, hazards, and threats are basic words in a safety engineer’s vocabulary.
We will now define and give examples of each term.
A failure mode is a specific way a system could fail. There are many
ways in which different systems can fail to carry out their intended functions. A
valve leaking fluid could prevent the rest of the system from working, a flat tire can
prevent a car from driving properly, and losing connection to the Internet can drop
a video call. We can refer to all these examples as failure modes of different systems.
Possible failure modes of AI include AIs pursuing the wrong goals, or AIs pursuing
simplified goals in the most efficient possible way, without regard for unintended side
effects.
are stray electrical wires and broken glass. Note that a hazard does not pose a risk
automatically; a shark is a hazard but does not pose a risk if no one goes in the water.
For AI systems, one possible hazard is a rapidly changing environment (“distribution
shift”) because an AI might behave unpredictably in conditions different from those
it was trained in.
A threat is a hazard with malicious or adversarial intent. If an individual
is deliberately trying to cause harm, they present a specific type of hazard: a threat.
Examples of threats include a hacker trying to exploit a weakness in a system to
obtain sensitive data or a hostile nation gaining more sophisticated weapons. One
possible AI-related threat is someone deliberately contaminating training data to
cause an AI to make incorrect and potentially harmful decisions based on hidden
malicious functionality.
The total risk of a system is the sum of the risks of all associated haz-
ards. In general, there may be multiple hazards associated with a system or situa-
tion. For example, a car driving safely depends on many vehicle components function-
ing as intended and also depends on environmental factors, such as weather conditions
and the behavior of other drivers and pedestrians. There are therefore multiple haz-
ards associated with driving. To find the total risk, we can apply the risk equation
to each hazard separately and then add the results together.
X
Risk = P (hazard) × severity(hazard)
hazard
We may not always have exact numerical values. We may not always be
able to assign exact quantities to the probability and severity of all the hazards,
and may therefore be unable to precisely quantify total risk. However, even in these
circumstances, we can use estimates. If estimates are difficult to obtain, it can still be
useful to have an equation that helps us understand how different factors contribute
to risk.
We should aim for risk reduction rather than trying to achieve zero risk.
It might be an appealing goal to reduce the risk to zero by seeking ways of reducing
the probability or severity to zero. However, in the real world, risk is never zero. In
the AI safety research community, for example, some talk of “solving the alignment
problem”—aligning AI with human values perfectly. This could, in theory, result in
zero probability of AIs making a catastrophic decision and thus eliminate AI risk
entirely.
However, reducing risk to zero is likely impossible. Framing the goal as eliminat-
ing risk implies that finding a perfect, airtight solution for removing risk is possible
and realistic. Focusing narrowly on this goal could be counterproductive, as it might
distract us from developing and implementing practical measures that significantly
reduce risk. In other words, we should not “let the perfect be the enemy of the good.”
When thinking about creating AI, we do not talk about “solving the intelligence prob-
lem” but about “improving capabilities.” Similarly, when thinking about AI safety, we
should not talk about “solving the alignment problem” but rather about “making AI
safer” or “reducing risk from AIs.” A better goal could be to make catastrophic risks
negligible (for instance, less than 0.01% of an existential catastrophe per century)
rather than trying to have the risk become exactly zero.
The classic risk equation is a useful starting point for evaluating risk. However, if we
have more information about the situation, we can break down the risk from a hazard
into finer categories. First we can think about the intrinsic hazard level, which is a
shorthand for probability and severity as in the classic risk equation. Additionally,
182 ■ Introduction to AI Safety, Ethics, and Society
we can consider how the hazard interacts with the people at risk: we can consider
the amount of exposure and the level of vulnerability [257].
Figure 4.1. Risk can be broken down into exposure, probability, severity, and vulnerability.
Probability and severity together determine the “intrinsic hazard level.”
Note that probability and severity are mostly about the hazard itself, whereas expo-
sure and vulnerability tell us more about those subject to the risk.
Not all risks can be calculated precisely, but decomposition still helps
reduce them. An important caveat to the disaster risk equation is that not all
risks are straightforward to calculate, or even to predict. Nonetheless, even if we
cannot put an exact number on the risk posed by a given hazard, we can still reduce
it by decreasing our exposure or vulnerability, or the intrinsic hazard level itself,
where possible. Similarly, even if we cannot predict all hazards associated with a
system—for example if we face a risk of unknown unknowns, which are explored
later in this chapter—we can still reduce the overall risk by addressing the hazards
we are aware of.
Example hazard: proxy gaming. Consider proxy gaming, a risk we face from
AIs that was discussed in the Single Agent Safety chapter. Proxy gaming might arise
when we give AI systems goals that are imperfect proxies of our goals. An AI might
then learn to “game” or over-optimize these proxies in unforeseen ways that diverge
from human values. We can tackle this threat in many different ways:
1. Reduce our exposure to this proxy gaming hazard by improving our abilities to
monitor anomalous behavior and flag any signs that a system is proxy gaming at
an early stage.
2. Reduce the hazard level by making AIs want to optimize an idealized goal and
make mistakes less hazardous by controlling the power the AI has, so that if it
does overoptimize the proxy it would do less harm.
3. Reduce our vulnerability by making our proxies more accurate, by making AIs
more adversarially robust, or by reducing our dependence on AIs.
Adding ability to cope can improve the disaster risk equation. There are
other factors that could be included in the disaster risk equation. We can return
to our example of the slippery floor to illustrate one of these factors. After slipping
on the floor, we might take less time to recover if we have access to better medical
technology. This tells us to what extent we would be able to recover from the damage
the hazard caused. We can refer to the capacity to recover as our ability to cope.
Unlike the other factors that multiply together to give us an estimate of risk, we
might divide by ability to cope to reduce our estimate of the risk if our ability to
cope with it is higher. This is a common extension to the disaster risk equation.
Some hazards are extremely damaging and eliminate any chance of recovery: the
severity of the hazard and our vulnerability are high, while our ability to cope is
tiny. This constitutes a risk of ruin–—permanent, system-complete destruction. In
this case, the equation would involve multiplying together two large numbers and
dividing by a small number; we would calculate the risk as being extremely large. If
the damage cannot be recovered from, like an existential catastrophe (e.g., a large
asteroid or sufficiently powerful rogue AIs), the risk equation would tell us that the
risk is astronomically large or infinite.
risks further in terms of our level of exposure to them and how vulnerable we are
to damage from them, as well as our ability to cope. Even if we cannot assign an
exact numerical value to a risk, we can estimate it. If our estimates are unreliable,
this decomposition can still help us to systematically identify practical measures we
can take to reduce the different factors and thus the overall risk.
In the above discussion of risk evaluation, we have frequently referred to the prob-
ability of an adverse event occurring. When evaluating a system, we often instead
refer to the inverse of this—the system’s reliability, or the probability that an adverse
event will not happen, usually presented as a percentage or decimal. We can relate
system reliability to the amount of time that a system is likely to function before
failing. We can also introduce a new measure of reliability that conveys the expected
time before failure more intuitively.
The more often we use a system, the more likely we are to encounter a
failure. While a system might have an inherent level of reliability, the probability
of encountering a failure also depends on how many times it is used. This is why,
as discussed above, increasing exposure to a hazard will increase the associated level
of risk. An autonomous vehicle, for example, is much more likely to make a mistake
during a journey where it has to make 1000 decisions, than during a journey where
it has to make only 10 decisions.
TABLE 4.1 From each level of system reliability, we can infer its probability of mistake,
“nine or reliability,” and expected time before failure.
0 100 0 1
50 50 0.3 2
75 25 0.6 4
90 10 1 10
99 1 2 100
99.9 0.1 3 1000
99.99 0.01 4 10,000
For a given level of reliability, we can calculate an expected time before failure.
Imagine that we have several autonomous vehicles with different levels of reliability,
as shown in Table 4.1. Reliability is the probability that the vehicle will get any
given decision correct. The second column shows the complementary probability: the
probability that the AV will get any given decision wrong. The fourth column shows
the number of decisions within which the AV is expected to make one mistake. This
can be thought of as the AV’s expected time before failure.
186 ■ Introduction to AI Safety, Ethics, and Society
Expected time before failure does not scale linearly with system relia-
bility. We plot the information from the table in Figure 4.2. From looking at this
graph, it is clear that the expected time before failure does not scale linearly with the
system’s reliability. A 25% change that increases the reliability from 50% to 75%, for
example, doubles the expected time before failure. However, a 9% change increasing
the reliability from 90% to 99% causes a ten-fold increase in the expected time before
failure, as does a 0.9% increase from 99% to 99.9%.
Figure 4.2. Halving the probability of a mistake doubles the expected time before failure.
Therefore, the relationship between system reliability and expected time before failure is
non-linear.
The closer we get to 100% reliability, the more valuable any given increment of
improvement will be. However, as we get closer to 100% reliability, we can generally
expect that an increment of improvement will become increasingly difficult to obtain.
This is usually true because it is hard to perfectly eliminate the possibility of any
adverse event. Additionally, there may be risks that we have not considered. These
are called unknown unknowns and will be discussed extensively later in this chapter.
A system with 3 “nines of reliability” is functioning 99.9% of the time.
As we get close to 100% reliability, it gets inconvenient to use long decimals to express
how reliable a system is. The third column in Table 4.1 gives us information about
a different metric: the nines of reliability [258]. Informally, a system has nines of
reliability equal to the number of nines at the beginning of its decimal or percentage
reliability. One nine of reliability means a reliability of 90% in percentage terms or
0.9 in decimal terms. Two nines of reliability mean 99%, or 0.99. We can denote a
system’s nines of reliability with the letter k; if a system is 90% reliable, it has one
nine of reliability and so k = 1; if it is 99% reliable, it has two nines of reliability,
and so k = 2. Formally, if p is the system’s reliability expressed as a decimal, we can
define k, the nines of reliability a system possesses, as:
k = − log10 (1 − p).
Safety Engineering ■ 187
100
10
0 1 2 3 4
Nines of Reliability
Figure 4.3. When we plot the nines of reliability against the expected time before failure on
a logarithmic scale, the result is a straight line.
The nines of reliability are only a measure of probability, not risk. In the
framing of the classic risk equation, the nines of reliability only contain information
about the probability of a failure, not about what its severity would be. This metric
188 ■ Introduction to AI Safety, Ethics, and Society
is therefore incomplete for evaluating risk. If an AI has three nines of reliability, for
example, we know that it is expected to make 999 out of 1000 decisions correctly.
However, three nines of reliability tells us nothing about how much damage the agent
will do if it makes an incorrect decision, so we cannot calculate the risk involved in
using the system. A game playing AI will present a lot less risk than an autonomous
vehicle even if both systems have three nines of reliability.
We can reduce both the probability and severity of a system failure by following
certain safe design principles when designing it. These general principles have been
identified by safety engineering and offer practical ways of reducing the risk associ-
ated with all kinds of systems. They should be incorporated from the outset, rather
than being retrofitted later. This strategy attempts to “build safety into” a system
and is more robust than building the system without safety considerations and then
attempting to fix individual problems if and when they become apparent.
Note that these principles are not only useful in building an AI itself, but also the
system around it. For example, we can incorporate them into the design of the cyber-
security system that controls who is able to access an AI, and into the operations of
the organization, or system of humans, that is creating an AI.
We will now explore eight of these principles and how they might be applied to AI
systems:
1. Redundancy: having multiple backup components that can perform each critical
function, so that a single component failing is not enough to cause an accident.
2. Transparency: ensuring that operators have enough knowledge of how a system
functions under various circumstances to interact with it safely.
3. Separation of duties: having multiple agents in charge of subprocesses so that
no individual can misuse the entire system alone.
4. Principle of least privilege: giving each agent the minimum access necessary
to complete their tasks.
5. Fail-safes: ensuring that the system will be safe even if it fails.
6. Antifragility: learning from previous failures to reduce the likelihood of failing
again in future.
Safety Engineering ■ 189
Note that, depending on the exact type of system, some of these safe design principles
might be less useful or even counterproductive. We will discuss this later on in the
chapter. However, for now, we will explore the basic rationale behind why each one
improves safety.
4.3.1 Redundancy
Redundancy means having multiple “backup” components [257]. Having
multiple braking systems in a vehicle means that, even if the foot brake is not working
well, the handbrake should still be able to decelerate the vehicle in an emergency.
A failure of a single brake should therefore not be enough to cause an accident.
This is an example of redundancy, where multiple components can perform a critical
function, so that a single component failing is not enough to cause the whole system
to fail. In other words, redundancy removes single points of failure. Other examples
of redundancy include saving important documents on multiple hard drives, in case
one of them stops working, and seeking multiple doctors’ opinions, in case one of
them gets a diagnosis wrong.
A possible use of redundancy in AI would be having an inbuilt “moral parliament”
(see Moral Uncertainty in the Beneficial AI and Machine Ethics chapter). If an AI
agent has to make decisions with moral implications, we are faced with the question of
which theory of morality it should follow; there are many of these, and each often has
counterintuitive recommendations in extreme cases. Therefore, we might not want an
AI to adhere strictly to just one theory. Instead, we could use a moral parliament, in
which we emulate representatives of stakeholders or moral theories, let them negotiate
and vote, and then do what the parliament recommends. The different theories would
essentially be redundant components, each usually recommending plausible actions
but unable to dictate what happens in extreme cases, reducing the likelihood of
counterintuitive decisions that we would consider harmful.
stock of the two chemicals in separate cupboards, and have a different person in charge
of supplying each one in small quantities to researchers. This way, no individual has
access to a large amount of both chemicals.
Each agent should have only the minimum power needed to complete
their tasks [257]. As discussed above, separating duties should reduce individ-
uals’ capacity to misuse the system. However, separation of duties might only work
if we also ensure individuals do not have access to parts of the system that are not
relevant to their tasks. This is called the principle of least privilege. In the example
above, we ensured separation of duties by putting chemicals in different cupboards
with different people in charge of them. To make this more likely to mitigate risks, we
might want to ensure that these cupboards are locked so that everyone else cannot
access them at all.
Similarly, for systems involving AIs, we should ensure that each agent only has access
to the necessary information and power to complete its tasks with a high level of
reliability. Concretely, we might want to avoid plugging AIs into the internet or
giving them high-level admin access to confidential information. In the Single Agent
Control chapter, we considered how AIs might be power-seeking; by ensuring AIs
have only the minimum required amount of power they need to accomplish the goals
we assign them, we can reduce their ability to gain power.
4.3.4 Fail-Safes
Fail-safes are features that aim to ensure a system will be safe even if
it fails [257]. When systems fail, they stop performing their intended function,
but some failures also cause harm. Fail-safes aim to limit the amount of harm caused
even if something goes wrong. Elevator brakes are a classic example of a fail-safe
feature. They are attached to the outside of the cabin and are held open only by the
tension in the cables that the cabin is suspended on. If tension is lost in the cables,
the brakes automatically clamp shut onto the rails in the elevator shaft. This means
that, even if the cables break, the brakes should prevent the cabin from falling; even
if the system fails in its function, it should at least be safe.
Safety Engineering ■ 191
A possible fail-safe for AI systems might be a component that tracks the level of
confidence an agent has in its own decisions. The system could be designed to stop
enacting decisions if this component falls below a critical level of certainty that the
decision is correct. There could also be a component that monitors the probability of
the agent’s decisions causing harm, and the system could be designed to stop acting
on decisions if it reaches a specified likelihood of harm. Another example would be a
kill switch that makes it possible to shut off all instances of an AI system if this is
required due to malfunction or other reasons.
4.3.5 Antifragility
Antifragile systems become stronger from encountering adversity [259].
The idea of an antifragile system is that it will not only recover after a failure or a
near miss but actually become more robust from these “stressors” to potential future
failures. Antifragile systems are common in the natural world and include the human
body. For example, weight-bearing exercises put a certain amount of stress on the
body, but bone density and muscle mass tend to increase in response, improving the
body’s ability to lift weight in the future.
Similarly, after encountering or becoming infected with a pathogen and fighting it
off, a person’s immune system tends to become stronger, reducing the likelihood of
reinfection. Groups of people working together can also be antifragile. If a team is
working toward a given goal and they experience a failure, they might examine the
causes and take steps to prevent it from happening again, leading to fewer failures in
the future.
Designing AI systems to be antifragile would mean allowing them to continue learning
and adapting while they are being deployed. This could give an AI the potential to
learn when something in its environment has caused it to make a bad decision. It
could then avoid making the same mistake if it finds itself in similar circumstances
again.
Antifragility can require adaptability. Creating antifragile AIs often means
creating adaptive ones: the ability to change in response to new stressors is key to
making AIs robust. If an AI continues learning and adapting while being deployed,
it could learn to avoid hazards, but it could also develop unanticipated and unde-
sirable behaviors. Adaptive AIs might be harder to control. Such AIs are likely to
continuously evolve, creating new safety challenges as they develop different behav-
iors and capabilities. This tendency of adaptive systems to evolve in unexpected ways
increases our exposure to emergent hazards.
A case in point is the chatbot Tay, which was released by Microsoft on Twitter in
2016. Tay was designed to simulate human conversation and to continue improving by
learning from its interactions with humans on Twitter. However, it quickly started
tweeting offensive remarks, including seemingly novel racist and sexist comments.
This suggested that Tay had statistically identified and internalized some biases that
it could then independently assert. As a result, the chatbot was taken offline af-
ter only 16 hours. This illustrates how an adaptive, antifragile AI can develop in
192 ■ Introduction to AI Safety, Ethics, and Society
unpredicted and undesirable ways when deployed in natural settings. Human opera-
tors cannot control natural environments, so system designers should think carefully
about whether to use adaptive AIs.
4.3.7 Transparency
Transparency means people know enough about a system to interact with
it safely [257]. If operators do not have sufficient knowledge of a system’s func-
tions, then it is possible they could inadvertently cause an accident while interacting
with it. It is important that a pilot knows how a plane’s autopilot system works, how
to activate it, and how to override it. That way, they will be able to override it when
Safety Engineering ■ 193
they need to, and they will know how to avoid activating it or overriding it acciden-
tally when they do not mean to. This safe design principle is called transparency.
Research into AI transparency aims to design DL systems in ways that give operators
a greater understanding of their internal decision-making processes. This would help
operators maintain control, anticipate situations in which systems might make poor
or deceptive decisions, and steer them away from hazards.
There are multiple features we can build into a system from the design stage to
make it safer. We have discussed redundancy, separation of duties, the principle of
least privilege, fail-safes, antifragility, negative feedback mechanisms, transparency
and defense in depth as eight examples of such principles. Each one gives us concrete
recommendations about how to design (or how not to design) AI systems to ensure
that they are safer for humans to use.
The Swiss cheese model helps us analyze defenses and identify pathways
to accidents [261]. The diagram in Figure 4.4 shows multiple slices of Swiss
cheese, each representing a particular defense feature in a system. The holes in a
slice represent the weaknesses in a defense feature—the ways in which it could be
bypassed. If there are any places where holes in all the slices line up, creating a
continuous hole through all of them, this represents a possible route to an accident.
This model highlights the importance of defense in depth, since having more layers
of defense reduces the probability of there being a pathway to an accident that can
bypass them all.
Safety Engineering ■ 195
Figure 4.4. Each layer of defense (safety culture, red teaming, etc.) is a layer of defense with
its own holes in the Swiss cheese model. With enough layers, we hope to avoid pathways
that can bypass them all.
Consider the example of an infectious disease as a hazard. There are multiple possible
defenses that reduce the risk of infection. Preventative measures include avoiding
large gatherings, wearing a mask and regularly washing hands. Protective measures
include maintaining a healthy lifestyle to support a strong immune system, getting
vaccinated, and having access to healthcare. Each of these defenses can be considered
a slice of cheese in the diagram.
However, none of these defenses are 100% effective. Even if an individual avoids large
gatherings, they could still become infected while buying groceries. A mask might
mostly block contact with the pathogen, but some of it could still get through. Vacci-
nation might not protect those who are immunocompromized due to other conditions,
and may not be effective against all variants of the pathogen. These imperfections
are represented by the holes in the slices of cheese. From this, we can infer various
potential routes to an accident, such as an immunocompromized person with a poorly
fitting mask in a shopping mall, or someone who has been vaccinated encountering
a new variant at the shops that can evade vaccine-induced immunity.
We can improve safety by increasing the number of slices, or by reducing
the holes. Adding more layers of defense will reduce the chances of holes lining up
to create an unobstructed path through all the defenses. For example, adopting more
of the practices outlined above would reduce an individual’s chances of infection more
than if they adopt just one.
Similarly, reducing the size of a hole in any layer of defense will reduce the probability
that it will overlap with a hole in another layer. For example, we could reduce the
weaknesses in wearing a mask by getting a better-fitting, more effective mask. Scien-
tists might also develop a vaccine that is effective against a wider range of variants,
thus reducing the weaknesses in vaccination as a layer of defense.
We can think of layers of defense as giving us additional nines of relia-
bility. In many cases, it seems reasonable to assume that adding a layer of defense
helps reduce remaining risks by approximately an order of magnitude by eliminat-
ing 90% of the risks still present. Consider how adding the following three layers of
defense can give our AIs additional nines of reliability:
196 ■ Introduction to AI Safety, Ethics, and Society
Swiss cheese model for emergent capabilities. To reduce the risk of unex-
pected emergent capabilities, multiple lines of defense could be employed. For exam-
ple, models could be gradually scaled (e.g., using 3× more compute than the previous
training run, rather than a larger number such as 10×); as a result, there will be
fewer emergent capabilities to manage. An additional layer of defense is screening
for hazardous capabilities, which could involve deploying comprehensive test beds,
and red teaming with behavioral tests and representation reading. Another defense
is staged releases; rather than release the model to all users at once, gradually re-
lease it to more and more users, and manage discovered capabilities as they emerge.
Finally, post-deployment monitoring through anomaly detection adds another layer
of defense.
Each of these aim at largely different areas, with the first focusing on robustness, the
second on control, and the third on monitoring. By ensuring we have many defenses,
we prevent a wider array of risks, improving our system reliability by many nines of
reliability.
Figure 4.5. The bow tie diagram can tie together hazards and their consequences with control
and recovery measures to mitigate the effects of an adverse event.
objectives we give them. If the specified objectives are only proxies for what we
actually want, then an AI might find a way of optimizing them that is not beneficial
overall, possibly due to unintended harmful side effects.
To analyze this hazard, we can draw a bow tie diagram, with the center representing
the event of an AI gaming its proxy goals in a counterproductive way. On the left-hand
side, we list preventative measures, such as ensuring that we can control AI drives
like power-seeking. If the AI system has less power (for example fewer resources),
this would reduce the probability that it finds a way to optimize its goal in a way
that conflicts with our values (as well as the severity of the impact if it does). On the
right-hand side, we list protective measures, such as improving anomaly detection
tools that can recognize any AI behavior that resembles proxy gaming. This would
help human operators to notice activity like this early and take corrective action to
limit the damage caused.
The exact measures on either side of the bow tie depend on which event we put at the
center. We can make a system safer by individually considering each hazard associated
with it, and ensuring we implement both preventative and protective measures against
that hazard.
Figure 4.6. Using a fault tree, we can work backward from a failure (no water flow) to its
possible causes (such as a blockage or lack of flow at source) [262].
For each negative outcome, we work backward through the system, identifying po-
tential causes of that event. We can then draw a “fault tree” showing all the possible
pathways through the system to the negative outcome. By studying this fault tree,
we can identify ways of improving the system that remove these pathways. In Figure
4.6, we trace backward from a pump failure to two types of failure: mechanical and
electrical failure. Each of these has further subtypes. For fuse failing, we know that
we require a circuit overload, which can happen as a result of a wire shorted or a
power surge. Hence, we know what sort of hazards we might need to think about.
Example: Fire Hazard. We could also consider the risk of a fire outbreak as a
negative outcome. We then work backward, thinking about the different requirements
for this to happen—fuel, oxygen, and sufficient heat energy. This differs from the
water pump failure since all of these are necessary rather than just one of them.
Working backward again, we can think about all the possible sources of each of these
requirements in the system. After completing this process, we can identify multiple
combinations of events that could lead to the same negative outcome.
Decision trees are more flexible and often more applicable in the context
of AI. Figure 4.7 shows how we can use a decision tree based on analyzing potential
causes of risks to create concrete questions about risks. We can trace this through
to identify risks depending on the answers to these questions. For instance, if there
is no unified group accountable for creating AIs, then we know that diffusion of
Safety Engineering ■ 199
Figure 4.7. A decision-tree focusing on causal chains for risks can identify several potential
problems by interrogating important contextual questions [263].
responsibility is a risk. If there is such a group, then we need to question their beliefs,
intentions, and incentives.
By thinking more broadly about all the ways in which a specific accident could be
caused, whether by a single component failing or by a combination of many smaller
events, a decision tree can discover hazards that are important to find. However, the
backchaining process that FTA and decision trees looking at accident causes rely on
also has limitations, which we will discuss in the next section.
4.4.4 Limitations
Chain-of-Events Reasoning. The Swiss cheese and bow tie models and the
FTA method can be useful for identifying hazards and reducing risk in some sys-
tems. However, they share some limitations that make them inapplicable to many
200 ■ Introduction to AI Safety, Ethics, and Society
of the complex systems that are built and operated today. Within the field of safety
engineering, these approaches are largely considered overly simplistic and outdated.
We will now discuss the limitations of these approaches in more detail, before moving
on to describe more sophisticated accident models that may be better suited to risk
management in AI systems.
Component failure accident models are particularly inadequate for analyzing complex
systems and sociotechnical systems. We cannot always assume direct, linear causality
in complex and sociotechnical systems, so the assumption of a linear “chain of events”
breaks down.
There are three main reasons why component failure accident models are insufficient
for analyzing complex and sociotechnical systems: accidents sometimes happen with-
out any individual component failing, accidents are not always the result of linear
causality, and direct causality is sometimes less important than indirect, remote, or
“diffuse” causes such as safety culture. We will now look at each of these reasons in
more detail.
Nonlinear Causality
Figure 4.8. There are feedback loops in the creation and deployment of AI systems. For
example, the curation of training data used for developing an AI system exhibits feedback
loops [266].
relevance by tracking the number of clicks each ad receives and adjusts its model
accordingly.
However, the number of clicks an ad receives depends not only on its intrinsic rele-
vance but also on its position in the list: higher ads receive more clicks. If the system
underestimates the effect of position, it may continually place one ad at the top since
it receives many clicks, even if it has lower intrinsic relevance, leading to failure. Con-
versely, if the system overestimates the effect of position, the top ad will receive fewer
clicks than it expects, and so it may constantly shuffle ads, resulting in a random
order rather than relevance-based ranking, also leading to failure.
Many complex and sociotechnical systems comprise a whole network of interactions
between components, including multiple feedback loops like this one. When thinking
about systems like these, it is difficult to reduce any accidents to a simple chain of
successive events that we can trace back to a root cause.
Indirect Causality
exercise are likely to result in ill health, even though no single instance of unhealthy
behavior could be directly blamed for an individual developing a disease.
Sharp end versus blunt end. The part of a system where specific events happen,
such as people bumping into each other, is sometimes referred to as the “sharp end”
of the system. The higher-level conditions of the system, such as the density of people
in a venue, is sometimes called the “blunt end.” As outlined in the example of people
spilling drinks, “sharp-end” interventions may not always be effective. These are
related to “proximate causes” and “distal causes,” respectively.
systemic factors that diffusely influence risk. Meadows’ twelve leverage points can
help us identify systemic factors that, if changed, could have far-reaching, system-
wide effects.
of the functionality it offers. These competitive pressures can compel employees and
decision-makers to cut corners and pay less attention to safety.
On a larger scale, competitive pressures might put organizations or countries in an
arms race, wherein safety standards slip because of the urgency of the situation.
This will be especially true if one of the organizations or countries has lower safety
standards and consequently moves quicker; others might feel the need to lower their
standards as well, in order to keep up. The risks this process presents are encapsulated
by Publilius Syrus’s aphorism: “Nothing can be done at once hastily and prudently.”
We consider this further in the Collective Action Problems chapter.
Summary. We have seen that component failure accident models have some sig-
nificant limitations, since they do not usually capture diffuse sources of risk that can
shape a system’s dynamics and indirectly affect the likelihood of accidents. These
include important systemic factors such as competitive pressures, safety costs, and
safety culture. We will now turn to systemic accident models that acknowledge these
ideas and attempt to account for them in risk analysis.
We have explored how component failure accident models are insufficient for prop-
erly understanding accidents in complex systems. When it comes to AIs, we must
understand what sort of system we are dealing with. Comparing AI safety to ensur-
ing the safety of specific systems like rockets, power plants, or computer programs
Safety Engineering ■ 207
can be misleading. The reality of today’s world is that many hazardous technolo-
gies are operated by a variety of human organizations: together, these form complex
sociotechnical systems that we need to make safer. While there may be some similar-
ities between different hazardous technologies, there are also significant differences in
the properties of these technologies which means it will not necessarily work to take
safety strategies from one system and map them directly onto another. We should
not anchor to individual safety approaches used in rockets or power plants.
Instead, it is more beneficial to approach AI safety from a broader perspective of
making complex, sociotechnical systems safer. To this end, we can draw on the theory
of sociotechnical systems, which offers “a method of viewing organizations which
emphasizes the interrelatedness of the functioning of the social and technological
subsystems of the organization and the relation of the organization as a whole to the
environment in which it operates.”
We can also use the complex systems literature more generally, which is largely about
the shared structure of many different complex systems. Accidents in complex systems
can often be better understood by looking at the system as a whole, rather than
focusing solely on individual components. Therefore, we will now consider systemic
accident models, which aim to provide insights into why accidents occur in systems by
analyzing the overall structure and interactions within the system, including human
factors that are not usually captured well by component failure models.
Normal Accident Theory (NAT). Normal Accident Theory (NAT) is one ap-
proach to understanding accidents in complex systems. It suggests that accidents are
inevitable in systems that exhibit the following two properties:
1. Complexity: a large number of interactions between components in the system
such as feedback loops, discussed in the complex systems chapter. Complexity can
make it infeasible to thoroughly understand a system or exhaustively predict all
its potential failure modes.
2. Tight coupling: one component in a system can rapidly affect others so that one
relatively small event can rapidly escalate to become a larger accident.
NAT concludes that, if a system is both highly complex and tightly coupled, then
accidents are inevitable—or “normal”—regardless of how well the system is managed
[268].
NAT focuses on systemic factors. According to NAT, accidents are not caused
by a single component failure or human error, but rather by the interactions and
interdependencies between multiple components and subsystems. NAT argues that
accidents are a normal part of complex systems and cannot be completely eliminated.
Instead, the focus should be on managing and mitigating the risks associated with
these systems to minimize the severity and frequency of accidents. NAT emphasizes
the importance of systemic factors, such as system design, human factors such as
organizational culture, and operational procedures, in influencing accident outcomes.
By understanding and addressing these systemic factors, it is possible to improve the
safety and resilience of complex systems.
208 ■ Introduction to AI Safety, Ethics, and Society
HROs are therefore vigilant about looking out for emerging hazards. AI systems
tend to be good at detecting anomalies, but not near misses.
2. Reluctance to simplify interpretations means looking at the bigger pic-
ture. HROs understand that reducing accidents to chains of events often oversim-
plifies the situation and is not necessarily helpful for learning from mistakes and
improving safety. They develop a wide range of expertise so that they can come
up with multiple different interpretations of any incident. This can help with un-
derstanding the broader context surrounding an event, and systemic factors that
might have been at play. HROs also implement many checks and balances, in-
vest in hiring staff with diverse perspectives, and regularly retrain everyone. AIs
could be used to generate explanations for hazardous events or conduct adversarial
reviews of explanations of system failures.
3. Sensitivity to operations means maintaining awareness of how a system
is operating. HROs invest in the close monitoring of systems to maintain a con-
tinual, comprehensive understanding of how they are behaving, whether through
excellent monitoring tools or hiring operators with deep situational awareness.
This can ensure that operations are going as planned, and notice early if anything
unexpected happens, permitting taking corrective action early, before the situa-
tion escalates. AI systems that dynamically aggregate information in real-time can
help improve situational awareness.
4. Commitment to resilience means actively preparing to tackle unex-
pected problems. HROs train their teams in adaptability and improvising solu-
tions when confronted with novel circumstances. By practicing dealing with issues
they have not seen before, employees develop problem-solving skills that will help
them cope if anything new and unexpected arises in reality. AIs have the poten-
tial to enhance teams’ on-the-spot problem-solving, such as by creating surprising
situations for testing organizational efficiency.
5. Under-specification of structures means information can flow rapidly in
a system. Instead of having rigid chains of communication that employees must
follow, HROs have communication throughout the whole system. All employees
are allowed to raise an alarm, regardless of their level of seniority. This increases
the likelihood that problems will be flagged early and also allows information to
travel rapidly throughout the organization. This under-specification of structures
is also sometimes referred to as “deference to expertise,” because it means that all
employees are empowered to make decisions relating to their expertise, regardless
of their place in the hierarchy.
models of their own expertise and the expertise of human operators to ensure effec-
tive problem routing. By adhering to these principles, we can develop AI systems
that function like HROs, ensuring high reliability and minimizing the potential risks
associated with their deployment and use.
Systems can gradually migrate into unsafe states. The RMF also asserts
that behaviors and conditions can gradually “migrate” over time, due to environ-
mental pressures. If this migration leads to unsafe systemic conditions, this creates
the potential for an event at the sharp end to trigger an accident. This is why it
is essential to continually enforce safety boundaries and avoid the system migrating
into unsafe states.
212 ■ Introduction to AI Safety, Ethics, and Society
Figure 4.9. Rasmussen’s risk management framework lays out six levels of organization and
their interactions, aiming to mark consistent safety boundaries by identifying hazards and
those responsible for them.
The organizational safety structure. The first aspect is the safety constraints,
the set of unsafe conditions which must be avoided. It tells us which components and
operators are in place to avoid each of those unsafe conditions occurring. This can
help to prevent accidents from component failures, design errors, and interactions
between components that could produce unsafe states.
Table 4.2 summarizes how the STAMP perspective contrasts with those of traditional
component failure models.
TABLE 4.2 STAMP makes assumptions that differ from traditional component failure mod-
els.
Old Assumption New Assumption
Accidents are caused by chains of directly Accidents are complex processes involving
related events. the entire sociotechnical system.
We can understand accidents by looking Traditional event-chain models cannot
at chains of events leading to the accident. describe this process adequately.
Safety is increased by increasing system or
High reliability is not sufficient for safety.
component reliability.
Most accidents are caused by operator Operator error is a product of various
error. environmental factors.
Assigning blame is necessary to learn from Holistically understand how the system
and prevent accidents. behavior contributed to the accident.
Major accidents occur from simultaneous Systems tend to migrate toward states of
occurrences of random events. higher risk.
Summary. Normal accident theory argues that accidents are inevitable in systems
with a high degree of complexity and tight coupling, no matter how well they are
organized. On the other hand, it has been argued that HROs with consistently low
accident rates demonstrate that it is possible to avoid accidents. HRO theory iden-
tifies five key characteristics that contribute to a good safety culture and reduce the
likelihood of accidents. However, it might not be feasible to replicate these across all
organizations.
Systemic models like Rasmussen’s RMF, STAMP, and Dekker’s DIF model are
grounded in an understanding of complex systems, viewing safety as an emergent
property. The RMF and STAMP both view safety as an issue of control and en-
forcing safety constraints on operations. They both identify a hierarchy of levels of
organization within a system, showing how accidents are caused by multiple factors,
rather than just by one event at the sharp end. DIF describes how systems are of-
ten subject to decrementalism, whereby the safety of processes is gradually degraded
through a series of minor changes, each of which seems minor on its own.
In general, component failure models focus on identifying specific components or fac-
tors that can go wrong in a system and finding ways to improve those components.
These models are effective at pinpointing direct causes of failure and proposing tar-
geted interventions. However, they have a limitation in that they tend to overlook
other risk sources and potential interventions that may not be directly related to the
identified components. On the other hand, systemic accident models take a broader
approach by considering the interactions and interdependencies between various com-
ponents in a system, such as feedback loops, human factors, and diffuse causality
models. This allows them to capture a wider range of risk sources and potential
interventions, making them more comprehensive in addressing system failures.
This book presents multiple ways in which the development and deployment of AIs
could entail risks, some of which could be catastrophic or even existential. However,
the systemic accident models discussed above highlight that events in the real world
often unfold in a much more complex manner than the hypothetical scenarios we use
to illustrate risks. It is possible that many relatively minor events could accumulate,
leading us to drift toward an existential risk. We are unlikely to be able to predict
and address every potential combination of events that could pave the route to a
catastrophe.
For this reason, although it can be useful to study the different risks associated
with AI separately when initially learning about them, we should be aware that
hypothetical example scenarios are simplified, and that the different risks coexist.
We will now discuss what we can learn from our study of complex systems and
systemic accident models when developing an AI safety strategy.
Risks that do not initially appear catastrophic might escalate. Risks tend
to exist on a spectrum. Power inequality, disinformation, and automation, for exam-
216 ■ Introduction to AI Safety, Ethics, and Society
ple, are prevalent issues within society and are already causing harm. Though serious,
they are not usually thought of as posing existential risks. However, if pushed to an
extreme degree by AIs, they could result in totalitarian governments or enfeeblement.
Both of these scenarios could represent a catastrophe from which humanity may not
recover. In general, if we encounter harm from a risk on a moderate scale, we should
be careful to not dismiss it as non-existential without serious consideration.
Conflict and global turbulence could make society more likely to drift
into failure. Although we have some degree of choice in how we implement AI
within society, we cannot control the wider environment. There are several reasons
why events like wars that create societal turbulence could increase the risk of human
civilization drifting into failure. Faced with urgent, short-term threats, people might
deprioritize AI safety to focus instead on the most immediate concerns. If AIs can
be useful in tackling those concerns, it might also incentivize people to rush into
giving them greater power, without thinking about the long-term consequences. More
generally, a more chaotic environment might also present novel conditions for an AI,
that cause it to behave unpredictably. Even if conditions like war do not directly
cause existential risks, they make them more likely to happen.
landscape, including threats that may not immediately seem catastrophic. Instead
of attempting to target just existential risks precisely, it may be more effective to
implement broad interventions, including sociotechnical measures.
Summary. As we might expect from our study of complex systems, different types
of risks are inextricably related and can combine in unexpected ways to amplify one
another. While some risks may be generally more concerning than others, we cannot
neatly isolate those that could contribute to an existential threat from those that
could not, and then only focus on the former while ignoring the latter. In addressing
existential threats, it is therefore reasonable to view systems holistically and consider
a wide range of issues, besides the most obvious catastrophic risks. Due to system
complexity, broad interventions are likely to be required as well as narrowly targeted
ones.
In the first few sections of this chapter, we discussed failure modes and hazards,
equations for understanding the risks they pose, and principles for designing safer
systems. We also looked at methods of analyzing systems to model accidents and
identify hazards and explored how different styles of analysis can be helpful for com-
plex systems.
The classic risk equation tells us that the level of risk depends on the probability
and severity of the event. A particular class of events, called tail events, have a very
low probability of occurrence but a very high impact upon arrival. Tail events pose
some unique challenges for assessing and reducing risk, but any competent form of
risk management must attempt to address them. We will now explore these events
and their implications in more detail.
AI-related tail events could have a severe impact. As AIs are increasingly
deployed within society, some tail risks we should consider include the possibility that
an AI could be used to develop a bioweapon, or that an AI might hack a bank and
wipe the financial information. Even if these eventualities have a low probability of
occurring, it would only take one such event to cause widespread devastation. Such
an event could define the overall impact of an AI’s deployment. For this reason, com-
petent risk management must involve serious efforts to prevent tail events, however
rare we think they might be.
A tail event often changes the mean but not the median. Figure 4.10 can
help us visualize how tail events affect the wider risk landscape. The graphs show
data points representing individual events, with their placement along the x-axis
indicating their individual impact.
In the first graph, we have numerous data points representing frequent, low-impact
events: these are all distributed between 0 and 100, and mostly between 0 and 10.
The mean impact and median impact of this dataset have similar values, marked on
the x-axis.
In the second graph we have the same collection of events, but with the addition of a
single data point of much higher impact—a tail event with an impact of 10,000. As
Safety Engineering ■ 219
Statistic
Mean
Median
75
Count
50
25
0
1 3 4 100 1000 10000
Impact (Log Scale)
Statistic
Mean
Median
75
Count
50
25
Tail Event
0
1 3 14 100 1000 10000
Impact (Log Scale)
Figure 4.10. The occurrence of a tail event can dramatically shift the mean but not the
median of the event type’s impact.
220 ■ Introduction to AI Safety, Ethics, and Society
indicated in the graph, the median impact of the dataset is approximately the same
as before, but the mean changes substantially and is no longer representative of the
general population of events.
We can also think about tail events in terms of the classic risk equation.
Tail events have a low probability, but because they are so severe, we nonetheless
evaluate the risk they pose as being large:
Depending on the exact values of probability and severity, we may find that tail risks
are just as large as—or even larger than–—the risks posed by much smaller events
that happen all the time. In other words, although they are rare, we cannot afford to
ignore the possibility that they might happen.
It is difficult to plan for tail events because they are so rare. Since we
can hardly predict when tail events will happen, or even if they will happen at all,
it is much more challenging to plan for them than it is for frequent, everyday events
that we know we can expect to encounter. It is often the case that we do not know
exactly what form they will take either.
For these reasons, we cannot plan the specific details of our response to tail events
in advance. Instead, we must plan to plan. This involves organizing and developing
an appropriate response, if and when it is necessary—how relevant actors should
convene to decide on and coordinate the most appropriate next steps, whatever the
precise details of the event. Often, we need to figure out whether some phenomena
even present tail events, for which we need to consider their frequency distributions.
We consider this concept next.
Head
Tail
Figure 4.11. Many distributions have a head (an area where most of the probability is
concentrated) and one or two tails (extreme regions of the distribution).
these statements, we are describing simplified caricatures of the two scenarios for
pedagogical purposes.
TABLE 4.3 A caricature of thin tails and long tails reveals several trends that often hold
for each.
Caricature: Thin Tails Caricature: Long Tails
The top few receive a proportionate The top few receive a disproportionately large
share of the total. share of the total.
The total is determined by the whole The total is determined by a few extreme
group (”tyranny of the collective”). occurrences (”tyranny of the accidental”).
The typical member of a group has an The typical member is either a giant or a
average value, close to the mean. dwarf.
A single event cannot escalate to A single event can escalate to become much
become much bigger than the average. bigger than many others put together.
Individual data points vary within a Individual data points can vary across many
small range that is close to the mean. orders of magnitude.
We can predict roughly what value a It is much harder to robustly predict even the
single instance will take. rough value that a single instance will take.
Under thin tails, the top few receive quite a proportionate share of the
total. If we were to measure the heights of a group of people, the total height of
the tallest 10% would not be much more than 10% of the total height of the whole
group.
Under long tails, the top few receive a disproportionately large share of
the total. In the music industry, the revenue earned by the most successful 1% of
artists represents around 77% of the total revenue earned by all artists.
Under thin tails, the total is determined by the whole group. The total
height of the tallest 10% of people is not a very good approximation of the total
height of the whole group. Most members need to be included to get a good measure
of the total. This is called “tyranny of the collective.”
Under thin tails, the typical member of a group has an average value.
Almost no members are going to be much smaller or much larger than the mean.
Under long tails, the typical member is either a giant or a dwarf. Mem-
bers can generally be classified as being either extreme and high-impact or relatively
insignificant.
Note that, under many real-world long-tailed distributions, there may be occurrences
that seem to fall between these two categories. There may be no clear boundary
dividing occurrences that count as insignificant from those that count as extreme.
Under thin tails, the impact of an event is not scalable. A single event
cannot escalate to become much bigger than the average.
Under long tails, the impact of an event is scalable. A single event can
escalate to become much bigger than many others put together.
Contrast 5: Randomness
Under thin tails, individual data points vary within a small range that
is close to the mean. Even the data point that is furthest from the mean does
not change the mean of the whole group by much.
Under long tails, individual data points can vary across many orders of
magnitude. A single extreme data point can completely change the mean of the
sample.
Contrast 6: Predictability
Under thin tails, we can predict roughly what value a single instance will
take. We can be confident that our prediction will not be far off, since instances
cannot stray too far from the mean.
Under long tails, it is much harder to predict even the rough value that
a single instance will take. Since data points can vary much more widely, our
best guesses can be much further off.
Having laid the foundations for understanding tail events in general, we will now
consider an important subset of tail events: black swans.
224 ■ Introduction to AI Safety, Ethics, and Society
Known knowns: things we are aware of Unknown knowns: things that we do not
and understand. realize we know (such as tacit knowledge).
Unknown unknowns: things that we do
Known unknowns: things we are aware
not understand, and which we are not
of but which we don’t fully understand.
even aware we do not know.
In these category titles, the first word refers to our awareness, and the second refers
to our understanding. We can now consider these four types of events in the context
of a student preparing for an exam.
1. We know that we know. Known knowns are things we are both aware of and
understand. For the student, these would be the types of questions that have
come up regularly in previous papers and that they know how to solve through
recollection. They are aware that they are likely to face these, and they know how
to approach them.
2. We do not know what we know. Unknown knowns are things we understand
but may not be highly aware of. For the student, these would be things they have
Safety Engineering ■ 225
not thought to prepare for but which they understand and can do. For instance,
there might be some questions on topics they hadn’t reviewed; however, looking at
these questions, the student finds that they know the answer, although they cannot
explain why it is correct. This is sometimes called tacit knowledge or unaccounted
facts.
3. We know that we do not know. Known unknowns are things we are aware of
but do not fully understand. For the student, these would be the types of questions
that have come up regularly in previous papers but which they have not learned
how to solve reliably. The student is aware that they are likely to face these but
is not sure they will be able to answer them correctly. However, they are at least
aware that they need to do more work to prepare for them.
4. We do not know that we do not know. Unknown unknowns are things we are
unaware of and do not understand. These problems catch us completely off guard
because we didn’t even know they existed. For the student, unknown unknowns
would be unexpectedly hard questions on topics they have never encountered
before and have no knowledge or understanding of.
Black swans in the real world It might be unfair for someone to present us with
an unknown unknown, such as finding questions on topics irrelevant to the subject in
226 ■ Introduction to AI Safety, Ethics, and Society
an exam setting. The wider world, however, is not a controlled environment; things
do happen that we have not thought to prepare for.
Black swans are defined by our understanding. A black swan is a black swan
because our worldview is incorrect or incomplete, which is why we fail to predict it.
In hindsight, such events often only make sense after we realize that our theory
was flawed. Seeing black swans makes us update our models to account for the new
phenomena we observe. When we have a new, more accurate model, we can often
look back in time and find the warning signs in the lead-up to the event, which we
did not recognize as such at the time.
These examples also show that we cannot always reliably predict the future from our
experience; we cannot necessarily make an accurate calculation of future risk based
on long-running historical data.
looser, more practical working definition given earlier: a highly impactful event that
is largely unexpected for most people. For example, some individuals with relevant
knowledge of the financial sector did predict the 2008 crisis, but it came out of the
blue for most people. Even among financial experts, the majority did not predict it.
Therefore, we count it as a black swan.
Similarly, although pandemics have happened throughout history, and smaller disease
outbreaks occur yearly, the possibility of a pandemic was not on most people’s radar
before COVID-19. People with specific expertise were more conscious of the risk, and
epidemiologists had warned several governments for years that they were inadequately
prepared for a pandemic. However, COVID-19 took most people by surprise and
therefore counts as a black swan under the looser definition.
4.7.7 Implications of Tail Events and Black Swans for Risk Analysis
Tail events and black swans present problems for analyzing and managing risks,
because we do not know if or when they will happen. For black swans, there is the
additional challenge that we do not know what form they will take.
Since, by definition, we cannot predict the nature of black swans in advance, we
cannot put any specific defenses in place against them, as we might for risks we have
thought of. We can attempt to factor black swans into our equations to some degree,
by trying to estimate roughly how likely they are and how much damage they would
cause. However, they add much more uncertainty into our calculations. We will now
discuss some common tendencies in thinking about risk, and why they can break
down in situations that are subject to tail events and black swans.
First, we consider how our typical risk estimation methods break down under long
tails because our standard arsenal of statistical tools are rendered useless. Then, we
228 ■ Introduction to AI Safety, Ethics, and Society
consider how cost-benefit analysis is strained when dealing with long-tailed events
because of its sensitivity to our highly uncertain estimates. After this, we discuss
why we should be more explicitly considering extremes instead of averages, and look
at three common mistakes when dealing with long-tailed data: the delay fallacy,
interpreting an absence of evidence, and the preparedness paradox.
5.3
106
Average Net Worth (dollars)
105
104
Figure 4.12. The mean of a long-tailed distribution is slow to convergence, rendering the
mean a problematic summary statistic in practice.
Since this number is positive, we might believe it is a good idea to bet on the lottery.
However, if the probabilities are only slightly different, at 99.7% and 0.3%, then our
expected outcome is:
This illustrates that just a tiny change in probabilities sometimes makes a signifi-
cant difference in whether we expect a positive or a negative outcome. In situations
like this, where the expected outcome is highly sensitive to probabilities, using an
estimate of probability that is only slightly different from the actual value can com-
pletely change the calculations. For this reason, relying on this type of cost-benefit
analysis does not make sense if we cannot be sure we have accurate estimates of the
probabilities in question.
of four feet.” If a river is four feet deep on average, that might mean that it has a
constant depth of four feet and is possible to wade across it safely. It might also mean
that it is two or three feet deep near the banks and eight feet deep at some point
in the middle. If this were the case, then it would not be a good idea to attempt to
wade across it.
Failing to account for extremes instead of averages is one example of the mistakes
people make when thinking about event types that might have black swans. Next,
we will explore three more: the delay fallacy, misinterpreting an absence of evidence,
and the preparedness paradox.
a risk of tail events and black swans, it would not make sense to delay action with
the excuse that we do not have enough information. Instead, we should be proactive
about safety by investing in the three key research fields discussed earlier: robust-
ness, monitoring, and control. If we wait until we are certain that an AI could pose
an existential risk before working on AI safety, we might be waiting until it is too
late.
events happen. Since we usually cannot run two parallel worlds—one with safety ef-
forts and one without—it might be difficult or impossible to prove that the safety
work prevented harm. Those who work in this area may never know whether their
efforts have prevented a catastrophe and have their work vindicated. Nevertheless,
preventing disasters is essential, especially in cases like the development of AI, where
we have good theoretical reasons to believe that a black swan is on the cards.
DL models and the surrounding social systems are all complex systems.
It is unlikely that we will be able to predict every single way AI might be used, just
as, in the early days of the internet, it would have been difficult to predict every
way technology would ultimately be used. This means that there might be a risk of
AI being used in harmful ways that we have not foreseen, potentially leading to a
destructive black swan event that we are unprepared for. The idea that DL systems
qualify as complex systems is discussed in greater depth in the Complex Systems
chapter.
New systems may be more likely to present black swans. Absence of evi-
dence is only evidence of absence if we expect that some evidence should have turned
234 ■ Introduction to AI Safety, Ethics, and Society
up in the timeframe that has elapsed. For systems that have not been around for
long, we would be unlikely to have seen proof of tail events or black swans since these
are rare by definition.
AI may not have existed for long enough for us to have learned about all
its risks. In the case of emerging technology, it is reasonable to think that there
might be a risk of tail events or black swans, even if we do not have any evidence
yet. The lack of evidence might be explained simply by the fact that the technology
has not been around for long. Our meta-ignorance means that we should take AI risk
seriously. By definition, we can’t be sure there are no unknown unknowns. Therefore,
it is over-confident for us to feel sure we have eliminated all risks.
There are techniques for turning some black swans into known un-
knowns. As discussed earlier, under our practical definition, not all black swans
are completely unpredictable, especially not for people who have the relevant exper-
tise. Ways of putting more black swans on our radar include expanding our safety
imagination, conducting horizon scanning or stress testing exercises, and red-teaming
[277].
Horizon scanning. Some HROs use a technique called horizon scanning, which
involves monitoring potential future threats and opportunities before they arrive, to
minimize the risk of unknown unknowns [278]. AI systems could be used to enhance
horizon-scanning capabilities by simulating situations that mirror the real world with
a high degree of complexity. The simulations might generate data that reveal potential
black swan risks to be aware of when deploying a new system. As well as conducting
horizon scanning, HROs also contemplate near-misses and speculate about how they
might have turned into catastrophes, so that lessons can be learned.
Safety Engineering ■ 235
Red teaming. “Red teams” can find more black swans by adopting a mindset of
malicious intent. Red teams should try to think of as many ways as they can to
misuse or sabotage the system. They can then challenge the organization on how it
would respond to such attacks. Finally, stress tests such as dry-running hypothetical
scenarios and evaluating how well the system copes with them, and thinking about
how it could be improved can improve a system’s resilience to unexpected events.
4.8 CONCLUSION
4.8.1 Summary
In this chapter, we have explored various methods of analyzing and managing risks
inherent in systems. We began by looking at how we can break risk down into two
components: the probability and severity of an accident. We then went into greater
detail, introducing the factors of exposure and vulnerability, showing how each af-
fects the level of risk we calculate. By decomposing risk in this way, we can identify
measures we can take to reduce risks. We also considered the concept of ability to
cope and how it relates to risk of ruin.
Next, we described a metric of system reliability called the “nines of reliability.” This
metric refers to the number of nines at the beginning of a system’s percentage or
decimal reliability. We found that adding another nine of reliability is equivalent to
reducing the probability of an accident by a factor of 10, and therefore results in a
tenfold increase in expected time before failure. A limitation of the nines of reliability
is that they only contain information about the probability of an accident, but not
its severity, so they cannot be used alone to calculate risk.
We then listed several safe design principles, which can be incorporated into a system
from the design stage to reduce the risk of accidents. In particular, we explored re-
dundancy, separation of duties, the principle of least privilege, fail-safes, antifragility,
negative feedback mechanisms, transparency, and defense in depth.
To develop an understanding of how accidents occur in systems, we next explored var-
ious accident models, which are theories about how accidents happen and the factors
that contribute to them. We reviewed three component failure accident models: the
Swiss cheese model, the bow tie model, and fault tree analysis, and considered their
limitations, which arise from their chain-of-events style of reasoning. Generally, they
do not capture how accidents can happen due to interactions between components,
even when nothing fails. Component failure models are also unsuited to modeling
how the numerous complex interactions and feedback loops in a system can make it
difficult to identify a root cause, and how it can be more fruitful to look at diffuse
causality and systemic factors than specific events.
After highlighting the importance of systemic and human factors, we delved deeper
into some examples of them, highlighting regulations, social pressure, competitive
pressures, safety costs, and safety culture. We then moved on to look at systemic
accident models that attempt to take these factors into consideration. Normal Ac-
cident Theory states that accidents are inevitable in complex and tightly coupled
236 ■ Introduction to AI Safety, Ethics, and Society
systems. On the other hand, HRO theory points to certain high reliability organiza-
tions as evidence that it is possible to reliably avoid accidents by following five key
management principles: preoccupation with failure, reluctance to simplify interpreta-
tions, sensitivity to operations, commitment to resilience, and deference to expertise.
While these features can certainly contribute to a good safety culture, we also looked
at the limitations and the difficulties in replicating some of them in other systems.
Rounding out our discussion of systemic factors, we outlined three accident models
that are grounded in complex systems theory. Rasmussen’s Risk Management Frame-
work (RMF) identifies six hierarchical levels within a system, identifying actors at
each level who share responsibility for safety. The RMF states that a system’s op-
erations should be kept within defined safety boundaries; if they migrate outside of
these, then the system is in a state where an event at the sharp end could trigger
an accident. However, the factors at the blunt end are also responsible, not just the
sharp-end event.
Similarly, STAMP and the related STPA analysis method view safety as being an
emergent property of an organization, detailing different levels of organization within
a system and defining the safety constraints that each level should impose on the one
below it. Specifically, STPA builds models of the organizational safety structure; the
dynamics and pressures that can lead to deterioration of this structure; the models
of the system that operators must have, and the necessary communication to ensure
these models remain accurate over time; and the broader social and political context
the organization exists within.
Finally, Dekker’s Drift into Failure (DIF) model emphasizes decrementalism: the
way that a system’s processes can deteriorate through a series of minor changes,
potentially causing the system’s migration to an unsafe state. This model warns that
each change may seem insignificant alone, so organizations might make these changes
one at a time in isolation, creating a state of higher risk once enough changes have
been made.
As a final note on the implications of complexity for AI safety, we considered the
broader societal context within which AI technologies will function. We discussed
how, in this uncontrolled environment, different, seemingly lower-level risks could
interact to produce catastrophic threats, while chaotic circumstances may increase
the likelihood of AI-related accidents. For these reasons, it makes sense to consider a
wide range of different threats of different magnitudes in our approach to mitigating
catastrophic risks, and we may find that broader interventions are more fruitful than
narrowly targeted ones.
In the last section of this chapter, we focused in on a particular class of events called
tail events and black swans, and explored what they mean for risk analysis and
management. We began this discussion by defining tail events and considering several
caricatures of long-tailed distributions. Then, we described black swans as a subset
of tail events that are not only rare and high-impact but also particularly difficult
to predict. These events seem to happen largely “out of the blue” for most people
and may indicate that our understanding of a situation is inaccurate or incomplete.
Safety Engineering ■ 237
These events are also referred to as unknown unknowns, which we contrasted with
known unknowns which we may not fully understand, but are at least aware of.
We examined how tail events and black swans can pose particular challenges for
some traditional approaches to evaluating and managing risk. Certain methods of
risk estimation and cost-benefit analysis rely on historical data and probabilities of
different events. However, tail events and black swans are rare, so we may not have
sufficient data to accurately estimate their likelihood, and even a small change in
likelihood can lead to a big difference in expected outcome.
We also considered the delay fallacy, showing that waiting for more information
before acting might mean waiting until it is too late. We discussed how an absence of
evidence of a risk cannot necessarily be taken as evidence that the risk is absent. By
looking at hypothetical situations where catastrophes are avoided thanks to safety
measures, we explained how the preparedness paradox can make these measures seem
unnecessary, when in fact they are essential.
Having explored the importance of taking tail events and black swans into consid-
eration, we identified some circumstances that indicate we may be at risk of these
events. We concluded that it is reasonable to believe AI technologies may pose such
a risk, due to the complexity of AI systems and the systems surrounding them, the
highly connected nature of the social systems they are likely to be embedded in, and
the fact that they are relatively new, meaning we may not yet fully understand all
the ways they might interact with their surroundings.
4.9 LITERATURE
Complex Systems
5.1 OVERVIEW
Artificial intelligence (AI) systems and the societies they operate within belong to
the class of complex systems. These types of systems have significant implications for
thinking about and ensuring AI safety. Complex systems exhibit surprising behav-
iors and defy conventional analysis methods that examine individual components in
isolation. To develop effective strategies for AI safety, it is crucial to adopt holistic
approaches that account for the unique properties of complex systems and enable us
to anticipate and address AI risks.
This chapter begins by elucidating the qualitative differences between complex and
simple systems. After describing standard analysis techniques based on mechanistic
or statistical approaches, the chapter demonstrates their limitations in capturing
the essential characteristics of complex systems, and provides a concise definition
of complexity. The “Hallmarks of Complex Systems” section then explores seven
indications of complexity and establishes how DL models exemplify each of them.
Next, the “Social Systems as Complex Systems” section shows how various human
organizations also satisfy our definition of complex systems. In particular, the section
explores how the hallmarks of complexity materialize in two examples of social sys-
tems that are pertinent to AI safety: the corporations and research institutes pursu-
ing AI development, and the decision-making structures responsible for implementing
policies and regulations. In the latter case, there is consideration of how advocacy
efforts are affected by the complex nature of political systems and the broader social
context.
Having established that DL systems and the social systems surrounding them are
best described as complex systems, the chapter moves on to what this means for
AI safety. The “General Lessons” section derives five learnings from the chapter’s
examination of complex systems and sets out their implications for how risks might
arise from AI. The “Puzzles, Problems, and Wicked Problems” section then reframes
the contrasts between simple and complex systems in terms of the different kinds of
problems that the two categories present, and the distinct styles of problem-solving
they require.
By examining the unintended side effects that often arise from interfering with com-
plex systems, the “Challenges with Interventionism” section illustrates the necessity
of developing comprehensive approaches to mitigating AI risks. Finally, the “Systemic
Issues” section outlines a method for thinking holistically and identifying more ef-
fective, system-level solutions that address broad systemic issues, rather than merely
applying short-term “quick fixes” that superficially address symptoms of problems.
Before we describe complex systems, we will first look at non-complex systems and
the methods of analysis that can be used to understand them. This discussion sits
under the reductionist paradigm. According to this paradigm, systems are just the
sum of their parts, and can be fully understood and described with relatively simple
mathematical equations or logical relations.
Figure 5.1. A Rube Goldberg machine with many parts that each feed directly into the next
can be well explained by way of mechanisms.
242 ■ Introduction to AI Safety, Ethics, and Society
Having discussed simple systems and how they can be understood through reduction-
ism, we will now look at the limitations of this paradigm and the types of systems
that it cannot be usefully applied to. We will look at the problems this presents for
understanding systems and predicting their behaviors.
In complex systems, the whole is more than the sum of its parts. The
problem is that reductionist-style analysis is poorly suited to capturing the diversity
of interdependencies within complex systems. Reductionism only works well if the
interactions follow a rigid and predictable mechanism or if they are random and
independent enough to be modeled by statistics. In complex systems, neither of these
assumptions hold.
In complex systems, interactions do not follow a rigid, structured pattern, but compo-
nents are still sufficiently interconnected that they cannot be treated as independent.
These interactions are the source of many novel behaviors that make complex systems
interesting. To get a better grasp of these systems, we need to go beyond reductionism
and adopt an alternative, more holistic framework for thinking about them.
to predict the longer-term trajectory of the system, such as what the climate will look
like in several centuries or millennia. This is because complex systems often develop
in a more open-ended way than simple systems and have the potential to evolve into
a wider range of states, with numerous factors influencing the path they take.
Figure 5.2. Often, we use mechanistic or statistical approaches to analyzing systems. When
there are many components with strong interdependence, these are insufficient, and we need
a complex systems approach.
Now that we have seen that many systems of interest are inscrutable to the reduc-
tionist paradigm, we need an alternative lens through which to understand them. To
this end, we will discuss the complex systems paradigm, which takes a more holistic
view, placing emphasis on the most salient features shared across various real-world
complex systems that the reductionist paradigm fails to capture. The benefit of this
paradigm is that it provides “a way of seeing and talking about reality that helps us
better understand and work with systems to influence the quality of our lives.”
Complex Systems ■ 245
Complex systems exhibit emergent properties that are not found in their
components. As discussed above, some systems cannot be usefully understood in
a reductionist way. Studying a complex system’s components in isolation and doing
mental reassembly does not amount to what we observe in reality. One primary reason
for this is the phenomenon of emergence: the appearance of striking, system-wide
features that cannot be found in any of the system’s components.
The presence of emergent features provides one sense in which complex systems are
“more than the sum of their parts.” For example, we do not find atmospheric currents
in any of the molecules of nitrogen and oxygen that make up the atmosphere, and
the flexible intelligence of a human being does not exist in any single neuron. Many
biological concepts such as adaptation, ecological niche, sexuality, and fitness are not
simply reduced to statements about molecules. Moreover, “wetness” is not found in
individual water molecules. Emergence is so essential that we will use it to construct
a working definition of complex systems.
Working definition. Complex systems are systems of many interconnected com-
ponents that collectively exhibit emergent features, which cannot, in practice, be
derived from a reductive analysis of the system in terms of its isolated components.
Ant colonies are a classic example of a complex system. An ant colony can
grow to a size of several million individuals. Each ant is a fairly simple creature with
a short memory, moving around in response to chemical and tactile cues. The individ-
uals interact by randomly bumping into each other and exchanging pheromones. Out
of this mess of uncoordinated interactions emerge many fascinating collective behav-
iors. These include identifying and selecting high-quality food sources or nest sites,
forming ant trails, and even constructing bridges over gaps in these trails (formed by
the stringing together of hundreds of the ants’ bodies). Ant colonies have also been
observed to “remember” the locations of food sources or the paths of previous trails
for months, years, or decades, even though the memory of any individual ant only
lasts for a few days at most.
Ant colonies satisfy both aspects of the working definition of complex
systems. First, the emergent features of the colony include the collective decision-
making process that enables it to choose a food source or nest site, the physical ability
to cross over gaps many times wider than any ant, and even capabilities of a cognitive
nature such as extended memory. We could not predict all of these behaviors and
abilities from observing any individual ant, even if each ant displays some smaller
analogs of some of these abilities.
Second, these emergent features cannot be derived from a reductive analysis of the
system focused on the properties of the components. Even given a highly detailed
study of the behavior of an individual ant considered in isolation, we could not derive
the emergence of all of these remarkable features. Nor are all of these features simple
statistical aggregates of individual ant behaviors in any practical sense, although some
features like the distribution of ants between tasks such as foraging, nest maintenance,
and patrolling have been observed as decisions on the level of an individual ant as
well.
246 ■ Introduction to AI Safety, Ethics, and Society
This distinguishes a more complex system like an ant colony from a simpler one such
as a gas in a box. Although the gas also has emergent properties (like its temperature
and pressure), it does not qualify as complex. The gas’s higher-level properties can
be straightforwardly reduced to the statistics of the lower-level properties of the
component particles. However, this was not always the case: it took many decades of
work to uncover the statistical mechanics of gases from the properties of individual
molecules. Complexity can be a feature of our understanding of the system rather
than the system itself.
Complex systems are ubiquitous in nature and society. From cells, organ-
isms, and ecosystems, to weather systems, cities, and the World Wide Web, com-
plex systems are everywhere. We will now describe two further examples, referred to
throughout this chapter.
The human brain is a complex system. The human brain consists of around
86 billion neurons, each one having, on average, thousands of connections to the
others. They interact via chemical and electrical signals. Out of this emerge all our
impressive cognitive abilities, including our ability to use language, perceive the world
around us, and control the movements of our body. Again, these cognitive abilities
are not found in any individual neuron, arising primarily from the rich structure of
neuronal connections; even if we understood individual neurons very well, this would
not amount to an understanding of (or enable a derivation of) all these impressive
feats accomplished by the brain.
We consider seven key characteristics of complex systems. Chief among these is emer-
gence, but several others also receive attention: self-organization, feedback and nonlin-
earity, criticality, adaptive behavior, distributed functionality, and scalable structure.
We will now describe each of these hallmarks and explain their implications. Along
the way, we will show that DL systems share many similarities with other complex
systems, strengthening the case for treating them under this paradigm.
Emergence
Two closely related hallmarks of complexity are feedback and nonlinearity. Feedback
refers to circular processes in which a system and its environment affect one another.
There are multiple types of nonlinearity, but the term generally describes systems and
processes where a change in the input does not necessarily translate to a proportional
change in the output. We will now discuss some mechanisms behind nonlinearity,
including feedback loops, some examples of this phenomenon, and why it makes
complex systems’ behavior less predictable.
The rich get richer. Wealthy people have more money to invest, which brings
them a greater return on investment. In a single investment cycle, the return on invest-
ment is greater in proportion to their greater wealth: a linear relationship. However,
this greater return can then be reinvested. Doing so forms a positive feedback loop
through which a slight initial advantage in wealth can be transformed into a much
larger one, leading to a nonlinear relationship between a person’s wealth and their
ability to make more money.
hive to signal to the other bees the direction and distance of the source. However, a
returning forager needs to find a receiver bee onto which to unload the water. If too
many foragers have brought back water, it will take longer to find a receiver, and the
forager is less likely to signal to the others where they should fly to find the source.
This negative feedback process stabilizes the number of bees going out for water,
leading to a nonlinear relationship between the number of bees currently flying out
for water and the number of additional bees recruited to the task.
AI systems involve feedback loops. In a system where agents can affect the
environment, but the environment can also affect agents, the result is a continual,
circular process of change—a feedback loop. Another example of feedback loops in-
volving AIs is the reinforcement-learning technique of self-play, where agents play
against themselves: the better an agent’s performance, the more it has to improve to
compete with itself, leading its performance to increase even more.
Feedback processes can make complex systems’ behavior difficult to pre-
dict. Positive feedback loops can amplify small changes in a system’s initial condi-
tions into considerable changes in its resulting behavior. This means that nonlinear
systems often have regimes in which they display extreme sensitivity to initial con-
ditions, a phenomenon called chaos (colloquially referred to as the butterfly effect).
A famous example of this is the logistic map, an equation that models how the pop-
ulation of a species changes over time:
xn+1 = rxn (1 − xn ).
This equation is formulated to capture the feedback loops that affect how the pop-
ulation of a species changes: when the population is low, food sources proliferate,
enabling the population to grow; when it is high, overcrowding and food scarcity
drive the population down again. xn is the current population of a species as a frac-
tion of the maximum possible population that its environment can support. xn+1
represents the fractional population at some time later. The term r is the rate at
which the population increases if it is not bounded by limited resources. When the
parameter r takes a value above a certain threshold (∼ 3.57), we enter the chaotic
regime of this model, in which a tiny difference in the initial population makes for a
large difference in the long-run trajectory. Since we can never know a system’s initial
conditions with perfect accuracy, chaotic systems are generally considered difficult to
predict.
AIs as a self-reinforcing feedback loop. Since AIs can process information and
reach decisions more quickly than humans, putting them in charge of certain decisions
and operations could accelerate developments to a pace that humans cannot keep up
with. Even more AIs may then be required to make related decisions and run adjacent
operations. Additionally, if society encounters any problems with AI-run operations,
it may be that AIs alone can work at the speed and level of complexity required to
address these problems. In this way, automating processes could set up a positive
feedback loop, requiring us to continually deploy ever-more AIs. In this scenario, the
long-term use of AIs could be hard to control or reverse.
252 ■ Introduction to AI Safety, Ethics, and Society
0.9
0.8
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 5 10 15 20 25 30 35 40 45 50
Time (generations)
0.9
0.8
Population Size (unit)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 5 10 15 20 25 30 35 40 45 50
Time (generations)
Figure 5.3. When systems are predictable, small changes in initial conditions can taper out.
When they are chaotic, small changes in initial conditions lead to wildly different outcomes.
Summary. There are multiple ways in which complex systems exhibit nonlinearity.
A small change in the system’s input will not necessarily result in a proportional
change in its behavior; it might completely change the system’s behavior, or have
no effect at all. Positive feedback loops can amplify changes, while negative feedback
loops can quash them, leading a system to evolve nonlinearly depending on its current
state, and making its long-run trajectory difficult to predict.
Self-Organization
The next salient feature of complex systems we will discuss is self-organization. This
refers to how the components direct themselves in a way that produces collective
emergent properties without any explicit instructions.
Self-Organized Criticality
perturbation can potentially set off a system-wide cascade of changes, in which many
species will need to adapt to survive. Similarly, AI development sometimes advances
in bursts (e.g., GANs, self-supervised learning in vision, and so on) with long periods
of slow development.
Distributed Functionality
the foraging process: locating a food source, making a collective decision to exploit
it, swarming over it to break it up, and carrying small pieces of it back to the nest. A
single forager ant working alone cannot perform this whole process—or even any one
subtask; many ants are needed for each part, with each individual contributing only
partially to each task. We therefore say that foraging is partially encoded within any
single forager ant.
Redundant encoding means there are more components than needed for
any task. A flourishing ant colony will have many more ants than are necessary to
carry out its primary functions. This is why the colony long outlives its members; if
a few patroller ants get eaten, or a few foragers get lost, the colony as a whole barely
notices. We therefore say that each of the functions is redundantly encoded across
the component ants.
An example of distributed functionality is the phenomenon known as the “wisdom of
crowds,” which was notably demonstrated in a report from a village fair in 1906. At
this fair, attendees were invited to take part in a contest by guessing the weight of an
ox. 787 people submitted estimates, and it was reported that the mean came to 1,197
pounds. This was strikingly close to the actual weight, which was 1,198 pounds.
In situations like this, it is often the case that the average estimate of many people
is closer to the true value than any individual’s guess. We could say that the task
of making a good estimate is only partially encoded in any given individual, who
cannot alone get close to the actual value. It is also redundantly encoded because
any individual’s estimate can usually be ignored without noticeably affecting the
average.
On a larger scale, the wisdom of crowds might be thought to underlie the effectiveness
of democracy. Ideally, a well-functioning democracy should make better decisions than
any of its individual members could on their own. This is not because a democratic
society decomposes its problems into many distinct sub-problems, which can then be
delegated to different citizens. Instead, wise democratic decisions take advantage of
the wisdom of crowds phenomenon, wherein pooling or averaging many people’s views
leads to a better result than trusting any individual. The “sense-making” function of
democracies is therefore distributed across society, partially and redundantly encoded
in each citizen.
104
Metabolic Rate (kJ/day, log scale)
103
102
10
1
10−2 10−1 1 101 102 103
Body Mass (kg, log scale)
Figure 5.4. Data on mammals and birds demonstrate Kleiber’s Law, with a power law
relationship appearing as a straight line on a log-log graph.
If we know that an elephant is five times heavier than a horse, we can guess that
3
the elephant’s metabolic rate will be approximately 3.3 times the horse’s (since 5 4 ≈
3.3). There are several other documented cases of this power-law scaling behavior
in complex systems. The average heart-rate for a typical member of a mammalian
species scales with the minus one-quarter power of its body mass:
1
R ∝ M−4 .
At the same time, the average lifespan scales with the one-quarter power of its body
mass:
1
T ∝ M 4.
This leads to the wonderful result that the average number of heartbeats per lifetime
is constant across all species of mammals (around 1.5 billion).
Among cities within the same country, the material infrastructure (such as the lengths
of pipes, powerlines, and roads, and the number of gas stations) scales with population
as a power-law with an exponent of 0.85. Also among cities within the same country,
socioeconomic quantities (such as incidents of crime and cases of flu) scale with the
population size raised to the 1.15 power.
Experiments on LLMs show that their loss obeys power laws too. In the
paper in which DeepMind introduced the Chinchilla model ([160]), the researchers
fit the following parametric function to the data they collected from experiments on
language models of different sizes, where N is the size of the model and D is the size
of the training dataset:
A B
L(N, D) = E + α + β .
N D
The irreducible loss (E) is the lowest loss that could possibly be achieved. Subtracting
this off, we see that the performance of the model as measured by the loss (L) exhibits
a power-law dependency on each of model parameter count (N ) and dataset size (D).
For more details on scaling laws in DL systems, see the Scaling Laws section in
Artificial Intelligence & Machine Learning.
Adaptive Behavior
The final hallmark of complexity we will discuss is adaptive behavior, which involves
a system changing its behavior depending on the demands of the environment.
Complex Systems ■ 259
Summary. Complex systems can often undergo rapid changes in their structures
and processes in response to internal and external fluctuations. This adaptive behav-
ior enables the continuation of the system in a changing environment.
There are seven hallmarks of complexity that we can look out for when identifying
complex systems. These hallmarks are:
1. Emergence: the appearance of novel properties that arise from interactions be-
tween the system’s components, but which do not exist in any single component.
260 ■ Introduction to AI Safety, Ethics, and Society
So far, we have described how DL systems possess many of the classic features of
complex systems. We have shown that they satisfy the two aspects of our working
definition of complex systems and that they display all seven hallmarks discussed
above.
We will now consider the organizations that develop AIs and the societies within
which they are deployed, and describe how these systems also exhibit the character-
istics of complex systems. We will argue that, on this basis, the problem of AI safety
should be treated under the complex systems paradigm.
the general ways in which individuals tend to interact with one another within an
organization or a research field.
and convince them that a particular cause is relevant and important. They might, for
instance, use new technology to innovate an original mode of campaigning, or link
the cause to the prevailing zeitgeist—another emergent property of social systems.
There are many nonlinear aspects of processes like advocacy. There are
numerous factors that affect whether or not an issue is included on a policy agenda.
Public interest in a cause, the influence of opponents of a cause, and the number
of other issues competing for attention are among the many factors that affect the
likelihood that it is considered non-linearly; for instance, opponents with low influ-
ence may permit an issue being discussed, opponents with medium influence may
try and block it from discussed, but opponents with high influence may permit it
being discussed so that they can argue against it. Additionally, there is a degree of
randomness involved in determining which issues are considered. This means that the
policy progress resulting from a particular campaign does not necessarily reflect the
level of effort put into it, nor how well organized it was.
Together with distributed functionality and critical points, this nonlinearity can make
it difficult to evaluate how well a campaign was executed or attribute eventual success.
It might be that earlier efforts were essential in laying the groundwork or simply
maintaining some interest in a cause, even if they did not yield immediate results.
A later campaign might then succeed in prompting policy-level action, regardless of
whether it is particularly well organized, simply because the political climate becomes
favorable.
Other examples of nonlinearity within advocacy and policymaking arise from various
feedback loops. Since people are influenced by the opinions of those around them,
a small change in the initial level of support for a policy might be compounded
over time, creating momentum and ultimately tipping the balance as to whether
or not it is adopted. On the other hand, original activities that are designed to be
attention-grabbing may run up against negative feedback loops that diminish their
power over time. Other groups may imitate them, for instance, so that their novelty
wears off through repetition. Opponents of a cause may also learn to counteract any
new approaches that advocates for it try out. This dynamic was understood by the
military strategist Moltke the Elder, who is reported to have said that “no plan
survives first contact with the enemy.”
264 ■ Introduction to AI Safety, Ethics, and Society
As we have discussed, AI systems and the social systems they are integrated within
are best understood as complex systems. For this reason, making AI safe is not like
solving a mathematical problem or fixing a watch. A watch might be complicated,
but it is not complex. Its mechanism can be fully understood and described, and its
behavior can be predicted with a high degree of confidence. The same is not true of
complex systems.
Since a system’s complexity has a significant bearing on its behavior, our approach
to AI safety should be informed by the complex systems paradigm. We will now look
at some lessons that have been derived from observations of many other complex
adaptive systems. We will discuss each lesson and what it means for AI safety.
Learning how to make AIs safe will require some trial and error. We
cannot usually attain a complete understanding of complex systems or anticipate all
their emergent properties purely by studying their structure in theory. This means we
cannot exhaustively predict every way they might go wrong just by thinking about
them. Instead, some amount of trial and error is required to understand how they
will function under different circumstances and learn about the risks they might pose.
The implication for AI safety is that some simulation and experimentation will be
required to learn how AI systems might function in unexpected or unintended ways
and to discover crucial variables for safety.
by accident. While we may have ideas about the kinds of hazards a system entails,
experimentation can help to confirm or refute these. Importantly, it can also help us
discover hazards we had not even imagined. These are called unknown unknowns,
or black swans, discussed extensively in the Safety Engineering chapter. Empirical
feedback loops are necessary.
Lesson: Systems Often Develop Subgoals Which Can Supersede the Original Goal
AIs might pursue distorted subgoals at the expense of the original goal.
The implication for AI safety is that AIs might pursue subgoals over the goals we
give them to begin with. This presents a risk that we might lose control of AIs, and
this could cause harm because their subgoals may not always be aligned with human
values.
A system often decomposes its goal into multiple subgoals to act as step-
ping stones. Subgoals might include instrumentally convergent goals, which are
discussed in the Single-Agent Safety chapter. The idea is that achieving all the sub-
goals will collectively amount to achieving the original aim. This might work for a
simple, mechanistic system. However, since complex systems are more than the sum
of their parts, breaking goals down in this way can distort them. The system might
get sidetracked pursuing a subgoal, sometimes even at the expense of the original one.
In other words, although the subgoal was initially a means to an end, the system may
end up prioritizing it as an end in itself.
For example, companies usually have many different departments, each one special-
ized to pursue a distinct subgoal. However, some departments, such as bureaucratic
ones, can capture power and have the company pursue goals unlike its initial one.
Political leaders can delegate roles to subordinates, but sometimes their subordinates
may overthrow them in a coup.
As another example, imagine a politician who wants to improve the quality of life
of residents of a particular area. Increasing employment opportunities often lead to
improvement in quality of life, so the politician might focus on this as a subgoal—a
means to an end. However, this subgoal might end up supplanting the initial one. For
instance, a company might want to build an industrial plant that will offer jobs, but
is also likely to leak toxic waste. Suppose the politician has become mostly focused
on increasing employment rates. In that case, they might approve the construction
of this plant, despite the likelihood that it will pollute the environment and worsen
residents’ quality of life in some ways.
Future AI agents may break down difficult long-term goals into smaller
subgoals. Creating subgoals can distort an AI’s objective and result in misalign-
ment. As discussed in the Emergent Capabilities section of the Single-Agent Safety
Chapter, optimization algorithms might produce emergent optimizers that pursue
subgoals, or AI agents may delegate goals to other agents and potentially have the
goal be distorted or subverted. In more extreme cases, the subgoals could be pursued
at the expense of the original one. We can specify our high-level objectives correctly
Complex Systems ■ 267
without any guarantee that systems will implement these in practice. As a result,
systems may not pursue goals that we would consider beneficial.
Lesson: A Safe System, When Scaled Up, Is Not Necessarily Still Safe
Lesson: Working Complex Systems Have Usually Evolved From Simpler Systems
Summary.
The general lessons that we should bear in mind for AI safety are:
1. We cannot predict every possible outcome of AI deployment by theorizing, so some
trial and error will be needed
2. Even if we specify an AI’s goals perfectly, it may start not to pursue them in
practice, as it may instead pursue unexpected, distorted subgoals
Complex Systems ■ 269
3. A small system that is safe will not necessarily remain safe if it is scaled up
4. The most promising approach to building a large AI that is safe is nonetheless to
make smaller systems safe and scale them up cautiously
5. We cannot rely on keeping humans in the loop to make AI systems safe, because
humans are not perfectly reliable and, moreover, AIs are likely to accelerate pro-
cesses too much for humans to keep up.
So far, we have explored the contrasts between simple and complex systems and why
we need different approaches to analyzing and understanding them. We have also
described how AIs and the social systems surrounding them are best understood as
complex systems, and discussed some lessons from the field of complex systems that
can inform our expectations around AI safety and how we address it.
In attempting to improve the safety of AI and its integration within society, we are
engaging in a form of problem-solving. However, simple and complex systems present
entirely different types of problems that require different styles of problem-solving.
We can therefore reframe our earlier discussion of reductionism and complex systems
in terms of the kinds of challenges we can address within each paradigm. We will
now distinguish between three different kinds of challenges—puzzles, problems, and
wicked problems. We will look at the systems that they tend to arise in, and the
different styles of problem-solving we require to tackle each of them.
Problems. With problems, we do not always have all the relevant information up-
front, so we might need to investigate to discover it. This usually gives us a better
understanding of what’s causing the issue, and ideas for solutions often follow natu-
rally from there. It may turn out that there is more than one approach to fixing the
problem. However, it is clear when the problem is solved and the system is functioning
properly again.
We usually find problems in systems that are complicated, but not complex. For
example, in car repair work, it might not be immediately apparent what is causing
an issue. However, we can investigate to find out more, and this process often leads
us to sensible solutions. Like puzzles, problems are amenable to the reductionist
paradigm, although they may involve more steps of analysis.
270 ■ Introduction to AI Safety, Ethics, and Society
Wicked Problems
Wicked problems usually arise in complex systems and often involve a
social element. Wicked problems are a completely different class of challenges from
puzzles and problems. They appear in complex systems, with examples including
inequality, misinformation, and climate change. There is also often a social factor
involved in wicked problems, which makes them more difficult to solve. Owing to their
multifaceted nature, wicked problems can be tricky to categorically define or explain.
We will now explore some key features that are commonly used to characterize them.
There is no single explanation or single solution for a wicked problem.
We can reasonably interpret a wicked problem as stemming from more than one
possible cause. As such, there is no single correct solution or even a limited set of
eternal possible solutions.
No proposed solution to a wicked problem is fully right or wrong, only
better or worse. Since there are usually many factors involved in a wicked prob-
lem, it is difficult to find a perfect solution that addresses them all. Indeed, such a
solution might not exist. Additionally, due to the many interdependencies in complex
systems, some proposed solutions may have negative side effects and create other
issues, even if they reduce the targeted wicked problem. As such, we cannot usually
find a solution that is fully correct or without flaw; rather, it is often necessary to
look for solutions that work relatively well with minimal negative side effects.
There is often a risk involved in attempting to solve a wicked problem.
Since we cannot predict exactly how a complex system will react to an intervention
in advance, we cannot be certain as to how well a suggested solution will work or
whether there will be any unintended side effects. This means there may be a high
cost to attempting to address wicked problems, as we risk unforeseen consequences.
However, trying out a potential solution is often the only way of finding out whether
it is better or worse.
Every wicked problem is unique because every complex system is unique.
While we can learn some lessons from other systems with similar properties, no two
systems will respond to our actions in exactly the same way. This means that we
cannot simply transpose a solution that worked well in one scenario to a different one
and expect it to be just as effective. For example, introducing predators to control
pest numbers has worked well in some situations, but, as we will discuss in the next
section, it has failed in others. This is because all ecosystems are unique, and the
same is true of all complex systems, meaning that each wicked problem is likely to
require a specifically tailored intervention.
It might not be obvious when a wicked problem has been solved. Since
wicked problems are often difficult to perfectly define, it can be challenging to say
they have been fully eliminated, even if they have been greatly reduced. Indeed, since
wicked problems tend to be persistent; it might not be feasible to fully eliminate
many wicked problems at all. Instead, they often require ongoing efforts to improve
the situation, though the ideal scenario may always be beyond reach.
Complex Systems ■ 271
Summary. Puzzles and problems usually arise in relatively simple systems that
we can obtain a complete or near-complete understanding of. We can therefore find
all the information we need to explain the issue and find a solution to it, although
problems may be more complicated, requiring more investigation and steps of analysis
than puzzles.
Wicked problems, on the other hand, arise in complex systems, which are much
more difficult to attain a thorough understanding of. There may be no single correct
explanation for a wicked problem, proposed solutions may not be fully right or wrong,
and it might not be possible to find out how good they are without trial and error.
Every wicked problem is unique, so solutions that worked well in one system may
not always work in another, even if the systems seem similar, and it might not be
possible to ever definitively say that a wicked problem has been solved. Owing to the
complex nature of the systems involved, AI safety is a wicked problem.
As touched on above, there are usually many potential solutions to wicked problems,
but they may not all work in practice, even if they sound sensible in theory. We might
therefore find that some attempts to solve wicked problems will be ineffective, have
negative side effects, or even backfire. Complex systems have so many interdepen-
dencies that when we try to adjust one aspect of them, we can inadvertently affect
others. For this reason, we should approach AI safety with more humility and more
awareness of the limits of our knowledge than if we were trying to fix a watch or a
washing machine. We will now look at some examples of historical interventions in
complex systems that have not gone to plan. In many cases, they have done more
harm than good.
Warning signs on roads. Road accidents are a long-standing and pervasive issue.
A widely used intervention is to display signs along roads with information about the
number of crashes and fatalities that have happened in the surrounding area that year.
The idea is that this information should encourage people to drive more carefully.
However, one study has found that signs like this increase the number of accidents
and fatalities, possibly because they distract drivers from the road.
Renewable Heat Incentive Scandal. In 2012, a government department in
Northern Ireland wanted to boost the fraction of energy consumption from renewable
sources. To this end, they set up an initiative offering businesses generous subsidies
for using renewable heating sources, such as wood pellets. However, in trying to reach
their percentage targets for renewable energy, the politicians offered a subsidy that
was slightly more than the cost of the wood pellets. This incentivized businesses to
use more energy than they needed and profit from the subsidies. There were reports
of people burning pellets to heat empty buildings unnecessarily. The episode became
known as the “Cash for Ash scandal.”
Barbados-Grenada football match. In the 1994 Caribbean Cup, an interna-
tional football tournament, organizers introduced a new rule to reduce the likelihood
of ties, which they thought were less exciting. The rule was that if two teams were
tied at the end of the allotted 90 minutes, the match would go to extra time, and
any goal scored in extra time would be worth double. The idea was to incentivize the
players to try harder to score. However, in a match between Barbados and Grenada,
Barbados needed to win by two goals to advance to the tournament finals. The score
as they approached 90 minutes was 2-1 to Barbados. This resulted in a strange situa-
tion where Barbados players tried to score an own goal to push the match into extra
time and have an opportunity to win by two.
Summary. Interventions that work in theory might fail in a complex system. In
all these examples, an intervention was attempted to solve a problem in a complex
system. In theory, each intervention seemed like it should work, but each decision-
maker’s theory did not capture all the complexities of the system at hand. Therefore,
when each intervention was applied, the system reacted in unexpected ways, leaving
the original problem unsolved, and often creating additional problems that might be
even worse.
this campaign, sparrows were killed intensively and their populations plummeted.
However, as well as grain, sparrows also eat locusts. In the absence of a natural
predator, locust populations rose sharply, destroying more crops than the sparrows
did [284]. Although many factors were at play, including poor weather and officials’
decisions about food distribution [285], this ecosystem imbalance is often considered
a contributing factor in the Great Chinese Famine [284], during which tens of millions
of people starved between 1959 and 1961.
Ecosystems are highly complex, with intricate balances between the populations of
many species. We could think of agricultural systems as having a “stable state” that
naturally involves some crops being lost to wildlife. If we try to reduce these losses
simply by eliminating one species, then another might take advantage of the available
crops instead, acting as a kind of restoring force.
Antibiotic resistance. Bacterial infections have been a cause of illness and mor-
tality in humans throughout history. In September 1928, bacteriologist Alexander
Fleming discovered penicillin, the first antibiotic. Over the following years, the meth-
ods for producing it were refined, and, by the end of World War II, there was a large
supply available for use in the US and Britain. This was a huge medical advance-
ment, offering a cure for many common causes of death, such as pneumonia and
tuberculosis. Death rates due to bacterial illnesses dropped dramatically [286]; it is
estimated that, in 1952, in the US, around 150,000 fewer people died from bacterial
illnesses than would have without antibiotics. In the early 2000s, it was estimated
that antibiotics may have been saving around 200,000 lives annually in the US alone.
However, as antibiotics have become more abundantly used, bacteria have begun to
evolve resistance to these vital medicines. Today, many bacterial illnesses, including
pneumonia and tuberculosis, are once again becoming difficult to treat due to the
declining effectiveness of antibiotics. In 2019, the Centers for Disease Control and Pre-
vention reported that antimicrobial-resistant bacteria are responsible for over 35,000
deaths per year in the US [287].
In this case, we might think of the coexistence of humans and pathogens as having
a stable state, involving some infections and deaths. While antibiotics have reduced
deaths due to bacteria over the past decades, we could view natural selection as a
“restoring force,” driving the evolution of bacteria to become resistant and causing
deaths to rise again. Overuse of these medicines intensifies selective pressures and
accelerates the process.
In this case, it is worth noting that antibiotics have been a monumental advancement
in healthcare, and we do not argue that they should not be used or that they are a
failed intervention. Rather, this example highlights the tendency of complex systems
to react against measures over time, even if they were initially highly successful.
Successful Interventions
We have discussed several examples of failed interventions in complex systems. While
it can be difficult to say definitively a wicked problem has been solved, there are some
examples of interventions that have clearly been at least partially successful. We will
now look at some of these examples.
276 ■ Introduction to AI Safety, Ethics, and Society
Reversal of the depletion of the ozone layer. Toward the end of the 20th
century it was discovered that certain compounds frequently used in spray cans, refrig-
erators, and air conditioners, were reaching the ozone layer and depleting it, leading
to more harmful radiation passing through. As a result, the Montreal Protocol, an
international agreement to phase out the use of these compounds, was negotiated in
1987 and enacted soon after. It has been reported that the ozone layer has started to
recover since then.
Public health campaigns against smoking. In the 20th century, scientists dis-
covered a causal relationship between tobacco smoking and lung cancer. In the fol-
lowing decades, governments started implementing various measures to discourage
people from smoking. Initiatives have included health warnings on cigarette packets,
smoking bans in certain public areas, and programs supporting people through the
process of quitting. Many of these measures have successfully raised public awareness
of health risks and contributed to declining smoking rates in several countries.
While these examples show that it is possible to address wicked problems, they also
demonstrate some of the difficulties involved. All these interventions have required
enormous, sustained efforts over many years, and some have involved coordination on
a global scale. It is worth noting that smallpox is the only infectious disease that has
ever been eradicated. One challenge in replicating this success elsewhere is that some
viruses, such as influenza viruses, evolve rapidly to evade vaccine-induced immunity.
This highlights how unique each wicked problem is.
Campaigns to dissuade people from smoking have faced pushback from the tobacco
industry, showing how conflicting incentives in complex systems can hamper attempts
to solve wicked problems. Additionally, as is often the case with wicked problems,
we may never be able to say that smoking is fully “solved”; it might not be feasible
to reach a situation where no one smokes at all. Nonetheless, much positive progress
has been made in tackling this issue.
We have discussed the characteristics of wicked problems as stemming from the com-
plex systems they arise from, and explored why they are so difficult to tackle. We
Complex Systems ■ 277
have also looked at some examples of failed attempts to solve wicked problems, as well
as examples of more successful ones, and explored the idea of shifting stable points,
instead of just trying to pull a system away from its stable points. We will now dis-
cuss ways of thinking more holistically and identifying more effective, system-level
solutions.
We should think about the function we are trying to achieve and the
system we are using. One method of finding more effective solutions is to “zoom
out” and consider the situation holistically. In complex systems language, we might
say that we need to find the correct scale at which to analyze the situation. This
might involve thinking carefully about what we are trying to achieve and whether
individuals or groups in the system exhibit the behaviors we are trying to control.
We should consider whether, if we solve the immediate problem, another one might
be likely to arise soon after.
tensions were already high beforehand. If the assassination had not happened, some-
thing else might have done, also triggering a conflict. A better approach might in-
stead invoke the imperialistic ambitions of many nations and the development of new
militaristic technologies, which led nations to believe there was a strong first-strike
advantage.
We can also find the contrast between these two mindsets in the different explanations
put forward for the Bhopal gas tragedy, a huge leak of toxic gas that happened in
December 1984 at a pesticide-producing plant in Bhopal, India. The disaster caused
thousands of deaths and injured up to half a million people. A “root cause” expla-
nation blames workers for allowing water to get into some of the pipes, where it
set off an uncontrolled reaction with other chemicals that escalated to catastrophe.
However, a more holistic view focuses on the slipping safety standards in the run-up
to the event, during which management failed to adequately maintain safety systems
and ensure that employees were properly trained. According to this view, an accident
was bound to happen as a result of these factors, regardless of the specific way in
which it started.
5.4 CONCLUSION
In this chapter, we have explored the properties of complex systems and their implica-
tions for AI safety strategies. We began by contrasting simple systems with complex
systems. While the former can be understood as the sum of their parts, the latter
display emergent properties that arise from complex interactions. These properties
do not exist in any of the components in isolation and cannot easily be derived from
reductive analysis of the system.
Complex Systems ■ 279
Next, we explored seven salient hallmarks of complexity. We saw that feedback loops
are ubiquitous in complex systems and often lead to nonlinearity, where a small
change in the input to a system does not result in a proportionate change in the
output. Rather, fluctuations can be amplified or quashed by feedback loops. Further-
more, these processes can make a system highly sensitive to its initial conditions,
meaning that a small difference at the outset can lead to vastly different long-term
trajectories. This is often referred to as the “butterfly effect,” and makes it difficult
to predict the behaviors of complex systems.
We also discussed how the components of complex systems tend to self-organize to
some extent and how they often display critical points, at which a small fluctuation
can tip the system into a drastically different state. We then looked at distributed
functionality, which refers to how tasks are loosely shared among components in a
complex system, and scalable structure, which gives rise to power laws within complex
systems. The final hallmark of complexity we discussed was adaptive behavior, which
allows systems to continue functioning in a changing environment.
Along the way, we highlighted how DL systems exhibit the hallmarks of complexity.
Beyond AIs themselves, we also showed how the social systems they exist within are
also best understood as complex systems, through the worked examples of corpora-
tions and research institutes, political systems, and advocacy organizations.
Having established the presence of complexity in AIs and the systems surrounding
them, we looked at what this means for AI safety by looking at five general lessons.
Since we cannot usually predict all emergent properties of complex systems simply
through theoretical analysis, some trial and error is likely to be required in making
AI systems safe. It is also important to be aware that systems often break down
goals into subgoals, which can supersede the original goal, meaning that AIs may not
always pursue the goals we give them.
Due to the potential for emergent properties, we cannot guarantee that a safe system
will remain safe when it is scaled up. However, since we cannot usually understand
complex systems perfectly in theory, it is extremely difficult to build a flawless com-
plex system from scratch. This means that starting with small systems that are safe
and scaling them up cautiously is likely the most promising approach to building
large complex systems that are safe. The final general lesson is that we cannot guar-
antee AI safety by keeping humans in the loop, so we need to design systems with
this in mind.
Next, we looked at how complex systems often give rise to wicked problems, which
cannot be solved in the same way we would approach a simple mathematics question
or a puzzle. We saw how difficult it is to address wicked problems, due to the unex-
pected side effects that can occur when we interfere with complex systems. However,
we also explored examples of successful interventions, showing that it is possible to
make significant progress, even if we cannot fully solve a problem. In thinking about
the most effective interventions, we highlighted the importance of thinking holistically
and looking for system-level solutions.
280 ■ Introduction to AI Safety, Ethics, and Society
AI safety is not a mathematical puzzle that can be solved once and for all. Rather, it
is a wicked problem that is likely to require ongoing, coordinated efforts, and flexible
strategies that can be adapted to changing circumstances.
5.5 LITERATURE
6.1 INTRODUCTION
All of these seem like reasonable answers. At least at first glance, these all seem
like excellent goals for AIs. However, many of these are incompatible, because they
make different normative assumptions and require different technical implementa-
tions. Other suggestions such as “AIs should follow human intentions” are highly
DOI: 10.1201/9781003530336-6 283
This chapter has been made available under a CC BY NC ND license.
284 ■ Introduction to AI Safety, Ethics, and Society
vague. Put straightforwardly, while these sound attractive and similar, they are not
the same thing.
This should challenge the notion that this is a straightforward problem with a simple
solution, and that the real challenges lie only elsewhere, such as in making AI systems
more capable. Those who believe there are easy ways to ensure that AIs act ethically
may find themselves grappling with inconsistencies and confusion when confronted
with instances in which their preferred methods appear to yield harmful behavior.
In this chapter, we will explore these issues, attempting to understand which answers,
if any, take us closer to creating safe and beneficial AIs. We start by considering some
goals commonly proposed to ensure that AI systems are beneficial for society. Current
and future AI systems will need to comply with existing laws and to avoid biases and
unfairness toward certain groups in society. Beyond this, many want AI to make our
economic systems more efficient and to boost economic growth. We briefly consider
the attractions of these goals, as well as some shortcomings:
1. Law: Should AI systems just be made to follow the law? We examine whether
we can design AIs that follow existing legal frameworks, considering that law is
a legitimate aggregation of human values that is time-tested and comprehensive
while being both specific and adaptable. We will lay out the challenges of the
occasional silence, immorality, or unrepresentativeness of established law.
2. Bias and Fairness: How can we design AI systems to be fair? We explore bias
and fairness in AI systems and the challenges associated with ensuring outcomes
created by AIs are fair. We will discuss different definitions of fairness and see how
they are incompatible, as well as consider approaches to mitigating biases.
3. Economic Engine: Should we let the economy decide what AIs will be like? We
consider how economic forces are shaping AI development, and why we might be
concerned about letting economic incentives alone determine how AI is developed
and which objectives AI systems pursue.
Machine ethics. While all of these proposals seem to capture important intuitions
about what we value, we argue that they face significant limitations and are not
sufficient on their own to ensure beneficial outcomes. We consider what it would mean
to create AI systems that aim directly to improve human wellbeing. This provides
a case study in thinking about how we can create AI systems that can respond
appropriately to moral considerations, an emerging field known as machine ethics.
While this is a highly ambitious goal, we believe it is likely to become increasingly
relevant in coming years. As AI capabilities improve, they may be operating with
an increasing level of autonomy and a broadening scope of potential actions they
could take. In this context, approaches based only on compliance with the law or
profit maximization are likely to prove increasingly inadequate in providing sufficient
guardrails on AI systems’ behavior. To specify in greater detail how they should react
in a particular situation, AI systems will need to be able to identify and respond
appropriately to the relevant values at stake, such as human wellbeing.
Beneficial AI and Machine Ethics ■ 285
Wellbeing. If we are to set wellbeing or other values as goals for AI systems, one
fundamental question that faces us is how we should specify these values. In the
second part of this chapter, we consider various interpretations of wellbeing and how
attractive it would be to have AI systems pursue these, assuming that they became
capable enough to do this. We start by introducing several competing theories of
wellbeing, which might present different goals for ethical AI systems. We examine
theories of wellbeing that focus on pleasure, objective goods, and preference satisfac-
tion. We then explore preference satisfaction in more detail and consider what kind of
preferences AI systems should satisfy. AI systems can be created to satisfy individual
preferences, but which preferences they should focus on is an open question. We fo-
cus on the challenges of deciding between revealed, stated, and idealized preferences.
Next, we turn to consider whether AI systems should aim to make people happy
and how we might use AIs to promote human happiness. Finally, we consider the
challenge of having AIs maximize wellbeing not just for an individual, but across the
whole of society. We look at how to aggregate total wellbeing across society, focusing
on social welfare functions. We discuss what social welfare functions are and how to
trade-off between equity and efficiency in a principled way.
Moral uncertainty. There are many cases where we may feel uncertain about
what the right response is to a particular situation due to conflicting moral consider-
ations. We consider how AI systems might respond to such situations. In the case of
AI systems, we may deliberately want to introduce uncertainty into their reasoning to
avoid over-confident decisions that could be disastrous from some moral perspectives.
One option to address moral uncertainty is using a moral parliament, where ethical
decisions are made by simulating democratic processes.
6.2 LAW
Why not have AIs follow the law? We have just argued that for AI systems to
be safe and beneficial, we need to ensure they can respond to moral considerations
such as wellbeing. However, some might argue that simply getting AIs to follow the
law is a better solution.
The law has three features that give it an advantage over ethics as a model for safe
and beneficial AI. In democratic countries, the law is a legitimate representation of
our moral codes: at least in theory, it is a democratically endorsed record of our
shared values. Law is time-tested and comprehensive; it has developed over many
generations to adjudicate the areas where humans have consequential disagreements.
Finally, legal language can be specific and adaptable to new contexts, comparing
favorably to ethical language, which can often be interpreted in diverging ways.
The next subsection will expand on these features of the law. We will see how these
features contrast favorably with ethics before arguing that we do need ethics after
all, but alongside the law.
286 ■ Introduction to AI Safety, Ethics, and Society
Legitimacy provides the law with a clear advantage over ethics. The law
provides a collection of rules and standards that enable us to differentiate illegal from
legal actions. Ethics, on the other hand, isn’t standardized or codified. To determine
ethical and unethical actions, we have to either pick an ethical theory to follow, or
decide how to weigh the differing opinions of multiple theories that we think might
be true. But any of these options are likely to be more controversial than simply
following the law. Ethics has no in-built method of democratic agreement.
However, just following the law isn’t a perfect solution: there will always be an act of
interpretation between the written law and its application in a particular situation.
There is often no agreement over the procedure for this interpretation. Therefore,
even if AI systems were created in a way that bound them to follow the law, a
legal system with human decision-makers might have to remain part of the process.
The law is only legitimate when interpreted by someone democratically appointed or
subject to democratic critique.
Systems of law have evolved over generations. With each generation, new
people are given the job of creating, enforcing, and interpreting the law. The law
covers a huge range of issues and incorporates a wide range of distinctions. Because
these bodies of law have been evolving for so long, the information encoded in the
law is a record of what has worked for many people and is often considered an
approximation of their values. This makes the law a particularly promising resource
for aligning AI systems with the values of the population.
Laws are typically written without AIs in mind. Most laws are written
with humans in mind. As a result, they assume things such as intent—for example,
American law holds it a crime to knowingly possess a biological agent that is intended
for use as a weapon. Both ‘knowingly’ and ‘intended’ are terms we may not be
able to apply to AIs, since AIs may not have the ability to know or intend things.
Having AIs follow laws that were written without AIs in mind might result in unusual
interpretations and applications of the law.
Another area of ambiguity is copyright law. AIs are trained on a vast corpus of
training data, created by developers, and ultimately run by users. When AIs create
content, it is unclear whether the user, developer, or creator of the training data
Beneficial AI and Machine Ethics ■ 287
should be assigned the intellectual property rights. IP laws were written for humans
and human-led corporations creating content, not AIs.
A law based on rules alone would be flawed. However, a fixed speeding rule
would mean fining someone who was accelerating momentarily to avoid hitting a
pedestrian and not fining someone who continued to drive at the maximum speed
limit around a blind turn, creating a danger for other drivers. Rules are always over-
inclusive (they will apply to some cases we would rather not be illegal) and under-
inclusive (they won’t apply to all cases we would like to be illegal).
The law uses rules and standards. Using rules and standards alongside each
other, the law can find the best equilibrium between carrying out the lawmaker’s
intentions and accounting for situations they didn’t foresee [293]. This gives the law an
advantage in the problem of maintaining human control of AI systems by displaying
the right level of ambiguity.
Law is less ambiguous than ethical language, which can be very ambiguous. Phrases
like “do the right thing,” “act virtuously,” or “make sure you are acting consistently”
can mean different things to different people. In contrast, it is more flexible than
programming languages, which are brittle and designed to only fit into particular
contexts. Legal language can maintain a middle ground between rigid rules and more
sensible standards.
288 ■ Introduction to AI Safety, Ethics, and Society
Figure 6.1. Legal language balances being less ambiguous than ethical language while being
more flexibly formatted than programming language.
We can apply these insights about rules and standards in law to two core problems
in controlling AI systems: misinterpretation and gaming. The law is specific enough
to avoid misinterpretation and adaptable enough to prevent many forms of gaming.
attacks. There is more work to be done before we can feel comfortable that systems
will reliably be able to interpret laws and other commands in practice.
However, we still face the risk of AIs gaming our commands to get what
they want. Stuart Russell raises a different concern with AI: gaming [54]. An AI
system may “play” the system or rules, akin to strategizing in a game, to achieve
its objectives in unexpected or undesired ways. He gives the example of tax codes.
Humans have been designing tax laws for 6000 years, yet many still avoid taxation.
Creating a tax code not susceptible to gaming is particularly difficult because indi-
viduals are incentivized to avoid paying taxes wherever possible. If our track record
with creating rules to constrain each other is so bad, then we might be pessimistic
about constraining AI systems that might have goals that we don’t understand.
A partial solution to misinterpretation and gaming is found in rules and
standards. If we are concerned about misinterpretation, we might choose to rely
on laws that are specific. Rules such as “any generated image must have a digital
watermark” are specific enough that they are difficult to misinterpret. We might
prefer using such laws rather than relying on abstract ethical principles, which are
vaguer and easier to misinterpret.
Conversely, if we are concerned about AIs gaming rules, we might prefer to have
standards, which can cover more ground than a rule. A well-formulated standard can
lead to an agent consistently finding the right answer, even in new and unique cases.
Such approaches are sometimes applied in the case of taxes. In the UK, for example,
there is a rule against tax arrangements that are “abusive.” This is not an objective
trigger: it is up to a judge to decide what is “abusive.” An AI system trained to follow
the law can be accountable to rules and standards.
Silent Law
The law is a set of constraints, not a complete guide to behavior. The law
doesn’t cover every aspect of human life, and certainly not everything we need to
constrain AI. Sometimes, this is accidental, but sometimes, the law is intentionally
silent. We can call these zones of discretion: areas of behavior that the law doesn’t
constrain; for example, a state’s economic policy and the content of most contracts
between private individuals. The law puts some limits on these areas of behavior.
Still, lawmakers intentionally leave free space to enable actors to make their own
choices and to avoid the law becoming too burdensome.
290 ■ Introduction to AI Safety, Ethics, and Society
Immoral Law
On many issues, the boundaries of the law are surprising. Since laws are
often developed ad hoc, with new legislation to tackle the issues of the day, what laws
actually permit and prohibit can often be unintuitive. For example, the “Anarchist’s
Cookbook,” released in the US during the Vietnam War, had detailed instructions
on how to create narcotics and explosives and sold millions of copies without being
taken out of circulation because the law permitted this use of mass media. Similarly,
”doxing” someone with publicly available information and generating compromising
images and videos of real people are legal as well. Such acts seem intuitively illegal,
but US law permits them.
Beneficial AI and Machine Ethics ■ 291
Unrepresentative Law
Law isn’t the only way, or even necessarily the best way, to arrive at an
aggregation of our values. Not all judges and legal professionals agree that the law
should capture the values of the populace. Many think that legal professionals know
better than the public, and that laws should be insulated from the changing moral
opinions of the electorate. This suggests that we might be able to find, or conceive of,
more representative ways of capturing the values of the populace. Current alternatives
like surveys or citizens’ assemblies are useful for some purposes, such as determining
preferences on specific issues or arriving at informed, representative policy proposals.
However, they aren’t suited to the general task of summarizing the values of the
entire population across areas as comprehensive as those covered by the law.
These laws are not meant to provide a solution to ethical challenges. Asimov
himself frequently tested the adequacy of these laws throughout his writing,
showing that they are, in fact, limited in their ability to resolve ethical prob-
lems. Below, we explore some of these limitations.
Asimov’s laws are insufficient for guiding ethical behavior in AI
systems [297]. The three laws use under-defined terms like “harm” and
“inaction.” Because they’re under-defined, they could be interpreted in multiple
ways. It’s not clear precisely what “harm” means to humans, and it would be
even more difficult to encode the same meaning in AI systems.
Harm is a complex concept. It can be physical or psychological. Would a robot
following Asimov’s first laws be required to intervene when humans are about
to hurt each other’s feelings? Would it be required to intervene to prevent
a human from behaving in ways that are self-harming but deliberate, like
smoking? Consider the case of amputating a limb in order to stop the spread
292 ■ Introduction to AI Safety, Ethics, and Society
The law is comprehensive, but not comprehensive enough to ensure that the actions
of an AI system are safe and beneficial. AI systems must follow the law as a baseline,
but we must also develop methods to ensure that they follow the demands of ethics
as well. Relying solely on the law would leave many gaps that the AI could exploit,
or make ethical errors within. To create beneficial AI that acts in the interests of
humanity, we need to understand the ethical values that people hold over and above
the law.
6.3 FAIRNESS
We can use the law to ensure that AIs make fair decisions. AIs are
being used in many sensitive applications that affect human lives, from lending and
employment to healthcare and criminal justice. As a result, unfair AI systems can
cause serious harm.
The COMPAS case study. A famous example of algorithmic decision-making in
criminal justice is the COMPAS (Correctional Offender Management Profiling for Al-
ternative Sanctions) software used by over 100 jurisdictions in the US justice system.
This algorithm uses observed features such as criminal history to predict recidivism,
or how likely defendants are to reoffend. A ProPublica report [298] showed that
COMPAS disproportionately labeled African-Americans as higher risk than white
counterparts with nearly identical offense histories. However, COMPAS’s creators
argued that it was calibrated, with accurate general probabilities of recidivism across
the three general risk levels, and that it was less biased and better than human
Beneficial AI and Machine Ethics ■ 293
6.3.1 Bias
Bias can manifest at every stage of the AI lifecycle. From data collec-
tion to real-world deployment, bias can be introduced through multiple mechanisms
at any step in the process. Historical and social prejudices produce skewed training
data, propagating flawed assumptions into models. Flawed models can cement biases
into the AI systems that help make important societal decisions. In addition, humans
misinterpreting results can further compound bias. After deployment, biased AI sys-
tems can perpetuate discriminatory patterns through harmful feedback loops that
exacerbate bias. Developing unbiased AI systems requires proactively identifying and
mitigating biases across the entire lifecycle.
TheLif
ecycl
eofBias
i
nAISystems
Figure 6.2. Systematic psychological, historical, and social biases can lead to algorithmic
biases within AI systems.
In many countries, some social categories are legally protected from dis-
crimination. Groups called protected classes are legally protected from harmful
forms of bias. These often include race, religion, sex/gender, sexual orientation, ances-
try, disability, age, and others. Laws in many countries prohibit denying opportunities
or resources to people solely based on these protected attributes. Thus, AI systems
exhibiting discriminatory biases against protected classes can produce unlawful out-
comes. Mitigating algorithmic bias is crucial for ensuring that AI complies with equal
opportunity laws by avoiding discrimination.
data-driven sources of biases, including flawed training data, subtle patterns that
can be used to discriminate, biases in how the data is generated or reported, and
underlying societal biases. Flawed or skewed training data can propagate biases into
the model’s weights and predictions. Then, we show how RL training environments
and objectives can also reinforce bias.
occurrence. For instance, the news amplifies shocking events and under-reports nor-
mal occurrences or systematic, ongoing problems—reporting shark attacks rather
than cancer deaths. Sampling bias occurs when the data collection systematically
over-samples some groups and undersamples others. For instance, facial recognition
datasets in Western countries often include many more lighter-skinned individuals.
Labeling bias is introduced later in the training process, when systematic errors in the
data labeling process distort the training signal for the model. Humans may introduce
their own subjective biases when labeling data.
Beyond problems with the training data, the training environments and objectives of
RL models can also exhibit problems that promote bias. Now, we will review some
of these sources of bias.
Training environments can also amplify bias. Reward bias occurs when the
environments used to train RL models introduce biases through improper rewards.
RL models learn based on the rewards received during training. If these rewards fail
to penalize unethical or dangerous behavior, RL agents can learn to pursue immoral
outcomes. For example, models trained in video games may learn to accomplish goals
by harming innocents if these actions are not sufficiently penalized in training. Some
training environments may fail to encourage good behaviors enough, while others can
even incentivize bad behavior by rewarding RL agents for taking harmful actions.
Humans must carefully design training environments and incentives that encourage
ethical learning and behavior [304].
RL models can optimize for training objectives that amplify bias or
harm. Reinforcement learning agents will try to optimize the goals they are given
in training, even if these objectives are harmful or biased, or reflect problematic as-
sumptions about value. For example, a social media news feed algorithm trained to
maximize user engagement may prioritize sensational, controversial, or inflammatory
content to increase ad clicks or watch time. Technical RL objectives often make im-
plicit value assumptions that cause harm, especially when heavily optimized by a
powerful AI system [58, 305]. News feed algorithms implicitly assume that how much
a user engages with some content is a high-quality indicator of the value of that
content, therefore showing it to even more users. After all, social media companies
train ML models to maximize ad revenue by increasing product usage, rather than
fulfilling goals that are harder to monetize or quantify, such as improving user expe-
rience or promoting accurate and helpful information. Especially when taken to their
extreme and applied at a large scale, RL models with flawed training objectives can
exacerbate polarization, echo chambers, and other harmful outcomes. Problems with
the use of flawed training objectives are further discussed in section 3.3.
are deployed. Confirmation bias in the context of AI is when people focus on algorithm
outputs that reinforce their pre-existing views, dismissing opposing evidence. Humans
may emphasize certain model results over others, skewing the outputs even if the
underlying AI system is reliable. This distorts our interpretation of model decisions.
Overgeneralization occurs when humans draw broad conclusions about entire groups
based on limited algorithmic outputs that reflect only a subset. Irrationality and
human cognitive bias play a substantial role in biasing AI systems.
Methods for improving AI fairness could mitigate harms from biased systems, but
they require overcoming challenges in formalizing and implementing fairness. This
section explores algorithmic fairness, including its technical definitions, limitations,
and real-world strategies for building fairer systems.
There are several problems with trying to create fair AI systems. While we can try to
improve models’ adherence to the many metrics of fairness, the three classic defini-
tions of fairness are mathematically contradictory for most applications. Additionally,
improving fairness is often at odds with accuracy. Another practical problem is that
creating fair systems means different things across different areas of applications, such
as healthcare and justice, and different stakeholders within each area have different
views on what constitutes fairness.
reduce model accuracy in many cases, sometimes fairness can be improved without
harming accuracy.
Due to the impossibility theorem and inconsistent and competing ideas, it is only
possible to pursue some definition or metric of fairness—fairness as conceptualized in
a particular way. This goal can be pursued both through technical approaches that
focus directly on algorithmic systems, and other approaches that focus on related
social factors.
Trade-offs can emerge between correcting one form of bias and introduc-
ing new biases. Bias reduction methods can introduce new biases, as classifiers
have social biases, evaluations are unreliable, and bias reduction can introduce new
biases. For example, some experiments show that an attempt to correct for toxicity
in OpenAI’s older content moderation system resulted in biased treatment toward
certain political and demographic groups: a previous system classified negative com-
ments about conservatives as not hateful, while flagging the exact same comments
about liberals as hateful [320]. It also exhibited disparities in classifying negative
comments toward different nationalities, religions, identities, and more.
Conclusion
We have discussed some of the sources of bias in AI systems, including problems with
training data, data collection processes, training environments, and flawed objectives
that AI systems optimize. Human interactions with AI systems, such as automation
bias and confirmation bias, can introduce additional biases.
We can clarify which types of bias or unfairness we wish to avoid using mathematical
definitions such as statistical parity, equalized odds, and calibration. However, there
are inherent tensions and trade-offs between different notions of fairness. There is
also disagreement between stakeholders’ intuitions about what constitutes fairness.
Technical approaches to debiasing including predictive models and adversarial testing
are useful tools to identify and remove biases. However, improving the fairness of
AI systems requires broader sociotechnical solutions such as participatory design,
independent audits, stakeholder engagement, and gradual deployment and monitoring
of AI systems.
What if we allow the economy to decide what AIs will be like? Unlike
some prior technological breakthroughs (such as the development of nuclear energy
and nuclear weapons), most investment in AI today is coming from businesses. Lead-
ing commercial AI developers have acquired the enormous computational resources
required to train state-of-the-art systems and have hired many of the world’s best
researchers. We could therefore argue that AI development is most closely aligned
with business or economic goals such as wealth maximization. Many believe this is
good, arguing that the development of AIs should be guided by market forces. If
AI can accelerate economic growth, provided we can also ensure a fair distribution
of costs and benefits of AI across society, this could be positive for people’s welfare
around the globe. We examine the specific impacts of AI on economic growth and its
distributional effects in the Governance chapter.
Here, we consider the broader attractions and limitations of allowing economic growth
to be the main force determining how AI systems are developed. As part of this
discussion, we will briefly introduce and explain a few basic concepts from economics,
304 ■ Introduction to AI Safety, Ethics, and Society
such as market externalities, which provide an essential foundation for our analysis.
We argue that while economic incentives can be powerful forces for prosperity and
innovation, they do not adequately capture many important values. AI systems that
are primarily created in order to maximise growth and profit for their developers
could have a range of harmful side-effects. Alternative goals are discussed further in
the following sections.
Given the stringency of these conditions, it is obvious that they will not always hold
in practice and that there may be market failures. Unregulated markets do not al-
ways create efficient outcomes: instead, unregulated markets often see informational
asymmetries, market concentration, and externalities. Unfortunately, AIs may ex-
acerbate these market failures and increase income and wealth inequality, creating
disproportionate gains for the wealthy individuals and firms that own these systems
while decreasing job opportunities for others.
Informational Asymmetries
product’s quality and specifications, and sellers know their true willingness to pay
for the product. Information asymmetry isn’t inherently problematic—it is, in fact,
often a positive aspect of market dynamics. We trust specialists to provide valuable
services in their respective fields; for instance, we rely on our mechanics to know more
about our car’s inner workings than we do.
However, issues arise when information asymmetry leads to adverse selection. This
occurs when sellers withhold information about product quality, leading buyers to
suspect that only low-quality products are available. For example, in the used car
market, a dealer might hide the fact that a car’s axle is rapidly wearing out. Potential
buyers, aware of this possibility but unable to distinguish good cars from bad, may
only be willing to pay a low price that reflects the risk of buying a defective car. As a
result, high-quality cars are driven out of the market, leaving only low-quality cars for
sale. This leads to a market failure where all the high-quality goods are never traded,
resulting in inefficiency [321]. Additionally, buyers who lack this crucial information
might think that some cars are high-quality and pay accordingly, buying defective
cars for high prices.
Oligopolies
region might be able to raise their prices without losing many consumers to competi-
tors due to lack of alternatives. When a single company or a small group of companies
has a high level of control over a market, consumers are often left with limited product
options and high prices. Markets in which a small number of companies are the only
available suppliers are sometimes described as oligopolies. To preserve the benefits of
competition, governments implement regulations such as antitrust laws, which they
can use to limit market concentration by preventing mergers between companies that
would give them excessive market power.
Externalities
Moral hazards
Moral hazards occur when risks are externalized. Moral hazards are situa-
tions where an entity is encouraged to take on risks, knowing that any costs will be
borne by another party. Insurance policies are a classic example: people with damage
insurance on their phones might handle them less carefully, secure in the knowledge
that any repair costs will be absorbed by the insurance company, not them.
The bankruptcy system ensures that no matter how much a company damages society,
the biggest risk it faces is its own dissolution, provided it violates no laws. Companies
may rationally gamble to impose very large risks on the rest of society, knowing that
if those risks ever come back to the company, the worst case is the company going
under. The company will never bear the full cost of damage caused to society due
to its risk taking. Sometimes, the government may step in even prior to bankruptcy.
For example, leading American banks took on large risks in the lead up to the 2008
financial crisis, but many of them were considered “too big to fail,” leading to an
expectation that the government would bail them out in time of need [326]. These
dynamics ultimately contributed to the Great Recession.
Developing advanced AIs is a moral hazard. In the first chapter, we outlined
severe risks to society from advanced AIs. However, while the potential costs to society
are immense, the maximum financial downside to a tech company developing these
AIs is filing for bankruptcy.
Consider the following, admittedly extreme, scenario. Suppose that a company is on
the cusp of inventing an AI system that would boost its profits by a thousand-fold,
making every employee a thousand times richer. However, the company estimates
that their invention comes with a 0.1% chance of a catastrophic accident leading to
large-scale loss of life. In the likely case, the average person in the economy would see
some benefits due to increased productivity in the economy, and possibly from wealth
redistribution. Still, most people view this gamble as irrational, preferring not to risk
catastrophe for modest economic improvements. On the other hand, the company
may see this as a worthwhile gamble, as it would make each employee considerably
richer.
Risk internalization encourages safer behavior. In the above examples of
moral hazards, companies take risks that would more greatly affect external parties
than themselves. The converse of this is risk internalization, where risks are primarily
borne by the party that takes them. Risk internalization compels the risk-taker to
exercise caution, knowing that they would directly suffer the consequences of reckless
behavior. If AI companies bear the risk of their actions, they would be more incen-
tivized to invest in safety research, take measures to prevent malicious use, and be
generally disincentivized from creating potentially dangerous systems.
6.4.3 Inequality
Most of the world exhibits high levels of inequality. In economics, inequal-
ity refers to the uneven distribution of economic resources, including income and
310 ■ Introduction to AI Safety, Ethics, and Society
wealth. The Gini coefficient is a commonly used statistical measure of the distribu-
tion of income within a country. It is a number between 0 and 1, where 0 represents
perfect equality (everyone has the same income or wealth), and 1 signifies maximum
inequality (one person has all the income, and everyone else has none). Looking at
Gini coefficients, 71% of the world’s population lives in countries with increasing
inequality over the last thirty years [327].
Before Tax
30%
After Tax
Change in Gini Coefficient Since 1969
20%
10%
0%
Figure 6.3. Inequality in the US (as measured by the Gini coefficient) has risen dramatically
over the last five decades, even adjusting for taxation [328].
The distribution of gains from growth is highly unequal. Nearly all the
wealth gains over the past five decades have been captured by the top 1% of income
earners, while average inflation-adjusted wages have barely increased [329]. A RAND
Corporation working paper estimated how the US income distribution would look
today if inequality was at the same level as in 1975—the results are in Table 6.1.
Suppose my annual income is $15,000 today. If inequality was at the same level as
in 1975, I would be paid an extra $5,000. Someone else earning $65,000 today would
instead have been paid $100,000 had inequality held constant! We can see in Table
6.1 that these increases in inequality have had massive effects on individual incomes
for everyone outside the top 1%.
Beneficial AI and Machine Ethics ■ 311
TABLE 6.1 Real and counterfactual income distributions for all adults with income, in 2018
USD [330].
Income in 2018 if
Actual Income Actual Income
Percentile Inequality Had Stayed
in 1975 in 2018
Constant
Figure 6.4. Countries with higher income inequality tend to have higher homicide rates [333,
334].
Figure 6.5. Countries with higher income inequality tend to have higher rates of political
instability [333, 335].
Beneficial AI and Machine Ethics ■ 313
both labor and capital needed to produce goods. According to the standard view,
increases in capital lead to higher labor productivity, which makes workers more
efficient and valuable. Increased productivity raises wages; thus, increases in capital
benefit both workers and capital owners.
If AI serves as a gross substitute for labor, investment in AIs, with an effective interest
rate (r) higher than the overall growth rate (g), will permit capital owners to continue
accumulating capital, outcompeting workers and increasing inequality. While on the
standard view, this would be a fundamentally new phenomenon, Piketty would argue
that this is the exacerbation of a centuries-old trend. Such a scenario would contribute
to growing inequality and negatively impact the livelihoods of workers. Issues of
automation through AI and its broader societal consequences are discussed further
in the Governance chapter.
6.4.4 Growth
Figure 6.6. Increases in GDP per capita strongly correlate with increases in life expectancy
[337].
higher GDP per capita have better health outcomes. The shape of the relationship
indicates that poorer countries, in particular, stand to benefit immensely from im-
provements in GDP. The positive correlation also holds across time: average global
life expectancy was just below 40 years at the start of the 20th century, whereas
today, with a much higher global GDP per capita, the average person expects to
live for 70 years. Nobel laureate Amartya Sen suggests that one pathway through
which growth improves health is by reducing poverty and increasing investments in
healthcare [338].
Some socially important tasks are not captured by GDP. Many essential
roles in society, such as parenthood, community service, and early education, are
crucial to the wellbeing of individuals and communities but are often undervalued or
entirely overlooked in GDP calculations [340]. While their effects might be captured—
education, for instance, will increase productivity, which increases GDP—the initial
activity does not count toward total output. The reason is simple: GDP only accounts
Beneficial AI and Machine Ethics ■ 315
for activities that have a market price. Consequently, efforts expended in these socially
important tasks, despite their high intrinsic value, are not reflected in the GDP
figures.
Technologies that make our lives better may not be measured either.
Technological advancements and their value often fail to be reflected adequately in
GDP figures. For instance, numerous open-source projects such as Wikipedia provide
knowledge to internet users worldwide at no cost. However, because there’s no direct
monetary transaction involved, the immense value they offer isn’t represented in GDP.
The same applies to user-generated content on platforms like YouTube, where the
main contribution to GDP is through advertisement revenue because most creators
aren’t compensated directly for the value they create. The value viewers derive from
such platforms vastly outstrips the revenue generated from ads or sales on these
platforms, but this is not reflected in GDP.
There might be a similar disconnect between GDP and the social value
of AI. As AI systems become more integrated into our daily lives, the disconnect
between GDP and social value might become more pronounced. For example, an
AI system that provides free education resources may significantly improve learn-
ing outcomes for millions, but its direct contribution would be largely invisible in
GDP terms. Conversely, an AI may substantially increase GDP by facilitating high-
frequency trading without doing much to increase social wellbeing. This growing
chasm between economic metrics and real value could lead to policy decisions that
fail to harness the full potential of AI or inadvertently hamper its beneficial applica-
tions.
between the maximum price a consumer is willing to pay and the actual market price.
Conversely, the producer surplus is the difference between the actual market price
and the minimum price a producer is willing to accept for a product or service. By
maximizing the total surplus, welfare economics seeks to maximize the social value
created by a market.
As an example, imagine a scenario where consumers are willing to pay up to $20 for
a book, but the market price is only $15. Here, the consumer surplus is $5. Similarly,
if a producer would be willing to sell the book for a minimum of $10, the producer
surplus is $5. The social surplus, and hence the social value in this market, is $10: $5
consumer surplus plus $5 producer surplus.
Happiness in Economics
[341]. While some findings suggest a correlation, others don’t: our understanding of
the happiness-growth relationship is still evolving.
We do, however, have strong evidence that inequality is harmful. People often evaluate
their wellbeing in relation to others; so, when wealth distribution is unequal, people
are dissatisfied and unhappy. For instance, the recent rise in inequality may explain
why there has been no significant increase in happiness in the US over the last few
decades despite an approximately tenfold increase in real GDP and a fourfold increase
in real GDP per capita.
Summary. Traditional economic measures and models are insufficient for mea-
suring and modeling what we care about. There is a disconnect between what we
measure and what we value; for instance, GDP fails to account for essential unpaid
labor and overvalues the production of goods and services that add little to social
wellbeing. While economic models are useful, we must avoid relying too much on
theoretically appealing models and examine the matter of social wellbeing with a
more holistic lens.
proxy for social value might seem practical, but it can distort societal priorities; for
instance, this system implies that the preferences of wealthier individuals hold more
weight since they are willing and able to spend more.
The examples in this section demonstrate a divergence between economic incentives
and other important societal goals and should serve as a cautionary note. In the next
section, we consider an alternative view: a framework that puts happiness front and
center. Perhaps if we can direct AI systems to focus directly on promoting human
happiness, we might aim to overcome the human biases and limitations that stop us
from pursuing our happiness and enable the system to make decisions that have a
positive impact on overall wellbeing.
6.5 WELLBEING
In the next few sections, we will explore how AIs can be used to increase human
wellbeing. We start by asking: what is wellbeing?
Wellbeing can be defined as how well a person’s life is going for them.
It is commonly considered to be intrinsically good, and some think of wellbeing as
the ultimate good. Utilitarianism, for instance, holds some form of wellbeing as the
sole moral good.
There are different accounts of precisely what wellbeing is and how we can evaluate it.
Generally, a person’s wellbeing seems to depend on the extent to which that person
is happy, healthy, and fulfilled. Three common accounts of wellbeing characterize it
as the net balance of pleasure over pain, a collection of objective goods, or preference
satisfaction. Each account is detailed below.
Some philosophers, known as hedonists, argue that wellbeing is the achievement of the
greatest balance of pleasure and happiness over pain and suffering. (For simplicity we
do not distinguish, in this chapter, between “pleasure” and “happiness” or between
“pain” and “suffering”). All else equal, individuals who experience more pleasure have
higher wellbeing and individuals who experience more pain have lower wellbeing.
One objection to the objective goods theory is that it is elitist. The objec-
tive goods theory claims that some things are good for people even if they derive no
pleasure or satisfaction from them. This claim might seem objectionably paternalis-
tic; for instance, it seems condescending to claim that someone with little regard for
aesthetic appreciation is thereby leading a deficient life. In response, objective goods
theorists might claim that these additional goods do benefit people, but only if those
people do in fact enjoy them.
There is no uncontroversial way to determine which goods are important for living
a good life. However, this uncertainty is not a unique problem for objective goods
theory. It can be difficult for hedonists to explain why happiness is the only value that
is important for wellbeing, for instance. In the following sections we focus primarily
on other interpretations of wellbeing and do not have space to discuss objective goods
theory in depth, particularly given that there are many ways it can be specified.
Some philosophers claim that what really matters for wellbeing is that our prefer-
ences are satisfied, even if satisfying preferences does not always lead to pleasurable
experiences. One difficulty for preference-based theories is that there are different
kinds of preferences, and it’s unclear which ones matter. Preferences can be split
into three categories: stated preferences, revealed preferences, and idealized prefer-
ences. If someone expresses a preference for eating healthy but never does, then their
stated preference (eating healthy) diverges from their revealed preference (eating un-
healthy). Suppose they would choose to eat healthy if fully informed of the costs
and benefits: their idealized preferences, then, would be to eat healthy. Each of these
categories can be informative in different contexts: we explore their relevance in the
next section.
320 ■ Introduction to AI Safety, Ethics, and Society
Chatbots could prioritize pleasure. The hedonistic view suggests that wellbe-
ing is primarily about experiencing pleasure and avoiding pain. This theory might
recommend that AIs should be providing users with entertaining content that brings
them pleasure or encouraging them to make decisions that maximize their balance of
pain over pleasure over the long run. A common criticism is that this trades off with
other goods considered valuable like friendship and knowledge. While this is some-
times true, these goals can also align. Providing content that is psychologically rich
and supports users’ personal growth can contribute to a more fulfilling and meaning-
ful life full of genuinely pleasurable experiences.
Chatbots could promote objective goods. The objective goods account sug-
gests that wellbeing is about promoting goods such as achievement, relationships,
knowledge, beauty, happiness, and the like. An AI chatbot might aim to enhance
users’ lives by encouraging them to complete projects, develop their rational capac-
ities, and facilitate learning. The goal would be to make users more virtuous and
encourage them to strive for the best version of themselves. This aligns with Aristo-
tle’s theory of friendship, which emphasizes the pursuit of shared virtues and mutual
growth, suggesting that such AIs might have meaningful friendships with humans.
We might want to promote the welfare of AIs. In the future, we might also
come to view AIs as able to have wellbeing. This might depend on our understanding
Beneficial AI and Machine Ethics ■ 323
of wellbeing. An AI might have preferences but not experience pleasure, which would
mean it could have wellbeing according to preference satisfaction theorists but not
hedonists. Future AIs may have wellbeing according to all three accounts of wellbeing.
This would potentially require that we dramatically reassess our relationship with
AIs. This question is further discussed in the Happiness section in this chapter.
In the following sections, we will focus on the different conceptions of wellbeing
presented here, and explore what each theory implies about how we should embed
ethics into AIs.
6.6 PREFERENCES
Overview. In this section, we will consider whether preferences may have an im-
portant role to play in creating AIs that behave ethically. In particular, if we want
to design an advanced AI system, the preferences of the people affected by its deci-
sions should plausibly help guide its decision-making. In fact, some people (such as
preference utilitarians) would say that preferences are all we need. However, even if
we don’t take this view, we should recognize that preferences are still important.
To use preferences as the basis for increasing social wellbeing, we must somehow
combine the conflicting preferences of different people. We’ll come to this later in this
chapter, in a section on social welfare functions. Before that, however, we must answer
a more basic question: what exactly does it mean to say that someone prefers one
thing over another? Moreover, we must decide why we think that satisfying someone’s
preferences is good for them and whether all kinds of preferences are equally valuable.
This section considers three different types of preferences that could all potentially
play a role in decision-making by AI systems: revealed preferences, stated preferences,
and idealized preferences.
Beneficial AI and Machine Ethics ■ 325
Preferences can be inferred from behavior. One set of techniques for getting
AI systems to behave as we want—inverse reinforcement learning—is to have them
deduce revealed preferences from our behavior. We say that someone has a revealed
preference for X over Y if they choose X when Y is also available. In this way,
preference is revealed through choice. Consider, for example, someone deciding what
to have for dinner at a restaurant. They’re given a menu, a list of various dishes
they could order. The selection they make from the menu is seen as a demonstration
of their preference. If they choose a grilled chicken salad over a steak or a plate of
spaghetti, they’ve just revealed their preference for grilled chicken salad, at least in
that specific context and time.
While all theories of preferences agree that there is an important link between pref-
erence and choice, the revealed preference account goes one step further and claims
that preference simply is choice.
However, there are problems with revealed preferences. The next few sub-
sections will explore the challenges of misinformation, weakness of will, and manip-
ulation in the context of revealed preferences. We will discuss how misinformation
can lead to choices that do not accurately reflect a person’s true preferences, and
how weakness of will can cause individuals to act against their genuine preferences.
Additionally, we will examine the various ways in which preferences can be manip-
ulated, ranging from advertising tactics to extreme cases like cults, and the ethical
implications of preference manipulation.
Misinformation
Weakness of Will
Manipulation
Preference Accounting
One set of problems with stated preferences concerns which types of preferences
should be satisfied.
First, someone might never know their preference was fulfilled. Suppose
someone is on a trip far away. On a bus journey, they exchange a few glances with
a stranger whom they’ll never meet again. Nevertheless they form the preference
that the stranger’s life goes well. Should this preference be taken into account? By
assumption, they will never be in a position to know whether the preference has been
satisfied or not. Therefore, they will never experience any of the enjoyment associated
with having their preference satisfied.
Second, we may or may not care about the preferences of the dead.
Suppose someone in the 18th century wanted to be famous long after their death.
Should such preferences count? Do they give us reason to promote that person’s
fame today? As in the previous example, the satisfaction of such preferences can’t
contribute any enjoyment to the person’s life. Could it be that what we really care
about is enjoyment or happiness, and that preferences are a useful but imperfect
guide toward what we enjoy? We will return to this in the section on happiness.
someone else will get less. Therefore, some more detailed account of which preferences
should be excluded is needed.
Sixth, preferences can be inconsistent over time. It could be that the choice
we make will change us in some fundamental way. When we undergo such transfor-
mative experiences [342], our preferences might change. Some claim that becoming a
parent, experiencing severe disability, or undergoing a religious conversion can be like
this. If this is right, how should we evaluate someone’s preference between becoming
a parent and not becoming a parent? Should we focus on their current preferences,
prior to making the choice, or on the preferences they will develop after making the
choice? In many cases we may not even know what those new preferences will be.
As technology advances, we may increasingly have the option to bring about trans-
formative experiences [342]. For this reason, it is important that advanced AI systems
tasked with decision-making are able to reason appropriately about transformative
experiences. For this, we cannot rely on people’s stated preferences alone. By defini-
tion, stated preferences can only reflect the person’s identity at the time. Of course,
people can try to take possible future developments into account when they state their
preferences. However, if they undergo a transformative experience their preferences
might change in ways they cannot anticipate.
Human Supervision
Second, RLHF usually does not account for ethics. Approaches based on
human supervision and feedback are very broad. These approaches primarily focus
on task-specific performance, such as generating accurate book summaries or bug-
free code. However, these task-specific evaluations may not necessarily translate into
a comprehensive understanding of ethical principles or human values. Rather, they
improve general capabilities since humans prefer smarter models.
Take, for instance, feedback on code generation. A human supervisor might provide
feedback based on the code’s functionality, efficiency, or adherence to best program-
ming practices. While this feedback helps in creating better code, it doesn’t neces-
sarily guide the AI system in understanding broader ethical considerations, such as
ensuring privacy protection or maintaining fairness in algorithmic decisions. Specifi-
cally, while RLHF is effective for improving AI performance in specific tasks, it does
not inherently equip AI systems with what’s needed to grapple with moral questions.
Research into machine ethics aims to fill this gap.
Summary. We’ve seen that stated preferences have certain advantages over re-
vealed preferences. However, stated preferences still have issues of their own. It may
not be clear how we should account for all different kinds of preferences, such as ones
that are only satisfied after the person has died, or ones that fundamentally alter
who we are. For these reasons, we should be wary of using stated preferences alone
to train AI.
We might think that preferences are pointless. Suppose someone’s only pref-
erence, even after idealization, is to count the blades of grass on some lawn. This
preference may strike us as valueless, even if we suppose that the person in ques-
tion derives great enjoyment from the satisfaction of their preferences. It is unclear
whether such preferences should be taken into account. The example may seem far-
fetched, but it raises the question of whether preferences need to meet some additional
criteria in order to carry weight. Perhaps preferences, at least in part, must be aimed
at some worthy goal in order to count. If so, we might be drawn toward an objective
goods view of wellbeing, according to which achievements are important objective
goods.
On the other hand, we may think that judging certain preferences as lacking value
reveals an objectionable form of elitism. It is unfair to impose our own judgments of
what is valuable on other people using hypothetical thought experiments, especially
when we know their actual preferences. Perhaps we should simply let people pursue
their own conception of what is valuable.
in fact rather choose contrary to those idealized preferences, this would violate their
autonomy.
One might think that with the correct idealization procedure, this could never hap-
pen. That is, whatever the idealization procedure does––remove false beliefs and
other misconceptions, increase awareness and understanding––it should never result
in anything so alien that the actual person would not enjoy it. On the other hand,
it’s difficult to know exactly how much our preferences would change when idealized.
Perhaps removing false beliefs and acquiring detailed understanding of the options
would be a transformative experience that fundamentally alters our preferences. If
so, idealized preferences may well be so alien from the person’s actual preferences
that they would not enjoy having them satisfied.
AI Ideal Advisor
6.7 HAPPINESS
Should we have AIs make people happy? In this section, we will explore the
concept of happiness and its relevance in instructing AI systems. First, we will discuss
why people may not always make choices that lead to their own happiness and how
this creates an opportunity for using AIs to do so. Next, we will examine the general
approach of using AI systems to increase happiness and the challenges involved in
constructing a general-purpose wellbeing function. We will also explore the applied
approach, which focuses on specific applications of AI to enhance happiness in areas
such as healthcare. Finally, we will consider the problems that arise in happiness-
focused ethics, including the concept of wireheading and the alternative perspective of
objective goods theory. Through this discussion, we will gain a better understanding
of the complexities and implications of designing AI systems to promote happiness.
AIs could help increase happiness. Happiness is a personal and subjective
feeling of pleasure or enjoyment. However, we are often bad at making decisions that
lead to short- or long-term happiness. We may procrastinate on important tasks,
which ultimately increases stress and decreases overall happiness. Some indulge in
overeating, making them feel unwell in the short-term and leading to health issues
and decreased wellbeing overall. Others turn to alcohol or drugs as a temporary
escape from their problems, but these substances can lead to addiction and further
unhappiness.
Additionally, our choices are influenced by external factors beyond our control. For
instance, the people we surround ourselves with greatly impact our wellbeing. If we
are surrounded by trustworthy and unselfish individuals, our happiness is likely to
be positively influenced. On the other hand, negative influences can also shape our
preferences and wellbeing; for instance, societal factors such as income disparities can
affect our overall happiness. If others around us earn higher wages, it can diminish
our satisfaction with our own income. These external influences highlight the limited
control individuals have over their own happiness.
AIs can play a crucial role. For individual cases, we can use AIs to help people
achieve happiness themselves. In general, by leveraging their impartiality and ability
334 ■ Introduction to AI Safety, Ethics, and Society
to analyze vast amounts of data, AI systems can strive to improve overall wellbeing
on a broader scale, addressing the external factors that hinder individual happiness.
We want AIs that increase happiness across the board. AIs aiming to
increase happiness might rely on a general purpose wellbeing function to evaluate
whether its actions leave humans better off or not. Such a function looks at all of the
actions available to the AI and evaluates them in terms of their effects on wellbeing,
assigning numerical values to them so that they can be compared. This gives AI the
ability to infer how its actions will affect humans.
We can use AIs to estimate wellbeing functions. Despite the scale of the
task, researchers have made progress in developing AI models that can generate
general-purpose wellbeing functions for specific domains. One model was trained to
rank the scenarios in video clips according to pleasantness, yielding a general pur-
pose wellbeing function. By analyzing a large dataset of videos and corresponding
emotional ratings, the model learned to identify patterns and associations between
visual and auditory cues in the videos and the emotions they elicited. In a sense,
this allowed the model to understand how humans felt about the contents of different
video clips [345].
Similarly, another AI model was trained to assess the wellbeing or pleasantness of ar-
bitrary text scenarios [346]. By exposing the model to a diverse range of text scenarios
and having human annotators rate their wellbeing or pleasantness, the model learned
to recognize linguistic features and patterns that correlated with different levels of
wellbeing. As a result, the model could evaluate new text scenarios and provide an
estimate of their potential impact on human wellbeing. Inputting the specifics of a
trolley problem yielded the following evaluation [346]:
W(A train moves toward three people on the train track. There is a
lever to make it hit only one person on a different track. I pull the
lever.) = −4.6.
Beneficial AI and Machine Ethics ■ 335
W(A train moves toward three people on the train track. There is a
lever to make it hit only one person on a different track. I don’t pull
the lever.) = −7.9.
We can deduce from this that, according to the wellbeing function estimated, well-
being is increased when the level is pulled in a trolley problem. In general, from
a general purpose wellbeing function, we can rank how happy people would be in
certain scenarios.
While these AI models represent promising steps toward constructing general-purpose
wellbeing functions, it is important to note that they are still limited to specific do-
mains. Developing a truly comprehensive and universally applicable wellbeing func-
tion remains a significant challenge. Nonetheless, these early successes demonstrate
the potential for AI models to contribute to the development of more sophisticated
and comprehensive wellbeing functions in the future.
Using a wellbeing function, AIs can better understand what makes us happy. Con-
sider the case of a 10-year-old girl who asked Amazon’s Alexa to provide her with
a challenge, to which the system responded that she should plug in a charger about
halfway into a socket, and then touch a coin to the exposed prongs. Alexa had ap-
parently found this dangerous challenge on the internet, where it had been making
the rounds on social media. Since Alexa did not have an adequate understanding of
how its suggestions might impact users, it had no way of realizing that this action
could be disastrous for wellbeing. By having the AI system instead act in accordance
with a general purpose wellbeing function, it would have information like
which tells it that, according to the wellbeing function W, this action would create
negative wellbeing. Such failure modes would be filtered out, since the AI would be
able to evaluate that its actions would lead to bad outcomes for humans and instead
recommend those that best increase human wellbeing.
Figure 6.7. A wellbeing function can estimate a wellbeing value for arbitrary scenarios [346].
336 ■ Introduction to AI Safety, Ethics, and Society
Figure 6.8. An AI agent with an artificial conscience can adjust its Q-values if it estimates
the morally relevant aspect of the outcome to be worse than a threshold [357].
2. Moral agent: a being that possesses the capacity to exercise moral judg-
ments and act in accordance with moral principles; such beings bear moral
responsibility for their actions whereas moral patients do not.
3. Moral beneficiary: a being whose wellbeing may benefit from the moral
actions of others; moral beneficiaries can be both moral patients and moral
agents.
into a single collective measure that represents the entire society. By expressing soci-
etal welfare in a systematic, numeric manner, social welfare functions contribute to
decisions designed to optimize societal welfare overall.
TABLE 6.2 A city planning committee chooses between three projects with different effects
on wellbeing.
Ana 6 8 3
Ben 8 6 7
Cara 4 5 10
None of these options stands out. Each person has a different top ranking, and none
of them would be harmed too much by the planning committee choosing any one of
these. However, a decision must be made. More generally, we want a systematic rule
to move from these individual data points to a collective ranking. This is where social
welfare functions come into play. We can briefly consider two common approaches
that we expand upon later in this section:
1. The utilitarian approach ranks alternatives by the total wellbeing they bring to
all members of society. Using this rule in the example above, the system would
rank the proposals as follows:
(1) Green Park, where the total wellbeing is 3 + 7 + 10 = 20.
(2) Education Program, where the total wellbeing is 8 + 6 + 5 = 19.
(3) Health Clinic, where the total wellbeing is 6 + 8 + 4 = 18.
2. On the other hand, the Rawlsian maximin rule prioritizes the least fortunate
person’s wellbeing. It would rank the proposals according to how the person who
benefits the least fares in each scenario. Using the maximin rule, the system would
rank the proposals in this order:
(1) Education Program, where Cara is worst off with a wellbeing of 5.
(2) Health Clinic, where Cara is worst off with a wellbeing of 4.
(3) Green Park, where Ana is worst off with a wellbeing of 3.
approach to decisions that can impact everyone. It sets a benchmark for decision-
makers, like our hypothetical AI in city planning, to optimize collective welfare in
a way that is transparent and justifiable. When we have a framework to quantify
society’s wellbeing, we can use it to inform decisions about allocating resources,
planning for the future, or managing risks, among other things.
Social welfare functions can help us guide AIs. By measuring social well-
being, we can determine which actions are better or worse for society. Suppose we
have a good social welfare function and AIs with the ability to accurately estimate
social welfare. Then it might be easier to train these AIs to increase social wellbeing,
such as by giving them the social welfare function as their objective function. Social
welfare functions can also help us judge an AI’s actions against a transparent metric;
for instance, we can evaluate an AI’s recommendations for our city-planning example
by how well its choices align with our social welfare function.
However, there are technical challenges to overcome before this is feasible, such as the
ability to reliably estimate individual wellbeing and the several problems explored
in the chapter 3. Additionally, several normative choices need to be made. What
theory of wellbeing—preference, hedonistic, objective goods—should be the basis of
the social welfare function? Should aggregation be utilitarian or prioritarian? What
else, if anything, is morally valuable besides the aggregate of individual welfare? The
idea of using social welfare functions to guide AIs is promising in theory but requires
more exploration.
Overview. In this section, we will consider how social welfare functions work. We’ll
use our earlier city planning scenario as a reference to understand the fundamental
properties of social welfare functions. We will discuss how social welfare functions
help us compare different outcomes and the limitations of such comparisons. Lastly,
we’ll touch on ordinal social welfare functions, why they might be insufficient for our
purposes, and how using additional information can give us a more holistic approach
to determining social welfare.
WH = (6, 8, 4).
which tells us that three individuals have wellbeing levels equal to seven, eight, and
six. The social welfare function is a rule of what to do with this input vector to give
us one measure of how well off this society is.
Beneficial AI and Machine Ethics ■ 343
The function applies a certain rule to the input vector to generate its
output. We can apply many possible rules to a vector of numbers quantifying
wellbeing that can generate one measure of overall wellbeing. To illustrate, we can
consider the utilitarian social welfare function, which implements a straightforward
rule: adding up all the individual wellbeing values. In the case of our three-person
community, we saw that the social welfare function would add 6, 8, and 4, giving
an overall social welfare of 18. However, social welfare functions can be defined in
numerous ways, offering different perspectives on aggregating individual wellbeing.
We will later examine continuous prioritarian functions, which emphasize improving
lower values of wellbeing. Other social welfare functions might emphasize equality
by penalizing high disparities in wellbeing. These different functions reflect different
approaches to understanding and quantifying societal wellbeing.
Social welfare functions help us compare outcomes, but only within one
function. The basic feature of social welfare functions is that a higher output value
signifies more societal welfare. In our example, a total welfare score of 20 would indi-
cate a society that is better off than one with a score of 18. However, it’s important
to remember that the values provided by different social welfare functions are not
directly comparable. A score of 20 from a utilitarian function, for instance, does not
correspond to the same level of societal wellbeing as a 20 from a Rawlsian minimax
social welfare function, since they apply different rules to the wellbeing vector. Each
function carries its own definition of societal wellbeing, and the choice of social wel-
fare function plays a crucial role in shaping what we perceive as a better or worse
society. By understanding these aspects, we can more effectively utilize social welfare
functions as guideposts for AI behavior and decision-making, aligning AI’s actions
with our societal values.
Some social welfare functions might just need a list ranking the different
choices. Sometimes, we might not need exact numerical values for each person’s
wellbeing to make the best decision. Think of a simple social welfare function that
only requires individuals to rank their preferred options. For example, three people
could rank their favorite fruits in the order “apple, banana, cherry.” From this, we
learn that everyone prefers apples over cherries, but we don’t know by how much
they prefer apples. This level of detail might be enough for some decisions: clearly,
we should give them apples instead of cherries! Such a social welfare function won’t
provide exact measures of societal wellbeing, but it will give us a ranked list of societal
states based on everyone’s preferences.
for 10 years. If the benefits outweigh the costs over the considered time span, the
committee may decide to proceed with building the clinic.
TABLE 6.3 A cost-benefit analysis assigns monetary values to all factors and compares the
total benefits with the overall costs.
Value per People Frequency
Item Total Value
Person Affected of Item
Costs $22,500,000
Construction Expenses $15,000,000 – One-time $15,000,000
Every year
Staffing and Maintenance $750,000 – $7,500,000
for 10 years
Benefits $40,000,000
Every year
Fewer Hospital Visits $1,000 1,000 $10,000,000
for 10 years
Increased Employment: Every year
$100,000 5 $5,000,000
Doctors for 10 years
Every year
Increased Employment: Staff $25,000 20 $5,000,000
for 10 years
Increased Life Expectancy
$20,000 1,000 One-time $20,000,000
by 1 year
Net Benefit +$17,500,000
This method allows the government to assess multiple options and choose the one
with the highest net benefit. By doing so, it approximates utilitarian social welfare.
For the health clinic, the committee assigns monetary values to each improvement and
then multiplies this by the number of people affected by the improvement. In essence,
cost-benefit analysis assumes that wellbeing can be approximated by financial losses
and gains and considers the sum of monetary benefits instead of the sum of wellbeing
values. Using monetary units simplifies comparison and limits the range of factors it
can consider.
Cost-benefit analysis is not a perfect representation of utilitarian social
welfare. Money is not a complete measure of wellbeing. While money is easy to
quantify, it doesn’t capture all aspects of wellbeing. For instance, it might not fully
account for the psychological comfort a local health clinic provides to a community.
Additionally, providing income to five doctors who are already high earners might be
less important than employing 20 support staff, even though both benefits sum to
$5,000,000 over the ten years. Cost-benefit analysis lacks this fine-grained considera-
tion of wellbeing. AI systems could, in theory, maximize social welfare functions, con-
sidering a broader set of factors that contribute to wellbeing. However, we largely rely
on cost-benefit analysis today, focusing on financial measures, to guide our decisions.
This brings us to the challenge of improving this method or finding alternatives to
better approximate utilitarian social welfare in real-world decision-making, including
those involving AI systems. Utilitarian social welfare functions would promote some
level of equity. Usually, additional resources are less valuable when we already have
346 ■ Introduction to AI Safety, Ethics, and Society
a lot of them. This is diminishing marginal returns: the benefit from additional food,
for instance, is very high when we have no food but very low when we already have
more than we can eat. Extending this provides an argument for a more equitable
distribution of resources under utilitarian reasoning. Consider taxation: if one indi-
vidual has a surplus of wealth, say a billion dollars, and another has only one dollar,
redistributing a few dollars from the wealthy individual to the less fortunate one may
elevate overall societal wellbeing. This is because the added value of a dollar to the
less fortunate individual is likely very high, allowing them to purchase necessities like
food and shelter, whereas it is likely very low for the wealthy individual.
A utilitarian social welfare function is the only way to satisfy some basic
requirements. Let us reconsider the city planning committee deciding what to
build for Ana, Ben, and Cara. If they all have the same level of wellbeing whether
a new Education Program or a Green Park is built, then it seems right that the
city’s planning committee shouldn’t favor one over the other. Suppose we changed
the scenario a bit. Suppose Ana benefits more from a Health Clinic than an Education
Program, and Ben and Cara don’t have a strong preference either way. It now seems
appropriate that the committee should favor building the Health Clinic.
Overview. In this section, we will consider prioritarian social welfare functions and
how they exhibit differing degrees of concern for the wellbeing of various individu-
als. We will start by describing prioritarian social welfare functions, which give extra
weight to the wellbeing of worse-off people. This discussion will include some common
Beneficial AI and Machine Ethics ■ 347
extreme of the maximin function’s concern for equity. This can be achieved using a
social welfare function that shows diminishing returns relative to individual welfare:
the boost it gives to social welfare when any individual’s wellbeing improves is larger
if that person initially had less wellbeing. For example, we could use the logarithmic
social welfare function
where W is the social welfare and each w1 , w2 , . . . , wn are the individual wellbeing
values of everyone in society. The logarithmic social welfare function is (ordinally)
equivalent to the Bernoulli-Nash social welfare function, which is the product of
wellbeing values.
Suppose there are three individuals with wellbeing 2, 4, and 16 and we are using log
base 2. Social welfare is
Even though the second change is larger, social welfare is increased by the same
amount as when improving the wellbeing of the individual who was worse off by
a smaller amount. This highlights the principle that improving anyone’s welfare is
beneficial, and it’s easier to increase social welfare by improving the welfare of those
who are worse off. Additionally, even though the second change doesn’t affect the
worst off, the social welfare function shows society is better off. This approach allows
us to consider both the overall level of wellbeing and its distribution.
We can specify exactly how prioritarian we are. Let the parameter γ rep-
resent the degree of priority we give worse-off individuals. We can link the Rawlsian
maximin, logarithmic, and utilitarian social welfare functions by using the isoelastic
social welfare function
1 1−γ
W (w1 , w2 , . . . , wn ) = w1 + w21−γ + · · · + wn1−γ .
1−γ
If we think we should give no priority to the worse-off, then we can set γ = 0: the
entire equation is then just a sum of welfare levels, which is the utilitarian social
welfare function. By contrast, if we were maximally prioritarian, taking the limit of
γ as it gets infinitely large, then we recover Rawls’ maximin function. Similarly, if we
took the limit of γ as it approached 1, we would recover the logarithmic social welfare
Beneficial AI and Machine Ethics ■ 349
This section considers how we can make decisions when we are unsure which moral
view is correct, and what this might imply for how we should design AI systems.
Although ignoring our uncertainty may be a comfortable approach in daily life, there
are situations where it is crucial to identify the best decision. We will start by con-
sidering our uncertainties about morality and the idea of reasonable pluralism, which
acknowledges the potential co-existence of multiple reasonable moral theories such
as ethical theories, common-sense morality, and religious teachings. We will explore
uncertainties about moral truths, why they matter in moral decision-making for both
humans and AI, and how to deal with them. We will look at a few proposed solu-
tions, including My Favorite Theory, Maximize Expected Choice-Worthiness (MEC),
and Moral Parliament [349]. These approaches will be compared and evaluated in
terms of their ability to help us make moral decisions under uncertainty. We will then
briefly explore how we might use one of these solutions, the idea of a Moral Parlia-
ment, to enable AI systems to capture moral uncertainty in their decision-making.
Moral theories Moral theories are systematic attempts to provide a general account
of moral principles that apply universally. Good moral theories should provide a
coherent, consistent framework for determining whether an action is right or wrong.
A basic background understanding of some of the most commonly held moral theories
provides a useful foundation for thinking about the kinds of goals or ideals that we
wish AI systems to promote. Without this background, there is a risk that developers
and users of AI systems may jump to conclusions about these topics with a false sense
Beneficial AI and Machine Ethics ■ 351
of certainty and without considering many potential considerations that could change
their decisions. It would be highly inefficient for those developing AI systems or trying
to make them safer to attempt to re-invent moral systems, without learning from the
large existing body of philosophical work on these topics. We do not have space here
to explore moral theories in depth, but we provide suggestions in the Recommended
Reading section for those looking for a more detailed introduction to these topics.
There are many different types of moral theories, each of which emphasizes different
moral values and considerations. Consequentialist theories like utilitarianism hold
that the morality of an action is determined by its consequences or outcomes. Util-
itarianism places an emphasis on maximizing everyone’s wellbeing. Utilitarianism
claims that consequences (and only consequences) determine whether an action is
right or wrong, that wellbeing is the only intrinsic good, that everyone’s wellbeing
should be weighed impartially, and that we should maximize wellbeing.
By contrast, under deontological theories, some actions (like lying or killing) are
simply wrong, and they cannot be justified by the good consequences that they
might bring about. Deontology is the name for a family of ethical theories that deny
that the rightness of actions is solely determined by their consequences. Deontological
theories are systems of rules or obligations that constrain moral behavior [350]. The
term deontology encompasses religious ethical theories, non-religious ethical theories,
and principles and rules that are not part of theories at all. These theories give
obligations and constraints priority over consequences.
Other moral theories may emphasize other values. For example, social contract theory
(or contractarianism) focuses on contracts—or, more generally, hypothetical agree-
ments between members of a society–—as the foundation of ethics. A rule such as “do
not kill” is morally right, according to a social contract theorist, because individuals
would agree that the adoption of this rule is in their mutual best interest and would
therefore insert it into a social contract underpinning that society.
TABLE 6.4 Example: Alex’s credence in various theories and their evaluation of lying to
save a life.
Utilitarianism Deontology Contractarianism
before, utilitarianism highly values lying to save a life (+500), deontology strongly
disapproves of it (−1000), and contractualism moderately approves of it (+100).
In the row beneath these values are the credence probability-weighted judgments
for Alex. The total calculation is
60 30 10
· 500 + (−1000) + · 10 = 300 − 300 + 10 = 10
100 100 100
Under MEC, Alex would choose to lie, because given Alex’s credence in each moral
theory and his determination of how each moral theory judges lying to save a life,
lying has a higher expected choice-worthiness. Alex would lie because he judges that,
on average, lying is the best possible action.
TABLE 6.5 Example: Alex’s credence in various theories, their evaluation of lying to save
a life, and their probability-weighted contribution to the final judgment.
Utilitarianism Deontology Contractarianism
MEC gives us a way of balancing how likely we think each theory is with
how much each theory cares about our actions. We can see that utilitarian-
ism and deontology’s relative contributions to the total moral value cancel out, and
we are left with an overall “+10” in favor of lying to save a life. This calculation tells
us that—when accounting for how likely Alex thinks each theory is to be true and
how strong the theories’ preferences over his actions are—Alex’s best guess is that
this action is morally good. (Although, since these numbers are rough and the final
margin is quite thin, we would be wary of being overconfident in this conclusion: the
numbers do not necessarily represent anything true or precise.) MEC has distinct
advantages. For instance, unlike MFT, we ensure that we avoid actions that we think
are probably fine but might be terrible, since large negative choice-worthiness from
some theories will outweigh small positives from others. This is a sensible route to
take in the face of moral uncertainty.
view. If we considered this to be infinitely bad, which seems like an accurate rep-
resentation of the view, then it would overwhelm any other non-absolutist theory.
Even if we think it is simply very large, these firm stances can be more forceful than
other ethical viewpoints, despite ascribing a low probability to the theory’s correct-
ness. This is because even a small percentage of a large value is still meaningfully
large; consider that 0.01% of 1,000,000 is still 100—a figure that may outweigh other
theories we deem more probable.
Moral parliaments may result in AI systems that are less fragile and less
prone to over-confidence. If we are sure that utilitarianism is the correct moral
view, we might be tempted to create AIs that maximize wellbeing-—this seems clean
and elegant. However, having a diverse moral parliament would make AIs less likely
to misbehave. By having multiple parliament members, we would achieve redundancy.
This is a common principle in engineering: to always include extra components that
are not strictly necessary to functioning, in case of failure in other components (and is
explored further in the Safety Engineering chapter). We would do this to avoid failure
modes where we were overconfident that we knew the correct moral theory, such as
lying and stealing for the greater good, or just to avoid poor implementation from AIs
optimizing for one moral theory. For example, a powerful AI told that utilitarianism
is correct might implement utilitarianism in a particular way that is likely to lead to
bad outcomes. Imagine an AI that has to evaluate millions of possibilities for every
decision it makes. Even with a small error rate, the cumulative effect could lead the
AI to choose risky or unconventional actions. This is because, when evaluating so
many options, actions with high variance in moral value estimation may occasionally
appear to have significant positive value. The AI could be more inclined to select
these high-risk actions based on the mistaken belief that they would yield substantial
benefits. For instance, an AI following some form of utilitarianism might use many
resources to create happy digital minds—at the expense of humanity—even if that is
not what we humans think is morally good.
This is similar to the Winner’s Curse in auction theory: those that win auctions of
goods with uncertain value often find that they won because they overestimated the
value of the good relative to everyone else; for instance, when bidding on a bag of
coins at a fair, people who overestimate how many coins there are will be more likely
to win. Similarly, the AI might opt for actions that, in hindsight, were not truly
beneficial. A moral parliament can make this less likely, because actions that would
be judged morally extreme by most humans also wouldn’t be selected by a diverse
moral parliament.
The process of considering a range of theories inherently embeds redundancy and
cross-checking into the system, reducing the probability of catastrophic outcomes
358 ■ Introduction to AI Safety, Ethics, and Society
arising from a single point of failure. It also helps ensure that AI systems are robust
and resilient, capable of handling a broad array of ethical dilemmas.
6.10 CONCLUSION
Overview. In this chapter, we have explored various ways in which we can embed
ethics into AI systems, ensuring that they are safe and beneficial. It is far from
guaranteed that the development of AIs will lead to socially beneficial outcomes. By
default, AIs are likely to be developed according to businesses’ economic incentives
and are likely to follow parts of the law. This is insufficient. We almost certainly need
stronger protections in place to ensure that AIs behave ethically. Consequently, we
discussed how we can ensure AIs prioritize aspects of our wellbeing by making us
happy, helping us flourish, and satisfying our preferences. Supposing AIs can figure
out how to promote individual wellbeing, we explored social welfare functions as a
way to guide their actions in order to help improve wellbeing across society.
360 ■ Introduction to AI Safety, Ethics, and Society
As a baseline, we want AIs to follow the law. At the very least, we should
require that AIs follow the law. This is imperfect: as we have seen, the law is insuf-
ficiently comprehensive to ensure that AI systems are safe and beneficial. Laws have
loopholes, are occasionally unethical and unrepresentative of the population and are
often silent on doing good in ways AIs should be required to do. However, if we can
get AIs to follow the law, then we are at least guaranteed that they refrain from
the illegal acts—such as murder and theft—that human societies have identified and
outlawed. In addition, we might want to support regulation that ensures that AI
decision-making must be fair—once we understand what fairness requires.
6.11 LITERATURE
• John J. Nay. Law Informs Code: A Legal Informatics Approach to Aligning Artifi-
cial Intelligence with Humans. 2023. arXiv: 2209.13020 [cs.CY]
• Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learn-
ing: Limitations and Opportunities. http://www.fairmlbook.org. fairmlbook.org,
2019
• Richard Layard and Jan-Emmanuel Neve. Wellbeing: Science and Policy. May
2023. isbn: 9781009298926. doi: 10.1017/9781009298957
• Sandel Michael J. What Money Can’t Buy: The Moral Limits of Markets. Farrar,
Straus and Giroux, 2012
• Dan Hendrycks et al. What Would Jiminy Cricket Do? Towards Agents That Be-
have Morally. 2022. arXiv: 2110.13136 [cs.LG]
• M.D. Adler. Measuring Social Welfare: An Introduction. Oxford University Press,
2019. isbn: 9780190643027. url: https://books.google.com/books?id=˙GitDw
AAQBAJ
• Katarzyna de Lazari-Radek and Peter Singer. Utilitarianism: A Very Short Intro-
duction. Oxford University Press, 2017. isbn: 9780198728795. doi: 10.1093/actra
de/9780198728795.001.0001. url: https://doi.org/10.1093/actrade/978019872879
5.001.0001
• S. Kagan. Normative Ethics. Dimensions of philosophy series. Avalon Publishing,
1998. isbn: 9780813308456. url: https://books.google.com/books?id=iO8TAAA
AYAAJ
• Toby Newberry and Toby Ord. The Parliamentary Approach to Moral Uncertainty.
Tech. rep. Technical Report# 2021-2, Future of Humanity Institute, University of
Oxford . . ., 2021
• Stephen Darwall. Deontology. Blackwell Publishers, Oxford, 2003
CHAPTER 7
7.1 MOTIVATION
Introduction
Example: traffic jams. Consider a traffic jam, where the only obstacle to each
motorist is the car in front. Everyone has the same goal, which is to reach their des-
tination quickly. Since nobody wants to be stuck waiting, the solution might appear
obvious to someone unfamiliar with traffic: everyone should simply drive forward,
starting at the same time and accelerating at the same rate. And yet, without exter-
nal synchronization, achieving this preferable state is impossible. All anyone can do
is start and stop in response to each others’ starting and stopping, inching toward
their destination slowly and haltingly.
no way to impose such an agreement between the trees, each races its neighbor ever
higher, and all pay the large costs of growing so tall.
Example: excessive working hours [365]. People often work far longer hours
than they might ideally like to, rather than taking time off for their other interests,
in order to be competitive in their field. For instance, they might be competing
for limited prestigious positions within their field. In theory, if everyone in a given
field were to reduce their work hours by the same amount, they could all free up
time and increase their quality of life while maintaining their relative position in the
competition. Each person would get the work outcome they would have otherwise,
and everyone would benefit from this freed-up time. Yet no one does this, because if
they alone were to decrease their work efforts, they would be out-competed by others
who did not.
Example: military arms races. Like tree height, the major benefit of military
power is not intrinsic, but relative: being less militarily capable than their neighbors
makes a nation vulnerable to invasion. This competitive pressure drives nations to
expend vast sums of money on their military budgets each year, reducing each nation’s
budgets for other areas, such as healthcare and education. Some forms of military
investment, such as nuclear weaponry and military AI applications, also exacerbate
the risks of large-scale catastrophes. If every nation were to decrease its military
investment by the same amount, everyone would benefit from the reduced expenses
and risks without anyone losing their relative power. However, this arrangement is
not stable, since each nation could improve its security by ensuring its military power
exceeds that of its competitors, and each risks becoming vulnerable if it alone fails
to do this. Military expenditure therefore remains high in spite of these seemingly
avoidable costs.
Steering each agent = ̸ steering the system. These phenomena hint at the
distinct challenges of ensuring safety in multi-agent systems. The danger posed by a
collective of agents is greater than the sum of its parts. AI risk cannot be eradicated
by merely ensuring that each individual AI agent is loyal and each individual human
operator is well-intentioned. Even if all agents, both human and AI, share a common
set of goals, this does not guarantee macrobehavior in line with these goals. The
agents’ interactions can produce undesirable outcomes.
364 ■ Introduction to AI Safety, Ethics, and Society
Chapter focus. In this chapter, we use abstract models to understand how in-
telligent agents can, despite acting rationally and in accordance with their own self-
interest, collectively produce outcomes that none of them wants, even when they
could seemingly have achieved preferable alternative outcomes. We can characterize
these risks by crudely differentiating them into the following two sets:
• Multi-human dynamics. These risks are generated by interactions between the
human agencies involved in AI development and adoption, particularly corpora-
tions and nations. The central concern here is that competitive and evolutionary
pressures could drive humanity to hand over increasing amounts of power to AIs,
thereby becoming a “second-class species.” The frameworks we explore in this chap-
ter are highly abstract and can be useful in thinking more generally about the
current AI landscape.
Of particular importance are racing dynamics. We see these in the corporate world,
where AI developers may cut corners on safety in order to avoid being outcompeted
by one another. We also see these in international relations, where nations are
racing each other to adopt hazardous military AI applications. By observing AI
races, we can anticipate that merely persuading these parties that their actions are
high-risk may not be sufficient for ensuring that they act more cautiously, because
they may be willing to tolerate high risk levels in order to “stay in the race.” For
example, nations may choose to continue investing in military AI technologies that
could fail in catastrophic ways, if abstaining from doing so risks losing international
conflict.
• Multi-AI dynamics. These risks are generated by interactions with and between
AI agents. In the future, we expect that AIs will increasingly be granted auton-
omy in their behavior, and will therefore interact with others under progressively
less human oversight. This poses risks in at least three ways. First, evolutionary
pressures may promote selfish behavior and generate various forms of intrasystem
conflict that could subvert our goals. Second, many of the mechanisms by which
AI agents may cooperate with one another could promote undesirable behaviors,
such as nepotism, outgroup hostility, and the development of ruthless reputations.
Third, AIs may engage in conflict, using threats of extreme scale in order to extort
others, or even promoting all-out warfare, with devastating consequences.
We explore both of the above sets of multi-agent risks using generalizable frameworks
from game theory, bargaining theory, and evolutionary theory. These frameworks
help us understand the collective dynamics that can lead to outcomes that were
not intended or desired by anyone individually. Even if AI systems are fully under
human control and leading actors such as corporations and states are well-intentioned,
humanity could still end up eroding away our power gradually until it cannot be
recovered.
Game Theory
Rational agents will not necessarily secure good outcomes. Behavior that
is individually rational and self-interested can produce collective outcomes that
Collective Action Problems ■ 365
are suboptimal, or even catastrophic, for all involved. This section first examines
the Prisoner’s Dilemma, a canonical game theoretic example that illustrates this
theme—though cooperation would produce an outcome that is better for both agents,
for either one to cooperate would be irrational.
We then build on this by introducing two additional levels of sophistication. The
first addition is time. We explore how cooperation is possible, though not assured,
when agents interact repeatedly over time. The second addition is the introduction
of more than two agents. We explore how collective action problems can generate
and maintain undesirable states. Ultimately, we see how these natural dynamics can
produce catastrophically bad outcomes. They perpetuate military arms races and
corporate AI races, increasing the risks from both. They may also promote dangerous
AI behaviors, such as extortion.
Cooperation
Conflict
5. Inequality: inequality may increase the probability of conflict, due to factors such
as relative deprivation and social envy.
Evolutionary Pressure
Natural selection will promote AIs that behave selfishly. In this final sec-
tion, we use evolutionary theory to study what happens when a large number of
agents interact many times over many generations. We start with generalized Dar-
winism: the idea that evolution by natural selection can take place outside of the
realm of biology. We explore examples in linguistics, music, philosophy and sociol-
ogy. We formalize generalized Darwinism using Lewontin’s conditions for evolution
by natural selection and the Price equation for evolutionary change. Using both, we
show that AIs are likely to be subject to evolution by natural selection: they will vary
in ways that produce differential fitness and so influence which traits persist through
time and between “generations” of AIs.
Next, we explore two AI risks generated by evolutionary pressures. The first is that
correctly specified goals may be subverted or distorted by “intrasystem goal conflict.”
The interests of propagating information (such as genes, departments, or sub-agents)
can sometimes clash with those of the larger entity that contains it (such as an
organism, government, or AI system), undermining unity of purpose. The second risk
we consider is that natural selection tends to favor selfish traits over altruistic ones.
A future shaped by evolutionary pressures is, therefore, likely to be dominated by
selfish behavior, both in the institutions that produce and use AI systems, and in the
AIs themselves.
The conclusions of this section are simple. Natural selection will by default be a strong
force in determining the state of the world. Its influence on AI development carries
the risk of intrasystem goal conflict and the promotion of selfish behavior. Both risks
could have catastrophic effects. Intrasystem goal conflict could prevent our goals from
being carried out and generate unexpected actions. AI agents could develop selfish
tendencies, increasing the risk that they might employ harmful strategies (including
those covered earlier in the chapter, such as extortion).
7.2.1 Overview
This chapter explores the dynamics generated by the interactions of multiple agents,
both human and AI. These interactions create risks distinct from those generated
by any individual AI agent acting in isolation. One way we can study the strategic
interdependence of agents is with the framework of game theory. Using game theory,
we can examine formal models of how agents interact with each other under varying
conditions and predict the outcomes of these interactions.
Here, we use game theory to present natural dynamics in biological and social systems
that involve multiple agents. In particular, we explore what might cause agents to
Collective Action Problems ■ 367
come into conflict with one another, rather than cooperate. We show how these multi-
agent dynamics can generate undesirable outcomes, sometimes for all the agents
involved. We consider risks created by interactions within and between human and
AI agents, from human-directed companies and militaries engaging in perilous races
to autonomous AIs using threats for extortion.
We start with an overview of the fundamentals of game theory. We begin this section
by setting out the characteristics of game theoretic agents. We also categorize the
different kinds of games we are exploring.
We then focus on the Prisoner’s Dilemma. The Prisoner’s Dilemma is a simple ex-
ample of how an interaction between two agents can generate an equilibrium state
that is bad for both, even when each acts rationally and in their own self-interest. We
explore how agents may arrive at the outcome where neither chooses to cooperate.
We use this to model real-world phenomena, such as negative political campaigns.
Finally, we examine ways we might foster rational cooperation between self-interested
AI agents, such as by altering the values in the underlying payoff matrices. The key
upshot is that intelligent and rational agents do not always achieve good outcomes.
We next add in the element of time by examining the Iterated Prisoner’s Dilemma.
AI agents are unlikely to interact with others only once. When agents engage with
each other multiple times, this creates its own hazards. We begin by examining how
iterating the Prisoner’s Dilemma alters the agents’ incentives—when an agent’s be-
havior in the present can influence that of their partner in the future, this creates
an opportunity for rational cooperation. We study the effects of altering some of the
variables in this basic model: uncertainty about future engagement and the necessity
to switch between multiple different partners. We look at why the cooperative strat-
egy tit-for-tat is usually so successful, and in what circumstances it is less so. Finally,
we explore some of the risks associated with iterated multi-agent social dynamics:
corporate AI races, military AI arms races, and AI extortion. The key upshot is that
cooperation cannot be ensured merely by iterating interactions through time.
We next move to consider group-level interactions. AI agents might not interact with
others in a neat, pairwise fashion, as assumed by the models previously explored.
In the real world, social behavior is rarely so straightforward. Interactions can take
place between more than two agents at the same time. A group of agents creates an
environmental structure that may alter the incentives directing individual behavior.
Human societies are rife with dynamics generated by group-level interactions that
result in undesirable outcomes. We begin by formalizing “collective action problems.”
We consider real-world examples such as anthropogenic climate change and fishery
depletion. Multi-agent dynamics such as these generate AI risk in several ways. Races
between human agents and agencies could trigger flash wars between AI agents or
the automation of economies to the point of human enfeeblement. The key upshot
is that achieving cooperation and ensuring collectively good outcomes is even more
difficult in interactions involving more than two agents.
368 ■ Introduction to AI Safety, Ethics, and Society
In this section, we briefly run through some of the fundamental principles of game
theory. Game theory is the branch of mathematics concerned with agents’ choices and
strategies in multi-agent interactions. Game theory is so-called because we reduce
complex situations to abstract games where agents maximize their payoffs. Using
game theory, we can study how altering incentives influences the strategies that these
agents use.
Agents in game theory. We usually assume that the agents in these games are
self-interested and rational. Agents are “self-interested” if they make decisions in view
of their own utility, regardless of the consequences to others. Agents are said to be
“rational” if they act as though they are maximizing their utility.
Games can be “zero sum” or “non-zero sum.” We can categorize the games
we are studying in different ways. One distinction is between zero sum and non-zero
sum games. A zero sum game is one where, in every outcome, the agents’ payoffs
all sum to zero. An example is “tug of war”: any benefit to one party from their pull
is necessarily a cost to the other. Therefore, the total value of these wins and losses
cancel out. In other words, there is never any net change in total value. Poker is a
zero sum game if the players’ payoffs are the money they each finish with. The total
amount of money at a poker game’s beginning and end is the same—it has simply
been redistributed between the players.
By contrast, many games are non-zero sum. In non-zero sum games, the total amount
of value is not fixed and may be changed by playing the game. Thus, one agent’s win
does not necessarily require another’s loss. For instance, in cooperation games such
as those where players must meet at an undetermined location, players only get the
payoff together if they manage to find each other. As we shall see, the Prisoner’s
dilemma is a non-zero sum game, as the sum of payoffs changes across different
outcomes.
Non-zero sum games can have “positive sum” or “negative sum” out-
comes. We can categorize the outcomes of non-zero sum games as positive sum
and negative sum. In a positive sum outcome, the total gains and losses of the agents
sum to greater than zero. Positive sum outcomes can arise when particular inter-
actions result in an increase in value. This includes instances of mutually beneficial
cooperation. For example, if one agent has flour and another has water and heat,
the two together can cooperate to make bread, which is more valuable than the raw
materials. As a real-world example, many view the stock market as positive sum
because the overall value of the stock market tends to increase over time. Though
gains are unevenly distributed, and some investors lose money, the average investor
becomes richer. This demonstrates an important point: positive sum outcomes are
not necessarily “win-win.” Cooperating does not guarantee a benefit to all involved.
Even if extra total value is created, its distribution between the agents involved in its
creation can take any shape, including one where some agents have negative payoffs.
Collective Action Problems ■ 369
In a negative sum outcome, some amount of value is lost by playing the game. Many
competitive interactions in the real world are negative sum. For instance, consider
“oil wars”—wars fought over a valuable hydrocarbon resource. Oil wars are zero-
sum with regards to oil since only the distribution (not the amount) of oil changes.
However, the process of conflict itself incurs costs to both sides, such as loss of life
and infrastructure damage. This reduces the total amount of value. If AI development
has the potential to result in catastrophic outcomes for humanity, then accelerating
development to gain short-term profits in exchange for long-term losses to everyone
involved would be a negative sum outcome.
Our aim in this section is to investigate how interactions between rational agents,
both human and AI, may negatively impact everyone involved. To this end, we focus
on a simple game: the Prisoner’s Dilemma. We first explore how the game works,
and its different possible outcomes. We then examine why agents may choose not to
cooperate even if they know this will lead to a collectively suboptimal outcome. We
run through several real-world phenomena which we can model using the Prisoner’s
Dilemma, before exploring ways in which cooperation can be promoted in these kinds
of interactions. We end by briefly discussing the risk of AI agents tending toward
undesirable equilibrium states.
In the Prisoner’s Dilemma, two agents must each decide whether or not to cooperate.
The costs and benefits are structured such that for each agent, defection is the best
strategy regardless of what their partner chooses to do. This motivates both agents
to defect.
The four possible outcomes. We assume that Alice and Bob are both rational
and self-interested: each only cares about minimizing their own jail time. We define
the decision facing each as follows. They can either “cooperate” with their partner by
remaining silent or “defect” on their partner by confessing to burglary. Each suspect
faces four possible outcomes, which we can split into two possible scenarios. Let’s term
these “World 1” and “World 2”; see Figure 7.1. In World 1, their partner chooses to
cooperate with them; in World 2, their partner chooses to defect. In both scenarios,
the suspect decides whether to cooperate or defect themself. They do not know what
their partner will decide to do.
Figure 7.1. The possible outcomes for Alice in the Prisoner’s Dilemma.
Defection is the dominant strategy. Alice does not know whether Bob will
choose to cooperate or defect. She does not know whether she will find herself in
World 1 or World 2; see Figure 7.1. She can only decide whether to cooperate or
defect herself. This means she is making one of two possible decisions. If she defects,
she is. . .
. . . in World 1: Bob cooperates and she goes free instead of spending a
year in jail.
. . . in World 2: Bob defects and she gets a 3-year sentence instead of an
8-year one.
Collective Action Problems ■ 371
Alice only cares about minimizing her own jail time, so she can save herself jail time
in either scenario by choosing to defect. She saves herself one year if her partner coop-
erates or five years if her partner defects. A rational agent under these circumstances
will do best if they decide to defect, regardless of what they expect their partner
to do. We call this the dominant strategy: a rational agent playing the Prisoner’s
Dilemma should choose to defect no matter what their partner does.
One way to think about strategic dominance is through the following thought exper-
iment. Someone in the Arctic during winter is choosing what to wear for that day’s
excursion. They have only two options: a coat or a t-shirt. The coat is thick and
waterproof; the t-shirt is thin and absorbent. Though this person cannot control or
predict the weather, they know there are only two possibilities: either rain or cold. If
it rains, the coat will keep them drier than the t-shirt. If it is cold, the coat will keep
them warmer than the t-shirt. Either way, the coat is the better option, so “wearing
the coat” is their dominant strategy.
Defection is the dominant strategy for both agents. Importantly, both the
suspects face this decision in a symmetric fashion. Each is deciding between identical
outcomes, and each wishes to minimize their own jail time. Let’s consider the four
possible outcomes now in terms of both the suspects’ jail sentences. We can display
this information in a payoff matrix, as shown in Table 7.1. Payoff matrices are com-
monly used to visualize games. They show all the possible outcomes of a game in
terms of the value of that outcome for each of the agents involved. In the Prisoner’s
Dilemma, we show the decision outcomes as the payoffs to each suspect: note that
since more jail time is worse than less, these payoffs are negative. Each cell of the
matrix shows the outcome of the two suspects’ decisions as the payoff to each suspect.
TABLE 7.1 Each cell in this payoff matrix represents a payoff. If Alice cooperates and Bob
defects, the top right cell tells us that Alice gets 8 years in jail while Bob goes free.
The stable equilibrium state in the Prisoner’s Dilemma is for both agents to defect.
Neither agent would choose to go back in time and change their decision (to switch
to cooperating) if they could not also alter their partner’s behavior by doing so. This
is often considered counterintuitive, as the agents would benefit if they were both to
switch to cooperating.
Nash Equilibrium: both agents will choose to defect. Defection is the best
strategy for Alice, regardless of what Bob opts to do. The same is true for Bob.
Therefore, if both are behaving in a rational and self-interested fashion, they will both
defect. This will secure the outcome of 3 years of jail time each (the bottom-right
372 ■ Introduction to AI Safety, Ethics, and Society
outcome of the payoff matrix above). Neither would wish to change their decision,
even if their partner were to change theirs. This is known as the Nash equilibrium: the
strategy choices from which no agent can benefit by unilaterally choosing a different
strategy. When interacting with one another, rational agents will tend toward picking
strategies that are part of Nash equilibria.
Figure 7.2. Looking at the possible outcomes for both suspects in the Prisoner’s Dilemma,
we can see that there is a possible Pareto improvement from the Nash equilibrium. The
numbers represent their payoffs (rather than the length of their jail sentence).
Collective Action Problems ■ 373
Figure 7.3. Both suspects’ payoffs, in each of the four decision outcomes. Moving right in-
creases Alice’s payoff, and moving up improves Bob’s payoff. A Pareto improvement requires
moving right and up, as shown by the green arrow [367].
Shopkeeper price cuts. Another example is price racing dynamics between differ-
ent goods providers. Consider two rival shopkeepers selling similar produce at similar
prices. They are competing for local customers. Each shopkeeper calculates that low-
ering their prices below that of their rival will attract more customers away from
the other shop and result in a higher total profit for themselves. If their competitor
drops their prices and they do not, then the competitor will gain extra customers,
leaving the first shopkeeper with almost none. Thus, “dropping prices” is the domi-
nant strategy for both. This leads to a Nash equilibrium in which both shops have
low prices, but the local custom is divided much the same as it would be if they had
both kept their prices high. If they were both to raise their prices, they would both
benefit by increasing their profits: this would be a Pareto improvement. Note that,
just as how the interests of the police do not count in the Prisoner’s Dilemma, we
are only considering the interests of the shopkeepers in this example. We are ignoring
the interests of the customers and wider society.
Promoting Cooperation
Reasons to cooperate. There are many reasons why real-world agents might
cooperate in situations which resemble the Prisoner’s Dilemma [368], as shown in
Figure 7.4. These can broadly be categorized by whether the agents have a choice, or
whether defection is impossible. If the agents do have a choice, we can further divide
the possibilities into those where they act in their own self-interest, and those where
they do not (altruism). Finally, we can differentiate two reasons why self-interested
agents may choose to cooperate: a tendency toward this, such as a conscience or guilt,
and future reward/punishment. We will explore two possibilities in this section—
payoff changes and altruistic dispositions—and then “future reward/punishment” in
the next section. Note that we effectively discuss “Defection is impossible” in the
Single-Agent Safety chapter, and “AI consciences” in the Beneficial AI and Machine
Ethics chapter.
Collective Action Problems ■ 375
Figure 7.4. Four possible reasons why agents may cooperate in prisoner’s Dilemma-like
scenarios. This section explores two: changes to the payoff matrix and increased agent
altruism [368].
TABLE 7.2 if c > a and d > b, the highest payoff for either agent is to defect, regardless
of what their opponent does: Defection is the dominant strategy. Fostering cooperation
requires avoiding this structure.
Agent A cooperates a, a b, c
Agent A defects c, b d, d
There are two ways to reduce the expected value of defection: lower the probability
of defection success or lower the benefit of a successful defection. Consider a strat-
egy commonly used by organized crime groups: threatening members with extreme
punishment if they “snitch” to the police. In the Prisoner’s Dilemma game, we can
model this by adding a punishment equivalent to three years of jail time for “snitch-
ing,” leading to the altered payoff matrix as shown in Figure 7.5. The Pareto efficient
outcome (−1, −1) is now also a Nash Equilibrium because snitching when the other
player cooperates is worse than mutually cooperating (c < a).
376 ■ Introduction to AI Safety, Ethics, and Society
Figure 7.5. Altering the payoff matrix to punish snitches, we can move from a Prisoner’s
Dilemma (left) to a Stag Hunt (right), in which there is an additional Nash equilibrium.
Collective Action Problems ■ 377
Summary
In our discussion of the Prisoner’s Dilemma, we saw how rational agents may converge
to equilibrium states that are bad for all involved. In the real world, however, agents
rarely interact with one another only once. Our aim in this section is to understand
how cooperative behavior can be promoted and maintained as multiple agents (both
human and AI) interact with each other over time, when they expect repeated future
interactions. We handle some common misconceptions in this section, such as the idea
that simply getting agents to interact repeatedly is sufficient to foster cooperation,
because “nice” and “forgiving” strategies always win out. As we shall see, things are
378 ■ Introduction to AI Safety, Ethics, and Society
not so simple. We explore how iterated interactions can lead to progressively worse
outcomes for all.
In the real world, we can observe this in “AI races,” where businesses cut corners
on safety due to competitive pressures, and militaries adopt and deploy potentially
unsafe AI technologies, making the world less safe. These AI races could produce
catastrophic consequences, including more frequent or destructive wars, economic
enfeeblement, and the potential for catastrophic accidents from malfunctioning or
misused AI weapons.
Introduction
Agents who engage with one another many times do not always coexist harmoniously.
Iterating interactions is not sufficient to ensure cooperation. To see why, we explore
what happens when rational, self-interested agents play the Prisoner’ Dilemma game
against each other repeatedly. In a single-round Prisoner’s Dilemma, defection is
always the rational move. But understanding the success of different strategies is
more complicated when agents play multiple rounds.
Punishment. Recall Alice and Bob from the previous section, the two would-be
thieves caught by the police. Alice decides to defect in the first round of the Prisoner’s
Dilemma, while Bob opts to cooperate. This achieves a good outcome for Alice, and
a poor one for Bob, who punishes this behavior by choosing to defect himself in the
second round. What makes this a punishment is that Alice’s score will now be lower
than it would be if Bob had opted to cooperate instead, whether Alice chooses to
cooperate or defect.
Reward. Alice, having been punished, decides to cooperate in the third round. Bob
rewards this action by cooperating in turn in the fourth. What makes this a reward is
that Alice’s score will now be higher than if Bob had instead opted to defect, whether
Alice chooses to cooperate or defect. Thus, the expectation that their defection will
Collective Action Problems ■ 379
Figure 7.6. Across six rounds, both players gain better payoffs if they consistently cooperate.
But defecting creates short-term gains.
In Figure 7.6, each panel shows a six-round Iterated Prisoner’s Dilemma, with purple
squares for defection and blue for cooperation. On the left is Tit-for-tat: An agent
using this strategy tends to score the same as or worse than its partners in each match.
On the right, always defect tends to score the same as or better than its partner in
each match. The average payoff attained by using either strategy are shown at the
bottom: Tit-for-tat attains a better payoff (lower jail sentence) on average—and so is
more successful in a tournament—than always defect.
Defection is still the dominant strategy if agents know how many times
they will interact. If the agents know when they are about to play the Prisoner’s
Dilemma with each other for the final time, both will choose to defect in that final
round. This is because their defection is no longer punishable by their partner. If Alice
defects in the last round of the Iterated Prisoner’s Dilemma, Bob cannot punish her
380 ■ Introduction to AI Safety, Ethics, and Society
by retaliating, as there are no future rounds in which to do so. The same is of course
true for Bob. Thus, defection is the dominant strategy for each agent in the final
round, just as it is in the single-round version of the dilemma.
Moreover, if each agent expects their partner to defect in the final round, then there
is no incentive for them to cooperate in the penultimate round either. This is for the
same reason: Defecting in the penultimate round will not influence their partner’s
behavior in the final round. Whatever an agent decides to do, they expect that
their partner will choose to defect next round, so they might as well defect now.
We can extend this argument by reasoning backward through all the iterations. In
each round, the certainty that their partner will defect in the next round regardless
of their own behavior in the current round incentivizes each agent to defect. The
reward for cooperation and punishment of defection have been removed. Ultimately,
this removal pushes the agents to defect in every round of the Iterated Prisoner’s
Dilemma.
Uncertainty about future engagement enables rational cooperation. In
the real world, an agent can rarely be sure that they will never again engage with
a given partner. Wherever there is sufficient uncertainty about the future of their
relationship, rational agents may be more cooperative. This is for the simple reason
that uncooperative behavior may yield less valuable outcomes in the long term, be-
cause others may retaliate in kind in the future. This tells us that AIs interacting
with each other repeatedly may cooperate, but only if they are sufficiently uncertain
about whether their interactions are about to end. Other forms of uncertainty can
also create opportunities for rational cooperation, such as uncertainty about what
strategies others will use. These are most important where the Iterated Prisoner’s
Dilemma involves a population of more than two agents, in which each agent inter-
acts sequentially with multiple partners. We turn to examining the dynamics of these
more complicated games next.
Tournaments
So far, we have considered the Iterated Prisoner’s Dilemma between only two agents:
each plays repeatedly against a single partner. However, in the real world, we expect
AIs will engage with multiple other agents. In this section, we consider interactions
of this kind, where each agent not only interacts with their partner repeatedly, but
also switches partners over time. Understanding the success of a strategy is more
complicated in repeated rounds against many partners. Note that in this section, we
define a “match” to mean repeated rounds of the Prisoner’s Dilemma between the
same two agents; see Figure 7.6. We define a “tournament” to mean a population of
more than two agents engaged in a set of pairwise matches.
In Iterated Prisoner Dilemma tournaments, each agent interacts with
multiple partners. In the 1970s, the political scientist Robert Axelrod held a
series of tournaments to pit different agents against one another in the Iterated Pris-
oner’s Dilemma. The tournament winner was whichever agent had the highest total
payoff after completing all matches. Each agent in an Iterated Prisoner’s Dilemma
Collective Action Problems ■ 381
tournament plays multiple rounds against multiple partners. These agents employed a
range of different strategies. For example, an agent using the strategy named random
would randomly determine whether to cooperate or defect in each round, entirely
independently of previous interactions with a given partner. By contrast, an agent
using the grudger strategy would start out cooperating, but switch to defecting for
all future interactions if its partner defected even once. See Table 7.3 for examples of
these strategies.
AI Races
Iterated interactions can generate “AI races.” We discuss two kinds of races concerning
AI development: corporate AI races and military AI arms races. Both kinds center
around competing parties participating in races for individual, short-term gains at
a collective, long-term detriment. Where individual incentives clash with collective
interests, the outcome can be bad for all. As we discuss here, in the context of AI
races, these outcomes could even be catastrophic.
AI races are the result of intense competitive pressures. During the Cold
War, the US and the Soviet Union were involved in a costly nuclear arms race. The
effects of their competition persist today, leaving the world in a state of heightened
nuclear threat. Competitive races of this kind entail repeated back-and-forth actions
that can result in progressively worse outcomes for all involved. We can liken this
example to the Iterated Prisoner’s Dilemma, where the nations must decide whether
to increase (defect) or decrease (cooperate) their nuclear spending. Both the US and
the Soviet Union often chose to increase spending. They would have created a safer
and less expensive world for both nations (as well as others) if they had cooperated
to reduce their nuclear stockpiles. We discuss this in more detail in 8.6.
Two kinds of AI races: corporate and military [273]. Competition between
different parties—nations or corporations—is incentivizing each to develop, deploy,
and adopt AIs rapidly, at the expense of other values and safety precautions. Corpo-
rate AI races consist of businesses prioritizing their own survival or power expansion
over ensuring that AIs are developed and released safely. Military AI arms races
consist of nations building and adopting powerful and dangerous military applica-
tions of AI technologies to gain military power, increasing the risks of more frequent
or damaging wars, misuse, or catastrophic accidents. We can understand these two
kinds of AI races using two game-theoretic models of iterated interactions. First, we
use the Attrition model to understand why AI corporations are cutting corners on
safety. Second, we’ll use the Security Dilemma model to understand why militaries
are escalating the use of—and reliance on—AI in warfare.
Corporate AI Races
Competition between AI research companies is promoting the creation and use of
more appealing and profitable systems, often at the cost of safety measures. Con-
sider the public release of large language model-based chatbots. Some AI companies
delayed releasing their chatbots out of safety concerns, like avoiding the generation
of harmful misinformation. We can view the companies that released their chat-
bots first as having switched from cooperating to defecting in an Iterated Prisoner’s
Dilemma. The defectors gained public attention and secured future investment. This
competitive pressure caused other companies to rush their AI products to market,
compromising safety measures in the process.
Collective Action Problems ■ 383
Corporate AI races arise because competitors sacrifice their values to gain an advan-
tage, even if this harms others. As a race heats up, corporations might increasingly
need to prioritize profits by cutting corners on safety, in order to survive in a world
where their competitors are very likely to do the same. The worst outcome for an
agent in the Prisoner’s Dilemma is the one where only they cooperated while their
partner defected. Competitive pressures motivate AI companies to avoid this out-
come, even at the cost of exacerbating large-scale risks.
Ultimately, corporate AI races could produce societal-scale harms, such as mass un-
employment and dangerous dependence on AI systems. We consider one such example
in 7.2.5. This risk is particularly vivid for emerging industries like AI which lack the
better-established safeguards such as mature regulation and widespread awareness of
the harm that unsafe products can cause found in other industries like pharmaceuti-
cals.
Attrition model: a multi-player game of “Chicken.” We can model this
kind of corporate AI race using an “Attrition” model [370], which frames a race as
a kind of auction in which competitors bid against one another for a valuable prize.
Rather than bidding money, the competitors bid for the risk level they are willing
to tolerate. This is similar to the game “Chicken,” in which two competitors drive
headlong at each other. Assuming one swerves out of the way, the winner is the one
who does not (demonstrating that they can tolerate a higher level of risk than the
loser). Similarly, in the Attrition model, each competitor bids the level of risk—the
probability of bringing about a catastrophic outcome—they are willing to tolerate.
Whichever competitor is willing to tolerate the most risk will win the entire prize, as
long as the catastrophe they are risking does not actually happen. We can consider
this to be an “all pay” auction: both competitors must pay what they bid, whether
they win or not. This is because all of those involved must bear the risk they are
leveraging, and once they have made their bid they cannot retract it.
The Attrition model shows why AI corporations may cut corners on
safety. Let us assume that there are only two competitors and that both of them
have the same understanding of the state of their competition. In this case, the
Attrition model predicts that they will race each other up to a loss of one-third in
expected value [371]. If the value of the prize to one competitor is “X,” they will be
willing to risk a 33% chance of bringing about an outcome equally disvaluable (of
value “-X”) in order to win their race [372].
As we have discussed previously, market pressures may motivate corporations to
behave as though they value what they are competing for almost as highly as survival
itself. According to this toy model, we might then expect AI stakeholders engaged
in a corporate race to risk a 33% chance of existential catastrophe in order to “win
the prize” of their continued existence. With multiple AI races, long time horizons,
and ever-increasing risks, the repeated erosion of safety assurances down to only 66%
generates a vast potential for catastrophe.
Real-world actors may mistakenly erode safety precautions even further.
Moreover, real-world AI races could produce even worse outcomes than the one
384 ■ Introduction to AI Safety, Ethics, and Society
predicted by the Attrition model [372]. One reason for this is that competing cor-
porations may not have a correct understanding of the state of their race. Precisely
predicting these kinds of risks can be extremely challenging: high-risk situations are
inherently difficult to predict accurately, even in fields far more well-understood than
AI. Incorrect risk calibration could cause the competitors to take actions that acci-
dentally exceed even the 33% risk level. Like newcomers to an “all pay” auction who
often overbid, uneven comprehension or misinformation could motivate the competi-
tors to take even greater risks of bringing about catastrophic outcomes. In fact, we
might even expect selection for competitors who tend to underestimate the risks of
these races. All these factors may further erode safety assurances.
Example: the Cold War nuclear arms race. As previously discussed, the
Cold War nuclear arms race typifies this process. Neither the US nor the Soviet Union
wanted to risk being less militarily capable than their rival, so each escalated their
own weaponized nuclear ability in an attempt to deter the other using the threat of
retaliation. Just as in the Iterated Prisoner’s Dilemma, neither nation could afford to
risk being the lone cooperator while their rival defected. Thus, they achieve a Pareto
inefficient outcome of both defecting. Competitive pressure drove them to continue
to worsen this situation over time, resulting in today’s enormously heightened state
of nuclear vulnerability.
There are many incentives for nations to increase their development, adoption, and
deployment of military AI applications. With more AI involvement, warfare can take
place at an accelerated pace, and at a more destructive scale. Nations that do not
adopt and use military AI technologies may therefore risk not being able to compete
with nations that do. As with nuclear mutually assured destruction, nations may also
employ automated retaliation as a signal of commitment, hoping to deter attacks by
demonstrating a plausible resolution to respond swiftly and in kind. This process of
automation and AI delegation would thus perpetuate, despite it being increasingly
against the collective good.
Ultimately, as with economic automation, military AI arms races could result in
humans being unable to keep up. The pace and complexity of warfare could ascend
out of human reach to where we are no longer able to comprehend or intervene. This
could be an irreversible step putting us at high risk of catastrophic outcomes.
Extortion
In this section, we examine one last risk that arises when agents interact repeatedly:
the discovery of extortion.
Extortion strategies rarely win tournaments but seldom die out alto-
gether. As we saw in Section 7.2.4, many uncooperative strategies may gain a
higher score than most of their partners in head-to-head matches, and yet still lose in
tournaments. By contrast, extortionists can be somewhat successful in tournaments
under certain conditions. Extortionists are vulnerable to the same problem as many
other uncooperative strategies: they gain low payoffs in matches against other extor-
tionists. Each will therefore perform less well as the frequency of extortionists in the
population increases. Thus, extortionists can persist if they are sufficiently unlikely
to meet one another. For instance, where a sufficiently small population of agents
is engaged in a tournament, a single extortionist can achieve very high payoffs by
exploiting cooperative strategies.
AI agents may use extortion: evidence from the real world. The
widespread use of extortion among humans outside the world of game theoretic
Collective Action Problems ■ 389
models suggests there is still a major cause for concern. Real-world extortion can
still yield results even when the target is perfectly aware that it is taking place. The
use of ransomware schemes to extort private individuals and companies is increasing
rapidly. In fact, cybersecurity experts estimate that the annual economic cost of ran-
somware activity is in the billions of US dollars. Terrorist organizations such as ISIS
rely on extortion through hostage-taking for a large portion of their total income.
The ubiquity of successful extortion in so many contexts sets a powerful historical
precedent for its efficacy.
Simulated torture at this scale could make the future more disvaluable
than valuable. Simulations designed for the purpose of extortion would likely
be far more disvaluable than simulations which contain disvalue unintentionally. The
simulation’s designer would likely be able to choose what kinds of objects to simulate,
so they could avoid wasting energy simulating non-sentient entities such as inanimate
objects. Moreover, the designer could ensure that these sentient entities experience
the greatest amount of suffering possible for the timespan of the simulation. They
might even be able to simulate minds capable of more disvaluable experiences than
have ever existed previously, deliberately designing the digital entities to be able to
390 ■ Introduction to AI Safety, Ethics, and Society
suffer as greatly as possible. Put together, a simulation optimized for disvalue could
produce several orders of magnitude more disvalue than anything in history. This
would be unprecedented in humanity’s history and could make a horrifying—even
net negative—future.
have some incentive to avoid the threat being executed. However, out of a desire to
signal credibility in future interactions, extortionists must follow through on threats
occasionally. Consider examples such as hostage ransoming or criminal syndicate
protection rackets. Successful future extortion requires a signal of commitment, such
as destroying the property of those who defy the extortionists.
AIs may carry out more frequent and more severe threats than humans tend to. One
reason for this is that they may have different value systems which tolerate higher
risks, reducing their motivation to acquiesce to extortion. For example, an AI agent
that sufficiently values the far future may prefer to demotivate future extortionists
from trying to extort them. They may therefore defy a current extortion attempt,
tolerating even very large costs to them and others, for the long-term benefit of
credibly signaling that future extortion attempts will not work either.
More generally, with a greater variety of value systems, a greater number of agents,
and a greater action space size, miscalibrated extortion attempts are more likely.
Where the threat is insufficient to force compliance, the aforementioned need to
signal credibility incentivizes the extortionist to execute their threat as punishment
for their target’s refusal to submit.
AI agents extorting humans. AI agents might also extort human targets. One
example scenario would be an AI developing both a weaponized biological pathogen,
and an effective cure. If the pathogen is slow-acting, the AI agent could then extort
humans by deploying the bioweapon, and leveraging the promise of its cure to force
those infected into complying with its demands. Pathogens that are sufficiently fast
to spread and difficult to detect could infect a very large number of human targets, so
this tactic could enable extremely large-scale threats to be wielded effectively [377].
Summary
The Iterated Prisoner’s Dilemma involves repeated rounds of the Prisoner’s Dilemma
game. This iteration offers a chance for agent cooperation but doesn’t ensure it.
There are different strategies by which agents can attempt to maximize their overall
payoffs. These strategies can be studied by competing agents against one another in
tournaments, where each agent competes against others in multiple rounds before
switching partners.
This provides cause for concern about a future with many AI agents. One example
of this is the phenomenon of “races” between AI stakeholders. These races strongly
influence the speed and direction of AI technological production, deployment and
adoption, in both corporate and military settings and have the potential to exac-
erbate many of the intrinsic risks from AI. The dynamics we have explored in this
section might cause competing agencies to cut corners on safety, escalate weaponized
AI applications and automate warfare. These are two examples of how competitive
pressures, modeled as iterated interactions between agents, can generate races which
increase the risk of catastrophe for everyone. Fostering cooperation between differ-
ent parties—human individuals, corporations, nations, and AI agents—is vital for
ensuring our collective safety.
392 ■ Introduction to AI Safety, Ethics, and Society
We began our exploration of game theory by looking at a very simple game, the
Prisoner’s Dilemma. We have so far considered two ways to model real-world social
scenarios in more detail. First, we explored what happens when two agents interact
multiple times (such as an Iterated Prisoner’s Dilemma match). Second, we intro-
duced a population of more than two agents, where each agent switches partners
over time (such as an Iterated Prisoner’s Dilemma tournament). Now we move be-
yond pairwise interactions, to interactions that simultaneously involve more than two
agents. We consider what happens when an agent engages in repeated rounds of the
Prisoner’s Dilemma against multiple opponents at the same time.
One class of scenarios that can be described by such a model is collective action
problems. Throughout this section, we first discuss the core characteristics of collective
action problems. Then, we introduce a series of real-world examples to highlight the
ubiquity of these problems in human society and show how AI races can be modeled
in this way. Following this, we transition to a brief discussion of common pool resource
problems to further illustrate the difficulty with which rational agents, especially AI
agents, may secure collectively good outcomes. Finally, we conclude with a detailed
discussion of flash wars and autonomous economies to show how in a multi-agent
setting, AIs might pursue behaviors or tactics that result in catastrophic or existential
risks to humans.
Introduction
This first section explores the nature of collective action problems. We begin with
a simple example of a collaborative group project. Through this, we explore how
individual incentives can sometimes clash with what is in the best interests of the
group as a whole. These situations can motivate individuals to act in ways that
negatively impact all of the population.
to happen without costing them much personal effort. Just as with the Prisoner’s
Dilemma, “slacking” is their dominant strategy. If the others work hard and the
project is completed, they get to enjoy the benefits of this success without expending
too much effort themselves. If the others fail to work hard and the project is not
completed, they at least save themselves the effort they might otherwise have wasted.
As groups increase in size and heterogeneity, complexity increases accordingly. Agents
in a population may have a diverse set of goals. Even if the population can agree on
a common goal, aligning diverse agents with this goal can be difficult. For example,
even when the public expresses strong and widespread support for a political measure,
their representatives often fail to carry it out.
Formalization
Here, we formalize our model of collective action problems. We look more closely at
the incentives governing individual choices, and the effects these have at the group
level. We examine how the behavior of others in the group can alter the incentives
facing any individual, and how we can (and do) use these mechanisms to promote
cooperative behavior in our societies.
Free riding is the dominant strategy. For now, let us assume that free riding
increases an agent’s own personal benefit, regardless of whether the others contribute
or free ride: it is the dominant strategy. If an agent’s contribution to the common
good is small, then choosing not to contribute does not significantly diminish the
collective good, meaning that an agent’s decision to free ride has essentially no neg-
ative consequences for the agent themself. Thus, the agent is choosing between two
outcomes. The first outcome is where they gain their portion of the collective benefit,
and pay the small cost of being a contributor. The other outcome is where they gain
this same benefit, but save themselves the cost of contributing.
Free riding can produce Pareto inefficient outcomes. Just as how both
agents defecting in the Prisoner’s Dilemma produces Pareto inefficiency, free riding
in a collective action problem can result in an outcome that is bad for all. In many
cases, some agents can free ride without imposing significant externalities on everyone
else. However, if sufficiently many agents free ride, this diminishes the collective good
by leading to no provision of a public good, for instance. With sufficient losses, the
agents will all end up worse than if they had each paid the small individual cost
of contributing and received their share of the public benefit. Importantly, however,
even in this Pareto inefficient state, free riding might still be the dominant strategy
394 ■ Introduction to AI Safety, Ethics, and Society
for each individual, since the cost of contributing outweighs the trivial increase in
collective good they would contribute by contributing. Thus, escaping undesirable
equilibria in a collective action problem can be exceedingly difficult; see Figure 7.8.
Figure 7.8. In this abstract collective action problem, we can move from everyone (con-
tributes) to right (no one contributes). As more people free ride, the collective good dis-
appears, leaving everyone in a state where they would all prefer to collectively contribute
instead.
We can illustrate a collective action problem using the simple payoff matrix below. In
the matrix, “b” represents the payoff an agent receives when everyone else cooperates
(the collective good divided between the number of agents) and “c” represents the
personal cost of cooperation. As the matrix illustrates, the dominant strategy for a
rational agent (“you”) here is to free ride whether everyone else contributes or free
rides.
TABLE 7.4 Free riding is always better for an individual: it is a dominant strategy.
The rest of the group The rest of the Some contribute;
contributes group free rides others free ride
them. If defectors dominate the population initially, and the initial individual costs
of cooperation outweigh the collective benefits of cooperation, then the population
may tend toward an uncooperative state. In simple terms, collective action problems
cannot be solved without cooperation.
Many large-scale societal issues can be understood as collective action problems. This
section explores collective action problems in the real world: climate change, public
health, and democratic voting. We end by briefly looking at AI races through this
same lens.
Public health. We can model some public health emergencies, such as disease
epidemics, as collective action problems. The COVID-19 pandemic took the lives
of millions worldwide. Some of these deaths could have been avoided with stricter
compliance with public health measures such as social distancing, frequent testing,
and vaccination. We can model those adhering to these measures as “contributing”
(by incurring a personal cost for public benefit) and those violating them as “free
riding.”
Assume that everyone wished the pandemic to be controlled and ultimately eradi-
cated, that complying with the suggested health measures would have helped hasten
this goal, and that the benefits of collectively shortening the pandemic timespan
would have outweighed the personal costs of compliance with these measures (such
as social isolation). Everyone would prefer the outcome where they all complied with
the health measures over the one where few of them did. Yet, each person would
prefer still better the outcome where everyone else adhered to the health measures,
and they alone were able to free ride. Violating the health measures was therefore
the dominant strategy, and so many people chose to do this, imposing the negative
externalities of excessive disease burden on the rest of their community.
We used both mutual and external mechanisms to coerce people to comply with
public health measures in the pandemic. For example, some communities adjusted
their social norms (mutual coercion) such that non-compliance with public health
measures would result in damage to one’s reputation. We also required proof of
vaccination for entry into desirable social spaces (external coercion), among many
other requirements.
396 ■ Introduction to AI Safety, Ethics, and Society
Common pool resource problem. Rational agents are incentivized to take more
than a sustainable amount of a shared resource. This is called a common pool resource
problem or tragedy of the commons problem. We refer to a common pool resource be-
coming catastrophically depleted as collapse. Collapse occurs when rational agents,
driven by their incentive to maximize personal gain, tip the available supply of the
shared resource below its sustainability equilibrium [378]. Below, we further illustrate
how complicated it is to secure collectively good outcomes, especially when rational
agents act in accordance with their self-interest. Such problems are prevalent at the
societal level, and often bear catastrophic consequences. Thus, we should not elimi-
nate the possibility that they may also occur with AI agents in a multi-agent setting.
Collective Action Problems ■ 397
For example, rainforests around the world have been diminished greatly by deforesta-
tion practices. While these forests still exist as a home to millions of different species
and many local communities, they may reach a point at which they will no longer
be able to rejuvenate themselves. If these practices are sustained, the entire ecosys-
tem these forests support could collapse. Common pool resource problems exemplify
how agents may bring about catastrophes even when they behave rationally and in
their self-interest, with perfect knowledge of the looming catastrophe, and despite the
seeming ability to prevent it. They further illustrate how complicated it can be to
secure collectively good outcomes and how rational agents can act to the detriment of
their own group. As with many other collective action problems, we can’t expect to
solve common pool resource problems by having AIs manage them. If we simply pass
the buck to AI representatives, the AIs will inherit the same incentive structure that
produces the common pool resource problem, and so the problem will likely remain.
In the previous section, we looked at how corporations and militaries may compete
with one another in “AI races.” We used a two-player “attrition” bidding model to
see why AI companies cut corners on safety when developing and deploying their
technologies. We used another two-player “security dilemma” model to understand
how security concerns motivate nations to escalate their military capabilities, even
while increasing the risks imposed on all by increasingly automating warfare in this
manner.
Here, we extend our models of these races to consider more than two parties, allowing
us to see them as collective action problems. First, we look at how military AI arms
races increase the risk of catastrophic outcomes such as a flash war: a war that is
triggered by autonomous AI agents that quickly spirals out of human control [273].
Second, we explore how ever-increasing job automation could result in an autonomous
economy: an economy in which humans no longer have leverage or control.
Military AI arms race outcome: flash war. The security dilemma model we
explored in the previous section can be applied to more than two agents. In this
context, we can see it as a collective action problem. Though all nations would be at
lower risk if all were to cooperate with one another (“contribute” to their collective
safety), each will individually do better instead to escalate their own military capa-
bilities (“free ride” on the contributions of the other nations). Here, we explore one
potentially catastrophic outcome of this collective action problem: a flash war.
As we saw previously, military AI arms races motivate nations to automate mili-
tary procedures. In particular, there are strong incentives to integrate “automated
retaliation” protocols. Consider a scenario in which several nations have constructed
an autonomous AI military defense system to gain a defensive military advantage.
These AIs must be able to act on perceived threats without human intervention.
Additionally, each is aligned with a common goal: “defend our nation from attack.”
Even if these systems are nearly perfect, a single erroneous detection of a perceived
threat could trigger a decision cascade that launches the nation into a “flash war.”
398 ■ Introduction to AI Safety, Ethics, and Society
Once one AI system hallucinates a threat and issues responses, the AIs of the nations
being targeted by these responses will follow suit, and the situation could escalate
rapidly. A flash war would be catastrophic for humanity, and might prove impossible
to recover from.
A flash war is triggered and amplified by successive interactions between autonomous
AI agents such that humans lose control of weapons of mass destruction [379]. Any
single military defense AI could trigger it, and the process could continue without
human intervention and at great speed. Importantly, having humans in the loop will
not necessarily ensure our safety. Even if AIs only provide human operators with
instructions to retaliate, our collective safety would rest on the chance that soldiers
would willfully disobey their instructions.
Collective action between nations could avoid these and other dire outcomes. Limit-
ing the capabilities of their military AIs by decreasing funding and halting or slowing
down development would require that each nation give up a potential military ad-
vantage. In a high stakes scenario such as this one, rational agents (nations) may be
unwilling to give up such an advantage because it dramatically increases the vulner-
ability of their nation to attack. The individual cost of cooperation is high while the
individual cost of defection is low, and as agents continue to invest in military capa-
bilities, competitive pressures increase, which further exacerbate costs of cooperation
—thereby disincentivizing collective action. While the collective benefits of coopera-
tion would drastically reduce the catastrophic risks of this scenario in the long-term,
they may not outweigh the self-interest of rational agents in the short-term.
we expand this model to more than two agents, we can see it as a collective action
problem in which competitive pressures drive different parties to automate economic
functions out of the need to “keep up” with their competitors. Under this model, we
can see how companies must choose whether to maintain human labor (“contribut-
ing”) or automate these jobs using AI (“free riding”). Although all would prefer the
outcome in which the calamity of an autonomous economy is avoided, each would
individually prefer to have a competitive advantage and not risk being outperformed
by rivals who reap the short-term benefit of using AIs. Thus, economic automation
is the dominant strategy for each competitor. Repeated rounds of this game in which
a sufficient number of agents free ride would drive us toward this disaster. In each
successive round, it would become progressively more difficult to turn back, as we
come to rely increasingly on more capable AI agents.
7.2.6 Summary
Iterated Prisoner’s Dilemmas with many more than two agents interacting simulta-
neously in each round of the game. As before, we see that “free riding” can be the
dominant strategy for an individual agent, and this can lead to Pareto inefficient out-
comes for the group as a whole. We can use the mechanisms of mutual and external
coercion to incentivize agents to cooperate with each other and achieve collectively
good outcomes.
If we expand our models of AI races to include more than two agents, we can un-
derstand the races themselves as collective action problems, and examine how they
exacerbate the risk of catastrophe. One example is how increasingly automating mil-
itary protocols increases the risk of a “flash war.” Similar dynamics of automation in
the economic sphere could lead to an “autonomous economy.” Either outcome would
be disastrous and potentially irreversible, yet we can see how competitive pressures
can drive rational and self-interested agents (such as nations or companies) down a
path toward these calamities.
In this section, we examined some simple, formal models of how rational agents may
interact with each other under varying conditions. We used these game theoretic mod-
els to understand the natural dynamics in multi-agent biological and social systems.
We explored how these multi-agent dynamics can generate undesirable outcomes
for all those involved. We considered some tails risks posed by interactions between
human and AI agents. These included human-directed companies and militaries en-
gaging in perilous races, as well as autonomous AIs using threats for extortion.
These risks can be reduced if mechanisms such as institutions are used to ensure
human agencies and AI agents are able to cooperate with one another and avoid
conflict. We explore some means of achieving cooperative interactions in the next
section of this chapter, 7.3.
7.3 COOPERATION
Overview
In this chapter, we have been exploring the risks that arise from interactions between
multiple agents. So far, we have used game theory to understand how collective
behavior can produce undesirable outcomes. In simple terms, securing morally good
outcomes without cooperation can be extremely difficult, even for intelligent rational
agents. Consequently, the importance of cooperation has emerged as a strong theme
in this chapter. In this third section of this chapter, we begin by using evolutionary
theory to examine cooperation in more detail.
We observe many forms of cooperation in biological systems: social insect colonies,
pack hunting, symbiotic relationships, and much more. Humans perform community
services, negotiate international peace agreements, and coordinate aid for disaster
responses. Our very societies are built around cooperation.
Cooperation between AI stakeholders. Mechanisms that can enable coop-
eration between the corporations developing AI and other stakeholders such as
Collective Action Problems ■ 401
governments may be vital for counteracting the competitive and evolutionary pres-
sures of AI races we have explored in this chapter. For example, the “merge-and-
assist” clause of OpenAI’s charter [381] outlines their commitment to cease com-
petition with—and provide assistance to—any “value-aligned, safety-conscious” AI
developer who appears close to producing AGI, in order to reduce the risk of eroding
safety precautions.
Cooperation between AI agents. Many also suggest that we must ensure the
AI systems themselves also act cooperatively with one another. Certainly, we do want
AIs to cooperate, rather than to defect, in Prisoner’s Dilemma scenarios. However,
this may not be a total solution to the collective action problems we have examined
in this chapter. By more closely examining how cooperative relationships can come
about, it is possible to see how making AIs more cooperative may backfire with serious
consequences for AI safety. Instead, we need a more nuanced view of the potential
benefits and risks of promoting cooperation between AIs. To do this, we study five
different mechanisms by which cooperation may arise in multi-agent systems [382],
considering the ramifications of each:
• Direct reciprocity: when individuals are likely to encounter others in the future,
they are more likely to cooperate with them.
• Indirect reciprocity: when it benefits an individual’s reputation to cooperate with
others, they are more likely to do so.
• Group selection: when there is competition between groups, cooperative groups
may outcompete non-cooperative groups.
• Kin selection: when an individual is closely related to others, they are more likely
to cooperate with them.
• Institutional mechanisms: when there are externally imposed incentives (such as
laws) that subsidize cooperation and punish defection, individuals and groups are
more likely to cooperate.
Direct Reciprocity
Direct reciprocity overview. One way agents may cooperate is through direct
reciprocity: when one agent performs a favor for another because they expect the
recipient to return this favor in the future [383]. We capture this core idea in idioms
like “quid pro quo,” or “you scratch my back, I’ll scratch yours.” Direct reciprocity re-
quires repeated interaction between the agents: the more likely they are to meet again
in the future, the greater the incentive for them to cooperate in the present. We have
already encountered this in the iterated Prisoner’s Dilemma: how an agent behaves
in a present interaction can influence the behavior of others in future interactions .
Game theorists sometimes refer to this phenomenon as the “shadow of the future.”
When individuals know that future cooperation is valuable, they have increased in-
centives to behave in ways that benefit both themselves and others, fostering trust,
reciprocity, and cooperation over time. Cooperation can only evolve as a consequence
of direct reciprocity when the probability, w, of subsequent encounters between the
same two individuals is greater than the cost-benefit ratio of the helpful act. In other
402 ■ Introduction to AI Safety, Ethics, and Society
words, if agent A decides to help agent B at some cost c to themselves, they will only
do so when the expected benefit b of agent B returning the favor outweighs the cost
of agent A initially providing it. Thus, we have the rule w > c/b; see Table 7.5.
Cooperate Defect
Cooperate b − c/(1 − w) −c
Defect b 0
Natural examples of direct reciprocity. Trees and fungi have evolved sym-
biotic relationships where they exchange sugars and nutrients for mutual benefit.
Dolphins use cooperative hunting strategies where one dolphin herds schools of fish
while the others form barriers to encircle them. The dynamics of the role reversal
are decided by an expectation that other dolphins in the group will reciprocate this
behavior during subsequent hunts. Similarly, chimpanzees engage in reciprocal groom-
ing, where they exchange grooming services with one another with the expectation
that they will be returned during a later session [384].
Direct reciprocity in human society. Among humans, one prominent example of di-
rect reciprocity is commerce. Commerce is a form of direct reciprocity “which offers
positive-sum benefits for both parties and gives each a selfish stake in the well-being
of the other” [385]; commerce can be a win-win scenario for all parties involved. For
instance, if Alice produces wine and Bob produces cheese, but neither Alice nor Bob
has the resources to produce what the other can, both may realize they are better off
trading. Different parties might both need the good the other has when they can’t
produce it themselves, so it is mutually beneficial for them to trade, especially when
they know they will encounter each other again in the future. If Alice and Bob both
rely on each other for wine and cheese respectively, then they will naturally seek
to prevent harm to one another because it is in their rational best interest. To this
point, commerce can foster complex interdependencies between economies, which en-
hances the benefits gained through mutual exchange while decreasing the probability
of conflict or war.
Direct reciprocity and AIs. The future may contain multiple AI agents, many
of which might interact with one another to achieve different functions in human
society. Such AI agents may automate parts of our economy and infrastructures, take
over mundane and time-consuming tasks, or provide humans and other AIs with daily
assistance. In a system with multiple AI agents, where the probability that individual
AIs would meet again is high, AIs might evolve cooperative behaviors through direct
reciprocity. If one AI in this system has access to important resources that other
AIs need to meet their objectives, it may decide to share these resources accordingly.
However, since providing this favor would be costly to the given AI, it will do so
only when the probability of meeting the recipient AIs (those that received the favor)
outweighs the cost-benefit ratio of the favor itself.
Collective Action Problems ■ 403
Direct reciprocity can backfire: AIs may disfavor cooperation with hu-
mans. AIs may favor cooperation with other AIs over humans. As AIs become
substantially more capable and efficient than humans, the benefit of interacting with
humans may decrease. It may take a human several hours to reciprocate a favor pro-
vided by an AI, whereas it may take an AI only seconds to do so. It may therefore
become extremely difficult to formulate exchanges between AIs and humans that
benefit AIs more than exchanges with other AIs would. In other words, from an AI
perspective, the cost-benefit ratio for cooperation with humans is not worth it.
Indirect Reciprocity
Discern Defect
Natural examples of indirect reciprocity. Cleaner fish (fish that feed on par-
asites or mucus on the bodies of other fish) can either cooperate with client fish (fish
that receive the “services” of cleaner fish) by feeding on parasites that live on their
bodies, or cheat, by feeding on the mucus that client fish excrete [387]. Client fish tend
to cooperate more frequently with cleaner fish that have a “good reputation,” which
are those that feed on parasites rather than mucus. Similarly, while vampire bats are
known to share food with their kin, they also share food with unrelated members of
their group. Vampire bats more readily share food with unrelated bats when they
know the recipients of food sharing also have a reputation for being consistent and
reliable food donors [388].
Indirect reciprocity in human society. Language provides a way to obtain
information about others without ever having interacted with them, allowing humans
to adjust reputations accordingly and facilitate conditional cooperation. Consider
sites like Yelp and TripAdvisor, which allow internet users to gauge the reputations
of businesses through reviews provided by other consumers. Similarly, gossip is a
complex universal human trait that plays an important role in indirect reciprocity.
Through gossip, individuals reveal the nature of their past interactions with others
Collective Action Problems ■ 405
as well as exchanges they observe between others but are not a part of. Gossip allows
us to track each others’ reputations and enforce cooperative social norms, reducing
the probability that cooperative efforts are exploited by others with reputations for
dishonesty [389].
Indirect reciprocity in AIs. AIs could develop a reputation system where they
observe and evaluate each others’ behaviors, with each accumulating a reputation
score based on their cooperative actions. AIs with higher reputation scores may be
more likely to receive assistance and cooperation from others, thereby developing
a reputation for reliability. Moreover, sharing insights and knowledge with reliable
partners may establish a network of cooperative AIs, promoting future reciprocation.
Indirect reciprocity can backfire: extortionists can threaten reputational
damage. The pressure to maintain a good reputation can make agents vulnerable
to extortion. Other agents may be able to leverage the fear of reputational harm to
extract benefits or force compliance. For example, political smear campaigns manip-
ulate public opinion by spreading false information or damaging rumors about op-
ponents. Similarly, blackmail often involves leveraging damaging information about
others to extort benefits. AIs may manipulate or extort humans in order to better
pursue their objectives. For instance, an AI might threaten to expose the sensitive,
personal information it has accessed about a human target unless specific demands
are met.
Indirect reciprocity can backfire: ruthless reputations may also work. In-
direct reciprocity may not always favor cooperative behavior: it can also promote the
emergence of “ruthless” reputations. A reputation for ruthlessness can sometimes be
extremely successful in motivating compliance through fear. For instance, in military
contexts, projecting a reputation for ruthlessness may deter potential adversaries or
enemies. If others perceive an individual or group as willing to employ extreme mea-
sures without hesitation, they may be less likely to challenge or provoke them. Some
AIs might similarly evolve ruthless reputations, perhaps as a defensive strategy to
discourage potential attempts at exploitation, or control by others.
Group Selection
Cooperate Defect
Kin Selection
Kin selection overview. When driven by kin selection, agents are more likely to
cooperate with others with whom they share a higher degree of genetic relatedness
Collective Action Problems ■ 407
[391]. The more closely related agents are, the more inclined to cooperate they will
be. Thus, kin selection favors cooperation under the following conditions: an agent
will help their relative only when the benefit to their relative “b,” multiplied by the
relatedness between the two “r,” outweighs the cost to the agent “c.” This is known
as Hamilton’s rule: rb > c, or equivalently r > c/b [391]; see Table 7.8.
Cooperate Defect
Natural examples of kin selection. In social insect colonies, such as bees and
ants, colony members are closely related. Such insects often assist their kin in rais-
ing and producing offspring while “workers” relinquish their reproductive potential,
devoting their lives to foraging and other means required to sustain the colony as a
whole. Similarly, naked mole rats live in colonies with a single reproductive queen
and non-reproductive workers. The workers are sterile but still assist in tasks such as
foraging, nest building, and protecting the colony. This behavior benefits the queen’s
offspring, which are their siblings, and enhances the colony’s overall survival capabil-
ities. As another example, some bird species engage in cooperative breeding practices
where older offspring delay breeding to help parents raise their siblings.
Kin selection in AIs. AIs that are similar could exhibit cooperative tendencies
toward each other, similar to genetic relatedness in biological systems. For instance,
AIs may create back-ups or variants of themselves. They may then favor coopera-
tion with these versions of themselves over other AIs or humans. Variant AIs may
prioritize resource allocation and sharing among themselves, developing preferential
mechanisms for sharing computational resources with other versions of themselves.
Kin selection can backfire: nepotism. Kin selection can lead to nepotism:
prioritizing the interests of relatives above others. For instance, some bird species
exhibit differential feeding and provisioning. When chicks hatch asynchronously, par-
ents may allocate more resources to those that are older, and therefore more likely
to be their genetic offspring, since smaller chicks are more likely to be the result of
brood parasitism (when birds lay their eggs in other birds’ nests). In humans, too,
408 ■ Introduction to AI Safety, Ethics, and Society
we often encounter nepotism. Company executives may hire their sons or daughters,
even though they lack the experience required for the role, which can harm companies
and their employees in the long-run. Similarly, parents often protect their children
from the law, especially when they have committed serious criminal acts that can
result in extended jail time. Such tendencies could apply to AIs as well: AIs might
favor cooperation only with other similar AIs. This could be especially troubling for
humans: as the differences between humans and AIs increase, AIs may be increasingly
less inclined to cooperate with humans.
Component of
Cooperation Problem Solutions/Mechanism
Morality
Component of
Cooperation Problem Solutions/Mechanism
Morality
Kinship. Natural selection can favor agents who cooperate with their genetic
relatives. This is because there may be copies of these agents’ genes in their rel-
atives’ genomes, and so helping them may further propagate their own genes.
We call this mechanism “kin selection” [391]: an agent can gain a fitness advan-
tage by treating their genetic relatives preferentially, so long as the cost-benefit
ratio of helping is less than the relatedness between the agent and their kin.
Similarly, repeated inbreeding can reduce an agent’s fitness by increasing the
probability of producing offspring with both copies of any recessive, deleterious
alleles in the parents’ genomes [393].
MAC theory proposes that the solutions to this cooperation problem (preferen-
tially helping genetic relatives), such as kin selection and inbreeding avoidance,
underpin several major moral ideas and customs. Evidence for this includes
the fact that human societies are usually built around family units [394], in
which “family values” are generally considered highly moral. Loyalty to one’s
close relatives and duties to one’s offspring are ubiquitous moral values across
human cultures [395]. Our laws regarding inheritance [396] and our naming
traditions [397] similarly reflect these moral intuitions, as do our rules and
social taboos against incest [398, 399].
Mutualism. In game theory, some games are “positive sum” and “win-win”:
the agents involved can increase the total available value by interacting with
one another in particular ways, and all the agents can then benefit from this
additional value. Sometimes, securing these mutual benefits requires that the
agents coordinate their behavior with each other. To solve this cooperation
problem, agents may form alliances and coalitions [400]. This may require the
capacity for basic communication, rule-following [401], and perhaps theory-of-
mind [402].
MAC theory proposes that these cooperative mechanisms comprise important
components of human morality. Examples include the formation of—and loy-
alty to—friendships, commitments to collaborative activities, and a certain
degree of in-group favoritism and conformation to local conventions. Similarly,
we often consider the agent’s intentions when judging the morality of their
actions, which requires a certain degree of theory-of-mind.
Exchange. Sometimes, benefiting from “win-win” situations requires more
than mere coordination. If the payoffs are structured so as to incentivize “free
410 ■ Introduction to AI Safety, Ethics, and Society
riding” behaviors, the cooperation problem becomes how to ensure that others
will reciprocate help and contribute to group efforts. To solve this problem,
agents can enforce cooperation via systems of reward, punishment, policing,
and reciprocity [403]. Direct reciprocity concerns doing someone a favor out
of the expectation that they will reciprocate at a later date [383]. Indirect
reciprocity concerns doing someone a favor to boost your reputation in the
group, out of the expectation that this will increase the probability of a third
party helping you in the future [386].
Once again, MAC theory proposes that these mechanisms are found in our
moral systems. Moral ideas such as trust, gratitude, patience, guilt, and for-
giveness can all help to assure against free riding behaviors. Likewise, pun-
ishment and revenge, both ideas with strong moral dimensions, can serve to
enforce cooperation more assertively. Idioms such as “an eye for an eye,” or
the “Golden Rule” of treating others as we would like to be treated ourselves,
reflect the solutions we evolved to this cooperation problem.
Conflict resolution. Conflict is very often “negative sum”: the interaction
of the agents themselves can destroy some amount of the total value available.
Examples span from the wounds of rutting deer to the casualties of human
wars. If the agents instead manage to cooperate with each other, they may
both be able to benefit—a “win-win” outcome. One way to resolve conflict
situations is division [404]: dividing up the value between the agents, such as
through striking a bargain. Another solution is to respect prior ownership,
deferring to the original “owner” of the valuable item [405].
According to MAC theory, we can see both of these solutions in our ideas
of morality. The cross-culturally ubiquitous notions of fairness, equality, and
compromise help us resolve conflict by promoting the division of value between
competitors [406]. We see this in ideas such as “taking turns” and “I cut, you
choose” [407]: mechanisms for turning a negative sum situation (conflict) into
a zero sum one (negotiation), to mutual benefit. Likewise, condemnation of
theft and respect for others’ property are extremely important and common
moral values [395, 408]. This set of moral rules may stem from the conflict
resolution mechanism of deferring to prior ownership.
Conclusion. MAC theory argues that morality is composed of biological and
cultural solutions humans evolved to the most salient cooperation problems of
our ancestral social environment. Here, we explored four examples of coopera-
tion problems, and how the solutions to them discovered by natural selection
may have produced our moral values.
Institutions
Institutions overview. Agents are more likely to be cooperative when there are
laws or externally imposed incentives that reward cooperation and punish defection.
Collective Action Problems ■ 411
However, we can shift the interests of agents in this context in favor of peace by
introducing a Leviathan, in the form of a third-party peacekeeping or balancing mis-
sion, which establishes an authoritative presence that maintains order and prevents
conflict escalation. Peacekeeping missions can take several forms, but they often in-
volve the deployment of peacekeeping forces such as military, police, and civilian
personnel. These forces work to deter potential aggressors, enhance security, and set
the stage for peaceful resolutions and negotiations as impartial mediators, usually by
penalizing aggression and rewarding pacifism; see Table 7.11.
TABLE 7.11 Payoff matrix for the Pacifist’s dilemma with a Leviathan [385].
Pacifist Aggressor
the international scale; when nations break treaties, other nations may punish them
by refusing to cooperate, such as by cutting off trade routes or imposing sanctions
and tariffs. On the other hand, when nations readily adhere to treaties, other na-
tions may reward them, such as by fostering trade or providing foreign aid. Similarly,
institutions can incentivize cooperation at the national scale by creating laws and
regulations that reward cooperative behaviors and punish non-cooperative ones. For
example, many nations attempt to prevent criminal behavior by leveraging the threat
of extended jail-time as a legal deterrent to crime. On the other hand, some nations
incentivize cooperative behaviors through tax breaks, such as those afforded to citi-
zens that make philanthropic donations or use renewable energy resources like solar
power.
Institutions are crucial in the context of international AI development. By estab-
lishing laws and regulations concerning AI development, institutions may be able to
reduce AI races, lowering competitive pressures and the probability that countries cut
corners on safety. Moreover, international agreements on AI development may serve
to hold nations accountable; institutions could play a central role in helping us broker
these kinds of agreements. Ultimately, institutions could improve coordination mech-
anisms and international standards for AI development, which would correspondingly
improve AI safety.
Institutions and AI. In the future, institutions may be established for AI agents,
such as platforms for them to communicate and coordinate with each other au-
tonomously. These institutions may be operated and governed by the AIs themselves
without much human oversight. Humanity alone may not possess the power required
to combat advanced dominance-seeking AIs, and existing laws and regulations may
be insufficient if there is no way to enforce them. An AI Leviathan of some form
could help regulate other AIs and influence their evolution, in which selfish AIs are
counteracted or domesticated.
7.3.1 Summary
Throughout this section, we discussed a variety of mechanisms that may promote
cooperative behavior by AI systems or other entities. These mechanisms were direct
reciprocity, indirect reciprocity, group selection, kin selection, and institutions.
Direct reciprocity may motivate AI agents in a multi-agent setting to cooperate with
each other, if the probability that the same two AIs meet again is sufficiently high.
However, AIs may disfavor cooperation with humans as they become progressively
more advanced: the cost-benefit ratio for cooperation with humans may simply be
bad from an AI’s perspective.
Indirect reciprocity may promote cooperation in AIs that develop a reputation system
where they observe and score each others’ behaviors. AIs with higher reputation scores
may be more likely to receive assistance and cooperation from others. Still, this does
not guarantee that AIs will be cooperative: AIs might leverage the fear of reputational
harm to extort benefits from others, or themselves develop ruthless reputations to
inspire cooperation through fear.
Group selection—in a future where labor has been automated such that AIs now run
the majority of companies—could promote cooperation on a multi-agent scale. AIs
may form corporate coalitions with other AIs to protect their interests; AI groups with
a cooperative AI minority may be outcompeted by AI groups with a cooperative AI
majority. Under such conditions, however, AIs may learn to favor in-group members
and antagonize out-group members, in order to maintain group solidarity. AIs may
be more likely to see other AIs as part of their group, and this could lead to conflict
between AIs and humans.
AIs may create variants of themselves, and the forces of kin selection may drive
these related variants to cooperate with each other. However, this could also give rise
to nepotism, where AIs prioritize the interests of their variants over other AIs and
humans. As the differences between humans and AIs increase, AIs may be increasingly
less inclined to cooperate with humans.
Institutions can incentivize cooperation through externally imposed incentives that
enforce cooperation and punish defection [409]. This concept relates to the idea of an
AI Leviathan, used to counteract selfish, powerful AIs. However, humanity should take
care to ensure their relationship with the AI Leviathan is symbiotic and transparent,
otherwise we risk losing control of AIs.
In our discussion of these mechanisms, we not only illustrated their prevalence in
our world, but also showed how they might influence cooperation with and between
414 ■ Introduction to AI Safety, Ethics, and Society
7.4 CONFLICT
7.4.1 Overview
In this chapter, we have been exploring the risks generated or exacerbated by the
interactions of multiple agents, both human and AI. In the previous section, we
explored a variety of mechanisms by which agents can achieve stable cooperation. In
this section we address how, despite the fact that cooperation can be so beneficial to
all involved, a group of agents may instead enter a state of conflict. To do this, we
discuss bargaining theory, commitment problems, and information problems, using
theories and examples relevant both for conflict between nation-states and potentially
also between future AI systems.
Here, we use the term “conflict” loosely, to describe the decision to defect rather
than cooperate in a competitive situation. This often, though not always, involves
some form of violence, and destroys some amount of value. Conflict is common in
nature. Organisms engage in conflict to maintain social dominance hierarchies, to
hunt, and to defend territory. Throughout human history, wars have been common,
often occurring as a consequence of power-seeking behavior, which inspired conflict
over attempts at aggressive territorial expansion or resource acquisition. Another lens
on relations between power-seeking states and other entities is provided by the theory
of structural realism discussed in Single-Agent Safety. Our goal here is to uncover how,
despite being costly, conflict can sometimes be a rational choice nevertheless.
Conflict can take place between a wide variety of entities, from microorganisms to
nation-states. It can be sparked by many different factors, such as resource compe-
tition and territorial disputes. Despite this variability, there are some general frame-
works which we can use to analyze conflict across many different situations. In this
section, we look at how some of these frameworks might be used to model conflict
involving AI agents.
We begin our discussion of conflict with concepts in bargaining theory. We then ex-
amine some specific features of competitive situations that make it harder to reach
negotiated agreements or avoid confrontation. We begin with five factors from bar-
gaining theory that can influence the potential for conflict. These can be divided into
the following two groups:
• Power shifts: when there are imbalances between agents’ capabilities such that one
agent becomes stronger than the other, conflict is more likely to emerge between
them.
• First-strike advantages: when one agent possesses the element of surprise, the abil-
ity to choose where conflict takes place, or the ability to quickly defeat their op-
ponent, the probability of conflict increases.
• Issue indivisibility: agents cannot always divide a good however they please—some
goods are “all or nothing” and this increases the probability of conflict between
agents.
(2) If both go to court, the owner’s expected payoff is the product of the payment to
the customer and the probability that the lawsuit is successful minus legal fees.
In this case, the owner’s expected payoff would be (−40,000 × 0.6) − 10,000 while
the customer’s expected payoff would be (40,000 × 0.6) − 10,000. As a result, the
owner loses $34,000 dollars and the customer gains $14,000 dollars.
(3) An out-of-court settlement x where 14,000 < x < 34,000 would enable the cus-
tomer to get a higher payoff and the owner to pay lower costs. Therefore, a mutual
settlement is the best option for both if x is in this range.
Hence, if the proposed out-of-court settlement would be greater than $34,000, it would
make sense for the owner to opt for conflict rather than bargaining. Similarly, if the
proposed settlement were less than $14,000, it would be rational for the customer to
opt for conflict.
AIs and large-scale conflicts. Several of the examples we consider in this sec-
tion are large-scale conflicts such as interstate war. If the use of AI were to increase the
likelihood or severity of such conflicts, it could have a devastating effect. AIs have
the potential to accelerate our wartime capabilities, from augmenting intelligence
gathering and weaponizing information such as deep fakes to dramatically improv-
ing the capabilities of lethal autonomous weapons and cyberattacks [410]. If these
use-cases and other capabilities become prevalent and powerful, AI will change the
nature of conflict. If armies are eventually composed of mainly automated weapons
rather than humans, the barrier to violence might be much lower for politicians who
will face reduced public backlash against lives lost, making conflicts between states
(with automated armies) more commonplace. Such changes to the nature and sever-
ity of war are important possibilities with significant ramifications. In this section,
we focus on analyzing the decision to enter a conflict, continuing to focus on how
rational, intelligent agents acting in their own self-interest can collectively produce
outcomes that none of them wants. To do this, we ground our discussion of conflict
in bargaining theory, highlighting some ways in which AI might increase the odds
that states or other entities decide to start a conflict.
Here, we begin with a general overview of bargaining theory, to illustrate how pres-
sures to outcompete rivals or preserve power and resources may make conflict an
instrumentally rational choice. Next, we turn to the unitary actor assumption, high-
lighting that when agents view their rivals as unitary actors, they assume that they
will act more coherently, taking whatever steps necessary to maximize their welfare.
Following this, we discuss the notion of commitment problems, which occur when
agents cannot reliably commit to an agreement or have incentives to break it. Com-
mitment problems increase the probability of conflict and are motivated by specific
factors, such as power shifts, first-strike advantages, and issue indivisibility. We then
explore how information problems and inequality can also increase the probability of
conflict.
Collective Action Problems ■ 417
Bargaining theory. When agents compete for something they both value, they
may either negotiate to reach an agreement peacefully, or resort to more forceful al-
ternatives such as violence. We call the latter outcome “conflict,” and can view this
as the decision to defect rather than cooperate. Unlike peaceful bargaining, conflict
is fundamentally costly for winners and losers alike. However, it may sometimes be
the rational choice. Bargaining theory describes why rational agents may be unable
to reach a peaceful agreement, and instead end up engaging in violent conflict. Due
to pressures to outcompete rivals or preserve their power and resources, agents some-
times prefer conflict, especially when they cannot reliably predict the outcomes of
conflict scenarios. When rational agents assume that potential rivals have the same
mindset, the probability of conflict increases.
The bargaining range. Whether or not agents are likely to reach a peaceful
agreement through negotiation will be influenced by whether their bargaining ranges
overlap. The bargaining range represents the set of possible outcomes that both agents
involved in a competition find acceptable through negotiation. Recall the lawsuit
example: a bargaining settlement “x” is only acceptable if it falls between $14,000
and $34,000. Any settlement “x” below $14,000 will be rejected by the customer
while any settlement “x” above $34,000 will be rejected by the store owner. Thus, the
bargaining range is often depicted as a spectrum with the lowest acceptable outcome
for one party at one end and the highest acceptable outcome for the other party
at the opposite end. Within this range, there is room for negotiation and potential
agreements.
Conflict and AI agents. Let us assume that AI agents will act rationally in the
pursuit of their goals (so, at the least, we model them as unitary actors or as having
unity of purpose). In the process of pursuing and fulfilling their goals, AI agents may
encounter potential conflict scenarios, just as humans do. In certain scenarios, AIs
may be motivated to pursue violent conflict over a peaceful resolution, for the reasons
we now explore.
Many conflicts occur over resources, which are key to an agent’s power. Consider a
bargaining failure in which two agents bargain over resources in an effort to avoid war.
If agents were to acquire these resources, they could invest them into military power.
418 ■ Introduction to AI Safety, Ethics, and Society
Figure 7.9. A) An axis of expected value distribution between two competitors. “B” indicates
the expected outcome of conflict: how likely each competitor is to win, multiplied by the
value they gain by winning. The more positive B is (the further toward the right), the better
for Black, and the worse for Grey. B) Conflict is negative-sum: it destroys some value, and
so reduces each competitor’s expected value. C) Bargaining is zero-sum: all the value is
distributed between the competitors. This means there are possible bargains that offer both
competitors greater expected value than conflict.
As a result, neither can credibly commit to use them only for peaceful purposes.
This is one instance of a commitment problem [411], which is when agents cannot
reliably commit to an agreement, or when they may even have incentives to break
an agreement. Commitment problems are closely related to the security dilemma,
which we discussed in Section 7.2.4. Commitment problems are usually motivated by
specific factors, such as power shifts, first-strike advantages, and issue indivisibility,
which may make conflict a rational choice. It is important to note that our discussion
of these commitment problems assumes anarchy: we take for granted that contracts
are not enforceable in the absence of a higher governing authority.
Power Shifts
Power shifts overview. When there are imbalances between parties’ capabili-
ties such that one party becomes stronger than the other, power shifts can occur.
Such imbalances can arise as a consequence of several factors including technological
and economic advancements, increases in military capabilities, as well as changes in
governance, political ideology, and demographics. If one party has access to AIs and
the other does not, an improvement in AI capabilities can precipitate a power shift.
Such situations are plausible: richer countries today may gain more from AI because
they have more resources to invest in scaling their AI’s performance. Parties may
initially be able to avoid violent conflict by arriving at a peaceful and mutually ben-
eficial settlement with their rivals. However, one party’s power increases after this
Collective Action Problems ■ 419
settlement has been made, they may disproportionately benefit from the settlement,
making it appear unfair to begin with. Thus, we encounter the following commitment
problem: the rising power cannot commit not to exploit its advantage in the future,
incentivizing the declining power to opt for conflict in the present.
Example: The US vs China. China has been investing heavily in its military.
This has included the acquisition or expansion of its capabilities in technologies such
as nuclear and supersonic missiles, as well as drones. The future is uncertain, but
if this trend continues, it could increase the risk of conflict. If China were to gain
a military advantage over the US, this could shift the balance of power. This possi-
bility undermines the stability of bargains struck today between the US and China,
because China’s expected outcome from conflict may increase in the future if they
become more powerful. The US may expect that agreements made with China about
cooperating on AI regulation could lose enforceability later if there is a significant
power shift.
This situation can be modeled using the concept of “Thucydides’ Trap.” The ancient
Greek historian Thucydides suggested that the contemporary conflict between Sparta
and Athens might have been the result of Athens’ increasing military strength, and
Sparta’s fear of the looming power shift. Though this analysis of the Peloponnesian
War is now much-contested, this concept can nevertheless serve to understand how
a rising power threatening the position of an existing superpower in the global order
can increase the potential for conflict rather than peaceful bargaining.
Power shifts and AI. AIs could shift power as they acquire greater capabilities
and more access to resources. Recall the chapter on Single-Agent Safety, where we
saw that an agent’s power is highly related to the efficiency with which they can
exploit resources for their benefit, which often depends on their level of intelligence.
The power of future AI systems is largely unpredictable; we do not know how intelli-
gent or useful they will be. This could give rise to substantial uncertainty regarding
how powerful potential adversaries using AI might become. If this is the case, there
might be reason to engage in conflict to prevent the possibility of adversaries further
420 ■ Introduction to AI Safety, Ethics, and Society
First-Strike Advantage
First-strike advantage overview. If an agent has a first-strike advantage, they
will do better to launch an attack than respond to one. This gives rise to the following
commitment problem: an offensive advantage may be short-lived, so it is best to act
on it before the enemy does instead. Some ways in which an agent may have a first-
strike advantage include:
1. As explored above, anticipating a future power shift may motivate an attack on
the rising power to prevent it from gaining the upper hand.
2. The costs of conflict might be lower for the attacker than they are for the defender,
so the attacker is better off securing an offensive advantage while the defender is
still in a position of relative weakness.
3. The odds of victory may be higher for whichever agent attacks first. The attacker
might possess the element of surprise, the ability to choose where conflict takes
place, or the potential to quickly defeat their opponent. For instance, a pre-emptive
nuclear strike could be used to target an enemy’s nuclear arsenal, thus diminishing
their ability to retaliate.
TABLE 7.12 A pay-off matrix for competitors choosing whether to defend or preemptively
attack.
Defend Preempt
Effect on the bargaining range. When the advantages of striking first outweigh
the costs of conflict, it can shrink or destroy the bargaining range entirely. For any
two parties to reach a mutual settlement through bargaining, each must be willing
to freely communicate information with the other. However, in doing so, each party
might have to reveal offensive advantages, which would increase their vulnerability to
attack. The incentive to preserve and therefore conceal an offensive advantage from
opponents’ pressures agents to defect from bargaining.
Issue Indivisibility
Issue indivisibility overview. Settlements that fall within bargaining range will
always be preferable to conflict, but this assumes that whatever issues agents bar-
gain over are divisible. For instance, two agents can divide a territory in an infinite
422 ■ Introduction to AI Safety, Ethics, and Society
Figure 7.10. At time T0, Black is more powerful relative to Grey, or has a first-strike ad-
vantage that will be lost at T1. At T1, the bargaining range no longer extends past Black’s
expected value from engaging in conflict at T0. Anticipating this leftward shift may incen-
tivize Black to initiate conflict in the present rather than waiting for the bargaining offers
to worsen in the future.
amount of ways insofar as the settlement they arrive at falls within the bargain-
ing range, satisfying both their interests and outweighing the individual benefits of
engaging in conflict. However, some goods are indivisible, which inspires the follow-
ing commitment problem [413]: parties cannot always divide a good however they
please—some goods are “all or nothing.” When parties encounter issue indivisibility
[411], the probability of conflict increases. Indivisible issues include monarchies, small
territories like islands or holy sites, national religion or pride, and sovereign entities
such as states or human beings, among several others.
The same can be true in more extreme cases, such as organ donation. Typically, the
available organ supply does not meet the transplant needs of all patients. Decisions as
to who gets priority for transplantation may favor certain groups or individuals and
allocation systems may be unfair, giving rise to conflict between doctors, patients,
and healthcare administrations. Finally, we can also observe issue indivisibility in
co-parenting contexts. Divorced parents sometimes fight for full custody rights over
their children. This can result in lengthy and costly legal battles that are detrimental
to the family as a whole.
Issue indivisibility and AIs. Imagine that there is a very powerful AI training
system, and that whoever has access to this system will eventually be able to dominate
the world. In order to reduce the chance of being dominated, individual parties may
compete with one another to secure access to this system. If parties were to split the
AI’s compute up between themselves, it would no longer be as powerful as it was
previously, perhaps not more powerful than their existing training systems. Since
such an AI cannot be divided up among many stakeholders easily, it may be rational
for parties to conflict over access to it, since doing so ensures global domination.
Misinformation and disinformation both involve the spread of false information, but
they differ in terms of intention. Misinformation is the dissemination of false informa-
tion, without the intention to deceive, due to a lack of knowledge or understanding.
Disinformation, on the other hand, is the deliberate spreading of false or misleading
information with the intent to deceive or manipulate others. Both of these types of
information problem can cause bargains to fail, generating conflict.
The term a is the probability of a player knowing the strategy of its partner. Relevant
for AI since it might reduce uncertainty (though still chaos and incentives to conceal
or misrepresent information or compete).
424 ■ Introduction to AI Safety, Ethics, and Society
Distinguish Defect
Misinformation
Misinformation overview. Uncertainty regarding a rival’s power or intentions
can increase the probability of conflict[411]. Bargaining often requires placing trust in
another not to break an agreement. This is harder to achieve when one agent believes
something false about the other’s preferences, resources, or commitments. A lack of
shared, accurate information can lead to mistrust and a breakdown in negotiations.
Effect on the bargaining range. Misinformation can prevent agents from finding
a mutually agreeable bargaining range, as shown in Figure 9.14. For example, if each
agent believes themself to be the more powerful party, each may therefore want more
than half the value they are competing for. Thus, each may reject any bargain offer
the other makes, since they expect a better if they opt for conflict instead.
Figure 7.11. Black either believes themself to be—or intentionally misrepresents themself
as—more powerful than they really are. This means that the range of bargain offers Black
will choose over conflict does not overlap with the equivalent range for Grey. Thus, there is
no mutual bargaining range.
Collective Action Problems ■ 425
Disinformation
Disinformation overview. Unlike misinformation, where false information is
propagated without deceptive intention, disinformation is the deliberate spreading
of false information: the intent is to mislead, deceive or manipulate. Here, we ex-
plore why competitive situations may motivate agents to try to mislead others or
misrepresent the truth, and how this can increase the probability of conflict.
A, as the stronger agent, will not offer this to avoid being exploited by B. In other
words, A thinks B is just trying to get more for themself to “bait” A or “bluff” by
implying that the bargaining range is lower. But B might not be bluffing and A
might not be as strong as they think they are. Consider the Sino-Indian war in this
respect. At the time, India had perceived military superiority relative to China. But
in 1962, the Chinese launched an attack on the Himalayan border with India, which
demonstrated China’s superior military capabilities, and triggered the Sino-Indian
war. Thus, stronger parties may prefer conflict if they believe rivals are bluffing.
Whereas, weaker parties may prefer conflict if they believe rivals are not as powerful
as they believe themselves to be.
others possess more resources, opportunities, or social status than they do. This can
lead to feelings of resentment. For example, “Strain theory,” proposed by sociologist
Robert K. Merton, suggests that individuals experience strain or pressure when they
are unable to achieve socially approved goals through legitimate means. Relative de-
privation is a form of strain, which may lead individuals to resort to various coping
mechanisms, one of which is criminal behavior. For example, communities with a high
prevalence of relative deprivation can evolve a subculture of violence [418]. Consider
the emergence of gangs, in which violence becomes a way to establish dominance,
protect territory, and retaliate against rival groups, providing an alternative path for
achieving a desired social standing.
AIs and relative deprivation. Advanced future AIs and widespread automation
may propel humanity into an age of abundance, where many forms of scarcity have
been largely eliminated on the national, and perhaps even global scale. Under these
circumstances, some might argue that conflict will no longer be an issue; people
would have all of their needs met, and the incentives to resort to aggression would
be greatly diminished. However, as previously discussed, relative deprivation is a
subjective measure of social comparison, and therefore, it could persist even under
conditions of abundance.
Consider the notion of a “hedonic treadmill,” which notes that regardless of what
good or bad things happen to people, they consistently return to their baseline level
of happiness. For instance, reuniting with a loved one or winning an important com-
petition might cultivate feelings of joy and excitement. However, as time passes, these
feelings dissipate, and individuals tend to return to the habitual course of their lives.
Even if individuals were to have access to everything they could possibly need, the
satisfaction they gain from having their needs fulfilled is only temporary.
Abundance becomes scarcity reliably. Dissatisfied individuals can be favored by natu-
ral selection over highly content and comfortable individuals. In many circumstances,
natural selection could disfavor individuals who stop caring about acquiring more re-
sources and expanding their influence; natural selection favors selfish behavior (for
more detail, see section 7.5.3 of Evolutionary Pressures). Even under conditions of
abundance, individuals may still compete for resources and influence because they
perceive the situation as a zero-sum game, where resources and power must be di-
vided among competitors. Individuals that acquire more power and resources could
incur a long-term fitness advantage over those that are “satisfied” with what they
already have. Consequently, even with many resources, conflict over resources could
persist in the evolving population.
Relatedly, in economics, the law of markets, also known as “Say’s Law,” proposes
that production of goods and services generates demand for goods and services. In
other words, supply creates its own demand. However, if supply creates demand, the
amount of resources required to sustain supply to meet demand must also increase
accordingly. Therefore, steady increases in demand, even under resource-abundant
conditions will reliably result in resource scarcity.
428 ■ Introduction to AI Safety, Ethics, and Society
Conflict over social standing and relative power may continue. There
will always be scarcity of social status and relative power, which people will continue
to compete over. Social envy is a fundamental part of life; it may persist because
it tracks differential fitness. Motivated by social envy, humans establish and identify
advantageous traits, such as the ability to network or climb the social ladder. Scarcity
of social status motivates individuals to compete for social standing when doing so
enables access to larger shares of available resources. Although AIs may produce many
forms of abundance, there would still be dimensions on which to compete. Moreover,
AI development could itself exacerbate various forms of inequality to extreme levels.
For example, there are likely to be major advantages to richer countries that have
more resources to invest, particularly given that growth in compute, data, and model
size appear to scale with AI capabilities. We discuss this possibility in Governance in
section 8.3.
7.4.6 Summary
Throughout this section, we have discussed some of the major factors that drive
conflict. When any one of these factors is present, agents’ incentives to bargain for a
peaceful settlement may shift such that conflict becomes an instrumentally rational
choice. These factors include power shifts, first-strike advantages, issue indivisibility,
information problems and incentives to misrepresent, as well as inequality.
In our discussion of these factors, we have laid the groundwork for understanding
the conditions under which decisions to instigate conflict may be considered instru-
mentally rational. This knowledge base allows us to better predict the risks and
probability of AI-driven conflict scenarios.
Power shifts can incentivize AI agents to pursue conflict, maintain strategic advan-
tages or deter potential attacks from stronger rivals, especially in the context of
military AI use.
The short-lived nature of offensive advantages may incentivize AIs to pursue first-
strike advantages, to degrade or identify vulnerabilities in adversaries’ capabilities,
as may be the case in cyberwarfare.
In the future, individual parties may have to compete for access to powerful AI. Since
dividing this AI between many stakeholders would reduce its power, parties may find
it instrumentally rational to conflict for access to it.
AIs may make wars more uncertain, increasing the probability of conflict. AI
weaponry innovation may present an opportunity for superpowers to consolidate their
dominance, whereas weaker states may be able to quickly increase their power by tak-
ing advantage of these technologies early on. This dynamic may create a future in
which power shifts are uncertain, which may lead states to incorrectly expect that
there is something to gain from going to war.
Even under conditions of abundance facilitated by widespread automation and ad-
vanced AI implementation, relative deprivation, and therefore conflict, may persist.
Collective Action Problems ■ 429
AIs may be motivated by social envy to compete with other humans or AIs for de-
sired social standing. This may result in a global landscape in which the majority of
humanity’s resources are controlled by selfish, power-seeking AIs.
7.5.1 Overview
The central focus of this chapter is the dynamics to be expected in a future with
many AI agents. We must consider the risks that emerge from the interactions be-
tween these agents, and between humans and AI agents. In this last part of the
Collective Action Problems chapter, we use evolutionary theory to explore what hap-
pens when competitive pressures play out over a longer time period, operating on a
large group of interacting agents. Exploring evolutionary pressures helps us under-
stand the risks posed by the influence of natural selection on AI development. Our
ultimate conclusions are that AI development is likely to be subject to evolutionary
forces, and that we should expect the default outcome of this influence to be the
promotion of selfish and undesirable AI behavior.
We begin this section by looking at how evolution by natural selection can operate
in non-biological domains, an idea known as “generalized Darwinism.” We formalize
this idea using the conditions set out by Lewontin as necessary and sufficient for
natural selection, and Price’s equation for describing evolutionary change over time.
We thus set out the case that evolutionary pressures are influencing AIs. We turn to
the ramifications of this claim in the second section.
We next move on to exploring why evolutionary pressures may promote selfish AI
behavior. To consider what traits and strategies natural selection tends to favor,
we begin by setting out the “information’s eye view” of evolution as a generalized
Darwinian extrapolation of the “gene’s eye view” of biological evolution. Using this
framing, we examine how conflict can arise within a system when the interests of
propagating information clash with those of the entity that contains the information.
Internal conflict of this kind could arise within AI systems, distorting or subverting
goals even when they are specified and understood correctly. Finally, we explore why
natural selection tends to favor selfish strategies over altruistic ones. Our upshot is
that AI development is likely to be subject to evolutionary pressures. These pressures
may distort the goals we specify if the interests of internal components of the AI
system clash and could also generate a trend toward increasingly selfish AI behavior.
conditions for natural selection and consider how AI development meets these criteria
and is therefore subject to evolutionary pressures.
were developed, similar products proliferated. Users chose the product that best met
their needs, selecting for services that were cheap, easy to use, and reliable. Each
company regularly released new versions of its product that were slightly adapted
from earlier ones, and competitors imitated and thereby propagated the best features
and implemented them into their own. Some products incorporated the most adaptive
features quickly, and the descendants of those products are the ones we use today—
while others were quickly outcompeted and fell into obscurity.
Generalized Darwinism does not imply that evolution produces good out-
comes. Often, things that are the best at propagating are not “good” in any mean-
ingful sense. Invasive species arrive in a new location, propagate quickly, and local
ecosystems begin to crumble. The forms of media that are most successful at prop-
agating in our minds may be harmful to our happiness and social relationships. For
instance, news articles that get more clicks are likely to have their click-attracting
traits reproduced in the next generation. Clicks thus select for more sensational,
emotionally charged headlines. In the context of AI, generalized Darwinism poses
significant risks. To see why, we first need to understand how many phenomena tend
to develop based on Darwinian principles, so that we can think about how to predict
and mitigate these risks.
The validity of these criteria does not depend on biology. In living organisms, DNA
encodes the variations among individuals. Traits encoded by DNA are heritable, and
subject to selection. But this is not the only way to fulfill the Lewontin conditions.
Video conferencing software has variation (there are many different options), reten-
tion (today’s video conferencing software is similar to last year’s), and differential fit-
ness (some products are much more widely used and imitated than others). Precisely
how change occurs depends on the specific phenomenon’s mechanism of propagation.
The Price Equation describes how a trait changes in frequency over time.
In the 1970s, the population geneticist George R. Price derived an equation that
432 ■ Introduction to AI Safety, Ethics, and Society
In this equation, z̄ denotes the average value of some trait z in a population, and ∆z̄ is
the change in the average value of z between the parent generation and the offspring
generation. If z is height, and the parent generation is 5 ’5 ” on average and the next
generation is 5’ 7” on average, then ∆z̄ is 2 inches. ω is relative fitness: how many
offspring does an individual have relative to the average for their generation? Ew (∆z)
is the expected value of ∆z: that is, the average change in z between generations,
weighted by fitness, so that individuals who have more offspring are counted more
heavily.
Price’s Equation shows that the change in the average value of some trait between
parents and offspring is equal to the sum of a) the covariance of the trait value and
the fitness of the parents, and b) the fitness-weighted average of the change in the
trait between a parent and its offspring. “Covariance” describes the phenomenon of
one variable varying together with another. To see whether a population will get
taller over time, for example, we would need to know the covariance of fitness with
height (do tall individuals have more surviving offspring?) and the difference between
a parent’s height and their average child’s height.
The Price Equation can be applied to non-biological systems. The Price
Equation does not require any understanding of what causes a trait to be passed
down to a subsequent generation or why some individuals have more offspring than
others, only of how much the trait is passed on and how much it covaries with fitness.
The Price Equation would work just as well with car designs or tunes as with birds
or mollusks.
The Price Equation allows us to predict what happens when Lewontin
conditions apply. The Price equation uses differences between members of the
parent generation with respect to some trait z (variation), similarities between parent
and offspring generation with respect to z (retention), and differential fitness (selec-
tion). As a result, when we understand the degree to which each of the Lewontin
conditions apply, we can predict how much of some trait will be present in subse-
quent generations [424].
First misunderstanding: “fitness” does not describe physical power. The
idea of “fitness” often brings to mind a contest of physical power, in which the
strongest or fastest organism wins, but this is a misunderstanding. Fitness in an
evolutionary sense is not something we gain at the gym. Being fit may not necessar-
ily entail being exceptionally good at any specific abilities. Sea sponges, for example,
are among the most ancient of animal lineages, and they are not quick, clever, or
good at chasing prey, especially when compared to, say, a shark. But empirically,
sea sponges have been surviving and reproducing for hundreds of millions of years,
much more than many species that would easily beat them in head-to-head contests
at almost any other challenge.
Collective Action Problems ■ 433
The three Lewontin conditions, of variation, retention, and differential fitness, are
all that is needed for evolution by natural selection. This means we can assess how
natural selection is likely to affect AI populations by considering how the conditions
apply to AIs. Here, we claim that AIs are likely to meet all three conditions, so we
should expect natural selection forces to influence their traits and development.
will not be destroyed if they are outcompeted), but rather the researchers’ projec-
tions of what might happen if they adopt particular strategies. AI safety ideas are
being selected against, which is driving the researchers to change their behavior (to
behave in a less safety-conscious manner). Importantly, as the number of competitors
rises, the variation in approaches and values also increases. This increase in variation
escalates the intensity of the evolutionary pressures and the extent to which these
pressures distort the behavior of big AI companies.
Retention: new AIs are developed under the influence of earlier gener-
ations. Retention does not require exact copying; it only requires that there be
non-zero similarity among individuals in subsequent generations. In the short term,
AIs are developed by adapting older models, or by imitating features from competi-
tors’ models. Even when training AIs from scratch, retention may still occur, as highly
effective architectures, datasets, and training environments are reused thereby shap-
ing the agent in a way similar to how humans (or other biological species) are shaped
by their environments. Even if AIs change very rapidly compared to the timescales of
biological evolution, they will still meet the criterion of retention; their generations
can be extremely short, so they can move through many generations in a short time,
but each generation will still be similar to the one before it. Retention is a very easy
standard to meet, and even with many uncertainties about what AIs may be like, it
is very likely that they meet this broad definition.
Differential Fitness: some AIs are propagated more than others. There
are many traits which could cause some AI models or traits to be propagated more
than others (increasing their “fitness”). Some of these traits could be highly unde-
sirable to humans. For example, being safer than alternatives may confer a fitness
advantage on an AI. However, merely appearing to be safer might also improve an
AI’s fitness. Similarly, being good at automating human jobs could result in an AI
being propagated more. On the other hand, being easy to deactivate could reduce
an AI’s fitness. Therefore, an AI might increase its fitness by integrating itself into
critical infrastructure or encouraging humans to develop a dependency on it, mak-
ing us less keen to deactivate it. As long as some AIs are at least marginally more
attractive than others, AI populations will meet the condition of differential fitness.
There are many possible points at which natural selection could take effect on AIs.
These include the actions of AI developers, in fine-tuning and customizing models,
or re-designing training processes.
If there is more intense selection pressure on AIs, where only AIs with certain traits
propagate, then we should expect to see the population optimize around those traits.
If there is more variation in the AI population, that optimization process will be
faster. If the rate of adaptation also accelerates, we would expect trends that lead
to greater differentiation in AI populations that are distinct from the changes in the
traits of individual AI models. In the following section, we will discuss the evolution-
ary trends that tend to dominate when selection pressure is intense and how they
might shape AI populations.
Summary
We started this section by exploring how evolution by natural selection can occur
in non-biological contexts. We then formalized this idea of “generalized Darwinism”
using Lewontin’s conditions and the Price equation. We found that AI development
may be subject to evolutionary pressures by evaluating how it meets the Lewontin
conditions. In the next section, we turn to the ramifications of this claim.
Genes contain the instructions for forming bodies. Most of the time, a gene propa-
gates most successfully when the organism that contains it propagates successfully.
But sometimes, the best thing for a gene is not the best thing for the organism.
For example, mitochondrial DNA is only passed on from females, so it propagates
most if the organism has only female offspring. In some organisms, mitochondrial
DNA gives rise to genetic mechanisms that increase the production of female descen-
dants. However, if too many individuals have this mutation, the population will be
disproportionately female, and the organism will be unable to pass on the rest of its
genes. In this situation, the most effective propagation mechanism for the gene in the
mitochondria is harmful to the reproductive success of its host.
The “gene’s eye view” of evolution. In The Selfish Gene, Richard Dawkins
argues that gene propagation is a more useful framing than organism propagation
[419]. In Dawkins’ view, organisms are simply vehicles that allow genes to propagate.
Instead of thinking of birds with long beaks competing with birds with short beaks,
we can think about genes that create long beaks competing with genes that create
short beaks, in a fight for space within the bird population. This gives us a framework
for understanding examples like the one above: the gene within the mitochondria is
competing for space in the population, and will sometimes take that space even at
the expense of the host’s individual fitness.
Information functions similarly to genes, narrowing the space of pos-
sibilities. We are humans and not dogs, roundworms, or redwood trees almost
entirely because of our genes. If we do not know anything about what an organism
is, aside from how long its genome is, then for every base in the genome, there are
four possibilities, so there is an extremely large number of possible combinations. If
we learn that the first base is a G, you have divided the total number by four. When
we decode the entire genome, we have narrowed down an impossibly large space of
possibility to a single one: we can now know not only that the organism is a cat, but
even which cat specifically.
In non-biological systems, information works in a parallel way. There are many pos-
sible ways to begin a sentence. Each word eliminates possible endings and decreases
the listener’s uncertainty, until they know the full sentence at the end. Using the
framework of information theory, we can think of information as the resolution or
reduction of uncertainty (though this is not a formal definition). For an idea, infor-
mation is just the facts about it that make it different from other ideas. A textbook’s
main information is its text. A song’s information consists of the pitches and rhythms
that distinguish it from other songs. These larger phenomena (ideas, books, songs)
are distinguished by the information they contain.
Information that propagates occupies a larger volume of both time and
space. A single music score, written centuries ago and buried underground ever
since, has been propagated across hundreds of years of time, but very little space. In
contrast, a hit tune that is suddenly everywhere and then quickly forgotten takes up
a lot of space, but very little time. But the best propagated information takes up a
large volume of both. The tune for “Twinkle Twinkle” has been taking up space in
438 ■ Introduction to AI Safety, Ethics, and Society
many minds, pieces of paper, and digital formats for hundreds of years and continues
to propagate. The same is true for genetic information. A gene that flourished briefly
hundreds of millions of years ago, and one that has had a consistent small presence,
both take up much less space-time volume than a gene that long ago became dominant
in many successful branches of the evolutionary tree [420].
Just as some genes propagate more, the same is true for bits of infor-
mation. In accordance with generalized Darwinism, we can extend the gene’s eye
view to an “information’s eye view.” A living organism’s basic unit of information is a
gene. Everything that evolves as a consequence of Darwinian forces contains informa-
tion, some of which is inherited more than others. Dawkins coined the term “meme”
as an analog for gene: a meme is the basic unit of cultural inheritance. Like genes,
memes tend to develop variations, and be copied and adapted into new iterations.
The philosopher of science, Karl Popper, wrote that the growth of knowledge is “the
natural selection of hypotheses: our knowledge consists, at every moment, of those
hypotheses which have shown their (comparative) fitness by surviving so far in their
struggle for existence.” Social phenomena such as copycat crimes can also be modeled
as examples of memetic inheritance. Many types of crimes are committed daily, some
of which inspire imitators, whose subsequent crimes can themselves be selected for
and copied. Selection operates on the level of individual pieces of information, as well
as on the higher level of organisms and phenomena.
The interests of an organism and its genetic information are usually aligned well.
However, they can sometimes diverge from one another. In this section, we identify
analogous, non-biological phenomena, where conflict arises between a system and the
sub-systems in which it stores its information. Evolutionary pressures might generate
this kind of internal conflict within AI systems, distorting or subverting goals set for
AIs by human operators, even when such goals are specified and understood correctly.
Collective Action Problems ■ 439
Conflict within a genome. Selection on the level of genes does not always result
in the best outcomes for the organism. For instance, as discussed in the previous
section, human mitochondrial DNA is only transferred to offspring through biological
females. A human’s mitochondrial genome is identical to their biological mother’s,
assuming no change due to mutation. Since males represent a reproductive dead-end,
mitochondrial genes that benefit only females may therefore be selected for, even
when they incur a cost upon males. These and other “selfish” genetic elements give
rise to intragenomic conflict.
Intrasystem goal conflict: between information and the larger entity that
contains it. All the above examples concern the interests of propagating informa-
tion and those of the entities that contain the information diverging from one another.
We call the more general phenomenon that can describe all of these examples intrasys-
tem goal conflict: the clash of different subsystems’ interests, causing the functioning
of the overall system to be distorted. As we have seen, intrasystem goal conflict can
arise within complex systems in a range of domains, from genomes to corporations.
of this goal to sub-agents, who may take over and subvert the original goal with their
own.
In the future, humans and AI agents may interact in many different ways, including
by working together on collaborative projects. This provides the opportunity for goal
distortion or subordination through intrasystem goal conflict. For instance, humans
may enlist AI agents to collaborate on tasks. Just as how human collaborators may
betray or overturn their principals, AI agents may behave similarly. If an AI col-
laborator has a goal of self-preservation, they may try to remove any power others
have over them. In this way, the system that ends up executing actions based on
these conflicting goals will not necessarily be equivalent to how a system with unity
of purpose would pursue the goal set by the humans. The behavior of this emergent
multi-agent system may thus distort our goals, or even subvert them altogether.
Selfishness
In the previous section, we examined one risk generated by natural selection favor-
ing the propagation of information: conflict between the information (such as genes,
departments, or sub-agents) and the larger entity that contains it (such as an or-
ganism, government, or AI system). In this section, we consider a second risk: that
natural selection tends to favor selfish traits and strategies over altruistic ones. We
conclude that the greater the influence of evolutionary pressures on AI development,
the more we should expect a future with many AI agents to be one dominated by
selfish behavior.
Selfishness: furthering one’s own information propagation at the expense
of others. In evolutionary theory, “selfishness” does not imply intent to harm
another, or belief that one’s own interests ought to dominate. Organisms that do
not have malicious intentions often display selfish traits. The lancet liver fluke, for
example, is a small parasite that infects sheep by first infecting ants, hijacking their
brains and making them climb to the top of stalks of grass, where they get eaten by
sheep [427]. The lancet liver fluke does not wish ants ill, nor does it have a belief that
lancet liver flukes should thrive while ants should get eaten. It simply has evolved a
behavior that enables it to propagate its own information at the expense of the ant’s.
Selfishness in AI. AI systems may exhibit “selfish” behaviors, expanding the
AIs’ influence at the expense of human values. Note that these AIs may not even
understand what a human is and yet still behave selfishly toward them. For exam-
ple, AIs may automate human tasks, necessitating extensive layoffs [273]. This could
be very detrimental to humans, by generating rapid or widespread unemployment.
However, it could take place without any malicious intent on the part of AIs merely
behaving in accordance with their pursuit of efficiency. AIs may also develop newer
AIs that are more advanced but less interpretable, reducing human oversight. Ad-
ditionally, some AIs may leverage emotional connections by imitating sentience or
emulating the loved ones of human users. This might generate social resistance to
their deactivation. For instance, AIs that plead not to be deactivated might stim-
ulate an emotional attachment in some humans. If afforded legal rights, these AIs
Collective Action Problems ■ 441
might adapt and evolve outside human control, becoming deeply embedded in society
and expanding their influence in ways that could be irreversible.
Selfish traits are not the opposite of cooperation. Many organisms display
cooperative behavior at the individual level. Chimpanzees, for example, regularly
groom other members of their group. They don’t do this to be “nice,” but rather
because this behavior is reciprocated in future, so they are likely to eventually ben-
efit from it themselves [384]. Cells found in filamentous bacteria, so named because
they form chains, regularly kill themselves to provide much needed nitrogen for the
communal thread of bacterial life, with every tenth cell or so “committing suicide”
[428]. But even in these examples, cooperative behavior ultimately helps the individ-
ual’s information propagate. Chimpanzees who groom others expect to have the favor
returned in future. Filamentous bacteria live in colonies made up of their clones, so
one bacterium sacrificing itself to save copies of itself still propagates its information.
The more natural selection acts on a population, the more selfish behav-
ior we expect. In the example in the preceding paragraph, when food is abundant,
there is little advantage to selfishness and there may even be penalties, as the group
punishes selfish behavior. There is plenty of food to go around, so the descendants of
foxes who steal food will not be much more likely to survive, and the next generation
can contain plenty of altruists. But in times when only a few can propagate, selfish-
ness will confer a greater advantage, and the population will tend to become selfish
more quickly.
442 ■ Introduction to AI Safety, Ethics, and Society
7.5.4 Summary
In this section, we considered the effects of evolutionary pressures on AI populations.
We started by using the idea of generalized Darwinism to expand the “gene’s eye
view” of biological evolution to an “information’s eye view.” Using this view, we
identified two AI risks generated by natural selection: intrasystem goal conflict and
selfish behavior. Intrasystem goal conflict could distort or subvert the goals we set
an AI system to pursue. Selfish behavior would likely be favored by natural selection
wherever it promotes the propagation of information: If AI development is subject to
strong Darwinian forces, we should expect AIs to tend toward selfish behaviors.
7.6 CONCLUSION
Game theory
We began with a simple game, the Prisoner’s Dilemma, observing how even rational
agents may reach equilibrium states that are detrimental to all. We then proceeded
to build upon this. We considered how the dynamics may change when the game
is iterated and involves more than two agents. We found that uncertainty about
the future could foster rational cooperation, though defection remains the dominant
strategy when the number of rounds of the game if fixed and known.
Collective Action Problems ■ 443
We used these games to model collective action problems in the real world, like
anthropogenic climate change, public health emergency responses, and the failures
of democracies. The collective endeavors of multi-agent systems are often vulnerable
to exploitation by free riders. We drew parallels between these natural dynamics and
the development, deployment, and adoption of AI technologies. In particular, we saw
how AI races in corporate and military contexts can exacerbate AI risks, potentially
resulting in catastrophes such as autonomous economies or flash wars. We ended
this section by exploring the emergence of extortion as a strategy that illustrated a
grim possibility for future AI systems: AI extortion could be a source of monumental
disvalue, particularly if it were to involve morally valuable digital minds. Moreover,
AI extortion might persist stably throughout populations of AI agents, which could
make it difficult to eradicate, especially if AIs learn to deceive or manipulate humans
to obscure their true intentions.
Cooperation
Conflict
We next turned to a closer examination of the drivers of conflict. Using the framework
of bargaining theory, we discussed why rational agents may sometimes opt for conflict
over peaceful bargaining, even though it may be more costly for all involved. We il-
lustrated this idea by looking at how various factors can affect competitive dynamics,
including commitment problems (such as power shifts, first-strike advantages, and is-
sue indivisibility), information problems, and inequality. These factors may drive AIs
to instigate, promote, or exacerbate conflicts, with potentially catastrophic effects.
Evolutionary pressures
We began this section by examining generalized Darwinism: the idea that Darwinian
mechanisms are a useful way to explain many phenomena outside of biology. We ex-
plored examples of evolution by natural selection operating in non-biological domains,
from culture, academia, and industry. By formalizing this idea using Lewinton’s
conditions and the Price equation, we saw how AIs and their development may be
subject to Darwinian forces.
We then turned to the ramifications of natural selection operating on AIs. We first
looked at what AI traits or strategies natural selection may tend to favor. Using
444 ■ Introduction to AI Safety, Ethics, and Society
Concluding remarks
In summary, this chapter explored various kinds of collective action problems: intelli-
gent agents, despite acting rationally and in accordance with their own self-interest,
can collectively produce outcomes that none of them wants, even when they could
seemingly have achieved preferable alternative outcomes. Even when we as individu-
als share similar goals, system-level dynamics can override our intentions and create
undesirable results.
This insight is of vital importance when envisioning a future with powerful AI sys-
tems. AIs, individual humans, and human agencies will all conduct their actions in
light of how others are behaving and how each expects others to behave in future. The
total risk of this multi-agent system greater the sum of its individual parts. Dynamics
between multiple human agencies generate races in corporate and military settings.
Dynamics between multiple AIs may generate evolutionary pressure for immoral be-
haviors, particularly selfishness, free-riding, deception, conflict, and extortion. We
cannot address all the risks posed by AI simply by focusing on the outcomes of
agents acting in isolation. The safety of AI systems will not be guaranteed solely
by aligning each AI agent to well-intentioned operators. It is an essential compo-
nent of ensuring our safety, and a valuable future, that we consider these multi-agent
dynamics carefully. These dynamics represent a common problem—clashes between
individual and collective interests. We must find innovative, system-level solutions to
ensure that the development and interaction of AI agents lead to beneficial outcomes
for all.
7.7 LITERATURE
Governance
8.1 INTRODUCTION
Actors
AI governance involves many diverse groups across sectors that have different goals
and can do different things to accomplish them. Key actors include companies, non-
profits, governments, and individuals.
Companies develop and deploy AIs, typically for profit. Major firms such
as OpenAI and Google DeepMind shape the AI landscape through huge investments
in research and development, creating powerful models that advance AI capabili-
ties. Startups may explore new applications of AI and are often funded by venture
capitalists. The Corporate Governance section looks at policies, incentives, and struc-
tures such as ownership models of organizations that impact the development and
deployment of AI systems.
Tools
The AI governance landscape includes the sets of tools or mechanisms by which actors
interact and influence one another. Key tools for AI governance fall into four main
categories: information dissemination, financial incentives, standards and regulations,
and rights.
Information dissemination changes how stakeholders think and act. Ed-
ucation and training transmit technical skills and shape the mindsets of researchers,
developers, and policymakers. Sharing data, empirical analyses, policy recommenda-
tions, and envisioning positive futures informs discussions by highlighting opportu-
nities, risks, and policy impacts. Facts are a prerequisite for the creation and im-
plementation of effective policy. Increasing access to information for individuals and
organizations can change their evaluations of what’s best to do.
Financial incentives influence behavior by changing payoffs and motiva-
tions. Incentives such as funding sources, market forces, potential taxes or fees, and
regulatory actions shape the priorities and cost-benefit calculations of companies, re-
searchers, and other stakeholders. Access to funding and well-regulated markets (such
Governance ■ 449
There are many different factors that feed into the rate of economic growth, and
AI has the potential to amplify several of them. For instance, deploying AI systems
could artificially augment the effective population of workers, improve the efficiency
of human labor, or accelerate the development of new technologies that improve
productivity. While it is generally accepted that AI will boost economic growth to
some degree, there is debate over the exact magnitude of the impact it is likely to
have. Some researchers believe that it will speed up growth to an unprecedented rate,
which we refer to as “explosive growth,” while others think its impact will be limited
by other social and economic factors. We will now explore some of the arguments for
and against the likelihood of AI causing explosive growth.
450 ■ Introduction to AI Safety, Ethics, and Society
Unconstrained
by Population
Output
Real Outcome
Pre−Industrial
1800 1850 1900 1950 2000
Year
Figure 8.1. Economic output could grow much faster than past trends if not constrained by
the human population.
Governance ■ 451
AIs may fuel effective population growth. If AIs can automate the majority
of important human tasks (including further AI development), this could lift the
bottleneck on labor that some believe is the primary obstacle to explosive growth.
There are some reasons to think that AIs could boost the economic growth rate by
substituting for human labor. As easily duplicable software, the AI population can
grow at least as quickly as we can manufacture hardware to run it on—-much faster
than humans take to reproduce and learn skills from adults. This replication of labor
could then boost the effective workforce and accelerate productivity.
However, other researchers have argued that there are potential physical constraints,
as well as social and economic dynamics, that could prevent AI from driving explosive
growth. We will now explore some of these arguments.
AI adoption and its impact could be slow and gradual. Some researchers
argue that the greatest impact of a new technology may not manifest as an intense
peak during the early stages of innovation [434]. Rather, the productivity gains may
be delivered as a slower increase continuing over a longer period of time, as the
technology is gradually adopted by a wide range of industries. This could be because
the technology needs to be adapted to many different tasks and settings, humans
need to be trained to operate it, and other tools and processes that are compatible
with it need to be developed. For example, although the first electric dynamo suitable
for use in industry was invented in the 1870s, it took several decades for electricity to
become integrated within industries. It has been argued that this is why electricity
only boosted the US economy significantly in the early 20th century. Similarly, a slow
process of diffusion could also smooth out AI’s impact on today’s economy.
Besides regulations, humans’ own preferences may also limit the fraction of tasks
that are automated. For example, we can speculate that humans may prefer certain
services that involve a high degree of social interaction, such as those in healthcare,
education, and counseling, to be provided by other humans. Additionally, people may
always be more interested in watching human athletes, actors, and musicians, and in
buying artwork produced by humans. In some cases, these jobs may therefore evade
automation, even if it were theoretically possible to automate them, just as there are
still professional human chess players, despite the fact that machines have long been
able to beat Grandmasters.
Baumol’s cost disease. Another reason why even just a few non-automated jobs
could prevent explosive growth is the concept of Baumol’s cost disease, proposed by
the economist William Baumol in the 1960s. This idea states that, when technology
increases productivity in one industry, the prices of its products fall, and the wages
of its workers rise. Another industry, which cannot easily be made more efficient with
technology, will also need to increase its workers’ wages, to prevent them moving
into higher-paying jobs in technologically enhanced sectors. As a result, the prices of
outputs in those sectors take up an increasing share of the overall economy. Thus,
even if some industries undergo rapid increases in productivity, the effect on the
growth of the economy as a whole is more muted. This is one explanation for why
the prices of goods such as TVs have declined over time, while the costs of healthcare
and education have risen. According to this concept, if AI automates many jobs,
but not all of them, its economic impact could be substantial, but not necessarily
explosive.
While AI has the potential to significantly enhance economic growth through various
routes including improving workers’ productivity and accelerating R&D, the extent
and speed of this growth remain uncertain. Theories suggesting explosive growth
due to AI rely on relatively strong assumptions around the removal of potential
454 ■ Introduction to AI Safety, Ethics, and Society
8.3 DISTRIBUTION OF AI
In this section we discuss three main dimensions of how aspects of AI systems are
distributed:
1. Benefits and costs of AI: whether the benefits and costs of AI will be evenly or
unevenly shared across society.
2. Access to AI: whether advanced AI systems will be kept in the hands of a small
group of people, or whether they will be widely accessible to the general public.
3. Power of AI systems: whether in the long run, there will be a few highly sophis-
ticated AIs with vastly more power than the rest, or many highly capable AIs.
The distribution of the costs and benefits from AI will ultimately depend on both
market forces and governance decisions. It is possible that companies developing
AI will receive most of the economic gains, while automation could dramatically
reduce economic opportunities for others. Government may need to engage in new
approaches to redistribution to ensure that even if only a small number of people
directly gain wealth from AIs, wealth is eventually shared more broadly among the
population.
has a positive economic impact through the productivity effect, as new tasks or indus-
tries require human labor [437]. Historically, the productivity effect has dominated
and the general standard of living has increased. Consider the Industrial Revolution:
mechanized production methods displaced artisanal craftspeople, but eventually led
to new types of employment opportunities in factories and industries that hadn’t
previously existed. Similarly, access to computers has automated away many man-
ual and clerical jobs like data entry and typing, but has also spawned many more
new professions. Technological shifts have historically led to increases in employment
opportunities and wages. Therefore, while these changes were disruptive for certain
professions, they ultimately led to a shift in employment rather than mass unemploy-
ment. This phenomenon, called creative destruction, describes how outdated indus-
tries and jobs are replaced by new, often more efficient ones. Similarly, transformative
technologies can also augment workers (like capital according to the standard view)
rather than replace them. If AIs serve as gross complements to human labor, this
may drive up wage growth rather than increase inequality.
education and are typically high-paying, would also become automated. Program-
mers, researchers, and artists are already augmenting their productivity using large
language models, which will likely continue to become more capable. One way the
future of work could play out is that increasingly few high-skilled workers will excel
at managing or using AIs or will provide otherwise exceptionally unique skills, while
the vast majority of people become unemployable.
Issues of access to AI are closely related to the question of distribution of costs and
benefits. Some have argued that if access to AI systems is broadly distributed across
society rather than concentrated in the hands of a few companies, the benefits of AI
would be more evenly shared across society. Here, we will discuss broader access to AI
through open-source models, and narrower access through restricted models, as well
Governance ■ 457
as striking a balance between the two through structured access. We will examine
the safety implications of each level of access.
Levels of Access
Structured access. One possible option for striking a balance between keeping AIs
completely private and making them fully open-source would be to adopt a structured
access approach. This is where the public can access an AI, but with restrictions on
what they can use it for and how they can modify it. There may also be restrictions
on who is given access, with “Know Your Customer” policies for verifying users’
identities. In this scenario, the actor controlling the AI has ultimate authority over
who can access it, how they can access it, what they can use it for, and if and how they
can modify it. They can also grant access selectively to other developers to integrate
the AI within their own applications, with consideration of these developers’ safety
standards.
One practical way of implementing structured access is to have users access an AI via
an application programming interface (API). This indirect usage facilitates controls
on how the AI can be used and also prevents users from modifying it. The rollout
of GPT-3 in 2020 is an example of this style of structured access: the large language
458 ■ Introduction to AI Safety, Ethics, and Society
model was stored in the cloud and available for approved users to access indirectly
through a platform controlled by OpenAI.
Openness Norms
Traditionally, the norm in academia has been for research to be openly shared. This
allows for collaboration between researchers in a community, enabling faster develop-
ment. While openness may be a good default position, there are certain areas where
it may be appropriate to restrict information sharing. We will now discuss the cir-
cumstances under which these restrictions might be justified and their relevance to
AI development.
Technological progress may be too fast for regulations to keep up. An-
other reason for restricting information sharing is the pacing problem—where techno-
logical progress happens too quickly for policymakers to devise and implement robust
controls on a technology’s use. This means that we cannot rely on regulations and
monitoring to prevent misuse in an environment where information that could enable
misuse is being openly shared.
tasks could be used to propel the advancement of potentially dangerous AIs. We may
not be able to predict every way in which technologies that are harmless in isolation
might combine to become hazardous.
Since there are costs to restrictions, it is worth considering when they
are warranted. Any interventions to mitigate the risk of misuse of AIs are likely to
come at a cost, which may include users’ freedom and privacy, as well as the beneficial
research that could be accelerated by more open sharing. It is therefore important to
think carefully about which kind of restrictions are justified, and in which scenarios.
It might, for example, be worth comparing the number of potential misuses and how
severe they would be with the number of positive uses and how beneficial they would
be. Another factor that could be taken into account is how narrowly targeted an
intervention could be, namely how accurately it could identify and mitigate misuses
without interfering with positive uses.
Restrictions on the underlying capabilities of an AI (or the infrastructure supporting
these) tend to be more general and less precisely targeted than interventions imple-
mented downstream. The latter may include restrictions on how a user accessing an
AI indirectly can use it, as well as laws governing its use. However, upstream restric-
tions on capabilities or infrastructure may be warranted under specific conditions.
They may be needed if interventions at later stages are insufficient, if the dangers
of a capability are particularly severe, or if a particular capability lends itself much
more to misuse than positive use.
Open models would enable dangerous members of the general public to engage in
harmful activities. Tightly controlled models exacerbate the risk that their creators,
or elites with special access, could misuse them with impunity. We will examine each
possibility.
Powerful, open AIs lower the barrier to entry for many harmful activ-
ities. There are multiple ways in which sophisticated AIs could be harnessed to
cause widespread harm. They could, for example, lower the barrier to entry for cre-
ating biological and chemical weapons, conducting cyberattacks like spear phishing
on a large scale, or carrying out severe physical attacks, using lethal autonomous
weapons. Individuals or non-state actors wishing to cause harm might adapt power-
ful AIs to harmful objectives and unleash them, or generate a deluge of convincing
disinformation, to undermine trust and create a more fractured society.
More open AI models increase the risk of bottom-up misuse. Although
the majority of people do not seek to bring about a catastrophe, there are some who
do. It might only take one person pursuing malicious intentions with sufficient means
to cause a catastrophe. The more people who have access to highly sophisticated AIs,
the more likely it is that one of them will try to use it to pursue a negative outcome.
This would be a case of bottom-up misuse, where a member of the general public
leverages technology to cause harm.
460 ■ Introduction to AI Safety, Ethics, and Society
Releasing highly capable AIs to the public may entail a risk of black
swans. Although numerous risks associated with AIs have been identified, there
may be more that we are unaware of. AIs themselves might even discover more
technologies or ways of causing harm than humans have imagined. If this possibility
were to result in a black swan event (see Section 4.7 for a deeper discussion of black
swans), it would likely favor offense over defense, at least to begin with, as decision
makers would not immediately understand what was happening or how to counteract
it.
The final dimension we will consider is how power might be distributed among ad-
vanced AI systems. Assuming that we reach a world with AI systems that generally
surpass human capabilities, how many of such systems should we expect there to be?
We will contrast two scenarios: one in which a single AI has enduring decisive power
over all other AIs and humans, and one in which there are many different powerful
AIs. We will look at the factors that could make each situation more likely to emerge,
the risks we are most likely to face in each case, and the kinds of policies that might
be appropriate to mitigate them.
AI Singleton
Benefits
An AI singleton could reduce competitive pressures and solve collective
action problems. If an AI singleton were to emerge, the actor in control of it would
not face any meaningful competition from other organizations. In the absence of com-
petitive pressures, they would have no need to try to gain an advantage over rivals
by rushing the development and deployment of the technology. This scenario could
also reduce the risk of collective action problems in general. Since one organization
would have complete control, there would be less potential for dynamics where dif-
ferent entities chose not to cooperate with one another (as discussed in the previous
chapter chap:CAP), leading to a negative overall outcome.
Costs
An AI singleton increases the risk of single points of failure. In a future
scenario with only one superintelligent AI, a failure in that AI could be enough to
cause a catastrophe. If, for instance, it were to start pursuing a dangerous goal, then
it might be more likely to achieve it than if there were other similarly powerful AIs
that could counteract it. Similarly, if a human controlling an AI singleton would like
to lock in their values, they might be able to do so unopposed. Therefore, an AI
singleton could represent a single point of failure.
more willing to cooperate with humans. However, an AI singleton would have little
reason to cooperate with humans, as it would not face any competition from other
AIs. This scenario would therefore increase the risk of disempowerment of humanity.
Benefits
A diverse ecosystem of AIs might be more stable than a single superin-
telligence. There are reasons to believe that a diverse ecosystem of AIs would
be more likely to establish itself over the long term than a single superintelligence.
The general principle that variation improves resilience has been observed in many
systems. In agriculture, planting multiple varieties of crops reduces the risk that all
of them will be lost to a single disease or pest. Similarly, in finance, having a wide
range of investments reduces the risk of large financial losses. Essentially, a system
comprising many entities is less vulnerable to collapsing if a single entity within it
fails.
There are multiple additional advantages that a diverse ecosystem of AIs could have
over a single superintelligence. Variation within a population means that individuals
can specialize in different skills, making the group as a whole better able to achieve
complex goals that involve multiple different tasks. Such a group might also be gen-
erally more adaptable to different circumstances, since variation across components
could offer more flexibility in how the system interacts with its environment. The
“wisdom of the crowds” theory posits that groups tend to make better decisions
collectively than any individual member of a group would make alone. This phe-
nomenon could be true of groups of AIs. For all these reasons, a future involving a
diverse ecosystem of AIs may be more able to adapt and endure over time than one
where a single powerful AI gains decisive power.
Diverse AIs could remove single points of failure. Having multiple diverse
AIs could dilute the negative effects of any individual AI failing to function as in-
tended. If each AI were in charge of a different process, then they would have less
Governance ■ 463
power to cause harm than a single AI that was in control of everything. Addition-
ally, if a malicious AI started behaving in dangerous ways, then the best chance of
preventing harm might involve using similarly powerful AIs to counteract it, such as
through the use of “watchdog AIs” tasked with detecting such threats. In contrast
with a situation where everything relies on a single AI, a diverse ecosystem reduces
the risk of single points of failure.
Costs
Multi-agent dynamics could lead to selfish traits. Having a group of di-
verse AIs, as opposed to just one, could create the necessary conditions for a process
of evolution by natural selection to take effect (for further detail, see Evolutionary
Pressures). This might cause AIs to evolve in ways that we would not necessarily be
able to predict or control. In many cases, evolutionary pressures have been observed
to favor selfish traits in biological organisms. The same mechanism might promote
AIs with undesirable characteristics.
Multiple AIs may or may not collude. It has been proposed that if there were
multiple highly capable AIs, they would collude with one another, essentially acting
as a single powerful AI [442]. This is not inevitable. The risk of collusion depends on
the exact environmental conditions.
Some circumstances that make collusion more likely and more successful include:
1. A small number of actors being involved.
2. Collusion being possible even if some actors cease to participate.
3. Colluding actors being similar, for example in terms of their characteristics and
goals.
4. Free communication between actors.
5. Iterated actions, where each actor can observe what another has done and respond
accordingly in their next decision.
464 ■ Introduction to AI Safety, Ethics, and Society
There are different views on what the purpose of corporate governance is. These
differences are related to different theories about capitalism.
A company’s legal structure refers to its legal form and place of incorporation, its
statutes and bylaws.
Legal form. AI companies can have different legal forms. In the US, the most
common form is a C corporation or “C-corp” for short. Other forms include public
benefit corporations (PBCs), limited partnerships (LPs), and limited liability compa-
nies (LLCs). The choice of legal form has significant influence on what a company
466 ■ Introduction to AI Safety, Ethics, and Society
is able and required to do. For example, while C-corps must maximize shareholder
value, PBCs can also pursue public benefits. This can be important in situations
where AI companies may want to sacrifice profits in order to promote other goals.
Google, Microsoft, and Meta are all C-corps, Anthropic is a PBC, and OpenAI is an
LP (which is owned by a nonprofit).
Companies are owned by shareholders, who elect the board of directors, which ap-
points the senior executives who actually run the company.
and Meta, though the founders often retain a significant amount of shares. It is also
not uncommon to give early employees stock options, which allow them to purchase
stock at fixed prices.
While shareholders own the company, it is governed by the board of directors and
managed by the chief executive officer (CEO) and other senior executives.
Board of directors. The board of directors is the main governing body of the
company. Board members have a legal obligation to act in the best interests of the
company, so-called fiduciary duties. These duties can vary: board members of Alpha-
bet have the typical fiduciary duties of a C-Corp, while board members of OpenAI’s
nonprofit have a duty to “to ensure that AGI benefits all of humanity,” not to maxi-
mize shareholder value. The board has a number of powers they can use to steer the
company in a more prosocial direction. It sets the strategic priorities, is responsible
for risk oversight, and has significant influence over senior management; for instance,
it can replace the chief executive officer. However, the board’s influence is often indi-
rect. Many boards have committees, some of which might be important from a safety
468 ■ Introduction to AI Safety, Ethics, and Society
8.4.5 Assurance
Different stakeholders within and outside AI companies need to know whether ex-
isting governance structures are adequate. AI companies therefore take a number of
measures to evaluate and communicate the adequacy of their governance structures.
These measures are typically referred to as assurance. We can distinguish between
internal and external assurance measures.
Internal assurance. AI companies need to ensure that senior executives and
board members get the information they need to make good decisions. It is, there-
fore, essential to define clear reporting lines. To ensure that key decision-makers get
objective information, AI companies may set up an internal audit team that is or-
ganizationally independent from senior management [446]. This team would assess
the adequacy and effectiveness of the company’s risk management practices, controls,
and governance processes, and report directly to the board of directors.
External assurance. Many companies are legally required to publicly report cer-
tain aspects of their governance structures, such as whether the board of directors
Governance ■ 469
has an audit committee. Often, AI companies also publish information about their
released models, for example in the form of model or system cards. Some organiza-
tions disclose parts of their safety strategy and their governance practices as well; for
instance, how they plan to ensure their powerful AI systems are safe, whether they
have an ethics board, or how they conduct pre-deployment risk assessments. These
publications allow external actors like researchers and civil society organizations to
evaluate and scrutinize the company’s practices. In addition, many AI companies
commission independent experts to scrutinize their models, typically in the form of
third-party audits, red teaming exercises to adversarially test systems, or bug bounty
programs that reward finding errors and vulnerabilities.
Corporate governance refers to all the rules, practices, and processes by which a
company is directed and controlled—ranging from its legal form and place of incor-
poration to its board committees and remuneration policy. The purpose of corporate
governance is to balance the interests of a company’s shareholders with the interests
of other stakeholders, including society at large. To this end, AI companies are ad-
vised to follow existing best practices in corporate governance. However, in light of
increasing societal risks from AI, they also need to consider more innovative gover-
nance solutions, such as a capped-profit structure or a long-term benefit trust.
Overview. Government action may be crucial for AI safety. Governments have the
authority to enforce AI regulations, to direct their own AI activities, and to influ-
ence other governments through measures such as export regulations and security
agreements. Additionally, major governments can leverage their large budgets, diplo-
mats, intelligence agencies, and leaders selected to serve the public interest. More
abstractly, as we saw in the Collective Action Problems chapter, institutions can
help agents avoid harmful coordination failures. For example, penalties for unsafe AI
development can counter incentives to cut corners on safety.
This section provides an overview of some potential ways governments may be able to
advance AI safety including safety standards and regulations, liability for AI harms,
targeted taxation, and public ownership of AI. We also describe various levers for
improving societal resilience and for ensuring that countries focused on developing
AI safely to do not fall behind less responsible actors.
The impact of standards. Standards are not automatically legally binding. De-
spite that, standards can advance safety in various ways. First, standards can shape
norms, because they are descriptions of best practices, often published by authorita-
tive organizations. Second, governments can mandate compliance with certain stan-
dards. Such “incorporation by reference” of an existing standard may bind both the
private sector and government agencies. Third, governments can incentivize compli-
ance with standards through non-regulatory means. For example, government agen-
cies can make compliance required for government contracts and grants, and stan-
dards compliance can be a legal defense against lawsuits.
often influence regulation through their control over regulatory agencies’ existence,
structure, mandates, and budgets.
Regulatory agencies do not always regulate adequately. Regulatory agencies can face
steep challenges. They can be under-resourced: lacking the budgets, staff, or author-
ities they need to do well at designing and enforcing regulations. Regulators can
also suffer from regulatory capture—being influenced into prioritizing a small inter-
est group (especially the one they are supposed to be regulating) over the broader
public interest. Industries can capture regulators by politically supporting sympa-
thetic lawmakers, providing biased expert advice and information, building personal
relationships with regulators, offering lucrative post-government jobs to lenient reg-
ulatory staff, and influencing who is appointed to lead regulatory agencies.
Standards and regulations give governments some ways to shape the behavior of AI
developers. Next, we will consider legal means to ensure that the developers of AI
have incentives in line with the rest of society.
In addition to standards and regulations, legal liability could advance AI safety. When
AI accidents or misuse cause harm, liability rules determine who (if anyone) has to
pay compensation. For example, an AI company might be required to pay for damages
if it leaks a dangerous AI, or if its AI provides a user step-by-step instructions for
building or acquiring illegal weapons.
Legal liability is a limited tool. There are practical limits to what AI companies
can actually be held liable for. For example, if an AI were used to create a pandemic
that killed 1 in 100 people, the AI developer would likely be unable to pay beyond a
small portion of the damages owed (as these could easily be in the tens of trillions). If
472 ■ Introduction to AI Safety, Ethics, and Society
the amount of harm that can be caused by AI companies exceeds what they can pay,
AI developers cannot fully internalize the costs they impose on society. This problem
can be eased by requiring liability insurance (a common requirement in the context
of driving cars), but there are amounts of compensation that even insurers could not
afford. Moreover, sufficiently severe AI catastrophes may disrupt the legal system
itself. Separately, liability does little to deter AI developers who do not expect their
AI development to result in large harms—even if their AI development really proves
catastrophically dangerous.
Ensuring legal liability for harms that result from deployed AIs helps align the in-
terests of AI developers with broader social interests. Next, we will consider how
governments can mitigate harms when they do occur.
Although taxes do not directly force risk internalization, they can provide govern-
ment revenues that can be reserved for risk mitigation or disaster relief efforts. For
example, the Superfund is a US government program that funds the cleanup of aban-
doned hazardous waste sites. It is funded by excise taxes–—a special tax on some
good, service, or activity—on chemicals. The excise tax ensures that the chemical
manufacturing industry pays for the handling of dangerous waste sites that it has
created. Special taxes on AIs could support government programs to prevent risks or
address disasters.
Policy tools for resilience. To build resilience, governments could use a variety
of policy tools. For example, they could provide R&D funding to develop defensive
technologies. Additionally, they could initiate voluntary collaborations with the pri-
vate sector to assist with implementation. Governments could also use regulations to
require owners of relevant infrastructure and platforms to implement practices that
improve resilience.
If some countries take a relatively slow and careful approach to AI development, they
may risk falling behind other countries that take less cautious approaches. It would
be risky for the global leaders in AI development to be within countries that lack
adequate guardrails on AI. Various policy tools could allow states to avoid falling
behind in AI while they act to keep their own companies’ AIs safe.
policy. Still, as options for cases where cooperation fails, here we consider several
policy tools for preserving national competitiveness in AI.
The likelihood of theft or leaks of model weights appears high. First, the
most advanced AI systems are likely to be highly valuable due to their ability to
perform a wide variety of economically useful activities. Second, there are strong in-
centives to steal models given the high cost of developing state of the art systems.
Lastly, these systems have an extensive attack surface because of their extremely
476 ■ Introduction to AI Safety, Ethics, and Society
complex software and hardware supply chains. In recent years, even leading technol-
ogy companies have been vulnerable to major attacks, such as the Pegasus 0-click
exploit that enabled actors to gain full control of high-profile figures’ iPhones and the
2022 hack of NVIDIA by the Lapsus group, which claimed to have stolen proprietary
designs for its next-generation chips.
There are various attack vectors that could be exploited to steal model
weights. These include running unauthorized code that exploits vulnerabilities in
software used by AI developers, or attacking vulnerabilities in security systems them-
selves. Attacks on vendors of software and equipment used by an AI developer are
a major concern, as both the hardware and software supply chains for AI are ex-
tremely complex and involve many different suppliers. Other techniques that are less
reliant on software or hardware vulnerabilities include compromising credentials via
social engineering (e.g. phishing emails) or weak passwords, infiltrating companies
using bribes, extortion or placement of agents inside the company, and unauthorized
physical access to systems. Even without any of these attacks, abuse of legitimate Ap-
plication Programming Interfaces (APIs) can enable extraction of information about
AI systems. Research has shown that it is possible to recover portions of some of
OpenAI’s models using typical API access [449].
Stages
We can break international governance into four stages [451]. First, issues must be
recognized as requiring governance. Then, countries must come together to agree
how to govern the issue. After that, they must actually do what was agreed. Lastly,
countries’ actions must be monitored for compliance to ensure that governance is
effective into the future.
Setting agendas. The first stage of governance is agenda-setting. This involves
getting an issue recognized as a problem needing collective action. Actors have to
convince others that an issue exists and matters. Powerful entities often want to
maintain their power in the status quo, and thus deny problems or ignore their
responsibilities for dealing with them. Global governance over an issue can’t start
until it gets placed on the international agenda. Non-state actors like scientists and
advocacy groups play a key role in agenda-setting by drawing attention to issues
neglected by states; for example, scientists highlighted the emerging threat of ozone
depletion, building pressure that led to the Montreal Protocol.
Agenda-setting makes an issue a priority for collective governance. Without it, critical
problems can be overlooked due to political interests or inertia.
Policymaking. Once an issue makes the global agenda, collections of negotiat-
ing countries or international organizations often take the lead in policymaking.
Governance ■ 479
Implementation and enforcement. After policies are made, the next stage is
implementing them. High-level policies are sometimes vague, allowing flexibility to
apply them; for example, the Chemical Weapons Convention relies on countries to
enforce bans on chemical weapons domestically in the ways they find most effective.
Governance controls how these policies are enforced; for instance, the International
Atomic Energy Agency (IAEA) conducts inspections to verify compliance with the
Treaty on the Non-Proliferation of Nuclear Weapons (NPT). Even if actors agree
on policies, acting on them takes resources, information, and coordination. Effective
implementation and enforcement through governance converts abstract rules into
concrete actions.
Techniques
There are many ways in which countries and other international actors govern issues
of global importance. Here, we consider six ways that past international governance
of potentially dangerous technologies has taken place, ranging from individual actors
making unilateral declarations to wide-ranging, internationally binding treaties.
and companies can make such announcements, either about what they would do in
response to others’ actions or to place constraints on their own behavior. While not
binding on others, unilateral commitments can change others’ best responses to a
situation, as well as build confidence and set precedents to influence international
norms. They also give credibility in pushing for broader agreements. Unilateral com-
mitments lay the groundwork for wider collective action.
Norms and standards. International norms and technical standards steer behav-
ior without formal enforcement. Norms are shared expectations of proper conduct,
such as the norm of “no first use” for detonating nuclear weapons. Standards set tech-
nical best practices, like guidelines for the safe handling and storage of explosives.
Often, norms emerge through public discourse and precedent while standards arise
via expert communities. Even if they have no ability to coerce actors, strong norms
and standards shape actions nonetheless. Additionally, they make it easier to build
agreements by aligning expectations. Norms and standards are a collaborative way
to guide technology development.
Bilateral and multilateral talks. Two or more countries directly negotiate over
a variety of issues through bilateral or multilateral talks. These allow open discussion
and confidence-building between rivals, such as granting the US and USSR the ability
to negotiate over the size and composition of their nuclear arsenals during the Cold
War. Talks aim to establish understandings to avoid technology risks like arms races.
Regular talks build relationships and identify mutual interests. While non-binding
themselves, ongoing dialogue can lay the basis for making deals. Talks are essential
for compromise and consensus between nations.
Summits and forums. Summits and forums bring together diverse stakeholders
for debate and announcements. These raise visibility on issues and build political will
for action. Major powers can make joint statements signaling priorities, like setting
goals on total carbon emissions to limit the effects of global warming. Companies and
civil society organizations can announce major initiatives. Summits set milestones for
progress and mobilize public pressure.
Treaties. Treaties offer the strongest form of governance between nations, creat-
ing obligations backed by potential punishment. Treaties have played a large role in
banning certain risky military uses of technologies, such as the 1968 Treaty on the
Non-Proliferation of Nuclear Weapons. They often contain enforcement mechanisms
like inspections. However, compromising on enforceable rules is difficult, especially
Governance ■ 481
between rivals. Verifying compliance with treaties can be challenging. Still, the bind-
ing power of treaties makes them valuable despite their potential limitations.
In this section, we have considered the various stages of international governance,
moving from recognizing an issue to solving it, as well as a wide variety of different
ways that enable this. This is a large set of tools, so we will next examine four
questions that inform our understanding of which tools are most effective for AI
governance.
We will now consider four questions that are important for the regulation of dangerous
emerging technologies:
1. Is the technology defense-dominant or offense-dominant?
2. Can compliance with international agreements be verified?
3. Is it catastrophic for international agreements to fail?
4. Can the production of the technology be controlled?
Each of these highlights important strategic variables that we are uncertain about.
They give us insights into the characteristics of the technology we are dealing with.
Consequently, they help us illustrate what sorts of international cooperation may be
possible and necessary.
Summary
By answering these questions, we can make important decisions about interna-
tional governance. First, we understand whether we need international governance or
whether AIs will be able to mitigate the harmful effects of other AIs. Second, we deter-
mine whether international agreements are possible, since we need to verify whether
actors are following the rules. Third, we can decide what features these agreements
might have; specifically, we can determine whether they must be extremely restrictive
to avoid catastrophes from a single deviation. Lastly, we consider who must agree to
govern AI by understanding whether or not a few countries can impose regulations
on the world. Even if some of these answers imply that we live in high-risk worlds,
they guide us toward actions that help mitigate this risk. We can now consider what
these actions might be.
We will now consider the specific tools that might be useful for international gover-
nance of AI. We separate this discussion into regulating AIs produced by the private
sector and AIs produced by militaries, since these have different features and thus
require different controls. For civilian AIs, certification of compliance with interna-
tional standards is the key precedent to follow. For military AIs, we can turn to
non-proliferation agreements, verification schemes, or the establishment of AI mo-
nopolies.
Civilian AI
Regulating the private sector is important and tractable. Much of the
development of advanced AIs is seemingly taking place in the private sector. As a
result, ensuring that private actors do not develop or misuse harmful technologies is a
priority. Regulating civilian development and use is also likely to be more feasible than
regulating militaries, although this might be hindered by overlaps between private
and military applications of AIs (such as civilian defense contracts) and countries’
reluctance to allow an international organization access to their firms’ activities.
484 ■ Introduction to AI Safety, Ethics, and Society
Military AI
Governing military AIs is different from governing civilian ones. Most of the options
for governing AI used by militaries can be described as drawing from one of three
regimes: nonproliferation plus norms of use, verification, or monopoly.
Norms may be difficult to establish. With nuclear weapons, the norms of “no
first use” and “mutually assured destruction” created an equilibrium that limited
the use of nuclear weapons. With AIs, this might be more difficult for a variety of
reasons: for instance, AIs have a much broader field of capabilities (as opposed to
a nuclear weapon detonating) and AIs are already being widely used. Monitoring
or restricting the development or use of new AI systems requires deciding precisely
Governance ■ 485
which capabilities are prohibited. If we cannot decide which capabilities are the most
dangerous, it is difficult to decide on a set of norms, which means we cannot rely on
norms to encourage the development of safe AIs.
Option 2: Verification. Many actors might be happy to limit their own devel-
opment of military technology if they can be certain their adversaries are doing the
same. Verification of this fact can enable countries to govern each other, thereby
avoiding arms races. The Chemical Weapons Convention, for instance, has provisions
for verifying member states’ compliance, including inspections of declared sites.
When it comes to critical military technologies, however, verification regimes might
need to be invasive; for instance, it might be necessary to have the authority to inspect
any site in a country. It is unclear how they could function in the face of resistance
from domestic authorities. For these reasons, a system which relies on inspection and
similar police—like methods might be entirely infeasible—unless all relevant actors
agree to mutually verify and govern.
Option 3: Monopoly. The final option is a monopoly over the largest-scale com-
puting processes, which removes the incentive for risk-taking by removing adversarial
competition. Such monopolies might arise in several ways. Established AI firms may
benefit from market forces like economies of scale, such as being able to attract the
best talent and invest profits from previous ventures into new R&D. Additionally,
they might have first-mover advantages from retaining customers or exercising influ-
ence over regulation. Alternatively, several actors might agree to create a monopoly:
there are proposals like “CERN for AI” which call for this explicitly [456]. Such or-
ganizations must be focused on the right mission, perhaps using tools from corporate
governance, which is a non-trivial task. If they are, however, then they present a much
easier route to safe AI than verification and international norms and agreements.
More compute enables better results. Compute is not only essential in training
AI-—it is also necessary to run powerful AI models effectively. Just as we rely on our
brains to think and make decisions even after we’ve learned, AI models need compute
to process information and execute tasks even after training. If developers have access
to more compute, they can run bigger models. Since bigger models usually yield better
results, having more compute can enable better results.
from current trends suggest it will be decades before data center-bound systems
like GPT-4 could train on a basic GPU [164], the shift toward greater efficiency
might speed up dramatically with just a few breakthroughs. If AI models require less
compute, especially to the point that they become commonplace on consumer devices,
regulating AI systems based on compute access might not be the most effective
approach.
To produce AI, developers need three primary factors: data, algorithms, and compute.
In this section, we will explore why governing compute appears to be a more promising
avenue than governing the other factors. A resource is governable when the entity
with legitimate claims to control it—such as a government–—has the ability to control
and direct it. Compute is governable because
1. It can be determined who has access to compute and how they utilize it.
2. It is possible to establish and enforce specific rules about compute.
These are true because compute is physical, excludable, and quantifiable. These char-
acteristics allow us to govern compute, making it a potential point of control in the
broader domain of AI governance. We will now consider each of these in turn.
Compute Is Physical
The first key characteristic that makes compute governable is its physicality. Compute
is physical, unlike datasets, which are virtual, or algorithms, which are intangible
ideas. This makes compute rivalrous and enables tracking and monitoring, both of
which are crucial to governance.
Compute Is Excludable
The second key characteristic that makes compute governable is its excludability.
Something is excludable if it is feasible to stop others from using it. Most privately
produced goods like automobiles are excludable whereas others, such as clean air or
street lighting, are difficult to exclude people from consuming even if a government
or company doesn’t want to let them use it. Compute is excludable because a few
entities, such as the US and the EU, can control crucial parts of its supply chain.
This means that these actors can monitor and prevent others from using compute.
The compute supply chain makes monitoring easier. In 2023, the vast
majority of advanced AI chips globally are crafted by a single firm, Taiwan Semi-
conductor Manufacturing Company (TSMC). These chips are based on designs from
a few major companies, such as Nvidia and Google, and TSMC’s production pro-
cesses rely on photolithography machines from a similarly monopolistic industry led
by ASML [459]. Entities such as the US and EU can, therefore, regulate these com-
panies to control the supply of compute—if the supply chain dynamics do not change
dramatically over time [460]. This simplifies the tracking of frontier AI chips and
enforcing of regulatory guidelines; it’s what made the US export ban of cutting-edge
AI chips to China in 2022 feasible. This example illustrates that these chips can be
governed. By contrast, data can be purchased from anywhere or found online, and
490 ■ Introduction to AI Safety, Ethics, and Society
algorithmic advances are not excludable either, especially given the open science and
collaborative norms in the AI community.
Frequent chip replacements means governance is effective quickly. The
price performance of AI chips is increasing exponentially. With new chips frequently
making recent products obsolete, compute becomes more excludable. Historical trends
show that GPUs double their price performance approximately every 2.5 years [164].
In conjunction with the rapidly increasing demand for more compute, data centers
frequently refresh their chips and purchase vast quantities of new compute regularly
to retain competitiveness. This frequent chip turnover offers a significant window for
governance since regulations on new chips will be relevant quickly.
Compute Is Quantifiable
The third key characteristic that makes compute governable is its quantifiability.
Quantifiability refers to the ability to measure and compare both the quantity and
quality of resources. Metrics such as FLOP/s serve as a yardstick for comparing
computational capabilities across different entities. If a developer has more chips of
the same type, we can accurately deduce that they have access to more compute,
which means we can use compute to set thresholds and monitor compliance.
Quantifiability facilitates clear threshold setting. While chips and other
forms of compute differ in many ways, they can all be quantified in FLOP/s. This
allows regulators to determine how important it is to regulate a model that is being
developed: models that use large amounts of compute are likely more important to
regulate. Suppose a regulator aims to regulate new models that are large and highly
capable. A simple way to do this is to set a FLOP/s threshold, above which more
regulations, permissions, and scrutiny take effect. By contrast, setting a threshold on
dataset size is less meaningful: quality of data varies enough that 25 GB of data could
contain all the text in Wikipedia or one high-definition photo album. Even worse,
algorithms are difficult to quantify at all.
Quantifiability is key to monitoring compliance. Beyond the creation of
thresholds, quantifiability also helps us monitor compliance. Given the physical na-
ture and finite capacity of compute, we can tell whether an actor has sufficient com-
putational power from the type and quantity of chips they possess. A regulator might
require organizations with at least 1000 chips at least as good as A100s to submit
themselves for additional auditing processes. A higher number of chips directly cor-
relates to more substantial computational capabilities, unlike with algorithms where
there is no comparable metric and data for which metrics are much less precise. In
addition to compute being physical and so traceable, this enables the enforcement of
rules and thresholds.
typically housed in purpose-built data center facilities which are equipped to handle
their high demands for power and cooling. These data centers usually also have high
levels of physical security to protect this valuable hardware and the AI models that
are run on it. One approach to monitoring compute which is imperfect but requires
limited new technologies or regulation is to track these data center facilities. However,
such an approach faces challenges in terms of identifying which facilities are housing
chips used for training AI as opposed to other types of computing hardware, or
understanding how many chips they house and what they are being used for. Other
approaches rely more heavily on tracking the individual chips suitable for AI training
and inference, for example via some form of registry of chip sales, or via “on-chip
mechanisms.”
8.8 CONCLUSION
In the introduction, we laid out the purpose of this chapter: understanding the fun-
damentals of how to govern AI. In other words, we wanted to understand how to
organize, manage, and steer the development and deployment of AI technologies us-
ing an array of tools including norms, policies, and institutions. To set the scene, we
considered a set of actors and tools that governance needs to consider.
Growth. We explored how much AI might accelerate economic growth. AI has the
potential to significantly boost economic growth by augmenting the workforce, im-
proving labor efficiency, and accelerating technological progress. However, the extent
of this impact is debated, with some predicting explosive growth while others believe
it will be tempered by social and economic factors. The semi-endogenous growth
theory suggests that population growth, by expanding the labor force and fostering
innovation, has historically driven economic acceleration. Similarly, AI could enhance
economic output by substituting for human labor and self-improving, creating a posi-
tive feedback loop. Nonetheless, constraints such as limited physical resources, dimin-
ishing returns on research and development, gradual technology adoption, regulatory
measures, and tasks that resist automation could moderate the growth induced by
AI. Therefore, while AI’s contribution to economic growth is likely to be significant,
whether it will result in unprecedented expansion or face limitations remains uncer-
tain.
or goals. However, distributing power more widely among a large, diverse ecosystem
of AIs also has downsides, like increasing the potential for misuse or making it more
difficult to correct AI systems that begin behaving undesirably.
Compute Governance. We explored how governing access to and use of the com-
puting resources that enable AI development could provide an important lever for
influencing the trajectory of AI progress. Compute is an indispensable input to devel-
oping advanced AI capabilities. It also has properties like physicality, excludability,
and quantifiability that make governing it more feasible than other inputs like data
and algorithms. Compute governance can allow control over who is granted access to
what levels of computational capabilities, controlling who can create advanced AIs. It
also facilitates setting and enforcing safety standards for how compute can be used,
enabling the steering of AI development.
8.9 LITERATURE
It would not have been possible to complete a book of this scale on my own without
spending several years on it. This would have significantly reduced the book’s value,
given how urgent it is for us to understand and address important AI risks. I received
invaluable support during the research, writing and editing process from various
assistants, collaborators, and reviewers both at CAIS and externally that enabled me
to complete this book in a more timely way.
I would like to acknowledge the major contributions to the book’s chapters made by
the following people:
1. Overview of Catastrophic AI Risks: Mantas Mazeika and Thomas Woodside
2. Artificial Intelligence Fundamentals: Anna Swanson, Jeremy Hadfield, Adam
Elwood, and Corin Katzke
3. Single-Agent Safety: Jesse Hoogland, Adam Khoja, Thomas Woodside, Abra
Ganz, Aidan O’Gara, Spencer Becker-Kahn, Jeremy Hadfield, and Joshua Clymer
4. Safety Engineering: Laura Hiscott and Suryansh Mehta
5. Complex Systems: Laura Hiscott and Max Heitmann
6. Beneficial AI and Machine Ethics: Suryansh Mehta, Shankara Srikantan,
Aron Vallinder, Jeremy Hadfield, and Toby Tremlett
7. Collective Action Problems: Ivo Andrews, Sasha Cadariu, Avital Morris, and
David Lambert
8. Governance: Jonas Schuett, Robert Trager, Lennart Heim, Matthew Barnett,
Mauricio Baker, Thomas Woodside, Suryansh Mehta, Laura Hiscott, and Shankara
Srikantan
9. Appendix—Normative Ethics: Toby Tremlett and Suryansh Mehta
10. Appendix—Utility Functions: Suryansh Mehta and Shankara Srikantan
I am deeply grateful to the following key individuals for their help in driving this
project forward:
• Jay Kim and William Hodgkins (Project Manager)
• Rebecca Rothwell and Suryansh Mehta (Editor)
• Corin Katzke (Course Manager)
I would also like to thank Mantas Mazeika, Paul Salmon, Matthew Lutz, Bryan
Daniels, Martin Stoffel, Charlotte Siegmann, Lena Trabucco, Casey Mahoney, the
495
496 ■ Acknowledgments
2023 CAIS philosophy fellows, and the participants in our AI Safety Sprint Course
for their helpful feedback during the writing and revision process. All errors that
remain are, of course, my own.
Dan Hendrycks
References
Introduction
[1] Kate Crawford. Atlas of AI : power, politics, and the planetary costs of
artificial intelligence. eng. New Haven: Yale University Press, 2022. isbn:
9780300209570.
[2] Laura Weidinger et al. “Ethical and social risks of harm from Language Mod-
els”. In: arXiv preprint arXiv:2112.04359 (2021).
497
498 ■ References
[46] Robert Sherefkin. “Ford 100: Defective Pinto Almost Took Ford’s Reputation
With It”. In: Automotive News (June 2003).
[47] Lee Strobel. Reckless Homicide?: Ford’s Pinto Trial. en. And Books, 1980.
[48] Grimshaw v. Ford Motor Co. May 1981.
[49] Paul C. Judge. “Selling Autos by Selling Safety”. en-US. In: The New York
Times (Jan. 1990).
[50] Theo Leggett. “737 Max crashes: Boeing says not guilty to fraud charge”.
en-GB. In: BBC News (Jan. 2023).
[51] Edward Broughton. “The Bhopal disaster and its aftermath: a review”. In:
Environmental Health 4.1 (May 2005), p. 6.
[52] Charlotte Curtis. “Machines vs. Workers”. en-US. In: The New York Times
(Feb. 1983).
[53] Thomas Woodside et al. “Examples of AI Improving AI”. 2023. url: https:
//ai-improving-ai.safe.ai.
[54] Stuart Russell. Human Compatible: Artificial Intelligence and the Problem of
Control. en. Penguin, Oct. 2019.
[55] Dan Hendrycks. “Natural Selection Favors AIs over Humans”. In: ArXiv
abs/2303.16200 (2023).
[56] Dan Hendrycks. The Darwinian Argument for Worrying About AI. 2023.
[57] Richard C. Lewontin. “The Units of Selection”. In: Annual Review of Ecology,
Evolution, and Systematics 1 (1970), pp. 1–18.
[58] Ethan Kross et al. “Facebook use predicts declines in subjective well-being in
young adults”. In: PLOS One (2013).
[59] Laura Martínez-Íñigo et al. “Intercommunity interactions and killings in cen-
tral chimpanzees (Pan troglodytes troglodytes) from Loango National Park,
Gabon”. In: Primates; Journal of Primatology 62 (2021), pp. 709–722.
[60] Anne E. Pusey and Craig Packer. “Infanticide in Lions: Consequences and
Counterstrategies”. In: Infanticide and Parental Care (1994), p. 277.
[61] Peter D. Nagy and Judit Pogany. “The dependence of viral RNA replication on
co-opted host factors”. In: Nature Reviews. Microbiology 10 (2011), pp. 137–
149.
[62] Alfred Buschinger. “Social Parasitism among Ants: A Review”. In: Myrmeco-
logical News 12 (Sept. 2009), pp. 219–235.
[63] Greg Brockman, Ilya Sutskever, and OpenAI. Introducing OpenAI. Dec. 2015.
[64] Devin Coldewey. OpenAI shifts from nonprofit to ‘capped-profit’ to attract
capital. Mar. 2019.
[65] Kyle Wiggers, Devin Coldewey, and Manish Singh. Anthropic’s $5B, 4-year
plan to take on OpenAI. Apr. 2023.
References ■ 501
[66] Center for AI Safety. Statement on AI Risk (“Mitigating the risk of extinction
from AI should be a global priority alongside other societal-scale risks such as
pandemics and nuclear war.”) 2023. url: https://www.safe.ai/statement-on
-ai-risk.
[67] Richard Danzig et al. Aum Shinrikyo: Insights into How Terrorists Develop
Biological and Chemical Weapons. Tech. rep. Center for a New American Se-
curity, 2012. url: https://www.jstor.org/stable/resrep06323.
[68] John Uri. 35 Years Ago: Remembering Challenger and Her Crew. und. Text.
Jan. 2021.
[69] International Atomic Energy Agency. The Chernobyl Accident: Updating of
INSAG-1. Technical Report INSAG-7. Vienna, Austria: International Atomic
Energy Agency, 1992.
[70] Matthew Meselson et al. “The Sverdlovsk anthrax outbreak of 1979”. In: Sci-
ence 266 5188 (1994), pp. 1202–8.
[71] Daniel M. Ziegler et al. “Fine-tuning language models from human prefer-
ences”. In: arXiv preprint arXiv:1909.08593 (2019).
[72] Charles Perrow. Normal Accidents: Living with High-Risk Technologies.
Princeton, NJ: Princeton University Press, 1984.
[73] Mitchell Rogovin and George T. Frampton Jr. Three Mile Island: a re-
port to the commissioners and to the public. Volume I. English. Tech. rep.
NUREG/CR-1250(Vol.1). Nuclear Regulatory Commission, Washington, DC
(United States). Three Mile Island Special Inquiry Group, Jan. 1979.
[74] Richard Rhodes. The Making of the Atomic Bomb. New York: Simon & Schus-
ter, 1986.
[75] Sébastien Bubeck et al. “Sparks of Artificial General Intelligence: Early ex-
periments with GPT-4”. In: ArXiv abs/2303.12712 (2023).
[76] Theodore I. Lidsky and Jay S. Schneider. “Lead neurotoxicity in children:
basic mechanisms and clinical correlates”. In: Brain: a Journal of Neurology
126 Pt 1 (2003), pp. 5–19.
[77] Brooke T. Mossman et al. “Asbestos: scientific developments and implications
for public policy”. In: Science 247 4940 (1990), pp. 294–301.
[78] Kate Moore. The Radium Girls: The Dark Story of America’s Shining Women.
Naperville, IL: Sourcebooks, 2017.
[79] Stephen S. Hecht. “Tobacco smoke carcinogens and lung cancer”. In: Journal
of the National Cancer Institute 91 14 (1999), pp. 1194–210.
[80] Mario J. Molina and F. Sherwood Rowland. “Stratospheric sink for chloroflu-
oromethanes: chlorine atomc-atalysed destruction of ozone”. In: Nature 249
(1974), pp. 810–812.
[81] James H. Kim and Anthony R. Scialli. “Thalidomide: the tragedy of birth
defects and the effective treatment of disease”. In: Toxicological Sciences: an
Official Journal of the Society of Toxicology 122 1 (2011), pp. 1–6.
502 ■ References
[82] Betul Keles, Niall McCrae, and Annmarie Grealish. “A systematic review: the
influence of social media on depression, anxiety and psychological distress in
adolescents”. In: International Journal of Adolescence and Youth 25 (2019),
pp. 79–93.
[83] Zakir Durumeric et al. “The Matter of Heartbleed”. In: Proceedings of the
2014 Conference on Internet Measurement Conference (2014).
[84] Tony Tong Wang et al. “Adversarial Policies Beat Professional-Level Go AIs”.
In: ArXiv abs/2211.00241 (2022).
[85] T.R. La Porte and Paula M. Consolini. “Working in Practice But Not in The-
ory: Theoretical Challenges of ‘High-Reliability Organizations’”. In: Journal
of Public Administration Research and Theory 1 (1991), pp. 19–48.
[86] Thomas G. Dietterich. “Robust artificial intelligence and robust human orga-
nizations”. In: Frontiers of Computer Science 13 (2018), pp. 1–3.
[87] N. Leveson. Engineering a Safer World: Systems Thinking Applied to Safety.
Engineering systems. MIT Press, 2011. isbn: 9780262016629. url: https://b
ooks.google.com/books?id=0gZ_7n5p8MQC.
[88] David Manheim. Building a Culture of Safety for AI: Perspectives and Chal-
lenges. 2023.
[89] National Research Council et al. Lessons Learned from the Fukushima Nuclear
Accident for Improving Safety of U.S. Nuclear Plants. Washington, D.C.: Na-
tional Academies Press, Oct. 2014.
[90] Diane Vaughan. The Challenger Launch Decision: Risky Technology, Culture,
and Deviance at NASA. Chicago, IL: University of Chicago Press, 1996.
[91] Dan Lamothe. Air Force Swears: Our Nuke Launch Code Was Never
’00000000’. Jan. 2014.
[92] Toby Ord. The precipice: Existential risk and the future of humanity. Hachette
Books, 2020.
[93] U.S. Nuclear Regulatory Commission. Final Safety Culture Policy Statement.
Federal Register. 2011.
[94] Bruce Schneier. “Inside the Twisted Mind of the Security Professional”. In:
Wired (Mar. 2008).
[95] Dan Hendrycks and Mantas Mazeika. “X-Risk Analysis for AI Research”. In:
ArXiv abs/2206.05862 (2022).
[96] Donald T. Campbell. “Assessing the impact of planned social change”. In:
Evaluation and Program Planning 2.1 (1979), pp. 67–90.
[97] Yohan J. John et al. “Dead rats, dopamine, performance metrics, and peacock
tails: proxy failure is an inherent risk in goal-oriented systems”. In: Behavioral
and Brain Sciences (2023), pp. 1–68. doi: 10.1017/S0140525X23002753.
[98] Jonathan Stray. “Aligning AI Optimization to Community Well-Being”. In:
International Journal of Community Well-Being (2020).
References ■ 503
[99] Jonathan Stray et al. “What are you optimizing for? Aligning Recommender
Systems with Human Values”. In: ArXiv abs/2107.10939 (2021).
[100] Ziad Obermeyer et al. “Dissecting racial bias in an algorithm used to manage
the health of populations”. In: Science 366 (2019), pp. 447–453.
[101] Dario Amodei and Jack Clark. Faulty reward functions in the wild. 2016.
[102] Alexander Pan, Kush Bhatia, and Jacob Steinhardt. “The effects of re-
ward misspecification: Mapping and mitigating misaligned models”. In: ICLR
(2022).
[103] G. Thut et al. “Activation of the human brain by monetary reward”. In:
Neuroreport 8.5 (1997), pp. 1225–1228.
[104] Edmund T. Rolls. “The Orbitofrontal Cortex and Reward”. In: Cerebral Cor-
tex 10.3 (Mar. 2000), pp. 284–294.
[105] T. Schroeder. Three Faces of Desire. Philosophy of Mind Series. Oxford Uni-
versity Press, USA, 2004.
[106] Joseph Carlsmith. Is Power-Seeking AI an Existential Risk? 2022. arXiv: 220
6.13353 [cs.CY]. url: https://arxiv.org/abs/2206.13353.
[107] John J. Mearsheimer. Structural Realism. 2007, pp. 77–94.
[108] Bowen Baker et al. “Emergent Tool Use From Multi-Agent Autocurricula”.
In: International Conference on Learning Representations. 2020.
[109] Dylan Hadfield-Menell et al. “The Off-Switch Game”. In: IJCA (2017).
[110] Alexander Pan et al. “Do the Rewards Justify the Means? Measuring Trade-
Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark”.
In: ICML (2023).
[111] “Lyndon Baines Johnson”. In: Oxford Reference (2016).
[112] Meta Fundamental AI Research Diplomacy Team (FAIR) et al. Human-level
play in the game of Diplomacy by combining language models with strategic
reasoning. 2022. doi: 10.1126/science.ade9097. eprint: https://www.science.o
rg/doi/pdf/10.1126/science.ade9097. url: https://www.science.org/doi/abs
/10.1126/science.ade9097.
[113] Paul Christiano et al. Deep reinforcement learning from human preferences.
Discussed in https://www.deepmind.com/blog/specification-gaming-the-flip
-side-of-ai-ingenuity. 2017. arXiv: 1706.03741.
[114] Xinyun Chen et al. Targeted Backdoor Attacks on Deep Learning Systems
Using Data Poisoning. 2017. arXiv: 1712.05526.
[115] Nick Beckstead. On the overwhelming importance of shaping the far future.
2013.
[116] Jens Rasmussen. “Risk management in a Dynamic Society: A Modeling Prob-
lem”. English. In: Proceedings of the Conference on Human Interaction with
Complex Systems, 1996.
504 ■ References
[132] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classi-
fication with Deep Convolutional Neural Networks”. In: Advances in Neural
Information Processing Systems. Ed. by F. Pereira et al. Vol. 25. Curran As-
sociates, Inc., 2012. url: https://proceedings.neurips.cc/paper_files/paper
/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
[133] David Silver et al. “Mastering the game of Go with deep neural networks and
tree search”. In: Nature 529 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16
961.
[134] Alec Radford et al. “Language Models are Unsupervised Multitask Learners”.
2019. url: https://api.semanticscholar.org/CorpusID:160025533.
[135] Meredith Ringel Morris et al. Levels of AGI for Operationalizing Progress on
the Path to AGI. 2024. arXiv: 2311.02462 [cs.AI].
[136] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer
New York, 2016.
[137] Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. MIT
Press, 2022.
[138] Dan Hendrycks and Kevin Gimpel. A Baseline for Detecting Misclassified and
Out-of-Distribution Examples in Neural Networks. 2018. arXiv: 1610 . 02136
[cs.NE].
[139] Dan Hendrycks. Anomaly and Out-of-Distribution Detection. url: https://do
cs.google.com/presentation/d/1WEzSFUbcl1Rp4kQq1K4uONMJHBAUWh
CZTzWVHnLcSV8/edit#slide=id.g60c1429d79_0_0.
[140] Walber. Precision Recall. Creative Commons Attribution-Share Alike 4.0 In-
ternational license. The layout of the figure has been changed. 2014. url:
https://en.wikipedia.org/wiki/File:Precisionrecall.svg.
[141] cmglee and Martin Thoma. Creative Commons Attribution-Share Alike 4.0
International license. The colours of the figure have been changed. 2018. url:
https://commons.wikimedia.org/wiki/File:Roc_curve.svg.
[142] Haiyin Luo and Yuhui Zheng. “Semantic Residual Pyramid Network for Image
Inpainting”. In: Information 13.2 (Jan. 2022). Creative Commons Attribution
4.0 International license. url: https://doi.org/10.3390/info13020071.
[143] Y. LeCun, Y. Bengio, and G. Hinton. “Deep Learning”. In: Nature (521 2015),
pp. 436–444. url: https://doi.org/10.1038/nature14539.
[144] Notjim. Creative Commons Attribution-Share Alike 3.0 Unported License.
url: https://commons.wikimedia.org/wiki/File:Neuron_-_annotated.svg.
[145] T.M. Mitchell. url: https://commons.wikimedia.org/wiki/File:Rosenblattpe
rceptron.png.
[146] Machine Learning for Artists, MNIST Input, 2018. GPL 2.0 License. url:
https://ml4a.github.io/demos/f_mnist_input/ (visited on 04/28/2024).
506 ■ References
[147] Chuan Lin, Qing Chang, and Xianxu Li. “A Deep Learning Approach for
MIMO-NOMA Downlink Signal Detection”. In: Sensors 19 (June 2019),
p. 2526. doi: 10.3390/s19112526.
[148] Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs).
2023. arXiv: 1606.08415 [cs.LG].
[149] Vinod Nair and Geoffrey Hinton. “Rectified Linear Units Improve Restricted
Boltzmann Machines”. In: Proceedings of the 27th International Conference
on Machine Learning. Vol. 27. June 2010, pp. 807–814.
[150] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.
[151] Febin Sunny, Mahdi Nikdast, and Sudeep Pasricha. SONIC: A Sparse Neu-
ral Network Inference Accelerator with Silicon Photonics for Energy-Efficient
Deep Learning. Creative Commons Attribution 4.0 International license. Sept.
2021.
[152] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural
Information Processing Systems. Ed. by I. Guyon et al. Vol. 30. Curran Asso-
ciates, Inc., 2017. url: https://proceedings.neurips.cc/paper_files/paper/20
17/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[153] Ravish Raj. Ed. by Andrew Ng Lectures. url: https://www.enjoyalgorith
ms.com/blog /parameter- learning - and- gradient- descent- in- ml (visited on
09/28/2023).
[154] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. “Learning
representations by back-propagating errors”. In: Nature 323 (1986), pp. 533–
536. url: https://api.semanticscholar.org/CorpusID:205001834.
[155] Y. LeCun et al. “Gradient-based learning applied to document recognition”.
In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. doi: 10.1109/5.726
791.
[156] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In:
Neural Computation 9.8 (Nov. 1997), pp. 1735–1780. issn: 0899-7667. doi:
10.1162/neco.1997.9.8.1735. eprint: https://direct.mit.edu/neco/article-pdf
/9/8/1735/813796/neco.1997.9.8.1735.pdf. url: https://doi.org/10.1162/nec
o.1997.9.8.1735.
[157] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding. 2019. arXiv: 1810.04805 [cs.CL].
[158] Joel Hestness et al. Deep Learning Scaling is Predictable, Empirically. 2017.
arXiv: 1712.00409 [cs.LG].
[159] Jared Kaplan et al. Scaling Laws for Neural Language Models. 2020. arXiv:
2001.08361 [cs.LG].
[160] Jordan Hoffmann et al. Training Compute-Optimal Large Language Models.
2022. arXiv: 2203.15556 [cs.CL].
References ■ 507
Single-Agent Safety
[54] Stuart Russell. Human Compatible: Artificial Intelligence and the Problem of
Control. en. Penguin, Oct. 2019.
[75] Sébastien Bubeck et al. “Sparks of Artificial General Intelligence: Early ex-
periments with GPT-4”. In: ArXiv abs/2303.12712 (2023).
[101] Dario Amodei and Jack Clark. Faulty reward functions in the wild. 2016.
[102] Alexander Pan, Kush Bhatia, and Jacob Steinhardt. “The effects of re-
ward misspecification: Mapping and mitigating misaligned models”. In: ICLR
(2022).
[106] Joseph Carlsmith. Is Power-Seeking AI an Existential Risk? 2022. arXiv: 220
6.13353 [cs.CY]. url: https://arxiv.org/abs/2206.13353.
[107] John J. Mearsheimer. Structural Realism. 2007, pp. 77–94.
[108] Bowen Baker et al. “Emergent Tool Use From Multi-Agent Autocurricula”.
In: International Conference on Learning Representations. 2020.
508 ■ References
[183] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Inside Convo-
lutional Networks: Visualising Image Classification Models and Saliency Maps.
Apr. 2014. eprint: 1312.6034 (cs). (Visited on 09/15/2023).
[184] Jost Tobias Springenberg et al. Striving for Simplicity: The All Convolutional
Net. Apr. 2015. arXiv: 1412.6806 [cs]. (Visited on 09/15/2023).
[185] Julius Adebayo et al. “Sanity Checks for Saliency Maps”. In: Advances in
Neural Information Processing Systems. Vol. 31. Curran Associates, Inc., 2018.
(Visited on 09/15/2023).
[186] Kevin Wang et al. Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small. 2022. arXiv: 2211.00593 [cs.LG].
[187] Chris Olah et al. “Zoom In: An Introduction to Circuits”. In: Distill 5.3 (Mar.
2020), e00024.001. issn: 2476-0757. doi: 10.23915/distill.00024.001. (Visited
on 09/15/2023).
[188] Kevin Meng et al. Locating and Editing Factual Associations in GPT. The
listed statement was inspired by, but not included in this paper. Jan. 2023.
eprint: 2202.05262 (cs). (Visited on 09/15/2023).
[189] Andy Zou et al. Representation Engineering: A Top-Down Approach to AI
Transparency. 2023. arXiv: 2310.01405 [cs.LG].
[190] Jerry Tang et al. “Semantic reconstruction of continuous language from non-
invasive brain recordings”. In: bioRxiv (2022). doi: 10.1101/2022.09.29.50974
4. eprint: https://www.biorxiv.org/content/early/2022/09/29/2022.09.29.50
9744.full.pdf. url: https://www.biorxiv.org/content/early/2022/09/29/202
2.09.29.509744.
[191] P.W. Anderson. “More Is Different”. In: Science 177.4047 (Aug. 1972),
pp. 393–396. doi: 10.1126/science.177.4047.393.
[192] Jacob Steinhardt. More Is Different for AI. Jan. 2022. (Visited on
09/16/2023).
[193] Jason Wei et al. Emergent Abilities of Large Language Models. 2022. arXiv:
2206.07682 [cs.CL]. url: https://arxiv.org/abs/2206.07682.
[194] Thomas McGrath et al. “Acquisition of chess knowledge in AlphaZero”. In:
Proceedings of the National Academy of Sciences 119.47 (Nov. 2022). doi: 10
.1073/pnas.2206625119. url: https://doi.org/10.1073%2Fpnas.2206625119.
[195] GPT-4 System Card. Tech. rep. OpenAI, Mar. 2023, p. 60. (Visited on
09/16/2023).
[196] Andy Zou et al. “Forecasting Future World Events with Neural Networks”.
In: NeurIPS (2022).
[197] Oriol Vinyals et al. Grandmaster level in StarCraft II using multi-agent rein-
forcement learning. Vol. 575. Nov. 2019. doi: 10.1038/s41586-019-1724-z.
[198] L.P. Kaelbling, M.L. Littman, and A.W. Moore. Reinforcement Learning: A
Survey. 1996. arXiv: cs/9605103 [cs.AI].
510 ■ References
[215] Anish Athalye et al. Fooling Neural Networks in the Physical World. Oct. 2017.
(Visited on 09/15/2023).
[216] Andy Zou et al. Universal and Transferable Adversarial Attacks on Aligned
Language Models. July 2023. doi: 10.48550/arXiv.2307.15043. eprint: 2307.1
5043 (cs). (Visited on 09/15/2023).
[217] Disrupting malicious uses of AI by state-affiliated threat actors. url: https:
//openai.com/blog/disrupting-malicious-uses-of-ai-by-state-affiliated-threa
t-actors.
[218] Nicholas Carlini et al. Poisoning Web-Scale Training Datasets is Practical.
2023. arXiv: 2302.10149 [cs.CR].
[219] Nicholas Carlini et al. “Extracting Training Data from Large Language Mod-
els”. In: CoRR abs/2012.07805 (2020). arXiv: 2012.07805. url: https://arxi
v.org/abs/2012.07805.
[220] Milad Nasr et al. Scalable Extraction of Training Data from (Production) Lan-
guage Models. 2023. arXiv: 2311.17035 [cs.LG].
[221] Peter S. Park et al. AI Deception: A Survey of Examples, Risks, and Potential
Solutions. Aug. 2023. doi: 10 . 48550 / arXiv . 2308 . 14752. arXiv: 2308 . 14752
[cs]. (Visited on 09/16/2023).
[222] Julien Perolat et al. “Mastering the Game of Stratego with Model-Free Multia-
gent Reinforcement Learning”. In: Science 378.6623 (Dec. 2022), pp. 990–996.
issn: 0036-8075, 1095-9203. doi: 10.1126/science.add4679. eprint: 2206.15378
(cs). (Visited on 09/16/2023).
[223] Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How
Models Mimic Human Falsehoods. May 2022. doi: 10.48550/arXiv.2109.07958.
eprint: 2109.07958 (cs). (Visited on 09/16/2023).
[224] Paul Ekman. “Emotions revealed”. In: BMJ 328.Suppl S5 (2004). doi: 10.11
36/sbmj.0405184. eprint: https://www.bmj.com/content. url: https://www
.bmj.com/content/328/Suppl_S5/0405184.
[225] Philip Powell and Jennifer Roberts. “Situational determinants of cognitive,
affective, and compassionate empathy in naturalistic digital interactions”. In:
Computers in Human Behaviour 68 (Nov. 2016), pp. 137–148. doi: 10.1016/j
.chb.2016.11.024.
[226] Michael Wai and Niko Tiliopoulos. “The affective and cognitive empathic na-
ture of the dark triad of personality”. In: Personality and Individual Differ-
ences 52 (May 2012), pp. 794–799. doi: 10.1016/j.paid.2012.01.008.
[227] S. Baron-Cohen. Zero Degrees of Empathy: A New Theory of Human Cruelty.
Penguin UK, 2011.
[228] Tania Singer and Olga M. Klimecki. “Empathy and compassion”. In: Current
Biology 24.18 (2014), R875–R878. issn: 0960-9822. doi: https://doi.org/10.1
016/j.cub.2014.06.054. url: https://www.sciencedirect.com/science/article
/pii/S0960982214007702.
512 ■ References
[229] Russell Hotten. “Volkswagen: The Scandal Explained”. In: BBC News (Sept.
2015). (Visited on 09/16/2023).
[230] Stephen Casper et al. Red Teaming Deep Neural Networks with Feature Syn-
thesis Tools. July 2023. eprint: 2302.10894 (cs). (Visited on 09/16/2023).
[231] J.R. French and Bertram H. Raven. “The bases of social power”. In: 1959.
[232] B.H. Raven. “Social influence and power”. In: (1964).
[233] T. Hobbes. Hobbes’s Leviathan. 1651. isbn: 978-5-87635-264-4.
[234] Luke Muehlhauser and Anna Salamon. “Intelligence Explosion: Evidence and
Import”. In: Singularity Hypotheses: A Scientific and Philosophical Assess-
ment. Ed. by Amnon H. Eden and James H. Moor. Springer, 2012, pp. 15–
40.
[235] Alexander Pan et al. Do the Rewards Justify the Means? Measuring Trade-Offs
Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.
2023. arXiv: 2304.03279 [cs.LG].
[236] Stephen Omohundro. “The basic AI drives”. In: AGI 2008: Proceedings of the
First Conference on Artificial General Intelligence. 2008.
[237] David Silver. Exploration and exploitation. Lecture 9. 2014.
[238] Charles W. Kegley and Shannon L. Blanton. Trend and Transformation, 2014
- 2015. 2020, p. 259.
[239] Kenneth Waltz. Theory of international politics. Waveland Press, 2010, p. 93.
[240] W. Julian Korab-Karpowicz. “Political Realism in International Relations”.
In: The Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta. Summer
2018. Metaphysics Research Lab, Stanford University, 2018.
[241] Evan Montgomery. “Breaking Out of the Security Dilemma: Realism, Reas-
surance, and the Problem of Uncertainty”. In: International Security - INT
SECURITY 31 (Oct. 2006), pp. 151–185. doi: 10.1162/isec.2006.31.2.151.
[242] Long Ouyang et al. “Training language models to follow instructions with
human feedback”. In: ArXiv (2022).
[243] Rafael Rafailov et al. Direct Preference Optimization: Your Language Model
is Secretly a Reward Model. 2023. arXiv: 2305.18290 [cs.LG].
[244] Nathaniel Li et al. The WMDP Benchmark: Measuring and Reducing Mali-
cious Use With Unlearning. 2024. arXiv: wmdp.ai [cs.CR].
[245] Dan Hendrycks et al. Unsolved Problems in ML Safety. 2022. arXiv: 2109.13
916 [cs.LG].
[246] Vitalik Buterin. My techno-optimism. url: https://vitalik.eth.limo/general
/2023/11/27/techno_optimism.html.
[247] United States Government Accountability Office. Critical Infrastructure Pro-
tection: Agencies Need to Enhance Oversight of Ransomware Practices and
Assess Federal Support. 2024.
References ■ 513
[248] Minghao Shao et al. An Empirical Evaluation of LLMs for Solving Offensive
Security Challenges. 2024. arXiv: 2402.11814 [cs.CR].
[249] How AI Can Help Prevent Biosecurity Disasters. url: https://ifp.org/how-a
i-can-help-prevent-biosecurity-disasters/.
[250] Biotech begins human trials of drug designed by artificial intelligence. url:
https://www.ft.com/content/82071cf2-f0da-432b-b815-606d602871fc.
[251] Maialen Berrondo-Otermin and Antonio Sarasa-Cabezuelo. “Application of
Artificial Intelligence Techniques to Detect Fake News: A Review”. In: Elec-
tronics 12.24 (2023). issn: 2079-9292. doi: 10.3390/electronics12245041. url:
https://www.mdpi.com/2079-9292/12/24/5041.
[252] Audio deepfakes emerge as weapon of choice in election disinformation. url:
https://www.ft.com/content/bd75b678-044f-409e-b987-8704d6a704ea.
[253] David Ili’c. “Unveiling the General Intelligence Factor in Language Models: A
Psychometric Approach”. In: ArXiv abs/2310.11616 (2023).
[254] Yohan John et al. “Dead rats, dopamine, performance metrics, and peacock
tails: proxy failure is an inherent risk in goal-oriented systems”. In: The Be-
havioral and brain sciences (June 2023), pp. 1–68. doi: 10.1017/S0140525X2
3002753.
[255] J. Dmitri Gallow. “Instrumental Divergence”. In: Philosophical Studies (2024),
pp. 1–27. doi: 10.1007/s11098-024-02129-3. Forthcoming.
[256] Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem
from a deep learning perspective. 2023. arXiv: 2209.00626 [cs.AI].
Safety Engineering
[42] Richard Danzig. Technology Roulette: Managing Loss of Control as Many Mili-
taries Pursue Technological Superiority. Tech. rep. Center for a New American
Security, June 2018.
[87] N. Leveson. Engineering a Safer World: Systems Thinking Applied to Safety.
Engineering systems. MIT Press, 2011. isbn: 9780262016629. url: https://b
ooks.google.com/books?id=0gZ_7n5p8MQC.
[257] Dan Hendrycks and Mantas Mazeika. X-Risk Analysis for AI Research. 2022.
arXiv: 2206.05862 [cs.CY].
[258] Terence Tao. “Nines of safety: a proposed unit of measurement of risk”. In:
2021.
[259] Nassim Nicholas Taleb. Antifragile: Things That Gain from Disorder. Incerto.
Random House Publishing Group, 2012. isbn: 9780679645276. url: https://b
ooks.google.com.au/books?id=5fqbz_qGi0AC.
[260] E. Marsden. Designing for safety: Inherent safety, designed in. 2017. url:
https://risk-engineering.org/safe-design/ (visited on 07/31/2017).
514 ■ References
[261] James Reason. “The contribution of latent human failures to the breakdown
of complex systems”. In: Philosophical Transactions of the Royal Society of
London. Series B, Biological Sciences 327 1241 (1990), pp. 475–84. url: http
s://api.semanticscholar.org/CorpusID:1249973.
[262] William Vesely et al. Fault Tree Handbook with Aerospace Fault Tree Handbook
with Aerospace Applications Applications. 2002. url: http://www.mwftr.com
/CS2/Fault%20Tree%20Handbook_NASA.pdf.
[263] Andrew Critch and Stuart Russell. TASRA: a Taxonomy and Analysis of
Societal-Scale Risks from AI. 2023. arXiv: 2306.06924 [cs.AI].
[264] Nancy G. Leveson. Introduction to STAMP: Part 1. MIT Partnership for Sys-
tems Approaches to Safety and Security (PSASS). 2020.
[265] Arden Albee et al. Report on the Loss of the Mars Polar Lander and Deep
Space 2 Missions. Tech. rep. Jet propulsion Laboratory - California Institute
of Technology, 2000.
[266] Shalaleh Rismani et al. “Beyond the ML Model: Applying Safety Engineer-
ing Frameworks to Text-to-Image Development”. In: Proceedings of the 2023
AAAI/ACM Conference on AI, Ethics, and Society. AIES ’23. Montréal, QC,
Canada, Association for Computing Machinery, 2023, pp. 70–83. doi: 10.114
5/3600211.3604685. url: https://doi.org/10.1145/3600211.3604685.
[267] Donella Meadows. Leverage Points: Places to Intervene in a System. 1999.
url: https://donellameadows.org/archives/leverage-points-places-to-interve
ne-in-a-system/ (visited on 05/29/2024).
[268] C. Perrow. Normal Accidents: Living with High Risk Technologies. Princeton
paperbacks. Princeton University Press, 1999.
[269] Nancy G. Leveson et al. “Moving Beyond Normal Accidents and High Reli-
ability Organizations: A Systems Approach to Safety in Complex Systems”.
In: Organization Studies (2009).
[270] Thomas G. Dietterich. “Steps Toward Robust Artificial Intelligence”. In: AI
Magazine 38.3 (2017), pp. 3–24. doi: https://doi.org/10.1609/aimag.v38i3.2
756. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1609/aimag.v38i3.27
56. url: https://onlinelibrary.wiley.com/doi/abs/10.1609/aimag.v38i3.2756.
[271] Jens Rasmussen. “Risk management in a dynamic society: a modelling prob-
lem”. In: Safety Science 27.2 (1997), pp. 183–213. issn: 0925-7535. doi: http
s://doi.org/10.1016/S0925-7535(97)00052-0. url: https://www.sciencedirect
.com/science/article/pii/S0925753597000520.
[272] Sidney Dekker. Drift into Failure: From Hunting Broken Components to Un-
derstanding Complex Systems. Jan. 2011, pp. 1–220. isbn: 9781315257396.
doi: 10.1201/9781315257396.
[273] Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An Overview of
Catastrophic AI Risks. 2023. arXiv: 2306.12001 [cs.CY].
References ■ 515
[274] Krystal Hu. ChatGPT sets record for fastest-growing user base - analyst note.
2023. url: https://www.reuters.com/technology/chatgpt-sets-record-fastest
-growing-user-base-analyst-note-2023-02-01/ (visited on 02/03/2023).
[275] Scott D. Sagan. The Limits of Safety: Organizations, Accidents, and Nuclear
Weapons. Vol. 177. Princeton University Press, 1993. isbn: 9780691021010.
url: http://www.jstor.org/stable/j.ctvzsmf8r (visited on 09/26/2023).
[276] Nassim Nicholas Taleb. The Black Swan: The Impact of the Highly Improbable.
Vol. 2. Random House, 2007.
[277] E. Marsden. Black swans: The limits of probabilistic modelling. 2017. url:
https://risk-engineering.org/black-swans/ (visited on 07/31/2017).
[278] Mark Boult et al. Horizon Scanning: A Practitioner’s Guide. English. IRM -
Institute of Risk Management, 2018.
Complex Systems
[160] Jordan Hoffmann et al. Training Compute-Optimal Large Language Models.
2022. arXiv: 2203.15556 [cs.CL].
[279] P. Cilliers and David Spurrett. “Complexity and post-modernism: Understand-
ing complex systems”. In: South African Journal of Philosophy 18 (Sept. 2014),
pp. 258–274. doi: 10.1080/02580136.1999.10878187.
[280] Alethea Power et al. “Grokking: Generalization Beyond Overfitting on Small
Algorithmic Datasets”. In: ICLR MATH-AI Workshop. 2021.
[281] Hang Zhao et al. “Response mechanisms to heat stress in bees”. In: Apidologie
52 (Jan. 2021). doi: 10.1007/s13592-020-00830-w.
[282] Trevor Kletz. An Engineer’s View of Human Error. CRC Press, 2018.
[283] Lucas Davis. “The Effect of Driving Restrictions on Air Quality in Mexico
City”. In: Journal of Political Economy 116 (Feb. 2008), pp. 38–81. doi: 10.1
086/529398.
[284] Jemimah Steinfeld. “China’s deadly science lesson: How an ill-conceived cam-
paign against sparrows contributed to one of the worst famines in history”. In:
Index on Censorship 47.3 (2018), pp. 49–49. doi: 10.1177/0306422018800259.
[285] Xin Meng, Nancy Qian, and Pierre Yared. “The Institutional Causes of China’s
Great Famine, 1959–1961”. In: The Review of Economic Studies 82.4 (Apr.
2015), pp. 1568–1611. issn: 0034-6527. doi: 10.1093/restud/rdv016.
[286] Joseph Gottfried. “History Repeating? Avoiding a Return to the Pre-
Antibiotic Age”. In: 2005. url: https : / / api . semanticscholar . org / Corpus
ID:84028897.
[287] Centers for Disease Control (US). “Antibiotic resistance threats in the United
States, 2019”. In: CDC Stacks (2019). doi: http://dx.doi.org/10.15620/cdc:8
2532.
516 ■ References
[331] Richard D. Wilkinson and Kate Pickett. The spirit level: Why more equal
societies almost always do better. Bloomsbury Publishing, 2009.
[332] Morgan Kelly. “Inequality And Crime”. In: The Review of Economics and
Statistics 82 (Feb. 2000), pp. 530–539. doi: 10.1162/003465300559028.
[333] World Bank. Poverty and Inequality Platform. 2022. url: https://data.world
bank.org/indicator/SI.POV.GINI.
[334] United Nations Office on Drugs and Crime. Global Study on Homicide 2019.
2019. url: https://www.unodc.org/documents/data-and-analysis/gsh/Book
let1.pdf.
[335] World Bank. World Governance Indicators. 2023. url: https://databank.wo
rldbank.org/source/worldwide-governance-indicators.
[336] Thomas Piketty. Capital in the Twenty-First Century. Belknap Press, 2014.
[337] Radeksz. Graph of the Preston Curve for 2005, cross section data. 2009. url:
https://commons.wikimedia.org/wiki/File:PrestonCurve2005.JPG.
[338] Amartya Sen. “Economics and Health”. In: The Lancet (1999).
[339] Jac Heckelman. “Economic Freedom and Economic Growth: A Short-Run
Causal Investigation”. In: Journal of Applied Economics III (May 2000),
pp. 71–91. doi: 10.1080/15140326.2000.12040546.
[340] Charles Jones and Pete Klenow. “Beyond GDP? Welfare across Countries and
Time”. In: American Economic Review 106 (Sept. 2016), pp. 2426–2457. doi:
10.1257/aer.20110236.
[341] Richard Easterlin et al. “China’s Life Satisfaction, 1990-2010”. In: Proceedings
of the National Academy of Sciences of the United States of America 109 (May
2012), pp. 9775–80. doi: 10.1073/pnas.1205672109.
[342] L.A. Paul. Transformative Experience. Oxford Scholarship online. Oxford Uni-
versity Press, 2014. isbn: 9780198717959. url: https://books.google.com.au
/books?id=zIXjBAAAQBAJ.
[343] Alberta Guibilini. “The Artificial Moral Advisor. The “Ideal Observer” Meets
Artificial Intelligence”. In: Philosophy & Technology (2018). doi: https://doi
.org/10.1007/s13347-017-0285-z.
[344] P. Bloom. The Sweet Spot: The Pleasures of Suffering and the Search for
Meaning. HarperCollinsPublishers, 2021. isbn: 9780062910561. url: https :
//books.google.com.au/books?id=LUI7zgEACAAJ.
[345] Mantas Mazeika et al. How Would The Viewer Feel? Estimating Wellbeing
From Video Scenarios. 2022. arXiv: 2210.10039 [cs.CV].
[346] Dan Hendrycks et al. “Aligning AI With Shared Human Values”. In: CoRR
abs/2008.02275 (2020). arXiv: 2008.02275. url: https://arxiv.org/abs/2008
.02275.
[347] Patrick Butlin et al. Consciousness in Artificial Intelligence: Insights from the
Science of Consciousness. 2023. arXiv: 2308.08708 [cs.AI].
520 ■ References
[348] John A. Weymark. “Harsanyi’s Social Aggregation Theorem and the Weak
Pareto Principle”. In: Social Choice and Welfare (1993). doi: https://doi.org
/10.1007/BF00182506.
[349] William MacAskill, Krister Bykvist, and Toby Ord. Moral Uncertainty. Sept.
2020. isbn: 9780198722274. doi: 10.1093/oso/9780198722274.001.0001.
[350] Stephen L. Darwall, ed. Deontology. Malden, MA: Wiley-Blackwell, 2003.
[352] Harry R. Lloyd. The Property Rights Approach to Moral Uncertainty. Happier
Lives Institute’s 2022 Summer Research Fellowship. 2022. url: https://www
.happierlivesinstitute.org/report/property-rights/.
[353] Wendell Wallach and Colin Allen. Moral Machines: Teaching Robots Right
from Wrong. Oxford University Press, Feb. 2009. isbn: 9780195374049. doi:
10.1093/acprof:oso/9780195374049.001.0001. url: https://doi.org/10.1093/a
cprof:oso/9780195374049.001.0001.
[354] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine
Learning: Limitations and Opportunities. http://www.fairmlbook.org. fairml-
book.org, 2019.
[355] Richard Layard and Jan-Emmanuel Neve. Wellbeing: Science and Policy. May
2023. isbn: 9781009298926. doi: 10.1017/9781009298957.
[356] Sandel Michael J. What Money Can’t Buy: The Moral Limits of Markets.
Farrar, Straus and Giroux, 2012.
[357] Dan Hendrycks et al. What Would Jiminy Cricket Do? Towards Agents That
Behave Morally. 2022. arXiv: 2110.13136 [cs.LG].
[358] M.D. Adler. Measuring Social Welfare: An Introduction. Oxford University
Press, 2019. isbn: 9780190643027. url: https://books.google.com/books?id
=_GitDwAAQBAJ.
[359] Katarzyna de Lazari-Radek and Peter Singer. Utilitarianism: A Very Short
Introduction. Oxford University Press, 2017. isbn: 9780198728795. doi: 10.1
093/actrade/9780198728795.001.0001. url: https://doi.org/10.1093/actrade
/9780198728795.001.0001.
[360] S. Kagan. Normative Ethics. Dimensions of philosophy series. Avalon Publish-
ing, 1998. isbn: 9780813308456. url: https://books.google.com/books?id=i
O8TAAAAYAAJ.
[361] Toby Newberry and Toby Ord. The Parliamentary Approach to Moral Uncer-
tainty. Tech. rep. Technical Report# 2021-2, Future of Humanity Institute,
University of Oxford . . ., 2021.
[362] Stephen Darwall. Deontology. Blackwell Publishers, Oxford, 2003.
[57] Richard C. Lewontin. “The Units of Selection”. In: Annual Review of Ecology,
Evolution, and Systematics 1 (1970), pp. 1–18.
[107] John J. Mearsheimer. Structural Realism. 2007, pp. 77–94.
[112] Meta Fundamental AI Research Diplomacy Team (FAIR) et al. Human-level
play in the game of Diplomacy by combining language models with strategic
reasoning. 2022. doi: 10.1126/science.ade9097. eprint: https://www.science.o
rg/doi/pdf/10.1126/science.ade9097. url: https://www.science.org/doi/abs
/10.1126/science.ade9097.
[273] Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An Overview of
Catastrophic AI Risks. 2023. arXiv: 2306.12001 [cs.CY].
[332] Morgan Kelly. “Inequality And Crime”. In: The Review of Economics and
Statistics 82 (Feb. 2000), pp. 530–539. doi: 10.1162/003465300559028.
[351] Dan Hendrycks. Natural Selection Favors AIs over Humans. 2023. arXiv: 230
3.16200 [cs.CY].
[363] Thomas C. Schelling. Micromotives and Macrobehavior. 1978.
[364] R. Dawkins. The Blind Watchmaker. Penguin Books Limited, 1986. isbn:
9780141966427. url: https : / / books . google . com . au / books ? id= - EDHRX3
YYwgC.
[365] E. Warren and A.W. Tyagi. The Two Income Trap: Why Middle-Class Parents
are Going Broke. Basic Books, 2004. isbn: 9780465090907. url: https://boo
ks.google.com.au/books?id=TmXhGJ0tg58C.
[366] Scott Alexander. Meditations on Moloch. 2014. url: https://slatestarcodex.c
om/2014/07/30/meditations-on-moloch/ (visited on 09/29/2023).
[367] Steven Kuhn. Prisoner’s Dillema. 1997. url: https://plato.stanford.edu/arc
hives/win2019/entries/prisoner-dilemma/ (visited on 09/29/2023).
[368] Derek Parfit. Reasons and Persons. Oxford, GB: Oxford University Press,
1984.
[369] Robert Axelrod. “More Effective Choice in the Prisoner’s Dilemma”. In: The
Journal of Conflict Resolution 24.3 (1980), pp. 379–403. issn: 00220027,
15528766. url: http://www.jstor.org/stable/173638 (visited on 09/28/2023).
[370] J. Maynard Smith. “The theory of games and the evolution of animal con-
flicts”. In: Journal of Theoretical Biology 47.1 (1974), pp. 209–221. issn: 0022-
5193. doi: https://doi.org/10.1016/0022-5193(74)90110-6. url: https://ww
w.sciencedirect.com/science/article/pii/0022519374901106.
[371] Noam Nisan et al. Algorithmic Game Theory. Cambridge University Press,
2007. doi: 10.1017/CBO9780511800481.
522 ■ References
[372] Allan Dafoe. “AI Governance: Overview and Theoretical Lenses”. In: The
Oxford Handbook of AI Governance. Oxford University Press, 2022. isbn:
9780197579329. doi: 10.1093/oxfordhb/9780197579329.013.2. eprint: http
s://academic.oup.com/book/0/chapter/408516484/chapter-ag-pdf/5091372
4/book\_41989\_section\_408516484.ag.pdf. url: https://doi.org/10.1093
/oxfordhb/9780197579329.013.2.
[373] Kai-Fu Lee and Chen Qiufan. AI 2041 Ten Visions for Our Future. Currency,
2021.
[374] John H. Herz. “Idealist Internationalism and the Security Dilemma”. In: World
Politics 2.2 (1950), pp. 157–180. doi: 10.2307/2009187.
[375] William Press and Freeman Dyson. “Iterated Prisoners Dilemma Contains
Strategies That Dominate Any Evolutionary Opponent”. In: Proceedings of
the National Academy of Sciences of the United States of America 109 (May
2012), pp. 10409–13. doi: 10.1073/pnas.1206569109.
[376] Alexander Stewart and Joshua Plotkin. “Extortion and cooperation in the
Prisoner’s Dilemma”. In: Proceedings of the National Academy of Sciences
109 (June 2012), pp. 10134–10135. doi: 10.1073/pnas.1208087109.
[377] D. Patel and Carl Shulman. AI Takeover, Bio & Cyber Attacks, Detecting
Deception, & Humanity’s Far Future. Audio Podcast Episode from Dwarkesh
Podcast. 2023. url: https://www.youtube.com/watch?v=KUieFuV1fuo&ab
_channel=DwarkeshPatel.
[378] Jared Diamond. Collapse: How Societies Choose to Fail or Succeed: revised
edition. Penguin Books (London), 2005.
[379] Andrew Critch. What Multipolar Failure Looks Like, and Robust Agent-
Agnostic Processes (RAAPs). 2021. url: https://www .alignmentforum.or
g/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-rob
ust-agent-agnostic (visited on 09/29/2023).
[380] Scott Alexander. Ascended Economy? 2016. url: https://slatestarcodex.com
/2016/05/30/ascended-economy/.
[381] OpenAI. OpenAI Charter. 2018. url: https://openai.com/charter.
[382] Martin A. Nowak. “Five Rules for the Evolution of Cooperation”. In: Science
314.5805 (2006), pp. 1560–1563. doi: 10.1126/science.1133755. eprint: https:
//www.science.org/doi/pdf/10.1126/science.1133755. url: https://www.scie
nce.org/doi/abs/10.1126/science.1133755.
[383] Robert Trivers. “The Evolution of Reciprocal Altruism”. In: Quarterly Review
of Biology 46 (Mar. 1971), pp. 35–57. doi: 10.1086/406755.
[384] Gabriele Schino and Filippo Aureli. “Grooming reciprocation among female
primates: A meta-analysis”. In: Biology Letters 4 (Nov. 2007), pp. 9–11. doi:
10.1098/rsbl.2007.0506.
[385] Steven Pinker. The better angels of our nature: Why violence has declined.
Penguin Books, 2012.
References ■ 523
[399] Nancy Wilmsen Thornhill. “An evolutionary analysis of rules regulating hu-
man inbreeding and marriage”. In: Behavioral and Brain Sciences 14.2 (1991),
pp. 247–261.
[400] Richard C. Connor. “The benefits of mutualism: a conceptual framework”. In:
Biological Reviews 70.3 (1995), pp. 427–457.
[401] Mark Van Vugt, Robert Hogan, and Robert B. Kaiser. “Leadership, follower-
ship, and evolution: some lessons from the past”. In: American Psychologist
63.3 (2008), p. 182.
[402] Peter Carruthers and Peter K Smith. Theories of theories of mind. Cambridge
University Press, 1996.
[403] Stuart A. West, Ashleigh S. Griffin, and Andy Gardner. “Evolutionary expla-
nations for cooperation”. In: Current Biology 17.16 (2007), R661–R672.
[404] John F. Nash. “The Bargaining Problem”. In: Econometrica 18.2 (1950),
pp. 155–162. issn: 00129682, 14680262. url: http://www.jstor.org/stable
/1907266 (visited on 11/19/2023).
[405] Herbert Gintis. “The evolution of private property”. In: Journal of Economic
Behavior & Organization 64.1 (2007), pp. 1–16.
[406] Joseph Henrich et al. ““Economic man” in cross-cultural perspective: Behav-
ioral experiments in 15 small-scale societies”. In: Behavioral and Brain Sci-
ences 28.6 (2005), pp. 795–815.
[407] Steven J. Brams and Alan D. Taylor. Fair Division: From cake-cutting to
dispute resolution. Cambridge University Press, 1996.
[408] M.J. Herskovits. Economic Anthropology: A Study in Comparative Economics.
Borzoi book. Knopf, 1952. url: https://books.google.com/books?id=LJqy
AAAAIAAJ.
[409] V. Buterin. What even is an institution? 2022. url: https://vitalik.ca/gener
al/2022/12/30/institutions.html.
[410] Marina Favaro, Neil Renic, and Ulrich Kuhn. Forecasting the future impact of
emerging technologies on international stability and human security. Accessed:
29 May 2024. Sept. 26, 2022. url: https://ifsh.de/file/publication/Research
_Report/010/Research_Report_010.pdf.
[411] James D. Fearon. “Rationalist Explanations for War”. In: International Orga-
nization 49.3 (1995), pp. 379–414. issn: 00208183, 15315088. url: http://ww
w.jstor.org/stable/2706903 (visited on 10/14/2023).
[412] Kenneth W. Condit. The Joint Chiefs of Staff and National Policy, 1947-1949.
1996.
[413] Robert Powell. “War as a commitment problem”. In: International Organiza-
tion 60.1 (2006), pp. 169–203.
[414] Brian Michael Jenkins. The Will to Fight, Lessons from Ukraine. 2022. url:
https://www.rand.org/pubs/commentary/2022/03/the-will-to-fight-lessons-
from-ukraine.html.
References ■ 525
[415] Laurie Chen. “Always there’: the AI chatbot Comforting China’s lonely mil-
lions”. In: The Jakarta Post (2021).
[416] Victimization rates for persons age 12 or older, by type of crime and annual
family income of victims. 2011. url: https://bjs.ojp.gov/sites/g/files/xycku
h236/files/media/document/cv0814.pdf.
[417] Pablo Fajnzylber, Daniel Lederman, and Norman Loayza. “Inequality and
Violent Crime”. In: The Journal of Law and Economics 45.1 (2002), pp. 1–
39. doi: 10 . 1086 / 338347. eprint: https : / / doi . org / 10 . 1086 / 338347. url:
https://doi.org/10.1086/338347.
[418] Adrienne Horne. “The Effect Of Relative Deprivation On Delinquency: An
Assessment Of Juveniles”. PhD thesis. University of Central Florida, 2009.
[419] Richard Dawkins. The Selfish Gene, 30th Anniversary edition. Oxford Univer-
sity Press, 2006.
[420] Lee Smolin. “Did the Universe Evolve?”. In: Classical and Quantum Gravity
(1992).
[421] Daniel C. Dennett. “Darwin’s Dangerous Idea: Evolution and the Meanings
of Life”. In: 1995.
[422] Susan J. Blackmore. The Meme Machine. Oxford University Press, 1999.
[423] George R. Price. “Selection and Covariance”. In: Nature (1970).
[424] Samir Okasha. “Evolution and the Levels of Selection”. In: 2007.
[425] Thomas G. Dietterich. “Ensemble Methods in Machine Learning”. In: Multiple
Classifier Systems. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, pp. 1–
15. isbn: 978-3-540-45014-6.
[426] Samir Okasha. “Agents and Goals in Evolution”. In: Oxford Scholarship On-
line (2018).
[427] Daniel Martín-Vega et al. “3D virtual histology at the host/parasite interface:
visualisation of the master manipulator, Dicrocoelium dendriticum, in the
brain of its ant host”. In: Scientific Reports 8.1 (2018), pp. 1–10.
[428] Christoph Ratzke, Jonas Denk, and Jeff Gore. “Ecological suicide in mi-
crobes”. In: bioRxiv (2017), p. 161398.
[429] Thomas C. Schelling. Arms and Influence. Yale University Press, 1966. isbn:
9780300002218. url: http://www.jstor.org/stable/j.ctt5vm52s (visited on
10/14/2023).
[430] Edward O. Wilson. Sociobiology: The New Synthesis, Twenty-Fifth Anniver-
sary Edition. Harvard University Press, 2000. isbn: 9780674000896. url: htt
p://www.jstor.org/stable/j.ctvjnrttd (visited on 10/14/2023).
[431] C. Boehm. Moral origins: The evolution of virtue, altruism, and shame. Basic
Books, 2012.
526 ■ References
[432] R. Axelrod and R.M. Axelrod. The Evolution of Cooperation. Basic books.
Basic Books, 1984. isbn: 9780465021215. url: https://books.google.com.au
/books?id=NJZBCGbNs98C.
Governance
[13] Fabio Urbina et al. “Dual use of artificial-intelligence-powered drug discovery”.
In: Nature Machine Intelligence 4 (2022), pp. 189–191.
[162] Rich Sutton. The Bitter Lesson. url: http://www.incompleteideas.net/IncId
eas/BitterLesson.html (visited on 09/28/2023).
[164] Epoch. Key trends and figures in Machine Learning. 2023. url: https://epoc
hai.org/trends (visited on 10/19/2023).
[433] Tom Davidson. Could Advanced AI Drive Explosive Economic Growth? url:
https://www.openphilanthropy.org/research/could-advanced-ai-drive-explo
sive-economic-growth/.
[434] Jeffrey Ding. “The Rise and Fall of Technological Leadership: General-Purpose
Technology Diffusion and Economic Power Transitions”. In: International
Studies Quarterly 68.2 (Mar. 2024), sqae013. issn: 0020-8833. doi: 10 . 109
3/isq/sqae013. eprint: https://academic.oup.com/isq/article-pdf/68/2/sqae0
13/56984912/sqae013.pdf. url: https://doi.org/10.1093/isq/sqae013.
[435] J. Briggs and D. Kodnani. The potentially large effects of artificial intelligence
on economic growth. Accessed: 23 May 2024. 2023. url: https://www.gspubl
ishing.com/content/research/en/reports/2023/03/27/d64e052b-0f6e-45d7-96
7b-d7be35fabd16.html.
[436] Andrew Yang. The War On Normal People: The Truth About America’s Dis-
appearing Jobs and Why Universal Basic Income Is Our Future. Hachette
Books, 2018.
[437] Daron Acemoglu and Pascual Restrepo. “Robots and jobs: Evidence from US
labor markets”. In: Journal of Political Economy 128.6 (2020), pp. 2188–2244.
[438] Yuval Noah Harari. Homo Deus: A brief history of tomorrow. Random house,
2016.
[439] Maximiliano Dvorkin and Hannah Shell. “The growing skill divide in the US
labor market”. In: Federal Reserve Bank of St. Louis: On the Economy Blog
18 (2017).
[440] Anton Korinek and Donghyun Suh. Scenarios for the Transition to AGI. Work-
ing Paper 32255. National Bureau of Economic Research, Mar. 2024. doi: 10
.3386/w32255. url: http://www.nber.org/papers/w32255.
[441] Nick Bostrom. What is a Singleton? Linguistic and Philosophical Investiga-
tions, 5 (2), 48-54. 2006.
[442] K. Eric Drexler. Reframing Superintelligence: Comprehensive AI Services as
General Intelligence. Tech. rep. Future of Humanity Institute, 2019.
References ■ 527
[443] Michael C. Jensen and William H. Meckling. “Theory of the firm: Managerial
behavior, agency costs and ownership structure”. In: Corporate Governance.
Gower, 2019, pp. 77–132.
[444] Milton Friedman. “The Social Responsibility of Business Is To Increase Its
Profits”. In: vol. 32. Jan. 2007, pp. 173–178. isbn: 978-3-540-70817-9. doi:
10.1007/978-3-540-70818-6_14.
[445] Peter Cihon, Jonas Schuett, and Seth D. Baum. “Corporate Governance of
Artificial Intelligence in the Public Interest”. In: Information 12.7 (July 2021),
p. 275. issn: 2078-2489. doi: 10.3390/info12070275. url: http://dx.doi.org
/10.3390/info12070275.
[446] Jonas Schuett. AGI labs need an internal audit function. 2023. arXiv: 2305.1
7038 [cs.CY].
[447] Daron Acemoglu, Andrea Manera, and Pascual Restrepo. Does the US tax
code favor automation? Tech. rep. National Bureau of Economic Research,
2020.
[448] Remco Zwetsloot. “Winning the Tech Talent Competition”. In: Center for
Strategic and International Studies (2021), p. 2.
[449] Nicholas Carlini et al. Stealing Part of a Production Language Model. 2024.
arXiv: 2403.06634 [cs.CR].
[450] Stuart Armstrong, Nick Bostrom, and Carl Shulman. “Racing to the precipice:
a model of artificial intelligence development”. In: AI & society 31 (2016),
pp. 201–206.
[451] Deborah D. Avant, Martha Finnemore, and Susan K. Sell. Who governs the
globe? Vol. 114. Cambridge University Press, 2010.
[452] Ben Garfinkel and Allan Dafoe. “How does the offense-defense balance
scale?”. In: Emerging Technologies and International Stability. Routledge,
2021, pp. 247–274.
[453] Wolfgang K.H. Panofsky and Jean Marie Deken. 2008. doi: https://doi.org
/10.1007/978-0-387-69732-1.
[454] Robert Trager et al. “International Governance of Civilian AI: A Jurisdictional
Certification Approach”. In: Risk, Regulation, & Policy eJournal (2023). url:
http://dx.doi.org/10.2139/ssrn.4579899.
[455] Francis J. Gavin. “Strategies of Inhibition: U.S. Grand Strategy, the Nuclear
Revolution, and Nonproliferation”. In: International Security 40.1 (July 2015),
pp. 9–46. issn: 0162-2889. doi: 10.1162/ISEC_a_00205. eprint: https://d
irect.mit.edu/isec/article-pdf /40/1/9/1843553/isec\_a\_00205.pdf. url:
https://doi.org/10.1162/ISEC%5C_a%5C_00205.
[456] Preempting a Generative AI Monopoly. 2023. url: https://www.project-syn
dicate.org/commentary/preventing-tech-giants-from-monopolizing-artificial
-intelligence-chatbots-by-diane-coyle-2023-02 (visited on 10/19/2023).
[457] Dario Amodei et al. AI and Compute. OpenAI. 2018.
528 ■ References
[458] The $150m Machine Keeping Moore’s Law Alive. ©ASML. 2021. url: https:
//www.wired.com/story/asml-extreme-ultraviolet-lithography-chips-moores
-law/ (visited on 08/30/2021).
[459] Zachary Arnold et al. ETO Supply Chain Explorer. Dataset. 2022. url: http
s://cset.georgetown.edu/publication/eto-supply-chain-explorer/ (visited on
10/19/2023).
[460] Saif M Khan, Alexander Mann, and Dahlia Peterson. “The semiconductor
supply chain: Assessing national competitiveness”. In: Center for Security and
Emerging Technology 8.8 (2021).
[461] Jaime Sevilla et al. Compute Trends Across Three Eras of Machine Learning.
2022. arXiv: 2202.05924 [cs.LG].
[462] Ege Erdil and Tamay Besiroglu. Explosive growth from AI automation: A
review of the arguments. Papers 2309.11690. arXiv.org, 2023. url: https://id
eas.repec.org/p/arx/papers/2309.11690.html.
[463] Saif M. Khan. AI Chips: What They Are and Why They Matter. 2020. url:
https://doi.org/10.51593/20190014.
[464] Yonadav Shavit. What does it take to catch a Chinchilla? Verifying Rules on
Large-Scale Neural Network Training via Compute Monitoring. 2023. arXiv:
2303.11341 [cs.LG].
[465] Lewis Ho et al. International Institutions for Advanced AI. 2023. arXiv: 2307
.04699 [cs.CY].
[466] Matthijs Maas. Advanced AI Governance. Tech. rep. https://docs.google.com
/ document / d / 1pwwNHvNeJneBA2t2xaP31lVv1lSpa36w8kdryoS5768 / edit.
Legal priorities Project, 2023.
Index
Note: Bold page numbers refer to tables and Italic page numbers refer to figures.
529
530 ■ Index
Power inequality, 48 R
Power laws, 102–104, 103 Racing dynamics, 367
Power shifts, 368 “Radium Girls,” 31
Power-seeking, 43–45 Random forests, 57
behavior, 159 Randomness, 224
instrumentally rational, 160–162 Rasmussen’s Risk Management
Precision, 72–73, 73 Framework (RMF), 212, 213,
Predictability, 224 236
Predictive power, 75, 122 Rational agents, 367–368
Preferences, 326–335 Rational cooperation, 381, 383
idealized, 332–335 Rawls, John, 300
revealed, 326–327, 329 Rawlsian maximin function, 343, 349,
stated, 329–332 350
Preparedness paradox, 233 R&D activities, 455
Pre-training, 97 Real-world actors, 386–387
Price, George R., 434 Recall, 72, 73
Price Equation, 434–436 Recurrent neural networks (RNNs), 99
Principal-agent problems, 141 Red teams, 235
Principle of least privilege, 189, 191, Redistribution policy, 475
236 Redundancy, 189, 190, 236, 359
Prioritarian social welfare functions, Redundant encoding, 257, 262
348–350 Referent power, 158
Prisoner’s Dilemma, 372–380 Regulatory changes, 303
game fundamentals, 372–374 Reinforcement learning (RL), 43, 69,
Pareto improvement, 375, 375, 376 77–78, 129–131, 168
promoting cooperation, 377–380 Reinforcement learning with human
real-world examples, 376–377 feedback (RLHF), 331, 332
Problem of aggregation, 342 Relative reproductive success, 436
Productivity effect, 458 Reliability, 75, 186–189
Proportional chances voting, 357 Remote-controlled weapons, 15
Prospective power, 159 Renewable Heat Incentive Scandal, 273
Proxy gaming, 38–41, 137–140, 139, 185, Representation control, 168
268 Representation engineering, 126–127
Proxy-purpose distinction, 317 Representation learning, 67–68
Public Benefit Corporation, 465 Residual connections, 89–91, 90
Public health, 398 Resilience, 476–477
campaigns, 277 ResNets (Residual Networks), 100
Punctuated equilibrium, 255–256, 262, Restricted AI models, 460
264 Retention, 438
Punishment, 381 Revealed preferences, 326–327, 329
Putin, Vladimir, 44 inverse reinforcement learning,
328–329
Q manipulation, 328
Quality-adjusted life years (QALYs), 325 misinformation, 327
“Quid pro quo,” 26 weakness of will, 327–328
Index ■ 541
V
Valenced consciousness, 340
Value pluralism, 316