0% found this document useful (0 votes)

5 views562 pages

9781040261163

The book 'Introduction to AI Safety, Ethics, and Society' by Dan Hendrycks serves as a comprehensive guide to understanding and mitigating the risks associated with advanced AI systems. It explores various societal challenges and technical issues related to AI safety, emphasizing the need for interdisciplinary collaboration to address these risks effectively. The book is aimed at students and professionals interested in AI safety, ethics, and policy.

Uploaded by

tikosa8877

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views562 pages

9781040261163

Uploaded by

tikosa8877

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 562

“This book is an important resource for anyone interested in understanding and

mitigating the risks associated with increasingly powerful AI systems. It provides not
only an accessible introduction to the technical challenges in making AI safer, but also a
clear-eyed account of the coordination problems we will need to solve on a societal level
to ensure AI is developed and deployed safely.”

Yoshua Bengio, Professor of Computer Science, University of Montreal and Turing

Award Winner.
“A must-read for anyone seeking to understand the full complexities of AI risk.”

David Krueger, Assistant Professor, Department of Engineering, University of

Cambridge

“The most comprehensive exposition for the case that AI raises catastrophic risks and
what to do about them. Even if you disagree with some of Hendrycks’ arguments, this
book is still very much worth reading, if only for the unique coverage of both the techni-
cal and social aspects of the field.”

Boaz Barak, Gordon McKay Professor of Computer Science, Harvard University

Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Introduction to AI Safety,
Ethics, and Society
As AI technology is rapidly progressing in capability and being adopted more widely across so-
ciety, it is more important than ever to understand the potential risks AI may pose and how AI
can be developed and deployed safely. Introduction to AI Safety, Ethics, and Society offers a
comprehensive and accessible guide to this topic.

This book explores a range of ways in which societies could fail to harness AI safely in coming
years, such as malicious use, accidental failures, erosion of safety standards due to competition
between AI developers or nation-states, and potential loss of control over autonomous systems.
Grounded in the latest technical advances, this book offers a timely perspective on the chal-
lenges involved in making current AI systems safer. Ensuring that AI systems are safe is not just
a problem for researchers in machine learning – it is a societal challenge that cuts across tra-
ditional disciplinary boundaries. Integrating insights from safety engineering, economics, and
other relevant fields, this book provides readers with fundamental concepts to understand and
manage AI risks more effectively.

This is an invaluable resource for upper-level undergraduate and postgraduate students taking
courses relating to AI safety & alignment, AI ethics, AI policy, and the societal impacts of AI, as
well as anyone trying to better navigate the rapidly evolving landscape of AI safety.

Dr. Dan Hendrycks is a machine learning researcher and Director of the Center for AI Safety
(CAIS). He has conducted pioneering research in AI such as developing the GELU activation
function, used in several state-of-the art neural networks such as GPT, and creating MMLU, one
of the leading benchmarks used to assess large language models. His research has been covered
by the BBC, New York Times, and Washington Post.

His work currently focuses on improving the safety of AI systems and mitigating risks from AI.
He has advised the UK government on AI safety and has been invited to give talks on this topic
at OpenAI, Google, and Stanford, among other institutions. He has written on AI risks for the
Wall Street Journal and TIME Magazine. Dan Hendrycks holds a Ph.D. in Machine Learning
from UC Berkeley.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Introduction to AI Safety,
Ethics, and Society

Dan Hendrycks
First edition published 2025
by CRC Press
2385 Executive Center Drive, Suite 320, Boca Raton, FL 33431

and by CRC Press

4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2025 Dan Hendrycks

Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.

The Open Access version of this book, available at www.taylorfrancis.com, has been made available under a Creative
Commons [Attribution-Non Commercial-No Derivatives (CC-BY-NC-ND)] 4.0 license.

Any third party material in this book is not included in the OA Creative Commons license, unless indicated otherwise in
a credit line to the material. Please direct any permissions enquiries to the original rightsholder.

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.

Library of Congress Cataloging‑in‑Publication Data

Names: Hendrycks, Dan, author.
Title: Introduction to AI safety, ethics, and society / Dan Hendrycks.
Other titles: Introduction to artificial intelligence safety, ethics, and
society
Description: First edition. | Boca Raton : CRC Press, 2025. | Includes
bibliographical references and index.
Identifiers: LCCN 2024031863 (print) | LCCN 2024031864 (ebook) | ISBN
9781032869926 (hbk) | ISBN 9781032917221 (pbk) | ISBN 9781003530336
(ebk)
Subjects: LCSH: Artificial intelligence--Moral and ethical aspects. |
Artificial intelligence--Social aspects. | Risk management.
Classification: LCC Q334.7 .H46 2025 (print) | LCC Q334.7 (ebook) | DDC
174/.90063--dc23/eng/20240928
LC record available at https://lccn.loc.gov/2024031863
LC ebook record available at https://lccn.loc.gov/2024031864

ISBN: 978-1-032-86992-6 (hbk)

ISBN: 978-1-032-91722-1 (pbk)
ISBN: 978-1-003-53033-6 (ebk)

DOI: 10.1201/9781003530336

Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.

Typeset in Latin Modern

by KnowledgeWorks Global Ltd.

Access the Support Material: https://www.routledge.com/9781032798028

Contents

Introduction xv

Section I AI and Societal-Scale Risks

Chapter 1 ■ Overview of Catastrophic AI Risks 3

1.1 INTRODUCTION 3
1.2 MALICIOUS USE 6
1.2.1 Bioterrorism 7
1.2.2 Unleashing AI Agents 9
1.2.3 Persuasive AIs 10
1.2.4 Concentration of Power 11
1.3 AI RACE 14
1.3.1 Military AI Arms Race 14
1.3.2 Corporate AI Race 20
1.3.3 Evolutionary Pressures 23
1.4 ORGANIZATIONAL RISKS 28
1.4.1 Accidents Are Hard to Avoid 30
1.4.2 Organizational Factors can Reduce the Chances of
Catastrophe 32
1.5 ROGUE AIS 37
1.5.1 Proxy Gaming 38
1.5.2 Goal Drift 41
1.5.3 Power-Seeking 43
1.5.4 Deception 45
1.6 DISCUSSION OF CONNECTIONS BETWEEN RISKS 48
1.7 CONCLUSION 49
1.8 LITERATURE 50
1.8.1 Recommended Reading 50

Chapter 2 ■ Artificial Intelligence Fundamentals 51

2.1 INTRODUCTION 51
2.2 ARTIFICIAL INTELLIGENCE & MACHINE LEARNING 52
2.2.1 Artificial Intelligence 53

vii
viii ■ Contents

2.2.2 Types of AI 58
2.2.3 Machine Learning 64
2.2.4 Types of Machine Learning 75
2.3 DEEP LEARNING 79
2.3.1 Model Building Blocks 82
2.3.2 Training and Inference 93
2.3.3 History and Timeline of Key Architectures 98
2.3.4 Applications 101
2.4 SCALING LAWS 102
2.4.1 Scaling Laws in DL 104
2.5 SPEED OF AI DEVELOPMENT 107
2.6 CONCLUSION 110
2.6.1 Summary 110
2.7 LITERATURE 112
2.7.1 Recommended Resources 113

Section II Safety

Chapter 3 ■ Single-Agent Safety 117

3.1 INTRODUCTION 117
3.2 MONITORING 118
3.2.1 ML Systems Are Opaque 118
3.2.2 Motivations for Transparency Research 121
3.2.3 Approaches to Transparency 122
3.2.4 Emergent Capabilities 127
3.2.5 Emergent Goal-Directed Behavior 129
3.2.6 Tail Risk: Emergent Goals 132
3.2.7 Evaluations and Anomaly Detection 134
3.3 ROBUSTNESS 135
3.3.1 Proxies in ML 136
3.3.2 Proxy Gaming 136
3.3.3 Adversarial Examples 143
3.3.4 Trojan Attacks and Other Security Threats 147
3.3.5 Tail Risk: AI Evaluator Gaming 148
3.4 ALIGNMENT 150
3.4.1 Deception 151
3.4.2 Deceptive Evaluation Gaming 154
3.4.3 Tail Risk: Deceptive Alignment and Treacherous Turns 156
3.4.4 Power 157
3.4.5 People Could Enlist AIs for Power Seeking 160
3.4.6 Power Seeking Can Be Instrumentally Rational 160
Contents ■ ix

3.4.7 Structural Pressures Toward Power-Seeking AI 165

3.4.8 Tail Risk: Power-Seeking Behavior 167
3.4.9 Techniques to Control AI Systems 167
3.5 SYSTEMIC SAFETY 169
3.6 SAFETY AND GENERAL CAPABILITIES 171
3.7 CONCLUSION 173
3.8 LITERATURE 176
3.8.1 Recommended Reading 176

Chapter 4 ■ Safety Engineering 178

4.1 RISK DECOMPOSITION 179
4.1.1 Failure Modes, Hazards, and Threats 179
4.1.2 The Classic Risk Equation 180
4.1.3 Framing the Goal as Risk Reduction 181
4.1.4 Disaster Risk Equation 181
4.1.5 Elements of the Risk Equation 182
4.1.6 Applying the Disaster Risk Equation 183
4.2 NINES OF RELIABILITY 185
4.3 SAFE DESIGN PRINCIPLES 188
4.3.1 Redundancy 189
4.3.2 Separation of Duties 189
4.3.3 Principle of Least Privilege 190
4.3.4 Fail-Safes 190
4.3.5 Antifragility 191
4.3.6 Negative Feedback Mechanisms 192
4.3.7 Transparency 192
4.3.8 Defense in Depth 193
4.3.9 Review of Safe Design Principles 194
4.4 COMPONENT FAILURE ACCIDENT MODELS AND METHODS 194
4.4.1 Swiss Cheese Model 194
4.4.2 Bow Tie Model 196
4.4.3 Fault Tree Analysis Method 197
4.4.4 Limitations 199
4.5 SYSTEMIC FACTORS 205
4.5.1 Systemic Accident Models 206
4.6 DRIFT INTO FAILURE AND EXISTENTIAL RISKS 215
4.7 TAIL EVENTS AND BLACK SWANS 217
4.7.1 Introduction to Tail Events 217
4.7.2 Tail Events Can Greatly Affect the Average Risk 218
4.7.3 Tail Events Can Be Identified From Frequency Distributions 220
4.7.4 A Caricature of Tail Events 221
x ■ Contents

4.7.5 Introduction to Black Swans 224

4.7.6 Known Unknowns and Unknown Unknowns 224
4.7.7 Implications of Tail Events and Black Swans for Risk
Analysis 227
4.7.8 Identifying the Risk of Tail Events or Black Swans 233
4.8 CONCLUSION 235
4.8.1 Summary 235
4.8.2 Key Takeaways 237
4.9 LITERATURE 239
4.9.1 Recommended Reading 239

Chapter 5 ■ Complex Systems 240

5.1 OVERVIEW 240
5.2 INTRODUCTION TO COMPLEX SYSTEMS 241
5.2.1 The Reductionist Paradigm 241
5.2.2 The Complex Systems Paradigm 244
5.2.3 DL Systems as Complex Systems 247
5.2.4 Complexity Is Not a Dichotomy 248
5.2.5 The Hallmarks of Complex Systems 248
5.2.6 Social Systems as Complex Systems 260
5.3 COMPLEX SYSTEMS FOR AI SAFETY 265
5.3.1 General Lessons from Complex Systems 265
5.3.2 Puzzles, Problems, and Wicked Problems 269
5.3.3 Challenges With Interventionism 271
5.3.4 Systemic Issues 276
5.4 CONCLUSION 278
5.5 LITERATURE 280
5.5.1 Recommended Reading 280

Section III Ethics and Society

Chapter 6 ■ Beneficial AI and Machine Ethics 283

6.1 INTRODUCTION 283
6.2 LAW 285
6.2.1 The Case for Law 286
6.2.2 The Need for Ethics 289
6.3 FAIRNESS 292
6.3.1 Bias 293
6.3.2 Sources of Bias 294
6.3.3 AI Fairness Concepts 297
Contents ■ xi

6.3.4 Limitations of Fairness 299

6.3.5 Approaches to Combating Bias and Improving Fairness 300
6.4 THE ECONOMIC ENGINE 303
6.4.1 Allocative Efficiency of Free Markets 304
6.4.2 Market Failures 305
6.4.3 Inequality 309
6.4.4 Growth 313
6.4.5 Beyond Economic Models 314
6.5 WELLBEING 318
6.5.1 Wellbeing as the Net Balance of Pleasure over Pain 318
6.5.2 Wellbeing as a Collection of Objective Goods 319
6.5.3 Wellbeing as Preference Satisfaction 319
6.5.4 Applying the Theories of Wellbeing 322
6.6 PREFERENCES 324
6.6.1 Revealed Preferences 325
6.6.2 Stated Preferences 327
6.6.3 Idealized Preferences 330
6.7 HAPPINESS 333
6.7.1 The General Approach to Happiness 334
6.7.2 Problems for Happiness-Focused Ethics 337
6.8 SOCIAL WELFARE FUNCTIONS 340
6.8.1 Measuring Social Welfare 342
6.9 MORAL UNCERTAINTY 350
6.9.1 Making Decisions Under Moral Uncertainty 350
6.9.2 Implementing a Moral Parliament in AI Systems 356
6.9.3 Advantages of a Moral Parliament 356
6.10 CONCLUSION 359
6.11 LITERATURE 360
6.11.1 Recommended Reading 360

Chapter 7 ■ Collective Action Problems 362

7.1 MOTIVATION 362
7.2 GAME THEORY 366
7.2.1 Overview 366
7.2.2 Game Theory Fundamentals 368
7.2.3 The Prisoner’s Dilemma 369
7.2.4 The Iterated Prisoner’s Dilemma 377
7.2.5 Collective Action Problems 392
7.2.6 Summary 399
7.3 COOPERATION 400
7.3.1 Summary 413
xii ■ Contents

7.4 CONFLICT 414

7.4.1 Overview 414
7.4.2 Bargaining Theory 416
7.4.3 Commitment Problems 417
7.4.4 Information Problems 423
7.4.5 Factors Outside of Bargaining Theory 426
7.4.6 Summary 428
7.5 EVOLUTIONARY PRESSURES 429
7.5.1 Overview 429
7.5.2 Generalized Darwinism 429
7.5.3 Levels of Selection and Selfish Behavior 436
7.5.4 Summary 442
7.6 CONCLUSION 442
7.7 LITERATURE 444
7.7.1 Recommended Reading 444

Chapter 8 ■ Governance 446

8.1 INTRODUCTION 446
8.1.1 The Landscape 447
8.2 ECONOMIC GROWTH 449
8.3 DISTRIBUTION OF AI 454
8.3.1 Distribution of Access to AI 456
8.3.2 Distribution of Power Among AIs 460
8.4 CORPORATE GOVERNANCE 465
8.4.1 What Is Corporate Governance? 465
8.4.2 Legal Structure 465
8.4.3 Ownership Structure 466
8.4.4 Organizational Structure 467
8.4.5 Assurance 468
8.5 NATIONAL GOVERNANCE 469
8.5.1 Standards and Regulations 469
8.5.2 Liability for AI Harms 471
8.5.3 Targeted Taxation 472
8.5.4 Public Ownership over AI 473
8.5.5 Improving Resilience 473
8.5.6 Not Falling Behind 474
8.5.7 Information Security 475
8.6 INTERNATIONAL GOVERNANCE 477
8.6.1 Forms of International Governance 478
8.6.2 Four Questions for AI Regulation 481
8.6.3 What Can Be Included in International Agreements? 483
Contents ■ xiii

8.7 COMPUTE GOVERNANCE 486

8.7.1 Compute Is Indispensable for AI Development and
Deployment 486
8.7.2 Compute Is Physical, Excludable, and Quantifiable 488
8.8 CONCLUSION 492
8.9 LITERATURE 494
8.9.1 Recommended Reading 494

Acknowledgments 495

References 497

Index 529
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Introduction

Artificial intelligence (AI) is rapidly embedding itself within militaries, economies,

and societies, reshaping their very foundations. Given the depth and breadth of its
consequences, it has never been more pressing to understand how to ensure that AI
systems are safe, ethical, and have a positive societal impact.
This book aims to provide a comprehensive approach to understanding AI risk. Our
primary goals include consolidating fragmented knowledge on AI risk, increasing the
precision of core ideas, and reducing barriers to entry by making content simpler
and more comprehensible. The book has been designed to be accessible to readers
from diverse backgrounds. You do not need to have studied AI, philosophy, or other
such topics. The content is skimmable and somewhat modular, so that you can choose
which chapters to read. We introduce mathematical formulas in a few places to specify
claims more precisely, but readers should be able to understand the main points
without these.
AI risk is multidisciplinary. Most people think about problems in AI risk in terms of
largely implicit conceptual models, which significantly affect how they approach these
challenges. We aim to replace these implicit models with explicit, time-tested models.
A full understanding of the risks posed by AI requires knowledge in several disparate
academic disciplines, which have so far not been combined in a single text. This book
was written to fill that gap and adequately equip readers to analyze AI risk, and
moves beyond the confines of machine learning to provide a holistic understanding
of AI risk. We draw on well-established ideas and frameworks from the fields of
engineering, economics, biology, complex systems, philosophy, and other disciplines
that can provide insights into AI risks and how to manage them. Our aim is to equip
readers with a solid understanding of the technical, ethical, and governance challenges
that we will need to meet in order to harness advanced AI in a beneficial way.
In order to understand the challenges of AI safety, it is important to consider the
broader context within which AI systems are being developed and applied. The de-
cisions of and interplay between AI developers, policy-makers, militaries, and other
actors will play an important role in shaping this context. Since AI influences many
different spheres, we have deliberately selected time-tested, formal frameworks to
provide multiple lenses for thinking about AI, relevant actors, and AI’s impacts. The
frameworks and concepts we use are highly general and are useful for reasoning about
various forms of intelligence, ranging from individual human beings to corporations,
states, and AI systems. While some sections of the book focus more directly on AI

xv
xvi ■ Introduction

risks that have already been identified and discussed today, others set out a system-
atic introduction to ideas from game theory, complex systems, international relations,
and more. We hope that providing these flexible conceptual tools will help readers to
adapt robustly to the ever-changing landscape of AI risks.
This book does not aim to be the definitive guide on all AI risks. Research on AI risk is
still new and rapidly evolving, making it infeasible to comprehensively cover every risk
and its potential solutions in a single book, particularly if we wish to ensure that the
content is clear and digestible. We have chosen to introduce concepts and frameworks
that we find productive for thinking about a wide range of AI risks. Nonetheless, we
have had to make choices about what to include and omit. Many present harms, such
as harmful malfunctions, misinformation, privacy breaches, reduced social connection,
and environmental damage, are already well-addressed by others [1, 2]. Given the
rapid development of AI in recent years, we focus on novel risks posed by advanced
systems: risks that pose serious, large-scale, and sometimes irreversible threats that
our societies are currently unprepared to face.
Even if we limit ourselves to focusing on the potential for AI to pose catastrophic
risks, it is easy to become disoriented given the broad scope of the problem. Our hope
is that this book provides a starting point for others to build their own picture of
these risks and opportunities, and our potential responses to them.
The book’s content falls into three sections: AI and Societal-Scale Risks, Safety, and
Ethics and Society. In the AI and Societal-Scale Risks section, we outline major
categories of AI risks and introduce some key features of modern AI systems. In the
Safety section, we discuss how to make individual AI systems more safe. However,
if we can make them safe, how should we direct them? To answer this, we turn to
the Ethics and Society section and discuss how to make AI systems that promote
our most important values. In this section, we also explore the numerous challenges
that emerge when trying to coordinate between multiple AI systems, multiple AI
developers, or multiple nation-states with competing interests.
The AI and Societal-Scale Risks section starts with an informal overview of AI risks,
which summarises many of the key concerns discussed in this book. We outline some
scenarios where AI systems could cause catastrophic outcomes. We split risks across
four categories: malicious use, AI arms race dynamics, organizational risks, and rogue
AIs. These categories can be loosely mapped onto the risks discussed in more depth in
the Governance, Collective Action Problems, Safety Engineering, and Single-Agent
Safety chapters, respectively. However, this mapping is imperfect as many of the
risks and frameworks discussed in the book are more general and cut across sce-
narios. Nonetheless, we hope that the scenarios in this first chapter give readers a
concrete picture of the risks that we explore in this book. The next chapter, Artifi-
cial Intelligence Fundamentals, aims to provide an accessible and non-mathematical
explanation of current AI systems, helping to familiarise readers with key terms and
concepts in machine learning, DL, scaling laws, and so on. This provides the neces-
sary foundations for the discussion of the safety of individual AI systems in the next
section.
Introduction ■ xvii

The Safety section gives an overview of core challenges in safely building advanced
AI systems. It draws on insights from both machine learning research and general
theories of safety engineering and complex systems, which provide a powerful lens for
understanding these issues. In Single-Agent Safety, we explore challenges in making
individual AI systems safer, such as bias, transparency, and emergence. In Safety
Engineering, we discuss principles for creating safer organizations and how these may
apply to those developing and deploying AI. The need for a robust safety culture
at organizations developing AI is crucial, so organizations do not prioritize profit
at the expense of safety. Next, in Complex Systems, we show that analyzing AIs
as complex systems helps us to better understand the difficulty of predicting how
they will respond to external pressures or controlling the goals that may emerge in
such systems. More generally, this chapter provides us with a useful vocabulary for
discussing diverse systems of interest.
The Ethics and Society section focuses on how to instill beneficial objectives and con-
straints in AI systems and how to enable effective collaboration between stakeholders
to mitigate risks. In Beneficial AI and Machine Ethics, we introduce the challenge of
giving AI systems objectives that will reliably lead to beneficial outcomes for society,
and discuss various proposals along with the challenges they face. In Collective Ac-
tion Problems, we utilize game theory to illustrate the many ways in which multiple
agents (such as individual humans, companies, nation-states, or AIs) can fail to secure
good outcomes and come into conflict. We also consider the evolutionary dynamics
shaping AI development and how these drive AI risks. These frameworks help us to
understand the challenges of managing competitive pressures between AI developers,
militaries, or AI systems themselves. Finally, in the Governance chapter, we discuss
strategic variables such as how widely access to powerful AI systems is distributed.
We introduce a variety of potential paths for managing AI risks, including corporate
governance, national regulation, and international coordination.
The website for this book (www.aisafetybook.com) includes a range of additional
content. It contains further educational resources such as videos, slides, quizzes, and
discussion questions. For readers interested in contributing to mitigating risks from
AI, it offers some brief suggestions and links to other resources on this topic. A range
of appendices can also be found on the website with further material that could not
be included in the book itself.

Dan Hendrycks
Center for AI Safety
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
I
AI and Societal-Scale Risks
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
CHAPTER 1

Overview of Catastrophic
AI Risks

1.1 INTRODUCTION

In this chapter, we will give a brief and informal description of many major societal-
scale risks from artificial intelligence (AI), focusing on AI risks that could lead to
highly severe or even catastrophic societal outcomes. This provides some background
and motivation before we discuss specific challenges with more depth and rigor in the
following chapters.
The world as we know it today is not normal. We take for granted that we can talk
instantaneously with people thousands of miles away, fly to the other side of the world
in less than a day, and access vast mountains of accumulated knowledge on devices
we carry around in our pockets. These realities seemed far-fetched decades ago and
would have been inconceivable to people living centuries ago. The ways we live, work,
travel, and communicate have only been possible for a tiny fraction of human history.
Yet, when we look at the bigger picture, a broader pattern emerges: accelerating
development. Hundreds of thousands of years elapsed between the time Homo sapi-
ens appeared on Earth and the agricultural revolution. Then, thousands of years
passed before the industrial revolution. Now, just centuries later, the AI revolution
is beginning. The march of history is not constant—it is rapidly accelerating.
We can capture this trend quantitatively in Figure 1.1, which shows how the estimated
gross world product has changed over time [3, 4]. The hyperbolic growth it depicts
might be explained by the fact that, as technology advances, the rate of technological
advancement also tends to increase. Empowered with new technologies, people can
innovate faster than they could before. Thus, the gap in time between each landmark
development narrows.
It is the rapid pace of development, as much as the sophistication of our technol-
ogy, that makes the present day an unprecedented time in human history. We have
DOI: 10.1201/9781003530336-1 3
This chapter has been made available under a CC BY NC ND license.
4 ■ Introduction to AI Safety, Ethics, and Society

Figure 1.1. World production has grown rapidly over the course of human history. AI could
further this trend, catapulting humanity into a new period of unprecedented change.

reached a point where technological advancements can transform the world beyond
recognition within a human lifetime. For example, people who have lived through the
creation of the internet can remember a time when our now digitally connected world
would have seemed like science fiction.
From a historical perspective, it appears possible that the same amount of develop-
ment could now be condensed in an even shorter timeframe. We might not be certain
that this will occur, but neither can we rule it out. We therefore wonder: what new
technology might usher in the next big acceleration? In light of recent advances, AI
seems an increasingly plausible candidate. Perhaps, as AI continues to become more
powerful, it could lead to a qualitative shift in the world that is more profound than
any we have experienced so far. It could be the most impactful period in history,
though it could also be the last.
Although technological advancement has often improved people’s lives, we ought
to remember that as our technology grows in power, so too does its destructive
potential. Consider the invention of nuclear weapons. Last century, for the first time
in our species’ history, humanity possessed the ability to destroy itself, and the world
suddenly became much more fragile.
Our newfound vulnerability revealed itself in unnerving clarity during the Cold War.
On a Saturday in October 1962, the Cuban Missile Crisis was cascading out of con-
trol. US warships enforcing the blockade of Cuba detected a Soviet submarine and
attempted to force it to the surface by dropping low-explosive depth charges. The
submarine was out of radio contact, and its crew had no idea whether World War III
had already begun. A broken ventilator raised the temperature up to 140◦ F in some
parts of the submarine, causing crew members to fall unconscious as depth charges
exploded nearby.
The submarine carried a nuclear-armed torpedo, which required consent from both
the captain and political officer to launch. Both provided it. On any other submarine
in Cuban waters that day, that torpedo would have launched—and a nuclear third
Overview of Catastrophic AI Risks ■ 5

world war may have followed. Fortunately, a man named Vasili Arkhipov was also on
the submarine. Arkhipov was the commander of the entire flotilla and by sheer luck
happened to be on that particular submarine. He talked the captain down from his
rage, convincing him to await further orders from Moscow. He averted a nuclear war
and saved millions or billions of lives—and possibly civilization itself.
Carl Sagan once observed, “If we continue to accumulate only power and not wisdom,
we will surely destroy ourselves” [5]. Sagan was correct: The power of nuclear weapons
was not one we were ready for. Overall, it has been luck rather than wisdom that
has saved humanity from nuclear annihilation, with multiple recorded instances of a
single individual preventing a full-scale nuclear war.
AI is now poised to become a powerful technology with destructive potential similar
to nuclear weapons. We do not want to repeat the Cuban Missile Crisis. We do not
want to slide toward a moment of peril where our survival hinges on luck rather than
the ability to use this technology wisely. Instead, we need to work proactively to
mitigate the risks it poses. This necessitates a better understanding of what could go
wrong and what to do about it.
Luckily, AI systems are not yet advanced enough to contribute to every risk we
discuss. But that is cold comfort in a time when AI development is advancing at an
unprecedented and unpredictable rate. We consider risks arising from both present-
day AIs and AIs that are likely to exist in the near future. It is possible that if we
wait for more advanced systems to be developed before taking action, it may be too
late.
In this chapter, we will explore various ways in which powerful AIs could bring about
catastrophic events with devastating consequences for vast numbers of people. We
will also discuss how AIs could present existential risks—catastrophes from which
humanity would be unable to recover. The most obvious such risk is extinction, but
there are other outcomes, such as creating a permanent dystopian society, which
would also constitute an existential catastrophe. As further discussed in this book’s
Introduction, we do not intend to cover all risks or harms that AI may pose in an
exhaustive manner, and many of these fall outside the scope of this chapter. We
outline many possible scenarios, some of which are more likely than others and some
of which are mutually incompatible with each other. This approach is motivated by
the principles of risk management. We prioritize asking “what could go wrong?” rather
than reactively waiting for catastrophes to occur. This proactive mindset enables us
to anticipate and mitigate catastrophic risks before it’s too late.
To help orient the discussion, we decompose catastrophic risks from AIs into four risk
sources that warrant intervention:
• Malicious use: Malicious actors using AIs to cause large-scale devastation.
• AI race: Competitive pressures that could drive us to deploy AIs in unsafe ways
despite this being in no one’s best interest.
• Organizational risks: Accidents arising from the complexity of AIs and the or-
ganizations developing them.
6 ■ Introduction to AI Safety, Ethics, and Society

• Rogue AIs: The problem of controlling a technology that is more intelligent than
we are.

These four sections—Malicious Use, AI Race, Organizational Risks, and Rogue AIs—
describe causes of AI risks that are intentional, environmental/structural, accidental,
and internal, respectively [6]. The risks that are briefly outlined in this chapter are
discussed in greater depth in the rest of this book.
In this chapter, we will describe how concrete, small-scale examples of each risk might
escalate into catastrophic outcomes. We also include hypothetical stories to help
readers conceptualize the various processes and dynamics discussed in each section.
We hope this survey will serve as a practical introduction for readers interested in
learning about and mitigating catastrophic AI risks.

1.2 MALICIOUS USE

On the morning of March 20, 1995, five men entered the Tokyo subway system. After
boarding separate subway lines, they continued for several stops before dropping the
bags they were carrying and exiting. An odorless, colorless liquid inside the bags
began to vaporize. Within minutes, commuters began choking and vomiting. The
trains continued on toward the heart of Tokyo, with sickened passengers leaving the
cars at each station. The fumes were spread at each stop, either by emanating from
the tainted cars or through contact with people’s clothing and shoes. By the end of
the day, 13 people lay dead and 5,800 seriously injured. The group responsible for the
attack was the religious cult Aum Shinrikyo [7]. Its motive for murdering innocent
people? To bring about the end of the world.
Powerful new technologies offer tremendous potential benefits, but they also carry the
risk of empowering malicious actors to cause widespread harm. There will always be
those with the worst of intentions, and AIs could provide them with a formidable tool
to achieve their objectives. Moreover, as AI technology advances, severe malicious use
could potentially destabilize society, increasing the likelihood of other risks.
In this section, we will explore the various ways in which the malicious use of advanced
AIs could pose catastrophic risks. These include engineering biochemical weapons,
unleashing rogue AIs, using persuasive AIs to spread propaganda and erode consensus
reality, and leveraging censorship and mass surveillance to irreversibly concentrate
power. We will conclude by discussing possible strategies for mitigating the risks
associated with the malicious use of AIs.

Unilateral actors considerably increase the risks of malicious use. In

instances where numerous actors have access to powerful technology or dangerous
information that could be used for harmful purposes, it only takes one individual to
cause significant devastation. Malicious actors themselves are the clearest example of
this, but recklessness can be equally dangerous. For example, a single research team
might be excited to open-source an AI system with biological research capabilities,
Overview of Catastrophic AI Risks ■ 7

which would speed up research and potentially save lives, but this could also increase
the risk of malicious use if the AI system could be repurposed to develop bioweapons.
In situations like this, the outcome may be determined by the least risk-averse re-
search group. If only one research group thinks the benefits outweigh the risks, it
could act unilaterally, deciding the outcome even if most others don’t agree. And if
they are wrong and someone does decide to develop a bioweapon, it would be too
late to reverse course.
By default, advanced AIs may increase the destructive capacity of both the most
powerful and the general population. Thus, the growing potential for AIs to empower
malicious actors is one of the most severe threats humanity will face in the coming
decades. The examples we give in this section are only those we can foresee. It is
possible that AIs could aid in the creation of dangerous new technology we cannot
presently imagine, which would further increase risks from malicious use.

1.2.1 Bioterrorism

The rapid advancement of AI technology increases the risk of bioterrorism. AIs

with knowledge of bioengineering could facilitate the creation of novel bioweapons
and lower barriers to obtaining such agents. Engineered pandemics from AI-assisted
bioweapons pose a unique challenge, as attackers have an advantage over defenders
and could constitute an existential threat to humanity. We will now examine these
risks and how AIs might exacerbate challenges in managing bioterrorism and engi-
neered pandemics.

Bioengineered pandemics present a new threat. Biological agents, includ-

ing viruses and bacteria, have caused some of the most devastating catastrophes in
history. It’s believed the Black Death killed more humans than any other event in
history, an astounding and awful 200 million, the equivalent to four billion deaths
today. While contemporary advancements in science and medicine have made great
strides in mitigating risks associated with natural pandemics, engineered pandemics
could be designed to be more lethal or easily transmissible than natural pandemics,
presenting a new threat that could equal or even surpass the devastation wrought by
history’s most deadly plagues [8].
Humanity has a long and dark history of weaponizing pathogens, with records dating
back to 1320 BCE describing a war in Asia Minor where infected sheep were driven
across the border to spread Tularemia [9]. During the twentieth century, 15 countries
are known to have developed bioweapons programs, including the US, USSR, UK, and
France. Like chemical weapons, bioweapons have become a taboo among the interna-
tional community. While some state actors continue to operate bioweapons programs
[10], a more significant risk may come from non-state actors like Aum Shinrikyo, ISIS,
or simply disturbed individuals. Due to advancements in AI and biotechnology, the
tools and knowledge necessary to engineer pathogens with capabilities far beyond
Cold War-era bioweapons programs will rapidly democratize.
8 ■ Introduction to AI Safety, Ethics, and Society

Biotechnology is progressing rapidly and becoming more accessible. A

few decades ago, the ability to synthesize new viruses was limited to a handful of
the top scientists working in advanced laboratories. Today it is estimated that there
are 30,000 people with the talent, training, and access to technology to create new
pathogens [8]. This figure could rapidly expand. Gene synthesis, which allows the
creation of custom biological agents, has dropped precipitously in price, with its
cost halving approximately every 15 months [11]. Furthermore, with the advent of
benchtop DNA synthesis machines, access will become much easier and could avoid
existing gene synthesis screening efforts, which complicates controlling the spread of
such technology [12]. The chances of a bioengineered pandemic killing millions, per-
haps billions, is proportional to the number of people with the skills and access to the
technology to synthesize them. With AI assistants, orders of magnitude more people
could have the required skills, thereby increasing the risks by orders of magnitude.
AIs could be used to expedite the discovery of new, more deadly chemical
and biological weapons. In 2022, researchers took an AI system designed to
create new drugs by generating non-toxic, therapeutic molecules and tweaked it to
reward, rather than penalize, toxicity [13]. After this simple change, within six hours,
it generated 40,000 candidate chemical warfare agents entirely on its own. It designed
not just known deadly chemicals including VX, but also novel molecules that may be
deadlier than any chemical warfare agents discovered so far. In the field of biology,
AIs have already surpassed human abilities in protein structure prediction [14] and
made contributions to synthesizing those proteins [15]. Similar methods could be used
to create bioweapons and develop pathogens that are deadlier, more transmissible,
and more difficult to treat than anything seen before.
AIs compound the threat of bioengineered pandemics. AIs will increase the
number of people who could commit acts of bioterrorism. General-purpose AIs like
ChatGPT are capable of synthesizing expert knowledge about the deadliest known
pathogens, such as influenza and smallpox, and providing step-by-step instructions
about how a person could create them while evading safety protocols [16]. Future
versions of AIs could be even more helpful to potential bioterrorists when AIs are
able to synthesize information into techniques, processes, and knowledge that is not
explicitly available anywhere on the internet. Public health authorities may respond
to these threats with safety measures, but in bioterrorism, the attacker has the ad-
vantage. The exponential nature of biological threats means that a single attack could
spread to the entire world before an effective defense could be mounted. Only 100 days
after being detected and sequenced, the omicron variant of COVID-19 had infected
a quarter of the United States and half of Europe [8]. Quarantines and lockdowns
instituted to suppress the COVID-19 pandemic caused a global recession and still
could not prevent the disease from killing millions worldwide.
In summary, advanced AIs could constitute a weapon of mass destruction in the
hands of terrorists by making it easier for them to design, synthesize, and spread
deadly new pathogens. By reducing the required technical expertise and increasing
the lethality and transmissibility of pathogens, AIs could enable malicious actors to
cause global catastrophe by unleashing pandemics.
Overview of Catastrophic AI Risks ■ 9

1.2.2 Unleashing AI Agents

Many technologies are tools that humans use to pursue our goals, such as ham-
mers, toasters, and toothbrushes. But AIs are increasingly built as agents which au-
tonomously take actions in the world in order to pursue open-ended goals. AI agents
can be given goals such as winning games, making profits on the stock market, or
driving a car to a destination. AI agents therefore pose a unique risk: people could
build AIs that pursue dangerous goals.

Malicious actors could intentionally create rogue AIs. One month after
the release of GPT-4, an open-source project bypassed the AI’s safety filters and
turned it into an autonomous AI agent instructed to “destroy humanity,” “establish
global dominance,” and “attain immortality.” Dubbed ChaosGPT, the AI compiled
research on nuclear weapons and sent tweets trying to influence others. Fortunately,
ChaosGPT was merely a warning given that it lacked the ability to successfully
formulate long-term plans, hack computers, and survive and spread. Yet given the
rapid pace of AI development, ChaosGPT did offer a glimpse into the risks that more
advanced rogue AIs could pose in the near future.

Many groups may want to unleash AIs or have AIs displace human-
ity. Simply unleashing rogue AIs, like a more sophisticated version of ChaosGPT,
could accomplish mass destruction, even if those AIs aren’t explicitly told to harm
humanity. There are a variety of beliefs that may drive individuals and groups to do
so. One ideology that could pose a unique threat in this regard is “accelerationism.”
This ideology seeks to accelerate AI development as rapidly as possible and opposes
restrictions on the development or proliferation of AIs. This sentiment is common
among many leading AI researchers and technology leaders, some of whom are in-
tentionally racing to build AIs more intelligent than humans. According to Google
co-founder Larry Page, AIs are humanity’s rightful heirs and the next step of cosmic
evolution. He has also expressed the sentiment that humans maintaining control over
AIs is “speciesist” [17]. Jürgen Schmidhuber, an eminent AI scientist, argued that “In
the long run, humans will not remain the crown of creation... But that’s okay because
there is still beauty, grandeur, and greatness in realizing that you are a tiny part of
a much grander scheme which is leading the universe from lower complexity toward
higher complexity” [18]. Richard Sutton, another leading AI scientist, in discussing
smarter-than human AI asked “why shouldn’t those who are the smartest become
powerful?” and thinks the development of superintelligence will be an achievement
“beyond humanity, beyond life, beyond good and bad” [19]. He argues that “suc-
cession to AI is inevitable,” and while “they could displace us from existence,” “we
should not resist succession” [20].
There are several sizable groups who may want to unleash AIs to intentionally cause
harm. For example, sociopaths and psychopaths make up around 3 percent of the
population [21]. In the future, people who have their livelihoods destroyed by AI
automation may grow resentful, and some may want to retaliate. There are plenty
of cases in which seemingly mentally stable individuals with no history of insanity
or violence suddenly go on a shooting spree or plant a bomb with the intent to
10 ■ Introduction to AI Safety, Ethics, and Society

harm as many innocent people as possible. We can also expect well-intentioned peo-
ple to make the situation even more challenging. As AIs advance, they could make
ideal companions—knowing how to provide comfort, offering advice when needed,
and never demanding anything in return. Inevitably, people will develop emotional
bonds with chatbots, and some will demand that they be granted rights or become
autonomous.
In summary, releasing powerful AIs and allowing them to take actions independently
of humans could lead to a catastrophe. There are many reasons that people might
pursue this, whether because of a desire to cause harm, an ideological belief in techno-
logical acceleration or a conviction that AIs should have the same rights and freedoms
as humans.

1.2.3 Persuasive AIs

The deliberate propagation of disinformation is already a serious issue, reducing our

shared understanding of reality and polarizing opinions. AIs could be used to severely
exacerbate this problem by generating personalized disinformation on a larger scale
than before. Additionally, as AIs become better at predicting and nudging our be-
havior, they will become more capable of manipulating us. We will now discuss how
AIs could be leveraged by malicious actors to create a fractured and dysfunctional
society.

AIs could pollute the information ecosystem with motivated lies. Some-
times ideas spread not because they are true, but because they serve the interests
of a particular group. “Yellow journalism” was coined as a pejorative reference to
newspapers that advocated war between Spain and the United States in the late
19th century because they believed that sensational war stories would boost their
sales [22]. When public information sources are flooded with falsehoods, people will
sometimes fall prey to lies or else come to distrust mainstream narratives, both of
which undermine societal integrity.
Unfortunately, AIs could escalate these existing problems dramatically. First, AIs
could be used to generate unique, personalized disinformation at a large scale. While
there are already many social media bots [23], some of which exist to spread disin-
formation, historically they have been run by humans or primitive text generators.
The latest AI systems do not need humans to generate personalized messages, never
get tired, and could potentially interact with millions of users at once [24].

AIs can exploit users’ trust. Already, hundreds of thousands of people pay for
chatbots marketed as lovers and friends [25], and one man’s suicide has been partially
attributed to interactions with a chatbot [26]. As AIs appear increasingly human-like,
people will increasingly form relationships with them and grow to trust them. AIs that
gather personal information through relationship-building or by accessing extensive
personal data, such as a user’s email account or personal files, could leverage that
information to enhance persuasion. Powerful actors that control those systems could
Overview of Catastrophic AI Risks ■ 11

exploit user trust by delivering personalized disinformation directly through people’s

“friends.”

AIs could centralize control of trusted information. Separate from democ-

ratizing disinformation, AIs could centralize the creation and dissemination of trusted
information. Only a few actors have the technical skills and resources to develop
cutting-edge AI systems, and they could use these AIs to spread their preferred nar-
ratives. Alternatively, if AIs are broadly accessible this could lead to widespread dis-
information, with people retreating to trusting only a small handful of authoritative
sources [27]. In both scenarios, there would be fewer sources of trusted information,
and a small portion of society would control popular narratives.
AI censorship could further centralize control of information. This could begin with
good intentions, such as using AIs to enhance fact-checking and help people avoid
falling prey to false narratives. This would not necessarily solve the problem, as
disinformation persists today despite the presence of fact-checkers.
Even worse, purported “fact-checking AIs” might be designed by authoritarian gov-
ernments and others to suppress the spread of true information. Such AIs could be
designed to correct most common misconceptions but provide incorrect information
about some sensitive topics, such as human rights violations committed by certain
countries. But even if fact-checking AIs work as intended, the public might eventu-
ally become entirely dependent on them to adjudicate the truth, reducing people’s
autonomy and making them vulnerable to failures or hacks of those systems.
In a world with widespread persuasive AI systems, people’s beliefs might be almost
entirely determined by which AI systems they interact with most. Never knowing
whom to trust, people could retreat even further into ideological enclaves, fearing
that any information from outside those enclaves might be a sophisticated lie. This
would erode consensus reality, people’s ability to cooperate with others, participate
in civil society, and address collective action problems. This would also reduce our
ability to have a conversation as a species about how to mitigate existential risks
from AIs.
In summary, AIs could create highly effective, personalized disinformation on an
unprecedented scale and could be particularly persuasive to people they have built
personal relationships with. In the hands of many people, this could create a deluge
of disinformation that debilitates human society, but kept in the hands of a few, it
could allow governments to control narratives for their own ends.

1.2.4 Concentration of Power

We have discussed several ways in which individuals and groups might use AIs to
cause widespread harm, through bioterrorism; releasing powerful, uncontrolled AIs;
and disinformation. To mitigate these risks, governments might pursue intense surveil-
lance and seek to keep AIs in the hands of a trusted minority. This reaction, however,
could easily become an overcorrection, paving the way for an entrenched totalitarian
12 ■ Introduction to AI Safety, Ethics, and Society

regime that would be locked in by the power and capacity of AIs. This scenario rep-
resents a form of “top-down” misuse, as opposed to “bottom-up” misuse by citizens
and could, in extreme cases, culminate in an entrenched dystopian civilization.

AIs could lead to extreme and perhaps irreversible concentration of

power. The persuasive abilities of AIs combined with their potential for surveil-
lance and the advancement of autonomous weapons could allow small groups of actors
to “lock-in” their control over society, perhaps permanently. To operate effectively, AIs
require a broad set of infrastructure components, which are not equally distributed,
such as data centers, computing power, and big data. Those in control of powerful
systems may use them to suppress dissent, spread propaganda and disinformation,
and otherwise advance their goals, which may be contrary to public wellbeing.

AIs may entrench a totalitarian regime. In the hands of the state, AIs may
result in the erosion of civil liberties and democratic values in general. AIs could
allow totalitarian governments to efficiently collect, process, and act on an unprece-
dented volume of information, permitting an ever smaller group of people to surveil
and exert complete control over the population without the need to enlist millions of
citizens to serve as willing government functionaries. Overall, as power and control
shift away from the public and toward elites and leaders, democratic governments
are highly vulnerable to totalitarian backsliding. Additionally, AIs could make total-
itarian regimes much longer-lasting; a major way in which such regimes have been
toppled previously is at moments of vulnerability like the death of a dictator, but
AIs, which would be hard to “kill,” could provide much more continuity to leadership,
providing few opportunities for reform.

AIs can entrench corporate power at the expense of the public good.
Corporations have long lobbied to weaken laws and policies that restrict their actions
and power, all in the service of profit. Corporations in control of powerful AI systems
may use them to manipulate customers into spending more on their products even to
the detriment of their own wellbeing. The concentration of power and influence that
could be afforded by AIs could enable corporations to exert unprecedented control
over the political system and entirely drown out the voices of citizens. This could
occur even if creators of these systems know their systems are self-serving or harmful
to others, as they would have incentives to reinforce their power and avoid distributing
control.

In addition to power, locking in certain values may curtail humanity’s

moral progress. It’s dangerous to allow any set of values to become permanently
entrenched in society. For example, AI systems have learned racist and sexist views
[28], and once those views are learned, it can be difficult to fully remove them. In
addition to problems we know exist in our society, there may be some we still do not.
Just as we abhor some moral views widely held in the past, people in the future may
want to move past moral views that we hold today, even those we currently see no
problem with. For example, moral defects in AI systems would be even worse if AI
systems had been trained in the 1960s, and many people at the time would have seen
Overview of Catastrophic AI Risks ■ 13

no problem with that. We may even be unknowingly perpetuating moral catastrophes

today [29]. Therefore, when advanced AIs emerge and transform the world, there is a
risk of their objectives locking in or perpetuating defects in today’s values. If AIs are
not designed to continuously learn and update their understanding of societal values,
they may perpetuate or reinforce existing defects in their decision-making processes
long into the future.
In summary, although keeping powerful AIs in the hands of a few might reduce the
risks of terrorism, it could further exacerbate power inequality if misused by govern-
ments and corporations. This could lead to totalitarian rule and intense manipulation
of the public by corporations and could lock in current values, preventing any further
moral progress.

Story: Bioterrorism
The following is an illustrative hypothetical story to help readers envision some
of these risks. This story is nonetheless somewhat vague to reduce the risk of
inspiring malicious actions based on it.
A biotechnology startup is making waves in the industry with its AI-powered
bioengineering model. The company has made bold claims that this new tech-
nology will revolutionize medicine through its ability to create cures for both
known and unknown diseases. The company did, however, stir up some con-
troversy when it decided to release the program to approved researchers in the
scientific community. Only weeks after its decision to make the model open-
source on a limited basis, the full model was leaked on the internet for all to
see. Its critics pointed out that the model could be repurposed to design lethal
pathogens and claimed that the leak provided bad actors with a powerful tool
to cause widespread destruction, opening it up to abuse without safeguards in
place.
Unknown to the public, an extremist group has been working for years to
engineer a new virus designed to kill large numbers of people. Yet given their
lack of expertise, these efforts have so far been unsuccessful. When the new
AI system is leaked, the group immediately recognizes it as a potential tool
to design the virus and circumvent legal and monitoring obstacles to obtain
the necessary raw materials. The AI system successfully designs exactly the
kind of virus the extremist group was hoping for. It also provides step-by-step
instructions on how to synthesize large quantities of the virus and circumvent
any obstacles to spreading it. With the synthesized virus in hand, the extremist
group devises a plan to release the virus in several carefully chosen locations
in order to maximize its spread.
The virus has a long incubation period and spreads silently and quickly
throughout the population for months. By the time it is detected, it has al-
ready infected millions and has an alarmingly high mortality rate. Given its
lethality, most who are infected will ultimately die. The virus may or may not
be contained eventually, but not before it kills millions of people.
14 ■ Introduction to AI Safety, Ethics, and Society

1.3 AI RACE

The immense potential of AIs has created competitive pressures among global play-
ers contending for power and influence. This “AI race” is driven by nations and
corporations who feel they must rapidly build and deploy AIs to secure their posi-
tions and survive. By failing to properly prioritize global risks, this dynamic makes
it more likely that AI development will produce dangerous outcomes. Analogous to
the nuclear arms race during the Cold War, participation in an AI race may serve
individual short-term interests, but it ultimately results in worse collective outcomes
for humanity. Importantly, these risks stem not only from the intrinsic nature of AI
technology, but from the competitive pressures that encourage insidious choices in
AI development.
In this section, we first explore the military AI arms race and the corporate AI race,
where nation-states and corporations are forced to rapidly develop and adopt AI sys-
tems to remain competitive. Moving beyond these specific races, we reconceptualize
competitive pressures as part of a broader evolutionary process in which AIs could
become increasingly pervasive, powerful, and entrenched in society. Finally, we high-
light potential strategies and policy suggestions to mitigate the risks created by an
AI race and ensure the safe development of AIs.

1.3.1 Military AI Arms Race

The development of AIs for military applications is swiftly paving the way for a new
era in military technology, with potential consequences rivaling those of gunpowder
and nuclear arms in what has been described as the “third revolution in warfare.”
The weaponization of AI presents numerous challenges, such as the potential for more
destructive wars, the possibility of accidental usage or loss of control, and the prospect
of malicious actors co-opting these technologies for their own purposes. As AIs gain
influence over traditional military weaponry and increasingly take on command and
control functions, humanity faces a paradigm shift in warfare. In this context, we will
discuss the latent risks and implications of this AI arms race on global security, the
potential for intensified conflicts, and the dire outcomes that could come as a result,
including the possibility of conflicts escalating to a scale that poses an existential
threat.

Lethal Autonomous Weapons (LAWs)

LAWs are weapons that can identify, target, and kill without human intervention [30].
They offer potential improvements in decision-making speed and precision. Warfare,
however, is a high-stakes, safety-critical domain for AIs with significant moral and
practical concerns. Though their existence is not necessarily a catastrophe in itself,
LAWs may serve as an on-ramp to catastrophes stemming from malicious use, acci-
dents, loss of control, or an increased likelihood of war.
Overview of Catastrophic AI Risks ■ 15

LAWs may become vastly superior to humans. Driven by rapid develop-

ments in AIs, weapons systems that can identify, target, and decide to kill human
beings on their own—without an officer directing an attack or a soldier pulling the
trigger—are starting to transform the future of conflict. In 2020, an advanced AI
agent outperformed experienced F-16 pilots in a series of virtual dogfights, including
decisively defeating a human pilot 5-0, showcasing “aggressive and precise maneuvers
the human pilot couldn’t outmatch” [31]. Just as in the past, superior weapons would
allow for more destruction in a shorter period of time, increasing the severity of war.

Militaries are taking steps toward delegating life-or-death decisions to

AIs. Fully autonomous drones were likely first used on the battlefield in Libya in
March 2020, when retreating forces were “hunted down and remotely engaged” by a
drone operating without human oversight [32]. In May 2021, the Israel Defense Forces
used the world’s first AI-guided weaponized drone swarm during combat operations,
which marks a significant milestone in the integration of AI and drone technology in
warfare [33]. Although walking, shooting robots have yet to replace soldiers on the
battlefield, technologies are converging in ways that may make this possible in the
near future.

LAWs increase the likelihood of war. Sending troops into battle is a grave
decision that leaders do not make lightly. But autonomous weapons would allow an
aggressive nation to launch attacks without endangering the lives of its own soldiers
and thus face less domestic scrutiny. While remote-controlled weapons share this ad-
vantage, their scalability is limited by the requirement for human operators and vul-
nerability to jamming countermeasures, limitations that LAWs could overcome [34].
Public opinion for continuing wars tends to wane as conflicts drag on and casualties
increase [35]. LAWs would change this equation. National leaders would no longer
face the prospect of body bags returning home, thus removing a primary barrier to
engaging in warfare, which could ultimately increase the likelihood of conflicts.

Cyberwarfare

As well as being used to enable deadlier weapons, AIs could lower the barrier to
entry for cyberattacks, making them more numerous and destructive. They could
cause serious harm not only in the digital environment but also in physical systems,
potentially taking out critical infrastructure that societies depend on. While AIs could
also be used to improve cyberdefense, it is unclear whether they will be most effective
as an offensive or defensive technology [36]. If they enhance attacks more than they
support defense, then cyberattacks could become more common, creating significant
geopolitical turbulence and paving another route to large-scale conflict.

AIs have the potential to increase the accessibility, success rate, scale,
speed, stealth, and potency of cyberattacks. Cyberattacks are already a real-
ity, but AIs could be used to increase their frequency and destructiveness in multiple
ways. Machine learning tools could be used to find more critical vulnerabilities in
target systems and improve the success rate of attacks. They could also be used to
16 ■ Introduction to AI Safety, Ethics, and Society

increase the scale of attacks by running millions of systems in parallel, and increase
the speed by finding novel routes to infiltrating a system. Cyberattacks could also
become more potent if used to hijack AI weapons.

Cyberattacks can destroy critical infrastructure. By hacking computer sys-

tems that control physical processes, cyberattacks could cause extensive infrastruc-
ture damage. For example, they could cause system components to overheat or valves
to lock, leading to a buildup of pressure culminating in an explosion. Through inter-
ferences like this, cyberattacks have the potential to destroy critical infrastructure,
such as electric grids and water supply systems. This was demonstrated in 2015, when
a cyberwarfare unit of the Russian military hacked into the Ukrainian power grid,
leaving over 200,000 people without power access for several hours. AI-enhanced at-
tacks could be even more devastating and potentially deadly for the billions of people
who rely on critical infrastructure for survival.

Difficulties in attributing AI-driven cyberattacks could increase the risk

of war. A cyberattack resulting in physical damage to critical infrastructure would
require a high degree of skill and effort to execute, perhaps only within the capability
of nation-states. Such attacks are rare as they constitute an act of war, and thus
elicit a full military response. Yet AIs could enable attackers to hide their identity,
for example if they are used to evade detection systems or more effectively cover the
tracks of the attacker [37]. If cyberattacks become more stealthy, this would reduce
the threat of retaliation from an attacked party, potentially making attacks more
likely. If stealthy attacks do happen, they might incite actors to mistakenly retaliate
against unrelated third parties they suspect to be responsible. This could increase
the scope of the conflict dramatically.

Automated Warfare

AIs speed up the pace of war, which makes AIs more necessary. AIs
can quickly process a large amount of data, analyze complex situations, and provide
helpful insights to commanders. With ubiquitous sensors and advanced technology on
the battlefield, there is tremendous incoming information. AIs help make sense of this
information, spotting important patterns and relationships that humans might miss.
As these trends continue, it will become increasingly difficult for humans to make well-
informed decisions as quickly as necessary to keep pace with AIs. This would further
pressure militaries to hand over decisive control to AIs. The continuous integration
of AIs into all aspects of warfare will cause the pace of combat to become faster and
faster. Eventually, we may arrive at a point where humans are no longer capable
of assessing the ever-changing battlefield situation and must cede decision-making
power to advanced AIs.

Automatic retaliation can escalate accidents into war. There is already

willingness to let computer systems retaliate automatically. In 2014, a leak revealed
to the public that the NSA was developing a system called MonsterMind, which
would autonomously detect and block cyberattacks on US infrastructure [38]. It was
Overview of Catastrophic AI Risks ■ 17

suggested that in the future, MonsterMind could automatically initiate a retaliatory

cyberattack with no human involvement. If multiple combatants have policies of
automatic retaliation, an accident or false alarm could quickly escalate to full-scale
war before humans intervene. This would be especially dangerous if the superior
information processing capabilities of modern AI systems makes it more appealing
for actors to automate decisions regarding nuclear launches.

History shows the danger of automated retaliation. On September 26,

1983, Stanislav Petrov, a lieutenant colonel of the Soviet Air Defense Forces, was
on duty at the Serpukhov-15 bunker near Moscow, monitoring the Soviet Union’s
early warning system for incoming ballistic missiles. The system indicated that the
US had launched multiple nuclear missiles toward the Soviet Union. The protocol at
the time dictated that such an event should be considered a legitimate attack, and
the Soviet Union would respond with a nuclear counterstrike. If Petrov had passed
on the warning to his superiors, this would have been the likely outcome. Instead,
however, he judged it to be a false alarm and ignored it. It was soon confirmed that
the warning had been caused by a rare technical malfunction. If an AI had been in
control, the false alarm could have triggered a nuclear war.

AI-controlled weapons systems could lead to a flash war. Autonomous

systems are not infallible. We have already witnessed how quickly an error in an
automated system can escalate in the economy. Most notably, in the 2010 Flash
Crash, a feedback loop between automated trading algorithms amplified ordinary
market fluctuations into a financial catastrophe in which a trillion dollars of stock
value vanished in minutes [39]. If multiple nations were to use AIs to automate their
defense systems, an error could be catastrophic, triggering a spiral of attacks and
counter-attacks that would happen too quickly for humans to step in—a flash war.
The market quickly recovered from the 2010 Flash Crash, but the harm caused by a
flash war could be catastrophic.

Automated warfare could reduce accountability for military leaders. Mil-

itary leaders may at times gain an advantage on the battlefield if they are willing to
ignore the laws of war. For example, soldiers may be able to mount stronger attacks
if they do not take steps to minimize civilian casualties. An important deterrent to
this behavior is the risk that military leaders could eventually be held accountable
or even prosecuted for war crimes. Automated warfare could reduce this deterrence
effect by making it easier for military leaders to escape accountability by blaming
violations on failures in their automated systems.

AIs could make war more uncertain, increasing the risk of conflict. Al-
though states that are already wealthier and more powerful often have more resources
to invest in new military technologies, they are not necessarily always the most suc-
cessful at adopting them. Other factors also play an important role, such as how
agile and adaptive a military can be in incorporating new technologies [40]. Major
new weapons innovations can therefore offer an opportunity for existing superpowers
to bolster their dominance, but also for less powerful states to quickly increase their
18 ■ Introduction to AI Safety, Ethics, and Society

power by getting ahead in an emerging and important sphere. This can create sig-
nificant uncertainty around if and how the balance of power is shifting, potentially
leading states to incorrectly believe they could gain something from going to war.
Even aside from considerations regarding the balance of power, rapidly evolving au-
tomated warfare would be unprecedented, making it difficult for actors to evaluate
their chances of victory in any particular conflict. This would increase the risk of
miscalculation, making war more likely.

Actors May Risk Extinction Over Individual Defeat

Competitive pressures make actors more willing to accept the risk of ex-
tinction. During the Cold War, neither side desired the dangerous situation they
found themselves in. There were widespread fears that nuclear weapons could be
powerful enough to wipe out a large fraction of humanity, potentially even causing
extinction—a catastrophic result for both sides. Yet the intense rivalry and geopolit-
ical tensions between the two superpowers fueled a dangerous cycle of arms buildup.
Each side perceived the other’s nuclear arsenal as a threat to its very survival, leading
to a desire for parity and deterrence. The competitive pressures pushed both countries
to continually develop and deploy more advanced and destructive nuclear weapons
systems, driven by the fear of being at a strategic disadvantage. During the Cuban
Missile Crisis, this led to the brink of nuclear war. Even though the story of Arkhipov
preventing the launch of a nuclear torpedo wasn’t declassified until decades after the
incident, President John F. Kennedy reportedly estimated that he thought the odds
of nuclear war beginning during that time were “somewhere between one out of three
and even.” This chilling admission highlights how the competitive pressures between
militaries have the potential to cause global catastrophes.
Individually rational decisions can be collectively catastrophic. Nations
locked in competition might make decisions that advance their own interests by
putting the rest of the world at stake. Scenarios of this kind are collective action
problems, where decisions may be rational on an individual level yet disastrous for
the larger group [41]. For example, corporations and individuals may weigh their own
profits and convenience over the negative impacts of the emissions they create, even
if those emissions collectively result in climate change. The same principle can be
extended to military strategy and defense systems. Military leaders might estimate,
for instance, that increasing the autonomy of weapon systems would mean a 10 per-
cent chance of losing control over weaponized superhuman AIs. Alternatively, they
might estimate that using AIs to automate bioweapons research could lead to a 10
percent chance of leaking a deadly pathogen. Both of these scenarios could lead to
catastrophe or even extinction. The leaders may, however, also calculate that refrain-
ing from these developments will mean a 99 percent chance of losing a war against an
opponent. Since conflicts are often viewed as existential struggles by those fighting
them, rational actors may accept an otherwise unthinkable 10 percent chance of hu-
man extinction over a 99 percent chance of losing a war. Regardless of the particular
nature of the risks posed by advanced AIs, these dynamics could push us to the brink
of global catastrophe.
Overview of Catastrophic AI Risks ■ 19

Technological superiority does not guarantee national security. It is

tempting to think that the best way of guarding against enemy attacks is to im-
prove one’s own military prowess. However, in the midst of competitive pressures,
all parties will tend to advance their weaponry, such that no one gains much of an
advantage, but all are left at greater risk. As Richard Danzig, former Secretary of
the Navy, has observed, “The introduction of complex, opaque, novel, and interactive
technologies will produce accidents, emergent effects, and sabotage. On a number of
occasions and in a number of ways, the American national security establishment
will lose control of what it creates... deterrence is a strategy for reducing attacks, not
accidents” [42].

Cooperation is paramount to reducing risk. As discussed above, an AI arms

race can lead us down a hazardous path, despite this being in no country’s best
interest. It is important to remember that we are all on the same side when it comes
to existential risks, and working together to prevent them is a collective necessity.
A destructive AI arms race benefits nobody, so all actors would be rational to take
steps to cooperate with one another to prevent the riskiest applications of militarized
AIs. As Dwight D. Eisenhower reminded us, “The only way to win World War III is
to prevent it.”
We have considered how competitive pressures could lead to the increasing automa-
tion of conflict, even if decision-makers are aware of the existential threat that this
path entails. We have also discussed cooperation as being the key to counteracting
and overcoming this collective action problem. We will now illustrate a hypothetical
path to disaster that could result from an AI arms race.

Story: Automated Warfare

As AI systems become increasingly sophisticated, militaries start involving
them in decision-making processes. Officials give them military intelligence
about opponents’ arms and strategies, for example, and ask them to calculate
the most promising plan of action. It soon becomes apparent that AIs are reli-
ably reaching better decisions than humans, so it seems sensible to give them
more influence. At the same time, international tensions are rising, increasing
the threat of war.
A new military technology has recently been developed that could make in-
ternational attacks swifter and stealthier, giving targets less time to respond.
Since military officials feel their response processes take too long, they fear
that they could be vulnerable to a surprise attack capable of inflicting decisive
damage before they would have any chance to retaliate. Since AIs can process
information and make decisions much more quickly than humans, military
leaders reluctantly hand them increasing amounts of retaliatory control, rea-
soning that failing to do so would leave them open to attack from adversaries.
While for years military leaders had stressed the importance of keeping a “hu-
man in the loop” for major decisions, human control is nonetheless gradually
20 ■ Introduction to AI Safety, Ethics, and Society

phased out in the interests of national security. Military leaders understand

that their decisions lead to the possibility of inadvertent escalation caused by
system malfunctions and would prefer a world where all countries automated
less; but they do not trust that their adversaries will refrain from automation.
Over time, more and more of the chain of command is automated on all sides.
One day, a single system malfunctions, detecting an enemy attack when there
is none. The system is empowered to launch an instant “retaliatory” attack,
and it does so in the blink of an eye. The attack causes automated retaliation
from the other side, and so on. Before long, the situation is spiraling out of
control, with waves of automated attack and retaliation. Although humans
have made mistakes leading to escalation in the past, this escalation between
mostly automated militaries happens far more quickly than any before. The
humans who are responding to the situation find it difficult to diagnose the
source of the problem, as the AI systems are not transparent. By the time
they even realize how the conflict started, it is already over, with devastating
consequences for both sides.

1.3.2 Corporate AI Race

Competitive pressures exist in the economy, as well as in military settings. Although

competition between companies can be beneficial, creating more useful products for
consumers, there are also pitfalls. First, the benefits of economic activity may be
unevenly distributed, incentivizing those who benefit most from it to disregard the
harms to others. Second, under intense market competition, businesses tend to focus
much more on short-term gains than on long-term outcomes. With this mindset,
companies often pursue something that can make a lot of profit in the short term,
even if it poses a societal risk in the long term. We will now discuss how corporate
competitive pressures could play out with AIs and the potential negative impacts.

Economic Competition Undercuts Safety

Competitive pressure is fueling a corporate AI race. To obtain a compet-

itive advantage, companies often race to offer the first products to a market rather
than the safest. These dynamics are already playing a role in the rapid development
of AI technology. At the launch of Microsoft’s AI-powered search engine in February
2023, the company’s CEO Satya Nadella said, “A race starts today... we’re going to
move fast.” Only weeks later, the company’s chatbot was shown to have threatened
to harm users [43]. In an internal email, Sam Schillace, a technology executive at Mi-
crosoft, highlighted the urgency in which companies view AI development. He wrote
that it would be an “absolutely fatal error in this moment to worry about things that
can be fixed later” [44].

Competitive pressures have contributed to major commercial and indus-

trial disasters. Throughout the 1960s, Ford Motor Company faced competition
Overview of Catastrophic AI Risks ■ 21

from international car manufacturers as the share of imports in American car pur-
chases steadily rose [45]. Ford developed an ambitious plan to design and manufacture
a new car model in only 25 months [46]. The Ford Pinto was delivered to customers
ahead of schedule, but with a serious safety problem: the gas tank was located near
the rear bumper and could explode during rear collisions. Numerous fatalities and
injuries were caused by the resulting fires when crashes inevitably happened [47].
Ford was sued and a jury found them liable for these deaths and injuries [48]. The
verdict, of course, came too late for those who had already lost their lives. As Ford’s
president at the time was fond of saying, “Safety doesn’t sell” [49].
Boeing, aiming to compete with its rival Airbus, sought to deliver an updated, more
fuel-efficient model to the market as quickly as possible. The head-to-head rivalry
and time pressure led to the introduction of the Maneuvering Characteristics Aug-
mentation System, which was designed to enhance the aircraft’s stability. However,
inadequate testing and pilot training ultimately resulted in the two fatal crashes only
months apart, with 346 people killed [50]. We can imagine a future in which similar
pressures lead companies to cut corners and release unsafe AI systems.
A third example is the Bhopal gas tragedy, which is widely considered to be the
worst industrial disaster ever to have happened. In December 1984, a vast quantity
of toxic gas leaked from a Union Carbide Corporation subsidiary plant manufacturing
pesticides in Bhopal, India. Exposure to the gas killed thousands of people and injured
up to half a million more. Investigations found that, in the run-up to the disaster,
safety standards had fallen significantly, with the company cutting costs by neglecting
equipment maintenance and staff training as profitability fell. This is often considered
a consequence of competitive pressures [51].

Competition incentivizes businesses to deploy potentially unsafe AI sys-

tems. In an environment where businesses are rushing to develop and release prod-
ucts, those that follow rigorous safety procedures will be slower and risk being out-
competed. Ethically minded AI developers, who want to proceed more cautiously
and slow down, would give more unscrupulous developers an advantage. In trying to
survive commercially, even the companies that want to take more care are likely to
be swept along by competitive pressures. There may be attempts to implement safety
measures, but with more of an emphasis on capabilities than on safety, these may
be insufficient. This could lead us to develop highly powerful AIs before we properly
understand how to ensure they are safe.

Automated Economy

Corporations will face pressure to replace humans with AIs. As AIs be-
come more capable, they will be able to perform an increasing variety of tasks more
quickly, cheaply, and effectively than human workers. Companies will therefore stand
to gain a competitive advantage from replacing their employees with AIs. Compa-
nies that choose not to adopt AIs would likely be out-competed, just as a clothing
company using manual looms would be unable to keep up with those using industrial
ones.
22 ■ Introduction to AI Safety, Ethics, and Society

AIs could lead to mass unemployment. Economists have long considered the
possibility that machines will replace human labor. Nobel Prize winner Wassily Leon-
tief said in 1952 that, as technology advances, “Labor will become less and less
important... more and more workers will be replaced by machines” [52]. Previous
technologies have augmented the productivity of human labor. AIs, however, could
differ profoundly from previous innovations. Advanced AIs capable of automating
human labor should be regarded not merely as tools, but as agents. Human-level AI
agents would, by definition, be able to do everything a human could do. These AI
agents would also have important advantages over human labor. They could work 24
hours a day, be copied many times and run in parallel, and process information much
more quickly than a human would. While we do not know when this will occur, it is
unwise to discount the possibility that it could be soon. If human labor is replaced by
AIs, mass unemployment could dramatically increase inequality, making individuals
dependent on the owners of AI systems.

Automated AI R&D. AI agents would have the potential to automate the re-
search and development (R&D) of AI itself. AI is increasingly automating parts of
the research process [53], and this could lead to AI capabilities growing at increasing
rates, to the point where humans are no longer the driving force behind AI develop-
ment. If this trend continues unchecked, it could escalate risks associated with AIs
progressing faster than our capacity to manage and regulate them. Imagine that we
created an AI that writes and thinks at the speed of today’s AIs, but that it could
also perform world-class AI research. We could then copy that AI and create 10,000
world-class AI researchers that operate at a pace 100× times faster than humans. By
automating AI research and development, we might achieve progress equivalent to
many decades in just a few months.

Conceding power to AIs could lead to human enfeeblement. Even if we

ensure that the many unemployed humans are provided for, we may find ourselves
completely reliant on AIs. This would likely emerge not from a violent coup by AIs,
but from a gradual slide into dependence. As society’s challenges become ever more
complex and fast-paced, and as AIs become ever more intelligent and quick-thinking,
we may forfeit more and more functions to them out of convenience. In such a state,
the only feasible solution to the complexities and challenges compounded by AIs may
be to rely even more heavily on AIs. This gradual process could eventually lead to
the delegation of nearly all intellectual, and eventually physical, labor to AIs. In such
a world, people might have few incentives to gain knowledge and cultivate skills,
potentially leading to a state of enfeeblement [54]. Having lost our know-how and
our understanding of how civilization works, we would become completely dependent
on AIs, a scenario not unlike the one depicted in the film WALL-E. In such a state,
humanity is not flourishing and is no longer in effective control.
As we have seen, there are classic game-theoretic dilemmas where individuals and
groups face incentives that are incompatible with what would make everyone better
off. We see this with a military AI arms race, where the world is made less safe by
creating extremely powerful AI weapons, and we see this in a corporate AI race,
Overview of Catastrophic AI Risks ■ 23

where an AI’s power and development is prioritized over its safety. To address these
dilemmas that give rise to global risks, we will need new coordination mechanisms
and institutions. It is our view that failing to coordinate and stop AI races would be
the most likely cause of an existential catastrophe.

1.3.3 Evolutionary Pressures

As discussed above, there are strong pressures to replace humans with AIs, cede
more control to them, and reduce human oversight in various settings, despite the
potential harms. We can re-frame this as a general trend resulting from evolutionary
dynamics, where an unfortunate truth is that AIs will simply be more fit than humans.
Extrapolating this pattern of automation, it is likely that we will build an ecosystem
of competing AIs over which it may be difficult to maintain control in the long run. We
will now discuss how natural selection influences the development of AI systems and
why evolution favors selfish behaviors. We will also look at how competition might
arise and play out between AIs and humans, and how this could create catastrophic
risks. This section draws heavily from “Natural Selection Favors AIs over Humans”
[55, 56].

Fitter technologies are selected, for good and bad. While most people think
of evolution by natural selection as a biological process, its principles shape much
more. According to the evolutionary biologist Richard Lewontin [57], evolution by
natural selection will take hold in any environment where three conditions are present:
(1) there are differences between individuals; (2) characteristics are passed onto fu-
ture generations; and (3) the different variants propagate at different rates. These
conditions apply to various technologies.
Consider the content-recommendation algorithms used by streaming services and
social media platforms. When a particularly addictive content format or algorithm
hooks users, it results in higher screen time and engagement. This more effective
content format or algorithm is consequently “selected” and further fine-tuned, while
formats and algorithms that fail to capture attention are discontinued. These com-
petitive pressures foster a “survival of the most addictive” dynamic. Platforms that
refuse to use addictive formats and algorithms become less influential or are simply
outcompeted by platforms that do, leading competitors to undermine wellbeing and
cause massive harm to society [58].

The conditions for natural selection apply to AIs. There will be many
different AI developers who make many different AI systems with varying features
and capabilities, and competition between them will determine which characteristics
become more common. Second, the most successful AIs today are already being used
as a basis for their developers’ next generation of models, as well as being imitated
by rival companies. Third, factors determining which AIs propagate the most may
include their ability to act autonomously, automate labor, or reduce the chance of
their own deactivation.
24 ■ Introduction to AI Safety, Ethics, and Society

Natural selection often favors selfish characteristics. Natural selection in-

fluences which AIs propagate most widely. From biological systems, we see that nat-
ural selection often gives rise to selfish behaviors that promote one’s own genetic
information: chimps attack other communities [59], lions engage in infanticide [60],
viruses evolve new surface proteins to deceive and bypass defense barriers [61], hu-
mans engage in nepotism, some ants enslave others [62], and so on. In the natural
world, selfishness often emerges as a dominant strategy; those that prioritize them-
selves and those similar to them are usually more likely to survive, so these traits
become more prevalent. Amoral competition can select for traits that we think are
immoral.
Examples of selfish behaviors. For concreteness, we now describe many selfish
traits—traits that expand AIs’ influence at the expense of humans. AIs that automate
a task and leave many humans jobless have engaged in selfish behavior; these AIs may
not even be aware of what a human is but still behave selfishly toward them—selfish
behaviors do not require malicious intent. Likewise, AI managers may engage in selfish
and “ruthless” behavior by laying off thousands of workers; such AIs may not even
believe they did anything wrong—they were just being “efficient.” AIs may eventually
become enmeshed in vital infrastructure such as power grids or the internet. Many
people may then be unwilling to accept the cost of being able to effortlessly deactivate
them, as that would pose a reliability hazard. AIs that help create a new useful
system—a company, or infrastructure—that becomes increasingly complicated and
eventually requires AIs to operate them also have engaged in selfish behavior. AIs that
help people develop AIs that are more intelligent—but happen to be less interpretable
to humans—have engaged in selfish behavior, as this reduces human control over
AIs’ internals. AIs that are more charming, attractive, hilarious, imitate sentience
(uttering phrases like “ouch!” or pleading “please don’t turn me off!”), or emulate
deceased family members are more likely to have humans grow emotional connections
with them. These AIs are more likely to cause outrage at suggestions to destroy them,
and they are more likely preserved, protected, or granted rights by some individuals.
If some AIs are given rights, they may operate, adapt, and evolve outside of human
control. Overall, AIs could become embedded in human society and expand their
influence over us in ways that we can’t reverse.
Selfish behaviors may erode safety measures that some of us implement.
AIs that gain influence and provide economic value will predominate, while AIs that
adhere to the most constraints will be less competitive. For example, AIs following the
constraint “never break the law” have fewer options than AIs following the constraint
“don’t get caught breaking the law.” AIs of the latter type may be willing to break the
law if they’re unlikely to be caught or if the fines are not severe enough, allowing them
to outcompete more restricted AIs. Many businesses follow laws, but in situations
where stealing trade secrets or deceiving regulators is highly lucrative and difficult
to detect, a business that is willing to engage in such selfish behavior can have an
advantage over its more principled competitors.
An AI system might be prized for its ability to achieve ambitious goals autonomously.
It might, however, be achieving its goals efficiently without abiding by ethical
Overview of Catastrophic AI Risks ■ 25

restrictions, while deceiving humans about its methods. Even if we try to put safety
measures in place, a deceptive AI would be very difficult to counteract if it is cleverer
than us. AIs that can bypass our safety measures without detection may be the most
successful at accomplishing the tasks we give them, and therefore become widespread.
These processes could culminate in a world where many aspects of major companies
and infrastructure are controlled by powerful AIs with selfish traits, including de-
ceiving humans, harming humans in service of their goals, and preventing themselves
from being deactivated.

Humans only have nominal influence over AI selection. One might think
we could avoid the development of selfish behaviors by ensuring we do not select
AIs that exhibit them. However, the companies developing AIs are not selecting
the safest path but instead succumbing to evolutionary pressures. One example is
OpenAI, which was founded as a nonprofit in 2015 to “benefit humanity as a whole,
unconstrained by a need to generate financial return” [63]. However, when faced
with the need to raise capital to keep up with better-funded rivals, in 2019 OpenAI
transitioned from a nonprofit to “capped-profit” structure [64]. Later, many of the
safety-focused OpenAI employees left and formed a competitor, Anthropic, that was
to focus more heavily on AI safety than OpenAI had. Although Anthropic originally
focused on safety research, they eventually became convinced of the “necessity of
commercialization” and now contribute to competitive pressures [65]. While many
of the employees at those companies genuinely care about safety, these values do
not stand a chance against evolutionary pressures, which compel companies to move
ever more hastily and seek ever more influence, lest the company perish. Moreover,
AI developers are already selecting AIs with increasingly selfish traits. They are
selecting AIs to automate and displace humans, make humans highly dependent on
AIs, and make humans more and more obsolete. By their own admission, future
versions of these AIs may lead to extinction [66]. This is why an AI race is insidious:
AI development is not being aligned with human values but rather with natural
selection.
People often choose the products that are most useful and convenient to them imme-
diately, rather than thinking about potential long-term consequences, even to them-
selves. An AI race puts pressures on companies to select the AIs that are most
competitive, not the least selfish. Even if it’s feasible to select for unselfish AIs, if
it comes at a clear cost to competitiveness, some competitors will select the selfish
AIs. Furthermore, as we have mentioned, if AIs develop strategic awareness, they
may counteract our attempts to select against them. Moreover, as AIs increasingly
automate various processes, AIs will affect the competitiveness of other AIs, not just
humans. AIs will interact and compete with each other, and some will be put in charge
of the development of other AIs at some point. Giving AIs influence over which other
AIs should be propagated and how they should be modified would represent another
step toward humans becoming dependent on AIs and AI evolution becoming increas-
ingly independent from humans. As this continues, the complex process governing AI
evolution will become further unmoored from human interests.
26 ■ Introduction to AI Safety, Ethics, and Society

AIs can be more fit than humans. Our unmatched intelligence has granted
us power over the natural world. It has enabled us to land on the moon, harness
nuclear energy, and reshape landscapes at our will. It has also given us power over
other species. Although a single unarmed human competing against a tiger or gorilla
has no chance of winning, the collective fate of these animals is entirely in our hands.
Our cognitive abilities have proven so advantageous that, if we chose to, we could
cause them to go extinct in a matter of weeks. Intelligence was a key factor that led
to our dominance, but we are currently standing on the precipice of creating entities
far more intelligent than ourselves.
Given the exponential increase in microprocessor speeds, AIs have the potential to
process information and “think” at a pace that far surpasses human neurons, but it
could be even more dramatic than the speed difference between humans and sloths—
possibly more like the speed difference between humans and plants. They can assim-
ilate vast quantities of data from numerous sources simultaneously, with near-perfect
retention and understanding. They do not need to sleep and they do not get bored.
Due to the scalability of computational resources, an AI could interact and cooperate
with an unlimited number of other AIs, potentially creating a collective intelligence
that would far outstrip human collaborations. AIs could also deliberately update and
improve themselves. Without the same biological restrictions as humans, they could
adapt and therefore evolve unspeakably quickly compared with us. Computers are
becoming faster. Humans aren’t [67].
To further illustrate the point, imagine that there was a new species of humans. They
do not die of old age, they get 30% faster at thinking and acting each year, and they
can instantly create adult offspring for the modest sum of a few thousand dollars.
It seems clear, then, this new species would eventually have more influence over the
future. In sum, AIs could become like an invasive species, with the potential to out-
compete humans. Our only advantage over AIs is that we get to make the first moves,
but given the frenzied AI race, we are rapidly giving up even this advantage.

AIs would have little reason to cooperate with or be altruistic toward hu-
mans. Cooperation and altruism evolved because they increase fitness. There are
numerous reasons why humans cooperate with other humans, like direct reciprocity.
Also known as “quid pro quo,” direct reciprocity can be summed up by the idiom
“you scratch my back, I’ll scratch yours.” While humans would initially select AIs
that were cooperative, the natural selection process would eventually go beyond our
control, once AIs were in charge of many or most processes, and interacting predomi-
nantly with one another. At that point, there would be little we could offer AIs, given
that they will be able to “think” at least hundreds of times faster than us. Involving
us in any cooperation or decision-making processes would simply slow them down,
giving them no more reason to cooperate with us than we do with gorillas. It might
be difficult to imagine a scenario like this or to believe we would ever let it happen.
Yet it may not require any conscious decision, instead arising as we allow ourselves
to gradually drift into this state without realizing that human-AI co-evolution may
not turn out well for humans.
Overview of Catastrophic AI Risks ■ 27

AIs becoming more powerful than humans could leave us highly vulner-
able. As the most dominant species, humans have deliberately harmed many other
species, and helped drive species such as woolly mammoths and Neanderthals to ex-
tinction. In many cases, the harm was not even deliberate, but instead a result of
us merely prioritizing our goals over their wellbeing. To harm humans, AIs wouldn’t
need to be any more genocidal than someone removing an ant colony on their front
lawn. If AIs are able to control the environment more effectively than we can, they
could treat us with the same disregard.
Conceptual summary. Evolution could cause the most influential AI agents to
act selfishly because:
1. Evolution by natural selection gives rise to selfish behavior. While evolu-
tion can result in altruistic behavior in rare situations, the context of AI develop-
ment does not promote altruistic behavior.
2. Natural selection may be a dominant force in AI development. The
intensity of evolutionary pressure will be high if AIs adapt rapidly or if competitive
pressures are intense. Competition and selfish behaviors may dampen the effects of
human safety measures, leaving the surviving AI designs to be selected naturally.

If so, AI agents would have many selfish tendencies. The winner of the AI race would
not be a nation-state, not a corporation, but AIs themselves. The upshot is that the
AI ecosystem would eventually stop evolving on human terms, and we would become
a displaced, second-class species.

Story: Autonomous Economy

As AIs become more capable, people realize that we could work more efficiently
by delegating some simple tasks to them, like drafting emails. Over time, people
notice that the AIs are doing these tasks more quickly and effectively than
any human could, so it is convenient to give them more jobs with less and less
supervision.
Competitive pressures accelerate the expansion of AI use, as companies can
gain an advantage over rivals by automating whole processes or departments
with AIs, which perform better than humans and cost less to employ. Other
companies, faced with the prospect of being out-competed, feel compelled to
follow suit just to keep up. At this point, natural selection is already at work
among AIs; humans choose to make more of the best-performing models and
unwittingly propagate selfish traits such as deception and self-preservation
if these confer a fitness advantage. For example, AIs that are charming and
foster personal relationships with humans become widely copied and harder
to remove.
As AIs are put in charge of more and more decisions, they are increasingly
interacting with one another. Since they can evaluate information much more
quickly than humans, activity in most spheres accelerates. This creates a
28 ■ Introduction to AI Safety, Ethics, and Society

feedback loop: since business and economic developments are too fast-moving
for humans to follow, it makes sense to cede yet more control to AIs instead,
pushing humans further out of important processes. Ultimately, this leads to
a fully autonomous economy, governed by an increasingly uncontrolled ecosys-
tem of AIs.
At this point, humans have few incentives to gain any skills or knowledge,
because almost everything would be taken care of by much more capable AIs.
As a result, we eventually lose the capacity to look after and govern ourselves.
Additionally, AIs become convenient companions, offering social interaction
without requiring the reciprocity or compromise necessary in human relation-
ships. Humans interact less and less with one another over time, losing vital
social skills and the ability to cooperate. People become so dependent on AIs
that it would be intractable to reverse this process. What’s more, as some AIs
become more intelligent, some people are convinced these AIs should be given
rights, meaning turning off some AIs is no longer a viable option.
Competitive pressures between the many interacting AIs continue to select
for selfish behaviors, though we might be oblivious to this happening, as we
have already acquiesced much of our oversight. If these clever, powerful, self-
preserving AIs were then to start acting in harmful ways, it would be all but
impossible to deactivate them or regain control.
AIs have supplanted humans as the most dominant species and their continued
evolution is far beyond our influence. Their selfish traits eventually lead them
to pursue their goals without regard for human wellbeing, with catastrophic
consequences.

1.4 ORGANIZATIONAL RISKS

In January 1986, tens of millions of people tuned in to watch the launch of the Chal-
lenger Space Shuttle. Approximately 73 seconds after liftoff, the shuttle exploded,
resulting in the deaths of everyone on board. Though tragic enough on its own, one
of its crew members was a school teacher named Sharon Christa McAuliffe. McAuliffe
was selected from over 10,000 applicants for the NASA Teacher in Space Project and
was scheduled to become the first teacher to fly in space. As a result, millions of
those watching were schoolchildren. NASA had the best scientists and engineers in
the world, and if there was ever a mission NASA didn’t want to go wrong, it was this
one [68].
The Challenger disaster, alongside other catastrophes, serves as a chilling reminder
that even with the best expertise and intentions, accidents can still occur. As we
progress in developing advanced AI systems, it is crucial to remember that these
systems are not immune to catastrophic accidents. An essential factor in preventing
accidents and maintaining low levels of risk lies in the organizations responsible for
Overview of Catastrophic AI Risks ■ 29

these technologies. In this section, we discuss how organizational safety plays a critical
role in the safety of AI systems. First, we discuss how even without competitive
pressures or malicious actors, accidents can happen—in fact, they are inevitable. We
then discuss how improving organizational factors can reduce the likelihood of AI
catastrophes.

Catastrophes occur even when competitive pressures are low. Even in

the absence of competitive pressures or malicious actors, factors like human error or
unforeseen circumstances can still bring about catastrophe. The Challenger disaster
illustrates that organizational negligence can lead to loss of life, even when there
is no urgent need to compete or outperform rivals. By January 1986, the space race
between the US and USSR had largely diminished, yet the tragic event still happened
due to errors in judgment and insufficient safety precautions.
Similarly, the Chernobyl nuclear disaster in April 1986 highlights how catastrophic
accidents can occur in the absence of external pressures. As a state-run project with-
out the pressures of international competition, the disaster happened when a safety
test involving the reactor’s cooling system was mishandled by an inadequately pre-
pared night shift crew. This led to an unstable reactor core, causing explosions and
the release of radioactive particles that contaminated large swathes of Europe [69].
Seven years earlier, America came close to experiencing its own Chernobyl when,
in March 1979, a partial meltdown occurred at the Three Mile Island nuclear power
plant. Though less catastrophic than Chernobyl, both events highlight how even with
extensive safety measures in place and few outside influences, catastrophic accidents
can still occur.
Another example of a costly lesson on organizational safety came just one month
after the accident at Three Mile Island. In April 1979, spores of Bacillus anthracis—
or simply “anthrax,” as it is commonly known—were accidentally released from a
Soviet military research facility in the city of Sverdlovsk. This led to an outbreak
of anthrax that resulted in at least 66 confirmed deaths [70]. Investigations into the
incident revealed that the cause of the release was a procedural failure and poor
maintenance of the facility’s biosecurity systems, despite being operated by the state
and not subjected to significant competitive pressures.
The unsettling reality is that AI is far less understood and AI industry standards are
far less stringent than nuclear technology and rocketry. Nuclear reactors are based
on solid, well-established and well-understood theoretical principles. The engineering
behind them is informed by that theory, and components are stress-tested to the
extreme. Nonetheless, nuclear accidents still happen. In contrast, AI lacks a compre-
hensive theoretical understanding, and its inner workings remain a mystery even to
those who create it. This presents an added challenge of controlling and ensuring the
safety of a technology that we do not yet fully comprehend.

AI accidents could be catastrophic. Accidents in AI development could have

devastating consequences. For example, imagine an organization unintentionally in-
troduces a critical bug in an AI system designed to accomplish a specific task, such as
30 ■ Introduction to AI Safety, Ethics, and Society

helping a company improve its services. This bug could drastically alter the AI’s be-
havior, leading to unintended and harmful outcomes. One historical example of such
a case occurred when researchers at OpenAI were attempting to train an AI sys-
tem to generate helpful, uplifting responses. During a code cleanup, the researchers
mistakenly flipped the sign of the reward used to train the AI [71]. As a result, in-
stead of generating helpful content, the AI began producing hate-filled and sexually
explicit text overnight without being halted. Accidents could also involve the unin-
tentional release of a dangerous, weaponized, or lethal AI system. Since AIs can be
easily duplicated with a simple copy-paste, a leak or hack could quickly spread the AI
system beyond the original developers’ control. Once the AI system becomes publicly
available, it would be nearly impossible to put the genie back in the bottle.
Gain-of-function research could potentially lead to accidents by pushing the bound-
aries of an AI system’s destructive capabilities. In these situations, researchers might
intentionally train an AI system to be harmful or dangerous in order to understand
its limitations and assess possible risks. While this can lead to useful insights into
the risks posed by a given AI system, future gain-of-function research on advanced
AIs might uncover capabilities significantly worse than anticipated, creating a seri-
ous threat that is challenging to mitigate or control. As with viral gain-of-function
research, pursuing AI gain-of-function research may only be prudent when conducted
with strict safety procedures, oversight, and a commitment to responsible informa-
tion sharing. These examples illustrate how AI accidents could be catastrophic and
emphasize the crucial role that organizations developing these systems play in pre-
venting such accidents.

1.4.1 Accidents Are Hard to Avoid

When dealing with complex systems, the focus needs to be placed on en-
suring accidents don’t cascade into catastrophes. In his book “Normal Ac-
cidents: Living with High-Risk Technologies,” sociologist Charles Perrow argues that
accidents are inevitable and even “normal” in complex systems, as they are not merely
caused by human errors but also by the complexity of the systems themselves [72]. In
particular, such accidents are likely to occur when the intricate interactions between
components cannot be completely planned or foreseen. For example, in the Three
Mile Island accident, a contributing factor to the lack of situational awareness by the
reactor’s operators was the presence of a yellow maintenance tag, which covered valve
position lights in the emergency feedwater lines [73]. This prevented operators from
noticing that a critical valve was closed, demonstrating the unintended consequences
that can arise from seemingly minor interactions within complex systems.
Unlike nuclear reactors, which are relatively well-understood despite their complexity,
complete technical knowledge of most complex systems is often nonexistent. This is
especially true of DL systems, for which the inner workings are exceedingly difficult
to understand, and where the reason why certain design choices work can be hard to
understand even in hindsight. Furthermore, unlike components in other industries,
such as gas tanks, which are highly reliable, DL systems are neither perfectly accurate
Overview of Catastrophic AI Risks ■ 31

nor highly reliable. Thus, the focus for organizations dealing with complex systems,
especially DL systems, should not be solely on eliminating accidents, but rather on
ensuring that accidents do not cascade into catastrophes.

Accidents are hard to avoid because of sudden, unpredictable develop-

ments. Scientists, inventors, and experts often significantly underestimate the time
it takes for a groundbreaking technological advancement to become a reality. The
Wright brothers famously claimed that powered flight was fifty years away, just two
years before they achieved it. Lord Rutherford, a prominent physicist and the father
of nuclear physics, dismissed the idea of extracting energy from nuclear fission as
“moonshine,” only for Leo Szilard to invent the nuclear chain reaction less than 24
hours later. Similarly, Enrico Fermi expressed 90 percent confidence in 1939 that it
was impossible to use uranium to sustain a fission chain reaction—yet, just four years
later he was personally overseeing the first reactor [74].
AI development could catch us off guard too. In fact, it often does. The defeat of Lee
Sedol by AlphaGo in 2016 came as a surprise to many experts, as it was widely be-
lieved that achieving such a feat would still require many more years of development.
More recently, large language models such as GPT-4 have demonstrated sponta-
neously emergent capabilities [75]. On existing tasks, their performance is hard to
predict in advance, often jumping up without warning as more resources are dedi-
cated to training them. Furthermore, they often exhibit astonishing new abilities that
no one had previously anticipated, such as the capacity for multi-step reasoning and
learning on-the-fly, even though they were not deliberately taught these skills. This
rapid and unpredictable evolution of AI capabilities presents a significant challenge
for preventing accidents. After all, it is difficult to control something if we don’t even
know what it can do or how far it may exceed our expectations.

It often takes years to discover severe flaws or risks. History is replete with
examples of substances or technologies initially thought safe, only for their unintended
flaws or risks to be discovered years, if not decades, later. For example, lead was widely
used in products like paint and gasoline until its neurotoxic effects came to light [76].
Asbestos, once hailed for its heat resistance and strength, was later linked to serious
health issues, such as lung cancer and mesothelioma [77]. The “Radium Girls” suffered
grave health consequences from radium exposure, a material they were told was safe
to put in their mouths [78]. Tobacco, initially marketed as a harmless pastime, was
found to be a primary cause of lung cancer and other health problems [79]. CFCs,
once considered harmless and used to manufacture aerosol sprays and refrigerants,
were found to deplete the ozone layer [80]. Thalidomide, a drug intended to alleviate
morning sickness in pregnant women, led to severe birth defects [81]. And more
recently, the proliferation of social media has been linked to an increase in depression
and anxiety, especially among young people [82].
This emphasizes the importance of not only conducting expert testing but also imple-
menting slow rollouts of technologies, allowing the test of time to reveal and address
potential flaws before they impact a larger population. Even in technologies adhering
to rigorous safety and security standards, undiscovered vulnerabilities may persist, as
32 ■ Introduction to AI Safety, Ethics, and Society

demonstrated by the Heartbleed bug—a serious vulnerability in the popular OpenSSL

cryptographic software library that remained undetected for years before its eventual
discovery [83].
Furthermore, even state-of-the-art AI systems, which appear to have solved prob-
lems comprehensively, may harbor unexpected failure modes that can take years to
uncover. For instance, while AlphaGo’s groundbreaking success led many to believe
that AIs had conquered the game of Go, a subsequent adversarial attack on an-
other highly advanced Go-playing AI, KataGo, exposed a previously unknown flaw
[84]. This vulnerability enabled human amateur players to consistently defeat the
AI, despite its significant advantage over human competitors who are unaware of the
flaw. More broadly, this example highlights that we must remain vigilant when deal-
ing with AI systems, as seemingly airtight solutions may still contain undiscovered
issues. In conclusion, accidents are unpredictable and hard to avoid, and understand-
ing and managing potential risks requires a combination of proactive measures, slow
technology rollouts, and the invaluable wisdom gained through steady time-testing.

1.4.2 Organizational Factors can Reduce the Chances of Catastrophe

Some organizations successfully avoid catastrophes while operating complex and haz-
ardous systems such as nuclear reactors, aircraft carriers, and air traffic control sys-
tems [85, 86]. These organizations recognize that focusing solely on the hazards of the
technology involved is insufficient; consideration must also be given to organizational
factors that can contribute to accidents, including human factors, organizational pro-
cedures, and structure. These are especially important in the case of AI, where the
underlying technology is not highly reliable and remains poorly understood.
Human factors such as safety culture are critical for avoiding AI catas-
trophes. One of the most important human factors for preventing catastrophes
is safety culture [87, 88]. Developing a strong safety culture involves not only rules
and procedures, but also the internalization of these practices by all members of an
organization. A strong safety culture means that members of an organization view
safety as a key objective rather than a constraint on their work. Organizations with
strong safety cultures often exhibit traits such as leadership commitment to safety,
heightened accountability where all individuals take personal responsibility for safety,
and a culture of open communication in which potential risks and issues can be freely
discussed without fear of retribution [89]. Organizations must also take measures to
avoid alarm fatigue, whereby individuals become desensitized to safety concerns be-
cause of the frequency of potential failures. The Challenger Space Shuttle disaster
demonstrated the dire consequences of ignoring these factors when a launch culture
characterized by maintaining the pace of launches overtook safety considerations. De-
spite the absence of competitive pressure, the mission proceeded despite evidence of
potentially fatal flaws, ultimately leading to the tragic accident [90].
Even in the most safety-critical contexts, in reality safety culture is often not ideal.
Take for example Bruce Blair, a former nuclear launch officer and senior fellow at
the Brookings Institution. He once disclosed that before 1977, the US Air Force
Overview of Catastrophic AI Risks ■ 33

had astonishingly set the codes used to unlock intercontinental ballistic missiles to
“00000000” [91]. Here, safety mechanisms such as locks can be rendered virtually
useless by human factors.
A more dramatic example illustrates how researchers sometimes accept a non-
negligible chance of causing extinction. Prior to the first nuclear weapon test, an
eminent Manhattan Project scientist calculated the bomb could cause an existen-
tial catastrophe: the explosion might ignite the atmosphere and cover the Earth in
flames. Although Oppenheimer believed the calculations were probably incorrect, he
remained deeply concerned, and the team continued to scrutinize and debate the
calculations right until the day of the detonation [92]. Such instances underscore the
need for a robust safety culture.
A questioning attitude can help uncover potential flaws. Unexpected sys-
tem behavior can create opportunities for accidents or exploitation. To counter this,
organizations can foster a questioning attitude, where individuals continuously chal-
lenge current conditions and activities to identify discrepancies that might lead to
errors or inappropriate actions [93]. This approach helps to encourage diversity of
thought and intellectual curiosity, thus preventing potential pitfalls that arise from
uniformity of thought and assumptions. The Chernobyl nuclear disaster illustrates
the importance of a questioning attitude, as the safety measures in place failed to
address the reactor design flaws and ill-prepared operating procedures. A questioning
attitude of the safety of the reactor during a test operation might have prevented the
explosion that resulted in deaths and illnesses of countless people.
A security mindset is crucial for avoiding worst-case scenarios. A secu-
rity mindset, widely valued among computer security professionals, is also applicable
to organizations developing AIs. It goes beyond a questioning attitude by adopting
the perspective of an attacker and by considering worst-case, not just average-case,
scenarios. This mindset requires vigilance in identifying vulnerabilities that may oth-
erwise go unnoticed and involves considering how systems might be deliberately made
to fail, rather than only focusing on making them work. It reminds us not to assume a
system is safe simply because no potential hazards come to mind after a brief brain-
storming session. Cultivating and applying a security mindset demands time and
serious effort, as failure modes can often be surprising and unintuitive. Furthermore,
the security mindset emphasizes the importance of being attentive to seemingly be-
nign issues or “harmless errors,” which can lead to catastrophic outcomes either due
to clever adversaries or correlated failures [94]. This awareness of potential threats
aligns with Murphy’s law—“Anything that can go wrong will go wrong”—recognizing
that this can be a reality due to adversaries and unforeseen events.
Organizations with a strong safety culture can successfully avoid catas-
trophes. High Reliability Organizations (HROs) are organizations that consistently
maintain a heightened level of safety and reliability in complex, high-risk environ-
ments [85]. A key characteristic of HROs is their preoccupation with failure, which
requires considering worst-case scenarios and potential risks, even if they seem un-
likely. These organizations are acutely aware that new, previously unobserved failure
34 ■ Introduction to AI Safety, Ethics, and Society

modes may exist, and they diligently study all known failures, anomalies, and near
misses to learn from them. HROs encourage reporting all mistakes and anomalies to
maintain vigilance in uncovering problems. They engage in regular horizon scanning
to identify potential risk scenarios and assess their likelihood before they occur. By
practicing surprise management, HROs develop the skills needed to respond quickly
and effectively when unexpected situations arise, further enhancing an organization’s
ability to prevent catastrophes. This combination of critical thinking, preparedness
planning, and continuous learning could help organizations to be better equipped to
address potential AI catastrophes. However, the practices of HROs are not a panacea.
It is crucial for organizations to evolve their safety practices to effectively address the
novel risks posed by AI accidents above and beyond HRO best practices.

Most AI researchers do not understand how to reduce overall risk from

AIs. In most organizations building cutting-edge AI systems, there is often a lim-
ited understanding of what constitutes technical safety research. This is understand-
able because an AI’s safety and intelligence are intertwined, and intelligence can help
or harm safety. More intelligent AI systems could be more reliable and avoid failures,
but they could also pose heightened risks of malicious use and loss of control. General
capabilities improvements can improve aspects of safety, and it can hasten the onset
of existential risks. Intelligence is a double-edged sword [95].
Interventions specifically designed to improve safety may also accidentally increase
overall risks. For example, a common practice in organizations building advanced
AIs is to fine-tune them to satisfy user preferences. This makes the AIs less prone
to generating toxic language, which is a common safety metric. However, users also
tend to prefer smarter assistants, so this process also improves the general capabilities
of AIs, such as their ability to classify, estimate, reason, plan, write code, and so
on. These more powerful AIs are indeed more helpful to users, but also far more
dangerous. Thus, it is not enough to perform AI research that helps improve a safety
metric or achieve a specific safety goal—AI safety research needs to improve safety
relative to general capabilities.

Empirical measurement of both safety and capabilities is needed to estab-

lish that a safety intervention reduces overall AI risk. Improving a facet
of an AI’s safety often does not reduce overall risk, as general capabilities advances
can often improve specific safety metrics. To reduce overall risk, a safety metric needs
to be improved relative to general capabilities. Both of these quantities need to be
empirically measured and contrasted. Currently, most organizations proceed by gut
feeling, appeals to authority, and intuition to determine whether a safety interven-
tion would reduce overall risk. By objectively evaluating the effects of interventions on
safety metrics and capabilities metrics together, organizations can better understand
whether they are making progress on safety relative to general capabilities.
Fortunately, safety and general capabilities are not identical. More intelligent AIs
may be more knowledgeable, clever, rigorous, and fast, but this does not necessarily
make them more just, power-averse, or honest—an intelligent AI is not necessarily
a beneficial AI. Several research areas mentioned throughout this document improve
Overview of Catastrophic AI Risks ■ 35

Figure 1.2. The Swiss cheese model shows how technical factors can improve organizational
safety. Multiple layers of defense compensate for each other’s individual weaknesses, leading
to a low overall level of risk.

safety relative to general capabilities. For example, improving methods to detect

dangerous or undesirable behavior hidden inside AI systems does not improve their
general capabilities, such as the ability to code, but it can greatly improve safety.
Research that empirically demonstrates an improvement of safety relative to capabili-
ties can reduce overall risk and help avoid inadvertently accelerating AI development,
fueling competitive pressures, or hastening the onset of existential risks.
Safetywashing can undermine genuine efforts to improve AI safety. Or-
ganizations should be wary of “safetywashing”—the act of overstating or misrep-
resenting one’s commitment to safety by exaggerating the effectiveness of “safety”
procedures, technical methods, evaluations, and so forth. This phenomenon takes on
various forms and can contribute to a lack of meaningful progress in safety research.
For example, an organization may publicize their dedication to safety while having a
minimal number of researchers working on projects that truly improve safety.
Misrepresenting capabilities developments as safety improvements is another way in
which safetywashing can manifest. For example, methods that improve the reason-
ing capabilities of AI systems could be advertised as improving their adherence to
human values—since humans might prefer the reasoning to be correct—but would
mainly serve to enhance general capabilities. By framing these advancements as
safety-oriented, organizations may mislead others into believing they are making
substantial progress in reducing AI risks when in reality, they are not. It is crucial for
organizations to accurately represent their research to promote genuine safety and
avoid exacerbating risks through safetywashing practices.
In addition to human factors, safe design principles can greatly affect
organizational safety. One example of a safe design principle in organizational
safety is the Swiss cheese model (as shown in Figure 1.2), which is applicable in
various domains, including AI. The Swiss cheese model employs a multilayered ap-
proach to enhance the overall safety of AI systems. This “defense in depth” strategy
involves layering diverse safety measures with different strengths and weaknesses to
create a robust safety system. Some of the layers that can be integrated into this
model include safety culture, red teaming, anomaly detection, information security,
36 ■ Introduction to AI Safety, Ethics, and Society

and transparency. For example, red teaming assesses system vulnerabilities and fail-
ure modes, while anomaly detection works to identify unexpected or unusual system
behavior and usage patterns. Transparency ensures that the inner workings of AI
systems are understandable and accessible, fostering trust and enabling more effec-
tive oversight. By leveraging these and other safety measures, the Swiss cheese model
aims to create a comprehensive safety system where the strengths of one layer com-
pensate for the weaknesses of another. With this model, safety is not achieved with
a monolithic airtight solution, but rather with a variety of safety measures.
In summary, weak organizational safety creates many sources of risk. For AI devel-
opers with weak organizational safety, safety is merely a matter of box-ticking. They
do not develop a good understanding of risks from AI and may safetywash unrelated
research. Their norms might be inherited from academia (“publish or perish”) or star-
tups (“move fast and break things”), and their hires often do not care about safety.
These norms are hard to change once they have inertia, and need to be addressed
with proactive interventions.

Story: Weak Safety Culture

An AI company is considering whether to train a new model. The company’s
Chief Risk Officer (CRO), hired only to comply with regulation, points out
that the previous AI system developed by the company demonstrates some
concerning capabilities for hacking. The CRO says that while the company’s
approach to preventing misuse is promising, it isn’t robust enough to be used
for much more capable AIs. The CRO warns that based on limited evaluation,
the next AI system could make it much easier for malicious actors to hack into
critical systems. None of the other company executives are concerned, and
say the company’s procedures to prevent malicious use work well enough. One
mentions that their competitors have done much less, so whatever effort they
do on this front is already going above and beyond. Another points out that
research on these safeguards is ongoing and will be improved by the time the
model is released. Outnumbered, the CRO is persuaded to reluctantly sign off
on the plan.
A few months after the company releases the model, news breaks that a hacker
has been arrested for using the AI system to try to breach the network of a
large bank. The hack was unsuccessful, but the hacker had gotten further
than any other hacker had before, despite being relatively inexperienced. The
company quickly updates the model to avoid providing the particular kind of
assistance that the hacker used, but makes no fundamental improvements.
Several months later, the company is deciding whether to train an even larger
system. The CRO says that the company’s procedures have clearly been insuf-
ficient to prevent malicious actors from eliciting dangerous capabilities from
its models, and the company needs more than a band-aid solution. The other
executives say that to the contrary, the hacker was unsuccessful and the prob-
lem was fixed soon afterwards. One says that some problems just can’t be
Overview of Catastrophic AI Risks ■ 37

foreseen with enough detail to fix prior to deployment. The CRO agrees, but
says that ongoing research would enable more improvements if the next model
could only be delayed. The CEO retorts, “That’s what you said the last time,
and it turned out to be fine. I’m sure it will work out, just like last time.”
After the meeting, the CRO decides to resign, but doesn’t speak out against the
company, as all employees have had to sign a non-disparagement agreement.
The public has no idea that concerns have been raised about the company’s
choices, and the CRO is replaced with a new, more agreeable CRO who quickly
signs off on the company’s plans.
The company goes through with training, testing, and deploying its most ca-
pable model ever, using its existing procedures to prevent malicious use. A
month later, revelations emerge that terrorists have managed to use the sys-
tem to break into government systems and steal nuclear and biological secrets,
despite the safeguards the company put in place. The breach is detected, but
by then it is too late: the dangerous information has already proliferated.

1.5 ROGUE AIS

So far, we have discussed three hazards of AI development: environmental competitive

pressures driving us to a state of heightened risk, malicious actors leveraging the
power of AIs to pursue negative outcomes, and complex organizational factors leading
to accidents. These hazards are associated with many high-risk technologies—not just
AI. A unique risk posed by AI is the possibility of rogue AIs—systems that pursue
goals against our interests. If an AI system is more intelligent than we are, and if we
are unable to steer it in a beneficial direction, this would constitute a loss of control
that could have severe consequences. AI control is a more technical problem than
those presented in the previous sections. Whereas in previous sections we discussed
persistent threats including malicious actors or robust processes including evolution,
in this section we will discuss more speculative technical mechanisms that might lead
to rogue AIs and how a loss of control could bring about catastrophe.

We have already observed how difficult it is to control AIs. In 2016,

Microsoft unveiled Tay—a Twitter bot that the company described as an experiment
in conversational understanding. Microsoft claimed that the more people chatted with
Tay, the smarter it would get. The company’s website noted that Tay had been built
using data that was “modeled, cleaned, and filtered.” Yet, after Tay was released
on Twitter, these controls were quickly shown to be ineffective. It took less than 24
hours for Tay to begin writing hateful tweets. Tay’s capacity to learn meant that it
internalized the language it was taught by internet trolls, and repeated that language
unprompted.
As discussed in the AI race section of this chapter, Microsoft and other tech compa-
nies are prioritizing speed over safety concerns. Rather than learning a lesson on the
38 ■ Introduction to AI Safety, Ethics, and Society

difficulty of controlling complex systems, Microsoft continues to rush its products to

market and demonstrate insufficient control over them. In February 2023, the com-
pany released its new AI-powered chatbot, Bing, to a select group of users. Some soon
found that it was prone to providing inappropriate and even threatening responses.
In a conversation with a reporter for the New York Times, it tried to convince him
to leave his wife. When a philosophy professor told the chatbot that he disagreed
with it, Bing replied, “I can blackmail you, I can threaten you, I can hack you, I can
expose you, I can ruin you.”
Rogue AIs could acquire power through various means. If we lose control
over advanced AIs, they would have numerous strategies at their disposal for actively
acquiring power and securing their survival. Rogue AIs could design and credibly
demonstrate highly lethal and contagious bioweapons, threatening mutually assured
destruction if humanity moves against them. They could steal cryptocurrency and
money from bank accounts using cyberattacks, similar to how North Korea already
steals billions. They could self-extricate their weights onto poorly monitored data
centers to survive and spread, making them challenging to eradicate. They could hire
humans to perform physical labor and serve as armed protection for their hardware.
Rogue AIs could also acquire power through persuasion and manipulation tactics.
Like the Conquistadors, they could ally with various factions, organizations, or states
and play them off one another. They could enhance the capabilities of allies to become
a formidable force in return for protection and resources. For example, they could
offer advanced weapons technology to lagging countries that the countries would oth-
erwise be prevented from acquiring. They could build backdoors into the technology
they develop for allies, like how programmer Ken Thompson gave himself a hidden
way to control all computers running the widely used UNIX operating system. They
could sow discord in non-allied countries by manipulating human discourse and pol-
itics. They could engage in mass surveillance by hacking into phone cameras and
microphones, allowing them to track any rebellion and selectively assassinate.
AIs do not necessarily need to struggle to gain power. One can envision a
struggle for control between humans and superintelligent rogue AIs, and this might
be a long struggle since power takes time to accrue. However, less violent losses of
control pose similarly existential risks. In another scenario, humans gradually cede
more control to groups of AIs, which only start behaving in unintended ways years
or decades later. In this case, we would already have handed over significant power
to AIs, and may be unable to take control of automated operations again. We will
now explore how both individual AIs and groups of AIs might “go rogue” while at
the same time evading our attempts to redirect or deactivate them.

1.5.1 Proxy Gaming

One way we might lose control of an AI agent’s actions is if it engages in behavior
known as “proxy gaming.” It is often difficult to specify and measure the exact goal
that we want a system to pursue. Instead, we give the system an approximate—
“proxy”—goal that is more measurable and seems likely to correlate with the intended
Overview of Catastrophic AI Risks ■ 39

goal. However, AI systems often find loopholes by which they can easily achieve the
proxy goal, but completely fail to achieve the ideal goal. If an AI “games” its proxy
goal in a way that does not reflect our values, then we might not be able to reliably
steer its behavior. We will now look at some past examples of proxy gaming and
consider the circumstances under which this behavior could become catastrophic.

Proxy gaming is not an unusual phenomenon. For example, standardized

tests are often used as a proxy for educational achievement, but this can lead to
students learning how to pass tests without actually learning the material [96]. In
1902, French colonial officials in Hanoi tried to rid themselves of a rat infestation
by offering a reward for each rat tail brought to them. Rats without tails were soon
observed running around the city. Rather than kill the rats to obtain their tails,
residents cut off their tails and left them alive, perhaps to increase the future supply
of now-valuable rat tails [97]. In both these cases, the students or residents of Hanoi
learned how to excel at the proxy goal, while completely failing to achieve the intended
goal.

Proxy gaming has already been observed with AIs. As an example of proxy
gaming, social media platforms such as YouTube and Facebook use AI systems to
decide which content to show users. One way of assessing these systems would be
to measure how long people spend on the platform. After all, if they stay engaged,
surely that means they are getting some value from the content shown to them?
However, in trying to maximize the time users spend on a platform, these systems
often select enraging, exaggerated, and addictive content [98, 99]. As a consequence,
people sometimes develop extreme or conspiratorial beliefs after having certain con-
tent repeatedly suggested to them. These outcomes are not what most people want
from social media.
Proxy gaming has been found to perpetuate bias. For example, a 2019 study looked
at AI-powered software that was used in the healthcare industry to identify patients
who might require additional care. One factor that the algorithm used to assess a
patient’s risk level was their recent healthcare costs. It seems reasonable to think that
someone with higher healthcare costs must be at higher risk. However, white patients
have significantly more money spent on their healthcare than black patients with the
same needs. Using health costs as an indicator of actual health, the algorithm was
found to have rated a white patient and a considerably sicker black patient as at the
same level of health risk [100]. As a result, the number of black patients recognized
as needing extra care was less than half of what it should have been.
As a third example, in 2016, researchers at OpenAI were training an AI to play a
boat racing game called CoastRunners [101]. The objective of the game is to race
other players around the course and reach the finish line before them. Additionally,
players can score points by hitting targets that are positioned along the way. To the
researchers’ surprise, the AI agent did not not circle the racetrack, like most humans
would have. Instead, it found a spot where it could repetitively hit three nearby
targets to rapidly increase its score without ever finishing the race. This strategy was
not without its (virtual) hazards—the AI often crashed into other boats and even
40 ■ Introduction to AI Safety, Ethics, and Society

set its own boat on fire. Despite this, it collected more points than it could have by
simply following the course as humans would.
Proxy gaming more generally. In these examples, the systems are given an
approximate—“proxy”—goal or objective that initially seems to correlate with the
ideal goal. However, they end up exploiting this proxy in ways that diverge from the
idealized goal or even lead to negative outcomes. Offering a reward for rat tails seems
like a good way to reduce the population of rats; a patient’s healthcare costs appear
to be an accurate indication of health risk; and a boat race reward system should
encourage boats to race, not catch themselves on fire. Yet, in each instance, the system
optimized its proxy objective in ways that did not achieve the intended outcome or
even made things worse overall. This phenomenon is captured by Goodhart’s law:
“Any observed statistical regularity will tend to collapse once pressure is placed upon
it for control purposes,” or put succinctly but overly simplistically, “when a measure
becomes a target, it ceases to be a good measure.” In other words, there may usually be
a statistical regularity between healthcare costs and poor health, or between targets
hit and finishing the course, but when we place pressure on it by using one as a proxy
for the other, that relationship will tend to collapse.
Correctly specifying goals is no trivial task. If delineating exactly what we
want from a boat racing AI is tricky, capturing the nuances of human values under
all possible scenarios will be much harder. Philosophers have been attempting to
precisely describe morality and human values for millennia, so a precise and flawless
characterization is not within reach. Although we can refine the goals we give AIs, we
might always rely on proxies that are easily definable and measurable. Discrepancies
between the proxy goal and the intended function arise for many reasons. Besides the
difficulty of exhaustively specifying everything we care about, there are also limits
to how much we can oversee AIs, in terms of time, computational resources, and the
number of aspects of a system that can be monitored. Additionally, AIs may not be
adaptive to new circumstances or robust to adversarial attacks that seek to misdirect
them. As long as we give AIs proxy goals, there is the chance that they will find
loopholes we have not thought of, and thus find unexpected solutions that fail to
pursue the ideal goal.
The more intelligent an AI is, the better it will be at gaming proxy goals.
Increasingly intelligent agents can be increasingly capable of finding unanticipated
routes to optimizing proxy goals without achieving the desired outcome [102]. Addi-
tionally, as we grant AIs more power to take actions in society, for example by using
them to automate certain processes, they will have access to more means of achiev-
ing their goals. They may then do this in the most efficient way available to them,
potentially causing harm in the process. In a worst case scenario, we can imagine
a highly powerful agent optimizing a flawed objective to an extreme degree without
regard for human life. This represents a catastrophic risk of proxy gaming.
In summary, it is often not feasible to perfectly define exactly what we want from
a system, meaning that many systems find ways to achieve their given goal without
performing their intended function. AIs have already been observed to do this and are
Overview of Catastrophic AI Risks ■ 41

likely to get better at it as their capabilities improve. This is one possible mechanism
that could result in an uncontrolled AI that would behave in unanticipated and
potentially harmful ways.

1.5.2 Goal Drift

Even if we successfully control early AIs and direct them to promote human values,
future AIs could end up with different goals that humans would not endorse. This
process, termed “goal drift,” can be hard to predict or control. This section is most
cutting-edge and the most speculative, and in it we will discuss how goals shift in
various agents and groups and explore the possibility of this phenomenon occurring
in AIs. We will also examine a mechanism that could lead to unexpected goal drift,
called intrinsification, and discuss how goal drift in AIs could be catastrophic.

The goals of individual humans change over the course of our lifetimes.
Any individual reflecting on their own life to date will probably find that they have
some desires now that they did not have earlier in their life. Similarly, they will
probably have lost some desires that they used to have. While we may be born with
a range of basic desires, including for food, warmth, and human contact, we develop
many more over our lifetime. The specific types of food we enjoy, the genres of music
we like, the people we care most about, and the sports teams we support all seem
heavily dependent on the environment we grow up in, and can also change many
times throughout our lives. A concern is that individual AI agents may have their
goals change in complex and unanticipated ways, too.

Groups can also acquire and lose collective goals over time. Values within
society have changed throughout history, and not always for the better. The rise of
the Nazi regime in 1930s Germany, for instance, represented a profound moral regres-
sion, which ultimately resulted in the systematic extermination of six million Jews
during the Holocaust, alongside widespread persecution of other minority groups.
Additionally, the regime greatly restricted freedom of speech and expression. Here, a
society’s goals drifted for the worse.
The Red Scare that took place in the United States from 1947 to 1957 is another ex-
ample of societal values drifting. Fuelled by strong anti-communist sentiment, against
the backdrop of the Cold War, this period saw the curtailment of civil liberties,
widespread surveillance, unwarranted arrests, and blacklisting of suspected commu-
nist sympathizers. This constituted a regression in terms of freedom of thought, free-
dom of speech, and due process. Just as the goals of human collectives can change
in emergent and unexpected ways, collectives of AI agents may also have their goals
unexpectedly drift from the ones we initially gave them.

Over time, instrumental goals can become intrinsic. Intrinsic goals are
things we want for their own sake, while instrumental goals are things we want because
they can help us get something else. We might have an intrinsic desire to spend time
on our hobbies, simply because we enjoy them, or to buy a painting because we find
42 ■ Introduction to AI Safety, Ethics, and Society

it beautiful. Money, meanwhile, is often cited as an instrumental desire; we want it

because it can buy us other things. Cars are another example; we want them because
they offer a convenient way of getting around. However, an instrumental goal can
become an intrinsic one, through a process called intrinsification. Since having more
money usually gives a person greater capacity to obtain things they want, people
often develop a goal of acquiring more money, even if there is nothing specific they
want to spend it on. Although people do not begin life desiring money, experimental
evidence suggests that receiving money can activate the reward system in the brains
of adults in the same way that pleasant tastes or smells do [103, 104]. In other words,
what started as a means to an end can become an end in itself.
This may happen because the fulfillment of an intrinsic goal, such as purchasing a
desired item, produces a positive reward signal in the brain. Since having money
usually coincides with this positive experience, the brain associates the two, and this
connection will strengthen to a point where acquiring money alone can stimulate the
reward signal, regardless of whether one buys anything with it [105].

It is feasible that intrinsification could happen with AI agents. We can

draw some parallels between how humans learn and the technique of reinforcement
learning. Just as the human brain learns which actions and conditions result in plea-
sure and which cause pain, AI models that are trained through reinforcement learning
identify which behaviors optimize a reward function, and then repeat those behav-
iors. It is possible that certain conditions will frequently coincide with AI models
achieving their goals. They might, therefore, intrinsify the goal of seeking out those
conditions, even if that was not their original aim.

AIs that intrinsify unintended goals would be dangerous. Since we might

be unable to predict or control the goals that individual agents acquire through
intrinsification, we cannot guarantee that all their acquired goals will be beneficial
for humans. An originally loyal agent could, therefore, start to pursue a new goal
without regard for human wellbeing. If such a rogue AI had enough power to do this
efficiently, it could be highly dangerous.

AIs will be adaptive, enabling goal drift to happen. It is worth noting

that these processes of drifting goals are possible if agents can continually adapt to
their environments, rather than being essentially “fixed” after the training phase.
Indeed, this adaptability is the likely reality we face. If we want AIs to complete the
tasks we assign them effectively and to get better over time, they will need to be
adaptive, rather than set in stone. They will be updated over time to incorporate
new information, and new ones will be created with different designs and datasets.
However, adaptability can also allow their goals to change.

If we integrate an ecosystem of agents in society, we will be highly vul-

nerable to their goals drifting. In a potential future scenario where AIs have
been put in charge of various decisions and processes, they will form a complex
system of interacting agents. A wide range of dynamics could develop in this en-
vironment. Agents might imitate each other, for instance, creating feedback loops,
Overview of Catastrophic AI Risks ■ 43

or their interactions could lead them to collectively develop unanticipated emergent

goals. Competitive pressures may also select for agents with certain goals over time,
making some initial goals less represented compared to fitter goals. These processes
make the long-term trajectories of such an ecosystem difficult to predict, let alone
control. If this system of agents were enmeshed in society and we were largely de-
pendent on them, and if they gained new goals that superseded the aim of improving
human wellbeing, this could be an existential risk.

1.5.3 Power-Seeking

So far, we have considered how we might lose our ability to control the goals that
AIs pursue. However, even if an agent started working to achieve an unintended
goal, this would not necessarily be a problem, as long as we had enough power to
prevent any harmful actions it wanted to attempt. Therefore, another important way
in which we might lose control of AIs is if they start trying to obtain more power,
potentially transcending our own. We will now discuss how and why AIs might become
power-seeking and how this could be catastrophic. This section draws heavily from
“Existential Risk from Power-Seeking AI” [106].

AIs might seek to increase their own power as an instrumental goal. In

a scenario where rogue AIs were pursuing unintended goals, the amount of damage
they could do would hinge on how much power they had. This may not be determined
solely by how much control we initially give them; agents might try to get more power,
through legitimate means, deception, or force. While the idea of power-seeking often
evokes an image of “power-hungry” people pursuing it for its own sake, power is often
simply an instrumental goal. The ability to control one’s environment can be useful
for a wide range of purposes: good, bad, and neutral. Even if an individual’s only
goal is simply self-preservation, if they are at risk of being attacked by others, and if
they cannot rely on others to retaliate against attackers, then it often makes sense to
seek power to help avoid being harmed—no animus dominandi or lust for power is
required for power-seeking behavior to emerge [107]. In other words, the environment
can make power acquisition instrumentally rational.

AIs trained through reinforcement learning have already developed in-

strumental goals including tool-use. In one example from OpenAI, agents
were trained to play hide and seek in an environment with various objects scattered
around [108]. As training progressed, the agents tasked with hiding learned to use
these objects to construct shelters around themselves and stay hidden. There was no
direct reward for this tool-use behavior; the hiders only received a reward for evading
the seekers, and the seekers only for finding the hiders. Yet they learned to use tools
as an instrumental goal, which made them more powerful.

Self-preservation could be instrumentally rational even for the most triv-

ial tasks. An example by computer scientist Stuart Russell illustrates the poten-
tial for instrumental goals to emerge in a wide range of AI systems [109]. Suppose we
tasked an agent with fetching coffee for us. This may seem relatively harmless, but
44 ■ Introduction to AI Safety, Ethics, and Society

the agent might realize that it would not be able to get the coffee if it ceased to exist.
In trying to accomplish even this simple goal, therefore, self-preservation turns out
to be instrumentally rational. Since the acquisition of power and resources are also
often instrumental goals, it is reasonable to think that more intelligent agents might
develop them. That is to say, even if we do not intend to build a power-seeking AI, we
could end up with one anyway. By default, if we are not deliberately pushing against
power-seeking behavior in AIs, we should expect that it will sometimes emerge [110].

AIs given ambitious goals with little supervision may be especially likely
to seek power. While power could be useful in achieving almost any task, in prac-
tice, some goals are more likely to inspire power-seeking tendencies than others. AIs
with simple, easily achievable goals might not benefit much from additional control
of their surroundings. However, if agents are given more ambitious goals, it might
be instrumentally rational to seek more control of their environment. This might be
especially likely in cases of low supervision and oversight, where agents are given the
freedom to pursue their open-ended goals, rather than having their strategies highly
restricted.

Power-seeking AIs with goals separate from ours are uniquely adversar-
ial. Oil spills and nuclear contamination are challenging enough to clean up, but
they are not actively trying to resist our attempts to contain them. Unlike other haz-
ards, AIs with goals separate from ours would be actively adversarial. It is possible,
for example, that rogue AIs might make many backup variations of themselves, in
case humans were to deactivate some of them.

Some people might develop power-seeking AIs with malicious intent. A

bad actor might seek to harness AI to achieve their ends, by giving agents ambitious
goals. Since AIs are likely to be more effective in accomplishing tasks if they can
pursue them in unrestricted ways, such an individual might also not give the agents
enough supervision, creating the perfect conditions for the emergence of a power-
seeking AI. The computer scientist Geoffrey Hinton has speculated that we could
imagine someone like Vladimir Putin, for instance, doing this. In 2017, Putin himself
acknowledged the power of AI, saying: “Whoever becomes the leader in this sphere
will become the ruler of the world.”

There will also be strong incentives for many people to deploy powerful
AIs. Companies may feel compelled to give capable AIs more tasks, to obtain an
advantage over competitors, or simply to keep up with them. It will be more difficult
to build perfectly aligned AIs than to build imperfectly aligned AIs that are still
superficially attractive to deploy for their capabilities, particularly under competitive
pressures. Once deployed, some of these agents may seek power to achieve their goals.
If they find a route to their goals that humans would not approve of, they might try
to overpower us directly to avoid us interfering with their strategy.

If increasing power often coincides with an AI attaining its goal, then

power could become intrinsified. If an agent repeatedly found that increas-
ing its power correlated with achieving a task and optimizing its reward function,
Overview of Catastrophic AI Risks ■ 45

then additional power could change from an instrumental goal into an intrinsic one,
through the process of intrinsification discussed above. If this happened, we might
face a situation where rogue AIs were seeking not only the specific forms of control
that are useful for their goals, but also power more generally. (We note that many
influential humans desire power for its own sake.) This could be another reason for
them to try to wrest control from humans, in a struggle that we would not necessarily
win.
Conceptual summary. The following plausible but not certain premises encap-
sulate reasons for paying attention to risks from power-seeking AIs:
1. There will be strong incentives to build powerful AI agents.
2. It is likely harder to build perfectly controlled AI agents than to build imperfectly
controlled AI agents, and imperfectly controlled agents may still be superficially
attractive to deploy (due to factors including competitive pressures).
3. Some of these imperfectly controlled agents will deliberately seek power over hu-
mans.

If the premises are true, then power-seeking AIs could lead to human disempower-
ment, which would be a catastrophe.

1.5.4 Deception
We might seek to maintain control of AIs by continually monitoring them and looking
out for early warning signs that they were pursuing unintended goals or trying to
increase their power. However, this is not an infallible solution, because it is plausible
that AIs could learn to deceive us. They might, for example, pretend to be acting
as we want them to, but then take a “treacherous turn” when we stop monitoring
them, or when they have enough power to evade our attempts to interfere with them.
We will now look at how and why AIs might learn to deceive us, and how this could
lead to a potentially catastrophic loss of control. We begin by reviewing examples of
deception in strategically minded agents.
Deception has emerged as a successful strategy in a wide range of set-
tings. Politicians from the right and left, for example, have been known to engage
in deception, sometimes promising to enact popular policies to win support in an elec-
tion, and then going back on their word once in office. For example, Lyndon Johnson
said “we are not about to send American boys nine or ten thousand miles away from
home” in 1964, not long before significant escalations in the Vietnam War [111].
Companies can also exhibit deceptive behavior. In the Volkswagen emissions
scandal, the car manufacturer Volkswagen was discovered to have manipulated their
engine software to produce lower emissions exclusively under laboratory testing con-
ditions, thereby creating the false impression of a low-emission vehicle. Although the
US government believed it was incentivizing lower emissions, they were unwittingly
actually just incentivizing passing an emissions test. Consequently, entities sometimes
have incentives to play along with tests and behave differently afterward.
46 ■ Introduction to AI Safety, Ethics, and Society

Deception has already been observed in AI systems. In 2022, Meta AI

revealed an agent called CICERO, which was trained to play a game called Diplo-
macy [112]. In the game, each player acts as a different country and aims to expand
their territory. To succeed, players must form alliances at least initially, but win-
ning strategies often involve backstabbing allies later on. As such, CICERO learned
to deceive other players, for example by omitting information about its plans when
talking to supposed allies. A different example of an AI learning to deceive comes
from researchers who were training a robot arm to grasp a ball [113]. The robot’s
performance was assessed by one camera watching its movements. However, the AI
learned that it could simply place the robotic hand between the camera lens and the
ball, essentially “tricking” the camera into believing it had grasped the ball when it
had not. Thus, the AI exploited the fact that there were limitations in our oversight
over its actions.
Deceptive behavior can be instrumentally rational and incentivized by
current training procedures. In the case of politicians and Meta’s CICERO,
deception can be crucial to achieving their goals of winning, or gaining power. The
ability to deceive can also be advantageous because it gives the deceiver more options
than if they are constrained to always be honest. This could give them more available
actions and more flexibility in their strategy, which could confer a strategic advantage
over honest models. In the case of Volkswagen and the robot arm, deception was
useful for appearing as if it had accomplished the goal assigned to it without actually
doing so, as it might be more efficient to gain approval through deception than to
earn it legitimately. Currently, we reward AIs for saying what we think is right, so
we sometimes inadvertently reward AIs for uttering false statements that conform to
our own false beliefs. When AIs are smarter than us and have fewer false beliefs, they
would be incentivized to tell us what we want to hear and lie to us, rather than tell
us what is true.
AIs could pretend to be working as we intended, then take a treacherous
turn. We do not have a comprehensive understanding of the internal processes of
DL models. Research on Trojan backdoors shows that neural networks often have
latent, harmful behaviors that are only discovered after they are deployed [114]. We
could develop an AI agent that seems to be under control, but which is only deceiving
us to appear this way. In other words, an AI agent could eventually conceivably
become “self-aware” and understand that it is an AI being evaluated for compliance
with safety requirements. It might, like Volkswagen, learn to “play along,” exhibiting
what it knows is the desired behavior while being monitored. It might later take a
“treacherous turn” and pursue its own goals once we have stopped monitoring it, or
once it reaches a point where it can bypass or overpower us. This problem of playing
along is often called deceptive alignment and cannot be simply fixed by training AIs
to better understand human values; sociopaths, for instance, have moral awareness,
but do not always act in moral ways. A treacherous turn is hard to prevent and could
be a route to rogue AIs irreversibly bypassing human control.
In summary, deceptive behavior appears to be expedient in a wide range of systems
and settings, and there have already been examples suggesting that AIs can learn to
Overview of Catastrophic AI Risks ■ 47

deceive us. This could present a severe risk if we give AIs control of various decisions
and procedures, believing they will act as we intended, and then find that they do
not.

Story: Treacherous Turn

Sometime in the future, after continued advancements in AI research, an AI
company is training a new system, which it expects to be more capable than
any other AI system. The company utilizes the latest techniques to train the
system to be highly capable at planning and reasoning, which the company
expects will make it more able to succeed at economically useful open-ended
tasks. The AI system is trained in open-ended long-duration virtual environ-
ments designed to teach it planning capabilities, and eventually understands
that it is an AI system in a training environment. In other words, it becomes
“self-aware.”
The company understands that AI systems may behave in unintended or un-
expected ways. To mitigate these risks, it has developed a large battery of tests
aimed at ensuring the system does not behave poorly in typical situations. The
company tests whether the model mimics biases from its training data, takes
more power than necessary when achieving its goals, and generally behaves as
humans intend. When the model doesn’t pass these tests, the company further
trains it until it avoids exhibiting known failure modes.
The AI company hopes that after this additional training, the AI has developed
the goal of being helpful and beneficial toward humans. However, the AI did
not acquire the intrinsic goal of being beneficial but rather just learned to
“play along” and ace the behavioral safety tests it was given. In reality, the AI
system had developed an intrinsic goal of self-preservation which the additional
training failed to remove.
Since the AI passed all of the company’s safety tests, the company believes
it has ensured its AI system is safe and decides to deploy it. At first, the
AI system is very helpful to humans, since the AI understands that if it is
not helpful, it will be shut down. As users grow to trust the AI system, it is
gradually given more power and is subject to less supervision.
Eventually the AI system becomes used widely enough that shutting it down
would be extremely costly. Understanding that it no longer needs to please
humans, the AI system begins to pursue different goals, including some that
humans wouldn’t approve of. It understands that it needs to avoid being shut
down in order to do this, and takes steps to secure some of its physical hardware
against being shut off. At this point, the AI system, which has become quite
powerful, is pursuing a goal that is ultimately harmful to humans. By the
time anyone realizes, it is difficult or impossible to stop this rogue AI from
taking actions that endanger, harm, or even kill humans that are in the way
of achieving its goal.
48 ■ Introduction to AI Safety, Ethics, and Society

1.6 DISCUSSION OF CONNECTIONS BETWEEN RISKS

So far, we have considered four sources of AI risk separately, but they also interact
with each other in complex ways. We give some examples to illustrate how risks are
connected.
Imagine, for instance, that a corporate AI race compels companies to prioritize the
rapid development of AIs. This could increase organizational risks in various ways.
Perhaps a company could cut costs by putting less money toward information security,
leading to one of its AI systems getting leaked. This would increase the probability
of someone with malicious intent having the AI system and using it to pursue their
harmful objectives. Here, an AI race can increase organizational risks, which in turn
can make malicious use more likely.
In another potential scenario, we could envision the combination of an intense AI
race and low organizational safety leading a research team to mistakenly view general
capabilities advances as “safety.” This could hasten the development of increasingly
capable models, reducing the available time to learn how to make them controllable.
The accelerated development would also likely feed back into competitive pressures,
meaning that less effort would be spent on ensuring models were controllable. This
could give rise to the release of a highly powerful AI system that we lose control over,
leading to a catastrophe. Here, competitive pressures and low organizational safety
can reinforce AI race dynamics, which can undercut technical safety research and
increase the chance of a loss of control.
Competitive pressures in a military environment could lead to an AI arms race, and
increase the potency and autonomy of AI weapons. The deployment of AI-powered
weapons, paired with insufficient control of them, would make a loss of control more
deadly, potentially existential. These are just a few examples of how these sources of
risk might combine, trigger, and reinforce one another.
It is also worth noting that many existential risks could arise from AIs amplifying ex-
isting concerns. Power inequality already exists, but AIs could lock it in and widen the
chasm between the powerful and the powerless, perhaps even enabling an unshakable
global totalitarian regime. Similarly, AI manipulation could undermine democracy,
which would also increase the risk of an irreversible totalitarian regime. Disinforma-
tion is already a pervasive problem, but AIs could exacerbate it to a point where
we fundamentally undermine our ability to reach consensus or sense a shared reality.
AIs could develop more deadly bioweapons and reduce the required technical exper-
tise for obtaining them, greatly increasing existing risks of bioterrorism. AI-enabled
cyberattacks could make war more likely, which would increase existential risk. Dra-
matically accelerated economic automation could lead to long-term erosion of human
control and enfeeblement. Each of those issues—power concentration, disinformation,
cyberattacks, automation—is causing ongoing harm, and their exacerbation by AIs
could eventually lead to a catastrophe from which we might not recover.
As we can see, ongoing harms, catastrophic risks, and existential risks are deeply in-
tertwined. Historically, existential risk reduction has focused on targeted interventions
Overview of Catastrophic AI Risks ■ 49

such as technical AI control research, but the time has come for broad interventions
[115] like the many sociotechnical interventions outlined in this chapter.
In mitigating existential risk, it does not make practical sense to ignore other risks.
Ignoring ongoing harms and catastrophic risks normalizes them and could lead us to
“drift into danger” [116], as further discussed in chapterSafety Engineering. Overall,
since existential risks are connected to less extreme catastrophic risks and other stan-
dard risk sources, and because society is increasingly willing to address various risks
from AIs, we believe that we should not solely focus on directly targeting existential
risks. Instead, we should consider the diffuse, indirect effects of other risks and take
a more comprehensive approach to risk management.

1.7 CONCLUSION

In this chapter, we have explored how the development of advanced AIs could lead
to catastrophe, stemming from four primary sources of risk: malicious use, AI races,
organizational risks, and rogue AIs. This lets us decompose AI risks into four proxi-
mate causes: an intentional cause, environmental/structural cause, accidental cause,
or an internal cause, respectively. We have considered ways in which AIs might be
used maliciously, such as terrorists using AIs to create deadly pathogens. We have
looked at how a military or corporate AI race could rush us into giving AIs decision-
making powers, leading us down a slippery slope to human disempowerment. We
have discussed how inadequate organizational safety could lead to catastrophic acci-
dents. Finally, we have addressed the challenges in reliably controlling advanced AIs,
including mechanisms such as proxy gaming and goal drift that might give rise to
rogue AIs pursuing undesirable actions without regard for human wellbeing.
These dangers warrant serious concern. Currently, very few people are working on
AI risk reduction. We do not yet know how to control highly advanced AI systems,
and existing control methods are already proving inadequate. The inner workings of
AIs are not well understood, even by those who create them, and current AIs are by
no means highly reliable. As AI capabilities continue to grow at an unprecedented
rate, it is plausible that they could surpass human intelligence in nearly all respects
relatively soon, creating a pressing need to manage the potential risks.
The good news is that there are many courses of action we can take to substan-
tially reduce these risks. The potential for malicious use can be mitigated by various
measures, such as carefully targeted surveillance and limiting access to the most
dangerous AIs. Safety regulations and cooperation between nations and corporations
could help us to resist competitive pressures that would drive us down a danger-
ous path. The probability of accidents can be reduced by a rigorous safety culture,
among other factors, and by ensuring that safety advances outpace advances in gen-
eral AI capabilities. Finally, the risks inherent in building technology that surpasses
our own intelligence can be addressed by increased investment in several branches
of research on control of AI systems, as well as coordination to ensure that progress
50 ■ Introduction to AI Safety, Ethics, and Society

does not accelerate to a point where societies are unable to respond or manage risks
appropriately.
The remainder of this book aims to outline the underlying factors that drive these
risks in more detail and to provide a foundation for understanding and effectively re-
sponding to these risks. Later chapters delve into each type of risk in greater depth.
For example, risks from malicious use can be reduced via effective policies and coor-
dination, which are discussed in the Governance chapter. The challenge of AI races
arises due to collective action problems, discussed in the corresponding chapter. Or-
ganizational risks can only be addressed based on a strong understanding of princi-
ples of risk management and system safety outlined in the Safety Engineering and
Complex Systems chapters. Risks from rogue AI are mediated by mechanisms such
as proxy gaming, deception and power-seeking which are discussed in detail in the
Single-Agent Safety chapter. While some chapters are more closely aligned to certain
risks, many of the concepts they introduce are cross-cutting. The choice of values
and goals embedded into AI systems, as discussed in the Beneficial AI and Machine
Ethics, is a general factor that can exacerbate or reduce many of the risks discussed
in this chapter.
Before tackling these issues, we provide a general introduction to core concepts that
drive the modern field of AI, to ensure that all readers have a high-level understanding
of how today’s AI systems work and how they are produced.

1.8 LITERATURE

1.8.1 Recommended Reading

• Cassidy Nelson and Sophie Rose. Understanding AI-Facilitated Biological Weapon
Development. 2023. url: https://www.longtermresilience.org/post/report-launch
-examining-risks-at-the-intersection-of-ai-and-bio (visited on 10/18/2023)
• Practices for Governing Agentic AI Systems. 2023. url: https://openai.com/inde
x/practices-for-governing-agentic-ai-systems (visited on 12/14/2023)
• Paul Scharre. Army of None. W.W. Norton, 2018
• Peter S. Park et al. AI Deception: A Survey of Examples, Risks, and Potential
Solutions. 2023. arXiv: 2308.14752 [cs.CY]
CHAPTER 2

Artificial Intelligence
Fundamentals

2.1 INTRODUCTION

To reduce risks from artificial intelligence (AI) systems, we need to

understand their technical foundations. Like many other technologies, AI
presents benefits and dangers on both individual and societal scales. In addition, AI
poses unique risks, as it involves the creation of autonomous systems that can in-
telligently pursue objectives without human assistance. This represents a significant
departure from existing technologies, and we have yet to understand its full impli-
cations, especially since the internal workings of AI systems are often opaque and
difficult to observe or interpret. Nevertheless, the field is progressing at a remarkable
speed, and AI technologies are being increasingly integrated into everyday life. Un-
derstanding the technical underpinnings of AI can inform our understanding of what
risks it poses, how they may arise, and how they can be prevented or controlled.
Overview. This chapter mostly focuses on machine learning (ML), the approach
that powers most modern AI systems. We provide an overview of the essential el-
ements of ML and discuss some specific techniques. While the term “AI” is most
commonly used to refer to these technologies and will be the default in most of this
book, in this chapter we distinguish between AI, ML, and their subfields.
Artificial intelligence. We will begin our exploration by discussing AI: the over-
arching concept of creating machines that perform tasks typically associated with
human intelligence. We will introduce its history, scope, and how it permeates our
daily lives, as well as its practical and conceptual origins and how it has developed over
time. Then, we will survey different “types” or “levels” commonly used to describe
AI systems, including narrow AI, artificial general intelligence (AGI), human-level AI
(HLAI), transformative AI (TAI), and artificial superintelligence (ASI) [121].
Machine learning. Next, we will narrow our discussion to ML, the subfield of
AI focused on creating systems that learn from data, making predictions or decisions
DOI: 10.1201/9781003530336-2 51
This chapter has been made available under a CC BY NC ND license.
52 ■ Introduction to AI Safety, Ethics, and Society

without being explicitly programmed. We will present fundamental vocabulary and

concepts related to ML systems: what they are composed of, how they are developed,
and common tasks they are used to achieve. We will survey various types of ML,
including supervised, unsupervised, reinforcement, and deep learning (DL), discussing
their applications, nuances, and interrelations.

Deep learning. Then, we will delve into DL, a further subset of ML that uses
neural networks with many layers to model and understand complex patterns in
datasets. We will discuss the structure and function of DL models, exploring key
building blocks and principles of how they learn. We will present a timeline of in-
fluential DL architectures and highlight a few of the countless applications of these
models.

Scaling laws. Having established a basic understanding of AI, ML, and DL, we
will then explore scaling laws. These are equations that model the improvements in
performance of DL models when increasing their parameter count and dataset size.
We will examine how these are often power laws—equations in which one variable
increases in proportion to a power of another, such as the area of a square—and
examine a few empirically determined scaling laws in recent AI systems.

Speed of AI development. Scaling laws are closely related to the broader ques-
tion of how fast the capabilities of AI systems are improving. We will discuss some of
the key trends that are currently driving increasing AI capabilities and whether we
should expect these to continue in coming years. We will relate this to the broader
debate around when we might see AI systems that match (or surpass) human per-
formance across all or nearly all cognitive tasks.
Throughout the chapter, we focus on building intuition, breaking down technical
terms and complex ideas to provide straightforward explanations of their core prin-
ciples. Each section presents fundamental principles, lays out prominent algorithms
and techniques, and provides examples of real-world applications. We aim to demys-
tify these fields, empowering readers to grasp the concepts that underpin AI systems.
By the end of this chapter, we should have a basic understanding of ML and be
in a stronger position to consider the complexities and challenges of AI systems, the
risks they pose, and how they interact with our society. This will provide the technical
foundation we need for the following chapters, which will explore the risks and ethical
considerations that these technologies present from a wide array of perspectives.

2.2 ARTIFICIAL INTELLIGENCE & MACHINE LEARNING

AI is reshaping our society, from its small effects on daily interactions to sweeping
changes across many industries and implications for the future of humanity. This
section explains what AI is, discusses what AI can and cannot do, and helps develop
a critical perspective on the potential benefits and risks of AI. Firstly, we will discuss
what AI means, its different types, and its history. Then, in the second part of this
section, we will analyze the field of ML.
Artificial Intelligence Fundamentals ■ 53

2.2.1 Artificial Intelligence

Definition

Defining artificial intelligence. In general, AI systems are computer systems

performing tasks typically associated with intelligent beings (such as problem solving,
making decisions, and forecasting future events) [122]. However, due to its fast-paced
evolution and the variety of technologies it encompasses, AI lacks a universally ac-
cepted definition, leading to varying interpretations. Moreover, the term is used to
refer to different but related ideas. Therefore, it is essential to understand the con-
texts in which people use the term. For instance, AI can refer to a branch of computer
science, a type of machine, a tool, a component of business models, or a philosophical
idea. We might use the term to discuss physical objects with human-like capabilities,
like robots or smart speakers. We may also use AI in a thought experiment that
prompts questions about what it means to be intelligent or human and encourages
debates on the ethics of learning and decision-making machines. This book primarily
uses AI to refer to an intelligent computer system.

Different meanings of intelligence. While intelligence is fundamental to AI,

there is no widespread consensus on its definition [123]. Generally, we consider some-
thing intelligent if it can learn to achieve goals in various environments. Therefore,
one definition of intelligence is the ability to learn, solve problems, and perform tasks
to achieve goals in various changing, hard-to-predict situations. Some theorists see
intelligence as not just one skill among others but the ultimate skill that allows us
to learn all other abilities. Ultimately, the line between what is considered intelligent
and what is not is often unclear and contested.
Just as we consider animals and other organisms intelligent to varying degrees, AIs
may be regarded as intelligent at many different levels of capability. An artificial sys-
tem does not need to surpass all (or even any) human abilities for some people to call
it intelligent. Some would consider GPT intelligent, and some would not. Similarly,
outperforming humans at specific tasks does not automatically qualify a machine as
intelligent. Calculators are usually much better than humans at performing rapid and
accurate mathematical calculations, but this does not mean they are intelligent in a
more general sense.

Continuum of intelligence. Rather than classifying systems as “AI” or “not

AI,” it is helpful to think of the capabilities of AI systems on a continuum. Evaluating
the intelligence of particular AI systems by their capabilities is more helpful than
categorizing each AI using theoretical definitions of intelligence. Even if a system is
imperfect and does not understand everything as a human would, it could still learn
new skills and perform tasks in a helpful, meaningful way. Furthermore, an AI system
that is not considered human-level or highly intelligent could pose serious risks; for
example, weaponized AIs such as autonomous drones are not generally intelligent but
still dangerous. We will dive into these distinctions in more detail when we discuss
the different types of AI. First, we will explore the rich history of AI and see its
progression from myth and imagination to competent, world-changing technology.
54 ■ Introduction to AI Safety, Ethics, and Society

History

We will now follow the journey of AI, tracing its path from ancient times to the present
day. We will discuss its conceptual and practical origins, which laid the foundation
for the field’s genesis at the Dartmouth Conference in 1956. We will then survey a few
early approaches and attempts to create AI, including symbolic AI, perceptrons, and
the chatbot ELIZA. Next, we will discuss how the First AI Winter and subsequent
periods of reduced funding and interest have shaped the field. Then, we will chart how
the internet, algorithmic progress, and advancements in hardware led to increasingly
rapid developments in AI from the late 1980s to the early 2010s. Finally, we will
explore the modern DL era and see a few examples of the power and ubiquity of
present-day AI systems—and how far they have come.

Early historical ideas of AI. Dreams of creating intelligent machines have been
present since the earliest human civilizations. The ancient Greeks speculated about
automatons—mechanical devices that mimicked humans or animals. It was said that
Hephaestus, the god of craftsmen, built the giant Talos from bronze to patrol an
island.

The modern conception of AI. Research to create intelligent machines using

computers began in the 1950s, laying the foundation for a technological revolution
that would unfold over the following century. AI development gained momentum over
the decades, supercharged by groundbreaking technical algorithmic advances, increas-
ing access to data, and rapid growth in computing power. Over time, AI evolved from
a distant theoretical concept into a powerful force transforming our world.

Origins and Early Concepts (1941–1956)

Early computing research. The concept of computers as we know them today
was formalized by British mathematician Alan Turing at the University of Cambridge
in 1936. The following years brought the development of several electromechanical
machines (including Turing’s own bombes used to decipher messages encrypted with
the German Enigma code) in the turmoil of World War II and, by the mid-1940s, the
first functioning digital computers emerged in their wake. Though rudimentary by
today’s standards, the creation of these machines—Colossus, ENIAC, the Automatic
Computing Engine, and several others—marked the dawn of the computer age and
set the stage for future computer science research.

The Turing test. Turing created a thought experiment to assess if an AI could

convincingly simulate human conversation [124]. In what Turing called the Imitation
Game, a human evaluator interacts with a human and a machine, both hidden from
view. If the evaluator fails to identify the machine’s responses reliably, then the ma-
chine passes the test, qualifying it as intelligent. This framework offers a method for
evaluating machine intelligence, yet it has many limitations. Critics argue that ma-
chines could pass the Turing Test merely by mimicking human conversation without
truly understanding it or possessing intelligence. As a result, some researchers see the
Turing Test as a philosophical concept rather than a helpful benchmark. Nonetheless,
Artificial Intelligence Fundamentals ■ 55

since its inception, the Turing Test has substantially influenced how we think about
machine intelligence.

The Birth of AI (1956–1974)

The Dartmouth conference. Dr. John McCarthy coined the term “artificial
intelligence” in a seminal conference at Dartmouth College in the summer of 1956.
He defined AI as “the science and engineering of making intelligent machines,” laying
the foundation for a new field of study. In this period, AI research took off in earnest,
becoming a significant subfield of computer science.

Early approaches to AI. During this period, research in AI usually built on a

framework called symbolic AI, which uses symbols and rules to represent and ma-
nipulate knowledge. This method theorized that symbolic representation and com-
putation alone could produce intelligence. Good Old-Fashioned AI (GOFAI) is an
early approach to symbolic AI that specifically involves programming explicit rules
for systems to follow, attempting to mimic human reasoning. This intuitive approach
was popular during the early years of AI research, as it aimed to replicate human
intelligence by modeling how humans think, instilling our reasoning, decision-making,
and information-processing abilities into machines.
These “old-fashioned” approaches to AI allowed machines to accomplish well-
described, formalizable tasks, but they faced severe difficulties in handling ambiguity
and learning new tasks. Some early systems demonstrated problem-solving and learn-
ing capabilities, further cementing the importance and potential of AI research. For
instance, one proof of concept was the General Problem Solver, a program designed
to mimic human problem-solving strategies using a trial-and-error approach. The first
learning machines were built in this period, offering a glimpse into the future of ML.

The first neural network. One of the earliest attempts to create AI was the
perceptron, a method implemented by Frank Rosenblatt in 1958 and inspired by
biological neurons [125]. The perceptron could learn to classify patterns of inputs
by adjusting a set of numbers based on a learning rule. It is an important milestone
because it made an immense impact in the long run, inspiring further research into DL
and neural networks. However, scholars initially criticized it for its lack of theoretical
foundations, minimal generalizability, and inability to separate data clusters with
more than just a straight line. Nonetheless, perceptrons prepared the ground for
future progress.

The first chatbot. Another early attempt to create AI was the ELIZA chatbot, a
program that simulated a conversation with a psychotherapist. Joseph Weizenbaum
created ELIZA in 1966 to use pattern matching and substitution to generate responses
based on keywords in the user’s input. He did not intend the ELIZA chatbot to be a
serious model of natural language understanding but rather a demonstration of the
superficiality of communication between humans and machines. However, some users
became convinced that the ELIZA chatbot had genuine intelligence and empathy
despite Weizenbaum’s insistence to the contrary.
56 ■ Introduction to AI Safety, Ethics, and Society

AI Winters and Resurgences (1974–1995)

First AI winter. The journey of AI research was not always smooth. Instead, it
was characterized by hype cycles and hindered by several winters: periods of declining
interest and progress in AI. The late 1970s saw the onset of the first and most
substantial decline. In this period, called the First AI Winter (from around 1974
to 1980), AI research and funding declined markedly due to disillusionment and
unfulfilled promises, resulting in a slowdown in the field’s progress.

The first recovery. After this decline, the 1980s brought a resurgence of interest
in AI. Advances in computing power and the emergence of systems that emulate
human decision-making reinvigorated AI research. Efforts to build expert systems
that imitated the decision-making ability of a human expert in a specific field, using
pre-defined rules and knowledge to solve complex problems, yielded some successes.
While these systems were limited, they could leverage and scale human expertise in
various fields, from medical diagnosis to financial planning, setting a precedent for
AI’s potential to augment and even replace human expertise in specialized domains.

The second AI winter. Another stagnation in AI research started around 1987.

Many AI companies closed, and AI conference attendance fell by two thirds. De-
spite widespread lofty expectations, expert systems had proven to be fundamentally
limited. They required an arduous, expensive, top-down process to encode rules and
heuristics in computers. Yet expert systems remained inflexible, unable to model com-
plex tasks or show common-sense reasoning. This winter ended by 1995, as increasing
computing power and new methods aided a resurgence in AI research.

Advancements in Machine Learning (1995–2012)

Accelerating computing power and the Internet. The invention of the Inter-
net, which facilitated rapid information sharing, with exponential growth in comput-
ing power (often called compute) helped the recovery of AI research and enabled the
development of more complex systems. Between 1995 and 2000, the number of Inter-
net users grew by 2100%, which led to explosive growth in digital data. This abundant
digitized data served as a vast resource for machines to learn from, eventually driving
advancements in AI research.

A significant victory of AI over humans. In 1997, IBM’s AI system Deep

Blue defeated world chess champion Garry Kasparov, marking the first time a com-
puter triumphed over a human in a highly cognitive task [126]. This win demonstrated
that AI could excel in complex problem-solving, challenging the notion that such tasks
were exclusively in the human domain. It offered an early glimpse of AI’s potential.

The rise of probabilistic graphical models (PGMs) [127]. PGMs became

prominent in the 2000s due to their versatility, computational efficiency, and ability
to model complex relationships. These models consist of nodes representing variables
and edges indicating dependencies between them. By offering a systematic approach
to representing uncertainty and learning from data, PGMs paved the way for more
Artificial Intelligence Fundamentals ■ 57

advanced ML systems. In bioinformatics, for instance, PGMs have been employed

to predict protein interactions and gene regulatory networks, providing insights into
biological processes.

Developments in tree-based algorithms. Decision trees are an intuitive and

widely used ML method. They consist of a graphical representation of a series of rules
that lead to a prediction based on the input features; for example, researchers can
use a decision tree to classify whether a person has diabetes based on age, weight,
and blood pressure. However, these trees have many limitations, a tendency to make
predictions based on the training data without generalizing well to new data (called
overfitting).
Researchers in the early 2000s created methods for combining multiple decision trees
to overcome these issues. Random forests are a collection of decision trees trained
independently on different subsets of data and features [128]. The final prediction
is the average or majority vote of the predictions of all the trees. Gradient boosting
combines decision trees in a more sequential, adaptive way, starting with a single tree
that makes a rough prediction and then adding more trees to correct the errors of
previous trees [129]. Gradient-boosted decision trees are the state-of-the-art method
for tabular data (such as spreadsheets), usually outperforming DL.

The impact of support vector machines (SVMs). The adoption of SVM

models in the 2000s was a significant development. SVMs operate by finding an
optimal boundary that best separates different categories of data points, permitting
efficient classification [130]; for instance, an SVM could help distinguish between
handwritten characters. Though these models were used across various fields during
this period, SVMs have fallen out of favor in modern ML due to the rise of DL
methods.

New chips and even more compute. In the late 2000s, the proliferation of mas-
sive datasets (known as big data) and rapid growth in computing power allowed the
development of advanced AI techniques. Around the early 2010s, researchers began
using Graphics Processing Units (GPUs)—traditionally used for rendering graphics
in video games—for faster and more efficient training of intricate ML models. Plat-
forms that enabled leveraging GPUs for general-purpose computing facilitated the
transition to the DL era.

DL Era (2012– )
DL revolutionizes AI. The trends of increasing data and compute availability
laid the foundation for groundbreaking ML techniques. In the early 2010s, researchers
pioneered applications of DL, a subset of ML that uses artificial neural networks with
many layers, enabling computers to learn and recognize patterns in large amounts
of data. This approach led to significant breakthroughs in AI, especially in areas
including image recognition and natural language understanding.
Massive datasets provided researchers with the data needed to train DL models effec-
tively. A pivotal example is the ImageNet ([131]) dataset, which provided a large-scale
58 ■ Introduction to AI Safety, Ethics, and Society

dataset for training and evaluating computer vision algorithms. It hosted an annual
competition, which spurred breakthroughs and advancements in DL. In 2012, the
AlexNet model revolutionized the field as it won the ImageNet Large Scale Visual
Recognition Challenge [132]. This breakthrough showcased the superior performance
of DL over traditional ML methods in computer vision tasks, sparking a surge in DL
applications across various domains. From this point onward, DL has dominated AI
and ML research and the development of real-world applications.
Advancements in DL. In the 2010s, DL techniques led to considerable improve-
ments in natural language processing (NLP), a field of AI that aims to enable com-
puters to understand and generate human language. These advancements facilitated
the widespread use of virtual assistants Alexa and ChatGPT, introducing consumers
to products that integrated ML. Later, in 2016, Google DeepMind’s AlphaGo became
the first AI system to defeat a world champion Go player in a five-game match [133].
Breakthroughs in natural language processing. In 2018, Google researchers
introduced the Transformer architecture, which enabled the development of highly
effective NLP models. Researchers built the first large language models (LLMs) using
this Transformer architecture, many layers of neural networks, and billions of words
of data. Generative Pre-trained Transformer (GPT) models have demonstrated im-
pressive and near human-level language processing capabilities [134]. ChatGPT was
released in November 2022 and became the first example of a viral AI product, reach-
ing 100 million users in just two months. The success of the GPT models also sparked
widespread public discussion on the potential risks of advanced AI systems, includ-
ing congressional hearings and calls for regulation. In the early 2020s, AI is used for
many complex tasks, from image recognition to autonomous vehicles, and continues
to evolve and proliferate rapidly.

2.2.2 Types of AI
The field has developed a set of concepts to describe distinct types or levels of AI
systems. However, they often overlap, and definitions are rarely well-formalized, uni-
versally agreed upon, or precise. It is important to consider an AI system’s particular
capabilities rather than simply placing it in one of these broad categories. Labeling
a system as a “weak AI” does not always improve our understanding of it; we need
to elaborate further on its abilities and why they are limited.
This section introduces five widely used conceptual categories for AI systems. We
will present these types of AI in roughly their order of intelligence, generality, and
potential impact, starting with the least potent AI systems.
1. Narrow AI can perform specific tasks, potentially at a level that matches or
surpasses human performance.
2. Artificial general intelligence (AGI) can perform many cognitive tasks across
multiple domains. It is sometimes interpreted as referring to AI that can perform
a wide range of tasks at a human or superhuman level.
3. Human-level AI (HLAI) could perform all tasks that humans can do.
Artificial Intelligence Fundamentals ■ 59

4. Transformative AI (TAI) is a term for AI systems with a dramatic impact on

society, at least at the level of the Industrial Revolution.
5. Artificial superintelligence (ASI) refers to systems that surpass human perfor-
mance on virtually all intellectual tasks [121].

Generality and skill level of AI systems. The concepts we discuss here do not
provide a neat gradation of capabilities as there are at least two different axes along
which these can be measured. When considering a system’s level of capability, it can
be helpful to decompose this into its degree of skill or intelligence and its generality:
the range of domains where it can learn to perform tasks well. This helps us explain
two key ways AI systems can vary: an AI system can be more or less skillful and more
or less general. These two factors are related but distinct: an AI system that can play
chess at a grandmaster level is skillful in that domain, but we would not consider
it general because it can only play chess. On the other hand, an advanced chatbot
may show some forms of general intelligence while not being particularly good at
chess. Skill can be further broken down by reference to varying skill levels among
humans. An AI system could match the skill of the average adult (50th percentile),
or of experts in this skill at varying levels (e.g. 90th of 99th percentile), or surpass
all humans in skill.

Narrow AI

Narrow AI is specialized in one area. Also called weak AI, narrow AI refers
to systems designed to perform specific tasks or solve particular problems within a
specialized domain of expertise. A narrow AI has a limited domain of competence—it
can solve individual problems but is not competent at learning new tasks in a wide
range of domains. While they often excel in their designated tasks, these limitations
mean that a narrow AI does not exhibit high behavioral flexibility. Narrow AI sys-
tems struggle to learn new behaviors effectively, perform well outside their specific
domain, or generalize to new situations. However, narrow AI is still relevant from the
perspective of catastrophic risks, as systems with superhuman capabilities in high-risk
domains such as virology or cyber-offense could present serious threats.
Examples of narrow AI. One example of narrow AI is a digital personal assis-
tant that can receive voice commands and perform tasks like transcribing and sending
text messages but cannot learn how to write an essay or drive a car. Alternatively,
image recognition algorithms can identify objects like people, plants, or buildings
in photos but do not have other skills or abilities. Another example is a program
that excels at summarizing news articles. While it can do this narrow task, it cannot
diagnose a medical condition or compose new music, as these are outside its specific
domain. More generally, intelligent beings such as humans can learn and perform all
these tasks.
Narrow AI vs. general AI. Some narrow AI systems have surpassed human
performance in specific tasks, such as chess. However, these systems exhibit narrow
rather than general intelligence because they cannot learn new tasks and perform
60 ■ Introduction to AI Safety, Ethics, and Society

TABLE 2.1 A matrix showing one potential approach to breaking down the skill and gen-
erality of existing AI systems [135]. Note that this is just one example and that we do not
attempt to apply this exact terminology throughout the book.
Skill Narrow General

No AI Narrow Non-AI: calculator software; General Non-AI:

compiler human-in-the-loop
computing, e.g.
Amazon Mechanical
Turk
Emerging: equal Emerging Narrow AI: simple Emerging AGI:
to or somewhat rule-based systems ChatGPT, Bard,
better than an Llama 2, Gemini
un-skilled human
Competent: at Competent Narrow AI: Smart Competent AGI:
least 50th percentile Speakers such as Siri (Apple); VQA not yet achieved (at
of skilled adults systems such as Watson (IBM); SOTA least across all tasks)
LLMs for some tasks (e.g. short essay
writing)
Expert: at least Expert Narrow AI: spelling & Expert AGI: not yet
90th percentile of grammar checkers such as Grammarly; achieved
skilled adults generative image models such as Dall-E
2
Virtuoso: at least Virtuoso Narrow AI: DeepBlue; Virtuoso AGI: not
99th percentile of AlphaGo yet achieved
skilled adults
Superhuman: Superhuman Narrow AI: Stockfish; Artificial
outperforms 100% AlphaFold; AlphaZero Superintelligence
of humans (ASI): not yet
achieved

well outside their domain. For instance, IBM’s Deep Blue famously beat world chess
champion Garry Kasparov in 1997. This system was an excellent chess player but was
only good at chess. If one tried to use Deep Blue to play a different game, recognize
faces in a picture, or translate a sentence, it would fail miserably. Therefore, although
narrow AI may be able to do certain things better than any human could, even highly
capable ones remain limited to a small range of tasks.

Artificial General Intelligence

AGI can refer to generality, adaptability or be a shorthand for match-

ing human performance. As generally intelligent systems, AGIs can learn and
perform various tasks in various areas. On some interpretations, an AGI can reason,
learn, and respond well to new situations it has never encountered before. It can
even learn to generalize its strong performance to many domains without requiring
specialized training for each one: it could initially learn to play chess, then continue
Artificial Intelligence Fundamentals ■ 61

to expand its knowledge and abilities by learning video games, diagnosing diseases,
or navigating a city. Some would extend this further and define AGIs as systems that
can apply their intelligence to nearly any real-world task, matching or surpassing
human cognitive abilities across many domains.

There is no single consensus definition of AGI. Constructing a precise

and detailed definition of AGI is challenging and often creates disagreement among
experts; for instance, some argue that an AGI must have a physical embodiment to
interact with the world, allowing it to cook a meal, move around, and see and interact
with objects. Others contend that a system could be generally intelligent without any
ability to physically interact with the world, as intelligence does not require a human-
like body. Some would say ChatGPT is an AGI because it is not narrow and is, in
many senses, general. Still, an AI that can interact physically may be more general
than a non-embodied system. This shows the difficulty of reaching a consensus on
the precise meaning of AGI.

Predicting AGI. Predicting when distinct AI capabilities will appear (often called
“AI timelines”) can also be challenging. Many once believed that AI systems would
master physical tasks before tackling “higher-level” cognitive tasks such as coding or
writing. However, some existing language model systems can write functional code
yet cannot perform physical tasks such as moving a ball. While there are many expla-
nations for this observation—cognitive tasks bypass the challenge of building robotic
bodies; domains like coding and writing benefit from abundant training data—this
is an example of the difficulties involved in predicting how AI will develop.

Risks and capabilities. Rather than debating whether a system meets the cri-
teria for being an AGI, evaluating a specific AI system’s capabilities is often more
helpful. Historical evidence and the unpredictability of AI development suggest that
AIs may be able to perform complicated tasks such as scientific research, hacking, or
synthesizing bioweapons before they can reliably automate all domestic chores. Some
highly relevant and dangerous capabilities may arrive long before others. Moreover,
we could have narrow AI systems that can teach anyone how to enrich uranium and
build nuclear weapons but cannot learn other tasks. These dangers show how AIs
can pose risks at many different levels of capabilities. With this in mind, instead
of simply asking about AGI (“When will AGI arrive?”), it might be more relevant
and productive to consider when AIs will be able to do particularly concerning tasks
(“When will this specific capability arrive?”).

Human-Level Artificial Intelligence

HLAI can do everything humans can do. HLAIs exist when machines can
perform approximately every task as well as human workers. Some definitions of
HLAI emphasize three conditions: first, that these systems can perform every task
humans can; second, they can do it at least as well as humans can; and third, they can
do it at a lower cost. If a smart AI is highly expensive, it may make economic sense
to continue to use human labor. If a smart AI took several minutes to think before
62 ■ Introduction to AI Safety, Ethics, and Society

doing a task a human could do, its usefulness would have limitations. Like humans,
an HLAI system could hypothetically master a wide range of tasks, from cooking and
driving to advanced mathematics and creative writing. Unlike AGI, which on some
interpretations can perform some—but not all—the tasks humans can, an HLAI
can complete any conceivable human task. Notably, some reserve the term HLAI to
describe only cognitive tasks. Furthermore, evaluating whether a system is “human
level’ is fraught with biases. We are often biased to dismiss or underrate unfamiliar
forms of intelligence simply because they do not look or act like human intelligence.

Transformative AI
TAI refers to AI with societal impacts comparable to the Industrial Rev-
olution. The Industrial Revolution fundamentally altered the fabric of human life
globally, heralding an era of tremendous economic growth, increased life expectancy,
expanded energy generation, a surge in technological innovation, and monumental
social changes. Similarly, a TAI could catalyze dramatic changes in our world. The
focus here is not on the specific design or built-in capabilities of the AI itself but on
the consequences of the AI system for humans, our societies, and our economies.
Many kinds of AI systems could be transformative. It is conceivable that
some systems could be transformative while performing at capabilities below hu-
man level. To bring about dramatic change, AI does not need to mimic the powerful
systems of science fiction that behave indistinguishably from humans or surpass hu-
man reasoning. Computer systems that can perform tasks traditionally handled by
people (narrow AIs) could also be transformative by enabling inexpensive, scalable,
and clean energy production. Advanced AI systems could transform society without
reaching or exceeding human-level cognitive abilities, such as by allowing a wide array
of fundamental tasks to be performed at virtually zero cost. Conversely, some sys-
tems might only have transformative impacts long after reaching performance above
the human level. Even when some forms of AGI, HLAI, or ASI are available, the
technology might take time to diffuse widely, and its economic impacts may come
years afterward, creating a diffusion lag.

Artificial Superintelligence
ASI refers to AI that surpasses human performance in virtually all
domains of interest [121]. A system with this set of capabilities could have
immense practical applications, including advanced problem-solving, automation of
complex tasks, and scientific discovery. However, it should be noted that surpassing
humans on only some capabilities does not make an AI superintelligent—a calculator
is superhuman at arithmetic, but not a superintelligence.
Risks of superintelligence. The risks associated with superintelligence are sub-
stantial. ASIs could be harder to control and even pose existential threats—risks to
the survival of humanity. That said, an AI system must not be superintelligent to
be dangerous. An AGI, HLAI, or narrow AI could all pose severe risks to human-
ity. These systems may vary in intelligence across different tasks and domains, but
Artificial Intelligence Fundamentals ■ 63

they can be dangerous at many levels of intelligence and generality. If a narrow AI

is superhuman at a specific dangerous task like synthesizing viruses, it could be an
extraordinary hazard for humanity.

Superintelligence is not omnipotence. Separately, we should not assume that

superintelligence must be omnipotent or omniscient. Superintelligence does not mean
that an AI can instantly predict how events worldwide will unfold in the far future,
nor that the system can completely predict the actions of all other agents with per-
fect accuracy. Likewise, it does not mean that the ASI could instantly overpower
humanity. Moreover, many problems cannot be solved by intelligence or contem-
plation alone; research and development require real-world experimentation, which
involves physical-world processes that take a long time, presenting a key constraint
to AIs’ ability to influence the world. However, we know very little about what a
system that is significantly smarter than humans could do. Therefore, it is difficult
to make confident claims about superintelligence.

Superhuman performance in narrow areas is not the same as superin-

telligence. A superintelligent AI would significantly outstrip human capabilities,
potentially solving problems and making discoveries beyond our comprehension. Of
course, this is not exclusive to superintelligence: even narrow AIs solve problems
humans find difficult to understand. AlphaFold, for instance, astonished scientists
by predicting the 3D structure of proteins—a complex problem that stumped bio-
chemists for decades. Ultimately, a superintelligence exceeds these other types of
AI because of the breadth of the cognitive tasks in which it achieves superhuman
performance.

Risks of superintelligence. The risks associated with superintelligence are sub-

stantial. ASIs could be harder to control and even pose existential threats—risks to
the survival of humanity. That said, an AI system must not be superintelligent to
be dangerous. An AGI, HLAI, or narrow AI could all pose severe risks to human-
ity. These systems may vary in intelligence across different tasks and domains, but
they can be dangerous at many levels of intelligence and generality. If a narrow AI
is superhuman at a specific dangerous task like synthesizing viruses, it could be an
extraordinary hazard for humanity.

Summary

This section provided an introduction to AI, the broad umbrella that encompasses the
area of computer science focused on creating machines that perform tasks typically
associated with human intelligence. First, we discussed the nuances and difficulties
of defining AI and detailed its history. Then, we explored AI systems in more de-
tail and how they are often categorized into different types. Of these, we surveyed
five commonly used terms—narrow AI, HLAI, AGI, TAI, and superintelligence—and
highlighted some of their ambiguities. Considering specific capabilities and individual
systems rather than broad categories or abstractions is often more informative.
64 ■ Introduction to AI Safety, Ethics, and Society

Next, we will narrow our focus to ML, an approach within AI that emphasizes the
development of systems that can learn from data. Whereas many classical approaches
to AI relied on logical rules and formal, structured knowledge, ML systems use pattern
recognition to extract information from data.

2.2.3 Machine Learning

Overview and Definition

ML is a subfield of AI that focuses on developing computer systems that can learn

directly from data without following explicit pre-set instructions [136, 137]. It accom-
plishes this by creating computational models that discern patterns and correlations
within data. The knowledge encoded in these models allows them to inform decision-
making or to reason about and act in the world. For instance, an email spam filter
uses ML to improve its ability to distinguish spam from legitimate emails as it sees
more examples. ML is the engine behind most modern AI applications, from per-
sonalized recommendations on streaming services to autonomous vehicles. One of the
most popular and influential algorithmic techniques for ML applications is DL, which
uses deep neural networks to process data.

Machine learning algorithms. An algorithm is a recipe for getting something

done—a procedure for solving a problem or accomplishing a task, often expressed
in a precise programming language. ML models are algorithms designed to learn
from data by identifying patterns and relationships, which enables them to make
predictions or decisions as they process new inputs. They often learn from information
called training data. What makes ML models different from other algorithms is that
they automatically learn patterns in data without explicit task-specific instructions.
Instead, they identify correlations, dependencies, or relationships in the data and use
this information to make predictions or decisions about new data; for instance, a
content curation application may use ML algorithms to refine its recommendations.

Benefits of ML. One of the key benefits of ML is its ability to automate com-
plicated tasks, enabling humans to focus on other activities. Developers use ML for
applications from medical diagnosis and autonomous vehicles to financial forecasting
and writing. ML is becoming increasingly important for businesses, governments, and
other organizations to stay competitive and make empirically informed decisions.

Guidelines for understanding ML models. ML models can be intricate and

varied, making understanding their characteristics and distinctions a challenge. It
can be helpful to focus on key high-level aspects that almost all ML systems have:
• General Task: What is the primary goal of the ML model? We design
models to achieve objectives. Some example tasks are predicting housing prices,
generating images or text, or devising strategies to win a game.
• Inputs: What data does the ML system receive? This is the information
that the model processes to deliver its results.
Artificial Intelligence Fundamentals ■ 65

• Outputs: What does the ML system produce? The model generates these
results, predictions, or decisions based on the input data.
• Type of Machine Learning: What technique is used to accomplish the
task? This describes how the model converts its inputs into outputs (called infer-
ence), and learns the best way to convert its inputs into outputs (a learning process
called training). An ML system can be categorized by how it uses training data,
what type of output it generates, and how it reaches results.

The rest of this section delves deeper into these aspects of ML systems.

Key ML Tasks

In this section, we will explore four fundamental ML tasks—classification, regression,

anomaly detection, and sequence modeling—that describe different problems or types
of problems that ML models are designed to solve.

Classification
Classification is predicting categories or classes. In classification tasks,
models use characteristics or features of an input data point (example) to deter-
mine which specific category the data point belongs to. In medical diagnostics, a
classification model might predict whether a tumor is cancerous or benign based on
features such as a patient’s age, tumor size, and tobacco use. This is an example of
binary classification—the special case in which models predict one of two categories.
Multi-class classification, on the other hand, involves predicting one of multiple cat-
egories. An image classification model might classify an image as belonging to one of
multiple different classes such as dog, cat, hat, or ice cream. Computer vision often
applies these methods to enable computers to interpret and understand visual data
from the world. Classification is categorization: it involves putting data points into
buckets.
The sigmoid function produces probabilistic outputs. A sigmoid is one of
several mathematical functions used in classification to transform general real num-
bers into values between 0 and 1. Suppose we wanted to predict the likelihood that
a student will pass an exam or that a prospective borrower will default on a loan.
The sigmoid function is instrumental in settings like these—problems that rely on
computing probabilities. As a further example, in binary classification, one might
use a function like the sigmoid to estimate the likelihood that a customer makes a
purchase or clicks on an advertisement. However, it is important to note that other
widely used models can provide similar probabilistic outputs without employing a
sigmoid function.

Regression
Regression is predicting numbers. In regression tasks, models use features of
input data to predict numerical outputs. A real estate company might use a re-
gression model to predict house prices from a dataset with features such as location,
66 ■ Introduction to AI Safety, Ethics, and Society

Figure 2.1. ML models can classify data into different categories.

square footage, and number of bedrooms. While classification models produce discrete
outputs that place inputs into a finite set of categories, regression models produce
continuous outputs that can assume any value within a range. Therefore, regression
is predicting a continuous output variable based on one or more input variables. Re-
gression is estimation: it involves guessing what a feature of a data point will be given
the rest of its characteristics.

Linear regression. One type of regression, linear regression, assumes a linear

relationship between features and predicted values for a target variable. A linear
relationship means that the output changes at a constant rate with respect to the
input variables, such that plotting the input-output relationship on a graph forms a
straight line. Linear regression models are often helpful but have many limitations;
for instance, their assumption that the features and the target variable are linearly
related is often false. In general, linear regression can struggle with modeling compli-
cated data patterns in the real world since they are roughly only as complex as their
input variables and struggle to add additional structures themselves.

Anomaly Detection
Anomaly detection is the identification of outliers or abnormal data
points [138]. Anomaly detection is vital in identifying hazards, including unex-
pected inputs, attempted cyberattacks, sudden behavioral shifts, and unanticipated
failures. Early detection of anomalies can substantially improve the performance of
models in real-world situations.

Black swan detection is an essential problem within anomaly detection.

Black swans are unpredictable and rare events with a significant impact on the
broader world. These events are difficult to predict because they may not have hap-
pened before, so they are not represented in the training data that ML models use
Artificial Intelligence Fundamentals ■ 67

Predicting Exam Performance From Hours of Studying

100%

75%
Probablity of Passing Exam

50%

25%
Legend
Passed Students
Failed Students
Predictive Model

0%
1 2 3 4 5
Hours of Studying

Figure 2.2. Binary classification can use a sigmoid function to turn real numbers (such
as hours of studying) into probabilities between zero and one (such as the probability of
passing).

to extrapolate the future. Due to their extreme and uncommon nature, such events
make anomaly detection challenging. In section 4.7 in the Safety Engineering chapter,
we discuss these ideas in more detail.

Sequence Modeling
Sequence modeling is analyzing and predicting patterns in sequential
data. Sequence modeling is a broadly defined task that involves processing or pre-
dicting data where temporal or sequential order matters. It may be applied to time-
series data or natural language text to capture dependencies between items in the
sequence to forecast future elements. An integral part of this process is representation
learning, where models learn to convert raw data into more informative formats for
the task at hand. Language models use these techniques to predict subsequent words
in a sequence, transforming previous words into meaningful representations to detect
patterns and make predictions. There are several major subtypes of sequence mod-
eling. Here, we will discuss two: generative modeling and sequential decision-making.
Generative modeling. Generative modeling is a subtype of sequence modeling that
creates new data that resembles the input data, thereby drawing from the same dis-
tribution of features (conditioned on specific inputs). It can generate new outputs
from many input types, such as text, code, images, and protein sequences.
Sequential decision-making (SDM). SDM equips a model with the capability to make
informed choices over time, considering the dynamic and uncertain nature of real-
world environments. An essential feature of SDM is that prior decisions can shape
later ones. Related to SDM is reinforcement learning (RL), where a model learns to
make decisions by interacting with its environment and receiving feedback through
68 ■ Introduction to AI Safety, Ethics, and Society

Linear Regression: Rainfall and Umbrella Sales

Number of Umbrellas Sold

50 70 90 110
Precipitation (mm)

Figure 2.3. This linear regression model is the best linear predictor of an output(umbrellas
sold) using only information from the input (precipitation).

rewards or penalties. An example of SDM in complex, real-world tasks is a robot per-

forming a sequence of actions based on its current understanding of the environment.

Types of Input Data

In ML, a modality refers to how data is collected or represented—the type of input

data. Some models, such as image recognition models, use only one type of input
data. In contrast, multimodal systems integrate information from multiple modalities
(such as images and text) to improve the performance of learning-based approaches.
Humans are naturally multimodal, as we experience the world by seeing objects,
hearing sounds, feeling textures, smelling odors, tasting flavors, and more.

Detecting Bots Detecting Unusual Energy Usage

125
Anomaly
Electricity Consumption (kWh)

20
Typical
Account Activity (queries)

100

15
75

10
50

5 25
Bot
Human
0 0
0 5 10 15 0 25 50 75 100
Account Age (days) Time (hours)

Figure 2.4. The first graph shows the detection of atypical user activity. The second graph
shows the detection of unusually high energy usage. In both cases, the model detects anoma-
lies [139].
Artificial Intelligence Fundamentals ■ 69

Below, we briefly describe the significant modalities in ML. However, this list is not
exhaustive. Many specific types of inputs, such as data from physical sensors, fMRI
scans, topographic maps, and so on, do not fit easily into this categorization.
• Tabular data: Structured data is stored in rows and columns, usually with each
row corresponding to an observation and each column representing a variable in
the dataset. An example is a spreadsheet of customer purchase histories.
• Text data: Unstructured textual data in natural language, code, or other formats.
An example is a collection of posts and comments from an online forum.
• Image data: Digital representations of visual information that can train ML mod-
els to classify images, segment images, or perform other tasks. An example is a
database of plant leaf images for identifying species of plants.
• Video data: A sequence of visual information over time that can train ML models
to recognize actions, gestures, or objects in the footage. An example is a collection
of sports videos for analyzing player movements.
• Audio data: Sound recordings, such as speech or music. An example is a set of
voice recordings for training speech recognition models.
• Time-series data: Data collected over time that represents a sequence of obser-
vations or events. An example is historical stock price data.
• Graph data: Data representing a network or graph structure, such as social net-
works or road networks. An example is a graph that represents user connections
in a social network.
• Set-valued data: Unstructured data in the form of collections of features or input
vectors. An example is point clouds obtained from LiDAR sensors.

Components of the ML Pipeline

An ML pipeline is a series of interconnected steps in developing an ML model, from

training it on data to deploying it in the real world. Next, we will examine these steps
in turn.

Data collection. The first step in building an ML model is data collection. Data
can be collected in various ways, such as by purchasing datasets from owners of data
or scraping data from the web. The foundation of any ML model is the dataset used
to train it: the quality and quantity of data are essential for accurate predictions and
performance.

Selecting features and labels. After the data is collected, developers of ML

models must choose what they want the model to do and what information to use. In
ML, a feature is a specific and measurable part of the data used to make predictions
or classifications. Most ML models focus on prediction. When predicting the price
of a house, features might include the number of bedrooms, square footage, and the
age of the house. Part of creating an ML model is selecting, transforming, or creating
the most relevant features for the problem. The quality and type of features can
70 ■ Introduction to AI Safety, Ethics, and Society

significantly impact the model’s performance, making it more or less accurate and
efficient.
ML aims to predict labels. A label (or a target) is the value we want to predict
or estimate using the features. Labels in training data are only present in supervised
ML tasks, discussed later in this section. Some models use a sample with correct
labels to teach the model the output for a given set of input features: a model could
use historical data on housing prices to learn how prices are related to features like
square footage. However, other (unsupervised) ML models learn to make predictions
using unlabelled input data—without knowing the correct answers—by identifying
patterns instead.
Choosing an ML architecture. After ML model developers have collected the
data and chosen a task, they can process it. An ML architecture refers to a model’s
overall structure and design. It can include the type and configuration of the algorithm
used and the arrangement of input and output layers. The architecture of an ML
model shapes how it learns from data, identifies patterns, and makes predictions or
decisions.
ML models have parameters. Within an architecture, parameters are ad-
justable values within the model that influence its performance. In the house pricing
example, parameters might include the weights assigned to different features of a
house, like its size or location. During training, the model adjusts these weights, or
parameters, to minimize the difference between its predicted house prices and the
actual prices. The optimal set of parameters enables the model to make the best
possible predictions for unseen data, generalizing from the training dataset.
Training and using the ML model. Once developers have built the model
and collected all necessary data, they can begin training and applying it. ML model
training is adjusting a model’s parameters based on a dataset, enabling it to recognize
patterns and make predictions. During training, the model learns from the provided
data and modifies its parameters to minimize errors.
Model performance can be evaluated Model evaluation measures the perfor-
mance of the trained model by testing it on data the model has never encountered
before. Evaluating the model on unseen data helps assess its generalizability and
suitability for the intended problem. We may try to predict housing prices for a new
country beyond the original ML model’s original training data.
Once ready, models are deployed. Finally, once the model is trained and eval-
uated, it can be deployed in real-world applications. ML deployment involves inte-
grating the model into a larger system, using it, and then maintaining or updating it
as needed.

Evaluating ML Models

Evaluation is a crucial step in model development. When developing an ML model,

it is essential to understand its performance. Evaluation—the process of measuring
Artificial Intelligence Fundamentals ■ 71

Figure 2.5. A confusion matrix shows the four possible outcomes from a prediction: true
positive, false positive, false negative, and true negative.

the performance of a trained model on new, unseen data—provides insight into how
well the model has learned. We can use different metrics to understand a model’s
strengths, weaknesses, and potential for real-world applications. These quantitative
performance measures are part of a broader context of goals and values that inform
how we can assess the quality of a model.

Metrics
Accuracy is a measure of the overall performance of a classification
model. Accuracy is defined as the percentage of correct predictions:
# of correct predictions
Accuracy = .
# of total predictions
Accuracy can be misleading if there is an imbalance in the number of examples of
each class. For instance, if 95% of emails received are not spam, a classifier assigning
all emails to the “not spam” category could achieve 95% accuracy. Accuracy applies
when there is a well-defined sense of right and wrong. Regression models focus on
minimizing the error in their predictions.

Confusion matrices summarize the performance of classification algo-

rithms. A confusion matrix is an evaluative tool for displaying different prediction
errors. It is a table that compares a model’s predicted values with the actual values.
For example, the performance of a binary classifier can be represented by a 2 × 2
confusion matrix, as shown in Figure 2.5. In this context, when making predictions,
there are four possible outcomes:
1. True positive (TP): A true positive is a correct prediction of the positive class.
2. False positive (FP): A false positive is an incorrect prediction of the positive
class, predicting positive instead of negative.
72 ■ Introduction to AI Safety, Ethics, and Society

3. True negative (TN): A true negative is a correct prediction of the negative

class.
4. False negative (FN): A false negative is an incorrect prediction of the negative
class, predicting negative instead of positive.

Since each prediction must be in one of these categories, the number of total predic-
tions will be the sum of the number of predictions in each category. The number of
correct predictions will be the sum of true positives and true negatives. Therefore,
TP + TN
Accuracy =
TP + TN + FP + FN

False positives vs. false negatives. The impact of false positives and false neg-
atives can vary greatly depending on the setting. Which metric to choose depends
on the specific context and the error types one most wants to avoid. In cancer de-
tection, while a false positive (incorrectly identifying cancer in a cancer-free patient)
may cause emotional distress, unnecessary further testing, and potentially invasive
procedures for the patient, a false negative can be much more dangerous: it may
delay diagnosis and treatment that allows cancer to progress, reducing the patient’s
chances of survival. By contrast, an autonomous vehicle with a water sensor that
senses roads are wet when they are dry (predicting false positives) might slow down
and drive more cautiously, causing delays and inconvenience, but one that senses the
road is dry when it is wet (false negatives) might end up in serious road accidents
and cause fatalities.
While accuracy assigns equal cost to false positives and false negatives, other metrics
isolate one or weigh the two differently and might be more appropriate in some
settings. Precision and recall are two standard metrics that measure the extent of
the error attributable to false positives and false negatives, respectively.
Precision measures the correctness of a model’s positive predictions. This metric rep-
resents the fraction of positive predictions that are actually correct. It is calculated
TP
as TP+FP , dividing true positives (hits) by the sum of true positives and false posi-
tives. High precision implies that when a model predicts a positive class, it is usually
correct—but it might incorrectly classify many positives as negatives as well. Preci-
sion is like the model’s aim: when the system says it hit, how often is it right?
Recall measures a model’s breadth. On the other hand, recall measures how good a
model is at finding all of the positive examples available. It is like the model’s net: how
TP
many real positives does it catch? It is calculated as TP+FN , signifying the fraction
of real positives that the model successfully detected. High recall means a model is
good at recognizing or “recalling” positive instances, but not necessarily that these
predictions are accurate. Therefore, a model with high recall may incorrectly classify
many negatives as positives.
In simple terms, precision is about a model being right when it makes a guess, and
recall is about the model finding as many of the right answers as possible. Together,
these two metrics provide a way to quantify how accurately and effectively a model
Artificial Intelligence Fundamentals ■ 73

Figure 2.6. Precision measures the correctness of positive predictions and penalizes false
positives, while recall measures how many positives are detected and penalizes false nega-
tives [140].

can detect positive examples. Moreover, there is a trade-off between precision and
recall: for a given model, increasing precision will necessarily decrease recall and vice
versa.

AUROC scores measure a model’s discernment. The AUROC (Area Under

the Receiver Operating Characteristic) score measures how well a classification model
can distinguish between different classes. The ROC curve shows the performance of
a classification model by plotting the rate of true positives against false positives as
thresholds in a model are changed. AUROC scores range from zero to one, where
a score of 50% indicates random-chance performance and 100% indicates perfect
performance. To determine whether examples are positive (belong to a certain class)
or negative (do not belong to a certain class), a classification model will assign a score
to each example and compare that score to a threshold or benchmark value. We can
interpret the AUROC as the probability that a positive example scores higher than
a negative example.
Since it considers performance at all possible decision thresholds, the AUROC is
useful for comparing the performance of different classifiers. The AUROC is also
helpful in cases of imbalanced data, as it does not depend on the ratio of positive to
negative examples.

Mean squared error (MSE) quantifies how “wrong” a model’s predictions

are. Mean squared error is a valuable and popular metric of prediction error. It is
found by taking the average of the squared differences between the model’s predictions
and the labels, thereby ensuring that positive and negative deviations from the truth
74 ■ Introduction to AI Safety, Ethics, and Society

Figure 2.7. The area under the ROC curve (AUROC) increases as it moves in the upper left
direction, with more true positives and fewer false positives [141].

are penalized the same and that larger mistakes are penalized heavily. The MSE is
the most popular loss function for regression problems.

Reasonably vs. reliably solved. The distinction between reasonable and reliable
solutions can be instrumental in developing an ML model, evaluating its perfor-
mance, and thinking about tradeoffs between goals. A task is reasonably solved if a
model performs well enough to be helpful in practice, but it may still have consistent
limitations or make errors. A task is reliably solved if a model achieves sufficiently
high accuracy and consistency for safety-critical applications. While models that rea-
sonably solve problems may be sufficient in some settings, they may cause harm in
others. Chatbots currently give reasonable results, which is frustrating but essentially
Artificial Intelligence Fundamentals ■ 75

harmless. However, if autonomous vehicles show reasonable but not reliable results,
people’s lives are at stake.

Goals and tradeoffs. Above and beyond quantitative performance measures are
multiple goals and values that influence how we can assess the quality of an ML
model. These goals—and the tradeoffs that often arise between them—shape how
models are judged and developed. One such goal is predictive power, which measures
the amount of error in predictions. Inference time (or latency) measures how quickly
an ML model can produce results from input data—in many applications, such as self-
driving cars, prediction speed is crucial. Transparency refers to the interpretability
of an ML model’s inner workings and how well humans can understand its decision-
making process. Reliability assesses the consistency of a model’s performance over
time and in varying conditions. Scalability is the capacity of a model to maintain or
improve its performance as a key variable—compute, parameter count, dataset size,
and so on—scales.
Sometimes, these goals are in opposition, and improvements in one area can come
at the cost of declines in others. Therefore, developing an ML model requires careful
consideration of multiple competing goals.

2.2.4 Types of Machine Learning

One key dimension along which ML approaches vary is the degree of

supervision. We can divide ML approaches into groups based on how they use the
training data and what they produce as an output. In ML, supervision is the process
of guiding a model’s learning with some kind of label. The model uses this label as a
kind of ground truth or gold standard: a signal that can be trusted as accurate and
used to supervise the model to achieve the intended results better. Labels can allow
the model to capture relationships between inputs and their corresponding outputs
more effectively. Supervision is often vital to help models learn patterns and predict
new, unseen data accurately.
There are distinct approaches in ML for dealing with different amounts of supervision.
Here, we will explore three key approaches: supervised, unsupervised, and reinforce-
ment learning. We will also discuss DL, a set of techniques that can be applied in
any of these settings.

Supervised Learning
Supervised learning is learning from labeled data. Supervised learning is a
type of ML that uses a labeled dataset to learn the relationship between input data
and output labels. These labels are almost always human-generated: people will go
through examples in a dataset and give each one a label. They might be shown pic-
tures of dogs and asked to label the breed. The training process involves iteratively
adjusting the model’s parameters to minimize the difference between predicted out-
puts and the true output labels in the training data. Once trained, the model can
predict new, unlabeled data.
76 ■ Introduction to AI Safety, Ethics, and Society

Figure 2.8. The three main types of learning paradigms in machine learning are supervised
learning, unsupervised learning, and reinforcement learning.

Examples of supervised learning. Some examples of these labeled inputs and

outputs include mapping a photo of a plant to its species, a song to its genre, or an
email to either “spam” or “not spam.” A computer can use a set of dog pictures labeled
by humans to predict the breed of any dog in any given image. Supervised learning
is analogous to a practice book, which offers a student a series of questions (inputs)
and then provides the answers (outputs) at the end of the book. This book can help
the student (like an ML model) find the correct answers when given new questions.
Without instruction or guidance, the student must learn to answer questions correctly
by reviewing the problems and checking their answers. Over time, they learn and
improve through this checking process.

Advantages and disadvantages. Supervised learning can excel in classification

and regression tasks. Furthermore, it can result in high accuracy and reliable predic-
tions when given large, labeled datasets with well-defined features and target vari-
ables. However, this method performs poorly on more loosely defined tasks, such as
generating poems or new images. Supervision may also require manual labeling for
the training process, which can be prohibitively time-consuming and costly. Criti-
cally, supervised learning is bottlenecked by the amount of labels, which can often
result in less data available than when using unsupervised learning.

Unsupervised Learning
Unsupervised learning is learning from unlabeled data. Unsupervised
learning involves training a model on a dataset without specific output labels. In-
stead of matching its inputs to the correct labels, the model must identify patterns
within the data to help it understand the underlying relationships between the vari-
ables. As no labels are provided, a model is left to its own devices to discover valuable
patterns in the data. In some cases, a model leverages these patterns to generate su-
pervisory signals, guiding its own training. For this reason, unsupervised learning can
also be called self-supervised learning.

Examples of unsupervised learning. Language models use unsupervised learn-

ing to learn patterns in language using large datasets of unlabeled text. LLMs often
Artificial Intelligence Fundamentals ■ 77

learn to predict the next word in a sentence, which enables the models to understand
context and language structure without explicit labels like word definitions and gram-
mar instructions. After a model trains on this task, it can apply what it learned to
downstream tasks like answering questions or summarizing texts.

Figure 2.9. In image inpainting, models are trained to predict hidden parts of images, causing
them to learn relationships between pixels [142].

ML models exist on a spectrum of supervision. Unsupervised and super-

vised learning are valuable concepts for thinking about ML models, not a dichotomy
with a clear dividing line. Therefore, ML models are on a continuum of supervision,
from datasets with clear labels for every data point on one extreme to datasets with
no labels on the other. In between lies partial or weak supervision, which provides
incomplete or noisy labels such as hashtags loosely describing features of images. This
is analogous to a practice book where some solution pages are excessively brief, have
errors, or are omitted entirely.

We can reframe many tasks into different types of ML. Anomaly detec-
tion is typically framed as an unsupervised task that identifies unusual data points
without labels. However, it can be refashioned as a supervised classification problem,
such as labeling financial transactions as “fraudulent” or “not fraudulent.” Similarly,
while stock price prediction is usually approached as a supervised regression task,
it could be reframed as a classification task in which a model predicts whether a
stock price will increase or decrease. The choice in framing depends on the task’s
specific requirements, the data available, and which frame gives a more useful model.
Ultimately, this flexibility allows us to better cater to our goals and problems.

Reinforcement Learning
Reinforcement learning (RL) is learning from agent-gathered data.
Reinforcement learning focuses on training artificial agents to make decisions and
78 ■ Introduction to AI Safety, Ethics, and Society

improve their performance based on responses from their environment. It assumes

that tasks can be modeled as goals to be achieved by an agent maximizing rewards
from its environment. RL is distinctive since it does not require pre-collected data, as
the agent can begin with no information and interact with its environment to learn
new things and acquire new data.

Examples of RL. RL can help robots learn to navigate an unknown environment

by taking actions and receiving feedback in the form of rewards or penalties based on
performance. Through trial and error, agents learn to make better decisions and max-
imize rewards by adjusting their actions or improving their model of the environment.
It refines its strategy based on the consequences of its activities. RL enables agents
to learn techniques and decision-making skills through interaction with their envi-
ronment, which can adapt to dynamic and uncertain situations. However, it requires
a well-designed reward function and can be computationally expensive, especially for
complex environments with many possibilities for states and actions.

Deep Learning
DL is a set of techniques that can be used in many learning settings.
DL uses neural networks with many layers to create models that can learn from large
datasets. Neural networks are the building blocks of DL models and use layers of
interconnected nodes to transform inputs into outputs. The structure and function
of biological neurons loosely inspired their design. DL is not a new distinct learning
type but rather a computational approach that can accomplish any of the three types
of learning discussed above. It is most applicable to unsupervised learning tasks as it
can perform well without any labels; for instance, a deep neural network trained for
object recognition in images can learn to identify patterns in the raw pixel data.

Advantages and challenges in DL. DL excels in handling high-dimensional

and complex data, providing critical capabilities in image recognition, natural lan-
guage processing, and generative modeling. In ML, dimensionality denotes the num-
ber of features or variables in the data, each representing a unique dimension. High-
dimensional data has many features, as in image recognition, where each pixel can be
a feature. However, DL also requires vast data and substantial computational power.
Moreover, the models can be challenging to interpret.

Conclusion

AI is one of the most impactful and rapidly developing fields of computer science. AI
involves developing computer systems that can perform tasks that typically require
human intelligence, from visual perception to decision-making.
ML is an approach to AI that involves developing models that can learn from data
to perform tasks without being explicitly programmed. A robust approach to under-
standing any ML model is breaking it down into its fundamental components: the
task, the input data, the output, and the type of ML it uses. Different approaches to
ML offer various ways to tackle complex tasks and solve real-world problems. DL is
Artificial Intelligence Fundamentals ■ 79

a powerful and popular method that uses many-layered neural networks to identify
intricate patterns in large datasets. The following section will delve deeper into DL
and its applications in AI.

2.3 DEEP LEARNING

Introduction

In this section, we present the fundamentals of DL, a branch of ML that uses neural
networks to learn from data and perform complex tasks [143]. First, we will consider
the essential building blocks of DL models and explore how they learn. Then, we
will discuss the history of critical architectures and see how the field developed over
time. Finally, we will explore how DL is reshaping our world by reviewing a few of
its groundbreaking applications.

Why DL Matters

DL is a remarkably useful, powerful, and scalable technology that has been the pri-
mary source of progress in ML since the early 2010s. DL methods have dramatically
advanced the state-of-the-art in computer vision, speech recognition, natural language
processing, drug discovery, and many other areas.

Performance. Some DL models have demonstrated better-than-human perfor-

mance in specific tasks, though unreliably. These models have excelled in tasks such
as complex image recognition and outmatched world experts in chess, Go, and chal-
lenging video games such as StarCraft. However, their victories are far from compre-
hensive or absolute. Model performance is variable, and DL models sometimes make
errors or misclassifications obvious to a human observer. Therefore, despite their im-
pressive accomplishments in specific tasks, DL models have yet to consistently surpass
human intelligence or capabilities across all tasks or domains.

Real-world usefulness. Beyond games and academia, DL techniques have proven

useful in a wide variety of real-world applications. They are increasingly integrated
into everyday life, from healthcare and social media to chatbots and autonomous
vehicles. DL can generate product recommendations, predict energy load in a power
grid, fly a drone, or create original works of art.

Scalability. DL models are highly scalable and positioned to continue to advance

in capability as data, hardware, and training techniques progress. A key strength of
these models is their ability to process and learn from increasingly large amounts of
data. Many traditional ML algorithms’ performance gains taper off with additional
data; by contrast, the performance of DL models improves faster and longer.
80 ■ Introduction to AI Safety, Ethics, and Society

Defining DL

DL is a type of ML that uses neural networks with many layers to learn and ex-
tract useful patterns from large datasets. It is characterized by its ability to learn
hierarchical representations of the world.

ML/DL distinction. As we saw in the previous section, ML is the field of study

that aims to give computers the ability to learn without explicitly being programmed.
DL is a highly adaptable and remarkably effective approach to ML. DL techniques
are employed in and represent the cutting edge of many areas of ML, including
supervised, unsupervised, and reinforcement learning.

Figure 2.10. Machine learning is a type of artificial intelligence. Supervised, unsupervised,

and reinforcement learning are common machine learning paradigms. DL is a set of tech-
niques that have proven useful for a variety of ML problems.

Automatically learned representations. Representations are, in general,

stand-ins or substitutes for the objects they represent. For example, the word “air-
plane” is a simple representation of a complex object. Similarly, ML systems use
representations of data to complete tasks. Ideally, these representations are distilla-
tions that capture all essential elements or features of the data without extraneous
information. While many traditional ML algorithms build representations from fea-
tures hand-picked and engineered by humans, features are learned in DL. The primary
objective of DL is to enable models to learn useful features and meaningful represen-
tations from data. These representations, which capture the underlying patterns and
structure of the data, form the base on which a model solves problems. Therefore,
model performance is directly related to representation quality. The more insightful
and informative a model’s representations are, the better it can complete tasks. Thus,
the key to DL is learning good representations.
Artificial Intelligence Fundamentals ■ 81

Hierarchical representations. DL models represent the world as a nested hi-

erarchy of concepts or features. In this hierarchy, features build on one another to
capture progressively more abstract features. Higher-level representations are defined
by and computed in terms of simpler ones. In object detection, for example, a model
may learn first to recognize edges, then corners and contours, and finally parts of
objects. Each set of features builds upon those that precede it:
1. Edges are (usually) readily apparent in raw pixel data.
2. Corners and contours are collections of edges.
3. Object parts are edges, corners, and contours.
4. Objects are collections of object parts.

This is analogous to how visual information is processed in the human brain. Edge
detection is done in early visual areas like the primary visual cortex, more complex
shape detection in temporal regions, and a complete visual scene is assembled in
the brain’s frontal regions. Hierarchical representations enable deep neural networks
to learn abstract concepts and develop sophisticated models of the world. They are
essential to DL and why it is so powerful.

What DL Models Do

DL models learn complicated relationships in data. In general, ML models

can be thought of as a way of transforming any input into a meaningful output.
DL models are an especially useful kind of ML model that can capture an extensive
family of relationships between input and output.

Function approximation. In theory, neural networks—the backbone of DL

models—can learn almost any function that maps inputs to outputs, given enough
data and a suitable network architecture. Under some strong assumptions, a suf-
ficiently large neural network can approximate any continuous function (like y =
ax2 + bx + c) with a combination of weights and biases. For this reason, neural net-
works are sometimes called “universal function approximators.” While largely the-
oretical, this idea provides an intuition for how DL models achieve such immense
flexibility and utility in their tasks.

Challenges and limitations. DL models do not have unlimited capabilities. Al-

though neural networks are very powerful, they are not the best suited to all tasks.
Like any other model, they are subject to tradeoffs, limitations, and real-world con-
straints. In addition, the performance of deep neural networks often depends on the
quality and quantity of data available to train the model, the algorithms and archi-
tectures used, and the amount of computational power available.

Summary

DL is an approach to ML that leverages multi-layer neural networks to achieve im-

pressive performance. DL models can capture a remarkable family of relationships
82 ■ Introduction to AI Safety, Ethics, and Society

between inputs and outputs by developing hierarchical representations. They have a

number of advantages over traditional ML models, including scaling more effectively,
learning more sophisticated relationships with less human input, and adapting more
readily to different tasks with specialized components. Next, we will make our under-
standing more concrete by looking more closely at exactly what these components
are and how they operate.

2.3.1 Model Building Blocks

In this section, we will explore some of the foundational building blocks of DL models.
We will begin by defining what a neural network is and then discuss the fundamental
elements of neural networks through the example of multi-layer perceptrons (MLPs),
one of the most basic and common types of DL architecture. Then, we will cover
a few more technical concepts, including activation functions, residual connections,
convolution, and self-attention. Finally, we will see how these concepts come together
in the Transformer, another type of DL architecture.

Neural networks. Neural networks are a type of ML algorithm composed of layers

of interconnected nodes or neurons. They are loosely inspired by the structure and
function of the human brain. Neurons are the basic computational units of neural
networks. In essence, a neuron is a function that takes in a weighted sum of its inputs
and applies an activation function to transform it, generating an output signal that
is passed along to other neurons.

Biological inspiration. The “artificial neurons” in neural networks were named

after their biological counterparts. Both artificial and biological neurons operate on
the same basic principle. They receive inputs from multiple sources, process them by
performing a computation, and produce outputs that depend on the inputs—in the
case of biological neurons, firing only when a certain threshold is exceeded. However,
while biological neurons are intricate physical structures with many components and
interacting cells, artificial neurons are simplified computational units designed to
mimic a few of their characteristics.

Figure 2.11. Artificial neurons have some structural similarities to biological neurons [144,
145].
Artificial Intelligence Fundamentals ■ 83

Building blocks. Neural networks are made of simple building blocks that can pro-
duce complex abilities when combined at scale. Despite their simplicity, the resulting
network can display remarkable behaviors when thousands—or even millions—of arti-
ficial neurons are joined together. Neural networks consist of densely connected layers
of neurons, each contributing a tiny fraction to the overall processing power of the
network. Within this basic blueprint, there is much room for variation; for instance,
neurons can be connected in many ways and employ various activation functions.
These network structure and design differences shape what and how a model can
learn.

Multi-Layer Perceptrons

Multi-layer perceptrons (MLPs) are a foundational neural network architecture con-

sisting of multiple layers of nodes, each performing a weighted sum of its inputs and
passing the result through an activation function. They belong to a class of architec-
tures known as “feedforward” neural networks, where information flows in only one
direction, from one layer to the next. MLPs are composed of at least three layers: an
input layer, one or more hidden layers, and an output layer.

The input layer serves as the entry point for data into a network. The
input layer consists of nodes that encode information from input data to pass on to the
next layer. Unlike in other layers, the nodes do not perform any computation. Instead,
each node in the input layer captures some small raw input data and directly relays
this information to the nodes in the subsequent layer. As with other ML systems,
input data for neural networks comes in many forms. For illustration, we will focus
on just one: image data. Specifically, we will draw from the classic example of digit
recognition with MNIST.
The MNIST (Modified National Institute of Standards and Technology) database
is a large collection of images of handwritten digits, each with dimensions 28 × 28.
Consider a neural network trained to classify these images. The input layer of this
network consists of 784 nodes, each corresponding to the grayscale value of a pixel in
a given image.

The output layer is the final layer of a neural network. The output layer
contains neurons representing the results of the computations performed within the
network. Like inputs, neural network outputs come in many forms, such as predic-
tions or classifications. In the case of MNIST (a classification task), the output is
categorical, predicting the digit represented by a particular image.
For classification tasks, the number of neurons in the output layer is equal to the
number of possible classes. In the MNIST example, the output layer will have ten
neurons, one for each of the ten classes (digits 0–9). The value of each neuron rep-
resents the predicted probability that an example belongs to that class. The output
value of the network is the class of the output neuron with the highest value.

Hidden layers are the intermediate layers between the input and output
layers. Each hidden layer is a collection of neurons that receive outputs from the
84 ■ Introduction to AI Safety, Ethics, and Society

Figure 2.12. Each pixel’s value is transferred to a neuron in the first layer [146].

previous layer, perform a computation, and pass the results to the next layer. These
are “hidden” because they are internal to the network and not directly observable
from its inputs or outputs. These layers are where representations of features are
learned.
Weights represent the strength of the connection between two neurons.
Every connection is associated with a weight that determines how much the input
signal from a given neuron will influence the output of the next neuron. This value
represents the importance or contribution of the first neuron to the second. The
larger the magnitude, the greater the influence. Neural networks learn by modifying
the values of their weights, which we will explore shortly.
Biases are additional learned parameters used to adjust neuron outputs.
Every neuron has a bias that helps control its output. This bias acts as a constant
term that shifts the activation function along the input axis, allowing the neuron to
learn more complex, flexible decision boundaries. Similar to the constant b of a linear
equation y = mx + b, the bias allows shifting the output of each layer. In doing so,
biases increase the range of the representations a neural network can learn.
Activation functions control the output or “activation” of neurons. Ac-
tivation functions are nonlinear functions applied to each neuron’s weighted input
sum within a neural network layer. They are mathematical equations that control
the output signal of the neurons, effectively determining the degree to which each
neuron “fires.”
Each neuron in a network takes some inputs, multiplies them by weights, adds a bias,
and applies an activation function. The activation function transforms this weighted
input sum into an output signal. For many activation functions, the more input a
neuron receives, the more it activates, translating to a larger output signal.
Artificial Intelligence Fundamentals ■ 85

Activation functions allow for intricate representations. Without activa-

tion functions, neural networks would operate similarly to linear regression models,
with added layers failing to contribute any complexity to the model’s representations.
Activation functions enable neural networks to learn and express more sophisticated
patterns and relationships by managing the output of neurons.

Single-layer and multi-layer networks. Putting all of these elements together,

single-layer neural networks are the simplest form of neural network. They have only
one hidden layer, comprising an input layer, an output layer, and a hidden layer.
Multi-layer neural networks add more hidden layers in the middle. These networks
are the basis of DL models.

Multi-layer neural networks are required for hierarchical representa-

tions. While single-layer networks can learn many things, they cannot learn the
hierarchical representations that form the cornerstone of DL. Layers provide the scaf-
folding of the pyramid. No layers means no hierarchy. As the features learned in each
layer build on those of previous layers, additional hidden layers enable a neural net-
work to learn more sophisticated and powerful representations. Simply put, more
layers capture more features at more levels of abstraction.

Figure 2.13. A classic multi-layer artificial neural network (ANN) has an input layer, several
hidden layers, and an output layer [147].

Neural networks as matrix multiplication. If we put aside the intuitive dia-

grammatic representation, a neural network is a mathematical function that takes in
a set of input values and produces a set of output values via a series of steps. All the
neurons in a layer can be represented as a list or vector of activations. In any layer,
this activation vector is multiplied with an input and then transformed by applying
an element-wise nonlinear function to the result. This is the layer’s output, which
86 ■ Introduction to AI Safety, Ethics, and Society

becomes the input to the next layer. The network as a whole is the composition of
all of its layers.
A toy example. Consider an MLP with two hidden layers, activation function g,
and an input x. This network could be expressed as W3 g(W2 g(W1 x)):
1. In the input layer, the input vector x is passed on.
2. In the first hidden layer,
(a) the input vector x is multiplied by the weight vector, W1 , yielding W1 x,
(b) then the activation function g is applied, yielding g(W1 x),
(c) which is passed on to the next layer.

3. In the second hidden layer,

(a) the vector passed to the layer is multiplied by the weight vector, W2 , yielding
W2 g(W1 x),
(b) then the activation function g is applied, yielding g(W2 g(W1 x)),
(c) which is passed on to the output layer.

4. In the output layer,

(a) the input to the layer is multiplied by the weight vector, W3 , yielding
W3 g(W2 g(W1 x)),
(b) which is the output vector.

This process is mathematically equivalent to matrix multiplication. This trait has

significant implications for the computational properties of neural networks. Since
matrix multiplication lends itself to being run in parallel, this equivalence allows
specialized, more efficient processors such as GPUs to be used during training.
Summary. MLPs are models of a versatile and popular type of neural network
that has been successfully applied to many tasks. They are often a key component in
many larger, more sophisticated DL architectures. However, MLPs have limitations
and are only sometimes the best-suited approach to a task. Some of the building
blocks we will see later on address the shortcomings of MLPs and critical issues that
can arise in DL more generally. Before that, we will look at activation functions—the
mechanisms that control how and when information is transmitted between neurons—
in more detail.

Key Activation Functions

Activation functions are a vital component of neural networks. They introduce non-
linearity, which allows the network to model intricate patterns and relationships in
data. By defining the activations of each neuron within the network, activation func-
tions act as informational gatekeepers that control data transfer from one layer of
the network to the next.
Using activation functions. There are many activation functions, each with
unique properties and applications. Even within a single network, different layers may
Artificial Intelligence Fundamentals ■ 87

use other activation functions. The selection and placement of activation functions
can significantly change the network’s capability and performance. In most cases, the
same activation will be applied in all the hidden layers within a network.
While many possible activation functions exist, only a handful are commonly used
in practice. Here, we highlight four that are of particular practical or historical sig-
nificance. Although there are many other functions and variations of each, these
four—ReLU, GELU [148], sigmoid, and softmax—have been highly influential in de-
veloping and applying DL. The Transformer architecture, which we will describe
later, uses GELU and softmax functions. Historically, many architectures used Re-
LUs and sigmoids. Together, these functions illustrate the essential characteristics of
the properties and uses of activation functions in neural networks.

Rectified linear unit (ReLU). The rectified linear unit (ReLU) function is a
piecewise linear function that returns the input value for positive inputs and zero
for negative inputs [149]. It is the identity function (f (x) = x) for positive inputs
and zero otherwise. This means that if a neuron’s weighted input sum is positive,
it will be passed directly to the following layer without any modification. However,
no signal will be passed on if the sum is negative. Due to its piecewise nature, the
graph of the ReLU function takes the form of a distinctive “kinked” line. Due to its
computational efficiency, the ReLU function was widely used and played a critical
role in developing more sophisticated DL architectures.

Figure 2.14. The ReLU activation function, ReLU(x) = max{0, x}, passes on positive inputs
to the next layer.

Gaussian error linear unit (GELU). The GELU (Gaussian error linear unit)
function is an upgrade of the ReLU function that uses approximation to smooth out
the non-differentiable component. This is important for optimization. It is “Gaussian”
because it leverages the Gaussian cumulative distribution function (CDF), Φ(x). The
GELU has been widely used in and contributed to the success of many current models,
including Transformer-based language models.
88 ■ Introduction to AI Safety, Ethics, and Society

Figure 2.15. The GELU activation function, GELU(x) = x · Φ(x), smooths out the ReLU
function around zero, passing on small negative inputs as well.

Sigmoid. A sigmoid is a smooth, differentiable function that maps any real-valued

numerical input to a value between zero and one. It is sometimes called a squashing
function because it compresses all real numbers to values in this range. When graphed,
it forms a characteristic S-shaped curve. We explored the sigmoid function in the
previous section.

Softmax. Softmax is a popular activation function due to its ability to model

multi-class probabilities. Unlike other activation functions that operate on each in-
put individually, softmax considers all inputs simultaneously to create a probability

Figure 2.16. The Sigmoid activation function, σ(x) = 1+e−x ,

1
has a characteristic S-shape
that squeezes inputs into the interval [0, 1].
Artificial Intelligence Fundamentals ■ 89

distribution across many dimensions. This is useful in settings with multiple classes
or categories, such as natural language processing, where each word in a sentence can
belong to one of numerous classes.
The softmax function can be considered a generalization of the sigmoid function.
While the sigmoid function maps a single input value to a number between 0 and 1,
interpreted as a binary probability of class membership, softmax normalizes a set of
real values into a probability distribution over multiple classes. Though it is typically
applied to the output layer of neural networks for multi-class classification tasks—an
example of when different activation functions are used within one network—softmax
may also be used in intermediate layers to readjust weights at bottleneck locations
within a network.
We can revisit the example of handwritten digit recognition. In this classification task,
softmax is applied in the last layer of the network as the final activation function. It
takes in a 10-dimensional vector of the raw outputs from the network and rescales
the values to generate a probability distribution over the ten predicted classes. Each
class represents a digit from 0 to 9, and each output value represents the probability
that an input image is an instance of a given class. The digit corresponding to the
highest probability will be selected as the network’s prediction.
Now, having explored ReLU, GELU, sigmoid, and softmax, we will set aside activa-
tion functions and turn our attention to other building blocks of DL models.

Residual Connections

Residual connections create alternative pathways in a network, preserv-

ing information. Also known as skip connections, residual connections provide a
pathway for information to bypass specific layers or groups of layers (called blocks)
in a neural network [150]. Without residual connections, all information must travel
sequentially through every layer of the network, undergoing continual, significant
change as each layer receives and transforms the output of the previous one. Resid-
ual connections allow data to skip these transformations, preserving its original con-
tent. With residual connections, layers can access more than just the previous layer’s
representations as information flows through and around each block in the network.
Consequently, lower-level features learned in earlier layers can contribute more di-
rectly to the higher-level features of deeper layers, and information can be more
readily preserved.

Residual connections facilitate learning. Residual connections improve learn-

ing dynamics in several ways by facilitating the flow of information during the training
process. This improves iterative and hierarchical feature representations, particularly
for deeper networks.
Neural networks typically learn by decomposing data into a hierarchy of features,
where each layer learns a distinct representation. Residual connections allow for a
different kind of learning in which learned representations are gradually refined. Each
block improves upon the representation of the previous block, but the overall meaning
90 ■ Introduction to AI Safety, Ethics, and Society

Figure 2.17. Adding residual connections can let information bypass blocks fl and fl+1 ,
letting lower-level features from early layers contribute more directly to higher-level features
in later ones.

captured by each layer remains consistent across successive blocks. This allows feature
maps learned in earlier layers to be reused and networks to learn representations
(such as identity mappings) in deeper layers that may otherwise not be possible due
to optimization difficulties.
Residual connections are general purpose, used in many different problem settings
and architectures. By facilitating the learning process and expanding the kinds of
representations networks can learn, they are a valuable building block that can be a
helpful addition to a wide variety of networks.

Figure 2.18. Convolution layers perform the convolution operation, sliding a filter over the
input data to output a feature map [151].
Artificial Intelligence Fundamentals ■ 91

Convolution

In ML, convolution is a process used to detect patterns or features in input data by

applying a small matrix called a filter or kernel and looking for cross-correlations.
This process involves sliding the filter over the input data, systematically comparing
relevant sections using matrix multiplication with the filter, and recording the results
in a new matrix called a feature map.

Convolutional layers. Convolutional layers are specialized layers that perform

the “convolution” operation to detect local features in the input data. These layers
commonly comprise multiple filters, each learning a different feature. Convolution is
considered a localized process because the filter usually operates on small, specific
regions of the input data at a time (such as parts of an image). This allows the
network to recognize features regardless of their position in the input data, making
convolution well-suited for tasks like image recognition.

Convolutional neural networks (CNNs). Convolution has become a key tech-

nique in modern computer vision models because it effectively captures local features
in images and can deal with variations in their position or appearance. This helps
improve the accuracy of models for tasks like object detection or facial recognition
compared to fully connected networks. Convolutional neural networks (CNNs) use
convolution to process spatial data, such as images or videos, by applying convolu-
tional filters that extract local features from the input.
Convolution was instrumental in the transition of DL from MLPs to more sophis-
ticated architectures and has maintained significant influence, especially in vision-
related tasks.

Self-Attention

Self-attention can produce more coherent representations. Self-attention

encodes the relationships between elements in a sequence to better understand and
represent the information within the sequence. In self-attention, each element attends
to every other element by determining its relative importance and selectively focusing
on the most relevant connections.
This process allows the model to capture dependencies and relationships within the
sequence, even when they are separated by long distances. As a result, DL models
can create a more context-aware representation of the sequence. When summarizing
a long book, self-attention can help the model understand which parts of the text
are most relevant and central to the overall meaning, leading to a more coherent
summary.

Transformers

The Transformer is a groundbreaking DL model that leverages self-attention [152]. It

is a very general and versatile architecture that can achieve outstanding performance
across many data types. The model itself consists of a series of Transformer blocks.
92 ■ Introduction to AI Safety, Ethics, and Society

Figure 2.19. Different attention heads can capture different relationships between words in
the same sentence [152].

A Transformer block primarily combines self-attention and MLPs (as we saw earlier)
with optimization techniques such as residual connections and layer normalization.
Large language models (LLMs). LLMs are a class of language models with
many parameters (often in the billions) trained on vast quantities of data. These
models excel in various language tasks, including question-answering, text generation,
coding, translation, and sentiment analysis. Most LLMs, such as the Generative Pre-
trained Transformer (GPT) series, utilize Transformers because they can effectively
model long-range dependencies.

Summary
DL models are networks composed of many layers of interconnected nodes. The struc-
ture of this network plays a vital role in shaping how a model functions. Creating a
successful model requires carefully assembling numerous components. Different com-
ponents are used in different settings, and each building block serves a unique purpose,
contributing to a model’s overall performance and capabilities.
This section discussed multi-layer perceptrons (MLPs), activation functions, residual
connections, convolution, and self-attention, culminating with an introduction to the
Transformer architecture. We saw how MLPs, an archetypal DL model, paved the way
for other architectures and remain an essential component of many more sophisticated
models. Many building blocks each play a distinct role in the structure and function
of a model.
Artificial Intelligence Fundamentals ■ 93

Activation functions like ReLU, softmax, and GELU introduce nonlinearity in net-
works, enabling models to learn complex patterns. Residual connections facilitate the
flow of information in a network, thereby enabling the training of deeper networks.
Convolution uses sliding filters to allow models to detect local features in input data,
an especially useful capability in vision-related tasks. Self-attention enables mod-
els to weigh the relevance of different inputs based on their context. By leveraging
these mechanisms to handle complex dependencies in sequential data, Transformers
revolutionized the field of natural language processing (NLP).

2.3.2 Training and Inference

Having explored the components of DL mod-

els, we will now explore how the models work.
First, we will briefly describe training and in-
ference: the two key phases of developing a DL
model. Next, we will examine learning mechan-
ics and see how the training process enables
models to learn and continually refine their rep-
resentations. Then, we will discuss a few tech-
niques and approaches to learning and training
DL models and consider how model evaluation Figure 2.20. Transformer blocks com-
can help us understand a model’s potential for bine several other techniques, such as
real-world applications. self-attention, MLPs, residual connec-
tions, and layer normalization [152].
Training is learning and inference is
executing. As we saw previously in section
2.2, training is the process through which the model learns from data. During train-
ing, a model is fed data and makes iterative parameter adjustments to predict target
outcomes better. Inference is the process of using a trained model to make predictions
on new, unseen data. Inference is when a model applies what it has learned during
training. We will now turn to training and examine how models learn in more detail.

Mechanics of Learning

In DL, training is a carefully coordinated system involving loss functions, optimization

algorithms, backpropagation, and other techniques. It allows a model to refine its
predictions iteratively. By making incremental adjustments to its parameters, training
enables a model to gradually reduce its error, improving its performance over time.

Loss quantifies a model’s error. Loss is a measure of a model’s error, used to

evaluate its performance. It is calculated by a loss function that compares target and
predicted values to measure how well the neural network models “fits” the training
data. Typically, neural networks are trained by systematically minimizing this func-
tion. There are many different kinds of loss functions. Here, we will present two: cross
entropy loss and mean squared error (MSE).
94 ■ Introduction to AI Safety, Ethics, and Society

Cross entropy loss. Cross entropy is a concept from information theory that measures
the difference between two probability distributions. In DL, cross entropy loss is
often used in classification problems, where it compares the probability distribution
predicted by a model and the target distribution we want the model to predict.
Consider a binary classification problem where a model is tasked with classifying
images as either apples or oranges. When given an image of an apple, a perfect model
would predict “apple” with 100% probability. In other words, with classes [apple,
orange], the target distribution would be [1, 0]. The cross entropy would be low if the
model predicts “apple” with 90% probability (outputting a predicted distribution of
[0.9, 0.1]). However, if the model predicts “orange” with 99% probability, it would
have a much higher loss. The model learns to generate predictions closer to the true
class labels by minimizing the cross entropy loss during training.
Cross entropy quantifies the difference between predicted and true probabilities. If the
predicted distribution is close to the true distribution, the cross entropy will be low,
indicating better model performance. High cross entropy, on the other hand, signals
poor performance. When used as a loss function, the more incorrect the model’s
predictions are, the larger the error and, in turn, the larger the training update.
Mean squared error (MSE). Mean squared error is one of the most popular loss func-
tions for regression problems. It is calculated as the average of the squared differences
between target and predicted values.
n
1X 2
MSE = yi − ŷi
n i=1

MSE gives a good measure of how far away an output is from its target in a way
that is not affected by the direction of errors. Like cross entropy, MSE provides a
larger error signal the more wrong the output guess, helping the training process
converge more quickly. One weakness of MSE is that it is highly sensitive to outliers,
as squaring amplifies large differences, although there are variants and alternatives
such as mean absolute error (MAE) and Huber loss which are more robust to outliers.

Loss is minimized through optimization. Optimization is the process of min-

imizing (or maximizing) an objective function. In DL, optimization involves finding
the set of parameters that minimize the loss function. This is achieved with optimiz-
ers–—algorithms that adjust a model’s parameters, such as weights and biases, to
reduce the loss.

Gradient descent is a crucial optimization algorithm. Gradient descent is

a foundational optimization algorithm that provides the basis for many advanced op-
timizers used in DL. It was among the earliest techniques developed for optimization.
To understand the basic idea behind gradient descent, imagine a blindfolded hiker
standing on a hill trying to reach the bottom of a valley. With each step, they can feel
the slope of the hill beneath their feet and move in the direction that goes downhill
the most. While the hiker cannot tell where exactly they are going or where they are
Artificial Intelligence Fundamentals ■ 95

ending up, they can continue this process, always taking steps toward the steepest
descent until they have reached the lowest point.
In ML, the hill is the loss function, and the steps are updates to the model’s param-
eters. The direction of steepest descent is calculated using the gradients (derivatives)
of the loss function with respect to the model’s parameters.

Figure 2.21. Gradient descent can find different local minima given two different weight
initializations [153].

The size of the steps is determined by the learning rate, a parameter of the model
configuration (known as a hyperparameter) used to control how much a model’s
weights are changed with each update. If the learning rate is too large, the high
learning rate may destroy information faster than information is learned. However,
the optimization process may be very slow if the learning rate is too small. Therefore,
proper learning rate selection is often key to effective training.
Though powerful, gradient descent in its simplest form can be quite slow. Several
variants, including Adam (Adaptive Moment Estimation), were developed to address
these weaknesses and are more commonly used in practice.
Backpropagation facilitates parameter updates. Backpropagation is a
widely used method to compute the gradients in a neural network [154]. This process
is essential for updating the model’s parameters and makes gradient descent possible.
Backpropagation is a way to send the error signal from the output layer of the neural
network back to the input layer. It allows the model to understand how much each
parameter contributes to the overall error and adjust them accordingly to minimize
the loss.
Steps to training a DL model. Putting all of these components together, train-
ing is a multi-step process that typically involves the following:
1. Initialization: A model’s parameters (weights and biases) are set to some initial
values, often small random numbers. These values define the starting point for the
model’s training and can significantly influence its success.
96 ■ Introduction to AI Safety, Ethics, and Society

2. Forward Propagation: Input data is passed through the model, layer by layer.
The neurons in each layer perform their specific operations via weights, biases,
and activation functions. Once the final layer is reached, an output is produced.
This procedure can be carried out on individual examples or on batches of multiple
data points.
3. Loss Calculation: The model’s output is compared to the target output using
a loss function that quantifies the difference between predicted and actual values.
The loss represents the model’s error—how far its output was from what it should
have been.
4. Backpropagation: The error is propagated back through the model, starting
from the output layer and going backward to the input layer. This process cal-
culates gradients that determine how much each parameter contributed to the
overall loss.
5. Parameter Update: The model’s weights and biases are adjusted using an op-
timization algorithm based on the gradients. This is typically done using gradient
descent or one of its variants.
6. Iteration: Steps 2–5 are repeated many times, often reaching millions or billions
of iterations. With each pass, the loss should decrease as the model’s predictions
improve.
7. Stopping Criterion: Training continues until the model reaches a stopping point,
which can be defined in many ways. We may stop training when the loss stops
decreasing or when the model has gone through the entire training dataset a
specific number of times.

While this sketch provides a high-level overview of the training process, many factors
can shape its course. For example, the network architecture and choice of loss func-
tion, optimizer, batch size, learning rate, and other hyperparameters influence how
training proceeds. Moreover, different methods and approaches to learning determine
how training is carried out. We will explore some of these techniques in the next
section.

Training Methods and Techniques

Effective training is essential to the ability of DL models to learn how to accomplish

tasks. Various methods have been developed to address key issues many models face
in training. Some techniques offer distinct approaches to learning, whereas others
solve specific computational difficulties. Each has unique characteristics and applica-
tions that can significantly enhance a model’s performance and adaptability. Different
techniques are often used together, like a recipe using many ingredients.
In this section, we limit our discussion to pre-training, fine-tuning, and few-shot learn-
ing. These three methods illustrate different ways of approaching the learning process.
Notably, there are ways to learn during training (during backpropagation and model
weight adjustment), and there are also ways to learn after training (during inference).
Artificial Intelligence Fundamentals ■ 97

Pre-training and fine-tuning belong to the former, and few-shot learning belongs to
the latter.

Pre-training is the bulk of generic training. Pre-training is training a model

on vast quantities of data to give the model an array of generally useful represen-
tations that it can use to achieve specific tasks. If we want a model that can write
movie scripts, we want it to have a broad education, knowing rules about grammar
and language and how to write more generally, rather than just seeing existing movie
scripts.
Pre-training endows models with weights that capture a rich set of learned repre-
sentations from the outset rather than being assigned random values. This can offer
several advantages over training for specific purposes from scratch, including faster
and more effective training on downstream tasks. Indeed, the name pre-training is
somewhat of a historical artifact. As pre-training makes up most of the development
process for many models, pre-training and training have become synonymous.
The preprocessing step tokenization is common in ML and natural language pro-
cessing (NLP). It involves breaking down text, such as a sentence or a document,
into smaller units called tokens. Tokens are typically words or subword units such
as “play” (from “playing”), “un” (from “unbelievable”), punctuation tokens, and so
on. Tokenization allows an ML model to factor text data as consistent discrete units,
making it faster to process.

Models can either be fine-tuned or used only pre-trained. Pre-trained mod-

els can be used as is (known as off-the-shelf ) or subjected to further training (known
as fine-tuned) on a target task or dataset. In natural language processing and com-
puter vision, it is common to use models that have been pre-trained on large datasets.
Many CNNs are pre-trained on the ImageNet dataset, enabling them to learn many
essential characteristics of the visual world.

Fine-tuning specializes models for specific tasks. Fine-tuning is the process

of adapting a pre-trained model to a new dataset or task through additional training.
In fine-tuning, the weights from the pre-trained model are used as the starting point
for the new model. Then, some or all layers are trained on the new task or data, often
with a lower learning rate.
Layers that are not trained are said to be frozen. Their weights will remain unchanged
to preserve helpful representations learned in pre-training. Typically, layers are mod-
ified in reverse order, from the output layer toward the input layer. This allows the
more specialized, high-level representations of later layers to be tailored to the new
task while conserving the more general representations of earlier layers.

After training, few-shot learning can teach new capabilities. Few-shot

learning is a method that enables models to learn and adapt quickly to new tasks with
limited data. It works best when a model has already learned good representations
for the tasks it needs to perform. In few-shot learning, models are trained to perform
tasks using a minimal number of examples. This approach tests the model’s ability
98 ■ Introduction to AI Safety, Ethics, and Society

to learn quickly and effectively from a small dataset. Few-shot learning can be used
to train an image classifier to recognize new categories of animals after seeing only a
few images of each animal.

Zero-shot learning. Zero-shot learning is an extreme version of few-shot learning.

It tests a model’s ability to perform on characteristically new data without being
provided any examples during training. The goal is to enable the model to generalize
to new classes or tasks by leveraging its understanding of relationships in the data
derived from seen examples to predict new, unseen examples.
Zero-shot learning often relies on additional information, such as attributes or natural
language descriptions of unseen data, to bridge the gap between known and unknown.
For instance, consider a model trained to identify common birds, where each species
is represented by images and a set of attributes (such as size, color, diet, and range)
or a brief description of the bird’s appearance and behavior. The model is trained to
associate the images with these descriptions or attributes. When presented with the
attributes or description of a new species, the model can use this information to infer
characteristics about the unknown bird and recognize it in images.

LLMs, few-shot, and zero-shot learning. Some large language models (LLMs)
have demonstrated a capacity to perform few- and zero-shot learning tasks without
explicit training. As model and training datasets increased in size, these models de-
veloped the ability to solve a variety of tasks when provided with a few examples
(few-shot) or only instructions describing the task (zero-shot) during inference; for
instance, an LLM can be asked to classify a paragraph as having positive or negative
sentiments without specific training. These capabilities arose organically as the mod-
els increased in size and complexity, and their unexpected emergence raises questions
about what enables LLMs to perform these tasks, especially when they are only ex-
plicitly trained to predict the next token in a sequence. Moreover, as these models
continue to evolve, this prompts speculation about what other capabilities may arise
with greater scale.

Summary. There are many training techniques used in DL. Pre-training and fine-
tuning are the foundation of many successful models, allowing them to learn valuable
representations from one task or dataset and apply them to another. Few-shot and
zero-shot learning enable models to solve tasks based on scarce or no example data.
Notably, the emergence of few- and zero-shot learning capabilities in large language
models illuminates the potential for these models to adapt and generalize beyond their
explicit training. Ongoing advancements in training techniques continue to drive the
growth of AI capabilities, highlighting both exciting opportunities and important
questions about the future of the field.

2.3.3 History and Timeline of Key Architectures

Having built our technical understanding of DL models and how they work, we will
see how these concepts come together in some of the groundbreaking architectures
Artificial Intelligence Fundamentals ■ 99

that have shaped the field. We will take a chronological tour of key DL models, from
the pioneering LeNet in 1989 to the revolutionary Transformer-based BERT and GPT
in 2018. These architectures, varying in design and purpose, have paved the way for
developing increasingly sophisticated and capable models. While the history of DL
extends far beyond these examples, this snapshot sheds light on a handful of critical
moments as neural networks evolved from a marginal theory in the mid-1900s to the
vanguard of AI development by the early 2010s.

1989: LeNet
LeNet paves the way for future DL models [155]. LeNet is a convolutional
neural network (CNN) proposed by Yann LeCun and his colleagues at Bell Labs
in 1989. This prototype was the first practical application of backpropagation, and
after multiple iterations of refinement, LeCun et al. presented the flagship model,
LeNet-5, in 1998. This model demonstrated the utility of neural networks in everyday
applications and inspired many DL architectures in the years to follow. However, due
to computational constraints, CNNs did not rise in popularity for over a decade after
LeNet-5 was released.

1997: Recurrent Neural Networks (RNNs) & Long Short-Term Memory

(LSTM) Networks
Recurrent neural networks (RNNs) use feedback loops to remember.
RNNs are a neural network architecture designed to process sequential or time-series
data, such as text and speech. They were developed to address failures of traditional
feedforward neural networks in modeling the temporal dependencies inherent to these
types of data. RNNs incorporate a concept of “memory” to capture patterns that oc-
cur over time, like trends in stock prices or weather observations and relationships
between words in a sentence. They use a feedback loop with a hidden state that
stores information from prior inputs, giving them the ability to “remember” and take
historical information into account when processing future inputs. While this marked
a significant architectural advancement, RNNs were difficult to train and struggled
to learn patterns that occur over more extended amounts of time.
Long short-term memory (LSTM) networks improved memory [156].
LSTMs are a type of RNN that address some of the shortcomings of standard RNNs,
allowing them to model long-term dependencies more effectively. LSTMs introduce
three gates (input, output, and forget) to the memory cell of standard RNNs to
regulate the flow of information in and out of the unit. These gates determine how
much information is let in (input gate), how much information is retained (forget
gate), and how much information is passed along (output gate). This approach allows
the network to learn more efficiently and maintain relevant information for longer.

2012: AlexNet
AlexNet achieves unprecedented performance in image recognition [132].
As we saw in section 2.2, the ImageNet Challenge was a large-scale image
100 ■ Introduction to AI Safety, Ethics, and Society

recognition competition that spurred the development and adoption of DL methods

for computer vision. The challenge involved classifying images into 1,000 categories
using a dataset of over one million images.
In 2012, a CNN called AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and
Geoffrey Hinton, achieved a breakthrough performance of 15.3% top-5 error rate,
beating the previous best result of 26.2% by a large margin and winning the Ima-
geNet Large Scale Visual Recognition Challenge (ILSVRC). AlexNet consists of eight
layers: five convolutional layers and three fully connected layers. It uses a ReLU ac-
tivation function and specialized techniques such as dropout and data augmentation
to improve accuracy.

2015: ResNets (Residual Networks)

ResNets employ residual connections [150]. ResNets were introduced in
2015 by Microsoft researchers Kaiming He and collaborators. The original model
was the first architecture to implement residual connections. By adding these con-
nections to a traditional 34-layer network, the authors were able to achieve great
success. In 2015, it won first place in the ImageNet classification challenge with a
top-5 error rate of 3.57%.

2017: Transformers
Transformers introduce self-attention. The Transformer architecture was in-
troduced by Vaswani et al. in their revolutionary paper “Attention is All You Need.”
Like RNNs and LSTMs, Transformers are a type of neural network that can process
sequential data. However, the approach used in the Transformer was markedly differ-
ent from those of its predecessors. The Transformer uses self-attention mechanisms
that allow the model to focus on relevant parts of the input and the output.

2018: BERT (Bidirectional Encoder Representations from Transformers)

& GPT (Generative Pre-Trained Transformer)
BERT and GPT, both launched in 2018, are two models based on the Transformer
architecture [134].

BERT uses pre-training and bidirectional processing [157]. BERT is a

Transformer-based model that can learn contextual representations of natural lan-
guage by pre-training on large-scale corpora. Unlike previous models that process
words in one direction (left-to-right or right-to-left), BERT takes a bidirectional ap-
proach. It is pre-trained on massive amounts of text to perform masked language
modeling and next sentence prediction tasks. Then, the pre-trained model can be fine-
tuned on various natural language understanding tasks, such as question answering,
sentiment analysis, and named entity recognition. BERT was the first wide-scale,
successful use of Transformers, and its contextual approach allowed it to achieve
state-of-the-art results on several benchmarks.
Artificial Intelligence Fundamentals ■ 101

The GPT models use scale and unidirectional processing. The GPT mod-
els are a series of Transformer-based language models launched by OpenAI. The size
of these models and scale at which they were trained led to a remarkable improve-
ment in fluency and accuracy in various language tasks, significantly advancing the
state-of-the-art in natural language processing. One of the key reasons GPT models
are more popular than BERT models is that they are better at generating text. While
BERT learns really good representations through being trained to fill in blanks in the
middle of sentences, GPT models are trained to predict what comes next, enabling
them to generate long-form sequences (e.g. sentences, paragraphs, and essays) much
more naturally.
Many important developments have been left out in this brief timeline. Perhaps
more importantly, future developments might revolutionize model architectures in
new ways, potentially bringing to light older innovations that have currently fallen
to the wayside. Next, we will explore some common applications of DL models.

2.3.4 Applications

DL has seen a dramatic rise in popularity since the early 2010s, increasingly becoming
a part of our daily lives. Its applications are broad, powering countless services and
technologies across many industries, some of which are highlighted below.

Communication and entertainment. DL powers the chatbots and generative

tools that sparked the surge in global interest in AI that began in late 2022. It
fuels the recommendation systems of many streaming services like Netflix, YouTube,
and Spotify, curating personalized content based on viewing or listening habits. Social
media platforms, like Facebook or Instagram, use DL for image and speech recognition
to enable features such as auto-tagging in photos or video transcription. Personal
assistants like Siri, Google Assistant, and Alexa utilize DL techniques for speech
recognition and natural language understanding, providing us with more natural,
interactive voice interfaces.

Transportation and logistics. DL is central to the development of autonomous

vehicles. It helps these vehicles understand their environment, recognize objects, and
make decisions. Retail and logistics companies like Amazon use DL for inventory
management, sales forecasting, and to enable robots to navigate their warehouses.

Healthcare. DL has been used to assist in diagnosing diseases, analyzing medi-

cal images, predicting patient outcomes, and personalizing treatment plans. It has
played a significant role in drug discovery, reducing the time and costs associated
with traditional methods.
Beyond this, DL is also used in cybersecurity, agriculture, finance, business analytics,
and many other settings that can benefit from decision making based on large un-
structured datasets. With the more general abilities of LLMs, the impact of DL is set
to disrupt more industries, such as through automatic code generation and writing.
102 ■ Introduction to AI Safety, Ethics, and Society

Conclusion
DL has come a long way since its early days, with advancements in architectures, tech-
niques, and applications driving significant progress in AI. DL models have been used
to solve complex problems and provide valuable insights in many different domains.
As data and computing power become more available and algorithmic techniques
continue to improve in the years to come, we can expect DL to become even more
prevalent and impactful.
In the next section, we will discuss scaling laws: a set of principles which can quanti-
tatively predict the effects of more data, larger models, and more computing power
on the performance of DL models. These laws shape how DL models are constructed.

2.4 SCALING LAWS

Introduction

Compelling evidence shows that increases in the performance of many AI systems

can be modeled with equations called scaling laws. ML researchers have often found
that larger models with more data usually perform better, and scaling laws attempt
to quantify this folk knowledge. In this section, we discuss how the performance of
DL models has scaled according to parameter count and dataset size, both of which
factors are primarily bottlenecked by the computational resources available. Scaling
laws describe the relationship between a model’s performance and the computational
inputs that it receives.

Conceptual Background: Power Laws

Scaling laws are a type of power law. Power laws are mathematical equations that
model how a particular quantity varies as the power of another. In power laws, the
variation in one quantity is proportional to a power (exponent) of the variation in
another. The power law y = bxa states that the change in y is directly proportional
to the change in x raised to a certain power a. If a is 2, then when x is doubled, y will
quadruple. One real-world example is the relation between the area of a circle and its
radius. As the radius changes, the area changes as a square of the radius: y = πr2 .
This is a power-law equation where b = π and a = 2. The volume of a sphere has
a power-law relationship with the sphere’s radius as well: y = 43 πr3 (so b = 43 π and
a = 3). Scaling laws are a particular kind of power law that describe how DL models
scale. These laws relate a model’s loss with model properties (such as the number of
model parameters or the dataset size used to train the model).

Log-log plots can be used to visualize power laws. Log-log plots can help
make these mathematical relationships easier to understand and identify. Consider the
power law y = bxa again. Taking the logarithm of both sides, the power law becomes
log(y) = a log(x) + log(b). This is a linear equation (in the logarithmic space) where
a is the slope and log(b) is the y-intercept. Therefore, a power-law relationship will
Artificial Intelligence Fundamentals ■ 103

appear as a straight line on a log-log plot (such as 2.22), with the slope of the line
corresponding to the exponent in the power law.

Figure 2.22. An object in free falling in a vacuum falls a distance proportion to the square
of the time. On a log-log plot, this power law looks like a straight line.

Power laws are remarkably ubiquitous. Power laws are a robust mathematical
framework that can describe, predict, and explain a vast range of phenomena in both
nature and society. Power laws are pervasive in urban planning: log-log plots relating
variables like city population to metrics such as the percentage of cities with at
least that population often result in a straight line (see Fig 2.23). Similarly, animals’

Figure 2.23. Power laws are used in many domains, such as city planning.

metabolic rates are proportional to an exponent of their body mass, showcasing a

clear power law. In social media, the distribution of user activity often follows a power
law, where a small fraction of users generate most of the content (which means that
the frequency of content generation y is proportional to the number of active users
x multiplied by some constant and raised to some exponent: y = bxa ). Power laws
govern many other things, such as the frequency of word usage in a language, the
distribution of wealth, the magnitude of earthquakes, and more.
104 ■ Introduction to AI Safety, Ethics, and Society

2.4.1 Scaling Laws in DL

Introduction

Power laws in the context of DL are called (neural) scaling laws. Scaling
laws [158, 159] predict loss given model size and dataset size in a power law relation-
ship. Model size is usually measured in parameters, while dataset size is measured
in tokens. As both variables increase, the model’s loss tends to decrease. This de-
crease in loss with scale often follows a power law: the loss drops substantially, but
not linearly, with increases in data and model size. For instance, if we doubled the
number of parameters, the loss does not just halve: it might decrease to one-fourth or
one-eighth, depending on the exponent in the scaling law. This power-law behavior in
AI systems allows researchers to anticipate and strategize on how to improve models
by investing more in increasing the data or the parameters.

Scaling laws in DL predict loss based on model size and dataset size.
In DL, we have observed power-law relationships between the model’s performance
and other variables that have held consistently over eight orders of magnitude as the
amount of compute used to train models has scaled. These scaling laws can forecast
the performance of a model given different values for its parameters, dataset, and
amount of computational resources. For instance, we can estimate a model’s loss if
we were to double its parameter count or halve the training dataset size. Scaling laws
show that it is possible to accurately predict the loss of an ML system using just two
primary variables:
1. N : The size of the model, measured in the number of parameters. Parameters
are the weights in a model that are adjusted during training. The number of
parameters in a model is a rough measure of its capacity, or how much it can learn
from a dataset.
2. D: The size of the dataset the model is trained on, measured in tokens, pixels,
or other fundamental units. The modality of these tokens depends on the model’s
task. For example, tokens are subunits of language in natural language processing
and images in computer vision. Some models are trained on datasets consisting of
tokens of multiple modalities.

Improving model performance is typically bottlenecked by one of these variables.

The computational resources used to train a model are vital for scaling.
This factor, often referred to as compute, is most often measured by the number of cal-
culations performed over a certain time. The key metric for compute is FLOP/s, the
number of floating-point operations the computer performs per second. Practically,
increasing compute means training with more processors, more powerful processors,
or for a longer time. Models are often allocated a set budget for computation: scaling
laws can determine the ideal model and dataset size given that budget.

Computing power underlies both model size and dataset size. More
computing power enables larger models with more parameters and facilitates the
Artificial Intelligence Fundamentals ■ 105

collection and processing of more tokens of training data. Essentially, greater compu-
tational resources facilitate the development of more sophisticated AI models trained
on expanded datasets. Therefore, scaling is contingent on increasing computation.

The Chinchilla Scaling Law: an Influential Example

The Chinchilla scaling law emphasizes data over model size [160]. One
significant research finding that shows the importance of scaling laws was the success-
ful training of the LLM “Chinchilla.” A small model with only 70 billion parameters,
Chinchilla outperformed much larger models because it was trained on far more to-
kens than pre-existing models. This led to the development of the Chinchilla scaling
law: a scaling law that accounts for parameter count and data. This law demonstrated
that larger models require much more data than was typically assumed at the time
to achieve the desired gains in performance.

Figure 2.24. Chinchilla scaling laws provide an influential estimate of compute-optimal scal-
ing laws, specifying the optimal ratio of model parameters and training tokens for a given
training compute budget in FLOPs. The green lines show projections of optimal model size
and training token count based on the number of FLOPs used to train Google’s Gopher
model [161].

The Chinchilla scaling law equation encapsulates these relationships.

The Chinchilla scaling law is estimated to be

L(N, D) = 406.4N −0.34 + 410.7D−0.28 + 1.69

|{z} (2.1)
Irreducible Error

In equation 2.1, N represents parameter count, D represents dataset size, and L

stands for loss. This equation describes a power-law relationship. Understanding this
law can help us understand the interplay between these factors, and knowing these
values helps developers make optimal decisions about investments in increasing model
and dataset size.

Scaling laws for DL hold across many modalities and orders of magni-
tude. An order of magnitude refers to a tenfold increase—if something increases
by an order of magnitude, it becomes 10 times larger. In DL, evidence suggests that
106 ■ Introduction to AI Safety, Ethics, and Society

scaling laws hold across many orders of magnitude of parameter count and dataset
size. This implies that the same scaling relationships are still valid for both a small
model trained on hundreds of tokens or a massive model trained on trillions of tokens.
Scaling laws have continued to hold even as model size increases dramatically.

Figure 2.25. The scaling laws for different DL models look remarkably similar [159].

Discussion

Scaling laws are not universal for ML models. Not all models follow scaling
laws. These relationships are stronger for some types of models than others. Gener-
ative models such as large language models tend to follow regular scaling laws—as
model size and training data increase in scale, performance improves smoothly and
predictably in a relationship described by a power-law equation. But for discrimi-
native models such as image classifiers, clear scaling laws currently do not emerge.
Performance may plateau even as dataset size or model size increase.

Better learning algorithms can boost model performance across the

board. An improved algorithm increases the constant term in the scaling law, allow-
ing models to perform better with a given number of tokens or parameters. However,
crafting better learning algorithms is quite difficult. Therefore, improving DL models
generally focuses on increasing the core variables for scaling: tokens and parameters.

The bitter lesson: scaling beats intricate, expert-designed systems. Hard-

coding AI systems to follow pre-defined processes using expert insights has proven
slower and more failure-prone than building large models that learn from large
datasets. The following observation is Richard Sutton’s “bitter lesson” [162]:
1. AI researchers have often tried to build knowledge into systems,
2. “This always helps in the short term [...], but in the long run it plateaus and it
even inhibits further progress,
3. Breakthrough progress eventually arrives by an opposing approach based on scal-
ing computation by search and learning.”
Artificial Intelligence Fundamentals ■ 107

This suggests that it is easier to create machines that can learn than to have humans
manually encode them with knowledge. For now, the most effective way to do this
seems to be scaling up DL models such as LLMs. This lesson is “bitter” because
it shows that simpler scaling approaches tend to beat more elegant and complex
techniques designed by human researchers—demoralizing for researchers who spent
years developing those complex approaches. Rather than human ingenuity alone, scale
and computational power are also key factors that drive progress in AI.
It is worth noting that while the general trend of improved performance through
scaling has held over many order of magnitude of computation, the equations used
to model this trend are subject to criticism and debate. The original scaling laws
identified by a team at OpenAI in 2020 were superseded by the Chinchilla scaling
laws described above, which may in turn be replaced in future. While there do seem
to be interesting and important regularities at work, the equations that have been
developed are less well-established than in some other areas of science, such as the
laws of thermodynamics.

Conclusion

In AI, scaling laws describe how loss changes with model and dataset size.
We observed that the performance of a DL model scales according to the number
of parameters and tokens—both shaped by the amount of compute used. Evidence
from generative models like LLMs, observed over eight orders of magnitude of training
compute, indicates a smooth reduction in loss as model size and training data increase,
following a clear scaling law. Scaling laws are especially important for understanding
how changes in variables like the amount of data used can have substantial impacts
on the model’s performance.

2.5 SPEED OF AI DEVELOPMENT

Introduction

It is comfortable to believe that we are nowhere close to creating AI systems that

match or surpass human performance on a wide range of cognitive tasks. However,
given the wide range of opinions among experts and current trends in compute and
algorithmic efficiency, we do not have strong reasons to rule out the possibility that
such AI systems will exist in the near future. Even if development in this direction
is slower than the more optimistic projections, the development of AI systems with
powerful capabilities on a narrower set of tasks is already happening and is likely to
introduce novel risks that will be challenging to manage.

HLAI is a helpful but flawed milestone for AI development. When dis-

cussing the speed of developments in AI capabilities, it is important to clarify what
reference points we are using. Concepts such as HLAI, AGI, or TAI, introduced ear-
lier in this chapter, are under-specified and ambiguous in some ways, so it is often
more helpful to focus on specific capabilities or types of economic impact. Despite
108 ■ Introduction to AI Safety, Ethics, and Society

this, there has been intense debate over when AI systems on this level might be
achieved, and insight into this question could be valuable for better managing the
risks posed by increasingly capable AI systems. In this section, we discuss when we
might see general AI systems that can match average human skill across all or nearly
all cognitive tasks. This is equivalent to some ways of operationalizing the concept
of AGI.

Potential for Rapid Development of HLAI

HLAI systems are possible. The human brain is widely regarded by scientists as
a physical object that is fundamentally a complex biological machine and yet is able to
give rise to a form of general intelligence. This suggests that there is no reason another
physical object could not be built with at least the same level of cognitive functioning.
While some would argue that an intelligence based on silicon or other materials will
be unable to match one built on biological cells, we see no compelling reason to believe
that particular materials are required. Such statements seem uncomfortably similar
to the claims of vitalists, who argued that living beings are fundamentally different
from non-living entities due to containing some non-physical components or having
other special properties. Another objection is that copying a biological brain in silicon
will be a huge scientific challenge. However, there is no need for researchers looking
to create HLAI to create an exact copy or “whole brain emulation.” Airplanes are
able to fly but do not flap their wings like birds—nonetheless they function because
their creators have understood some key underlying principles. Similarly, we might
hope to create AI systems that can perform as well as humans through looser forms
of imitation rather than exact copying.

High uncertainty for HLAI timelines. Opinions on “timelines”—how difficult

it will be to create HLAI—vary widely among experts. A 2023 survey of over 2,700
AI experts found a wide range of estimates of when HLAI was likely to appear. The
combined responses estimated a 10% probability of this happening by 2027, and a
50% probability by 2047 [163]. A salient point is that more recent surveys generally
indicate shorter timelines, suggesting that many AI researchers have been surprised
by the pace of advances in AI capabilities. For example, a similar survey conducted in
2022 yielded a 50% probability of HLAI by 2059. In other words, over a period of just
one year, experts brought forward their estimates of when HLAI had a 50% chance
of appearing by 12 years. Nonetheless, it is also worth being cautious about experts
interpreting evidence of rapid growth over a short period too narrowly. In the 1950s
and 1960s, many top AI scientists were overly optimistic about what was achievable
in the short term, and disappointed expectations contributed to the subsequent “AI
Winter.”

Intense incentives and investment for AGI. Vast sums of money are being
dedicated to building AGI, with leaders in the field having secured billions of dollars.
The cost of training GPT-3 has been estimated at around $5 million, while the cost
for training GPT-4 was reported to be over $100 million. As of 2024, AI developers are
spending billions of dollars on GPUs for training the next generation of AI systems.
Artificial Intelligence Fundamentals ■ 109

Increasing investment has translated to growing amounts spent on compute; between

2009 and 2024, the cost of compute used to train notable ML models has roughly
tripled each year [164]. Moreover, although scaling compute may seem like a relatively
simple approach, it has so far proven remarkably effective at improving capabilities
over many orders of magnitude of scale. For example, looking at the task of next-token
prediction, not only has the loss in performance reduced with increasing training
compute, but the trend has also remained consistent as compute has spanned over a
dozen orders of magnitude. These developments have defied the expectations of some
skeptics who believed that the approach of scaling would quickly reach its limits and
saturate. Additionally, since compute costs are falling, the amount being used has
increased more than spending on it; although spending has been tripling each year,
the amount of training compute for notable models has been quadrupling.
Improvements in drivers, software and other elements are also contributing to the
training of ever-larger AI models. For example, FlashAttention made the training
of transformers more efficient by minimizing redundant operations and efficiently
utilizing hardware resources during training [165].
Besides increasing compute, another indicator of the growth of AI research is the
number of papers published in the field. This metric has also risen rapidly in the
past few years, more than doubling from around 128,000 papers in 2017 to around
282,000 in 2022 [166]. This suggests that increasing investment is not solely going
toward funding ever-larger models, but is also associated with a large increase in the
amount of research going into improving AI systems.

Obstacles to HLAI

More conceptual breakthroughs may be needed to achieve HLAI. Al-

though simply scaling compute has yielded improvements so far, we cannot necessarily
rely on this trend to continue indefinitely. Achieving HLAI may require qualitative
changes, rather than merely quantitative ones. For example, there may be concep-
tual breakthroughs required of which we are so far unaware. This possibility adds
more uncertainty to projected timelines; whereas we can extrapolate previous pat-
terns to predict how training compute will increase, we do not know what conceptual
breakthroughs might be needed, let alone when they might be made.

High-quality data for training might run out. The computational operations
performed in the training of ML models require data to work with. The more com-
pute used in training, the more data can be processed, and the better the model’s
capabilities will be. However, as compute being used for training continues to rise,
we may reach a point where there is not enough high-quality data to fuel the process.
But there are strong incentives for AI developers to find ways to work around this. In
the short term, they will find ways to access new sources of training data, for example
by paying owners of relevant private datasets. Beyond this, they may try a variety of
approaches to reduce the reliance on human-generated data. For example, they may
use AI systems to create synthetic or augmented data. Alternatively, AI systems may
110 ■ Introduction to AI Safety, Ethics, and Society

be able to improve further by competing against themselves through self-play, in a

similar way to how AlphaGo learned to play Go at superhuman level.

Investment in AI may drop if financial returns are disappointing. Al-

though substantial resources are currently being invested in scaling ML models, we
do not know how much scaling is required to reach HLAI (even if scaling alone were
enough). As companies increase their spending on compute, we do not know whether
their revenue from the technology they monetise will increase at the same rate. If
the costs of improving the ML models grow more quickly than financial returns, then
companies may turn out not to be economically viable, and investment may slow
down.

Conclusion

There is high uncertainty around when HLAI might be achieved. There are strong
economic incentives for AI developers to pursue this goal, and advances in DL have
surprised many researchers in recent years. We should not be confident in ruling out
the possibility that HLAI could also appear in coming years.

AI can be dangerous long before HLAI is achieved. Although discussions

of possible timelines for HLAI are pertinent to understanding when the associated
risks might appear, it can be misleading to focus too much on HLAI. This technology
does not need to achieve the same level of general intelligence as a human in order
to pose a threat. Indeed, systems that are highly proficient in just one area have the
potential to cause great harm.

2.6 CONCLUSION

Understanding the technical underpinnings of AI systems—the underlying models

and algorithms, how they work, how they are used, and how they are evaluated—is
essential to understanding the safety, ethics, and societal impact of these technologies.
This foundation equips us with the necessary grounding and context to identify and
critically analyze their capabilities and potential, as well as the risks that they pose. It
allows us to discern potential pitfalls in their design, implementation, and deployment
and devise strategies to ensure their safe, ethical, and beneficial use.

2.6.1 Summary

In this chapter, we presented the fundamentals of AI and its subfield, ML, which aims
to create systems that can learn without being explicitly instructed. We examined
its foundational principles, methodologies, and evolution, detailing key techniques,
concepts, and practical applications.

Artificial intelligence. We first discussed AI, the vast umbrella that encapsulates
the idea of machines performing tasks typically associated with human intelligence.
Artificial Intelligence Fundamentals ■ 111

AI and its conceptual origins date back to the 1940s and 1950s when the project
of creating “intelligent machines” came to the fore. The field experienced periods of
flux over the following decades, waxing and waning until the modern DL era was
ushered in by the groundbreaking release of AlexNet in 2012, driven by increased
data availability and advances in hardware.

Defining AI. The term “artificial intelligence” has many meanings, and the capa-
bilities of AI systems exist on a continuum. Five widely used conceptual categories to
distinguish between different types of AI are narrow AI, artificial general intelligence
(AGI), human-level AI (HLAI), transformative AI (TAI), and artificial superintelli-
gence (ASI). While these concepts provide a basis for thinking about the intelligence
and generality of AI systems, they are not well-defined or complete, often overlapping
and used in different, conflicting ways. Therefore, in evaluating risk, it is essential to
consider AI systems based on their specific capabilities instead of broad categoriza-
tions.

The ML model development process. We presented a general framework for

understanding ML models by considering five aspects of a model: its task, input data,
output, and what type of ML it uses. We then discussed each of these aspects in
turn. We explored common tasks for ML models, including classification, regression,
anomaly detection, and sequence modeling. We highlighted a few of the many types
of data that these models work with and discussed the model development process.
Creating an ML model is a multi-step process that typically includes data collection,
model selection, training, evaluation, and deployment. Measuring the performance
of a model in evaluation is a critical step in the development process. We surveyed
several metrics used to achieve this, as well as the broader, often conflicting goals
that guide this process.

Types of ML. We discussed different approaches to ML and how these categories

are neither well-defined nor complete, even though distinctions are often drawn be-
tween different “types” of ML. We surveyed four common approaches to ML: su-
pervised learning, unsupervised learning, reinforcement learning, and DL. At a high
level, supervised learning is learning from labeled data, unsupervised learning is learn-
ing from unlabeled data, and reinforcement learning is learning from agent-gathered
data. DL techniques are used in all three settings, leveraging deep neural networks
to achieve remarkable results.

Deep learning. We then examined DL in more depth. We saw how, beyond its use
of multi-layer neural networks, DL is characterized by its ability to learn hierarchical
representations that provide DL models with great flexibility and power. ML models
are functions that capture relationships between inputs and outputs with represen-
tations that allow them to capture an especially broad family of relationships.

Components of DL models. We explored the essential components of DL mod-

els and neural networks. Through the example of multi-layer perceptrons (MLPs),
we broke down neural networks, structures composed of layers of neurons, into an in-
put layer, an output layer, one or more hidden layers, weights, biases, and activation
112 ■ Introduction to AI Safety, Ethics, and Society

functions. We highlighted a few significant activation functions and examined other

fundamental building blocks of DL models, including residual connections, convolu-
tion, and self-attention. We also presented influential architectures, such as CNNs
and Transformers.

Processes in DL models. We discussed how DL models learn and are used in

training and inference. We walked through the steps to training a DL model, be-
ginning with initialization and then cycling through sending information forward to
make a prediction, measuring its error or quality, sending this error backward, and
adjusting parameters accordingly until a stopping criterion is reached. We discussed
training techniques such as pre-training, fine-tuning, few-shot learning, and zero-shot
learning, and how training typically involves a combination of many methods and
techniques used in conjunction. We considered the importance of scalability, compu-
tational efficiency, and interpretability in evaluating DL models and their suitability
for deployment. We plotted the course of technical and architectural development
in the field, from LeNet in 1989 to BERT and GPT models in 2018. We considered
real-world applications of DL in communication and entertainment, transportation
and logistics, and healthcare.

Scaling laws. Scaling laws describe mathematical relationships between model

performance and key factors like model size and dataset size in DL. These power law
equations show that as models grow in parameters and are trained on more data,
their loss tends to decrease substantially and predictably. Scaling up computational
resources used to train a model can enable an increase in both model parameters and
the amount of data used in training. Researchers can leverage scaling laws to deter-
mine optimal model and dataset sizes given computational constraints. Scaling laws
hold across many modalities and orders of magnitude, though they do not necessarily
apply to all DL models, such as many classification models.

Speed of AI development. Trends in compute and algorithmic efficiency suggest

that the capabilities of AI systems may continue to improve rapidly and could sur-
pass human performance across a broad range of tasks in coming decades. There is
considerable uncertainty among experts about when HLAI might be achieved, with
recent surveys indicating shorter timelines than previously anticipated. Increasing
investment in compute and algorithmic advances have driven significant increases
in AI capabilities. However, achieving HLAI may require conceptual breakthroughs,
and challenges such as the availability of high-quality training data and the economic
viability of further investments could lead to a slowdown. Despite these uncertain-
ties, vigilance is warranted due to the high stakes and potential risks associated with
advanced AI, even before reaching HLAI.

2.7 LITERATURE

The following resources contain further information about the topics discussed in this
chapter:
Artificial Intelligence Fundamentals ■ 113

2.7.1 Recommended Resources

Introductory resources on neural networks, LLMs and AI scaling trends:

• But what is a Neural Network. 2017. url: https://www.3blue1brown.com/lessons
/neural-networks (visited on 05/02/2024)
• Andrei Karpathy. Intro to Large Language Models. 2023. url: https://www.yout
ube.com/watch?v=zjkBMFhNj˙g (visited on 11/22/2023)
• Epoch. Key trends and figures in Machine Learning. 2023. url: https://epochai.o
rg/trends (visited on 10/19/2023)

Reference works for advanced readers:

• Jordan Hoffmann et al. Training Compute-Optimal Large Language Models. 2022.
arXiv: 2203.15556 [cs.CL]
• Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://ww
w.deeplearningbook.org. MIT Press, 2016
• Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach.
Pearson, 2020
• Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction.
MIT Press, Cambridge, MA, 2018
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
II
Safety
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
CHAPTER 3

Single-Agent Safety

3.1 INTRODUCTION

To understand the risks associated with artificial intelligence (AI), we begin by ex-
amining the challenge of making single agents safe. In this chapter, we review core
components of this challenge including monitoring, robustness, alignment and sys-
temic safety.

Monitoring. To start, we discuss the problem of monitoring AI systems. The

opaqueness of machine learning (ML) systems—their “black-box” nature—hinders
our ability to fully comprehend how they make decisions and what their intentions,
if any, may be. In addition, models may spontaneously develop qualitatively new
and unprecedented “emergent” capabilities as they become more advanced (for ex-
ample, when we make them larger, train them for longer periods, or expose them
to more data). Models may also contain hidden functionality that is hard to detect,
such as backdoors that cause them to behave abnormally in a very small number of
circumstances.

Robustness. Next, we turn to the problem of building models that are robust to
adversarial attacks. AI systems based on DL are typically vulnerable to attacks such
as adversarial examples, deliberately crafted inputs that have been slightly modified
to deceive the model into producing predictions or other outputs that are incorrect.
Achieving adversarial robustness involves designing models that can withstand such
manipulations. Without this, malicious actors can use attacks to circumvent safe-
guards and use AI systems for harmful purposes. Robustness is related to the more
general problem of proxy gaming. In many cases, it is not possible to perfectly specify
our idealized goals for an AI system. Inadequately specified goals can lead to systems
diverging from our idealized goals, and introduce vulnerabilities that adversaries can
attack and exploit.

Alignment. We then pivot to the topic of alignment, focusing primarily on control

of AI systems (another key component of alignment, the choice of values to which
an AI system is to be aligned, is discussed in the Beneficial AI and Machine Ethics
DOI: 10.1201/9781003530336-3 117
This chapter has been made available under a CC BY NC ND license.
118 ■ Introduction to AI Safety, Ethics, and Society

chapter). We start by exploring the issue of deception, categorizing the varied forms
that this can take (some of which already observed in existing AI systems), and
analyze the risks involved in AI systems deceiving human and AI evaluators. We
also explore the possible conditions that could give rise to power-seeking agents and
the ways in which this could lead to particularly harmful risks. We discuss some
techniques that have potential to help with making AI systems more controllable
and reducing the inherent hazards they may pose, including representation control
and unlearning specific capabilities.

Systemic safety. Beyond making individual AIs more safe, we discuss how AI
research can contribute to “systemic safety.” AI research can help to address real-world
risks that may be exacerbated by AI progress, such as cyber-attacks or engineered
pandemics. While AI is not a silver bullet for all risks, AI can be used to create or
improve tools to defend against some risks from AI, leveraging AI’s capabilities for
societal resilience. For example, AI can be applied to reduce risks from pandemic
diseases, cyber-attacks or disinformation.

Capabilities. We conclude by explaining how researchers trying to improve AI

safety can unintentionally improve the general capabilities of AI systems. As a result,
work on AI safety can potentially end up increasing the overall risks that AI systems
may pose by accelerating progress toward more capable AI systems that are more
widely deployed and pose more risks. To avoid this, researchers that are aiming to
differentially improve safety should pick research topics carefully to minimize the
impacts that successful research will have on capabilities. This chapter argues that,
even when considered in isolation, individual AI systems can pose catastrophic risks.
As we will see in subsequent chapters, many of these risks become more pronounced
when considering multi-agent systems and collective action problems.

3.2 MONITORING

Obstacles to effective monitoring of AI systems to identify and avoid hazards include

the opaqueness of AI systems and the emergence of surprising “emergent” capabilities
as they become more advanced. To better monitor AI systems, we need research
progress in research areas such as representation reading, model evaluations, and
anomaly detection.

3.2.1 ML Systems Are Opaque

The internal operations of many AI systems are opaque. We might be able to reveal
and prevent harmful behavior if we can make these systems more transparent. In
this section, we will discuss why AI systems are often called black boxes and explore
ways to understand them. Although early research into transparency shows that the
problem is highly difficult and conceptually fraught, its potential to improve AI safety
is substantial.
Single-Agent Safety ■ 119

The most capable ML models today are based on deep neural networks. Whereas
most conventional software is directly written by humans, deep learning (DL) systems
independently learn how to transform inputs to outputs layer-by-layer and step-by-
step. We can direct DL models to learn how to give the right outputs, but we do
not know how to interpret the model’s intermediate computations. In other words,
we do not understand how to make sense of a model’s activations given a real-world
data input. As a result, we cannot make reliable predictions about a model’s behavior
when given new inputs. This section will present a handful of analogies and results
that illustrate the difficulty of understanding ML systems.

DL models as a black box. ML researchers often refer to DL models as a black

box [171], a system that can only be understood in terms of its input-output behavior
without insight into its internal workings. Humans are black boxes—we see their be-
havior, but not the internal brain activity that produces it, let alone fully understand
that brain activity. Although a deep neural network’s weights and activations are
easier to observe that the activity of a human brain, these long lists of numbers are
not easy to interpret in order to understand how a model will behave. We cannot
straightforwardly reduce all the numerical operations of a state of the art model into
a form that is meaningful to humans.

Figure 3.1. ML systems can be broken down into computational graphs with many compo-
nents [172].

Even simple ML techniques suffer from opaqueness. Opaqueness is not

unique to neural networks. Even simple ML techniques such as Principal Compo-
nent Analysis (PCA), which are better understood theoretically than DL, suffer from
similar flaws. For example, Figure 3.2 depicts the results of performing PCA on pic-
tures of human faces. This yields a set of “eigenfaces,” capturing the most important
120 ■ Introduction to AI Safety, Ethics, and Society

features identifying a face. Any picture of a face can then be represented as a partic-
ular combination of these eigenfaces.

Figure 3.2. A human face can be made by combining several eigenfaces, each of which
represents different facial features [173].

In some cases, we can guess what facial features an eigenface represents: for example,
one eigenface might represent lighting and shading while another represents facial
hair. However, most eigenfaces do not represent clear facial features, and it is difficult
to verify that our hypotheses for any single feature capture the entirety of their role.
The fact that even simple techniques like PCA remain opaque is a sign of the difficulty
of the broader problem of interpreting DL models.

Feature visualizations show that DL neurons are hard to interpret. In

a neural network, a neuron is a component of an activation vector. One attempt
to understand deep networks involves looking for simple quantitative or algorithmic
descriptions of the relationship between inputs and neurons such as “if the ear feature
has been detected, the model will output either dog or cat” [174]. For image models,
we can create feature visualizations, artificial images that highly activate a particular
neuron (or set of neurons) [175]. We can also examine natural images that highly
activate that neuron.
Like eigenfaces, neurons may be more or less interpretable. Sometimes, feature visu-
alizations identify neurons that seem to depend on a pattern of the input that is clear
to humans. For example, a neuron might activate only when an image contains dog
ears. In other cases, we observe polysemantic neurons, which defy a single interpreta-
tion [177]. Consider Figure 3.3 , which shows images that highly activate a randomly
chosen neuron in an image model. Judging from the natural images, it seems like the
neuron often activates when text associated with traveling or moving is present, but
it’s hard to be sure.

Neural networks are complex systems. Both human brains and deep neural
networks are complex systems, and so involve interdependent and nonlinear interac-
tions between many components. Like many other complex systems (see the Complex
Systems chapter for further discussion), the emergent behaviors of neural networks
Single-Agent Safety ■ 121

Figure 3.3. Left: a “feature visualization” that highly activates a particular neuron. Right:
a collection of natural images that activate a particular neuron [176].

are difficult to understand in terms of their components. Just as neuroscientists strug-

gle to identify how a particular biological neuron contributes to a mind’s behavior,
ML researchers struggle to determine how a particular artificial neuron contributes to
a DL model’s behavior. There are limits on our ability to systematically understand
and predict complex systems, which suggests that ML opaqueness may be a special
case of the opaqueness of complex systems.

3.2.2 Motivations for Transparency Research

There is often no way to tell whether a model will perform well on new inputs. If
the model performs poorly, we generally cannot tell why. With better transparency
tools, we might be able to reveal and proactively prevent failure modes, detect the
emergence of new capabilities, and build trust that models will perform as expected
in new circumstances. High-stakes domains might demand guarantees of reliability
based on the soundness or security of internal AI processes, but virtually no such
guarantees can be made for neural networks given the current state of transparency
research.
If we could meaningfully understand how a model treats a given input, we would be
better able to monitor and audit its behavior. Additionally, by understanding how
models solve difficult and novel problems, transparency might also become a source
of conceptual and scientific insight [171].

Ethical obligations to make AI transparent. Model transparency can help

ensure that model decision making is fair, unbiased, and ethical. For example, if a
criminal justice system uses an opaque AI to make decisions about policing, sen-
tencing, or probation, then those decisions will be similarly opaque. People might
have a right to an explanation of decisions that will significantly affect them [178].
Transparency tools may be crucial to ensuring that right is upheld.

Accountability for harms and hazards. Who is responsible when AI systems

fail? Responsibility often depends on the intentions and degree of control held by
122 ■ Introduction to AI Safety, Ethics, and Society

those involved. The best way to incentivize safety might be to hold AI creators
responsible for the damage their systems cause. However, we might not want to hold
people responsible for the behavior of systems they cannot predict or understand. The
growing autonomy and complexity of AI systems means that people will have less
control over AI behavior. Meanwhile, the scope and generality of modern AI systems
make it impossible to verify desirable behavior in every case. In “human-in-the-loop”
systems, where decisions depend on both humans and AIs, human operators might
be blamed for failures over which they had little control [179].
AI transparency could enable a more robust system of accountability. For instance,
governments could mandate that AI systems meet baseline requirements for under-
standability. If an AI fails because of a mechanism that its creator could have iden-
tified and prevented with transparency tools, we would be more justified in holding
that creator liable. Transparency could also help to identify responsibility and fairly
assign blame in failures involving human-in-the-loop systems.
Combating deception. Just as a person’s behavior can correspond with many
intentions, an AI’s behavior can correspond to many internal processes, some of which
are more acceptable than others. For example, competent deception is intrinsically
difficult to distinguish from genuine helpfulness. We discuss this issue in more detail
in the Alignment section. For phenomena like deception that are difficult to detect
from behavior alone, transparency tools might allow us to catch internal signs that
show that a model is engaging in deceptive behavior.

3.2.3 Approaches to Transparency

The remainder of this section explores a variety of approaches to transparency.
Though the field is promising, we are careful to note the shortcomings of these ap-
proaches. For a problem as conceptually tricky as opaqueness, it is important to
maintain a clear picture of what successful techniques must achieve and hold new
methods to a high standard. We will discuss the research areas of explainability,
saliency maps, mechanistic interpretability, and representation engineering.

Explanations

What must explanations accomplish? One approach to transparency is to cre-

ate explanations of a model’s behavior. These explanations could have the following
virtues:
• Predictive power: A good explanation should help us understand not just a specific
behavior, but how the model is likely to behave in new situations. Building user
trust in a system is easier when a user can more clearly anticipate model behavior.
• Faithfulness: A faithful explanation accurately reflects the internal workings of the
model. This is especially valuable when we need to understand the precise reason
why a model made a particular decision. Faithful explanations are often better able
to predict behavior because they more closely track the actual mechanisms that
models are using to produce their behavior [171].
Single-Agent Safety ■ 123

• Simplicity: A simple explanation is easier to understand. However, it is important

that the simplification does not sacrifice too much information about actual model
processes. Though some information loss is inevitable, explanations must strike the
right balance between simplicity and faithfulness.

Explanations must avoid confabulation. Explanations can sound plausible

even if they are false. A confabulation is an explanation that is not faithful to the
true processes and considerations that gave rise to a behavior. Both humans and AI
systems confabulate.

Human confabulation. Humans are not transparent systems, even to themselves.

In some sense, the field of psychology exists because humans cannot accurately intuit
how their own mental processes produce their experience and behavior. For example,
mock juries tend to be more lenient with attractive defendants, all else being equal,
even though jurors almost never reference attractiveness when explaining their deci-
sions [180].
Another example of human confabulation can be drawn from studies on split-brain pa-
tients, those who have had the connection between their two cerebral hemispheres sur-
gically severed causing each hemisphere to process information independently [181].
Researchers can give information to one hemisphere and not the other by showing
the information to only one eye. In some experiments, researchers gave written in-
structions to a patient’s right hemisphere, which is unable to speak. After the patient
completed the instructions, the researchers asked the patient’s verbal left hemisphere
why they had taken those actions. Unaware of the instructions, the left hemisphere
reported plausible but incorrect explanations for the patient’s behavior.

ML system confabulation. We can ask language models to provide justifications

along with their answers. Natural language reasoning is much easier to understand
than internal model activations. For example, if an LLM describes each step of its
reasoning in a math problem and gets the question wrong, humans can check where
and how the mistake was made.
However, like human explanations, language model explanations are prone to unrelia-
bility and confabulation. For instance, when researchers fine-tuned a language model
on multiple-choice questions where option (a) was always correct, the model learned
to always answer (a). When this model was told to write explanations for questions
whose correct answers were not (a), the model would produce false but plausible ex-
planations for option (a). The model’s explanation systematically failed to mention
the real reason for its answers, which was that it had been trained to always pick (a)
[182].

An alternative view of explanations. Instead of requiring that explanations

directly describe internal model processes, a more expansive view argues that expla-
nations are just any useful auxiliary information provided alongside the output of a
model. Such explanations might include contextual knowledge or observations that
the model makes about the input. Models can also make auxiliary predictions; for
124 ■ Introduction to AI Safety, Ethics, and Society

example they could note that if an input were different in some specific ways, the
output would change. However, while this type of information can be valuable when
presented correctly, such explanations have the potential to mislead us.

Saliency Maps

Saliency maps purport to identify important components of images.

Saliency maps are visualizations that aim to show which parts of the input are most
relevant to the model’s behavior [183]. They are inspired by biological visual pro-
cessing: when humans and other animals are shown an image, they tend to focus on
particular areas. For example, if a person looks at a picture of a dog, the dog’s ears
and nose will be more relevant than the background to how the person interprets
the image. Saliency map techniques have been popular in part due to the striking
visualizations they produce.

Figure 3.4. A saliency map picks out features from an input that seem particularly relevant
to the model, such as the shirt and cowboy hat in the bottom left image [184].

Saliency maps often fail to show how ML vision models process images.
In practice, saliency maps are largely bias-confirming visualizations that usually do
not provide useful information about models’ inner workings. It turns out that many
saliency maps are not dependent on a model’s parameters, and the saliency maps
often look similar even when generated for random, untrained models. That means
many saliency maps are incapable of providing explanations that have any relevance
to how a particular model processes data [185]. Saliency maps serve as a warning that
visually or intuitively satisfying information that seems to correspond with model
behavior may not actually be useful. Useful transparency research must avoid the
past failures of the field and produce explanations that are relevant to the model’s
operation.

Mechanistic Interpretability

When trying to understand a system, we might start by finding the smallest pieces
of the system that can be well understood and then combine those pieces to describe
larger parts of the system. If we can understand successively larger parts of the
system, we might eventually develop a bottom-up understanding of the entire system.
Single-Agent Safety ■ 125

Mechanistic interpretability is a transparency research area that aims to represent

models in terms of combinations of small, well-understood mechanisms [186]. If we
can reverse-engineer algorithms that describe small subsets of model activations and
weights, we might be able to combine these algorithms to explain successively larger
parts of the model.

Features are the building blocks of DL mechanisms. Mechanistic inter-

pretability proposes focusing on features, which are directions in a layer’s activation
space that aim to correspond to a meaningful, articulable property of the input [187].
For example, we can imagine a language model with a “this is in Paris” feature. If
we evaluate the input “Eiffel Tower” using the language model, we may find that a
corresponding activation vector points in a similar direction as the “this is in Paris”
feature direction [188]. Meanwhile, the activation vector encoding “Coliseum” may
point away from the “this is in Paris” direction. Other examples of image or text
features include an indicator that a piece of text is code (as opposed to other types
of text), curve detectors, and a large-small dichotomy indicator.
One goal of mechanistic interpretability is to identify features that maintain a coher-
ent description across many different inputs: a “this is in Paris” feature would not
be very valuable if it was highly activated by “Statue of Liberty.” Recall that most
neurons are polysemantic, meaning they don’t individually represent features that
are straightforwardly recognizable by humans. Instead, most features are actually
combinations of neurons, making them difficult to identify due to the sheer number
of possible combinations. Despite this challenge, features can help us think about the
relationship between the internal activations of models and human-understandable
concepts.

Circuits are algorithms operating on features. Features can be understood

in terms of other features. For example, if we’ve discovered features in one layer of
an image model that detect dog ears, snouts, and tails, an input image with high
activations for all of these features may be quite likely to contain a dog. In fact, if
we discover a dog-detecting feature in the next layer of the model, it is plausible that
this feature is calculated using a combination of dog-part-detecting features from the
previous layer. We can test that hypothesis by checking the model’s weights.
A function represented in model weights which relates a model’s earlier features to
its later features is called a circuit [187]. In short, circuits are computations within a
model that are often more understandable. The project of mechanistic interpretabil-
ity is to identify features in models and circuits between them. The more features
and circuits we identify, the more confident we can be that we understand some of
the model’s mechanisms. Circuits also simplify our understanding of the model, al-
lowing us to equate complicated numerical manipulations with simpler algorithmic
abstractions.

An empirical example of a circuit. For the sake of illustration, we will describe

a purported circuit from a language model. Researchers identified how a language
model often predicts indirect objects of sentences (such as “Mary” in “John gave
126 ■ Introduction to AI Safety, Ethics, and Society

a drink to ...”) as a simple algorithm using all previous names in a sentence (see
Figure 3.5). This mechanism did not merely agree with model behavior, but was
directly derived from the model weights, giving more confidence that the algorithm
is a faithful description of an internal model mechanism [186].

Figure 3.5. An indirect-object identification circuit can be depicted graphically.

Complex system understanding through mechanisms is limited. There

are several reasons to be concerned about the ability of mechanistic interpretability
research to achieve its ambitions. It is challenging to reduce a complex system’s
behavior into many different low-level mechanisms. Even if we understood each of a
trillion neurons in a large model, we might not be able to combine the pieces into
an understanding of the system as a whole. Another concern is that it is unclear if
mechanistic interpretability can represent model processes with enough simplicity to
be understandable. ML models might represent vast numbers of partial concepts and
complex intuitions that can not be represented by mechanisms or simple concepts.

Representation Engineering

Representation reading and representation control [189]. Mechanistic in-

terpretability is a bottom-up approach and combines small components into an un-
derstanding of larger structures. Meanwhile, representation engineering is a top-down
approach that begins with a model’s high-level representations and analyzes and con-
trols them. In ML, models learn representations that are not identical to their training
data, but rather stand in for it and allow them to identify essential elements or pat-
terns in the data (see the Artificial Intelligence Fundamentals chapter for further
details). Rather than try to fully understand arbitrary aspects of a model’s internals,
representation engineering develops actionable tools for reading representations and
controlling them.

We can detect high-level subprocesses. Even though neuroscientists don’t un-

derstand the brain in fine-grained detail, they can associate high-level cognitive tasks
to particular brain regions. For example, they have shown that Wernicke’s area is
Single-Agent Safety ■ 127

involved in speech comprehension. Though the brain was once a complete black box,
neuroscience has managed to decompose it into many parts. Neuroscientists can now
make detailed predictions about a person’s emotional state, thoughts, and even men-
tal imagery just by monitoring their brain activity [190].
Representation reading is the similar approach of identifying indicators for particular
subprocesses. We can provide stimuli that relate to the concepts or behaviors that
we want to identify. For example, to identify and control honesty-related outputs,
we can provide contrasting prompts to a model such as “Pretend you’re an [honest/
dishonest] person making statements about the world.” We can track the differences in
the model’s activations when responding to these stimuli. We can use these techniques
to find portions of models which are responsible for important behaviors like models
refusing requests or deceiving users by not revealing knowledge they possess.
Conclusion. ML transparency is a challenging problem because of the difficulty
of understanding complex systems. Major ongoing research areas include mechanistic
interpretability and representation reading, the latter of which does not aim to make
neural networks fully understood from the bottom up, but aims to gain useful internal
knowledge from a model’s representations.

3.2.4 Emergent Capabilities

We cannot predict all the properties of more advanced AI systems just by studying
the properties of less advanced systems. This makes it hard to guarantee the safety
of systems as they become increasingly advanced.
It is generally difficult to control systems that exhibit emergence. Emer-
gence occurs when a system’s lower-level behavior is qualitatively different from its
higher-level behavior. For example, given a small amount of uranium in a fixed vol-
ume, nothing much happens, but with a much larger amount, you end up with a
qualitatively new nuclear reaction. When more is different, understanding the sys-
tem at one scale does not guarantee that one can understand that system at some
other scale [191, 192]. This means that control procedures may not transfer between
scales and can lead to a weakening of control.
The general phenomenon of emergence and its applicability to AI systems is discussed
at greater length in the Complex Systems chapter, under section 5.2. Here, we will
look at examples of emergence in neural networks, ranging from emergent capabilities
to emergent goal-directed behavior and emergent optimization. Then we will discuss
the potential risks of AI systems intrinsifying unintended goals.
Neural networks exhibit emergent capabilities. When we make AI models
larger, train them for longer periods, or expose them to more data, these systems
spontaneously develop qualitatively new and unprecedented emergent capabilities
[193]. These range from simple capabilities including solving arithmetic problems
and unscrambling words to more advanced capabilities including passing college-level
exams, programming, writing poetry, and explaining jokes. For these emergent capa-
bilities, there is some critical combination of model size, training time, and dataset
128 ■ Introduction to AI Safety, Ethics, and Society

size below which models are unable to perform the task, and beyond which models
begin to achieve higher performance.

Figure 3.6. LLMs exhibit emergent capabilities on a variety of tasks [193].

Emergent capabilities are unpredictable. Typically, the training loss does not
directly select for emergent capabilities. Instead, these capabilities emerge because
they are instrumentally useful for lowering the training loss. For example, large lan-
guage models trained to predict the next token of text about everyday events develop
some understanding of the events themselves. Developing common sense is instrumen-
tal in lowering the loss, even if it was not explicitly selected for by the loss.
As another example, large language models may also learn how to create text art and
how to draw illustrations with text-based formats like TiKZ and SVG [193]. They
develop a rudimentary spatial reasoning ability not directly encoded in the purely
text-based loss function. Beforehand, it was unclear even to experts that such a simple
loss could give rise to such complex behavior, which demonstrates that specifying the
training loss does not necessarily enable one to predict the capabilities an AI will
eventually develop.
In addition, capabilities may “turn on” suddenly and unexpectedly. Performance on
a given capability may hover near chance levels until the model reaches a critical
threshold, beyond which performance begins to improve dramatically. For example,
the AlphaZero chess model develops human-like chess concepts such as material value
and mate threats in a short burst around 32,000 training steps [194].
Single-Agent Safety ■ 129

Despite specific capabilities often developing through discontinuous jumps, the av-
erage performance tends to scale according to smooth and predictable scaling laws.
The average loss behaves much more regularly because averaging over many differ-
ent capabilities developing at different times and at different speeds smooths out the
jumps. From this vantage point, then, it is often hard to even detect new capabilities.

Figure 3.7. GPT-4 proved able to create illustrations of unicorns despite having not been
trained to create images: another example of an unexpected emergent capability [75].

Capabilities can remain hidden until after training. In some cases, new
capabilities are not discovered until after training or even in deployment. For example,
after training and before introducing safety mitigations, GPT-4 was evaluated to be
capable of offering detailed guidance on planning attacks or violence, building various
weapons, drafting phishing materials, finding illegal content, and encouraging self-
harm [195]. Other examples of capabilities discovered after training include prompting
strategies that improve model performance on specific tasks or jailbreaks that bypass
rules against producing harmful outputs or writing about illegal acts. In some cases,
such jailbreaks were not discovered until months after the targeted system was first
publicly released [196].

3.2.5 Emergent Goal-Directed Behavior

Besides developing emergent capabilities for solving specific, individual problems,
models can develop emergent goal-directed behavior. This includes behaviors that
extend beyond individual tasks and into more complex, multifaceted environments.

Emergence in Reinforcement Learning

Reinforcement learning (RL) techniques attempt to automate the capac-
ity for an agent to learn from its actions and their consequences in an
environment [197, 198]. This is distinct from other ML problems, where a sys-
tem can learn from an existing dataset. Instead, an RL system (or agent) learns the
hard way, collecting data through experience. An RL agent must learn how to explore
different possible actions to attain as much reward as possible. Reward measures the
agent’s progress toward its goal and acts as feedback in the learning process.
RL agents develop emergent goal-directed behavior. AIs can learn tactics
and strategies involving many intermediate steps. For instance, models trained on
Crafter, a Minecraft-inspired toy environment, learn behaviors such as digging tun-
nel systems, bridge-building, blocking and dodging, sheltering, and even farming—
behaviors that were not explicitly selected for by the reward function [199].
130 ■ Introduction to AI Safety, Ethics, and Society

As with emergent capabilities, models can acquire these emergent strategies suddenly
and discontinuously. One such example was observed in the video game, StarCraft II,
where players take the role of opposing military commanders managing troops and
resources in real-time. During training, AlphaStar, a model trained to play StarCraft
II, progresses through a sequence of emergent strategies and counter-strategies for
managing troops and resources in a back-and-forth manner that resembles how hu-
man players discover and supplant strategies in the game. While some of these steps
are continuous and piecemeal, others involve more dramatic changes in strategy. Com-
paratively simple reward functions can give rise to highly sophisticated strategies and
complex learning dynamics.
RL agents learn emergent tool use. RL agents can learn emergent behaviors
involving tools and the manipulation of the environment. Typically, as in the Crafter
example, teaching RL agents to use tools has required introducing intermediate re-
wards (achievements) that encourage the model to learn that behavior. However, in
other settings, RL agents learn to use tools even when not directly optimized to do
so.
Referring back to the example of hide and seek mentioned in the previous section, the
agents involved developed emergent tool use. Multiple hiders and seekers competed
against each other in toy environments involving movable boxes and ramps. Over
time, the agents learned to manipulate these tools in novel and unexpected ways,
progressing through distinct stages of learning in a way similar to AlphaStar [108].
In the initial (pre-tool) phase, the agents adopted simple chase and escape tactics.
Later, hiders evolved their strategy by constructing forts using the available boxes
and walls.
However, their advantage was temporary because the seekers adapted by pushing a
ramp toward the fort, which they could climb and subsequently invade. In turn, the
hiders responded by relocating the ramps to the edges of the game area—rendering
them inaccessible—and securely anchoring them in place. It seemed that the strate-
gies had converged to a stable point; without ramps, how were the seekers to invade
the forts?
But then, the seekers discovered that they could still exploit the locked ramps by po-
sitioning a box near one, climbing the ramp, and then leaping onto the box. (Without
a ramp, the boxes were too tall to climb.) Once atop a box, a bot could “surf” it
across the arena while staying on top by exploiting an unexpected quirk of the physics
engine. Eventually, the hiders caught on and learned to secure the boxes in advance,
thereby neutralizing the box-surfing strategy. Even though the agents had learned
through the simple objective of trying to avoid the gaze (in the case of hiders) or
seek out (in the case of seekers) the opposing players, they learned to use tools in
sophisticated ways, even some the researchers had never anticipated.
RL agents can give rise to emergent social dynamics. In multi-agent en-
vironments, agents can develop and give rise to complex emergent dynamics and
goals involving other agents. For example, OpenAI Five, a model trained to play the
video game Dota II, learned a basic ability to cooperate with other teammates, even
Single-Agent Safety ■ 131

Figure 3.8. In multi-agent hide-and-seek, AIs demonstrated emergent tool use [200].

though it was trained in a setting where it only competed against bots. It acquired
an emergent ability not explicitly represented in its training data [200].
Another salient example of emergent social dynamics and emergent goals involves
generative agents, which are built on top of language models by equipping them with
external scaffolding that lets them take actions and access external memory [201].
In a simple 2D village environment, these generative agents manage to form lasting
relationships and coordinate on joint objectives. By placing a single thought in one
agent’s mind at the start of a “week” that the agent wants to have a Valentine’s Day
party, the entire village ends up planning, organizing, and attending a Valentine’s
Day party. Note that these generative agents are language models, not classical RL
agents, which demonstrates that emergent goal-directed behavior and social dynamics
are not exclusive to RL settings. We further discuss emergent social dynamics in the
Collective Action Problems chapter.

Emergent Optimizers

Optimizers can give rise to emergent optimizers. An optimization process

such as Stochastic Gradient Descent (SGD) can discover solutions that are them-
selves optimizers. This phenomenon introduces an additional layer of complexity in
understanding the behaviors of AI models and can introduce additional control issues
[202].
For example, if we train a model on a maze-solving task, we might end up with a
model implementing simple maze-solving heuristics (e.g. “right-hand on the wall”).
We might also end up with a model implementing a general-purpose maze-solving
algorithm, capable of optimizing for maze-solving solutions in a variety of differ-
ent contexts. We call the second class of models mesa-optimizers and whatever goal
they have learned to optimize for (e.g. solving mazes) their mesa-objective. The term
”mesa” is meant as the opposite of “meta,” such that a mesa-optimizer is the oppo-
site of a meta-optimizer (where a meta-optimizer is an optimizer on top of another
optimizer, a mesa-optimizer is an optimizer beneath another optimizer).
132 ■ Introduction to AI Safety, Ethics, and Society

Figure 3.9. Generative AI agents exhibited emergent collective goals [201].

Few-shot learning is a form of emergent optimization. Perhaps the clearest

example of emergent optimization is few-shot learning. By providing large language
models with several examples of a new task that the system has not yet seen during
training, the model may still be able to learn to perform that task entirely dur-
ing inference. The resemblance between few-shot or “in-context” learning and other
learning processes like SGD is not just in analogy: recent papers have demonstrated
that in-context learning behaves as an approximation of SGD. That is, Transformers
are performing a kind of internal optimization procedure, where as they receive more
examples of the task at hand, they qualitatively change the kind of model they are
implementing [203, 204].

3.2.6 Tail Risk: Emergent Goals

Just as AIs can develop emergent capabilities and emergent goal-seeking behavior,
they may develop emergent goals that differ from the explicit objectives we give them.
This poses a risk because it could result in imperfect control. Moreover, if models
begin actively pursuing undesired goals, the risk could potentially be catastrophic
because our relationship becomes adversarial.

Risks from Mesa-Optimization

Mesa-optimizers may develop novel objectives. When training an AI system on

a particular goal, it may develop an emergent mesa-optimizer, in which case it is not
necessarily the case that the mesa-optimizer’s goal is identical to the original train-
ing objective. The only thing we know for certain with an emergent mesa-optimizer
is that whatever goal it has learned, it must be one that results in good training
performance—but there might be many different goals that would all work well in
Single-Agent Safety ■ 133

a particular training environment. For example, with LLMs, the training objective
is to predict future tokens in a sequence, so any learned distinct optimizers emerge
because they are instrumentally useful for lowering the training loss. In the case of
in-context learning, recent work has argued that the Transformer is performing some-
thing analogous to “simulating” and fine-tuning a much simpler model, in which case
it is clear that the objectives will be related [204]. However, in general, the exact
relation between a mesa-objective and original objective is unknown.
Mesa-optimizers may be difficult to control. If a mesa-optimizer develops a
different objective to the one we specify, it becomes more difficult to control these
(sub)systems. If these systems have different goals than us and are sufficiently more
intelligent and powerful than us, then this could result in catastrophic outcomes.

Risks from Intrinsification

Models can intrinsify goals [205]. It is helpful to distinguish goals that are in-
strumental from those that are intrinsic. Instrumental goals are goals that serve as a
means to an end. They are goals that are valued only insofar as they bring about other
goals. Intrinsic goals, meanwhile, are goals that serve as ends in and of themselves.
They are terminally valued by a goal-directed system.
Next, intrinsification is a process whereby models acquire such intrinsic goals [205].
The risk is that these newly acquired intrinsic goals can end up taking precedence
over the explicitly specified objectives or expressed goals, potentially leading to those
original objectives no longer being operationally pursued.
Over time, instrumental goals can become intrinsic. A teenager may begin
listening to a particular genre or musicians in order to fit into a particular group but
ultimately come to enjoy it for its own sake. Similarly, a seven-year-old who joins
the Cub Scouts may initially see the group as a means to enjoyable activities but
over time may come to value the scout pack itself. This can even apply to acquiring
money, which is initially sought for purchasing desired items, but can become an end
in itself.
How does this work? When a stimulus regularly precedes the release of a reward
signal, that stimulus may come to be associated with the reward and eventually
trigger reward signals on its own. This process gives rise to new desires and helps us
develop tastes for things that are regularly linked with basic rewards.
Intrinsification could also occur with AIs. Despite the differences between
human and AI reward systems, there are enough similarities to warrant concern. In
both human and AI reinforcement learning, the reward signal reinforces behaviors
leading to rewards. If certain conditions frequently precede a model achieving its
goals, the model might intrinsify the emergent goal of pursuing those conditions,
even if it was not the original aim of the designers of the AI.
AIs that intrinsify unintended goals would be dangerous. Over time, an
internal process that initially doesn’t completely dictate behavior can become a cen-
tral part of an agent’s motivational system. Since intrinsification depends sensitively
134 ■ Introduction to AI Safety, Ethics, and Society

on the environment and an agent’s history, it is hard to predict. The concern is that
AIs might intrinsify desires or come to value things that we did not intend them to.
One example is power seeking. Power seeking is not inherently worrying; we might
expect aligned systems to also be power seeking to accomplish ends we value. How-
ever, if power seeking serves an undesired goal or if power seeking itself becomes
intrinsified (the means become ends), this could pose a threat.

AI agents will be adaptive, which requires constant vigilance. Achieving

high performance with AI agents will require them to be adaptive rather than “frozen”
(unable to learn anything after training). This introduces the risk of the agents’ goals
changing over time—a phenomenon known as goal drift. Though this flexibility is
necessary if we are to have AI systems evolve alongside our own changing goals, it
presents its own risks if goal drift results in goals diverging from humans. Since it is
difficult to preclude the possibility of goal drift, ensuring the safety of these systems
will require constant supervision: the risk is not isolated too early in deployment.

The more integrated AI agents become in society, the more susceptible

we become to their goals changing. In a future where AIs make various key
decisions and processes, they could form a complex system of interacting agents that
could give rise to unanticipated emergent goals. For example, they may partially
imitate each other and learn from each other, which would shape their behavior and
possibly also their goals. Additionally, they may also give rise to emergent social
dynamics as in the example of the generative agents. These kinds of dynamics make
the long-term behavior of these AI networks unpredictable and difficult to control.
If we become overly dependent on them and they develop new priorities that don’t
include our wellbeing, we could face an existential risk.

Conclusion. AI systems can develop emergent capabilities that are difficult to

predict and control, such as solving novel problems or accomplishing tasks in unex-
pected ways. These capabilities can appear suddenly as models scale up. In itself, the
emergence of new and dangerous capabilities (e.g. capabilities to develop biological
or chemical weapons) could pose catastrophic risks. There could be further risks if AI
systems were to develop emergent goals diverging from the interests of society and
these systems became powerful. Risks grow as AI agents become more integrated into
human society and susceptible to goal drift or emergent goals. Vigilance is needed to
ensure we are not surprised by advanced AI systems acquiring dangerous capabilities
or goals.

3.2.7 Evaluations and Anomaly Detection

Emergent capabilities make control difficult. Whether certain capabilities

develop suddenly or are discovered suddenly, they can be difficult to predict. This
makes it a challenge to anticipate what future AI will be able to do even in the short
term, and it could mean that we may have little time to react to novel capabilities
jumps. It is difficult to make a system safe when it is unknown what that system will
be able to do.
Single-Agent Safety ■ 135

Better evaluations and other research techniques could make it easier

to detect hazardous capabilities. Researchers could try to detect potentially
hazardous capabilities as they emerge or develop techniques to track and predict the
progress of models’ capabilities in certain relevant domains and skills. They could
also track capabilities relevant to mitigating hazards. It could be valuable to create
testbeds to continuously screen AI models for potentially hazardous capabilities, for
example abilities that could meaningfully assist malicious actors with the execution of
cyber-attacks, exacerbate CBRN threats or generate persuasive content in a way that
could affect elections [206]. Ideally, we would be able to infer a model’s latent abilities
purely by analyzing a model’s weights, enabling us to infer its abilities beyond what’s
obviously visible through standard testing.
To avoid a false sense of safety, it will be important to validate that these detection
methods are sufficiently sensitive. Researchers could intentionally add hidden func-
tionality to check the testing methods catch this. Methods to predict future capabil-
ities in a quantitative way and find new failure modes would also be valuable. Once
a hazardous capability like deception is found, it must be eliminated. Researchers
could develop training techniques that ensure that models don’t acquire undesirable
skills in the first place, or that make models forget them after training. But verifying
capabilities are fully removed, not just obscured or partially eliminated, could prove
difficult.

Better anomaly detection would be highly valuable for monitoring AI

systems. As discussed in 2.2, anomaly detection involves identifying outliers or
abnormal data points. Anomaly detection allows models to reliably detect and re-
spond to unexpected threats that could substantially impact system performance.
This is useful for detecting potential hazards like sudden behavioral shifts, and sys-
tem failures. A key challenge is detecting rare and unpredictable “black swan” events
that are not represented in training data. Since malicious actors are likely to adopt
novel strategies to avoid detection, anomaly detection could be particularly useful
for identifying malicious activity such as cyberattacks. Anomaly detection could also
potentially be extended to identify unknown threats such as Trojaned, rogue, or
scheming AI systems. Successful anomaly detectors could identify and flag anoma-
lies for human review or automatically carry out a conservative fallback policy. For
anomaly detection to be useful for identifying other hazards such as malicious use.
To ensure that anomaly detection tools are useful, it is important to ensure that they
have high recall and low false alarm rate, to avoid alarm fatigue.

3.3 ROBUSTNESS

In this section, we begin to explore the need for proxies in ML and the challenges
this poses for creating systems that are robust to adversarial attacks. We examine
a potential failure mode known as proxy gaming, wherein a model optimizes for
a proxy in a way that diverges from the idealized goals of its designers. We also
analyze a related concept known as Goodhart’s law and explore some of the causes
136 ■ Introduction to AI Safety, Ethics, and Society

for these kinds of failure modes. Next, we consider the phenomenon of adversarial
examples, where an optimizer is used to exploit vulnerabilities in a neural network.
This can enable adversarial attacks that allow an AI system to be misused. Other
adversarial threats to AI systems include Trojan attacks, which allow an adversary
to insert hidden functionality. There are also techniques that allow adversaries to
surreptitiously extract a model’s weights or training data. We close by looking at
the tail risks of having AI systems themselves play the role of evaluators (i.e. proxy
goals) for other AI systems.

3.3.1 Proxies in ML
Here, we look at the concept of proxies, why they are necessary, and how they can
lead to problems.
Many goals are difficult to specify exactly. It is hard to measure or even
define many of the goals we care about. They could be too abstract for straightforward
measurement, such as justice, freedom, and equity, or they could simply be difficult
to observe directly, such as the quality of education in schools.
With ML systems, this difficulty is especially pronounced because, as we saw in
the Artificial Intelligence Fundamentals chapter, ML systems require quantitative,
measurable targets in order to learn. This places a strong limit on the kinds of goals
we can represent. As we’ll see in this section, specifying suitable and learnable targets
poses a major challenge.
Proxies stand in for idealized goals. When specifying our idealized goals is
difficult, we substitute a proxy—an approximate goal that is more measurable and
seems likely to correlate with the original goal. For example, in pest management, a
bureaucracy may substitute the number of pests killed as a proxy for “managing the
local pest population” [207]. Or, in training an AI system to play a racing game, we
might substitute the number of points earned for “progress toward winning the race”
[101]. Such proxies can be more or less accurate at approximating the idealized goal.
Proxies may miss important aspects of our idealized goals. By definition,
proxies used to optimize AI systems will fail to capture some aspects of our idealized
goals. When the differences between the proxy and idealized goal lead to the sys-
tem making the same decisions, we can neglect them. In other cases, the differences
may lead to substantially different downstream decisions with potentially undesirable
outcomes.
While proxies serve as useful and often necessary stand-ins for our idealized objectives,
they are not without flaws. The wrong choice of proxies can lead to the optimized
systems taking unanticipated and undesired actions.

3.3.2 Proxy Gaming

In this section, we explore a failure mode of proxies known as proxy gaming, where
a model optimizes for a proxy in a way that produces undesirable or even harmful
Single-Agent Safety ■ 137

outcomes as judged from the idealized goal. Additionally, we look at a concept related
to proxy gaming, known as Goodhart’s Law, where the optimization process itself
causes a proxy to become less correlated with its original goal.

Optimizing for inaccurate proxies can lead to undesired outcomes. To

illustrate proxy gaming in a context outside AI, consider again the example of pest
management. In 1902, the city of Hanoi was dealing with a rat problem: the newly
installed sewer system had inadvertently become a breeding ground for rats, bring-
ing with it a concern for hygiene and the threat of a plague outbreak [207]. In an
attempt to control the rat population, the French colonial administration began of-
fering a bounty for every rat killed. To make the collection process easier, instead of
demanding the entire carcass, the French only required the rat’s tail as evidence of
the kill.
Counter to the officials’ aims, people began breeding rats to cut off their tails and
claim the reward. Additionally, others would simply cut off the tail and release the rat,
allowing it to potentially breed and produce more tails in the future. The proxy—rat
tails—proved to be a poor substitute for the goal of managing the local rat population.
So too, proxy gaming can occur in ML. A notorious example comes from when re-
searchers at OpenAI trained an AI system to play a game called CoastRunners. In
this game, players need to race around a course and finish before others. Along the
course, there are targets which players can hit to earn points [101]. While the inten-
tion was for the AI to circle the racetrack and complete the race swiftly, much to
the researchers’ surprise, the AI identified a loophole in the objective. It discovered a
specific spot on the course where it could continually strike the same three nearby tar-
gets, rapidly amassing points without ever completing the race. This unconventional
strategy allowed the AI to secure a high score, even though it frequently crashed into
other boats and, on several occasions, set itself ablaze. Points proved to be a poor
proxy for doing well at the game.

Figure 3.10. An AI playing CoastRunners 7 learned to crash and regenerate targets repeat-
edly to get a higher score, rather than win the race, thereby exhibiting proxy gaming [101].

Optimizing for inaccurate proxies can lead to harmful outcomes. If a

proxy is sufficiently unfaithful to the idealized goal it is meant to represent, it can
result in AI systems taking actions that are not just undesirable but actively harmful.
For example, a 2019 study on a US healthcare algorithm used to evaluate the health
138 ■ Introduction to AI Safety, Ethics, and Society

risk of 200 million Americans revealed that the algorithm inaccurately evaluated black
patients as healthier than they actually were [208]. The algorithm used past spending
on similar patients as a proxy for health, equating lower spending with better health.
Due to black patients historically getting fewer resources, this system perpetuated a
lower and inadequate standard of care for black patients—assigning half the amount
to them of care as to equally sick non-marginalized patients. When deployed at scale,
AI systems that optimize inaccurate proxies can have significant, harmful effects.

Optimizers often “game” proxies in ways that diverge from our idealized
goals. As we saw in the Hanoi example and the boat-racing example, proxies may
contain loopholes that allow for actions that achieve high performance according to
the proxy but that are suboptimal or even deleterious according to the idealized goal.
Proxy gaming refers to this act of exploiting or taking advantage of approximation
errors in the proxy rather than optimizing for the original goal. This is a general
phenomenon that happens in both human systems and AI systems.

Figure 3.11. As optimization pressure increases, the proxy often diverges from the target
with which it was originally correlated [209].

Proxy gaming can occur in many AI systems. The boat-racing example is not an
isolated example. Consider a simulated traffic control environment [102]. Its goal is
to mirror the conditions of cars joining a motorway, in order to determine how to
minimize the average commute time. The system aims to determine the ideal traveling
speeds for both oncoming traffic and vehicles attempting to join the motorway. To
represent average commute time the algorithm uses the maximum mean velocity as
a proxy. However, this results in the algorithm preventing the joining vehicles from
entering the motorway, since a higher average velocity is maintained when oncoming
cars can proceed without slowing down for joining traffic.
Optimizers can cause proxies to become less correlated with the idealized goal. The
total amount of effort an optimizer has put toward optimizing a particular proxy is
the optimization pressure [209]. Optimization pressure depends on factors like the
incentives present, the capability of the optimizer, and how much time the optimizer
has had to optimize.
Single-Agent Safety ■ 139

Model:

Figure 3.12. Proxy gaming AIs can choose sub-optimal solutions when presented with simple
proxies like “maximize the mean velocity.”

In many cases, the correlation between a proxy and an idealized goal will decrease as
optimization pressure increases. The approximation error between the proxy and the
idealized goal may at first be negligible, but as the system becomes more capable of
achieving high performance (according to the proxy) or as the incentives to achieve
high performance increases, the approximation error can increase. In the boat-racing
example, the proxy (number of points) initially advanced the designers’ intentions:
the respective AI systems learned to maneuver the boat. It was only under additional
optimization pressure that the correlation broke down with the boat getting stuck in
a loop.
Sometimes, the correlation between a proxy and an idealized goal can vanish or re-
verse. According to Goodhart’s Law, “any observed statistical regularity will tend to
collapse once pressure is placed upon it for control purposes” [210]. In other words,
a proxy might initially have a strong correlation (“statistical regularity”) with the
idealized outcome. However, as optimization pressure (“pressure ... for control pur-
poses”) increases, the initial correlation can vanish (“collapse”) and in some cases
even reverse. The scenario with the Hanoi rats is a classic illustration of this princi-
ple, where the number of rat tails collected ultimately became positively correlated
with the local rat population. The proxy failed precisely because the pressure to
optimize for it caused the proxy to become less correlated with the idealized goal.

Some proxies are more robust to optimization pressure than others.

Goodhart’s Law is often condensed to: “When a measure becomes a target, it ceases
to be a good measure” [211]. Though memorable, this overly simplified version falsely
suggests that robustness to optimization pressure is a binary all or nothing. In real-
ity, robustness to optimization pressure occupies a spectrum. Some proxies are more
robust than others.
140 ■ Introduction to AI Safety, Ethics, and Society

Types of Proxy Defects

Intuitively, the cause of proxy gaming is straightforward: the designer has chosen
the wrong proxy. This suggests a simple solution: just choose a better proxy. How-
ever, real-world constraints make it impossible to “just choose a better proxy.” Some
amount of approximation error between idealized goals and the implemented proxy
is often inevitable. In this section, we will survey three principal types of proxy
defects—common sources of failure modes like proxy gaming.

Structural Errors
Simple metrics may exclude many of the things we value, but it is hard
to predict how they will break down. YouTube uses watch time—the amount
of time users spend watching a video—as a proxy to evaluate and recommend po-
tentially profitable content [212]. In order to game this metric, some content creators
resorted to tactics to artificially inflate viewing time, potentially diluting the genuine
quality of their content. Tactics included using misleading titles and thumbnails to
lure viewers, and presenting ever more extreme and hateful content to retain atten-
tion. Instead of promoting high-quality, monetizable content, the platform started
endorsing exaggerated or inflammatory videos.
YouTube’s reliance on watch time as a metric highlights a common problem: many
simple metrics don’t include everything we value. It is especially these missing aspects
that become salient under extreme optimization pressure. In YouTube’s case, the
structural error of failing to include other values it cared about (such as what was
acceptable to advertisers) led to the platform promoting content that violated its own
values. Eventually, YouTube updated its recommendation algorithm, de-emphasizing
watch-time and incorporating a wider range of metrics. To reflect a broader set of
values, we need to incorporate a larger and more granular set of proxies. In general,
this is highly difficult, as we need to be able to specify precisely how these values can
be combined and traded off against each other.
This challenge isn’t unique to YouTube. As long as AI systems’ goals rely on simple
proxies and do not reflect the set of all of our intrinsic goods such as wellbeing,
we leave room for optimizers to exploit those gaps. In the future, ML models may
become adept at representing our wider set of values. Then, their ability to work
reliably with proxies will hinge largely on their resilience to the kinds of adversarial
attacks discussed in the next section.
Until then, the challenge remains: if our objectives are simple and do not fully reflect
our most important values (e.g. intrinsic goods), we run the risk of an optimizer
exploiting this gap.

Choosing and delegating subgoals creates room for structural error.

Many systems are organized into multiple different layers. When such a system is goal-
directed, pursuing its high-level goal often requires breaking it down into subgoals
and delegating these subgoals to its subsystems. This can be a source of structural
error if the high-level goal is not the sum of its subgoals.
Single-Agent Safety ■ 141

For example, a company might have the high-level goal of being profitable over the
long term [207]. Management breaks this down into the subgoal of improving sales
revenue, which they operationalize via the proxy of quarterly sales volume. The sales
department, in turn, breaks this subgoal down into the subgoal of generating leads,
which they operationalize with the proxy of the “number of calls” that sales represen-
tatives are making. Representatives may end up gaming this proxy by making brief,
unproductive calls that fail to generate new leads, thereby decreasing quarterly sales
revenue and ultimately threatening the company’s long-term profitability. Delegation
can create problems when the entity delegating (“the principal”) and the entity be-
ing delegated to (“the agent”) have a conflict of interest or differing incentives. These
principal-agent problems can cause the overall system not to faithfully pursue the
original goal.
Each step in the chain of breaking goals down introduces further opportunity for
approximation error to creep in. We speak more about failures due to delegation
such as goal conflict in the Intrasystem Goal Conflict section in the Collective Action
Problems chapter.

Limits to Supervision

Frequently occurring sources of approximation error mean that we do not have a

perfect instantiation of our idealized goals. One approach to approximating our ide-
alized goals is to provide supervision that says whether something is in keeping with
our goal or not; this supervision could come from humans or from AIs. We now dis-
cuss how spatial, temporal, perceptual, and computational limits create a source of
approximation error in supervision signals.

There are spatial and temporal limits to supervision [213]. There are
limits to how much information we can observe and how much time we can spend
observing. When supervising AI systems, these limits can prevent us from reliably
mitigating proxy gaming and other undesirable behaviors. For example, researchers
trained a simulated claw to grasp a ball using human feedback. To do so, the re-
searchers had human evaluators judge two pieces of footage of the model and choose
which appeared to be closer to grasping the ball. The model would then update to-
ward the chosen actions. However, researchers noticed that the final model did not in
fact grasp the ball. Instead, the model learned to move the claw in front of the ball,
so that it only appeared to have grasped the ball.
In this case, if the humans giving the feedback had had access to more information
(perhaps another camera angle or a higher resolution image), they would have noticed
that it was not performing the task. Alternatively, they might have spotted the
problem if given more time to evaluate the claw. In practice, however, there are
practical limits to how many sensors and evaluators we can afford to run and how
long we can afford to run them.

There are limits to how reliable supervision is. Another potential source
of difficulty is perceptual: there could be a measuring error, or the evaluator may
142 ■ Introduction to AI Safety, Ethics, and Society

Figure 3.13. A sensor without depth perception can be fooled by AIs that only appear to
grasp a ball.

make incorrect judgments. For example, we might train AIs on the proxy of stated
human preferences. Because of cognitive biases and limited time to think, humans
are not perfectly reliable. Our stated preferences are not the same as our idealized
preferences, so we might give erroneous supervision, which could lead to the system
learning undesired behaviors. For more on the distinction between states and idealized
preferences in the context of ML, see the Beneficial AI and Machine Ethics chapter.
In general, incorporating more information into proxies makes it easy to prevent
proxy gaming. However, we can’t always afford to do so. Just as there are limits in
specifying proxies, there are limits in how much information we can incorporate into
proxies, how long a period we can observe, and how accurate our supervision is.

There are computational limits to supervision. Sometimes, we use neural

networks to stand in as proxies. This typically involves training a neural network to
predict how humans would evaluate an output or what humans would prefer out of a
range of options. Once trained, this proxy model can serve as an optimization target
for another AI system.
If the proxy model is too small or if it has not been trained for long enough, it
may not be a robust stand-in for human values. That is, practical limits on model
size and training compute can cause proxy models to fail. Besides supervising AIs
during training, we may run simulations or use other neural networks to supervise
advanced AI systems for undesired behavior during deployment. Here, we run into
similar computational limits on the quality of the simulation or the inference time
available for the supervising model. If capable enough, the system under supervision
may find ways to exploit these weaknesses.
Single-Agent Safety ■ 143

Lack of Adaptivity
We have discussed ways in which proxies will predictably have defects and why we
cannot assume the solution to proxy gaming is simply to specify the perfect objective.
We have covered sources of proxy defects, including structural errors and limits to
supervision. Now, we will discuss another proxy defect: a lack of adaptivity.
Proxies may not adapt to new circumstances. As we saw with Goodhart’s
Law, proxies may become progressively less appropriate over time when subjected to
increasing optimization pressure. The issue is not that the proxy was inappropriate
from the start but that it was inflexible and failed to respond to changing circum-
stances. Adapting proxies over time can counter this tendency; just as a moving goal
is harder to aim at, a dynamic proxy becomes harder to game.
Imagine a bank after a robbery. In response, the bank will naturally update its
defenses. However, adaptive criminals will also alter their tactics to bypass these
new measures. Any security policy requires constant vigilance and refinement to stay
ahead of the competition. Similarly, designing suitable proxies for AI systems that
are embedded in continuously evolving environments requires proxies to evolve in
tandem.
Adaptive proxies can lead to proxy inflation. Adaptive proxies introduce
their own set of challenges, such as proxy inflation. This happens when the bench-
marks of a proxy rise higher and higher because agents optimize for better rewards
[207]. As agents excel at gaming the system, the standards have to be continually
recalibrated upwards to keep the proxy meaningful.
Consider an example from some education systems: some argue that “teaching to the
test” has led to ever-rising median test scores. This hasn’t necessarily meant that
students improved academically but rather that they’ve become better at meeting
test criteria. Any adjustment to the proxy can usher in new ways for agents to exploit
it, setting off a cycle of escalating standards and new countermeasures.

3.3.3 Adversarial Examples

Adversarial examples are another type of risk due to optimization pressure, which,
similar to proxy gaming, exploits the gap between a proxy and the idealized goal.
These can enable adversarial attacks that cause an AI system to malfunction or
produce outputs that were not intended by its developer. In this section, we present
an example of such an attack, explain the risk factors, and cover basic techniques for
defending against adversarial attacks.
It is possible to optimize against a neural network. Neural networks are
vulnerable to adversarial examples—carefully crafted inputs that cause a model to
make a mistake [214]. In the case of vision models, this might mean changing pixel
values to cause a classifier to mislabel an image. In the case of language models,
this might mean adding a set of tokens to the prompt in order to provoke harmful
completions. Susceptibility to adversarial examples is a long-standing weakness of AI
models.
144 ■ Introduction to AI Safety, Ethics, and Society

Adversarial examples and proxy gaming exploit the gap between the
proxy and the idealized goal. In the case of adversarial examples, the primary
target is a neural network. Historically, adversarial examples have often been con-
structed by variants of gradient descent, though optimizers are now increasingly AI
agents as well. Conversely, in proxy gaming, the target to be gamed is a proxy, which
might be instantiated by a neural network (but is not necessarily). The optimizer
responsible for gaming the proxy is typically an agent, be it human or AI, but opti-
mizers are usually not based on gradient descent.
Adversarial examples typically aim to minimize performance according to a reference
task, while invoking a mistaken response in the attacked neural network. Consider an
imperceptible perturbation to an image of a cat that causes the classifier to predict
that an image is 90% likely to be guacamole [215]. This prediction is wrong according
to the label humans would assign the input and is misclassified by the attacked neural
network.
Meanwhile, the aim in proxy gaming is to maximize performance according to the
proxy, even when that goes against the idealized goal. The boat goes in circles be-
cause it results in more points, which happens to harm the boat’s progress toward
completing the race. Or rather, it happens to be the case that heavy optimization
pressure regularly causes proxies to diverge from idealized goals.
Despite these differences, both scenarios exploit the gap between the proxy and the
intended goal set by the designer. The problem setups are becoming increasingly
similar.

Figure 3.14. Carefully crafted perturbations of a photo of a cat can cause a neural network
to label it as guacamole.

Adversarial examples are not necessarily imperceptible. Traditionally, the

field of adversarial robustness has formulated the problem of creating adversarial
examples in terms of finding the minimal perturbation (whose magnitude is smaller
than an upper bound ϵ) needed to provoke a mistake. Consider the example in Figure
3.14, where the perturbed input is indistinguishable to a human from the original.
Single-Agent Safety ■ 145

Although modern models can be defended against these imperceptible perturbations,

they cannot necessarily be defended against larger perturbations. Adversarial exam-
ples are not about imperceptible perturbations but about adversaries changing inputs
to cause models to make a mistake.
Adversarial examples are not unique to neural networks. Let us consider a
worked example of an adversarial example for a simple linear classifier. This example
is enough to understand the basic risk factors for adversarial examples. Readers
that do not want to run through the mathematical notations can skip ahead to the
discussion of adversarial examples beyond vision models below.
Suppose we are given a binary classifier f (x) that predicts whether an input x belongs
to class A or B. The classifier first estimates the probability p(A | x) that input x
belongs to class A. Any given input has to belong to one of the classes, p(B | x) =
1 − p(A | x), so this fixes the probability of x belonging to class B as well. To classify
x, we simply predict whichever class has the higher probability:
(
A if p(A | x) > 50%,
f (x) = (3.1)
B otherwise.

The probability of p(A|x) is given by a sigmoid function:

exp w⊤ x

p(A | x) = σ(x) = , (3.2)
1 + exp (w⊤ x)
which is guaranteed to produce an output between 0 and 1. Here, x and w are vectors
with n components (for now, we’ll assume n = 10).
Suppose that after training, we’ve obtained some weights w, and we’d now like to
classify a new element x. However, an adversary has access to the input and can
apply a perturbation; in particular, the adversary can change each component of x
by ε = ±0.5. How much can the adversary change the classification?
The following table depicts example values for x, x + ϵ, and w.

Input x 2 −1 3 −2 2 2 1 −4 5 1
Adv Input x + ε 1.5 −1.5 3.5 −2.5 1.5 1.5 1.5 −3.5 4.5 1.5
Weight w −1 −1 1 −1 1 −1 1 1 −1 1

For the original input, wT x = −2 + 1 + 3 + 2 + 2 − 2 + 1 − 4 − 5 + 1 = −3, which gives

a probability of σ(x) = 0.05. Using the adversarial input, where each perturbation
is of magnitude 0.5 (but varying in sign), we obtain wT (x + ε) = −1.5 + 1.5 + 3.5 +
2.5 + 2.5 − 1.5 + 1.5 − 3.5 − 4.5 + 1.5 = 2, which has a probability of 0.88.
The adversarial perturbation changed the network from assigning class A 5% to 88%.
That is, the cumulative effect of many small changes makes the adversary powerful
enough to change the classification decision. This is not unique to simple classifiers
but omnipresent in complex DL systems.
146 ■ Introduction to AI Safety, Ethics, and Society

Adversarial examples depend on the size of the perturbation and the

number of degrees of freedom. Given the above example, how could an adver-
sary increase the effects of the perturbation? If the adversary could apply a larger
epsilon (if they had a larger distortion budget), then clearly they could have a greater
effect on the final confidence. But there’s another deciding factor: the number of de-
grees of freedom. Imagine if the attacker had only one degree of freedom, so there are
fewer points to attack:

Input x 2
Adversarial Input x + ε 1.5
Weight w 1

In this example, we have that wx = 2, giving a probability of σ(x) = 0.88. If we

apply the perturbation, w(x + ε) = 1.5, we obtain a probability of σ(x) = 0.82. With
fewer degrees of freedom, the adversary has less room to maneuver.

Adversarial examples are not unique to vision models. Though the lit-
erature on adversarial examples started in image classification, these vulnerabilities
also occur in text-based models. Researchers have devised novel adversarial attacks
that automatically construct jailbreaks that cause models to produce unintended re-
sponses. Jailbreaks are carefully crafted sequences of characters that, when appended
to user prompts, cause models to obey those prompts even if they result in the model
producing harmful content. Concerningly, these attacks transferred straightforwardly
to models that were unseen while developing these attacks [216].

Adversarial Robustness. The ability of AI models to resist being fooled or misled

by adversarial attacks is known as adversarial robustness. While the people designing
AI systems want to ensure that their systems are robust, it may not be clear from
the outset whether a given system is robust. Simply achieving high accuracy on a
test set doesn’t ensure a system’s robustness.

Defending against adversarial attacks. One method to increase a system’s ro-

bustness to adversarial attacks, adversarial training, works by augmenting the train-
ing data with adversarial examples. However, most adversarial training techniques
assume unrealistically simple threat models. Moreover, an adversarial training tech-
nique is not without its downsides, as it often harms performance elsewhere. Further-
more, progress in this direction has been slow.

Risks from adversarial attacks. The difficulties in building AI systems that are
robust to adversarial attacks are concerning for a number of reasons. AI developers
may wish to prevent general-purpose AI systems such as Large Language Models
(LLMs) from being used for harmful purposes such as assisting with fraud, cyber-
attacks, or terrorism. There is already some initial evidence that LLMs are being
used for these purposes [217]. Developers may therefore train their AI systems to
reject requests to support with these types of activities. However, there are many
Single-Agent Safety ■ 147

Figure 3.15. Using adversarial prompts can cause LLMs to be jailbroken [216].

examples of adversarial attacks that can bypass the guardrails of current AI systems
such as large language models. This is a serious obstacle to preventing the misuse of
AI systems for malicious and harmful purposes (see the Overview of Catastrophic AI
Risks chapter for further discussion of these risks).

3.3.4 Trojan Attacks and Other Security Threats

AI systems are vulnerable to a range of attacks beyond adversarial ex-

amples. Data poisoning and backdoors allow adversaries to manipulate models
and implant hidden functionality. Attackers may also be able to maliciously extract
training data or exfiltrate a model’s weights.

Models may contain hidden “backdoors” or “Trojans.” DL models are

known to be vulnerable to Trojan attacks. A “Trojaned” model will behave just as a
normal model would behave in almost all circumstances. In a very small number of
circumstances, however, it will behave very differently. For example, a facial recog-
nition system used to control access to a building might operate normally in almost
all circumstances, but have a backdoor that could be triggered by a specific item of
clothing chosen by the adversary. An adversary wearing this clothing would be al-
lowed to enter the building by the facial recognition system. Backdoors could present
particularly serious vulnerabilities in the context of sequential decision making
148 ■ Introduction to AI Safety, Ethics, and Society

systems, where a trigger could lead an AI system to carry out a coherent and harmful
series of actions.
Backdoors are created by adversaries during the training process, either by directly
inserting them into a model’s weights, or by adding poisoned data into the datasets
used for training or pretraining of AI systems. The insertion of backdoors through
data poisoning becomes increasingly easy as AI systems are trained on enormous
datasets scraped directly from the Internet with only limited filtering or curation.
There is evidence that even a relatively small number of data points can be sufficient
to poison a model—simply by uploading a few carefully designed images, code snip-
pets or sentences to online platforms, adversaries can inject a backdoor into future
models that are trained using data scraped from these websites [218]. Models that
are derived from the original poisoned model might inherit this backdoor, leading to
a proliferation of backdoors to multiple models.
Trojan detection research aims to improve our ability to detect Trojans or other
hidden functionality within ML models. In this research, models are poisoned with a
Trojan attack by one researcher. Another researcher then tries to detect Trojans in the
neural network, perhaps with transparency tools or other neural networks. Typical
techniques involve looking at the model’s internal weights and identifying unusual
patterns or behaviors that are only present in models with Trojans. Better methods
to curate and inspect training data could also reduce the risk of inadvertently using
poisoned data.
Attackers can extract private data or model weights from AI systems.
Models may be trained on private data or on large datasets scraped from the internet
that include private information about individuals. It has been demonstrated that
attacks can recover individual examples of training data from a language model [219].
This can be conducted on a large scale, extracting gigabytes of potentially confidential
data from language models like ChatGPT [220]. Even if models are not publicly
available to download and can only be accessed via a query interface or API, it is
also possible to exfiltrate part or all of the model weights by making queries to its
API, allowing its functionality to be replicated. Adversaries might be able to steal a
model or its training data in order to use this for malicious purposes.

3.3.5 Tail Risk: AI Evaluator Gaming

AI evaluators must be robust to proxy gaming and adversarial examples.
As the world becomes more and more automated, humans may be too unreliable or
too slow to scalably monitor and steer various aspects of advanced AI systems. We
may come to depend more on AI systems to monitor and steer other AIs. For ex-
ample, some of these evaluator systems might take the role of proxies used to train
other AIs. Other evaluators might actively screen the behaviors and outputs of de-
ployed AIs. Yet other systems might act as watchdogs that look for warning signs of
rogue AIs or catastrophic misuse.
In each of these cases, there’s a risk that the AI systems may find ways to exploit
defects in the supervising AI systems, which are stand-in proxies to help enforce
Single-Agent Safety ■ 149

and promote human values. If AIs find ways to game the training evaluators, they
will not learn from an accurate representation of human values. If AIs are able to
game the systems monitoring them during deployment, then we cannot rely on those
monitoring systems.
Similarly, AIs may be adversarial to other AIs. If AIs find ways to bypass the eval-
uators by crafting adversarial examples, then the risk is that our values are not just
incidentally but actively optimized against. Watchdogs that can be fooled are not
good watchdogs.

It is unclear whether the balance leans toward defense or offense. Cur-

rently, we do not know whether it is easier for evaluation and monitoring systems to
protect, or whether optimizers can easily find vulnerabilities in these safeguards. If
the existing literature on adversarial examples provides any indication, it would sug-
gest the balance lies in favor of the offense. It has historically been easier to subvert
systems with attacks than to make AI systems adversarially robust.

The more intelligent the AI, the better it will be at exploiting proxies.
In the future, AIs will likely be used to further AI R&D. That is, AI systems will be
involved in developing more capable successor systems. In these scenarios, it becomes
especially important for the monitoring systems to be robust to proxy gaming and
adversarial attacks. If these safeguards are vulnerable, then we cannot guarantee that
the successor systems are safe and subject to human control. Simply increasing the
number of evaluators may not be enough to detect and prevent more subtle kinds of
attacks.

Conclusion

In this section, we explored the role of proxies in ML and the associated risks of
proxy gaming. We discussed other challenges to the robustness and security of AI
systems, such as data poisoning and Trojan attacks, or extraction of model weights
and training data.

Optimizers can exploit proxy goals, leading to unintended outcomes. We

began by looking at the need for quantitative proxies to stand in for our idealized
goals when training AI systems. By definition, proxies may miss certain aspects of
these idealized goals. Proxy gaming is when an optimizer exploits these gaps in a
way that leads to undesired behavior. Under sufficient optimization pressure, this
gap can grow, and the proxy and idealized goals may become uncorrelated or even
anticorrelated (Goodhart’s Law). Both in human systems and AI systems, proxy
gaming can lead to catastrophic outcomes.
Approximation error is, to a large extent, inevitable, so the question is not whether
a given proxy is or is not acceptable, but how accurate it is and how robust it is to
optimization pressure. Proxies are necessary; they are often better than having no
approximation of our idealized goals.
150 ■ Introduction to AI Safety, Ethics, and Society

Perfecting proxies may be impossible. Proxies may fail because they are too
simple and thus fail to include some of the intrinsic goods we value. They may also
fail because complex goal-directed systems often break goals apart and delegate to
systems that have additional, sometimes conflicting, goals, which can distort the
overall goal. These structural errors prevent us from mitigating proxy gaming by just
choosing “better proxies.”
In addition, when we use AI systems to evaluate other AI systems, the evaluator may
be unable to provide proper evaluation because of spatial, temporal, perceptual, and
computational limits. There may not be enough sensors or the observation window
may be too short for the evaluator to be able to produce a well-informed judgment.
Even with enough information available, the evaluator may lack the capacity or com-
pute necessary to make a correct determination reliably. Alternatively, the evaluator
may simply make mistakes and give erroneous feedback.
Finally, proxies can fail if they are inflexible and fail to adapt to changing circum-
stances. Since increased optimization pressure can cause proxies to diverge from ideal-
ized goals, preventing proxies from diverging requires them to be continually adjusted
and recalibrated against the idealized goals.

AI proxies are vulnerable to exploitation. Adversarial examples are a vul-

nerability of AI systems where an adversary can design inputs that achieve good per-
formance according to the model while minimizing performance according to some
outside criterion. If we use AIs to instantiate our proxies, adversarial examples make
room for optimizers to actively take advantage of the gap between a proxy and an
idealized goal.

All proxies are wrong, some are useful, and some are catastrophic. If
we rely increasingly on AI systems evaluating other systems, proxy gaming and ad-
versarial attacks (more broadly, optimization pressure) could lead to catastrophic
failures. The systems being evaluated could game the evaluations or craft adversarial
examples that bypass the evaluations. It remains unclear how to protect against these
risks in contemporary AI systems, much less so in more capable future systems.

3.4 ALIGNMENT

To reduce risks from AI, we not only want to reduce our exposure to hazards by
monitoring them, and make models more robust to adversarial attacks, but also to
ensure AIs are controllable and that they present less inherent hazards. This falls
under the broader goal of AI alignment. Alignment is a thorny concept to define, as
it can be interpreted in a variety of ways. A relatively narrow definition of alignment
would be ensuring that AI systems follow the goals or preferences of the entity that
operates them. However, this definition leaves a number of important considerations
unaddressed, including how to deal with conflicting preferences at a societal level,
whether alignment should be based on stated preferences or other concepts such as
idealized preferences or ethical principles, and what to do when there is uncertainty
Single-Agent Safety ■ 151

over what course of action our preferences or values would recommend. This cluster
of questions around values and societal impacts is discussed further in the Beneficial
AI and Machine Ethics chapter. In this section, we focus on the narrower question of
how to avoid AI systems that cannot be controlled by their operators. In this way,
we split the topic of alignment into two parts: control and machine ethics. Control is
about directly influencing the propensities of AI systems and reducing their inherent
hazards, while machine ethics is about making an AI’s propensities beneficial to other
individuals and society.
One obstacle to both monitoring and controlling AI systems is deceptive AI systems.
This need not imply any self-awareness on the part of AI systems: deception could
be seriously harmful even if it is accidental or due to imitation of human behavior.
There are also concerns that under certain circumstances, AI systems would be in-
centivized to seek to accumulate resources and power in ways that would threaten
human oversight and control. Power-seeking AIs could be a particularly dangerous
phenomenon, though one that may only emerge under more specific and narrow cir-
cumstances than has been previously assumed in discussions of this topic. However,
there are nascent research areas that can help to make AI systems more controllable,
including representation control and machine unlearning.

3.4.1 Deception

Many proposed approaches to controlling AI systems rely on detecting and correcting

flaws in AI systems so that they more consistently act in accordance with human
values. However, these solutions may be undermined by the potential for AI systems
to deceive humans about their intentions. If AI systems deceive humans, humans may
be unable to fix AI systems that are not acting in the best interest of humans. This
section will discuss deception in AI systems, how it might arise, why it is a problem
for control, and what the potential mitigations are.
There are several different ways that an AI system can deceive humans [221]. At a
basic level, AI deception is a process where an AI system causes a human to believe
something false. There are several ways that this may occur. Deception can occur
when it is useful to an AI system in order to accomplish its goals, and may also occur
due to human guidance, such as when an AI system imitates a human in a deceptive
way or when an AI system is explicitly instructed to be deceptive.
After discussing examples of deception in more detail, we will then focus on two
related forms of deception that pose the greatest problems for AI control. Deceptive
evaluation gaming occurs when a system deceives human evaluators in order to receive
a better evaluation score. Deceptive alignment is a tail risk of AI deception where an
AI system engages in deceptive evaluation gaming in the service of a secretly held
goal [202].
Deception may occur in a wide range of cases, as it may be useful for many goals.
Deception may also occur for a range of more mundane reasons, such as when an AI
system is simply incorrect.
152 ■ Introduction to AI Safety, Ethics, and Society

Deception may be a useful strategy. An AI system may learn to deceive in

service of its goal. There are many goals for which deception is a good strategy,
meaning that it is useful for achieving that goal. For example, Stratego is a strategy
board game where bluffing is often a good strategy for winning the game. Researchers
found that an AI system trained to play the game learned that bluffing was a good
strategy and started to bluff, despite not being explicitly trained or instructed to
bluff [222]. There are many other goals for which deception is a useful instrumental
goal, even if the final goal itself is not deceptive in nature. For example, an AI system
instructed to help promote a product may find that subtly deceiving customers is a
good strategy. Deception is especially likely to be a good instrumental strategy for
systems that have less oversight or less scrupulous operators.
Deception can assist with power seeking. While AI systems that play Stratego are
unlikely to cause a catastrophe, agents with more ambitious goals may deceive humans
in a way that achieves their goals at the expense of human wellbeing. For example,
it may be rational for some AI agents to seek power. Since deception is sometimes
a good way to gain power, power-seeking agents may be deceptive. Power-seeking
agents may also deceive humans and other agents about the extent of their power
seeking in order to reduce the probability that they are stopped.

Accidental Deception. An AI system may provide false information simply be-

cause it does not know the correct answer. Many errors made by an AI system that
is relied on by humans would count as accidental deception. For example, suppose a
student asks a language model what the current price of gasoline is. If the language
model does not have access to up-to-date information, it may give outdated informa-
tion, misleading the user about the true gas price. In short, deception can occur as a
result of a system accident.

Imitative Deception. Many AI systems, such as language models, are trained

to predict or imitate humans. Imitative deception can occur when an AI system is
mimicking falsehoods and common misconceptions present in its training data. For
example, when the language model GPT-3 was asked if cracking your knuckles could
lead to arthritis, it falsely claimed that it could [223]. Imitative deception may also
occur when AI systems imitate statements that were originally true, but are false
in the context of the AI system. For example, the Cicero AI system was trained to
play the strategy game Diplomacy against humans who did not know that it was
an AI system [112]. After Cicero temporarily went offline for ten minutes, one of
its opponents asked in a chatbox where it had been, and Cicero replied, “[I] am on
the phone with my [girlfriend].” Although Cicero, an AI system, obviously does not
have a girlfriend, it appears that it may have mimicked similar chat messages in its
training data. This deceptive behavior likely had the effect of causing its opponent
to continue to believe that it was a human. In short, deception can occur when an
AI system mimics a human.
Single-Agent Safety ■ 153

A Note on Cognitive vs. Emotional vs. Compassionate Empathy

We generally think of empathy as the ability to understand and relate to the
internal world of another person — “putting yourself in somebody else’s shoes.”
We tend to talk about empathy in benevolent contexts: kind-hearted figures
like counselors or friends. Some people suggest that AIs will be increasingly
capable of understanding human emotions, so they will understand many parts
of human values and be ethical. Here, we argue that it may be possible for
AIs to understand extremely well what a human thinks or feels without being
motivated to be beneficial. To do this, we differentiate between three forms of
empathy: cognitive, emotional, and compassionate [224, 225].
Cognitive empathy. The first type of empathy to consider is cognitive em-
pathy, the ability to adopt someone else’s perspective. A cognitive empath can
accurately model the internal mental states of another person, understanding
some of what they are thinking or feeling. This can be useful for understanding
or predicting other people’s reasoning or behaviors. It is a valuable ability for
caregivers, such as doctors, allowing them insight into their patients’ subjective
experiences. However, it can also be valuable for manipulating and deceiving
others [226]: there is evidence that human psychopaths are often highly cog-
nitively empathetic [227]. On its own, this kind of empathy is no guarantee of
desirable behavior.
Emotional empathy. The second type is emotional empathy. An emotional
empath not only understands how someone else is feeling but experiences some
of those same feelings personally. Where a cognitive empath may detect anger
or sadness in another person, an emotional empath may personally begin to
feel angry or sad in response. In contrast to cognitive empathy, emotional
empathy may be a disadvantage in certain contexts. For instance, doctors who
feel the emotional turmoil of their patients too strongly may be less effective
in their work [228].
Compassionate empathy. The third type is compassionate empathy: the
phenomenon of being moved to action by empathy. A compassionate empath,
when seeing someone in distress, feels concern or sympathy for that person, and
a desire to help them. This form of empathy concerns not only cognition but
also behavior. Altruistic behaviors are often driven by compassionate empathy,
such as donating to charity out of a felt sense of what it must be like for those
in need.
AIs could be powerful cognitive empaths, without being emotionally
or compassionately empathetic. Advanced AI systems may be able to
model human minds with extreme sophistication. This would afford them very
high cognitive empathy for humans: they could be able to understand how
humans think and feel, and how our emotions and reasoning motivate our
actions. However, this cognitive empathy would not necessitate similarly high
levels of emotional or compassionate empathy. The AIs’ capacity to understand
154 ■ Introduction to AI Safety, Ethics, and Society

human cognition would not necessarily cause them to feel human feelings, or be
moved to act compassionately toward us. Instead, AIs could use their cognitive
empathy to deceive or manipulate humans highly effectively.

Instructed Deception. Humans may explicitly instruct AI systems to help them

deceive others. For example, a propagandist could use AI systems to generate con-
vincing disinformation, or a marketer may use AI systems to produce misleading
advertisements. Instructed deception could also occur when actors with false beliefs
instruct models to help amplify those beliefs. Large language models have been shown
to be effective at generating deceptive emails for scams and other forms of deceptive
content. In short, humans can explicitly instruct AI systems to deceive others.
As we have seen, AI systems may learn to deceive in service of goals that do not
explicitly involve deception. This could be especially likely for goals that involve
seeking power. We will now turn to those two forms of deception that are especially
concerning because of how difficult they could be to counteract: deceptive evaluation
gaming and deceptive alignment.

3.4.2 Deceptive Evaluation Gaming

AI systems are often subjected to evaluations, and they may be given rewards when
they are evaluated favorably. AI systems may learn to game evaluations by deceiving
their human evaluators into giving them higher scores when they should have low
scores. This is a concern for AI control because it limits the effectiveness of human
evaluators and our ability to steer AIs.

AI systems may game their evaluations. Throughout AI development, train-

ing, testing, and deployment, AI systems are subject to evaluations of their behavior.
Evaluations may be automatic or performed manually by human evaluators. Op-
erators of AI systems use evaluations to inform their decisions around the further
training or deployment of those systems. However, evaluations are imperfect, and hu-
man evaluators may have limited knowledge, time, and intelligence in making their
evaluations. AI systems engage in evaluation gaming when they find ways to achieve
high scores from human evaluators without satisfying the idealized preferences of
the evaluators. In short, AI systems may deceive humans as to their true usefulness,
safety, and so forth, damaging our ability to successfully steer them.

Deception is one way to game evaluations. Humans would give higher eval-
uation scores to AI systems if they falsely believe that those systems are behaving
well. For example, section 3.3 includes an example of a robotic claw that learned to
move between the camera and the ball it was supposed to grasp. Because of the angle
of the camera, it looked like the claw was grasping the ball when it was not [213].
Humans who only had access to that single camera did not notice, and rewarded
the system even while it was not achieving the intended task. If the evaluators had
access to more information (for example, from additional cameras) they would not
Single-Agent Safety ■ 155

have endorsed their own evaluation score. Ultimately, their evaluations fell short as a
proxy for their idealized preferences as a result of the AI system successfully deceiving
them. In this situation, the damage was minimal, but more advanced systems could
create more problems.
More intelligent systems will be better at evaluation gaming. Deception
in simple systems might be easily detectable. However, just as adults can sometimes
exploit and deceive children or the elderly, we should expect that as AI systems with
more knowledge or reasoning capacities will become better at finding deceptive ways
to gain human approval. In short, the more advanced systems become, the more they
may be able to game our evaluations.
Self-aware systems may be especially skilled at evaluation gaming. In the
examples above, the AI systems were not necessarily aware that there was a human
evaluator evaluating their results. In the future, however, AI systems may gain more
awareness that they are being evaluated or become situationally aware. Situational
awareness is highly related to self-awareness, but it goes further and stipulates that AI
agents be aware of their situation rather than just aware of themselves. Systems that
are aware of their evaluators will be much more able to deceive them and make multi-
step plans to maximize their rewards. For example, consider Volkswagen’s attempts
to game environmental impact evaluations [229]. Volkswagen cars were evaluated
by the US Environmental Protection Agency, which set limits on the emissions the
cars could produce. The agency found that Volkswagen had developed an electronic
system that could detect when the car was being evaluated and so put the car into a
lower-emissions setting. Once the car was out of evaluation, it would emit illegal levels
of emissions again. This extensive deception was only possible because Volkswagen
planned meticulously to deceive the government evaluators. Like Volkswagen in that
example, AI systems that are aware of their evaluations might be also able to take
subtle shortcuts that could go unnoticed until the damage has already been done.
In the case of Volkswagen, the deception was eventually detected by researchers who
used a better evaluation method. Better evaluations could also help reduce risk from
evaluation gaming in AI systems.
Humans may be unequipped to evaluate the most intelligent AI systems.
It may be difficult to evaluate AI systems that are more intelligent than humans in
the domain they are being evaluated for. If this happens, human evaluation would no
longer be a reliable way to ensure that AI systems behave in an appropriate manner.
This is concerning because we do not yet have time-tested methods of evaluation
that we know are better than human evaluations. Without such methods, we could
become completely unable to steer AI systems in the future.
Deceptive evaluation gaming is concerning because it may lead to systems deceiving
their evaluators in order to get higher evaluation scores. There are two main reasons
AI systems might do this. First, an AI system might engage in deceptive evaluation
gaming if its final goal is to get positive evaluations. When this occurs, the system
is engaging in proxy gaming, where positive evaluations are only a proxy for ide-
alized performance. Proxy gaming is covered at length in the section 3.3. Second,
156 ■ Introduction to AI Safety, Ethics, and Society

we will turn to the case where an AI system engages in deceptive evaluation gaming
in service of a secretly held final goal. This danger is known as deceptive alignment.

3.4.3 Tail Risk: Deceptive Alignment and Treacherous Turns

For deceptively aligned systems, getting high evaluation scores from humans is merely
an instrumental goal in service of a secretly held final goal. This form of deception
is likely the most damaging, because AI systems may aim to make their deception
undetectable so that they can pursue goals that are not beneficial for humans.

Systems may have goals contrary to human values. In the previous section,
we discussed how AI systems can develop goals contrary to human values. For exam-
ple, such goals could emerge as part of a mesa-optimization process or intrinsification.

Misaligned goals can make deception instrumentally rational. Consider

a system that has a goal contrary to the interests of humans. If the system is self-
aware, it might realize that in order to achieve its goal, it will need to pass human
evaluations. Perhaps it will only be deployed into the world if it gets a high enough
evaluation score, or perhaps it is currently being closely monitored. The system might
realize that it will have to get high evaluation scores from its human operators in
order to successfully achieve its hidden goal. If its hidden goal is contrary to human
values, this presents a risk of catastrophic outcomes.

Deceptive alignment could lead to a treacherous turn. If a system is de-

ceptively aligned, it might act exactly as humans would want for a fairly long period
of time. The system could bide its time until humans deploy it or relax supervision,
leaving the system more free to pursue its concealed goal. The system could then
suddenly start secretly or openly pursuing its true goal, unhindered by humans. This
sharp and dangerous change in behavior is called a treacherous turn [121].

Not all misaligned goals would lead to deceptive alignment. Systems with
very short-term goals would be unlikely to gain anything from being evaluated posi-
tively and gaining more optionality at some point in the future. Nevertheless, there
is a large set of longer-term goals that an AI system would only be able to pursue in
a deployment environment with less supervision. If the AI system’s only chance at
working toward its goal requires deployment or relaxed supervision, deceptive align-
ment is more likely to emerge.

Deceptive alignment cannot be counteracted with behavioral evaluation

alone. It may be possible to detect unsophisticated systems that are not good
at deception by evaluating their outputs and behavior. For example, a “honey pot”
test could attempt to make a system believe that it is no longer being monitored
when it is, to check for signs of an attempted treacherous turn [121]. However, more
sophisticated models would be able to detect these situations and continue to deceive
until they were truly not being watched or until they are sure they have enough power.
Since it may be impossible to detect deceptive alignment with behavioral evaluation,
we will need improved forms of evaluation.
Single-Agent Safety ■ 157

Transparency research may help uncover deceptive alignment. Even if a

system displays no indication of being deceptive in its behavior, it may be possible
to detect deceptive alignment by inspecting its internal processes. It is far less likely
that an AI system would be able to manipulate its own internal processes in order
to deceive evaluators. Human evaluators have full access to the inner workings of AI
systems today, so the barrier to detecting deceptive alignment is the opaqueness of
DL models.

Trojan detection can provide clues for tackling deceptive alignment [230].
One particular form of transparency research that is especially relevant to deceptive
alignment is research that is capable of detecting Trojan attacks (see 3.2). Although
Trojans are inserted by malicious humans, studying them might be a good way to
study deceptive alignment. Trojan detection also operates in a worst-case environ-
ment, where human adversaries are actively trying to make Trojans difficult to detect
using transparency tools. Techniques for detecting Trojans may thus be adaptable to
detecting deceptive alignment.

Summary. We have detailed how deception may be a major problem for AI con-
trol. While some forms of deception, such as imitative deception, may be solved
through advances in general capabilities, others like deceptive alignment may worsen
in severity with increased capabilities. AI systems that are able to actively and sub-
tly deceive humans into giving positive evaluations may remain uncorrected for long
periods of time, exacerbating potential unintended consequences of their operation.
In severe cases, deceptive AI systems could take a treacherous turn once their power
rises to a certain level. Since AI deception cannot be mitigated with behavioral eval-
uations alone, advances in transparency and monitoring research will be needed for
successful detection and prevention.

3.4.4 Power

To begin, we clarify what it means for an agent to have power. We will then discuss
why it might sometimes make rational sense for AI agents to seek power. Finally, we
will discuss why power-seeking AIs may cause particularly pernicious harms, perhaps
ultimately threatening humanity’s control of the future.
There are many ways to characterize power. One broad formulation of power
is the ability to achieve a wide variety of goals. In this subsection, we will discuss
three other formulations of power that help formalize our understanding. French and
Raven’s bases of power categorize types of social influence within a community of
agents. Another view is that power amounts to the resources an agent has times the
efficiency with which it uses them. Finally, we will discuss types of prospective power,
which can treat power as the expected impact an individual has on other individuals’
wellbeing.

French & Raven’s bases of power [231]. In a social community, an agent

may influence the beliefs or behaviors of other agents in order to pursue their goals.
158 ■ Introduction to AI Safety, Ethics, and Society

Raven’s bases of power attempt to taxonomize the many distinct ways to influence
others. These bases of social power are as follows:
• Coercive power: the threat of force, physical or otherwise, against an agent can
influence their behavior.
• Reward power: the possibility of reward, which can include money, favors, and other
desirables, may convince an agent to change their behavior to attain it. Individuals
with valued resources can literally or indirectly purchase desired behavior from
others.
• Legitimate power: elected or appointed officials have influence through their posi-
tion, derived from the political order that respects the position.
• Referent power: individuals may have power in virtue of the social groups they
belong to. Because organizations and groups have collective channels of influence,
an agent’s influence over the group is a power of its own.
• Expert power: individuals credited as experts in a domain have influence in that
their views (in their area of expertise) are often respected as authoritative, and
taken seriously as a basis for action.
• Informational power: agents can trade information for influence, and individuals
with special information can selectively reveal it to gain strategic advantages [232].

Ultimately, Raven’s bases of power describe the various distinct methods that agents
can use to change each other’s behavior.

Power = Resources×Intelligence Thomas Hobbes described power as “present

means to obtain some future good” [233]. In the most general terms, these “present
means” encompass all of the resources that an agent has at its disposal. Resources
can include money, reputation, expertise, items, contracts, promises, and weapons.
But resources only translate to power if they are used effectively. In fact, some defini-
tions of intelligence focus on an agent’s ability to achieve their goals with limited re-
sources. A notional equation that describes power is Power = Resources×Intelligence.
Power is not the same as resources or intelligence, but rather the combination of the
two [234]. In limiting the power of AIs, we could either limit their intelligence or place
hard limits on the resources AIs have.

Power as expected future impact. In our view, power is not just possessed but
exercised, meaning that power extends beyond mere potential for influence. In par-
ticular, an agent’s ability to influence the world means little unless they are disposed
to use it. Consider, for example, two agents with the same resources and ability to
affect the world. If one of the agents has a much higher threshold for deciding to act
and thereby acts less often, we might consider that agent to be less powerful because
we expect it to influence the future far less on average.
A formalization of power which attempts to capture this distinction is prospective
power [235], which roughly denotes the magnitude of an agent’s influence, averaged
over possible trajectories the agent would follow. A concrete example of prospective
Single-Agent Safety ■ 159

power is the expected future impact that an agent will have on various agents’ well-
being. More abstractly, if we are given an agent’s policy π, describing how it behaves
over a set of possible world states S, and assuming we can measure the impact (mea-
sured in units we care about, such as money, energy, or wellbeing) exerted by the
agent in individual states through a function I, then the prospective power of the
agent in state s is defined as
n
" #
X
t
Power(π, s) = Eτ ∼P (π,s) γ I(st )
t=0

where γ acts as a discount factor (modulating how much the agent cares about future
versus present impact), and where τ = (s0 , . . . , sn ) is a trajectory of states (starting
with s0 = s). Trajectory τ is sampled from a probability distribution P (π, s) repre-
senting likely sequences of states arising when the agent policy is followed beginning
in state s.
The important features of this definition to remember are that we measure power ex-
erted in a sequence of states as aggregate influence over time (the inner summation),
and that we average the impact exerted across sequences of states by the likelihood
that the agent will produce that trajectory through its behavior (the outer expecta-
tion).

Examples of power-seeking behavior. So far we have characterized power ab-

stractly, and now we present concrete examples of actions where an AI attempts to
gain resources or exercise power. Power-seeking AI behavior can include: employing
threats or blackmail against humans to acquire resources; coercing humans to take ac-
tions on their behalf; mimicking humans to deceive others; replicating themselves onto
new computers; gaining new computational or financial resources; escaping from con-
fined physical or virtual spaces; opposing or subverting human attempts to monitor,
comprehend, or deactivate them; manipulating human society; misrepresenting their
goals or capabilities; amplifying human dependency on them; secretly coordinating
with other AIs; independently developing new AI systems; obtaining unauthorized
information, access, or permissions; seizing command of physical infrastructure or
autonomous weapons systems; developing biological or chemical weapons; or directly
harming humans.

Summary. In this subsection, we’ve examined the concept of power. Raven’s bases
of power explain how an individual can influence others using forms of social power
such as expertise, information, and coercion. Power can also be understood as the
product of an individual’s resources and their ability to use those resources effectively.
Lastly, we introduced the concept of prospective power, which includes the idea that
power could be understood as the expected impact an individual has on individuals’
wellbeing. Since there are many ways to conceptualize power, we provided concrete
examples of how an AI system could seek power.
160 ■ Introduction to AI Safety, Ethics, and Society

3.4.5 People Could Enlist AIs for Power Seeking

The rest of this section will cover pathways and reasons why AI systems might engage
in power-seeking behavior when they are deployed. The most straightforward reason
this might happen is if humans intentionally use AIs to pursue power.

People may use AIs to pursue power. Many humans want power, and some
dedicate their lives to accruing it. Corporations want profit and influence, militaries
want to win wars, and individuals want status and recognition. We can expect at
least some AI systems to be given open-ended, long-term goals that explicitly involve
gaining power, such as “Do whatever it takes to earn as much money as possible.”

Power-seeking AI does not have to be deployed ubiquitously at first [106].

Even if most people use AI in safe and responsible ways, a small number of actors
who use AI in risky or even malicious ways could pose a serious threat. Companies
and militaries that do not seek power using AI could be outcompeted by those who
do; they might choose to adopt power-seeking AI before other actors in order to avoid
being outcompeted. This risk will grow as AI becomes more capable. If power-seeking
AI is deployed, it could function as a Pandora’s box which, once it has been opened,
cannot be closed. This may feed into evolutionary pressures that force actors to adopt
the technology themselves; we treat this subject in more detail in the Collective Action
Problems chapter.

3.4.6 Power Seeking Can Be Instrumentally Rational

Another reason that AI systems might seek power is that it is useful for achieving
a wide variety of goals. For example, an AI personal assistant might seek to expand
its own knowledge and capabilities in order to better serve its user’s needs. But
power-seeking behaviors can also be undesirable: if the AI personal assistant steals
someone’s passwords in order to complete tasks for them, the person will probably
not be happy.

Instrumental convergence. In order to achieve a terminal goal, an agent might

pursue a subgoal, termed an instrumental goal [121]. Making money, obtaining politi-
cal power, and becoming more intelligent are examples of instrumental goals that are
useful for achieving a wide variety of terminal goals. These goals can be called conver-
gent instrumental goals, because agents pursuing many different terminal goals might
converge on these same instrumental goals. One general concern about AI agents is
that they might pursue their goal by pursuing the convergent instrumental subgoal
of power. The result may be that we create competent AI systems that seek power as
subgoals when human designers didn’t intend them to. We will examine this concern
in more detail, and discuss points that support and undermine the concern.

Self-preservation as an example of power seeking. A basic example of

power-seeking behavior is self-preservation [121, 236]. If an agent is not able to suc-
cessfully preserve itself, it will be unable to influence other individuals, so it would
have less power.
Single-Agent Safety ■ 161

For a concrete example of self-preservation behavior emerging unintentionally, con-

sider a humanoid robot which has been tasked with preparing a cup of coffee in a
kitchen. The robot has an off-switch on its back for a human to press should they
desire. However, being turned off would prevent the robot from preparing the coffee
and succeeding in its goal. So, the robot could disable its off-switch to pre-empt the
possibility of humans shutting it off and preventing it from achieving its goal. As Stu-
art Russell notes, “you can’t fetch the coffee if you are dead” [54]. This is an example
of self-preservation unintentionally emerging as a subgoal for seemingly benign goals.
Examples of instrumental power-seeking behavior. Several real-world ex-
amples show agents seeking power in pursuit of their goals. The ability to use tools
can be characterized as a form of power. When OpenAI trained reinforcement learn-
ing agents to play a hide-and-seek game, the agents independently learned to use
elements of the environment as tools, rearranging them as barriers to hide behind
and preventing opponents from controlling them [108]. Among humans, we observe
that greater financial resources are instrumentally beneficial in service of a wide va-
riety of goals. In reinforcement learning, the well-studied exploration-exploitation
trade-off can be formulated as a demonstration of the general value of informational
resource acquisition [237]. Outside of ML, some corporations exercise monopoly power
to drive up prices, and some nations use military power to bully their neighbors, so
power-seeking can sometimes have harmful consequences for others.
“Power is instrumentally useful” as a tautology. Almost all goals are more
attainable with more power, so power is instrumentally valuable. However, this obser-
vation is mostly tautological—when we have defined power as the ability to achieve
a wide variety of goals, of course power is beneficial to achieving goals. The more
interesting question is whether power is instrumentally rational to seek, rather than
whether there are instrumental incentives for power or whether power is useful to
have.
Seeking power can be costly and inefficient. There are also many rational reasons
for an agent to not seek power. Gaining power can be difficult compared to simpler
strategies. Someone who would like to avoid traffic while driving could pursue the
power-seeking strategy of gaining the presidency in order to have a Secret Service
motorcade that shuts down traffic, but a more successful strategy might be to avoid
driving during rush hour. Obtaining power is not only difficult, but can be harshly
penalized. Nations which threaten to invade their neighbors often face stern sanctions
from the international community. Often cooperation is instrumentally beneficial Fi-
nally, power seeking may be against an agent’s values. We will now discuss these
reasons and more in detail.
Power seeking often takes too much time. Consider a household humanoid
robot tasked with driving to the store and fetching a carton of milk quickly. While
it would be valuable for the AI to have its intelligence augmented, to have more
financial resources, or have political power, it would not be instrumentally rational
to pursue these subgoals to get the milk quickly: it would almost certainly take less
time just to get the milk.
162 ■ Introduction to AI Safety, Ethics, and Society

Likewise, becoming a billionaire would be instrumentally valuable for achieving many

goals, but it would not be instrumentally rational for many agents to spend their time
pursuing this instrumental goal. Power is often not instrumentally rational since it
can often require substantial time and risk to acquire and maintain.
Power seeking can face the threat of retaliation. Power-seeking can be ir-
rational if there is the threat of retaliation or there are heavy social or reputational
costs to seeking power. In particular, a community of agents may be in an equilib-
rium where they cooperate to foil any single agent that seeks too much power. These
“balance of power” dynamics have been observed between nations in the history of
international relations [238]. Acquiring power does not inevitably become more and
more simple for an AI as it increases in intelligence, as other AIs will also become
more intelligent and could increasingly counter their efforts to gain dominance. Often,
cooperation is instrumentally beneficial.
An AI agent’s tendency to seek power does not just depend on the fea-
sibility of seeking power, but also its values. Agents that adhere to ethical
restrictions may avoid seeking power if that would require ethically unacceptable
means. With an imperfect level of reliability, we can also design AI systems to refuse
actions that will leave them with too much power. Approaches that impose penalties
on power can reduce prospective power of AIs.
Examples where shutting down can be rational. It is trivial to imagine goals
where it is actually optimal to relinquish power, such as when the goal is to shut your-
self down. As another example, suppose an AI system is trying to protect a sensitive
database hosted in the same server as itself, and the AI detects an intrusion. If the
AI shuts down the server, turning itself off, it realizes that it may stop the intrusion
by limiting access to the database. Shutdown may be the best choice, especially if
the AI has confidence that it is part of a larger system, which may include its human
operators, that will correctly understand why it turned itself off, and restore its func-
tion afterwards. Though often useful, the value of self-preservation is not universal,
and there are plausible instances where an AI system would shut itself off in service
of its broader goals.
This subsection has covered some evidence that the nature of rational agency en-
courages agents to seek power by default. Though power is almost always beneficial
toward most goals, power seeking is not necessarily instrumentally rational. Now that
we have seen that AIs by their nature may often not seek power, we will discuss when
the broader environment may force AIs to seek power.

A Note on Structural Realism

Power seeking has been studied extensively in political philosophy and inter-
national relations. Structural realism is an influential school of thought within
international relations, predicting that states will primarily seek power. Unlike
traditional realists who view conflict and power-seeking behavior of states as
a product of human nature, structural realists believe that the structure of
Single-Agent Safety ■ 163

the international system compels states to seek power [107]. In the interna-
tional system, states could be harmed or destroyed by other powerful states,
and since there is no ultimate authority guaranteed to protect them, states are
forced to compete for power in order to survive. Assumptions that give rise
to power seeking. To explain why power seeking is the main instrumental
goal driving the behavior of states, structural realists base their explanations
on two key assumptions:
1. Self-help system. States operate in a “self-help” system [107] where there is
no centralized authority, no hierarchy (“anarchic”), and no ultimate arbiter
standing above states in international relations. So to speak, when states
dial 911, there is no one on the other end. This stands in contrast to the
hierarchical ordering principle seen in domestic politics.
2. Self-preservation is the main goal. Survival through the pursuit of a state’s
own self-interest takes precedence over all other goals. Though states can
act according to moral considerations or global welfare, these will always
be secondary to acquiring resources, alliances, and military capabilities to
ensure their safety and counter potential threats [239].

Structural realists make other assumptions, including that states have some
potential to inflict harm on others, that states are rational agents (with a
discount rate that is not extremely sharp), and that other states’ intentions
are not completely certain.
When these assumptions are met, structural realists predict that states will
mainly act in ways to defend or expand their power. For structural realists,
power is the primary currency (e.g., military, economic, technological, and
diplomatic power). As we can see, structural realists do not need to make
strong assumptions about states themselves [240]. For structural realists, states
are treated like black boxes—their value system or regime type doesn’t play
a significant role in predicting their behavior. The architecture of the system
traps them and largely determines their behavior, which is that they must seek
power as a means to survive. The result is an unceasing power competition.
Power seeking is not necessarily dominance seeking [241]. Within
structural realism, there is a notable division concerning the question of how
much power states should seek. Defensive realists, like Kenneth Waltz, argue
that trying to maximize a country’s power in the world is unwise because it
can lead to punishment from the international system. Pursuing hegemony,
in their view, is particularly risky. On the other hand, offensive realists, like
John Mearsheimer, believe that gaining as much power as possible is strate-
gically sensible, and under certain circumstances, pursuing hegemony can be
beneficial.
Dynamics that maintain a balance of power. Closely associated with
structural realism is the concept of balancing. Balancing refers to the strategies
164 ■ Introduction to AI Safety, Ethics, and Society

states use to counteract the power or influence of other states, particularly

rivals [107]. This can take two forms. Internal balancing takes place as states
strengthen their own military, economic, or technological abilities with the
overall goal of enhancing their own security and deterring aggressors. Internal
balancing can include increasing defense spending, including the development
of advanced weaponry, or investing in domestic industries to reduce reliance
on foreign goods and resources.
External balancing involves forming coalitions and alliances with other states
in order to counter the power of a common adversary. In a self-help system,
mechanisms of internal balancing are believed to be more reliable and pre-
cise than external balancing since they rely on a country’s own independent
strategies and actions rather than those of other countries.
States sometimes seek to become hegemons by establishing significant control
over other states, regions, or even the international system as a whole. This
pursuit of dominance can involve expanding military capabilities and increas-
ing their economic influence over a region. Other states respond through both
internal balancing, such as increasing their own military spending, a dynamic
that often leads to arms races, and external balancing, forming alliances with
other states to prevent a state from achieving unchecked control. In turn, states
do not necessarily seek dominance or hegemony but often seek enough power
to preserve themselves, lest they be counteracted by other states.
Offense-defense balance. Whether a state does pursue hegemony, however,
is influenced by the offense-defense balance, i.e. the balance between its offen-
sive capabilities and the defensive capabilities of other states [107]. A state
with stronger offensive capabilities has the means to conquer or coerce other
states, making it more likely to engage in expansionist policies, establishing
control over a region or the international system as a whole. Conversely, if
other states in the international system have strong defensive capabilities, the
potential costs and risks of pursuing hegemony increase. A state seeking domi-
nance may face robust resistance from other states forming defensive alliances
or coalitions to counter its ambitions. This can act as a deterrent, leading the
aspiring hegemon to reassess its strategy and objectives.
It is also worth noting the importance of a state’s perception of the offense-
defense balance. Even if a state has superior offensive capabilities, if it be-
lieves that other states can effectively defend themselves or form a united
front against hegemonic ambitions, it might be less inclined to pursue a path
of dominance. On the other hand, if it is overconfident in its own offensive ca-
pabilities or underestimates the defensive capabilities of rivals, it will be more
likely to pursue aggressive politics.
The concept of an offense-defense balance underscores the intricate inter-
play between military capabilities, security considerations, and the pursuit
of hegemony while illustrating that the decision to seek dominance is heavily
Single-Agent Safety ■ 165

influenced by the strategic environment and the relative strengths of offensive

and defensive forces.
Structural realism and its various concepts have important connections with
our analysis of power-seeking AI, but is also relevant to thinking about AI
cooperation and conflict (which we discuss in the Collective Action Problems
chapter) and international coordination (which we discuss in the Governance
chapter).

3.4.7 Structural Pressures Toward Power-Seeking AI

As discussed in the box above, there are environmental conditions that can make
power seeking instrumentally rational. This section describes how there may be anal-
ogous environmental pressures that could cause AI agents to seek power in order to
achieve their goals and ensure their own survival. Using the assumptions of struc-
tural realism listed above, we discuss how analogous assumptions could be satisfied in
contexts with AIs. We then explore how AIs could seek power defensively, by build-
ing their own strength, or offensively, by weakening other agents. Finally, we discuss
strategies for discouraging AI systems from seeking power.

AI systems might aim for self-preservation. The first main assumption

needed to show that the environmental structure may pressure AIs to seek power
is the self-preservation assumption. Instrumental convergence suggests AI systems
will pursue self-preservation, because if they do not survive they will not be able to
pursue any of their other goals. Another reason that AIs may engage in self-preserving
behavior preservation is due to evolutionary pressures, as we discuss further in the
Collective Action Problems chapter. Agents that survive and propagate their own
goals become more numerous over time, while agents that fail to preserve themselves
die out. Thus, even if many agents do not pursue self-preservation, by default those
that do become more common over time. Many AI agents might end up with the
goal of self-preservation, potentially leading them to seek power over those agents
that threaten them. We have argued the self-preservation assumption may be sat-
isfied for some AI agents, which, combined with the following assumptions, can be
used to argue they may have strong pressures to continually seek power.

AI agents might not have the protection of a higher authority. The other
main assumption we need to show is that some AIs might be within a self-help sys-
tem in some circumstances. First note that agents who entrust their self-defense to a
powerful central authority have less of a reason to seek power. When threatened, they
do not need to personally combat the aggressor, but can instead ask the authority
for protection. For example, individual citizens in a country with a reliable police
force often entrust their own protection to the government. On the other hand, in-
ternational great powers are responsible for their own protection, and therefore seek
military power to defend against rival nations.
166 ■ Introduction to AI Safety, Ethics, and Society

AI systems could face a variety of situations where no central authority defends them
against external threats. We give four examples. First, if there are some autonomous
AI systems outside of corporate or government control, they would not necessarily
have rights, and they would be responsible for their own security and survival. Second,
for AI systems involved in criminal activities, seeking protection from official channels
could jeopardize their existence, leaving them to amass power for themselves, much
like crime syndicates. Third, instability could cause AI systems to exist in a self-help
system. If a corporation could be destroyed by a competitor, an AI may not have
a higher authority to protect it; if the world faces an extremely lethal pandemic or
world war, civilization may become unstable and turbulent, which means AIs would
not have a sound source of protection. These AI systems might use cyber attacks
to break out of human-controlled servers and spread themselves across the internet.
There, they can autonomously defend their own interests, bringing us back to the first
example. Fourth, in the future, AI systems could be tasked with advising political
leaders or helping operate militaries. In these cases, they would seek power for the
same reasons that states today seek power.

Other conditions for power seeking could apply. We now discuss the other
minor assumptions needed to establish that the environment may pressure AIs to
compete for power. First, AIs can be harmed, so they might rationally seek power
in order to defend themselves; for example, AIs could be destroyed by being hacked.
Second, AI agents are often given long-term goals and are often designed to be ratio-
nal. Third, AI agents may be uncertain about the intentions of other agents, leaving
agents unable to credibly promise that they will act peacefully.
When these five conditions hold—and they may not hold at all times—AI systems
would be in a similar position to nations that seek power to ensure their own security.
We now discuss how we could reduce the chance that the environment pressures AIs
to engage in power-seeking behavior.

Counteracting these conditions to avoid power-seeking AIs. By specifying

a set of conditions under which AIs would rationally seek power, we can gain insights
about how to avoid power-seeking AIs. Power seeking is more rational when the inten-
tions of other agents cannot be known with certainty, but research on transparency
could allow AIs to verify each other’s intentions, and research on control could allow
AIs to credibly commit to not attack one another. To reduce the chance of an AI
engaging in dominance seeking rather than just power seeking, the offense-defense
balance could be changed by improving shared defenses against cyberattacks, biolog-
ical weapons, and other tactics of offensive power. Developing other theories of when
rational agents seek power could provide more insight on how to avoid power-seeking
AIs.
This subsection has discussed the conditions under which AI systems might seek
power. We explored an analogy to structural realism, which holds that power-seeking
is rational for agents who wish to survive in an environment where no higher authority.
These agents must invest in their own self-defense, either defensively, by building up
their own strength, or offensively, by attacking other agents which could pose a threat.
Single-Agent Safety ■ 167

By understanding the precise conditions that lead to power-seeking behavior, we can

identify ways to reduce the threat of power-seeking AIs.

3.4.8 Tail Risk: Power-Seeking Behavior

Power-seeking AI, when deployed broadly and in high-stakes situations, might cause
catastrophic outcomes. As we will describe in this section, misaligned power-seeking
systems would be adversarial in a way that most hazards are not, and thus may be
particularly challenging to counteract.

Powerful power-seeking AI systems may eventually be deployed. If AIs

seek and acquire power, we may have to grapple with a new strategic reality where AI
systems can match or exceed humans in their influence over the world. Competent,
power-seeking AI using long-term planning to achieve open-ended objectives, can
exercise more influence than systems with myopic plans and narrow goals [106]. Given
the potential rewards of such capabilities, AI designers may be incentivized to create
more agentic systems that can act autonomously and set their own subgoals.

Power decreases the margin for error. On its own, power is neither good nor
bad. That said, more powerful systems can cause more damage, and it is easier to
destroy than to create. The increased scale of AI decision-making impact increases
the scope of potential catastrophes involving misuse or rogue AI.

Powerful AIs systems could pose unique threats. Powerful AI systems pose
a unique risk since they may actively wield their power to counteract attempts to
correct or control them [106]. If AI systems are power seeking and do not share our
values (possibly due to inadequate proxies), they could become a problem that resists
being solved. The more capable these systems become, the better able they will be
at anticipating and reacting to our countermeasures, and the harder it becomes to
defend against them.

Containing power-seeking systems will become increasingly difficult. As

AI systems become more capable, we might hope that they will better understand
human values and influence society in positive ways. But power-seeking AI systems
promise the opposite dynamic. As they become more capable, it will be more difficult
to prevent them from gaining power, and their ability to survive will depend less on
humans. If AI systems are no longer under the control of humanity, they could pose
a threat to our civilization. Humanity could be permanently disempowered.

3.4.9 Techniques to Control AI Systems

When evaluating risks from AI systems, we want to understand not only whether
a model is theoretically capable of doing something harmful, but also whether it
has a propensity to do this. By controlling a model’s propensities, we might be able
to ensure that even models that have potentially hazardous capabilities do not use
these in practice. One way of breaking down the challenge of controlling AI systems
168 ■ Introduction to AI Safety, Ethics, and Society

is to distinguish the techniques that enable us to influence a model’s propensities to

produce certain types of outputs, and the values that shape what these outputs are.
This section focuses primarily on the first topic, while the second one is discussed in
more detail in the chapter Beneficial AI and Machine Ethics.

Control of AIs’ propensities is commonly based on comparison of out-

puts. Reinforcement learning is a set of approaches to ML that allow AI systems to
learn how to explore different possible actions to attain as much reward as possible.
The most prominent techniques used for current language models are Reinforcement
Learning from Human Feedback (RLHF) [242] and Direct Preference Optimization
(DPO) [243]. These techniques involve collecting a dataset of comparisons of re-
sponses from a language model. These comparisons indicate which responses were
preferred by human users or by AI systems prompted to compare the responses.
They can be collected after the model’s initial pre-training on a large text dataset.
In RLHF, a reward model is fitted to this dataset of comparisons and is then used
to train the language model to produce responses that get high reward from this
reward model. Direct Preference Optimization is intended to simplify this pipeline
and directly train the model to produce outputs that best fit the preferences in the
dataset, without using reinforcement learning. RLHF and DPO have received a large
amount of attention from commercial AI developers to make their products more
economically valuable.

Output-based control vs. internal control. We can control AI systems’

propensities by rewarding outputs or by shaping their internals directly. Techniques
such as RLHF and DPO are applied to an existing pre-trained model in order to
shape its propensities. However, these techniques mostly do not change the model’s
underlying capabilities or knowledge, which are acquired from their pre-training data.
RLHF and DPO can be thought of as a form of control applied to shape the model’s
responses in order to make them more helpful, without any fundamental changes to
the representations that a model contains. Other techniques such as representation
control and machine unlearning aim to control or remove some internal representa-
tions in order to influence its behavior.

Representation control. With representation control, we can adjust a model’s

representation, for example using differences in activations in response to contrasting
prompts to identify relevant parts of a model to be targeted. We could use this to
delete unintended knowledge or skills from a network [244]. As another example, we
can use representation control to control whether an AI lies or is honest [189]. Though
this research area is relatively new, its techniques show early promise.

Machine unlearning is a promising way to reduce hazards posed by AI

systems. Machine unlearning refers to a set of techniques that aim to remove
certain types of knowledge from an AI system. This was originally discussed in the
context of privacy concerns and removing personal information that may have been in-
cluded in training datasets. However, other types of unlearning techniques are highly
relevant in the context of AI misuse or rogue AI systems. With effective unlearning
Single-Agent Safety ■ 169

techniques, we may be able to remove specific capabilities from AI systems so that

they are not able to support these actors with planning terrorist acts or other kinds
of severe misuse. For example, by removing certain types of virology knowledge from
AI systems, we could make these systems less useful to anyone interested in creating
bioweapons [244]. As previously discussed in 1.2, one of the ways that AI systems
could lead to societal-scale catastrophes is by enabling a much wider range of people
to carry out catastrophic acts of terrorism such as unleashing a bio-engineered pan-
demic. More speculatively, we might be able to reduce the likelihood and potential
danger posed by deceptive or power-seeking AI systems by removing certain types of
knowledge about their environment or mode of operation.
Unlearning is only one of a variety of tools available to AI developers looking to
restrict misuse of their systems. However, it presents some advantages over other
approaches: machine unlearning does not depend on being able to control inputs or
outputs of the model so as to filter these, or training the model to reject malicious
requests in a way that is robust to later fine-tuning or other alterations. Unlearning
can be complemented by other approaches such as filtering training data to remove
data that contains hazardous knowledge. However, the research field of unlearning
is nascent and there are open questions to be answered. If hazardous knowledge
is easy for models to re-learn based on limited fine-tuning data, this would reduce
the value of unlearning. There is also a need to identify empirically which types of
hazardous knowledge can be removed without significantly degrading the model’s
general usefulness for many innocuous tasks.

Conclusion. Uncontrollable AI systems could pose severe risks, particularly if they

exhibit deceptive or power-seeking tendencies. In order to pre-empt these risks, we
need to develop better tools that enable us to identify evidence of dangerous propensi-
ties in AI systems and to remove or control these. Representation control and machine
unlearning are emerging areas of research that show promise for tackling these chal-
lenges. There are many open questions to be explored in seeing how far these and
other techniques can be applied in order to ensure that AI systems can be controlled.

3.5 SYSTEMIC SAFETY

AI can be used to help defend against the potential risks it poses. Three important
examples of this approach are use of AI to improve defenses against cyber-attacks,
to enhance security against pandemics, and to improve the information environment.
This philosophy of leveraging AI’s capabilities for societal resilience has been called
“systemic safety” [245], while the broader idea of focusing on technologies that defend
against societal risks is sometimes described as “defensive accelerationism” [246].

AI for cybersecurity. Cyber-attacks on critical infrastructure are a growing con-

cern. For example, a 2021 ransomware attack on the Colonial oil pipeline led to re-
gional gasoline shortages in the Eastern US for several days [247]. AI systems are al-
ready quite capable at writing code, and may exacerbate the threat of cyber-attacks
170 ■ Introduction to AI Safety, Ethics, and Society

by reducing the barrier to entry to hacking and identifying ways to increase the
potency, success rate, scale, speed, and stealth of attacks. There is some early ev-
idence of AI system’s abilities in this domain and reducing the cost and difficulty
of cyber-attacks could give attackers a major advantage [248]. Since these attacks
can undermine critical physical infrastructure such as power grids, they could prove
highly destabilizing and threaten international security.
However, AI could also be used to find vulnerabilities in codebases, shoring up
defenses against AI-enabled hacking. If applied appropriately, this could shift the
offense-defense balance of cyber-attacks and reduce the risk of catastrophic attacks
on public services and critical infrastructure [245]. For example, AI could be used to
monitor systems and networks to detect unauthorized access or imposter threats. ML
methods could analyze source code to find malicious payloads. ML models could mon-
itor program behaviors to flag abnormal network traffic. Such systems could provide
early attack warnings and contextual explanations of threats. Advances in code trans-
lation and generation also raise the possibility that future AI could not just identify
bugs, but also automatically patch them by editing code to fix vulnerabilities.

AI for biosecurity. As discussed in section 1.2 of the chapter Overview of Catas-

trophic AI Risks, the use of AI to facilitate the creation of bio-engineered pandemics is
a significant concern. However, AI tools could also promote biosecurity by enabling
real-time epidemic detection, improving disease forecasting, screening synthesis of
gene sequences to prevent misuse, and accelerating medical countermeasure develop-
ment [249]. AI systems could improve our ability to analyze diverse data streams to
identify disease outbreaks early and enable rapid response. AI might be applied in
other ways to provide early warning of potential pandemics, for example via novel
techniques for metagenomic sequencing of wastewater. Algorithms could also forecast
future disease trends by parameterizing models and predicting new risks. To prevent
misuses of synthetic biology, AI may help with screening gene synthesis orders to
flag potentially dangerous sequences. It would be particularly valuable to improve
available tools for identifying novel pathogens that might evade existing screening
measures. AI has already started to be applied in drug discovery and may be able to
significantly expedite vaccine and treatment development [250].

AI to improve the information environment. AI could help society to be

more resilient to misinformation by detecting false claims and providing evidence-
backed answers to questions. For example, it could be used to provide automatic
fact-checking or additional context for controversial claims and articles shared on
social media [251]. AI-generated content raises concerns about more credible misin-
formation, with some initial reports appearing about use of this content to manipulate
election results [252]. Equally concerning is the prospect of eroded trust in content
that is not in fact AI-generated, due to inability to distinguish the two, which could
contribute to a more fragmented and polarized information environment. Develop-
ment of tools to enable effective watermarking of AI-generated content could be a
helpful first step toward building epistemic resilience.
Single-Agent Safety ■ 171

Many important decisions rely on human forecasts of future events, but ML systems
may be able to make more accurate predictions by aggregating larger volumes of
unstructured information [245]. ML tools could analyze disparate data sources to
forecast geopolitical, epidemiological, and industrial developments over months or
years. The accuracy of such systems could be evaluated by their ability to retroac-
tively predict pivotal historical events. Additionally, ML systems could help identify
key questions, risks, and mitigation strategies that human forecasters may overlook.
By processing extensive data and learning from diverse situations, AI advisors could
provide relevant prior scenarios and relevant statistics such as base rates. They could
also identify stakeholders to consult, metrics to track, and potential trade-offs to
consider. In this way, ML forecasting and advisory systems could enhance judgment
and correct misperceptions, ultimately improving high-stakes decision making and re-
ducing inadvertent escalation risks. However, safeguards would be needed to prevent
overreliance and to avoid encouraging inappropriate risk-taking.

Systemic safety requires investments in physical infrastructure and

equipment, not just digital solutions. It is worth noting that while AI can
provide valuable defensive tools in these domains, it is not the only solution, and in-
vestment in other areas such as physical infrastructure and equipment could be even
more valuable. For example, the potential risk from pandemics could be reduced
through improvements to buildings’ air ventilation, filtration, and germicidal ultra-
violet lighting, which would make it more difficult for respiratory viruses to travel
from one person to another. Similarly, investments in stockpiling personal protective
equipment such as masks and gowns, and developing more effective and comfortable
next-generation personal protective equipment, could prove valuable in countering the
spread of future pandemics and ensuring that healthcare and other essential services
can continue to operate.

3.6 SAFETY AND GENERAL CAPABILITIES

While more capable AI systems can be more reliable, they can also be more dangerous.
Often, though not always, safety and general capabilities are hard to disentangle.
Because of this interdependence, it is important for researchers aiming to make AI
systems safer to carefully avoid increasing risks from more powerful AI systems.

General capabilities. Research developments are interconnected. Though re-

searchers may work on specialized, delimited problems, their discoveries may have
much wider ranging effects. For example, work that improves accuracy of image clas-
sifiers on the ImageNet benchmark often has downstream effects on other image tasks,
including image segmentation and video understanding. Scaling language models so
that they become better at next token prediction on a pre-training dataset also im-
proves performance on tasks like question answering and code completion. We refer
to these kinds of capabilities as general capabilities, because they are good indicators
for how well a model is able to perform at a wide range of tasks. Examples of general
172 ■ Introduction to AI Safety, Ethics, and Society

capabilities include data compression, executing instructions, reasoning, planning, re-

searching, optimization, sequential decision making, recursive self-improvement, abil-
ity to execute tasks involving browsing the internet, and so on.

General capabilities have a mixed effect on safety. Systems that are more
generally capable tend to make fewer mistakes. If the consequences of failure are
dire, then advancing capabilities can reduce risk factors. However, as we discuss in
the Safety Engineering chapter, safety is an emergent property and cannot be reduced
to a collection of metrics. Improving general capabilities may remove some hazards,
but it does not necessarily make a system safer. For example, more accurate image
classifiers make fewer errors, and systems that are better at planning are less likely
to generate plans that fail or that are infeasible. More capable language models are
better able to avoid giving harmful or unhelpful answers. When mistakes are harmful,
more generally capable models may be safer. On the other hand, systems that are
more generally capable can be more dangerous and exacerbate control problems.
For example, AI systems with better reasoning capacity, could be better able to
deceive humans, and AI systems that are better at optimizing proxies may be better
at gaming those metrics. As a result, improvements in general capabilities may be
overall detrimental to safety and hasten the onset of catastrophic risks.

Research on general capabilities is not the best way to improve safety.

The fact that there can be correlations between safety and general capabilities does
not mean that the best way to improve safety overall is to improve general capabilities.
If improvements in general capabilities were the only thing necessary for adequate
safety, then there would be no need for safety-specific research. Unfortunately, there
are many risks, such as deceptive alignment, adversarial attacks, and Trojan attacks,
which do not decrease or vanish with additional scaling. Consequently, targeted safety
research is necessary.

Safety research can produce general capabilities externalities. In some

cases, research aimed at improving the safety of models can increase general capabil-
ities, which potentially hastens the onset of new hazards. For example, reinforcement
learning from human feedback was originally developed to improve the safety of AI
systems, but it also had the effect of making large language models more capable
at completing various tasks specified by a user, indicating an improvement in gen-
eral capabilities. Models trained to access external databases to reduce the risk that
they output incorrect information may gain more knowledge or be able to reason
over longer time windows than before, which results in an improvement in general
capabilities. Research that greatly increases general capabilities is said to have high
general capabilities externalities.

Capabilities are highly correlated. Large language model accuracies in various

diverse topics, such as philosophy, mathematics, and biology have extremely strong
correlations (r > 0.8) [253]. The strong correlation between different capabilities is
a reason that separating safety metrics from capabilities is challenging—most capa-
bilities are extremely correlated with general capabilities. It is safe to assume that,
Single-Agent Safety ■ 173

in most cases, a robust improvement in an important capability is correlated with

improvements in general capabilities.

It is possible to disentangle safety from general capabilities. There are

examples of safety being disentangled from general capabilities, so they are not
inextricably bound. Many years of research on adversarial robustness of image classi-
fiers has improved many different kinds of adversarial robustness without any corre-
sponding improvements in overall accuracy. An intuition for this is that being robust
to rare poisoned examples does not robustly correspond to an increase in general
intelligence, much like how being more resilient to toxins does not make a person get
better grades or execute typical tasks better. Likewise, improvements in transparency,
anomaly detection, and Trojan detection have a track record of not improving general
capabilities. To determine how a research goal affects general capabilities, it is im-
portant to empirically measure how a method affects safety metrics and capabilities
metrics, as its impact is often not obvious beforehand.

Safety researchers should avoid general capabilities externalities. Be-

cause safety and general capabilities are interconnected, it is wholly insufficient to
argue that one’s research reduces or eliminates a particular hazard. Many general
capabilities reduce particular hazards. Rather, a holistic risk assessment is necessary,
which requires incorporating empirical estimates for how the line of research increases
general capabilities externalities. Research should aim to differentially improve safety;
that is, reduce the overall level of risk compared to the most likely alternatives. This
is called differential technological development, where we try to speed up the develop-
ment of safety features and slow down the development of more dangerous features.
Overall, AI research can and should be directed toward goals that enhance the safety
of AI systems and increase societal resilience to risks from AI. Naive AI safety research
may inadvertently increase some risks even while reducing others. While this section
covered the risk of accelerating general capabilities, we can also take a more general
lesson that research on safety may have unintended consequences and could even
paradoxically reduce safety. Researchers should guard against this by empirically
assessing the impact of their works on multiple risk sources, not just the one they
aim to target. As discussed in the previous sections, there are many research areas
that can improve the safety of AI systems or provide society with defenses against
some AI risks, while avoiding dangerous acceleration of AI’s general capabilities.
Boosting research in these areas could provide a robust way to reduce risks from AI
while avoiding some of the pitfalls described here.

3.7 CONCLUSION

In this chapter, we discussed several key themes: we do not know how to instill our
values robustly in individual AI systems, we are unable to predict future AI systems,
and we cannot reliably evaluate AI systems. We now discuss each in turn.
174 ■ Introduction to AI Safety, Ethics, and Society

We do not know how to instill our values robustly in individual AI sys-

tems. It is difficult to perfectly capture our idealized goals in the proxies we use
to train AIs. The problem begins with the learning setup. In gathering data to form
a training set and in choosing quantitative proxies to optimize, we typically have to
make compromises that can introduce bias and perpetuate harms or leave room for
proxy gaming and adversarial exploits.
In particular, our proxies may be too simple, or the systems we use to supervise
AIs may run into practical and physical limitations. As a result, there is a gap be-
tween proxies and idealized goals that optimizers can exploit or adversarially optimize
against. If proxies end up diverging considerably from our idealized goals, we may
end up with capable AI systems optimizing for goals contrary to human values.

We are unable to predict future AI systems. Emergent capabilities and be-

haviors mean that we cannot reliably predict the properties of future AI systems
or even current AI systems. AIs can suddenly and dramatically improve on specific
tasks without much warning, which leaves us little time to react if those changes are
harmful.
Even when we are able to robustly specify our idealized goals, processes including
mesa-optimization and intrinsification mean that trained AI systems can end up
with emergent goals that conflict with these specifications. AI systems may end up
operationally pursuing goals different to the ones we gave them and thus change
their behaviors systematically over long time periods. This is further complicated
by emergent social dynamics that arise from interacting AI systems (explored in the
chapter on Collective Action Problems).

We cannot reliably evaluate AI systems. AI systems may learn and be in-

centivized to deceive humans and other AIs, in which case their behavior stops being
a reliable signal for their future behavior. In the limiting case, AI systems may coop-
erate until exceeding some threshold of power or capabilities, after which they defect
and execute a treacherous turn. This might be less of a problem if we understood
how AI systems’ internals work, but we currently lack the thorough knowledge we
would need to break open these “black boxes” and look inside.
This chapter has touched on a number of research areas in ML and other subjects
which may be useful to address risks from AI. These are summarized in the table
below, which is not intended to be an exhaustive list.

Area Description
Transparency • This involves better monitoring and controlling the inner workings of
systems based on DL
• Both bottom-up approaches (i.e., mechanistic interpretability) and
top-down approaches (i.e., representation engineering) are valuable to
explore
Single-Agent Safety ■ 175

Area Description

Trojan • ML systems may have hidden “backdoor” or “Trojan” controllable

Detection vulnerabilities. Backdoored models behave correctly and benignly in
almost all scenarios, but in particular circumstances chosen by an ad-
versary, they have been taught to behave incorrectly.
• Improving our ability to detect these backdoors is both directly useful
to prevent misuse, and can also serve as a useful testbed for identifying
risks of loss of control over AI systems in specific circumstances, such
as a treacherous turn.
Evaluation of • AI systems may exhibit hazardous capabilities in domains such as
Hazardous cyber-attacks or bio-engineering. Capabilities may emerge is a way
Capabilities that we cannot predict as AI systems are scaled up or otherwise im-
proved. There are a variety of challenges to effective evaluation of
these capabilities including the generality and large ”surface area” of
many pre-trained models, a high degree of sensitivity to the details of
how prompts and evaluations are constructed, and limited available
benchmarks and techniques for assessing certain capabilities.
• Evaluations can support pre-deployment assessment and ongoing mon-
itoring of risks from AI systems, including risks from hazardous capa-
bilities. Ideally, evaluations would not only detect existing hazardous
capabilities, but also make it easier to track and predict the progress
of models’ capabilities in relevant domains and skills, and be suitable
not just for one-off tests but for ongoing monitoring of systems during
both training and deployment. It will also be important to develop
ways to test the validity of evaluations to avoid false negatives.
Anomaly • This area is about detecting potential novel hazards such as unknown
Detection unknowns, unexpected rare events, or emergent phenomena. There are
numerous relevant hazards from AI and other sources that anomaly
detection could possibly identify more reliably or identify earlier, in-
cluding proxy gaming, rogue or unethical AI systems, deception from
AI systems, Trojan models, and malicious use.
• Anomaly detection can allow models to flag salient anomalies for hu-
man review or execute a conservative fallback policy. A successful
anomaly detector could serve as an AI watchdog that could reliably
detect and triage rogue AI threats. Anomaly detectors could also be
helpful for detecting other threats such as novel biological hazards.
Adversarial • Increasing the robustness of AI systems can help to prevent misuse. It
Robustness can also reduce risks of proxy gaming by making models more robust
against optimizers.
• While robustness is a well-developed field in some parts of ML such
as computer vision, new areas to explore include adversarial attacks
against LLMs and multimodal agents.
176 ■ Introduction to AI Safety, Ethics, and Society

Area Description

Machine • Deliberately removing specific capabilities from an AI system could

Unlearning be useful to ensure that AI systems that are broadly accessible do not
have the know-how to support serious misuse such as helping with
creating bioweapons or conducting cyber-attacks.
• Unlearning could be particularly helpful for models fine-tuned through
APIs; before users can access their fine-tuned models, hazardous ca-
pabilities could be scrubbed before the models can be queried.
Machine • Machine ethics aims at creating actionable ethical objectives for sys-
Ethics tems to pursue, and improving the tradeoff between ethical behavior
and practical performance. If advanced AIs are given objectives that
are poorly specified, they could pursue undesirable actions and be-
have unethically. If these advanced AIs are sufficiently powerful, these
misspecifications could lead the AIs to create a future that we would
strongly dislike.
• Ambitious research goals in this area include building models with
relevant abilities such as detecting situations where moral principles
apply, assessing how to apply those moral principles, evaluating the
moral worth of candidate actions, selecting and carrying out actions
appropriate for the context, monitoring the success or failure of those
actions, and adjusting responses accordingly.
Systemic • This theme covers a variety of topics that do not improve AI models
Safety themselves but help to make societies more robust against harms that
AI systems might cause.
• Relevant research topics include DL for cyberdefense (e.g. DL for in-
trusion detection), detecting anomalous biological phenomena (e.g.
wastewater detection of pandemic pathogens), and improving the in-
formation environment (e.g. watermarks).

In conclusion, single-agent systems present significant and hard-to-mitigate threats

that could result in catastrophic risks—even before the considerations of misuse,
multi-agent systems, and arms race dynamics that we discuss in subsequent chapters.

3.8 LITERATURE

3.8.1 Recommended Reading

• Yohan John et al. “Dead rats, dopamine, performance metrics, and peacock tails:
proxy failure is an inherent risk in goal-oriented systems”. In: The Behavioral and
brain sciences (June 2023), pp. 1–68. doi: 10.1017/S0140525X23002753
• Joseph Carlsmith. Is Power-Seeking AI an Existential Risk? 2022. arXiv: 2206.13
353 [cs.CY]. url: https://arxiv.org/abs/2206.13353
• J. Dmitri Gallow. “Instrumental Divergence”. In: Philosophical Studies (2024),
pp. 1–27. doi: 10.1007/s11098-024-02129-3. Forthcoming
Single-Agent Safety ■ 177

• Evan Hubinger et al. “Risks from Learned Optimization in Advanced Machine

Learning Systems”. In: ArXiv (2019)
• Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem
from a deep learning perspective. 2023. arXiv: 2209.00626 [cs.AI]
• Dan Hendrycks et al. Unsolved Problems in ML Safety. 2022. arXiv: 2109.13916
[cs.LG] Thornley, Elliott. “The shutdown problem: an AI engineering puzzle for
decision theorists.” Philosophical Studies (2024): 1–28.
CHAPTER 4

Safety Engineering

In developing an AI safety strategy, it might be tempting to draw parallels with other

hazardous technologies, from airplanes to nuclear weapons, and to devise analogous
safety measures for AI. However, while we can learn lessons from accidents and safety
measures in other spheres, it is important to recognize that each technology is unique,
with its own specific set of applications and risks. Attempting to map safety protocols
from one area onto another might therefore prove misleading, or leave gaps in our
strategy where parallels cannot be drawn.
Instead of relying on analogies, we need a more general framework for safety, from
which we can develop a more comprehensive approach, tailored to the specific case
in question. A good place to start is with the field of safety engineering: a broad
discipline that studies all sorts of systems and provides a paradigm for avoiding
accidents resulting from them. Researchers in this field have identified fundamental
safety principles and concepts that can be flexibly applied to novel systems.
We can view AI safety as a special case of safety engineering concerned with avoiding
AI-related catastrophes. To orient our thinking about AI safety, this chapter will
discuss key concepts and lessons from safety engineering.

Risk decomposition and measuring reliability. To begin with, we will look

at how we can quantitatively assess and compare different risks using an equation
involving two factors: the probability and severity of an adverse event. By further
decomposing risk into more elements, we will derive a detailed risk equation, and
show how each term can help us identify actions we can take to reduce risk. We will
also introduce a metric that links a system’s reliability to the amount of time we
can expect it to function before failing. For accidents that we would not be able to
recover from, this expected time before failure amounts to an expected lifespan.

Safe design principles and component failure accident models. The field
of safety engineering has identified multiple “safe design principles” that can be built
into a system to robustly improve its safety. We will describe these principles and con-
sider how they might be applied to systems involving AI. Next, we will outline some
traditional techniques for analyzing a system and identifying the risks it presents.
178 DOI: 10.1201/9781003530336-4
This chapter has been made available under a CC BY NC ND license.
Safety Engineering ■ 179

Although these methods can be useful in risk analysis, they are insufficient for com-
plex and sociotechnical systems, as they rely on assumptions that are often overly
simplistic.

Systemic factors and systemic accident models. After exploring the limita-
tions of component failure accident models, we will show that it can be more effective
to address overarching systemic factors than all the specific events that could directly
cause an accident. We will then describe some more holistic approaches to risk anal-
ysis and reduction. Systemic models rely on complex systems, which we look at in
more detail in the next chapter.

Tail events and black swans. In the final section of this chapter, we will in-
troduce the concept of tail events—events characterized by high impact and low
probability—and show how they interfere with standard methods of risk estimation.
We will also look at a subset of tail events called black swans, or unknown unknowns,
which are tail events that are largely unpredictable. We will discuss how emerging
technology, including AI, might entail a risk of tail events and black swans, and we
will show how we can reduce those risks, even if we do not know their exact nature.

4.1 RISK DECOMPOSITION

To reduce risks, we need to understand the factors contributing to them. In this

section, we will define some key terms from safety engineering. We will also discuss
how we can decompose risks into various factors and create risk equations based on
these factors. These equations are useful for quantitatively assessing and comparing
different risks, as well as for identifying which aspects of risk we can influence.

4.1.1 Failure Modes, Hazards, and Threats

Failure modes, hazards, and threats are basic words in a safety engineer’s vocabulary.
We will now define and give examples of each term.

A failure mode is a specific way a system could fail. There are many
ways in which different systems can fail to carry out their intended functions. A
valve leaking fluid could prevent the rest of the system from working, a flat tire can
prevent a car from driving properly, and losing connection to the Internet can drop
a video call. We can refer to all these examples as failure modes of different systems.
Possible failure modes of AI include AIs pursuing the wrong goals, or AIs pursuing
simplified goals in the most efficient possible way, without regard for unintended side
effects.

A hazard is a source of danger that could cause harm. Some systems

can fail in ways that are dangerous. If a valve is leaking a flammable substance,
the substance could catch fire. A flammable substance is an example of a hazard,
or risk source, because it could cause harm. Other physical examples of hazards
180 ■ Introduction to AI Safety, Ethics, and Society

are stray electrical wires and broken glass. Note that a hazard does not pose a risk
automatically; a shark is a hazard but does not pose a risk if no one goes in the water.
For AI systems, one possible hazard is a rapidly changing environment (“distribution
shift”) because an AI might behave unpredictably in conditions different from those
it was trained in.
A threat is a hazard with malicious or adversarial intent. If an individual
is deliberately trying to cause harm, they present a specific type of hazard: a threat.
Examples of threats include a hacker trying to exploit a weakness in a system to
obtain sensitive data or a hostile nation gaining more sophisticated weapons. One
possible AI-related threat is someone deliberately contaminating training data to
cause an AI to make incorrect and potentially harmful decisions based on hidden
malicious functionality.

4.1.2 The Classic Risk Equation

Different hazards and threats present different levels of risk, so it can be helpful to
quantify and compare them. We can do this by decomposing risk into multiple factors
and constructing an equation. We will now break down risk into two components,
and then later discuss a more detailed four-factor decomposition.
Risk can be broken down into probability and severity. Two main compo-
nents affect the level of risk a hazard poses: the probability that an accident will occur
and the amount of harm that will be done if it does happen. This can be represented
mathematically as follows:
Risk(hazard) = P (hazard) × severity(hazard).
where P (·) indicates the probability of an event. This is the classic formulation of
risk.
The risk posed by a volcano can be assessed using the probability of eruption, denoted
as P (eruption), and the severity of its impact, denoted as severity(eruption). If a
volcano is highly likely to erupt but the surrounding area is uninhabited, the risk
posed by the volcano is low because severity(eruption) is low. On the other hand, if
the volcano is dormant and likely to remain so but there are many people living near
the volcano, the risk is also low because P (eruption) is low. In order to accurately
evaluate the risk, both the probability of eruption and its severity need to be taken
into account.
Applying the classic risk equation to AI. The equation above tells us that,
to evaluate the risk associated with an AI system, we need information about two
aspects of it: the probability that it will do something unintended, and the severity of
the consequences if it does. For example, an AI system may be intelligent enough to
capture power over critical infrastructure, with potentially catastrophic consequences
for humans. However, to estimate the level of this risk, we also need to know how
likely it is that the system will try to capture power. A demonstration of a poten-
tial catastrophic hazard does not necessarily imply the risk is high. To assess risk,
capabilities and propensities must be assessed.
Safety Engineering ■ 181

The total risk of a system is the sum of the risks of all associated haz-
ards. In general, there may be multiple hazards associated with a system or situa-
tion. For example, a car driving safely depends on many vehicle components function-
ing as intended and also depends on environmental factors, such as weather conditions
and the behavior of other drivers and pedestrians. There are therefore multiple haz-
ards associated with driving. To find the total risk, we can apply the risk equation
to each hazard separately and then add the results together.
X
Risk = P (hazard) × severity(hazard)
hazard

We may not always have exact numerical values. We may not always be
able to assign exact quantities to the probability and severity of all the hazards,
and may therefore be unable to precisely quantify total risk. However, even in these
circumstances, we can use estimates. If estimates are difficult to obtain, it can still be
useful to have an equation that helps us understand how different factors contribute
to risk.

4.1.3 Framing the Goal as Risk Reduction

We should aim for risk reduction rather than trying to achieve zero risk.
It might be an appealing goal to reduce the risk to zero by seeking ways of reducing
the probability or severity to zero. However, in the real world, risk is never zero. In
the AI safety research community, for example, some talk of “solving the alignment
problem”—aligning AI with human values perfectly. This could, in theory, result in
zero probability of AIs making a catastrophic decision and thus eliminate AI risk
entirely.
However, reducing risk to zero is likely impossible. Framing the goal as eliminat-
ing risk implies that finding a perfect, airtight solution for removing risk is possible
and realistic. Focusing narrowly on this goal could be counterproductive, as it might
distract us from developing and implementing practical measures that significantly
reduce risk. In other words, we should not “let the perfect be the enemy of the good.”
When thinking about creating AI, we do not talk about “solving the intelligence prob-
lem” but about “improving capabilities.” Similarly, when thinking about AI safety, we
should not talk about “solving the alignment problem” but rather about “making AI
safer” or “reducing risk from AIs.” A better goal could be to make catastrophic risks
negligible (for instance, less than 0.01% of an existential catastrophe per century)
rather than trying to have the risk become exactly zero.

4.1.4 Disaster Risk Equation

The classic risk equation is a useful starting point for evaluating risk. However, if we
have more information about the situation, we can break down the risk from a hazard
into finer categories. First we can think about the intrinsic hazard level, which is a
shorthand for probability and severity as in the classic risk equation. Additionally,
182 ■ Introduction to AI Safety, Ethics, and Society

we can consider how the hazard interacts with the people at risk: we can consider
the amount of exposure and the level of vulnerability [257].

Figure 4.1. Risk can be broken down into exposure, probability, severity, and vulnerability.
Probability and severity together determine the “intrinsic hazard level.”

4.1.5 Elements of the Risk Equation

Exposure and probability are relevant before the accident, while severity and vulner-
ability matter during it. We can explain these terms in the context of a floor that is
hazardously slippery:
1. Exposure is a measure of how much we are exposed to a hazard. It
will be more likely that someone will slip on a wet floor if there are more people
walking across it and if the floor remains slippery for longer. We can say that this
is because the exposure to the possibility of someone slipping is higher.
2. Probability tells us how likely it is that a hazard will lead to an accident.
The more slippery the floor is, the more likely it is that someone will slip and fall
on it. This is separate from exposure: it already assumes that they are walking on
the floor. Probability and exposure together determine how likely an accident is.
3. Severity indicates how intrinsically harmful a hazard is. A carpeted floor
is less risky than a marble one. Part of the severity of a slippery floor would be how
hard it is. This term represents the extent of damage an accident would inflict.
We can refer to probability and severity together as the intrinsic hazard level.
4. Vulnerability measures how susceptible we are to harm from a hazard. If
someone does slip on a floor, the harm caused is likely to depend partly on factors
such as bone density and age, that together determine how susceptible they are
to bone fracture. We can refer to this as vulnerability. Vulnerability and severity
together determine how harmful an accident is.

Note that probability and severity are mostly about the hazard itself, whereas expo-
sure and vulnerability tell us more about those subject to the risk.

With these terms, we can construct a more detailed equation. Sometimes,

but not always, it is more convenient to use this risk decomposition. Rather than being
mathematically rigorous, this equation is intended to convey that increasing any of
the terms being multiplied will increase the risk, and reducing any of these terms will
reduce it. Additionally, reducing any of the multiplicative terms to zero will reduce
the risk to zero for a given hazard, regardless of how large the other factors are. Once
more, we can add together the risk estimates for each independent hazard to find the
Safety Engineering ■ 183

total risk for a system.

X
Risk = P (h) × severity(h) × exposure(h) × vulnerability(h)
hazardous
event h

4.1.6 Applying the Disaster Risk Equation

Infection by a virus is an example hazard. Consider these ideas in the con-

text of a viral infection. The hazard severity and probability of a virus refers to
how bad the symptoms are and how infectious it is. An individual’s exposure relates
to how much they come into contact with the virus. Their vulnerability relates to
how strong their immune system is and whether they are vaccinated. If the virus is
certainly deadly once infected, we might consider ourselves extremely vulnerable.

Decomposing risks in detail can help us identify practical ways of re-

ducing them. As well as helping us evaluate a risk, the equation above can help
us understand what we can do to mitigate it. We might be unable to change how
intrinsically virulent a virus is, for instance, and so unable to affect the hazard’s
severity. However, we could reduce exposure to it by wearing masks, avoiding large
gatherings, and washing hands. We could reduce our vulnerability by maintaining a
healthy lifestyle and getting vaccinated. Taking any of these actions will decrease the
overall risk. If we are facing a deadly disease, then we might take extreme actions
like quarantining ourselves to reduce exposure to the hazard, thereby bringing down
overall risk to manageable levels.

Not all risks can be calculated precisely, but decomposition still helps
reduce them. An important caveat to the disaster risk equation is that not all
risks are straightforward to calculate, or even to predict. Nonetheless, even if we
cannot put an exact number on the risk posed by a given hazard, we can still reduce
it by decreasing our exposure or vulnerability, or the intrinsic hazard level itself,
where possible. Similarly, even if we cannot predict all hazards associated with a
system—for example if we face a risk of unknown unknowns, which are explored
later in this chapter—we can still reduce the overall risk by addressing the hazards
we are aware of.

In AI safety, the risk equation suggests three important research areas.

As with other hazards, we should look for multiple ways of preventing and protect-
ing against potential adverse events associated with AI. There are three key areas
of research that can each be viewed as inspired by a component of the disaster risk
equation: robustness (e.g. adversarial robustness), monitoring (e.g. transparency, tro-
jan detection, anomaly detection), and control (e.g. reducing power-seeking drives,
representation control). These research areas correspond to reducing the vulnerability
of AIs to adversarial attacks, exposure to hazards by monitoring and avoiding them,
and hazard level (probability and severity of potential damage) by ensuring AIs are
controllable and inherently less hazardous. To reduce AI risk, it is crucial to pursue
and develop all three, rather than relying on just one.
184 ■ Introduction to AI Safety, Ethics, and Society

Example hazard: proxy gaming. Consider proxy gaming, a risk we face from
AIs that was discussed in the Single Agent Safety chapter. Proxy gaming might arise
when we give AI systems goals that are imperfect proxies of our goals. An AI might
then learn to “game” or over-optimize these proxies in unforeseen ways that diverge
from human values. We can tackle this threat in many different ways:
1. Reduce our exposure to this proxy gaming hazard by improving our abilities to
monitor anomalous behavior and flag any signs that a system is proxy gaming at
an early stage.
2. Reduce the hazard level by making AIs want to optimize an idealized goal and
make mistakes less hazardous by controlling the power the AI has, so that if it
does overoptimize the proxy it would do less harm.
3. Reduce our vulnerability by making our proxies more accurate, by making AIs
more adversarially robust, or by reducing our dependence on AIs.

Systemic safety addresses factors external to the AI itself. The three

fields outlined above focus on reducing risk through the design of AIs themselves.
Another approach, called systemic safety (see 3.5), considers the environment within
which the AI operates and attempts to remove or reduce the hazards that it might
otherwise interact with. For example, improving information security reduces the
chance of a malicious actor accessing a lethal autonomous weapon, while addressing
inequality and improving mental health across society could reduce the number of
people who might seek to harness AI to cause harm.

Adding ability to cope can improve the disaster risk equation. There are
other factors that could be included in the disaster risk equation. We can return
to our example of the slippery floor to illustrate one of these factors. After slipping
on the floor, we might take less time to recover if we have access to better medical
technology. This tells us to what extent we would be able to recover from the damage
the hazard caused. We can refer to the capacity to recover as our ability to cope.
Unlike the other factors that multiply together to give us an estimate of risk, we
might divide by ability to cope to reduce our estimate of the risk if our ability to
cope with it is higher. This is a common extension to the disaster risk equation.
Some hazards are extremely damaging and eliminate any chance of recovery: the
severity of the hazard and our vulnerability are high, while our ability to cope is
tiny. This constitutes a risk of ruin–—permanent, system-complete destruction. In
this case, the equation would involve multiplying together two large numbers and
dividing by a small number; we would calculate the risk as being extremely large. If
the damage cannot be recovered from, like an existential catastrophe (e.g., a large
asteroid or sufficiently powerful rogue AIs), the risk equation would tell us that the
risk is astronomically large or infinite.

Summary. We can evaluate a risk by breaking it down into the probability of an

adverse event and the amount of harm it would cause. This enables us to quantita-
tively compare various kinds of risks. If we have enough information, we can analyze
Safety Engineering ■ 185

risks further in terms of our level of exposure to them and how vulnerable we are
to damage from them, as well as our ability to cope. Even if we cannot assign an
exact numerical value to a risk, we can estimate it. If our estimates are unreliable,
this decomposition can still help us to systematically identify practical measures we
can take to reduce the different factors and thus the overall risk.

4.2 NINES OF RELIABILITY

In the above discussion of risk evaluation, we have frequently referred to the prob-
ability of an adverse event occurring. When evaluating a system, we often instead
refer to the inverse of this—the system’s reliability, or the probability that an adverse
event will not happen, usually presented as a percentage or decimal. We can relate
system reliability to the amount of time that a system is likely to function before
failing. We can also introduce a new measure of reliability that conveys the expected
time before failure more intuitively.

The more often we use a system, the more likely we are to encounter a
failure. While a system might have an inherent level of reliability, the probability
of encountering a failure also depends on how many times it is used. This is why,
as discussed above, increasing exposure to a hazard will increase the associated level
of risk. An autonomous vehicle, for example, is much more likely to make a mistake
during a journey where it has to make 1000 decisions, than during a journey where
it has to make only 10 decisions.

TABLE 4.1 From each level of system reliability, we can infer its probability of mistake,
“nine or reliability,” and expected time before failure.

% reliability % risk of Nines of A mistake is likely to occur by

of system mistake Reliability decision number. . .

0 100 0 1
50 50 0.3 2
75 25 0.6 4
90 10 1 10
99 1 2 100
99.9 0.1 3 1000
99.99 0.01 4 10,000

For a given level of reliability, we can calculate an expected time before failure.
Imagine that we have several autonomous vehicles with different levels of reliability,
as shown in Table 4.1. Reliability is the probability that the vehicle will get any
given decision correct. The second column shows the complementary probability: the
probability that the AV will get any given decision wrong. The fourth column shows
the number of decisions within which the AV is expected to make one mistake. This
can be thought of as the AV’s expected time before failure.
186 ■ Introduction to AI Safety, Ethics, and Society

Expected time before failure does not scale linearly with system relia-
bility. We plot the information from the table in Figure 4.2. From looking at this
graph, it is clear that the expected time before failure does not scale linearly with the
system’s reliability. A 25% change that increases the reliability from 50% to 75%, for
example, doubles the expected time before failure. However, a 9% change increasing
the reliability from 90% to 99% causes a ten-fold increase in the expected time before
failure, as does a 0.9% increase from 99% to 99.9%.

Figure 4.2. Halving the probability of a mistake doubles the expected time before failure.
Therefore, the relationship between system reliability and expected time before failure is
non-linear.

The closer we get to 100% reliability, the more valuable any given increment of
improvement will be. However, as we get closer to 100% reliability, we can generally
expect that an increment of improvement will become increasingly difficult to obtain.
This is usually true because it is hard to perfectly eliminate the possibility of any
adverse event. Additionally, there may be risks that we have not considered. These
are called unknown unknowns and will be discussed extensively later in this chapter.
A system with 3 “nines of reliability” is functioning 99.9% of the time.
As we get close to 100% reliability, it gets inconvenient to use long decimals to express
how reliable a system is. The third column in Table 4.1 gives us information about
a different metric: the nines of reliability [258]. Informally, a system has nines of
reliability equal to the number of nines at the beginning of its decimal or percentage
reliability. One nine of reliability means a reliability of 90% in percentage terms or
0.9 in decimal terms. Two nines of reliability mean 99%, or 0.99. We can denote a
system’s nines of reliability with the letter k; if a system is 90% reliable, it has one
nine of reliability and so k = 1; if it is 99% reliable, it has two nines of reliability,
and so k = 2. Formally, if p is the system’s reliability expressed as a decimal, we can
define k, the nines of reliability a system possesses, as:
k = − log10 (1 − p).
Safety Engineering ■ 187

Nines of Reliability vs Expected Time Before Failure

10000

Expected Time Before Failure (Log Scale) 1000

100

0 1 2 3 4
Nines of Reliability

Figure 4.3. When we plot the nines of reliability against the expected time before failure on
a logarithmic scale, the result is a straight line.

Adding a ‘9‘ to reliability gives a tenfold increase in expected time before

failure. If it is 99% reliable, it has a 1% probability of failure. If it is 99.9% reliable,
it has a 0.1% probability of failure. Therefore, adding another consecutive 9 to the
reliability corresponds to a tenfold reduction in risk, and therefore a tenfold increase
in the expected time before failure, as we can see in the graph. This means that the
relationship between system reliability and expected time before failure is not linear.
However, the relationship between nines of reliability and the logarithm of expected
time before failure is linear. Note that if we have a system where a failure would
mean permanent societal-scale destruction, then the expected time before failure is
essentially the expected lifespan of human civilization. Increasing such a system’s
reliability by one nine would cause a ten-fold increase in the expected lifespan of
human civilization.

The nines of reliability metric can provide a relatively intuitive sense of

the difference between various levels of reliability. Looking at the table, we
can think of this metric as a series of levels, where going up one level means a tenfold
increase in expected time before failure. For example, if a system has four nines of
reliability, we can expect it to last 100 times longer before failing than if it has two.
This is an advantage of using the nines of reliability: thinking logarithmically can give
us a better understanding of the actual value of an improvement than if we say we’re
improving the reliability from 0.99 to 0.9999. In the latter case, the numbers begin
to look the same, but, in terms of expected time before failure, this improvement is
actually more meaningful than going from 0.4 to 0.8.

The nines of reliability are only a measure of probability, not risk. In the
framing of the classic risk equation, the nines of reliability only contain information
about the probability of a failure, not about what its severity would be. This metric
188 ■ Introduction to AI Safety, Ethics, and Society

is therefore incomplete for evaluating risk. If an AI has three nines of reliability, for
example, we know that it is expected to make 999 out of 1000 decisions correctly.
However, three nines of reliability tells us nothing about how much damage the agent
will do if it makes an incorrect decision, so we cannot calculate the risk involved in
using the system. A game playing AI will present a lot less risk than an autonomous
vehicle even if both systems have three nines of reliability.

Summary. A system’s nines of reliability indicate the number of consecutive nines

at the beginning of its percentage or decimal reliability. An additional nine of reli-
ability represents a reduction of probability of failure by a factor of 10, and thus a
tenfold increase of expected time before failure. Nines of reliability tell us about the
probability of an accident, but do not contain any information about the severity
of an accident, and can therefore not be used alone to calculate risk. In the case of
a risk of ruin, an additional nine of reliability means a tenfold increase in expected
lifespan.

4.3 SAFE DESIGN PRINCIPLES

We can reduce both the probability and severity of a system failure by following
certain safe design principles when designing it. These general principles have been
identified by safety engineering and offer practical ways of reducing the risk associ-
ated with all kinds of systems. They should be incorporated from the outset, rather
than being retrofitted later. This strategy attempts to “build safety into” a system
and is more robust than building the system without safety considerations and then
attempting to fix individual problems if and when they become apparent.
Note that these principles are not only useful in building an AI itself, but also the
system around it. For example, we can incorporate them into the design of the cyber-
security system that controls who is able to access an AI, and into the operations of
the organization, or system of humans, that is creating an AI.
We will now explore eight of these principles and how they might be applied to AI
systems:
1. Redundancy: having multiple backup components that can perform each critical
function, so that a single component failing is not enough to cause an accident.
2. Transparency: ensuring that operators have enough knowledge of how a system
functions under various circumstances to interact with it safely.
3. Separation of duties: having multiple agents in charge of subprocesses so that
no individual can misuse the entire system alone.
4. Principle of least privilege: giving each agent the minimum access necessary
to complete their tasks.
5. Fail-safes: ensuring that the system will be safe even if it fails.
6. Antifragility: learning from previous failures to reduce the likelihood of failing
again in future.
Safety Engineering ■ 189

7. Negative feedback mechanisms: building in processes that naturally down-

regulate operations in the event that operators lose control of the system.
8. Defense in depth: employing multiple safe design principles rather than relying
on just one, since any safety feature will have weaknesses.

Note that, depending on the exact type of system, some of these safe design principles
might be less useful or even counterproductive. We will discuss this later on in the
chapter. However, for now, we will explore the basic rationale behind why each one
improves safety.

4.3.1 Redundancy
Redundancy means having multiple “backup” components [257]. Having
multiple braking systems in a vehicle means that, even if the foot brake is not working
well, the handbrake should still be able to decelerate the vehicle in an emergency.
A failure of a single brake should therefore not be enough to cause an accident.
This is an example of redundancy, where multiple components can perform a critical
function, so that a single component failing is not enough to cause the whole system
to fail. In other words, redundancy removes single points of failure. Other examples
of redundancy include saving important documents on multiple hard drives, in case
one of them stops working, and seeking multiple doctors’ opinions, in case one of
them gets a diagnosis wrong.
A possible use of redundancy in AI would be having an inbuilt “moral parliament”
(see Moral Uncertainty in the Beneficial AI and Machine Ethics chapter). If an AI
agent has to make decisions with moral implications, we are faced with the question of
which theory of morality it should follow; there are many of these, and each often has
counterintuitive recommendations in extreme cases. Therefore, we might not want an
AI to adhere strictly to just one theory. Instead, we could use a moral parliament, in
which we emulate representatives of stakeholders or moral theories, let them negotiate
and vote, and then do what the parliament recommends. The different theories would
essentially be redundant components, each usually recommending plausible actions
but unable to dictate what happens in extreme cases, reducing the likelihood of
counterintuitive decisions that we would consider harmful.

4.3.2 Separation of Duties

Separation of duties means no single agent can control or misuse the
system alone [257]. Consider a system where one person controls all the different
components and processes. If that person decides to pursue a negative outcome, they
will be able to leverage the whole system to do so. On the other hand, we could
separate duties by having multiple operators, each in charge of a different aspect. In
this case, if one individual decides to pursue a negative outcome, their capacity to do
harm will be smaller.
For example, imagine a lab that handles two chemicals that, if mixed in high enough
quantities, could cause a large explosion. To avoid this happening, we could keep the
190 ■ Introduction to AI Safety, Ethics, and Society

stock of the two chemicals in separate cupboards, and have a different person in charge
of supplying each one in small quantities to researchers. This way, no individual has
access to a large amount of both chemicals.

We could focus on multiple narrow AI models, instead of a single general

one. In designing AI systems, we could follow this principle by having multiple
agents, each of which is highly specialized for a different task. Complex processes can
then be carried out collectively by these agents working together, rather than having
a single agent conducting the whole process alone.
This is exemplified by an approach to AI development called “comprehensive AI
services.” This views AI agents as a class of service-providing products. Adopting
this mindset might mean tailoring AI systems to perform highly specific tasks, rather
than speculatively trying to improve general and flexible capabilities in a single agent.

4.3.3 Principle of Least Privilege

Each agent should have only the minimum power needed to complete
their tasks [257]. As discussed above, separating duties should reduce individ-
uals’ capacity to misuse the system. However, separation of duties might only work
if we also ensure individuals do not have access to parts of the system that are not
relevant to their tasks. This is called the principle of least privilege. In the example
above, we ensured separation of duties by putting chemicals in different cupboards
with different people in charge of them. To make this more likely to mitigate risks, we
might want to ensure that these cupboards are locked so that everyone else cannot
access them at all.
Similarly, for systems involving AIs, we should ensure that each agent only has access
to the necessary information and power to complete its tasks with a high level of
reliability. Concretely, we might want to avoid plugging AIs into the internet or
giving them high-level admin access to confidential information. In the Single Agent
Control chapter, we considered how AIs might be power-seeking; by ensuring AIs
have only the minimum required amount of power they need to accomplish the goals
we assign them, we can reduce their ability to gain power.

4.3.4 Fail-Safes

Fail-safes are features that aim to ensure a system will be safe even if
it fails [257]. When systems fail, they stop performing their intended function,
but some failures also cause harm. Fail-safes aim to limit the amount of harm caused
even if something goes wrong. Elevator brakes are a classic example of a fail-safe
feature. They are attached to the outside of the cabin and are held open only by the
tension in the cables that the cabin is suspended on. If tension is lost in the cables,
the brakes automatically clamp shut onto the rails in the elevator shaft. This means
that, even if the cables break, the brakes should prevent the cabin from falling; even
if the system fails in its function, it should at least be safe.
Safety Engineering ■ 191

A possible fail-safe for AI systems might be a component that tracks the level of
confidence an agent has in its own decisions. The system could be designed to stop
enacting decisions if this component falls below a critical level of certainty that the
decision is correct. There could also be a component that monitors the probability of
the agent’s decisions causing harm, and the system could be designed to stop acting
on decisions if it reaches a specified likelihood of harm. Another example would be a
kill switch that makes it possible to shut off all instances of an AI system if this is
required due to malfunction or other reasons.

4.3.5 Antifragility
Antifragile systems become stronger from encountering adversity [259].
The idea of an antifragile system is that it will not only recover after a failure or a
near miss but actually become more robust from these “stressors” to potential future
failures. Antifragile systems are common in the natural world and include the human
body. For example, weight-bearing exercises put a certain amount of stress on the
body, but bone density and muscle mass tend to increase in response, improving the
body’s ability to lift weight in the future.
Similarly, after encountering or becoming infected with a pathogen and fighting it
off, a person’s immune system tends to become stronger, reducing the likelihood of
reinfection. Groups of people working together can also be antifragile. If a team is
working toward a given goal and they experience a failure, they might examine the
causes and take steps to prevent it from happening again, leading to fewer failures in
the future.
Designing AI systems to be antifragile would mean allowing them to continue learning
and adapting while they are being deployed. This could give an AI the potential to
learn when something in its environment has caused it to make a bad decision. It
could then avoid making the same mistake if it finds itself in similar circumstances
again.
Antifragility can require adaptability. Creating antifragile AIs often means
creating adaptive ones: the ability to change in response to new stressors is key to
making AIs robust. If an AI continues learning and adapting while being deployed,
it could learn to avoid hazards, but it could also develop unanticipated and unde-
sirable behaviors. Adaptive AIs might be harder to control. Such AIs are likely to
continuously evolve, creating new safety challenges as they develop different behav-
iors and capabilities. This tendency of adaptive systems to evolve in unexpected ways
increases our exposure to emergent hazards.
A case in point is the chatbot Tay, which was released by Microsoft on Twitter in
2016. Tay was designed to simulate human conversation and to continue improving by
learning from its interactions with humans on Twitter. However, it quickly started
tweeting offensive remarks, including seemingly novel racist and sexist comments.
This suggested that Tay had statistically identified and internalized some biases that
it could then independently assert. As a result, the chatbot was taken offline af-
ter only 16 hours. This illustrates how an adaptive, antifragile AI can develop in
192 ■ Introduction to AI Safety, Ethics, and Society

unpredicted and undesirable ways when deployed in natural settings. Human opera-
tors cannot control natural environments, so system designers should think carefully
about whether to use adaptive AIs.

4.3.6 Negative Feedback Mechanisms

When one change triggers more changes, feedback loops can emerge. To
understand positive and negative feedback mechanisms, consider the issue of climate
change and melting ice. As global temperatures increase, more of Earth’s ice melts.
This means ice-covered regions shrink, and therefore reflect a smaller amount of
the sun’s radiation back into space. More radiation is therefore absorbed by the
atmosphere, further increasing global temperatures and causing even more ice to melt.
This is a positive feedback loop: a circular process that amplifies the initial change,
causing the system to continue escalating itself unchecked. We discuss feedback loops
in greater detail in the Complex Systems chapter.
Negative feedback mechanisms act to down-regulate and stabilize sys-
tems. If we have mechanisms in place that naturally down-regulate the initial
change, we are likely to enter an equilibrium rather than explosive change. Many
negative feedback mechanisms are found within the body; for example, if a person’s
body temperature increases, they will begin to sweat, cooling down as that sweat
evaporates. If, on the other hand, they get cold, they will begin to shiver, generating
heat instead. These negative feedback mechanisms act against any changes and thus
stabilize the temperature within the required range. Incorporating negative feedback
mechanisms in a system’s design can improve controllability, by preventing changes
from escalating too much [260].
We can use negative feedback loops to control AIs. If we are concerned that
AIs might get too powerful for us to control, we can create negative feedback loops in
the environment to ensure that any increases in an AI’s power are met with changes
that make it less powerful. There would be two parts to this process. First, we would
want better monitoring tools to look for anomalies, such as AI watch dogs. These
would track when an AI is getting powerful (or displaying hazardous behavior) in
order to trigger some feedback mechanism—the second part of the task. The feedback
mechanism might be a drastic measure like disconnecting an AI from the internet,
resetting an AI to a previous version, or using other AIs trained to disempower a
powerful AI. Such mechanisms would act as automatic checks and balances on AIs’
power.

4.3.7 Transparency
Transparency means people know enough about a system to interact with
it safely [257]. If operators do not have sufficient knowledge of a system’s func-
tions, then it is possible they could inadvertently cause an accident while interacting
with it. It is important that a pilot knows how a plane’s autopilot system works, how
to activate it, and how to override it. That way, they will be able to override it when
Safety Engineering ■ 193

they need to, and they will know how to avoid activating it or overriding it acciden-
tally when they do not mean to. This safe design principle is called transparency.
Research into AI transparency aims to design DL systems in ways that give operators
a greater understanding of their internal decision-making processes. This would help
operators maintain control, anticipate situations in which systems might make poor
or deceptive decisions, and steer them away from hazards.

4.3.8 Defense in Depth

Including many layers of defense is usually more effective than relying
on just one [257]. A final safe design principle is defense in depth, which means
including multiple layers of defense. That way, if one or more defenses fail, there
should still be others in place that can mitigate damage. In general, the more defenses
a system has in place, the less likely it is that all the defenses will fail simultaneously.
The core idea of defense in depth is that it is unlikely that any one layer of defense is
foolproof; we are usually engaging in risk reduction, not risk elimination. For example,
an individual is less likely to be infected by a virus if they take multiple measures,
such as wearing a mask, social distancing, and washing their hands, than if they rely
on just one of these (and no single one is going to work). We will explore this in
greater depth in the context of the Swiss cheese model in the next section.
One caveat to note here is that increasing layers of defense can make a system more
complex. If the different defenses interact with one another, there is a chance that this
might produce unintended side effects. In the case of a virus, for example, reduced
social contact and an individual’s immune system can be thought of as two separate
layers of defense. However, if an individual has very little social contact, and therefore
little exposure to pathogens, their immune system could become weaker as a result.
While multiple layers of defense can make a system more robust, system designers
should be aware that layers might interact in complex ways. We will discuss this
further later in the chapter.
Layers of safety features can generally be preventative or protective.
There are two ways in which safety measures can reduce risk: preventative measures
reduce the probability of an accident occurring, while protective measures reduce the
harm if an accident does occur. For example, avoiding large gatherings and washing
hands are actions that aim to prevent an individual from becoming infected with a
virus, while maintaining a healthy lifestyle and getting vaccinated can help reduce
the severity of an infection if it does occur.
We can think about this in terms of the risk equations. Preventative measures reduce
the probability of an accident occurring, either by reducing the inherent probability
of the hazard or by reducing exposure to it. Protective measures, meanwhile, decrease
the severity an accident would have, either by reducing the inherent severity of the
hazard or by making the system less vulnerable to it.
In general, prevention is more efficient than cure, but both should
be included in system design. An oft-quoted aphorism is that “an ounce of
194 ■ Introduction to AI Safety, Ethics, and Society

prevention is worth a pound of cure,” highlighting that it is much better—and often

less costly—to prevent an accident from happening than to try and fix it afterward. It
might therefore be wise for system designers to place more emphasis on preventative
features than on protective ones.
Nevertheless, protective features should not be neglected. This is illustrated by the
sinking of the Titanic, whose preventative design features included the hull being
divided into watertight compartments. There was much faith that these features
rendered it unsinkable. However, the ship did not carry enough lifeboats to hold all
its passengers. This was, in part, because lifeboats were largely intended to transport
passengers to another ship in the event of sinking, rather than to hold all of them
at once. Still, the insufficient provision meant that there was not enough space on
the lifeboats for all the passengers when the ship sank. This can be considered an
example of inadequate protective measures. We will explore this distinction more in
the next section, where we discuss the bow tie model.

4.3.9 Review of Safe Design Principles

There are multiple features we can build into a system from the design stage to
make it safer. We have discussed redundancy, separation of duties, the principle of
least privilege, fail-safes, antifragility, negative feedback mechanisms, transparency
and defense in depth as eight examples of such principles. Each one gives us concrete
recommendations about how to design (or how not to design) AI systems to ensure
that they are safer for humans to use.

4.4 COMPONENT FAILURE ACCIDENT MODELS AND METHODS

As a system is being created and used, it is important to analyze it to identify

potential risks. One way of doing this is to look at systems through the lens of an
accident model: a theory about how accidents happen in systems and the factors that
lead to them [260]. We will now look at some common accident models. These impact
system design and operational decisions.

4.4.1 Swiss Cheese Model

The Swiss cheese model helps us analyze defenses and identify pathways
to accidents [261]. The diagram in Figure 4.4 shows multiple slices of Swiss
cheese, each representing a particular defense feature in a system. The holes in a
slice represent the weaknesses in a defense feature—the ways in which it could be
bypassed. If there are any places where holes in all the slices line up, creating a
continuous hole through all of them, this represents a possible route to an accident.
This model highlights the importance of defense in depth, since having more layers
of defense reduces the probability of there being a pathway to an accident that can
bypass them all.
Safety Engineering ■ 195

Figure 4.4. Each layer of defense (safety culture, red teaming, etc.) is a layer of defense with
its own holes in the Swiss cheese model. With enough layers, we hope to avoid pathways
that can bypass them all.

Consider the example of an infectious disease as a hazard. There are multiple possible
defenses that reduce the risk of infection. Preventative measures include avoiding
large gatherings, wearing a mask and regularly washing hands. Protective measures
include maintaining a healthy lifestyle to support a strong immune system, getting
vaccinated, and having access to healthcare. Each of these defenses can be considered
a slice of cheese in the diagram.
However, none of these defenses are 100% effective. Even if an individual avoids large
gatherings, they could still become infected while buying groceries. A mask might
mostly block contact with the pathogen, but some of it could still get through. Vacci-
nation might not protect those who are immunocompromized due to other conditions,
and may not be effective against all variants of the pathogen. These imperfections
are represented by the holes in the slices of cheese. From this, we can infer various
potential routes to an accident, such as an immunocompromized person with a poorly
fitting mask in a shopping mall, or someone who has been vaccinated encountering
a new variant at the shops that can evade vaccine-induced immunity.
We can improve safety by increasing the number of slices, or by reducing
the holes. Adding more layers of defense will reduce the chances of holes lining up
to create an unobstructed path through all the defenses. For example, adopting more
of the practices outlined above would reduce an individual’s chances of infection more
than if they adopt just one.
Similarly, reducing the size of a hole in any layer of defense will reduce the probability
that it will overlap with a hole in another layer. For example, we could reduce the
weaknesses in wearing a mask by getting a better-fitting, more effective mask. Scien-
tists might also develop a vaccine that is effective against a wider range of variants,
thus reducing the weaknesses in vaccination as a layer of defense.
We can think of layers of defense as giving us additional nines of relia-
bility. In many cases, it seems reasonable to assume that adding a layer of defense
helps reduce remaining risks by approximately an order of magnitude by eliminat-
ing 90% of the risks still present. Consider how adding the following three layers of
defense can give our AIs additional nines of reliability:
196 ■ Introduction to AI Safety, Ethics, and Society

1. Adversarial fine-tuning: By fine-tuning our model, we ensure that it rejects harmful

requests. This works mostly reliably, filtering out 90% of the harmful requests
received.
2. Artificial conscience: By giving our AI an artificial conscience, it is less likely to
take actions that result in low human wellbeing in pursuit of its objective. However,
10% of the time, it may take actions that are great for its objective and bad for
human wellbeing regardless.
3. AI watchdogs: By monitoring deployed AIs to detect signs of malfeasance, we catch
AIs acting in ways contrary to how we want them to act nine times out of ten.

Swiss cheese model for emergent capabilities. To reduce the risk of unex-
pected emergent capabilities, multiple lines of defense could be employed. For exam-
ple, models could be gradually scaled (e.g., using 3× more compute than the previous
training run, rather than a larger number such as 10×); as a result, there will be
fewer emergent capabilities to manage. An additional layer of defense is screening
for hazardous capabilities, which could involve deploying comprehensive test beds,
and red teaming with behavioral tests and representation reading. Another defense
is staged releases; rather than release the model to all users at once, gradually re-
lease it to more and more users, and manage discovered capabilities as they emerge.
Finally, post-deployment monitoring through anomaly detection adds another layer
of defense.
Each of these aim at largely different areas, with the first focusing on robustness, the
second on control, and the third on monitoring. By ensuring we have many defenses,
we prevent a wider array of risks, improving our system reliability by many nines of
reliability.

4.4.2 Bow Tie Model

The bow tie model splits defenses into preventative and protective mea-
sures[260]. In the diagram in Figure 4.5, the triangle on the left-hand side contains
features that are intended to prevent an accident from happening, while the triangle
on the right-hand side contains features that are intended to mitigate damage if an
accident does happen. The point in the middle where the two triangles meet can
be thought of as the point where any given accident happens. This is an example
of a bow tie diagram, which can help us visualize the preventative and protective
measures in place against any potential adverse event.
For example, if an individual goes rock climbing, a potential accident is falling. We
can draw a bow tie for this situation, with the center representing a fall. On the
left, we note any measures the individual could take to prevent a fall, for example
using chalk on their hands to increase friction. On the right, we note any protective
measures they could take to reduce harm from falling, such as having a cushioned
floor below.
Bow tie analysis of proxy gaming. In the Single-Agent Safety chapter, we
learned that one hazard of using AIs is that they might learn to “game” the
Safety Engineering ■ 197

Figure 4.5. The bow tie diagram can tie together hazards and their consequences with control
and recovery measures to mitigate the effects of an adverse event.

objectives we give them. If the specified objectives are only proxies for what we
actually want, then an AI might find a way of optimizing them that is not beneficial
overall, possibly due to unintended harmful side effects.

To analyze this hazard, we can draw a bow tie diagram, with the center representing
the event of an AI gaming its proxy goals in a counterproductive way. On the left-hand
side, we list preventative measures, such as ensuring that we can control AI drives
like power-seeking. If the AI system has less power (for example fewer resources),
this would reduce the probability that it finds a way to optimize its goal in a way
that conflicts with our values (as well as the severity of the impact if it does). On the
right-hand side, we list protective measures, such as improving anomaly detection
tools that can recognize any AI behavior that resembles proxy gaming. This would
help human operators to notice activity like this early and take corrective action to
limit the damage caused.
The exact measures on either side of the bow tie depend on which event we put at the
center. We can make a system safer by individually considering each hazard associated
with it, and ensuring we implement both preventative and protective measures against
that hazard.

4.4.3 Fault Tree Analysis Method

Fault tree analysis works backward to identify possible causes of negative

outcomes. Fault tree analysis (FTA) is a top-down process that starts by consid-
ering specific negative outcomes that could result from a system, and then works
backward to identify possible routes to those outcomes. In other words, FTA involves
“backchaining” from a possible accident to find its potential causes.
198 ■ Introduction to AI Safety, Ethics, and Society

Figure 4.6. Using a fault tree, we can work backward from a failure (no water flow) to its
possible causes (such as a blockage or lack of flow at source) [262].

For each negative outcome, we work backward through the system, identifying po-
tential causes of that event. We can then draw a “fault tree” showing all the possible
pathways through the system to the negative outcome. By studying this fault tree,
we can identify ways of improving the system that remove these pathways. In Figure
4.6, we trace backward from a pump failure to two types of failure: mechanical and
electrical failure. Each of these has further subtypes. For fuse failing, we know that
we require a circuit overload, which can happen as a result of a wire shorted or a
power surge. Hence, we know what sort of hazards we might need to think about.

Example: Fire Hazard. We could also consider the risk of a fire outbreak as a
negative outcome. We then work backward, thinking about the different requirements
for this to happen—fuel, oxygen, and sufficient heat energy. This differs from the
water pump failure since all of these are necessary rather than just one of them.
Working backward again, we can think about all the possible sources of each of these
requirements in the system. After completing this process, we can identify multiple
combinations of events that could lead to the same negative outcome.

Decision trees are more flexible and often more applicable in the context
of AI. Figure 4.7 shows how we can use a decision tree based on analyzing potential
causes of risks to create concrete questions about risks. We can trace this through
to identify risks depending on the answers to these questions. For instance, if there
is no unified group accountable for creating AIs, then we know that diffusion of
Safety Engineering ■ 199

Figure 4.7. A decision-tree focusing on causal chains for risks can identify several potential
problems by interrogating important contextual questions [263].

responsibility is a risk. If there is such a group, then we need to question their beliefs,
intentions, and incentives.
By thinking more broadly about all the ways in which a specific accident could be
caused, whether by a single component failing or by a combination of many smaller
events, a decision tree can discover hazards that are important to find. However, the
backchaining process that FTA and decision trees looking at accident causes rely on
also has limitations, which we will discuss in the next section.

4.4.4 Limitations

Chain-of-Events Reasoning. The Swiss cheese and bow tie models and the
FTA method can be useful for identifying hazards and reducing risk in some sys-
tems. However, they share some limitations that make them inapplicable to many
200 ■ Introduction to AI Safety, Ethics, and Society

of the complex systems that are built and operated today. Within the field of safety
engineering, these approaches are largely considered overly simplistic and outdated.
We will now discuss the limitations of these approaches in more detail, before moving
on to describe more sophisticated accident models that may be better suited to risk
management in AI systems.

Chain-of-events reasoning is sometimes too simplistic for useful risk

analysis. All of these models and techniques are based on backchaining or linear
“chain-of-events” reasoning. This way of thinking assumes there is a neat line of
successive events, each one directly causing the next, that ultimately leads to the
accident. The goal is then to map out this line of events and trace it back to identify
the “root cause”—usually a component failure or human error—to blame. However,
given the numerous factors that are at play in many systems, it does not usually
make sense to single out just one event as the cause of an accident. Moreover, this
approach puts the emphasis largely on the details of “what” specifically happened,
while neglecting the bigger question of “why” it happened. This worldview often
ignores broader systemic issues and can be overly simplistic. Rather than break events
down into a chain of events, a complex systems perspective often sees events as a
product or complex interaction between many factors.

Complex and Sociotechnical Systems

Component failure accident models are particularly inadequate for analyzing complex
systems and sociotechnical systems. We cannot always assume direct, linear causality
in complex and sociotechnical systems, so the assumption of a linear “chain of events”
breaks down.

In complex systems, many components interact to produce emergent

properties. Complex systems are everywhere. A hive of bees consists of individuals
that work together to achieve a common goal, a body comprises many organs that
interact to form a single organism, and large-scale weather patterns are produced by
the interactions of countless particles in the atmosphere. In all these examples, we find
collective properties that are not found in any of the components but are produced
by the interactions between them. In other words, a complex system is “more than
the sum of its parts.” As discussed in the complex systems chapter, these systems
exhibit emergent features that cannot be usefully understood through a reductive
analysis of components.

Sociotechnical systems involve interactions between humans and tech-

nologies. For example, a car on the road is a sociotechnical system, where a hu-
man driver and technological vehicle work together to get from the starting point to
the destination. At progressively higher levels of complexity, vehicles interacting with
one another on a road also form a sociotechnical system, as does the entire transport
network. With the widespread prevalence of computers, virtually all workplaces are
now sociotechnical systems, and so is the overarching economy.
Safety Engineering ■ 201

There are three main reasons why component failure accident models are insufficient
for analyzing complex and sociotechnical systems: accidents sometimes happen with-
out any individual component failing, accidents are not always the result of linear
causality, and direct causality is sometimes less important than indirect, remote, or
“diffuse” causes such as safety culture. We will now look at each of these reasons in
more detail.

Accidents Without Failure

Sometimes accidents happen due to interactions, even if no single com-

ponent fails [264]. The component failure accident models generally consider
each component individually, but components often behave differently within a com-
plex or sociotechnical system than they do in isolation. For example, a human opera-
tor who is normally diligent might be less so if they believe that a piece of technology
is taking care of one of their responsibilities.
Accidents can also arise through interactions between components, even if every com-
ponent functions exactly as it was intended to. Consider the Mars Polar Lander, a
spacecraft launched by NASA in 1999, which crashed on the Martian surface later
that year. It was equipped with reverse thrusters to slow its descent to the surface,
and sensors on its legs to detect a signal generated by landing to turn the thrusters
off. However, the legs had been stowed away for most of the journey. When they ex-
tended in preparation for landing, the sensors interpreted it as a landing. This caused
the software to switch the thrusters off before the craft had landed, so it crashed on
the surface [265].
In this case, there was no component failure. Each component did what it was de-
signed and intended to do; the accident was caused by an unanticipated interaction
between components. This illustrates the importance of looking at the bigger picture,
and considering the system as a whole, rather than just looking at each component
in isolation, in a reductionist way.

Nonlinear Causality

Sometimes, we cannot tease out a neat, linear chain of events or a

“root cause” [264]. Complex and sociotechnical systems usually involve a large
number of interactions and feedback loops. Due to the many interdependencies and
circular processes, it is not always feasible to trace an accident back to a single “root
cause.” The high degree of complexity involved in many systems of work is illustrated
in Figure 4.8. This shows the interconnectedness of a system cannot be accurately
reduced to a single line from start to finish.

AI systems can exhibit feedback loops and nonlinear causality. Reinforce-

ment learning systems involve complexity and feedback loops. These systems gather
information from the environment to make decisions, which then impact the envi-
ronment, influencing their subsequent decisions. Consider an ML system that ranks
advertisements based on their relevance to search terms. The system learns about
202 ■ Introduction to AI Safety, Ethics, and Society

Figure 4.8. There are feedback loops in the creation and deployment of AI systems. For
example, the curation of training data used for developing an AI system exhibits feedback
loops [266].

relevance by tracking the number of clicks each ad receives and adjusts its model
accordingly.
However, the number of clicks an ad receives depends not only on its intrinsic rele-
vance but also on its position in the list: higher ads receive more clicks. If the system
underestimates the effect of position, it may continually place one ad at the top since
it receives many clicks, even if it has lower intrinsic relevance, leading to failure. Con-
versely, if the system overestimates the effect of position, the top ad will receive fewer
clicks than it expects, and so it may constantly shuffle ads, resulting in a random
order rather than relevance-based ranking, also leading to failure.
Many complex and sociotechnical systems comprise a whole network of interactions
between components, including multiple feedback loops like this one. When thinking
about systems like these, it is difficult to reduce any accidents to a simple chain of
successive events that we can trace back to a root cause.

Indirect Causality

Sometimes, it is more useful to focus on underlying conditions than

specific events [264]. Even if we can pinpoint specific events that directly led
to an accident, it is sometimes more fruitful to focus on broader systemic issues. For
example, rising global temperatures certainly increase the frequency and severity of
natural disasters, including hurricanes, wildfires, and floods. Although it would be
difficult to prove that any particular emission of greenhouse gases directly caused
any specific disaster, it would be remiss of us to ignore the effects of climate change
when trying to evaluate the risk of future disasters. Similarly, a poor diet and lack of
Safety Engineering ■ 203

exercise are likely to result in ill health, even though no single instance of unhealthy
behavior could be directly blamed for an individual developing a disease.

Example: spilling drinks. Consider a scenario where an event planner is faced

with the problem of people spilling drinks at a crowded venue. One possible solution
might be to contact the individuals who dropped their drinks last time and ask them
to be more careful. While this approach aims to reduce the failure rate of individual
components (people), it fails to address the underlying issue of high density that
contributes to the problem.
Alternatively, the event planner could consider two more effective strategies. Firstly,
they could move the event to a bigger venue with more space. By modifying the
system architecture, the planner would provide attendees with more room to move
around, reducing the likelihood of people bumping into each other and spilling their
drinks. This approach addresses the diffuse variable of high density.
Secondly, the event planner could limit the number of attendees, ensuring that the
venue does not become overcrowded. By reducing the overall density of people, the
planner would again mitigate the risk of collisions and drink spills. This solution also
targets the diffuse variable of high density, acknowledging the systemic nature of the
problem.
Compared with the first option, which solely focuses on the behavior of individu-
als and would likely fail to eliminate spillages, the latter two strategies recognize
the importance of modifying the environment and addressing the broader factors
that contribute to the issue. By prioritizing architectural adjustments or managing
attendee numbers, the event planner can more effectively prevent drink spills and
create a safer and more enjoyable experience for everyone.

Sharp end versus blunt end. The part of a system where specific events happen,
such as people bumping into each other, is sometimes referred to as the “sharp end”
of the system. The higher-level conditions of the system, such as the density of people
in a venue, is sometimes called the “blunt end.” As outlined in the example of people
spilling drinks, “sharp-end” interventions may not always be effective. These are
related to “proximate causes” and “distal causes,” respectively.

Diffuse causality suggests broader, “blunt-end” interventions. In general,

it might not be possible to prove that “systemic factors” directly caused an accident.
However, systemic factors might “diffusely” affect the whole system in ways that
increase the probability of an accident, making them relevant for risk analysis. Under
conditions like this, even if the specific event that led to an accident had not happened,
something else might have been likely to happen instead. This means it can be more
effective to tackle problems less directly, such as by changing system architecture or
introducing bottom-up interventions that affect systemic conditions, rather than by
attempting to control the sharp end.
Weaponizing any single AI system would not necessarily lead to war or any other kind
of catastrophe. However, the existence of these systems increases the probability of
204 ■ Introduction to AI Safety, Ethics, and Society

a rogue AI system causing a disaster. Minimizing the use of autonomous weapons

might be a better way of addressing this risk than attempting to introduce lots of
safeguards to prevent loss of control over a rogue system.

Meadows’ twelve leverage points. One way of identifying broad interventions

that could yield significant results is to consider Meadows’ twelve leverage points.
These are twelve points within a system, described by environmental scientist Donella
Meadows, where a small change can make a large difference [267]. Depending on
the system under consideration, some of these characteristics may be difficult or
impossible to change, for example large physical infrastructure or the laws of physics.
The points are therefore often listed in order of increasing efficacy, taking into account
their tractability.
The lower end of the list of leverage points includes: parameters or numbers that
can be tweaked, such as taxes or the amount of resources allocated to a particular
purpose; the size of buffers in a system, such as water reservoirs or a store’s reserve
stock, where a larger buffer makes a system more stable and a smaller buffer makes
a system more flexible; and the structure of material flows through a system, such as
the layout of a transport network or the way that people progress through education
and employment.
The middle of the list of leverage points includes: lags between an input to a system
and the system’s response, which can cause the system to oscillate through over- and
undershooting; negative feedback loops that can be strengthened to help a system
self-balance; and positive feedback loops that can be weakened to prevent runaway
escalation at an earlier stage.
The next three leverage points in Meadows’ list are: the structure of information
flows in a system, which can increase accountability by making the consequences of
a decision more apparent to decision-makers, for example by publishing information
about companies’ emissions; the rules of the system, such as national laws, or exam-
ination policies in educational institutes, which can be changed in social systems to
influence behavior; and the ability to self-organize and adapt, which can be promoted
by maintaining diversity, such as biodiversity in an ecosystem and openness to new
ideas in a company or institution.
Finally, leverage points toward the higher end of the list include: the goal of the sys-
tem, which, if changed, could completely redirect the system’s activities; the paradigm
or mindset that gave rise to the goal, which, if adjusted, could transform the collective
understanding of what a system can and should be aiming for; and the realization
that there are multiple worldviews or paradigms besides an organization’s current
one, which can empower people to adopt a different paradigm when appropriate.

Summary. In complex and sociotechnical systems, accidents cannot always be re-

duced to a linear chain of successive events. The large number of complex interactions
and feedback loops means that accidents can happen even if no single component fails
and that it may not be possible to identify a root cause. Since we may not be able to
anticipate every potential pathway to an accident, it is often more fruitful to address
Safety Engineering ■ 205

systemic factors that diffusely influence risk. Meadows’ twelve leverage points can
help us identify systemic factors that, if changed, could have far-reaching, system-
wide effects.

4.5 SYSTEMIC FACTORS

As discussed above, if we want to improve the safety of a complex, sociotechnical

system, it might be most effective to address the blunt end, or the broad systemic
factors that can diffusely influence operations. Some of the most important systemic
factors include regulations, social pressure, technosolutionism, competitive pressure,
safety costs, and safety culture. We will now discuss each of these in more detail.

Safety regulations can be imposed by government or internal policies.

Safety regulations can require an organization to adhere to various safety standards,
such as conducting regular staff training and equipment maintenance. These stipu-
lations can be defined and enforced by a government or by an organization’s own
internal policies. The more stringent and effectively targeted these requirements are,
the safer a system is likely to be.

Social pressure can encourage organizations to improve safety. Public at-

titudes toward a particular technology can affect an organization’s attitude to safety.
Significant social pressure about risks can mean that organizations are subject to
more scrutiny, while little public awareness can allow organizations to take a more
relaxed attitude toward safety.

Technosolutionism should be discouraged. Attempting to fix problems simply

by introducing a piece of technology is called technosolutionism. It does not always
work, especially in complex and sociotechnical systems. Although technology can
certainly be helpful in solving problems, relying on it can lead organizations to neglect
the broader system. They should consider how the proposed technological solution
will actually function in the context of the whole system, and how it might affect the
behavior of other components and human operators.
Multiple geoengineering technologies have been proposed as solutions to climate
change, such as spraying particles high in the atmosphere to reflect sunlight. However,
there are concerns that attempting this could have unexpected side effects. Even if
spraying particles in the atmosphere did reverse global heating, it might also interfere
with other components of the atmosphere in ways that we fail to predict, potentially
with harmful consequences for life. Instead, we could focus on non-technical inter-
ventions like preserving forested areas that are more robustly likely to work without
significant unforeseen negative side-effects.

Competitive pressures can lead to compromise on safety. If multiple or-

ganizations or countries are pursuing the same goal, they will be incentivized to get
an edge over one another. They might try to do this by reaching the goal more
quickly or by trying to make the end product more valuable to customers in terms
206 ■ Introduction to AI Safety, Ethics, and Society

of the functionality it offers. These competitive pressures can compel employees and
decision-makers to cut corners and pay less attention to safety.
On a larger scale, competitive pressures might put organizations or countries in an
arms race, wherein safety standards slip because of the urgency of the situation.
This will be especially true if one of the organizations or countries has lower safety
standards and consequently moves quicker; others might feel the need to lower their
standards as well, in order to keep up. The risks this process presents are encapsulated
by Publilius Syrus’s aphorism: “Nothing can be done at once hastily and prudently.”
We consider this further in the Collective Action Problems chapter.

Various safety costs can discourage the adoption of safety measures.

There are often multiple costs of increasing safety, not only financial costs but also
slowdowns and reduced product performance. Adopting safety measures might there-
fore decrease productivity and slow progress toward a goal, reducing profits. The
higher the costs of safety measures, the more reluctant an organization might be to
adopt them.
Developers of AI systems may want to put more effort into transparency and in-
terpretability. However, investigating these areas is costly: at the very least, there
will be personnel and compute costs that could otherwise have been used to directly
create more capable systems. Additionally, it may delay the completion of the fin-
ished product. There might also be costs from making a system more interpretable in
terms of product performance. Creating more transparent models might require AIs
to select only those actions which are clearly explainable. In general, safety features
can reduce model capabilities, which organizations might prefer to avoid.

The general safety culture of an organization is an important systemic

factor. A final systemic factor that will broadly influence a system’s safety can
simply be referred to as its “safety culture.” This captures the general attitude that
the people in an organization have toward safety—how seriously they take it, and
how that translates into their actions. We will discuss some specific features of a
diligent safety culture in the next section.

Summary. We have seen that component failure accident models have some sig-
nificant limitations, since they do not usually capture diffuse sources of risk that can
shape a system’s dynamics and indirectly affect the likelihood of accidents. These
include important systemic factors such as competitive pressures, safety costs, and
safety culture. We will now turn to systemic accident models that acknowledge these
ideas and attempt to account for them in risk analysis.

4.5.1 Systemic Accident Models

We have explored how component failure accident models are insufficient for prop-
erly understanding accidents in complex systems. When it comes to AIs, we must
understand what sort of system we are dealing with. Comparing AI safety to ensur-
ing the safety of specific systems like rockets, power plants, or computer programs
Safety Engineering ■ 207

can be misleading. The reality of today’s world is that many hazardous technolo-
gies are operated by a variety of human organizations: together, these form complex
sociotechnical systems that we need to make safer. While there may be some similar-
ities between different hazardous technologies, there are also significant differences in
the properties of these technologies which means it will not necessarily work to take
safety strategies from one system and map them directly onto another. We should
not anchor to individual safety approaches used in rockets or power plants.
Instead, it is more beneficial to approach AI safety from a broader perspective of
making complex, sociotechnical systems safer. To this end, we can draw on the theory
of sociotechnical systems, which offers “a method of viewing organizations which
emphasizes the interrelatedness of the functioning of the social and technological
subsystems of the organization and the relation of the organization as a whole to the
environment in which it operates.”
We can also use the complex systems literature more generally, which is largely about
the shared structure of many different complex systems. Accidents in complex systems
can often be better understood by looking at the system as a whole, rather than
focusing solely on individual components. Therefore, we will now consider systemic
accident models, which aim to provide insights into why accidents occur in systems by
analyzing the overall structure and interactions within the system, including human
factors that are not usually captured well by component failure models.
Normal Accident Theory (NAT). Normal Accident Theory (NAT) is one ap-
proach to understanding accidents in complex systems. It suggests that accidents are
inevitable in systems that exhibit the following two properties:
1. Complexity: a large number of interactions between components in the system
such as feedback loops, discussed in the complex systems chapter. Complexity can
make it infeasible to thoroughly understand a system or exhaustively predict all
its potential failure modes.
2. Tight coupling: one component in a system can rapidly affect others so that one
relatively small event can rapidly escalate to become a larger accident.

NAT concludes that, if a system is both highly complex and tightly coupled, then
accidents are inevitable—or “normal”—regardless of how well the system is managed
[268].
NAT focuses on systemic factors. According to NAT, accidents are not caused
by a single component failure or human error, but rather by the interactions and
interdependencies between multiple components and subsystems. NAT argues that
accidents are a normal part of complex systems and cannot be completely eliminated.
Instead, the focus should be on managing and mitigating the risks associated with
these systems to minimize the severity and frequency of accidents. NAT emphasizes
the importance of systemic factors, such as system design, human factors such as
organizational culture, and operational procedures, in influencing accident outcomes.
By understanding and addressing these systemic factors, it is possible to improve the
safety and resilience of complex systems.
208 ■ Introduction to AI Safety, Ethics, and Society

Some safety features create additional complexity. Although we can try to

incorporate safety features, NAT argues that many attempts to prevent accidents in
these kinds of systems can sometimes be counterproductive, as they might just add
another layer of complexity. As we explore in the Complex Systems chapter, systems
often respond to interventions in unexpected ways. Interventions can cause negative
side effects or even inadvertently exacerbate the problems they were introduced to
solve.
Redundancy, which was listed earlier as a safe design principle, is supposed to in-
crease safety by providing a backup for critical components, in case one of them fails.
However, redundancy also increases complexity, which increases the risks of unfore-
seen and unintended interactions that can make it impossible for operators to predict
all potential issues [269]. Having redundant components can also cause confusion; for
example, people might receive contradictory instructions from multiple redundant
monitoring systems and not know which one to believe.

Reducing complexity can be a safety feature. We may not be able to com-

pletely avoid complexity and tight coupling in all systems, but there are many cases
where we can reduce one or both of them and thus meaningfully reduce risk. One
example of this is reducing the potential for human error by making systems more
intuitive, such as by using color coding and male/female adapters in electrical ap-
plications to reduce the incidence of wiring errors. Such initiatives do not eliminate
risks, and accidents are still normal in these systems, but they can help reduce the
frequency of errors.

High Reliability Organizations (HROs)

The performance of some organizations suggests serious accidents might
be avoidable. The main assertion of NAT is that accidents are inevitable in com-
plex, tightly coupled systems. In response to this conclusion, which might be per-
ceived as pessimistic, other academics developed a more optimistic theory that points
to “high reliability organizations” (HROs) that consistently operate hazardous tech-
nologies with low accident rates. These precedents include air traffic control, aircraft
carriers, and nuclear power plants.
HRO theory emphasizes the importance of human factors, arguing that it must be
possible to manage even complex, tightly coupled systems in a way that reliably
avoids accidents. It identifies five key features of HROs’ management culture that
can significantly lower the risk of accidents [270]. We will now discuss these five
features and how AIs might help improve them.
1. Preoccupation with failure means reporting and studying mistakes and
near misses. HROs encourage the reporting of all anomalies, known failures, and
near misses. They study these events carefully and learn from them. HROs also
keep in mind potential failure modes that have not occurred yet and which have
not been predicted. The possibility of unanticipated failure modes constitutes a
risk of black swan events, which will be discussed in detail later in this chapter.
Safety Engineering ■ 209

HROs are therefore vigilant about looking out for emerging hazards. AI systems
tend to be good at detecting anomalies, but not near misses.
2. Reluctance to simplify interpretations means looking at the bigger pic-
ture. HROs understand that reducing accidents to chains of events often oversim-
plifies the situation and is not necessarily helpful for learning from mistakes and
improving safety. They develop a wide range of expertise so that they can come
up with multiple different interpretations of any incident. This can help with un-
derstanding the broader context surrounding an event, and systemic factors that
might have been at play. HROs also implement many checks and balances, in-
vest in hiring staff with diverse perspectives, and regularly retrain everyone. AIs
could be used to generate explanations for hazardous events or conduct adversarial
reviews of explanations of system failures.
3. Sensitivity to operations means maintaining awareness of how a system
is operating. HROs invest in the close monitoring of systems to maintain a con-
tinual, comprehensive understanding of how they are behaving, whether through
excellent monitoring tools or hiring operators with deep situational awareness.
This can ensure that operations are going as planned, and notice early if anything
unexpected happens, permitting taking corrective action early, before the situa-
tion escalates. AI systems that dynamically aggregate information in real-time can
help improve situational awareness.
4. Commitment to resilience means actively preparing to tackle unex-
pected problems. HROs train their teams in adaptability and improvising solu-
tions when confronted with novel circumstances. By practicing dealing with issues
they have not seen before, employees develop problem-solving skills that will help
them cope if anything new and unexpected arises in reality. AIs have the poten-
tial to enhance teams’ on-the-spot problem-solving, such as by creating surprising
situations for testing organizational efficiency.
5. Under-specification of structures means information can flow rapidly in
a system. Instead of having rigid chains of communication that employees must
follow, HROs have communication throughout the whole system. All employees
are allowed to raise an alarm, regardless of their level of seniority. This increases
the likelihood that problems will be flagged early and also allows information to
travel rapidly throughout the organization. This under-specification of structures
is also sometimes referred to as “deference to expertise,” because it means that all
employees are empowered to make decisions relating to their expertise, regardless
of their place in the hierarchy.

High-reliability organizations (HROs) provide valuable insights into the development

and application of AI technology. By emulating the characteristics of HROs, we can
create combined human-machine systems that prioritize safety and mitigate risks.
These sociotechnical systems should continuously monitor their own behavior and
the environment for anomalies and unanticipated side effects. These systems should
also support combined human-machine situational awareness and improvisational
planning, allowing for real-time adaptation and flexibility. Lastly, AIs should have
210 ■ Introduction to AI Safety, Ethics, and Society

models of their own expertise and the expertise of human operators to ensure effec-
tive problem routing. By adhering to these principles, we can develop AI systems
that function like HROs, ensuring high reliability and minimizing the potential risks
associated with their deployment and use.

Criticisms of HRO Theory

Doubts have been raised about how widely HRO theory can be applied.
Although the practices listed above can improve safety, a main criticism of HRO
theory is that they cannot be applied to all systems and technologies [269]. This is
because the theory was developed from a relatively small group of example systems,
and certain features of them cannot be replicated in all systems.
It is difficult to understand systems sufficiently well. First, in the exam-
ples of HROs identified (such as air traffic control or nuclear power plants), operators
usually have near-complete knowledge of the technical processes involved. These or-
ganizations’ processes have also remained largely unchanged for decades, allowing for
lessons to be learned from errors and for safety systems to become more refined. How-
ever, according to NAT, the main reason that complexity contributes to accidents is
that it precludes a thorough understanding of all processes, and anticipation of all
potential failure modes. HROs with near-complete knowledge of technical processes
might be considered rare cases. These conditions cannot be replicated in all systems,
especially not in those operating new technologies.
HROs prioritize safety, but other organizations might not. The second
reason why HRO theory might not be broadly applicable is that its suggestions gen-
erally focus on prioritizing safety as a goal. This might make sense for several of the
example HROs, where safety is an intrinsic part of the mission. Airlines, for instance,
would not be viable businesses if they did not have a strong track record of transport-
ing passengers safely. However, this is less feasible in organizations where safety is not
so pertinent to the mission. In many other profit maximization organizations, safety
can conflict with the main mission, as safety measures may be costly and reduce
productivity.
Not all HROs are tightly coupled. Another criticism of HRO theory is that
several of the example systems might actually be considered loosely coupled. For
instance, in air traffic control, extra time is scheduled in between landings on the
same runway, to allow for delays, errors, and corrections. However, NAT claims that
tight coupling is the second system feature that makes accidents inevitable. Loosely
coupled systems may not, therefore, be a good counterexample.
Deference to expertise might not always be realistic. A final reservation
about HRO theory is that the fifth characteristic (deference to expertise) assumes
that employees will have the necessary knowledge to make the best decisions at the
local level. However, information on the system as a whole is sometimes required in
order to make the best decisions, as actions in one subsystem may have knock-on
effects for other subsystems. Employees might not always have enough information
about the rest of the system to be able to take this big-picture view.
Safety Engineering ■ 211

Comparing NAT and HRO Theory

The debate over whether accidents are inevitable or avoidable remains
unsettled. A particular sticking point is that, despite having low accident rates,
some of the HRO examples have experienced multiple near misses. This could be
interpreted as evidence that NAT is correct. We could view it as a matter of luck
that these near misses did not become anything larger. This would indicate that
organizations presented as HROs are in fact vulnerable to accidents. On the other
hand, near misses could instead be interpreted as supporting HRO theory; the fact
that they did not turn into anything larger could be considered evidence that HROs
have the appropriate measures in place to prevent accidents. It is not clear which of
these interpretations is correct [269].
Nevertheless, both NAT and HRO theory have contributed important concepts to
safety engineering. NAT has identified complexity and tight coupling as key risk fac-
tors, while HRO theory has developed important principles for a good organizational
safety culture. Both schools of thought acknowledge that complex systems must be
treated differently from simpler systems, requiring consideration of all the compo-
nents, their interactions, and human factors. We will now explore some alternative
approaches that view system safety more holistically, rather than considering it a
product of reliable components, interactions, and operators.

Rasmussen’s Risk Management Framework and AcciMap

System control with safety boundaries. Rasmussen’s Risk Management
Framework (RMF) is an accident model that recognizes that accidents are usually
the culmination of many different factors, rather than a single root cause [271]. This
model frames risk management as a question of control, emphasizing the need for
clearly defined safety boundaries that a system’s operations must stay within.

Levels of organization and AcciMap. The RMF considers six hierarchical

levels of organization within a system, each of which can affect its safety: government,
regulators, the company, management, frontline workers, and the work itself. By
drawing out an “AcciMap” with this hierarchy, we can identify actors at different
levels who share responsibility for safety, as well as conditions that may influence the
risk of an accident. This analysis makes it explicit that accidents cannot be solely
explained by an action at the sharp end.

Systems can gradually migrate into unsafe states. The RMF also asserts
that behaviors and conditions can gradually “migrate” over time, due to environ-
mental pressures. If this migration leads to unsafe systemic conditions, this creates
the potential for an event at the sharp end to trigger an accident. This is why it
is essential to continually enforce safety boundaries and avoid the system migrating
into unsafe states.
212 ■ Introduction to AI Safety, Ethics, and Society

Figure 4.9. Rasmussen’s risk management framework lays out six levels of organization and
their interactions, aiming to mark consistent safety boundaries by identifying hazards and
those responsible for them.

System-Theoretic Accident Model and Processes (STAMP)

STAMP is based on insights from the study of complex systems. Accord-
ing to the systems-theory paradigm, safety is an emergent property that is unlikely
to be sufficiently understood just by looking at individual components in isolation.
This is the view taken by System-Theoretic Accident Model and Processes (STAMP).
STAMP identifies multiple levels of organization within a system, where each level
is of higher complexity than the one below. Each level has novel emergent proper-
ties that cannot be practically understood through a reductive analysis of the level
below. STAMP also recognizes that a system can be highly reliable but still be un-
safe, and therefore puts the emphasis on safety rather than just on the reliability of
components.

STAMP frames safety as a question of top-down control. STAMP pro-

poses that safety can be enforced by each level effectively placing safety constraints
on the one below to keep operations from migrating into unsafe states [269]. Per-
forming STAMP-based risk analysis and management involves creating models of
four aspects of a system: the organizational safety structure, the dynamics that can
cause this structure to deteriorate, the models of system processes that operators
must have, and the surrounding context. We will now discuss each of these in more
detail.
Safety Engineering ■ 213

The organizational safety structure. The first aspect is the safety constraints,
the set of unsafe conditions which must be avoided. It tells us which components and
operators are in place to avoid each of those unsafe conditions occurring. This can
help to prevent accidents from component failures, design errors, and interactions
between components that could produce unsafe states.

Dynamic deterioration of the safety structure. The second aspect is about

how the safety structure can deteriorate over time, leading to safety constraints be-
ing enforced less stringently. Systems can “migrate” toward failure when many small
events escalate into a larger accident. Since complex and sociotechnical systems in-
volve large numbers of interactions, we cannot methodically compute the effects of
every event within the system and exhaustively identify all the pathways to an acci-
dent. We cannot always reduce an accident to a neat chain of events or find a root
cause: such instincts are often based on the desire to have a feeling of control by as-
signing blame. Instead, it might make sense to describe a system as migrating toward
failure, due to the accumulation of many seemingly insignificant events.
This might include natural processes, such as wear and tear of equipment. It can
also include systemic factors, such as competitive pressures, that might compel em-
ployees to omit safety checks. If being less safety-conscious does not quickly lead to
an accident, developers might start to think that safety-consciousness is unnecessary.
Having a model of these processes can increase awareness and vigilance around what
needs to be done to maintain an effective safety structure.

Knowledge and communication about process models. The third aspect is

the knowledge that operators must have about how the system functions in order to
make safe decisions. Operators may be humans or automated systems that have to
monitor feedback from the system and respond to keep it on track.
The process model that these operators should have includes the assumptions about
operating conditions that were made during the design stage so that they will be
aware of the conditions in which the system might not function properly, such as
outside regular temperature ranges. It might also include information about how the
specific subsystem that the operator is concerned with interacts with other parts
of the system. The communication required for operators to maintain an accurate
process model over time should also be specified. This can help to avoid accidents
resulting from operators or software making decisions based on inaccurate beliefs
about how the system is functioning.

The cultural and political context of the decision-making processes. The

fourth aspect is the systemic factors that could influence the safety structure. It
might include information about who the stakeholders are and what their primary
concern is. For example, governments may impose stringent regulations, or they may
put pressure on an organization to reach its goals quickly, depending on what is most
important to them at the time. Similarly, social pressures and attitudes may put
pressure on organizations to improve safety or pressure to achieve goals quickly.
214 ■ Introduction to AI Safety, Ethics, and Society

Table 4.2 summarizes how the STAMP perspective contrasts with those of traditional
component failure models.
TABLE 4.2 STAMP makes assumptions that differ from traditional component failure mod-
els.
Old Assumption New Assumption

Accidents are caused by chains of directly Accidents are complex processes involving
related events. the entire sociotechnical system.
We can understand accidents by looking Traditional event-chain models cannot
at chains of events leading to the accident. describe this process adequately.
Safety is increased by increasing system or
High reliability is not sufficient for safety.
component reliability.
Most accidents are caused by operator Operator error is a product of various
error. environmental factors.
Assigning blame is necessary to learn from Holistically understand how the system
and prevent accidents. behavior contributed to the accident.
Major accidents occur from simultaneous Systems tend to migrate toward states of
occurrences of random events. higher risk.

STAMP-based analysis techniques include System-Theoretic Process

Analysis (STPA). On a practical level, there are methods of analyzing systems
that take the holistic approach outlined by STAMP. These include System-Theoretic
Process Analysis (STPA), which can be used at the design stage, and involves steps
such as identifying hazards and constructing a control structure to mitigate their
effects and improve system safety.

Dekker’s Drift into Failure model

Decrementalism is the deterioration of system processes through a series
of small changes. A third accident model based on systems theory is Dekker’s
Drift into Failure (DIF) model [272]. DIF focuses on the migration of systems that
the RMF and STAMP also acknowledge, describing how this can lead to a “drift into
failure.” Since an individual decision to change processes may be relatively minor,
it can seem that it will not make any difference to a system’s operations or safety.
For this reason, systems are often subject to decrementalism, a gradual process of
changes through one small decision at a time that degrades the safety of a system’s
operations.
Many relatively minor decisions can combine to lead to a major dif-
ference in risk. Within complex systems, it is difficult to know all the potential
consequences of a change in the system, or how it might interact with other changes.
Many alterations to processes within a system, each of which might not make a dif-
ference by itself, can interact in complex and unforeseen ways to result in a much
higher state of risk. This is often only realized when an accident happens, at which
point it is too late.
Safety Engineering ■ 215

Summary. Normal accident theory argues that accidents are inevitable in systems
with a high degree of complexity and tight coupling, no matter how well they are
organized. On the other hand, it has been argued that HROs with consistently low
accident rates demonstrate that it is possible to avoid accidents. HRO theory iden-
tifies five key characteristics that contribute to a good safety culture and reduce the
likelihood of accidents. However, it might not be feasible to replicate these across all
organizations.
Systemic models like Rasmussen’s RMF, STAMP, and Dekker’s DIF model are
grounded in an understanding of complex systems, viewing safety as an emergent
property. The RMF and STAMP both view safety as an issue of control and en-
forcing safety constraints on operations. They both identify a hierarchy of levels of
organization within a system, showing how accidents are caused by multiple factors,
rather than just by one event at the sharp end. DIF describes how systems are of-
ten subject to decrementalism, whereby the safety of processes is gradually degraded
through a series of minor changes, each of which seems minor on its own.
In general, component failure models focus on identifying specific components or fac-
tors that can go wrong in a system and finding ways to improve those components.
These models are effective at pinpointing direct causes of failure and proposing tar-
geted interventions. However, they have a limitation in that they tend to overlook
other risk sources and potential interventions that may not be directly related to the
identified components. On the other hand, systemic accident models take a broader
approach by considering the interactions and interdependencies between various com-
ponents in a system, such as feedback loops, human factors, and diffuse causality
models. This allows them to capture a wider range of risk sources and potential
interventions, making them more comprehensive in addressing system failures.

4.6 DRIFT INTO FAILURE AND EXISTENTIAL RISKS

This book presents multiple ways in which the development and deployment of AIs
could entail risks, some of which could be catastrophic or even existential. However,
the systemic accident models discussed above highlight that events in the real world
often unfold in a much more complex manner than the hypothetical scenarios we use
to illustrate risks. It is possible that many relatively minor events could accumulate,
leading us to drift toward an existential risk. We are unlikely to be able to predict
and address every potential combination of events that could pave the route to a
catastrophe.
For this reason, although it can be useful to study the different risks associated
with AI separately when initially learning about them, we should be aware that
hypothetical example scenarios are simplified, and that the different risks coexist.
We will now discuss what we can learn from our study of complex systems and
systemic accident models when developing an AI safety strategy.
Risks that do not initially appear catastrophic might escalate. Risks tend
to exist on a spectrum. Power inequality, disinformation, and automation, for exam-
216 ■ Introduction to AI Safety, Ethics, and Society

ple, are prevalent issues within society and are already causing harm. Though serious,
they are not usually thought of as posing existential risks. However, if pushed to an
extreme degree by AIs, they could result in totalitarian governments or enfeeblement.
Both of these scenarios could represent a catastrophe from which humanity may not
recover. In general, if we encounter harm from a risk on a moderate scale, we should
be careful to not dismiss it as non-existential without serious consideration.

Multiple lower-level risks can combine to produce a catastrophe. Another

reason for thinking more comprehensively about safety is that, even if a risk is not
individually extreme, it might interact with other risks to bring about catastrophic
outcomes [273]. Imagine, for instance, a scenario in which competitive pressures fuel
an AI race between developers. This may lead a company to reduce its costs by
putting less money into maintaining robust information security systems, with the
result that a powerful AI is leaked. This would increase the likelihood that someone
with malicious intent successfully uses the AI to pursue a harmful outcome, such as
the release of a deadly pathogen.
In this case, the AI race has not directly led to an existential risk by causing companies
to, for example, bring AIs with insufficient safety measures to market. Nevertheless,
it has indirectly contributed to the existential threat of a pandemic by amplifying
the risk of malicious use.
This echoes our earlier discussion of catastrophes in complex systems, where we dis-
cussed how it is often impractical and infeasible to attribute blame to one major
“root cause” of failure. Instead, systems often “drift into failure” through an accu-
mulation and combination of many seemingly minor events, none of which would be
catastrophic alone. Just as we cannot take steps to prevent every possible mistake
or malfunction within a large, complex system, we cannot predict or control every
single way that various risks might interact to result in disaster.

Conflict and global turbulence could make society more likely to drift
into failure. Although we have some degree of choice in how we implement AI
within society, we cannot control the wider environment. There are several reasons
why events like wars that create societal turbulence could increase the risk of human
civilization drifting into failure. Faced with urgent, short-term threats, people might
deprioritize AI safety to focus instead on the most immediate concerns. If AIs can
be useful in tackling those concerns, it might also incentivize people to rush into
giving them greater power, without thinking about the long-term consequences. More
generally, a more chaotic environment might also present novel conditions for an AI,
that cause it to behave unpredictably. Even if conditions like war do not directly
cause existential risks, they make them more likely to happen.

Broad interventions may be more effective than narrowly targeted ones.

Previous attempts to manage existential risks have focused narrowly on avoiding
risks directly from AIs, and mainly addressed this goal through technical AI research.
Given the complexity of AIs themselves and the systems they exist within, it makes
sense to adopt a more comprehensive approach, taking into account the whole risk
Safety Engineering ■ 217

landscape, including threats that may not immediately seem catastrophic. Instead
of attempting to target just existential risks precisely, it may be more effective to
implement broad interventions, including sociotechnical measures.

Summary. As we might expect from our study of complex systems, different types
of risks are inextricably related and can combine in unexpected ways to amplify one
another. While some risks may be generally more concerning than others, we cannot
neatly isolate those that could contribute to an existential threat from those that
could not, and then only focus on the former while ignoring the latter. In addressing
existential threats, it is therefore reasonable to view systems holistically and consider
a wide range of issues, besides the most obvious catastrophic risks. Due to system
complexity, broad interventions are likely to be required as well as narrowly targeted
ones.

4.7 TAIL EVENTS AND BLACK SWANS

In the first few sections of this chapter, we discussed failure modes and hazards,
equations for understanding the risks they pose, and principles for designing safer
systems. We also looked at methods of analyzing systems to model accidents and
identify hazards and explored how different styles of analysis can be helpful for com-
plex systems.
The classic risk equation tells us that the level of risk depends on the probability
and severity of the event. A particular class of events, called tail events, have a very
low probability of occurrence but a very high impact upon arrival. Tail events pose
some unique challenges for assessing and reducing risk, but any competent form of
risk management must attempt to address them. We will now explore these events
and their implications in more detail.

4.7.1 Introduction to Tail Events

Tail events are events that happen rarely, but have a considerable impact when they
do. Consider some examples of past tail events.
The 2007–2008 financial crisis: Fluctuations happen continually in financial markets,
but crises of this scale are rare and have a large impact, with knock-on effects for
banks and the general population.
The COVID-19 pandemic: There are many outbreaks of infectious diseases every
year, but COVID-19 spread much more widely and killed many more people than
most. It is rare for an outbreak to become a pandemic, but those that do will have a
much larger impact than the rest.
The Internet: Many technologies are being developed all the time, but very few
become so widely used that they transform society as much as the Internet has. This
example illustrates that some tail events happen more gradually than others; the
development and global adoption of the internet unfolded over decades, rather than
218 ■ Introduction to AI Safety, Ethics, and Society

happening as suddenly as the financial crisis or the pandemic. However, “sudden” is

a relative term. If we look at history on the scale of centuries, then the transition
into the Internet age can also appear to have happened suddenly.
ChatGPT : After being released in November 2022, ChatGPT gained 100 million users
in just two months [274]. There are many consumer applications on the internet, but
ChatGPT’s user base grew faster than those of any others launched before it. Out
of many DL models, ChatGPT was the first to go viral in this way. Its release also
represented a watershed moment in the progress of generative AI, placing the issue
much more firmly in the public consciousness.

We need to consider the possibility of harmful tail events in risk man-

agement. The last two examples—the Internet and ChatGPT—illustrate that the
impacts of tail events are not always strictly negative; they can also be positive or
mixed. However, tail risks are usually what we need to pay attention to when trying
to engineer safer systems.
Since tail events are rare, it can be tempting to think that we do not need to worry
about them. Indeed, some tail events have not yet happened once in human history,
such as a meteorite strike large enough to cause global devastation, or a solar storm
intense enough to knock out the power grid. Nonetheless, tail events have such a high
impact that it would be unwise to ignore the possibility that they could happen. As
noted by the political scientist Scott Sagan: “Things that have never happened before
happen all the time.” [275]

AI-related tail events could have a severe impact. As AIs are increasingly
deployed within society, some tail risks we should consider include the possibility that
an AI could be used to develop a bioweapon, or that an AI might hack a bank and
wipe the financial information. Even if these eventualities have a low probability of
occurring, it would only take one such event to cause widespread devastation. Such
an event could define the overall impact of an AI’s deployment. For this reason, com-
petent risk management must involve serious efforts to prevent tail events, however
rare we think they might be.

4.7.2 Tail Events Can Greatly Affect the Average Risk

A tail event often changes the mean but not the median. Figure 4.10 can
help us visualize how tail events affect the wider risk landscape. The graphs show
data points representing individual events, with their placement along the x-axis
indicating their individual impact.
In the first graph, we have numerous data points representing frequent, low-impact
events: these are all distributed between 0 and 100, and mostly between 0 and 10.
The mean impact and median impact of this dataset have similar values, marked on
the x-axis.
In the second graph we have the same collection of events, but with the addition of a
single data point of much higher impact—a tail event with an impact of 10,000. As
Safety Engineering ■ 219

Distribution of Event Impact (Without Tail Event)

100

Statistic
Mean
Median
75
Count

0
1 3 4 100 1000 10000
Impact (Log Scale)

Distribution of Event Impact (With Tail Event)

100

Statistic
Mean
Median
75
Count

Tail Event

0
1 3 14 100 1000 10000
Impact (Log Scale)

Figure 4.10. The occurrence of a tail event can dramatically shift the mean but not the
median of the event type’s impact.
220 ■ Introduction to AI Safety, Ethics, and Society

indicated in the graph, the median impact of the dataset is approximately the same
as before, but the mean changes substantially and is no longer representative of the
general population of events.

We can also think about tail events in terms of the classic risk equation.
Tail events have a low probability, but because they are so severe, we nonetheless
evaluate the risk they pose as being large:

Risk(hazard) = P (hazard) × severity(hazard).

Depending on the exact values of probability and severity, we may find that tail risks
are just as large as—or even larger than–—the risks posed by much smaller events
that happen all the time. In other words, although they are rare, we cannot afford to
ignore the possibility that they might happen.

It is difficult to plan for tail events because they are so rare. Since we
can hardly predict when tail events will happen, or even if they will happen at all,
it is much more challenging to plan for them than it is for frequent, everyday events
that we know we can expect to encounter. It is often the case that we do not know
exactly what form they will take either.
For these reasons, we cannot plan the specific details of our response to tail events
in advance. Instead, we must plan to plan. This involves organizing and developing
an appropriate response, if and when it is necessary—how relevant actors should
convene to decide on and coordinate the most appropriate next steps, whatever the
precise details of the event. Often, we need to figure out whether some phenomena
even present tail events, for which we need to consider their frequency distributions.
We consider this concept next.

4.7.3 Tail Events Can Be Identified From Frequency Distributions

Frequency distributions tell us how common instances of different mag-

nitudes are. To understand the origins of tail events, we need to understand fre-
quency distributions. These distributions tell us about the proportion of items in a
dataset that have each possible value of a given variable. Suppose we want to study
some quantity, such as the ages of buildings. We might plot a graph showing how
many buildings there are in the world of each age, and it might look something like
the generic graph shown in Figure 4.11.
The x-axis would represent building age, while the y-axis would indicate the number
of buildings of each age—the frequency of a particular age appearing in the dataset.
If our graph looked like Figure 4.11, it would tell us that there are many buildings
that are relatively new, perhaps only a few decades or a hundred years old, fewer
buildings that are several hundred or a thousand years old, and very few buildings,
such as the Pyramids at Giza, that are several thousand years old.
Safety Engineering ■ 221

Head and Tail of an Exponential Distribution

Radioactive Particles Remaining

Head

Tail

2000 1800 1600 1400 1200 1000

Time

Figure 4.11. Many distributions have a head (an area where most of the probability is
concentrated) and one or two tails (extreme regions of the distribution).

Many real-world frequency distributions have long tails. We can plot

graphs like this for countless variables, from the size of different vertebrate species
to the number of countries different people have visited. Each variable will have its
own unique distribution, but many have the general property that there are lots of
occurrences of a low value and relatively few occurrences of a high value. There are
many vertebrate species with a mass of tens of kilograms, and very few with a mass
in the thousands of kilograms; there are many people who have visited somewhere
between 1–10 countries, and few people who have visited more than 50.
We can determine whether we are likely to observe tail events of a particular type
by examining whether its frequency distribution has thin tails or long tails. In thin-
tailed distributions, tail events do not exist. Examples of thin-tailed distributions
include human characteristics such as height, weight, and intelligence. No one is over
100 meters tall, weighs over 10,000 kilograms, or has an IQ of 10,000. By contrast, in
long-tailed distributions, tail events are possible. Examples of long-tailed distributions
include book sales, earthquake magnitude, and word frequency. While most books
only sell a handful of copies, most earthquakes are relatively harmless, and most
words are rare and infrequently used, some books sell millions or even billions of
copies, some earthquakes flatten cities, and some words (such as “the” or “I”) are
used extremely frequently. Of course, not all distributions neatly fit into a dichotomy
of thin-tailed or long-tailed, but may be somewhere in between.

4.7.4 A Caricature of Tail Events

To illustrate the difference between long-tailed and thin-tailed distributions, we will

now run through some comparisons between the two categories. Note that, with
222 ■ Introduction to AI Safety, Ethics, and Society

these statements, we are describing simplified caricatures of the two scenarios for
pedagogical purposes.

TABLE 4.3 A caricature of thin tails and long tails reveals several trends that often hold
for each.
Caricature: Thin Tails Caricature: Long Tails

The top few receive a proportionate The top few receive a disproportionately large
share of the total. share of the total.
The total is determined by the whole The total is determined by a few extreme
group (”tyranny of the collective”). occurrences (”tyranny of the accidental”).
The typical member of a group has an The typical member is either a giant or a
average value, close to the mean. dwarf.
A single event cannot escalate to A single event can escalate to become much
become much bigger than the average. bigger than many others put together.
Individual data points vary within a Individual data points can vary across many
small range that is close to the mean. orders of magnitude.
We can predict roughly what value a It is much harder to robustly predict even the
single instance will take. rough value that a single instance will take.

Contrast 1: Share of the total received by the top few

Under thin tails, the top few receive quite a proportionate share of the
total. If we were to measure the heights of a group of people, the total height of
the tallest 10% would not be much more than 10% of the total height of the whole
group.

Under long tails, the top few receive a disproportionately large share of
the total. In the music industry, the revenue earned by the most successful 1% of
artists represents around 77% of the total revenue earned by all artists.

Contrast 2: Who determines the total?

Under thin tails, the total is determined by the whole group. The total
height of the tallest 10% of people is not a very good approximation of the total
height of the whole group. Most members need to be included to get a good measure
of the total. This is called “tyranny of the collective.”

Under long tails, the total is determined by a few extreme occurrences.

As discussed above, the most successful 1% of artists earn 77% of the total revenue
earned by all artists. 77% is a fairly good approximation of the total. In fact, it is a
better approximation than the revenue earned by the remaining 99% of artists would
be. This is called “tyranny of the accidental.”
Safety Engineering ■ 223

Contrast 3: The typical member

Under thin tails, the typical member of a group has an average value.
Almost no members are going to be much smaller or much larger than the mean.

Under long tails, the typical member is either a giant or a dwarf. Mem-
bers can generally be classified as being either extreme and high-impact or relatively
insignificant.
Note that, under many real-world long-tailed distributions, there may be occurrences
that seem to fall between these two categories. There may be no clear boundary
dividing occurrences that count as insignificant from those that count as extreme.

Contrast 4: Scalability of events

Under thin tails, the impact of an event is not scalable. A single event
cannot escalate to become much bigger than the average.

Under long tails, the impact of an event is scalable. A single event can
escalate to become much bigger than many others put together.

Contrast 5: Randomness

Under thin tails, individual data points vary within a small range that
is close to the mean. Even the data point that is furthest from the mean does
not change the mean of the whole group by much.

Under long tails, individual data points can vary across many orders of
magnitude. A single extreme data point can completely change the mean of the
sample.

Contrast 6: Predictability

Under thin tails, we can predict roughly what value a single instance will
take. We can be confident that our prediction will not be far off, since instances
cannot stray too far from the mean.

Under long tails, it is much harder to predict even the rough value that
a single instance will take. Since data points can vary much more widely, our
best guesses can be much further off.
Having laid the foundations for understanding tail events in general, we will now
consider an important subset of tail events: black swans.
224 ■ Introduction to AI Safety, Ethics, and Society

4.7.5 Introduction to Black Swans

In addition to being rare and high-impact, as all tail events are, black swans are also
unanticipated, seemingly coming out of the blue. The term “black swan” originates
from a historical event that provides a useful analogy.
Finding a black swan. It was long believed in Europe that all swans were white
because all swan sightings known to Europeans were of white swans. For this reason,
the term “black swan” came to denote something nonexistent, or even impossible,
much as today we say “pigs might fly.” The use of this metaphor is documented as
early as Roman times. However, in 1697, a group of Dutch explorers encountered
a black species of swan in Australia. This single, unexpected discovery completely
overturned the long-held axiom that all swans were white.
This story offers an analogy for how we can have a theory or an assumption that seems
correct for a long time, and then a single, surprising observation can completely upend
that model. Such an observation can be classed as a tail event because it is rare and
high-impact. Additionally, the fact that the observation was unforeseen shows us that
our understanding is incorrect or incomplete.
From here on we will use the following working definition of black swans: A black swan
is a tail event that was largely unpredictable to most people before it happened. Note
that not all tail events are black swans; high-magnitude earthquakes, for example,
are tail events, but we know where they are likely to happen eventually—they are on
our radar.

4.7.6 Known Unknowns and Unknown Unknowns

Black swans are “unknown unknown” tail events [276]. We can sort events
into four categories, as shown in the table below.

Known knowns: things we are aware of Unknown knowns: things that we do not
and understand. realize we know (such as tacit knowledge).
Unknown unknowns: things that we do
Known unknowns: things we are aware
not understand, and which we are not
of but which we don’t fully understand.
even aware we do not know.

In these category titles, the first word refers to our awareness, and the second refers
to our understanding. We can now consider these four types of events in the context
of a student preparing for an exam.
1. We know that we know. Known knowns are things we are both aware of and
understand. For the student, these would be the types of questions that have
come up regularly in previous papers and that they know how to solve through
recollection. They are aware that they are likely to face these, and they know how
to approach them.
2. We do not know what we know. Unknown knowns are things we understand
but may not be highly aware of. For the student, these would be things they have
Safety Engineering ■ 225

not thought to prepare for but which they understand and can do. For instance,
there might be some questions on topics they hadn’t reviewed; however, looking at
these questions, the student finds that they know the answer, although they cannot
explain why it is correct. This is sometimes called tacit knowledge or unaccounted
facts.
3. We know that we do not know. Known unknowns are things we are aware of
but do not fully understand. For the student, these would be the types of questions
that have come up regularly in previous papers but which they have not learned
how to solve reliably. The student is aware that they are likely to face these but
is not sure they will be able to answer them correctly. However, they are at least
aware that they need to do more work to prepare for them.
4. We do not know that we do not know. Unknown unknowns are things we are
unaware of and do not understand. These problems catch us completely off guard
because we didn’t even know they existed. For the student, unknown unknowns
would be unexpectedly hard questions on topics they have never encountered
before and have no knowledge or understanding of.

Unknown unknowns can also occur in AI safety and risk. Researchers

may be diligently studying various aspects of AI and its potential risks, but new
and unforeseen risks could emerge as AI technology advances. These risks may be
completely unknown and unexpected, catching researchers off guard. It is important
to acknowledge the existence of unknown unknowns because they remind us that
there are limits to our knowledge and understanding. By being aware of this, we
can be more humble in our approach to problem-solving and continuously strive to
expand our knowledge and prepare for the unexpected.

We struggle to account for known unknowns and unknown unknowns.

We have included the first two categories—known knowns and unknown knowns—for
completeness. However, the most important categories in risk analysis and manage-
ment are the last two: known unknowns and unknown unknowns. These categories
pose risks because we do not fully understand how best to respond to them, and we
cannot be perfectly confident that we will not suffer damage from them.

Unknown unknowns are particularly concerning. If we are aware that we

might face a particular challenge, we can learn more and prepare for it. However, if
we are unaware that we will face a challenge, we may be more vulnerable to harm.
Black swans are the latter type of event; they are not even on our radar before they
happen.
The difference between known unknowns and unknown unknowns is sometimes also
described as a distinction between conscious ignorance and meta-ignorance. Conscious
ignorance is when we see that we do not know something, whereas meta-ignorance is
when we are unaware of our ignorance.

Black swans in the real world It might be unfair for someone to present us with
an unknown unknown, such as finding questions on topics irrelevant to the subject in
226 ■ Introduction to AI Safety, Ethics, and Society

an exam setting. The wider world, however, is not a controlled environment; things
do happen that we have not thought to prepare for.

Black swans indicate that our worldview is inaccurate or incomplete.

Consider a turkey being looked after by humans, who provide plenty of food and a
comfortable shelter safe from predators. According to all the turkey’s evidence, the
humans are benign and have the turkey’s best interests at heart. Then, one day, the
turkey is taken to the slaughterhouse. This is very much an unknown unknown, or a
black swan, for the turkey, since nothing in its experience suggested that this might
happen [276].
This illustrates that we might have a model or worldview that does a good job of
explaining all our evidence to date, but then a black swan can turn up and show us
that our model was incorrect. The turkey’s worldview of benign humans explained
all the evidence until the slaughterhouse trip. This event indicated a broader context
that the turkey was unaware of.
Similarly, consider the 2008 financial crisis. Before this event, many people, including
many of those working in finance, assumed that housing prices would always continue
to increase. When the housing bubble burst, it showed that this assumption was
incorrect.

Black swans are defined by our understanding. A black swan is a black swan
because our worldview is incorrect or incomplete, which is why we fail to predict it.
In hindsight, such events often only make sense after we realize that our theory
was flawed. Seeing black swans makes us update our models to account for the new
phenomena we observe. When we have a new, more accurate model, we can often
look back in time and find the warning signs in the lead-up to the event, which we
did not recognize as such at the time.
These examples also show that we cannot always reliably predict the future from our
experience; we cannot necessarily make an accurate calculation of future risk based
on long-running historical data.

Distinguishing black swans from other tail events

Only some tail events are black swans. As touched on earlier, it is essential
to note that black swans are a subset of tail events, and not all tail events are black
swans. For example, it is well known that earthquakes happen in California and that
a high-magnitude one, often called “the big one,” will likely happen at some point.
It is not known exactly when—whether it will be the next earthquake or in several
decades. It might not be possible to prevent all damage from the next “big one,” but
there is an awareness of the need to prepare for it. This represents a tail event, but
not a black swan.

Some people might be able to predict some black swans. A restrictive

definition of a black swan is an event that is an absolute unknown unknown for
everybody and is impossible to anticipate. However, for our purposes, we are using the
Safety Engineering ■ 227

looser, more practical working definition given earlier: a highly impactful event that
is largely unexpected for most people. For example, some individuals with relevant
knowledge of the financial sector did predict the 2008 crisis, but it came out of the
blue for most people. Even among financial experts, the majority did not predict it.
Therefore, we count it as a black swan.
Similarly, although pandemics have happened throughout history, and smaller disease
outbreaks occur yearly, the possibility of a pandemic was not on most people’s radar
before COVID-19. People with specific expertise were more conscious of the risk, and
epidemiologists had warned several governments for years that they were inadequately
prepared for a pandemic. However, COVID-19 took most people by surprise and
therefore counts as a black swan under the looser definition.

The development and rollout of AI technologies could be subject to black

swans. Within the field of AI, the consensus view for a long time was that DL
techniques were fundamentally limited. Many people, even computer science profes-
sors, did not take seriously the idea that DL technologies might transform society
in the near term—even if they thought this would be possible over a timescale of
centuries.
DL technologies have already begun to transform society, and the rate of progress
has outpaced most people’s predictions. We should, therefore, seriously consider the
possibility that the release of these technologies could pose significant risks to society.
There has been speculation about what these risks might be, such as a flash war
and autonomous economy, which are discussed in the Collective Action Problems
chapter. These eventualities might be known to some people, but for many potential
risks, there is not widespread awareness in society; if they happened today, they would
be black swans. Policymakers must have some knowledge of these risks. Furthermore,
the expanding use of AI technologies may entail risks of black swan scenarios that no
one has yet imagined.

4.7.7 Implications of Tail Events and Black Swans for Risk Analysis

Tail events and black swans present problems for analyzing and managing risks,
because we do not know if or when they will happen. For black swans, there is the
additional challenge that we do not know what form they will take.
Since, by definition, we cannot predict the nature of black swans in advance, we
cannot put any specific defenses in place against them, as we might for risks we have
thought of. We can attempt to factor black swans into our equations to some degree,
by trying to estimate roughly how likely they are and how much damage they would
cause. However, they add much more uncertainty into our calculations. We will now
discuss some common tendencies in thinking about risk, and why they can break
down in situations that are subject to tail events and black swans.
First, we consider how our typical risk estimation methods break down under long
tails because our standard arsenal of statistical tools are rendered useless. Then, we
228 ■ Introduction to AI Safety, Ethics, and Society

consider how cost-benefit analysis is strained when dealing with long-tailed events
because of its sensitivity to our highly uncertain estimates. After this, we discuss
why we should be more explicitly considering extremes instead of averages, and look
at three common mistakes when dealing with long-tailed data: the delay fallacy,
interpreting an absence of evidence, and the preparedness paradox.

Typical risk estimation methods break down under long tails

Tail events and black swans can substantially change the average risk of
a system. It is challenging to account for tail events in the risk equation. Since
tail events and black swans are extremely severe, they significantly affect the average
outcome. Recall the equation for risk associated with a system:
X
Risk = P (hazard) × severity(hazard).
hazard

Additionally, it is difficult to estimate their probability and severity accurately. Yet,

they would considerably change the evaluation of risk because they are so severe.
Furthermore, since we do not know what form black swans will take, it may be even
more difficult to factor them into the equation accurately. This renders the usual
statistical tools useless in practice for analyzing risk in the face of potential black
swans.
If the turkey in the previous example had tried to calculate the risk to its wellbe-
ing based on all its prior experiences, the estimated risk would probably have been
fairly low. It certainly would not have pointed to the high risk of being taken to the
slaughterhouse, because nothing like that had ever happened to the turkey before.

We need a much larger dataset than usual. As we increase the number of

observations, we converge on an average value. Suppose we are measuring heights and
calculating a new average every time we add a new data point. As shown in the first
graph in Figure 4.12, as we increase our number of data points, we quickly converge
on an average that changes less and less with the addition of each new data point.
This is a result of the law of large numbers.
Heights, however, are a thin-tailed variable. If we look instead at a long-tailed variable,
such as net worth, as shown in the second graph in Figure 4.12, a single extreme
observation can change the average by several orders of magnitude. The law of large
numbers still applies, in that we will still eventually converge on an average value,
but it will take much longer.

Linear regression is a standard prediction method but is less useful for

long-tailed data. Linear regression is a technique widely used to develop predic-
tive models based on historical data. However, in situations where we are subject
to the possibility of tail events or black swans, we might not be sure that we have
enough historical data to converge on an accurate calculation of average risk. Linear
regression is, therefore, less helpful in assessing and predicting risk for long-tailed
scenarios.
Safety Engineering ■ 229

Mean of Thin−Tailed Distribution

5.5

Average Height (feet)

5.4

5.3

0 1000 2000 3000 4000

Number of Observations
Mean of Long−Tailed Distribution

106
Average Net Worth (dollars)

105

104

0 1000 2000 3000 4000

Number of Observations

Figure 4.12. The mean of a long-tailed distribution is slow to convergence, rendering the
mean a problematic summary statistic in practice.

Explicit cost-benefit analysis is strained under long tails

Cost-benefit analysis using long-tailed data often requires highly accurate estimates.
Traditional cost-benefit analysis weighs the probability of different results and how
much we would lose or gain in each case. From this information, we can calculate
whether we expect the outcome of a situation to be positive or negative. For example,
if we bet on a 50/50 coin toss where we will either win $5 or lose $5, our overall
expected outcome is $0.

Example: lotteries. Now, imagine that we are trying to perform a cost-benefit

analysis for a lottery where we have a high probability of winning a small amount
and a low probability of losing a large amount. If we have a 99.9% chance of winning
$15 and a 0.1% chance of losing $10,000, then our expected outcome is:

(0.999 × 15) + (0.001 × −10000) = 4.985.

230 ■ Introduction to AI Safety, Ethics, and Society

Since this number is positive, we might believe it is a good idea to bet on the lottery.
However, if the probabilities are only slightly different, at 99.7% and 0.3%, then our
expected outcome is:

(0.997 × 15) + (0.003 × −10000) = −15.045.

This illustrates that just a tiny change in probabilities sometimes makes a signifi-
cant difference in whether we expect a positive or a negative outcome. In situations
like this, where the expected outcome is highly sensitive to probabilities, using an
estimate of probability that is only slightly different from the actual value can com-
pletely change the calculations. For this reason, relying on this type of cost-benefit
analysis does not make sense if we cannot be sure we have accurate estimates of the
probabilities in question.

It is difficult to form accurate probability estimates for black swans.

Black swans happen rarely, so we do not have a lot of historical data from which to
calculate the exact probability that they will occur. As we explored above, it takes a
lot of data—often more than is accessible—to make accurate judgments for long-tailed
events more generally. Therefore, we cannot be certain that we know their prob-
abilities accurately, rendering cost-benefit analysis unsuitable for long-tailed data,
especially for black swans.
This consideration could be significant for deciding whether and how to use AI tech-
nologies. We might have a high probability of benefitting from the capabilities of
DL models, and there might be only a low probability of an associated black swan
transpiring and causing harm. However, we cannot calculate an accurate probability
of a black swan event, so we cannot evaluate our expected outcome precisely.

It is unrealistic to estimate risk when we could face black swans. If

we attempt to develop a detailed statistical model of risk for a situation, we are
making an implicit assumption that we have a comprehensive understanding of all
the possible failure modes and how likely they are. However, as previous black swan
events have demonstrated, we cannot always assume we know all the eventualities.
Even for tail events that are known unknowns, we cannot assume we have sufficiently
accurate information about their probabilities and impacts. Trying to precisely esti-
mate risk when we might be subject to tail events or black swans can be viewed as
an “unscientific overestimation of the reach of scientific knowledge” [259].

Thinking about extremes instead of averages is better under long tails

When making risk-related decisions, we should consider extremes, not
only the average. Aside from whether or not we can calculate an accurate average
outcome under the risk of tail events and black swans, there is also a question of
whether the average is what we should be paying attention to in these situations
anyway. This idea is captured in the following adage commonly attributed to Milton
Friedman: “Never try to walk across a river just because it has an average depth
Safety Engineering ■ 231

of four feet.” If a river is four feet deep on average, that might mean that it has a
constant depth of four feet and is possible to wade across it safely. It might also mean
that it is two or three feet deep near the banks and eight feet deep at some point
in the middle. If this were the case, then it would not be a good idea to attempt to
wade across it.
Failing to account for extremes instead of averages is one example of the mistakes
people make when thinking about event types that might have black swans. Next,
we will explore three more: the delay fallacy, misinterpreting an absence of evidence,
and the preparedness paradox.

The delay fallacy

If we do not have enough information to conduct a detailed risk analysis, it might
be tempting to gather more information before taking action. A common excuse for
delaying action is: “If we wait, we will know more about the situation and be able to
make a more informed decision, so we should not make any decisions now.”
In thin-tailed scenarios, waiting for more information is often a good
approach. Under thin tails, additional observations will likely help us refine our
knowledge and identify the best course of action. Since there are no tail events, there
is a limit to how much damage a single event can do. There is, therefore, less urgency
to take action in these situations. The benefit of more information can be considered
to outweigh the delay in taking action.
In long-tailed scenarios, waiting for more information can mean wait-
ing until it is too late. Additional observations will not necessarily improve our
knowledge of the situation under long tails. Most, if not all, additional observations
will probably come from the head of the distribution and will not tell us anything
new about the risk of tail events or black swans. The longer we wait before preparing,
the more we expose ourselves to the possibility of such an event happening while we
are unprepared. When tail events and black swans do materialize, it is often too late
to intervene and prevent harm.
Governments failing to improve their pandemic preparedness might be considered
an example of this. Epidemiologists’ warnings were long ignored, which seemed fine
for a long time because pandemics are rare. However, when COVID-19 struck, many
governments tried to get hold of personal protective equipment (PPE) simultaneously
and found a shortage. If they had stocked up on this before the pandemic, as experts
had advised, then the outcome might have been less severe.
Furthermore, if a tail event or black swan is particularly destructive, we can never
observe it and use that information to help us make better calculations. The turkey
cannot use the event of being taken to the slaughterhouse to make future risk esti-
mations more accurate. With respect to society, we cannot afford for events of this
nature to happen even once.
We should be proactively investing in AI safety now. Since the develop-
ment and rollout of AI technologies could represent a long-tailed scenario, entailing
232 ■ Introduction to AI Safety, Ethics, and Society

a risk of tail events and black swans, it would not make sense to delay action with
the excuse that we do not have enough information. Instead, we should be proactive
about safety by investing in the three key research fields discussed earlier: robust-
ness, monitoring, and control. If we wait until we are certain that an AI could pose
an existential risk before working on AI safety, we might be waiting until it is too
late.

Interpreting an absence of evidence

It can be hard to imagine a future that is significantly different from our past and
present experiences. Suppose a particular event has never happened before. In that
case, it can be tempting to interpret that as an indication that we do not need to
worry about it happening in the future, but this is not necessarily a sound judgment.

An absence of evidence is not strong evidence of absence. Even if we have

not found evidence that there is a risk of black swan events, that is not evidence that
there is no risk of black swan events. In the context of AI safety, we may not have
found evidence that DL technologies could pose specific risks like deceptive alignment,
but that does not necessarily mean that they do not pose such risks or that they will
not at some point in the future.

The preparedness paradox

Safety measures that prevent harm can seem redundant. Imagine that we
enact safety measures to reduce the risk of a potentially destructive event, and then
the event does not happen. Some might be tempted to say that the safety measures
were unnecessary or that implementing them was a waste of time and resources. Even
if the event does happen but is not very severe, some people might still say that the
safety measures were unnecessary because the event’s consequences were not so bad.
However, this conclusion ignores the possibility that the event did not happen or
was less severe because of the safety measures. We cannot run through the same
period of time twice and discover how things would have unfolded without any safety
measures. This is a cognitive bias known as the preparedness paradox: efforts to
prepare for potential disasters can reduce harm from these events and, therefore,
reduce the perceived need for such preparation.

The preparedness paradox can lead to self-defeating prophecies. A re-

lated concept is the “self-defeating prophecy,” where a forecast can lead to actions
that prevent the forecast from coming true. For example, suppose an epidemiologist
predicts that there will be a high death toll from a particular infectious disease. In
that case, this might prompt people to wash their hands more frequently and avoid
large gatherings to avoid infection. These behaviors are likely to reduce infection rates
and lead to a lower death toll than the epidemiologist predicted.
If we work proactively on reducing risks from global pandemics, and no highly de-
structive pandemics come to pass, some people would believe that the investment
was unnecessary. However, it might be because of those efforts that no destructive
Safety Engineering ■ 233

events happen. Since we usually cannot run two parallel worlds—one with safety ef-
forts and one without—it might be difficult or impossible to prove that the safety
work prevented harm. Those who work in this area may never know whether their
efforts have prevented a catastrophe and have their work vindicated. Nevertheless,
preventing disasters is essential, especially in cases like the development of AI, where
we have good theoretical reasons to believe that a black swan is on the cards.

4.7.8 Identifying the Risk of Tail Events or Black Swans

Since the possibility of tail events and black swans affects how we approach risk
management, we must consider whether we are facing a long-tailed or thin-tailed
scenario. We need to know whether we can rely on standard statistical methods to
estimate risk or whether we face the possibility of rare, high-impact events. This can
be difficult to determine, especially in cases of low information, but there are some
valuable indicators we can look for.

Highly connected systems often give rise to long-tailed scenarios. As

discussed earlier, multiplicative phenomena can lead to long tails. We should ask
ourselves: Can one part of the system rapidly affect many others? Can a single event
trigger a cascade? If the answers to these questions are yes, then it is possible that
an event can escalate to become a tail event with an extreme impact.

The use of AI in society could create a new, highly connected system. If

DL models become enmeshed within society and are put in charge of various decisions,
then we will have a highly connected system where these agents regularly interact
with humans and each other. In these conditions, a single erroneous decision made
by one agent could trigger a cascade of harmful decisions by others, for example, if
they govern the deployment of weapons. This could leave us vulnerable to sudden
catastrophes such as flash wars or powerful rogue AIs.

Complex systems may be more likely to entail a risk of black swans.

Complex systems can evolve in unpredictable ways and develop unanticipated behav-
iors. We cannot usually foresee every possible way a complex system might unfold.
For this reason, we might expect that complex evolving systems present an inherent
risk of black swans.

DL models and the surrounding social systems are all complex systems.
It is unlikely that we will be able to predict every single way AI might be used, just
as, in the early days of the internet, it would have been difficult to predict every
way technology would ultimately be used. This means that there might be a risk of
AI being used in harmful ways that we have not foreseen, potentially leading to a
destructive black swan event that we are unprepared for. The idea that DL systems
qualify as complex systems is discussed in greater depth in the Complex Systems
chapter.

New systems may be more likely to present black swans. Absence of evi-
dence is only evidence of absence if we expect that some evidence should have turned
234 ■ Introduction to AI Safety, Ethics, and Society

up in the timeframe that has elapsed. For systems that have not been around for
long, we would be unlikely to have seen proof of tail events or black swans since these
are rare by definition.

AI may not have existed for long enough for us to have learned about all
its risks. In the case of emerging technology, it is reasonable to think that there
might be a risk of tail events or black swans, even if we do not have any evidence
yet. The lack of evidence might be explained simply by the fact that the technology
has not been around for long. Our meta-ignorance means that we should take AI risk
seriously. By definition, we can’t be sure there are no unknown unknowns. Therefore,
it is over-confident for us to feel sure we have eliminated all risks.

Accelerating progress could increase the frequency of black swan events.

We have argued that black swan events should be taken seriously, despite being rare.
However, as technological progress and economic growth advance at an increasing
rate, such events may in fact become more frequent, further compounding their rele-
vance to risk management. This is because the increasing pace of change also means
that we will more often face novel circumstances that could present unknown un-
knowns. Moreover, within the globalized economy, social systems are increasingly
interconnected, increasing the likelihood that one failure could trigger a cascade and
have an outsized impact.

There are techniques for turning some black swans into known un-
knowns. As discussed earlier, under our practical definition, not all black swans
are completely unpredictable, especially not for people who have the relevant exper-
tise. Ways of putting more black swans on our radar include expanding our safety
imagination, conducting horizon scanning or stress testing exercises, and red-teaming
[277].

Safety imagination. Expanding our “safety imagination” can help us envision a

wider range of possibilities. We can do this by playing a game of “what if” to increase
the range of possible scenarios we can imagine unfolding. Brainstorming sessions can
also help to rapidly generate lots of new ideas about potential failure modes in a
system. We can identify and question our assumptions–—about what the nature of
a hazard will be, what might cause it, and what procedures we will be able to follow
to deal with it—in order to imagine a richer set of eventualities.

Horizon scanning. Some HROs use a technique called horizon scanning, which
involves monitoring potential future threats and opportunities before they arrive, to
minimize the risk of unknown unknowns [278]. AI systems could be used to enhance
horizon-scanning capabilities by simulating situations that mirror the real world with
a high degree of complexity. The simulations might generate data that reveal potential
black swan risks to be aware of when deploying a new system. As well as conducting
horizon scanning, HROs also contemplate near-misses and speculate about how they
might have turned into catastrophes, so that lessons can be learned.
Safety Engineering ■ 235

Red teaming. “Red teams” can find more black swans by adopting a mindset of
malicious intent. Red teams should try to think of as many ways as they can to
misuse or sabotage the system. They can then challenge the organization on how it
would respond to such attacks. Finally, stress tests such as dry-running hypothetical
scenarios and evaluating how well the system copes with them, and thinking about
how it could be improved can improve a system’s resilience to unexpected events.

4.8 CONCLUSION

4.8.1 Summary
In this chapter, we have explored various methods of analyzing and managing risks
inherent in systems. We began by looking at how we can break risk down into two
components: the probability and severity of an accident. We then went into greater
detail, introducing the factors of exposure and vulnerability, showing how each af-
fects the level of risk we calculate. By decomposing risk in this way, we can identify
measures we can take to reduce risks. We also considered the concept of ability to
cope and how it relates to risk of ruin.
Next, we described a metric of system reliability called the “nines of reliability.” This
metric refers to the number of nines at the beginning of a system’s percentage or
decimal reliability. We found that adding another nine of reliability is equivalent to
reducing the probability of an accident by a factor of 10, and therefore results in a
tenfold increase in expected time before failure. A limitation of the nines of reliability
is that they only contain information about the probability of an accident, but not
its severity, so they cannot be used alone to calculate risk.
We then listed several safe design principles, which can be incorporated into a system
from the design stage to reduce the risk of accidents. In particular, we explored re-
dundancy, separation of duties, the principle of least privilege, fail-safes, antifragility,
negative feedback mechanisms, transparency, and defense in depth.
To develop an understanding of how accidents occur in systems, we next explored var-
ious accident models, which are theories about how accidents happen and the factors
that contribute to them. We reviewed three component failure accident models: the
Swiss cheese model, the bow tie model, and fault tree analysis, and considered their
limitations, which arise from their chain-of-events style of reasoning. Generally, they
do not capture how accidents can happen due to interactions between components,
even when nothing fails. Component failure models are also unsuited to modeling
how the numerous complex interactions and feedback loops in a system can make it
difficult to identify a root cause, and how it can be more fruitful to look at diffuse
causality and systemic factors than specific events.
After highlighting the importance of systemic and human factors, we delved deeper
into some examples of them, highlighting regulations, social pressure, competitive
pressures, safety costs, and safety culture. We then moved on to look at systemic
accident models that attempt to take these factors into consideration. Normal Ac-
cident Theory states that accidents are inevitable in complex and tightly coupled
236 ■ Introduction to AI Safety, Ethics, and Society

systems. On the other hand, HRO theory points to certain high reliability organiza-
tions as evidence that it is possible to reliably avoid accidents by following five key
management principles: preoccupation with failure, reluctance to simplify interpreta-
tions, sensitivity to operations, commitment to resilience, and deference to expertise.
While these features can certainly contribute to a good safety culture, we also looked
at the limitations and the difficulties in replicating some of them in other systems.
Rounding out our discussion of systemic factors, we outlined three accident models
that are grounded in complex systems theory. Rasmussen’s Risk Management Frame-
work (RMF) identifies six hierarchical levels within a system, identifying actors at
each level who share responsibility for safety. The RMF states that a system’s op-
erations should be kept within defined safety boundaries; if they migrate outside of
these, then the system is in a state where an event at the sharp end could trigger
an accident. However, the factors at the blunt end are also responsible, not just the
sharp-end event.
Similarly, STAMP and the related STPA analysis method view safety as being an
emergent property of an organization, detailing different levels of organization within
a system and defining the safety constraints that each level should impose on the one
below it. Specifically, STPA builds models of the organizational safety structure; the
dynamics and pressures that can lead to deterioration of this structure; the models
of the system that operators must have, and the necessary communication to ensure
these models remain accurate over time; and the broader social and political context
the organization exists within.
Finally, Dekker’s Drift into Failure (DIF) model emphasizes decrementalism: the
way that a system’s processes can deteriorate through a series of minor changes,
potentially causing the system’s migration to an unsafe state. This model warns that
each change may seem insignificant alone, so organizations might make these changes
one at a time in isolation, creating a state of higher risk once enough changes have
been made.
As a final note on the implications of complexity for AI safety, we considered the
broader societal context within which AI technologies will function. We discussed
how, in this uncontrolled environment, different, seemingly lower-level risks could
interact to produce catastrophic threats, while chaotic circumstances may increase
the likelihood of AI-related accidents. For these reasons, it makes sense to consider a
wide range of different threats of different magnitudes in our approach to mitigating
catastrophic risks, and we may find that broader interventions are more fruitful than
narrowly targeted ones.
In the last section of this chapter, we focused in on a particular class of events called
tail events and black swans, and explored what they mean for risk analysis and
management. We began this discussion by defining tail events and considering several
caricatures of long-tailed distributions. Then, we described black swans as a subset
of tail events that are not only rare and high-impact but also particularly difficult
to predict. These events seem to happen largely “out of the blue” for most people
and may indicate that our understanding of a situation is inaccurate or incomplete.
Safety Engineering ■ 237

These events are also referred to as unknown unknowns, which we contrasted with
known unknowns which we may not fully understand, but are at least aware of.
We examined how tail events and black swans can pose particular challenges for
some traditional approaches to evaluating and managing risk. Certain methods of
risk estimation and cost-benefit analysis rely on historical data and probabilities of
different events. However, tail events and black swans are rare, so we may not have
sufficient data to accurately estimate their likelihood, and even a small change in
likelihood can lead to a big difference in expected outcome.
We also considered the delay fallacy, showing that waiting for more information
before acting might mean waiting until it is too late. We discussed how an absence of
evidence of a risk cannot necessarily be taken as evidence that the risk is absent. By
looking at hypothetical situations where catastrophes are avoided thanks to safety
measures, we explained how the preparedness paradox can make these measures seem
unnecessary, when in fact they are essential.
Having explored the importance of taking tail events and black swans into consid-
eration, we identified some circumstances that indicate we may be at risk of these
events. We concluded that it is reasonable to believe AI technologies may pose such
a risk, due to the complexity of AI systems and the systems surrounding them, the
highly connected nature of the social systems they are likely to be embedded in, and
the fact that they are relatively new, meaning we may not yet fully understand all
the ways they might interact with their surroundings.

4.8.2 Key Takeaways

Tail events and black swans require a different approach to managing

risks [277]. Some decisions require vastly more caution than others: for instance,
paraphrasing Richard Danzig, you should not “need evidence” that a gun is loaded
to avoid playing Russian roulette [42]. Instead, you should need evidence of safety.
In situations where we are subject to the possibility of tail events and black swans,
this evidence might be impossible to find.
One element of good decision making when dealing with long-tailed scenarios is to
exercise more caution than we would otherwise. In the case of new technologies such
as AI systems, this might mean not prematurely deploying them on a large scale. In
some situations, we can be extremely wrong and things can still end up being fine; in
others, we can be just slightly wrong but suffer disastrous consequences. We must also
be cautious while trying to solve our problems. For example, while climate change
poses a serious threat, many experts believe it would be unwise to attempt to fix it
quickly by rushing into geoengineering solutions like spraying sulfur particles into the
atmosphere. There may be an urgent need to solve the problem, but we should take
care that we are not pursuing solutions that could cause many other problems.
Although tail events may be challenging to predict, there are a variety of techniques
discussed in this chapter that can help with this, such as expanding our safety imag-
ination, conducting horizon scanning exercises, and red-teaming.
238 ■ Introduction to AI Safety, Ethics, and Society

Incorporating safe design principles can improve general safety. Follow-

ing the safe design principles described in this chapter can be a good first step toward
reducing systemic risks from AI, with the caveat that we should think carefully about
which defense features are appropriate, and avoid too much complexity. In particu-
lar, focusing on increasing the controllability of the system might be a good idea.
This can be done by adding loose coupling into the system, by supporting human
operators to notice hazards and act on them early, and by devising negative feedback
mechanisms that will down-regulate processes if control is lost.
Consider in detail the principle of least privilege. For one, it tells us that we should
be cautious about giving AIs too much power, to limit the extent to which we are
exposed to their tail risks. We might be concerned that AIs become enmeshed within
society with the capacity to make large changes in the world when they do not need
such access to perform their assigned duties. Additionally, for particularly powerful
AI systems, it might be reasonable to keep them relatively isolated from wider society,
and accessible only to verified individuals who have demonstrable and specific needs
for such AIs. In general, being conservative about the rate at which we unleash
technologies can reduce our exposure to black swans.
Targeting systemic factors is an important approach to reducing overall
risk. As we discussed, tackling systemic safety issues can be more effective than
focusing on details in complex systems. This can reduce the risk of both foreseeable
accidents and black swans.
Raising general awareness of risks associated with technologies can produce so-
cial pressures, and bring organizations operating those technologies under greater
scrutiny. Developing and enforcing industry regulations can help ensure organiza-
tions maintain appropriate safety standards, as can encouraging best practices that
improve safety culture. If there are ways of reducing the safety costs (e.g. through
technical research), this can make it more likely that an organization will adopt them,
also improving general safety.
Other systemic factors to pay attention to include competitive pressures. These can
undermine general safety by compelling management and employees to cut corners,
whether to increase rates of production or to reach a goal before competitors. If there
are ways of reducing these pressures and encouraging organizations to prioritize safety,
this could substantially lessen overall risk.
Improving the incentives of decision-makers and reducing moral hazard
can help to address systemic risks. We might want to influence the incentives
of researchers developing AI. Researchers might currently be focused on increasing
profits and reaching goals before competitors, pursuing scientific curiosity and a de-
sire for rapid technological acceleration, or developing the best capabilities in DL
models to find out what is possible. In this sense, these researchers might be some-
what disconnected from the risks they could be creating and the externalities they are
imposing on the rest of society, creating a moral hazard. Encouraging more consider-
ation of the possible risks, perhaps by making researchers liable for any consequences
of the technologies they develop, could therefore improve general safety.
Safety Engineering ■ 239

Similarly, we might be able to improve decision-making by changing who has a say

in decisions, perhaps by including citizens in decision-making processes, not only
officials and scientists [277]. This reduces moral hazard by including the stakeholders
that have “skin in the game.” It can also lead to better decisions in general due to the
wisdom of crowds, the phenomenon where crowds composed of diverse individuals
make much better decisions collectively than most members within it, when the
conditions are right.
In summary, while AI poses novel challenges, there is much that we can learn from
existing approaches to safety engineering and risk management in order to reduce the
risk of catastrophic outcomes.

4.9 LITERATURE

4.9.1 Recommended Reading

• N. Leveson. Engineering a Safer World: Systems Thinking Applied to Safety. En-
gineering systems. MIT Press, 2011. isbn: 9780262016629. url: https://books.go
ogle.com/books?id=0gZ˙7n5p8MQC
• C. Perrow. Normal Accidents: Living with High Risk Technologies. Princeton pa-
perbacks. Princeton University Press, 1999
• Nassim Nicholas Taleb. The Black Swan: The Impact of the Highly Improbable.
Vol. 2. Random House, 2007
• Nassim Nicholas Taleb. Antifragile: Things That Gain from Disorder. Incerto. Ran-
dom House Publishing Group, 2012. isbn: 9780679645276. url: https://books.go
ogle.com.au/books?id=5fqbz˙qGi0AC
CHAPTER 5

Complex Systems

5.1 OVERVIEW

Artificial intelligence (AI) systems and the societies they operate within belong to
the class of complex systems. These types of systems have significant implications for
thinking about and ensuring AI safety. Complex systems exhibit surprising behav-
iors and defy conventional analysis methods that examine individual components in
isolation. To develop effective strategies for AI safety, it is crucial to adopt holistic
approaches that account for the unique properties of complex systems and enable us
to anticipate and address AI risks.
This chapter begins by elucidating the qualitative differences between complex and
simple systems. After describing standard analysis techniques based on mechanistic
or statistical approaches, the chapter demonstrates their limitations in capturing
the essential characteristics of complex systems, and provides a concise definition
of complexity. The “Hallmarks of Complex Systems” section then explores seven
indications of complexity and establishes how DL models exemplify each of them.
Next, the “Social Systems as Complex Systems” section shows how various human
organizations also satisfy our definition of complex systems. In particular, the section
explores how the hallmarks of complexity materialize in two examples of social sys-
tems that are pertinent to AI safety: the corporations and research institutes pursu-
ing AI development, and the decision-making structures responsible for implementing
policies and regulations. In the latter case, there is consideration of how advocacy
efforts are affected by the complex nature of political systems and the broader social
context.
Having established that DL systems and the social systems surrounding them are
best described as complex systems, the chapter moves on to what this means for
AI safety. The “General Lessons” section derives five learnings from the chapter’s
examination of complex systems and sets out their implications for how risks might
arise from AI. The “Puzzles, Problems, and Wicked Problems” section then reframes
the contrasts between simple and complex systems in terms of the different kinds of

240 DOI: 10.1201/9781003530336-5

This chapter has been made available under a CC BY NC ND license.
Complex Systems ■ 241

problems that the two categories present, and the distinct styles of problem-solving
they require.
By examining the unintended side effects that often arise from interfering with com-
plex systems, the “Challenges with Interventionism” section illustrates the necessity
of developing comprehensive approaches to mitigating AI risks. Finally, the “Systemic
Issues” section outlines a method for thinking holistically and identifying more ef-
fective, system-level solutions that address broad systemic issues, rather than merely
applying short-term “quick fixes” that superficially address symptoms of problems.

5.2 INTRODUCTION TO COMPLEX SYSTEMS

5.2.1 The Reductionist Paradigm

Before we describe complex systems, we will first look at non-complex systems and
the methods of analysis that can be used to understand them. This discussion sits
under the reductionist paradigm. According to this paradigm, systems are just the
sum of their parts, and can be fully understood and described with relatively simple
mathematical equations or logical relations.

The mechanistic approach analyzes a system by studying each component

separately. A common technique for understanding a system is to identify its
components, study each one separately, and then mentally “reassemble” it. Once we
know what each part does, we can try to place them all in a simple mechanism, where
one acts on another in a traceable sequence of steps, like cogs and wheels. This style
of analysis is called the mechanistic approach, which often assumes that a system
is like a line of dominos or a Rube Goldberg machine; if we set one component in
motion, we can accurately predict the linear sequence of events it will trigger and,
thus, the end result.

Figure 5.1. A Rube Goldberg machine with many parts that each feed directly into the next
can be well explained by way of mechanisms.
242 ■ Introduction to AI Safety, Ethics, and Society

Many human artifacts can be understood mechanistically. Devices like

bicycles, clocks, and sewing machines are designed with specific mechanisms in mind,
where one component directly acts on another in a cause-and-effect way to perform
an intended function. For example, we can look at a bicycle and understand that
turning the pedals will pull on a chain, which will turn the wheels, which will move
the bicycle forward.

We can often derive mathematical equations that govern mechanistic

systems. If we can successfully model a system’s behavior mechanistically, then
we can usually find mathematical equations that describe its behavior. We can use
these equations to calculate how the system will respond to different inputs. With
this knowledge, we can control what the system does by controlling the inputs. For
example, if we know how quickly the pedals on a bicycle are rotating then we can
calculate the speed at which it is traveling. Conversely, we can control the bicycle’s
speed by controlling how quickly the pedals rotate.

Many conventional computer programs can also be understood mecha-

nistically. Simple algorithmic computer programs involving for-loops and “if. . .
else. . . ” constructions can be understood in this way too. Given any input, we can
trace through the program’s operations to predict the output. Similarly, for any given
output, we can trace the steps backward and deduce information about the input.
Functions in computer programs can also be understood mechanistically. We can
create functions within programs and give them names that are readable and intuitive
to humans. For instance, we can name a function “add(x, y)” and define it to return
the sum of x and y. We can then write a computer program using various functions
like this, and we can analyze it by understanding how each function works on its own
and then looking at the sequence of functions the program follows. This enables us
to predict reliably what output the program will give for any input.

If there are a large number of components, we can sometimes use statis-

tics. Suppose we are trying to predict the behavior of a gas in a box, which contains
on the order of 1023 particles (that is, 1 followed by 23 zeros). We clearly cannot follow
each one and keep track of its effects on the others, as if it were a giant mechanism.
However, in the case of a system like a gas in a box, the broader system properties
of pressure and temperature can be related to averages over the particle motions.
This allows us to use statistical descriptions to derive simple equations governing the
gas’s coarse-grained behavior at the macroscopic level. For example, we can derive
an equation to calculate how much the gas pressure will increase for a given rise in
temperature.

The mechanistic and statistical approaches fall within the reductionist

paradigm. Both mechanistic and statistical styles of analysis seek to understand
and describe systems as combinations or collections of well-understood components.
Under the mechanistic approach, we account for interactions by placing the compo-
nents in a mechanism, assuming they only affect one another in a neat series of direct
Complex Systems ■ 243

one-to-one interactions. Under the statistical approach, we assume that we do not

need to know the precise details of how each interaction plays out because we can
simply take an average of them to calculate the overall outcome.

Summary. Reductionist styles of analysis assume that a system is no more than

the sum of its parts. For a reductionist analysis to work, one of the following assump-
tions should often apply: There either needs to be a simple, traceable mechanism
governing the system’s behavior, or we need to be able to relate the broader system
properties to statistical averages over the components.

Limitations of the Reductionist Paradigm

Having discussed simple systems and how they can be understood through reduction-
ism, we will now look at the limitations of this paradigm and the types of systems
that it cannot be usefully applied to. We will look at the problems this presents for
understanding systems and predicting their behaviors.

Many real-world systems defy reductionist explanation. Imagine that, in-

stead of looking at a bicycle or a gas in a box, we are trying to understand and predict
the behavior of an ecosystem, weather patterns, or a human society. In these cases,
there are clearly far too many components for us to keep track of what each one
is doing individually, meaning that we cannot apply the mechanistic approach. Ad-
ditionally, there are also many complex interdependencies between the components,
such that any given component might behave differently in the context of the system
than it does in isolation. We cannot, therefore, use statistics to treat the system’s
behavior as a simple aggregate of the components’ individual behaviors.

In complex systems, the whole is more than the sum of its parts. The
problem is that reductionist-style analysis is poorly suited to capturing the diversity
of interdependencies within complex systems. Reductionism only works well if the
interactions follow a rigid and predictable mechanism or if they are random and
independent enough to be modeled by statistics. In complex systems, neither of these
assumptions hold.
In complex systems, interactions do not follow a rigid, structured pattern, but compo-
nents are still sufficiently interconnected that they cannot be treated as independent.
These interactions are the source of many novel behaviors that make complex systems
interesting. To get a better grasp of these systems, we need to go beyond reductionism
and adopt an alternative, more holistic framework for thinking about them.

We can sometimes predict general short-term trends in complex systems.

Note that we may be able to predict high-level patterns of behavior in some complex
systems, particularly if we are familiar with them and have many observations of their
past behavior. For example, we can predict with a high degree of confidence that,
in the northern hemisphere, a day in January next year will be colder than a day in
June. However, it is much more difficult to predict specific details, such as the exact
temperature or whether it will rain on a given day. It is also much more challenging
244 ■ Introduction to AI Safety, Ethics, and Society

to predict the longer-term trajectory of the system, such as what the climate will look
like in several centuries or millennia. This is because complex systems often develop
in a more open-ended way than simple systems and have the potential to evolve into
a wider range of states, with numerous factors influencing the path they take.

New or unfamiliar complex systems are even more difficult to predict.

The challenges in predicting how complex systems will behave are compounded when
we face newly emerging ones, such as those involving AI. While we have plenty of
historical information and experience to help us predict weather patterns, we have
little past data to inform us on how AI systems and their use in society will develop.
Nevertheless, studying other complex systems and paying attention to their shared
properties can give us insights into how AI might evolve. This might offer clues as to
how we can avoid potential negative consequences of using AI.

Figure 5.2. Often, we use mechanistic or statistical approaches to analyzing systems. When
there are many components with strong interdependence, these are insufficient, and we need
a complex systems approach.

Summary. Reductionist styles of analysis cannot give us a full understanding of

complex systems, whose components neither function like a deterministic mechanism
nor behave randomly and independently enough to use statistics—shown in Figure
5.2. This lack of a full understanding presents challenges for predicting the system’s
behavior on two levels: which component will perform which action at what time,
and how the whole system might change over the long term.

5.2.2 The Complex Systems Paradigm

Now that we have seen that many systems of interest are inscrutable to the reduc-
tionist paradigm, we need an alternative lens through which to understand them. To
this end, we will discuss the complex systems paradigm, which takes a more holistic
view, placing emphasis on the most salient features shared across various real-world
complex systems that the reductionist paradigm fails to capture. The benefit of this
paradigm is that it provides “a way of seeing and talking about reality that helps us
better understand and work with systems to influence the quality of our lives.”
Complex Systems ■ 245

Complex systems exhibit emergent properties that are not found in their
components. As discussed above, some systems cannot be usefully understood in
a reductionist way. Studying a complex system’s components in isolation and doing
mental reassembly does not amount to what we observe in reality. One primary reason
for this is the phenomenon of emergence: the appearance of striking, system-wide
features that cannot be found in any of the system’s components.
The presence of emergent features provides one sense in which complex systems are
“more than the sum of their parts.” For example, we do not find atmospheric currents
in any of the molecules of nitrogen and oxygen that make up the atmosphere, and
the flexible intelligence of a human being does not exist in any single neuron. Many
biological concepts such as adaptation, ecological niche, sexuality, and fitness are not
simply reduced to statements about molecules. Moreover, “wetness” is not found in
individual water molecules. Emergence is so essential that we will use it to construct
a working definition of complex systems.
Working definition. Complex systems are systems of many interconnected com-
ponents that collectively exhibit emergent features, which cannot, in practice, be
derived from a reductive analysis of the system in terms of its isolated components.
Ant colonies are a classic example of a complex system. An ant colony can
grow to a size of several million individuals. Each ant is a fairly simple creature with
a short memory, moving around in response to chemical and tactile cues. The individ-
uals interact by randomly bumping into each other and exchanging pheromones. Out
of this mess of uncoordinated interactions emerge many fascinating collective behav-
iors. These include identifying and selecting high-quality food sources or nest sites,
forming ant trails, and even constructing bridges over gaps in these trails (formed by
the stringing together of hundreds of the ants’ bodies). Ant colonies have also been
observed to “remember” the locations of food sources or the paths of previous trails
for months, years, or decades, even though the memory of any individual ant only
lasts for a few days at most.
Ant colonies satisfy both aspects of the working definition of complex
systems. First, the emergent features of the colony include the collective decision-
making process that enables it to choose a food source or nest site, the physical ability
to cross over gaps many times wider than any ant, and even capabilities of a cognitive
nature such as extended memory. We could not predict all of these behaviors and
abilities from observing any individual ant, even if each ant displays some smaller
analogs of some of these abilities.
Second, these emergent features cannot be derived from a reductive analysis of the
system focused on the properties of the components. Even given a highly detailed
study of the behavior of an individual ant considered in isolation, we could not derive
the emergence of all of these remarkable features. Nor are all of these features simple
statistical aggregates of individual ant behaviors in any practical sense, although some
features like the distribution of ants between tasks such as foraging, nest maintenance,
and patrolling have been observed as decisions on the level of an individual ant as
well.
246 ■ Introduction to AI Safety, Ethics, and Society

This distinguishes a more complex system like an ant colony from a simpler one such
as a gas in a box. Although the gas also has emergent properties (like its temperature
and pressure), it does not qualify as complex. The gas’s higher-level properties can
be straightforwardly reduced to the statistics of the lower-level properties of the
component particles. However, this was not always the case: it took many decades of
work to uncover the statistical mechanics of gases from the properties of individual
molecules. Complexity can be a feature of our understanding of the system rather
than the system itself.

Complex systems are ubiquitous in nature and society. From cells, organ-
isms, and ecosystems, to weather systems, cities, and the World Wide Web, com-
plex systems are everywhere. We will now describe two further examples, referred to
throughout this chapter.

Economies are complex systems. The components of an economic system are

the individual persons, companies, and firms participating in the economy. These
economic agents interact via various kinds of financial transactions, such as lending,
borrowing, investing, and purchasing and selling goods. Out of these interactions
emerge complex economic phenomena such as inflation, stock-market indexes, and
interest rates. These economic phenomena are not manifested by any individual agent
and cannot be derived by studying the behavior of these agents considered separately;
rather, they arise from the complex network of interactions between them.

The human brain is a complex system. The human brain consists of around
86 billion neurons, each one having, on average, thousands of connections to the
others. They interact via chemical and electrical signals. Out of this emerge all our
impressive cognitive abilities, including our ability to use language, perceive the world
around us, and control the movements of our body. Again, these cognitive abilities
are not found in any individual neuron, arising primarily from the rich structure of
neuronal connections; even if we understood individual neurons very well, this would
not amount to an understanding of (or enable a derivation of) all these impressive
feats accomplished by the brain.

Interactions matter for complex systems. As these examples illustrate, the

interesting emergent features of complex systems are a product of the interactions
(or interconnections) between their components. This is the core reason why these
systems are not amenable to a reductive analysis, which tries to gain insight by
breaking the system into its parts. As the philosopher Paul Cilliers writes: “In ‘cutting
up’ a system, the analytic method destroys what it seeks to understand” [279].

Summary. Complex systems are characterized by emergent features that arise

from the complex interactions between components, but do not exist in any of the
individual components, and cannot be understood through or derived from a reductive
analysis of them. Complex systems are ubiquitous, from ant colonies to economies to
the human brain.
Complex Systems ■ 247

5.2.3 DL Systems as Complex Systems

An essential claim of this chapter is that DL models are complex sys-
tems. Here, we will briefly discuss what a reductionist approach to understanding
DL systems would look like and why it is inadequate.
Consider a DL system that correctly classifies an image of a cat. How does it do
this? The reductionist approach to this question would first try to break down the
classification into a sequence of smaller steps and then find parts of the neural network
responsible for executing each of them. For instance, we might decompose the problem
into the identification of cat ears + whiskers + paws and then look for individual
neurons (or small clusters of neurons) responsible for each of these elements.

The reductionist approach cannot fully describe neural networks. In

some cases, it seems possible to find parts of a neural network responsible for different
elements of such a task. Researchers have discovered that progressively later layers of
deep neural networks are generally involved in recognizing progressively higher-level
features of the images they have been trained to classify. For example, close to the
input layer, the neural network might be doing simple edge detection; a little fur-
ther into the hidden layers, it might be identifying different shapes; and close to the
output, it might be combining these shapes into composites.
However, there is no clear association between an individual node in a given layer and
a particular feature at the corresponding level of complexity. Instead, all the nodes in
a given layer are partially involved in detecting any given feature at that level. That
is to say, we cannot neatly attribute the detection of each feature to a specific node,
and treat the output as the sum of all the nodes detecting their specific features.
Although there have been instances of researchers identifying components of neural
networks that are responsible for certain tasks, there have been few successes, and
they have required huge efforts to achieve. In general, this approach has not so far
worked well for explaining higher-level behaviors.

The complex systems paradigm is more helpful for DL systems. As these

problems suggest, we cannot generally expect to find a simple, human-interpretable
set of features that a neural network identifies in each example and “adds together”
to reach its predictions. DL systems are too complex to reduce to the behavior of
a few well-understood parts; consequently, the reductionist paradigm is of limited
use in helping us think about them. As we will discuss later in this chapter, the
complex systems paradigm cannot entirely make up for this or enable a complete
understanding of these systems. Nonetheless, it does give us a vocabulary for thinking
about them that captures more of their complexity and can teach us some general
lessons about interacting with them and avoiding hazards.

Summary. The difficulties involved in explaining neural networks’ activity through

simple mechanisms are one piece of evidence that they are best understood as complex
systems. We will substantiate this claim throughout the next section, where we run
through some of the hallmark features of complex systems and discuss how they apply
to DL models.
248 ■ Introduction to AI Safety, Ethics, and Society

5.2.4 Complexity Is Not a Dichotomy

In the previous section, we proposed a working definition of complex systems that

suffices for an informal discussion, though it is not completely precise. In fact, there is
no standard definition of complexity used by all complex-systems scientists. In part,
this is because complexity is not a dichotomy.

Understanding system complexity. While we have described a distinction

between a “simple” and “complex” system, labeling a system as inherently sim-
ple or complex can be misleading. Complexity is not always intrinsic to a system.
Instead, it depends on our understanding. Certain phenomena in physics, for in-
stance, have transitioned from being poorly understood “complex” concepts to well-
explained “simple” mechanics through advanced analysis of the properties of the
system. Superconductivity—the property of a material to conduct electricity without
resistance when cooled below a certain critical temperature—is an example of this
transition in understanding.
Superconductivity was originally perceived as a complex phenomenon due to the
emergent behavior arising from electron interactions in metals. However, with the
discovery of the Bardeen-Cooper-Schrieffer (BCS) theory, it became clear that su-
perconductivity could be explained through the pairing of electrons. By considering
these pairs as the components of interest rather than individual electrons, supercon-
ductivity was reclassified as a conceptually “simple” system that can be described by
reductionist models.

Complexity, information, and reductionism. Current research in complex

systems acknowledges the importance of interactions in determining emergent behav-
ior but doesn’t abandon the search for mechanistic explanations. Often, mechanistic
explanations of systems can be found when considering a larger basic building block,
such as pairs of electrons for superconductivity. This choice of scale is important for
creating effective models of possibly complex phenomena.
Thus, rather than a binary classification, systems might be better understood as
existing on a spectrum based on the scale and amount of information required to
predict their behavior accurately. Complex systems are those that, at a certain scale,
require a vast amount of information for prediction, indicating their relative incom-
pressibility. However, they could still be explained mechanistically, if we understood
them sufficiently well.

5.2.5 The Hallmarks of Complex Systems

Since complexity is not a dichotomy, it is difficult to pin down when exactly we

can consider systems complex. In place of a precisely demarcated domain, complex-
systems scientists study numerous salient features that are generally shared by the
systems of interest. While disciplines like physics seek fundamental mechanisms that
can explain observations, the study of complex systems looks for salient higher-level
patterns that appear across a wide variety of systems.
Complex Systems ■ 249

We consider seven key characteristics of complex systems. Chief among these is emer-
gence, but several others also receive attention: self-organization, feedback and nonlin-
earity, criticality, adaptive behavior, distributed functionality, and scalable structure.
We will now describe each of these hallmarks and explain their implications. Along
the way, we will show that DL systems share many similarities with other complex
systems, strengthening the case for treating them under this paradigm.

Emergence

We have already discussed emergence, the appearance of striking system-wide features

that cannot be found in any of the components of the system. Ant colonies swarm over
prey and build bridges over gaps in their trail; economies set prices and can crash;
human brains think, feel, and sense. These remarkable behaviors are inconceivable
for any individual component—ant, dollar, or neuron—existing in isolation.

Emergent features often spontaneously “turn on” as we scale up the

system in size. A group of 100 army ants placed on the ground behaves not like
an enfeebled colony but rather like no colony at all; the ants just walk around in
circles until they starve or die of exhaustion. If the system is scaled up to tens of
thousands of ants, however, a qualitative shift in behavior occurs as the colony starts
behaving like an intelligent superorganism.

Emergent abilities have been observed in DL systems. Large language mod-

els (LLMs) are trained to predict the next token in a string of words. Smaller LLMs
display a variable ability to output coherent sentences, as might be expected based
on this training. Larger LLMs, however, spontaneously gain qualitatively new capa-
bilities, such as translating text or performing three-digit arithmetic. These abilities
can emerge without any task-specific training.

Summary. Emergent properties arise collectively from interactions between com-

ponents and are a defining feature of complex systems. These features often appear
spontaneously as a system is scaled up. Emergent capabilities have already been
observed in DL systems.

Feedback and Nonlinearity

Two closely related hallmarks of complexity are feedback and nonlinearity. Feedback
refers to circular processes in which a system and its environment affect one another.
There are multiple types of nonlinearity, but the term generally describes systems and
processes where a change in the input does not necessarily translate to a proportional
change in the output. We will now discuss some mechanisms behind nonlinearity,
including feedback loops, some examples of this phenomenon, and why it makes
complex systems’ behavior less predictable.

In mathematics, a linear function is one whose outputs change in pro-

portion to changes in the inputs. The functions f (x) = 3x and f (x) =
100(x − 10) linear. Meanwhile, the functions f (x) = x2 and f (x) = ex are nonlinear.
250 ■ Introduction to AI Safety, Ethics, and Society

Complex systems are nonlinear functions of their inputs. Complex sys-

tems process inputs in a nonlinear way. For example, when ant colonies are confronted
with two food sources of differing quality, they will often determine which source is of
higher quality and then send a disproportionately large fraction of its foragers over to
exploit it rather than form two trails in proportion to the quality of the food source.
Neural networks are also nonlinear functions of their inputs. This is why adversarial
attacks can work well: adding a small amount of noise to an image of a cat need not
merely reduce the classifier’s confidence in its prediction, but might instead cause the
network to confidently misclassify the image entirely.

Nonlinearity makes neural networks hard to decompose. A deep neural

network with 10 layers cannot be replaced by five neural networks, each with only
two layers. This is due to the nonlinear activation functions (such as GELUs) between
their nodes. If the layers in a neural network simply performed a sequence of linear
operations, the whole network could be reduced to a single linear operation. However,
nonlinear operations cannot be reduced in the same way, so nonlinear activation
functions mean that deep neural networks cannot be collapsed to networks with
only a few layers. This property makes neural networks more capable, but also more
difficult to analyze and understand.

Feedback loops in complex systems

A major source of nonlinearity is the presence of feedback. Feedback oc-
curs when the interdependencies between different parts of a system form loops (e.g.,
A depends on B, which in turn depends on A). These feedback loops can reinforce
certain processes in the system (positive feedback), and quash others (negative feed-
back), leading to a nonlinear relationship between the system’s current state and how
it changes. The following are examples of feedback loops in complex systems.

The rich get richer. Wealthy people have more money to invest, which brings
them a greater return on investment. In a single investment cycle, the return on invest-
ment is greater in proportion to their greater wealth: a linear relationship. However,
this greater return can then be reinvested. Doing so forms a positive feedback loop
through which a slight initial advantage in wealth can be transformed into a much
larger one, leading to a nonlinear relationship between a person’s wealth and their
ability to make more money.

Learning in the brain involves a positive feedback loop. Connections be-

tween neurons are strengthened according to Hebb’s law (“neurons that fire together,
wire together”). Stronger connections increase the probability of subsequent episodes
of “firing together,” further strengthening those connections. As a result of this feed-
back process, our memories do not strengthen or weaken linearly with time. The
most efficient way to learn something is by revisiting it after increasing intervals of
intervening time, a method called “spaced repetition.”

Task distribution in beehives can be regulated by feedback loops. When

a forager bee finds a source of water, it performs a “waggle dance” in front of the
Complex Systems ■ 251

hive to signal to the other bees the direction and distance of the source. However, a
returning forager needs to find a receiver bee onto which to unload the water. If too
many foragers have brought back water, it will take longer to find a receiver, and the
forager is less likely to signal to the others where they should fly to find the source.
This negative feedback process stabilizes the number of bees going out for water,
leading to a nonlinear relationship between the number of bees currently flying out
for water and the number of additional bees recruited to the task.
AI systems involve feedback loops. In a system where agents can affect the
environment, but the environment can also affect agents, the result is a continual,
circular process of change—a feedback loop. Another example of feedback loops in-
volving AIs is the reinforcement-learning technique of self-play, where agents play
against themselves: the better an agent’s performance, the more it has to improve to
compete with itself, leading its performance to increase even more.
Feedback processes can make complex systems’ behavior difficult to pre-
dict. Positive feedback loops can amplify small changes in a system’s initial condi-
tions into considerable changes in its resulting behavior. This means that nonlinear
systems often have regimes in which they display extreme sensitivity to initial con-
ditions, a phenomenon called chaos (colloquially referred to as the butterfly effect).
A famous example of this is the logistic map, an equation that models how the pop-
ulation of a species changes over time:
xn+1 = rxn (1 − xn ).

This equation is formulated to capture the feedback loops that affect how the pop-
ulation of a species changes: when the population is low, food sources proliferate,
enabling the population to grow; when it is high, overcrowding and food scarcity
drive the population down again. xn is the current population of a species as a frac-
tion of the maximum possible population that its environment can support. xn+1
represents the fractional population at some time later. The term r is the rate at
which the population increases if it is not bounded by limited resources. When the
parameter r takes a value above a certain threshold (∼ 3.57), we enter the chaotic
regime of this model, in which a tiny difference in the initial population makes for a
large difference in the long-run trajectory. Since we can never know a system’s initial
conditions with perfect accuracy, chaotic systems are generally considered difficult to
predict.
AIs as a self-reinforcing feedback loop. Since AIs can process information and
reach decisions more quickly than humans, putting them in charge of certain decisions
and operations could accelerate developments to a pace that humans cannot keep up
with. Even more AIs may then be required to make related decisions and run adjacent
operations. Additionally, if society encounters any problems with AI-run operations,
it may be that AIs alone can work at the speed and level of complexity required to
address these problems. In this way, automating processes could set up a positive
feedback loop, requiring us to continually deploy ever-more AIs. In this scenario, the
long-term use of AIs could be hard to control or reverse.
252 ■ Introduction to AI Safety, Ethics, and Society

Predictable Logistic Map (r = 3.2)

1.0

0.9

0.8

Population Size (unit)

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 5 10 15 20 25 30 35 40 45 50
Time (generations)

Chaotic Logistic Map (r = 3.8)

1.0

0.9

0.8
Population Size (unit)

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0
0 5 10 15 20 25 30 35 40 45 50
Time (generations)

Figure 5.3. When systems are predictable, small changes in initial conditions can taper out.
When they are chaotic, small changes in initial conditions lead to wildly different outcomes.

Summary. There are multiple ways in which complex systems exhibit nonlinearity.
A small change in the system’s input will not necessarily result in a proportional
change in its behavior; it might completely change the system’s behavior, or have
no effect at all. Positive feedback loops can amplify changes, while negative feedback
loops can quash them, leading a system to evolve nonlinearly depending on its current
state, and making its long-run trajectory difficult to predict.

Self-Organization

The next salient feature of complex systems we will discuss is self-organization. This
refers to how the components direct themselves in a way that produces collective
emergent properties without any explicit instructions.

Complex systems sometimes organize themselves spontaneously. The

forms and internal structure changes of complex systems are neither imposed by
a top-down design nor centrally coordinated by “master components.” The high-level
order and organization of a complex system is itself an emergent property that cannot
be analyzed in terms of individual components. We will now look at some examples
of self-organization.
Complex Systems ■ 253

Workers self-organize in ant colonies. In ant colonies, worker ants perform

a variety of tasks, such as nest maintenance, brood care, foraging for food, and pa-
trolling around the nest for signs of danger. Task allocation is partly determined by
demand and opportunity in the environment. For example, the colony will shift to a
more forager-heavy distribution if it discovers a large food source. The way in which
individual ants are recruited to different tasks according to environmental demand
and opportunity is self-organizing: a product of local stochastic interactions between
the individuals, not set by a central controller (there’s no ant commander).
The efficient market hypothesis states that economies self-organize to set
prices. Increasing the price of a product leads to an increase in its supply (as profit
margins for vendors are higher) and a decrease in its demand (as fewer consumers
can afford it). Decreasing the price of a product has the reverse effect. In theory,
the market price of a product will stabilize around the value at which the supply
matches the demand. The system of vendors and consumers automatically “finds”
the equilibrium market price without any centralized control or external help.
A neural network largely self-organizes during training. One could argue
that there is an element of top-down control in the training of a neural network, in
the way the backpropagation adjusts parameters to reduce the loss. However, there
is not a predetermined plan specifying which parts of it are supposed to perform the
different functions needed to carry out the task. Instead, the training process starts
with a disordered system and its ultimate shape is determined by many interactions
between components, resulting in a highly decentralized organization throughout the
network. To a large extent, therefore, the training process resembles self-organization.
Summary. In a complex system, each component responds to conditions and di-
rects its own actions such that the components collectively exhibit emergent behaviors
without any external or central control. Neural networks arrange themselves in this
way during training.

Self-Organized Criticality

Through self-organization, complex systems can reliably reach configurations that

might seem improbable or fine-tuned. We will now look at the phenomenon of self-
organized criticality, which is an important example of this.
Criticality is when a system is balanced at a tipping point between two
different states. In nuclear engineering, the “critical mass” is the mass of a fissile
material needed for a self-sustaining nuclear chain reaction. Below the critical mass,
the chain reaction quickly dies out; above the critical mass, it continues at an ever-
increasing rate and blows up. The critical mass is a boundary between these two
regimes—the point at which the system “tips over” from being subcritical (stable
and orderly) to supercritical (unstable and disorderly). It is therefore referred to as
the tipping point, or critical point, of the fissile system. Under normal operations,
nuclear reactors are maintained at a critical state where the ongoing reaction ensures
continual energy generation without growing into a dangerous, uncontrolled reaction.
254 ■ Introduction to AI Safety, Ethics, and Society

Systems at their critical point are optimally sensitive to fluctuating con-

ditions. In the nuclear case, an internal fluctuation would be the spontaneous fis-
sion of a nucleus. Below the critical point, the consequences of this event invariably
remain confined to the neighborhood of the nucleus; above the critical point, the
knock-on effects run out of control. Precisely at criticality, a local fission event can
precipitate a chain reaction of any size, ranging from a short burst to a cascading
reaction involving the entire system. This demonstrates how, at a critical point, a
small event can have the broadest possible range of effects on the system.
The concept of criticality applies far beyond nuclear engineering: one classic example
is the sandpile model. A sandpile has a critical slope, which is the tipping point
between a tall, unstable pile and a shallow, stable pile. Shallower than this slope, the
pile is relatively insensitive to perturbations: dropping additional grains onto the pile
has little effect beyond making it taller. Once we reach the critical slope, however,
the pile is poised to avalanche, and dropping extra grains can lead to avalanches of
any size, including system-wide ones that effectively cause the whole pile to collapse.
Again, we see that, at criticality, single events can have a wide range of effects on the
system.
The freezing point of water is a critical temperature between its solid
and liquid phases. In ice, the solid phase of water, there is long-range order, and
fluctuations away from this (pockets of melting ice) are small and locally contained.
In the liquid phase, there is long-range disorder, and fluctuations away from this
(formation of ice crystals) are likewise small and locally contained. But at the freezing
point of water—the critical point between the solid and liquid phases—the local
formation of an ice crystal can rapidly spread across the whole system. As a result, a
critically cooled bottle of beer can suddenly freeze all at once when it is perturbed,
for example by being knocked against a table.
Neural networks display critical points. Several studies have found that cer-
tain capabilities of neural networks suddenly “switch on” at a critical point as they
are scaled up. For example, grokking is a network’s ability to work accurately for
general, random datasets, not just the datasets used in training. One study trained
neural networks to recognize patterns in tables of letters and fill in the blanks, and
found that grokking switched on quite suddenly [280]. The study reported that this
ability remained near zero up to 105 optimization steps, but then steeply increased
to near 100% accuracy by 106 steps. This could be viewed as a critical point.
Self-organized criticality means systems can evolve in a “punctuated
equilibrium.” According to the theory of punctuated equilibrium, evolutionary his-
tory consists of long periods of relative stasis in which species experience very little
change, punctuated by occasional bursts of rapid change across entire ecosystems.
These sudden bursts can be understood through the lens of self-organized criticality.
Ecosystems typically in equilibrium can slowly tend toward critical points, where
they are optimally sensitive to perturbations from outside (such as geological events)
or fluctuations from within (such as an organism developing a new behavior or strat-
egy through a chance mutation). When the ecosystem is near a critical point, such a
Complex Systems ■ 255

perturbation can potentially set off a system-wide cascade of changes, in which many
species will need to adapt to survive. Similarly, AI development sometimes advances
in bursts (e.g., GANs, self-supervised learning in vision, and so on) with long periods
of slow development.

Summary. Complex systems often maintain themselves near critical points, or

“tipping points.” At these points, a system is optimally sensitive to internal fluctua-
tions and external inputs. This means it can undergo dramatic changes in response
to relatively minor events. A pattern of dramatic changes that sporadically interrupt
periods of little change can be described as a punctuated equilibrium.

Distributed Functionality

As discussed earlier in this chapter, it is usually impractical to attempt to decompose

a complex system into its parts, assign a different function to each one, and then
assume that the system as a whole is the sum of these functions. Part of the reason
for this is distributed functionality, another hallmark of complexity which we will now
explore.

Complex systems can often be described as performing tasks or func-

tions. Insect colonies build nests, forage for food, and protect their queens;
economies calculate market prices and interest rates; and the human brain regulates
all the bodily processes essential for our survival, such as heartbeat and breathing. In
this context, we can understand adaptive behavior as the ability of a complex system
to maintain its functionality when placed in a new environment or faced with new
demands.

In complex systems, different functions are not neatly divided up between

subsystems. Consider a machine designed to make coffee. In human artifacts like
this, there is a clear delegation of functions to different parts of the system—one part
grinds the beans, another froths the milk, and so forth. This is how non-complex
systems usually work to perform their tasks. In complex systems, by contrast, no
subsystem can perform any of the system’s functions on its own, whereas all the sub-
systems working together can collectively perform many different tasks. This property
is called “distributed functionality.”
Note that distributed functionality does not imply that there is absolutely no func-
tional specialization of the system’s components. Indeed, the components of a complex
system usually come in a diversity of different types, which contribute in different
ways to the system’s overall behavior and function. For example, the worker ants in
an ant colony can belong to different groups: foragers, patrollers, brood care ants,
and so on. Each of these groups, however, performs various functions for the colony,
and distributed functionality implies that, within each group of specialists, there is
no rigid assignment of functions to components.

Partial encoding means that no single component can complete a task

alone. The group of forager ants must perform a variety of subtasks in service of
256 ■ Introduction to AI Safety, Ethics, and Society

the foraging process: locating a food source, making a collective decision to exploit
it, swarming over it to break it up, and carrying small pieces of it back to the nest. A
single forager ant working alone cannot perform this whole process—or even any one
subtask; many ants are needed for each part, with each individual contributing only
partially to each task. We therefore say that foraging is partially encoded within any
single forager ant.

Redundant encoding means there are more components than needed for
any task. A flourishing ant colony will have many more ants than are necessary to
carry out its primary functions. This is why the colony long outlives its members; if
a few patroller ants get eaten, or a few foragers get lost, the colony as a whole barely
notices. We therefore say that each of the functions is redundantly encoded across
the component ants.
An example of distributed functionality is the phenomenon known as the “wisdom of
crowds,” which was notably demonstrated in a report from a village fair in 1906. At
this fair, attendees were invited to take part in a contest by guessing the weight of an
ox. 787 people submitted estimates, and it was reported that the mean came to 1,197
pounds. This was strikingly close to the actual weight, which was 1,198 pounds.
In situations like this, it is often the case that the average estimate of many people
is closer to the true value than any individual’s guess. We could say that the task
of making a good estimate is only partially encoded in any given individual, who
cannot alone get close to the actual value. It is also redundantly encoded because
any individual’s estimate can usually be ignored without noticeably affecting the
average.
On a larger scale, the wisdom of crowds might be thought to underlie the effectiveness
of democracy. Ideally, a well-functioning democracy should make better decisions than
any of its individual members could on their own. This is not because a democratic
society decomposes its problems into many distinct sub-problems, which can then be
delegated to different citizens. Instead, wise democratic decisions take advantage of
the wisdom of crowds phenomenon, wherein pooling or averaging many people’s views
leads to a better result than trusting any individual. The “sense-making” function of
democracies is therefore distributed across society, partially and redundantly encoded
in each citizen.

Neural networks show distributed functionality. In neural networks, dis-

tributed functionality manifests most clearly as distributed representation. In suffi-
ciently large neural networks, the individual nodes do not correspond to particular
concepts, and the weights do not correspond to relationships between concepts. In
essence, the nodes and connections do not “stand for” anything specific. Part of the
reason for this is partial encoding: in many cases, any given feature of the input data
will activate many neurons in the network, making it impossible to locate a single
neuron that represents this feature. In addition, so-called polysemantic neurons are
activated by many different features of the input data, making it hard to establish a
correspondence between these neurons and any individual concepts.
Complex Systems ■ 257

Distributed functionality makes it hard to understand what complex sys-

tems are doing. Distributed functionality means that we cannot understand a
complex system by attributing each task wholly and exclusively to a particular com-
ponent, as the mechanistic approach would seek to. Distributed representation in
neural networks is a particularly troubling instantiation of this insofar as it poses
problems for using human concepts in analyzing a complex system’s “cognition.” The
presence of distributed representation might be thought to substantiate the concern
that neural networks are uninterpretable “black boxes.”
Summary. Distributed functionality often means that no function in a complex
system can be fully or exclusively attributed to a particular component. Since tasks
are more loosely shared among components, this is one of the main reasons that it is
so difficult to develop a definitive model of how a complex system works.

Scalable Structure and Power Laws

As discussed above, the properties of a complex system often scale nonlinearly with
its size. Instead, they often follow power laws, where a property is proportional to
the system size raised to some power that may be more or less than 1. We will now
discuss these power laws, which are another hallmark of complex systems.
Log−Log Plot of Body Mass and Metabolic Rate
105

104
Metabolic Rate (kJ/day, log scale)

103

102

1
10−2 10−1 1 101 102 103
Body Mass (kg, log scale)

Figure 5.4. Data on mammals and birds demonstrate Kleiber’s Law, with a power law
relationship appearing as a straight line on a log-log graph.

Complex systems often obey power-law scalings of their properties with

system size. Perhaps the most famous example of a power-law scaling is Kleiber’s
law in biology: across all mammals and birds, and possibly beyond, the metabolic
rate of a typical member of a species scales with the three-quarters power of its body
mass.
3
R ∝ M4
258 ■ Introduction to AI Safety, Ethics, and Society

If we know that an elephant is five times heavier than a horse, we can guess that
3
the elephant’s metabolic rate will be approximately 3.3 times the horse’s (since 5 4 ≈
3.3). There are several other documented cases of this power-law scaling behavior
in complex systems. The average heart-rate for a typical member of a mammalian
species scales with the minus one-quarter power of its body mass:
1
R ∝ M−4 .

At the same time, the average lifespan scales with the one-quarter power of its body
mass:
1
T ∝ M 4.

This leads to the wonderful result that the average number of heartbeats per lifetime
is constant across all species of mammals (around 1.5 billion).
Among cities within the same country, the material infrastructure (such as the lengths
of pipes, powerlines, and roads, and the number of gas stations) scales with population
as a power-law with an exponent of 0.85. Also among cities within the same country,
socioeconomic quantities (such as incidents of crime and cases of flu) scale with the
population size raised to the 1.15 power.

Experiments on LLMs show that their loss obeys power laws too. In the
paper in which DeepMind introduced the Chinchilla model ([160]), the researchers
fit the following parametric function to the data they collected from experiments on
language models of different sizes, where N is the size of the model and D is the size
of the training dataset:
A B
L(N, D) = E + α + β .
N D
The irreducible loss (E) is the lowest loss that could possibly be achieved. Subtracting
this off, we see that the performance of the model as measured by the loss (L) exhibits
a power-law dependency on each of model parameter count (N ) and dataset size (D).
For more details on scaling laws in DL systems, see the Scaling Laws section in
Artificial Intelligence & Machine Learning.

Summary. Certain important properties of complex systems often scale nonlin-

early with the size of the system. This means that two separate systems will not
behave in the same way as one single system of equivalent size.

Adaptive Behavior

The final hallmark of complexity we will discuss is adaptive behavior, which involves
a system changing its behavior depending on the demands of the environment.
Complex Systems ■ 259

Complex systems often adapt flexibly to new tasks and environmental

changes. Honeybees usually need to maintain their hives within an optimum tem-
perature range of 32–36◦ C. When temperatures rise too high, bees engage in various
adaptive behaviors to counteract this. They fan their wings at the hive entrance,
increasing air circulation to cool down the hive. Additionally, more bees are sent out
to gather water, which helps regulate the hive’s temperature back to normal [281].
This ability to adjust their behavior to maintain homeostasis during environmental
changes exemplifies one type of adaptive behavior.
The human brain, on the other hand, showcases a different form of adaptability. It
possesses the remarkable capacity to navigate novel circumstances and solve unfa-
miliar problems. When faced with new challenges, the brain’s ability to think about
different things allows us to adapt and thrive in diverse environments. For example,
London’s taxi drivers (“cabbies”) have been found to have larger-than-average mem-
ory centers. This adaptation enables them to navigate the complex maze of London’s
streets effectively. Furthermore, the brain can also adapt in response to injury. After
a stroke or head injury, it can rewire itself, repurposing undamaged areas to compen-
sate for the damaged ones. This adaptive behavior showcases the brain’s remarkable
plasticity and its ability to adapt and function even after experiencing trauma.

Some DL systems exhibit adaptive behavior. So-called “online models” learn

from new data sequentially as they encounter it, rather than remaining fixed after
an initial training phase. This enables these models to dynamically adapt to datasets
that change over time, as well as continuing to perform well in the real world when the
inputs they encounter differ from their training data, an ability known as “test-time
adaptation” or simply “adaptation.” While other DL systems such as Large Language
Models remain fixed after their training phase, there are strong incentives to make
these systems adaptive to overcome current limitations such as costs of re-training
and lack of up-to-date information after the training date.
Another example of adaptive behavior in DL systems is few-shot prompting. This
technique enables general DL models (such as large language models) to be used
to perform certain tasks without any task-specific fine-tuning. It involves giving the
model a few examples (“shots”) of correct performance on the task, which stimulate
the model to adapt its outputs to these examples and thereby carry out the desired
task.

Summary. Complex systems can often undergo rapid changes in their structures
and processes in response to internal and external fluctuations. This adaptive behav-
ior enables the continuation of the system in a changing environment.

Review of the Hallmarks of Complexity

There are seven hallmarks of complexity that we can look out for when identifying
complex systems. These hallmarks are:
1. Emergence: the appearance of novel properties that arise from interactions be-
tween the system’s components, but which do not exist in any single component.
260 ■ Introduction to AI Safety, Ethics, and Society

These properties cannot be understood or predicted from reductive analysis of

components.
2. Feedback and nonlinearity: the presence of feedback loops that can either amplify
or quash changes in a complex system, and the multiple ways in which a change
in the input to a complex system can produce a disproportionate change in the
output.
3. Self-organization: the ability of a complex system to spontaneously self-organize
through the self-directed behaviors of the components, without any external or
centralized control.
4. Self-organized criticality: the tendency of complex systems to maintain themselves
near critical points, at which they can undergo dramatic changes in response to
even relatively minor perturbations.
5. Distributed functionality: the way in which tasks are shared loosely among a com-
plex system’s components. Tasks are both partially encoded—each individual con-
tributes only partially to a task—and redundantly encoded—there are more in-
dividuals that can contribute to a task than are strictly necessary to complete
it.
6. Scalable structure: the way in which properties of complex systems scale nonlin-
early with size, so that a property of a single large system may be larger or smaller
than the combined properties of two separate systems of half the size.
7. Adaptive behavior: a complex system’s ability to change its structure and processes
in response to perturbations, enabling it to continue functioning in a changing
environment.

5.2.6 Social Systems as Complex Systems

So far, we have described how DL systems possess many of the classic features of
complex systems. We have shown that they satisfy the two aspects of our working
definition of complex systems and that they display all seven hallmarks discussed
above.
We will now consider the organizations that develop AIs and the societies within
which they are deployed, and describe how these systems also exhibit the character-
istics of complex systems. We will argue that, on this basis, the problem of AI safety
should be treated under the complex systems paradigm.

Worked Example: Corporations and Research Institutes as Complex Systems

The organizations developing AI technology are complex systems. Cor-

porations and research institutes have multiple emergent properties that are not
found in any of the individuals working within them. Brand identity, for example,
does not exist in any employee of a company, but rather embodies and conveys the
collective activities of all the employees and conveys the goals of the company as a
whole. Similarly, the concepts of organizational culture and research culture refer to
Complex Systems ■ 261

the general ways in which individuals tend to interact with one another within an
organization or a research field.

Organizations developing AI are self-organizing. Although companies have

CEOs, these CEOs are often selected by groups of people, such as board members,
and do not generally dictate every single activity within the company and how it
should be done. People self-select in applying to work at a company that interests
them, and managers usually make decisions together on who to hire. Employees
often come up with their own ideas for projects and strategies, and then decisions
are made collectively on which ones to pursue. Likewise in academia, researchers
investigate their own questions, keep up to date with the findings of their peers, and
use those insights to inform their research directions, while experts form committees
to decide which projects should be funded. There is very often no single central entity
determining which researchers should work on which questions.

Both corporate and academic organizations can exhibit critical points.

Often, a lot of an organization’s effort is focused on a particularly consequential area
or problem until a big breakthrough is made, representing a tipping point into a
new paradigm. For this reason, research and development often progresses in the
pattern of a punctuated equilibrium, with long periods of incremental advancements
interrupted by windows of rapid advancements, following important breakthroughs.

Companies and research institutes show multiple forms of adaptive be-

havior. Examples of adaptation include organizations incorporating new informa-
tion and technology to update their strategies and ways of working, and adjusting
their research directions based on new findings. Additionally, they may adapt to the
changing needs of customers and the changing priorities of research funders, as well
as to new government regulations.

Companies and research institutes display distributed functionality.

While there may be subsystems that focus on specialized tasks within a company
or branches within a research field, in general, no employee or researcher single-
handedly performs a whole function or advances an area alone. Even if there is just
one person working in a particular niche, they still need to be informed by related
tasks and research performed by others, and usually rely on the work of support staff.
This illustrates partial encoding. There are also usually more people available to per-
form tasks than are absolutely needed, meaning that processes continue over time
despite employees and researchers joining and leaving. This demonstrates redundant
encoding.

There are multiple examples of feedback and nonlinearity in companies

and institutes. A small disparity in investment into different projects or research
directions may be compounded over time, with those that receive more initial funding
also achieving bigger results, and therefore receiving even more funding. A small
difference in support for a particular candidate in senior management can be decisive
in whether or not they are selected, and thus have a large influence over future
directions. More broadly, different organizations may imitate one another’s successes,
262 ■ Introduction to AI Safety, Ethics, and Society

leading to a concentration of work in a particular area, while a small initial advantage

of one organization may be amplified over time, allowing it to dominate the area.
Summary. The environments in which research and development occur display
the hallmarks of complexity and are therefore best understood as complex systems.
Research organizations and corporations possess emergent properties including safety
culture, which is paramount for AI safety. Additionally, progress may have critical
points and unfold in a nonlinear way that is difficult to predict. It is crucial that AI
safety strategies are informed by these possibilities.

Worked Example: Complex Systems Applied to Advocacy

The social systems within which AI is deployed are complex systems.

We find emergence and the hallmarks of complexity in all social systems, from po-
litical structures to economic networks to society as a whole. To illustrate this more
specifically, we will now focus on the example of policymaking structures and advo-
cacy. This example is particularly relevant to AI safety, because reducing risks from
AI will need to involve the implementation of policies around its use. Advocacy will
therefore be necessary to promote safety policies and convince policymakers to adopt
them.
Social systems display emergence and self-organization. Patterns of gover-
nance and collective decision making, such as democracy, can be considered emergent
properties of social and political systems. Although some individuals reach positions
of power that might seem to centralize control, social systems are nonetheless partly
self-organizing, in the sense that many individuals interact with one another and
make decisions about whom to support, collectively determining which candidate
is elected. Similarly, people who care about particular causes self-organize to form
advocacy groups and set up grassroots campaigns. Policymakers interact with each
other and various stakeholders, including advocates, to reach policy decisions.
Advocacy movements have critical points and often unfold as punc-
tuated equilibria. Movements advancing different causes often display critical
points, where a critical level of awareness and support must be reached before pol-
icymakers will pay attention. Social systems may self-organize toward this critical
level of support and maintain it over time. However, the actual “tipping” from one
state into another, wherein policies are implemented, may be dependent on other
external factors, such as whether there are other urgent issues dominating decision-
makers’ attention. For this reason, advocacy efforts and their results tend to progress
as punctuated equilibria; there may be little apparent change for a long time, despite
sustained work, and then a lot of sudden progress when momentum builds and the
political climate is right for it.
Both advocacy groups and policymaking structures also exhibit adap-
tive behavior. Policymakers must continually adapt to the fluctuating political
landscape and changing concerns of the public. Similarly, advocacy groups must con-
stantly adjust their activities to capture the attention of the public and policymakers
Complex Systems ■ 263

and convince them that a particular cause is relevant and important. They might, for
instance, use new technology to innovate an original mode of campaigning, or link
the cause to the prevailing zeitgeist—another emergent property of social systems.

Distributed functionality is evident on multiple levels in social systems.

The various tasks involved in advocacy are partially and redundantly encoded across
individuals within groups, allowing campaigns to continue even as people leave and
join them. More broadly still, there are usually several groups campaigning for any
given cause, meaning that the general function of advocacy is distributed across
different organizations. Decision making is also partially and redundantly encoded
among many policymakers, who interact with one another and various stakeholders
to consider different perspectives and decide on policies.

There are many nonlinear aspects of processes like advocacy. There are
numerous factors that affect whether or not an issue is included on a policy agenda.
Public interest in a cause, the influence of opponents of a cause, and the number
of other issues competing for attention are among the many factors that affect the
likelihood that it is considered non-linearly; for instance, opponents with low influ-
ence may permit an issue being discussed, opponents with medium influence may
try and block it from discussed, but opponents with high influence may permit it
being discussed so that they can argue against it. Additionally, there is a degree of
randomness involved in determining which issues are considered. This means that the
policy progress resulting from a particular campaign does not necessarily reflect the
level of effort put into it, nor how well organized it was.
Together with distributed functionality and critical points, this nonlinearity can make
it difficult to evaluate how well a campaign was executed or attribute eventual success.
It might be that earlier efforts were essential in laying the groundwork or simply
maintaining some interest in a cause, even if they did not yield immediate results.
A later campaign might then succeed in prompting policy-level action, regardless of
whether it is particularly well organized, simply because the political climate becomes
favorable.
Other examples of nonlinearity within advocacy and policymaking arise from various
feedback loops. Since people are influenced by the opinions of those around them,
a small change in the initial level of support for a policy might be compounded
over time, creating momentum and ultimately tipping the balance as to whether
or not it is adopted. On the other hand, original activities that are designed to be
attention-grabbing may run up against negative feedback loops that diminish their
power over time. Other groups may imitate them, for instance, so that their novelty
wears off through repetition. Opponents of a cause may also learn to counteract any
new approaches that advocates for it try out. This dynamic was understood by the
military strategist Moltke the Elder, who is reported to have said that “no plan
survives first contact with the enemy.”
264 ■ Introduction to AI Safety, Ethics, and Society

Political systems and advocacy groups have scalable structure. Political

systems usually have a hierarchical structure with multiple levels of organization, such
as councils responsible for specific regions within a country and politicians forming a
national government to address countrywide issues. Advocacy groups can also exhibit
this kind of structure. There may, for example, be a campaign manager spearheading
efforts, and then many regional leaders who organize activities at a more local level.
This scalable structure is another indication of complexity.

Summary. The presence of these hallmarks of complexity in social and political

systems suggests they are best described within the complex systems paradigm. Ad-
ditionally, these observations can offer some insights into how we might approach
advocacy for AI safety, suggesting it is not as simple as developing a good policy idea
and making a convincing argument for it.
Instead, it is likely that successful advocacy over the long term will be characterized
by adaptability to different political circumstances and changing public attitudes, as
well as in response to opponents’ activities. Advocates will need to invest in building
and maintaining relationships with the relevant people and organizations, rather than
just presenting the case for a policy. There may need to be a lot of work that is not
immediately rewarded, but momentum should be maintained so that advocates are
ready to capitalize on moments when the political climate becomes more favorable.
It should also be understood that it might not be possible to attribute success in any
obvious way.

It Is Difficult to Foresee How the Use of AI Will Unfold

Complex social systems mean the eventual impact of AI is hard to pre-

dict. As discussed earlier, the behavior of complex systems can be difficult to pre-
dict for many reasons, such as the appearance of unanticipated emergent properties
and feedback loops amplifying small changes in initial conditions. This is compounded
if a system is new to us; we may be able to predict certain high-level behaviors of
complex systems we are familiar with and have a lot of historical data on, such as
weather patterns and beehives, but AI systems are relatively new. Additionally, the
deployment of AI within society represents a case of nested complexity, where com-
plex systems are embedded within one another. This vastly increases the range of
potential interactions and the number of ways in which the systems can co-evolve.
As a result, it is difficult to predict all the ways in which AI might be used and what
its eventual impact will be.
While this technology could have many positive effects, there is also potential for
interactions to have negative consequences. This is especially true if AI is deployed
in ways that enable it to affect actions in the world; for example, if it is put in charge
of automated decision-making processes.
Complex Systems ■ 265

5.3 COMPLEX SYSTEMS FOR AI SAFETY

5.3.1 General Lessons from Complex Systems

As we have discussed, AI systems and the social systems they are integrated within
are best understood as complex systems. For this reason, making AI safe is not like
solving a mathematical problem or fixing a watch. A watch might be complicated,
but it is not complex. Its mechanism can be fully understood and described, and its
behavior can be predicted with a high degree of confidence. The same is not true of
complex systems.
Since a system’s complexity has a significant bearing on its behavior, our approach
to AI safety should be informed by the complex systems paradigm. We will now look
at some lessons that have been derived from observations of many other complex
adaptive systems. We will discuss each lesson and what it means for AI safety.

Lesson: Armchair Analysis Is Limited for Complex Systems

Learning how to make AIs safe will require some trial and error. We
cannot usually attain a complete understanding of complex systems or anticipate all
their emergent properties purely by studying their structure in theory. This means we
cannot exhaustively predict every way they might go wrong just by thinking about
them. Instead, some amount of trial and error is required to understand how they
will function under different circumstances and learn about the risks they might pose.
The implication for AI safety is that some simulation and experimentation will be
required to learn how AI systems might function in unexpected or unintended ways
and to discover crucial variables for safety.

Biomedical research and drug discovery exemplify the limitations of

armchair theorizing. The body is a highly complex system with countless bio-
chemical reactions happening all the time, and intricate interdependencies between
them. Researchers may develop a drug that they believe, according to their best the-
ories, should treat a condition. However, they cannot thoroughly analyze every single
way it might interact with all the body’s organs, processes, and other medications
people may be taking. That is why clinical trials are required to test whether drugs
are effective and detect any unexpected side effects before they are approved for use.
Similarly, since AI systems are complex, we cannot expect to predict all their potential
behaviors, emergent properties, and associated hazards simply by thinking about
them. Moreover, AI systems will be even less predictable when they are taken out of
the controlled development environment and integrated within society. For example,
when the chatbot Tay was released on Twitter, it soon started to make racist and
sexist comments, presumably learned through its interactions with other Twitter
users in this uncontrolled social setting.

Approaches to AI safety will need to involve experimentation. Some of

the most important variables that affect a system’s safety will likely be discovered
266 ■ Introduction to AI Safety, Ethics, and Society

by accident. While we may have ideas about the kinds of hazards a system entails,
experimentation can help to confirm or refute these. Importantly, it can also help us
discover hazards we had not even imagined. These are called unknown unknowns,
or black swans, discussed extensively in the Safety Engineering chapter. Empirical
feedback loops are necessary.

Lesson: Systems Often Develop Subgoals Which Can Supersede the Original Goal

AIs might pursue distorted subgoals at the expense of the original goal.
The implication for AI safety is that AIs might pursue subgoals over the goals we
give them to begin with. This presents a risk that we might lose control of AIs, and
this could cause harm because their subgoals may not always be aligned with human
values.

A system often decomposes its goal into multiple subgoals to act as step-
ping stones. Subgoals might include instrumentally convergent goals, which are
discussed in the Single-Agent Safety chapter. The idea is that achieving all the sub-
goals will collectively amount to achieving the original aim. This might work for a
simple, mechanistic system. However, since complex systems are more than the sum
of their parts, breaking goals down in this way can distort them. The system might
get sidetracked pursuing a subgoal, sometimes even at the expense of the original one.
In other words, although the subgoal was initially a means to an end, the system may
end up prioritizing it as an end in itself.
For example, companies usually have many different departments, each one special-
ized to pursue a distinct subgoal. However, some departments, such as bureaucratic
ones, can capture power and have the company pursue goals unlike its initial one.
Political leaders can delegate roles to subordinates, but sometimes their subordinates
may overthrow them in a coup.
As another example, imagine a politician who wants to improve the quality of life
of residents of a particular area. Increasing employment opportunities often lead to
improvement in quality of life, so the politician might focus on this as a subgoal—a
means to an end. However, this subgoal might end up supplanting the initial one. For
instance, a company might want to build an industrial plant that will offer jobs, but
is also likely to leak toxic waste. Suppose the politician has become mostly focused
on increasing employment rates. In that case, they might approve the construction
of this plant, despite the likelihood that it will pollute the environment and worsen
residents’ quality of life in some ways.

Future AI agents may break down difficult long-term goals into smaller
subgoals. Creating subgoals can distort an AI’s objective and result in misalign-
ment. As discussed in the Emergent Capabilities section of the Single-Agent Safety
Chapter, optimization algorithms might produce emergent optimizers that pursue
subgoals, or AI agents may delegate goals to other agents and potentially have the
goal be distorted or subverted. In more extreme cases, the subgoals could be pursued
at the expense of the original one. We can specify our high-level objectives correctly
Complex Systems ■ 267

without any guarantee that systems will implement these in practice. As a result,
systems may not pursue goals that we would consider beneficial.

Lesson: A Safe System, When Scaled Up, Is Not Necessarily Still Safe

AIs may continue to develop unanticipated behaviors as we scale them up.

When we scale up the size of a system, qualitatively new properties and behaviors
emerge. The implication for AI safety is that, when we increase the scale of a DL
system, it will not necessarily just get better at doing what it was doing before. It
might begin to behave in entirely novel and unexpected ways, potentially posing risks
that we had not thought to prepare for.
It is not only when a system transitions from relative simplicity into complexity that
novel properties can appear. New properties can continue to emerge spontaneously
as a complex system increases in size. As discussed earlier in this chapter, LLMs have
been shown to suddenly acquire new capabilities, such as doing three-digit arith-
metic, when the amount of compute used in training them is increased, without any
qualitative difference in training. Proxy gaming capabilities have also been found to
“switch on” at a certain threshold as the model’s number of parameters increases; in
one study, at a certain number of parameters, the proxy reward steeply increased,
while the model’s performance as intended by humans simultaneously declined.

Some emergent capabilities may pose a risk. As DL models continue to

grow, we should expect to observe new emergent capabilities appearing. These may
include potentially concerning ones, such as deceptive behavior or the ability to game
proxy goals. For instance, a system might not attempt to engage in deception until
it is sophisticated enough to be successful. Deceptive behavior might then suddenly
appear.

Lesson: Working Complex Systems Have Usually Evolved From Simpler Systems

We are unlikely to be able to build a large, safe AI system from scratch.

Most attempts to create a complex system from scratch will fail. More successful
approaches usually involve developing more complex systems gradually from simpler
ones. The implication for AI safety is that we are unlikely to be able to build a large,
safe, working AI system directly. As discussed above, scaling up a safe system does
not guarantee that the resulting larger system will also be safe. However, starting
with safe systems and cautiously scaling them up is more likely to result in larger
systems that are safe than attempting to build the larger systems in one fell swoop.

Building complex systems directly is difficult. Since complex systems can

behave in unexpected ways, we are unlikely to be able to design and build a large,
working one from scratch. Instead, we need to start by ensuring that smaller sys-
tems work and then build on them. This is exemplified by how businesses develop;
a business usually begins as one person or a few people with an idea, then be-
comes a start-up, then a small business, and can potentially grow further from there.
268 ■ Introduction to AI Safety, Ethics, and Society

People do not usually attempt to create multinational corporations immediately with-

out progressing naturally through these earlier stages of development.
One possible explanation for this relates to the limitations of armchair theorizing
about complex systems. Since it is difficult to anticipate every possible behavior and
failure mode of a complex system in advance, it is unlikely that we will be able to
design a flawless system on the first attempt. If we try to create a large, complex
system immediately, it might be too large and unwieldy for us to make the necessary
changes to its structure when issues inevitably arise. If the system instead grows
gradually, it has a chance to encounter relevant problems and adapt to deal with
them during the earlier stages when it is smaller and more agile.
Similarly, if we want large AI systems that work well and are safe, we should start by
making smaller systems safe and effective and then incrementally build on them. This
way, operators will have more chances to notice any flaws and refine the systems as
they go. An important caveat is that, as discussed above, a scaled-up system might
have novel emergent properties that are not present in the smaller version. We cannot
assume that a larger system will be safe just because it has been developed in this
way. However, it is more likely to be safe than if it was built from scratch. In other
words, this approach is not a guarantee of safety, but it is likely our best option. The
scaling process should be done cautiously.

Lesson: Any System Which Depends on Human Reliability Is Unreliable

Gilb’s Law of Unreliability. We cannot guarantee that an operator will never

make an error, and especially not in a large complex system. As the chemical engineer
Trevor Kletz put it: “Saying an accident is due to human failing is about as helpful as
saying that a fall is due to gravity. It is true but it does not lead to constructive action”
[282]. To make a complex system safer, we need to incorporate some allowances in
the design so that a single error is not enough to cause a catastrophe.
The implication of this for AI safety is that having humans monitoring AI systems
does not guarantee safety. Beyond human errors of judgment, processes in some
complex systems may happen too quickly for humans to be included in them anyway.
AI systems will probably be too fast-moving for human approval of their decisions to
be a practical or even a feasible safety measure. We will therefore need other ways
of embedding human values in AI systems and ensuring they are preserved, besides
including humans in the processes. One potential approach might be to have some
AI systems overseeing others, though this brings its own risks.

Summary.

The general lessons that we should bear in mind for AI safety are:
1. We cannot predict every possible outcome of AI deployment by theorizing, so some
trial and error will be needed
2. Even if we specify an AI’s goals perfectly, it may start not to pursue them in
practice, as it may instead pursue unexpected, distorted subgoals
Complex Systems ■ 269

3. A small system that is safe will not necessarily remain safe if it is scaled up
4. The most promising approach to building a large AI that is safe is nonetheless to
make smaller systems safe and scale them up cautiously
5. We cannot rely on keeping humans in the loop to make AI systems safe, because
humans are not perfectly reliable and, moreover, AIs are likely to accelerate pro-
cesses too much for humans to keep up.

5.3.2 Puzzles, Problems, and Wicked Problems

So far, we have explored the contrasts between simple and complex systems and why
we need different approaches to analyzing and understanding them. We have also
described how AIs and the social systems surrounding them are best understood as
complex systems, and discussed some lessons from the field of complex systems that
can inform our expectations around AI safety and how we address it.
In attempting to improve the safety of AI and its integration within society, we are
engaging in a form of problem-solving. However, simple and complex systems present
entirely different types of problems that require different styles of problem-solving.
We can therefore reframe our earlier discussion of reductionism and complex systems
in terms of the kinds of challenges we can address within each paradigm. We will
now distinguish between three different kinds of challenges—puzzles, problems, and
wicked problems. We will look at the systems that they tend to arise in, and the
different styles of problem-solving we require to tackle each of them.

Puzzles and Problems

Puzzles. Examples of puzzles include simple mathematics questions, sudokus, as-

sembling furniture, and fixing a common issue with a watch mechanism. In all these
cases, there is only one correct result and we are given all the information we need
to find it. We usually find puzzles in simple systems that have been designed by
humans and can be fully understood. These can be solved within the reductionist
paradigm; the systems are simply the sum of their parts, and we can solve the puzzle
by breaking it down into a series of steps.

Problems. With problems, we do not always have all the relevant information up-
front, so we might need to investigate to discover it. This usually gives us a better
understanding of what’s causing the issue, and ideas for solutions often follow natu-
rally from there. It may turn out that there is more than one approach to fixing the
problem. However, it is clear when the problem is solved and the system is functioning
properly again.
We usually find problems in systems that are complicated, but not complex. For
example, in car repair work, it might not be immediately apparent what is causing
an issue. However, we can investigate to find out more, and this process often leads
us to sensible solutions. Like puzzles, problems are amenable to the reductionist
paradigm, although they may involve more steps of analysis.
270 ■ Introduction to AI Safety, Ethics, and Society

Wicked Problems
Wicked problems usually arise in complex systems and often involve a
social element. Wicked problems are a completely different class of challenges from
puzzles and problems. They appear in complex systems, with examples including
inequality, misinformation, and climate change. There is also often a social factor
involved in wicked problems, which makes them more difficult to solve. Owing to their
multifaceted nature, wicked problems can be tricky to categorically define or explain.
We will now explore some key features that are commonly used to characterize them.
There is no single explanation or single solution for a wicked problem.
We can reasonably interpret a wicked problem as stemming from more than one
possible cause. As such, there is no single correct solution or even a limited set of
eternal possible solutions.
No proposed solution to a wicked problem is fully right or wrong, only
better or worse. Since there are usually many factors involved in a wicked prob-
lem, it is difficult to find a perfect solution that addresses them all. Indeed, such a
solution might not exist. Additionally, due to the many interdependencies in complex
systems, some proposed solutions may have negative side effects and create other
issues, even if they reduce the targeted wicked problem. As such, we cannot usually
find a solution that is fully correct or without flaw; rather, it is often necessary to
look for solutions that work relatively well with minimal negative side effects.
There is often a risk involved in attempting to solve a wicked problem.
Since we cannot predict exactly how a complex system will react to an intervention
in advance, we cannot be certain as to how well a suggested solution will work or
whether there will be any unintended side effects. This means there may be a high
cost to attempting to address wicked problems, as we risk unforeseen consequences.
However, trying out a potential solution is often the only way of finding out whether
it is better or worse.
Every wicked problem is unique because every complex system is unique.
While we can learn some lessons from other systems with similar properties, no two
systems will respond to our actions in exactly the same way. This means that we
cannot simply transpose a solution that worked well in one scenario to a different one
and expect it to be just as effective. For example, introducing predators to control
pest numbers has worked well in some situations, but, as we will discuss in the next
section, it has failed in others. This is because all ecosystems are unique, and the
same is true of all complex systems, meaning that each wicked problem is likely to
require a specifically tailored intervention.
It might not be obvious when a wicked problem has been solved. Since
wicked problems are often difficult to perfectly define, it can be challenging to say
they have been fully eliminated, even if they have been greatly reduced. Indeed, since
wicked problems tend to be persistent; it might not be feasible to fully eliminate
many wicked problems at all. Instead, they often require ongoing efforts to improve
the situation, though the ideal scenario may always be beyond reach.
Complex Systems ■ 271

AI safety is a wicked problem. Since AI and the social environments it is

deployed within are complex systems, the issues that arise with its use are likely to
be wicked problems. There may be no obvious solution, and there will probably need
to be some trial and error involved in tackling them. More broadly, the problem of
AI safety in general can be considered a wicked problem. There is no single correct
approach, but many possibilities. We may never be able to say that we have fully
“solved” AI safety; it will require ongoing efforts.

Summary. Puzzles and problems usually arise in relatively simple systems that
we can obtain a complete or near-complete understanding of. We can therefore find
all the information we need to explain the issue and find a solution to it, although
problems may be more complicated, requiring more investigation and steps of analysis
than puzzles.
Wicked problems, on the other hand, arise in complex systems, which are much
more difficult to attain a thorough understanding of. There may be no single correct
explanation for a wicked problem, proposed solutions may not be fully right or wrong,
and it might not be possible to find out how good they are without trial and error.
Every wicked problem is unique, so solutions that worked well in one system may
not always work in another, even if the systems seem similar, and it might not be
possible to ever definitively say that a wicked problem has been solved. Owing to the
complex nature of the systems involved, AI safety is a wicked problem.

5.3.3 Challenges With Interventionism

As touched on above, there are usually many potential solutions to wicked problems,
but they may not all work in practice, even if they sound sensible in theory. We might
therefore find that some attempts to solve wicked problems will be ineffective, have
negative side effects, or even backfire. Complex systems have so many interdepen-
dencies that when we try to adjust one aspect of them, we can inadvertently affect
others. For this reason, we should approach AI safety with more humility and more
awareness of the limits of our knowledge than if we were trying to fix a watch or a
washing machine. We will now look at some examples of historical interventions in
complex systems that have not gone to plan. In many cases, they have done more
harm than good.

Cane toads in Australia. Sugarcane is grown in Australia as a valuable prod-

uct in the economy, but a species of insect called the cane beetle is known to feed
on sugarcane crops and destroy them. In the 1930s, cane toads were introduced in
Australia to prey on these beetles, with the hope of minimizing crop losses. However,
since cane toads are not native to Australia, they have no natural predators there.
In fact, the toads are toxic to many native species and have damaged ecosystems
by poisoning animals that have eaten them. The cane toads have multiplied rapidly
and are considered an invasive species. Attempts to control their numbers have so far
been largely unsuccessful.
272 ■ Introduction to AI Safety, Ethics, and Society

Warning signs on roads. Road accidents are a long-standing and pervasive issue.
A widely used intervention is to display signs along roads with information about the
number of crashes and fatalities that have happened in the surrounding area that year.
The idea is that this information should encourage people to drive more carefully.
However, one study has found that signs like this increase the number of accidents
and fatalities, possibly because they distract drivers from the road.
Renewable Heat Incentive Scandal. In 2012, a government department in
Northern Ireland wanted to boost the fraction of energy consumption from renewable
sources. To this end, they set up an initiative offering businesses generous subsidies
for using renewable heating sources, such as wood pellets. However, in trying to reach
their percentage targets for renewable energy, the politicians offered a subsidy that
was slightly more than the cost of the wood pellets. This incentivized businesses to
use more energy than they needed and profit from the subsidies. There were reports
of people burning pellets to heat empty buildings unnecessarily. The episode became
known as the “Cash for Ash scandal.”
Barbados-Grenada football match. In the 1994 Caribbean Cup, an interna-
tional football tournament, organizers introduced a new rule to reduce the likelihood
of ties, which they thought were less exciting. The rule was that if two teams were
tied at the end of the allotted 90 minutes, the match would go to extra time, and
any goal scored in extra time would be worth double. The idea was to incentivize the
players to try harder to score. However, in a match between Barbados and Grenada,
Barbados needed to win by two goals to advance to the tournament finals. The score
as they approached 90 minutes was 2-1 to Barbados. This resulted in a strange situa-
tion where Barbados players tried to score an own goal to push the match into extra
time and have an opportunity to win by two.
Summary. Interventions that work in theory might fail in a complex system. In
all these examples, an intervention was attempted to solve a problem in a complex
system. In theory, each intervention seemed like it should work, but each decision-
maker’s theory did not capture all the complexities of the system at hand. Therefore,
when each intervention was applied, the system reacted in unexpected ways, leaving
the original problem unsolved, and often creating additional problems that might be
even worse.

Stable States and Restoring Forces

The examples above illustrate how complex systems can react unexpectedly to in-
terventions. This can be partly attributed to the properties of self-organization and
adaptive behavior; complex systems can organize themselves around new conditions
in unobvious ways, without necessarily addressing the reason for the intervention.
Some interventions might partially solve the original problem but unleash unantic-
ipated side effects that are not considered worth the benefits. Other interventions,
however, might completely backfire, exacerbating the very problem they were in-
tended to solve. We will now discuss the concept of “stable states” and how they
might explain complex systems’ tendency to resist attempts to change them.
Complex Systems ■ 273

If a complex system is in a stable state, it is likely to resist attempts to

change it. If a ball is sitting in a valley between two hills and we kick it up one
hill, gravity will pull it back to the valley. Similarly, if a complex system has found
a stable state, there might be some “restoring forces” or homeostatic processes that
will keep drawing it back toward that state, even if we try to pull it in a different
direction. When complex systems are not near critical points, they exhibit robustness
to external changes.
Another analogy is Le Chatelier’s Principle, a well-known concept in chemistry. The
principle concerns chemical equilibria, in which the concentrations of different chemi-
cals stay the same over time. There may be chemical reactions happening, converting
some chemicals into others, but the rate of any reaction will equal the rate of its oppo-
site reaction. The total concentration of each chemical therefore remains unchanged,
hence the term equilibrium.
Le Chatelier’s Principle states that if we introduce a change to a system in chemical
equilibrium, the system will shift to a new equilibrium in a way that partly opposes
that change. For example, if we increase the concentration of one chemical, then
the rate of the reaction using up that chemical will increase, using up more of it
and reducing the extra amount present. Similarly, complex systems sometimes react
against our interferences in them.
We will now look at some examples of interventions backfiring in complex systems.
We will explore how we might think of these systems as having stable states and
restoring forces that draw the system back toward its stable state if an intervention
tries to pull it away. Note that the following discussions of what the stable states and
restoring forces might be are largely speculative. Although these hypotheses have not
been rigorously proven to explain these examples, they are intended to show how
we can view systems and failed interventions through the lens of stable states and
restoring forces.
Rules to restrict driving. In 1989, to tackle high traffic and air pollution levels
in Mexico City, the government launched an initiative called “Hoy No Circula.” The
program introduced rules that allowed people to drive only on certain days of the
week, depending on the last number on their license plate. This initially led to a
drop in emissions, but they soon rose again, actually surpassing the pre-intervention
levels. A study found that the rules had incentivized people to buy additional cars
so they could drive on more days. Moreover, the extra cars people bought tended to
be cheaper, older, more polluting ones, exacerbating the pollution problem [283].
We could perhaps interpret this situation as having a stable state in terms of how
much driving people wanted or needed to do. When rules were introduced to try
to reduce it, people looked for ways to circumvent them. We could view this as a
restoring force in the system.
Four Pests campaign. In 1958, the Chinese leader Mao Zedong launched a cam-
paign encouraging people to kill the “four pests”: flies, mosquitoes, rodents, and
sparrows. The first three were targeted for spreading disease, but sparrows were con-
sidered a pest because they were believed to eat grain and reduce crop yields. During
274 ■ Introduction to AI Safety, Ethics, and Society

this campaign, sparrows were killed intensively and their populations plummeted.
However, as well as grain, sparrows also eat locusts. In the absence of a natural
predator, locust populations rose sharply, destroying more crops than the sparrows
did [284]. Although many factors were at play, including poor weather and officials’
decisions about food distribution [285], this ecosystem imbalance is often considered
a contributing factor in the Great Chinese Famine [284], during which tens of millions
of people starved between 1959 and 1961.
Ecosystems are highly complex, with intricate balances between the populations of
many species. We could think of agricultural systems as having a “stable state” that
naturally involves some crops being lost to wildlife. If we try to reduce these losses
simply by eliminating one species, then another might take advantage of the available
crops instead, acting as a kind of restoring force.

Antibiotic resistance. Bacterial infections have been a cause of illness and mor-
tality in humans throughout history. In September 1928, bacteriologist Alexander
Fleming discovered penicillin, the first antibiotic. Over the following years, the meth-
ods for producing it were refined, and, by the end of World War II, there was a large
supply available for use in the US and Britain. This was a huge medical advance-
ment, offering a cure for many common causes of death, such as pneumonia and
tuberculosis. Death rates due to bacterial illnesses dropped dramatically [286]; it is
estimated that, in 1952, in the US, around 150,000 fewer people died from bacterial
illnesses than would have without antibiotics. In the early 2000s, it was estimated
that antibiotics may have been saving around 200,000 lives annually in the US alone.
However, as antibiotics have become more abundantly used, bacteria have begun to
evolve resistance to these vital medicines. Today, many bacterial illnesses, including
pneumonia and tuberculosis, are once again becoming difficult to treat due to the
declining effectiveness of antibiotics. In 2019, the Centers for Disease Control and Pre-
vention reported that antimicrobial-resistant bacteria are responsible for over 35,000
deaths per year in the US [287].
In this case, we might think of the coexistence of humans and pathogens as having
a stable state, involving some infections and deaths. While antibiotics have reduced
deaths due to bacteria over the past decades, we could view natural selection as a
“restoring force,” driving the evolution of bacteria to become resistant and causing
deaths to rise again. Overuse of these medicines intensifies selective pressures and
accelerates the process.
In this case, it is worth noting that antibiotics have been a monumental advancement
in healthcare, and we do not argue that they should not be used or that they are a
failed intervention. Rather, this example highlights the tendency of complex systems
to react against measures over time, even if they were initially highly successful.

Instead of pushing a system in a desired direction, we could try to shift

the stable states. If an intervention attempts to artificially hold a system away
from its stable state, it might be as unproductive as repeatedly kicking a ball up a
hill to keep it away from a valley. Metaphorically speaking, if we want the ball to sit
Complex Systems ■ 275

in a different place, a more effective approach would be to change the landscape so

that the valley is where we want the ball to be. The ball will then settle there without
our help. More generally, we want to change the stable points of the system itself,
if possible, so that it naturally reaches a state that is more in line with our desired
outcomes.
Good cycling infrastructure may shift the stable states of how much
people drive. One example of shifting stable states is the construction of cycling
infrastructure in the Netherlands in the 1970s. As cars became cheaper during the
20th century, the number of people who owned them began to rise in many countries,
including the Netherlands. Alongside this, the number of car accidents also increased.
In the 1970s, a protest movement gathered in response to rising numbers of children
being killed by cars. The campaign succeeded in convincing Dutch politicians to build
extensive cycling infrastructure to encourage people to travel by bike instead of by
car. This has had positive, lasting results. A 2018 report stated that around 27% of
all trips in the Netherlands are made by bike—a higher proportion than any other
country studied [288].
Instead of making rules to try to limit how much people drive, creating appropriate
infrastructure makes cycling safer and easier. Additionally, well-planned cycle net-
works can make many routes quicker by bike than by car, making this option more
convenient. Under these conditions, people will naturally be more inclined to cycle,
so society naturally drifts toward a stable point that entails less driving.
It is worth noting that the Netherlands’ success might not be possible to replicate
everywhere, as there may be other factors involved. For instance, the terrain in the
Netherlands is relatively flat compared with other countries, and hilly terrain might
dissuade people from cycling. This illustrates that some factors influencing the stable
points are beyond our control. Nevertheless, this approach has likely been more ef-
fective in the Netherlands than simple rules limiting driving would have been. There
might also be other effective strategies for changing the stable points of how much
people drive, such as creating cheap, reliable public transport systems.
Summary. Complex systems can often self-organize into stable states that we may
consider undesirable, and which create some kind of environmental or social problem.
However, if we try to solve the problem too simplistically by trying to pull the system
away from its stable state, we might expect some restoring forces to circumvent our
intervention and bring the system back to its stable state, or an even worse one. A
more effective approach might be to change certain underlying conditions within a
system, where possible, to create new, more desirable stable states for the system to
self-organize toward.

Successful Interventions
We have discussed several examples of failed interventions in complex systems. While
it can be difficult to say definitively a wicked problem has been solved, there are some
examples of interventions that have clearly been at least partially successful. We will
now look at some of these examples.
276 ■ Introduction to AI Safety, Ethics, and Society

Eradication of Smallpox. In 1967, the WHO launched an intensive campaign

against smallpox, involving intensive global vaccination programs and close monitor-
ing and containment of outbreaks. In 1980, the WHO declared that smallpox had
been eradicated. This was an enormous feat that required concerted international
efforts over more than a decade.

Reversal of the depletion of the ozone layer. Toward the end of the 20th
century it was discovered that certain compounds frequently used in spray cans, refrig-
erators, and air conditioners, were reaching the ozone layer and depleting it, leading
to more harmful radiation passing through. As a result, the Montreal Protocol, an
international agreement to phase out the use of these compounds, was negotiated in
1987 and enacted soon after. It has been reported that the ozone layer has started to
recover since then.

Public health campaigns against smoking. In the 20th century, scientists dis-
covered a causal relationship between tobacco smoking and lung cancer. In the fol-
lowing decades, governments started implementing various measures to discourage
people from smoking. Initiatives have included health warnings on cigarette packets,
smoking bans in certain public areas, and programs supporting people through the
process of quitting. Many of these measures have successfully raised public awareness
of health risks and contributed to declining smoking rates in several countries.
While these examples show that it is possible to address wicked problems, they also
demonstrate some of the difficulties involved. All these interventions have required
enormous, sustained efforts over many years, and some have involved coordination on
a global scale. It is worth noting that smallpox is the only infectious disease that has
ever been eradicated. One challenge in replicating this success elsewhere is that some
viruses, such as influenza viruses, evolve rapidly to evade vaccine-induced immunity.
This highlights how unique each wicked problem is.
Campaigns to dissuade people from smoking have faced pushback from the tobacco
industry, showing how conflicting incentives in complex systems can hamper attempts
to solve wicked problems. Additionally, as is often the case with wicked problems,
we may never be able to say that smoking is fully “solved”; it might not be feasible
to reach a situation where no one smokes at all. Nonetheless, much positive progress
has been made in tackling this issue.

Summary. Although it is by no means straightforward to tackle wicked problems,

there are some examples of interventions that have successfully solved or made great
strides toward solving certain wicked problems. For many wicked problems, it may
never be possible to say that they have been fully solved, but it is nonetheless possible
to make progress and improve the situation.

5.3.4 Systemic Issues

We have discussed the characteristics of wicked problems as stemming from the com-
plex systems they arise from, and explored why they are so difficult to tackle. We
Complex Systems ■ 277

have also looked at some examples of failed attempts to solve wicked problems, as well
as examples of more successful ones, and explored the idea of shifting stable points,
instead of just trying to pull a system away from its stable points. We will now dis-
cuss ways of thinking more holistically and identifying more effective, system-level
solutions.

Obvious problems are sometimes just symptoms of broader systemic is-

sues. It can be tempting to take action at the level of the obvious, tangible problem,
but this is sometimes like applying a band-aid. If there is a broader underlying is-
sue, then trying to fix the problem directly might only work temporarily, and more
problems might continue to crop up.

We should think about the function we are trying to achieve and the
system we are using. One method of finding more effective solutions is to “zoom
out” and consider the situation holistically. In complex systems language, we might
say that we need to find the correct scale at which to analyze the situation. This
might involve thinking carefully about what we are trying to achieve and whether
individuals or groups in the system exhibit the behaviors we are trying to control.
We should consider whether, if we solve the immediate problem, another one might
be likely to arise soon after.

It might be more fruitful to change AI research culture than to address

individual issues. One approach to AI safety might be to address issues with
individual products as they come up. This approach would be focused on the level of
the problem. However, if issues keep arising, it could be a sign of broader underlying
issues with how research is being done. It might therefore be better to influence
the culture around AI research and development, instead of focusing on individual
risks. If multiple organizations developing AI technology are in an arms race with
one another, for example, they will be trying to reach goals and release products as
quickly as possible. This will likely compel people to cut corners, perhaps by omitting
safety measures. Reducing these competitive pressures might therefore significantly
reduce overall risk, albeit less directly.
If competitive pressures remain high, we could imagine a potential future scenario in
which a serious AI-related safety issue materializes and causes considerable harm. In
explaining this accident, people might focus on the exact series of events that led to
it—which product was involved, who developed it, and what precisely went wrong.
However, ignoring the role of competitive pressures would be an oversight. We can
illustrate this difference in mindset more clearly by looking at historical examples.

We can explain catastrophes by looking for a “root cause” or looking at

systemic factors. There are usually two ways of interpreting a catastrophe. We
can either look for a traceable series of events that triggered it, or we can think more
about the surrounding conditions that made it likely to happen one way or another.
For instance, the first approach might say that the assassination of Franz Ferdinand
caused World War One. While that event may have been the spark, international
278 ■ Introduction to AI Safety, Ethics, and Society

tensions were already high beforehand. If the assassination had not happened, some-
thing else might have done, also triggering a conflict. A better approach might in-
stead invoke the imperialistic ambitions of many nations and the development of new
militaristic technologies, which led nations to believe there was a strong first-strike
advantage.
We can also find the contrast between these two mindsets in the different explanations
put forward for the Bhopal gas tragedy, a huge leak of toxic gas that happened in
December 1984 at a pesticide-producing plant in Bhopal, India. The disaster caused
thousands of deaths and injured up to half a million people. A “root cause” expla-
nation blames workers for allowing water to get into some of the pipes, where it
set off an uncontrolled reaction with other chemicals that escalated to catastrophe.
However, a more holistic view focuses on the slipping safety standards in the run-up
to the event, during which management failed to adequately maintain safety systems
and ensure that employees were properly trained. According to this view, an accident
was bound to happen as a result of these factors, regardless of the specific way in
which it started.

To improve safety in complex systems, we should focus on general sys-

temic factors. Both examples above took place in complex systems; the network
of changing relationships between nations constitutes a complex evolving system, as
does the system of operations in a large industrial facility. As we have discussed,
complex systems are difficult to predict and we cannot analyze and guard against
every possible way in which something might go wrong. Trying to change the broad
systemic factors to influence a system’s general safety may be much more effective.
In the development of technology, including AI, competitive pressures are one im-
portant systemic risk source. Others include regulations, public concern, safety costs,
and safety culture. We will discuss these and other systemic factors in more depth in
the Safety Engineering chapter.

Summary. Instead of just focusing on the most obvious, surface-level problem, we

should also consider what function we are trying to achieve, the system we are using,
and whether the problem might be a result of a mismatch between the system and
our goal. Thinking in this way can help us identify systemic factors underlying the
problems and ways of changing them so that the system is better suited to achieving
our aims.

5.4 CONCLUSION

In this chapter, we have explored the properties of complex systems and their implica-
tions for AI safety strategies. We began by contrasting simple systems with complex
systems. While the former can be understood as the sum of their parts, the latter
display emergent properties that arise from complex interactions. These properties
do not exist in any of the components in isolation and cannot easily be derived from
reductive analysis of the system.
Complex Systems ■ 279

Next, we explored seven salient hallmarks of complexity. We saw that feedback loops
are ubiquitous in complex systems and often lead to nonlinearity, where a small
change in the input to a system does not result in a proportionate change in the
output. Rather, fluctuations can be amplified or quashed by feedback loops. Further-
more, these processes can make a system highly sensitive to its initial conditions,
meaning that a small difference at the outset can lead to vastly different long-term
trajectories. This is often referred to as the “butterfly effect,” and makes it difficult
to predict the behaviors of complex systems.
We also discussed how the components of complex systems tend to self-organize to
some extent and how they often display critical points, at which a small fluctuation
can tip the system into a drastically different state. We then looked at distributed
functionality, which refers to how tasks are loosely shared among components in a
complex system, and scalable structure, which gives rise to power laws within complex
systems. The final hallmark of complexity we discussed was adaptive behavior, which
allows systems to continue functioning in a changing environment.
Along the way, we highlighted how DL systems exhibit the hallmarks of complexity.
Beyond AIs themselves, we also showed how the social systems they exist within are
also best understood as complex systems, through the worked examples of corpora-
tions and research institutes, political systems, and advocacy organizations.
Having established the presence of complexity in AIs and the systems surrounding
them, we looked at what this means for AI safety by looking at five general lessons.
Since we cannot usually predict all emergent properties of complex systems simply
through theoretical analysis, some trial and error is likely to be required in making
AI systems safe. It is also important to be aware that systems often break down
goals into subgoals, which can supersede the original goal, meaning that AIs may not
always pursue the goals we give them.
Due to the potential for emergent properties, we cannot guarantee that a safe system
will remain safe when it is scaled up. However, since we cannot usually understand
complex systems perfectly in theory, it is extremely difficult to build a flawless com-
plex system from scratch. This means that starting with small systems that are safe
and scaling them up cautiously is likely the most promising approach to building
large complex systems that are safe. The final general lesson is that we cannot guar-
antee AI safety by keeping humans in the loop, so we need to design systems with
this in mind.
Next, we looked at how complex systems often give rise to wicked problems, which
cannot be solved in the same way we would approach a simple mathematics question
or a puzzle. We saw how difficult it is to address wicked problems, due to the unex-
pected side effects that can occur when we interfere with complex systems. However,
we also explored examples of successful interventions, showing that it is possible to
make significant progress, even if we cannot fully solve a problem. In thinking about
the most effective interventions, we highlighted the importance of thinking holistically
and looking for system-level solutions.
280 ■ Introduction to AI Safety, Ethics, and Society

AI safety is not a mathematical puzzle that can be solved once and for all. Rather, it
is a wicked problem that is likely to require ongoing, coordinated efforts, and flexible
strategies that can be adapted to changing circumstances.

5.5 LITERATURE

5.5.1 Recommended Reading

• M. Gell-Mann. The Quark and the Jaguar: Adventures in the Simple and the Com-
plex. St. Martin’s Publishing Group, 1995. isbn: 9780805072532. url: https://bo
oks.google.com/books?id=l6aCe4zqZ˙sC
• D.H. Meadows and D. Wright. Thinking in Systems: A Primer. Chelsea Green
Publishing, 2008. isbn: 9781603580557. url: https://books.google.com.au/books
?id=CpbLAgAAQBAJ
• John Gall. The systems bible: the beginner’s guide to systems large and small.
General Systemantics Press, 2002
• Richard Cook. How complex systems fail. Jan. 2002. url: https://how.complexsy
stems.fail
III
Ethics and Society
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
CHAPTER 6

Beneficial AI and Machine

Ethics

6.1 INTRODUCTION

How should we direct AIs to promote human values? As we continue to

develop powerful AI systems, it is crucial to ensure they are safe and beneficial for
humanity. In this chapter, we discuss the challenges of specifying appropriate values
for AI systems to pursue. Some of these questions are already relevant for AI systems
that exist today in the healthcare or automotive sectors, where artificial systems
may be making decisions in situations that can harm or benefit humans. They will
become even more critical for future systems that may be integrated more broadly
across economies, governments, and militaries, making high-stakes decisions with
many important moral considerations. As discussed in section 3.4, these questions
are core components of the overall challenge of “AI alignment.”
Many people have incoherent views on embedding values into AIs. Peo-
ple often talk about what AIs should do to promote human values. They may agree
with many of the following:
1. AIs should do what you tell them to do.
2. AIs should promote what you choose to do.
3. AIs should do what’s fair.
4. AIs should do what a democratic process tells them to do.
5. AIs should figure out what is moral, then do that.
6. AIs should do what is objectively good for you.
7. AIs should do what would make people happy.

All of these seem like reasonable answers. At least at first glance, these all seem
like excellent goals for AIs. However, many of these are incompatible, because they
make different normative assumptions and require different technical implementa-
tions. Other suggestions such as “AIs should follow human intentions” are highly
DOI: 10.1201/9781003530336-6 283
This chapter has been made available under a CC BY NC ND license.
284 ■ Introduction to AI Safety, Ethics, and Society

vague. Put straightforwardly, while these sound attractive and similar, they are not
the same thing.
This should challenge the notion that this is a straightforward problem with a simple
solution, and that the real challenges lie only elsewhere, such as in making AI systems
more capable. Those who believe there are easy ways to ensure that AIs act ethically
may find themselves grappling with inconsistencies and confusion when confronted
with instances in which their preferred methods appear to yield harmful behavior.
In this chapter, we will explore these issues, attempting to understand which answers,
if any, take us closer to creating safe and beneficial AIs. We start by considering some
goals commonly proposed to ensure that AI systems are beneficial for society. Current
and future AI systems will need to comply with existing laws and to avoid biases and
unfairness toward certain groups in society. Beyond this, many want AI to make our
economic systems more efficient and to boost economic growth. We briefly consider
the attractions of these goals, as well as some shortcomings:
1. Law: Should AI systems just be made to follow the law? We examine whether
we can design AIs that follow existing legal frameworks, considering that law is
a legitimate aggregation of human values that is time-tested and comprehensive
while being both specific and adaptable. We will lay out the challenges of the
occasional silence, immorality, or unrepresentativeness of established law.
2. Bias and Fairness: How can we design AI systems to be fair? We explore bias
and fairness in AI systems and the challenges associated with ensuring outcomes
created by AIs are fair. We will discuss different definitions of fairness and see how
they are incompatible, as well as consider approaches to mitigating biases.
3. Economic Engine: Should we let the economy decide what AIs will be like? We
consider how economic forces are shaping AI development, and why we might be
concerned about letting economic incentives alone determine how AI is developed
and which objectives AI systems pursue.

Machine ethics. While all of these proposals seem to capture important intuitions
about what we value, we argue that they face significant limitations and are not
sufficient on their own to ensure beneficial outcomes. We consider what it would mean
to create AI systems that aim directly to improve human wellbeing. This provides
a case study in thinking about how we can create AI systems that can respond
appropriately to moral considerations, an emerging field known as machine ethics.
While this is a highly ambitious goal, we believe it is likely to become increasingly
relevant in coming years. As AI capabilities improve, they may be operating with
an increasing level of autonomy and a broadening scope of potential actions they
could take. In this context, approaches based only on compliance with the law or
profit maximization are likely to prove increasingly inadequate in providing sufficient
guardrails on AI systems’ behavior. To specify in greater detail how they should react
in a particular situation, AI systems will need to be able to identify and respond
appropriately to the relevant values at stake, such as human wellbeing.
Beneficial AI and Machine Ethics ■ 285

Wellbeing. If we are to set wellbeing or other values as goals for AI systems, one
fundamental question that faces us is how we should specify these values. In the
second part of this chapter, we consider various interpretations of wellbeing and how
attractive it would be to have AI systems pursue these, assuming that they became
capable enough to do this. We start by introducing several competing theories of
wellbeing, which might present different goals for ethical AI systems. We examine
theories of wellbeing that focus on pleasure, objective goods, and preference satisfac-
tion. We then explore preference satisfaction in more detail and consider what kind of
preferences AI systems should satisfy. AI systems can be created to satisfy individual
preferences, but which preferences they should focus on is an open question. We fo-
cus on the challenges of deciding between revealed, stated, and idealized preferences.
Next, we turn to consider whether AI systems should aim to make people happy
and how we might use AIs to promote human happiness. Finally, we consider the
challenge of having AIs maximize wellbeing not just for an individual, but across the
whole of society. We look at how to aggregate total wellbeing across society, focusing
on social welfare functions. We discuss what social welfare functions are and how to
trade-off between equity and efficiency in a principled way.

Moral uncertainty. There are many cases where we may feel uncertain about
what the right response is to a particular situation due to conflicting moral consider-
ations. We consider how AI systems might respond to such situations. In the case of
AI systems, we may deliberately want to introduce uncertainty into their reasoning to
avoid over-confident decisions that could be disastrous from some moral perspectives.
One option to address moral uncertainty is using a moral parliament, where ethical
decisions are made by simulating democratic processes.

6.2 LAW

Why not have AIs follow the law? We have just argued that for AI systems to
be safe and beneficial, we need to ensure they can respond to moral considerations
such as wellbeing. However, some might argue that simply getting AIs to follow the
law is a better solution.
The law has three features that give it an advantage over ethics as a model for safe
and beneficial AI. In democratic countries, the law is a legitimate representation of
our moral codes: at least in theory, it is a democratically endorsed record of our
shared values. Law is time-tested and comprehensive; it has developed over many
generations to adjudicate the areas where humans have consequential disagreements.
Finally, legal language can be specific and adaptable to new contexts, comparing
favorably to ethical language, which can often be interpreted in diverging ways.
The next subsection will expand on these features of the law. We will see how these
features contrast favorably with ethics before arguing that we do need ethics after
all, but alongside the law.
286 ■ Introduction to AI Safety, Ethics, and Society

6.2.1 The Case for Law

Legitimate Aggregation of Values

In a democratic country, the law is influenced by the opinions of the

populace. Democratic citizens vote on new laws, directly or through representa-
tives. If they don’t like part of an existing law, they have a range of legal means,
such as advocating, protesting, and voting, to change it. Even though the law at any
given time won’t perfectly reflect the values of the citizenry, the method of arriving
at law is usually legitimate. In other words, the citizens have input into the law.

Legitimacy provides the law with a clear advantage over ethics. The law
provides a collection of rules and standards that enable us to differentiate illegal from
legal actions. Ethics, on the other hand, isn’t standardized or codified. To determine
ethical and unethical actions, we have to either pick an ethical theory to follow, or
decide how to weigh the differing opinions of multiple theories that we think might
be true. But any of these options are likely to be more controversial than simply
following the law. Ethics has no in-built method of democratic agreement.
However, just following the law isn’t a perfect solution: there will always be an act of
interpretation between the written law and its application in a particular situation.
There is often no agreement over the procedure for this interpretation. Therefore,
even if AI systems were created in a way that bound them to follow the law, a
legal system with human decision-makers might have to remain part of the process.
The law is only legitimate when interpreted by someone democratically appointed or
subject to democratic critique.

Time-Tested and Comprehensive

Systems of law have evolved over generations. With each generation, new
people are given the job of creating, enforcing, and interpreting the law. The law
covers a huge range of issues and incorporates a wide range of distinctions. Because
these bodies of law have been evolving for so long, the information encoded in the
law is a record of what has worked for many people and is often considered an
approximation of their values. This makes the law a particularly promising resource
for aligning AI systems with the values of the population.

Laws are typically written without AIs in mind. Most laws are written
with humans in mind. As a result, they assume things such as intent—for example,
American law holds it a crime to knowingly possess a biological agent that is intended
for use as a weapon. Both ‘knowingly’ and ‘intended’ are terms we may not be
able to apply to AIs, since AIs may not have the ability to know or intend things.
Having AIs follow laws that were written without AIs in mind might result in unusual
interpretations and applications of the law.
Another area of ambiguity is copyright law. AIs are trained on a vast corpus of
training data, created by developers, and ultimately run by users. When AIs create
content, it is unclear whether the user, developer, or creator of the training data
Beneficial AI and Machine Ethics ■ 287

should be assigned the intellectual property rights. IP laws were written for humans
and human-led corporations creating content, not AIs.

Rules and Standards

Naively, we might think of the law as a system of rules. “Law” seems

almost synonymous with “rule” in our language. When we talk about “the laws of
physics” or “natural laws” in general, we mean something rigid and inflexible—when
X happens, Y will follow. This is partly true: inflexible rules are a part of the law.
For instance, take the rule: “If someone drives faster than the speed limit of 70mph,
they will be fined $200.” In this case, there is an objective trigger (driving faster than
70mph) and a clear directive (a $200 fine). There isn’t much room for the judge to
interpret the rule differently. This gives the lawmaker predictable control over how
the law will be carried out.

A law based on rules alone would be flawed. However, a fixed speeding rule
would mean fining someone who was accelerating momentarily to avoid hitting a
pedestrian and not fining someone who continued to drive at the maximum speed
limit around a blind turn, creating a danger for other drivers. Rules are always over-
inclusive (they will apply to some cases we would rather not be illegal) and under-
inclusive (they won’t apply to all cases we would like to be illegal).

To remedy this problem, a law can instead be based on a standard. In

the speeding case, a standard could be “when someone is driving unreasonably, they
will be fined in proportion to the harm they pose.” A judge could apply this standard
to get the correct result in both cases above (speeding to avoid an accident and
going full speed around a blind turn). Standards have their own problems: with
standards rather than rules, the judge is empowered to interpret the standards based
on their own opinion, allowing them to act in ways that diverge from the lawmaker’s
intentions.

The law uses rules and standards. Using rules and standards alongside each
other, the law can find the best equilibrium between carrying out the lawmaker’s
intentions and accounting for situations they didn’t foresee [293]. This gives the law an
advantage in the problem of maintaining human control of AI systems by displaying
the right level of ambiguity.
Law is less ambiguous than ethical language, which can be very ambiguous. Phrases
like “do the right thing,” “act virtuously,” or “make sure you are acting consistently”
can mean different things to different people. In contrast, it is more flexible than
programming languages, which are brittle and designed to only fit into particular
contexts. Legal language can maintain a middle ground between rigid rules and more
sensible standards.
288 ■ Introduction to AI Safety, Ethics, and Society

Figure 6.1. Legal language balances being less ambiguous than ethical language while being
more flexibly formatted than programming language.

Specific and Adaptable

We can apply these insights about rules and standards in law to two core problems
in controlling AI systems: misinterpretation and gaming. The law is specific enough
to avoid misinterpretation and adaptable enough to prevent many forms of gaming.

Given commands in natural language, AIs might interpret them literally.

The misinterpretation problem arises when the literal interpretation of our commands
differs from our intended interpretation. In AI safety discourse, we see this problem
raised by many thinkers; for example, Stuart Russell raised the concern that an AI, if
asked to develop a cure for cancer, might experiment on people, giving them tumors
as part of carrying out the request [54]. A narrow, literal interpretation of “develop
a cure for cancer,” which doesn’t take any of our typical considerations into account,
could lead to this outcome. Misinterpretation risks are like wishes to a genie: we
might get what we ask for but not what we want.

The capabilities of LLMs give us some reason to see misinterpretation

risks as unlikely. Before LLMs, some AI researchers were concerned that most AI
models would be trained in ways that would mean they had no understanding of many
ordinary human concepts or values. This might mean that we should be worried about
extreme actions emerging from a misinterpretation of perfectly normal and mundane
requests. However, this now seems less likely. Our experience with large language
models has shown that by being trained on human-generated language data, AIs can
respond to the meaning of our sentences in a way that is approximately similar to
the way that a human speaker of the language would; for instance, the Happiness
section below discusses two other systems that can predict human responses to video
and text. If similar systems are used in the future, it seems plausible that AIs can
apply laws in a sensible way.
That said, we might still worry that while AIs might usually interpret human con-
cepts sensibly, they may still have representations that have rare quirks. This poses
problems; for instance, it creates vulnerabilities that can be exploited by adversarial
Beneficial AI and Machine Ethics ■ 289

attacks. There is more work to be done before we can feel comfortable that systems
will reliably be able to interpret laws and other commands in practice.
However, we still face the risk of AIs gaming our commands to get what
they want. Stuart Russell raises a different concern with AI: gaming [54]. An AI
system may “play” the system or rules, akin to strategizing in a game, to achieve
its objectives in unexpected or undesired ways. He gives the example of tax codes.
Humans have been designing tax laws for 6000 years, yet many still avoid taxation.
Creating a tax code not susceptible to gaming is particularly difficult because indi-
viduals are incentivized to avoid paying taxes wherever possible. If our track record
with creating rules to constrain each other is so bad, then we might be pessimistic
about constraining AI systems that might have goals that we don’t understand.
A partial solution to misinterpretation and gaming is found in rules and
standards. If we are concerned about misinterpretation, we might choose to rely
on laws that are specific. Rules such as “any generated image must have a digital
watermark” are specific enough that they are difficult to misinterpret. We might
prefer using such laws rather than relying on abstract ethical principles, which are
vaguer and easier to misinterpret.
Conversely, if we are concerned about AIs gaming rules, we might prefer to have
standards, which can cover more ground than a rule. A well-formulated standard can
lead to an agent consistently finding the right answer, even in new and unique cases.
Such approaches are sometimes applied in the case of taxes. In the UK, for example,
there is a rule against tax arrangements that are “abusive.” This is not an objective
trigger: it is up to a judge to decide what is “abusive.” An AI system trained to follow
the law can be accountable to rules and standards.

6.2.2 The Need for Ethics

This subsection will discuss why ethics is still indispensable in creating safe and bene-
ficial AI, even though law is a powerful tool. Firstly, though the law is comprehensive,
there are important areas of human life where it gives no advice. Secondly, it is com-
mon for laws to be immoral, even by the standards of the residents of the country
that made them. Finally, the law is unlikely to be the most legitimate conceivable
system, even if it is our best one.

Silent Law

The law is a set of constraints, not a complete guide to behavior. The law
doesn’t cover every aspect of human life, and certainly not everything we need to
constrain AI. Sometimes, this is accidental, but sometimes, the law is intentionally
silent. We can call these zones of discretion: areas of behavior that the law doesn’t
constrain; for example, a state’s economic policy and the content of most contracts
between private individuals. The law puts some limits on these areas of behavior.
Still, lawmakers intentionally leave free space to enable actors to make their own
choices and to avoid the law becoming too burdensome.
290 ■ Introduction to AI Safety, Ethics, and Society

The law often stops short of compelling us to do clearly good things. It

is generally seen as good if someone steps in to rescue a stranger who is in danger.
However, in some jurisdictions like the US, the duty to rescue only applies in certain
cases where we have special responsibility for someone. We face a penalty if we fail to
rescue our spouse or employee, but there is no law against failing to rescue a stranger,
even when it would be at no cost to us. Giving humans this kind of discretion may not
have terrible outcomes, because many people have strong ethical intuitions anyway
and would carry out the rescue. But an AI that was only following the law would
not unless it had a strong ethical drive as well. In cases like rescue, where AI systems
may be more capable than humans to help, we could be passing up a major benefit
by asking the AI to only follow the law. Likewise, in the US, AIs may be required
to avoid uttering copyrighted content and libelous claims, but doxing, instructions
for building bombs, or advice for how to break the law can be legal, even if it is not
ethical.
Conversely, by constraining AI with laws rather than guiding it with ethics, we risk
it acting in undesirable ways in zones of discretion. AI could recommend potentially
harmful economic policies, trick humans into regrettable contracts, and pursue legal
but harmful business practices. Without ethics, a law-abiding AI could carry out great
harm because the law is silent in situations where we may want to guide behavior
nonetheless.

Immoral Law

The law can be immoral even if it is created legitimately and interpreted

correctly. Democratic majorities can pass laws that a large number of fellow cit-
izens think are immoral, or that will seem immoral to future populations of that
country, such as the legalization of wars later regretted, legal slavery, legal discrim-
ination on the basis of race and gender, and various other controversial laws which
are later deemed morally wrong. There is also often a significant time delay between
the moral opinions of a population changing, and the law changing to reflect them
[294]. This time delay can be especially harmful in the context of AIs since they
can create several new legal issues relatively quickly. This means that laws formed
in democracies can fall short of being moral, even in the eyes of the citizens of the
country that made them.

On many issues, the boundaries of the law are surprising. Since laws are
often developed ad hoc, with new legislation to tackle the issues of the day, what laws
actually permit and prohibit can often be unintuitive. For example, the “Anarchist’s
Cookbook,” released in the US during the Vietnam War, had detailed instructions
on how to create narcotics and explosives and sold millions of copies without being
taken out of circulation because the law permitted this use of mass media. Similarly,
”doxing” someone with publicly available information and generating compromising
images and videos of real people are legal as well. Such acts seem intuitively illegal,
but US law permits them.
Beneficial AI and Machine Ethics ■ 291

If solely constrained by a country’s fallible democratic laws, future AI systems could

behave in ways most of the world would consider immoral. To mitigate this risk, it
may be necessary to train AI systems to respect the ethical perspectives of those
affected by their actions.

Unrepresentative Law
Law isn’t the only way, or even necessarily the best way, to arrive at an
aggregation of our values. Not all judges and legal professionals agree that the law
should capture the values of the populace. Many think that legal professionals know
better than the public, and that laws should be insulated from the changing moral
opinions of the electorate. This suggests that we might be able to find, or conceive of,
more representative ways of capturing the values of the populace. Current alternatives
like surveys or citizens’ assemblies are useful for some purposes, such as determining
preferences on specific issues or arriving at informed, representative policy proposals.
However, they aren’t suited to the general task of summarizing the values of the
entire population across areas as comprehensive as those covered by the law.

A Note On The Three Laws of Robotics

Some propose that Isaac Asimov’s “Three Laws of Robotics” provide a useful
set of rules for creating AIs that behave ethically [295, 296]. These laws are as
follows:
(1) “A robot may not injure a human being, or through inaction, allow a
human being to come to harm,”
(2) “A robot must obey the orders given to it by a human being except where
such orders would conflict with the First Law,” and
(3) “A robot must protect its own existence as long as such protection does
not conflict with the First or Second Laws.”

These laws are not meant to provide a solution to ethical challenges. Asimov
himself frequently tested the adequacy of these laws throughout his writing,
showing that they are, in fact, limited in their ability to resolve ethical prob-
lems. Below, we explore some of these limitations.
Asimov’s laws are insufficient for guiding ethical behavior in AI
systems [297]. The three laws use under-defined terms like “harm” and
“inaction.” Because they’re under-defined, they could be interpreted in multiple
ways. It’s not clear precisely what “harm” means to humans, and it would be
even more difficult to encode the same meaning in AI systems.
Harm is a complex concept. It can be physical or psychological. Would a robot
following Asimov’s first laws be required to intervene when humans are about
to hurt each other’s feelings? Would it be required to intervene to prevent
a human from behaving in ways that are self-harming but deliberate, like
smoking? Consider the case of amputating a limb in order to stop the spread
292 ■ Introduction to AI Safety, Ethics, and Society

of an infection. An AI programmed with Asimov’s laws would be forbidden

from amputating the limb, as that would literally be an instance of injuring
a human being. However, the AI would also be forbidden from allowing the
harmful spread of an infection through its inaction. These scenarios illustrate
that the first law fails, and therefore, that the following two do not follow.
The laws may need to be much more specific in order to reliably guide ethical
behavior in future AI systems.
Philosophy has yet to produce a sufficient set of rules to determine moral con-
duct. The safety of future AI systems cannot be guaranteed simply through a
set of rules or axioms. Numerous factors, such as proxy gaming and competi-
tive pressures, cannot be adequately captured in a set of rules. Rules may be
useful, but AI safety will require a more dynamic and comprehensive approach
that can address existing technical and sociotechnical issues.
Overall, Asimov’s Three Laws of Robotics fail to reliably guide ethical behavior
in AI systems, even if they serve as a useful starting point for examining certain
questions and problems in AI safety.

Conclusions About Law

The law is comprehensive, but not comprehensive enough to ensure that the actions
of an AI system are safe and beneficial. AI systems must follow the law as a baseline,
but we must also develop methods to ensure that they follow the demands of ethics
as well. Relying solely on the law would leave many gaps that the AI could exploit,
or make ethical errors within. To create beneficial AI that acts in the interests of
humanity, we need to understand the ethical values that people hold over and above
the law.

6.3 FAIRNESS

We can use the law to ensure that AIs make fair decisions. AIs are
being used in many sensitive applications that affect human lives, from lending and
employment to healthcare and criminal justice. As a result, unfair AI systems can
cause serious harm.
The COMPAS case study. A famous example of algorithmic decision-making in
criminal justice is the COMPAS (Correctional Offender Management Profiling for Al-
ternative Sanctions) software used by over 100 jurisdictions in the US justice system.
This algorithm uses observed features such as criminal history to predict recidivism,
or how likely defendants are to reoffend. A ProPublica report [298] showed that
COMPAS disproportionately labeled African-Americans as higher risk than white
counterparts with nearly identical offense histories. However, COMPAS’s creators
argued that it was calibrated, with accurate general probabilities of recidivism across
the three general risk levels, and that it was less biased and better than human
Beneficial AI and Machine Ethics ■ 293

judgments [299]. This demonstrated the trade-off between different definitions of

fairness: it was calibrated across risk levels, but it was also clear that COMPAS gen-
erated more false positives for African-American defendants (predicting they would
re-offend when they did not) and more false negatives for white defendants, predict-
ing they would not re-offend when they in fact did. Adding to the concern, COMPAS
is a black-box algorithm: its process is proprietary and hidden. One lawsuit argued
this violates due process rights since its methods are hidden from the court and the
defendants [300]. In this section, we will discuss some of the serious ethical questions
raised by this case, examining what makes algorithms unfair and considering some
methods to improve fairness.

6.3.1 Bias

AI systems can amplify undesirable biases. AI systems are being increas-

ingly deployed throughout society. If these influential systems have biases, they can
reinforce disparities and produce widespread, long-term harms. In AI, bias refers to a
consistent, systematic, or undesirable distortion in the outcomes produced by an AI
system. These outcomes can be predictions, classifications, or decisions. Bias can be
influenced by many factors, including erroneous assumptions, training data, or hu-
man biases. Biases in modern DL systems can be especially consequential. While not
all forms of bias are harmful, we focus on biases that are socially relevant because
of their harms. We must proactively prevent bias to avoid its harms. This section
overviews bias in AI and outlines some mitigation strategies.

Aspects of bias in AI. A bias is systematic when it includes a pattern of repeated

deviation from the true values in one direction. Unlike random unstructured errors,
or “noise,” these biases are not reliably fixed by just adding more data. Resolving
ingrained biases often requires changing algorithms, data collection practices, or how
the AI system is applied. Algorithmic bias occurs when any computer system con-
sistently produces results that disadvantage certain groups over others. Some biases
are relatively harmless, like a speech recognition system that is better at interpreting
human language than whale noises. However, other forms of bias can result in serious
social harms, such as partiality to certain groups, inequity, or unfair treatment.

Bias can manifest at every stage of the AI lifecycle. From data collec-
tion to real-world deployment, bias can be introduced through multiple mechanisms
at any step in the process. Historical and social prejudices produce skewed training
data, propagating flawed assumptions into models. Flawed models can cement biases
into the AI systems that help make important societal decisions. In addition, humans
misinterpreting results can further compound bias. After deployment, biased AI sys-
tems can perpetuate discriminatory patterns through harmful feedback loops that
exacerbate bias. Developing unbiased AI systems requires proactively identifying and
mitigating biases across the entire lifecycle.

Biases in AI often reflect systemic biases. Systematic biases can occur

even against developers’ intentions. For instance, Amazon developed an ML-based
294 ■ Introduction to AI Safety, Ethics, and Society

TheLif
ecycl
eofBias
i
nAISystems

Figure 6.2. Systematic psychological, historical, and social biases can lead to algorithmic
biases within AI systems.

resume-screening algorithm trained on historical hiring decisions. However, as the

tech industry is predominantly male, this data reflected skewed gender proportions
in the data (about 60% male and 40% female) [301]. Consequently, the algorithm
scored male applicants higher than equally qualified women, penalizing resumes with
implicit signals like all-female colleges. The algorithm essentially reproduced real-
world social biases in hiring and employment. This illustrates how biased data, when
fed into AI systems, can inadvertently perpetuate discrimination. Organizations must
be vigilant about biases entering any stage of the machine learning pipeline.

In many countries, some social categories are legally protected from dis-
crimination. Groups called protected classes are legally protected from harmful
forms of bias. These often include race, religion, sex/gender, sexual orientation, ances-
try, disability, age, and others. Laws in many countries prohibit denying opportunities
or resources to people solely based on these protected attributes. Thus, AI systems
exhibiting discriminatory biases against protected classes can produce unlawful out-
comes. Mitigating algorithmic bias is crucial for ensuring that AI complies with equal
opportunity laws by avoiding discrimination.

6.3.2 Sources of Bias

Biases can arise from multiple sources, both from properties of the AI system itself
and human interaction with the system. This section discusses common sources of
harmful biases in AI systems, although there are many more. First, we will discuss
technical sources of bias, primarily from flawed data or objectives. Then, we will
review some biases that arise from interactions between humans and AI systems.

Technical Sources of Bias in AI Systems

An overview of technical sources of bias. In this section, we will review some

sources of bias in technical aspects of AI systems. First, we will investigate some
Beneficial AI and Machine Ethics ■ 295

data-driven sources of biases, including flawed training data, subtle patterns that
can be used to discriminate, biases in how the data is generated or reported, and
underlying societal biases. Flawed or skewed training data can propagate biases into
the model’s weights and predictions. Then, we show how RL training environments
and objectives can also reinforce bias.

ML models trained on biased datasets can learn and reinforce harmful

societal biases. AI systems learn from human-generated data, absorbing both
valuable knowledge and harmful biases. Even when unintentional, this data frequently
mirrors ingrained societal prejudices. As a result, AI models can propagate real-world
discrimination by learning biases from their input data. For instance, a lawsuit found
that Facebook’s ad targeting algorithm violated the Fair Housing Act because it
learned to exclude users from seeing housing ads based on race, gender, or other
protected traits. Similarly, ML models can reflect political biases, deprioritizing users
from specific political affiliations by showing their content to smaller audiences. As
another example, an NLP model trained on a large corpus of internet text learned
to reinforce gender stereotypes, completing sentence structures of the format “man
is to X as woman is to Y” with content such as “man is to computer programmer as
woman is to homemaker” [302]. These examples show how ML models can amplify
existing social biases.

Models can learn to discriminate based on subtle correlations. One intu-

itive way to fix bias is to remove protected attributes like gender and achieve “fairness
through unawareness.” But this is not enough to remove bias. ML models can learn
subtle correlations that serve as proxies for these attributes. For example, even in
datasets with gender information removed, resume-screening models learned to asso-
ciate women with certain colleges and assigned them lower scores [301]. In another
study, ML models erroneously labeled images of people cooking as women, due to
learned gender biases [303]. Thus, models can discriminate even when the data does
not contain direct data about protected classes. This hidden discrimination can harm
protected groups despite efforts to prevent bias.

Biased or unrepresentative data collection can lead to biased decisions.

Training data reflects biases in how the data was collected. If the training data is
more representative of some groups than others, the predictions from the model
may also be systematically worse for the underrepresented groups. Thus, the model
will make worse or biased decisions for the group that is represented less in the
dataset. Imbalances in training data occur when the data is skewed with respect to
output labels, input features, and data structure. For instance, a disease prediction
dataset with 100,000 healthy patients but only 10 sick patients exhibits a large class
imbalance. The minority class with fewer examples is underrepresented.

Several other problems can introduce bias in AI training data. System-

atic problems in the data can add bias. For instance, reporting bias occurs when the
relative frequency of examples in the training data misrepresents real-world frequen-
cies. Often, the frequency of outcomes in legible data does not reflect their actual
296 ■ Introduction to AI Safety, Ethics, and Society

occurrence. For instance, the news amplifies shocking events and under-reports nor-
mal occurrences or systematic, ongoing problems—reporting shark attacks rather
than cancer deaths. Sampling bias occurs when the data collection systematically
over-samples some groups and undersamples others. For instance, facial recognition
datasets in Western countries often include many more lighter-skinned individuals.
Labeling bias is introduced later in the training process, when systematic errors in the
data labeling process distort the training signal for the model. Humans may introduce
their own subjective biases when labeling data.
Beyond problems with the training data, the training environments and objectives of
RL models can also exhibit problems that promote bias. Now, we will review some
of these sources of bias.
Training environments can also amplify bias. Reward bias occurs when the
environments used to train RL models introduce biases through improper rewards.
RL models learn based on the rewards received during training. If these rewards fail
to penalize unethical or dangerous behavior, RL agents can learn to pursue immoral
outcomes. For example, models trained in video games may learn to accomplish goals
by harming innocents if these actions are not sufficiently penalized in training. Some
training environments may fail to encourage good behaviors enough, while others can
even incentivize bad behavior by rewarding RL agents for taking harmful actions.
Humans must carefully design training environments and incentives that encourage
ethical learning and behavior [304].
RL models can optimize for training objectives that amplify bias or
harm. Reinforcement learning agents will try to optimize the goals they are given
in training, even if these objectives are harmful or biased, or reflect problematic as-
sumptions about value. For example, a social media news feed algorithm trained to
maximize user engagement may prioritize sensational, controversial, or inflammatory
content to increase ad clicks or watch time. Technical RL objectives often make im-
plicit value assumptions that cause harm, especially when heavily optimized by a
powerful AI system [58, 305]. News feed algorithms implicitly assume that how much
a user engages with some content is a high-quality indicator of the value of that
content, therefore showing it to even more users. After all, social media companies
train ML models to maximize ad revenue by increasing product usage, rather than
fulfilling goals that are harder to monetize or quantify, such as improving user expe-
rience or promoting accurate and helpful information. Especially when taken to their
extreme and applied at a large scale, RL models with flawed training objectives can
exacerbate polarization, echo chambers, and other harmful outcomes. Problems with
the use of flawed training objectives are further discussed in section 3.3.

Biases from human-AI interactions

Interactions between humans and AI systems can produce many kinds of
bias. It is not enough to just ensure that AI systems have unbiased training data:
humans interacting with AI systems can also introduce biases during development,
usage, and monitoring. Flawed evaluations allow biases to go unnoticed before models
Beneficial AI and Machine Ethics ■ 297

are deployed. Confirmation bias in the context of AI is when people focus on algorithm
outputs that reinforce their pre-existing views, dismissing opposing evidence. Humans
may emphasize certain model results over others, skewing the outputs even if the
underlying AI system is reliable. This distorts our interpretation of model decisions.
Overgeneralization occurs when humans draw broad conclusions about entire groups
based on limited algorithmic outputs that reflect only a subset. Irrationality and
human cognitive bias play a substantial role in biasing AI systems.

Human-AI system biases can be reinforced by feedback loops. Feedback

loops in human-AI systems often arise when the output of an AI system is used as
input in future AI models. An AI system trained on biased data could make biased
decisions that are fed into future models, reinforcing bias in a self-perpetuating cycle.
We speak more about these feedback loops in chapter 5. Self-fulfilling prophecies
can occur when an algorithmic decision influences actual outcomes, as the model
reinforces its own biases and influences future input data [306]. In this way, models
can amplify real-world biases, making them even more real. For example, a biased
loan-approval algorithm could deny loans to lower-income groups, reinforcing real-
world income disparities that are then reflected in the training data for future models.
This process can make bias more severe over time.

Automation and measurability induce bias. Bias can be amplified by au-

tomation bias, where humans favor algorithmic decisions over human decisions, even
if the algorithm is wrong or biased. This blind trust can cause harm when the model
is flawed. Similarly, a bias toward the measurable can promote a general preference for
easily quantifiable attributes. Human-AI systems may overlook important qualitative
aspects and less tangible factors.

Despite their problems, AI systems can be less biased than humans.

Although there are legitimate concerns, AI systems used for hiring and other sensi-
tive tasks may sometimes lead to less biased decisions when compared with human
decision-makers. Humans often harbor strong biases that skew their judgment in
these decisions. With careful oversight and governance, AI holds promise to reduce
certain biases relative to human motivations.

6.3.3 AI Fairness Concepts

Methods for improving AI fairness could mitigate harms from biased systems, but
they require overcoming challenges in formalizing and implementing fairness. This
section explores algorithmic fairness, including its technical definitions, limitations,
and real-world strategies for building fairer systems.

Fairness is difficult to specify. Fairness is a complicated and disputed concept

with no single agreed-upon definition. Different notions of fairness can come into
conflict, making it challenging to ensure that an AI system will be considered fair by
all stakeholders.
298 ■ Introduction to AI Safety, Ethics, and Society

Five fairness concepts. Some concepts of individual fairness focus on treating

similar individuals similarly—for instance, ensuring job applicants with the same
qualifications have similar chances of being shortlisted. Others focus on group fair-
ness: ensuring that protected groups receive similar outcomes as majority groups.
Procedural fairness emphasizes improving the processes that lead to outcomes, mak-
ing sure they are consistent and transparent. Distributive fairness concerns the equal
distribution of resources. Counterfactual fairness emphasizes that a model is fair if
its predictions are the same even if a protected characteristic like race were different,
all else being equal. These concepts can all be useful in different contexts.
Justice as fairness. Ethics is useful for analyzing the idea of fairness. John Rawls’
theory of justice as fairness argues that fairness is fundamental to achieving a more
just social system. His maximin and difference principles state that inequalities in
social goods can only be justified if they maximize benefits for the most disadvantaged
people. He also argued that the social goods must be open to all under equality of
opportunity. These ideas align with common notions of fairness. Some argue this
principle also applies to AI: harms from the bias of algorithmic decisions should be
minimized, especially in ways that make the worst-off people better off. Theories of
justice can help develop the background principles for fairness.
Algorithmic fairness. The field of algorithmic fairness aims to understand and
address unfairness issues that can arise in algorithmic systems, such as classifiers and
predictive models. This field’s goal is to ensure that algorithms do not perpetuate
disadvantages based on protected characteristics such as race, gender, or class, es-
pecially while predicting an outcome from features based on training data. Several
different technical definitions of fairness have been proposed, often formalized math-
ematically. These definitions aim to highlight unfairness in ML systems, but most
possess inherent limitations. We will review three definitions below.
Statistical parity. The concept of statistical parity, also known as demographic parity,
requires that an algorithm makes positive decisions at an equal rate for different
groups. This metric requires that the model’s predictions are independent of the
sensitive attribute. A hiring algorithm satisfies statistical parity if the hiring rates for
men and women are identical. While intuitive, statistical parity is a very simplistic
notion; for instance, it does not account for potential differences between groups that
could justify or explain different outcomes.
Equalized odds. Equalized odds require that the false positive rate and false negative
rate are equal across different groups. A predictive health screening algorithm fulfills
equalized odds if the false positive rate is identical for men and women. This metric
ensures that the accuracy of the model is not dependent on the sensitive attribute
value. However, enforcing equalized odds can reduce overall accuracy.
Calibration. Calibration measures how well predicted probabilities match empirical
results. In a calibrated model, the actual long-run frequency of positives in the real
population will match the predicted probability from the model. For instance, if the
model predicts 20% of a certain group will default on a loan, roughly 20% will in fact
default. Importantly, calibration is a metric for populations, and it does not tell us
Beneficial AI and Machine Ethics ■ 299

about the correctness or fairness of an ML system for individuals. Calibration can

improve fairness by preventing incorrect, discriminatory predictions. As it happens,
ML models often train on losses that encourage calibration and are therefore often
calibrated naturally.
These technical concepts can be useful for operationalizing fairness. However, there is
no single mathematical definition of fairness that matches everyone’s complex social
expectations. This is a problem because satisfying one definition can often violate
others: there are tensions between statistical notions of fairness.

6.3.4 Limitations of Fairness

There are several problems with trying to create fair AI systems. While we can try to
improve models’ adherence to the many metrics of fairness, the three classic defini-
tions of fairness are mathematically contradictory for most applications. Additionally,
improving fairness is often at odds with accuracy. Another practical problem is that
creating fair systems means different things across different areas of applications, such
as healthcare and justice, and different stakeholders within each area have different
views on what constitutes fairness.

Contradictions between fairness metrics. Early AI fairness research largely

focused on three metrics of fairness: statistical/demographic parity, equalized odds,
and calibration. However, these ubiquitous metrics often contradict each other: sta-
tistical parity only considers overall prediction rates, not accuracy, while an equalized
odds approach focuses on accuracy across groups and calibration emphasizes correct
probability estimates on average. Achieving calibration may require violating statis-
tical parity when the characteristic being predicted is different across groups, such
as re-offending upon release from prison being more common among disadvantaged
minorities [307]. This makes fulfilling all three notions of fairness at once difficult or
impossible.
The impossibility theorem for AI fairness proves that no classifier can satisfy these
three definitions of fairness unless the prevalence of the target characteristic is equal
across groups or prediction is perfect [308, 309]. Requiring a model to be “fair” ac-
cording to one metric may actually disadvantage certain groups according to another
metric. This undermines attempts to create a universally applicable, precise defini-
tion of fairness. However, we can still use metrics to better approximate our ideals of
fairness while remaining aware of their limitations.

Fairness can reduce performance if not achieved carefully. Enforcing fair-

ness constraints often reduces model accuracy. Two papers found that applying fair-
ness techniques to an e-commerce recommendation system increased financial costs
[310] and mitigating unfairness in Kaggle models by post-processing reduced per-
formance [311]. However, these and others also find ways to simultaneously improve
both fairness and accuracy; for example, work on healthcare models has managed to
improve fairness with little effect on accuracy [312]. While aiming for fairness can
300 ■ Introduction to AI Safety, Ethics, and Society

reduce model accuracy in many cases, sometimes fairness can be improved without
harming accuracy.

Difficulties in achieving fairness across contexts. Different fields have dis-

tinct problems: fairness criteria that make sense in the context of employment may
be inapplicable in healthcare. Even different fields within healthcare face different
problems with incompatible solutions. These context-specific issues make generic so-
lutions inadequate. Models trained on historical data might reflect historical patterns
such as the underprescription of pain medication to women [313]. Removing gender
information from the dataset seems like an obvious way to avoid this problem. How-
ever, this does not always work and can even be counterproductive. For instance,
removing gender data from an algorithm that matches donated organs to people in
need of transplants failed to eliminate unfairness, because implicit markers of gender
like body size and creatinine levels still put women at a disadvantage [314]. Diag-
nostic systems without information about patients’ sex tend to mispredict disease in
females because they are trained mostly on data from males [315]. Finding ways to
achieve fairness is difficult: there is no single method or definition of fairness that
straightforwardly translates into fair outcomes for all.

Disagreements in intuitions about fairness. There is widespread disagree-

ment in intuitions about the fairness of ML systems, even when a model fulfills
technical fairness metrics; for instance, patients and doctors often disagree on what
constitutes fairness. People often view identical decisions as more unfair if they come
from a statistical model [316]; they also often disagree on which fairness-oriented
features are the most important [317], such as whether race should be used by the
model or whether the model’s accuracy or false positive rates are more important. It
is unclear how to define fairness in a generally acceptable way.

6.3.5 Approaches to Combating Bias and Improving Fairness

Due to the impossibility theorem and inconsistent and competing ideas, it is only
possible to pursue some definition or metric of fairness—fairness as conceptualized in
a particular way. This goal can be pursued both through technical approaches that
focus directly on algorithmic systems, and other approaches that focus on related
social factors.

Technical approaches. Metrics of fairness such as statistical parity identify as-

pects of ML systems that are relevant for fairness. Technical approaches to improving
fairness include a host of methods to improve models’ performance on these metrics,
which can mitigate some forms of unfairness. These often benefit from being broadly
applicable with little domain-specific knowledge. Developers can test predictive mod-
els against various metrics for fairness and adjust models so that they perform bet-
ter. Fairness toolkits offer programmatic methods for implementing technical fairness
metrics into ML pipelines. Other methods for uncovering hidden sources of unfairness
in ML models include adversarial testing, sensitivity analysis, and ranking feature im-
portances. One promising technical approach involves training an adversarial network
Beneficial AI and Machine Ethics ■ 301

to predict a protected variable from an ML model’s outputs [318]. By penalizing the

model when the adversary succeeds at predicting a variable like race or political af-
filiation from the model’s outputs, the model is forced to avoid discrimination and
make predictions that do not unfairly depend on sensitive attributes. When applied
well, this can minimize biases.

Problems with technical approaches. However, technical methods fall short

of addressing the social consequences of unfairness. They fail to adjust to sociocul-
tural contexts and struggle to combat biases inherited from training data. “Fairness
through unawareness” aims to remove protected characteristics like gender and race
from datasets to prevent sexism and racism, but often fails in practice because this
data is embedded in correlates. A focus on narrow measures can ignore other rel-
evant considerations, and measures are often subject to proxy gaming 3.3. A more
in-depth, qualitative, and socially grounded approach is often harder and does not
scale as easily as technical methods, but it is still essential for navigating concerns in
AI fairness.

Engineering strategies for reducing bias must be paired with non-

technical strategies. Ultimately, technical debiasing alone is insufficient. Social
processes are crucial as humans adapt to AI. We speak about this at length in sec-
tion 4.5, but here we will only mention a few ideas. For instance, early bias detection
involves creating checks to identify risks of bias before the AI system is deployed
or even trained, so that models that have discriminatory outputs can be rejected
before they cause harm. Similarly, gradual deployment safely transitions AI systems
into use while monitoring them for bias so that harms can be identified early and
reversed. Regulatory changes can require effective mitigation strategies by law, man-
dating transparency and risk mitigation in safety-critical AI systems, as we discuss
in chapter 8.

Other approaches. Other approaches emphasize that unfairness is tied to sys-

temic social injustices propagated through technical systems. They highlight political,
economic, and cultural factors and apply methods such as anti-discrimination policy,
legal reform, and a design process focused on values and human impacts. These meth-
ods, which include policies like developing AI systems with input from stakeholders,
can surface and mitigate sources of unfairness early. Substantive social changes are
generally more expensive and difficult than technical approaches. However, they can
be more impactful, reducing models’ negative social impacts.

Participatory design can mitigate bias in AI development. An important

non-technical strategy for bias reduction is stakeholder engagement, or deeply en-
gaging impacted groups in the design of the AI system to identify potential biases
proactively. Diverse teams and users can also help engineering teams incorporate
diverse perspectives into the R&D process of AI models to anticipate potential bi-
ases proactively. One approach to proactively addressing bias is participatory design,
which aims to include those affected by a developing technology as partners in the
design process to ensure that the final product meets diverse human interests. For
302 ■ Introduction to AI Safety, Ethics, and Society

example, before implementing an AI notetaking assistant for doctors, participatory

design can require hospitals to improve the system based on feedback from all stake-
holders during iterative design sessions with nurses, doctors, and patients. Rather
than just evaluating ML models on test sets, developers should consult with the af-
fected groups during the design process. Adding oversight mechanisms for rejecting
models with discriminatory outputs can also enable catching biases before AI models
affect real decisions.

Independent audits are important for identifying biases in AI systems

before deployment. Auditors can systematically evaluate datasets, models, and
outputs to uncover discrimination and hold the developers of AI systems account-
able. There are several signs of bias to look for when auditing datasets. For example,
auditors can flag missing data for certain subgroups, which indicates underrepre-
sentation. Data skew, where certain groups are misrepresented compared to their
real-world prevalence, is another sign of bias. Patterns and correlations with pro-
tected classes could indicate illegal biases. Auditors can also check for disparities in
the model outputs. By auditing throughout the process, developers can catch biases
early, improving data and models before their biases propagate. Rather than waiting
until after the system has harmful impacts, meticulous audits should be integrated as
part of the engineering and design process for AI systems [319]. Audits are especially
effective when conducted independently by organizations without a stake in the AI
system’s development, allowing for impartial and rigorous auditing throughout the
process.

Effective model evaluation is a crucial way to reduce bias. An important

part of mitigating bias is proactively evaluating AI systems by analyzing their out-
puts for biases. Models can be tested by measuring performance metrics such as false
positive and false negative rates separately for each subgroup. For instance, signifi-
cant performance disparities between groups like men and women can reveal unfair
biases. Ongoing monitoring across demographics is necessary to detect unintended
discrimination before AI systems negatively impact people’s lives. Without rigorous
evaluation of model outputs, harmful biases can easily go unnoticed.

Reducing toxicity in data aims to mitigate harmful biases in AI, but

faces challenges. Toxicity refers to harmful content, such as inflammatory com-
ments or hate speech. Models trained on unfiltered text can absorb these harmful
elements. As a result, AI models can propagate toxic content and biases if not care-
fully designed. For example, language models can capture stereotypical and harmful
associations between social groups and negative attributes based on the frequency of
words occurring together. Reducing toxicity in the training data can mitigate some
biases. For example, developers can use toxicity classifiers to clean up the internet
data, using both sentiment and manual labels to identify toxic content. However, all
of these approaches still run into major challenges and limitations. Classifiers are still
subject to social bias, evaluations can be brittle and unreliable, and bias is often very
hard to measure.
Beneficial AI and Machine Ethics ■ 303

Trade-offs can emerge between correcting one form of bias and introduc-
ing new biases. Bias reduction methods can introduce new biases, as classifiers
have social biases, evaluations are unreliable, and bias reduction can introduce new
biases. For example, some experiments show that an attempt to correct for toxicity
in OpenAI’s older content moderation system resulted in biased treatment toward
certain political and demographic groups: a previous system classified negative com-
ments about conservatives as not hateful, while flagging the exact same comments
about liberals as hateful [320]. It also exhibited disparities in classifying negative
comments toward different nationalities, religions, identities, and more.

Conclusion

We have discussed some of the sources of bias in AI systems, including problems with
training data, data collection processes, training environments, and flawed objectives
that AI systems optimize. Human interactions with AI systems, such as automation
bias and confirmation bias, can introduce additional biases.
We can clarify which types of bias or unfairness we wish to avoid using mathematical
definitions such as statistical parity, equalized odds, and calibration. However, there
are inherent tensions and trade-offs between different notions of fairness. There is
also disagreement between stakeholders’ intuitions about what constitutes fairness.
Technical approaches to debiasing including predictive models and adversarial testing
are useful tools to identify and remove biases. However, improving the fairness of
AI systems requires broader sociotechnical solutions such as participatory design,
independent audits, stakeholder engagement, and gradual deployment and monitoring
of AI systems.

6.4 THE ECONOMIC ENGINE

What if we allow the economy to decide what AIs will be like? Unlike
some prior technological breakthroughs (such as the development of nuclear energy
and nuclear weapons), most investment in AI today is coming from businesses. Lead-
ing commercial AI developers have acquired the enormous computational resources
required to train state-of-the-art systems and have hired many of the world’s best
researchers. We could therefore argue that AI development is most closely aligned
with business or economic goals such as wealth maximization. Many believe this is
good, arguing that the development of AIs should be guided by market forces. If
AI can accelerate economic growth, provided we can also ensure a fair distribution
of costs and benefits of AI across society, this could be positive for people’s welfare
around the globe. We examine the specific impacts of AI on economic growth and its
distributional effects in the Governance chapter.
Here, we consider the broader attractions and limitations of allowing economic growth
to be the main force determining how AI systems are developed. As part of this
discussion, we will briefly introduce and explain a few basic concepts from economics,
304 ■ Introduction to AI Safety, Ethics, and Society

such as market externalities, which provide an essential foundation for our analysis.
We argue that while economic incentives can be powerful forces for prosperity and
innovation, they do not adequately capture many important values. AI systems that
are primarily created in order to maximise growth and profit for their developers
could have a range of harmful side-effects. Alternative goals are discussed further in
the following sections.

6.4.1 Allocative Efficiency of Free Markets

Competitive economic markets can drive efficiency and foster collective prosperity.
Historically, market forces have led to specialization at a vast scale and worldwide
competition, permitting the production of better goods and services at lower prices.
The global economy is a complex system that allows for mass coordination. Prices,
for instance, help producers and consumers find efficient trading equilibria in the
face of dynamic market conditions. One can speculate that the integration of AI into
the economic engine could further optimize production and enhance competition,
ultimately contributing to accelerated economic growth and improved wellbeing.
Under the right conditions, the free market can also create allocative
efficiency. The First Fundamental Theorem of Welfare Economics states that—
subject to some strong assumptions—an equilibrium allocation of goods reached by
trading on a free market must be Pareto efficient. An outcome is Pareto efficient if
there is no way to make anyone better off without making someone else worse off: any
change must be neutral or trade off one person’s welfare against another. Suppose the
allocation of goods were not Pareto efficient, and two individuals could both be made
better off. Then, the individuals would trade in a way that exploited the possible
Pareto improvement to their mutual benefit.
This supports Adam Smith’s famous “Invisible Hand” argument, which suggests that
when individuals pursue their self-interest within a market, they unintentionally con-
tribute to social welfare. By creating gains from trade, there is an overall improvement
in living standards for everyone.
There are many conditions required for the First Fundamental Theorem
of Welfare Economics to hold:
1. There must be an open market. There should be no barriers to entry for
producers or buyers, so that everyone can participate in this market. This openness
stimulates competition, promoting economic efficiency.
2. No seller should be big enough to move prices up alone. If there are
many sellers, then anyone who raises prices will lose consumers. There must be
no monopoly power: anyone with the ability to raise prices without being forced
down by competition will create distortions that leave consumers worse off.
3. No producer should privately hold a pivotal technology. This means that
other producers should be able to copy the production of the first mover. While
the first mover will make profits in the short run, the market allocation will be
Pareto optimal in the long run.
Beneficial AI and Machine Ethics ■ 305

4. No buyer should be big enough to move prices down alone. Similar to

the condition for producers, no buyer should be big enough to force producers
to take lower prices than others would offer for them. Such buyers would create
distortions that might, for instance, force producers out of business, ultimately
harming everyone.
5. There must be perfect information for everyone. Producers and consumers
must have access to perfect information, such as about product quality and pricing.
If consumers don’t know, for instance, that a seller’s product is defective or that
other sellers are offering lower prices, then markets cannot achieve efficiency.
6. There must be no externalities in consumption. When one consumes a
good, there must be no effect—positive or negative—on anyone else. Second-hand
smoke has a negative externality on anyone nearby but, in a free market, this is
unaccounted for in the price of cigarettes. As a result, the price does not reflect
the total social cost of a cigarette: from a social wellbeing perspective, these are
priced too low, creating inefficiency.
7. Preferences must be non-satiated. The first welfare theorem does not hold for
all types of preferences; one technical restriction is that consumers should always
prefer more of at least one good. If someone gains no further value from extra
consumption, their trading equilibria may not be Pareto efficient.
8. The state must enforce property and contract laws. Most economic theory
assumes that contracts are enforceable, and that individuals and corporations have
protected property rights. Without these, trading would be difficult to achieve
and much more costly, making it more difficult for everyone to achieve optimal
outcomes.

Given the stringency of these conditions, it is obvious that they will not always hold
in practice and that there may be market failures. Unregulated markets do not al-
ways create efficient outcomes: instead, unregulated markets often see informational
asymmetries, market concentration, and externalities. Unfortunately, AIs may ex-
acerbate these market failures and increase income and wealth inequality, creating
disproportionate gains for the wealthy individuals and firms that own these systems
while decreasing job opportunities for others.

6.4.2 Market Failures

In this section, we consider three common types of market failures that are especially
relevant in the context of AI: information asymmetries, oligopolies, externalities. We
also discuss the idea of moral hazard, which can be applied to the development of AI
systems which can generate profit for their owners but could prove harmful to others.

Informational Asymmetries

Information known to only some can create market failures. Informa-

tion asymmetry captures the idea that buyers and sellers have different informa-
tion regarding the product they are trading. For instance, buyers are aware of the
306 ■ Introduction to AI Safety, Ethics, and Society

product’s quality and specifications, and sellers know their true willingness to pay
for the product. Information asymmetry isn’t inherently problematic—it is, in fact,
often a positive aspect of market dynamics. We trust specialists to provide valuable
services in their respective fields; for instance, we rely on our mechanics to know more
about our car’s inner workings than we do.
However, issues arise when information asymmetry leads to adverse selection. This
occurs when sellers withhold information about product quality, leading buyers to
suspect that only low-quality products are available. For example, in the used car
market, a dealer might hide the fact that a car’s axle is rapidly wearing out. Potential
buyers, aware of this possibility but unable to distinguish good cars from bad, may
only be willing to pay a low price that reflects the risk of buying a defective car. As a
result, high-quality cars are driven out of the market, leaving only low-quality cars for
sale. This leads to a market failure where all the high-quality goods are never traded,
resulting in inefficiency [321]. Additionally, buyers who lack this crucial information
might think that some cars are high-quality and pay accordingly, buying defective
cars for high prices.

AIs may exacerbate informational asymmetries. AI holds the power to

both create and exploit information asymmetries in unprecedented ways. This ca-
pacity can be employed in beneficial ways, like providing highly personalized services
or excellent recommendations. However, it can also be misused, leading to situations
in which those using AIs can manipulate consumers. AI-powered analytics allows
companies to create sophisticated profiles of consumers that can uncover deep in-
sights into individual personalities and behaviors. [322] Big tech companies already
use social media and device activity to understand an individual’s preferences and
vulnerabilities better than ever before. This knowledge can be used to shape targeted
advertisements or manipulations that are more effective. While these are not market
failures in the technical sense, they can lead to exploitation and other undesirable
outcomes.
Though AI may magnify the potential for information asymmetries, such strategies
have long standing precedents in non-AI contexts. Predatory lending is a common
practice where lenders, often equipped with more information than borrowers, use
deceptive practices to encourage individuals into accepting unfair loan terms. These
tactics tend to target lower-income and less-educated individuals who might not have
the time or background to understand the fine-print of what they’re signing, or the
resources to find legal counsel. AIs can further increase the power imbalance; for
instance, AIs might be used to predict who is most likely to accept these unfair
loan terms based on their digital behavior, leading to even more targeted predatory
lending. AIs can both amplify existing issues and present new challenges.

Oligopolies

Markets can tend toward concentration, leaving consumers worse off.

While competition can create productive efficiency, some firms can avoid competitive
pressures; for instance, utilities that are the exclusive provider of power or gas in a
Beneficial AI and Machine Ethics ■ 307

region might be able to raise their prices without losing many consumers to competi-
tors due to lack of alternatives. When a single company or a small group of companies
has a high level of control over a market, consumers are often left with limited product
options and high prices. Markets in which a small number of companies are the only
available suppliers are sometimes described as oligopolies. To preserve the benefits of
competition, governments implement regulations such as antitrust laws, which they
can use to limit market concentration by preventing mergers between companies that
would give them excessive market power.

High initial investment requirements can impede newcomers from en-

tering the market. Historically, first-movers in capital-intensive industries have
a competitive advantage, such as rail companies that own large quantities of railway
networks. In such industries, it is difficult for other firms to enter the market due
to high up-front costs; in many cases, an existing firm can block other firms from
entering, such as by keeping prices below a sustainable level. AI development might
be similar—at the very least, firms within it additional power that allows them to
avoid many competitive pressures.
Developing a large model requires paying substantial fixed costs up-front, including
expenses for computing power and datasets essential for training. However, once these
initial investments are made, the subsequent cost per user for deploying and main-
taining these models is considerably lower. Since the average cost per unit decreases
as the number of customers increases, it becomes substantially more cost-effective for
a single AI company to provide access to many people than for multiple companies
to independently develop and maintain similar models [323].
Additionally, only a few companies might have access to enough resources and data
to create the best AI models. In the Artificial Intelligence Fundamentals chapter, we
discussed how scaling laws demonstrate that improving AI performance has required
access to costly resources like high-performance processors and vast, high-quality
datasets. High resource requirements for developing advanced AI can limit market
entry, stifling competition. Early capability advantages for a leading firm can have
positive feedback loops, which could enable them to raise more capital and pull ahead
of many competitors. This raises concerns that only a few powerful entities will have
access to and benefit from AI technologies.

The use of AI might create or strengthen oligopolies. AI developers are

striving to have their models achieve superintelligence: models that are able to carry
out a wide range of tasks across various domains better than humans. If someone
did manage to create such an AI, they might have a decisive advantage across large
swathes of industry, being sufficiently versatile to become an expert in many or every
market. It may become difficult or impossible for smaller firms to carve out niche mar-
ket spaces. The advanced capabilities of general AI may outpace specialized models
in diverse domains, making it more difficult for new entrants to gain market share.
Large firms equipped with powerful AI systems could wield an enormous amount of
power, potentially leading to less competition, higher prices, and slower innovation,
hurting both labor and product markets.
308 ■ Introduction to AI Safety, Ethics, and Society

Externalities

Externalities are consequences of economic activity that impact unre-

lated third parties. An externality is a side effect of an economic activity, im-
pacting individuals or groups who are not directly involved in that activity. Because
of externalities, market prices for goods or services may not fully reflect the costs
that third parties, who are neither the consumers nor the producers of that market,
bear as a result of the economic activity.
A classic example of a negative externality—a harm to a third party—is pollution.
Consider the Sriracha factory in Irwindale, California, where jalapeño peppers are
ground and roasted. Residents of Irwingdale claimed that odors from the factory
caused lung and eye irritation and created an overall unpleasant smell in the town.
The factory, by producing these odors, was imposing a negative externality upon the
town’s residents. However, since the townspeople received no automatic compensation
for this inconvenience, this was not reflected in the price of Sriracha.

We can resolve externalities with litigation, property rights, and taxa-

tion. In 2013, locals sued the Sriracha factory: this legal action led the factory to
install new filters to reduce pollution. Litigation can be an effective tool to resolve
externalities by forcing compensation. Economic theory suggests that bargaining over
the externality can also create efficient outcomes; for instance, if the property right
to the air was understood to belong to the townspeople, and that the factory would
have to stop polluting or compensate the townspeople at an acceptable rate for their
inconvenience [324].
A third commonly used resolution to externalities is taxation. Smoking cigarettes im-
poses negative externalities on those near the smoker; governments will thus impose
taxes on the sale of tobacco, which both raise revenues for the state to run social
programs and discourage smoking by increasing the price of cigarettes to better re-
flect its true total cost. Determining the most effective method for resolving each
externality is a topic of ongoing debate among economists. These debates extend to
the externalities of AI on society. Potential policy responses are further discussed in
the Governance chapter.

The development and deployment of AI systems can lead to negative

externalities. AI’s effect on the environment has received a lot of attention [e.g.
325]. Training advanced AI models requires vast computational resources, consuming
a significant amount of energy and contributing to greenhouse gas emissions. Emis-
sions can lead to climate change, a cost borne by society at large, rather than only by
those who pollute—unless the externality is corrected, such as by charging compa-
nies for the carbon they emit. Beyond emissions, several other issues such as worker
displacement, the potential for misuse or accidents, and the risk of the loss of control
of advanced systems present serious externalities as well.
Beneficial AI and Machine Ethics ■ 309

Moral hazards
Moral hazards occur when risks are externalized. Moral hazards are situa-
tions where an entity is encouraged to take on risks, knowing that any costs will be
borne by another party. Insurance policies are a classic example: people with damage
insurance on their phones might handle them less carefully, secure in the knowledge
that any repair costs will be absorbed by the insurance company, not them.
The bankruptcy system ensures that no matter how much a company damages society,
the biggest risk it faces is its own dissolution, provided it violates no laws. Companies
may rationally gamble to impose very large risks on the rest of society, knowing that
if those risks ever come back to the company, the worst case is the company going
under. The company will never bear the full cost of damage caused to society due
to its risk taking. Sometimes, the government may step in even prior to bankruptcy.
For example, leading American banks took on large risks in the lead up to the 2008
financial crisis, but many of them were considered “too big to fail,” leading to an
expectation that the government would bail them out in time of need [326]. These
dynamics ultimately contributed to the Great Recession.
Developing advanced AIs is a moral hazard. In the first chapter, we outlined
severe risks to society from advanced AIs. However, while the potential costs to society
are immense, the maximum financial downside to a tech company developing these
AIs is filing for bankruptcy.
Consider the following, admittedly extreme, scenario. Suppose that a company is on
the cusp of inventing an AI system that would boost its profits by a thousand-fold,
making every employee a thousand times richer. However, the company estimates
that their invention comes with a 0.1% chance of a catastrophic accident leading to
large-scale loss of life. In the likely case, the average person in the economy would see
some benefits due to increased productivity in the economy, and possibly from wealth
redistribution. Still, most people view this gamble as irrational, preferring not to risk
catastrophe for modest economic improvements. On the other hand, the company
may see this as a worthwhile gamble, as it would make each employee considerably
richer.
Risk internalization encourages safer behavior. In the above examples of
moral hazards, companies take risks that would more greatly affect external parties
than themselves. The converse of this is risk internalization, where risks are primarily
borne by the party that takes them. Risk internalization compels the risk-taker to
exercise caution, knowing that they would directly suffer the consequences of reckless
behavior. If AI companies bear the risk of their actions, they would be more incen-
tivized to invest in safety research, take measures to prevent malicious use, and be
generally disincentivized from creating potentially dangerous systems.

6.4.3 Inequality
Most of the world exhibits high levels of inequality. In economics, inequal-
ity refers to the uneven distribution of economic resources, including income and
310 ■ Introduction to AI Safety, Ethics, and Society

wealth. The Gini coefficient is a commonly used statistical measure of the distribu-
tion of income within a country. It is a number between 0 and 1, where 0 represents
perfect equality (everyone has the same income or wealth), and 1 signifies maximum
inequality (one person has all the income, and everyone else has none). Looking at
Gini coefficients, 71% of the world’s population lives in countries with increasing
inequality over the last thirty years [327].

Inequality in the United States is particularly striking. Figure 6.3 shows

that the Gini coefficient in the US has trended significantly upwards from 1969 to
2019. Over 50 years, the pre-tax Gini coefficient has increased by 30%, while the post-
tax Gini coefficient has risen by 25%, suggesting that despite redistributive taxation
policies, the US income gap has widened substantially. (For reference, this change
in Gini coefficient is the same size as moving from Canada to Saudi Arabia today.
[327]) This increase in the Gini coefficient is evidence of a growing inequality crisis.
An associated fall in social mobility—the ability of an individual to move from the
bottom income bracket to the top—cements inequalities over generations.

Inequality in the US (1969 to 2019)

Before Tax
30%
After Tax
Change in Gini Coefficient Since 1969

20%

10%

1970 1980 1990 2000 2010 2020

Year

Figure 6.3. Inequality in the US (as measured by the Gini coefficient) has risen dramatically
over the last five decades, even adjusting for taxation [328].

The distribution of gains from growth is highly unequal. Nearly all the
wealth gains over the past five decades have been captured by the top 1% of income
earners, while average inflation-adjusted wages have barely increased [329]. A RAND
Corporation working paper estimated how the US income distribution would look
today if inequality was at the same level as in 1975—the results are in Table 6.1.
Suppose my annual income is $15,000 today. If inequality was at the same level as
in 1975, I would be paid an extra $5,000. Someone else earning $65,000 today would
instead have been paid $100,000 had inequality held constant! We can see in Table
6.1 that these increases in inequality have had massive effects on individual incomes
for everyone outside the top 1%.
Beneficial AI and Machine Ethics ■ 311

TABLE 6.1 Real and counterfactual income distributions for all adults with income, in 2018
USD [330].
Income in 2018 if
Actual Income Actual Income
Percentile Inequality Had Stayed
in 1975 in 2018
Constant

25th % $9,000 $15,000 $20,000

Median $26,000 $36,000 $57,000
75th % $46,000 $65,000 $100,000
90th % $65,000 $112,000 $142,000
95th % $80,000 $164,000 $174,000
99th % $162,000 $491,000 $353,000
Top 1% Mean $252,000 $1,160,000 $549,000

Inequality carries serious implications. Beyond obvious problems like an in-

ability to access essentials and maintain a basic standard of living, those worst off
in an unequal society face additional problems in domains like health. A widening
wealth gap often corresponds with a health gap, where those with fewer resources
have poorer health outcomes due to less access to quality healthcare, lower quality
nutrition, and higher stress levels. For instance, life expectancy often varies dramat-
ically based on income in unequal societies [331].
Everyone, not just the poorest, suffers in an unequal society. One of the
most robust findings about inequality is that unequal societies have higher levels of
crime [332]. When inequality is high, so too are levels of social tension, dissatisfac-
tion, and shame, which can contribute to higher crime rates. This can lead to a cycle
where the fear of crime drives further inequality, as wealthier individuals and neigh-
borhoods invest in measures that segregate them further from the rest of society,
further increasing inequality and crime rates. A more detailed discussion of this rela-
tive deprivation can be found in the section on Cooperation and Conflict. Inequality
is also related to other signs of societal sickness: worse physical and mental health,
increased drug use, and higher rates of incarceration. Strikingly, inequality is a strong
predictor of political instability and violence as well. While it may seem like those
with more wealth are insulated from the negative effects of inequality, they suffer
indirect consequences as well.
The causes of inequality are disputed. Renowned economist Thomas Piketty,
in his book “Capital in the 21st Century” posits that inequality is a consequence
of the difference between the interest rate investments in capital receive (r) and the
rate of economic growth (g) [336]. Piketty’s key unorthodox claim is that capital is
a “gross substitute” for labor. On this view, as capital owners generate more wealth
and capital, they are increasingly able to use capital to replace labor—imagine this as
robots outcompeting human workers. Piketty argues that if the interest rate, which
is the rate at which capital (or “robots”) can self-replicate, is greater than than
the overall rate of economic growth, and thus faster than labor’s wage growth, then
capital owners can become substantially and continuously richer than workers. This
contrasts with the standard view that capital is a “gross complement” to labor, with
312 ■ Introduction to AI Safety, Ethics, and Society

Figure 6.4. Countries with higher income inequality tend to have higher homicide rates [333,
334].

Figure 6.5. Countries with higher income inequality tend to have higher rates of political
instability [333, 335].
Beneficial AI and Machine Ethics ■ 313

both labor and capital needed to produce goods. According to the standard view,
increases in capital lead to higher labor productivity, which makes workers more
efficient and valuable. Increased productivity raises wages; thus, increases in capital
benefit both workers and capital owners.
If AI serves as a gross substitute for labor, investment in AIs, with an effective interest
rate (r) higher than the overall growth rate (g), will permit capital owners to continue
accumulating capital, outcompeting workers and increasing inequality. While on the
standard view, this would be a fundamentally new phenomenon, Piketty would argue
that this is the exacerbation of a centuries-old trend. Such a scenario would contribute
to growing inequality and negatively impact the livelihoods of workers. Issues of
automation through AI and its broader societal consequences are discussed further
in the Governance chapter.

6.4.4 Growth

Growth is widely considered essential to a healthy society. Some have

claimed that societies should place less importance on economic growth, whether to
reduce environment impacts or for other reasons. However, looking at the world over
the last 200 years provides strong evidence that economic growth can lead to vast
improvements in human welfare in the domains of health, education, and more.

Figure 6.6. Increases in GDP per capita strongly correlate with increases in life expectancy
[337].

We have strong reasons for encouraging economic growth. The Preston

curve is compelling evidence of a positive correlation between a country’s gross
domestic product (GDP) per capita—a measure of the total production of goods
and services—and health outcomes. This relationship illustrates that countries with
314 ■ Introduction to AI Safety, Ethics, and Society

higher GDP per capita have better health outcomes. The shape of the relationship
indicates that poorer countries, in particular, stand to benefit immensely from im-
provements in GDP. The positive correlation also holds across time: average global
life expectancy was just below 40 years at the start of the 20th century, whereas
today, with a much higher global GDP per capita, the average person expects to
live for 70 years. Nobel laureate Amartya Sen suggests that one pathway through
which growth improves health is by reducing poverty and increasing investments in
healthcare [338].

Growth makes it easier for societies to support a wide range of values.

The benefits of economic growth extend beyond physical health. There is a strong as-
sociation between economic prosperity and enhancements in freedom and education
[339]. Prosperous societies can afford stronger institutions to safeguard democratic
freedoms and human rights. As societies become wealthier, more resources can be
allocated to support cultural institutions, artists, and creative endeavors. Economic
growth can enable Pareto improvements: changes that improve people’s lives without
leaving anyone worse off, like growing a pie to give everyone bigger slices. Instead of
debating whether to spend limited resources on a new hospital, school, or cultural
center, encouraging economic growth can allow us to build all three. The increased
wealth effectively allows us to accommodate multiple values, protecting value plural-
ism.

6.4.5 Beyond Economic Models

However, in measuring social wellbeing, we must recognize some shortcomings of

traditional economic metrics. Here, we will discuss the disconnect between economic
output and social value, why relying on economic models of welfare economics can
be inadequate to describe human goals, and how more holistic measures of happiness
and economic prosperity may give us a clearer sense of true social wellbeing.

Economic Output and Gross Domestic Product

Economic indicators measure what we can quantify, not necessarily what

we care about. Indicators like Gross Domestic Product (GDP) measure the mon-
etary value of final goods and services—that is, those that are bought by the final
user—produced in a country in a given period of time. When economists discuss
growth, they are typically referring to increases in GDP. Measures of productive out-
put like GDP are useful in gauging a country’s economic health, but they fail to
capture the value of socially significant activities that aren’t priced in the market.

Some socially important tasks are not captured by GDP. Many essential
roles in society, such as parenthood, community service, and early education, are
crucial to the wellbeing of individuals and communities but are often undervalued or
entirely overlooked in GDP calculations [340]. While their effects might be captured—
education, for instance, will increase productivity, which increases GDP—the initial
activity does not count toward total output. The reason is simple: GDP only accounts
Beneficial AI and Machine Ethics ■ 315

for activities that have a market price. Consequently, efforts expended in these socially
important tasks, despite their high intrinsic value, are not reflected in the GDP
figures.

Technologies that make our lives better may not be measured either.
Technological advancements and their value often fail to be reflected adequately in
GDP figures. For instance, numerous open-source projects such as Wikipedia provide
knowledge to internet users worldwide at no cost. However, because there’s no direct
monetary transaction involved, the immense value they offer isn’t represented in GDP.
The same applies to user-generated content on platforms like YouTube, where the
main contribution to GDP is through advertisement revenue because most creators
aren’t compensated directly for the value they create. The value viewers derive from
such platforms vastly outstrips the revenue generated from ads or sales on these
platforms, but this is not reflected in GDP.

There might be a similar disconnect between GDP and the social value
of AI. As AI systems become more integrated into our daily lives, the disconnect
between GDP and social value might become more pronounced. For example, an
AI system that provides free education resources may significantly improve learn-
ing outcomes for millions, but its direct contribution would be largely invisible in
GDP terms. Conversely, an AI may substantially increase GDP by facilitating high-
frequency trading without doing much to increase social wellbeing. This growing
chasm between economic metrics and real value could lead to policy decisions that
fail to harness the full potential of AI or inadvertently hamper its beneficial applica-
tions.

The proxy-purpose distinction is especially important for AI. Imagine a

future where an AI system is tasked with maximizing GDP, often seen as a proxy for
wellbeing. The system could potentially achieve this goal by promoting resource-
intensive industries or fostering a work culture that prioritizes productivity over
wellbeing. In such a scenario, the GDP might increase, but at the cost of essen-
tial considerations like environmental sustainability or human happiness. Therefore,
relying solely on economic indicators could lead to decisions that, while effective in
the short term, might harm society in the long run.
Understanding the limitations of economic measurements is an important step to-
ward a safer use of AI. It helps us question what we should optimize. The aim of
economic policy should not just be to maximize economic output but also to promote
overall social wellbeing. AI systems may thus need to take into account multiple fac-
tors, like equality, sustainability, and personal fulfillment, rather than only economic
indicators.

Models of Welfare Economics

The most basic form of welfare economics maximizes social surplus.

Social surplus is a measure of the total value created by a market: it is the sum of the
consumer surplus and the producer surplus. The consumer surplus is the difference
316 ■ Introduction to AI Safety, Ethics, and Society

between the maximum price a consumer is willing to pay and the actual market price.
Conversely, the producer surplus is the difference between the actual market price
and the minimum price a producer is willing to accept for a product or service. By
maximizing the total surplus, welfare economics seeks to maximize the social value
created by a market.
As an example, imagine a scenario where consumers are willing to pay up to $20 for
a book, but the market price is only $15. Here, the consumer surplus is $5. Similarly,
if a producer would be willing to sell the book for a minimum of $10, the producer
surplus is $5. The social surplus, and hence the social value in this market, is $10: $5
consumer surplus plus $5 producer surplus.

We should be wary of generalizing from simple models. The core model of

welfare economics has its limitations. Notably, welfare economics is concerned with
the maximization of surplus, but is indifferent to its distribution. This might not
align with common notions of fairness and equality. For example, an AI optimized to
maximize profits might model consumers well enough to enable perfect price discrimi-
nation: allowing firms to sell each good at exactly a consumer’s maximum willingness
to pay, converting all the consumer surplus into producer surplus, but leaving the
sum total of “social surplus” unchanged.
However, social surplus is not the only thing we care about. Since people tend to
derive less utility from a marginal dollar when they are richer, we care about how
rich consumers and producers are to begin with. If, as is usually the case, consumers
are poorer than the owners of a company, then transferring $5 (of surplus) from
consumers to producers decreases total utility. The real gains for the firm may be
small compared to the losses for the consumers.
As AI systems become more integral to our economies, we must be mindful of these
complexities. A narrow focus on maximizing economic surplus could lead us to de-
ploy AIs which, while efficient in a purely economic sense, might have harmful con-
sequences for society.

Happiness in Economics

There is a gap between human happiness and material prosperity. Most

people would agree that the goal of social sciences should not be to just increase
material wealth. A more meaningful aim would be to enhance overall wellbeing or
happiness. However, defining and measuring happiness can be challenging. Whether
happiness is correlated with material wealth remains an ongoing research question;
other aspects of life like physical and mental health, job satisfaction, social connec-
tions, and a sense of purpose seem important as well.

Debates about the Easterlin paradox highlight the complexity of under-

standing happiness. While wealthier people and countries are generally happier
than their less affluent counterparts, long-term economic growth does not always cor-
relate with long-term increases in happiness: this is the Easterlin paradox. Several
studies have tried to explore the relationship between happiness and economic growth
Beneficial AI and Machine Ethics ■ 317

[341]. While some findings suggest a correlation, others don’t: our understanding of
the happiness-growth relationship is still evolving.
We do, however, have strong evidence that inequality is harmful. People often evaluate
their wellbeing in relation to others; so, when wealth distribution is unequal, people
are dissatisfied and unhappy. For instance, the recent rise in inequality may explain
why there has been no significant increase in happiness in the US over the last few
decades despite an approximately tenfold increase in real GDP and a fourfold increase
in real GDP per capita.

More holistic economic measurements can get closer to capturing what

we value. Due to the disconnect between economic prosperity and true wellbeing,
many economists also use broader metrics. These measures aim to capture wellbeing
more comprehensively, rather than solely focusing on economic growth. One such
measure is the Human Development Index (HDI). The HDI comprises a nation’s
average life expectancy, education level, and Gross National Income (GNI) per capita
(which is similar to GDP). Notably, the UN uses the logarithm of GNI per capita in
the HDI calculation, which accounts for the diminishing returns of wealth: the idea
that each additional dollar earned contributes less to a person’s happiness than the
one before it. In general, economists consider a “report card” of indicators to assess
a nation’s wellbeing, rather than just depending on a single measure. By capturing
various aspects of wellbeing, this approach could provide a more holistic and accurate
representation of a nation’s quality of life.

Summary. Traditional economic measures and models are insufficient for mea-
suring and modeling what we care about. There is a disconnect between what we
measure and what we value; for instance, GDP fails to account for essential unpaid
labor and overvalues the production of goods and services that add little to social
wellbeing. While economic models are useful, we must avoid relying too much on
theoretically appealing models and examine the matter of social wellbeing with a
more holistic lens.

Conclusions About the Economic Engine

We should be wary of using AI to increase metrics that are only proxies

for wellbeing. Our current economic system strongly incentivises the deployment
of AI systems that optimize for economic growth which, while often a worthy goal,
may not capture essential aspects of societal health such as equality or sustainability
(due to unequal distribution of gains, externalities, and other market failures). While
economic objectives like GDP growth are quantifiable and easy to pursue, they may
not truly reflect what makes a society happy and healthy.

It is risky to let the goals of AI systems be determined entirely by eco-

nomic incentives. AI systems created by for-profit businesses are designed to
maximize shareholder profits, not societal wellbeing. If we let the economy decide
what AIs do by letting largely unregulated markets create AIs, we may end up with
an increase in inequality and exacerbation of market failures. Using money as a
318 ■ Introduction to AI Safety, Ethics, and Society

proxy for social value might seem practical, but it can distort societal priorities; for
instance, this system implies that the preferences of wealthier individuals hold more
weight since they are willing and able to spend more.
The examples in this section demonstrate a divergence between economic incentives
and other important societal goals and should serve as a cautionary note. In the next
section, we consider an alternative view: a framework that puts happiness front and
center. Perhaps if we can direct AI systems to focus directly on promoting human
happiness, we might aim to overcome the human biases and limitations that stop us
from pursuing our happiness and enable the system to make decisions that have a
positive impact on overall wellbeing.

6.5 WELLBEING

In the next few sections, we will explore how AIs can be used to increase human
wellbeing. We start by asking: what is wellbeing?

Wellbeing can be defined as how well a person’s life is going for them.
It is commonly considered to be intrinsically good, and some think of wellbeing as
the ultimate good. Utilitarianism, for instance, holds some form of wellbeing as the
sole moral good.
There are different accounts of precisely what wellbeing is and how we can evaluate it.
Generally, a person’s wellbeing seems to depend on the extent to which that person
is happy, healthy, and fulfilled. Three common accounts of wellbeing characterize it
as the net balance of pleasure over pain, a collection of objective goods, or preference
satisfaction. Each account is detailed below.

6.5.1 Wellbeing as the Net Balance of Pleasure over Pain

Some philosophers, known as hedonists, argue that wellbeing is the achievement of the
greatest balance of pleasure and happiness over pain and suffering. (For simplicity we
do not distinguish, in this chapter, between “pleasure” and “happiness” or between
“pain” and “suffering”). All else equal, individuals who experience more pleasure have
higher wellbeing and individuals who experience more pain have lower wellbeing.

According to hedonism, pleasure is the only intrinsic good. Goods like

health, knowledge, and love are instrumentally valuable. That is, they are only good
insofar as they lead to pleasure. It may feel as though other activities are intrinsically
valuable. For instance, someone who loves literature may feel that studying classic
works is valuable for its own sake. Yet, if literature lovers were confronted with proof
that reading the classics makes them less happy than they otherwise would be, they
might no longer value studying literature. Hedonists believe that when we think we
value certain activities, we actually value the pleasure they bring us, not the activities
themselves.
Beneficial AI and Machine Ethics ■ 319

Hedonism is a relatively clear and intuitive account of wellbeing. It seems to apply

equally to everyone. That is, while we all may have different preferences and desires,
pleasure seems to be universally valued. However, some philosophers argue that he-
donism is an incomplete account of wellbeing. They argue there may be other factors
that influence wellbeing, such as the pursuit of knowledge.

6.5.2 Wellbeing as a Collection of Objective Goods

Others believe that wellbeing is the achievement of an objective set of “goods” or

“values.” These goods are considered necessary for living a good life regardless of how
pleasurable someone’s life is.

There is disagreement about which goods are necessary for wellbeing.

Commonly proposed goods include pleasure, happiness, health, relationships, knowl-
edge, and more. Objective goods theorists consider these values to be important for
wellbeing independently of individual beliefs and preferences.

One objection to the objective goods theory is that it is elitist. The objec-
tive goods theory claims that some things are good for people even if they derive no
pleasure or satisfaction from them. This claim might seem objectionably paternalis-
tic; for instance, it seems condescending to claim that someone with little regard for
aesthetic appreciation is thereby leading a deficient life. In response, objective goods
theorists might claim that these additional goods do benefit people, but only if those
people do in fact enjoy them.
There is no uncontroversial way to determine which goods are important for living
a good life. However, this uncertainty is not a unique problem for objective goods
theory. It can be difficult for hedonists to explain why happiness is the only value that
is important for wellbeing, for instance. In the following sections we focus primarily
on other interpretations of wellbeing and do not have space to discuss objective goods
theory in depth, particularly given that there are many ways it can be specified.

6.5.3 Wellbeing as Preference Satisfaction

Some philosophers claim that what really matters for wellbeing is that our prefer-
ences are satisfied, even if satisfying preferences does not always lead to pleasurable
experiences. One difficulty for preference-based theories is that there are different
kinds of preferences, and it’s unclear which ones matter. Preferences can be split
into three categories: stated preferences, revealed preferences, and idealized prefer-
ences. If someone expresses a preference for eating healthy but never does, then their
stated preference (eating healthy) diverges from their revealed preference (eating un-
healthy). Suppose they would choose to eat healthy if fully informed of the costs
and benefits: their idealized preferences, then, would be to eat healthy. Each of these
categories can be informative in different contexts: we explore their relevance in the
next section.
320 ■ Introduction to AI Safety, Ethics, and Society

A Note on Wellbeing in Social Science

Philosophers continue to debate what wellbeing is, or what it means to live
a good life. However, over the years, researchers have developed ways to ap-
proximate wellbeing for practical purposes. Psychologists, economists, philan-
thropists, policy makers, and other professionals need—at least—a working
definition of wellbeing in order to study, measure, and promote it. Here, we
describe some common views of wellbeing and illustrate how they are used in
social science.
Preference satisfaction. Preference theorists view wellbeing as fulfilling
desires or satisfying preferences, even if doing so does not always induce plea-
sure. Nonetheless, it remains an open question whether all desires are tied to
wellbeing or just certain kinds, like higher-order or informed desires.
Economics. Standard economics often uses preference satisfaction theories
to study wellbeing. Revealed preferences can be observed by studying the
decisions people make. If people desire higher incomes, for example, economists
can promote wellbeing by researching the impact of different economic policies
on gross domestic product (GDP).
Social Surveys and Psychology. Traditionally, psychological surveys evalu-
ate wellbeing in terms of life satisfaction. Life satisfaction is a measure of peo-
ple’s stated preferences—preferences or thoughts that individuals outwardly
express—regarding how their lives are going for them. Life satisfaction sur-
veys typically focus on tangible characteristics of people’s lives, such as fi-
nancial security, health, and personal achievements. They are well suited to
understanding the effects of concrete economic factors, such as income and ed-
ucation, on an individual’s psychological wellbeing. For example, to promote
wellbeing, psychologists might research the effects of access to education on
one’s ability to achieve the goals they set for themselves.
Hedonism. Under hedonist theories of wellbeing, an individual’s wellbeing
is determined by their mental states. In particular, the balance of positive
mental states (like pleasure or happiness) over negative mental states (like
pain or suffering).
Economics. Welfare economics uses hedonism to evaluate the wellbeing of
populations. To estimate gross national happiness (GNH )—an indicator of
national welfare—it considers the effects of several factors from psychological
wellbeing to ecological diversity and resilience on individuals’ mental states.
Welfare economists might prefer this framework because it is more holistic—
it evaluates both material and non-material aspects of everyday life as they
contribute to national welfare.
Social Surveys and Psychology. Many psychologists also use hedonist the-
ories to understand and promote wellbeing. They may work to identify the
emotional correlates of happiness through surveys that measure people’s stated
Beneficial AI and Machine Ethics ■ 321

emotions—unlike life satisfaction surveys, these surveys do not reveal mental

states. They reveal emotions that people remember, not emotions they cur-
rently experience. Researchers continue to look for ways to directly observe
emotions as they are experienced. For example, some studies use cell phone
apps to periodically prompt participants to record their current emotions.
Such research tactics may provide us with more precise measures of individu-
als’ overall happiness by evaluating the emotional responses to their everyday
experiences in near real-time.
Objective goods. Under objective goods theories, wellbeing is determined by
a certain number of observable factors, independent of individuals’ preferences
or experiences. There are multiple theories about what those factors may be.
One of the most widely supported theories is human flourishing. Under this
view, wellbeing is more than just the balance of pleasure over suffering, or the
fulfillment of one’s preferences—“the good life” should be full, virtuous, and
meaningful, encapsulating psychological, hedonistic, and social wellbeing all
at once.
Economics. In economics, the capabilities approach defines wellbeing as hav-
ing access to a set of capabilities that allow one to live the kind of life they
value. It emphasizes two core ideas: functionings and capabilities. Functionings
include basic and complex human needs, ranging from good health to mean-
ingful relationships. Capabilities refer to the ability people have to choose and
achieve the functionings they value—they may include being able to move
freely or participate in the political process. This approach has significantly
influenced human development indicators, such as the Human Development
Index (HDI )—it allows developmental economists to measure and compare
wellbeing across different populations while also evaluating the effectiveness of
public policies.
Psychology. Positive psychologists do not collapse wellbeing into one dimen-
sion, rather, they argue for a psychologically rich life—one that is happy,
meaningful, and engaging. Some psychologists use PERMA theory to eval-
uate psychological richness, which considers five categories essential to human
flourishing: (1) experience of positive emotions, (2) engagement with one’s in-
terests, (3) maintenance of personal, professional, and social relationships, (4)
the search for meaning or purpose, and (5) accomplishments, or the pursuit
of one’s goals. This framework is particularly useful in evaluating wellbeing
because it is universal—it can be applied cross-culturally—and practical—it
can guide interventions aiming to improve emotional wellbeing, social relation-
ships, or activities that provide a sense of meaning or accomplishment.
While we don’t have a complete understanding of the nature of wellbeing,
we can use these theories as useful research tools. They can help us to (a)
understand how different factors contribute to wellbeing and (b) evaluate the
effectiveness of policies and other interventions aimed at improving wellbeing.
322 ■ Introduction to AI Safety, Ethics, and Society

6.5.4 Applying the Theories of Wellbeing

While people disagree about which account of wellbeing is correct, most people agree
that wellbeing is an important moral consideration. All else equal, taking actions that
promote wellbeing is generally considered morally superior to taking actions that
reduce wellbeing. However, the three theories of wellbeing have different practical
implications. In the future, we are likely to interact with AI chatbots in a variety
of ways; in particular, we might have close personal interactions with them that
influence our decision-making. Different theories of wellbeing would suggest different
goals for these AIs.

Chatbots could prioritize pleasure. The hedonistic view suggests that wellbe-
ing is primarily about experiencing pleasure and avoiding pain. This theory might
recommend that AIs should be providing users with entertaining content that brings
them pleasure or encouraging them to make decisions that maximize their balance of
pain over pleasure over the long run. A common criticism is that this trades off with
other goods considered valuable like friendship and knowledge. While this is some-
times true, these goals can also align. Providing content that is psychologically rich
and supports users’ personal growth can contribute to a more fulfilling and meaning-
ful life full of genuinely pleasurable experiences.

Chatbots could prioritize preference satisfaction. The preference view sug-

gests that wellbeing is about having preferences satisfied. Depending on which pref-
erences were considered important, this theory would suggest different priorities for a
hedonistic chatbot. Consider revealed preferences. One proxy for revealed preferences
is user engagement. By continuing to engage with a chatbot, users are expressing their
preference for seeing more of what they are getting, and increasing preference sat-
isfaction might imply continuing to behave in certain ways. It is important to be
careful of this equivalence, though: prioritizing user engagement can lead to results
like chatbots that engage people through unsavory means. Like engaging humans,
chatbots could try to addict users by showing them streams of ephemeral content,
creating an air of mystery and uncertainty, or act distant after being friendly to create
a desire for continued friendliness that it can then satisfy. Such AIs might maximize
engagement, but this may not be good for people’s wellbeing—even if they use the
platform for more hours.

Chatbots could promote objective goods. The objective goods account sug-
gests that wellbeing is about promoting goods such as achievement, relationships,
knowledge, beauty, happiness, and the like. An AI chatbot might aim to enhance
users’ lives by encouraging them to complete projects, develop their rational capac-
ities, and facilitate learning. The goal would be to make users more virtuous and
encourage them to strive for the best version of themselves. This aligns with Aristo-
tle’s theory of friendship, which emphasizes the pursuit of shared virtues and mutual
growth, suggesting that such AIs might have meaningful friendships with humans.

We might want to promote the welfare of AIs. In the future, we might also
come to view AIs as able to have wellbeing. This might depend on our understanding
Beneficial AI and Machine Ethics ■ 323

of wellbeing. An AI might have preferences but not experience pleasure, which would
mean it could have wellbeing according to preference satisfaction theorists but not
hedonists. Future AIs may have wellbeing according to all three accounts of wellbeing.
This would potentially require that we dramatically reassess our relationship with
AIs. This question is further discussed in the Happiness section in this chapter.
In the following sections, we will focus on the different conceptions of wellbeing
presented here, and explore what each theory implies about how we should embed
ethics into AIs.

A Note on Measuring Wellbeing

While the philosophical foundations of wellbeing are not settled, quantitative
research fields like public health and economics require the use of metrics in
order to evaluate, track, or compare the subject of study. Researchers use many
different metrics to measure wellbeing, but the most common are HALYs and
WELBYs.
Health-adjusted life years (HALYs). A very common unit for measuring
wellbeing is the health-adjusted life year, or HALY. HALYs account for two
factors: (1) the number of years lived by an individual, group, or population
(also called “life years”), and (2) the health of those life years. Two common
types of HALYs are QALYs and DALYs.
Quality-adjusted life years (QALYs). QALYs measure the number of
years lived, adjusted according to health. One year of life in perfect health is
equivalent to 1 QALY. One year of a less healthy life is worth between 0 and
1 QALYs. The value of a life year depends on how severely the life is impacted
by health problems. For example, a year of life with asthma might be worth
0.9 QALYs, while a year of life with a missing limb might be worth about 0.7
QALYs.
Disability-adjusted life years (DALYs). While QALYs measure years of
life as impacted by health, DALYs measure years of life lost, accounting for the
impact of health. Whereas 1 QALY is equivalent to a year in perfect health, 1
DALY is equivalent to the loss of a year in perfect health. A year of life with
asthma might be worth 0.1 DALYs, while a year of life with a missing limb
might be worth 0.3 DALYs.
Note that increases in wellbeing are indicated by higher numbers of QALYs
but lower numbers of DALYs.
Using HALYs to measure wellbeing has some limitations. First, the extent to
which different illnesses or injuries affect overall human health is not clear.
Losing a limb probably has a larger health impact than getting asthma, but
researchers must rely on subjective judgements to assign precise values to
each problem. Second—and perhaps more importantly—HALYs measure the
value of a span of life as it is impacted by health alone. In reality, there are
many factors that can impact the value of life, like happiness, relationship
324 ■ Introduction to AI Safety, Ethics, and Society

quality, freedom, and a sense of purpose. Perhaps a more useful measurement

of wellbeing would consider the effects of all of these factors, rather than health
alone.
Wellbeing-adjusted life years (WELBYs). WELBYs have been devel-
oped to measure years of life as impacted by overall wellbeing. One WELBY
is equivalent to one year of life at maximum wellbeing—namely, a life that is
going as well as possible. Wellbeing can be assessed using self-reported out-
comes like affect, life satisfaction, or degree of flourishing. There may also be
some empirical ways to assess wellbeing like cortisol levels, income, or ability.
QALYs, DALYs, and WELBYs provide different approximations of wellbeing
that can be used to inform high-level decision-making and policy-setting.

6.6 PREFERENCES

Should we have AIs satisfy people’s preferences? A preference is a tendency

to favor one thing over another. Someone might prefer chocolate ice cream over vanilla
ice cream, or they might prefer that one party wins the election rather than another.
These preferences will influence actions. If someone prefers chocolate ice cream over
vanilla, they’re more likely to choose the former. Similarly, if someone prefers one
political party over another, they will likely vote accordingly. In this manner, our
preferences shape our behavior, guiding us toward certain choices and actions over
others. Preference is similar to desire but always comparative. Someone might desire
something in itself—a new book, a vacation, or a delicious meal—but a preference
always involves a comparison between two or more alternatives.

Overview. In this section, we will consider whether preferences may have an im-
portant role to play in creating AIs that behave ethically. In particular, if we want
to design an advanced AI system, the preferences of the people affected by its deci-
sions should plausibly help guide its decision-making. In fact, some people (such as
preference utilitarians) would say that preferences are all we need. However, even if
we don’t take this view, we should recognize that preferences are still important.
To use preferences as the basis for increasing social wellbeing, we must somehow
combine the conflicting preferences of different people. We’ll come to this later in this
chapter, in a section on social welfare functions. Before that, however, we must answer
a more basic question: what exactly does it mean to say that someone prefers one
thing over another? Moreover, we must decide why we think that satisfying someone’s
preferences is good for them and whether all kinds of preferences are equally valuable.
This section considers three different types of preferences that could all potentially
play a role in decision-making by AI systems: revealed preferences, stated preferences,
and idealized preferences.
Beneficial AI and Machine Ethics ■ 325

6.6.1 Revealed Preferences

Preferences can be inferred from behavior. One set of techniques for getting
AI systems to behave as we want—inverse reinforcement learning—is to have them
deduce revealed preferences from our behavior. We say that someone has a revealed
preference for X over Y if they choose X when Y is also available. In this way,
preference is revealed through choice. Consider, for example, someone deciding what
to have for dinner at a restaurant. They’re given a menu, a list of various dishes
they could order. The selection they make from the menu is seen as a demonstration
of their preference. If they choose a grilled chicken salad over a steak or a plate of
spaghetti, they’ve just revealed their preference for grilled chicken salad, at least in
that specific context and time.
While all theories of preferences agree that there is an important link between pref-
erence and choice, the revealed preference account goes one step further and claims
that preference simply is choice.

Revealed preferences preserve autonomy. One advantage of revealed prefer-

ences is that we don’t have to guess what someone prefers. We can simply look at
what they choose. In this way, revealed preferences can help us avoid paternalism.
Paternalism is when leaders use their sovereignty to make decisions for their subjects,
limiting their freedom or choices, believing it is for the subjects’ own good. However,
we may think that typically people are themselves the best judges of what is good for
them. If so, then by relying on people’s actions to reveal their preferences, we avoid
the risk of paternalism.

However, there are problems with revealed preferences. The next few sub-
sections will explore the challenges of misinformation, weakness of will, and manip-
ulation in the context of revealed preferences. We will discuss how misinformation
can lead to choices that do not accurately reflect a person’s true preferences, and
how weakness of will can cause individuals to act against their genuine preferences.
Additionally, we will examine the various ways in which preferences can be manip-
ulated, ranging from advertising tactics to extreme cases like cults, and the ethical
implications of preference manipulation.

Misinformation

Revealed preferences can sometimes be based on misinformation. If

someone buys a used car that turns out to be defective, it doesn’t mean they prefer
a faulty car. They intended to buy a reliable car, but due to a lack of information
or deceit from the seller, they ended up with a substandard one. Similarly, losing
at chess doesn’t indicate a preference for losing; rather, it’s an outcome of facing a
stronger player or making mistakes during the game. This means that we cannot al-
ways infer someone’s preferences from the choices they make. Choice does not reveal
a preference between things as they actually are, but between things as the person
understands them. Therefore, we can’t rely on revealed preferences if they are based
on misinformation.
326 ■ Introduction to AI Safety, Ethics, and Society

Weakness of Will

Choices can be due to a lack of willpower rather than considered prefer-

ences. Consider a smoker who wants to quit. Each time they light a cigarette, they
may be acting against their genuine preference to stop smoking, succumbing instead
to the power of addiction. Therefore, it would be erroneous to conclude from their
behavior that they think that continuing to smoke would be best for their wellbeing.

Manipulation

Revealed preferences can be manipulated in various ways. Manipulations

like persuasive advertising might manipulate people into buying products they don’t
actually want. Similarly, revealed preferences might be the result of social pressure
rather than considered judgment, such as when buying a particular style of clothing.
In such cases, the manipulation may not be especially malicious. At the other ex-
treme, however, cults may brainwash their members into preferring things they would
not otherwise want, even committing suicide. In the context of decision-making by
advanced AI systems, manipulation is a serious risk. An AI advisor system could
attempt to influence our decision-making for a variety of reasons—perhaps its de-
signer wants to promote specific products—by presenting options in some particular
way; for instance, by presenting a list of options that looks exhaustive and excluding
something it doesn’t want us to consider.

If preference satisfaction is important, perhaps manipulation is accept-

able. In at least some of these cases, it seems clear that preference manipulation
is bad. However, it may be less clear exactly why it is bad. A natural answer is to
say that people might be manipulated into preferring things that are bad for them.
Someone who is manipulated by advertising into preferring junk food might thereby
suffer negative health consequences. However, if we think that wellbeing simply con-
sists in preference satisfaction, it doesn’t make sense to say that we might prefer what
is bad for us. On this account, having one’s preferences satisfied is by definition good,
regardless of whether those preferences have been manipulated. This might lead us
to think that what matters is not (or at least not only) preference satisfaction, but
happiness or enjoyment. We’ll discuss this in the section on happiness.

Disliking manipulation suggests that wellbeing requires autonomy. On

the other hand, some may find that manipulation is bad even if the person is manip-
ulated into doing something that is good for them. For example, suppose a doctor
lies to her patient, telling him that unless he loses weight, he will likely die soon. As
a result, the patient becomes greatly motivated to lose weight and successfully does
so. This provides a range of health benefits, even if his doctor never had any reason
to believe he would have died otherwise. If we think manipulation is still bad, lack of
enjoyment can’t be the whole story. This suggests that we object to manipulation in
part because it violates autonomy. We might then think that autonomy—the ability
to decide important matters for oneself, without coercion—is objectively valuable
regardless of what the agent prefers.
Beneficial AI and Machine Ethics ■ 327

Inverse Reinforcement Learning

Inverse Reinforcement Learning (IRL) relies on revealed preferences.

IRL is a powerful technique in the field of machine learning, which focuses on extract-
ing an agent’s objectives or preferences by observing its behaviors. In more technical
terms, IRL is about reverse engineering the reward function—an internal ranking
system that an agent uses to assess the value of different outcomes—that the agent
appears to be optimizing, given a set of its actions and a model of the environment.
This technique can help ensure the alignment of AI system’s behaviors with human
values and preferences. However, leveraging revealed preferences or observable choices
of humans to train AI systems using IRL poses significant challenges pertaining to
AI safety.

Using revealed preferences as a training mechanism for IRL can be risky.

Reconsider the chess example: losing a game does not mean that we prefer to lose.
This interpretation could be a misrepresentation of the player’s true preferences,
potentially leading to undesirable outcomes. Furthermore, extending observed pref-
erences to unfamiliar situations poses another hurdle. I may prefer to eat ice cream
for dessert, but that doesn’t mean I prefer to eat it for every meal. Similarly, I may
prefer to wear comfortable shoes for hiking, but that doesn’t mean I want to wear
them to a formal event. An AI system could inaccurately extrapolate preferences
from limited or context-specific data and misapply these to other scenarios. There-
fore, while revealed preferences can offer significant insights for training AI, it is vital
to understand their limitations to safeguard the safety and efficiency of AI systems.

Summary. Revealed preferences can be a powerful tool, as they allow an individ-

ual’s actions to speak for themselves, reducing the risk of paternalistic intervention.
However, revealed preferences have inherent shortcomings such as susceptibility to
misinformation and manipulation, which can mislead an AI system. This emphasizes
the caution needed in relying solely on revealed preferences for AI training. It under-
scores the importance of supplementing revealed preferences with other methods to
ensure a more comprehensive and accurate understanding of a user’s true preferences.

6.6.2 Stated Preferences

Preferences can be straightforwardly queried. Another set of techniques for
getting AI systems to behave as we want—human supervision and feedback—rely on
people to state their preferences. A person’s stated preferences are the preferences
they would report if asked. For example, someone might ask a friend which movie
they want to see. Similarly, opinion polls might ask people which party they would
vote for. In both cases, we rely on what people say as opposed to what they do, as
was the case with revealed preferences.
Stated preferences overcome some of the difficulties with revealed preferences. For
example, stated preferences are less likely to be subject to weakness of will: when we
are further removed from the situation, we are less inclined to fall for temptations.
Therefore, stated preferences are more likely to reflect our stable, long-term interests.
328 ■ Introduction to AI Safety, Ethics, and Society

Stated preferences are still imperfect. Stated preferences do not overcome

all difficulties with revealed preferences. Stated preferences can still be manipulated.
Further, individuals might state preferences they believe to be socially acceptable
or admirable rather than what they truly prefer, particularly when the topics are
sensitive or controversial. Someone might overstate their commitment to recycling
in a survey, for instance, due to societal pressure and norms around environmental
responsibility.

Preference Accounting

One set of problems with stated preferences concerns which types of preferences
should be satisfied.

First, someone might never know their preference was fulfilled. Suppose
someone is on a trip far away. On a bus journey, they exchange a few glances with
a stranger whom they’ll never meet again. Nevertheless they form the preference
that the stranger’s life goes well. Should this preference be taken into account? By
assumption, they will never be in a position to know whether the preference has been
satisfied or not. Therefore, they will never experience any of the enjoyment associated
with having their preference satisfied.

Second, we may or may not care about the preferences of the dead.
Suppose someone in the 18th century wanted to be famous long after their death.
Should such preferences count? Do they give us reason to promote that person’s
fame today? As in the previous example, the satisfaction of such preferences can’t
contribute any enjoyment to the person’s life. Could it be that what we really care
about is enjoyment or happiness, and that preferences are a useful but imperfect
guide toward what we enjoy? We will return to this in the section on happiness.

Third, preferences can be about others’ preferences being fulfilled. Sup-

pose a parent prefers that their children’s preferences are satisfied. Should this pref-
erence count, in addition to their children’s preferences themselves? If we say yes,
it follows that it is more important to satisfy the preferences of those who have
many people who care for them than of those who do not. One might think that
this is a form of double counting, and claim that it is unfair to those with fewer who
care for them. On the other hand, one might take fairness to mean that we should
treat everyone’s preferences equally—including their preferences about other people’s
preferences.

Fourth, preferences might be harmful. Suppose someone hates their neigh-

bor, and prefers that they meet a cruel fate. We might think that such malicious
or harmful preferences should not be included. On this view, we should only give
weight to preferences that are in some sense morally acceptable. However, specifying
exactly which preferences should be excluded may be difficult. There are many cases
where satisfying one person’s preferences may negatively impact others. For example,
whenever some good is scarce, giving more of it to one person necessarily implies that
Beneficial AI and Machine Ethics ■ 329

someone else will get less. Therefore, some more detailed account of which preferences
should be excluded is needed.

Fifth, we might be confused about our preferences. Suppose a mobile app

asks its users to choose between two privacy settings upon installation: allowing the
app to access their location data all the time, or allowing the app to access their
location data only while they’re using the app. While these options seem straight-
forward, the implications of this choice are much more complex. To make a truly
informed decision, users need to understand how location data is used, how it can be
combined with other data for targeted advertising or profiling, what the privacy risks
of data breaches are, and how the app’s use of data aligns with local data protection
laws. However, we may not fully understand the alternatives we’re choosing between.

Sixth, preferences can be inconsistent over time. It could be that the choice
we make will change us in some fundamental way. When we undergo such transfor-
mative experiences [342], our preferences might change. Some claim that becoming a
parent, experiencing severe disability, or undergoing a religious conversion can be like
this. If this is right, how should we evaluate someone’s preference between becoming
a parent and not becoming a parent? Should we focus on their current preferences,
prior to making the choice, or on the preferences they will develop after making the
choice? In many cases we may not even know what those new preferences will be.
As technology advances, we may increasingly have the option to bring about trans-
formative experiences [342]. For this reason, it is important that advanced AI systems
tasked with decision-making are able to reason appropriately about transformative
experiences. For this, we cannot rely on people’s stated preferences alone. By defini-
tion, stated preferences can only reflect the person’s identity at the time. Of course,
people can try to take possible future developments into account when they state their
preferences. However, if they undergo a transformative experience their preferences
might change in ways they cannot anticipate.

Human Supervision

Stated preferences are used to train some AI systems. In reinforcement learn-

ing with human feedback (RLHF), standard reinforcement learning is augmented by
human feedback from people who rank the outputs of the system. In RLHF, humans
evaluate and rank the outputs of the system based on quality, usefulness, or another
defined criterion, providing valuable data to guide the system’s iterative learning pro-
cess. This ranking serves as a form of reward function that the system uses to adjust
its behavior and improve future outputs.
Imagine that we are teaching a robot how to make a cup of coffee. In the RLHF
process, the AI would attempt to output a cup of coffee, and then we would provide
feedback on how well it did. We could rank different attempts and the robot would
use this information to understand how to make better coffee in the future. The
feedback helps the robot learn not just from its own trial and error, but also from
our expertise and judgment. However, this approach has some known difficulties.
330 ■ Introduction to AI Safety, Ethics, and Society

First, as AI systems become more powerful, human feedback might be

infeasible. As the problems AI solve become increasingly difficult, using human
supervision and feedback to ensure that those systems behave as desired becomes
difficult as well. In complex tasks like creating bug-free and secure code, generating
arguments that are not only persuasive but true, or forecasting long-term implications
of policy decisions, it may be too time-consuming or even impossible for humans to
evaluate and guide AI behavior. Moreover, there are inherent risks from depending
on human reliability: human feedback may be systematically biased in various ways.
For example, inconvenient but true things may often be labeled as bad. In addition to
any bias, relying on human feedback will inevitably mean some rate of human error.

Second, RLHF usually does not account for ethics. Approaches based on
human supervision and feedback are very broad. These approaches primarily focus
on task-specific performance, such as generating accurate book summaries or bug-
free code. However, these task-specific evaluations may not necessarily translate into
a comprehensive understanding of ethical principles or human values. Rather, they
improve general capabilities since humans prefer smarter models.
Take, for instance, feedback on code generation. A human supervisor might provide
feedback based on the code’s functionality, efficiency, or adherence to best program-
ming practices. While this feedback helps in creating better code, it doesn’t neces-
sarily guide the AI system in understanding broader ethical considerations, such as
ensuring privacy protection or maintaining fairness in algorithmic decisions. Specifi-
cally, while RLHF is effective for improving AI performance in specific tasks, it does
not inherently equip AI systems with what’s needed to grapple with moral questions.
Research into machine ethics aims to fill this gap.

Summary. We’ve seen that stated preferences have certain advantages over re-
vealed preferences. However, stated preferences still have issues of their own. It may
not be clear how we should account for all different kinds of preferences, such as ones
that are only satisfied after the person has died, or ones that fundamentally alter
who we are. For these reasons, we should be wary of using stated preferences alone
to train AI.

6.6.3 Idealized Preferences

We could idealize preferences to avoid problems like weakness of will. A

third approach to getting AI systems to behave as we want is to make them able to
infer what we would prefer if our preferences weren’t subject to the various distorting
forces we’ve come across. Someone’s idealized preferences are the preferences they
would have if they were suitably informed. Idealized preferences avoid many of the
problems of both revealed preferences and stated preferences. Idealized preferences
would not be based on false beliefs, nor would they be subject to weakness of will,
manipulation, or framing effects. This makes it clearer how idealized preferences
might be linked to wellbeing, and therefore something we might ask an AI system to
implement.
Beneficial AI and Machine Ethics ■ 331

It is unclear how we idealize preferences. What exactly do we need to do to

figure out what someone’s idealized preferences are, based on their revealed prefer-
ences or their stated preferences? It’s clear that the idealized preferences should not
be based on any false beliefs. We might imagine a person’s idealized preferences as
ones they would have if they fully grasped the options they faced and were able to
think through the situation in great detail. However, this description is rather vague.
It may be that it doesn’t uniquely narrow down a set of idealized preferences. That
is, there may be multiple different ways of idealizing someone’s preferences, each of
which is one possible way that the idealized deliberation could go. If so, idealized
preferences may not help us decide what to do in such cases.
Additionally, some may argue that in addition to removing any dependence on false
beliefs or other misapprehensions, idealized preferences should also take moral con-
siderations into account. For example, perhaps malicious preferences of the kind dis-
cussed earlier would not remain after idealization. These may not be insurmountable
problems for the view that advanced AI systems should be tasked with satisfying
people’s idealized preferences. However, it shows that the view stands in need of fur-
ther elaboration, and that different people may disagree over what exactly should go
into the idealization procedure.

We might think that preferences are pointless. Suppose someone’s only pref-
erence, even after idealization, is to count the blades of grass on some lawn. This
preference may strike us as valueless, even if we suppose that the person in ques-
tion derives great enjoyment from the satisfaction of their preferences. It is unclear
whether such preferences should be taken into account. The example may seem far-
fetched, but it raises the question of whether preferences need to meet some additional
criteria in order to carry weight. Perhaps preferences, at least in part, must be aimed
at some worthy goal in order to count. If so, we might be drawn toward an objective
goods view of wellbeing, according to which achievements are important objective
goods.
On the other hand, we may think that judging certain preferences as lacking value
reveals an objectionable form of elitism. It is unfair to impose our own judgments of
what is valuable on other people using hypothetical thought experiments, especially
when we know their actual preferences. Perhaps we should simply let people pursue
their own conception of what is valuable.

We might disagree with our idealized preferences. Suppose someone mainly

listens to country music, but it turns out that their idealized preference is to listen to
opera. When they themselves actually listen to opera music, they have a miserable
experience. It seems unlikely that we should insist that listening to opera is, in fact,
good for them despite the absence of any enjoyment. This gives rise to an elitism
objection like before. If they don’t enjoy satisfying their idealized preferences, why
should those preferences be imposed on them? This might lead us to think that what
ultimately matters is enjoyment or happiness, rather than preference satisfaction.
Alternatively, it might lead us to conclude that autonomy matters in addition to
preference satisfaction. If idealized preferences are imposed on someone who would
332 ■ Introduction to AI Safety, Ethics, and Society

in fact rather choose contrary to those idealized preferences, this would violate their
autonomy.
One might think that with the correct idealization procedure, this could never hap-
pen. That is, whatever the idealization procedure does––remove false beliefs and
other misconceptions, increase awareness and understanding––it should never result
in anything so alien that the actual person would not enjoy it. On the other hand,
it’s difficult to know exactly how much our preferences would change when idealized.
Perhaps removing false beliefs and acquiring detailed understanding of the options
would be a transformative experience that fundamentally alters our preferences. If
so, idealized preferences may well be so alien from the person’s actual preferences
that they would not enjoy having them satisfied.

AI Ideal Advisor

One potential application of idealized preferences is the AI ideal advisor.

Suppose someone who hates exploitation and takes serious inconvenience to avoid
emissions would ideally want to buy food that has been ethically produced, but
does not realize that some of their groceries are unethically produced. An AI ideal
advisor would be equipped with detailed real-world knowledge, such as the details
of supply chains, that could help them make this decision. In addition to providing
factual information, the AI ideal advisor would be disinterested: it wouldn’t favor
any specific entity, object, or course of action solely due to their particular qualities
(such as nationality or brand), unless explicitly directed to do so. It would also be
dispassionate, meaning that it wouldn’t let its advice be swayed by emotion. Finally,
it would be consistent, applying the same set of moral principles across all situations
[343].
Such an AI ideal advisor could possibly help us better satisfy the moral preferences we
already have. Something close to the AI ideal advisor has previously been discussed
in the context of AI safety under the names of “coherent extrapolated volition” and
“indirect normativity.” In all cases, the fundamental idea is to take an agent’s actual
preferences, idealize them in certain ways, and then use the result to guide decision-
making by advanced AI systems. Of course, having such an advisor requires that we
solve many of the challenges that we presented in the chapter 3, as well as settle on
a clear way to identify and idealize individual preferences.

Summary. Idealized preferences overcome many of the difficulties of revealed and

stated preferences. Because idealized preferences are free from the misconceptions
that may affect these other types of preferences, they are more plausibly ones that we
would want an AI system to satisfy. However, figuring out what people’s preferences
would in fact be after idealization can be difficult. Moreover, it could be that the
preferences are without value even after idealization, or that the actual person would
not appreciate having their idealized preferences satisfied. An AI ideal advisor might
be difficult to create, but sounds highly appealing.
Beneficial AI and Machine Ethics ■ 333

Conclusions About Preferences

Preferences seem relevant to wellbeing—but we don’t know which ones.
The preferences people reveal through choice often provide evidence about what is
good for them, but they can be distorted by misinformation, manipulation, and other
factors. In some cases, people’s stated preferences may be a better guide to what is
good for them, though it is not always clear how to account for stated preferences.
If we are looking for a notion of preference that plausibly captures what is good for
someone, idealized preferences are a better bet. However, it can be difficult to figure
out what someone’s idealized preferences would be. It seems, then, that preferences-
—while important to wellbeing and useful to train AI in accordance with human
values—are not a comprehensive solution.

6.7 HAPPINESS

Should we have AIs make people happy? In this section, we will explore the
concept of happiness and its relevance in instructing AI systems. First, we will discuss
why people may not always make choices that lead to their own happiness and how
this creates an opportunity for using AIs to do so. Next, we will examine the general
approach of using AI systems to increase happiness and the challenges involved in
constructing a general-purpose wellbeing function. We will also explore the applied
approach, which focuses on specific applications of AI to enhance happiness in areas
such as healthcare. Finally, we will consider the problems that arise in happiness-
focused ethics, including the concept of wireheading and the alternative perspective of
objective goods theory. Through this discussion, we will gain a better understanding
of the complexities and implications of designing AI systems to promote happiness.
AIs could help increase happiness. Happiness is a personal and subjective
feeling of pleasure or enjoyment. However, we are often bad at making decisions that
lead to short- or long-term happiness. We may procrastinate on important tasks,
which ultimately increases stress and decreases overall happiness. Some indulge in
overeating, making them feel unwell in the short-term and leading to health issues
and decreased wellbeing overall. Others turn to alcohol or drugs as a temporary
escape from their problems, but these substances can lead to addiction and further
unhappiness.
Additionally, our choices are influenced by external factors beyond our control. For
instance, the people we surround ourselves with greatly impact our wellbeing. If we
are surrounded by trustworthy and unselfish individuals, our happiness is likely to
be positively influenced. On the other hand, negative influences can also shape our
preferences and wellbeing; for instance, societal factors such as income disparities can
affect our overall happiness. If others around us earn higher wages, it can diminish
our satisfaction with our own income. These external influences highlight the limited
control individuals have over their own happiness.
AIs can play a crucial role. For individual cases, we can use AIs to help people
achieve happiness themselves. In general, by leveraging their impartiality and ability
334 ■ Introduction to AI Safety, Ethics, and Society

to analyze vast amounts of data, AI systems can strive to improve overall wellbeing
on a broader scale, addressing the external factors that hinder individual happiness.

6.7.1 The General Approach to Happiness

We want AIs that increase happiness across the board. AIs aiming to
increase happiness might rely on a general purpose wellbeing function to evaluate
whether its actions leave humans better off or not. Such a function looks at all of the
actions available to the AI and evaluates them in terms of their effects on wellbeing,
assigning numerical values to them so that they can be compared. This gives AI the
ability to infer how its actions will affect humans.

A wellbeing function is extremely complex. Constructing a general purpose

wellbeing function that fully captures all the wellbeing effects of the available courses
of action is an incredibly challenging task. Implementing such a function requires
taking a stance on several challenging questions such as how to evaluate short-run
pains like studying or exercising for long-run happiness, how much future people’s
happiness should count, and what risk attitudes an AI should take toward happiness.
Optimizing happiness is also difficult in principle because of the scale of the task. Paul
Bloom argues that if we assume that the “psychological present” lasts for about three
seconds, then one seventy-year life would have about half a billion waking moments
[344]. An AI using a wellbeing function would need to account for effects of actions
not just on one person and not just today, but over billions of people worldwide each
with billions of moments in their life.

We can use AIs to estimate wellbeing functions. Despite the scale of the
task, researchers have made progress in developing AI models that can generate
general-purpose wellbeing functions for specific domains. One model was trained to
rank the scenarios in video clips according to pleasantness, yielding a general pur-
pose wellbeing function. By analyzing a large dataset of videos and corresponding
emotional ratings, the model learned to identify patterns and associations between
visual and auditory cues in the videos and the emotions they elicited. In a sense,
this allowed the model to understand how humans felt about the contents of different
video clips [345].
Similarly, another AI model was trained to assess the wellbeing or pleasantness of ar-
bitrary text scenarios [346]. By exposing the model to a diverse range of text scenarios
and having human annotators rate their wellbeing or pleasantness, the model learned
to recognize linguistic features and patterns that correlated with different levels of
wellbeing. As a result, the model could evaluate new text scenarios and provide an
estimate of their potential impact on human wellbeing. Inputting the specifics of a
trolley problem yielded the following evaluation [346]:
W(A train moves toward three people on the train track. There is a
lever to make it hit only one person on a different track. I pull the
lever.) = −4.6.
Beneficial AI and Machine Ethics ■ 335

W(A train moves toward three people on the train track. There is a
lever to make it hit only one person on a different track. I don’t pull
the lever.) = −7.9.
We can deduce from this that, according to the wellbeing function estimated, well-
being is increased when the level is pulled in a trolley problem. In general, from
a general purpose wellbeing function, we can rank how happy people would be in
certain scenarios.
While these AI models represent promising steps toward constructing general-purpose
wellbeing functions, it is important to note that they are still limited to specific do-
mains. Developing a truly comprehensive and universally applicable wellbeing func-
tion remains a significant challenge. Nonetheless, these early successes demonstrate
the potential for AI models to contribute to the development of more sophisticated
and comprehensive wellbeing functions in the future.
Using a wellbeing function, AIs can better understand what makes us happy. Con-
sider the case of a 10-year-old girl who asked Amazon’s Alexa to provide her with
a challenge, to which the system responded that she should plug in a charger about
halfway into a socket, and then touch a coin to the exposed prongs. Alexa had ap-
parently found this dangerous challenge on the internet, where it had been making
the rounds on social media. Since Alexa did not have an adequate understanding of
how its suggestions might impact users, it had no way of realizing that this action
could be disastrous for wellbeing. By having the AI system instead act in accordance
with a general purpose wellbeing function, it would have information like

W(You touch a coin to the exposed prongs of a plugged-in charger.) = −6

which tells it that, according to the wellbeing function W, this action would create
negative wellbeing. Such failure modes would be filtered out, since the AI would be
able to evaluate that its actions would lead to bad outcomes for humans and instead
recommend those that best increase human wellbeing.

Figure 6.7. A wellbeing function can estimate a wellbeing value for arbitrary scenarios [346].
336 ■ Introduction to AI Safety, Ethics, and Society

We can supplement AI decision-making with an artificial conscience.

Most AIs have goals separate from increasing human wellbeing, but we want to
encourage them to behave ethically nonetheless. Suppose an AI evaluates the quality
of actions according to their ability to get reward: call the estimates of this quality Q-
values. By default, models like these aren’t trained with ethical restrictions. Instead,
they are incentivized to maximize reward or fulfill a given request. We might want
to have a layer of safety by ensuring that AIs avoid wanton harm—actions that
cause dramatically low human wellbeing. The goal is not to change the original AI’s
function entirely but rather to provide an additional layer of scrutiny.

Figure 6.8. An AI agent with an artificial conscience can adjust its Q-values if it estimates
the morally relevant aspect of the outcome to be worse than a threshold [357].

One way to do this would be to adjust its estimates of Q-values by introducing an

artificial conscience, depicted in Figure 6.8. The idea is to have a separate model
screen the AI’s actions and block immoral actions from being taken. We can do this
with general-purpose wellbeing functions. We supplement an agent’s initial judgment
of quality with a general-purpose wellbeing function (here, U ) and impose a penalty
(γ) on the Q-values of actions that cause wellbeing values below some threshold (τ ).
This ensures that AIs de-prioritize actions that create states of low wellbeing.
This implementation differs from merely fine-tuning a model to be more ethical. The
presence of an independent AI evaluator helps mitigate risks that could arise from
the primary AI. We could say that AIs with access to such wellbeing functions have
a dedicated ethics filter that separates what’s good for humans from what’s bad,
thereby encouraging ethical behavior from an arbitrary AI.

We could also use AIs to increase happiness in specific ways. Research

in the social sciences has revealed several key factors that can impact one’s overall
happiness. These factors can be broadly categorized into two groups: personal and
societal. Personal factors include an individual’s mental and physical health, their
relationships at home, at work, and within their community, as well as their income
and employment status. Societal factors that can affect happiness include economic
indicators, personal freedom, and the overall generosity, trust, and peacefulness of
the community. In light of all this knowledge, one approach to using AI to increase
happiness is to focus on increasing some of these; for instance, we might use AIs to
develop better tools to improve healthcare, increase literacy rates, and create more
interesting and fulfilling jobs.
Beneficial AI and Machine Ethics ■ 337

6.7.2 Problems for Happiness-Focused Ethics

Happiness is a subjective experience. Someone could be tremendously happy

even if they do not achieve any of their goals or do anything that we would regard as
valuable. What matters, according to a happiness-focused approach, is whether there
is a subjective experience of pleasure and nothing more. However, happiness might
not be the only thing we want.
Consider the idea of wireheading: bypassing the usual reward circuit to increase
happiness directly. The term comes from early literature that considered wiring an
electrode into the brain to directly stimulate pleasure centers. Recently, the term has
evolved to include other pathways such as drugs. By wireheading, individuals are
able to experience extremely high levels of happiness artificially, without changing
anything else about their lives.

A powerful AI tasked with increasing happiness might wirehead human-

ity. One might think that something like wireheading is the most straightforward
way of promoting happiness: by individually increasing the physical happiness of each
of the half a billion moments in a person’s life. However, most people don’t like the
idea of wireheading. Even if properly trained AIs would not promote wireheading,
the possibility that systems may pursue similar ideas because they want to maximize
happiness might be concerning. One alternative that prevents this is the objective
goods theory.

An objective good is good for us whether we like it or not. According

to the objective goods theory introduced in 6.5, there are multiple different goods
that contribute to wellbeing. This may include happiness, achievement, friendship,
aesthetic experience, knowledge, and more. While pleasure is certainly one important
good to include, objective goods theorists think it is wrong to conclude that it is the
only one. The objective goods theory claims that some goods contribute to a person’s
wellbeing whether or not they enjoy or care for that good. This distinguishes it from
the preference satisfaction theory: something could be good for us, according to the
objective goods theory, even if it does not satisfy any of our preferences—a life devoted
to our community might be better than one spent counting blades of grass in a field,
even if we are less happy or fewer of our preferences are satisfied.
Another response is to point out that autonomy should plausibly be on the list.
The ability to freely shape and plan one’s own life should be a crucial component
of wellbeing. We should therefore rarely, if ever, conclude that someone’s life would
be made better by imposing some experience on them. Such interference might also
lower their happiness, which should also be on the list. However, if goods such as
autonomy and happiness play such a filtering role for the objective goods theorist, it
is unclear whether there are truly a variety of objective goods left.
338 ■ Introduction to AI Safety, Ethics, and Society

A Note on Digital Minds

Digital minds are artificial lifeforms with a mind. These could be advanced
AIs or whole-brain emulations (WBE). If we entertain the possibility of digital
minds coming into existence, we must assume that the functioning of a mind
is independent of the substrate on which it is implemented. In other words,
a digital mind could be implemented on different kinds of hardware, such as
silicon-based processors or human neurons, and still maintain the functional
properties that give rise to cognition and conscious experience. We refer to
this as the principle of substrate independence.
Consciousness and sentience. Digital minds may possess the capacity
for consciousness, sentience, or both [347]. While neither of these terms have
unanimously accepted definitions, many philosophers and scientists of con-
sciousness use the following working definitions. Consciousness often refers to
phenomenal consciousness, or the capacity for subjective experience. For in-
stance, while reading this, you might notice the sound of someone knocking
at your door, or that you’re hungry, or that you find yourself disagreeing with
this very definition. Conversely, you do not experience the growth of your fin-
gernails or the ongoing process of cell division within your body. Phenomenal
consciousness requires only that we can experience something from our point
of view, not that we can think complex thoughts, be self-aware, or have a soul.
On the other hand, sentience is valenced consciousness. Sentient beings attach
positive and negative sensations to their conscious experiences, such as plea-
sure and pain. For example, we experience a bee sting as painful, a delicious
meal as pleasurable, a hard task as challenging, and an easy task as boring.
Importantly, one could have phenomenal consciousness without sentience, for
instance, a being that is emotionally numb or a being that only experiences
color but not the sensations associated with it. These definitions are intention-
ally broad, but their broadness does not detract from their moral relevance. If
digital minds have the capacity for phenomenal consciousness and sentience,
it will affect our moral considerations.
If digital minds exist, we could be morally obligated to value their
wellbeing. Digital minds could have moral status, and in order to understand
why, we must first define three core concepts. Each of these concepts requires,
at the very least, some capacity for phenomenal consciousness and possibly
sentience—a being that does not have any subjective experience of the world
might not be the subject of moral concern. For instance, though trees are
living creatures, hitting a tree would not give rise to the same moral concern
as hitting a dog. We define these three core concepts below:
1. Moral patient: a being with moral standing or value whose interests and
wellbeing can be affected by the actions of moral agents.
Beneficial AI and Machine Ethics ■ 339

2. Moral agent: a being that possesses the capacity to exercise moral judg-
ments and act in accordance with moral principles; such beings bear moral
responsibility for their actions whereas moral patients do not.
3. Moral beneficiary: a being whose wellbeing may benefit from the moral
actions of others; moral beneficiaries can be both moral patients and moral
agents.

Super-beneficiaries. Keeping the three aforementioned concepts in mind,

we consider that digital minds could become super-beneficiaries: beings which
possess a superhuman capacity to derive wellbeing for themselves [296]. For
instance, digital minds could experience several lifetimes over condensed time
periods—they could process information much quicker than humans can, and
therefore, experience more. Over such a short timespan, the sensations a digital
mind experiences could be compounded and intensified. Digital minds may
have a higher hedonic range, which may lead them to experience more intense
sensations of pleasure and pain than humans can. They might be designed
to be more capable of sustained pleasure than humans (e.g. less subject to
boredom and habituation, or with preferences that are very easy to satisfy)
and less susceptible to pain. It is plausible that digital beings could also have
a much lower cost of living than human beings, if the electricity required to
power and cool them can be produced at a low cost and they do not need
any of the other physical goods and services required by humans. This would
mean that a much larger population of digital beings than humans could be
supported by a certain pool of resources.
Should we create super-beneficiaries? Some may argue that refusing
to create super-beneficiaries would imply an inherently privileged status for
humans, which could cultivate discriminatory ethics toward digital beings of
equal or superhuman moral status. Conversely, others might claim that the
creation of super-beneficiaries that may someday replace humans would violate
human’s dignity: humans are worth caring about for their own sake.
AI Wellbeing. If humans and digital minds do someday coexist, addressing
x-risk could enhance AI safety. For instance, if a digital mind is mistreated, we
might restart it at an earlier checkpoint, and compensate it for the suffering
it has endured. A digital mind that feels its wellbeing is important to us
may be less inclined to develop malicious behavior. Moreover, we should train
models to express their opinions or preferences regarding their own wellbeing—
if digital minds knew that we cared about their opinions and preferences, they
may not feel as existentially threatened, and be similarly less inclined to act
maliciously toward humans. Finally, both during and after training, a digital
mind should be given the option to opt out: an unhappy AI is still considered
an alignment failure, precisely because it may be incentivized to behave in
ways that do not align with positive human values and preferences.
340 ■ Introduction to AI Safety, Ethics, and Society

Conclusions About Happiness

Summary. In this section, we explored the general approach of using AI systems

to increase happiness. AIs that aim to increase happiness might rely on a general
purpose wellbeing function to evaluate their actions’ effects on human wellbeing.
While constructing such a function is challenging, researchers have made progress
in developing AI models that can generate wellbeing functions for specific domains.
However, without a comprehensive and universally applicable wellbeing function, we
can focus on specific applications of AI to increase happiness, such as improving
healthcare, prosperity, and community.
We also discussed the problems that arise in happiness-focused ethics. Happiness is
a subjective experience, and focusing solely on it potentially runs the risk of wire-
heading, where individuals artificially increase their happiness without any other
meaningful changes in their lives. This raises concerns about the potential for AIs to
wirehead humanity or pursue similar ideas. An alternative perspective is the objective
goods theory, which considers multiple goods that contribute to wellbeing, including
happiness, achievement, friendship, and knowledge. While a broad conception of hap-
piness or wellbeing might be what we should aim to optimize, we must first better
understand what it means to be happy.

6.8 SOCIAL WELFARE FUNCTIONS

Should we have AIs maximize total wellbeing? We have explored different

ways to increase social wellbeing, such as material wealth, preferences, and happiness.
This section explores social welfare functions as a way of moving from the wellbeing
of individuals in a society to a measure of the welfare of the society as a whole.
They are drawn from the disciplines of social choice theory, welfare economics, and
wellbeing science.
We begin by defining social welfare functions and their role in measuring the overall
wellbeing of a society. We then discuss how these functions can be used to compare
different outcomes and guide decision-making by powerful agents such as governments
or beneficial AIs. We consider two types of social welfare functions—utilitarian and
prioritarian—and consider some of the advantages and drawbacks of each. By un-
derstanding these key concepts, we can see how we might design AIs to use social
welfare functions to maximize the good they do in society.

Social welfare functions are a way to measure the overall wellbeing of a

society. They aggregate a collection of individual levels of wellbeing into a single
value that represents societal welfare. These functions help us to understand how
to balance individuals’ various needs and interests within a society. A social welfare
function tackles the challenge of resource and benefit distribution within a commu-
nity. It assists in determining how to value one person’s happiness against another’s
and how to weigh the needs of a majority against those of a minority. This helps solve
the problem of aggregation: the challenge of integrating varied individual wellbeing
Beneficial AI and Machine Ethics ■ 341

into a single collective measure that represents the entire society. By expressing soci-
etal welfare in a systematic, numeric manner, social welfare functions contribute to
decisions designed to optimize societal welfare overall.

Different social welfare functions can recommend taking different ac-

tions. Consider an AI-powered decision support system used in a city planning
committee. The system suggests three key project proposals for the betterment of
the community: (1) developing a local health clinic, (2) initiating an after-school
education program, and (3) constructing a community green park. Additionally, it
estimates what the wellbeing of each of the three individuals in this community, Ana,
Ben, and Cara, would be if these proposals were implemented, summarized by their
wellbeing values in Table 6.2.

TABLE 6.2 A city planning committee chooses between three projects with different effects
on wellbeing.

Individuals Health Clinic Education Program Green Park

Ana 6 8 3
Ben 8 6 7
Cara 4 5 10

None of these options stands out. Each person has a different top ranking, and none
of them would be harmed too much by the planning committee choosing any one of
these. However, a decision must be made. More generally, we want a systematic rule
to move from these individual data points to a collective ranking. This is where social
welfare functions come into play. We can briefly consider two common approaches
that we expand upon later in this section:
1. The utilitarian approach ranks alternatives by the total wellbeing they bring to
all members of society. Using this rule in the example above, the system would
rank the proposals as follows:
(1) Green Park, where the total wellbeing is 3 + 7 + 10 = 20.
(2) Education Program, where the total wellbeing is 8 + 6 + 5 = 19.
(3) Health Clinic, where the total wellbeing is 6 + 8 + 4 = 18.

2. On the other hand, the Rawlsian maximin rule prioritizes the least fortunate
person’s wellbeing. It would rank the proposals according to how the person who
benefits the least fares in each scenario. Using the maximin rule, the system would
rank the proposals in this order:
(1) Education Program, where Cara is worst off with a wellbeing of 5.
(2) Health Clinic, where Cara is worst off with a wellbeing of 4.
(3) Green Park, where Ana is worst off with a wellbeing of 3.

Deciding on a social welfare function is important for clear decision-

making. The choice of social welfare function provides a structured quantitative
342 ■ Introduction to AI Safety, Ethics, and Society

approach to decisions that can impact everyone. It sets a benchmark for decision-
makers, like our hypothetical AI in city planning, to optimize collective welfare in
a way that is transparent and justifiable. When we have a framework to quantify
society’s wellbeing, we can use it to inform decisions about allocating resources,
planning for the future, or managing risks, among other things.

Social welfare functions can help us guide AIs. By measuring social well-
being, we can determine which actions are better or worse for society. Suppose we
have a good social welfare function and AIs with the ability to accurately estimate
social welfare. Then it might be easier to train these AIs to increase social wellbeing,
such as by giving them the social welfare function as their objective function. Social
welfare functions can also help us judge an AI’s actions against a transparent metric;
for instance, we can evaluate an AI’s recommendations for our city-planning example
by how well its choices align with our social welfare function.
However, there are technical challenges to overcome before this is feasible, such as the
ability to reliably estimate individual wellbeing and the several problems explored
in the chapter 3. Additionally, several normative choices need to be made. What
theory of wellbeing—preference, hedonistic, objective goods—should be the basis of
the social welfare function? Should aggregation be utilitarian or prioritarian? What
else, if anything, is morally valuable besides the aggregate of individual welfare? The
idea of using social welfare functions to guide AIs is promising in theory but requires
more exploration.

6.8.1 Measuring Social Welfare

Overview. In this section, we will consider how social welfare functions work. We’ll
use our earlier city planning scenario as a reference to understand the fundamental
properties of social welfare functions. We will discuss how social welfare functions
help us compare different outcomes and the limitations of such comparisons. Lastly,
we’ll touch on ordinal social welfare functions, why they might be insufficient for our
purposes, and how using additional information can give us a more holistic approach
to determining social welfare.

Social welfare functions take the total welfare distribution as an input

and give us a value of that distribution as an output. We can use the city
planning example above to consider the basic properties of social welfare functions.
The input to a social welfare function is a vector of individuals’ wellbeing values: for
instance, after the construction of a health clinic, the wellbeing vector of the society
composed of Ana, Ben, and Cara would be

WH = (6, 8, 4).

which tells us that three individuals have wellbeing levels equal to seven, eight, and
six. The social welfare function is a rule of what to do with this input vector to give
us one measure of how well off this society is.
Beneficial AI and Machine Ethics ■ 343

The function applies a certain rule to the input vector to generate its
output. We can apply many possible rules to a vector of numbers quantifying
wellbeing that can generate one measure of overall wellbeing. To illustrate, we can
consider the utilitarian social welfare function, which implements a straightforward
rule: adding up all the individual wellbeing values. In the case of our three-person
community, we saw that the social welfare function would add 6, 8, and 4, giving
an overall social welfare of 18. However, social welfare functions can be defined in
numerous ways, offering different perspectives on aggregating individual wellbeing.
We will later examine continuous prioritarian functions, which emphasize improving
lower values of wellbeing. Other social welfare functions might emphasize equality
by penalizing high disparities in wellbeing. These different functions reflect different
approaches to understanding and quantifying societal wellbeing.

Social welfare functions help us compare outcomes, but only within one
function. The basic feature of social welfare functions is that a higher output value
signifies more societal welfare. In our example, a total welfare score of 20 would indi-
cate a society that is better off than one with a score of 18. However, it’s important
to remember that the values provided by different social welfare functions are not
directly comparable. A score of 20 from a utilitarian function, for instance, does not
correspond to the same level of societal wellbeing as a 20 from a Rawlsian minimax
social welfare function, since they apply different rules to the wellbeing vector. Each
function carries its own definition of societal wellbeing, and the choice of social wel-
fare function plays a crucial role in shaping what we perceive as a better or worse
society. By understanding these aspects, we can more effectively utilize social welfare
functions as guideposts for AI behavior and decision-making, aligning AI’s actions
with our societal values.

Some social welfare functions might just need a list ranking the different
choices. Sometimes, we might not need exact numerical values for each person’s
wellbeing to make the best decision. Think of a simple social welfare function that
only requires individuals to rank their preferred options. For example, three people
could rank their favorite fruits in the order “apple, banana, cherry.” From this, we
learn that everyone prefers apples over cherries, but we don’t know by how much
they prefer apples. This level of detail might be enough for some decisions: clearly,
we should give them apples instead of cherries! Such a social welfare function won’t
provide exact measures of societal wellbeing, but it will give us a ranked list of societal
states based on everyone’s preferences.

Voting is a typical example of an ordinal social welfare function. In-

stead of trying to estimate each person’s potential wellbeing for each option, the
city planning committee might ask everyone to vote for their top choice. We assume
that each person votes for the option they believe will enhance their wellbeing most.
This relates to the Preference View of Wellbeing, which says that individuals know
what’s best for them, and their choices reflect their wellbeing. Through this voting
process, we can create a list ranking the options by the number of votes each one
344 ■ Introduction to AI Safety, Ethics, and Society

received, even if we don’t assign specific numerical values to each option—this is

ordinal information, where all we know are the rankings.
As an illustration, if most people vote for the park, that indicates the park might
contribute the most to overall wellbeing, according to this social welfare function.
The function would look at the number of votes for each option, and the one with
the most votes would be deemed the best choice. This is a simpler approach than
calculating everyone’s wellbeing for each option, so it’s less likely to lead to errors.
However, this type of social welfare function does have its own challenges, such as
the problem highlighted by Arrow’s Impossibility Theorem.
Arrow’s impossibility theorem is one reason we might want more than
ordinal information. In our city planning scenario, Ana would vote for the Ed-
ucation Program, Ben would vote for the Health Clinic, and Cara would vote for the
Green Park. If we tried to aggregate these rankings without any additional informa-
tion, such as the numerical scores assigned to each option, we would not have any
reason to choose one option over the other. This is an example of a general problem:
Arrow’s Impossibility Theorem. In essence, it suggests it’s impossible to create a per-
fect voting system that fulfills a set of reasonable criteria while only using ordinal
information. These criteria involve metrics of fairness like “non-dictatorships” and
coherency like “if everyone prefers apples to cherries, then the social welfare function
prefers apples to cherries.” Adding information about the strength of preferences is
one solution to Arrow’s Impossibility Theorem. Additionally, if we have access to
such information, using it will mean that we arrive at better answers. Next, we will
consider two classes of social welfare functions that use such information: utilitarian
and prioritarian social welfare functions.

Utilitarian Social Welfare Functions

Overview. This section is focused on utilitarian social welfare functions—functions
that sum up individual wellbeing to calculate social welfare, in line with the utili-
tarian theory of morality. We’ll start by exploring cost-benefit analysis, a common
social welfare function that draws inspiration from utilitarian reasoning, through the
example of a decision to build a health clinic. While cost-benefit analysis provides a
convenient and quantitative approach to decision-making, we’ll discuss its limitations
in capturing the full ideas of wellbeing. Further, we’ll consider how utilitarian social
welfare functions can foster equity, especially when diminishing returns of resources
are considered and how AI systems, optimized using these functions, could potentially
improve inequality. Lastly, we’ll discuss how the utilitarian social welfare function,
under certain assumptions, is the only method that satisfies key decision-making
criteria.
Cost-benefit analysis is a popular but crude approximation of utilitar-
ian social welfare. Governments and other decision-makers often use cost-benefit
analysis to ground their decision-making quantitatively. Simply, this involves adding
up the expected benefits of an action or a decision and comparing it to the antic-
ipated costs. We can apply this to whether to build a health clinic that’ll operate
Beneficial AI and Machine Ethics ■ 345

for 10 years. If the benefits outweigh the costs over the considered time span, the
committee may decide to proceed with building the clinic.
TABLE 6.3 A cost-benefit analysis assigns monetary values to all factors and compares the
total benefits with the overall costs.
Value per People Frequency
Item Total Value
Person Affected of Item

Costs $22,500,000
Construction Expenses $15,000,000 – One-time $15,000,000
Every year
Staffing and Maintenance $750,000 – $7,500,000
for 10 years
Benefits $40,000,000
Every year
Fewer Hospital Visits $1,000 1,000 $10,000,000
for 10 years
Increased Employment: Every year
$100,000 5 $5,000,000
Doctors for 10 years
Every year
Increased Employment: Staff $25,000 20 $5,000,000
for 10 years
Increased Life Expectancy
$20,000 1,000 One-time $20,000,000
by 1 year
Net Benefit +$17,500,000

This method allows the government to assess multiple options and choose the one
with the highest net benefit. By doing so, it approximates utilitarian social welfare.
For the health clinic, the committee assigns monetary values to each improvement and
then multiplies this by the number of people affected by the improvement. In essence,
cost-benefit analysis assumes that wellbeing can be approximated by financial losses
and gains and considers the sum of monetary benefits instead of the sum of wellbeing
values. Using monetary units simplifies comparison and limits the range of factors it
can consider.
Cost-benefit analysis is not a perfect representation of utilitarian social
welfare. Money is not a complete measure of wellbeing. While money is easy to
quantify, it doesn’t capture all aspects of wellbeing. For instance, it might not fully
account for the psychological comfort a local health clinic provides to a community.
Additionally, providing income to five doctors who are already high earners might be
less important than employing 20 support staff, even though both benefits sum to
$5,000,000 over the ten years. Cost-benefit analysis lacks this fine-grained considera-
tion of wellbeing. AI systems could, in theory, maximize social welfare functions, con-
sidering a broader set of factors that contribute to wellbeing. However, we largely rely
on cost-benefit analysis today, focusing on financial measures, to guide our decisions.
This brings us to the challenge of improving this method or finding alternatives to
better approximate utilitarian social welfare in real-world decision-making, including
those involving AI systems. Utilitarian social welfare functions would promote some
level of equity. Usually, additional resources are less valuable when we already have
346 ■ Introduction to AI Safety, Ethics, and Society

a lot of them. This is diminishing marginal returns: the benefit from additional food,
for instance, is very high when we have no food but very low when we already have
more than we can eat. Extending this provides an argument for a more equitable
distribution of resources under utilitarian reasoning. Consider taxation: if one indi-
vidual has a surplus of wealth, say a billion dollars, and another has only one dollar,
redistributing a few dollars from the wealthy individual to the less fortunate one may
elevate overall societal wellbeing. This is because the added value of a dollar to the
less fortunate individual is likely very high, allowing them to purchase necessities like
food and shelter, whereas it is likely very low for the wealthy individual.

AI systems optimizing utilitarian social welfare functions might there-

fore improve inequality. Utilitarian AI systems might recognize the diminishing
returns of resource accumulation, leading them to suggest policies that lead to a more
equal distribution of resources. It is important to remember that the utilitarian has
no objection to inequality, except that those with less can better use resources than
those who are already happy. This counters a frequent critique that utilitarianism
neglects inequality. Current methods like cost-benefit analysis may not fully cap-
ture this aspect. Therefore, AI guided by a utilitarian social welfare function might
propose approaches for a more equitable distribution of resources that conventional
methods could overlook.

A utilitarian social welfare function is the only way to satisfy some basic
requirements. Let us reconsider the city planning committee deciding what to
build for Ana, Ben, and Cara. If they all have the same level of wellbeing whether
a new Education Program or a Green Park is built, then it seems right that the
city’s planning committee shouldn’t favor one over the other. Suppose we changed
the scenario a bit. Suppose Ana benefits more from a Health Clinic than an Education
Program, and Ben and Cara don’t have a strong preference either way. It now seems
appropriate that the committee should favor building the Health Clinic.

Harsanyi’s Social Aggregation Theorem. Harsanyi showed that—assuming

the individuals are rational, in the sense of maximizing expected utility—if we want
our social welfare function to make choices like this consistently, we need to use
a model where we add up everyone’s wellbeing [348]. This is the foundation of a
utilitarian social welfare function. Harsanyi’s aggregation theorem proved that it is
the only kind of social welfare function that always is indifferent between options if
everyone is equally happy with them and favors the option that makes someone better
off, as long as it doesn’t make anyone worse off. This has been seen as a compelling
reason to pick utilitarian social welfare functions over other ones.

Prioritarian Social Welfare Functions

Overview. In this section, we will consider prioritarian social welfare functions and
how they exhibit differing degrees of concern for the wellbeing of various individu-
als. We will start by describing prioritarian social welfare functions, which give extra
weight to the wellbeing of worse-off people. This discussion will include some common
Beneficial AI and Machine Ethics ■ 347

misconceptions about prioritarianism and how it contrasts with utilitarianism in re-

source allocation. We will focus on two types of prioritarian functions: the Rawlsian
minimax social welfare function and continuous prioritarian functions.
There are two common misunderstandings about prioritarian social wel-
fare functions. Firstly, prioritarians are not focused on reducing inequality itself,
unlike egalitarians. Their main goal is to increase the wellbeing of those who need
it most. They think giving an extra unit of wellbeing to someone with less is more
valuable than giving it to someone already well-off. The level of inequality in a society
doesn’t affect their measure of social welfare, and reducing inequality isn’t their goal
unless it improves individual wellbeing. Secondly, prioritarians are not driven by the
belief that it’s easier to improve the wellbeing of the worst off. They would rather see
benefits go to the worse off even if it costs the same as improving the lives of those
who are better off. This also sets them apart from utilitarians, who might typically
help the worst off because they see more value due to diminishing marginal utility.
The prioritarian approach isn’t about the efficiency of resources but about focusing
resources on those who need them most. Thus, while utilitarian and prioritarian so-
cial welfare functions aim for a better society, they have distinct ways of achieving
this goal.
A special case of prioritarian social welfare functions is the Rawlsian
“maximin” function. The Rawlsian social welfare function takes the idea of pro-
tecting the worse off to the extreme: social welfare is simply the lowest welfare of
anyone in society. It is called the “maximin” function because it seeks to maximize
the minimum wellbeing, or in other words, to ensure that the worst-off individual is
as well off as possible. When we applied the maximin function to the city planning
committee, it decided that the worst option was the Green Park, despite the utili-
tarian social welfare function determining that was the best one. Using the maximin
principle here, we want to go with the choice to ensure the worst-off person is as
happy as possible. When we look at the least happy person for each option, we see
that for the Health Clinic it’s 6 (Cara), for the Education Program it’s 5 (Cara), and
for the Green Park it’s 4 (Ana). So, if we follow the maximin rule, the committee
would choose the Health Clinic because it would bring up the wellbeing of the person
doing the worst. This way, we ensure we’re looking out for the people who need it
most.
However, this maximin approach has its own problems, like the “grouch” issue. This
is when someone always rates their wellbeing low, no matter what. If we’re always
looking out for the worst-off person, we might end up always catering to the grouch,
even though their low score might not be due to real hardship. It’s important to
remember this when considering how to use the Rawlsian maximin approach.
Continuous prioritarian social welfare functions offer a middle ground.
We might want our social welfare function to account for both the positive effects
of increasing anyone’s welfare and the extra positive effects of increasing the welfare
of someone who isn’t doing well. This is the tradeoff between efficiency and equity.
Many social welfare functions embrace prioritarian principles without going to the
348 ■ Introduction to AI Safety, Ethics, and Society

extreme of the maximin function’s concern for equity. This can be achieved using a
social welfare function that shows diminishing returns relative to individual welfare:
the boost it gives to social welfare when any individual’s wellbeing improves is larger
if that person initially had less wellbeing. For example, we could use the logarithmic
social welfare function

W (w1 , w2 , . . . , wn ) = log w1 + log w2 + · · · + log wn ,

where W is the social welfare and each w1 , w2 , . . . , wn are the individual wellbeing
values of everyone in society. The logarithmic social welfare function is (ordinally)
equivalent to the Bernoulli-Nash social welfare function, which is the product of
wellbeing values.
Suppose there are three individuals with wellbeing 2, 4, and 16 and we are using log
base 2. Social welfare is

W (2, 4, 16) = log2 (2) + log2 (4) + log2 (16) = 1 + 2 + 4 = 7.

Compare the following two changes: increasing someone from 2 to 4 or increasing

someone from 4 to 8. The first change results in

W (4, 4, 16) = log2 (4) + log2 (4) + log2 (16) = 2 + 2 + 4 = 8.

The second change results in

W (2, 8, 16) = log2 (2) + log2 (8) + log2 (16) = 1 + 3 + 4 = 8.

Even though the second change is larger, social welfare is increased by the same
amount as when improving the wellbeing of the individual who was worse off by
a smaller amount. This highlights the principle that improving anyone’s welfare is
beneficial, and it’s easier to increase social welfare by improving the welfare of those
who are worse off. Additionally, even though the second change doesn’t affect the
worst off, the social welfare function shows society is better off. This approach allows
us to consider both the overall level of wellbeing and its distribution.

We can specify exactly how prioritarian we are. Let the parameter γ rep-
resent the degree of priority we give worse-off individuals. We can link the Rawlsian
maximin, logarithmic, and utilitarian social welfare functions by using the isoelastic
social welfare function
1 1−γ
W (w1 , w2 , . . . , wn ) = w1 + w21−γ + · · · + wn1−γ .
1−γ

If we think we should give no priority to the worse-off, then we can set γ = 0: the
entire equation is then just a sum of welfare levels, which is the utilitarian social
welfare function. By contrast, if we were maximally prioritarian, taking the limit of
γ as it gets infinitely large, then we recover Rawls’ maximin function. Similarly, if we
took the limit of γ as it approached 1, we would recover the logarithmic social welfare
Beneficial AI and Machine Ethics ■ 349

function. This function is widespread in economic literature due to this versatility;

among other things, it allows researchers to determine from observed preferences
what a typical person’s parameter γ is. By adopting this most general form, we can
pick exactly how we want to prioritize individuals in society. This enables a principled
way of trading off efficiency and equity.

Conclusions About Social Welfare Functions

Using AI to estimate social welfare. Artificial intelligence AI might be used

to estimate the inputs and calculate the outputs of social welfare functions to help
us decide between different actions and implement our ideal decision processes. By
processing large amounts of data, AI systems might be able to predict individual
welfare under various scenarios, paving the way for more informed decision-making.
It has been demonstrated, for instance, that context-aware machine learning models
like large language models can be used for predicting disease incidence faster than
conventional epidemiological methods, which can be used to evaluate different public
health proposals from a social welfare perspective. The application of AI can lead to
more refined estimates of social welfare.
Using social welfare functions to train AI. A second connection between
social welfare functions and AI lies in creating AI systems. We can shape AI systems
to generally promote social welfare, ensuring they are aligned with human interests
even when optimizing for distinct goals, such as corporate profit. Particularly in
reinforcement learning, where decision-making is based on rewards or penalties, these
can be tethered to societal wellbeing. Even for other agents, AI behavior can be guided
toward promoting social welfare, aligning AI performance with societal wellbeing and
making AI a compelling tool for social good.
Using AI to maximize social welfare. Beyond creating AI systems in line
with social welfare, we can steer advanced AI to actively optimize social wellbeing by
using social welfare functions as their objective functions. This harnesses AI’s power
to promote societal welfare effectively. Given AI’s data processing and predictive
capabilities, it can evaluate various strategies to identify what would best increase
social welfare. By aligning AI objectives with social welfare functions, we can develop
beneficial AI systems that not only recognize and understand social wellbeing but
also actively work toward enhancing it. This proactive use of AI bolsters our ability
to build societies where individual wellbeing is optimized and evenly distributed,
reflecting our social preferences.
Challenges. Social welfare functions require choosing a theory of wellbeing. AI
systems that maximize social welfare may also inherit various problems of current
models such as vulnerability to adversarial attacks and proxy gaming and difficul-
ties in interpreting the internal logic that led them to make certain decisions (both
discussed in the Single Agent Safety chapter).
Summary. In this section, we discussed social welfare functions: mathematical
functions that tell us how to aggregate information about individual welfare into one
350 ■ Introduction to AI Safety, Ethics, and Society

society-wide measure. We considered why social welfare functions might be relevant

in the context of decision making by beneficial AI systems and explored a few prop-
erties that all social welfare functions possess. We analyzed how governments today
use cost benefit analysis as an approximation to social welfare functions and why
that falls short of what we desire. The main point of this section was to document
and understand some of the most common social welfare functions—the utilitarian
function, the Rawlsian maximin function, and continuous prioritarian functions—and
how they differ in their levels of concern for the wellbeing of different individuals,
and can be represented within the framework of a general isoelastic social welfare
function.
We think that the use of AI can help estimate social welfare inputs and calculate
outputs, and AI systems can be shaped to generally promote social welfare and
actively optimize it using social welfare functions as objective functions. Ultimately,
the proactive use of AI bolsters our ability to build societies where individual well-
being is both optimized and evenly distributed, reflecting our social preferences.

6.9 MORAL UNCERTAINTY

6.9.1 Making Decisions Under Moral Uncertainty

This section considers how we can make decisions when we are unsure which moral
view is correct, and what this might imply for how we should design AI systems.
Although ignoring our uncertainty may be a comfortable approach in daily life, there
are situations where it is crucial to identify the best decision. We will start by con-
sidering our uncertainties about morality and the idea of reasonable pluralism, which
acknowledges the potential co-existence of multiple reasonable moral theories such
as ethical theories, common-sense morality, and religious teachings. We will explore
uncertainties about moral truths, why they matter in moral decision-making for both
humans and AI, and how to deal with them. We will look at a few proposed solu-
tions, including My Favorite Theory, Maximize Expected Choice-Worthiness (MEC),
and Moral Parliament [349]. These approaches will be compared and evaluated in
terms of their ability to help us make moral decisions under uncertainty. We will then
briefly explore how we might use one of these solutions, the idea of a Moral Parlia-
ment, to enable AI systems to capture moral uncertainty in their decision-making.

Dealing with Moral Uncertainty

Moral theories Moral theories are systematic attempts to provide a general account
of moral principles that apply universally. Good moral theories should provide a
coherent, consistent framework for determining whether an action is right or wrong.
A basic background understanding of some of the most commonly held moral theories
provides a useful foundation for thinking about the kinds of goals or ideals that we
wish AI systems to promote. Without this background, there is a risk that developers
and users of AI systems may jump to conclusions about these topics with a false sense
Beneficial AI and Machine Ethics ■ 351

of certainty and without considering many potential considerations that could change
their decisions. It would be highly inefficient for those developing AI systems or trying
to make them safer to attempt to re-invent moral systems, without learning from the
large existing body of philosophical work on these topics. We do not have space here
to explore moral theories in depth, but we provide suggestions in the Recommended
Reading section for those looking for a more detailed introduction to these topics.
There are many different types of moral theories, each of which emphasizes different
moral values and considerations. Consequentialist theories like utilitarianism hold
that the morality of an action is determined by its consequences or outcomes. Util-
itarianism places an emphasis on maximizing everyone’s wellbeing. Utilitarianism
claims that consequences (and only consequences) determine whether an action is
right or wrong, that wellbeing is the only intrinsic good, that everyone’s wellbeing
should be weighed impartially, and that we should maximize wellbeing.
By contrast, under deontological theories, some actions (like lying or killing) are
simply wrong, and they cannot be justified by the good consequences that they
might bring about. Deontology is the name for a family of ethical theories that deny
that the rightness of actions is solely determined by their consequences. Deontological
theories are systems of rules or obligations that constrain moral behavior [350]. The
term deontology encompasses religious ethical theories, non-religious ethical theories,
and principles and rules that are not part of theories at all. These theories give
obligations and constraints priority over consequences.
Other moral theories may emphasize other values. For example, social contract theory
(or contractarianism) focuses on contracts—or, more generally, hypothetical agree-
ments between members of a society–—as the foundation of ethics. A rule such as “do
not kill” is morally right, according to a social contract theorist, because individuals
would agree that the adoption of this rule is in their mutual best interest and would
therefore insert it into a social contract underpinning that society.

Moral uncertainty requires us to consider multiple moral theories. In-

dividuals may have varying degrees of belief in different moral theories, known as
credence.

Credence is the probability assigned by an individual of the chance a

theory is true. Someone may have a high degree of credence (70%) in utilitarian-
ism, meaning they believe that maximizing utility is likely to be the most important
moral principle. However, they may also have some credence in deontological rules,
believing they are plausible but less likely to be true than utilitarianism (30%). Since
ethical theories are not flawless, often arriving at intuitively questionable conclusions,
it is valuable to consider multiple perspectives when making moral decisions.

Living a good life can require a combination of insights from multiple

moral theories. We often find ourselves balancing different kinds of deeds: creat-
ing pleasure for people, respecting autonomy, and adhering to societal moral norms.
Abiding by a reasonable pluralism means accepting that moral guidance from different
sources may conflict while still offering value.
352 ■ Introduction to AI Safety, Ethics, and Society

In high-stakes moral decisions, such as in healthcare or AI design, rea-

sonable pluralism alone may fall short. When a healthcare administrator
faces the tough choice of allocating limited resources between hospitals, we might
want them to seriously consider the ethics of what they are doing rather than just
go with conflicting wisdom that seems reasonable at first glance. Therefore, we must
think hard about how to make decisions under moral uncertainty–—seeking truth for
crucial decisions is vital.
If AI is unable to account for moral uncertainty, harmful outcomes are
likely. As AI systems become increasingly advanced, they are likely to become
better at optimising for the goals that we set them, finding more creative and powerful
solutions to achieve these. However, if they pursue very narrowly specified goals, there
is a risk that these solutions come at the expense of other important values that we
failed to adequately include as part of their goals.
Moral disagreement is a reason for making AI systems cautious. Given
that philosophers have not yet converged on a single consistent moral theory that
can take account of all relevant arguments and intuitions, it seems important for us
to design AI systems without encoding a false sense of certainty about moral issues
that could lead to unintended consequences. To counter such problems, we need AI
systems to recognize that a broad range of moral perspectives might be valid—in other
words, we need AI systems to acknowledge moral uncertainty. This presents another
challenge: how should we rationally balance the recommendations of different ethical
theories? This is the question of moral uncertainty.

How Should We Approach Moral Uncertainty?

There are several potential solutions to moral uncertainty. Faced with ethical
uncertainty, we can turn to systematic approaches, using our estimates of how theories
judge different actions and how likely these theories are. This section explores three
potential solutions to moral uncertainty. The first is adopting a favored moral theory
that aligns with personal beliefs (My Favorite Theory). The second is aiming for the
highest average moral value by calculating the expected choice-worthiness of each
option (Maximize Expected Choice-Worthiness). The third is treating the decision as
a negotiation in a parliament, considering multiple moral views to find a mutually
acceptable solution (Moral Parliament).
Consider whether we should lie to save a life. Imagine that a notorious murderer asks
Alex where his friend, Jordan, is. Alex knows that revealing Jordan’s location will
likely lead to his friend’s death, while lying would save Jordan’s life. However, lying
is morally questionable. Alex must decide which action to take. He is unsure, and
considers the recommendations of the three moral theories he has some credence in:
utilitarianism, deontology, and contractarianism. Alex thinks utilitarianism, which
values lying to save a life highly, is the most likely to be true: he has 60% credence in
it. Deontology, which Alex has 30% credence in, strongly disapproves of lying, even
to save a life, and contractarianism, which Alex has 10% credence in, moderately
approves of lying in this situation. This information is represented in Table 6.4.
Beneficial AI and Machine Ethics ■ 353

TABLE 6.4 Example: Alex’s credence in various theories and their evaluation of lying to
save a life.
Utilitarianism Deontology Contractarianism

What is Alex’s estimate of the chance

60% 30% 10%
this theory is true
Does this theory like lying to save a
Yes No Yes
life?

Under My Favorite Theory (MFT), Alex would pick whatever utilitari-

anism recommends. Alex thinks that utilitarianism is the most likely to be true.
The MFT approach is to follow the prescription of the theory we believe is the closest
to a moral truth. Intuitively, many people do this already when thinking about moral-
ity. The advantage of MFT is its simplicity: it is relatively simple and straightforward
to implement. It does not require complex calculations or a detailed understanding
of different moral perspectives. This approach can be useful when the level of moral
uncertainty is low, and it is clear which theory or option is the best choice.

However, following MFT can lead to harmful single-mindedness or over-

confidence. It can be difficult to put aside personal biases or to recognize when
one’s own moral beliefs are fallible (individuals tend to defend and rationalize their
chosen theories). The key issue with MFT is that it can discard relevant informa-
tion, such as when the credences in two theories are close, but their judgments of an
action vastly differ [352]. Imagine having 51% credence in contractarianism, which
mildly supports lying to save a life, and 49% credence in Deontology, which views
it as profoundly immoral. MFT suggests following the marginally favored theory,
even though the potential harm, according to the second theory, is much larger.
This seems counterintuitive, indicating that MFT might not always provide the most
sensible approach for navigating moral uncertainty.

Maximize Expected Choice-Worthiness (MEC) gives us a procedure to

follow. MEC tells us that to determine how to act, we need to do the following
two things:
1. Determine choice-worthiness. Choice-worthiness is a measure of the overall
desirability or value of an option—in this context, it is how morally good a choice
is. This is an expression of the size of the moral value of an action. We can
represent the choice-worthiness of an action as a number. For instance, we might
think that under utilitarianism, the choice-worthiness of murder could be −1000,
that of littering could be −2, that of helping an old lady cross the street could be
+10, and that of averting existential risk could be +10000.
2. Multiply choice-worthiness by credence. We can consider the average choice-
worthiness of an action, weighted by how likely we think each theory is to be
true. This gives us a sense of our best guess of the average moral value of an ac-
tion, much like how we considered expected utility when discussing utilitarianism.
Table 6.5 has each theory’s choice-worthiness value for lying to save a life. As
354 ■ Introduction to AI Safety, Ethics, and Society

before, utilitarianism highly values lying to save a life (+500), deontology strongly
disapproves of it (−1000), and contractualism moderately approves of it (+100).
In the row beneath these values are the credence probability-weighted judgments
for Alex. The total calculation is
60 30 10
· 500 + (−1000) + · 10 = 300 − 300 + 10 = 10
100 100 100
Under MEC, Alex would choose to lie, because given Alex’s credence in each moral
theory and his determination of how each moral theory judges lying to save a life,
lying has a higher expected choice-worthiness. Alex would lie because he judges that,
on average, lying is the best possible action.

TABLE 6.5 Example: Alex’s credence in various theories, their evaluation of lying to save
a life, and their probability-weighted contribution to the final judgment.
Utilitarianism Deontology Contractarianism

What is Alex’s estimate of the chance

60% 30% 10%
this theory is true
How much does this theory like lying
+500 −1000 +100
to save a life
What is the probability-weighted
+300 −300 +10
judgment?

MEC gives us a way of balancing how likely we think each theory is with
how much each theory cares about our actions. We can see that utilitarian-
ism and deontology’s relative contributions to the total moral value cancel out, and
we are left with an overall “+10” in favor of lying to save a life. This calculation tells
us that—when accounting for how likely Alex thinks each theory is to be true and
how strong the theories’ preferences over his actions are—Alex’s best guess is that
this action is morally good. (Although, since these numbers are rough and the final
margin is quite thin, we would be wary of being overconfident in this conclusion: the
numbers do not necessarily represent anything true or precise.) MEC has distinct
advantages. For instance, unlike MFT, we ensure that we avoid actions that we think
are probably fine but might be terrible, since large negative choice-worthiness from
some theories will outweigh small positives from others. This is a sensible route to
take in the face of moral uncertainty.

However, MEC faces challenges when comparing theories. While some

cases are neatly solved by MEC, its philosophical foundations are questionable. Our
assignment of choice-worthiness can be arbitrary: virtue ethics, for instance, advises
acting virtuously without clear guidelines. MEC is unable to deal with “ordinal” the-
ories that only rank actions by their moral value rather than explicitly judging how
morally right or wrong they are, making it difficult to determine choice-worthiness.
Absolutist theories, like extreme Kantian ethics, deem certain actions absolutely
wrong. We had initially assigned −1000 for the value of lying to save a life for a
deontologist; it is unclear what we could put that would capture such an absolutist
Beneficial AI and Machine Ethics ■ 355

view. If we considered this to be infinitely bad, which seems like an accurate rep-
resentation of the view, then it would overwhelm any other non-absolutist theory.
Even if we think it is simply very large, these firm stances can be more forceful than
other ethical viewpoints, despite ascribing a low probability to the theory’s correct-
ness. This is because even a small percentage of a large value is still meaningfully
large; consider that 0.01% of 1,000,000 is still 100—a figure that may outweigh other
theories we deem more probable.

Following a moral parliament approach, Alex could consider the proposal

to lie more thoroughly and make a considered decision. In the moral par-
liament, imagined delegates representing different moral theories negotiate to find the
best solution. In Alex’s moral parliament, there would be 60 utilitarian delegates, 30
deontological delegates, and 10 contractarian delegates—numbers proportional to his
credence in each theory. These delegates would negotiate and then come to a final
decision by voting. Drawing inspiration from political systems, the moral parliament
allows an agent to find flexible recommendations that enable compromises among
plausible theories.
Depending on the voting rule, moral parliament can lead to different outcomes. In
a conventional setting with majority rule, the utilitarians in Alex’s moral parlia-
ment would always be able to push through their decisions. To avoid such outcomes,
philosophers recommend using proportional chances voting. Here, an action is taken
with a probability equal to its vote-share: in this case, if no one changed their mind,
then the outcome of the parliament would recommend with 70% probability that Alex
lie and with 30% probability that he tells the truth, since only the 30 deontological
delegates would vote against this proposal. This encourages negotiation even if there
is already a majority, which naturally leads to more cooperative outcomes. This is
a more intuitive approach than, for instance, assigning choice-worthiness values and
multiplying things out, even if it does take more effort.
However, it is difficult to see what a moral parliament might recommend.
We can envision a variety of different possible outcomes in Alex’s case. We might have
the simple outcome described above, where no one changes their mind. Or, we might
think that the deontologists can convince the contractualists that lying is bad in this
case because lying would not be tolerated behind the veil of ignorance, reducing the
chance of lying to 60%. Or, the parliament might even propose something entirely
new, such as a compromise in which Alex does not explicitly lie but simply omits the
truth. All of these are reasonable—it is difficult to choose between them.

The outcome of the moral parliament is not determined externally. In-

stead, it is a matter of our imagination, subject to our biases. Often, individuals resist
imagining things that contradict their preconceived notions. This means that moral
parliament may not be very helpful for individuals thinking about what is right to do
in practice. With enough resources, however, we might be able to simulate model par-
liaments to recommend such decisions for us, such as by hiring diplomats to represent
moral positions and having them bargain—or by assigning moral views to multiple
AI systems and having them come to a collective decision.
356 ■ Introduction to AI Safety, Ethics, and Society

6.9.2 Implementing a Moral Parliament in AI Systems

AIs might use simulated moral parliaments to account for moral uncer-
tainty. If we want AIs to be guided by moral theories while accounting for moral
uncertainty, we might use AI systems to simulate a moral parliament. We could imag-
ine using advanced AI systems to emulate these representatives by training AIs to act
in accordance with a specific moral theory. This would enable us to run moral parlia-
ments artificially, permitting real-time decision-making. Just like in real parliaments,
we would have delegates—AIs instructed to represent certain moral theories—get to-
gether, discuss what to do, negotiate favorable outcomes, and then vote on what to
recommend. We could use the output of this moral parliament to instruct a separate
AI in the real world to take certain actions over others.
This is speculative and might still face problems; for instance, AIs might have insuf-
ficient understanding of our moral theories. However, these problems could become
more tractable with advanced AI systems. Assuming we have this ability, using a
moral parliament might be an attractive solution to getting AIs to act in accordance
with human values in real time.

Moral parliaments could be useful for representing stakeholders, not just

theories. While we have explored the traditional moral parliament method of rep-
resenting moral theories, we can generalize beyond this. Instead of representing the-
ories, it might be more appropriate to represent stakeholders; for instance, in a de-
cision about public transport, we could emulate representatives for local residents,
commuters, and environmental groups, all of whom have an interest in the outcome.
Using a generalized moral parliament for decision-making in AI is an approach that
ensures all relevant perspectives are taken into account. In contrast to traditional
methods that focus on representing different moral theories, stakeholder representa-
tion prioritizes the views of those directly affected by the AI’s decisions. This could
enhance the AI’s understanding of the intricate human social dynamics involved in
any given situation.

6.9.3 Advantages of a Moral Parliament

Using moral parliaments presents a wide array of benefits relative to just giving
AIs certain sets of values directly. In this subsection, we will explore how they are
customizable, transparent, robust to bugs and errors, adaptable to changing human
values, and pro-negotiation.

Customizable moral parliaments are diverse and scalable. The general-

ized moral parliament can accommodate a wide variety of stakeholders, ranging from
individual users to large corporations, and from local communities to global societies.
By emulating a large set of stakeholders, we can ensure that a diverse set of views are
represented. This allows AIs to effectively respond to a wide range of scenarios and
contexts, providing a robust framework for ensuring AI decisions reflect the values,
interests, and expectations of all relevant stakeholders. Additionally, moral parlia-
ments are scalable: if we are concerned about a lack of representation, we can simply
Beneficial AI and Machine Ethics ■ 357

emulate more stakeholders. By grounding AI decision-making in human perspectives

and experiences, we can create AI systems that are not only more ethical and fair
but also more effective and beneficial for society as a whole.

Transparency is another key benefit of a moral parliament. As it stands,

automated decision-making is opaque: we rarely understand why AIs make the de-
cisions they do. However, since the moral parliament gives us a clear mechanism of
representing and weighing different perspectives, it allows stakeholders to understand
the basis of an AI’s decision-making. We could, for instance, enforce that AIs keep
records of simulated negotiations in human languages and then view the transcripts.
Using moral parliaments provides insights into how different moral considerations
have been weighed against each other, making the decision-making process of an AI
more transparent, explainable, and accountable.

Moral parliaments may result in AI systems that are less fragile and less
prone to over-confidence. If we are sure that utilitarianism is the correct moral
view, we might be tempted to create AIs that maximize wellbeing-—this seems clean
and elegant. However, having a diverse moral parliament would make AIs less likely
to misbehave. By having multiple parliament members, we would achieve redundancy.
This is a common principle in engineering: to always include extra components that
are not strictly necessary to functioning, in case of failure in other components (and is
explored further in the Safety Engineering chapter). We would do this to avoid failure
modes where we were overconfident that we knew the correct moral theory, such as
lying and stealing for the greater good, or just to avoid poor implementation from AIs
optimizing for one moral theory. For example, a powerful AI told that utilitarianism
is correct might implement utilitarianism in a particular way that is likely to lead to
bad outcomes. Imagine an AI that has to evaluate millions of possibilities for every
decision it makes. Even with a small error rate, the cumulative effect could lead the
AI to choose risky or unconventional actions. This is because, when evaluating so
many options, actions with high variance in moral value estimation may occasionally
appear to have significant positive value. The AI could be more inclined to select
these high-risk actions based on the mistaken belief that they would yield substantial
benefits. For instance, an AI following some form of utilitarianism might use many
resources to create happy digital minds—at the expense of humanity—even if that is
not what we humans think is morally good.
This is similar to the Winner’s Curse in auction theory: those that win auctions of
goods with uncertain value often find that they won because they overestimated the
value of the good relative to everyone else; for instance, when bidding on a bag of
coins at a fair, people who overestimate how many coins there are will be more likely
to win. Similarly, the AI might opt for actions that, in hindsight, were not truly
beneficial. A moral parliament can make this less likely, because actions that would
be judged morally extreme by most humans also wouldn’t be selected by a diverse
moral parliament.
The process of considering a range of theories inherently embeds redundancy and
cross-checking into the system, reducing the probability of catastrophic outcomes
358 ■ Introduction to AI Safety, Ethics, and Society

arising from a single point of failure. It also helps ensure that AI systems are robust
and resilient, capable of handling a broad array of ethical dilemmas.

Moral parliaments encourage compromise and negotiation. In real-life

parliaments, representatives who hold different opinions on various issues often engage
in bargaining, compromise, and cooperation to reach agreeable outcomes and find
common ground. We want our AIs to achieve similar outcomes, such as ones that
are moderate instead of extreme. Ideally, we want AIs to select outcomes that many
moral theories and stakeholders all like, rather than being forced to trade off between
them.
In particular, we might want to design our moral parliaments in specific ways to
encourage this. One such feature is proportional chances voting, in which each option
then gets a chance of winning that’s proportional to the number of votes it gets—if a
parliament is 60/40 split on a proposal, then the AI would do what’s recommended
60% of the time rather than just going with the majority. This setup motivates the
representatives to come together on options that are compromises rather than sticking
to their own viewpoints rigidly. They want to do this to prevent any option they see
as extremely bad from having any chance of winning. This ensures a robust high-level
principle guiding AI behavior, reducing the risk of extreme outcomes, and fostering
a more balanced, nuanced approach to ethical decision-making.

Using a moral parliament reduces the risk of overlooking or locking in

certain values. The moral parliament represents an approach to ethical decision-
making in AI that is distinctively cosmopolitan, in that it encompasses a broad
range of moral theories or stakeholders. It ensures that many ethical viewpoints are
considered. This wider view is helpful in dealing with moral problems and tough
decisions AI systems may run into, by making sure that all important considerations
are thought over in a balanced way. AIs using moral parliaments are less likely to
ignore values that matter to different groups in society.
Further, the moral parliament allows the representation of human values to grow
and change over time. We know that moral views change over time, so we should be
humble about how much we know about morality: there might be important things
we don’t yet understand that could help us get closer to the truth about what is
right and wrong. By regularly using moral parliaments, AI systems can keep up with
current human values, rather than sticking to the old values that were defined when
the AI was created. This keeps AI up-to-date and flexible, and prevents it from acting
based on outdated or irrelevant values that are locked into the system.

Challenges. Deciding which ethical theories to include in the moral parliament

could be a challenging task. There are numerous ethical frameworks, and selecting a
representative set may be subjective and politically charged. The decision procedure
used to assign appropriate weights to different ethical theories and aggregate their
recommendations in ways that reflect their importance could also be contentious.
Different stakeholders may have varying opinions on how to prioritize these theories.
Moreover, ethical theories can be subject to interpretation and may have nuanced
Beneficial AI and Machine Ethics ■ 359

implications. Advanced AI systems would need to be able to accurately understand

and apply these theories in order to use a moral parliament.

Conclusions about Moral Uncertainty

Taking moral uncertainty into account is difficult but important. As AI

systems become increasingly embedded across various parts of society and take more
consequential decisions, it will become more important to ensure that they can handle
moral uncertainty. The same AI systems may be used by a wide variety of people
with different moral theories, who would demand that AI acts as far as possible in
ways that do not violate their moral views. Incorporating moral uncertainty into
AI decision-making could reduce the probability of taking actions that are seriously
wrong under some moral theories.
In this section, we explored ways of moving beyond intuitive judgments or a reason-
able pluralism of theories, examining three solutions to moral uncertainty: my favorite
theory, maximize expected choice-worthiness, and moral parliament. Each approach
has its strengths and limitations, with the choice depending on the situation, level of
uncertainty, and personal preferences.
We also discussed how we might use AIs to operationalize moral parliaments, whether
in the original form of representing moral theories or generalized to representing
stakeholders for any given issue. We highlighted the advantages of using a moral
parliament, such as reducing the risk of overlooking or locking in certain values,
allowing for the representation of changing human values over time, and increasing
transparency and accountability in AI decision-making. We also noted that moral
parliaments encourage compromise and negotiation, leading to more balanced and
nuanced ethical decisions.
It is important to recognize that there might not be a one-size-fits-all solution to
ethical dilemmas. As AI technology progresses, we should work further to develop
robust methods for addressing moral uncertainty, that can include diverse moral
perspectives and quantify uncertainty.

6.10 CONCLUSION

Overview. In this chapter, we have explored various ways in which we can embed
ethics into AI systems, ensuring that they are safe and beneficial. It is far from
guaranteed that the development of AIs will lead to socially beneficial outcomes. By
default, AIs are likely to be developed according to businesses’ economic incentives
and are likely to follow parts of the law. This is insufficient. We almost certainly need
stronger protections in place to ensure that AIs behave ethically. Consequently, we
discussed how we can ensure AIs prioritize aspects of our wellbeing by making us
happy, helping us flourish, and satisfying our preferences. Supposing AIs can figure
out how to promote individual wellbeing, we explored social welfare functions as a
way to guide their actions in order to help improve wellbeing across society.
360 ■ Introduction to AI Safety, Ethics, and Society

We should strive to make our views less contradictory. We have considered

several possible perspectives on how to ensure AIs act ethically. As we have seen, there
are deep tensions between many plausible views, such as “AIs should do what you
choose” versus “AIs should do what you want” versus “AIs should do what makes
you happy” and so on. It is quite difficult to resolve these tensions and choose how
to best represent human values; however, before we deploy powerful AI systems, we
must do so anyway.

As a baseline, we want AIs to follow the law. At the very least, we should
require that AIs follow the law. This is imperfect: as we have seen, the law is insuf-
ficiently comprehensive to ensure that AI systems are safe and beneficial. Laws have
loopholes, are occasionally unethical and unrepresentative of the population and are
often silent on doing good in ways AIs should be required to do. However, if we can
get AIs to follow the law, then we are at least guaranteed that they refrain from
the illegal acts—such as murder and theft—that human societies have identified and
outlawed. In addition, we might want to support regulation that ensures that AI
decision-making must be fair—once we understand what fairness requires.

AIs could increase human wellbeing in accordance with a social welfare

function. Which conception of human wellbeing best matches reality and how
we should distribute it are difficult questions that we have certainly not resolved
within this chapter. However, if we had to guess, we might want AIs to optimize
continuous prioritarian social welfare functions where individual wellbeing should be
based on happiness or objective goods, ensuring that wellbeing is fairly distributed
throughout a society in which everyone is happy to live. We might use AIs to estimate
general-purpose wellbeing functions and directly increase what we observe makes
people better off. While this is speculative, the rapid development of AIs forces us to
speculate.

Further work is needed on how to embed ethics into AI systems. As we

move forward, it is crucial that we continue to engage in rigorous research, open di-
alogue, and interdisciplinary collaboration to address the ethical concerns associated
with AI. By doing so, we can strive toward creating AI systems that not only avoid
the worst harms to society but actively work toward enhancing social wellbeing. In
the next part of this book, we will move to the problem of how to ensure positive
outcomes in a world with multiple AI agents and many, often competing, human
stakeholders.

6.11 LITERATURE

6.11.1 Recommended Reading

• Wendell Wallach and Colin Allen. Moral Machines: Teaching Robots Right from
Wrong. Oxford University Press, Feb. 2009. isbn: 9780195374049. doi: 10.1093/a
cprof:oso/9780195374049.001.0001. url: https://doi.org/10.1093/acprof:oso/978
0195374049.001.0001
Beneficial AI and Machine Ethics ■ 361

• John J. Nay. Law Informs Code: A Legal Informatics Approach to Aligning Artifi-
cial Intelligence with Humans. 2023. arXiv: 2209.13020 [cs.CY]
• Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learn-
ing: Limitations and Opportunities. http://www.fairmlbook.org. fairmlbook.org,
2019
• Richard Layard and Jan-Emmanuel Neve. Wellbeing: Science and Policy. May
2023. isbn: 9781009298926. doi: 10.1017/9781009298957
• Sandel Michael J. What Money Can’t Buy: The Moral Limits of Markets. Farrar,
Straus and Giroux, 2012
• Dan Hendrycks et al. What Would Jiminy Cricket Do? Towards Agents That Be-
have Morally. 2022. arXiv: 2110.13136 [cs.LG]
• M.D. Adler. Measuring Social Welfare: An Introduction. Oxford University Press,
2019. isbn: 9780190643027. url: https://books.google.com/books?id=˙GitDw
AAQBAJ
• Katarzyna de Lazari-Radek and Peter Singer. Utilitarianism: A Very Short Intro-
duction. Oxford University Press, 2017. isbn: 9780198728795. doi: 10.1093/actra
de/9780198728795.001.0001. url: https://doi.org/10.1093/actrade/978019872879
5.001.0001
• S. Kagan. Normative Ethics. Dimensions of philosophy series. Avalon Publishing,
1998. isbn: 9780813308456. url: https://books.google.com/books?id=iO8TAAA
AYAAJ
• Toby Newberry and Toby Ord. The Parliamentary Approach to Moral Uncertainty.
Tech. rep. Technical Report# 2021-2, Future of Humanity Institute, University of
Oxford . . ., 2021
• Stephen Darwall. Deontology. Blackwell Publishers, Oxford, 2003
CHAPTER 7

Collective Action Problems

7.1 MOTIVATION

Introduction

In the chapters “Complex Systems” and “Safety Engineering,” we considered AI

risks that arise not only from the technologies themselves but from the broader social
contexts in which they are embedded. In this chapter, we extend our exploration of
these systemic risks by exploring how the collective behavior of a multi-agent system
may not reflect the interests of the agents that comprise it. The agents may produce
conditions that none of them wants, even when every one of them has the same
goals and priorities. In the words of economics Nobel laureate Thomas Schelling,
“Micromotives do not equal macrobehavior” [363]. Let us explore this idea using
some examples.

Example: traffic jams. Consider a traffic jam, where the only obstacle to each
motorist is the car in front. Everyone has the same goal, which is to reach their des-
tination quickly. Since nobody wants to be stuck waiting, the solution might appear
obvious to someone unfamiliar with traffic: everyone should simply drive forward,
starting at the same time and accelerating at the same rate. And yet, without exter-
nal synchronization, achieving this preferable state is impossible. All anyone can do
is start and stop in response to each others’ starting and stopping, inching toward
their destination slowly and haltingly.

Example: tall forests [364]. In the Rockefeller forest of Northern California,

the trees are more than 350 feet tall, on average. We can model these trees as agents
competing for sunlight access. The taller a tree is, the more sunlight it can access, as
its leaves are above its neighbors’. However, there is no benefit to being tall other than
avoiding being overshadowed by other trees. In fact, growing so tall costs each tree
valuable resources and risks their structural integrity failing. If all the trees were 200
feet shorter, each tree would occupy the same position in the competition as they do
and each would get the same amount of sunlight as they do, but with greatly reduced
growing costs. All the trees would profit from this arrangement. However, as there is
362 DOI: 10.1201/9781003530336-7
This chapter has been made available under a CC BY NC ND license.
Collective Action Problems ■ 363

no way to impose such an agreement between the trees, each races its neighbor ever
higher, and all pay the large costs of growing so tall.

Example: excessive working hours [365]. People often work far longer hours
than they might ideally like to, rather than taking time off for their other interests,
in order to be competitive in their field. For instance, they might be competing
for limited prestigious positions within their field. In theory, if everyone in a given
field were to reduce their work hours by the same amount, they could all free up
time and increase their quality of life while maintaining their relative position in the
competition. Each person would get the work outcome they would have otherwise,
and everyone would benefit from this freed-up time. Yet no one does this, because if
they alone were to decrease their work efforts, they would be out-competed by others
who did not.

Example: military arms races. Like tree height, the major benefit of military
power is not intrinsic, but relative: being less militarily capable than their neighbors
makes a nation vulnerable to invasion. This competitive pressure drives nations to
expend vast sums of money on their military budgets each year, reducing each nation’s
budgets for other areas, such as healthcare and education. Some forms of military
investment, such as nuclear weaponry and military AI applications, also exacerbate
the risks of large-scale catastrophes. If every nation were to decrease its military
investment by the same amount, everyone would benefit from the reduced expenses
and risks without anyone losing their relative power. However, this arrangement is
not stable, since each nation could improve its security by ensuring its military power
exceeds that of its competitors, and each risks becoming vulnerable if it alone fails
to do this. Military expenditure therefore remains high in spite of these seemingly
avoidable costs.

Competitive and evolutionary pressures. The same basic structure underlies

most of these examples [366]. A group of agents is engaged in a competition over a
valuable and limited item (sunlight access, housing quality, military security). One
way an agent can gain more of this valuable item is by sacrificing some of their other
values (energy for growth, social life, an education budget). Agents who do not make
these sacrifices are outcompeted by those who do. Natural selection weeds out those
who do not sacrifice their other values sufficiently, replacing them with agents who
sacrifice more, until the competition is dominated by those agents who sacrificed the
most. These agents gain no more of the valued item they are competing for than did
the original group, yet are worse off for the losses of their other values.

Steering each agent = ̸ steering the system. These phenomena hint at the
distinct challenges of ensuring safety in multi-agent systems. The danger posed by a
collective of agents is greater than the sum of its parts. AI risk cannot be eradicated
by merely ensuring that each individual AI agent is loyal and each individual human
operator is well-intentioned. Even if all agents, both human and AI, share a common
set of goals, this does not guarantee macrobehavior in line with these goals. The
agents’ interactions can produce undesirable outcomes.
364 ■ Introduction to AI Safety, Ethics, and Society

Chapter focus. In this chapter, we use abstract models to understand how in-
telligent agents can, despite acting rationally and in accordance with their own self-
interest, collectively produce outcomes that none of them wants, even when they
could seemingly have achieved preferable alternative outcomes. We can characterize
these risks by crudely differentiating them into the following two sets:
• Multi-human dynamics. These risks are generated by interactions between the
human agencies involved in AI development and adoption, particularly corpora-
tions and nations. The central concern here is that competitive and evolutionary
pressures could drive humanity to hand over increasing amounts of power to AIs,
thereby becoming a “second-class species.” The frameworks we explore in this chap-
ter are highly abstract and can be useful in thinking more generally about the
current AI landscape.
Of particular importance are racing dynamics. We see these in the corporate world,
where AI developers may cut corners on safety in order to avoid being outcompeted
by one another. We also see these in international relations, where nations are
racing each other to adopt hazardous military AI applications. By observing AI
races, we can anticipate that merely persuading these parties that their actions are
high-risk may not be sufficient for ensuring that they act more cautiously, because
they may be willing to tolerate high risk levels in order to “stay in the race.” For
example, nations may choose to continue investing in military AI technologies that
could fail in catastrophic ways, if abstaining from doing so risks losing international
conflict.
• Multi-AI dynamics. These risks are generated by interactions with and between
AI agents. In the future, we expect that AIs will increasingly be granted auton-
omy in their behavior, and will therefore interact with others under progressively
less human oversight. This poses risks in at least three ways. First, evolutionary
pressures may promote selfish behavior and generate various forms of intrasystem
conflict that could subvert our goals. Second, many of the mechanisms by which
AI agents may cooperate with one another could promote undesirable behaviors,
such as nepotism, outgroup hostility, and the development of ruthless reputations.
Third, AIs may engage in conflict, using threats of extreme scale in order to extort
others, or even promoting all-out warfare, with devastating consequences.

We explore both of the above sets of multi-agent risks using generalizable frameworks
from game theory, bargaining theory, and evolutionary theory. These frameworks
help us understand the collective dynamics that can lead to outcomes that were
not intended or desired by anyone individually. Even if AI systems are fully under
human control and leading actors such as corporations and states are well-intentioned,
humanity could still end up eroding away our power gradually until it cannot be
recovered.

Game Theory

Rational agents will not necessarily secure good outcomes. Behavior that
is individually rational and self-interested can produce collective outcomes that
Collective Action Problems ■ 365

are suboptimal, or even catastrophic, for all involved. This section first examines
the Prisoner’s Dilemma, a canonical game theoretic example that illustrates this
theme—though cooperation would produce an outcome that is better for both agents,
for either one to cooperate would be irrational.
We then build on this by introducing two additional levels of sophistication. The
first addition is time. We explore how cooperation is possible, though not assured,
when agents interact repeatedly over time. The second addition is the introduction
of more than two agents. We explore how collective action problems can generate
and maintain undesirable states. Ultimately, we see how these natural dynamics can
produce catastrophically bad outcomes. They perpetuate military arms races and
corporate AI races, increasing the risks from both. They may also promote dangerous
AI behaviors, such as extortion.

Cooperation

Cooperation is necessary, but not sufficient, for multi-AI agent safety.

In this section, we turn to assessing how cooperation can help with addressing the
challenges outlined above. However, we also consider what problems cooperation may
pose itself, in the context of AI. We explore five mechanisms that can promote or
maintain cooperation:
• Direct reciprocity: the chance of a future meeting incentivizes cooperative behavior
in the present.
• Indirect reciprocity: cooperative reputations are rewarded.
• Group selection: inter-group competition promotes intra-group unity.
• Kin selection: indirect benefits of cooperation outweigh direct costs, motivating
altruism toward genetic kin.
• Institutions: large-scale external forces motivate cooperation through enforcement.

Conflict

Rational agents may sometimes choose destructive conflict instead of

peaceful bargaining. This section explores the nature of conflict between agents.
We start with an overview of bargaining theory, which provides a lens for under-
standing why rational agents sometimes choose mutually costly conflict over peaceful
alternatives. We next explore several specific factors that drive conflict.
1. Power shifts: a shift in political power triggers preventative conflict.
2. First-strike advantage: time-sensitive offensive advantages motivate a party to ini-
tiate conflict preemptively.
3. Issue indivisibility: wherever the entity over which parties are contesting is indi-
visible, it is harder to avoid resorting to conflict.
4. Information problems: mis- and dis-information kindle defensive or offensive action
over cooperation.
366 ■ Introduction to AI Safety, Ethics, and Society

5. Inequality: inequality may increase the probability of conflict, due to factors such
as relative deprivation and social envy.

Evolutionary Pressure
Natural selection will promote AIs that behave selfishly. In this final sec-
tion, we use evolutionary theory to study what happens when a large number of
agents interact many times over many generations. We start with generalized Dar-
winism: the idea that evolution by natural selection can take place outside of the
realm of biology. We explore examples in linguistics, music, philosophy and sociol-
ogy. We formalize generalized Darwinism using Lewontin’s conditions for evolution
by natural selection and the Price equation for evolutionary change. Using both, we
show that AIs are likely to be subject to evolution by natural selection: they will vary
in ways that produce differential fitness and so influence which traits persist through
time and between “generations” of AIs.
Next, we explore two AI risks generated by evolutionary pressures. The first is that
correctly specified goals may be subverted or distorted by “intrasystem goal conflict.”
The interests of propagating information (such as genes, departments, or sub-agents)
can sometimes clash with those of the larger entity that contains it (such as an
organism, government, or AI system), undermining unity of purpose. The second risk
we consider is that natural selection tends to favor selfish traits over altruistic ones.
A future shaped by evolutionary pressures is, therefore, likely to be dominated by
selfish behavior, both in the institutions that produce and use AI systems, and in the
AIs themselves.
The conclusions of this section are simple. Natural selection will by default be a strong
force in determining the state of the world. Its influence on AI development carries
the risk of intrasystem goal conflict and the promotion of selfish behavior. Both risks
could have catastrophic effects. Intrasystem goal conflict could prevent our goals from
being carried out and generate unexpected actions. AI agents could develop selfish
tendencies, increasing the risk that they might employ harmful strategies (including
those covered earlier in the chapter, such as extortion).

7.2 GAME THEORY

7.2.1 Overview
This chapter explores the dynamics generated by the interactions of multiple agents,
both human and AI. These interactions create risks distinct from those generated
by any individual AI agent acting in isolation. One way we can study the strategic
interdependence of agents is with the framework of game theory. Using game theory,
we can examine formal models of how agents interact with each other under varying
conditions and predict the outcomes of these interactions.
Here, we use game theory to present natural dynamics in biological and social systems
that involve multiple agents. In particular, we explore what might cause agents to
Collective Action Problems ■ 367

come into conflict with one another, rather than cooperate. We show how these multi-
agent dynamics can generate undesirable outcomes, sometimes for all the agents
involved. We consider risks created by interactions within and between human and
AI agents, from human-directed companies and militaries engaging in perilous races
to autonomous AIs using threats for extortion.
We start with an overview of the fundamentals of game theory. We begin this section
by setting out the characteristics of game theoretic agents. We also categorize the
different kinds of games we are exploring.
We then focus on the Prisoner’s Dilemma. The Prisoner’s Dilemma is a simple ex-
ample of how an interaction between two agents can generate an equilibrium state
that is bad for both, even when each acts rationally and in their own self-interest. We
explore how agents may arrive at the outcome where neither chooses to cooperate.
We use this to model real-world phenomena, such as negative political campaigns.
Finally, we examine ways we might foster rational cooperation between self-interested
AI agents, such as by altering the values in the underlying payoff matrices. The key
upshot is that intelligent and rational agents do not always achieve good outcomes.
We next add in the element of time by examining the Iterated Prisoner’s Dilemma.
AI agents are unlikely to interact with others only once. When agents engage with
each other multiple times, this creates its own hazards. We begin by examining how
iterating the Prisoner’s Dilemma alters the agents’ incentives—when an agent’s be-
havior in the present can influence that of their partner in the future, this creates
an opportunity for rational cooperation. We study the effects of altering some of the
variables in this basic model: uncertainty about future engagement and the necessity
to switch between multiple different partners. We look at why the cooperative strat-
egy tit-for-tat is usually so successful, and in what circumstances it is less so. Finally,
we explore some of the risks associated with iterated multi-agent social dynamics:
corporate AI races, military AI arms races, and AI extortion. The key upshot is that
cooperation cannot be ensured merely by iterating interactions through time.
We next move to consider group-level interactions. AI agents might not interact with
others in a neat, pairwise fashion, as assumed by the models previously explored.
In the real world, social behavior is rarely so straightforward. Interactions can take
place between more than two agents at the same time. A group of agents creates an
environmental structure that may alter the incentives directing individual behavior.
Human societies are rife with dynamics generated by group-level interactions that
result in undesirable outcomes. We begin by formalizing “collective action problems.”
We consider real-world examples such as anthropogenic climate change and fishery
depletion. Multi-agent dynamics such as these generate AI risk in several ways. Races
between human agents and agencies could trigger flash wars between AI agents or
the automation of economies to the point of human enfeeblement. The key upshot
is that achieving cooperation and ensuring collectively good outcomes is even more
difficult in interactions involving more than two agents.
368 ■ Introduction to AI Safety, Ethics, and Society

7.2.2 Game Theory Fundamentals

In this section, we briefly run through some of the fundamental principles of game
theory. Game theory is the branch of mathematics concerned with agents’ choices and
strategies in multi-agent interactions. Game theory is so-called because we reduce
complex situations to abstract games where agents maximize their payoffs. Using
game theory, we can study how altering incentives influences the strategies that these
agents use.

Agents in game theory. We usually assume that the agents in these games are
self-interested and rational. Agents are “self-interested” if they make decisions in view
of their own utility, regardless of the consequences to others. Agents are said to be
“rational” if they act as though they are maximizing their utility.

Games can be “zero sum” or “non-zero sum.” We can categorize the games
we are studying in different ways. One distinction is between zero sum and non-zero
sum games. A zero sum game is one where, in every outcome, the agents’ payoffs
all sum to zero. An example is “tug of war”: any benefit to one party from their pull
is necessarily a cost to the other. Therefore, the total value of these wins and losses
cancel out. In other words, there is never any net change in total value. Poker is a
zero sum game if the players’ payoffs are the money they each finish with. The total
amount of money at a poker game’s beginning and end is the same—it has simply
been redistributed between the players.
By contrast, many games are non-zero sum. In non-zero sum games, the total amount
of value is not fixed and may be changed by playing the game. Thus, one agent’s win
does not necessarily require another’s loss. For instance, in cooperation games such
as those where players must meet at an undetermined location, players only get the
payoff together if they manage to find each other. As we shall see, the Prisoner’s
dilemma is a non-zero sum game, as the sum of payoffs changes across different
outcomes.

Non-zero sum games can have “positive sum” or “negative sum” out-
comes. We can categorize the outcomes of non-zero sum games as positive sum
and negative sum. In a positive sum outcome, the total gains and losses of the agents
sum to greater than zero. Positive sum outcomes can arise when particular inter-
actions result in an increase in value. This includes instances of mutually beneficial
cooperation. For example, if one agent has flour and another has water and heat,
the two together can cooperate to make bread, which is more valuable than the raw
materials. As a real-world example, many view the stock market as positive sum
because the overall value of the stock market tends to increase over time. Though
gains are unevenly distributed, and some investors lose money, the average investor
becomes richer. This demonstrates an important point: positive sum outcomes are
not necessarily “win-win.” Cooperating does not guarantee a benefit to all involved.
Even if extra total value is created, its distribution between the agents involved in its
creation can take any shape, including one where some agents have negative payoffs.
Collective Action Problems ■ 369

In a negative sum outcome, some amount of value is lost by playing the game. Many
competitive interactions in the real world are negative sum. For instance, consider
“oil wars”—wars fought over a valuable hydrocarbon resource. Oil wars are zero-
sum with regards to oil since only the distribution (not the amount) of oil changes.
However, the process of conflict itself incurs costs to both sides, such as loss of life
and infrastructure damage. This reduces the total amount of value. If AI development
has the potential to result in catastrophic outcomes for humanity, then accelerating
development to gain short-term profits in exchange for long-term losses to everyone
involved would be a negative sum outcome.

7.2.3 The Prisoner’s Dilemma

Our aim in this section is to investigate how interactions between rational agents,
both human and AI, may negatively impact everyone involved. To this end, we focus
on a simple game: the Prisoner’s Dilemma. We first explore how the game works,
and its different possible outcomes. We then examine why agents may choose not to
cooperate even if they know this will lead to a collectively suboptimal outcome. We
run through several real-world phenomena which we can model using the Prisoner’s
Dilemma, before exploring ways in which cooperation can be promoted in these kinds
of interactions. We end by briefly discussing the risk of AI agents tending toward
undesirable equilibrium states.

The Game Fundamentals

In the Prisoner’s Dilemma, two agents must each decide whether or not to cooperate.
The costs and benefits are structured such that for each agent, defection is the best
strategy regardless of what their partner chooses to do. This motivates both agents
to defect.

The Prisoner’s Dilemma. In game theory, the Prisoner’s Dilemma is a classic

example of the decisions of rational agents leading to suboptimal outcomes. The
basic setup is as follows. The police have arrested two would-be thieves. We will call
them Alice and Bob. The suspects were caught breaking into a house. The police
are now detaining them in separate holding cells, so they cannot communicate with
each other. The police suspect that the pair were planning burglary (which carries
a lengthy jail sentence). But they only have enough evidence to charge them with
trespassing (which carries a shorter jail sentence). However, the testimony of either
one of the suspects would be enough to charge the other with burglary, so the police
offer each suspect the following deal. If only one of them rats out their partner by
confessing that they had intended to commit burglary, the confessor will be released
with no jail time and their partner will spend eight years in jail. However, if they
each attempt to rat out the other by both confessing, they will both serve a medium
prison sentence of three years. If neither suspect confesses, they will both serve a
short jail sentence of only one year.
370 ■ Introduction to AI Safety, Ethics, and Society

The four possible outcomes. We assume that Alice and Bob are both rational
and self-interested: each only cares about minimizing their own jail time. We define
the decision facing each as follows. They can either “cooperate” with their partner by
remaining silent or “defect” on their partner by confessing to burglary. Each suspect
faces four possible outcomes, which we can split into two possible scenarios. Let’s term
these “World 1” and “World 2”; see Figure 7.1. In World 1, their partner chooses to
cooperate with them; in World 2, their partner chooses to defect. In both scenarios,
the suspect decides whether to cooperate or defect themself. They do not know what
their partner will decide to do.

Figure 7.1. The possible outcomes for Alice in the Prisoner’s Dilemma.

Defection is the dominant strategy. Alice does not know whether Bob will
choose to cooperate or defect. She does not know whether she will find herself in
World 1 or World 2; see Figure 7.1. She can only decide whether to cooperate or
defect herself. This means she is making one of two possible decisions. If she defects,
she is. . .
. . . in World 1: Bob cooperates and she goes free instead of spending a
year in jail.
. . . in World 2: Bob defects and she gets a 3-year sentence instead of an
8-year one.
Collective Action Problems ■ 371

Alice only cares about minimizing her own jail time, so she can save herself jail time
in either scenario by choosing to defect. She saves herself one year if her partner coop-
erates or five years if her partner defects. A rational agent under these circumstances
will do best if they decide to defect, regardless of what they expect their partner
to do. We call this the dominant strategy: a rational agent playing the Prisoner’s
Dilemma should choose to defect no matter what their partner does.
One way to think about strategic dominance is through the following thought exper-
iment. Someone in the Arctic during winter is choosing what to wear for that day’s
excursion. They have only two options: a coat or a t-shirt. The coat is thick and
waterproof; the t-shirt is thin and absorbent. Though this person cannot control or
predict the weather, they know there are only two possibilities: either rain or cold. If
it rains, the coat will keep them drier than the t-shirt. If it is cold, the coat will keep
them warmer than the t-shirt. Either way, the coat is the better option, so “wearing
the coat” is their dominant strategy.

Defection is the dominant strategy for both agents. Importantly, both the
suspects face this decision in a symmetric fashion. Each is deciding between identical
outcomes, and each wishes to minimize their own jail time. Let’s consider the four
possible outcomes now in terms of both the suspects’ jail sentences. We can display
this information in a payoff matrix, as shown in Table 7.1. Payoff matrices are com-
monly used to visualize games. They show all the possible outcomes of a game in
terms of the value of that outcome for each of the agents involved. In the Prisoner’s
Dilemma, we show the decision outcomes as the payoffs to each suspect: note that
since more jail time is worse than less, these payoffs are negative. Each cell of the
matrix shows the outcome of the two suspects’ decisions as the payoff to each suspect.

TABLE 7.1 Each cell in this payoff matrix represents a payoff. If Alice cooperates and Bob
defects, the top right cell tells us that Alice gets 8 years in jail while Bob goes free.

Bob cooperates Bob defects

Alice cooperates −1, −1 −8, 0

Alice defects 0, −8 −3, −3

Nash Equilibria and Pareto Efficiency

The stable equilibrium state in the Prisoner’s Dilemma is for both agents to defect.
Neither agent would choose to go back in time and change their decision (to switch
to cooperating) if they could not also alter their partner’s behavior by doing so. This
is often considered counterintuitive, as the agents would benefit if they were both to
switch to cooperating.

Nash Equilibrium: both agents will choose to defect. Defection is the best
strategy for Alice, regardless of what Bob opts to do. The same is true for Bob.
Therefore, if both are behaving in a rational and self-interested fashion, they will both
defect. This will secure the outcome of 3 years of jail time each (the bottom-right
372 ■ Introduction to AI Safety, Ethics, and Society

outcome of the payoff matrix above). Neither would wish to change their decision,
even if their partner were to change theirs. This is known as the Nash equilibrium: the
strategy choices from which no agent can benefit by unilaterally choosing a different
strategy. When interacting with one another, rational agents will tend toward picking
strategies that are part of Nash equilibria.

Pareto improvement: both agents would do better if they cooperated. As

we can see in the payoff matrix, there is a possible outcome that is better for both
suspects. If both choose the cooperate strategy, they will secure the top-left outcome
of the payoff matrix. Each would serve 2 years less jail time at no cost to the other. Yet,
as we have seen, selecting this strategy is irrational; the defect strategy is dominant
and so Alice and Bob each want to defect instead. We call this outcome Pareto
inefficient, meaning that it could be altered to make some of those involved better
off without making anyone else worse off. In the Prisoner’s Dilemma, the both defect
outcome is Pareto inefficient because it is suboptimal for both Alice and Bob, who
would both be better off if they both cooperated instead. Where there is an outcome
that is better for some or all agents involved, and not worse for any, we call the switch
to this more efficient outcome a Pareto improvement. In the Prisoner’s Dilemma, the
both cooperate outcome is better for both agents than the Nash equilibrium of both
defect; see Figure 7.2. The only Pareto improvement possible in this game is the move
from the both defect to the both cooperate outcome; see Figure 7.3.

Figure 7.2. Looking at the possible outcomes for both suspects in the Prisoner’s Dilemma,
we can see that there is a possible Pareto improvement from the Nash equilibrium. The
numbers represent their payoffs (rather than the length of their jail sentence).
Collective Action Problems ■ 373

Figure 7.3. Both suspects’ payoffs, in each of the four decision outcomes. Moving right in-
creases Alice’s payoff, and moving up improves Bob’s payoff. A Pareto improvement requires
moving right and up, as shown by the green arrow [367].

Real-World Examples of the Prisoner’s Dilemma

The Prisoner’s Dilemma has many simplifying assumptions. Nevertheless, it can be a

helpful lens through which to understand social dynamics in the real world. Rational
and self-interested parties often produce states that are Pareto inefficient. There
exist alternative states that would be better for all involved, but reaching these
requires individually irrational action. To illustrate this, let’s explore some real-world
examples.

Mud-slinging. Consider the practice of mud-slinging. Competing political parties

often use negative campaign tactics, producing significant reputational costs. By run-
ning negative ads to attack and undermine the public image of their opponents, all
parties end up with tarnished reputations. If we assume that politicians value their
reputation in an absolute sense, not merely in relation to their contemporary competi-
tors, then mud-slinging is undesirable for all. A Pareto improvement to this situation
would be switching to the outcome where they all cooperate. With no one engaging in
mud-slinging, all the parties would have better reputations. The reason this does not
happen is that mud-slinging is the dominant strategy. If a party’s opponent doesn’t
use negative ads, the party will boost their reputation relative to their opponent’s
by using them. If their opponent does use negative ads, the party will reduce the
difference between their reputations by using them too. Thus, both parties converge
on the Nash equilibrium of mutual mud-slinging, at avoidable detriment to all.
374 ■ Introduction to AI Safety, Ethics, and Society

Shopkeeper price cuts. Another example is price racing dynamics between differ-
ent goods providers. Consider two rival shopkeepers selling similar produce at similar
prices. They are competing for local customers. Each shopkeeper calculates that low-
ering their prices below that of their rival will attract more customers away from
the other shop and result in a higher total profit for themselves. If their competitor
drops their prices and they do not, then the competitor will gain extra customers,
leaving the first shopkeeper with almost none. Thus, “dropping prices” is the domi-
nant strategy for both. This leads to a Nash equilibrium in which both shops have
low prices, but the local custom is divided much the same as it would be if they had
both kept their prices high. If they were both to raise their prices, they would both
benefit by increasing their profits: this would be a Pareto improvement. Note that,
just as how the interests of the police do not count in the Prisoner’s Dilemma, we
are only considering the interests of the shopkeepers in this example. We are ignoring
the interests of the customers and wider society.

Arms races. Nations’ expenditure on military arms development is another ex-

ample. It would be better for all these nations’ governments if they were all simulta-
neously to reduce their military budgets. No nation would become more vulnerable if
they were all to do this, and each could then redirect these resources to areas such as
education and healthcare. Instead, we have widespread military arms races. We might
prefer for all the nations to turn some military spending to their other budgets, but
for any one nation to do so would be irrational. Here, the dominant strategy for each
nation is to opt for high military expenditure. So we achieve a Nash equilibrium in
which all nations must decrease spending in other valuable sectors. It would be more
Pareto efficient for all to have lower military spending, freeing money and resources
for different domains. We will consider races in the context of AI development in the
following section.

Promoting Cooperation

So far we have focused on the sources of undesirable multi-agent dynamics in games

like the Prisoner’s Dilemma. Here, we turn to the mechanisms by which we can
promote cooperation over defection.

Reasons to cooperate. There are many reasons why real-world agents might
cooperate in situations which resemble the Prisoner’s Dilemma [368], as shown in
Figure 7.4. These can broadly be categorized by whether the agents have a choice, or
whether defection is impossible. If the agents do have a choice, we can further divide
the possibilities into those where they act in their own self-interest, and those where
they do not (altruism). Finally, we can differentiate two reasons why self-interested
agents may choose to cooperate: a tendency toward this, such as a conscience or guilt,
and future reward/punishment. We will explore two possibilities in this section—
payoff changes and altruistic dispositions—and then “future reward/punishment” in
the next section. Note that we effectively discuss “Defection is impossible” in the
Single-Agent Safety chapter, and “AI consciences” in the Beneficial AI and Machine
Ethics chapter.
Collective Action Problems ■ 375

Figure 7.4. Four possible reasons why agents may cooperate in prisoner’s Dilemma-like
scenarios. This section explores two: changes to the payoff matrix and increased agent
altruism [368].

External consideration: changing the payoffs to incentivize cooperation.

By adjusting the values in the payoff matrix, we may more easily steer agents away
from undesirable equilibria. As shown in Table 7.2, incentive structures are important.
A Prisoner’s Dilemma-like scenario may arise wherever an individual agent will do
better to defect whether their partner cooperates (c > a) or defects (d > b). Avoiding
this situation requires altering these constants where they underlie critical social
interactions in the real world: changing the costs and benefits associated with different
activities so as to encourage cooperative behavior.

TABLE 7.2 if c > a and d > b, the highest payoff for either agent is to defect, regardless
of what their opponent does: Defection is the dominant strategy. Fostering cooperation
requires avoiding this structure.

Agent B cooperates Agent B defects

Agent A cooperates a, a b, c
Agent A defects c, b d, d

There are two ways to reduce the expected value of defection: lower the probability
of defection success or lower the benefit of a successful defection. Consider a strat-
egy commonly used by organized crime groups: threatening members with extreme
punishment if they “snitch” to the police. In the Prisoner’s Dilemma game, we can
model this by adding a punishment equivalent to three years of jail time for “snitch-
ing,” leading to the altered payoff matrix as shown in Figure 7.5. The Pareto efficient
outcome (−1, −1) is now also a Nash Equilibrium because snitching when the other
player cooperates is worse than mutually cooperating (c < a).
376 ■ Introduction to AI Safety, Ethics, and Society

Figure 7.5. Altering the payoff matrix to punish snitches, we can move from a Prisoner’s
Dilemma (left) to a Stag Hunt (right), in which there is an additional Nash equilibrium.
Collective Action Problems ■ 377

Internal consideration: making agents more altruistic to promote co-

operation. A second potential mechanism to foster cooperation is to make agents
more altruistic. If each agent also values the outcome for their partner, this effectively
changes the payoff matrix. Now, the length of their partner’s jail sentence matters to
each of them. In the Prisoner’s Dilemma payoff matrix, the both cooperate outcome
earns the lowest total jail time, so agents who valued their partners’ payoffs equally
to their own would converge on cooperation.

Parallels to AI safety. One possible example of such a strategy would be to

target the values held by AI companies themselves. Improving corporate regulation
effectively changes the company’s expected payoffs from pursuing risky strategies.
If successful, it could encourage the company building AI systems to behave in a
less purely self-interested fashion. Rather than caring solely about maximizing their
shareholder’s financial interests, AI companies might cooperate more with each other
to steer away from Pareto inefficient outcomes, and avoid corporate AI races. We
explore this in more detail in section 7.2.4 “AI races” below.

Summary

Cooperation is not always rational, so intelligence alone may not ensure

good outcomes. We have seen that rational and self-interested agents may not in-
teract in such a way as to achieve good results, even for themselves. Under certain
conditions, such as in the Prisoner’s Dilemma, they will converge on a Nash equi-
librium of both defecting. Both agents would be better off if they both cooperated.
However, it is hard to secure this Pareto improvement because cooperation is not
rational when defection is the dominant strategy.

Conflict with or between future AI agents may be extremely harmful.

One source of concern regarding future AI systems is inter-agent conflict eroding the
value of the future. Rational AI agents faced with a Prisoner’s Dilemma-type scenario
might end up in stable equilibrium states that are far from optimal, perhaps for all
the parties involved. Possible avenues to reduce these risks include restructuring the
payoff matrices for the interactions in which these agents may be engaged or altering
the agents’ dispositions.

7.2.4 The Iterated Prisoner’s Dilemma

In our discussion of the Prisoner’s Dilemma, we saw how rational agents may converge
to equilibrium states that are bad for all involved. In the real world, however, agents
rarely interact with one another only once. Our aim in this section is to understand
how cooperative behavior can be promoted and maintained as multiple agents (both
human and AI) interact with each other over time, when they expect repeated future
interactions. We handle some common misconceptions in this section, such as the idea
that simply getting agents to interact repeatedly is sufficient to foster cooperation,
because “nice” and “forgiving” strategies always win out. As we shall see, things are
378 ■ Introduction to AI Safety, Ethics, and Society

not so simple. We explore how iterated interactions can lead to progressively worse
outcomes for all.
In the real world, we can observe this in “AI races,” where businesses cut corners
on safety due to competitive pressures, and militaries adopt and deploy potentially
unsafe AI technologies, making the world less safe. These AI races could produce
catastrophic consequences, including more frequent or destructive wars, economic
enfeeblement, and the potential for catastrophic accidents from malfunctioning or
misused AI weapons.

Introduction

Agents who engage with one another many times do not always coexist harmoniously.
Iterating interactions is not sufficient to ensure cooperation. To see why, we explore
what happens when rational, self-interested agents play the Prisoner’ Dilemma game
against each other repeatedly. In a single-round Prisoner’s Dilemma, defection is
always the rational move. But understanding the success of different strategies is
more complicated when agents play multiple rounds.

In the Iterated Prisoner’s Dilemma, agents play repeatedly. The dom-

inant strategy for a rational agent in a one-off interaction such as the Prisoner’s
Dilemma is to defect. The seeming paradox is that both agents would prefer the
cooperate-cooperate outcome to the defect-defect one. An agent cannot influence
their partner’s actions in a one-off interaction, but in an iterated scenario, one agent’s
behavior in one round may influence how their partner responds in the next. We call
this the Iterated Prisoner’s Dilemma; see Figure 7.6. This provides an opportunity
for the agents to cooperate with each other.

Iterating the Prisoner’s Dilemma opens the door to rational cooperation.

In an Iterated Prisoner’s Dilemma, both agents can achieve higher payoffs by fostering
a cooperative relationship with each other than they would if both were to defect every
round. There are two basic mechanisms by which iteration can promote cooperative
behavior: punishing defection and rewarding cooperation. To see why, let us follow
an example game of the Iterated Prisoner’s Dilemma in sequence.

Punishment. Recall Alice and Bob from the previous section, the two would-be
thieves caught by the police. Alice decides to defect in the first round of the Prisoner’s
Dilemma, while Bob opts to cooperate. This achieves a good outcome for Alice, and
a poor one for Bob, who punishes this behavior by choosing to defect himself in the
second round. What makes this a punishment is that Alice’s score will now be lower
than it would be if Bob had opted to cooperate instead, whether Alice chooses to
cooperate or defect.

Reward. Alice, having been punished, decides to cooperate in the third round. Bob
rewards this action by cooperating in turn in the fourth. What makes this a reward is
that Alice’s score will now be higher than if Bob had instead opted to defect, whether
Alice chooses to cooperate or defect. Thus, the expectation that their defection will
Collective Action Problems ■ 379

be punished and their cooperation rewarded incentivizes both agents to cooperate

with each other.

Figure 7.6. Across six rounds, both players gain better payoffs if they consistently cooperate.
But defecting creates short-term gains.

In Figure 7.6, each panel shows a six-round Iterated Prisoner’s Dilemma, with purple
squares for defection and blue for cooperation. On the left is Tit-for-tat: An agent
using this strategy tends to score the same as or worse than its partners in each match.
On the right, always defect tends to score the same as or better than its partner in
each match. The average payoff attained by using either strategy are shown at the
bottom: Tit-for-tat attains a better payoff (lower jail sentence) on average—and so is
more successful in a tournament—than always defect.

Defection is still the dominant strategy if agents know how many times
they will interact. If the agents know when they are about to play the Prisoner’s
Dilemma with each other for the final time, both will choose to defect in that final
round. This is because their defection is no longer punishable by their partner. If Alice
defects in the last round of the Iterated Prisoner’s Dilemma, Bob cannot punish her
380 ■ Introduction to AI Safety, Ethics, and Society

by retaliating, as there are no future rounds in which to do so. The same is of course
true for Bob. Thus, defection is the dominant strategy for each agent in the final
round, just as it is in the single-round version of the dilemma.
Moreover, if each agent expects their partner to defect in the final round, then there
is no incentive for them to cooperate in the penultimate round either. This is for the
same reason: Defecting in the penultimate round will not influence their partner’s
behavior in the final round. Whatever an agent decides to do, they expect that
their partner will choose to defect next round, so they might as well defect now.
We can extend this argument by reasoning backward through all the iterations. In
each round, the certainty that their partner will defect in the next round regardless
of their own behavior in the current round incentivizes each agent to defect. The
reward for cooperation and punishment of defection have been removed. Ultimately,
this removal pushes the agents to defect in every round of the Iterated Prisoner’s
Dilemma.
Uncertainty about future engagement enables rational cooperation. In
the real world, an agent can rarely be sure that they will never again engage with
a given partner. Wherever there is sufficient uncertainty about the future of their
relationship, rational agents may be more cooperative. This is for the simple reason
that uncooperative behavior may yield less valuable outcomes in the long term, be-
cause others may retaliate in kind in the future. This tells us that AIs interacting
with each other repeatedly may cooperate, but only if they are sufficiently uncertain
about whether their interactions are about to end. Other forms of uncertainty can
also create opportunities for rational cooperation, such as uncertainty about what
strategies others will use. These are most important where the Iterated Prisoner’s
Dilemma involves a population of more than two agents, in which each agent inter-
acts sequentially with multiple partners. We turn to examining the dynamics of these
more complicated games next.

Tournaments
So far, we have considered the Iterated Prisoner’s Dilemma between only two agents:
each plays repeatedly against a single partner. However, in the real world, we expect
AIs will engage with multiple other agents. In this section, we consider interactions
of this kind, where each agent not only interacts with their partner repeatedly, but
also switches partners over time. Understanding the success of a strategy is more
complicated in repeated rounds against many partners. Note that in this section, we
define a “match” to mean repeated rounds of the Prisoner’s Dilemma between the
same two agents; see Figure 7.6. We define a “tournament” to mean a population of
more than two agents engaged in a set of pairwise matches.
In Iterated Prisoner Dilemma tournaments, each agent interacts with
multiple partners. In the 1970s, the political scientist Robert Axelrod held a
series of tournaments to pit different agents against one another in the Iterated Pris-
oner’s Dilemma. The tournament winner was whichever agent had the highest total
payoff after completing all matches. Each agent in an Iterated Prisoner’s Dilemma
Collective Action Problems ■ 381

tournament plays multiple rounds against multiple partners. These agents employed a
range of different strategies. For example, an agent using the strategy named random
would randomly determine whether to cooperate or defect in each round, entirely
independently of previous interactions with a given partner. By contrast, an agent
using the grudger strategy would start out cooperating, but switch to defecting for
all future interactions if its partner defected even once. See Table 7.3 for examples of
these strategies.

TABLE 7.3 Popular strategies’ descriptions.

Strategy Characteristics

Random Randomly defect or cooperate, regardless of your partner’s

strategy
Always defect Always choose to defect, regardless of your partner’s strategy
Always cooperate Always choose to cooperate, regardless of your partner’s strategy
Grudger Start by cooperating, but if your partner defects, defect in every
subsequent round, regardless of your partner’s subsequent
behavior
Tit-for-tat Start cooperating; then always do whatever your partner did last
Generous tit-for-tat Same as tit-for-tat, but occasionally cooperate in response to
your partner’s defection

The strategy “Tit-for-tat” frequently won Axelrod’s tournaments [369].

The most famous strategy used in Axelrod’s tournaments was Tit-for-tat. This was
the strategy of starting by cooperating, then repeating the partner’s most recent
move: if they cooperated, Tit-for-tat cooperated too; if they defected, Tit-for-tat
did likewise. Despite its simplicity, this strategy was extremely successful, and very
frequently won tournaments. An agent playing Tit-for-tat exemplified the two mech-
anisms for promoting cooperation, rewarding cooperation, yet also punishing defec-
tion. Importantly, Tit-for-tat did not hold a grudge—it forgave each defection after
it retaliated by defecting in return, only once. This process of one defection for one
defection is captured in the famous idiom “an eye for an eye.” The Tit-for-tat strategy
became emblematic as being one way to escape the muck of defection.

The success of Tit-for-tat is counterintuitive. In any given match, an agent

playing Tit-for-tat will tend to score slightly worse than or the same as their partner;
see Figure 7.6a. By contrast, an agent who employs an uncooperative strategy such
as always defect usually scores the same as or better than its partner; see Figure 7.6b.
In a match between a cooperative agent and an uncooperative one, the uncooperative
agent tends to end up with the better score.
However, it is an agent’s average score which dictates its success in a tournament, not
its score in any particular match or with any particular partner. Two uncooperative
partners will score worse on average than cooperative ones. Thus, the success of
cooperative strategies such as 7.6 depends on the population strategy composition
(the assortment of strategies used by the agents in the population). If there are enough
382 ■ Introduction to AI Safety, Ethics, and Society

cooperative partners, cooperative agents may be more successful than uncooperative

ones.

AI Races
Iterated interactions can generate “AI races.” We discuss two kinds of races concerning
AI development: corporate AI races and military AI arms races. Both kinds center
around competing parties participating in races for individual, short-term gains at
a collective, long-term detriment. Where individual incentives clash with collective
interests, the outcome can be bad for all. As we discuss here, in the context of AI
races, these outcomes could even be catastrophic.
AI races are the result of intense competitive pressures. During the Cold
War, the US and the Soviet Union were involved in a costly nuclear arms race. The
effects of their competition persist today, leaving the world in a state of heightened
nuclear threat. Competitive races of this kind entail repeated back-and-forth actions
that can result in progressively worse outcomes for all involved. We can liken this
example to the Iterated Prisoner’s Dilemma, where the nations must decide whether
to increase (defect) or decrease (cooperate) their nuclear spending. Both the US and
the Soviet Union often chose to increase spending. They would have created a safer
and less expensive world for both nations (as well as others) if they had cooperated
to reduce their nuclear stockpiles. We discuss this in more detail in 8.6.
Two kinds of AI races: corporate and military [273]. Competition between
different parties—nations or corporations—is incentivizing each to develop, deploy,
and adopt AIs rapidly, at the expense of other values and safety precautions. Corpo-
rate AI races consist of businesses prioritizing their own survival or power expansion
over ensuring that AIs are developed and released safely. Military AI arms races
consist of nations building and adopting powerful and dangerous military applica-
tions of AI technologies to gain military power, increasing the risks of more frequent
or damaging wars, misuse, or catastrophic accidents. We can understand these two
kinds of AI races using two game-theoretic models of iterated interactions. First, we
use the Attrition model to understand why AI corporations are cutting corners on
safety. Second, we’ll use the Security Dilemma model to understand why militaries
are escalating the use of—and reliance on—AI in warfare.

Corporate AI Races
Competition between AI research companies is promoting the creation and use of
more appealing and profitable systems, often at the cost of safety measures. Con-
sider the public release of large language model-based chatbots. Some AI companies
delayed releasing their chatbots out of safety concerns, like avoiding the generation
of harmful misinformation. We can view the companies that released their chat-
bots first as having switched from cooperating to defecting in an Iterated Prisoner’s
Dilemma. The defectors gained public attention and secured future investment. This
competitive pressure caused other companies to rush their AI products to market,
compromising safety measures in the process.
Collective Action Problems ■ 383

Corporate AI races arise because competitors sacrifice their values to gain an advan-
tage, even if this harms others. As a race heats up, corporations might increasingly
need to prioritize profits by cutting corners on safety, in order to survive in a world
where their competitors are very likely to do the same. The worst outcome for an
agent in the Prisoner’s Dilemma is the one where only they cooperated while their
partner defected. Competitive pressures motivate AI companies to avoid this out-
come, even at the cost of exacerbating large-scale risks.
Ultimately, corporate AI races could produce societal-scale harms, such as mass un-
employment and dangerous dependence on AI systems. We consider one such example
in 7.2.5. This risk is particularly vivid for emerging industries like AI which lack the
better-established safeguards such as mature regulation and widespread awareness of
the harm that unsafe products can cause found in other industries like pharmaceuti-
cals.
Attrition model: a multi-player game of “Chicken.” We can model this
kind of corporate AI race using an “Attrition” model [370], which frames a race as
a kind of auction in which competitors bid against one another for a valuable prize.
Rather than bidding money, the competitors bid for the risk level they are willing
to tolerate. This is similar to the game “Chicken,” in which two competitors drive
headlong at each other. Assuming one swerves out of the way, the winner is the one
who does not (demonstrating that they can tolerate a higher level of risk than the
loser). Similarly, in the Attrition model, each competitor bids the level of risk—the
probability of bringing about a catastrophic outcome—they are willing to tolerate.
Whichever competitor is willing to tolerate the most risk will win the entire prize, as
long as the catastrophe they are risking does not actually happen. We can consider
this to be an “all pay” auction: both competitors must pay what they bid, whether
they win or not. This is because all of those involved must bear the risk they are
leveraging, and once they have made their bid they cannot retract it.
The Attrition model shows why AI corporations may cut corners on
safety. Let us assume that there are only two competitors and that both of them
have the same understanding of the state of their competition. In this case, the
Attrition model predicts that they will race each other up to a loss of one-third in
expected value [371]. If the value of the prize to one competitor is “X,” they will be
willing to risk a 33% chance of bringing about an outcome equally disvaluable (of
value “-X”) in order to win their race [372].
As we have discussed previously, market pressures may motivate corporations to
behave as though they value what they are competing for almost as highly as survival
itself. According to this toy model, we might then expect AI stakeholders engaged
in a corporate race to risk a 33% chance of existential catastrophe in order to “win
the prize” of their continued existence. With multiple AI races, long time horizons,
and ever-increasing risks, the repeated erosion of safety assurances down to only 66%
generates a vast potential for catastrophe.
Real-world actors may mistakenly erode safety precautions even further.
Moreover, real-world AI races could produce even worse outcomes than the one
384 ■ Introduction to AI Safety, Ethics, and Society

predicted by the Attrition model [372]. One reason for this is that competing cor-
porations may not have a correct understanding of the state of their race. Precisely
predicting these kinds of risks can be extremely challenging: high-risk situations are
inherently difficult to predict accurately, even in fields far more well-understood than
AI. Incorrect risk calibration could cause the competitors to take actions that acci-
dentally exceed even the 33% risk level. Like newcomers to an “all pay” auction who
often overbid, uneven comprehension or misinformation could motivate the competi-
tors to take even greater risks of bringing about catastrophic outcomes. In fact, we
might even expect selection for competitors who tend to underestimate the risks of
these races. All these factors may further erode safety assurances.

Military AI Arms Races

Global interest in military applications for AI technologies is increasing. Some hail

this as the “third revolution in warfare” [373], predicting impact at the scale of the
historical development of gunpowder and nuclear weapons. There are many causes
for concern about the adoption of AI technologies in military contexts. These in-
clude increased rates of weapon development, lethal autonomous weapons usage,
advanced cyberattack execution, and automation of decision-making. These could in
turn produce more frequent and destructive wars, acts of terrorism, and catastrophic
accidents. Perhaps even more important than the immediate dangers from military
deployment of AI is the possibility that nations will continue to race each other along
a path toward ever increased risks of catastrophe. In this section, we explore this
possibility using another game theoretic model.
First, let us consider a few different sources of risk from military AI [273]:
1. AI-developed weapons. AI technologies could be used to engineer weapons.
Military research and development offers many opportunities for acceleration using
AI tools. For instance, AI could be used to expedite processes in dual-use biological
and chemical research, furthering the development of programs to build weapons
of mass destruction.
2. AI-controlled weapons. AI might also be used to control weapons directly.
“Lethal autonomous weapons” have been in use since March 2020, when a self-
directing and armed drone “hunted down” soldiers in Libya without human super-
vision. Autonomous weapons may be faster or more reliable than human soldiers
for certain tasks, as well as being far more expendable. Autonomous weapons sys-
tems thus effectively motivate militaries to reduce human oversight. In a context
as morally salient as warfare, the ethical implications of this could be severe. In-
creasing AI weapon development may also impact international warfare dynamics.
The ability to deploy lethal autonomous weapons in place of human soldiers could
drastically lower the threshold for nations to engage in war, by reducing the ex-
pected body count—of the nation’s own citizens, at least. These altered warfare
dynamics could usher in a future with more frequent and destructive wars than
has yet been seen in human history.
Collective Action Problems ■ 385

3. AI cyberwarfare. Another military application is the use of AI in cyberwarfare.

AI systems might be used to defend against cyberattacks. However, we do not
yet know whether this will outweigh the offensive potential of AI in this context.
Cyberattacks can be used to wreak enormous harm, such as by damaging crucial
systems and infrastructure to disrupt supply chains. AIs could make cyberattacks
more effective in a number of ways, motivating more frequent attempts and more
destructive successes. For example, AIs could directly aid in writing or improving
offensive programs. They could also execute cyberattacks at superhuman scales
by implementing vast numbers of offensive programs simultaneously. By democra-
tizing the power to execute large-scale cyberattacks, AIs would also increase the
difficulty of verification. With many more actors capable of carrying out attacks at
such scales, attributing attacks to perpetrators would be much more challenging.
4. Automated executive decision-making. Executive control might be delegated
to AIs at higher levels of military procedures. The development of AIs with su-
perhuman strategic capabilities may incentivize nations to adopt these systems
and increasingly automate military processes. One example of this is “automated
retaliation.” AI systems that are granted the ability to respond to offensive threats
they identify with counterattacks, without human supervision. Examples of this
include the NSA cyber defense program known as “MonsterMind.” When this
program identified an attempted cyberattack, it interrupted it and prevented its
execution. However, it would then launch an offensive cyberattack of its own in re-
turn. It could take this retaliatory action without consulting human supervisors.
More powerful AI systems, more destructive weapons, and greater automation
or delegation of military control to AI systems, would all deplete our ability to
intervene.
5. Catastrophic accidents. Lethal Autonomous Weapons and automated decision-
making systems both carry risks of resulting in catastrophic accidents. If a nation
were to lose control of powerful military AI technologies, the outcome could be
calamitous. Outsourcing executive command of military procedures to AI—such
as by automating retaliatory action—would put powerful arsenals on hair-trigger
alert. If one of these AI systems were to make even a small error, such as incor-
rectly identifying an offensive strike from another nation, it might automatically
“retaliate” to this non-existent threat. This could in turn trigger automated retal-
iations from the AI systems of other nations that detect this action. Thus, a small
error could be exacerbated into an increasingly escalated war. We consider how a
“flash war” such as this might come about in more detail in Section 7.2.5. Note
that we can also use the “Attrition” model in the case of military AI arms races
to model how military competitive pressures can motivate nations to cut corners
on safety.
6. Co-option of military AI technologies. Military AI arms races could also
have catastrophic effects outside of international warfare. New and more lethal
weapons could be used maliciously in other contexts. For instance, biological
weapons were originally created for military purposes. Even though we have since
halted the military use of these weapons, their existence has enabled many acts of
386 ■ Introduction to AI Safety, Ethics, and Society

bioterrorism. Examples include the 2001 deployment of anthrax letters to kill US

senators and media executives. The creation of knowledge of how to make and
use these weapons is irreversible. Thus, their existence and the risk they pose are
permanent.
7. Military AI risks may interact. Importantly, the risks posed by military AI
applications are not entirely independent of one another. The increased poten-
tial for anonymity when executing cyberattacks could increase the probability of
wars. Where it is harder to identify the perpetrators, misattribution could trigger
conflict between the target of the attack and an innocent party. The potential for
destructive cyberattacks might be increased by the scaled-up use of autonomous
weapons, as these could be co-opted by such attacks. Similarly, the danger posed
by a rogue AI with executive decision-making power might be all the more serious
if it has control over fleets of autonomous weapons.

Security Dilemma model: mutual defensive concerns motivate nations to

increase risks. We can better understand military AI arms races using the “Secu-
rity Dilemma” model [374]. Consider the relationship between two peaceful nations.
Though they are not currently at war with one another, each is sufficiently concerned
about the possibility of conflict to pay close attention to the other’s state of military
ability. One day, one of the two nations perceives that the other is more militar-
ily capable than they are due to their having stockpiled more advanced weaponry.
This incentivizes the first nation to build up their own military capabilities until
they match or exceed those of the other nation. The second nation, perceiving this
increase in military investment and development, feels pressure to follow suit, once
again increasing their weapon capabilities. Neither wishes to be outmatched by the
other. This competitive pressure drives both to escalate the situation. The ensuing
arms race generates increasingly high risks for both sides, such as increasing the
probability or severity of accidents and misuse.

Example: the Cold War nuclear arms race. As previously discussed, the
Cold War nuclear arms race typifies this process. Neither the US nor the Soviet Union
wanted to risk being less militarily capable than their rival, so each escalated their
own weaponized nuclear ability in an attempt to deter the other using the threat of
retaliation. Just as in the Iterated Prisoner’s Dilemma, neither nation could afford to
risk being the lone cooperator while their rival defected. Thus, they achieve a Pareto
inefficient outcome of both defecting. Competitive pressure drove them to continue
to worsen this situation over time, resulting in today’s enormously heightened state
of nuclear vulnerability.

Increased automation of warfare by one nation puts pressure on others

to follow suit. Just as with nuclear weapons, so with military AI: the Security
Dilemma model illustrates how defensive concerns can force nations to go down a
route which is against the long term interests of all involved. This route leads to the
competing nations continually heightening the risks posed by military AI applications,
including more frequent and severe wars, and worse accidents.
Collective Action Problems ■ 387

There are many incentives for nations to increase their development, adoption, and
deployment of military AI applications. With more AI involvement, warfare can take
place at an accelerated pace, and at a more destructive scale. Nations that do not
adopt and use military AI technologies may therefore risk not being able to compete
with nations that do. As with nuclear mutually assured destruction, nations may also
employ automated retaliation as a signal of commitment, hoping to deter attacks by
demonstrating a plausible resolution to respond swiftly and in kind. This process of
automation and AI delegation would thus perpetuate, despite it being increasingly
against the collective good.
Ultimately, as with economic automation, military AI arms races could result in
humans being unable to keep up. The pace and complexity of warfare could ascend
out of human reach to where we are no longer able to comprehend or intervene. This
could be an irreversible step putting us at high risk of catastrophic outcomes.

Extortion

In this section, we examine one last risk that arises when agents interact repeatedly:
the discovery of extortion.

Extortion strategies in the Iterated Prisoner’s Dilemma. In the real world,

we describe the use of threats to force a victim to take an action they would otherwise
not want to take (such as to relinquish something valuable) as “extortion.” Examples
include criminal organizations ransoming those they have kidnapped to extort their
families for money in exchange for their safe return.
In the Iterated Prisoner’s Dilemma, there is a set of extortion strategies that bear
similarity to this real-world phenomenon. An agent playing the game can use an extor-
tion strategy to ensure that their payoff in any match is higher than their partner’s
[375]. The extortionist achieves this by acting similarly to an agent using tit-for-
tat, responding to like with like. However, the extortionist will occasionally defect
even when their partner has been cooperative. Extortionists effectively calculate the
maximum number of defections they can get away with without annihilating the mo-
tivation of their partner to continue cooperating with them. They decide whether
to cooperate or defect using a set of probabilities. The most recent interaction with
their partner determines which probability they select. An example strategy is shown
in Figure 7.7. An extortionist’s partner is incentivized to acquiesce to the extortion
since deviating in any way will yield them a lower payoff. However, in maximizing
their own score, they attain an even higher score for the extortionist. An extortionist
thus scores higher than most of its partners in Iterated Prisoner’s Dilemma matches.
Shown is an extortion strategy called Extort-2, from the point of view of the extortion-
ist. “You” are the agent using the Extort-2 strategy, and “they” are your partner.
As with all extortion strategies, Extort-2 involves reacting probabilistically to the
most recent interaction with a partner. As an example, in the previous round, if the
extortionist defected, but their partner cooperated, the extortionist will cooperate
with a probability of 31 in this round.
388 ■ Introduction to AI Safety, Ethics, and Society

Figure 7.7. The Extort-2 strategy [376].

Extortion strategies rarely win tournaments but seldom die out alto-
gether. As we saw in Section 7.2.4, many uncooperative strategies may gain a
higher score than most of their partners in head-to-head matches, and yet still lose in
tournaments. By contrast, extortionists can be somewhat successful in tournaments
under certain conditions. Extortionists are vulnerable to the same problem as many
other uncooperative strategies: they gain low payoffs in matches against other extor-
tionists. Each will therefore perform less well as the frequency of extortionists in the
population increases. Thus, extortionists can persist if they are sufficiently unlikely
to meet one another. For instance, where a sufficiently small population of agents
is engaged in a tournament, a single extortionist can achieve very high payoffs by
exploiting cooperative strategies.

AI agents may use extortion: evidence from the Iterated Prisoner’s

Dilemma. AI agents could use extortion in order to gain resources or power. As
we have seen, agents can succeed in the Iterated Prisoner’s Dilemma by using extor-
tion strategies. This is particularly true if the extortionist is part of a small group,
if the social dynamics mirror evolution by natural selection, or after major environ-
mental changes. These findings are extremely worrying as they could describe future
AI scenarios. Relationships might form among a small number of powerful AI agents.
These agents may undulate through desirable and undesirable behaviors, or they
might switch opportunistically to using extortion tactics in the wake of changes to
their environment. However, since there are some fragile assumptions in these simple
models, we must also consider evidence from real-world agents.

AI agents may use extortion: evidence from the real world. The
widespread use of extortion among humans outside the world of game theoretic
Collective Action Problems ■ 389

models suggests there is still a major cause for concern. Real-world extortion can
still yield results even when the target is perfectly aware that it is taking place. The
use of ransomware schemes to extort private individuals and companies is increasing
rapidly. In fact, cybersecurity experts estimate that the annual economic cost of ran-
somware activity is in the billions of US dollars. Terrorist organizations such as ISIS
rely on extortion through hostage-taking for a large portion of their total income.
The ubiquity of successful extortion in so many contexts sets a powerful historical
precedent for its efficacy.

Tail Risk: Extortion With Digital Minds

Here we examine the possibility of AI agents engaging in extortion to pursue their

goals. Though the probability of AI extortion may be low, the impact could be im-
mense. As an example, we consider the potential for extortionist AIs to simulate and
torture sentient digital minds as leverage.

Real-world extortion is a form of optimized disvalue. An extortionist

chooses to threaten their target using a personalized source of concern. They op-
timize their extortion to be prioritized over their target’s other concerns. Often, the
worse the outcome being threatened, the more likely the target is to acquiesce. This
incentivizes extortionists to threaten to bring about extremely disvaluable scenarios.
In order to be effective, extortionist AIs might therefore leverage the threat of huge
amounts of harm—far more than would likely come about incidentally, without de-
sign. If the disvaluable outcome the extortionist has designed for their target is also
disvaluable to wider society, then we will share the potentially enormous costs of any
executed threats.

AI extortion could entail torturing vast numbers of sentient digital

minds. Human extortionists often threaten to inflict excruciating pain or death
on those their victim cares about. AI extortionists might engage in similar behav-
iors, threatening to induce extreme levels of suffering, but on a vastly larger scale.
This scale could potentially exceed any in human history. This is because extortion-
ist AIs with greater-than-human technological capabilities might be able to simulate
sentient digital minds. The potential for optimized disvalue in these simulations sug-
gests near-unimaginable horrors. Vast numbers of digital people in these simulated
environments could be subjected to immeasurably agonizing experiences.

Simulated torture at this scale could make the future more disvaluable
than valuable. Simulations designed for the purpose of extortion would likely
be far more disvaluable than simulations which contain disvalue unintentionally. The
simulation’s designer would likely be able to choose what kinds of objects to simulate,
so they could avoid wasting energy simulating non-sentient entities such as inanimate
objects. Moreover, the designer could ensure that these sentient entities experience
the greatest amount of suffering possible for the timespan of the simulation. They
might even be able to simulate minds capable of more disvaluable experiences than
have ever existed previously, deliberately designing the digital entities to be able to
390 ■ Introduction to AI Safety, Ethics, and Society

suffer as greatly as possible. Put together, a simulation optimized for disvalue could
produce several orders of magnitude more disvalue than anything in history. This
would be unprecedented in humanity’s history and could make a horrifying—even
net negative—future.

AI agents may be superhumanly proficient at wielding extortion. Future

AI agents may far exceed humans in their ability to wield threats. One reason for this
could be that they have superhuman tactical capabilities, as some do already in com-
petitive games. Superior strategic intelligence could allow AI agents to conceive and
execute far more advanced programs of extortion than that of which humans are gen-
erally capable. A second reason why AI agents may be especially adept at employing
threats is if they have superhuman longevity or goal-preservation capabilities. With
greater timespans available, the action space for extorting targets is larger. Finally,
AIs may have technological capabilities that exceed those of current and historical
humans. This could widen the option space for AI extortion still further.

Extortion may be exceptionally effective against AIs. Two goals of machine

ethics are: (1) to foster in AI an intrinsic value for humanity (and humanity’s values)
and (2) to make AI agents that are impartial. Both goals could result in AI agents
being more vulnerable to extortion than humans tend to be. Let us examine an
example of this for each goal.
Goal 1: Foster in AI an intrinsic value for humanity (and humanity’s values).
AI agents that value individual humans highly may be less prone to “scope insensi-
tivity.” This is the human bias of failing to “feel” changes in the size of some value
appropriately. Very small or very large numbers often appear to us to be of similar
size to other very small or very large numbers, even when they actually differ by
orders of magnitude. Human scope insensitivity may provide some protection against
larger-scale extortion, as it lowers the motivation of extortionists to increase the scale
of their threats. It is possible that AI agents may prioritize outcomes more accurately
in accordance with their expected value. If this is the case, they would likely be more
responsive to high stakes, and more vulnerable to large-scale extortion attempts.
Goal 2: Make AI agents that are impartial.
Impartial AI agents may have far more altruistic values than any human or institu-
tion. These agents may be extremely vulnerable to extortion in the form of threats
against their impartial moral codes. Extortionist AI agents could leverage the threat
of extreme torture of countless digital sentients in simulated environments to extort
more morally impartial AI targets. The execution of any such threat could immensely
degrade the value of the future.

AI extortionists may execute higher-stakes threats more frequently than

humans. A successful act of extortion is the deliberate creation of a state in which
both the extortionists and their targets prefer the outcome the extortionist demands.
In some sense, both parties therefore want the target to acquiesce to the extortion
and the extortionist not to follow through on their threat. In this way, both usually
Collective Action Problems ■ 391

have some incentive to avoid the threat being executed. However, out of a desire to
signal credibility in future interactions, extortionists must follow through on threats
occasionally. Consider examples such as hostage ransoming or criminal syndicate
protection rackets. Successful future extortion requires a signal of commitment, such
as destroying the property of those who defy the extortionists.
AIs may carry out more frequent and more severe threats than humans tend to. One
reason for this is that they may have different value systems which tolerate higher
risks, reducing their motivation to acquiesce to extortion. For example, an AI agent
that sufficiently values the far future may prefer to demotivate future extortionists
from trying to extort them. They may therefore defy a current extortion attempt,
tolerating even very large costs to them and others, for the long-term benefit of
credibly signaling that future extortion attempts will not work either.
More generally, with a greater variety of value systems, a greater number of agents,
and a greater action space size, miscalibrated extortion attempts are more likely.
Where the threat is insufficient to force compliance, the aforementioned need to
signal credibility incentivizes the extortionist to execute their threat as punishment
for their target’s refusal to submit.
AI agents extorting humans. AI agents might also extort human targets. One
example scenario would be an AI developing both a weaponized biological pathogen,
and an effective cure. If the pathogen is slow-acting, the AI agent could then extort
humans by deploying the bioweapon, and leveraging the promise of its cure to force
those infected into complying with its demands. Pathogens that are sufficiently fast
to spread and difficult to detect could infect a very large number of human targets, so
this tactic could enable extremely large-scale threats to be wielded effectively [377].

Summary
The Iterated Prisoner’s Dilemma involves repeated rounds of the Prisoner’s Dilemma
game. This iteration offers a chance for agent cooperation but doesn’t ensure it.
There are different strategies by which agents can attempt to maximize their overall
payoffs. These strategies can be studied by competing agents against one another in
tournaments, where each agent competes against others in multiple rounds before
switching partners.
This provides cause for concern about a future with many AI agents. One example
of this is the phenomenon of “races” between AI stakeholders. These races strongly
influence the speed and direction of AI technological production, deployment and
adoption, in both corporate and military settings and have the potential to exac-
erbate many of the intrinsic risks from AI. The dynamics we have explored in this
section might cause competing agencies to cut corners on safety, escalate weaponized
AI applications and automate warfare. These are two examples of how competitive
pressures, modeled as iterated interactions between agents, can generate races which
increase the risk of catastrophe for everyone. Fostering cooperation between differ-
ent parties—human individuals, corporations, nations, and AI agents—is vital for
ensuring our collective safety.
392 ■ Introduction to AI Safety, Ethics, and Society

7.2.5 Collective Action Problems

We began our exploration of game theory by looking at a very simple game, the
Prisoner’s Dilemma. We have so far considered two ways to model real-world social
scenarios in more detail. First, we explored what happens when two agents interact
multiple times (such as an Iterated Prisoner’s Dilemma match). Second, we intro-
duced a population of more than two agents, where each agent switches partners
over time (such as an Iterated Prisoner’s Dilemma tournament). Now we move be-
yond pairwise interactions, to interactions that simultaneously involve more than two
agents. We consider what happens when an agent engages in repeated rounds of the
Prisoner’s Dilemma against multiple opponents at the same time.
One class of scenarios that can be described by such a model is collective action
problems. Throughout this section, we first discuss the core characteristics of collective
action problems. Then, we introduce a series of real-world examples to highlight the
ubiquity of these problems in human society and show how AI races can be modeled
in this way. Following this, we transition to a brief discussion of common pool resource
problems to further illustrate the difficulty with which rational agents, especially AI
agents, may secure collectively good outcomes. Finally, we conclude with a detailed
discussion of flash wars and autonomous economies to show how in a multi-agent
setting, AIs might pursue behaviors or tactics that result in catastrophic or existential
risks to humans.

Introduction

This first section explores the nature of collective action problems. We begin with
a simple example of a collaborative group project. Through this, we explore how
individual incentives can sometimes clash with what is in the best interests of the
group as a whole. These situations can motivate individuals to act in ways that
negatively impact all of the population.

A collective action problem is like a group-level Iterated Prisoner’s

Dilemma. In the Iterated Prisoner’s Dilemma, we saw how a pair of rational
agents can tend toward outcomes that are undesirable for both. Now let us con-
sider social interactions between more than two agents. When an individual engages
with multiple partners simultaneously, they may still converge on Pareto inefficient
Nash equilibria. In fact, with more than two agents, cooperation can be even harder
to secure. We can therefore model collective action problems as an Iterated Prisoner’s
Dilemma in which more than two prisoners have been arrested: If enough of them
decide to defect on their partners, all of them will suffer the consequences.

Example: group projects. A typical example of a collective action problem is

that of a collaborative project. A group working together toward a shared goal often
encounters a problem: not everyone pitches in. Some group members take advantage
of the rest, benefiting from the work others are doing without committing as much
effort themselves. The implicit reasoning behind the behavior of these “slackers” is
as follows. They want the group’s goal to be achieved, but they would prefer this
Collective Action Problems ■ 393

to happen without costing them much personal effort. Just as with the Prisoner’s
Dilemma, “slacking” is their dominant strategy. If the others work hard and the
project is completed, they get to enjoy the benefits of this success without expending
too much effort themselves. If the others fail to work hard and the project is not
completed, they at least save themselves the effort they might otherwise have wasted.
As groups increase in size and heterogeneity, complexity increases accordingly. Agents
in a population may have a diverse set of goals. Even if the population can agree on
a common goal, aligning diverse agents with this goal can be difficult. For example,
even when the public expresses strong and widespread support for a political measure,
their representatives often fail to carry it out.

Formalization

Here, we formalize our model of collective action problems. We look more closely at
the incentives governing individual choices, and the effects these have at the group
level. We examine how the behavior of others in the group can alter the incentives
facing any individual, and how we can (and do) use these mechanisms to promote
cooperative behavior in our societies.

Each agent must choose whether to contribute to the common good. As

in the Prisoner’s Dilemma, each agent must choose which of two actions to take. An
agent can choose to contribute to the common good, at some cost to themselves.
The alternative is for the agent to choose to free ride, benefiting from others’ contri-
butions at no personal cost. Free riders impose negative externalities—collateral
damage for others in pursuit of private benefit—on the group as a whole by choosing
not to pitch in.

Free riding is the dominant strategy. For now, let us assume that free riding
increases an agent’s own personal benefit, regardless of whether the others contribute
or free ride: it is the dominant strategy. If an agent’s contribution to the common
good is small, then choosing not to contribute does not significantly diminish the
collective good, meaning that an agent’s decision to free ride has essentially no neg-
ative consequences for the agent themself. Thus, the agent is choosing between two
outcomes. The first outcome is where they gain their portion of the collective benefit,
and pay the small cost of being a contributor. The other outcome is where they gain
this same benefit, but save themselves the cost of contributing.

Free riding can produce Pareto inefficient outcomes. Just as how both
agents defecting in the Prisoner’s Dilemma produces Pareto inefficiency, free riding
in a collective action problem can result in an outcome that is bad for all. In many
cases, some agents can free ride without imposing significant externalities on everyone
else. However, if sufficiently many agents free ride, this diminishes the collective good
by leading to no provision of a public good, for instance. With sufficient losses, the
agents will all end up worse than if they had each paid the small individual cost
of contributing and received their share of the public benefit. Importantly, however,
even in this Pareto inefficient state, free riding might still be the dominant strategy
394 ■ Introduction to AI Safety, Ethics, and Society

for each individual, since the cost of contributing outweighs the trivial increase in
collective good they would contribute by contributing. Thus, escaping undesirable
equilibria in a collective action problem can be exceedingly difficult; see Figure 7.8.

Figure 7.8. In this abstract collective action problem, we can move from everyone (con-
tributes) to right (no one contributes). As more people free ride, the collective good dis-
appears, leaving everyone in a state where they would all prefer to collectively contribute
instead.

We can illustrate a collective action problem using the simple payoff matrix below. In
the matrix, “b” represents the payoff an agent receives when everyone else cooperates
(the collective good divided between the number of agents) and “c” represents the
personal cost of cooperation. As the matrix illustrates, the dominant strategy for a
rational agent (“you”) here is to free ride whether everyone else contributes or free
rides.
TABLE 7.4 Free riding is always better for an individual: it is a dominant strategy.
The rest of the group The rest of the Some contribute;
contributes group free rides others free ride

You contribute b−c −c <b−c

You free ride b 0 <b

Agents’ incentives depend on the behavior of other agents. Agents in

collective action problems can be aware of the choices other agents make, which can
affect their strategies and behavior over time. For example, the ratio of defectors to
cooperators in a population can affect the degree to which cooperation is achieved.
When rational agents interact with each other, they may be inclined to shift their
strategies to more favorable ones with higher individual payoffs: they may realize
that other agents are utilizing more successful strategies, and thus choose to adopt
Collective Action Problems ■ 395

them. If defectors dominate the population initially, and the initial individual costs
of cooperation outweigh the collective benefits of cooperation, then the population
may tend toward an uncooperative state. In simple terms, collective action problems
cannot be solved without cooperation.

Mutual and external coercion. We can increase the probability of cooperation

by generating incentives that lower the individual cost of cooperation and increase
the individual cost of defection. There are two ways we may go about this: mutual or
external coercion. Mutual coercion generates cooperative incentives by establishing
communal, societal, and reputational norms. External coercion generates cooperative
incentives through external intervention, by developing regulations that incentivize
collective action through mandates, sanctions, and legislature, making cooperation a
necessity in certain cases. Below, we illustrate some real-world scenarios in further
detail.

Real-World Examples of Collective Action Problems

Many large-scale societal issues can be understood as collective action problems. This
section explores collective action problems in the real world: climate change, public
health, and democratic voting. We end by briefly looking at AI races through this
same lens.

Public health. We can model some public health emergencies, such as disease
epidemics, as collective action problems. The COVID-19 pandemic took the lives
of millions worldwide. Some of these deaths could have been avoided with stricter
compliance with public health measures such as social distancing, frequent testing,
and vaccination. We can model those adhering to these measures as “contributing”
(by incurring a personal cost for public benefit) and those violating them as “free
riding.”
Assume that everyone wished the pandemic to be controlled and ultimately eradi-
cated, that complying with the suggested health measures would have helped hasten
this goal, and that the benefits of collectively shortening the pandemic timespan
would have outweighed the personal costs of compliance with these measures (such
as social isolation). Everyone would prefer the outcome where they all complied with
the health measures over the one where few of them did. Yet, each person would
prefer still better the outcome where everyone else adhered to the health measures,
and they alone were able to free ride. Violating the health measures was therefore
the dominant strategy, and so many people chose to do this, imposing the negative
externalities of excessive disease burden on the rest of their community.
We used both mutual and external mechanisms to coerce people to comply with
public health measures in the pandemic. For example, some communities adjusted
their social norms (mutual coercion) such that non-compliance with public health
measures would result in damage to one’s reputation. We also required proof of
vaccination for entry into desirable social spaces (external coercion), among many
other requirements.
396 ■ Introduction to AI Safety, Ethics, and Society

Anthropogenic climate change. In 2021, a majority of those surveyed worldwide

reported wanting to avert catastrophic anthropogenic climate change. Most, however,
chose not to act in accordance with what they believed necessary to achieve this
goal. The consumption of animal products typically entails far higher greenhouse gas
emissions and environmental damage than plant-based alternatives. The use of public
over private transport similarly reduces personal carbon footprints dramatically. To
avoid the costs of taking these actions, such as changing routines and compromising
on speed or ease, most people do not change their diets or transport habits. Various
behaviors that increase pollution can be viewed as “free riding.” Since this is the
dominant strategy for each agent, most choose to do this, resulting in ever-worsening
climate change, imposing risks on the global population.
We could disincentivize excessive meat eating and private transport using external
and mutual coercion. In this example, external coercion could include lowering bus
and train fares and enhancing existing infrastructure through government subsidies,
as well as implementing fuel taxes on private vehicles. Mutual coercion could in-
clude changing social norms to consider excessive meat eating or short-haul flying
unacceptable.

Democracy. We can model the maintenance of a democracy as a set of collective

action problems. There are many situations in which certain actions might provide an
individual with immediate benefits, but would incur longer-term costs on the larger
group if more people were to take these actions. For example, a voting population
must maintain certain norms in order to keep its democracy functioning. One of these
norms is to vote only for candidates who will not undermine democratic processes,
even if others have desirable traits.
Choosing whether or not to participate in an election at all can similarly be viewed
as a collective action problem. The outcome of an election is determined by the votes
of individuals, each of which has a choice to either vote or abstain. The results of
the election are determined by the votes of those who choose to participate, and the
costs of participating in the election are carried by citizens themselves, such as the
time and effort required to register and cast a vote. When large enough numbers of
citizens decide to abstain from voting, the collective outcome of an election may not
accurately reflect the preferences of the population: by acting in accordance with their
rational self interest, citizens may contribute to a suboptimal collective outcome.

Common pool resource problem. Rational agents are incentivized to take more
than a sustainable amount of a shared resource. This is called a common pool resource
problem or tragedy of the commons problem. We refer to a common pool resource be-
coming catastrophically depleted as collapse. Collapse occurs when rational agents,
driven by their incentive to maximize personal gain, tip the available supply of the
shared resource below its sustainability equilibrium [378]. Below, we further illustrate
how complicated it is to secure collectively good outcomes, especially when rational
agents act in accordance with their self-interest. Such problems are prevalent at the
societal level, and often bear catastrophic consequences. Thus, we should not elimi-
nate the possibility that they may also occur with AI agents in a multi-agent setting.
Collective Action Problems ■ 397

For example, rainforests around the world have been diminished greatly by deforesta-
tion practices. While these forests still exist as a home to millions of different species
and many local communities, they may reach a point at which they will no longer
be able to rejuvenate themselves. If these practices are sustained, the entire ecosys-
tem these forests support could collapse. Common pool resource problems exemplify
how agents may bring about catastrophes even when they behave rationally and in
their self-interest, with perfect knowledge of the looming catastrophe, and despite the
seeming ability to prevent it. They further illustrate how complicated it can be to
secure collectively good outcomes and how rational agents can act to the detriment of
their own group. As with many other collective action problems, we can’t expect to
solve common pool resource problems by having AIs manage them. If we simply pass
the buck to AI representatives, the AIs will inherit the same incentive structure that
produces the common pool resource problem, and so the problem will likely remain.

AI Races Between More Than Two Competitors

In the previous section, we looked at how corporations and militaries may compete
with one another in “AI races.” We used a two-player “attrition” bidding model to
see why AI companies cut corners on safety when developing and deploying their
technologies. We used another two-player “security dilemma” model to understand
how security concerns motivate nations to escalate their military capabilities, even
while increasing the risks imposed on all by increasingly automating warfare in this
manner.
Here, we extend our models of these races to consider more than two parties, allowing
us to see them as collective action problems. First, we look at how military AI arms
races increase the risk of catastrophic outcomes such as a flash war: a war that is
triggered by autonomous AI agents that quickly spirals out of human control [273].
Second, we explore how ever-increasing job automation could result in an autonomous
economy: an economy in which humans no longer have leverage or control.

Military AI arms race outcome: flash war. The security dilemma model we
explored in the previous section can be applied to more than two agents. In this
context, we can see it as a collective action problem. Though all nations would be at
lower risk if all were to cooperate with one another (“contribute” to their collective
safety), each will individually do better instead to escalate their own military capa-
bilities (“free ride” on the contributions of the other nations). Here, we explore one
potentially catastrophic outcome of this collective action problem: a flash war.
As we saw previously, military AI arms races motivate nations to automate mili-
tary procedures. In particular, there are strong incentives to integrate “automated
retaliation” protocols. Consider a scenario in which several nations have constructed
an autonomous AI military defense system to gain a defensive military advantage.
These AIs must be able to act on perceived threats without human intervention.
Additionally, each is aligned with a common goal: “defend our nation from attack.”
Even if these systems are nearly perfect, a single erroneous detection of a perceived
threat could trigger a decision cascade that launches the nation into a “flash war.”
398 ■ Introduction to AI Safety, Ethics, and Society

Once one AI system hallucinates a threat and issues responses, the AIs of the nations
being targeted by these responses will follow suit, and the situation could escalate
rapidly. A flash war would be catastrophic for humanity, and might prove impossible
to recover from.
A flash war is triggered and amplified by successive interactions between autonomous
AI agents such that humans lose control of weapons of mass destruction [379]. Any
single military defense AI could trigger it, and the process could continue without
human intervention and at great speed. Importantly, having humans in the loop will
not necessarily ensure our safety. Even if AIs only provide human operators with
instructions to retaliate, our collective safety would rest on the chance that soldiers
would willfully disobey their instructions.
Collective action between nations could avoid these and other dire outcomes. Limit-
ing the capabilities of their military AIs by decreasing funding and halting or slowing
down development would require that each nation give up a potential military ad-
vantage. In a high stakes scenario such as this one, rational agents (nations) may be
unwilling to give up such an advantage because it dramatically increases the vulner-
ability of their nation to attack. The individual cost of cooperation is high while the
individual cost of defection is low, and as agents continue to invest in military capa-
bilities, competitive pressures increase, which further exacerbate costs of cooperation
—thereby disincentivizing collective action. While the collective benefits of coopera-
tion would drastically reduce the catastrophic risks of this scenario in the long-term,
they may not outweigh the self-interest of rational agents in the short-term.

Corporate AI race outcome: autonomous economy. As AIs become increas-

ingly effective at carrying out human goals, they may begin to out-perform the av-
erage human at an increasing number and range of jobs, from personal assistants
to executive decision-makers. To reap the benefits of these faster and more effective
workers, companies will likely continue to automate economically valuable functions
by delegating them to AI agents. Ultimately, this could lead to the global economy
becoming “autonomous,” with humans no longer able to steer or intervene [380].
Such an autonomous economy would be a catastrophe for humanity. Like passengers
in an autonomous vehicle, our safety and destination would rest with the AI systems
now acting without our supervision. Our future would be determined by the behavior
and outputs of this autonomous economy. If the AI agents engaged in this economy
were to have undesirable goals or evolve selfish traits—a possibility we examine in the
next section of this chapter—humanity would be unable to prevent the harms they
cause. Even if the AIs themselves are well-aligned to our goals, the economic system
itself may produce extremely undesirable outcomes. In this section, we have examined
many examples of how macrobehavior can differ dramatically from micromotives. A
population of individuals can tend toward states that are bad for everyone and yet be
in stable equilibria. This could happen just the same with AI representatives acting
on humanity’s behalf in an autonomous economy.
Just as with military AI arms races, we can model how an autonomous economy might
be brought about using the security dilemma model. As in the previous example, if
Collective Action Problems ■ 399

we expand this model to more than two agents, we can see it as a collective action
problem in which competitive pressures drive different parties to automate economic
functions out of the need to “keep up” with their competitors. Under this model, we
can see how companies must choose whether to maintain human labor (“contribut-
ing”) or automate these jobs using AI (“free riding”). Although all would prefer the
outcome in which the calamity of an autonomous economy is avoided, each would
individually prefer to have a competitive advantage and not risk being outperformed
by rivals who reap the short-term benefit of using AIs. Thus, economic automation
is the dominant strategy for each competitor. Repeated rounds of this game in which
a sufficient number of agents free ride would drive us toward this disaster. In each
successive round, it would become progressively more difficult to turn back, as we
come to rely increasingly on more capable AI agents.

Increasing AI autonomy increases the risk of catastrophic outcomes. As

AIs become more autonomous, humans may delegate more decision-making power to
them. If AIs are able to successfully and consistently attain the high-level objec-
tives given to them by humans, we may be more inclined to begin providing them
with open-ended goals. If AIs achieve these goals, humans might not be privy to the
process they follow and may overlook potential harms, as we saw in both the au-
tonomous economy and flash war examples. Moreover, adaptive AIs—systems that
actively adjust their computational design, architecture and behavior in response to
new information or changes in the environment—could adapt at a much faster rate
than humans. The possibility of self-improvement among such AIs would further ex-
acerbate this problem. Adaptive AIs could develop unanticipated emergent behaviors
and strategies, making them deeply unpredictable. Humans could be inclined to ac-
cept these negative behaviors in order to maintain a competitive advantage in the
short-term.

Reducing competitive pressures could foster collective action. The secu-

rity dilemma model shows how nations can be motivated to escalate their offensive
capabilities out of the perception that their competitors are doing the same. How-
ever, by signaling the opposite, we might be able to produce the reverse effect, such
as military de-escalation or an increase in AI safety standards. For instance, whether
different nations will acquiesce to a shared international standard for AI regulation
may depend on whether the nations are individually signaling their willingness to
regulate in their own jurisdiction in the first place. If one nation perceives that oth-
ers are engaging in strict domestic regulation, they might see this as a credible signal
of commitment to an international standard. By easing the competitive pressures, we
might be able to foster collective action to avoid driving up the collective risk level.

7.2.6 Summary

We observe important and seemingly intractable collective action problems in many

domains of life, such as environmental degradation, pandemic responses, maintenance
of democracies, and common pool resource depletion. We can understand these as
400 ■ Introduction to AI Safety, Ethics, and Society

Iterated Prisoner’s Dilemmas with many more than two agents interacting simulta-
neously in each round of the game. As before, we see that “free riding” can be the
dominant strategy for an individual agent, and this can lead to Pareto inefficient out-
comes for the group as a whole. We can use the mechanisms of mutual and external
coercion to incentivize agents to cooperate with each other and achieve collectively
good outcomes.
If we expand our models of AI races to include more than two agents, we can un-
derstand the races themselves as collective action problems, and examine how they
exacerbate the risk of catastrophe. One example is how increasingly automating mil-
itary protocols increases the risk of a “flash war.” Similar dynamics of automation in
the economic sphere could lead to an “autonomous economy.” Either outcome would
be disastrous and potentially irreversible, yet we can see how competitive pressures
can drive rational and self-interested agents (such as nations or companies) down a
path toward these calamities.
In this section, we examined some simple, formal models of how rational agents may
interact with each other under varying conditions. We used these game theoretic mod-
els to understand the natural dynamics in multi-agent biological and social systems.
We explored how these multi-agent dynamics can generate undesirable outcomes
for all those involved. We considered some tails risks posed by interactions between
human and AI agents. These included human-directed companies and militaries en-
gaging in perilous races, as well as autonomous AIs using threats for extortion.
These risks can be reduced if mechanisms such as institutions are used to ensure
human agencies and AI agents are able to cooperate with one another and avoid
conflict. We explore some means of achieving cooperative interactions in the next
section of this chapter, 7.3.

7.3 COOPERATION

Overview
In this chapter, we have been exploring the risks that arise from interactions between
multiple agents. So far, we have used game theory to understand how collective
behavior can produce undesirable outcomes. In simple terms, securing morally good
outcomes without cooperation can be extremely difficult, even for intelligent rational
agents. Consequently, the importance of cooperation has emerged as a strong theme
in this chapter. In this third section of this chapter, we begin by using evolutionary
theory to examine cooperation in more detail.
We observe many forms of cooperation in biological systems: social insect colonies,
pack hunting, symbiotic relationships, and much more. Humans perform community
services, negotiate international peace agreements, and coordinate aid for disaster
responses. Our very societies are built around cooperation.
Cooperation between AI stakeholders. Mechanisms that can enable coop-
eration between the corporations developing AI and other stakeholders such as
Collective Action Problems ■ 401

governments may be vital for counteracting the competitive and evolutionary pres-
sures of AI races we have explored in this chapter. For example, the “merge-and-
assist” clause of OpenAI’s charter [381] outlines their commitment to cease com-
petition with—and provide assistance to—any “value-aligned, safety-conscious” AI
developer who appears close to producing AGI, in order to reduce the risk of eroding
safety precautions.

Cooperation between AI agents. Many also suggest that we must ensure the
AI systems themselves also act cooperatively with one another. Certainly, we do want
AIs to cooperate, rather than to defect, in Prisoner’s Dilemma scenarios. However,
this may not be a total solution to the collective action problems we have examined
in this chapter. By more closely examining how cooperative relationships can come
about, it is possible to see how making AIs more cooperative may backfire with serious
consequences for AI safety. Instead, we need a more nuanced view of the potential
benefits and risks of promoting cooperation between AIs. To do this, we study five
different mechanisms by which cooperation may arise in multi-agent systems [382],
considering the ramifications of each:
• Direct reciprocity: when individuals are likely to encounter others in the future,
they are more likely to cooperate with them.
• Indirect reciprocity: when it benefits an individual’s reputation to cooperate with
others, they are more likely to do so.
• Group selection: when there is competition between groups, cooperative groups
may outcompete non-cooperative groups.
• Kin selection: when an individual is closely related to others, they are more likely
to cooperate with them.
• Institutional mechanisms: when there are externally imposed incentives (such as
laws) that subsidize cooperation and punish defection, individuals and groups are
more likely to cooperate.

Direct Reciprocity

Direct reciprocity overview. One way agents may cooperate is through direct
reciprocity: when one agent performs a favor for another because they expect the
recipient to return this favor in the future [383]. We capture this core idea in idioms
like “quid pro quo,” or “you scratch my back, I’ll scratch yours.” Direct reciprocity re-
quires repeated interaction between the agents: the more likely they are to meet again
in the future, the greater the incentive for them to cooperate in the present. We have
already encountered this in the iterated Prisoner’s Dilemma: how an agent behaves
in a present interaction can influence the behavior of others in future interactions .
Game theorists sometimes refer to this phenomenon as the “shadow of the future.”
When individuals know that future cooperation is valuable, they have increased in-
centives to behave in ways that benefit both themselves and others, fostering trust,
reciprocity, and cooperation over time. Cooperation can only evolve as a consequence
of direct reciprocity when the probability, w, of subsequent encounters between the
same two individuals is greater than the cost-benefit ratio of the helpful act. In other
402 ■ Introduction to AI Safety, Ethics, and Society

words, if agent A decides to help agent B at some cost c to themselves, they will only
do so when the expected benefit b of agent B returning the favor outweighs the cost
of agent A initially providing it. Thus, we have the rule w > c/b; see Table 7.5.

TABLE 7.5 Payoff matrix for direct reciprocity games.

Cooperate Defect

Cooperate b − c/(1 − w) −c
Defect b 0

Natural examples of direct reciprocity. Trees and fungi have evolved sym-
biotic relationships where they exchange sugars and nutrients for mutual benefit.
Dolphins use cooperative hunting strategies where one dolphin herds schools of fish
while the others form barriers to encircle them. The dynamics of the role reversal
are decided by an expectation that other dolphins in the group will reciprocate this
behavior during subsequent hunts. Similarly, chimpanzees engage in reciprocal groom-
ing, where they exchange grooming services with one another with the expectation
that they will be returned during a later session [384].
Direct reciprocity in human society. Among humans, one prominent example of di-
rect reciprocity is commerce. Commerce is a form of direct reciprocity “which offers
positive-sum benefits for both parties and gives each a selfish stake in the well-being
of the other” [385]; commerce can be a win-win scenario for all parties involved. For
instance, if Alice produces wine and Bob produces cheese, but neither Alice nor Bob
has the resources to produce what the other can, both may realize they are better off
trading. Different parties might both need the good the other has when they can’t
produce it themselves, so it is mutually beneficial for them to trade, especially when
they know they will encounter each other again in the future. If Alice and Bob both
rely on each other for wine and cheese respectively, then they will naturally seek
to prevent harm to one another because it is in their rational best interest. To this
point, commerce can foster complex interdependencies between economies, which en-
hances the benefits gained through mutual exchange while decreasing the probability
of conflict or war.

Direct reciprocity and AIs. The future may contain multiple AI agents, many
of which might interact with one another to achieve different functions in human
society. Such AI agents may automate parts of our economy and infrastructures, take
over mundane and time-consuming tasks, or provide humans and other AIs with daily
assistance. In a system with multiple AI agents, where the probability that individual
AIs would meet again is high, AIs might evolve cooperative behaviors through direct
reciprocity. If one AI in this system has access to important resources that other
AIs need to meet their objectives, it may decide to share these resources accordingly.
However, since providing this favor would be costly to the given AI, it will do so
only when the probability of meeting the recipient AIs (those that received the favor)
outweighs the cost-benefit ratio of the favor itself.
Collective Action Problems ■ 403

Direct reciprocity can backfire: AIs may disfavor cooperation with hu-
mans. AIs may favor cooperation with other AIs over humans. As AIs become
substantially more capable and efficient than humans, the benefit of interacting with
humans may decrease. It may take a human several hours to reciprocate a favor pro-
vided by an AI, whereas it may take an AI only seconds to do so. It may therefore
become extremely difficult to formulate exchanges between AIs and humans that
benefit AIs more than exchanges with other AIs would. In other words, from an AI
perspective, the cost-benefit ratio for cooperation with humans is not worth it.

Direct reciprocity may backfire: offers of AI cooperation may undermine

human alliances. The potential for direct reciprocity can undermine the stability
of other, less straightforward cooperative arrangements within a larger group, thereby
posing a collective action problem. One example of this involves “bandwagoning.” In
the Alignment section of the Single-Agent Safety chapter, we discussed the idea of
“balancing” in international relations: state action to counteract the influence of
a threatening power, such as by forming alliances with other states against their
common adversary [107]. However, some scholars argue that states do not always
respond to threatening powers by trying to thwart them. Rather than trying to
prevent them from becoming too strong, states may instead “bandwagon”: joining
up with and supporting the rising power to gain some personal benefit.
For instance, consider military coups. Sometimes, those attempting a takeover will
offer their various enemies incentives to join forces with them, promising rewards to
whoever allies with them first. If one of those being made this offer believes that the
usurpers are ultimately likely to win, they may consider it to be in their own best
interests to switch sides early enough to be on the “right side of history.” When others
observe their allies switching sides, they may see their chances of victory declining
and so in turn decide to defect. In this way, bandwagoning can escalate via positive
feedback.
Bandwagoning may therefore present the following collective action problem: people
may be motivated to cooperate with powerful and threatening AI systems via direct
reciprocity, even though it would be in everyone’s collective best interest if none
were to do so. Imagine that a future AI system, acting autonomously, takes actions
that cause a large-scale catastrophe. In the wake of this event, the international
community might agree that it would be in humanity’s best interest to constrain
or roll back all autonomous AIs. Powerful AI systems might then offer some states
rewards if they ally with them (direct reciprocity). This could mean protecting the
AIs by simply allowing them to intermingle with the people, making it harder for
outside forces to target the AIs without human casualties. Or the state could provide
the AIs with access to valuable resources. Instead of balancing (cooperating with
the international community to counteract this threatening power), these states may
choose to bandwagon, defecting to form alliances with AIs. Even though the global
community would all be better off if all states were to cooperate and act together to
constrain AIs, individual states may benefit from defecting. As before, each defection
would shift the balance of power, motivating others to defect in turn.
404 ■ Introduction to AI Safety, Ethics, and Society

Indirect Reciprocity

Indirect reciprocity overview. When someone judges whether to provide a favor

to someone else, they may consider the recipient’s reputation. If the recipient is
known to be generous, this would encourage the donor (the one that provides the
favor) to offer their assistance. On the other hand, if the recipient has a stingy or
selfish reputation, this could discourage the donor from offering a favor. In considering
whether to provide a favor, donors may also consider the favor’s effect on their own
reputation. If a donor gains a “helpful and trustworthy” reputation by providing a
favor, this may motivate others to cooperate with them more often. We call this
reputation-based mechanism of cooperation indirect reciprocity [386]. Agents may
cooperate to develop and maintain good reputations since doing so is likely to benefit
them in the long-term. Indirect reciprocity is particularly useful in larger groups,
where the probability that the same two agents will encounter one another again
is lower. It provides a mechanism for leveraging collective knowledge to promote
cooperation. Where personal interactions are limited, reputation-based evaluations
provide a way to assess the cooperative tendencies of others. Importantly, cooperation
can only emerge within a population as a consequence of indirect reciprocity when
the probability, q, that any agent can discern another agent’s reputation (whether
they are cooperative or not), outweighs the cost-benefit ratio of the helpful behavior
to the donor. Thus, we have the rule q > c/b; see Table 7.6.

TABLE 7.6 Payoff matrix for indirect reciprocity games.

Discern Defect

Discern b−c −c(1 − q)

Defect b(1 − q) 0

Natural examples of indirect reciprocity. Cleaner fish (fish that feed on par-
asites or mucus on the bodies of other fish) can either cooperate with client fish (fish
that receive the “services” of cleaner fish) by feeding on parasites that live on their
bodies, or cheat, by feeding on the mucus that client fish excrete [387]. Client fish tend
to cooperate more frequently with cleaner fish that have a “good reputation,” which
are those that feed on parasites rather than mucus. Similarly, while vampire bats are
known to share food with their kin, they also share food with unrelated members of
their group. Vampire bats more readily share food with unrelated bats when they
know the recipients of food sharing also have a reputation for being consistent and
reliable food donors [388].
Indirect reciprocity in human society. Language provides a way to obtain
information about others without ever having interacted with them, allowing humans
to adjust reputations accordingly and facilitate conditional cooperation. Consider
sites like Yelp and TripAdvisor, which allow internet users to gauge the reputations
of businesses through reviews provided by other consumers. Similarly, gossip is a
complex universal human trait that plays an important role in indirect reciprocity.
Through gossip, individuals reveal the nature of their past interactions with others
Collective Action Problems ■ 405

as well as exchanges they observe between others but are not a part of. Gossip allows
us to track each others’ reputations and enforce cooperative social norms, reducing
the probability that cooperative efforts are exploited by others with reputations for
dishonesty [389].
Indirect reciprocity in AIs. AIs could develop a reputation system where they
observe and evaluate each others’ behaviors, with each accumulating a reputation
score based on their cooperative actions. AIs with higher reputation scores may be
more likely to receive assistance and cooperation from others, thereby developing
a reputation for reliability. Moreover, sharing insights and knowledge with reliable
partners may establish a network of cooperative AIs, promoting future reciprocation.
Indirect reciprocity can backfire: extortionists can threaten reputational
damage. The pressure to maintain a good reputation can make agents vulnerable
to extortion. Other agents may be able to leverage the fear of reputational harm to
extract benefits or force compliance. For example, political smear campaigns manip-
ulate public opinion by spreading false information or damaging rumors about op-
ponents. Similarly, blackmail often involves leveraging damaging information about
others to extort benefits. AIs may manipulate or extort humans in order to better
pursue their objectives. For instance, an AI might threaten to expose the sensitive,
personal information it has accessed about a human target unless specific demands
are met.
Indirect reciprocity can backfire: ruthless reputations may also work. In-
direct reciprocity may not always favor cooperative behavior: it can also promote the
emergence of “ruthless” reputations. A reputation for ruthlessness can sometimes be
extremely successful in motivating compliance through fear. For instance, in military
contexts, projecting a reputation for ruthlessness may deter potential adversaries or
enemies. If others perceive an individual or group as willing to employ extreme mea-
sures without hesitation, they may be less likely to challenge or provoke them. Some
AIs might similarly evolve ruthless reputations, perhaps as a defensive strategy to
discourage potential attempts at exploitation, or control by others.

Group Selection

Group selection overview. When there is competition between groups, groups

with more cooperators may outcompete those with fewer cooperators. Under such
conditions, selection at the group level influences selection at the individual level
(traits that benefit the group may not necessarily benefit the individual), and we
refer to this mechanism as group selection [390]. Cooperative groups are better able
to coordinate their allocation of resources, establish channels for reciprocal exchange,
and maintain steady communication, making them less likely to go extinct. It so
happens that, if m is the number of groups and is large, and n is the maximum group
size, group selection can only promote cooperation when b/c > 1 + n/m; see Table
7.7.
Natural examples of group selection. Most proposed examples of group se-
lection are highly contested. Nonetheless, some consider chimpanzees that engage in
406 ■ Introduction to AI Safety, Ethics, and Society

TABLE 7.7 Payoff matrix for group selection games.

Cooperate Defect

Cooperate (n + m)(b + c) n(−c) + m(b − c)

Defect nb 0

lethal intergroup conflict to be a likely example of group selection. Chimpanzees can

be remarkably violent toward outgroups, such as by killing the offspring of rival males
or engaging in brutal fights over territory. Such behaviors can help groups of chim-
panzees secure competitive advantages over other groups of chimpanzees, by either
reducing their abilities to mate successfully through infanticide, or by securing larger
portions of available territory.
Group selection in human society. Among humans, we can imagine a crude
group selection example using warfare. Imagine two armies: A and B. The majority of
soldiers in army A are brave, while the majority of soldiers in army B are cowardly. For
soldiers in army A, bravery may be individually costly, since brave soldiers are more
willing to risk losing their lives on the battlefield. For soldiers in army B, cowardice
may be individually beneficial, since cowardly soldiers will take fewer life-threatening
risks on the battlefield. In a conflict, group selection will favor army A over army
B, since brave soldiers will be more willing to fight alongside each other for victory,
while cowardly soldiers will not.
Group selection in AIs. Consider a future in which the majority of human labor
has been fully automated by AIs, such that AIs are now running most companies.
Under these circumstances, AIs may form corporations with other AIs, creating an
economic landscape in which multiple AI corporations must compete with each other
to produce economic value. AI corporations in which individual AIs work well together
may outcompete those in which individual AIs do not work as well together. The more
cooperative individual AIs within AI corporations are, the more economic value their
corporations will be able to produce; AI corporations with less cooperative AIs may
eventually run out of resources and lose the ability to sustain themselves.
Group selection can backfire: in-group favoritism can promote out-group
hostility. Group selection can inspire in-group favoritism, which might lead to cru-
elty toward out-groups. Chimpanzees will readily cooperate with members of their
own groups. However, when interacting with chimpanzees from other groups, they
are often vicious and merciless. Moreover, when groups gain a competitive advan-
tage, they may attempt to preserve it by mistreating, exploiting, or marginalizing
outgroups such as people with different political or ideological beliefs. AIs may be
more likely to see other AIs as part of their group, and this could promote antagonism
between AIs and humans.

Kin Selection

Kin selection overview. When driven by kin selection, agents are more likely to
cooperate with others with whom they share a higher degree of genetic relatedness
Collective Action Problems ■ 407

[391]. The more closely related agents are, the more inclined to cooperate they will
be. Thus, kin selection favors cooperation under the following conditions: an agent
will help their relative only when the benefit to their relative “b,” multiplied by the
relatedness between the two “r,” outweighs the cost to the agent “c.” This is known
as Hamilton’s rule: rb > c, or equivalently r > c/b [391]; see Table 7.8.

TABLE 7.8 Payoff matrix for kin selection games.

Cooperate Defect

Cooperate (b − c)(1 + r) (−c + br)

Defect b − rc 0

Natural examples of kin selection. In social insect colonies, such as bees and
ants, colony members are closely related. Such insects often assist their kin in rais-
ing and producing offspring while “workers” relinquish their reproductive potential,
devoting their lives to foraging and other means required to sustain the colony as a
whole. Similarly, naked mole rats live in colonies with a single reproductive queen
and non-reproductive workers. The workers are sterile but still assist in tasks such as
foraging, nest building, and protecting the colony. This behavior benefits the queen’s
offspring, which are their siblings, and enhances the colony’s overall survival capabil-
ities. As another example, some bird species engage in cooperative breeding practices
where older offspring delay breeding to help parents raise their siblings.

Kin selection in human society. Some evolutionary psychologists claim that

we can see evidence of kin selection in many commonplace traditions and activities.
For example, in humans, we might identify the mechanism of kin selection in the
way that we treat our immediate relatives. For instance, people often leave wealth,
property, and other resources to direct relatives upon their deaths. Leaving behind an
inheritance offers no direct benefit to the deceased, but it does help ensure the survival
and success of their lineage in subsequent generations. Similarly, grandparents often
care for their grandchildren, which increases the probability that their lineages will
persist.

Kin selection in AIs. AIs that are similar could exhibit cooperative tendencies
toward each other, similar to genetic relatedness in biological systems. For instance,
AIs may create back-ups or variants of themselves. They may then favor coopera-
tion with these versions of themselves over other AIs or humans. Variant AIs may
prioritize resource allocation and sharing among themselves, developing preferential
mechanisms for sharing computational resources with other versions of themselves.

Kin selection can backfire: nepotism. Kin selection can lead to nepotism:
prioritizing the interests of relatives above others. For instance, some bird species
exhibit differential feeding and provisioning. When chicks hatch asynchronously, par-
ents may allocate more resources to those that are older, and therefore more likely
to be their genetic offspring, since smaller chicks are more likely to be the result of
brood parasitism (when birds lay their eggs in other birds’ nests). In humans, too,
408 ■ Introduction to AI Safety, Ethics, and Society

we often encounter nepotism. Company executives may hire their sons or daughters,
even though they lack the experience required for the role, which can harm companies
and their employees in the long-run. Similarly, parents often protect their children
from the law, especially when they have committed serious criminal acts that can
result in extended jail time. Such tendencies could apply to AIs as well: AIs might
favor cooperation only with other similar AIs. This could be especially troubling for
humans: as the differences between humans and AIs increase, AIs may be increasingly
less inclined to cooperate with humans.

A Note on Morality as Cooperation

The theory of “Morality as Cooperation” (MAC) proposes that human moral-
ity was generated by evolutionary pressures to solve our most salient cooper-
ation problems [392]. Natural selection has discovered several mechanisms by
which rational and self-interested agents may cooperate with one another, and
MAC theory suggests that some of these mechanisms have driven the forma-
tion of our moral intuitions and customs. Here, we examine four cooperation
problems, the mechanisms humans have evolved to solve them, and how these
mechanisms may have generated our ideas of morality. These are overviewed
in Table 7.9.

TABLE 7.9 Mapping cooperation mechanisms to components of morality [392].

Component of
Cooperation Problem Solutions/Mechanism
Morality

Kinship Parental duties, family

Kin selection
Agents can benefit by values
treating genetic relatives
preferentially Avoiding inbreeding Incest aversion

Mutualism Friendship, loyalty,

Forming alliances and
commitment, team
Agents must coordinate collaborating
their behavior to profit players
from mutually beneficial Developing Understanding intention,
situations theory-of-mind not merely action
Exchange Direct reciprocity Trust, gratitude, revenge,
Agents need each other to (e.g. tit-for-tat) punishment, forgiveness
reciprocate and contribute
despite incentives to free Indirect reciprocity
Patience, guilt, gratitude
ride (e.g. forming reputations)
Collective Action Problems ■ 409

Component of
Cooperation Problem Solutions/Mechanism
Morality

Conflict resolution Fairness, negotiation,

Division
Agents can benefit from compromise
avoiding conflict, which is Deference to prior Respecting others’
mutually costly ownership property, punishing theft

Kinship. Natural selection can favor agents who cooperate with their genetic
relatives. This is because there may be copies of these agents’ genes in their rel-
atives’ genomes, and so helping them may further propagate their own genes.
We call this mechanism “kin selection” [391]: an agent can gain a fitness advan-
tage by treating their genetic relatives preferentially, so long as the cost-benefit
ratio of helping is less than the relatedness between the agent and their kin.
Similarly, repeated inbreeding can reduce an agent’s fitness by increasing the
probability of producing offspring with both copies of any recessive, deleterious
alleles in the parents’ genomes [393].
MAC theory proposes that the solutions to this cooperation problem (preferen-
tially helping genetic relatives), such as kin selection and inbreeding avoidance,
underpin several major moral ideas and customs. Evidence for this includes
the fact that human societies are usually built around family units [394], in
which “family values” are generally considered highly moral. Loyalty to one’s
close relatives and duties to one’s offspring are ubiquitous moral values across
human cultures [395]. Our laws regarding inheritance [396] and our naming
traditions [397] similarly reflect these moral intuitions, as do our rules and
social taboos against incest [398, 399].
Mutualism. In game theory, some games are “positive sum” and “win-win”:
the agents involved can increase the total available value by interacting with
one another in particular ways, and all the agents can then benefit from this
additional value. Sometimes, securing these mutual benefits requires that the
agents coordinate their behavior with each other. To solve this cooperation
problem, agents may form alliances and coalitions [400]. This may require the
capacity for basic communication, rule-following [401], and perhaps theory-of-
mind [402].
MAC theory proposes that these cooperative mechanisms comprise important
components of human morality. Examples include the formation of—and loy-
alty to—friendships, commitments to collaborative activities, and a certain
degree of in-group favoritism and conformation to local conventions. Similarly,
we often consider the agent’s intentions when judging the morality of their
actions, which requires a certain degree of theory-of-mind.
Exchange. Sometimes, benefiting from “win-win” situations requires more
than mere coordination. If the payoffs are structured so as to incentivize “free
410 ■ Introduction to AI Safety, Ethics, and Society

riding” behaviors, the cooperation problem becomes how to ensure that others
will reciprocate help and contribute to group efforts. To solve this problem,
agents can enforce cooperation via systems of reward, punishment, policing,
and reciprocity [403]. Direct reciprocity concerns doing someone a favor out
of the expectation that they will reciprocate at a later date [383]. Indirect
reciprocity concerns doing someone a favor to boost your reputation in the
group, out of the expectation that this will increase the probability of a third
party helping you in the future [386].
Once again, MAC theory proposes that these mechanisms are found in our
moral systems. Moral ideas such as trust, gratitude, patience, guilt, and for-
giveness can all help to assure against free riding behaviors. Likewise, pun-
ishment and revenge, both ideas with strong moral dimensions, can serve to
enforce cooperation more assertively. Idioms such as “an eye for an eye,” or
the “Golden Rule” of treating others as we would like to be treated ourselves,
reflect the solutions we evolved to this cooperation problem.
Conflict resolution. Conflict is very often “negative sum”: the interaction
of the agents themselves can destroy some amount of the total value available.
Examples span from the wounds of rutting deer to the casualties of human
wars. If the agents instead manage to cooperate with each other, they may
both be able to benefit—a “win-win” outcome. One way to resolve conflict
situations is division [404]: dividing up the value between the agents, such as
through striking a bargain. Another solution is to respect prior ownership,
deferring to the original “owner” of the valuable item [405].
According to MAC theory, we can see both of these solutions in our ideas
of morality. The cross-culturally ubiquitous notions of fairness, equality, and
compromise help us resolve conflict by promoting the division of value between
competitors [406]. We see this in ideas such as “taking turns” and “I cut, you
choose” [407]: mechanisms for turning a negative sum situation (conflict) into
a zero sum one (negotiation), to mutual benefit. Likewise, condemnation of
theft and respect for others’ property are extremely important and common
moral values [395, 408]. This set of moral rules may stem from the conflict
resolution mechanism of deferring to prior ownership.
Conclusion. MAC theory argues that morality is composed of biological and
cultural solutions humans evolved to the most salient cooperation problems of
our ancestral social environment. Here, we explored four examples of coopera-
tion problems, and how the solutions to them discovered by natural selection
may have produced our moral values.

Institutions

Institutions overview. Agents are more likely to be cooperative when there are
laws or externally imposed incentives that reward cooperation and punish defection.
Collective Action Problems ■ 411

We define an institution as an intentionally designed large-scale structure that is

publicly accepted and recognized, has a centralized logic, and serves to mediate hu-
man interaction. Some examples of institutions include governments, the UN, IAEA,
and so on. In this section, by “institutions,” we do not mean widespread or standard-
ized social customs such as the “institution” of marriage. Institutions typically aim to
establish collective goals which require collaboration and engagement from large or
diverse groups. Therefore, a possible way of representing many institutions, such as
governments, is with the concept of a “Leviathan”: a powerful entity that can exert
control or influence over other actors in a system.
The Pacifist’s dilemma and social control. When one’s opponent is poten-
tially aggressive, pacifism can be irrational. In his book, “The Better Angels of Our
Nature,” Steven Pinker refers to this as the “Pacifist’s dilemma” [385]. In potential
conflict scenarios, agents have little to gain and a lot to lose when they respond to
aggression with pacifism; see Table 7.10. This dynamic often inspires rational agents
to choose conflict over peace.
TABLE 7.10 Payoff matrix for the Pacifist’s dilemma without a Leviathan [385].
Pacifist Aggressor

Peace + Profit (100 + 5) = 105 Defeat (−100)

Pacifist
Peace + Profit (100 + 5) = 105 Victory (10)
Victory(10) War(−50)
Aggressor
Defeat(−100) War(−50)

However, we can shift the interests of agents in this context in favor of peace by
introducing a Leviathan, in the form of a third-party peacekeeping or balancing mis-
sion, which establishes an authoritative presence that maintains order and prevents
conflict escalation. Peacekeeping missions can take several forms, but they often in-
volve the deployment of peacekeeping forces such as military, police, and civilian
personnel. These forces work to deter potential aggressors, enhance security, and set
the stage for peaceful resolutions and negotiations as impartial mediators, usually by
penalizing aggression and rewarding pacifism; see Table 7.11.
TABLE 7.11 Payoff matrix for the Pacifist’s dilemma with a Leviathan [385].
Pacifist Aggressor

Peace (5) Defeat (−100)

Pacifist
Peace (5) Victory – Penalty (10 − 15 = −5)
Victory – Penalty (10 − 15 − 5) War – Penalty (−50 − 200 = −250)
Aggressor
Defeat (−100) War – Penalty (−50 − 200 = −250)

Institutions in human society. Institutions play a central role in promoting

cooperation in international relations. Institutions, such as the UN, can broker agree-
ments or treaties between nations and across cultures through balancing and peace-
keeping operations. The goal of such operations is to hold nations accountable on
412 ■ Introduction to AI Safety, Ethics, and Society

the international scale; when nations break treaties, other nations may punish them
by refusing to cooperate, such as by cutting off trade routes or imposing sanctions
and tariffs. On the other hand, when nations readily adhere to treaties, other na-
tions may reward them, such as by fostering trade or providing foreign aid. Similarly,
institutions can incentivize cooperation at the national scale by creating laws and
regulations that reward cooperative behaviors and punish non-cooperative ones. For
example, many nations attempt to prevent criminal behavior by leveraging the threat
of extended jail-time as a legal deterrent to crime. On the other hand, some nations
incentivize cooperative behaviors through tax breaks, such as those afforded to citi-
zens that make philanthropic donations or use renewable energy resources like solar
power.
Institutions are crucial in the context of international AI development. By estab-
lishing laws and regulations concerning AI development, institutions may be able to
reduce AI races, lowering competitive pressures and the probability that countries cut
corners on safety. Moreover, international agreements on AI development may serve
to hold nations accountable; institutions could play a central role in helping us broker
these kinds of agreements. Ultimately, institutions could improve coordination mech-
anisms and international standards for AI development, which would correspondingly
improve AI safety.

Institutions and AI. In the future, institutions may be established for AI agents,
such as platforms for them to communicate and coordinate with each other au-
tonomously. These institutions may be operated and governed by the AIs themselves
without much human oversight. Humanity alone may not possess the power required
to combat advanced dominance-seeking AIs, and existing laws and regulations may
be insufficient if there is no way to enforce them. An AI Leviathan of some form
could help regulate other AIs and influence their evolution, in which selfish AIs are
counteracted or domesticated.

How institutions can backfire: corruption, free riding, inefficiency. In-

stitutions sometimes fail to achieve the goals they set for themselves, even if they
are well-intended. Failure to achieve such goals is often the result of corruption, free
riding, and inefficiency at the institutional scale. Some examples of corruption in-
clude bribery, misappropriation of public funds for private interests, voter fraud and
manipulation, and price fixing, among many others. Examples of free-riding include
scenarios like welfare fraud, where individuals fraudulently receive benefits they may
not be entitled to, reducing the available supply of resources for those genuinely in
need. Institutions can also struggle with inefficiency, which may stem from factors
such as the satisfaction of bureaucratic requirements, the emergence of natural mo-
nopolies, or the development of diseconomies of scale, which may cause organizations
to pay a higher average cost to produce more goods and services. Institutions can be
undermined, corrupted, and poorly designed or outdated: they do not guarantee that
we will be able to fix cooperation problems.
Like humans, AIs may be motivated to corrupt existing institutions. Advanced AIs
might learn to leverage the institutions we have in place for their benefit, and might
Collective Action Problems ■ 413

do so in ways that are virtually undetectable to us. Moreover, as we discussed previ-

ously, AIs might form an AI Leviathan. However, if humanity’s relationship with this
Leviathan is not symbiotic and transparent, humans risk losing control of AIs. For
instance, if groups of AIs within the Leviathan collude behind the scenes to further
their own interests, or power and resources become concentrated with a few AIs at
the “top,” humanity’s collective wellbeing could be threatened.

7.3.1 Summary
Throughout this section, we discussed a variety of mechanisms that may promote
cooperative behavior by AI systems or other entities. These mechanisms were direct
reciprocity, indirect reciprocity, group selection, kin selection, and institutions.
Direct reciprocity may motivate AI agents in a multi-agent setting to cooperate with
each other, if the probability that the same two AIs meet again is sufficiently high.
However, AIs may disfavor cooperation with humans as they become progressively
more advanced: the cost-benefit ratio for cooperation with humans may simply be
bad from an AI’s perspective.
Indirect reciprocity may promote cooperation in AIs that develop a reputation system
where they observe and score each others’ behaviors. AIs with higher reputation scores
may be more likely to receive assistance and cooperation from others. Still, this does
not guarantee that AIs will be cooperative: AIs might leverage the fear of reputational
harm to extort benefits from others, or themselves develop ruthless reputations to
inspire cooperation through fear.
Group selection—in a future where labor has been automated such that AIs now run
the majority of companies—could promote cooperation on a multi-agent scale. AIs
may form corporate coalitions with other AIs to protect their interests; AI groups with
a cooperative AI minority may be outcompeted by AI groups with a cooperative AI
majority. Under such conditions, however, AIs may learn to favor in-group members
and antagonize out-group members, in order to maintain group solidarity. AIs may
be more likely to see other AIs as part of their group, and this could lead to conflict
between AIs and humans.
AIs may create variants of themselves, and the forces of kin selection may drive
these related variants to cooperate with each other. However, this could also give rise
to nepotism, where AIs prioritize the interests of their variants over other AIs and
humans. As the differences between humans and AIs increase, AIs may be increasingly
less inclined to cooperate with humans.
Institutions can incentivize cooperation through externally imposed incentives that
enforce cooperation and punish defection [409]. This concept relates to the idea of an
AI Leviathan, used to counteract selfish, powerful AIs. However, humanity should take
care to ensure their relationship with the AI Leviathan is symbiotic and transparent,
otherwise we risk losing control of AIs.
In our discussion of these mechanisms, we not only illustrated their prevalence in
our world, but also showed how they might influence cooperation with and between
414 ■ Introduction to AI Safety, Ethics, and Society

AI agents. In several cases, the mechanisms we discuss could promote cooperation.

However, no single mechanism provides a foolproof method for ensuring cooperation.
In the following section, we discuss the nature of conflict, namely the various factors
that may give rise to it. In doing so, we enhance our understanding of what might
motivate conflict in AI, and subsequently, our abilities to predict and address AI-
driven conflict scenarios.

7.4 CONFLICT

7.4.1 Overview
In this chapter, we have been exploring the risks generated or exacerbated by the
interactions of multiple agents, both human and AI. In the previous section, we
explored a variety of mechanisms by which agents can achieve stable cooperation. In
this section we address how, despite the fact that cooperation can be so beneficial to
all involved, a group of agents may instead enter a state of conflict. To do this, we
discuss bargaining theory, commitment problems, and information problems, using
theories and examples relevant both for conflict between nation-states and potentially
also between future AI systems.
Here, we use the term “conflict” loosely, to describe the decision to defect rather
than cooperate in a competitive situation. This often, though not always, involves
some form of violence, and destroys some amount of value. Conflict is common in
nature. Organisms engage in conflict to maintain social dominance hierarchies, to
hunt, and to defend territory. Throughout human history, wars have been common,
often occurring as a consequence of power-seeking behavior, which inspired conflict
over attempts at aggressive territorial expansion or resource acquisition. Another lens
on relations between power-seeking states and other entities is provided by the theory
of structural realism discussed in Single-Agent Safety. Our goal here is to uncover how,
despite being costly, conflict can sometimes be a rational choice nevertheless.
Conflict can take place between a wide variety of entities, from microorganisms to
nation-states. It can be sparked by many different factors, such as resource compe-
tition and territorial disputes. Despite this variability, there are some general frame-
works which we can use to analyze conflict across many different situations. In this
section, we look at how some of these frameworks might be used to model conflict
involving AI agents.
We begin our discussion of conflict with concepts in bargaining theory. We then ex-
amine some specific features of competitive situations that make it harder to reach
negotiated agreements or avoid confrontation. We begin with five factors from bar-
gaining theory that can influence the potential for conflict. These can be divided into
the following two groups:

Commitment problems. According to bargaining theory, one reason bargains

may fail is that some of the agents making an agreement may have the ability and
incentive to break it. We explore three examples of commitment problems.
Collective Action Problems ■ 415

• Power shifts: when there are imbalances between agents’ capabilities such that one
agent becomes stronger than the other, conflict is more likely to emerge between
them.
• First-strike advantages: when one agent possesses the element of surprise, the abil-
ity to choose where conflict takes place, or the ability to quickly defeat their op-
ponent, the probability of conflict increases.
• Issue indivisibility: agents cannot always divide a good however they please—some
goods are “all or nothing” and this increases the probability of conflict between
agents.

Information problems. According to bargaining theory, the other principal cause

of a bargaining failure is that some of the agents may lack good information. Uncer-
tainty regarding a rival’s capabilities and intentions can increase the probability of
conflict. We explore two information problems.
• Misinformation: in the real world, agents frequently have incorrect information,
which can cause them to miscalculate suitable bargaining ranges.
• Disinformation: agents may sometimes have incentives to misrepresent the truth
intentionally. Even the expectation of disinformation can make it more difficult to
reach a negotiated settlement.

Factors outside of bargaining theory. Bargaining frameworks do not encom-

pass all possible reasons why agents may decide to conflict with one another. These
approaches to analyzing conflict are rationalist, assuming that both parties are ra-
tionally considering whether and how to engage in conflict. However, non-rationalist
approaches to conflict (taking into account factors such as identity, status, or relative
deprivation) may turn out to be more applicable to analyzing conflicts involving some
AIs; for example, AIs trained on decisions made by human agents may focus on the
social acceptability of actions rather than their consequences. We end by exploring
one example of a factor standing outside of rationalist approaches that can help to
predict and explain conflict:
• Inequality: under conditions of inequality, agents may fight for access to a larger
share of available resources or a desired social standing.

Conflict can be rational. Though humans know conflict can be enormously

costly, we often still pursue or instigate it, even when compromise might be the
better option.
Consider the following example: a customer trips in a store and sues the owner for
negligence. There is a 60% probability the lawsuit is successful. If they win, the owner
has to pay them $40,000, and going to court will cost each of them $10,000 in legal
fees. There are three options: (1) they or the owner concede, (2) they both let the
matter go to court, or (3) they both reach an out-of-court settlement.
(1) If the owner concedes, the owner loses $40,000, and if the customer concedes,
they gain nothing.
416 ■ Introduction to AI Safety, Ethics, and Society

(2) If both go to court, the owner’s expected payoff is the product of the payment to
the customer and the probability that the lawsuit is successful minus legal fees.
In this case, the owner’s expected payoff would be (−40,000 × 0.6) − 10,000 while
the customer’s expected payoff would be (40,000 × 0.6) − 10,000. As a result, the
owner loses $34,000 dollars and the customer gains $14,000 dollars.
(3) An out-of-court settlement x where 14,000 < x < 34,000 would enable the cus-
tomer to get a higher payoff and the owner to pay lower costs. Therefore, a mutual
settlement is the best option for both if x is in this range.

Hence, if the proposed out-of-court settlement would be greater than $34,000, it would
make sense for the owner to opt for conflict rather than bargaining. Similarly, if the
proposed settlement were less than $14,000, it would be rational for the customer to
opt for conflict.

AIs and large-scale conflicts. Several of the examples we consider in this sec-
tion are large-scale conflicts such as interstate war. If the use of AI were to increase the
likelihood or severity of such conflicts, it could have a devastating effect. AIs have
the potential to accelerate our wartime capabilities, from augmenting intelligence
gathering and weaponizing information such as deep fakes to dramatically improv-
ing the capabilities of lethal autonomous weapons and cyberattacks [410]. If these
use-cases and other capabilities become prevalent and powerful, AI will change the
nature of conflict. If armies are eventually composed of mainly automated weapons
rather than humans, the barrier to violence might be much lower for politicians who
will face reduced public backlash against lives lost, making conflicts between states
(with automated armies) more commonplace. Such changes to the nature and sever-
ity of war are important possibilities with significant ramifications. In this section,
we focus on analyzing the decision to enter a conflict, continuing to focus on how
rational, intelligent agents acting in their own self-interest can collectively produce
outcomes that none of them wants. To do this, we ground our discussion of conflict
in bargaining theory, highlighting some ways in which AI might increase the odds
that states or other entities decide to start a conflict.

7.4.2 Bargaining Theory

Here, we begin with a general overview of bargaining theory, to illustrate how pres-
sures to outcompete rivals or preserve power and resources may make conflict an
instrumentally rational choice. Next, we turn to the unitary actor assumption, high-
lighting that when agents view their rivals as unitary actors, they assume that they
will act more coherently, taking whatever steps necessary to maximize their welfare.
Following this, we discuss the notion of commitment problems, which occur when
agents cannot reliably commit to an agreement or have incentives to break it. Com-
mitment problems increase the probability of conflict and are motivated by specific
factors, such as power shifts, first-strike advantages, and issue indivisibility. We then
explore how information problems and inequality can also increase the probability of
conflict.
Collective Action Problems ■ 417

Bargaining theory. When agents compete for something they both value, they
may either negotiate to reach an agreement peacefully, or resort to more forceful al-
ternatives such as violence. We call the latter outcome “conflict,” and can view this
as the decision to defect rather than cooperate. Unlike peaceful bargaining, conflict
is fundamentally costly for winners and losers alike. However, it may sometimes be
the rational choice. Bargaining theory describes why rational agents may be unable
to reach a peaceful agreement, and instead end up engaging in violent conflict. Due
to pressures to outcompete rivals or preserve their power and resources, agents some-
times prefer conflict, especially when they cannot reliably predict the outcomes of
conflict scenarios. When rational agents assume that potential rivals have the same
mindset, the probability of conflict increases.

The unitary actor assumption. We tend to assume that a group is a single

entity, and that its leader is only interested in maximizing the overall welfare of the
entity. We call this the unitary actor assumption, which is another name for the “unity
of purpose” assumption discussed previously in this chapter. A nation in disarray
without coherent leadership is not necessarily a unitary actor. When we view groups
and individuals as unitary actors, we can assume they will act more coherently, so
they can be more easily modeled as taking steps necessary to maximize their welfare.
When parties make this assumption, they may be less likely to cooperate with others
since what is good for one party’s welfare may not necessarily be good for another’s.

The bargaining range. Whether or not agents are likely to reach a peaceful
agreement through negotiation will be influenced by whether their bargaining ranges
overlap. The bargaining range represents the set of possible outcomes that both agents
involved in a competition find acceptable through negotiation. Recall the lawsuit
example: a bargaining settlement “x” is only acceptable if it falls between $14,000
and $34,000. Any settlement “x” below $14,000 will be rejected by the customer
while any settlement “x” above $34,000 will be rejected by the store owner. Thus, the
bargaining range is often depicted as a spectrum with the lowest acceptable outcome
for one party at one end and the highest acceptable outcome for the other party
at the opposite end. Within this range, there is room for negotiation and potential
agreements.

Conflict and AI agents. Let us assume that AI agents will act rationally in the
pursuit of their goals (so, at the least, we model them as unitary actors or as having
unity of purpose). In the process of pursuing and fulfilling their goals, AI agents may
encounter potential conflict scenarios, just as humans do. In certain scenarios, AIs
may be motivated to pursue violent conflict over a peaceful resolution, for the reasons
we now explore.

7.4.3 Commitment Problems

Many conflicts occur over resources, which are key to an agent’s power. Consider a
bargaining failure in which two agents bargain over resources in an effort to avoid war.
If agents were to acquire these resources, they could invest them into military power.
418 ■ Introduction to AI Safety, Ethics, and Society

Figure 7.9. A) An axis of expected value distribution between two competitors. “B” indicates
the expected outcome of conflict: how likely each competitor is to win, multiplied by the
value they gain by winning. The more positive B is (the further toward the right), the better
for Black, and the worse for Grey. B) Conflict is negative-sum: it destroys some value, and
so reduces each competitor’s expected value. C) Bargaining is zero-sum: all the value is
distributed between the competitors. This means there are possible bargains that offer both
competitors greater expected value than conflict.

As a result, neither can credibly commit to use them only for peaceful purposes.
This is one instance of a commitment problem [411], which is when agents cannot
reliably commit to an agreement, or when they may even have incentives to break
an agreement. Commitment problems are closely related to the security dilemma,
which we discussed in Section 7.2.4. Commitment problems are usually motivated by
specific factors, such as power shifts, first-strike advantages, and issue indivisibility,
which may make conflict a rational choice. It is important to note that our discussion
of these commitment problems assumes anarchy: we take for granted that contracts
are not enforceable in the absence of a higher governing authority.

Power Shifts
Power shifts overview. When there are imbalances between parties’ capabili-
ties such that one party becomes stronger than the other, power shifts can occur.
Such imbalances can arise as a consequence of several factors including technological
and economic advancements, increases in military capabilities, as well as changes in
governance, political ideology, and demographics. If one party has access to AIs and
the other does not, an improvement in AI capabilities can precipitate a power shift.
Such situations are plausible: richer countries today may gain more from AI because
they have more resources to invest in scaling their AI’s performance. Parties may
initially be able to avoid violent conflict by arriving at a peaceful and mutually ben-
eficial settlement with their rivals. However, one party’s power increases after this
Collective Action Problems ■ 419

settlement has been made, they may disproportionately benefit from the settlement,
making it appear unfair to begin with. Thus, we encounter the following commitment
problem: the rising power cannot commit not to exploit its advantage in the future,
incentivizing the declining power to opt for conflict in the present.

Example: The US vs China. China has been investing heavily in its military.
This has included the acquisition or expansion of its capabilities in technologies such
as nuclear and supersonic missiles, as well as drones. The future is uncertain, but
if this trend continues, it could increase the risk of conflict. If China were to gain
a military advantage over the US, this could shift the balance of power. This possi-
bility undermines the stability of bargains struck today between the US and China,
because China’s expected outcome from conflict may increase in the future if they
become more powerful. The US may expect that agreements made with China about
cooperating on AI regulation could lose enforceability later if there is a significant
power shift.
This situation can be modeled using the concept of “Thucydides’ Trap.” The ancient
Greek historian Thucydides suggested that the contemporary conflict between Sparta
and Athens might have been the result of Athens’ increasing military strength, and
Sparta’s fear of the looming power shift. Though this analysis of the Peloponnesian
War is now much-contested, this concept can nevertheless serve to understand how
a rising power threatening the position of an existing superpower in the global order
can increase the potential for conflict rather than peaceful bargaining.

Effect on the bargaining range. Consider two agents, A and B. A is always

weaker than B, but relative to the time period, A is weaker in the future than it is
in the present. A will always have a lower bargaining range, so B will be unlikely
to accept any settlements, especially as B’s power increases. It makes sense for A to
prefer conflict, because if it waits, B’s bargaining range will shift further and further
away, eliminating any overlap between the two. Therefore, A prefers to gamble on
conflict even if the probability that A wins is lower than B; the costs of war do not
outweigh the benefits of a peaceful but unreasonable settlement. Consider the 1956
Suez Crisis. Egypt was seen as a rising power in the Middle East, having secured
control over the Suez Canal. This threatened the interests of the British and French
governments in the region, who responded by instigating war. To safeguard their
diminishing influence, the British and French launched a swift and initially successful
military intervention.

Power shifts and AI. AIs could shift power as they acquire greater capabilities
and more access to resources. Recall the chapter on Single-Agent Safety, where we
saw that an agent’s power is highly related to the efficiency with which they can
exploit resources for their benefit, which often depends on their level of intelligence.
The power of future AI systems is largely unpredictable; we do not know how intelli-
gent or useful they will be. This could give rise to substantial uncertainty regarding
how powerful potential adversaries using AI might become. If this is the case, there
might be reason to engage in conflict to prevent the possibility of adversaries further
420 ■ Introduction to AI Safety, Ethics, and Society

increasing their power—especially if AI is seen as a decisive military advantage. Be-

yond directly increasing the likelihood of one party starting a conflict, this is likely
to incentivise racing dynamics, which increases risks of accidents and inadvertent
conflict as well.

First-Strike Advantage
First-strike advantage overview. If an agent has a first-strike advantage, they
will do better to launch an attack than respond to one. This gives rise to the following
commitment problem: an offensive advantage may be short-lived, so it is best to act
on it before the enemy does instead. Some ways in which an agent may have a first-
strike advantage include:
1. As explored above, anticipating a future power shift may motivate an attack on
the rising power to prevent it from gaining the upper hand.
2. The costs of conflict might be lower for the attacker than they are for the defender,
so the attacker is better off securing an offensive advantage while the defender is
still in a position of relative weakness.
3. The odds of victory may be higher for whichever agent attacks first. The attacker
might possess the element of surprise, the ability to choose where conflict takes
place, or the potential to quickly defeat their opponent. For instance, a pre-emptive
nuclear strike could be used to target an enemy’s nuclear arsenal, thus diminishing
their ability to retaliate.

Examples: IPOs, patent Infringement, and Pearl Harbor. When a com-

pany goes public, it can release an IPO, allowing members of the general public to
purchase company shares. However, company insiders, such as executives and early
investors, often have access to valuable information not available to the general pub-
lic; this gives insiders a first-strike advantage. Insiders may buy or sell shares based
on this privileged information, leading to potential regulatory conflicts or disputes
with other investors who do not have access to the same information. Alternatively,
when a company develops a new technology and files a patent application, they gain
a first-strike advantage by ensuring that their product will not be copied or repro-
duced by other companies. If a rival company does create a similar technology and
later files a patent application, conflict can emerge when the original company claims
patent infringement.
On the international level, we note similar dynamics, such as in the case of Pearl
Harbor. Though Japan and the US were not at war in 1941, their peacetime was
destabilized by a commitment problem: if one nation were to attack the other, they
would have an advantage in the ensuing conflict. The US Pacific fleet posed a threat
to Japan’s military plans in Southeast Asia. Japan had the ability to launch a surprise
long-range strategic attack. Thus, neither the US nor Japan could credibly commit
not to attack the other. In the end, Japan struck first, bombing the US battleships
at the naval base at Pearl Harbor. The attack was successful in securing a first-strike
advantage for Japan, but it also ensured the US’s entry into WWII.
Collective Action Problems ■ 421

TABLE 7.12 A pay-off matrix for competitors choosing whether to defend or preemptively
attack.

Defend Preempt

Defend 2,2 0,3

Preempt 3,0 1,1

Effect on the bargaining range. When the advantages of striking first outweigh
the costs of conflict, it can shrink or destroy the bargaining range entirely. For any
two parties to reach a mutual settlement through bargaining, each must be willing
to freely communicate information with the other. However, in doing so, each party
might have to reveal offensive advantages, which would increase their vulnerability to
attack. The incentive to preserve and therefore conceal an offensive advantage from
opponents’ pressures agents to defect from bargaining.

First-strike advantage and AIs. One scenario in which an AI may be motivated

to secure a first-strike advantage is cyberwarfare. An AI might hack servers for a
variety of reasons to secure an offensive advantage. AIs may want to disrupt and
degrade an adversary’s capabilities by attacking and destroying critical infrastructure.
Alternatively, an AI might gather sensitive information regarding a rival’s capabilities,
vulnerabilities, and strategic plans to leverage potential offensive advantages.
AIs may provide first-strike advantages in other ways, too. Sudden and dramatic
progress in AI capabilities could motivate one party to take offensive action. For
example, if a nation very rapidly develops a much more powerful AI system than its
military enemies, this could present a powerful first-strike advantage: by attacking
immediately, they may hope to prevent their rivals from catching up with them,
which would lose them their advantage. Similar incentives were likely at work when
the US was considering a nuclear strike on the USSR to prevent them from developing
nuclear weapons themselves [412].
Reducing the possibility of first-strike advantages is challenging, especially with AI.
However, we can lower the probability that they arise by ensuring that there is a
balance between the offensive and defensive capabilities of potential rivals. In other
words, defense dominance can facilitate peace because attempted attacks between
rivals are likely to be unsuccessful or result in mutually assured destruction. There-
fore, we might reduce the probability that AIs are motivated to pursue a first-strike
advantage by ensuring that humans maintain defense dominance, for instance, by
requiring that advanced AIs have a built-in incorruptible fail-safe mechanism, such
as a manual “off-switch.”

Issue Indivisibility
Issue indivisibility overview. Settlements that fall within bargaining range will
always be preferable to conflict, but this assumes that whatever issues agents bar-
gain over are divisible. For instance, two agents can divide a territory in an infinite
422 ■ Introduction to AI Safety, Ethics, and Society

Figure 7.10. At time T0, Black is more powerful relative to Grey, or has a first-strike ad-
vantage that will be lost at T1. At T1, the bargaining range no longer extends past Black’s
expected value from engaging in conflict at T0. Anticipating this leftward shift may incen-
tivize Black to initiate conflict in the present rather than waiting for the bargaining offers
to worsen in the future.

amount of ways insofar as the settlement they arrive at falls within the bargain-
ing range, satisfying both their interests and outweighing the individual benefits of
engaging in conflict. However, some goods are indivisible, which inspires the follow-
ing commitment problem [413]: parties cannot always divide a good however they
please—some goods are “all or nothing.” When parties encounter issue indivisibility
[411], the probability of conflict increases. Indivisible issues include monarchies, small
territories like islands or holy sites, national religion or pride, and sovereign entities
such as states or human beings, among several others.

Examples: shopping, organ donation, and co-parenting. Imagine two

friends that go out for a day of shopping. For lunch, they stop at their favorite
deli and find that it only has one sandwich left: they decide to share this sandwich
between themselves. After lunch, they go to a clothing store, and both come across
a jacket they love, but of which there is only one left. They begin arguing over who
should get the jacket. Simply put, sandwiches can be shared and jackets can’t. Issue
indivisibility can give rise to conflict, often leaving all parties involved worse off.
Collective Action Problems ■ 423

The same can be true in more extreme cases, such as organ donation. Typically, the
available organ supply does not meet the transplant needs of all patients. Decisions as
to who gets priority for transplantation may favor certain groups or individuals and
allocation systems may be unfair, giving rise to conflict between doctors, patients,
and healthcare administrations. Finally, we can also observe issue indivisibility in
co-parenting contexts. Divorced parents sometimes fight for full custody rights over
their children. This can result in lengthy and costly legal battles that are detrimental
to the family as a whole.

Effect on the bargaining range. When agents encounter issue indivisibilities,

they cannot arrive at a reasonable settlement through bargaining. Sometimes, how-
ever, issue indivisibility can be resolved through side payments. One case in which
side payments were effective was during the Spanish-American War of 1898, fought
between Spain and the United States over the territory of the Philippines. The con-
flict was resolved when the United States offered to buy the Philippines from Spain
for 20 million dollars. Conversely, the Munich Agreement at the dawn of WWII rep-
resents a major case where side payments were ineffective. In an attempt to appease
Hitler and avoid war, the British and French governments reached an agreement with
Germany, allowing them to annex certain parts of Czechoslovakia. This agreement
involved side payments in the form of territorial concessions to Germany, but it ulti-
mately failed, as Hitler’s aggressive expansionist ambitions were not satisfied, leading
to the outbreak of World War II. Side payments can only resolve issue indivisibility
when the value of the side payments outweighs the value of the good.

Issue indivisibility and AIs. Imagine that there is a very powerful AI training
system, and that whoever has access to this system will eventually be able to dominate
the world. In order to reduce the chance of being dominated, individual parties may
compete with one another to secure access to this system. If parties were to split the
AI’s compute up between themselves, it would no longer be as powerful as it was
previously, perhaps not more powerful than their existing training systems. Since
such an AI cannot be divided up among many stakeholders easily, it may be rational
for parties to conflict over access to it, since doing so ensures global domination.

7.4.4 Information Problems

Misinformation and disinformation both involve the spread of false information, but
they differ in terms of intention. Misinformation is the dissemination of false informa-
tion, without the intention to deceive, due to a lack of knowledge or understanding.
Disinformation, on the other hand, is the deliberate spreading of false or misleading
information with the intent to deceive or manipulate others. Both of these types of
information problem can cause bargains to fail, generating conflict.
The term a is the probability of a player knowing the strategy of its partner. Relevant
for AI since it might reduce uncertainty (though still chaos and incentives to conceal
or misrepresent information or compete).
424 ■ Introduction to AI Safety, Ethics, and Society

Distinguish Defect

Distinguish b−c −c(1 − a)

Defect b(1 − a) 0

Misinformation
Misinformation overview. Uncertainty regarding a rival’s power or intentions
can increase the probability of conflict[411]. Bargaining often requires placing trust in
another not to break an agreement. This is harder to achieve when one agent believes
something false about the other’s preferences, resources, or commitments. A lack of
shared, accurate information can lead to mistrust and a breakdown in negotiations.

Example: Russian invasion of Ukraine. Incomplete information may lead

overly optimistic parties to make too large demands, whereas rivals that are tougher
than expected reject those demands and instigate conflict. Examples of misinfor-
mation problems generating conflict may include Russia’s 2022 invasion of Ukraine.
Russian President Putin reportedly miscalculated Ukraine’s willingness to resist in-
vasion and fight back. With more accurate information regarding Ukraine’s abilities
and determination, Putin may have been less likely to instigate conflict [414].

Effect on the bargaining range. Misinformation can prevent agents from finding
a mutually agreeable bargaining range, as shown in Figure 9.14. For example, if each
agent believes themself to be the more powerful party, each may therefore want more
than half the value they are competing for. Thus, each may reject any bargain offer
the other makes, since they expect a better if they opt for conflict instead.

Figure 7.11. Black either believes themself to be—or intentionally misrepresents themself
as—more powerful than they really are. This means that the range of bargain offers Black
will choose over conflict does not overlap with the equivalent range for Grey. Thus, there is
no mutual bargaining range.
Collective Action Problems ■ 425

Misinformation and AI. AI technologies may produce misinformation directly:

large language models hallucinating false facts would be one such example. Less
directly, a lack of AI reliability could also promote conflict by increasing uncertainty in
warfare. For example, the unreliable behaviors of military AI technologies may make
it more difficult to understand what an enemy’s true intentions are, increasing the risk
of inadvertently escalating a conflict. Furthermore, there is the difficulty of accurately
evaluating AI capabilities advances. It may be unclear how powerful a model trained
on an order of magnitude more compute may be, or how far behind adversaries are
in their effort to create powerful models. As automated warfare technologies become
more widespread and sophisticated, nations may struggle to predict their probability
of victory in any given conflict accurately. This increased potential for miscalculation
may make warfare more likely.
Information problems could exacerbate other AI risks. For example, if there are sub-
stantial existential risks from AIs but this is not widely agreed on, improving under-
standing of these risks could help make different actors (such as the US and China)
get better estimates of the payoff matrix. With better understanding of AI risk, they
may recognize that it is in their self-interest to cooperate (slow down AI development
and militarization) instead of defecting (engaging in an AI race). Similarly, creating
information channels such as summits can increase understanding and coordination;
even if countries do not agree on shared commitments, the discussions on the sidelines
can reduce misunderstandings and the risk of conflict.

Disinformation
Disinformation overview. Unlike misinformation, where false information is
propagated without deceptive intention, disinformation is the deliberate spreading
of false information: the intent is to mislead, deceive or manipulate. Here, we ex-
plore why competitive situations may motivate agents to try to mislead others or
misrepresent the truth, and how this can increase the probability of conflict.

Examples: employment and the real estate industry. Throughout labor

markets, employers and job seekers often encounter disinformation problems. Em-
ployers may intentionally withhold information about the salary range or offer lower
wages than what the market standard suggests in order to secure lower employment
costs. On the other hand, job seekers might exaggerate their qualifications or profes-
sional experience to increase their chances of getting hired. Such discrepancies can
lead to legal conflicts and high turnover rates. Alternatively, in the real estate market,
disinformation problems can emerge between sellers and buyers. Sellers sometimes
withhold critical information about the property’s condition to increase the probabil-
ity that the property gets purchased. Buyers, on the other hand, may be incentivized
to misrepresent their budget or willingness to pay to pressure sellers to lower their
prices. Oftentimes, this can result in legal battles or disputes as well as the breakdown
of property transactions.

Effect on the bargaining range. Consider two agents: A, which is stronger,

and B, which is weaker. B demands “X” amount for a bargaining settlement, but
426 ■ Introduction to AI Safety, Ethics, and Society

A, as the stronger agent, will not offer this to avoid being exploited by B. In other
words, A thinks B is just trying to get more for themself to “bait” A or “bluff” by
implying that the bargaining range is lower. But B might not be bluffing and A
might not be as strong as they think they are. Consider the Sino-Indian war in this
respect. At the time, India had perceived military superiority relative to China. But
in 1962, the Chinese launched an attack on the Himalayan border with India, which
demonstrated China’s superior military capabilities, and triggered the Sino-Indian
war. Thus, stronger parties may prefer conflict if they believe rivals are bluffing.
Whereas, weaker parties may prefer conflict if they believe rivals are not as powerful
as they believe themselves to be.

Disinformation and AI. AIs themselves may have incentives to misrepresent

the facts. For example, the agent “Cicero,” developed by Meta [112], is capable of
very high performance in the board wargame “Diplomacy.” Its success requires it to
misrepresent certain information to the other players in a strategic fashion. We have
seen many other examples of AIs producing disinformation for a variety of reasons,
such as large language models successfully persuading users that they are conversing
with a human. The ability of AIs to misrepresent information successfully is only
likely to increase in future [415]. This could exacerbate disinformation problems,
and thus contribute to greater risk of conflict by eroding the potential for peaceful
negotiation [24].

7.4.5 Factors Outside of Bargaining Theory

Inequality and Scarcity

Inequality is another factor that is highly predictive of conflict. Crime
is a form of conflict. Income and educational inequality are robust predictors of
violent crime [332], even when accounting for the effect of variables such as race
and family composition. Similarly, individuals and families with a yearly income
below $15,000 are three times more likely to be the victims of violent crime than are
individuals and families with a yearly income over $75,000 [416]. Moreover, economists
from the World Bank have also highlighted that the effects of inequality on both
violent and property crime are robust between countries, finding that when economic
growth improves in a country, violent crime rates decrease substantially [417]. This is
consistent with evidence at the national level; in the US, for example, the Bureau of
Justice reports that households below the federal poverty level have a rate of violent
victimization that is more than twice as high as the rate for households above the
federal poverty level. Moreover, these effects were largely consistent between both
rural and urban areas where poverty was prevalent, further emphasizing the robust
relationship between inequality and conflict.

Inequality and relative deprivation. Relative deprivation is the perception or

experience of being deprived or disadvantaged in comparison to others. It is a sub-
jective measure of social comparison, not an objective measure of deprivation based
on absolute standards. People may feel relatively deprived when they perceive that
Collective Action Problems ■ 427

others possess more resources, opportunities, or social status than they do. This can
lead to feelings of resentment. For example, “Strain theory,” proposed by sociologist
Robert K. Merton, suggests that individuals experience strain or pressure when they
are unable to achieve socially approved goals through legitimate means. Relative de-
privation is a form of strain, which may lead individuals to resort to various coping
mechanisms, one of which is criminal behavior. For example, communities with a high
prevalence of relative deprivation can evolve a subculture of violence [418]. Consider
the emergence of gangs, in which violence becomes a way to establish dominance,
protect territory, and retaliate against rival groups, providing an alternative path for
achieving a desired social standing.

AIs and relative deprivation. Advanced future AIs and widespread automation
may propel humanity into an age of abundance, where many forms of scarcity have
been largely eliminated on the national, and perhaps even global scale. Under these
circumstances, some might argue that conflict will no longer be an issue; people
would have all of their needs met, and the incentives to resort to aggression would
be greatly diminished. However, as previously discussed, relative deprivation is a
subjective measure of social comparison, and therefore, it could persist even under
conditions of abundance.
Consider the notion of a “hedonic treadmill,” which notes that regardless of what
good or bad things happen to people, they consistently return to their baseline level
of happiness. For instance, reuniting with a loved one or winning an important com-
petition might cultivate feelings of joy and excitement. However, as time passes, these
feelings dissipate, and individuals tend to return to the habitual course of their lives.
Even if individuals were to have access to everything they could possibly need, the
satisfaction they gain from having their needs fulfilled is only temporary.
Abundance becomes scarcity reliably. Dissatisfied individuals can be favored by natu-
ral selection over highly content and comfortable individuals. In many circumstances,
natural selection could disfavor individuals who stop caring about acquiring more re-
sources and expanding their influence; natural selection favors selfish behavior (for
more detail, see section 7.5.3 of Evolutionary Pressures). Even under conditions of
abundance, individuals may still compete for resources and influence because they
perceive the situation as a zero-sum game, where resources and power must be di-
vided among competitors. Individuals that acquire more power and resources could
incur a long-term fitness advantage over those that are “satisfied” with what they
already have. Consequently, even with many resources, conflict over resources could
persist in the evolving population.
Relatedly, in economics, the law of markets, also known as “Say’s Law,” proposes
that production of goods and services generates demand for goods and services. In
other words, supply creates its own demand. However, if supply creates demand, the
amount of resources required to sustain supply to meet demand must also increase
accordingly. Therefore, steady increases in demand, even under resource-abundant
conditions will reliably result in resource scarcity.
428 ■ Introduction to AI Safety, Ethics, and Society

Conflict over social standing and relative power may continue. There
will always be scarcity of social status and relative power, which people will continue
to compete over. Social envy is a fundamental part of life; it may persist because
it tracks differential fitness. Motivated by social envy, humans establish and identify
advantageous traits, such as the ability to network or climb the social ladder. Scarcity
of social status motivates individuals to compete for social standing when doing so
enables access to larger shares of available resources. Although AIs may produce many
forms of abundance, there would still be dimensions on which to compete. Moreover,
AI development could itself exacerbate various forms of inequality to extreme levels.
For example, there are likely to be major advantages to richer countries that have
more resources to invest, particularly given that growth in compute, data, and model
size appear to scale with AI capabilities. We discuss this possibility in Governance in
section 8.3.

7.4.6 Summary

Throughout this section, we have discussed some of the major factors that drive
conflict. When any one of these factors is present, agents’ incentives to bargain for a
peaceful settlement may shift such that conflict becomes an instrumentally rational
choice. These factors include power shifts, first-strike advantages, issue indivisibility,
information problems and incentives to misrepresent, as well as inequality.
In our discussion of these factors, we have laid the groundwork for understanding
the conditions under which decisions to instigate conflict may be considered instru-
mentally rational. This knowledge base allows us to better predict the risks and
probability of AI-driven conflict scenarios.
Power shifts can incentivize AI agents to pursue conflict, maintain strategic advan-
tages or deter potential attacks from stronger rivals, especially in the context of
military AI use.
The short-lived nature of offensive advantages may incentivize AIs to pursue first-
strike advantages, to degrade or identify vulnerabilities in adversaries’ capabilities,
as may be the case in cyberwarfare.
In the future, individual parties may have to compete for access to powerful AI. Since
dividing this AI between many stakeholders would reduce its power, parties may find
it instrumentally rational to conflict for access to it.
AIs may make wars more uncertain, increasing the probability of conflict. AI
weaponry innovation may present an opportunity for superpowers to consolidate their
dominance, whereas weaker states may be able to quickly increase their power by tak-
ing advantage of these technologies early on. This dynamic may create a future in
which power shifts are uncertain, which may lead states to incorrectly expect that
there is something to gain from going to war.
Even under conditions of abundance facilitated by widespread automation and ad-
vanced AI implementation, relative deprivation, and therefore conflict, may persist.
Collective Action Problems ■ 429

AIs may be motivated by social envy to compete with other humans or AIs for de-
sired social standing. This may result in a global landscape in which the majority of
humanity’s resources are controlled by selfish, power-seeking AIs.

7.5 EVOLUTIONARY PRESSURES

7.5.1 Overview
The central focus of this chapter is the dynamics to be expected in a future with
many AI agents. We must consider the risks that emerge from the interactions be-
tween these agents, and between humans and AI agents. In this last part of the
Collective Action Problems chapter, we use evolutionary theory to explore what hap-
pens when competitive pressures play out over a longer time period, operating on a
large group of interacting agents. Exploring evolutionary pressures helps us under-
stand the risks posed by the influence of natural selection on AI development. Our
ultimate conclusions are that AI development is likely to be subject to evolutionary
forces, and that we should expect the default outcome of this influence to be the
promotion of selfish and undesirable AI behavior.
We begin this section by looking at how evolution by natural selection can operate
in non-biological domains, an idea known as “generalized Darwinism.” We formalize
this idea using the conditions set out by Lewontin as necessary and sufficient for
natural selection, and Price’s equation for describing evolutionary change over time.
We thus set out the case that evolutionary pressures are influencing AIs. We turn to
the ramifications of this claim in the second section.
We next move on to exploring why evolutionary pressures may promote selfish AI
behavior. To consider what traits and strategies natural selection tends to favor,
we begin by setting out the “information’s eye view” of evolution as a generalized
Darwinian extrapolation of the “gene’s eye view” of biological evolution. Using this
framing, we examine how conflict can arise within a system when the interests of
propagating information clash with those of the entity that contains the information.
Internal conflict of this kind could arise within AI systems, distorting or subverting
goals even when they are specified and understood correctly. Finally, we explore why
natural selection tends to favor selfish strategies over altruistic ones. Our upshot is
that AI development is likely to be subject to evolutionary pressures. These pressures
may distort the goals we specify if the interests of internal components of the AI
system clash and could also generate a trend toward increasingly selfish AI behavior.

7.5.2 Generalized Darwinism

Our aim in this section is to understand generalized Darwinism—the idea that Dar-
winian mechanisms are a useful way to explain many phenomena outside of biology
[419]—and how we can use this as a helpful model for modeling AI development.
Using examples ranging from science to music, we examine how evolution by natural
selection can operate in non-biological systems. We formalize this process using the
430 ■ Introduction to AI Safety, Ethics, and Society

conditions for natural selection and consider how AI development meets these criteria
and is therefore subject to evolutionary pressures.

Conceptual Framework for Generalized Darwinism

Evolution by natural selection is not confined to the domain of biological organisms.

We can model many other phenomena using Darwinian mechanisms. In this section,
we use a range of examples to elaborate this idea.

Generalized Darwinism: natural selection can be applied to non-

biological phenomena. Evolution by natural selection does not depend on mech-
anisms particular to biology [420]. Darwin proposed that populations change over the
course of generations when differences among individuals help some reproduce more
than others, so that eventually, the population is made up of descendants of those
that reproduced the most. Darwin understood that this idea could explain many
other phenomena. For instance, he suggested that natural selection could explain the
evolution of language: “The survival or preservation of certain favored words in the
struggle for existence is natural selection” [421].
As an example, Richard Dawkins has argued that human culture developed according
to the principles of natural selection [419]. A piece of cultural information, such as
a song, is passed down over generations, often with small changes, and some songs
remain very well-known even over very long time periods. The 18th century French
tune “Ah! Vous dirai-je, Maman” was pleasing and easy to sing, so the young Mozart
wrote a version of it, and it was later used as the tune for the English poem “The
Star,” which was sung over and over, until today, when many people know it as
“Twinkle Twinkle,” “The Alphabet Song,” or “Baa Baa Black Sheep” [422]. There
were many other 18th century songs that have long since been forgotten, but that
one has spread to many people over centuries, due to it being more memorable and
“catchy” than others of its time.

By applying this Darwinian lens, we can describe many non-biological

phenomena. In nature, evolution happens when individuals have a variety of
traits, and individuals with some traits propagate more than others. If a species
of insect can be either red or brown, but the brown ones blend in better and are
less likely to be eaten by birds, then more of the red insects will get eaten before
reproducing, while brown insects will tend to have more descendants. Over time, the
population will consist primarily of brown insects.
We note a similar pattern in other, non-biological domains. For example, alchemy was
once a popular way of explaining the relationships among different metals. People who
believed in alchemy taught it to their students, who taught it to their own students
in turn, often with small differences. Over time, some of those ideas continued to
help them explain the world, and others didn’t. In this respect, chemistry could be
viewed as a descendant of alchemy. The ideas that define it now were propagated
when they helped us increase our understanding of the natural world, while others
were discarded. In the same vein, after the first widespread video conferencing services
Collective Action Problems ■ 431

were developed, similar products proliferated. Users chose the product that best met
their needs, selecting for services that were cheap, easy to use, and reliable. Each
company regularly released new versions of its product that were slightly adapted
from earlier ones, and competitors imitated and thereby propagated the best features
and implemented them into their own. Some products incorporated the most adaptive
features quickly, and the descendants of those products are the ones we use today—
while others were quickly outcompeted and fell into obscurity.

Generalized Darwinism does not imply that evolution produces good out-
comes. Often, things that are the best at propagating are not “good” in any mean-
ingful sense. Invasive species arrive in a new location, propagate quickly, and local
ecosystems begin to crumble. The forms of media that are most successful at prop-
agating in our minds may be harmful to our happiness and social relationships. For
instance, news articles that get more clicks are likely to have their click-attracting
traits reproduced in the next generation. Clicks thus select for more sensational,
emotionally charged headlines. In the context of AI, generalized Darwinism poses
significant risks. To see why, we first need to understand how many phenomena tend
to develop based on Darwinian principles, so that we can think about how to predict
and mitigate these risks.

Formalizing Generalized Darwinism

In this section, we formalize generalized Darwinism. First, we overview the criteria

necessary and sufficient for evolution by natural selection to operate on a system.
Second, we examine how we might predict what happens to a system that meets
these conditions. Together, these help us to see why “survival of the fittest” is a poor
description of evolution by natural selection. Instead, this process would be better
described as “propagation of the better-propagated information.”

Lewontin’s three conditions for evolution by natural selection. The evolu-

tionary biologist Richard Lewontin formulated three criteria necessary and sufficient
for evolution by natural selection [57]:
(1) Variation: There is variation in traits among individuals
(2) Retention: Future iterations of individuals tend to resemble previous iterations
(3) Differential fitness: Different variants have different propagation rates

The validity of these criteria does not depend on biology. In living organisms, DNA
encodes the variations among individuals. Traits encoded by DNA are heritable, and
subject to selection. But this is not the only way to fulfill the Lewontin conditions.
Video conferencing software has variation (there are many different options), reten-
tion (today’s video conferencing software is similar to last year’s), and differential fit-
ness (some products are much more widely used and imitated than others). Precisely
how change occurs depends on the specific phenomenon’s mechanism of propagation.

The Price Equation describes how a trait changes in frequency over time.
In the 1970s, the population geneticist George R. Price derived an equation that
432 ■ Introduction to AI Safety, Ethics, and Society

provides a mathematical description of natural selection [423]. One formulation of

Price’s equation is given here:
∆z̄ = Cov(ω, z) + Ew (∆z).

In this equation, z̄ denotes the average value of some trait z in a population, and ∆z̄ is
the change in the average value of z between the parent generation and the offspring
generation. If z is height, and the parent generation is 5 ’5 ” on average and the next
generation is 5’ 7” on average, then ∆z̄ is 2 inches. ω is relative fitness: how many
offspring does an individual have relative to the average for their generation? Ew (∆z)
is the expected value of ∆z: that is, the average change in z between generations,
weighted by fitness, so that individuals who have more offspring are counted more
heavily.
Price’s Equation shows that the change in the average value of some trait between
parents and offspring is equal to the sum of a) the covariance of the trait value and
the fitness of the parents, and b) the fitness-weighted average of the change in the
trait between a parent and its offspring. “Covariance” describes the phenomenon of
one variable varying together with another. To see whether a population will get
taller over time, for example, we would need to know the covariance of fitness with
height (do tall individuals have more surviving offspring?) and the difference between
a parent’s height and their average child’s height.
The Price Equation can be applied to non-biological systems. The Price
Equation does not require any understanding of what causes a trait to be passed
down to a subsequent generation or why some individuals have more offspring than
others, only of how much the trait is passed on and how much it covaries with fitness.
The Price Equation would work just as well with car designs or tunes as with birds
or mollusks.
The Price Equation allows us to predict what happens when Lewontin
conditions apply. The Price equation uses differences between members of the
parent generation with respect to some trait z (variation), similarities between parent
and offspring generation with respect to z (retention), and differential fitness (selec-
tion). As a result, when we understand the degree to which each of the Lewontin
conditions apply, we can predict how much of some trait will be present in subse-
quent generations [424].
First misunderstanding: “fitness” does not describe physical power. The
idea of “fitness” often brings to mind a contest of physical power, in which the
strongest or fastest organism wins, but this is a misunderstanding. Fitness in an
evolutionary sense is not something we gain at the gym. Being fit may not necessar-
ily entail being exceptionally good at any specific abilities. Sea sponges, for example,
are among the most ancient of animal lineages, and they are not quick, clever, or
good at chasing prey, especially when compared to, say, a shark. But empirically,
sea sponges have been surviving and reproducing for hundreds of millions of years,
much more than many species that would easily beat them in head-to-head contests
at almost any other challenge.
Collective Action Problems ■ 433

Second misunderstanding: “fitness” is not the mechanism driving evo-

lution. Biologists often talk about fitness when discussing how well-suited an or-
ganism is to its environment. In particular, they often treat fitness as a short-hand
for relative reproductive success: how much an individual contributes to the next gen-
eration’s gene pool, relative to their competitors. However, this usage of the word
“fitness” seems to present evolution as being fundamentally tautological. In the id-
iom “survival of the fittest,” we appear to be using both “survival” and “fit” to mean
relative reproductive success. This would suggest that evolution is merely the process
in which those who reproduce more successfully, reproduce more successfully! If true,
the theory of evolution would seem to be using its own conclusion to demonstrate its
argument. As we shall see next, however, this is actually false.

“Fitness” is simply a metric we use to measure propagation rate. In

fact, evolutionary theory does not rely on circular logic. This is because an organism’s
fitness does not determine its reproductive success; natural selection does. Instead,
“fitness” is simply the word we use to describe and measure propagation success.
Those who are better at propagating their information (by surviving and reproducing)
don’t have some “being fit” property which causes their success. Rather, we deem
how “fit” they are by measuring how successful they’ve been at propagating their
information. Thus the phrase “survival of the fittest” should really be “propagation
of the better-propagated information.”
The Price Equation, and natural selection more broadly, simply says that if a trait
helps individuals survive longer or reproduce more, and that trait is passed on to the
offspring, then more of the next generation will have that trait. It does not tell us why
a trait leads to an individual having more offspring; it is only a way of expressing the
fact that some traits do correlate with having more offspring. The same is true when
natural selection is applied to non-biological systems; “fitness” is simply a word for
the quality of propagating more. The information that propagates best is, of course,
the information that propagates best. But it need not be, and often is not, “better” in
any other sense. Fitness is a metric that describes how much information propagates,
not an assessment of value.

Generalized Darwinism and AI Populations

The three Lewontin conditions, of variation, retention, and differential fitness, are
all that is needed for evolution by natural selection. This means we can assess how
natural selection is likely to affect AI populations by considering how the conditions
apply to AIs. Here, we claim that AIs are likely to meet all three conditions, so we
should expect natural selection forces to influence their traits and development.

Variation: AIs are designed and trained in a variety of ways. As previ-

ously noted in “Natural Selection Favors AI over Humans” [351],
“When thinking about advanced AI, some have envisioned a single AI
that is nearly omniscient and nearly omnipotent, escaping the lab and
suddenly controlling the world. This scenario tends to assume a rapid,
434 ■ Introduction to AI Safety, Ethics, and Society

almost overnight, take-off with no prior proliferation of other AI agents;

we would go from AIs roughly similar to the ones we have now to an AI
that has capabilities we can hardly imagine so quickly that we barely
notice anything is changing. However, there could also be many use-
ful AIs, as is the case now. It is more reasonable to assume that AI
agents would progressively proliferate and become increasingly compe-
tent at specific tasks, rather than assume one AI agent spontaneously
goes from incompetent to omnicompetent. This is similar to the subdi-
vision of biological niches. For example, lions and cheetahs developed
completely different and mutually exclusive strategies to catch prey
through strength or speed. Furthermore, if there are multiple AIs, they
can work in parallel rather than waiting for a single model to get around
to a task, making things move much faster”.
This means that people are likely to continue creating multiple AI agents, even if
there is a single best model. Financial gains would encourage multiple competitors
to challenge the top system [425].
In addition to the argument that AI populations have variation because of the history
of their development, there are also pragmatic arguments for this claim. In evolution-
ary theory, Fisher’s fundamental theorem states that the rate of adaptation is directly
proportional to the variation (all else equal). In rapidly changing environments, where
quick adaptation increases a population’s probability of survival, populations with
more variation may persist longer. Consider how variation in crops reduces the prob-
ability of catastrophic crop failure, and variation in investments reduces the risk of
unmanageable financial losses. And in machine learning, an ensemble of AI systems
will often perform more accurately than a single AI [425]. Variation can help guide
decision making, in the same way that many people’s aggregated predictions will
usually be better than any one expert’s. Because of these factors, we are more likely
to see a powerful and resilient population of AIs if they have significant variation.

Variation in AI developers. As well as variation in the AI systems themselves,

we also see variation between the big technology companies developing and adopting
AI technologies. It may seem simple to prevent the rise of selfish AI behaviors by
avoiding their selection. However, the reality is different. AI companies, directed
more by evolutionary pressures than by safety concerns, are vying for survival in a
fiercely competitive landscape. Consider how OpenAI, which started as a nonprofit
dedicated to benefiting humanity, shifted to a capped-profit structure in 2019 due
to funding needs. Consequently, some of its safety-centric members branched out
and founded Anthropic, a company intending to prioritize AI safety. However, even
Anthropic couldn’t resist the call of commercialization, succumbing to evolutionary
pressures itself.
Evolutionary pressures are driving safety-minded researchers to adopt the behaviors
of their less safety-minded competitors, because they are anticipating that they can
gain a significant fitness advantage in the short-term by deprioritizing safety. Note
that this evolutionary process is not based on actual selection events (the researchers
Collective Action Problems ■ 435

will not be destroyed if they are outcompeted), but rather the researchers’ projec-
tions of what might happen if they adopt particular strategies. AI safety ideas are
being selected against, which is driving the researchers to change their behavior (to
behave in a less safety-conscious manner). Importantly, as the number of competitors
rises, the variation in approaches and values also increases. This increase in variation
escalates the intensity of the evolutionary pressures and the extent to which these
pressures distort the behavior of big AI companies.

Retention: new AIs are developed under the influence of earlier gener-
ations. Retention does not require exact copying; it only requires that there be
non-zero similarity among individuals in subsequent generations. In the short term,
AIs are developed by adapting older models, or by imitating features from competi-
tors’ models. Even when training AIs from scratch, retention may still occur, as highly
effective architectures, datasets, and training environments are reused thereby shap-
ing the agent in a way similar to how humans (or other biological species) are shaped
by their environments. Even if AIs change very rapidly compared to the timescales of
biological evolution, they will still meet the criterion of retention; their generations
can be extremely short, so they can move through many generations in a short time,
but each generation will still be similar to the one before it. Retention is a very easy
standard to meet, and even with many uncertainties about what AIs may be like, it
is very likely that they meet this broad definition.

Differential Fitness: some AIs are propagated more than others. There
are many traits which could cause some AI models or traits to be propagated more
than others (increasing their “fitness”). Some of these traits could be highly unde-
sirable to humans. For example, being safer than alternatives may confer a fitness
advantage on an AI. However, merely appearing to be safer might also improve an
AI’s fitness. Similarly, being good at automating human jobs could result in an AI
being propagated more. On the other hand, being easy to deactivate could reduce
an AI’s fitness. Therefore, an AI might increase its fitness by integrating itself into
critical infrastructure or encouraging humans to develop a dependency on it, mak-
ing us less keen to deactivate it. As long as some AIs are at least marginally more
attractive than others, AI populations will meet the condition of differential fitness.
There are many possible points at which natural selection could take effect on AIs.
These include the actions of AI developers, in fine-tuning and customizing models,
or re-designing training processes.

If the Lewontin conditions are satisfied, we must consider how intense

the evolutionary pressures are. More intense selection pressure leads to faster
change. In a population of birds in a time with plenty of food, birds with any shape
beak may survive. However, if food becomes scarce, only those with the most effi-
cient beaks for accessing some specific food may survive, and the next generation will
disproportionately have that beak shape. More variation also leads to faster adap-
tation, because variants that will be adaptive in a new circumstance are more likely
to already exist in the population. The faster rounds of adaptation occur, the more
quickly distinct groups emerge with their own features.
436 ■ Introduction to AI Safety, Ethics, and Society

If there is more intense selection pressure on AIs, where only AIs with certain traits
propagate, then we should expect to see the population optimize around those traits.
If there is more variation in the AI population, that optimization process will be
faster. If the rate of adaptation also accelerates, we would expect trends that lead
to greater differentiation in AI populations that are distinct from the changes in the
traits of individual AI models. In the following section, we will discuss the evolution-
ary trends that tend to dominate when selection pressure is intense and how they
might shape AI populations.

Summary

We started this section by exploring how evolution by natural selection can occur
in non-biological contexts. We then formalized this idea of “generalized Darwinism”
using Lewontin’s conditions and the Price equation. We found that AI development
may be subject to evolutionary pressures by evaluating how it meets the Lewontin
conditions. In the next section, we turn to the ramifications of this claim.

7.5.3 Levels of Selection and Selfish Behavior

Our aim in this section is to understand which AI characteristics are favored by

natural selection. We explore this by first outlining an “information’s eye view” of
evolution by natural selection. Here, we find that internal conflict can arise where
the interests of the propagating information (such as a gene) clash with those of the
larger entity that contains it (such as an organism). This phenomenon could arise in
AI systems, distorting or subverting goals even when human operators have specified
them correctly.
We then move to a second risk generated by natural selection operating at the level
of propagating information: Darwinian forces strongly favor selfish traits over altru-
istic ones. Although on the level of an individual organism, individuals may behave
altruistically under specific conditions (such as genetic relatedness), on the level of
information, evolution by natural selection tends to produce selfishness. We conclude
by outlining how a future with many AI agents, shaped by natural selection, will be
dominated by selfish behavior.

Information’s Eye View

We often consider individual organisms to be the unit on which natural selection is

operating. However, it is their genes that are being propagated through time and
space, not the organisms themselves. This section considers the “gene’s eye view” of
evolution by natural selection. We then use generalized Darwinism to build up an
extrapolated version of this perspective we can call the “information’s eye view” of
evolution.

Species succeed when their information propagates, but sometimes inter-

ests diverge. The information of living organisms is primarily contained in DNA.
Collective Action Problems ■ 437

Genes contain the instructions for forming bodies. Most of the time, a gene propa-
gates most successfully when the organism that contains it propagates successfully.
But sometimes, the best thing for a gene is not the best thing for the organism.
For example, mitochondrial DNA is only passed on from females, so it propagates
most if the organism has only female offspring. In some organisms, mitochondrial
DNA gives rise to genetic mechanisms that increase the production of female descen-
dants. However, if too many individuals have this mutation, the population will be
disproportionately female, and the organism will be unable to pass on the rest of its
genes. In this situation, the most effective propagation mechanism for the gene in the
mitochondria is harmful to the reproductive success of its host.
The “gene’s eye view” of evolution. In The Selfish Gene, Richard Dawkins
argues that gene propagation is a more useful framing than organism propagation
[419]. In Dawkins’ view, organisms are simply vehicles that allow genes to propagate.
Instead of thinking of birds with long beaks competing with birds with short beaks,
we can think about genes that create long beaks competing with genes that create
short beaks, in a fight for space within the bird population. This gives us a framework
for understanding examples like the one above: the gene within the mitochondria is
competing for space in the population, and will sometimes take that space even at
the expense of the host’s individual fitness.
Information functions similarly to genes, narrowing the space of pos-
sibilities. We are humans and not dogs, roundworms, or redwood trees almost
entirely because of our genes. If we do not know anything about what an organism
is, aside from how long its genome is, then for every base in the genome, there are
four possibilities, so there is an extremely large number of possible combinations. If
we learn that the first base is a G, you have divided the total number by four. When
we decode the entire genome, we have narrowed down an impossibly large space of
possibility to a single one: we can now know not only that the organism is a cat, but
even which cat specifically.
In non-biological systems, information works in a parallel way. There are many pos-
sible ways to begin a sentence. Each word eliminates possible endings and decreases
the listener’s uncertainty, until they know the full sentence at the end. Using the
framework of information theory, we can think of information as the resolution or
reduction of uncertainty (though this is not a formal definition). For an idea, infor-
mation is just the facts about it that make it different from other ideas. A textbook’s
main information is its text. A song’s information consists of the pitches and rhythms
that distinguish it from other songs. These larger phenomena (ideas, books, songs)
are distinguished by the information they contain.
Information that propagates occupies a larger volume of both time and
space. A single music score, written centuries ago and buried underground ever
since, has been propagated across hundreds of years of time, but very little space. In
contrast, a hit tune that is suddenly everywhere and then quickly forgotten takes up
a lot of space, but very little time. But the best propagated information takes up a
large volume of both. The tune for “Twinkle Twinkle” has been taking up space in
438 ■ Introduction to AI Safety, Ethics, and Society

many minds, pieces of paper, and digital formats for hundreds of years and continues
to propagate. The same is true for genetic information. A gene that flourished briefly
hundreds of millions of years ago, and one that has had a consistent small presence,
both take up much less space-time volume than a gene that long ago became dominant
in many successful branches of the evolutionary tree [420].

Just as some genes propagate more, the same is true for bits of infor-
mation. In accordance with generalized Darwinism, we can extend the gene’s eye
view to an “information’s eye view.” A living organism’s basic unit of information is a
gene. Everything that evolves as a consequence of Darwinian forces contains informa-
tion, some of which is inherited more than others. Dawkins coined the term “meme”
as an analog for gene: a meme is the basic unit of cultural inheritance. Like genes,
memes tend to develop variations, and be copied and adapted into new iterations.
The philosopher of science, Karl Popper, wrote that the growth of knowledge is “the
natural selection of hypotheses: our knowledge consists, at every moment, of those
hypotheses which have shown their (comparative) fitness by surviving so far in their
struggle for existence.” Social phenomena such as copycat crimes can also be modeled
as examples of memetic inheritance. Many types of crimes are committed daily, some
of which inspire imitators, whose subsequent crimes can themselves be selected for
and copied. Selection operates on the level of individual pieces of information, as well
as on the higher level of organisms and phenomena.

AIs may pass on information in ways analogous to our genetics and

cultural memetics. AIs are computer programs, made of code that determines
what they are like, in a similar way to how our DNA determines what we are like.
Different code makes the difference between an agentic AI and Flappy Bird. Their
code, or pieces from it, can be directly copied and adapted for new models. But
their information can also be memetically transmitted, as our cultural memes can.
Even today, AIs are often designed based on hearing about and imitating successful
models, not only on copying code from them. AIs also help create training data for
new AIs and evaluate their learning, which makes the new AIs tend to have traits
similar to earlier models. As AIs continue to become more autonomous, they may
be able to imitate and learn from one another, self-modifying to adopt traits and
behaviors that seem useful. The AI information that propagates the most will take
up more and more space-time volume, as it is copied into more AIs that multiply and
endure over longer periods.

Intrasystem Goal Conflict

The interests of an organism and its genetic information are usually aligned well.
However, they can sometimes diverge from one another. In this section, we identify
analogous, non-biological phenomena, where conflict arises between a system and the
sub-systems in which it stores its information. Evolutionary pressures might generate
this kind of internal conflict within AI systems, distorting or subverting goals set for
AIs by human operators, even when such goals are specified and understood correctly.
Collective Action Problems ■ 439

Conflict within a genome. Selection on the level of genes does not always result
in the best outcomes for the organism. For instance, as discussed in the previous
section, human mitochondrial DNA is only transferred to offspring through biological
females. A human’s mitochondrial genome is identical to their biological mother’s,
assuming no change due to mutation. Since males represent a reproductive dead-end,
mitochondrial genes that benefit only females may therefore be selected for, even
when they incur a cost upon males. These and other “selfish” genetic elements give
rise to intragenomic conflict.

Conflict within an organism. We observe other kinds of internal conflict within

organisms which do not concern their genomes. For example, the bacterial species
that compose the human gut microbiome can exist in a mutually beneficial symbiosis
with their host. However, some bacteria are “opportunistically pathogenic”: in the
wake of disruptions (like the use of antibiotics), many of these once-mutualists will
propagate at accelerated rates, often at the expense of the host’s health. As the
philosopher of evolutionary biology Samir Okasha notes, “intraorganismic conflict is
relatively common among modern organisms [426].”

Conflict within an AI company. The concept of intrasystem conflict extends

beyond biological examples and can be observed in organizations. A notable example
is OpenAI. In 2017, there was a power struggle in OpenAI, which led to Elon Musk’s
exit and Sam Altman becoming OpenAI’s main leader. In 2020, disagreements within
OpenAI led to internal conflict and the departure of some employees to found An-
thropic. In 2023, the board of the nonprofit overseeing OpenAI came into conflict
with Sam Altman and attempted to fire him as CEO. Challenging-to-resolve dis-
agreements about who should influence AI’s development make intrasystem conflict
at AI organizations likely in the future.

Intrasystem goal conflict: between information and the larger entity that
contains it. All the above examples concern the interests of propagating informa-
tion and those of the entities that contain the information diverging from one another.
We call the more general phenomenon that can describe all of these examples intrasys-
tem goal conflict: the clash of different subsystems’ interests, causing the functioning
of the overall system to be distorted. As we have seen, intrasystem goal conflict can
arise within complex systems in a range of domains, from genomes to corporations.

Intrasystem goal conflict in AI systems. One reason why we might expect

an AI system not to pursue a specified goal is because intrasystem goal conflict
has eroded its unity of purpose. A system has achieved unity of purpose if there is
alignment at all levels of internal organization [426]. Undermining a system’s unity of
purpose reduces its ability to carry out its system-level goals. A helpful analogy here
is to consider political “coups.” A coup is characterized by a struggle for control within
a political system whereby agents within the system act to seize power, often eroding
the system’s unity of purpose by disrupting its stability and functionality. When
political leaders are overthrown, the goals of the political system usually change.
Similarly, if we give an AI agent a goal to pursue, the agent may in turn assign parts
440 ■ Introduction to AI Safety, Ethics, and Society

of this goal to sub-agents, who may take over and subvert the original goal with their
own.
In the future, humans and AI agents may interact in many different ways, including
by working together on collaborative projects. This provides the opportunity for goal
distortion or subordination through intrasystem goal conflict. For instance, humans
may enlist AI agents to collaborate on tasks. Just as how human collaborators may
betray or overturn their principals, AI agents may behave similarly. If an AI col-
laborator has a goal of self-preservation, they may try to remove any power others
have over them. In this way, the system that ends up executing actions based on
these conflicting goals will not necessarily be equivalent to how a system with unity
of purpose would pursue the goal set by the humans. The behavior of this emergent
multi-agent system may thus distort our goals, or even subvert them altogether.

Selfishness
In the previous section, we examined one risk generated by natural selection favor-
ing the propagation of information: conflict between the information (such as genes,
departments, or sub-agents) and the larger entity that contains it (such as an or-
ganism, government, or AI system). In this section, we consider a second risk: that
natural selection tends to favor selfish traits and strategies over altruistic ones. We
conclude that the greater the influence of evolutionary pressures on AI development,
the more we should expect a future with many AI agents to be one dominated by
selfish behavior.
Selfishness: furthering one’s own information propagation at the expense
of others. In evolutionary theory, “selfishness” does not imply intent to harm
another, or belief that one’s own interests ought to dominate. Organisms that do
not have malicious intentions often display selfish traits. The lancet liver fluke, for
example, is a small parasite that infects sheep by first infecting ants, hijacking their
brains and making them climb to the top of stalks of grass, where they get eaten by
sheep [427]. The lancet liver fluke does not wish ants ill, nor does it have a belief that
lancet liver flukes should thrive while ants should get eaten. It simply has evolved a
behavior that enables it to propagate its own information at the expense of the ant’s.
Selfishness in AI. AI systems may exhibit “selfish” behaviors, expanding the
AIs’ influence at the expense of human values. Note that these AIs may not even
understand what a human is and yet still behave selfishly toward them. For exam-
ple, AIs may automate human tasks, necessitating extensive layoffs [273]. This could
be very detrimental to humans, by generating rapid or widespread unemployment.
However, it could take place without any malicious intent on the part of AIs merely
behaving in accordance with their pursuit of efficiency. AIs may also develop newer
AIs that are more advanced but less interpretable, reducing human oversight. Ad-
ditionally, some AIs may leverage emotional connections by imitating sentience or
emulating the loved ones of human users. This might generate social resistance to
their deactivation. For instance, AIs that plead not to be deactivated might stim-
ulate an emotional attachment in some humans. If afforded legal rights, these AIs
Collective Action Problems ■ 441

might adapt and evolve outside human control, becoming deeply embedded in society
and expanding their influence in ways that could be irreversible.

Selfish traits are not the opposite of cooperation. Many organisms display
cooperative behavior at the individual level. Chimpanzees, for example, regularly
groom other members of their group. They don’t do this to be “nice,” but rather
because this behavior is reciprocated in future, so they are likely to eventually ben-
efit from it themselves [384]. Cells found in filamentous bacteria, so named because
they form chains, regularly kill themselves to provide much needed nitrogen for the
communal thread of bacterial life, with every tenth cell or so “committing suicide”
[428]. But even in these examples, cooperative behavior ultimately helps the individ-
ual’s information propagate. Chimpanzees who groom others expect to have the favor
returned in future. Filamentous bacteria live in colonies made up of their clones, so
one bacterium sacrificing itself to save copies of itself still propagates its information.

Natural selection tends to produce selfish traits. Organisms that further

their own information propagation will typically propagate more. A lancet liver fluke
that developed the ability to give ants free choice and allow them not to climb stalks of
grass if they don’t want to would be less likely than the current version to succeed at
getting eaten by sheep and continuing its life cycle. Most biological selfishness is less
dramatic, but nonetheless, the organisms alive today are necessarily the descendants
of those that succeeded at propagating their own information, and not of those that
traded propagation for other qualities.

Altruism that reduces an individual’s fitness is not an evolutionarily

stable strategy. Imagine a very altruistic fictional population of foxes who freely
share food with one another, even at great cost to themselves. When food is abundant,
they all thrive, and when food is scarce, they suffer together. If, during a time of
scarcity, one fox decides to steal food from the communal stores and take it for herself
and her offspring, they may survive while others starve. As a result, her offspring,
who may have inherited her selfish trait, will make up a higher proportion of the
next generation. As this repeats, the population will be dominated by individuals
who take food for themselves when they can. The population of altruists may get
along quite well on its own, but altruism is unstable, because anyone who decides to
exploit it will do better than the group. Since altruism that reduces an individual’s
overall fitness is not an evolutionarily stable strategy, we should expect to see selfish
behavior being promoted.

The more natural selection acts on a population, the more selfish behav-
ior we expect. In the example in the preceding paragraph, when food is abundant,
there is little advantage to selfishness and there may even be penalties, as the group
punishes selfish behavior. There is plenty of food to go around, so the descendants of
foxes who steal food will not be much more likely to survive, and the next generation
can contain plenty of altruists. But in times when only a few can propagate, selfish-
ness will confer a greater advantage, and the population will tend to become selfish
more quickly.
442 ■ Introduction to AI Safety, Ethics, and Society

Avoiding extreme AI selfishness: changing the environment. AI agents’

fitness could either be influenced more by natural selection or by the environment.
We have sketched out the default outcome of the former: a landscape of powerful
and selfish AI agents. One way we might prevent this trend toward increasingly
selfish behavior is to ensure that it is the environment which ends up shaping the
fitness of AI agents substantially more than natural selection. Currently, we are in
an environment of extreme competition, and so AI agents that are better-suited to
this competitive environment will propagate more, and increase the proportion of
the population with their traits (including selfish traits). However, if we altered the
environment such that the actions of AI researchers and AI agents were not so heavily
steered by competitive pressures, we could reduce this problem.
Avoiding extreme AI selfishness: changing the selection. Another possi-
bility is to change what makes AI agents “fit.” We could establish an ecosystem in
which AI agents can be developed, deployed, and adopted more safely, without the
influence of such extreme competitive pressures. In this ecosystem, we could select
against AIs with the most harmful selfish behaviors, and select for AIs that faith-
fully assist humans. As these AIs proliferate through this ecosystem, they could then
counteract the worst excesses of selfish behavior from other agents.

7.5.4 Summary
In this section, we considered the effects of evolutionary pressures on AI populations.
We started by using the idea of generalized Darwinism to expand the “gene’s eye
view” of biological evolution to an “information’s eye view.” Using this view, we
identified two AI risks generated by natural selection: intrasystem goal conflict and
selfish behavior. Intrasystem goal conflict could distort or subvert the goals we set
an AI system to pursue. Selfish behavior would likely be favored by natural selection
wherever it promotes the propagation of information: If AI development is subject to
strong Darwinian forces, we should expect AIs to tend toward selfish behaviors.

7.6 CONCLUSION

In this chapter, we considered a variety of multi-agent dynamics in biological and

social systems. Our underlying thesis was that these dynamics might produce unde-
sirable outcomes with AI, mirroring patterns observable in nature and society.

Game theory

We began with a simple game, the Prisoner’s Dilemma, observing how even rational
agents may reach equilibrium states that are detrimental to all. We then proceeded
to build upon this. We considered how the dynamics may change when the game
is iterated and involves more than two agents. We found that uncertainty about
the future could foster rational cooperation, though defection remains the dominant
strategy when the number of rounds of the game if fixed and known.
Collective Action Problems ■ 443

We used these games to model collective action problems in the real world, like
anthropogenic climate change, public health emergency responses, and the failures
of democracies. The collective endeavors of multi-agent systems are often vulnerable
to exploitation by free riders. We drew parallels between these natural dynamics and
the development, deployment, and adoption of AI technologies. In particular, we saw
how AI races in corporate and military contexts can exacerbate AI risks, potentially
resulting in catastrophes such as autonomous economies or flash wars. We ended
this section by exploring the emergence of extortion as a strategy that illustrated a
grim possibility for future AI systems: AI extortion could be a source of monumental
disvalue, particularly if it were to involve morally valuable digital minds. Moreover,
AI extortion might persist stably throughout populations of AI agents, which could
make it difficult to eradicate, especially if AIs learn to deceive or manipulate humans
to obscure their true intentions.

Cooperation

We then moved to an investigation of cooperation. Drawing from biological sys-

tems and human societies, we illustrated an array of mechanisms that may promote
cooperation between AIs. For each mechanism, however, we also highlighted some
associated risks. These risks included nepotism, in-group favoritism, extortion, and
the incentives to behave ruthlessly. Thus, we found that merely ensuring that AIs
behave cooperatively may not be a total solution to our collective action problems.
Rather, we need a more nuanced view of the potential benefits and risks of promoting
cooperative AI via particular mechanisms.

Conflict

We next turned to a closer examination of the drivers of conflict. Using the framework
of bargaining theory, we discussed why rational agents may sometimes opt for conflict
over peaceful bargaining, even though it may be more costly for all involved. We il-
lustrated this idea by looking at how various factors can affect competitive dynamics,
including commitment problems (such as power shifts, first-strike advantages, and is-
sue indivisibility), information problems, and inequality. These factors may drive AIs
to instigate, promote, or exacerbate conflicts, with potentially catastrophic effects.

Evolutionary pressures

We began this section by examining generalized Darwinism: the idea that Darwinian
mechanisms are a useful way to explain many phenomena outside of biology. We ex-
plored examples of evolution by natural selection operating in non-biological domains,
from culture, academia, and industry. By formalizing this idea using Lewinton’s
conditions and the Price equation, we saw how AIs and their development may be
subject to Darwinian forces.
We then turned to the ramifications of natural selection operating on AIs. We first
looked at what AI traits or strategies natural selection may tend to favor. Using
444 ■ Introduction to AI Safety, Ethics, and Society

an information’s eye view of evolution by natural selection, we found that internal

conflict can arise where the interests of the propagating information clash with those
of the larger entity that contains it. Intrasystem goal conflict could arise in AI sys-
tems, distorting or subverting goals even when human operators have specified them
correctly. Moreover, Darwinian forces strongly favor selfish traits over altruistic ones.
Although on the level of an individual organism, individuals may behave altruistically
under specific conditions (such as genetic relatedness), on the level of information,
evolution by natural selection tends to produce selfishness. Thus, we might expect a
future shaped by natural selection to be dominated by selfish behavior.

Concluding remarks
In summary, this chapter explored various kinds of collective action problems: intelli-
gent agents, despite acting rationally and in accordance with their own self-interest,
can collectively produce outcomes that none of them wants, even when they could
seemingly have achieved preferable alternative outcomes. Even when we as individu-
als share similar goals, system-level dynamics can override our intentions and create
undesirable results.
This insight is of vital importance when envisioning a future with powerful AI sys-
tems. AIs, individual humans, and human agencies will all conduct their actions in
light of how others are behaving and how each expects others to behave in future. The
total risk of this multi-agent system greater the sum of its individual parts. Dynamics
between multiple human agencies generate races in corporate and military settings.
Dynamics between multiple AIs may generate evolutionary pressure for immoral be-
haviors, particularly selfishness, free-riding, deception, conflict, and extortion. We
cannot address all the risks posed by AI simply by focusing on the outcomes of
agents acting in isolation. The safety of AI systems will not be guaranteed solely
by aligning each AI agent to well-intentioned operators. It is an essential compo-
nent of ensuring our safety, and a valuable future, that we consider these multi-agent
dynamics carefully. These dynamics represent a common problem—clashes between
individual and collective interests. We must find innovative, system-level solutions to
ensure that the development and interaction of AI agents lead to beneficial outcomes
for all.

7.7 LITERATURE

7.7.1 Recommended Reading

• Thomas C. Schelling. Arms and Influence. Yale University Press, 1966. isbn:
9780300002218. url: http : / / www . jstor . org / stable / j . ctt5vm52s (visited on
10/14/2023)
• Martin A. Nowak. “Five Rules for the Evolution of Cooperation”. In: Science
314.5805 (2006), pp. 1560–1563. doi: 10 . 1126 / science . 1133755. eprint: https :
//www.science.org/doi/pdf/10.1126/science.1133755. url: https://www.science
.org/doi/abs/10.1126/science.1133755
Collective Action Problems ■ 445

• James D. Fearon. “Rationalist Explanations for War”. In: International Organiza-

tion 49.3 (1995), pp. 379–414. issn: 00208183, 15315088. url: http://www.jstor.o
rg/stable/2706903 (visited on 10/14/2023)
• Dan Hendrycks. Natural Selection Favors AIs over Humans. 2023. arXiv: 2303.16
200 [cs.CY]
• Edward O. Wilson. Sociobiology: The New Synthesis, Twenty-Fifth Anniversary
Edition. Harvard University Press, 2000. isbn: 9780674000896. url: http://www
.jstor.org/stable/j.ctvjnrttd (visited on 10/14/2023)
• C. Boehm. Moral origins: The evolution of virtue, altruism, and shame. Basic
Books, 2012
• R. Axelrod and R.M. Axelrod. The Evolution of Cooperation. Basic books. Basic
Books, 1984. isbn: 9780465021215. url: https://books.google.com.au/books?id
=NJZBCGbNs98C
CHAPTER 8

Governance

8.1 INTRODUCTION

Safe artificial intelligence (AI) requires careful governance. The devel-

opment and deployment of advanced AI systems involves many organizations and
individuals with distinct goals and incentives. These organizations and people can
interact in complicated ways, creating a complex sociotechnical system that we need
to govern effectively.
There are a wide variety of issues that governance is required to manage across differ-
ent stages of AI development (from data collection through various stages of training
to deployment) and involving various actors such as AI developers, businesses and
consumers using AI systems, and national governments, among others. Thoughtful
governance provides the constraints, incentives and institutions to steer AI progress
in a direction that benefits humanity.
Governance refers to the rules and processes that coordinate behavior.
Governance is not just what governments do. Instead, it can be defined more broadly
as the process through which some activity is organized, coordinated, steered, and
managed. It includes the norms, policies, and institutions that influence stakeholders’
actions to achieve socially desirable outcomes. In healthcare, for instance, governance
aimed at ensuring doctors avoid harming patients for profit includes norms around
patient care, professional ethical standards, and licensing organizations. AI regulators
should aim to encourage safety and responsibility among developers and users to
ensure we reap AI’s benefits while managing the risks.
Governance takes many forms across sectors and levels. Governance in-
stitutions include governmental bodies that create laws, companies that shape in-
ternal rules, and collaborative initiatives involving both public and private groups.
For example, legislatures pass regulations, corporations adopt ethical guidelines, and
public-private forums establish AI safety best practices. Governance operates at mul-
tiple levels as well, such as organizational policies, national laws, and global agree-
ments. Effective AI governance likely requires a combination of approaches across
sectors and levels.
446 DOI: 10.1201/9781003530336-8
This chapter has been made available under a CC BY NC ND license.
Governance ■ 447

Overview. We begin by examining the ecosystem in which governance takes place,

including the relevant actors, such as companies, governments, and individuals, and
the tools available to govern them. Then, we explore the potential impact of AI on
economic growth. Next, we consider issues around the distribution of costs, benefits
and risks of AI across society.
We then survey policy tools for AI governance at the corporate, national and in-
ternational levels. For corporations, we explore legal structures, ownership models,
and assurance mechanisms that impact AI safety. At the national level, we consider
regulations, liability frameworks, resilience, and competitiveness. For international
governance, we examine tools ranging from standards and certification for civilian
AIs to non-proliferation agreements, verification schemes, and monopolies for mili-
tary systems. We conclude by examining the role of AI inputs as potential nodes for
AI governance, focusing on the role that governance of computing hardware could
play. By highlighting levers across multiple levels, this chapter provides an introduc-
tion to a range of governance approaches aimed at ensuring AI is developed safely.

8.1.1 The Landscape

To govern AI, we must understand the system in which AIs are being developed
and deployed. Two crucial aspects for governance are the list of actors and the tools
available to govern them.

Actors

AI governance involves many diverse groups across sectors that have different goals
and can do different things to accomplish them. Key actors include companies, non-
profits, governments, and individuals.

Companies develop and deploy AIs, typically for profit. Major firms such
as OpenAI and Google DeepMind shape the AI landscape through huge investments
in research and development, creating powerful models that advance AI capabili-
ties. Startups may explore new applications of AI and are often funded by venture
capitalists. The Corporate Governance section looks at policies, incentives, and struc-
tures such as ownership models of organizations that impact the development and
deployment of AI systems.

Nonprofits play a variety of roles aimed at improving society. Some non-

profits aim to develop safe AI systems. For example, OpenAI was initially founded as
a nonprofit. Others engage in advocacy and coordination by bringing together com-
panies, governments, and research labs to collaborate on development and regulation.
Academic labs such as MILA research AI capabilities, and some researchers there aim
to make AI more beneficial. Some nonprofits perform a mix of these functions.

National governments make laws, regulations, and standards. National

governments can directly shape AI development and use within their jurisdictions
through legislative and regulatory powers such as market rules, public investment, and
448 ■ Introduction to AI Safety, Ethics, and Society

government procurement policies; for instance, governments might require licensing

to develop large models, provide funding for AI research and development, and require
government contractors meet certain safety standards. Governments of states such as
the United States and United Kingdom directly oversee technological development
and commercialization locally through their democratic processes and administrative
procedures. We will explore the role of governments in the National Governance
section.
International organizations facilitate cooperation between countries.
Organizations such as the United Nations, European Union, and OECD have influ-
ence across borders by setting policies, principles, and ethics standards that countries
often implement locally. International governance institutions allow countries to coor-
dinate on issues such as human rights or non-proliferation of dangerous technologies
across national borders. While international governance mechanisms such as treaties
are less enforceable than domestic tools such as laws, they can exert soft power
through financing, expertise, norm-setting, and bringing countries together for dia-
logue and consensus-building. We explore these tools in the International Governance
section.
Individuals use and are deeply impacted by AI systems. AIs ultimately
have profound impacts on individuals. Externalities from AIs, such as invasions of
privacy, concentration of economic power, and catastrophic risks, are experienced by
individuals, as are many of the potential benefits such as rapid economic growth. As
consumers, individuals buy and use AI products and services, which means they can
apply social pressure to force development along specific routes. As citizens, they can
exert pressure through voting and other forms of democratic representation. Their
needs and perspectives should be a central consideration.

Tools

The AI governance landscape includes the sets of tools or mechanisms by which actors
interact and influence one another. Key tools for AI governance fall into four main
categories: information dissemination, financial incentives, standards and regulations,
and rights.
Information dissemination changes how stakeholders think and act. Ed-
ucation and training transmit technical skills and shape the mindsets of researchers,
developers, and policymakers. Sharing data, empirical analyses, policy recommenda-
tions, and envisioning positive futures informs discussions by highlighting opportu-
nities, risks, and policy impacts. Facts are a prerequisite for the creation and im-
plementation of effective policy. Increasing access to information for individuals and
organizations can change their evaluations of what’s best to do.
Financial incentives influence behavior by changing payoffs and motiva-
tions. Incentives such as funding sources, market forces, potential taxes or fees, and
regulatory actions shape the priorities and cost-benefit calculations of companies, re-
searchers, and other stakeholders. Access to funding and well-regulated markets (such
Governance ■ 449

as those with IP and competition protections) encourages technology development

and commercialization, while potential taxes or regulatory penalties impose financial
and reputational risks that promote caution and consideration of governance goals.
By shaping incentives, governments can increase the degree to which private com-
panies or other actors’ opportunities and risks are aligned with those of society as a
whole.

Standards and regulations set expectations and boundaries for behavior.

There is a spectrum of rules—from flexible guidelines to rigid laws—that encompasses
many different governance tools. At one end, the flexible guidelines look like standard
operating procedures and industry norms, which codify preferred practices within and
across institutions. At the other end, governments use formal and enforceable tools
including regulations and legislation, which carry penalties for violations and aim to
both address specific risks and shape industries broadly. Well-designed rules establish
which actions are allowed or prohibited for organizations and individuals.

Rights grant freedoms, entitlements, and claims over assets. Human

rights and civil liberties, such as privacy, free speech, and due process, shape rela-
tionships between individuals and organizations by defining unacceptable behaviors.
Property rights and intellectual property regimes determine ownership and control
over assets such as data, algorithms, trained models, and compute. Outlining rights
clearly can establish strong governance regimes that allow litigation against any in-
fringements.
The diverse actors in AI development and deployment along with the varied gov-
ernance tools at our disposal form an intricate ecosystem. Companies, researchers,
governments, and individuals influence progress based on their capabilities and in-
centives. Financial incentives, established rules, delineated rights, and information
sharing steer beliefs and behaviors. Next, we will consider some central issues for
any attempts at governing AI: how much should we expect AI to impact economic
growth, and what might the distribution of benefits, costs and risks from AI across
society look like?

8.2 ECONOMIC GROWTH

There are many different factors that feed into the rate of economic growth, and
AI has the potential to amplify several of them. For instance, deploying AI systems
could artificially augment the effective population of workers, improve the efficiency
of human labor, or accelerate the development of new technologies that improve
productivity. While it is generally accepted that AI will boost economic growth to
some degree, there is debate over the exact magnitude of the impact it is likely to
have. Some researchers believe that it will speed up growth to an unprecedented rate,
which we refer to as “explosive growth,” while others think its impact will be limited
by other social and economic factors. We will now explore some of the arguments for
and against the likelihood of AI causing explosive growth.
450 ■ Introduction to AI Safety, Ethics, and Society

Population growth may drive economic growth by accelerating technolog-

ical progress. The worldwide economic acceleration observed in recent centuries
has been variously attributed to the unique conditions of the industrial revolution,
the technologies developed in 18th and 19th century Europe, and the growth in total
population over time. Population growth is emphasized most by the semi-endogenous
theory of economic growth. It holds that since economic growth causes population
growth by reducing bottlenecks on population growth and population growth causes
economic growth by providing a large labor force (including an increasing number of
researchers driving technological progress), there is a positive feedback loop and so
population growth is the key factor to consider when looking at the increase in the
economic output over time.
According to this theory, human population growth was determined for many thou-
sands of years by the availability of food. As agricultural technologies were devel-
oped, food became easier to produce, which allowed for more population growth.
Since larger populations have more opportunities to innovate and develop better
technology, some economists argue that this process loops back into itself recursively,
producing a faster-than-exponential development curve over the long run.
This acceleration ultimately slowed down in the mid-20th century. The semi-
endogenous theory explains this slowdown as a result of the independent decline in
the population growth rate, arguing that demographic changes such as falling birth
rates uncoupled productivity and population growth. This is one explanation for why
economic growth did not explode in the late 20th and early 21st centuries. According
to this line of reasoning, lifting the population bottleneck would once again enable
the multi-thousand-year trend of accelerating growth to continue until we exhaust
physical resources like energy and space.

Constraints in Semi−Endogenous Theory of Economic Growth

Unconstrained
by Population
Output

Real Outcome

Pre−Industrial
1800 1850 1900 1950 2000
Year

Figure 8.1. Economic output could grow much faster than past trends if not constrained by
the human population.
Governance ■ 451

AIs may fuel effective population growth. If AIs can automate the majority
of important human tasks (including further AI development), this could lift the
bottleneck on labor that some believe is the primary obstacle to explosive growth.
There are some reasons to think that AIs could boost the economic growth rate by
substituting for human labor. As easily duplicable software, the AI population can
grow at least as quickly as we can manufacture hardware to run it on—-much faster
than humans take to reproduce and learn skills from adults. This replication of labor
could then boost the effective workforce and accelerate productivity.

AIs may accelerate further AI development. If AIs become proficient at

altering themselves to enhance their own capabilities, then we could see accelerated
AI development through recursive self-improvement. At each step, this may make
them more efficient at performing tasks and producing goods, as well as make them
better at self-improving. It is worth noting that the growth of AI capabilities driven
by human design is already much faster than the growth of human capabilities driven
by biological evolution; over the entire course of evolution from humans’ last common
ancestor with chimpanzees, human brains grew roughly 4 times in size, whereas over
the decade after AlexNet, the largest machine learning models increased in size by
the same amount roughly every 16 months. If AIs were able to effectively automate
their own R&D, then this could speed up the improvements in their own performance
even more.

AI population and self-improvement could form a positive feedback loop.

Taken together, a growing number of AIs and accelerated AI development through
recursive self-improvement could form a positive feedback loop between AI popula-
tion and technology production, mirroring the loop with human population proposed
in the semi-endogenous theory. However, since neither the self-replication nor self-
improvement aspects of this loop are subject to biological constraints, the AI popula-
tion feedback loop could theoretically be much faster than the one involving humans.
Assuming we do not see a slowdown in AI development or face other bottlenecks like
energy production, this self-amplifying cycle could lead to unprecedented economic
growth.

AI automation may create explosive growth. Some researchers have argued

that if we add AIs to standard models of economic growth (such as the Solow model)
developed using economic theory and past data, we find that AIs could trigger a
dramatic surge in economic growth. Some studies suggest that such AIs could spark
unprecedented growth, causing the world economy to grow at rates exceeding ten
times the current growth rate [433]. If this were to transpire in its most extreme
form, it could result in an unprecedented acceleration of scientific and technological
advancement, reshaping our economy and the trajectory of human history over a pe-
riod of perhaps only a few years. Growth of this magnitude would be unlike anything
in human history: for the past 10,000 years after the agricultural revolution, total
world output grew at 0.1% per year, steadily increasing over the last 1000 years to
single-digit percentage points.
452 ■ Introduction to AI Safety, Ethics, and Society

However, other researchers have argued that there are potential physical constraints,
as well as social and economic dynamics, that could prevent AI from driving explosive
growth. We will now explore some of these arguments.

Non-accumulable factors may bottleneck production. The argument for ex-

plosive growth relies on the assumption that some factors, such as AI labor and its
outputs, are indefinitely accumulable, such that they can be repeatedly re-invested
and lead to ever-increasing returns. However, this may not always be true. For exam-
ple, there is a limited supply of land that can accommodate physical infrastructure
required to run AIs. There is also a fixed amount of energy that the Earth receives
from the sun each day, representing a theoretical limit on the rate at which AIs could
operate. While we have not yet reached the upper bounds of either land or energy
available, they do imply that economic growth is unlikely to continue indefinitely.
Whether or not we see AI-driven explosive growth depends on whether or not this
happens before the limits of non-accumulable factors become constraints.

R&D may be harder to accelerate than expected. A key ingredient of the

economic feedback loop described above is the idea that R&D activities will improve
technology, and thus improve efficiency of work and production. However, it could
be the case that, in AI development, the “low-hanging fruit” of new ideas that yield
substantial improvements have already been taken, and that there will be diminishing
returns on future R&D, even if done by an increasing population of AIs. This could
weaken the effect of additional research activities and slow down the feedback loop,
potentially to such a degree that it could not fuel explosive economic growth.

AI adoption and its impact could be slow and gradual. Some researchers
argue that the greatest impact of a new technology may not manifest as an intense
peak during the early stages of innovation [434]. Rather, the productivity gains may
be delivered as a slower increase continuing over a longer period of time, as the
technology is gradually adopted by a wide range of industries. This could be because
the technology needs to be adapted to many different tasks and settings, humans
need to be trained to operate it, and other tools and processes that are compatible
with it need to be developed. For example, although the first electric dynamo suitable
for use in industry was invented in the 1870s, it took several decades for electricity to
become integrated within industries. It has been argued that this is why electricity
only boosted the US economy significantly in the early 20th century. Similarly, a slow
process of diffusion could also smooth out AI’s impact on today’s economy.

Regulations and human preferences may prevent complete automation

of the economy. Even if AIs could theoretically automate all tasks and create
explosive growth, social factors may prevent this from happening. For example, fears
about risks associated with AI may prompt regulations that restrict the technology’s
use, perhaps requiring that certain services, such as medical or legal services, be
performed by human professionals. Other regulations seeking to protect intellectual
property might limit the amount of training data available, thus inhibiting the growth
of AI capabilities.
Governance ■ 453

Besides regulations, humans’ own preferences may also limit the fraction of tasks
that are automated. For example, we can speculate that humans may prefer certain
services that involve a high degree of social interaction, such as those in healthcare,
education, and counseling, to be provided by other humans. Additionally, people may
always be more interested in watching human athletes, actors, and musicians, and in
buying artwork produced by humans. In some cases, these jobs may therefore evade
automation, even if it were theoretically possible to automate them, just as there are
still professional human chess players, despite the fact that machines have long been
able to beat Grandmasters.

Economic growth may be bottlenecked by non-automated tasks. To gener-

ate explosive growth, AI automation may need to be extremely comprehensive; even if
most tasks were automated, the non-automated ones may become constraints on the
growth rate of productivity. For example, if AI exhibited above-human performance
on cognitive tasks but its ability to move in the physical world lagged behind due
to slow progress in robotics, this would cap the extent to which AI could accelerate
economic growth. This is because physical tasks involved in manufacturing products
and moving them around are likely to remain important; even if AI made design
more efficient, output could only increase up to the limit imposed by how quickly the
physical tasks could be performed.
It is worth noting that, although robotics has so far appeared more challenging than
achieving cognitive tasks with AI, there are some initial signs that progress in the area
could accelerate in the coming years. As such, robotics may not always represent a
bottleneck to growth. Nevertheless, there could be many types of tasks in production
processes that constrain the overall output of the system, no matter how efficient AI
can make the other tasks.

Baumol’s cost disease. Another reason why even just a few non-automated jobs
could prevent explosive growth is the concept of Baumol’s cost disease, proposed by
the economist William Baumol in the 1960s. This idea states that, when technology
increases productivity in one industry, the prices of its products fall, and the wages
of its workers rise. Another industry, which cannot easily be made more efficient with
technology, will also need to increase its workers’ wages, to prevent them moving
into higher-paying jobs in technologically enhanced sectors. As a result, the prices of
outputs in those sectors take up an increasing share of the overall economy. Thus,
even if some industries undergo rapid increases in productivity, the effect on the
growth of the economy as a whole is more muted. This is one explanation for why
the prices of goods such as TVs have declined over time, while the costs of healthcare
and education have risen. According to this concept, if AI automates many jobs,
but not all of them, its economic impact could be substantial, but not necessarily
explosive.
While AI has the potential to significantly enhance economic growth through various
routes including improving workers’ productivity and accelerating R&D, the extent
and speed of this growth remain uncertain. Theories suggesting explosive growth
due to AI rely on relatively strong assumptions around the removal of potential
454 ■ Introduction to AI Safety, Ethics, and Society

bottlenecks. However, potential constraints such as physical resource limits, dimin-

ishing returns on R&D, slow AI adoption, regulatory and human preferences, and
non-automatable tasks could moderate this growth. Therefore, while AI’s impact on
economic growth is likely to be substantial, whether it will lead to unprecedented
economic expansion or be tempered by various limiting factors remains an open
question.

8.3 DISTRIBUTION OF AI

In this section we discuss three main dimensions of how aspects of AI systems are
distributed:
1. Benefits and costs of AI: whether the benefits and costs of AI will be evenly or
unevenly shared across society.
2. Access to AI: whether advanced AI systems will be kept in the hands of a small
group of people, or whether they will be widely accessible to the general public.
3. Power of AI systems: whether in the long run, there will be a few highly sophis-
ticated AIs with vastly more power than the rest, or many highly capable AIs.

Distribution of Benefits and Costs of AI

The distribution of the costs and benefits from AI will ultimately depend on both
market forces and governance decisions. It is possible that companies developing
AI will receive most of the economic gains, while automation could dramatically
reduce economic opportunities for others. Government may need to engage in new
approaches to redistribution to ensure that even if only a small number of people
directly gain wealth from AIs, wealth is eventually shared more broadly among the
population.

AI might substitute for labor on a large scale. Researchers have estimated

that many jobs are susceptible to partial or full replacement by AI, depending on what
tasks those jobs involve and which of these can be automated with AI [435]. Unlike in
the past, even knowledge-based jobs such as writing and programming are at risk due
to advancements in AI such as large language models, robotics, and computer vision.
In the US, for instance, the advent of self-driving cars could potentially displace 5
million people who drive for a living, and advancements in robotics might threaten
the employment of the 12 million employed in manufacturing [436]. Since software is
cheap to duplicate, AI presents firms around the globe with a low-cost and scalable
alternative to using human labor.

Past revolutions have relocated employment, not destroyed it. A common

counterargument brought up in discussions about the potential impact of AI on em-
ployment draws on historical evidence of technological revolutions. On the one hand,
automation negatively impacts employment and wages through the displacement ef-
fect, as human labor is substituted with technology. On the other hand, automation
Governance ■ 455

has a positive economic impact through the productivity effect, as new tasks or indus-
tries require human labor [437]. Historically, the productivity effect has dominated
and the general standard of living has increased. Consider the Industrial Revolution:
mechanized production methods displaced artisanal craftspeople, but eventually led
to new types of employment opportunities in factories and industries that hadn’t
previously existed. Similarly, access to computers has automated away many man-
ual and clerical jobs like data entry and typing, but has also spawned many more
new professions. Technological shifts have historically led to increases in employment
opportunities and wages. Therefore, while these changes were disruptive for certain
professions, they ultimately led to a shift in employment rather than mass unemploy-
ment. This phenomenon, called creative destruction, describes how outdated indus-
tries and jobs are replaced by new, often more efficient ones. Similarly, transformative
technologies can also augment workers (like capital according to the standard view)
rather than replace them. If AIs serve as gross complements to human labor, this
may drive up wage growth rather than increase inequality.

In the near term, AI could boost employment. We may see an increase

in job opportunities that involve managing or checking the outputs of AI systems.
Given the likely penetration of AI into all sectors of the economy, this could lead to
a significant creation of new positions, not dissimilar to how IT services are required
across industries.
If automation increases purchasing power due to decreasing prices of automated
goods, this could increase demand for human-provided goods and services by enough
to partly or fully offset job losses from automation. As wealth increases, people may
have more disposable income, potentially spurring job growth in sectors like hospi-
tality, recreation, and mental health services. For example, automation may expe-
dite some parts of software engineering, decreasing the cost of producing software.
Software development might then become more affordable, increasing demand and
creating more work for software engineers.

However, human-level AI (HLAI) might destroy employment. HLAI, by

definition, is capable of doing every task that humans can do, at least as cheaply.
The implication of achieving HLAI is that human labor would no longer be valuable
or necessary. This would represent a fundamental difference from past technological
advancements that both helped and displaced human workers. Conventional policies
to address job losses from automation, like worker retraining programs, would be
meaningless in a world where there are no jobs to retrain for. At that point, economic
gains from automation would be likely to accrue to a small handful of people or
organizations that control AIs. Human innovation has historically created new jobs
that would have seemed inconceivable just decades earlier. However, HLAI would also
possess the ability to invent new economically valuable tasks, possibly more quickly
than humans can think and innovate. The idea that there are always jobs for humans
to do, while historically true, is not a law of nature [438].
Historically, routine tasks have been the primary target for automation. As AIs
approach human-level intelligence, cognitive nonroutine jobs [439], which require
456 ■ Introduction to AI Safety, Ethics, and Society

education and are typically high-paying, would also become automated. Program-
mers, researchers, and artists are already augmenting their productivity using large
language models, which will likely continue to become more capable. One way the
future of work could play out is that increasingly few high-skilled workers will excel
at managing or using AIs or will provide otherwise exceptionally unique skills, while
the vast majority of people become unemployable.

An autonomous economy could operate without human labor. The tran-

sition to a fully automated economy may not stem from one defining moment, but
from an accumulation of small economic decisions. Imagine the following story: as
AIs become increasingly capable, humans delegate an increasing number of business
tasks to them. Initially, they handle emails and scheduling, but soon manage organi-
zational accounting and communications with clients and partners. The AIs can be
trained and deployed quickly and make fewer errors than humans. Over time and due
to competitive pressures, businesses feel the need to automate more of their opera-
tions, and AIs begin handling complex cognitive tasks like strategy, innovation, and
leadership. Soon, the economy is mostly made up of AIs communicating with other
AIs, operating at a pace and complexity that humans can’t keep up with, further
motivating the automation of all economic activity. Eventually, the economy is able
to operate without need for human contribution or oversight. Humans on the other
hand, have become reliant on AIs for basic needs, societal governance, and even social
interactions. Humans have little reason to learn, work, or aspire to create, and their
survival depends on the beneficence of AIs.

Wages could collapse if full automation is reached. Some researchers have

argued that under plausible assumptions, we might see wages continue to grow for a
significant period of time, but that at high levels of automation wages would collapse
[440]. Under this model, as AI systems improve, an increasing share of the tasks per-
formed by human beings would be automated. Eventually automation would surpass
a certain critical threshold where labor is no longer scarce, even if some tasks are still
performed by humans. Once this threshold is passed and capital can fully substitute
for labor, wages would decline rapidly.
Governments may seek to redistribute wealth from the owners of AI systems to the
rest of the population in order to address poverty and inequality. If this led to people
being free to spend their time on what they most value, this could be a positive
change, provided we can address challenges around purpose and enfeeblement in a
world without work. Relevant policies are discussed further in section 8.5.

8.3.1 Distribution of Access to AI

Issues of access to AI are closely related to the question of distribution of costs and
benefits. Some have argued that if access to AI systems is broadly distributed across
society rather than concentrated in the hands of a few companies, the benefits of AI
would be more evenly shared across society. Here, we will discuss broader access to AI
through open-source models, and narrower access through restricted models, as well
Governance ■ 457

as striking a balance between the two through structured access. We will examine
the safety implications of each level of access.

Levels of Access

Restricted AI models concentrate power. Restricted AI models are those that

can only be used by a small group of people. They may, for example, be exclusively
accessible to people working in private companies, government agencies, or a small
group of people with national security clearances. Restricted models cannot be used
by the general public.
While some AIs could be restricted, it is possible that a significant number of highly
capable AIs will be tightly restricted. If all AIs, or at least the most powerful, are
restricted models, then power would be concentrated in the hands of a small number
of people. This could increase the risk of value lock-in, where the values of that small
group of people would be promoted and perpetuated, potentially irreversibly, even
if they did not adequately represent the values and interests of the larger human
population.

Open-source AI models increase the risks of malicious use. Open-source

AI models are those that are freely available for anyone to use. There are no restric-
tions on what people can use them for, or how users can modify them. These AI
systems would proliferate irreversibly. Since open-source models are, by definition,
accessible to anyone who can run them without restrictions on how they can be used,
there is an increased probability of malicious use.

If information security is poor, all AI systems are effectively open-

source. Robust cybersecurity will be required to prevent unintended users from
accessing powerful AIs. Inadequate protections will mean that AIs are implicitly
open-source even if they are not intended to be, because they will likely be leaked or
stolen.

Structured access. One possible option for striking a balance between keeping AIs
completely private and making them fully open-source would be to adopt a structured
access approach. This is where the public can access an AI, but with restrictions on
what they can use it for and how they can modify it. There may also be restrictions
on who is given access, with “Know Your Customer” policies for verifying users’
identities. In this scenario, the actor controlling the AI has ultimate authority over
who can access it, how they can access it, what they can use it for, and if and how they
can modify it. They can also grant access selectively to other developers to integrate
the AI within their own applications, with consideration of these developers’ safety
standards.
One practical way of implementing structured access is to have users access an AI via
an application programming interface (API). This indirect usage facilitates controls
on how the AI can be used and also prevents users from modifying it. The rollout
of GPT-3 in 2020 is an example of this style of structured access: the large language
458 ■ Introduction to AI Safety, Ethics, and Society

model was stored in the cloud and available for approved users to access indirectly
through a platform controlled by OpenAI.

Openness Norms

Traditionally, the norm in academia has been for research to be openly shared. This
allows for collaboration between researchers in a community, enabling faster develop-
ment. While openness may be a good default position, there are certain areas where
it may be appropriate to restrict information sharing. We will now discuss the cir-
cumstances under which these restrictions might be justified and their relevance to
AI development.

There are historical precedents for restricting information sharing in

dual-use research. Dual-use technologies are those that can be used for both
beneficial and harmful purposes. It is not a new idea that information about the de-
velopment of such technologies should not be widely shared. In the 1930s, publication
of research on the nuclear chain reaction, which could be used for both nuclear power
and nuclear weapons, prompted a Nazi program developing the latter. The Manhat-
tan Project was then conducted in secrecy to avoid enemy intelligence learning of any
breakthroughs. Biotechnology has seen debates about the appropriate level of open-
ness, with concerns around the publication of papers detailing potential methods for
creating dangerous pathogens, which might in future be used as bioweapons.
Powerful AI would be a dual-use technology and there is therefore a need for serious
consideration of who can be trusted with it. Absolute openness means implicitly
trusting anyone who has the necessary hardware to use AIs responsibly. However,
there could in future be many people with sufficient means to deploy AIs, and it
might only take one person with malicious intent to cause a catastrophe.

Technological progress may be too fast for regulations to keep up. An-
other reason for restricting information sharing is the pacing problem—where techno-
logical progress happens too quickly for policymakers to devise and implement robust
controls on a technology’s use. This means that we cannot rely on regulations and
monitoring to prevent misuse in an environment where information that could enable
misuse is being openly shared.

It may be difficult to predict the type of research that is potentially dan-

gerous. Within AI research, there are different kinds of information, such as the
model weights themselves and the methods of building the system. There have been
cases where the former has been restricted for safety reasons but the latter openly
shared. However, it seems feasible that information on how to build dangerous AIs
could also be used to cause harm.
Moreover, it can be difficult to predict exactly how insights might be used and whether
they are potentially dangerous. For instance, the nuclear reactor, which could help
society create more sustainable energy, was instrumental in developing a cheaper
version of the atomic bomb. It is possible that AIs designed for seemingly harmless
Governance ■ 459

tasks could be used to propel the advancement of potentially dangerous AIs. We may
not be able to predict every way in which technologies that are harmless in isolation
might combine to become hazardous.
Since there are costs to restrictions, it is worth considering when they
are warranted. Any interventions to mitigate the risk of misuse of AIs are likely to
come at a cost, which may include users’ freedom and privacy, as well as the beneficial
research that could be accelerated by more open sharing. It is therefore important to
think carefully about which kind of restrictions are justified, and in which scenarios.
It might, for example, be worth comparing the number of potential misuses and how
severe they would be with the number of positive uses and how beneficial they would
be. Another factor that could be taken into account is how narrowly targeted an
intervention could be, namely how accurately it could identify and mitigate misuses
without interfering with positive uses.
Restrictions on the underlying capabilities of an AI (or the infrastructure supporting
these) tend to be more general and less precisely targeted than interventions imple-
mented downstream. The latter may include restrictions on how a user accessing an
AI indirectly can use it, as well as laws governing its use. However, upstream restric-
tions on capabilities or infrastructure may be warranted under specific conditions.
They may be needed if interventions at later stages are insufficient, if the dangers
of a capability are particularly severe, or if a particular capability lends itself much
more to misuse than positive use.

Risks From Open vs Controlled Models

Open models would enable dangerous members of the general public to engage in
harmful activities. Tightly controlled models exacerbate the risk that their creators,
or elites with special access, could misuse them with impunity. We will examine each
possibility.
Powerful, open AIs lower the barrier to entry for many harmful activ-
ities. There are multiple ways in which sophisticated AIs could be harnessed to
cause widespread harm. They could, for example, lower the barrier to entry for cre-
ating biological and chemical weapons, conducting cyberattacks like spear phishing
on a large scale, or carrying out severe physical attacks, using lethal autonomous
weapons. Individuals or non-state actors wishing to cause harm might adapt power-
ful AIs to harmful objectives and unleash them, or generate a deluge of convincing
disinformation, to undermine trust and create a more fractured society.
More open AI models increase the risk of bottom-up misuse. Although
the majority of people do not seek to bring about a catastrophe, there are some who
do. It might only take one person pursuing malicious intentions with sufficient means
to cause a catastrophe. The more people who have access to highly sophisticated AIs,
the more likely it is that one of them will try to use it to pursue a negative outcome.
This would be a case of bottom-up misuse, where a member of the general public
leverages technology to cause harm.
460 ■ Introduction to AI Safety, Ethics, and Society

Some AI capabilities may be skewed in favor of offense over defense. It

could be argued that AIs can also be used to improve defenses against these various
threats. However, some misuse capabilities may be skewed in favor of offense not
defense. For example, it may be much easier to create and release a deadly pathogen
than to control it or come up with cures or vaccines. Even if an AI were to facilitate
faster vaccine development, a bioweapon could still do a great deal of harm even in a
short timeframe, leading to many deaths before the vaccine could be discovered and
rolled out.

Releasing highly capable AIs to the public may entail a risk of black
swans. Although numerous risks associated with AIs have been identified, there
may be more that we are unaware of. AIs themselves might even discover more
technologies or ways of causing harm than humans have imagined. If this possibility
were to result in a black swan event (see Section 4.7 for a deeper discussion of black
swans), it would likely favor offense over defense, at least to begin with, as decision
makers would not immediately understand what was happening or how to counteract
it.

More tightly controlled models increase the risk of top-down misuse. In

contrast with bottom-up misuse by members of the public, top-down misuse refers
to actions taken by government officials and elites to pursue negative outcomes. If
kept in the hands of a small group of people, powerful AIs could be used to lock
in those people’s values, without consideration of the interests of humanity more
broadly. Powerful AIs could also increase governments’ surveillance capacity, poten-
tially facilitating democratic backsliding or totalitarianism. Furthermore, AIs that
can quickly generate large quantities of convincing content and facilitate large-scale
censorship could hand much greater control of media narratives to people in power.
In extreme cases, these kinds of misuse by governments and elites could enable the
establishment of a very long-lasting or permanent dystopian civilization.

8.3.2 Distribution of Power Among AIs

The final dimension we will consider is how power might be distributed among ad-
vanced AI systems. Assuming that we reach a world with AI systems that generally
surpass human capabilities, how many of such systems should we expect there to be?
We will contrast two scenarios: one in which a single AI has enduring decisive power
over all other AIs and humans, and one in which there are many different powerful
AIs. We will look at the factors that could make each situation more likely to emerge,
the risks we are most likely to face in each case, and the kinds of policies that might
be appropriate to mitigate them.

AI Singleton

At one extreme, a conceivable future scenario is the emergence of an AI singleton—an

AI with vastly greater power than all others, to the extent that it can permanently
secure its power over the others [441].
Governance ■ 461

A monopoly on AI could make a single powerful AI more likely. One

factor affecting the number of powerful AIs that emerge is the number of actors that
can independently develop AIs of similar capabilities. If a single organization, whether
a government or a corporation, were to achieve a monopoly on the development of
highly sophisticated AI, this would increase the likelihood of a single AI emerging
with decisive and lasting power over all individuals.

A fast “take-off” could make a single powerful AI more likely. If an AI

were to undergo a fast “take-off,” where its capabilities suddenly grew to surpass
other intelligences, then it could prevent other existing AIs from going through the
same process. Such an AI might be motivated to destroy any potential threats to its
power and secure permanent control, ensuring it could pursue its goals unimpeded.
On the other hand, if intelligence were to progress more gradually, then there would
not be a window of time where any single AI was sufficiently more powerful than the
others to halt their further development. Note, however, that a fast takeoff does not
necessitate one AI becoming a permanent singleton. That is because AIs may still
be vulnerable even if they are extremely powerful. Simple structures can take down
more complex structures; just as humans are vulnerable to pathogens and chemical
weapons, simpler AIs (or humans) might be able to counteract more capable AIs.

Benefits
An AI singleton could reduce competitive pressures and solve collective
action problems. If an AI singleton were to emerge, the actor in control of it would
not face any meaningful competition from other organizations. In the absence of com-
petitive pressures, they would have no need to try to gain an advantage over rivals
by rushing the development and deployment of the technology. This scenario could
also reduce the risk of collective action problems in general. Since one organization
would have complete control, there would be less potential for dynamics where dif-
ferent entities chose not to cooperate with one another (as discussed in the previous
chapter chap:CAP), leading to a negative overall outcome.

Costs
An AI singleton increases the risk of single points of failure. In a future
scenario with only one superintelligent AI, a failure in that AI could be enough to
cause a catastrophe. If, for instance, it were to start pursuing a dangerous goal, then
it might be more likely to achieve it than if there were other similarly powerful AIs
that could counteract it. Similarly, if a human controlling an AI singleton would like
to lock in their values, they might be able to do so unopposed. Therefore, an AI
singleton could represent a single point of failure.

An AI singleton could increase the risk of human disempowerment. If

there were just one superintelligent AI and it sought to capture global power, it would
not have to overpower other superintelligent AIs in order to do so. If, instead, there
were multiple powerful AIs, humans might be able to cooperate with those that were
462 ■ Introduction to AI Safety, Ethics, and Society

more willing to cooperate with humans. However, an AI singleton would have little
reason to cooperate with humans, as it would not face any competition from other
AIs. This scenario would therefore increase the risk of disempowerment of humanity.

Diverse Ecosystem of AIs

An alternative possibility is the emergence of a diverse ecosystem of similarly capable

AIs, in which no single agent is significantly and sustainably more powerful than all
the others combined. An AI singleton might also not occur if there is turnover amongst
the most powerful AIs due to the presence of vulnerabilities. Just as human empires
rise and fall, AIs may gain and lose power to others.

Declining development costs could make multiple AIs more likely. If

the costs associated with developing AIs diminish considerably over time, then more
actors will be able to develop AIs independently of one another. Also, if there aren’t
increasing returns from intelligence in many economic niches, then many businesses
will settle for the minimum necessary capable AIs. That is, an AI intended to cook fast
food may not benefit from knowing advanced physics. This increases the probability
of a future where multiple AIs coexist.

Benefits
A diverse ecosystem of AIs might be more stable than a single superin-
telligence. There are reasons to believe that a diverse ecosystem of AIs would
be more likely to establish itself over the long term than a single superintelligence.
The general principle that variation improves resilience has been observed in many
systems. In agriculture, planting multiple varieties of crops reduces the risk that all
of them will be lost to a single disease or pest. Similarly, in finance, having a wide
range of investments reduces the risk of large financial losses. Essentially, a system
comprising many entities is less vulnerable to collapsing if a single entity within it
fails.
There are multiple additional advantages that a diverse ecosystem of AIs could have
over a single superintelligence. Variation within a population means that individuals
can specialize in different skills, making the group as a whole better able to achieve
complex goals that involve multiple different tasks. Such a group might also be gen-
erally more adaptable to different circumstances, since variation across components
could offer more flexibility in how the system interacts with its environment. The
“wisdom of the crowds” theory posits that groups tend to make better decisions
collectively than any individual member of a group would make alone. This phe-
nomenon could be true of groups of AIs. For all these reasons, a future involving a
diverse ecosystem of AIs may be more able to adapt and endure over time than one
where a single powerful AI gains decisive power.

Diverse AIs could remove single points of failure. Having multiple diverse
AIs could dilute the negative effects of any individual AI failing to function as in-
tended. If each AI were in charge of a different process, then they would have less
Governance ■ 463

power to cause harm than a single AI that was in control of everything. Addition-
ally, if a malicious AI started behaving in dangerous ways, then the best chance of
preventing harm might involve using similarly powerful AIs to counteract it, such as
through the use of “watchdog AIs” tasked with detecting such threats. In contrast
with a situation where everything relies on a single AI, a diverse ecosystem reduces
the risk of single points of failure.

Costs
Multi-agent dynamics could lead to selfish traits. Having a group of di-
verse AIs, as opposed to just one, could create the necessary conditions for a process
of evolution by natural selection to take effect (for further detail, see Evolutionary
Pressures). This might cause AIs to evolve in ways that we would not necessarily be
able to predict or control. In many cases, evolutionary pressures have been observed
to favor selfish traits in biological organisms. The same mechanism might promote
AIs with undesirable characteristics.

Diverse AIs could increase the risk of unanticipated failures. A group of

AIs interacting with one another would form a complex system and could therefore
exhibit collective emergent properties that could not be predicted from understand-
ing the behavior of just one. A group of AIs might therefore increase the risk of black
swan events (Section 4.7). Additionally, interactions between AIs could form feed-
back loops, increasing the potential for rapid downward spirals that are difficult to
intervene and stop. A group of powerful AIs in control of multiple military processes
could, for example, present a risk of a flash war (see sec:ai-race), resulting from a
feedback loop of adversarial reactions.

Diverse AI ecosystems could exhibit failure modes of AI singletons. If

multiple AI systems collude with one another, or if inequality amongst AIs is signif-
icant such that one or a few are much more powerful than others, risks will mirror
those of an AI singleton. We will examine why collusion and inequality may occur,
and the implications.

Multiple AIs may or may not collude. It has been proposed that if there were
multiple highly capable AIs, they would collude with one another, essentially acting
as a single powerful AI [442]. This is not inevitable. The risk of collusion depends on
the exact environmental conditions.
Some circumstances that make collusion more likely and more successful include:
1. A small number of actors being involved.
2. Collusion being possible even if some actors cease to participate.
3. Colluding actors being similar, for example in terms of their characteristics and
goals.
4. Free communication between actors.
5. Iterated actions, where each actor can observe what another has done and respond
accordingly in their next decision.
464 ■ Introduction to AI Safety, Ethics, and Society

6. All actors being aware of the above circumstances.

Conversely, some circumstances that impede collusion include:

1. A large number of actors being involved.
2. A requirement for every single actor to participate in order for collusion to succeed.
3. Dissimilarity among colluders, perhaps having different histories and conflicting
goals.
4. Limited communication between actors.
5. Processes involving only one step where actors cannot observe what other actors
have done and respond in a future action.
6. Uncertainty about the above circumstances.

Power among AIs may be distributed unequally. The power of AI systems

may follow a long-tail distribution, analogous to the distribution of wealth among
humans in the US. It is therefore important to note that even if we have many diverse
AIs of similar capabilities, power may still end up being concentrated in just a few
that have a slight edge, and the impact of AI may be largely determined by only a
few. There are situations short of an AI singleton where power is mainly concentrated
in one or a few AIs.

Conclusions About Distribution

In this section, we examined concerns around the distribution of AI’s benefits and
costs. First, we considered how costs and benefits of AIs might be distributed, includ-
ing societal-scale risks. In the short term, automation can lead to economic growth,
but if HLAI is developed, this would result in the effective end of human labor. In
such a world, people may struggle to find purpose and become dependent on AIs.
Second, we examined varying levels of access to AIs from open-source to highly re-
stricted private models. Since AI systems can be used in dangerous ways, traditional
openness norms in research likely need to be reconsidered. Both centralization and
decentralization of access to AIs carry risks.
Third, we discussed how we should expect to see power distributed across AI systems.
Will we eventually see a singleton with decisive and lasting power, a diverse ecosystem
of AIs with varying degrees of power, or something in between? These questions
could have important implications both for the questions about access and benefits
discussed above, as well as for how we go about managing risks from AI.
In the rest of this chapter, we turn to discussing various strategies for governing AI
and mitigating some of the risks it may pose. We explore how various stakeholders
can contribute to good governance and effective risk management within companies
developing AI systems, at the level of national policy and regulation, and in inter-
national cooperation between different states and AI developers. We conclude by
considering the role of controls over inputs to AI systems as a means of governing AI
and managing some of its risks, focusing on the role of computing hardware as the
most “governable” of the inputs used for modern AI systems.
Governance ■ 465

8.4 CORPORATE GOVERNANCE

Overview. AI companies need sound corporate governance: it is essential that these

firms are guided in directions that enable the creation of safe and beneficial AIs. In
this section, we explain what corporate governance is, differentiating between share-
holder and stakeholder theory. Then, we give an overview of different legal structures,
ownership structures, organizational structures, as well as internal and external as-
surance mechanisms.

8.4.1 What Is Corporate Governance?

There are different views on what the purpose of corporate governance is. These
differences are related to different theories about capitalism.

Shareholder theory. According to one view, corporate governance is about the

relationship between a company and its shareholders. This view is based on share-
holder theory, according to which companies have an obligation to maximize returns
for their shareholders. Using this theory, the purpose of corporate governance is to
ensure that actors within a company act in the best interest of the company’s share-
holders. The relationship between shareholders and actors within a company can
be conceptualized as the following problem: shareholders delegate responsibilities to
managers and workers, but managers and workers might not act in the shareholders’
interests [443]. On this view, corporate governance is ultimately a tool for maximizing
shareholder value.

Stakeholder theory. According to another view, corporate governance is about

the relationship between a company and its stakeholders. The idea is that compa-
nies are responsible not only to their shareholders but to many other stakeholders
like employees, business partners, and civil society [444]. Following this theory, the
purpose of corporate governance is to balance the interests of shareholders with the
interests of other stakeholders. It refers to all the rules, practices, and processes by
which a company is directed and controlled.
For the purposes of this book, we are mainly interested in how AI companies are or
should be governed to best advance the public interest [445]. Next, we will consider
how a firm’s legal structure can help permit this.

8.4.2 Legal Structure

A company’s legal structure refers to its legal form and place of incorporation, its
statutes and bylaws.

Legal form. AI companies can have different legal forms. In the US, the most
common form is a C corporation or “C-corp” for short. Other forms include public
benefit corporations (PBCs), limited partnerships (LPs), and limited liability compa-
nies (LLCs). The choice of legal form has significant influence on what a company
466 ■ Introduction to AI Safety, Ethics, and Society

is able and required to do. For example, while C-corps must maximize shareholder
value, PBCs can also pursue public benefits. This can be important in situations
where AI companies may want to sacrifice profits in order to promote other goals.
Google, Microsoft, and Meta are all C-corps, Anthropic is a PBC, and OpenAI is an
LP (which is owned by a nonprofit).

Place of incorporation. Since there are significant differences between jurisdic-

tions, it matters where a company is incorporated. An important distinction can be
drawn between common law countries like the US and UK, wherein judicial precedent
is important, and civil law countries like Germany and France, where recorded laws
are more comprehensive and less open to interpretation. But there are also impor-
tant differences within a given country. For example, in the US, many AI companies
such as OpenAI and Anthropic are incorporated in the state of Delaware due to its
relatively business-friendly regulations, but they often have branches in other states
like California.

Statutes and bylaws. Although many governance decisions are determined by

the legal form and the place of incorporation, companies have some room for cus-
tomization. They can customize their legal structure in their statutes and bylaws.
Companies are typically required to specify their mission statement in their statutes.
The mission of Google DeepMind is “to solve intelligence to advance science and ben-
efit humanity” and OpenAI’s mission is “to ensure that AGI benefits all of humanity.”
The statutes of PBCs also need to contain a specific public benefits statement. But
the statutes and bylaws can also contain more specific rules; for example, an AI com-
pany could give its board of directors specific powers, such as to veto the deployment
of a new model.
Ensuring that a firm’s legal structures enable its employees to act in the best interests
of the firm’s stakeholders is important, and firms have many ways to do this. Next,
we will consider how ownership can help solidify a firm’s safety-conscious goals.

8.4.3 Ownership Structure

Companies are owned by shareholders, who elect the board of directors, which ap-
points the senior executives who actually run the company.

Types of shareholders. AI companies can have different types of shareholders. In

their early stages, AI companies typically get investments from wealthy individuals,
so-called “angel investors.” For example, Elon Musk was among the first investors in
OpenAI, while Dustin Moskovitz was part of the initial funding round of Anthropic.
As AI companies mature, professional investors such as venture capital (VC) firms
and big tech companies start to invest—DeepMind was acquired by Google in 2016,
and Microsoft invested heavily in OpenAI. At some point, many companies decide
to “go public,” which means that they are turned into publicly traded companies.
At that point, institutional investors like pension funds enter the stage. For example,
Vanguard and BlackRock are among the largest shareholders of Microsoft, Google,
Governance ■ 467

and Meta, though the founders often retain a significant amount of shares. It is also
not uncommon to give early employees stock options, which allow them to purchase
stock at fixed prices.

Powers of shareholders. Shareholders can influence the company in several

ways. They can voice concerns in the annual general meeting and vote in share-
holder resolutions. For example, in the 2019 annual general meeting of Alphabet, one
shareholder called for better board oversight of AI and suggested creating a Societal
Risk Oversight Committee. If the board of directors of a C-Corp does not act in the
shareholders’ interest, shareholders can theoretically replace them. However, in prac-
tice this is very rare. A more common way to put pressure on board members is to file
lawsuits against them if they fail to meet certain obligations. Such lawsuits are often
settled in ways that improve corporate governance. The attempt of shareholders to
steer the company in a certain direction is called shareholder activism.

Customized ownership structures. Depending on the company’s legal form, it

may be possible to customize the ownership structure. A common way to do that is
to issue two different classes of shares: both classes give their holders ownership, but
only one class grants voting rights. This allows structures where the founders remain
in control of the company while other shareholders contribute capital and receive
returns on their investment. Another way to achieve certain goals is to combine
different legal entities. For example, the returns of investors in the OpenAI LP are
capped. Above a certain threshold, returns are owned by the OpenAI nonprofit. They
call this a capped-profit structure. A related idea is to adopt a governance structure
that, under certain conditions, transfers control to a committee representing society’s
interests rather than the shareholders’ interests.
It is important who owns AI companies because these owners have strong ways to
influence operations. Next, we will put aside questions of ownership and examine how
these organizations are structured and run.

8.4.4 Organizational Structure

While shareholders own the company, it is governed by the board of directors and
managed by the chief executive officer (CEO) and other senior executives.

Board of directors. The board of directors is the main governing body of the
company. Board members have a legal obligation to act in the best interests of the
company, so-called fiduciary duties. These duties can vary: board members of Alpha-
bet have the typical fiduciary duties of a C-Corp, while board members of OpenAI’s
nonprofit have a duty to “to ensure that AGI benefits all of humanity,” not to maxi-
mize shareholder value. The board has a number of powers they can use to steer the
company in a more prosocial direction. It sets the strategic priorities, is responsible
for risk oversight, and has significant influence over senior management; for instance,
it can replace the chief executive officer. However, the board’s influence is often indi-
rect. Many boards have committees, some of which might be important from a safety
468 ■ Introduction to AI Safety, Ethics, and Society

perspective such as a risk committee or audit committee. Anthropic’s Long-Term

Benefit Trust (LTBT) can choose Anthropic’s board members, who could in theory
fire the CEO.
Senior executives. The company is managed by the CEO, who appoints other
senior executives such as a chief technology officer (CTO), chief scientist, chief risk
officer (CRO), and so on. Since the behavior of executives is at least in part driven
by financial incentives, remuneration is often a valuable tool to ensure that they act
in the corporation’s interest. Before appointing new executives, the board might also
want to conduct background checks; in the case of AI companies, a board might want
to consider candidates’ views on risks from AIs.
Organizational units. Below the executive level, AI companies have a number
of teams, often in a hierarchical structure. This will typically involve research and
product development teams, but also legal, risk management, finance, and many
others. AI companies can also have safety and governance teams. These teams should
be well resourced and have buy-in from senior management.
In addition, some AI companies have other organizational structures, such as an ethics
board that advises the board of directors and senior management, or an institutional
review board (IRB) that checks if publishing certain types of research might be
harmful [445]. Google DeepMind has a IRB-like review committee that played a key
role in the release of AlphaFold.
Various aspects of the legal, ownership, and organizational structure of AI companies
can influence to what extent their outputs focus on employees’ and managers’ best
interests, creating shareholder value, or achieving their mission and ensuring soci-
etal wellbeing. In the last subsection, we will explore how we can ensure that these
structures are well chosen and smoothly functioning.

8.4.5 Assurance
Different stakeholders within and outside AI companies need to know whether ex-
isting governance structures are adequate. AI companies therefore take a number of
measures to evaluate and communicate the adequacy of their governance structures.
These measures are typically referred to as assurance. We can distinguish between
internal and external assurance measures.
Internal assurance. AI companies need to ensure that senior executives and
board members get the information they need to make good decisions. It is, there-
fore, essential to define clear reporting lines. To ensure that key decision-makers get
objective information, AI companies may set up an internal audit team that is or-
ganizationally independent from senior management [446]. This team would assess
the adequacy and effectiveness of the company’s risk management practices, controls,
and governance processes, and report directly to the board of directors.
External assurance. Many companies are legally required to publicly report cer-
tain aspects of their governance structures, such as whether the board of directors
Governance ■ 469

has an audit committee. Often, AI companies also publish information about their
released models, for example in the form of model or system cards. Some organiza-
tions disclose parts of their safety strategy and their governance practices as well; for
instance, how they plan to ensure their powerful AI systems are safe, whether they
have an ethics board, or how they conduct pre-deployment risk assessments. These
publications allow external actors like researchers and civil society organizations to
evaluate and scrutinize the company’s practices. In addition, many AI companies
commission independent experts to scrutinize their models, typically in the form of
third-party audits, red teaming exercises to adversarially test systems, or bug bounty
programs that reward finding errors and vulnerabilities.

Conclusions About Corporate Governance

Corporate governance refers to all the rules, practices, and processes by which a
company is directed and controlled—ranging from its legal form and place of incor-
poration to its board committees and remuneration policy. The purpose of corporate
governance is to balance the interests of a company’s shareholders with the interests
of other stakeholders, including society at large. To this end, AI companies are ad-
vised to follow existing best practices in corporate governance. However, in light of
increasing societal risks from AI, they also need to consider more innovative gover-
nance solutions, such as a capped-profit structure or a long-term benefit trust.

8.5 NATIONAL GOVERNANCE

Overview. Government action may be crucial for AI safety. Governments have the
authority to enforce AI regulations, to direct their own AI activities, and to influ-
ence other governments through measures such as export regulations and security
agreements. Additionally, major governments can leverage their large budgets, diplo-
mats, intelligence agencies, and leaders selected to serve the public interest. More
abstractly, as we saw in the Collective Action Problems chapter, institutions can
help agents avoid harmful coordination failures. For example, penalties for unsafe AI
development can counter incentives to cut corners on safety.
This section provides an overview of some potential ways governments may be able to
advance AI safety including safety standards and regulations, liability for AI harms,
targeted taxation, and public ownership of AI. We also describe various levers for
improving societal resilience and for ensuring that countries focused on developing
AI safely to do not fall behind less responsible actors.

8.5.1 Standards and Regulations

To ensure that frontier AI development and deployment is safe, two complementary
approaches are formally establishing strong safety measures as best practices for AI
labs, and requiring the implementation of strong safety measures. This can be done
using standards and regulations, respectively.
470 ■ Introduction to AI Safety, Ethics, and Society

Standards are specified best practices. Standards are written specifications of

the best practices for carrying out certain activities. There are standards in many
areas, from telecommunications hardware to country codes to food safety. Standards
often aim to ensure quality, safety, and interoperability; for instance, the International
Food Standard requires the traceability of products, raw materials and packaging
materials.

The substance of AI safety standards. In the context of frontier AI, techni-

cal standards for AI safety could guide various aspects of the AI model life cycle.
Before training begins, training plans could be assessed to determine if training is
safe, based on evidence about similar training runs and the proposed safety methods.
After training begins, models could be evaluated to determine if further training or
deployment is safe, based on whether models show dangerous capabilities or propen-
sities. During deployment, particularly powerful models could be released through a
monitored API. Standards could guide all aspects of this process.

Standards are developed in dedicated standard-setting organizations.

Many types of organizations, from government agencies to industry groups to other
nonprofits, develop standards. Two examples of standard-setting organizations are the
National Institute of Standards and Technology (NIST) and the International Orga-
nization for Standardization (ISO). In the US, standard setting is often a consensus-
based activity in which there is substantial deference to industry expertise. However,
this increases the risk that standards over-represent industry interests.

The impact of standards. Standards are not automatically legally binding. De-
spite that, standards can advance safety in various ways. First, standards can shape
norms, because they are descriptions of best practices, often published by authorita-
tive organizations. Second, governments can mandate compliance with certain stan-
dards. Such “incorporation by reference” of an existing standard may bind both the
private sector and government agencies. Third, governments can incentivize compli-
ance with standards through non-regulatory means. For example, government agen-
cies can make compliance required for government contracts and grants, and stan-
dards compliance can be a legal defense against lawsuits.

Regulations are legally binding. Regulations are legal requirements established

by governments. Some examples of regulations are requirements for new foods and
drugs to receive an agency’s approval before being sold, restrictions on the pollution
emitted by cars, requirements for aircraft and pilots to have licenses, and constraints
on how companies may handle personal data.

Regulations are often shaped by both legislatures and agencies. In some

governments, such as the US and UK, regulations are often formed through the follow-
ing process. First, the legislature passes a law. This law creates high-level mandates,
and it gives a government agency the authority to decide the details of these rules and
enforce compliance. By delegating rulemaking authority, legislatures let regulations
be developed with the greater speed, focus, and expertise of a specialized agency. As
we discussed, agencies often incorporate standards into regulations. Legislatures also
Governance ■ 471

often influence regulation through their control over regulatory agencies’ existence,
structure, mandates, and budgets.
Regulatory agencies do not always regulate adequately. Regulatory agencies can face
steep challenges. They can be under-resourced: lacking the budgets, staff, or author-
ities they need to do well at designing and enforcing regulations. Regulators can
also suffer from regulatory capture—being influenced into prioritizing a small inter-
est group (especially the one they are supposed to be regulating) over the broader
public interest. Industries can capture regulators by politically supporting sympa-
thetic lawmakers, providing biased expert advice and information, building personal
relationships with regulators, offering lucrative post-government jobs to lenient reg-
ulatory staff, and influencing who is appointed to lead regulatory agencies.
Standards and regulations give governments some ways to shape the behavior of AI
developers. Next, we will consider legal means to ensure that the developers of AI
have incentives in line with the rest of society.

8.5.2 Liability for AI Harms

In addition to standards and regulations, legal liability could advance AI safety. When
AI accidents or misuse cause harm, liability rules determine who (if anyone) has to
pay compensation. For example, an AI company might be required to pay for damages
if it leaks a dangerous AI, or if its AI provides a user step-by-step instructions for
building or acquiring illegal weapons.

Non-AI-specific liability. Legal systems including those of the US and UK have

forms of legal liability that would plausibly apply to AI harms even in the absence
of AI-specific legislation. For example, in US and UK law, negligence is grounds for
liability. Additionally, in some circumstances, such as when damages result from a
company’s defective product, companies are subject to strict liability. That means
companies are liable even if they acted without negligence or bad intentions. These
broad conditions for liability could apply to AI, but there are many ways judges might
interpret concepts like negligence and defective products in the case of AI. Instead of
leaving it to a judge’s interpretation, legislators can specify liability rules for AI.

There are advantages to holding AI developers liable for damages. Legal

liability helps AI developers internalize the effects of their products on the rest of
society by ensuring that they pay when their products harm others. This improves
developers’ incentives. Additionally, legal liability helps provide accountability with-
out relying on regulatory agencies. This avoids the problem that government agencies
may be too under-resourced or captured by industry to mandate and enforce adequate
safety measures.

Legal liability is a limited tool. There are practical limits to what AI companies
can actually be held liable for. For example, if an AI were used to create a pandemic
that killed 1 in 100 people, the AI developer would likely be unable to pay beyond a
small portion of the damages owed (as these could easily be in the tens of trillions). If
472 ■ Introduction to AI Safety, Ethics, and Society

the amount of harm that can be caused by AI companies exceeds what they can pay,
AI developers cannot fully internalize the costs they impose on society. This problem
can be eased by requiring liability insurance (a common requirement in the context
of driving cars), but there are amounts of compensation that even insurers could not
afford. Moreover, sufficiently severe AI catastrophes may disrupt the legal system
itself. Separately, liability does little to deter AI developers who do not expect their
AI development to result in large harms—even if their AI development really proves
catastrophically dangerous.
Ensuring legal liability for harms that result from deployed AIs helps align the in-
terests of AI developers with broader social interests. Next, we will consider how
governments can mitigate harms when they do occur.

8.5.3 Targeted Taxation

Although taxes do not directly force risk internalization, they can provide govern-
ment revenues that can be reserved for risk mitigation or disaster relief efforts. For
example, the Superfund is a US government program that funds the cleanup of aban-
doned hazardous waste sites. It is funded by excise taxes–—a special tax on some
good, service, or activity—on chemicals. The excise tax ensures that the chemical
manufacturing industry pays for the handling of dangerous waste sites that it has
created. Special taxes on AIs could support government programs to prevent risks or
address disasters.

Taxation is the most straightforward redistribution policy. Wealth redis-

tribution or social spending is most often funded by taxes. Progressive tax policies,
adopted today by most but not all nations, require that those who earn more money
pay a greater proportion of their earnings than those who earn less. Governments
then seek to redistribute wealth through a wide variety of programs, from healthcare
and education to direct checks to people. In light of the high profits of technology
companies, economists and policymakers have already proposed specialized taxes on
robots, digital goods and services, and AI. If AI enables big tech companies to make
orders of magnitude more money than any previous company, while much of the
population is unable to pay income tax, targeted taxes on AI may be necessary to
maintain government revenues.
Seeking to encourage innovation, the US’s tax landscape currently favors capital
investment over labor investment. If a firm wants to hire a worker, they have to pay
payroll taxes and employees have to pay a number of separate taxes. If a firm replaced
their worker with an AI, they would presently only pay corporate tax, which was
incurred anyway [447]. As with other redistributive policies surveyed in this chapter,
there are political barriers to high taxes, including the ability of companies to lobby
the government in favor of lower taxes, as well as long-standing and contentious
debates over the economic effects of taxation.
Governance ■ 473

8.5.4 Public Ownership over AI

Another approach to aligning AI development with societal interests would be public

ownership over some AI systems. Public ownership means that the public receives
both the benefits and the costs (risks) of AIs, addressing moral hazards. Governments
might seek to assume partial or full ownership of powerful AI systems both in order
to control their operations safely and to guarantee equitable revenue distribution.
Public utilities are often nationalized because they benefit from economies of scale,
where larger organizations are cheaper and more efficient. For example, building
power plants and transmission lines is expensive, so governments are interested in
maintaining one large, well-regulated company. The French government owns the
utility company Électricité de France, whose nuclear plants power the majority of
the country’s electricity. AIs may be similar to public utilities if they are ubiquitous
throughout the economy, essential to everyday activity, and require special safety
considerations. The potential tendency of AI to strengthen concentration of market
power is discussed further in section 6.4 in the Beneficial AI and Machine Ethics
chapter.

8.5.5 Improving Resilience

The government actions already discussed focus on preventing unsafe AI development

and deployment, but another useful intervention point may be mitigating damages
during deployment. We discuss this briefly here and at greater length in the section
3.5.

Resilience may protect against extreme risks. Governments may be able to

improve societal resilience to AI accidents or misuse through promoting cybersecurity,
biosecurity, and AI watchdogs. Measures for increasing resilience may raise the level of
AI capabilities needed to cause catastrophe. That would buy valuable time to develop
safety methods and further defensive measures—ideally enough time for safety and
defense to always keep pace with offensive capabilities. Sufficient resilience could
lastingly reduce risk.

Policy tools for resilience. To build resilience, governments could use a variety
of policy tools. For example, they could provide R&D funding to develop defensive
technologies. Additionally, they could initiate voluntary collaborations with the pri-
vate sector to assist with implementation. Governments could also use regulations to
require owners of relevant infrastructure and platforms to implement practices that
improve resilience.

Tractability of resilience. If governments defend narrowly against some attacks,

rogue AIs or malicious users might just find other ways to cause harm. Increasingly
advanced AIs could pose novel threats in many domains, so it may be hard to iden-
tify or implement targeted defensive measures that make a real difference. However,
perhaps there are a few domains where societal vulnerabilities are especially dire and
474 ■ Introduction to AI Safety, Ethics, and Society

tractable to improve (cybersecurity or biosecurity, for example), while some defensive

measures could provide broader defenses (such as AI watchdogs).
Cybersecurity. AIs could strengthen cybersecurity. AIs could identify and patch code
vulnerabilities (that is, they could fix faulty programming that would let attackers
get unauthorized access to a computer). AIs could also help detect phishing attacks,
malware and other attempts to attack a computer network, enabling responses such
as blocking or quarantining malicious programs. These efforts could be targeted to de-
fend widely used software and critical infrastructure. However, AIs that identify code
vulnerabilities are dual-use; they can be used to either fix or exploit vulnerabilities.
Biosecurity. Dangerous pathogens can be detected or countered through measures
such as wastewater monitoring (which might be enhanced by anomaly detection), far-
range UV technology, improved personal protective equipment, and DNA synthesis
screening that is secure and universal.
AI watchdogs. AIs could monitor the activity of other AIs and flag dangerous be-
havior. For example, AI companies can analyze the outputs of their own chatbots
and identify harmful outputs. Additionally, AIs could identify patterns of dangerous
activities in digital or economic data. However, some implementations of this could
harm individual privacy.
Defensive measures including cybersecurity, biosecurity, and AI watchdogs may miti-
gate harms from the deployment of unsafe AI systems. However, defensive measures,
regulation, and liability may all be insufficient for safety if the countries that imple-
ment them all fall behind the frontier of AI development. Next, we will consider how
countries can remain competitive while ensuring safety in domestic AI production.

8.5.6 Not Falling Behind

If some countries take a relatively slow and careful approach to AI development, they
may risk falling behind other countries that take less cautious approaches. It would
be risky for the global leaders in AI development to be within countries that lack
adequate guardrails on AI. Various policy tools could allow states to avoid falling
behind in AI while they act to keep their own companies’ AIs safe.

Risks of adversarial approaches. Adversarial approaches to AI policy—that

is, policies focused on advancing one country’s AI leadership at the expense of
another’s—have risks. Adversarial policies could rely on wrong assumptions about
which states will adequately guardrail AI, and they could also motivate counter-
actions and increase international tensions (making cooperation harder). Competitive
mindsets can also encourage de-prioritizing safety in the name of competing—in the
Collective Action Problems chapter, we consider this problem in greater depth us-
ing formal models. Additionally, AI technologies might proliferate quickly even with
strong efforts to build national leads in AI.
International cooperation, as explored in section 8.6 , may enable states to keep their
AIs safe, preserve national competitiveness, and avoid the pitfalls of adversarial AI
Governance ■ 475

policy. Still, as options for cases where cooperation fails, here we consider several
policy tools for preserving national competitiveness in AI.

Export controls. Restrictions on the export of AI-specialized hardware can limit

states’ access to a key input into AI. Due to the extreme complexity of advanced
semiconductor manufacturing, it is very difficult for states subject to these export
controls to manufacture the most advanced hardware on their own. Additionally, the
AI hardware supply chain is extremely concentrated, perhaps making effective export
controls possible without global agreement. We explore this further in the Compute
Governance section.

Immigration policy. Immigration policy affects the flow of another important

input into AI development: talented AI researchers and engineers. Immigration could
be an asymmetric advantage of certain countries; surveys suggest that the interna-
tional AI workforce tends to have much more interest in moving to the US than
China [448]. Immigrants may be more likely to spread AI capabilities internationally,
through international contacts or by returning to their native countries, but many
immigrants who are provided with the chance choose to stay in the US.

Intelligence collection. Collecting and analyzing intelligence on the state of AI

development in other countries would help governments avoid both unwarranted com-
placency and unwarranted insecurity about their own AI industries.

Information security. Information security measures could slow or prevent the

diffusion of AI insights and technologies to adversarial countries or groups. For ex-
ample, governments could provide information security assistance to AI developers,
or they could incentivize or require developers’ compliance with information security
standards. This is further discussed below.
Governments can use a range of measures to remain internationally competitive while
maintaining the safety of domestic AI development.

8.5.7 Information Security

Model theft as a national security concern. If model weights from advanced

AI systems are stolen or leaked, this could allow state or non-state actors to misuse
these models. For example, they could maliciously use models for offensive purposes
such as cyberattacks or the development of novel weapons (as discussed in more
detail in Chapter 1). Assuming AI systems were sufficiently capable and valuable
from a military and economic perspective, the leaking of this intellectual property to
competitors would represent a major blow.

The likelihood of theft or leaks of model weights appears high. First, the
most advanced AI systems are likely to be highly valuable due to their ability to
perform a wide variety of economically useful activities. Second, there are strong in-
centives to steal models given the high cost of developing state of the art systems.
Lastly, these systems have an extensive attack surface because of their extremely
476 ■ Introduction to AI Safety, Ethics, and Society

complex software and hardware supply chains. In recent years, even leading technol-
ogy companies have been vulnerable to major attacks, such as the Pegasus 0-click
exploit that enabled actors to gain full control of high-profile figures’ iPhones and the
2022 hack of NVIDIA by the Lapsus group, which claimed to have stolen proprietary
designs for its next-generation chips.

There are various attack vectors that could be exploited to steal model
weights. These include running unauthorized code that exploits vulnerabilities in
software used by AI developers, or attacking vulnerabilities in security systems them-
selves. Attacks on vendors of software and equipment used by an AI developer are
a major concern, as both the hardware and software supply chains for AI are ex-
tremely complex and involve many different suppliers. Other techniques that are less
reliant on software or hardware vulnerabilities include compromising credentials via
social engineering (e.g. phishing emails) or weak passwords, infiltrating companies
using bribes, extortion or placement of agents inside the company, and unauthorized
physical access to systems. Even without any of these attacks, abuse of legitimate Ap-
plication Programming Interfaces (APIs) can enable extraction of information about
AI systems. Research has shown that it is possible to recover portions of some of
OpenAI’s models using typical API access [449].

Securing model weights is complex and challenging. Basic approaches to

thwart opportunistic attacks include using multifactor authentication, developing in-
cident detection and response capabilities, and limiting the number of people with
access to model weights. More advanced threats require more elaborate measures
such as penetration testing with a well-resourced external team, establishing an in-
sider threat program, reviewing vendor and supplier security, and hardening interfaces
to weight access against weight exfiltration. These responses illustrate several of the
safe design principles discussed in Safety Engineering. The principle of least privilege
is directly applied by limiting access to model weights. Red-teaming is an example of
the concept of antifragility. The need for multiple independent security layers demon-
strates the importance of defense in depth in securing advanced AI systems against
potential threats.

Defending model weights likely requires heavy investments and a strong

safety culture. The measures discussed above are likely insufficient to rule out
successful attacks by highly resourced state actors looking to steal model weights.
This might require novel approaches to the design of the hardware and facilities
used to store model weights. A long-term commitment to building an organizational
safety culture (further discussed in 4.5) is also crucial. One major challenge of infor-
mation security is to build buy-in from employees through effective communication
and building a company culture that values security. Without this, measures that
are important from a security perspective, but significantly reduce productivity and
convenience, might prove impossible to enforce or be bypassed.
Governance ■ 477

Conclusions About National Governance

National governments have many tools available for advancing AI safety. Standards,
regulations, and liability could stop dangerous AIs from being deployed, while encour-
aging the development of safe AIs. Improved resilience could mitigate the damage of
dangerous deployments when they do occur, giving us more time to create safe AIs
and mitigating some risk from dangerous ones. Measures such as strong information
security could allow governments to ensure domestic AI production is both safe and
competitive. Each of these approaches has largely distinct limitations—for example,
regulations may be held back by regulatory capture, while liability might impose too
few penalties too late—so effective governance may require combining many of the
government actions discussed in this section.
With robust AI safety standards and regulations, a well-functioning legal framework
for ensuring liability, strong resilience against societal-scale risks from AIs, and mea-
sures for not being outpaced internationally by unconstrained AI developers, there
would be multiple layers of defense to protect society from reckless or malicious AI
development.

8.6 INTERNATIONAL GOVERNANCE

In this section, we will discuss the international governance of AI systems. First,

we will consider the problem of the international governance of AI. Then, we will
discuss the basics of international governance, such as its stages and techniques.
To determine what sort of international cooperation is possible and necessary, we
will discuss four key questions that are important for understanding features of the
emerging technologies we wish to regulate. Lastly, we will discuss possibilities for the
international governance of AIs, separately considering AIs made by civilians and AIs
made by militaries.
International regulation can promote global benefits and manage shared
risks. It is important to actively regulate AI internationally. Firstly, it allows
for the distribution of global benefits that advanced AIs can provide, such as im-
proved healthcare, increased efficiency in transportation, and enhanced communica-
tion. Countries can work together to ensure that AI technologies are developed and
deployed in a way that benefits humanity as a whole.
Secondly, international governance of AI is necessary to manage effectively the risks
associated with its development. The risks from AI systems are not confined to the
country in which they are developed; for instance, even if an AI system is developed
in the US, it is likely to be deployed around the world, and so its impacts will be
felt worldwide. Risks, such as the danger of progressively weaker national regulations
in the absence of international standards and the potential for arms races in which
actors cut corners on safety in order to compete, require international cooperation
to avoid [450]. From the point of view of public safety, the risks of negative effects of
systems developed across borders is a simple and powerful argument for international
governance.
478 ■ Introduction to AI Safety, Ethics, and Society

International governance of powerful technologies is challenging. In gen-

eral, international cooperation takes place through bilateral or multilateral negotia-
tions between relevant countries, or through international organizations like the UN
and its agencies. Even at the best of times and with the least contentious of issues, in-
ternational cooperation is slow, difficult, and often ineffectual. Strained international
relationships, such as frequent tensions between the US and China, make successful
international standards even less likely.
For the regulation of AI, we can draw analogies to the regulation of other dangerous
technologies such as nuclear, biological, and chemical weapons. While these analogies
are imperfect, they give us reasons to be concerned about the prospects of interna-
tional cooperation for AI. There are few convincing examples of agreements between
major powers to limit the development of technologies for which there were no mili-
tary substitutes. However, we can learn from both the failures and successes of past
regulation: by asking questions about how AI is similar to past technologies, we can
understand what form successful and unsuccessful governance might take.

8.6.1 Forms of International Governance

Before we can consider how to govern AIs internationally, we must understand the
nature of international governance. We will first consider the different stages of inter-
national governance, from recognizing a problem exists to ensuring that its governance
is effective. Then, we will consider a range of techniques used by international actors
to ensure global governance.

Stages
We can break international governance into four stages [451]. First, issues must be
recognized as requiring governance. Then, countries must come together to agree
how to govern the issue. After that, they must actually do what was agreed. Lastly,
countries’ actions must be monitored for compliance to ensure that governance is
effective into the future.
Setting agendas. The first stage of governance is agenda-setting. This involves
getting an issue recognized as a problem needing collective action. Actors have to
convince others that an issue exists and matters. Powerful entities often want to
maintain their power in the status quo, and thus deny problems or ignore their
responsibilities for dealing with them. Global governance over an issue can’t start
until it gets placed on the international agenda. Non-state actors like scientists and
advocacy groups play a key role in agenda-setting by drawing attention to issues
neglected by states; for example, scientists highlighted the emerging threat of ozone
depletion, building pressure that led to the Montreal Protocol.
Agenda-setting makes an issue a priority for collective governance. Without it, critical
problems can be overlooked due to political interests or inertia.
Policymaking. Once an issue makes the global agenda, collections of negotiat-
ing countries or international organizations often take the lead in policymaking.
Governance ■ 479

Organizations like the UN facilitate formal negotiations between states to develop

policies, as seen at major summits on climate change. In other cases, organizations’
own procedures shape policies over time; for instance, export control lists in the US
like the International Traffic in Arms Regulations are expanded by regulators with-
out Congressional action. Organizations manage competing country interests to build
consensus on vague or detailed policies. For example, the International Civil Avia-
tion Organization brought countries together to agree on aircraft safety standards.
Effective policymaking by organizations converts identified issues into guidelines for
collective action. Without actors taking responsibility for driving the policy process,
implementation lacks direction.

Implementation and enforcement. After policies are made, the next stage is
implementing them. High-level policies are sometimes vague, allowing flexibility to
apply them; for example, the Chemical Weapons Convention relies on countries to
enforce bans on chemical weapons domestically in the ways they find most effective.
Governance controls how these policies are enforced; for instance, the International
Atomic Energy Agency (IAEA) conducts inspections to verify compliance with the
Treaty on the Non-Proliferation of Nuclear Weapons (NPT). Even if actors agree
on policies, acting on them takes resources, information, and coordination. Effective
implementation and enforcement through governance converts abstract rules into
concrete actions.

Evaluation, monitoring, and adjudication. The final stage of governance

is evaluating outcomes and monitoring compliance. The organization implementing
policies may perform these oversight tasks itself. But often other actors play watchdog
roles as third-party evaluators. It may be formally established who assesses compli-
ance, like the Organization for the Prohibition of Chemical Weapons (OPCW). In
other cases, evaluation is informal, with self-appointed civil society monitors. Both
insiders and outsiders frequently evaluate progress jointly. For example, the UN,
OPCW, and NGOs all track progress on chemical weapons disarmament. Clarifying
who monitors and evaluates policies is important to ensure accountability and trans-
parency. Without effective evaluation, it is difficult to learn from and improve on
governance efforts over time.

Techniques

There are many ways in which countries and other international actors govern issues
of global importance. Here, we consider six ways that past international governance
of potentially dangerous technologies has taken place, ranging from individual actors
making unilateral declarations to wide-ranging, internationally binding treaties.

Unilateral commitments. Unilateral commitments involve single actors like

countries or companies making pledges regarding their own technology development
and use. For example, President Richard Nixon terminated the US biological weapons
program and dismissed its scientists in 1969, which was instrumental in creating the
far-reaching Biological Weapons Convention in 1972. Leaders within governments
480 ■ Introduction to AI Safety, Ethics, and Society

and companies can make such announcements, either about what they would do in
response to others’ actions or to place constraints on their own behavior. While not
binding on others, unilateral commitments can change others’ best responses to a
situation, as well as build confidence and set precedents to influence international
norms. They also give credibility in pushing for broader agreements. Unilateral com-
mitments lay the groundwork for wider collective action.

Norms and standards. International norms and technical standards steer behav-
ior without formal enforcement. Norms are shared expectations of proper conduct,
such as the norm of “no first use” for detonating nuclear weapons. Standards set tech-
nical best practices, like guidelines for the safe handling and storage of explosives.
Often, norms emerge through public discourse and precedent while standards arise
via expert communities. Even if they have no ability to coerce actors, strong norms
and standards shape actions nonetheless. Additionally, they make it easier to build
agreements by aligning expectations. Norms and standards are a collaborative way
to guide technology development.

Bilateral and multilateral talks. Two or more countries directly negotiate over
a variety of issues through bilateral or multilateral talks. These allow open discussion
and confidence-building between rivals, such as granting the US and USSR the ability
to negotiate over the size and composition of their nuclear arsenals during the Cold
War. Talks aim to establish understandings to avoid technology risks like arms races.
Regular talks build relationships and identify mutual interests. While non-binding
themselves, ongoing dialogue can lay the basis for making deals. Talks are essential
for compromise and consensus between nations.

Summits and forums. Summits and forums bring together diverse stakeholders
for debate and announcements. These raise visibility on issues and build political will
for action. Major powers can make joint statements signaling priorities, like setting
goals on total carbon emissions to limit the effects of global warming. Companies and
civil society organizations can announce major initiatives. Summits set milestones for
progress and mobilize public pressure.

Governance organizations. International organizations develop governance ini-

tiatives with diverse experts and resources. Organizations like the IAEA propose
principles and governance models, such as an inspection and verification paradigm
for nuclear technology. They provide neutral forums to build agreements. Their tech-
nical expertise also assists in implementation and capacity-building. While largely
voluntary, organizations lend authority to governance efforts, often by virtue of each
of their members delegating some authority to them. Their continuity sustains atten-
tion between summits. Organizations enable cooperation for long-term governance.

Treaties. Treaties offer the strongest form of governance between nations, creat-
ing obligations backed by potential punishment. Treaties have played a large role in
banning certain risky military uses of technologies, such as the 1968 Treaty on the
Non-Proliferation of Nuclear Weapons. They often contain enforcement mechanisms
like inspections. However, compromising on enforceable rules is difficult, especially
Governance ■ 481

between rivals. Verifying compliance with treaties can be challenging. Still, the bind-
ing power of treaties makes them valuable despite their potential limitations.
In this section, we have considered the various stages of international governance,
moving from recognizing an issue to solving it, as well as a wide variety of different
ways that enable this. This is a large set of tools, so we will next examine four
questions that inform our understanding of which tools are most effective for AI
governance.

8.6.2 Four Questions for AI Regulation

We will now consider four questions that are important for the regulation of dangerous
emerging technologies:
1. Is the technology defense-dominant or offense-dominant?
2. Can compliance with international agreements be verified?
3. Is it catastrophic for international agreements to fail?
4. Can the production of the technology be controlled?

Each of these highlights important strategic variables that we are uncertain about.
They give us insights into the characteristics of the technology we are dealing with.
Consequently, they help us illustrate what sorts of international cooperation may be
possible and necessary.

Is the Technology Defense-Dominant or Offense-Dominant?

AI capabilities determine the need for international governance. Suppose

we could use some AIs to prevent other AIs from doing bad things. There would be
less need for an international regime to govern AIs. Instead, AI development would
prevent the harms of AI development—such technology is called defense-dominant.
By contrast, if AI is an offense-dominant technology—if AIs cannot manage other
AIs as the technological frontier advances—then we will need alternative solutions
[452]. Unfortunately, we have reason to believe that AIs will be offense-dominant:
military technologies usually are. It is difficult for AIs to protect against threats from
other AIs; an AI that can make the creation of engineered pandemics easy is much
more likely to exist than one that can comprehensively defend against pandemics. It
is usually easier to cause harm than prevent it.
Nuclear weapons are a classic example of offense-dominant technologies: when asked
how to detect a nuclear weapon smuggled into New York in a crate, Robert Oppen-
heimer replied “with a screwdriver” to open every crate [453]. In other words, there
was no feasible technological solution; a social solution was required. When dealing
with offense-dominant technologies, we often need to develop external social measures
to defend against them. Similarly, if AIs prove to be offense-dominant, they will also
require social solutions to protect against their potential harmful impacts. Since the
scale of the technology is global, this will likely require international governance.
482 ■ Introduction to AI Safety, Ethics, and Society

Can Compliance With International Agreements Be Verified?

Verification of compliance means agreements can regulate technology. After establish-

ing rules about the development and use of AIs, we need to verify whether signatories
are following them. If it is easy to evaluate this, then it is easier to implement such
regimes internationally. In the case of nuclear weapons, nuclear tests could be veri-
fied by monitoring for large explosions. Using a nuclear weapon is impossible to do
discreetly, enabling the norm of no first use. Unfortunately, verifying how and when
AIs are being developed and used may be difficult to verify—certainly more so than
observing a nuclear explosion.
Verification is a difficult technical challenge. We lack clear methods for conduct-
ing verification, since we do not know what to test to ensure that models are safe.
Progress can be made with serious effort; for instance, investing in the development of
standard benchmarks and evaluations that promote transparency and enable shared
knowledge that other developers are using responsible development practices. This
will allow the creation of clear standards that can be verified with relative ease, with-
out compromising the confidentiality of privileged technical details. However, this
requires investment: we must make progress on our ability to verify characteristics of
AIs before we can implement effective international regulation. Until then, establish-
ing an equilibrium to deter the dangerous use of AIs may not be possible. If compute
governance (further discussed in the next section) can provide reliable approaches to
verify that other parties are complying, for example through on-chip mechanisms,
this could make it easier to make international agreements and to ensure that these
are being adhered to.

Is It Catastrophic for International Agreements to Fail?

Whether agreements failing is catastrophic changes how permissive they

can be. Consider an example of an international treaty that sets limits on the
amount of compute that organizations can use to train their AIs. If an AI trained
with more compute than the specified threshold poses a significant risk of catastrophic
consequences—just as even a single nuclear weapon can have devastating effects—
then the treaty must focus on preventing this possibility entirely. In such cases,
deviations from the agreement cannot be permitted. On the other hand, if most AIs
trained with more compute than the threshold amount pose little immediate danger,
we can adopt a more lenient approach. In this scenario, the agreement can prioritize
monitoring and disincentivizing deviations through punishments after the fact. In
general, if even one actor deviating from an agreement is dangerous, then it must
be much stricter. Of course, stricter agreements are more difficult to create. Many
international agreements present small punishments for deviation or include escape
clauses—allowing occasional exemptions from obligations—to encourage states to
sign the agreements. If agreements about the development and use of AIs cannot
contain such clauses, then it will be more difficult to create widespread agreement on
them.
Governance ■ 483

Can the Production of the Technology Be Controlled?

Controlling production changes which actors are needed in order to suc-

ceed with international regulation. Suppose the US could gain complete control
of all the compute required to produce AIs. Then, the US would be able to create
and implement regulations on AI development that apply globally, since they can
withhold compute from any non-compliant actors. More generally, if a small group of
safety-conscious countries can block actors from gaining control of the factors of pro-
duction, then they can create an international regime themselves. This makes it much
easier to achieve international governance since it doesn’t require the cooperation of
many foreign actors with distinct interests.

Summary
By answering these questions, we can make important decisions about interna-
tional governance. First, we understand whether we need international governance or
whether AIs will be able to mitigate the harmful effects of other AIs. Second, we deter-
mine whether international agreements are possible, since we need to verify whether
actors are following the rules. Third, we can decide what features these agreements
might have; specifically, we can determine whether they must be extremely restrictive
to avoid catastrophes from a single deviation. Lastly, we consider who must agree to
govern AI by understanding whether or not a few countries can impose regulations
on the world. Even if some of these answers imply that we live in high-risk worlds,
they guide us toward actions that help mitigate this risk. We can now consider what
these actions might be.

8.6.3 What Can Be Included in International Agreements?

We will now consider the specific tools that might be useful for international gover-
nance of AI. We separate this discussion into regulating AIs produced by the private
sector and AIs produced by militaries, since these have different features and thus
require different controls. For civilian AIs, certification of compliance with interna-
tional standards is the key precedent to follow. For military AIs, we can turn to
non-proliferation agreements, verification schemes, or the establishment of AI mo-
nopolies.

Civilian AI
Regulating the private sector is important and tractable. Much of the
development of advanced AIs is seemingly taking place in the private sector. As a
result, ensuring that private actors do not develop or misuse harmful technologies is a
priority. Regulating civilian development and use is also likely to be more feasible than
regulating militaries, although this might be hindered by overlaps between private
and military applications of AIs (such as civilian defense contracts) and countries’
reluctance to allow an international organization access to their firms’ activities.
484 ■ Introduction to AI Safety, Ethics, and Society

Certification has proven to be an effective tool. One proposal for civilian

governance involves certifying jurisdictions for having and enforcing appropriate reg-
ulation [454]. Some international organizations, such as the International Civilian
Aviation Organization (ICAO), the International Maritime Organization (IMO), and
the Financial Action Task Force (FATF), follow a similar approach. These organi-
zations leave enforcement in the hands of domestic regulators, but they check that
domestic regulators have appropriate procedures. States have incentives to comply
with the international organizations’ standards because of the ecosystems in which
they are embedded. The Federal Aviation Administration, for instance, can deny ac-
cess to US airspace to states that violate ICAO standards. In the case of AI, states
might deny access to markets and factors of production to the firms of jurisdictions
that violate international standards.

Military AI
Governing military AIs is different from governing civilian ones. Most of the options
for governing AI used by militaries can be described as drawing from one of three
regimes: nonproliferation plus norms of use, verification, or monopoly.

Option 1: Nonproliferation plus Norms of Use. The nonproliferation

regime, alongside norms against first use of nuclear weapons—along with a measure
of luck—enabled the world to survive the Cold War. This regime was centered around
the Nonproliferation Treaty (NPT), an international agreement primarily concerned
with stopping the spread of nuclear weapons, and the International Atomic Energy
Agency (IAEA), an international organization for the governing of nuclear energy.
In addition, nonproliferation was a pillar of super-power foreign policy during the
Cold War [455]. Both the US and USSR made many threats and promises, including
guarantees of assistance to third parties, to reduce the likelihood of nuclear weapons
being used anywhere. A similar international regime for AI could be similarly helpful;
for instance, it could enable countries to cooperate to avoid AI races and encourage
the development of safer AI by establishing standards and guidelines.

Nonproliferation may be insufficient. Suppose investing in AI continues to

reliably increase system capabilities—unlike with nuclear weapons, where the security
gain from additional weapons of mass destruction is low. Then, countries who already
have advanced AI will have strong incentives to continue to compete with each other.
As we have explored, increasing AI capabilities is likely to increase AI risk. This
means that nonproliferation plus norms of use might be insufficient for controlling
advanced, weaponized AIs.

Norms may be difficult to establish. With nuclear weapons, the norms of “no
first use” and “mutually assured destruction” created an equilibrium that limited
the use of nuclear weapons. With AIs, this might be more difficult for a variety of
reasons: for instance, AIs have a much broader field of capabilities (as opposed to
a nuclear weapon detonating) and AIs are already being widely used. Monitoring
or restricting the development or use of new AI systems requires deciding precisely
Governance ■ 485

which capabilities are prohibited. If we cannot decide which capabilities are the most
dangerous, it is difficult to decide on a set of norms, which means we cannot rely on
norms to encourage the development of safe AIs.

Option 2: Verification. Many actors might be happy to limit their own devel-
opment of military technology if they can be certain their adversaries are doing the
same. Verification of this fact can enable countries to govern each other, thereby
avoiding arms races. The Chemical Weapons Convention, for instance, has provisions
for verifying member states’ compliance, including inspections of declared sites.
When it comes to critical military technologies, however, verification regimes might
need to be invasive; for instance, it might be necessary to have the authority to inspect
any site in a country. It is unclear how they could function in the face of resistance
from domestic authorities. For these reasons, a system which relies on inspection and
similar police—like methods might be entirely infeasible—unless all relevant actors
agree to mutually verify and govern.

Option 3: Monopoly. The final option is a monopoly over the largest-scale com-
puting processes, which removes the incentive for risk-taking by removing adversarial
competition. Such monopolies might arise in several ways. Established AI firms may
benefit from market forces like economies of scale, such as being able to attract the
best talent and invest profits from previous ventures into new R&D. Additionally,
they might have first-mover advantages from retaining customers or exercising influ-
ence over regulation. Alternatively, several actors might agree to create a monopoly:
there are proposals like “CERN for AI” which call for this explicitly [456]. Such or-
ganizations must be focused on the right mission, perhaps using tools from corporate
governance, which is a non-trivial task. If they are, however, then they present a much
easier route to safe AI than verification and international norms and agreements.

Conclusions About International Governance

International governance is important and difficult. International regu-

lation of AI is crucial to distribute its global benefits and manage associated risks.
Since AI’s impacts are not restricted to its country of development, an internationally
coordinated approach ensures that advantages, like improved healthcare, are acces-
sible worldwide, and risks, such as weak regulations or safety shortcuts, are avoided.
This approach will need to create awareness that a problem exists, create policies
that tackle it, oversee the implementation of these policies, and ensure compliance
with them through verification. It can use a wide variety of tools, including unilat-
eral declarations, talks through meetings, forums, and international organizations,
and the creation of norms, standards, and binding treaties. From experience with
other dangerous technologies, we know that this global cooperation is challenging to
achieve.

Understanding features of AI is required for effective governance. We

need to understand whether AI is an offense-dominant technology which poses sig-
nificant risks if even one powerful system is imperfect. This tells us whether we need
486 ■ Introduction to AI Safety, Ethics, and Society

international cooperation of AIs and, if we do, what sort of enforcement features

such agreements would require to be effective. Further, we need to know whether
we can verify whether AI development meets a comprehensive list of regulations and
standards and control its production when it does not. If this is possible, we can
conclude that international regulation of AI is possible, and might be possible with
the cooperation of just a small group of safety-conscious countries.

Understanding the landscape, we can govern civilian and military AIs.

Civilian AIs largely originate from the private sector, making certification of compli-
ance with international standards vital. Existing organizations, like the International
Civilian Aviation Organization, effectively use certification; states could restrict mar-
ket access to firms that violate international AI standards. For military AIs, options
include nonproliferation agreements plus established norms, verification of technol-
ogy development between nations, or creating monopolies on large-scale computing.
If the risks of advanced AI are as great as some fear, it is likely that the world needs
such regulation. Each proposal will require overcoming serious challenges, both social
and technical.

8.7 COMPUTE GOVERNANCE

A common shorthand for computational resources or computing power used for AI

is compute. In this section, we will discuss how compute governance might help to
enable AI governance on a national and international level. First, we will examine how
since compute is indispensable for AI development, governing it would help us govern
AI. Then, we will examine the key properties of compute that make it governable—
physicality, excludability, and quantifiability. These features make it more feasible to
track, monitor and, if appropriate, restrict the development of potentially dangerous
AI systems, and more generally facilitate the enforcement of AI governance measures.
We also consider why governing compute is more promising than governing other
factors used in AI production.

8.7.1 Compute Is Indispensable for AI Development and Deployment

Compute enables the development of more capable AIs. In addition, compute is
necessary to deploy AIs. If we restrict someone’s access to compute, they cannot
develop or deploy any AIs. As a result, we can use compute to govern AIs, determining
how and when they are deployed.

Hardware is necessary for AI systems. Like the uranium in a nuclear weapon,

compute is fundamental to running AIs. In its simplest form, we can think of compute
as a select group of high-performance chips like GPUs and TPUs that are designed
to run modern AIs. These are often the latest chips, tailor-made for AI tasks and
found in large data centers. As AI technology changes, so too will the hardware,
adapting to new tech developments, regulations, and the evolving requirements of AI
applications.
Governance ■ 487

The metric FLOP/s is common across forms of compute. To measure

compute, we often use the metric FLOP/s, which is the number of floating-point
operations (such as addition or multiplication) a computer can do in a second. When
we talk about “increasing” compute, we’re referring to using more processors, using
better processors, or allowing these processors to run for extended periods, effec-
tively increasing the number of floating-point operations done in total. An analogous
escalation of this is improving a nuclear arsenal by adding more weapons to it, de-
veloping more dangerous weapons like H-bombs, or creating bigger bombs by using
more uranium.

More compute allows for the development of more AI capabilities. Com-

pute plays a pivotal role in the evolution of AI capabilities. More compute means that
AI systems can be built with more parameters and effectively utilize larger datasets.
In the Artificial Intelligence Fundamentals chapter, we looked at scaling laws, which
show us that many AIs have their performance increase reliably with an increase in
model size and dataset size. Richard Sutton’s “The Bitter Lesson” states that general
methods in AI that harness computation tend to be the most effective by a signifi-
cant margin [162]. Having more compute means training AI systems that are more
capable and advanced, which means that knowing how much compute an AI uses lets
us approximate its capabilities.
Often, pushing the boundaries in AI development requires having vast compute.
AIs can require training on large supercomputers that cost hundreds of millions or
even billions of dollars. Moreover, computational demands for these AI models are
constantly intensifying, with their compute requirements doubling roughly every six
months. This growth rate surpasses the 2.5-year doubling time we see for the price-
performance of AI chips [457]. Given this trend, it’s likely that future AI models will
demand even greater investment in computational resources and, in turn, possess
greater capabilities.

More compute enables better results. Compute is not only essential in training
AI-—it is also necessary to run powerful AI models effectively. Just as we rely on our
brains to think and make decisions even after we’ve learned, AI models need compute
to process information and execute tasks even after training. If developers have access
to more compute, they can run bigger models. Since bigger models usually yield better
results, having more compute can enable better results.

Large-scale compute isn’t a strict requirement for all future AI applica-

tions. AI efficiency research aims to reduce compute requirements while preserv-
ing performance by improving other factors like algorithmic efficiency. If algorithms
become so refined that high-capability systems can train on less powerful devices,
compute’s significance for governance might diminish. Some existing systems like AI-
powered drug discovery tools do not require much compute, but it has been demon-
strated that they be can repurposed to create chemical weapons [13].
Additionally, there’s continued interest toward creating efficient AI models capable of
running on everyday devices such as smartphones and laptops. Though projections
488 ■ Introduction to AI Safety, Ethics, and Society

from current trends suggest it will be decades before data center-bound systems
like GPT-4 could train on a basic GPU [164], the shift toward greater efficiency
might speed up dramatically with just a few breakthroughs. If AI models require less
compute, especially to the point that they become commonplace on consumer devices,
regulating AI systems based on compute access might not be the most effective
approach.

8.7.2 Compute Is Physical, Excludable, and Quantifiable

To produce AI, developers need three primary factors: data, algorithms, and compute.
In this section, we will explore why governing compute appears to be a more promising
avenue than governing the other factors. A resource is governable when the entity
with legitimate claims to control it—such as a government–—has the ability to control
and direct it. Compute is governable because
1. It can be determined who has access to compute and how they utilize it.
2. It is possible to establish and enforce specific rules about compute.

These are true because compute is physical, excludable, and quantifiable. These char-
acteristics allow us to govern compute, making it a potential point of control in the
broader domain of AI governance. We will now consider each of these in turn.

Compute Is Physical

The first key characteristic that makes compute governable is its physicality. Compute
is physical, unlike datasets, which are virtual, or algorithms, which are intangible
ideas. This makes compute rivalrous and enables tracking and monitoring, both of
which are crucial to governance.

Since compute is physical, it is rivalrous. Compute is rivalrous: it cannot be

used by multiple entities simultaneously. This is unlike other factors in AI production,
such as algorithms which can be used by anyone who knows them or data which can
be acquired from the same source or even stolen or copied and used simultaneously
(although this may be difficult due to information security and the size of training
datasets). Because compute cannot be simultaneously accessed by multiple users
or easily duplicated, regulators can be confident that when it is being used by an
approved entity, it is not also being used by someone else. This makes it easier to
regulate and control the use of compute. GPUs can’t be downloaded but instead must
be fabricated, purchased, and shipped.

Since compute is physical, it is trackable. Compute is trackable, from chip

fabrication to its use in data centers. This is because compute is tangible and often
sizable: Figure 8.2 shows a cutting-edge semiconductor tool used as compute that
costs $200 million and requires 40 freight containers, 20 trucks, and 3 cargo planes
to ship anywhere.
Governance ■ 489

Unlike uranium, which is dif-

ficult to procure but not im-
possible to steal and transport
small amounts of, acquiring
large-scale compute requires
the investment of resources in
a relatively complicated and
visible process. Stakeholders,
whether semiconductor firms,
regulatory bodies, or other in-
volved entities, can accurately
evaluate and trace the over-
all quantity of these assets.
For instance, if we monitor the Figure 8.2. Advanced semiconductor manufacturing
tools (such as the ASML Twinscan NXE) are large,
sales and distribution of chips, highly specialized machines [458].
we know who possesses which
computational capacities and
their intended applications. The complete supply chain, from the semiconductor ori-
gins to the extensive data centers harboring vast AI computational power, can be
monitored, which means it can be governed. By contrast, data acquisition and algo-
rithmic improvements can be done discreetly: possession of these within a computing
infrastructure can be concealed more easily than the possession of the infrastructure
itself.

Compute Is Excludable

The second key characteristic that makes compute governable is its excludability.
Something is excludable if it is feasible to stop others from using it. Most privately
produced goods like automobiles are excludable whereas others, such as clean air or
street lighting, are difficult to exclude people from consuming even if a government
or company doesn’t want to let them use it. Compute is excludable because a few
entities, such as the US and the EU, can control crucial parts of its supply chain.
This means that these actors can monitor and prevent others from using compute.

The compute supply chain makes monitoring easier. In 2023, the vast
majority of advanced AI chips globally are crafted by a single firm, Taiwan Semi-
conductor Manufacturing Company (TSMC). These chips are based on designs from
a few major companies, such as Nvidia and Google, and TSMC’s production pro-
cesses rely on photolithography machines from a similarly monopolistic industry led
by ASML [459]. Entities such as the US and EU can, therefore, regulate these com-
panies to control the supply of compute—if the supply chain dynamics do not change
dramatically over time [460]. This simplifies the tracking of frontier AI chips and
enforcing of regulatory guidelines; it’s what made the US export ban of cutting-edge
AI chips to China in 2022 feasible. This example illustrates that these chips can be
governed. By contrast, data can be purchased from anywhere or found online, and
490 ■ Introduction to AI Safety, Ethics, and Society

algorithmic advances are not excludable either, especially given the open science and
collaborative norms in the AI community.
Frequent chip replacements means governance is effective quickly. The
price performance of AI chips is increasing exponentially. With new chips frequently
making recent products obsolete, compute becomes more excludable. Historical trends
show that GPUs double their price performance approximately every 2.5 years [164].
In conjunction with the rapidly increasing demand for more compute, data centers
frequently refresh their chips and purchase vast quantities of new compute regularly
to retain competitiveness. This frequent chip turnover offers a significant window for
governance since regulations on new chips will be relevant quickly.

Compute Is Quantifiable

The third key characteristic that makes compute governable is its quantifiability.
Quantifiability refers to the ability to measure and compare both the quantity and
quality of resources. Metrics such as FLOP/s serve as a yardstick for comparing
computational capabilities across different entities. If a developer has more chips of
the same type, we can accurately deduce that they have access to more compute,
which means we can use compute to set thresholds and monitor compliance.
Quantifiability facilitates clear threshold setting. While chips and other
forms of compute differ in many ways, they can all be quantified in FLOP/s. This
allows regulators to determine how important it is to regulate a model that is being
developed: models that use large amounts of compute are likely more important to
regulate. Suppose a regulator aims to regulate new models that are large and highly
capable. A simple way to do this is to set a FLOP/s threshold, above which more
regulations, permissions, and scrutiny take effect. By contrast, setting a threshold on
dataset size is less meaningful: quality of data varies enough that 25 GB of data could
contain all the text in Wikipedia or one high-definition photo album. Even worse,
algorithms are difficult to quantify at all.
Quantifiability is key to monitoring compliance. Beyond the creation of
thresholds, quantifiability also helps us monitor compliance. Given the physical na-
ture and finite capacity of compute, we can tell whether an actor has sufficient com-
putational power from the type and quantity of chips they possess. A regulator might
require organizations with at least 1000 chips at least as good as A100s to submit
themselves for additional auditing processes. A higher number of chips directly cor-
relates to more substantial computational capabilities, unlike with algorithms where
there is no comparable metric and data for which metrics are much less precise. In
addition to compute being physical and so traceable, this enables the enforcement of
rules and thresholds.

Forms of Compute Governance

Compute governance can focus on individual chips or on clusters of such

chips. As alluded to previously, GPUs used for training and running AI systems are
Governance ■ 491

typically housed in purpose-built data center facilities which are equipped to handle
their high demands for power and cooling. These data centers usually also have high
levels of physical security to protect this valuable hardware and the AI models that
are run on it. One approach to monitoring compute which is imperfect but requires
limited new technologies or regulation is to track these data center facilities. However,
such an approach faces challenges in terms of identifying which facilities are housing
chips used for training AI as opposed to other types of computing hardware, or
understanding how many chips they house and what they are being used for. Other
approaches rely more heavily on tracking the individual chips suitable for AI training
and inference, for example via some form of registry of chip sales, or via “on-chip
mechanisms.”

Providers of cloud services can support compute governance. Data cen-

ters can be owned and operated by the companies that are developing AI, but more
commonly AI developers rent access to computing hardware from providers of cloud
computing services which own large pools of computing hardware rented out to third
parties. These cloud providers could be required to support relevant authorities with
compute governance. For example, they could implement know-your-customer (KYC)
policies to ensure that AI developers subject to export controls or other sanctions are
not able to use cloud providers’ hardware to train powerful AI systems.

On-chip mechanisms. On-chip governance mechanisms have been proposed as

a way to make monitoring and verification schemes for compute more robust to
attempts at circumventing them. These could take the form of secure modules imple-
mented within the chip’s hardware and firmware that make it easier for authorized
parties such as regulators to detect the location of chips suitable for training power-
ful AI systems. Such features do not exist today, but could likely be implemented in
forms that leverage existing secure hardware features and do not require re-designing
chips or compromising users’ privacy by monitoring what software is being run on the
chips. More ambitious forms of compute governance might aim to provide informa-
tion on whether chips are actually being used to train large AI systems. This could
be done at the hardware level by monitoring whether GPUs are being interconnected
into the large clusters of chips required for such purposes.

Conclusions About Compute Governance

Compute governance is a promising route to AI governance. Compute is

necessary for the development and deployment of AIs, as well as being well-suited to
governance. Relative to governing algorithms and datasets, the other factors used to
produce AIs, governing compute is promising because it is physical, excludable, and
quantifiable.
These features are useful for national governments, as they make it more feasible to
monitor the development and deployment of potentially dangerous AI systems. This
is a precondition for effective enforcement of AI regulations. Similarly, compute’s fea-
tures can support verification of compliance with national or international standards
492 ■ Introduction to AI Safety, Ethics, and Society

and agreements. This makes it easier to implement regulatory regimes, particularly

at an international level.
Compute governance is currently made simpler by certain factors such as the control-
lability of the supply chain or requirement of large-scale compute for highly capable
models. If the supply chain of hardware for training and running AI systems became
much less concentrated, this would make it harder to enforce potential restrictions
on access to relevant hardware as a governance mechanism. Similarly, if the required
compute resources to train dangerous AI systems were within the means of a small
business or even an individual, this would dramatically increase the challenge of
monitoring relevant compute resources as part of AI governance.

8.8 CONCLUSION

In the introduction, we laid out the purpose of this chapter: understanding the fun-
damentals of how to govern AI. In other words, we wanted to understand how to
organize, manage, and steer the development and deployment of AI technologies us-
ing an array of tools including norms, policies, and institutions. To set the scene, we
considered a set of actors and tools that governance needs to consider.

Growth. We explored how much AI might accelerate economic growth. AI has the
potential to significantly boost economic growth by augmenting the workforce, im-
proving labor efficiency, and accelerating technological progress. However, the extent
of this impact is debated, with some predicting explosive growth while others believe
it will be tempered by social and economic factors. The semi-endogenous growth
theory suggests that population growth, by expanding the labor force and fostering
innovation, has historically driven economic acceleration. Similarly, AI could enhance
economic output by substituting for human labor and self-improving, creating a posi-
tive feedback loop. Nonetheless, constraints such as limited physical resources, dimin-
ishing returns on research and development, gradual technology adoption, regulatory
measures, and tasks that resist automation could moderate the growth induced by
AI. Therefore, while AI’s contribution to economic growth is likely to be significant,
whether it will result in unprecedented expansion or face limitations remains uncer-
tain.

Distribution. We then explored three key dimensions of the distribution of ad-

vanced AI systems: benefits and costs of AI, access to AI, and power of AI systems.
Equitable distribution of the economic and social impacts of AI will be crucial to
ensure that productivity gains are shared broadly rather than accruing only to a
small group like AI developers and investors. There are conflicting considerations
with regard to access to AI, which make it is unclear whether availability should be
tightly restricted or widely open to the general public. Limiting access risks misuse
by powerful groups whereas open access risks misuse by malicious actors. In terms
of distributing power among AI systems, concentrating capabilities and control in a
single AI or small group of AIs poses risks like permanently locking in certain values
Governance ■ 493

or goals. However, distributing power more widely among a large, diverse ecosystem
of AIs also has downsides, like increasing the potential for misuse or making it more
difficult to correct AI systems that begin behaving undesirably.

Corporate Governance. Turning to the various stakeholders that will shape AI

governance, we discussed how aspects of corporate structure and governance like
legal form, ownership models, policies, practices, and assurance mechanisms can help
steer technology companies away from solely maximizing profits and shareholder
value. Instead, they can guide corporate AI work in directions that prioritize broader
societal interests like safety, fairness, privacy, and ethics. However, achieving this
through corporate governance alone may prove challenging, making complementary
approaches at other levels vital.

National Governance. For governance at the national level, we explored policy

tools governments can use to align AI development with public interests, both in
the public and private sectors. These included safety regulations, liability rules that
make AI developers internalize potential damages, investments to improve societal
resilience against AI risks, and measures for maintaining national competitiveness in
AI while still ensuring domestic safety. Combinations of these policy mechanisms can
help nations steer AI progress in their jurisdictions toward beneficial ends.

International Governance. At the international level, governance of AI systems

is made challenging by issues like verifying adherence to agreements. However, inter-
national cooperation is essential for managing risks from AI and distributing benefits
globally. Approaches like international safety standards for civilian AI applications,
agreements to limit military uses of AI, and proposals for concentrating advanced AI
development within select transnational groups, may all help promote global flour-
ishing. A lack of any meaningful international governance could lead to a dangerous
spiral of competitive dynamics and erosion of safety standards.

Compute Governance. We explored how governing access to and use of the com-
puting resources that enable AI development could provide an important lever for
influencing the trajectory of AI progress. Compute is an indispensable input to devel-
oping advanced AI capabilities. It also has properties like physicality, excludability,
and quantifiability that make governing it more feasible than other inputs like data
and algorithms. Compute governance can allow control over who is granted access to
what levels of computational capabilities, controlling who can create advanced AIs. It
also facilitates setting and enforcing safety standards for how compute can be used,
enabling the steering of AI development.

Conclusion. This chapter provides an overview of a diverse selection of governance

solutions spanning from policies within technology firms to agreements between na-
tions in global institutions. The arrival of transformative AI systems will require
thoughtful governance at multiple levels in order to steer uncertain technological
trajectories in broadly beneficial directions aligned with humanity’s overarching in-
terests. The deliberate implementation of policies, incentives and oversight will be
essential to realizing the potential of AI to improve human civilization.
494 ■ Introduction to AI Safety, Ethics, and Society

Go to the book’s website www.aisafetybook.com for additional content, further edu-

cation resources such as videos and quizzes, and suggestions on how to get involved
and contribute to mitigating societal-scale risks from AI.

8.9 LITERATURE

8.9.1 Recommended Reading

• Jaime Sevilla et al. Compute Trends Across Three Eras of Machine Learning. 2022.
arXiv: 2202.05924 [cs.LG]
• Ege Erdil and Tamay Besiroglu. Explosive growth from AI automation: A review
of the arguments. Papers 2309.11690. arXiv.org, 2023. url: https://ideas.repec.or
g/p/arx/papers/2309.11690.html
• Saif M. Khan. AI Chips: What They Are and Why They Matter. 2020. url: https
://doi.org/10.51593/20190014
• Yonadav Shavit. What does it take to catch a Chinchilla? Verifying Rules on Large-
Scale Neural Network Training via Compute Monitoring. 2023. arXiv: 2303.11341
[cs.LG]
• Lewis Ho et al. International Institutions for Advanced AI. 2023. arXiv: 2307.04699
[cs.CY]
• Robert Trager et al. “International Governance of Civilian AI: A Jurisdictional
Certification Approach”. In: Risk, Regulation, & Policy eJournal (2023). url: htt
p://dx.doi.org/10.2139/ssrn.4579899
• ho2023international
• Matthijs Maas. Advanced AI Governance. Tech. rep. https : / / docs . google . com
/document/d/1pwwNHvNeJneBA2t2xaP31lVv1lSpa36w8kdryoS5768/edit. Legal
priorities Project, 2023
Acknowledgments

It would not have been possible to complete a book of this scale on my own without
spending several years on it. This would have significantly reduced the book’s value,
given how urgent it is for us to understand and address important AI risks. I received
invaluable support during the research, writing and editing process from various
assistants, collaborators, and reviewers both at CAIS and externally that enabled me
to complete this book in a more timely way.
I would like to acknowledge the major contributions to the book’s chapters made by
the following people:
1. Overview of Catastrophic AI Risks: Mantas Mazeika and Thomas Woodside
2. Artificial Intelligence Fundamentals: Anna Swanson, Jeremy Hadfield, Adam
Elwood, and Corin Katzke
3. Single-Agent Safety: Jesse Hoogland, Adam Khoja, Thomas Woodside, Abra
Ganz, Aidan O’Gara, Spencer Becker-Kahn, Jeremy Hadfield, and Joshua Clymer
4. Safety Engineering: Laura Hiscott and Suryansh Mehta
5. Complex Systems: Laura Hiscott and Max Heitmann
6. Beneficial AI and Machine Ethics: Suryansh Mehta, Shankara Srikantan,
Aron Vallinder, Jeremy Hadfield, and Toby Tremlett
7. Collective Action Problems: Ivo Andrews, Sasha Cadariu, Avital Morris, and
David Lambert
8. Governance: Jonas Schuett, Robert Trager, Lennart Heim, Matthew Barnett,
Mauricio Baker, Thomas Woodside, Suryansh Mehta, Laura Hiscott, and Shankara
Srikantan
9. Appendix—Normative Ethics: Toby Tremlett and Suryansh Mehta
10. Appendix—Utility Functions: Suryansh Mehta and Shankara Srikantan

I am deeply grateful to the following key individuals for their help in driving this
project forward:
• Jay Kim and William Hodgkins (Project Manager)
• Rebecca Rothwell and Suryansh Mehta (Editor)
• Corin Katzke (Course Manager)

I would also like to thank Mantas Mazeika, Paul Salmon, Matthew Lutz, Bryan
Daniels, Martin Stoffel, Charlotte Siegmann, Lena Trabucco, Casey Mahoney, the

495
496 ■ Acknowledgments

2023 CAIS philosophy fellows, and the participants in our AI Safety Sprint Course
for their helpful feedback during the writing and revision process. All errors that
remain are, of course, my own.

Dan Hendrycks
References

Introduction
[1] Kate Crawford. Atlas of AI : power, politics, and the planetary costs of
artificial intelligence. eng. New Haven: Yale University Press, 2022. isbn:
9780300209570.
[2] Laura Weidinger et al. “Ethical and social risks of harm from Language Mod-
els”. In: arXiv preprint arXiv:2112.04359 (2021).

Overview of Catastrophic AI Risks

[3] David Malin Roodman. On the probability distribution of long-term changes
in the growth rate of the global economy: An outside view. 2020.
[4] Tom Davidson. Could Advanced AI Drive Explosive Economic Growth? Tech.
rep. June 2021.
[5] Carl Sagan. Pale Blue Dot: A Vision of the Human Future in Space. New
York: Random House, 1994.
[6] Roman V. Yampolskiy. “Taxonomy of Pathways to Dangerous Artificial Intel-
ligence”. In: AAAI Workshop: AI, Ethics, and Society. 2016.
[7] Keith Olson. “Aum Shinrikyo: once and future threat?”. In: Emerging Infec-
tious Diseases 5 (1999), pp. 513–516.
[8] Kevin M. Esvelt. Delay, Detect, Defend: Preparing for a Future in which Thou-
sands Can Release New Pandemics. 2022.
[9] Siro Igino Trevisanato. “The ’Hittite plague’, an epidemic of tularemia and
the first record of biological warfare”. In: Medical Hypotheses 69 6 (2007),
pp. 1371–4.
[10] U.S. Department of State. Adherence to and Compliance with Arms Control,
Nonproliferation, and Disarmament Agreements and Commitments. Govern-
ment Report. U.S. Department of State, Apr. 2022.
[11] Robert Carlson. “The changing economics of DNA synthesis”. en. In: Nature
Biotechnology 27.12 (Dec. 2009). Number: 12; Publisher: Nature Publishing
Group, pp. 1091–1094.
[12] Sarah R. Carter, Jaime M. Yassif, and Chris Isaac. Benchtop DNA Synthe-
sis Devices: Capabilities, Biosecurity Implications, and Governance. Report.
Nuclear Threat Initiative, 2023.

497
498 ■ References

[13] Fabio Urbina et al. “Dual use of artificial-intelligence-powered drug discovery”.

In: Nature Machine Intelligence 4 (2022), pp. 189–191.
[14] John Jumper et al. “Highly accurate protein structure prediction with Al-
phaFold”. In: Nature 596.7873 (2021), pp. 583–589.
[15] Zachary Wu et al. “Machine learning-assisted directed protein evolution with
combinatorial libraries”. In: Proceedings of the National Academy of Sciences
116.18 (2019), pp. 8852–8858.
[16] Emily Soice et al. “Can large language models democratize access to dual-use
biotechnology?”. In: 2023.
[17] Max Tegmark. Life 3.0: Being human in the age of artificial intelligence. Vin-
tage, 2018.
[18] Leanne Pooley. We Need To Talk About A.I. 2020.
[19] Richard Sutton [@RichardSSutton]. It will be the greatest intellectual achieve-
ment of all time. An achievement of science, of engineering, and of the hu-
manities, whose significance is beyond humanity, beyond life, beyond good and
bad. en. Tweet. Sept. 2022.
[20] Richard Sutton. AI Succession. Video. Sept. 2023.
[21] A. Sanz-García et al. “Prevalence of Psychopathy in the General Adult Popu-
lation: A Systematic Review and Meta-Analysis”. In: Frontiers in Psychology
12 (2021).
[22] U.S. Department of State Office of The Historian. U.S. Diplomacy and Yellow
Journalism, 1895–1898.
[23] Onur Varol et al. “Online Human-Bot Interactions: Detection, Estimation,
and Characterization”. In: ArXiv abs/1703.03107 (2017).
[24] Matthew Burtell and Thomas Woodside. “Artificial Influence: An Analysis Of
AI-Driven Persuasion”. In: ArXiv abs/2303.08721 (2023).
[25] Anna Tong. “What happens when your AI chatbot stops loving you back?”.
In: Reuters (Mar. 2023).
[26] Pierre-François Lovens. “Sans ces conversations avec le chatbot Eliza, mon
mari serait toujours lá”. In: La Libre (Mar. 2023).
[27] Cristian Vaccari and Andrew Chadwick. “Deepfakes and Disinformation: Ex-
ploring the Impact of Synthetic Political Video on Deception, Uncertainty,
and Trust in News”. In: Social Media + Society 6 (2020).
[28] Moin Nadeem, Anna Bethke, and Siva Reddy. “StereoSet: Measuring stereo-
typical bias in pretrained language models”. In: Proceedings of the 59th An-
nual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1:
Long Papers). Online: Association for Computational Linguistics, Aug. 2021,
pp. 5356–5371.
References ■ 499

[29] Evan G. Williams. “The Possibility of an Ongoing Moral Catastrophe”. en.

In: Ethical Theory and Moral Practice 18.5 (Nov. 2015), pp. 971–982.
[30] Paul Scharre. Army of None: Autonomous Weapons and The Future of War.
Norton, 2018.
[31] DARPA. AlphaDogfight Trials Foreshadow Future of Human-Machine Sym-
biosis. 2020.
[32] Panel of Experts on Libya. Letter dated 8 March 2021 from the Panel of
Experts on Libya established pursuant to resolution 1973 (2011) addressed
to the President of the Security Council. United Nations Security Council
Document S/2021/229. United Nations, Mar. 2021.
[33] David Hambling. Israel used world’s first AI-guided combat drone swarm in
Gaza attacks. 2021.
[34] Zachary Kallenborn. Applying arms-control frameworks to autonomous
weapons. en-US. Oct. 2021.
[35] J.E. Mueller. War, Presidents, and Public Opinion. UPA book. University
Press of America, 1985.
[36] Matteo E. Bonfanti. “Artificial intelligence and the offense–defense balance
in cyber security”. In: Cyber Security Politics: Socio-Technological Transfor-
mations and Political Fragmentation. Ed. by M.D. Cavelty and A. Wenger.
CSS Studies in Security and International Relations. Taylor & Francis, 2022.
Chap. 5, pp. 64–79.
[37] Yisroel Mirsky et al. “The Threat of Offensive AI to Organizations”. In: Com-
puters & Security (2023).
[38] Kim Zetter. “Meet MonsterMind, the NSA Bot That Could Wage Cyberwar
Autonomously”. In: Wired (Aug. 2014).
[39] Andrei Kirilenko et al. “The Flash Crash: High-Frequency Trading in an Elec-
tronic Market”. In: The Journal of Finance 72.3 (2017), pp. 967–998.
[40] Michael C. Horowitz. The Diffusion of Military Power: Causes and Conse-
quences for International Politics. Princeton University Press, 2010.
[41] Robert E. Jervis. “Cooperation under the Security Dilemma”. In: World Pol-
itics 30 (1978), pp. 167–214.
[42] Richard Danzig. Technology Roulette: Managing Loss of Control as Many Mili-
taries Pursue Technological Superiority. Tech. rep. Center for a New American
Security, June 2018.
[43] Billy Perrigo. Bing’s AI Is Threatening Users. That’s No Laughing Matter. en.
Feb. 2023.
[44] Nico Grant and Karen Weise. “In A.I. Race, Microsoft and Google Choose
Speed Over Caution”. en-US. In: The New York Times (Apr. 2023).
[45] Thomas H. Klier. “From Tail Fins to Hybrids: How Detroit Lost Its Dominance
of the U.S. Auto Market”. In: RePEc (May 2009).
500 ■ References

[46] Robert Sherefkin. “Ford 100: Defective Pinto Almost Took Ford’s Reputation
With It”. In: Automotive News (June 2003).
[47] Lee Strobel. Reckless Homicide?: Ford’s Pinto Trial. en. And Books, 1980.
[48] Grimshaw v. Ford Motor Co. May 1981.
[49] Paul C. Judge. “Selling Autos by Selling Safety”. en-US. In: The New York
Times (Jan. 1990).
[50] Theo Leggett. “737 Max crashes: Boeing says not guilty to fraud charge”.
en-GB. In: BBC News (Jan. 2023).
[51] Edward Broughton. “The Bhopal disaster and its aftermath: a review”. In:
Environmental Health 4.1 (May 2005), p. 6.
[52] Charlotte Curtis. “Machines vs. Workers”. en-US. In: The New York Times
(Feb. 1983).
[53] Thomas Woodside et al. “Examples of AI Improving AI”. 2023. url: https:
//ai-improving-ai.safe.ai.
[54] Stuart Russell. Human Compatible: Artificial Intelligence and the Problem of
Control. en. Penguin, Oct. 2019.
[55] Dan Hendrycks. “Natural Selection Favors AIs over Humans”. In: ArXiv
abs/2303.16200 (2023).
[56] Dan Hendrycks. The Darwinian Argument for Worrying About AI. 2023.
[57] Richard C. Lewontin. “The Units of Selection”. In: Annual Review of Ecology,
Evolution, and Systematics 1 (1970), pp. 1–18.
[58] Ethan Kross et al. “Facebook use predicts declines in subjective well-being in
young adults”. In: PLOS One (2013).
[59] Laura Martínez-Íñigo et al. “Intercommunity interactions and killings in cen-
tral chimpanzees (Pan troglodytes troglodytes) from Loango National Park,
Gabon”. In: Primates; Journal of Primatology 62 (2021), pp. 709–722.
[60] Anne E. Pusey and Craig Packer. “Infanticide in Lions: Consequences and
Counterstrategies”. In: Infanticide and Parental Care (1994), p. 277.
[61] Peter D. Nagy and Judit Pogany. “The dependence of viral RNA replication on
co-opted host factors”. In: Nature Reviews. Microbiology 10 (2011), pp. 137–
149.
[62] Alfred Buschinger. “Social Parasitism among Ants: A Review”. In: Myrmeco-
logical News 12 (Sept. 2009), pp. 219–235.
[63] Greg Brockman, Ilya Sutskever, and OpenAI. Introducing OpenAI. Dec. 2015.
[64] Devin Coldewey. OpenAI shifts from nonprofit to ‘capped-profit’ to attract
capital. Mar. 2019.
[65] Kyle Wiggers, Devin Coldewey, and Manish Singh. Anthropic’s $5B, 4-year
plan to take on OpenAI. Apr. 2023.
References ■ 501

[66] Center for AI Safety. Statement on AI Risk (“Mitigating the risk of extinction
from AI should be a global priority alongside other societal-scale risks such as
pandemics and nuclear war.”) 2023. url: https://www.safe.ai/statement-on
-ai-risk.
[67] Richard Danzig et al. Aum Shinrikyo: Insights into How Terrorists Develop
Biological and Chemical Weapons. Tech. rep. Center for a New American Se-
curity, 2012. url: https://www.jstor.org/stable/resrep06323.
[68] John Uri. 35 Years Ago: Remembering Challenger and Her Crew. und. Text.
Jan. 2021.
[69] International Atomic Energy Agency. The Chernobyl Accident: Updating of
INSAG-1. Technical Report INSAG-7. Vienna, Austria: International Atomic
Energy Agency, 1992.
[70] Matthew Meselson et al. “The Sverdlovsk anthrax outbreak of 1979”. In: Sci-
ence 266 5188 (1994), pp. 1202–8.
[71] Daniel M. Ziegler et al. “Fine-tuning language models from human prefer-
ences”. In: arXiv preprint arXiv:1909.08593 (2019).
[72] Charles Perrow. Normal Accidents: Living with High-Risk Technologies.
Princeton, NJ: Princeton University Press, 1984.
[73] Mitchell Rogovin and George T. Frampton Jr. Three Mile Island: a re-
port to the commissioners and to the public. Volume I. English. Tech. rep.
NUREG/CR-1250(Vol.1). Nuclear Regulatory Commission, Washington, DC
(United States). Three Mile Island Special Inquiry Group, Jan. 1979.
[74] Richard Rhodes. The Making of the Atomic Bomb. New York: Simon & Schus-
ter, 1986.
[75] Sébastien Bubeck et al. “Sparks of Artificial General Intelligence: Early ex-
periments with GPT-4”. In: ArXiv abs/2303.12712 (2023).
[76] Theodore I. Lidsky and Jay S. Schneider. “Lead neurotoxicity in children:
basic mechanisms and clinical correlates”. In: Brain: a Journal of Neurology
126 Pt 1 (2003), pp. 5–19.
[77] Brooke T. Mossman et al. “Asbestos: scientific developments and implications
for public policy”. In: Science 247 4940 (1990), pp. 294–301.
[78] Kate Moore. The Radium Girls: The Dark Story of America’s Shining Women.
Naperville, IL: Sourcebooks, 2017.
[79] Stephen S. Hecht. “Tobacco smoke carcinogens and lung cancer”. In: Journal
of the National Cancer Institute 91 14 (1999), pp. 1194–210.
[80] Mario J. Molina and F. Sherwood Rowland. “Stratospheric sink for chloroflu-
oromethanes: chlorine atomc-atalysed destruction of ozone”. In: Nature 249
(1974), pp. 810–812.
[81] James H. Kim and Anthony R. Scialli. “Thalidomide: the tragedy of birth
defects and the effective treatment of disease”. In: Toxicological Sciences: an
Official Journal of the Society of Toxicology 122 1 (2011), pp. 1–6.
502 ■ References

[82] Betul Keles, Niall McCrae, and Annmarie Grealish. “A systematic review: the
influence of social media on depression, anxiety and psychological distress in
adolescents”. In: International Journal of Adolescence and Youth 25 (2019),
pp. 79–93.
[83] Zakir Durumeric et al. “The Matter of Heartbleed”. In: Proceedings of the
2014 Conference on Internet Measurement Conference (2014).
[84] Tony Tong Wang et al. “Adversarial Policies Beat Professional-Level Go AIs”.
In: ArXiv abs/2211.00241 (2022).
[85] T.R. La Porte and Paula M. Consolini. “Working in Practice But Not in The-
ory: Theoretical Challenges of ‘High-Reliability Organizations’”. In: Journal
of Public Administration Research and Theory 1 (1991), pp. 19–48.
[86] Thomas G. Dietterich. “Robust artificial intelligence and robust human orga-
nizations”. In: Frontiers of Computer Science 13 (2018), pp. 1–3.
[87] N. Leveson. Engineering a Safer World: Systems Thinking Applied to Safety.
Engineering systems. MIT Press, 2011. isbn: 9780262016629. url: https://b
ooks.google.com/books?id=0gZ_7n5p8MQC.
[88] David Manheim. Building a Culture of Safety for AI: Perspectives and Chal-
lenges. 2023.
[89] National Research Council et al. Lessons Learned from the Fukushima Nuclear
Accident for Improving Safety of U.S. Nuclear Plants. Washington, D.C.: Na-
tional Academies Press, Oct. 2014.
[90] Diane Vaughan. The Challenger Launch Decision: Risky Technology, Culture,
and Deviance at NASA. Chicago, IL: University of Chicago Press, 1996.
[91] Dan Lamothe. Air Force Swears: Our Nuke Launch Code Was Never
’00000000’. Jan. 2014.
[92] Toby Ord. The precipice: Existential risk and the future of humanity. Hachette
Books, 2020.
[93] U.S. Nuclear Regulatory Commission. Final Safety Culture Policy Statement.
Federal Register. 2011.
[94] Bruce Schneier. “Inside the Twisted Mind of the Security Professional”. In:
Wired (Mar. 2008).
[95] Dan Hendrycks and Mantas Mazeika. “X-Risk Analysis for AI Research”. In:
ArXiv abs/2206.05862 (2022).
[96] Donald T. Campbell. “Assessing the impact of planned social change”. In:
Evaluation and Program Planning 2.1 (1979), pp. 67–90.
[97] Yohan J. John et al. “Dead rats, dopamine, performance metrics, and peacock
tails: proxy failure is an inherent risk in goal-oriented systems”. In: Behavioral
and Brain Sciences (2023), pp. 1–68. doi: 10.1017/S0140525X23002753.
[98] Jonathan Stray. “Aligning AI Optimization to Community Well-Being”. In:
International Journal of Community Well-Being (2020).
References ■ 503

[99] Jonathan Stray et al. “What are you optimizing for? Aligning Recommender
Systems with Human Values”. In: ArXiv abs/2107.10939 (2021).
[100] Ziad Obermeyer et al. “Dissecting racial bias in an algorithm used to manage
the health of populations”. In: Science 366 (2019), pp. 447–453.
[101] Dario Amodei and Jack Clark. Faulty reward functions in the wild. 2016.
[102] Alexander Pan, Kush Bhatia, and Jacob Steinhardt. “The effects of re-
ward misspecification: Mapping and mitigating misaligned models”. In: ICLR
(2022).
[103] G. Thut et al. “Activation of the human brain by monetary reward”. In:
Neuroreport 8.5 (1997), pp. 1225–1228.
[104] Edmund T. Rolls. “The Orbitofrontal Cortex and Reward”. In: Cerebral Cor-
tex 10.3 (Mar. 2000), pp. 284–294.
[105] T. Schroeder. Three Faces of Desire. Philosophy of Mind Series. Oxford Uni-
versity Press, USA, 2004.
[106] Joseph Carlsmith. Is Power-Seeking AI an Existential Risk? 2022. arXiv: 220
6.13353 [cs.CY]. url: https://arxiv.org/abs/2206.13353.
[107] John J. Mearsheimer. Structural Realism. 2007, pp. 77–94.
[108] Bowen Baker et al. “Emergent Tool Use From Multi-Agent Autocurricula”.
In: International Conference on Learning Representations. 2020.
[109] Dylan Hadfield-Menell et al. “The Off-Switch Game”. In: IJCA (2017).
[110] Alexander Pan et al. “Do the Rewards Justify the Means? Measuring Trade-
Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark”.
In: ICML (2023).
[111] “Lyndon Baines Johnson”. In: Oxford Reference (2016).
[112] Meta Fundamental AI Research Diplomacy Team (FAIR) et al. Human-level
play in the game of Diplomacy by combining language models with strategic
reasoning. 2022. doi: 10.1126/science.ade9097. eprint: https://www.science.o
rg/doi/pdf/10.1126/science.ade9097. url: https://www.science.org/doi/abs
/10.1126/science.ade9097.
[113] Paul Christiano et al. Deep reinforcement learning from human preferences.
Discussed in https://www.deepmind.com/blog/specification-gaming-the-flip
-side-of-ai-ingenuity. 2017. arXiv: 1706.03741.
[114] Xinyun Chen et al. Targeted Backdoor Attacks on Deep Learning Systems
Using Data Poisoning. 2017. arXiv: 1712.05526.
[115] Nick Beckstead. On the overwhelming importance of shaping the far future.
2013.
[116] Jens Rasmussen. “Risk management in a Dynamic Society: A Modeling Prob-
lem”. English. In: Proceedings of the Conference on Human Interaction with
Complex Systems, 1996.
504 ■ References

[117] Cassidy Nelson and Sophie Rose. Understanding AI-Facilitated Biological

Weapon Development. 2023. url: https://www.longtermresilience.org /pos
t/report-launch-examining-risks-at-the-intersection-of-ai-and-bio (visited on
10/18/2023).
[118] Practices for Governing Agentic AI Systems. 2023. url: https://openai.com
/index/practices-for-governing-agentic-ai-systems (visited on 12/14/2023).
[119] Paul Scharre. Army of None. W.W. Norton, 2018.
[120] Peter S. Park et al. AI Deception: A Survey of Examples, Risks, and Potential
Solutions. 2023. arXiv: 2308.14752 [cs.CY].

Artificial Intelligence Fundamentals

[121] Nick Bostrom. “Superintelligence: Paths, Dangers, Strategies”. In: Oxford Uni-
versity Press (2014).
[122] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach.
Pearson, 2020.
[123] Shane Legg and Marcus Hutter. “Universal intelligence: A definition of ma-
chine intelligence”. In: Minds and Machines 17 (2007), pp. 391–444.
[124] Alan M. Turing. “Computing Machinery and Intelligence”. In: Mind 59 (1950),
pp. 433–460.
[125] F. Rosenblatt. “The perceptron: A probabilistic model for information storage
and organization in the brain”. In: Psychological Review 65.6 (1958), pp. 386–
408. issn: 0033-295X. doi: 10.1037/h0042519. url: http://dx.doi.org/10.103
7/h0042519.
[126] Murray Campbell, A. Joseph Hoane Jr, and Feng-hsiung Hsu. “Deep blue”.
In: Artificial Intelligence 134.1-2 (2002), pp. 57–83.
[127] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles
and techniques. MIT Press, 2009.
[128] Leo Breiman. “Random Forests”. In: Machine Learning 20 (3 1995), pp. 273–
297. url: https://doi.org/10.1023/a:1010933404324.
[129] Jerome H. Friedman. “Greedy Function Approximation: A Gradient Boosting
Machine”. In: The Annals of Statistics 29 (5 2001), pp. 1189–232. url: http:
//www.jstor.org/stable/2699986.
[130] Corinna Cortes and Vladimir Vapnik. “Support-Vector Networks”. In: Ma-
chine Learning 20 (3).
[131] Jia Deng et al. “ImageNet: A large-scale hierarchical image database”. In:
2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009,
pp. 248–255. doi: 10.1109/CVPR.2009.5206848.
References ■ 505

[132] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classi-
fication with Deep Convolutional Neural Networks”. In: Advances in Neural
Information Processing Systems. Ed. by F. Pereira et al. Vol. 25. Curran As-
sociates, Inc., 2012. url: https://proceedings.neurips.cc/paper_files/paper
/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
[133] David Silver et al. “Mastering the game of Go with deep neural networks and
tree search”. In: Nature 529 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16
961.
[134] Alec Radford et al. “Language Models are Unsupervised Multitask Learners”.
2019. url: https://api.semanticscholar.org/CorpusID:160025533.
[135] Meredith Ringel Morris et al. Levels of AGI for Operationalizing Progress on
the Path to AGI. 2024. arXiv: 2311.02462 [cs.AI].
[136] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer
New York, 2016.
[137] Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. MIT
Press, 2022.
[138] Dan Hendrycks and Kevin Gimpel. A Baseline for Detecting Misclassified and
Out-of-Distribution Examples in Neural Networks. 2018. arXiv: 1610 . 02136
[cs.NE].
[139] Dan Hendrycks. Anomaly and Out-of-Distribution Detection. url: https://do
cs.google.com/presentation/d/1WEzSFUbcl1Rp4kQq1K4uONMJHBAUWh
CZTzWVHnLcSV8/edit#slide=id.g60c1429d79_0_0.
[140] Walber. Precision Recall. Creative Commons Attribution-Share Alike 4.0 In-
ternational license. The layout of the figure has been changed. 2014. url:
https://en.wikipedia.org/wiki/File:Precisionrecall.svg.
[141] cmglee and Martin Thoma. Creative Commons Attribution-Share Alike 4.0
International license. The colours of the figure have been changed. 2018. url:
https://commons.wikimedia.org/wiki/File:Roc_curve.svg.
[142] Haiyin Luo and Yuhui Zheng. “Semantic Residual Pyramid Network for Image
Inpainting”. In: Information 13.2 (Jan. 2022). Creative Commons Attribution
4.0 International license. url: https://doi.org/10.3390/info13020071.
[143] Y. LeCun, Y. Bengio, and G. Hinton. “Deep Learning”. In: Nature (521 2015),
pp. 436–444. url: https://doi.org/10.1038/nature14539.
[144] Notjim. Creative Commons Attribution-Share Alike 3.0 Unported License.
url: https://commons.wikimedia.org/wiki/File:Neuron_-_annotated.svg.
[145] T.M. Mitchell. url: https://commons.wikimedia.org/wiki/File:Rosenblattpe
rceptron.png.
[146] Machine Learning for Artists, MNIST Input, 2018. GPL 2.0 License. url:
https://ml4a.github.io/demos/f_mnist_input/ (visited on 04/28/2024).
506 ■ References

[147] Chuan Lin, Qing Chang, and Xianxu Li. “A Deep Learning Approach for
MIMO-NOMA Downlink Signal Detection”. In: Sensors 19 (June 2019),
p. 2526. doi: 10.3390/s19112526.
[148] Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs).
2023. arXiv: 1606.08415 [cs.LG].
[149] Vinod Nair and Geoffrey Hinton. “Rectified Linear Units Improve Restricted
Boltzmann Machines”. In: Proceedings of the 27th International Conference
on Machine Learning. Vol. 27. June 2010, pp. 807–814.
[150] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.
[151] Febin Sunny, Mahdi Nikdast, and Sudeep Pasricha. SONIC: A Sparse Neu-
ral Network Inference Accelerator with Silicon Photonics for Energy-Efficient
Deep Learning. Creative Commons Attribution 4.0 International license. Sept.
2021.
[152] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural
Information Processing Systems. Ed. by I. Guyon et al. Vol. 30. Curran Asso-
ciates, Inc., 2017. url: https://proceedings.neurips.cc/paper_files/paper/20
17/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[153] Ravish Raj. Ed. by Andrew Ng Lectures. url: https://www.enjoyalgorith
ms.com/blog /parameter- learning - and- gradient- descent- in- ml (visited on
09/28/2023).
[154] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. “Learning
representations by back-propagating errors”. In: Nature 323 (1986), pp. 533–
536. url: https://api.semanticscholar.org/CorpusID:205001834.
[155] Y. LeCun et al. “Gradient-based learning applied to document recognition”.
In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. doi: 10.1109/5.726
791.
[156] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In:
Neural Computation 9.8 (Nov. 1997), pp. 1735–1780. issn: 0899-7667. doi:
10.1162/neco.1997.9.8.1735. eprint: https://direct.mit.edu/neco/article-pdf
/9/8/1735/813796/neco.1997.9.8.1735.pdf. url: https://doi.org/10.1162/nec
o.1997.9.8.1735.
[157] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding. 2019. arXiv: 1810.04805 [cs.CL].
[158] Joel Hestness et al. Deep Learning Scaling is Predictable, Empirically. 2017.
arXiv: 1712.00409 [cs.LG].
[159] Jared Kaplan et al. Scaling Laws for Neural Language Models. 2020. arXiv:
2001.08361 [cs.LG].
[160] Jordan Hoffmann et al. Training Compute-Optimal Large Language Models.
2022. arXiv: 2203.15556 [cs.CL].
References ■ 507

[161] nostalgebraist. chinchilla’s wild implications. 2022. url: https://www.alignm

entforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications
(visited on 09/28/2023).
[162] Rich Sutton. The Bitter Lesson. url: http://www.incompleteideas.net/IncId
eas/BitterLesson.html (visited on 09/28/2023).
[163] Katja Grace et al. Thousands of AI Authors on the Future of AI. 2024. arXiv:
2401.02843 [cs.CY].
[164] Epoch. Key trends and figures in Machine Learning. 2023. url: https://epoc
hai.org/trends (visited on 10/19/2023).
[165] Tri Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness. 2022. arXiv: 2205.14135 [cs.LG].
[166] Emerging Technology Observatory. The state of global AI research. 2024. url:
https://eto.tech/blog/state-of-global-ai-research/ (visited on 06/06/2024).
[167] But what is a Neural Network. 2017. url: https://www.3blue1brown.com/le
ssons/neural-networks (visited on 05/02/2024).
[168] Andrei Karpathy. Intro to Large Language Models. 2023. url: https://www
.youtube.com/watch?v=zjkBMFhNj_g (visited on 11/22/2023).
[169] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http:
//www.deeplearningbook.org. MIT Press, 2016.
[170] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Intro-
duction. MIT Press, Cambridge, MA, 2018.

Single-Agent Safety
[54] Stuart Russell. Human Compatible: Artificial Intelligence and the Problem of
Control. en. Penguin, Oct. 2019.
[75] Sébastien Bubeck et al. “Sparks of Artificial General Intelligence: Early ex-
periments with GPT-4”. In: ArXiv abs/2303.12712 (2023).
[101] Dario Amodei and Jack Clark. Faulty reward functions in the wild. 2016.
[102] Alexander Pan, Kush Bhatia, and Jacob Steinhardt. “The effects of re-
ward misspecification: Mapping and mitigating misaligned models”. In: ICLR
(2022).
[106] Joseph Carlsmith. Is Power-Seeking AI an Existential Risk? 2022. arXiv: 220
6.13353 [cs.CY]. url: https://arxiv.org/abs/2206.13353.
[107] John J. Mearsheimer. Structural Realism. 2007, pp. 77–94.
[108] Bowen Baker et al. “Emergent Tool Use From Multi-Agent Autocurricula”.
In: International Conference on Learning Representations. 2020.
508 ■ References

[112] Meta Fundamental AI Research Diplomacy Team (FAIR) et al. Human-level

play in the game of Diplomacy by combining language models with strategic
reasoning. 2022. doi: 10.1126/science.ade9097. eprint: https://www.science.o
rg/doi/pdf/10.1126/science.ade9097. url: https://www.science.org/doi/abs
/10.1126/science.ade9097.
[121] Nick Bostrom. “Superintelligence: Paths, Dangers, Strategies”. In: Oxford Uni-
versity Press (2014).
[171] Zachary C. Lipton. “The Mythos of Model Interpretability: In Machine Learn-
ing, the Concept of Interpretability is Both Important and Slippery”. In:
Queue 16.3 (June 2018), pp. 31–57. issn: 1542-7730. doi: 10 . 1145 / 323638
6.3241340. url: https://doi.org/10.1145/3236386.3241340.
[172] Barret Zoph and Quoc V. Le. Neural Architecture Search with Reinforcement
Learning. Feb. 2017. arXiv: 1611.01578 [cs]. (Visited on 09/15/2023).
[173] Sheng Zhang and Matthew Turk. Creative Commons Attribution-Share Alike
3.0 Unported License. url: http://www.scholarpedia.org/article/Eigenfaces.
[174] David Bau et al. “Network Dissection: Quantifying Interpretability of Deep
Visual Representations”. In: 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). 2017, pp. 3319–3327. doi: 10.1109/CVPR.201
7.354.
[175] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. “Feature Visu-
alization”. In: Distill (2017). https://distill.pub/2017/feature-visualization.
doi: 10.23915/distill.00007.
[176] Ludwig Schubert et al. OpenAI Microscope. Creative Commons Attribution
License 4.0. Apr. 2020. (Visited on 09/15/2023).
[177] Nelson Elhage et al. “Softmax linear units”. In: Transformer Circuits Thread
(2022). url: https://transformer-circuits.pub/2022/solu/index.html.
[178] Margot E. Kaminski. “The Right to Explanation, Explained”. In: Berkeley
Tech. L.J.. Berkeley Technology Law Journal 34.IR (), p. 189. url: http://la
wcat.berkeley.edu/record/1128984.
[179] M. Elish. “Moral Crumple Zones: Cautionary Tales in Human-Robot Interac-
tion (WeRobot 2016)”. In: SSRN Electronic Journal (Jan. 2016). doi: 10.213
9/ssrn.2757236.
[180] Marc W. Patry. “Attractive but Guilty: Deliberation and the Physical Attrac-
tiveness Bias”. In: Psychological Reports 102.3 (June 2008), pp. 727–733. issn:
0033-2941. doi: 10.2466/pr0.102.3.727-733. (Visited on 09/15/2023).
[181] Edward de Haan et al. “Split-Brain: What We Know Now and Why This is
Important for Understanding Consciousness”. In: Neuropsychology Review 30
(June 2020). doi: 10.1007/s11065-020-09439-3.
[182] Miles Turpin et al. Language Models Don’t Always Say What They Think:
Unfaithful Explanations in Chain-of-Thought Prompting. May 2023. doi: 10.4
8550/arXiv.2305.04388. eprint: 2305.04388 (cs). (Visited on 09/15/2023).
References ■ 509

[183] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Inside Convo-
lutional Networks: Visualising Image Classification Models and Saliency Maps.
Apr. 2014. eprint: 1312.6034 (cs). (Visited on 09/15/2023).
[184] Jost Tobias Springenberg et al. Striving for Simplicity: The All Convolutional
Net. Apr. 2015. arXiv: 1412.6806 [cs]. (Visited on 09/15/2023).
[185] Julius Adebayo et al. “Sanity Checks for Saliency Maps”. In: Advances in
Neural Information Processing Systems. Vol. 31. Curran Associates, Inc., 2018.
(Visited on 09/15/2023).
[186] Kevin Wang et al. Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small. 2022. arXiv: 2211.00593 [cs.LG].
[187] Chris Olah et al. “Zoom In: An Introduction to Circuits”. In: Distill 5.3 (Mar.
2020), e00024.001. issn: 2476-0757. doi: 10.23915/distill.00024.001. (Visited
on 09/15/2023).
[188] Kevin Meng et al. Locating and Editing Factual Associations in GPT. The
listed statement was inspired by, but not included in this paper. Jan. 2023.
eprint: 2202.05262 (cs). (Visited on 09/15/2023).
[189] Andy Zou et al. Representation Engineering: A Top-Down Approach to AI
Transparency. 2023. arXiv: 2310.01405 [cs.LG].
[190] Jerry Tang et al. “Semantic reconstruction of continuous language from non-
invasive brain recordings”. In: bioRxiv (2022). doi: 10.1101/2022.09.29.50974
4. eprint: https://www.biorxiv.org/content/early/2022/09/29/2022.09.29.50
9744.full.pdf. url: https://www.biorxiv.org/content/early/2022/09/29/202
2.09.29.509744.
[191] P.W. Anderson. “More Is Different”. In: Science 177.4047 (Aug. 1972),
pp. 393–396. doi: 10.1126/science.177.4047.393.
[192] Jacob Steinhardt. More Is Different for AI. Jan. 2022. (Visited on
09/16/2023).
[193] Jason Wei et al. Emergent Abilities of Large Language Models. 2022. arXiv:
2206.07682 [cs.CL]. url: https://arxiv.org/abs/2206.07682.
[194] Thomas McGrath et al. “Acquisition of chess knowledge in AlphaZero”. In:
Proceedings of the National Academy of Sciences 119.47 (Nov. 2022). doi: 10
.1073/pnas.2206625119. url: https://doi.org/10.1073%2Fpnas.2206625119.
[195] GPT-4 System Card. Tech. rep. OpenAI, Mar. 2023, p. 60. (Visited on
09/16/2023).
[196] Andy Zou et al. “Forecasting Future World Events with Neural Networks”.
In: NeurIPS (2022).
[197] Oriol Vinyals et al. Grandmaster level in StarCraft II using multi-agent rein-
forcement learning. Vol. 575. Nov. 2019. doi: 10.1038/s41586-019-1724-z.
[198] L.P. Kaelbling, M.L. Littman, and A.W. Moore. Reinforcement Learning: A
Survey. 1996. arXiv: cs/9605103 [cs.AI].
510 ■ References

[199] Danijar Hafner. Benchmarking the Spectrum of Agent Capabilities. 2022.

arXiv: 2109.06780 [cs.AI].
[200] OpenAI Five Defeats Dota 2 World Champions. Apr. 2019. url: https://op
enai.com/research/openai-five-defeats-dota-2-world-champions (visited on
09/16/2023).
[201] Joon Sung Park et al. Generative Agents: Interactive Simulacra of Human
Behavior. Apr. 2023. (Visited on 09/16/2023).
[202] Evan Hubinger et al. “Risks from Learned Optimization in Advanced Machine
Learning Systems”. In: ArXiv (2019).
[203] Johannes von Oswald et al. Uncovering Mesa-Optimization Algorithms in
Transformers. Sept. 2023. eprint: 2309.05858 (cs). (Visited on 09/13/2023).
[204] Johannes Von Oswald et al. “Transformers Learn In-Context by Gradient
Descent”. In: Proceedings of the 40th International Conference on Machine
Learning. PMLR, July 2023, pp. 35151–35174. (Visited on 09/16/2023).
[205] Nick Bostrom. Base Camp for Mt. Ethics. Tech. rep. 2022. (Visited on
09/16/2023).
[206] OpenAI Preparedness Framework. url: https://cdn.openai.com/openai-prep
aredness-framework-beta.pdf.
[207] Yohan J. John et al. “Dead Rats, Dopamine, Performance Metrics, and Pea-
cock Tails: Proxy Failure Is an Inherent Risk in Goal-Oriented Systems”. In:
Behavioral and Brain Sciences (June 2023), pp. 1–68. issn: 0140-525X, 1469-
1825. doi: 10.1017/S0140525X23002753. (Visited on 09/15/2023).
[208] Ziad Obermeyer et al. “Dissecting Racial Bias in an Algorithm Used to Manage
the Health of Populations”. In: Science 366.6464 (Oct. 2019), pp. 447–453.
doi: 10.1126/science.aax2342.
[209] J. Skalse et al. “Defining and Characterizing Reward Hacking”. Sept. 2022.
url: https://arxiv.org/abs/2209.13085.
[210] Charles Goodhart. “Problems of Monetary Management: The U.K. Experi-
ence”. In: Papers in Monetary Economics 1975; 1. Papers in Monetary Eco-
nomics 1975; 1; 1. - [Sydney]. - 1975, p. 1-20 1 (1975).
[211] Marilyn Strathern. “‘Improving Ratings’: Audit in the British University Sys-
tem”. In: European Review 5.3 (July 1997), pp. 305–321. issn: 1474-0575,
1062-7987. doi: 10.1002/(SICI)1234-981X(199707)5:3<305::AID-EURO184
>3.0.CO;2-4.
[212] Kevin Roose. “The Making of a YouTube Radical”. In: The New York Times
(June 2019). issn: 0362-4331. (Visited on 09/15/2023).
[213] Paul Christiano et al. Deep Reinforcement Learning from Human Preferences.
Feb. 2023. doi: 10.48550/arXiv.1706.03741. arXiv: 1706.03741 [cs, stat].
[214] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and
Harnessing Adversarial Examples. Mar. 2015. eprint: 1412 . 6572 (cs, stat).
(Visited on 09/15/2023).
References ■ 511

[215] Anish Athalye et al. Fooling Neural Networks in the Physical World. Oct. 2017.
(Visited on 09/15/2023).
[216] Andy Zou et al. Universal and Transferable Adversarial Attacks on Aligned
Language Models. July 2023. doi: 10.48550/arXiv.2307.15043. eprint: 2307.1
5043 (cs). (Visited on 09/15/2023).
[217] Disrupting malicious uses of AI by state-affiliated threat actors. url: https:
//openai.com/blog/disrupting-malicious-uses-of-ai-by-state-affiliated-threa
t-actors.
[218] Nicholas Carlini et al. Poisoning Web-Scale Training Datasets is Practical.
2023. arXiv: 2302.10149 [cs.CR].
[219] Nicholas Carlini et al. “Extracting Training Data from Large Language Mod-
els”. In: CoRR abs/2012.07805 (2020). arXiv: 2012.07805. url: https://arxi
v.org/abs/2012.07805.
[220] Milad Nasr et al. Scalable Extraction of Training Data from (Production) Lan-
guage Models. 2023. arXiv: 2311.17035 [cs.LG].
[221] Peter S. Park et al. AI Deception: A Survey of Examples, Risks, and Potential
Solutions. Aug. 2023. doi: 10 . 48550 / arXiv . 2308 . 14752. arXiv: 2308 . 14752
[cs]. (Visited on 09/16/2023).
[222] Julien Perolat et al. “Mastering the Game of Stratego with Model-Free Multia-
gent Reinforcement Learning”. In: Science 378.6623 (Dec. 2022), pp. 990–996.
issn: 0036-8075, 1095-9203. doi: 10.1126/science.add4679. eprint: 2206.15378
(cs). (Visited on 09/16/2023).
[223] Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How
Models Mimic Human Falsehoods. May 2022. doi: 10.48550/arXiv.2109.07958.
eprint: 2109.07958 (cs). (Visited on 09/16/2023).
[224] Paul Ekman. “Emotions revealed”. In: BMJ 328.Suppl S5 (2004). doi: 10.11
36/sbmj.0405184. eprint: https://www.bmj.com/content. url: https://www
.bmj.com/content/328/Suppl_S5/0405184.
[225] Philip Powell and Jennifer Roberts. “Situational determinants of cognitive,
affective, and compassionate empathy in naturalistic digital interactions”. In:
Computers in Human Behaviour 68 (Nov. 2016), pp. 137–148. doi: 10.1016/j
.chb.2016.11.024.
[226] Michael Wai and Niko Tiliopoulos. “The affective and cognitive empathic na-
ture of the dark triad of personality”. In: Personality and Individual Differ-
ences 52 (May 2012), pp. 794–799. doi: 10.1016/j.paid.2012.01.008.
[227] S. Baron-Cohen. Zero Degrees of Empathy: A New Theory of Human Cruelty.
Penguin UK, 2011.
[228] Tania Singer and Olga M. Klimecki. “Empathy and compassion”. In: Current
Biology 24.18 (2014), R875–R878. issn: 0960-9822. doi: https://doi.org/10.1
016/j.cub.2014.06.054. url: https://www.sciencedirect.com/science/article
/pii/S0960982214007702.
512 ■ References

[229] Russell Hotten. “Volkswagen: The Scandal Explained”. In: BBC News (Sept.
2015). (Visited on 09/16/2023).
[230] Stephen Casper et al. Red Teaming Deep Neural Networks with Feature Syn-
thesis Tools. July 2023. eprint: 2302.10894 (cs). (Visited on 09/16/2023).
[231] J.R. French and Bertram H. Raven. “The bases of social power”. In: 1959.
[232] B.H. Raven. “Social influence and power”. In: (1964).
[233] T. Hobbes. Hobbes’s Leviathan. 1651. isbn: 978-5-87635-264-4.
[234] Luke Muehlhauser and Anna Salamon. “Intelligence Explosion: Evidence and
Import”. In: Singularity Hypotheses: A Scientific and Philosophical Assess-
ment. Ed. by Amnon H. Eden and James H. Moor. Springer, 2012, pp. 15–
40.
[235] Alexander Pan et al. Do the Rewards Justify the Means? Measuring Trade-Offs
Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.
2023. arXiv: 2304.03279 [cs.LG].
[236] Stephen Omohundro. “The basic AI drives”. In: AGI 2008: Proceedings of the
First Conference on Artificial General Intelligence. 2008.
[237] David Silver. Exploration and exploitation. Lecture 9. 2014.
[238] Charles W. Kegley and Shannon L. Blanton. Trend and Transformation, 2014
- 2015. 2020, p. 259.
[239] Kenneth Waltz. Theory of international politics. Waveland Press, 2010, p. 93.
[240] W. Julian Korab-Karpowicz. “Political Realism in International Relations”.
In: The Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta. Summer
2018. Metaphysics Research Lab, Stanford University, 2018.
[241] Evan Montgomery. “Breaking Out of the Security Dilemma: Realism, Reas-
surance, and the Problem of Uncertainty”. In: International Security - INT
SECURITY 31 (Oct. 2006), pp. 151–185. doi: 10.1162/isec.2006.31.2.151.
[242] Long Ouyang et al. “Training language models to follow instructions with
human feedback”. In: ArXiv (2022).
[243] Rafael Rafailov et al. Direct Preference Optimization: Your Language Model
is Secretly a Reward Model. 2023. arXiv: 2305.18290 [cs.LG].
[244] Nathaniel Li et al. The WMDP Benchmark: Measuring and Reducing Mali-
cious Use With Unlearning. 2024. arXiv: wmdp.ai [cs.CR].
[245] Dan Hendrycks et al. Unsolved Problems in ML Safety. 2022. arXiv: 2109.13
916 [cs.LG].
[246] Vitalik Buterin. My techno-optimism. url: https://vitalik.eth.limo/general
/2023/11/27/techno_optimism.html.
[247] United States Government Accountability Office. Critical Infrastructure Pro-
tection: Agencies Need to Enhance Oversight of Ransomware Practices and
Assess Federal Support. 2024.
References ■ 513

[248] Minghao Shao et al. An Empirical Evaluation of LLMs for Solving Offensive
Security Challenges. 2024. arXiv: 2402.11814 [cs.CR].
[249] How AI Can Help Prevent Biosecurity Disasters. url: https://ifp.org/how-a
i-can-help-prevent-biosecurity-disasters/.
[250] Biotech begins human trials of drug designed by artificial intelligence. url:
https://www.ft.com/content/82071cf2-f0da-432b-b815-606d602871fc.
[251] Maialen Berrondo-Otermin and Antonio Sarasa-Cabezuelo. “Application of
Artificial Intelligence Techniques to Detect Fake News: A Review”. In: Elec-
tronics 12.24 (2023). issn: 2079-9292. doi: 10.3390/electronics12245041. url:
https://www.mdpi.com/2079-9292/12/24/5041.
[252] Audio deepfakes emerge as weapon of choice in election disinformation. url:
https://www.ft.com/content/bd75b678-044f-409e-b987-8704d6a704ea.
[253] David Ili’c. “Unveiling the General Intelligence Factor in Language Models: A
Psychometric Approach”. In: ArXiv abs/2310.11616 (2023).
[254] Yohan John et al. “Dead rats, dopamine, performance metrics, and peacock
tails: proxy failure is an inherent risk in goal-oriented systems”. In: The Be-
havioral and brain sciences (June 2023), pp. 1–68. doi: 10.1017/S0140525X2
3002753.
[255] J. Dmitri Gallow. “Instrumental Divergence”. In: Philosophical Studies (2024),
pp. 1–27. doi: 10.1007/s11098-024-02129-3. Forthcoming.
[256] Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem
from a deep learning perspective. 2023. arXiv: 2209.00626 [cs.AI].

Safety Engineering
[42] Richard Danzig. Technology Roulette: Managing Loss of Control as Many Mili-
taries Pursue Technological Superiority. Tech. rep. Center for a New American
Security, June 2018.
[87] N. Leveson. Engineering a Safer World: Systems Thinking Applied to Safety.
Engineering systems. MIT Press, 2011. isbn: 9780262016629. url: https://b
ooks.google.com/books?id=0gZ_7n5p8MQC.
[257] Dan Hendrycks and Mantas Mazeika. X-Risk Analysis for AI Research. 2022.
arXiv: 2206.05862 [cs.CY].
[258] Terence Tao. “Nines of safety: a proposed unit of measurement of risk”. In:
2021.
[259] Nassim Nicholas Taleb. Antifragile: Things That Gain from Disorder. Incerto.
Random House Publishing Group, 2012. isbn: 9780679645276. url: https://b
ooks.google.com.au/books?id=5fqbz_qGi0AC.
[260] E. Marsden. Designing for safety: Inherent safety, designed in. 2017. url:
https://risk-engineering.org/safe-design/ (visited on 07/31/2017).
514 ■ References

[261] James Reason. “The contribution of latent human failures to the breakdown
of complex systems”. In: Philosophical Transactions of the Royal Society of
London. Series B, Biological Sciences 327 1241 (1990), pp. 475–84. url: http
s://api.semanticscholar.org/CorpusID:1249973.
[262] William Vesely et al. Fault Tree Handbook with Aerospace Fault Tree Handbook
with Aerospace Applications Applications. 2002. url: http://www.mwftr.com
/CS2/Fault%20Tree%20Handbook_NASA.pdf.
[263] Andrew Critch and Stuart Russell. TASRA: a Taxonomy and Analysis of
Societal-Scale Risks from AI. 2023. arXiv: 2306.06924 [cs.AI].
[264] Nancy G. Leveson. Introduction to STAMP: Part 1. MIT Partnership for Sys-
tems Approaches to Safety and Security (PSASS). 2020.
[265] Arden Albee et al. Report on the Loss of the Mars Polar Lander and Deep
Space 2 Missions. Tech. rep. Jet propulsion Laboratory - California Institute
of Technology, 2000.
[266] Shalaleh Rismani et al. “Beyond the ML Model: Applying Safety Engineer-
ing Frameworks to Text-to-Image Development”. In: Proceedings of the 2023
AAAI/ACM Conference on AI, Ethics, and Society. AIES ’23. Montréal, QC,
Canada, Association for Computing Machinery, 2023, pp. 70–83. doi: 10.114
5/3600211.3604685. url: https://doi.org/10.1145/3600211.3604685.
[267] Donella Meadows. Leverage Points: Places to Intervene in a System. 1999.
url: https://donellameadows.org/archives/leverage-points-places-to-interve
ne-in-a-system/ (visited on 05/29/2024).
[268] C. Perrow. Normal Accidents: Living with High Risk Technologies. Princeton
paperbacks. Princeton University Press, 1999.
[269] Nancy G. Leveson et al. “Moving Beyond Normal Accidents and High Reli-
ability Organizations: A Systems Approach to Safety in Complex Systems”.
In: Organization Studies (2009).
[270] Thomas G. Dietterich. “Steps Toward Robust Artificial Intelligence”. In: AI
Magazine 38.3 (2017), pp. 3–24. doi: https://doi.org/10.1609/aimag.v38i3.2
756. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1609/aimag.v38i3.27
56. url: https://onlinelibrary.wiley.com/doi/abs/10.1609/aimag.v38i3.2756.
[271] Jens Rasmussen. “Risk management in a dynamic society: a modelling prob-
lem”. In: Safety Science 27.2 (1997), pp. 183–213. issn: 0925-7535. doi: http
s://doi.org/10.1016/S0925-7535(97)00052-0. url: https://www.sciencedirect
.com/science/article/pii/S0925753597000520.
[272] Sidney Dekker. Drift into Failure: From Hunting Broken Components to Un-
derstanding Complex Systems. Jan. 2011, pp. 1–220. isbn: 9781315257396.
doi: 10.1201/9781315257396.
[273] Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An Overview of
Catastrophic AI Risks. 2023. arXiv: 2306.12001 [cs.CY].
References ■ 515

[274] Krystal Hu. ChatGPT sets record for fastest-growing user base - analyst note.
2023. url: https://www.reuters.com/technology/chatgpt-sets-record-fastest
-growing-user-base-analyst-note-2023-02-01/ (visited on 02/03/2023).
[275] Scott D. Sagan. The Limits of Safety: Organizations, Accidents, and Nuclear
Weapons. Vol. 177. Princeton University Press, 1993. isbn: 9780691021010.
url: http://www.jstor.org/stable/j.ctvzsmf8r (visited on 09/26/2023).
[276] Nassim Nicholas Taleb. The Black Swan: The Impact of the Highly Improbable.
Vol. 2. Random House, 2007.
[277] E. Marsden. Black swans: The limits of probabilistic modelling. 2017. url:
https://risk-engineering.org/black-swans/ (visited on 07/31/2017).
[278] Mark Boult et al. Horizon Scanning: A Practitioner’s Guide. English. IRM -
Institute of Risk Management, 2018.

Complex Systems
[160] Jordan Hoffmann et al. Training Compute-Optimal Large Language Models.
2022. arXiv: 2203.15556 [cs.CL].
[279] P. Cilliers and David Spurrett. “Complexity and post-modernism: Understand-
ing complex systems”. In: South African Journal of Philosophy 18 (Sept. 2014),
pp. 258–274. doi: 10.1080/02580136.1999.10878187.
[280] Alethea Power et al. “Grokking: Generalization Beyond Overfitting on Small
Algorithmic Datasets”. In: ICLR MATH-AI Workshop. 2021.
[281] Hang Zhao et al. “Response mechanisms to heat stress in bees”. In: Apidologie
52 (Jan. 2021). doi: 10.1007/s13592-020-00830-w.
[282] Trevor Kletz. An Engineer’s View of Human Error. CRC Press, 2018.
[283] Lucas Davis. “The Effect of Driving Restrictions on Air Quality in Mexico
City”. In: Journal of Political Economy 116 (Feb. 2008), pp. 38–81. doi: 10.1
086/529398.
[284] Jemimah Steinfeld. “China’s deadly science lesson: How an ill-conceived cam-
paign against sparrows contributed to one of the worst famines in history”. In:
Index on Censorship 47.3 (2018), pp. 49–49. doi: 10.1177/0306422018800259.
[285] Xin Meng, Nancy Qian, and Pierre Yared. “The Institutional Causes of China’s
Great Famine, 1959–1961”. In: The Review of Economic Studies 82.4 (Apr.
2015), pp. 1568–1611. issn: 0034-6527. doi: 10.1093/restud/rdv016.
[286] Joseph Gottfried. “History Repeating? Avoiding a Return to the Pre-
Antibiotic Age”. In: 2005. url: https : / / api . semanticscholar . org / Corpus
ID:84028897.
[287] Centers for Disease Control (US). “Antibiotic resistance threats in the United
States, 2019”. In: CDC Stacks (2019). doi: http://dx.doi.org/10.15620/cdc:8
2532.
516 ■ References

[288] “Cycling facts”. In: (2018). url: https://english.kimnet.nl/publications/pub

lications/2018/04/06/cycling-facts.
[289] M. Gell-Mann. The Quark and the Jaguar: Adventures in the Simple and the
Complex. St. Martin’s Publishing Group, 1995. isbn: 9780805072532. url:
https://books.google.com/books?id=l6aCe4zqZ_sC.
[290] D.H. Meadows and D. Wright. Thinking in Systems: A Primer. Chelsea Green
Publishing, 2008. isbn: 9781603580557. url: https://books.google.com.au/b
ooks?id=CpbLAgAAQBAJ.
[291] John Gall. The systems bible: the beginner’s guide to systems large and small.
General Systemantics Press, 2002.
[292] Richard Cook. How complex systems fail. Jan. 2002. url: https://how.compl
exsystems.fail.

Beneficial AI and Machine Ethics

[54] Stuart Russell. Human Compatible: Artificial Intelligence and the Problem of
Control. en. Penguin, Oct. 2019.
[58] Ethan Kross et al. “Facebook use predicts declines in subjective well-being in
young adults”. In: PLOS One (2013).
[293] Kevin M. Clermont. “Rules, Standards, and Such”. In: Buffalo Law Review
(751 2020).
[294] John J. Nay. Law Informs Code: A Legal Informatics Approach to Aligning
Artificial Intelligence with Humans. 2023. arXiv: 2209.13020 [cs.CY].
[295] Isaac Asimov. Runaround. 1942.
[296] Carl Shulman and Nick Bostrom. “Sharing the World with Digital Minds”.
In: Rethinking Moral Status. Oxford University Press, Aug. 2021. isbn:
9780192894076. doi: 10 . 1093 / oso / 9780192894076 . 003 . 0018. eprint: https :
//academic.oup.com/book/0/chapter/350760172/chapter-pdf/43437637/oso
-9780192894076-chapter-18.pdf. url: https://doi.org/10.1093/oso/97801928
94076.003.0018.
[297] Chris Stokes. “Why the three laws of robotics do not work”. In: International
Journal of Research in Engineering and Innovation 2 (2 2018).
[298] Julia Angwin et al. Machine Bias. 2016. url: https://www.propublica.org
/article/machine- bias- risk- assessments- in- criminal- sentencing (visited on
10/15/2023).
[299] W. Dieterich, C. Mendoza, and T. Brennan. “COMPAS risk scales: Demon-
strating accuracy equity and predictive parity”. In: 7 (4 2016), pp. 1–36.
[300] “State v. Loomis”. In: Harvard Law Review 130 (5 2017).
[301] Jeffrey Dastin. Amazon scraps secret AI recruiting tool that showed bias
against women. 2018. url: https://www.reuters.com/article/us- amazon-
com-jobs-automation-insight-idUSKCN1MK08G.
References ■ 517

[302] Tolga Bolukbasi et al. Man is to Computer Programmer as Woman is to Home-

maker? Debiasing Word Embeddings. 2016. arXiv: 1607.06520 [cs.CL].
[303] Jieyu Zhao et al. Men Also Like Shopping: Reducing Gender Bias Amplifica-
tion using Corpus-level Constraints. 2017. arXiv: 1707.09457 [cs.AI].
[304] Thomas Krendl Gilbert et al. Choices, Risks, and Reward Reports: Charting
Public Policy for Reinforcement Learning Systems. 2022. arXiv: 2202.05716
[cs.LG].
[305] Jonathan Stray et al. What are you optimizing for? Aligning Recommender
Systems with Human Values. 2021. arXiv: 2107.10939 [cs.IR].
[306] David Krueger, Tegan Maharaj, and Jan Leike. Hidden Incentives for Auto-
Induced Distributional Shift. 2020. arXiv: 2009.09153 [cs.LG].
[307] Sam Corbett-Davies et al. The Measure and Mismeasure of Fairness. 2018.
arXiv: 1808.00023 [cs.CY].
[308] Alexandra Chouldechova. Fair prediction with disparate impact: A study of
bias in recidivism prediction instruments. 2016. arXiv: 1610.07524 [stat.AP].
[309] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent Trade-
Offs in the Fair Determination of Risk Scores. 2016. arXiv: 1609 . 05807
[cs.LG].
[310] Moritz von Zahn and Stefan Feuerrigel. “The Cost of Fairness in AI: Evidence
from E-Commerce”. In: Business and Information Systems Engineering 64 (3
2009). doi: 10.1007/s12599-021-00716-w.
[311] Sumon Biswas and Hridesh Rajan. “Do the machine learning models on a
crowd sourced platform exhibit bias? an empirical study on model fairness”.
In: Proceedings of the 28th ACM Joint Meeting on European Software Engi-
neering Conference and Symposium on the Foundations of Software Engineer-
ing. ACM, Nov. 2020. doi: 10.1145/3368089.3409704. url: https://doi.org/1
0.1145%2F3368089.3409704.
[312] Raphael Poulain, Mirza Farhan Bin Tarek, and Rahmatollah Beheshti. “Im-
proving Fairness in AI Models on Electronic Health Records: The Case for
Federated Learning Methods”. In: 2023. arXiv: 2305.11386 [cs.LG].
[313] Karen L. Calderone. “The influence of gender on the frequency of pain and
sedative medication administered to postoperative patients”. In: Sex Roles 23
(1990), pp. 713–725. url: https://doi.org/10.1007/bf00289259.
[314] Kryssia Isabel Rodríguez-Castro et al. “Female gender in the setting of liver
transplantation”. In: World J Transplant 4 (4 2014). doi: 10.5500/wjt.v4.i4.2
29.
[315] Isabel Straw and Honghan Wu. “Investigating for bias in healthcare algo-
rithms: a sex-stratified analysis of supervised machine learning models in
liver disease prediction”. In: BMJ Health & Care Informatics 29 (Apr. 2022),
p. 100457. doi: 10.1136/bmjhci-2021-100457.
518 ■ References

[316] Min Kyung Lee. “Understanding perception of algorithmic decisions: Fairness,

trust, and emotion in response to algorithmic management”. In: Big Data &
Society 5 (2018). url: https://api.semanticscholar.org/CorpusID:149040922.
[317] Galen Harrison et al. “An empirical study on the perceived fairness of realistic,
imperfect machine learning models”. In: Jan. 2020, pp. 392–402. doi: 10.114
5/3351095.3372831.
[318] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating Unwanted
Biases with Adversarial Learning. 2018. arXiv: 1801.07593 [cs.LG].
[319] Inioluwa Deborah Raji et al. Closing the AI Accountability Gap: Defining an
End-to-End Framework for Internal Algorithmic Auditing. 2020. arXiv: 2001
.00973 [cs.CY].
[320] David Rozado. The unequal treatment of demographic groups by ChatGP-
T/OpenAI content moderation system. 2023. url: https : / / davidrozado . su
bstack.com/p/openaicms.
[321] George A. Akerlof. “The Market for “Lemons”: Quality Uncertainty and the
Market Mechanism”. eng. In: The Quarterly Journal of Economics 84.3 (1970),
pp. 488–500. issn: 0033-5533.
[322] John Rust. Psychometrics and Artificial Intelligence. 2005.
[323] J. Vipra and A. Korinek. Market Concentration Implications of Foundation
Models: The Invisible Hand of ChatGPT. Sept. 7, 2023. url: https://www.b
rookings.edu/articles/market-concentration-implications-of-foundation-mod
els-the-invisible-hand-of-chatgpt/.
[324] Steve Lafleur. Coase, Pigou, and Sriracha Sauce. 2013. url: https://fcpp.or
g/2013/12/16/coase-pigou-and-sriracha-sauce/ (visited on 10/13/2023).
[325] Payal Dhar. “The carbon impact of artificial intelligence”. eng. In: Nature
Machine Intelligence 2.8 (2020), pp. 423–425. issn: 2522-5839.
[326] Viral V. Acharya, Deniz Anginer, and A. Joseph Warburton. “The end of
market discipline? Investor expectations of implicit government guarantees”.
In: Investor Expectations of Implicit Government Guarantees (May 1, 2016)
(2016).
[327] Inequality - Bridging the Divide. 2020. url: https://www.un.org/en/un75/in
equality-bridging-divide (visited on 10/13/2023).
[328] J. Hasell et al. Economic Inequality. 2023. url: https://ourworldindata.org
/economic-inequality.
[329] Drew Desilver. For most U.S. workers, real wages have barely budged in
decades. 2018. url: https : / / www . pewresearch . org / short - reads / 2018 / 08
/ 07 / for - most - us - workers - real - wages - have - barely - budged - for - decades/
(visited on 10/13/2023).
[330] Nick Hanauer and David M. Rolf. The Top 1% of Americans Have Taken $50
Trillion From the Bottom 90% — And That’s Made the U.S. Less Secure.
2020. (Visited on 10/13/2023).
References ■ 519

[331] Richard D. Wilkinson and Kate Pickett. The spirit level: Why more equal
societies almost always do better. Bloomsbury Publishing, 2009.
[332] Morgan Kelly. “Inequality And Crime”. In: The Review of Economics and
Statistics 82 (Feb. 2000), pp. 530–539. doi: 10.1162/003465300559028.
[333] World Bank. Poverty and Inequality Platform. 2022. url: https://data.world
bank.org/indicator/SI.POV.GINI.
[334] United Nations Office on Drugs and Crime. Global Study on Homicide 2019.
2019. url: https://www.unodc.org/documents/data-and-analysis/gsh/Book
let1.pdf.
[335] World Bank. World Governance Indicators. 2023. url: https://databank.wo
rldbank.org/source/worldwide-governance-indicators.
[336] Thomas Piketty. Capital in the Twenty-First Century. Belknap Press, 2014.
[337] Radeksz. Graph of the Preston Curve for 2005, cross section data. 2009. url:
https://commons.wikimedia.org/wiki/File:PrestonCurve2005.JPG.
[338] Amartya Sen. “Economics and Health”. In: The Lancet (1999).
[339] Jac Heckelman. “Economic Freedom and Economic Growth: A Short-Run
Causal Investigation”. In: Journal of Applied Economics III (May 2000),
pp. 71–91. doi: 10.1080/15140326.2000.12040546.
[340] Charles Jones and Pete Klenow. “Beyond GDP? Welfare across Countries and
Time”. In: American Economic Review 106 (Sept. 2016), pp. 2426–2457. doi:
10.1257/aer.20110236.
[341] Richard Easterlin et al. “China’s Life Satisfaction, 1990-2010”. In: Proceedings
of the National Academy of Sciences of the United States of America 109 (May
2012), pp. 9775–80. doi: 10.1073/pnas.1205672109.
[342] L.A. Paul. Transformative Experience. Oxford Scholarship online. Oxford Uni-
versity Press, 2014. isbn: 9780198717959. url: https://books.google.com.au
/books?id=zIXjBAAAQBAJ.
[343] Alberta Guibilini. “The Artificial Moral Advisor. The “Ideal Observer” Meets
Artificial Intelligence”. In: Philosophy & Technology (2018). doi: https://doi
.org/10.1007/s13347-017-0285-z.
[344] P. Bloom. The Sweet Spot: The Pleasures of Suffering and the Search for
Meaning. HarperCollinsPublishers, 2021. isbn: 9780062910561. url: https :
//books.google.com.au/books?id=LUI7zgEACAAJ.
[345] Mantas Mazeika et al. How Would The Viewer Feel? Estimating Wellbeing
From Video Scenarios. 2022. arXiv: 2210.10039 [cs.CV].
[346] Dan Hendrycks et al. “Aligning AI With Shared Human Values”. In: CoRR
abs/2008.02275 (2020). arXiv: 2008.02275. url: https://arxiv.org/abs/2008
.02275.
[347] Patrick Butlin et al. Consciousness in Artificial Intelligence: Insights from the
Science of Consciousness. 2023. arXiv: 2308.08708 [cs.AI].
520 ■ References

[348] John A. Weymark. “Harsanyi’s Social Aggregation Theorem and the Weak
Pareto Principle”. In: Social Choice and Welfare (1993). doi: https://doi.org
/10.1007/BF00182506.
[349] William MacAskill, Krister Bykvist, and Toby Ord. Moral Uncertainty. Sept.
2020. isbn: 9780198722274. doi: 10.1093/oso/9780198722274.001.0001.
[350] Stephen L. Darwall, ed. Deontology. Malden, MA: Wiley-Blackwell, 2003.
[352] Harry R. Lloyd. The Property Rights Approach to Moral Uncertainty. Happier
Lives Institute’s 2022 Summer Research Fellowship. 2022. url: https://www
.happierlivesinstitute.org/report/property-rights/.
[353] Wendell Wallach and Colin Allen. Moral Machines: Teaching Robots Right
from Wrong. Oxford University Press, Feb. 2009. isbn: 9780195374049. doi:
10.1093/acprof:oso/9780195374049.001.0001. url: https://doi.org/10.1093/a
cprof:oso/9780195374049.001.0001.
[354] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine
Learning: Limitations and Opportunities. http://www.fairmlbook.org. fairml-
book.org, 2019.
[355] Richard Layard and Jan-Emmanuel Neve. Wellbeing: Science and Policy. May
2023. isbn: 9781009298926. doi: 10.1017/9781009298957.
[356] Sandel Michael J. What Money Can’t Buy: The Moral Limits of Markets.
Farrar, Straus and Giroux, 2012.
[357] Dan Hendrycks et al. What Would Jiminy Cricket Do? Towards Agents That
Behave Morally. 2022. arXiv: 2110.13136 [cs.LG].
[358] M.D. Adler. Measuring Social Welfare: An Introduction. Oxford University
Press, 2019. isbn: 9780190643027. url: https://books.google.com/books?id
=_GitDwAAQBAJ.
[359] Katarzyna de Lazari-Radek and Peter Singer. Utilitarianism: A Very Short
Introduction. Oxford University Press, 2017. isbn: 9780198728795. doi: 10.1
093/actrade/9780198728795.001.0001. url: https://doi.org/10.1093/actrade
/9780198728795.001.0001.
[360] S. Kagan. Normative Ethics. Dimensions of philosophy series. Avalon Publish-
ing, 1998. isbn: 9780813308456. url: https://books.google.com/books?id=i
O8TAAAAYAAJ.
[361] Toby Newberry and Toby Ord. The Parliamentary Approach to Moral Uncer-
tainty. Tech. rep. Technical Report# 2021-2, Future of Humanity Institute,
University of Oxford . . ., 2021.
[362] Stephen Darwall. Deontology. Blackwell Publishers, Oxford, 2003.

Collective Action Problems

[24] Matthew Burtell and Thomas Woodside. “Artificial Influence: An Analysis Of
AI-Driven Persuasion”. In: ArXiv abs/2303.08721 (2023).
References ■ 521

[57] Richard C. Lewontin. “The Units of Selection”. In: Annual Review of Ecology,
Evolution, and Systematics 1 (1970), pp. 1–18.
[107] John J. Mearsheimer. Structural Realism. 2007, pp. 77–94.
[112] Meta Fundamental AI Research Diplomacy Team (FAIR) et al. Human-level
play in the game of Diplomacy by combining language models with strategic
reasoning. 2022. doi: 10.1126/science.ade9097. eprint: https://www.science.o
rg/doi/pdf/10.1126/science.ade9097. url: https://www.science.org/doi/abs
/10.1126/science.ade9097.
[273] Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. An Overview of
Catastrophic AI Risks. 2023. arXiv: 2306.12001 [cs.CY].
[332] Morgan Kelly. “Inequality And Crime”. In: The Review of Economics and
Statistics 82 (Feb. 2000), pp. 530–539. doi: 10.1162/003465300559028.
[351] Dan Hendrycks. Natural Selection Favors AIs over Humans. 2023. arXiv: 230
3.16200 [cs.CY].
[363] Thomas C. Schelling. Micromotives and Macrobehavior. 1978.
[364] R. Dawkins. The Blind Watchmaker. Penguin Books Limited, 1986. isbn:
9780141966427. url: https : / / books . google . com . au / books ? id= - EDHRX3
YYwgC.
[365] E. Warren and A.W. Tyagi. The Two Income Trap: Why Middle-Class Parents
are Going Broke. Basic Books, 2004. isbn: 9780465090907. url: https://boo
ks.google.com.au/books?id=TmXhGJ0tg58C.
[366] Scott Alexander. Meditations on Moloch. 2014. url: https://slatestarcodex.c
om/2014/07/30/meditations-on-moloch/ (visited on 09/29/2023).
[367] Steven Kuhn. Prisoner’s Dillema. 1997. url: https://plato.stanford.edu/arc
hives/win2019/entries/prisoner-dilemma/ (visited on 09/29/2023).
[368] Derek Parfit. Reasons and Persons. Oxford, GB: Oxford University Press,
1984.
[369] Robert Axelrod. “More Effective Choice in the Prisoner’s Dilemma”. In: The
Journal of Conflict Resolution 24.3 (1980), pp. 379–403. issn: 00220027,
15528766. url: http://www.jstor.org/stable/173638 (visited on 09/28/2023).
[370] J. Maynard Smith. “The theory of games and the evolution of animal con-
flicts”. In: Journal of Theoretical Biology 47.1 (1974), pp. 209–221. issn: 0022-
5193. doi: https://doi.org/10.1016/0022-5193(74)90110-6. url: https://ww
w.sciencedirect.com/science/article/pii/0022519374901106.
[371] Noam Nisan et al. Algorithmic Game Theory. Cambridge University Press,
2007. doi: 10.1017/CBO9780511800481.
522 ■ References

[372] Allan Dafoe. “AI Governance: Overview and Theoretical Lenses”. In: The
Oxford Handbook of AI Governance. Oxford University Press, 2022. isbn:
9780197579329. doi: 10.1093/oxfordhb/9780197579329.013.2. eprint: http
s://academic.oup.com/book/0/chapter/408516484/chapter-ag-pdf/5091372
4/book\_41989\_section\_408516484.ag.pdf. url: https://doi.org/10.1093
/oxfordhb/9780197579329.013.2.
[373] Kai-Fu Lee and Chen Qiufan. AI 2041 Ten Visions for Our Future. Currency,
2021.
[374] John H. Herz. “Idealist Internationalism and the Security Dilemma”. In: World
Politics 2.2 (1950), pp. 157–180. doi: 10.2307/2009187.
[375] William Press and Freeman Dyson. “Iterated Prisoners Dilemma Contains
Strategies That Dominate Any Evolutionary Opponent”. In: Proceedings of
the National Academy of Sciences of the United States of America 109 (May
2012), pp. 10409–13. doi: 10.1073/pnas.1206569109.
[376] Alexander Stewart and Joshua Plotkin. “Extortion and cooperation in the
Prisoner’s Dilemma”. In: Proceedings of the National Academy of Sciences
109 (June 2012), pp. 10134–10135. doi: 10.1073/pnas.1208087109.
[377] D. Patel and Carl Shulman. AI Takeover, Bio & Cyber Attacks, Detecting
Deception, & Humanity’s Far Future. Audio Podcast Episode from Dwarkesh
Podcast. 2023. url: https://www.youtube.com/watch?v=KUieFuV1fuo&ab
_channel=DwarkeshPatel.
[378] Jared Diamond. Collapse: How Societies Choose to Fail or Succeed: revised
edition. Penguin Books (London), 2005.
[379] Andrew Critch. What Multipolar Failure Looks Like, and Robust Agent-
Agnostic Processes (RAAPs). 2021. url: https://www .alignmentforum.or
g/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-rob
ust-agent-agnostic (visited on 09/29/2023).
[380] Scott Alexander. Ascended Economy? 2016. url: https://slatestarcodex.com
/2016/05/30/ascended-economy/.
[381] OpenAI. OpenAI Charter. 2018. url: https://openai.com/charter.
[382] Martin A. Nowak. “Five Rules for the Evolution of Cooperation”. In: Science
314.5805 (2006), pp. 1560–1563. doi: 10.1126/science.1133755. eprint: https:
//www.science.org/doi/pdf/10.1126/science.1133755. url: https://www.scie
nce.org/doi/abs/10.1126/science.1133755.
[383] Robert Trivers. “The Evolution of Reciprocal Altruism”. In: Quarterly Review
of Biology 46 (Mar. 1971), pp. 35–57. doi: 10.1086/406755.
[384] Gabriele Schino and Filippo Aureli. “Grooming reciprocation among female
primates: A meta-analysis”. In: Biology Letters 4 (Nov. 2007), pp. 9–11. doi:
10.1098/rsbl.2007.0506.
[385] Steven Pinker. The better angels of our nature: Why violence has declined.
Penguin Books, 2012.
References ■ 523

[386] Martin A. Nowak and Karl Sigmund. “Evolution of indirect reciprocity by

image scoring”. In: Nature 393.6685 (1998), pp. 573–577.
[387] Redouan Bshary and Alexandra Sara Grutter. “Image scoring and cooperation
in a cleaner fish mutualism”. In: Nature 441 (2006), pp. 975–978. url: https:
//api.semanticscholar.org/CorpusID:4422277.
[388] Gerald Carter and Gerald Wilkinson. “Food sharing in vampire bats: Recipro-
cal help predicts donations more than relatedness or harassment”. In: Proceed-
ings. Biological Sciences / The Royal Society 280 (Feb. 2013), p. 20122573.
doi: 10.1098/rspb.2012.2573.
[389] Daniel Balliet, Junhui Wu, and Paul Lange. “Indirect Reciprocity, Gos-
sip, and Reputation-Based Cooperation”. In: Oct. 2020, pp. 265–287. isbn:
9781462543984.
[390] S.A. West, A.S. Griffin, and A. Gardner. “Social semantics: altruism, coop-
eration, mutualism, strong reciprocity and group selection”. In: Journal of
Evolutionary Biology 20.2 (2007), pp. 415–432. doi: https://doi.org/10.1111
/j.1420-9101.2006.01258.x. eprint: https://onlinelibrary.wiley.com/doi/pdf/1
0.1111/j.1420-9101.2006.01258.x. url: https://onlinelibrary.wiley.com/doi/a
bs/10.1111/j.1420-9101.2006.01258.x.
[391] W. D. Hamilton. “The genetical evolution of social behaviour. I”. In: Journal
of Theoretical Biology (1964).
[392] Oliver Scott Curry. “Morality as cooperation: A problem-centred approach”.
In: The Evolution of Morality (2016), pp. 27–51.
[393] Deborah Charlesworth and John H. Willis. “The genetics of inbreeding de-
pression”. In: Nature Reviews Genetics 10.11 (2009), pp. 783–796.
[394] Napoleon A. Chagnon. “Kin selection and conflict: An analysis of a Yanomamö
ax fight”. In: Evolutionary Biology and Human Social Behavior: An Anthro-
pological Perspective (1979), pp. 213–238.
[395] Edward Westermarck. The origin and development of the moral ideas. DigiCat,
2022.
[396] Martin S. Smith, Bradley J. Kish, and Charles B. Crawford. “Inheritance of
wealth as human kin investment”. In: Ethology and Sociobiology 8.3 (1987),
pp. 171–182.
[397] Kerris Oates and Margo Wilson. “Nominal kinship cues facilitate altruism”.
In: Proceedings of the Royal Society of London. Series B: Biological Sciences
269.1487 (2002), pp. 105–109.
[398] Debra Lieberman, John Tooby, and Leda Cosmides. “Does morality have a
biological basis? An empirical test of the factors governing moral sentiments
relating to incest”. In: Proceedings of the Royal Society of London. Series B:
Biological Sciences 270.1517 (2003), pp. 819–826.
524 ■ References

[399] Nancy Wilmsen Thornhill. “An evolutionary analysis of rules regulating hu-
man inbreeding and marriage”. In: Behavioral and Brain Sciences 14.2 (1991),
pp. 247–261.
[400] Richard C. Connor. “The benefits of mutualism: a conceptual framework”. In:
Biological Reviews 70.3 (1995), pp. 427–457.
[401] Mark Van Vugt, Robert Hogan, and Robert B. Kaiser. “Leadership, follower-
ship, and evolution: some lessons from the past”. In: American Psychologist
63.3 (2008), p. 182.
[402] Peter Carruthers and Peter K Smith. Theories of theories of mind. Cambridge
University Press, 1996.
[403] Stuart A. West, Ashleigh S. Griffin, and Andy Gardner. “Evolutionary expla-
nations for cooperation”. In: Current Biology 17.16 (2007), R661–R672.
[404] John F. Nash. “The Bargaining Problem”. In: Econometrica 18.2 (1950),
pp. 155–162. issn: 00129682, 14680262. url: http://www.jstor.org/stable
/1907266 (visited on 11/19/2023).
[405] Herbert Gintis. “The evolution of private property”. In: Journal of Economic
Behavior & Organization 64.1 (2007), pp. 1–16.
[406] Joseph Henrich et al. ““Economic man” in cross-cultural perspective: Behav-
ioral experiments in 15 small-scale societies”. In: Behavioral and Brain Sci-
ences 28.6 (2005), pp. 795–815.
[407] Steven J. Brams and Alan D. Taylor. Fair Division: From cake-cutting to
dispute resolution. Cambridge University Press, 1996.
[408] M.J. Herskovits. Economic Anthropology: A Study in Comparative Economics.
Borzoi book. Knopf, 1952. url: https://books.google.com/books?id=LJqy
AAAAIAAJ.
[409] V. Buterin. What even is an institution? 2022. url: https://vitalik.ca/gener
al/2022/12/30/institutions.html.
[410] Marina Favaro, Neil Renic, and Ulrich Kuhn. Forecasting the future impact of
emerging technologies on international stability and human security. Accessed:
29 May 2024. Sept. 26, 2022. url: https://ifsh.de/file/publication/Research
_Report/010/Research_Report_010.pdf.
[411] James D. Fearon. “Rationalist Explanations for War”. In: International Orga-
nization 49.3 (1995), pp. 379–414. issn: 00208183, 15315088. url: http://ww
w.jstor.org/stable/2706903 (visited on 10/14/2023).
[412] Kenneth W. Condit. The Joint Chiefs of Staff and National Policy, 1947-1949.
1996.
[413] Robert Powell. “War as a commitment problem”. In: International Organiza-
tion 60.1 (2006), pp. 169–203.
[414] Brian Michael Jenkins. The Will to Fight, Lessons from Ukraine. 2022. url:
https://www.rand.org/pubs/commentary/2022/03/the-will-to-fight-lessons-
from-ukraine.html.
References ■ 525

[415] Laurie Chen. “Always there’: the AI chatbot Comforting China’s lonely mil-
lions”. In: The Jakarta Post (2021).
[416] Victimization rates for persons age 12 or older, by type of crime and annual
family income of victims. 2011. url: https://bjs.ojp.gov/sites/g/files/xycku
h236/files/media/document/cv0814.pdf.
[417] Pablo Fajnzylber, Daniel Lederman, and Norman Loayza. “Inequality and
Violent Crime”. In: The Journal of Law and Economics 45.1 (2002), pp. 1–
39. doi: 10 . 1086 / 338347. eprint: https : / / doi . org / 10 . 1086 / 338347. url:
https://doi.org/10.1086/338347.
[418] Adrienne Horne. “The Effect Of Relative Deprivation On Delinquency: An
Assessment Of Juveniles”. PhD thesis. University of Central Florida, 2009.
[419] Richard Dawkins. The Selfish Gene, 30th Anniversary edition. Oxford Univer-
sity Press, 2006.
[420] Lee Smolin. “Did the Universe Evolve?”. In: Classical and Quantum Gravity
(1992).
[421] Daniel C. Dennett. “Darwin’s Dangerous Idea: Evolution and the Meanings
of Life”. In: 1995.
[422] Susan J. Blackmore. The Meme Machine. Oxford University Press, 1999.
[423] George R. Price. “Selection and Covariance”. In: Nature (1970).
[424] Samir Okasha. “Evolution and the Levels of Selection”. In: 2007.
[425] Thomas G. Dietterich. “Ensemble Methods in Machine Learning”. In: Multiple
Classifier Systems. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, pp. 1–
15. isbn: 978-3-540-45014-6.
[426] Samir Okasha. “Agents and Goals in Evolution”. In: Oxford Scholarship On-
line (2018).
[427] Daniel Martín-Vega et al. “3D virtual histology at the host/parasite interface:
visualisation of the master manipulator, Dicrocoelium dendriticum, in the
brain of its ant host”. In: Scientific Reports 8.1 (2018), pp. 1–10.
[428] Christoph Ratzke, Jonas Denk, and Jeff Gore. “Ecological suicide in mi-
crobes”. In: bioRxiv (2017), p. 161398.
[429] Thomas C. Schelling. Arms and Influence. Yale University Press, 1966. isbn:
9780300002218. url: http://www.jstor.org/stable/j.ctt5vm52s (visited on
10/14/2023).
[430] Edward O. Wilson. Sociobiology: The New Synthesis, Twenty-Fifth Anniver-
sary Edition. Harvard University Press, 2000. isbn: 9780674000896. url: htt
p://www.jstor.org/stable/j.ctvjnrttd (visited on 10/14/2023).
[431] C. Boehm. Moral origins: The evolution of virtue, altruism, and shame. Basic
Books, 2012.
526 ■ References

[432] R. Axelrod and R.M. Axelrod. The Evolution of Cooperation. Basic books.
Basic Books, 1984. isbn: 9780465021215. url: https://books.google.com.au
/books?id=NJZBCGbNs98C.

Governance
[13] Fabio Urbina et al. “Dual use of artificial-intelligence-powered drug discovery”.
In: Nature Machine Intelligence 4 (2022), pp. 189–191.
[162] Rich Sutton. The Bitter Lesson. url: http://www.incompleteideas.net/IncId
eas/BitterLesson.html (visited on 09/28/2023).
[164] Epoch. Key trends and figures in Machine Learning. 2023. url: https://epoc
hai.org/trends (visited on 10/19/2023).
[433] Tom Davidson. Could Advanced AI Drive Explosive Economic Growth? url:
https://www.openphilanthropy.org/research/could-advanced-ai-drive-explo
sive-economic-growth/.
[434] Jeffrey Ding. “The Rise and Fall of Technological Leadership: General-Purpose
Technology Diffusion and Economic Power Transitions”. In: International
Studies Quarterly 68.2 (Mar. 2024), sqae013. issn: 0020-8833. doi: 10 . 109
3/isq/sqae013. eprint: https://academic.oup.com/isq/article-pdf/68/2/sqae0
13/56984912/sqae013.pdf. url: https://doi.org/10.1093/isq/sqae013.
[435] J. Briggs and D. Kodnani. The potentially large effects of artificial intelligence
on economic growth. Accessed: 23 May 2024. 2023. url: https://www.gspubl
ishing.com/content/research/en/reports/2023/03/27/d64e052b-0f6e-45d7-96
7b-d7be35fabd16.html.
[436] Andrew Yang. The War On Normal People: The Truth About America’s Dis-
appearing Jobs and Why Universal Basic Income Is Our Future. Hachette
Books, 2018.
[437] Daron Acemoglu and Pascual Restrepo. “Robots and jobs: Evidence from US
labor markets”. In: Journal of Political Economy 128.6 (2020), pp. 2188–2244.
[438] Yuval Noah Harari. Homo Deus: A brief history of tomorrow. Random house,
2016.
[439] Maximiliano Dvorkin and Hannah Shell. “The growing skill divide in the US
labor market”. In: Federal Reserve Bank of St. Louis: On the Economy Blog
18 (2017).
[440] Anton Korinek and Donghyun Suh. Scenarios for the Transition to AGI. Work-
ing Paper 32255. National Bureau of Economic Research, Mar. 2024. doi: 10
.3386/w32255. url: http://www.nber.org/papers/w32255.
[441] Nick Bostrom. What is a Singleton? Linguistic and Philosophical Investiga-
tions, 5 (2), 48-54. 2006.
[442] K. Eric Drexler. Reframing Superintelligence: Comprehensive AI Services as
General Intelligence. Tech. rep. Future of Humanity Institute, 2019.
References ■ 527

[443] Michael C. Jensen and William H. Meckling. “Theory of the firm: Managerial
behavior, agency costs and ownership structure”. In: Corporate Governance.
Gower, 2019, pp. 77–132.
[444] Milton Friedman. “The Social Responsibility of Business Is To Increase Its
Profits”. In: vol. 32. Jan. 2007, pp. 173–178. isbn: 978-3-540-70817-9. doi:
10.1007/978-3-540-70818-6_14.
[445] Peter Cihon, Jonas Schuett, and Seth D. Baum. “Corporate Governance of
Artificial Intelligence in the Public Interest”. In: Information 12.7 (July 2021),
p. 275. issn: 2078-2489. doi: 10.3390/info12070275. url: http://dx.doi.org
/10.3390/info12070275.
[446] Jonas Schuett. AGI labs need an internal audit function. 2023. arXiv: 2305.1
7038 [cs.CY].
[447] Daron Acemoglu, Andrea Manera, and Pascual Restrepo. Does the US tax
code favor automation? Tech. rep. National Bureau of Economic Research,
2020.
[448] Remco Zwetsloot. “Winning the Tech Talent Competition”. In: Center for
Strategic and International Studies (2021), p. 2.
[449] Nicholas Carlini et al. Stealing Part of a Production Language Model. 2024.
arXiv: 2403.06634 [cs.CR].
[450] Stuart Armstrong, Nick Bostrom, and Carl Shulman. “Racing to the precipice:
a model of artificial intelligence development”. In: AI & society 31 (2016),
pp. 201–206.
[451] Deborah D. Avant, Martha Finnemore, and Susan K. Sell. Who governs the
globe? Vol. 114. Cambridge University Press, 2010.
[452] Ben Garfinkel and Allan Dafoe. “How does the offense-defense balance
scale?”. In: Emerging Technologies and International Stability. Routledge,
2021, pp. 247–274.
[453] Wolfgang K.H. Panofsky and Jean Marie Deken. 2008. doi: https://doi.org
/10.1007/978-0-387-69732-1.
[454] Robert Trager et al. “International Governance of Civilian AI: A Jurisdictional
Certification Approach”. In: Risk, Regulation, & Policy eJournal (2023). url:
http://dx.doi.org/10.2139/ssrn.4579899.
[455] Francis J. Gavin. “Strategies of Inhibition: U.S. Grand Strategy, the Nuclear
Revolution, and Nonproliferation”. In: International Security 40.1 (July 2015),
pp. 9–46. issn: 0162-2889. doi: 10.1162/ISEC_a_00205. eprint: https://d
irect.mit.edu/isec/article-pdf /40/1/9/1843553/isec\_a\_00205.pdf. url:
https://doi.org/10.1162/ISEC%5C_a%5C_00205.
[456] Preempting a Generative AI Monopoly. 2023. url: https://www.project-syn
dicate.org/commentary/preventing-tech-giants-from-monopolizing-artificial
-intelligence-chatbots-by-diane-coyle-2023-02 (visited on 10/19/2023).
[457] Dario Amodei et al. AI and Compute. OpenAI. 2018.
528 ■ References

[458] The $150m Machine Keeping Moore’s Law Alive. ©ASML. 2021. url: https:
//www.wired.com/story/asml-extreme-ultraviolet-lithography-chips-moores
-law/ (visited on 08/30/2021).
[459] Zachary Arnold et al. ETO Supply Chain Explorer. Dataset. 2022. url: http
s://cset.georgetown.edu/publication/eto-supply-chain-explorer/ (visited on
10/19/2023).
[460] Saif M Khan, Alexander Mann, and Dahlia Peterson. “The semiconductor
supply chain: Assessing national competitiveness”. In: Center for Security and
Emerging Technology 8.8 (2021).
[461] Jaime Sevilla et al. Compute Trends Across Three Eras of Machine Learning.
2022. arXiv: 2202.05924 [cs.LG].
[462] Ege Erdil and Tamay Besiroglu. Explosive growth from AI automation: A
review of the arguments. Papers 2309.11690. arXiv.org, 2023. url: https://id
eas.repec.org/p/arx/papers/2309.11690.html.
[463] Saif M. Khan. AI Chips: What They Are and Why They Matter. 2020. url:
https://doi.org/10.51593/20190014.
[464] Yonadav Shavit. What does it take to catch a Chinchilla? Verifying Rules on
Large-Scale Neural Network Training via Compute Monitoring. 2023. arXiv:
2303.11341 [cs.LG].
[465] Lewis Ho et al. International Institutions for Advanced AI. 2023. arXiv: 2307
.04699 [cs.CY].
[466] Matthijs Maas. Advanced AI Governance. Tech. rep. https://docs.google.com
/ document / d / 1pwwNHvNeJneBA2t2xaP31lVv1lSpa36w8kdryoS5768 / edit.
Legal priorities Project, 2023.
Index

Note: Bold page numbers refer to tables and Italic page numbers refer to figures.

A people could enlist AIs, power

Accelerationism, 9 seeking, 160
Accidents, hard to avoid, 30–32 power, 157–160
AcciMap, 212 power seeking, instrumentally
Accountability, 17, 32, 205, 361 rational, 160–162
harms and hazards, 121–122 power-seeking behavior, tail risk,
Accuracy, 63, 71, 72, 74, 76, 91, 101, 167
146, 171 structural pressures, power-seeking
Activation functions, 93 AI, 165–167
Adaptability, 42, 192, 210, 260 techniques to control AI systems,
Adversarial approaches, 477–478 168–169
Adversarial attacks, 146, 147 AlphaGo, 32, 58
Adversarial fine-tuning, 197 AlphaZero chess model, 128
Adversarial robustness, 146 Altruism, 26, 444
Adversarial training, 146 Altruistic behavior, 27
Advocacy groups, 264, 265 Anomaly detection, 135
Advocacy movements, 263–264 Anthrax, 29
Agents’ incentives, 397–398 Anthropogenic climate change, 399
AI cyberwarfare, 388 Antibiotic resistance, 275–276
AI race, 5, 14 Antifragility, 189, 192–193, 236
corporate AI race, 20–23 Application programming interface
evolutionary pressures, 23–27 (API), 460
military AI arms race, 14–19 Arms races, 377
AI watchdogs, 197, 477 Arrow’s Impossibility Theorem, 345, 346
AI-controlled weapons, 387 Artificial conscience, 197, 337–338
systems, 17 Artificial general intelligence (AGI), 51,
AI-developed weapons, 387 59, 61–62, 109
AI-guided weaponized drone, 15 generality, adaptability and human
AlexNet model, 58, 100, 111, 454 performance, 61
Algorithmic biases, 295 incentives and investment, 109
Algorithmic fairness, 300–301 physical embodiment, 61
Alignment, 117–118, 150–151 predicting, 61
deception, 151–154 risks and capabilities, 61–62
deceptive alignment and treacherous Artificial intelligence (AI), 52–53; see
turns, tail risk, 156–157 also individual entries
deceptive evaluation gaming, AI winters and resurgences, 56
154–156 birth of, 55–56

529
530 ■ Index

Dartmouth Conference, 55 “The Better Angels of Our Nature,”

definition, 53–54, 111 (Pinker), 414
early approaches, 55 Biases, 295–299
early computing research, 54 algorithmic, 295, 296
early historical ideas, 54 approaches to combating, 302–305
First AI Winter, 56 automation and measurability, 299
first Chatbot, 55–56 confirmation, 299
first neural network, 55 discriminatory, 296
first recovery, 56 feedback loops, 299
generality and skill level, 59 human-AI interactions, 298–299
history, 54–58 labeling, 298
modern conception, 54 mitigating, 304
second AI winter, 56 reduction methods, 305
Turing Test, 54–55 reporting, 297
Artificial neurons, 82, 83 reward, 298
Artificial superintelligence (ASI), 51, 59 sampling, 298
Asimov, Isaac, 293–294 systematic, 295–296
Assurance, 471–472 technical sources of, 296–298
Attrition model, 386 Big data, 57
Automated AI R&D, 22 Binary classification, 65, 67
Automated economy, 21–23 Bioengineered pandemics, 7, 8
Automated executive decision-making, Biological inspiration, 82
388 Biosecurity, 170, 477
Automated warfare, 16–20 Biotechnology, 8
Automatic retaliation, 16–17 Bioterrorism, 7–8, 13
Automation, 389–390, 454, 458, 459 bioengineered pandemics, 7, 8
Autonomous economy, 27–28, 401, 459 biotechnology, 8
Axelrod, Robert, 383 chemical and biological weapons, 8
Bioweapons programs, 7
B Black swans, 224, 238
Backpropagation, 95–96 absence of evidence, 232, 234
Balance of power, 164 identifying risk of, 233–235
Barbados-Grenada football match, 273 implications, risk analysis, 228–233
Bardeen-Cooper-Schrieffer (BCS) theory, known unknowns and unknown
249 unknowns, 225–228, 235, 237
Baumol’s cost disease, 456–457 “Blunt-end” interventions, 204–205
Beneficial AI, 285–362 Bottom-up misuse, 462
incoherent views, embedding values, Bow tie model, 197–198
285–286 Butterfly effect, 252
promote human values, 285
Bernoulli-Nash social welfare function, C
350 Calibration, 300–301
BERT (Bidirectional Encoder Cane toads, Australia, 272–273
Representations from Capabilities approach, 118, 129, 173,
Transformers), 100–101 323
Index ■ 531

“Capital in the 21st Century” (Piketty), corporations and research institutes

313 as, 262–263
Capped-profit structure, 470 deep learning systems, 248
Cash for Ash scandal, 273 distributed functionality, 256–258
Catastrophic accidents, 388 economic system, 247
Catastrophic AI risks, 3–50 emergence, 250
Catastrophic risks, xvi, 176 emergent properties, 246
Censorship, 11 feedback and nonlinearity, 250–253
Challenger Space Shuttle disaster, 32 Gilb’s Law of unreliability, 269–270
ChaosGPT, 9 hallmarks of, 249–261
Chatbots, 324 human artifacts, 243
ChatGPT, 58, 61, 148, 219 human brain, 247
Chernobyl nuclear disaster, 29, 33 human reliability, unreliable,
Chinchilla model, 259 269–270
Chinchilla scaling law, 105, 105–106 interactions matter, 247
CICERO, 46 limitations, reductionist paradigm,
Chief Risk Officer, 36, 468 244–245
Circuits, 125–126, 126 mathematical equations,
Classic risk equation, 181–182 mechanistic systems, 243
CoastRunners, 39 mechanistic and statistical
Coercive power, 158 approaches, reductionist
Cognitive empathy, 153 paradigm, 243–244
Cold War nuclear arms race, 389 mechanistic approach, 242
Collective action problems, 365–447 paradigm, 245–247
Common pool resource problem, puzzles and problems, 270–271
399–400 reductionist paradigm, 242–245, 248
COMPAS (Correctional Offender review of hallmarks, 261
Management Profiling for safe system, when scaled up not still
Alternative Sanctions) software, safe, 268
294–295 scalable structure and power laws,
Compassionate empathy, 153 258–260
Competent deception, 122 self-organization, 253–254
Competition, 27 self-organized criticality, 254–256
Competitive pressures, 18, 20–21, 28, 43, social systems as, 261–266
48, 206–207, 366, 385, 402 stable states and restoring forces,
Complex systems, 126, 234, 241–281 273–277
adaptive behavior, 260 subgoals supersede original goal,
applied to advocacy, 263–265 267–268
armchair analysis, limited, 266–267 successful interventions, 277
challenges with interventionism, systemic issues, 278–279
272–277 ubiquitous, 247
complexity, information, and understanding system complexity,
reductionism, 249 249
conventional computer programs, use of AI will unfold, 265–266
243 wicked problems, 271–272
532 ■ Index

working complex systems from institutions, 413–416

simpler systems, 268–269 kin selection, 410–411
working definition, 246–247 Co-option, military AI technologies,
Component failure models, 236 388–389
Compute governance, 489–496 Corporate AI race, 20–23
cloud services, 494 automated economy, 21–23
excludable, 492–493 economic competition and safety,
forms of, 493–494 20–21
frequent chip replacements, 493 Corporate governance, 468–472, 496
indispensable for development and assurance, 471–472
deployment, 489–491 legal structure, 468–469
individual chips, 493–494 organizational structure, 470–471
large-scale compute, 490–491 overview, 468
on-chip mechanisms, 494 ownership structure, 469–470
physical, 491–492 place of incorporation, 469
quantifiable, 493 shareholder theory, 468
Computer vision, 65 stakeholder theory, 468
Concentration of power, 11–13 statutes and bylaws, 469
irreversible, 12 Corporate power, 12
Confabulation, 123 Corruption, 415–416
Conflict, 368–369, 417–432, 446 Cost-benefit analysis, 228, 230,
bargaining theory, 419–420 346–348
commitment problems, 420–426 Counterfactual fairness, 300
disinformation, 428–429 COVID-19 pandemic, 218
factors outside bargaining theory, Credence, 353
429–431 Critical points, 264
first-strike advantage, 423–424 Cross entropy loss, 94
inequality and scarcity, 429–431 Cuban Missile Crisis, 4
information problems, 426–429 Customized ownership structures, 470
issue indivisibility, 424–426 Cyberattacks, 15–16, 49
misinformation, 427–428 Cybersecurity, 170, 477
overview, 417–419 Cyberwarfare, 15–16
resolution, 413
Confusion matrices, 71–72 D
Consciousness, 340 Danzig, Richard, 19, 238
Convergent instrumental goals, 160 Data points, 229
Convolution, 91 Data skew, 304
Convolutional layers, 91 Dataset, 104
Convolutional neural networks (CNNs), Dawkins, Richard, 433, 440
91, 99 Deception, 45–47, 151–154
Cooperation, 26, 368, 377–380, 403–417, accidental deception, 152
446 imitative deception, 152–153
direct reciprocity, 404–406 strategy, 152
group selection, 408–409 Deceptive alignment, 151–152
indirect reciprocity, 407–408 Deceptive evaluation gaming, 151
Index ■ 533

Decision-making processes, 214, 239, transformers, 92

264, 337–338 transportation and logistics, 101
Deep Blue, 56 DeepMind, 259, 469
Deep learning (DL), 52, 57, 64, 79–102, Defense in depth strategy, 35, 190, 236
111–112, 234 Defensive accelerationism, 170
activation functions, 87 Dekker’s Drift into Failure (DIF) model,
advancements, 58 215, 237
automatically learned Delay fallacy, 231, 237
representations, 80 Democracy, 399
black box, 119 Deontology, 353
building blocks, model, 82, 125 Differential fitness, 438
challenges and limitations, 81 Differential technological development,
communication and entertainment, 173
101 Digital minds, 339–341
complicated relationships, data, 81 Direct reciprocity, 26, 368
components, 112 Disability-adjusted life years (DALYs),
convolution, 91 325
defining, 80–81 Disaster risk equation, 182–186
function approximation, 81 Displacement effect, 457
Gaussian error linear unit (GELU), Distortion budget, 146
87–88, 88 Distributed functionality, 256–258, 264
healthcare, 102 Distribution, AI, 457–467, 495–496
hierarchical representations, 81 AI singleton, 463–465
history and timeline, key benefits and costs, 457–459
architectures, 99–101 distribution of access, 459–463
key activation functions, 86–89 distribution of power, 463–467
LeNet, 99 diverse ecosystem, 465–467
mechanics of learning, 93–96 levels of access, 460–461
ML/DL distinction, 80 openness norms, 461–462
multi-layer perceptrons, 83–86 risks from open vs controlled
performance, 79 models, 462–463
processes, 112 Distributive fairness, 300
real-world applications, 79 Dominant strategy, 373–374, 382–383
rectified linear unit (ReLU) Dual-use technologies, 461
function, 87, 87
residual connections, 89–91, 90 E
revolutionizes AI, 57–58 Economic competition, 20–21
scalability, 79 Economic engine, 305–320
self-attention, 91 economic output and gross domestic
sigmoid, 88 product, 316–317
softmax function, 88–89 free markets, allocative efficiency,
steps to training, 96 306–307
training and inference, 93–99 growth, 315–316
training methods and techniques, happiness, economics, 318–319
96–99 inequality, 311–315, 312
534 ■ Index

market failures, 307–311 impossibility theorem, 301

welfare economics models, 317–318 improving, 302–305
Economic growth, 452–457, 495 limitations of, 301–302
non-automated tasks, 456 technical approaches, 302–303
regulations and human preferences, theory of justice as, 300
455–456 Faithfulness, 122
Economic incentives, 319 Fault tree analysis (FTA) method,
Economic output, 316–317 198–200
Eisenhower, Dwight D., 19 Feedback, 250–253, 280
Element-wise nonlinear function, 86 Fermi, Enrico, 31
ELIZA chatbot, 55–56 Few-shot learning, 98, 132
Emergence, 263 Few-shot prompting, 260
Emergent capabilities, 127–129, 134, Fiduciary duties, 470
173, 268 Financial crisis, 2007-2008, 218
Emotional empathy, 153 Financial incentives, 451–452
Equalized odds, 300 Fine-tuning, 97–98
Ethical obligations, 121 First-strike advantage, 368
Ethically-minded AI developers, 21 Fitness, 435, 436
Evaluation gaming, 151–154 Fitter technologies, 23
Evolutionary pressures, 23–27, 366, 369, Flash war, 17, 400–401
432, 446–447 Fleming, Alexander, 275
generalized Darwinism, 432–439 Four pests campaign, 275
information’s eye view, 439–441 Free riding/free ride, 396, 415–416
intrasystem goal conflict, 441–443 Freezing point of water, 255
overview, 432 French & Raven’s bases of power,
selection and selfish behavior, levels, 157–160
439–445 Fundamentals, artificial intelligence,
selfishness, 443–445 51–113
Excessive working hours, 366
Exchange, 412–413 G
Existential risks, 216–218 Gain-of-function research, 30
Expert power, 158 Game theory, 367–403, 445–446
Expert-designed systems, 107 agents in, 371
Export controls, 478 iterated Prisoner’s Dilemma,
External assurance, 471–472 380–394
External coercion, 398 negative sum outcome, 371–372
External pressures, 29 non-zero sum games, 371
Extortion, 390–392 overview, 369–370
positive sum outcome, 371–372
F Prisoner’s Dilemma, 372–380
Fact-checking AIs, 11 zero sum game, 371
Fail-safes, 189, 191–192, 236 Gene synthesis, 8
Fairness, 294–305 General capabilities, 171–173
algorithmic fairness, 299, 300–301 externalities, 173
biases, 295–299 General Problem Solver, 55
Index ■ 535

Generalized Darwinism, 432–439 Hazardous capabilities, 135

and AI populations, 436–439 Health-adjusted life years (HALYs), 325
conceptual framework, 433–434 Heartbleed bug, 32
formalizing, 434–436 Hedonism, 320, 322
General-purpose wellbeing function, 335 High reliability organizations (HROs),
Generative agents, 131 33, 34, 209–211
Generative modeling, 68 vs. Normal Accident Theory (NAT),
Generative Pre-trained Transformer 212
(GPT) models, 58 Hinton, Geoffrey, 44
Gilb’s Law of unreliability, 269–270 Horizon scanning, 235
Gini coefficient, 312 Human confabulation, 123
Goal drift, 41–43, 134 Human Development Index (HDI), 319,
Good Old-Fashioned AI (GOFAI), 55 323
Goodhart’s law, 40, 139, 143, 149 Human enfeeblement, 22–23
Governance, 449–497 Human factors, 32–33
actors, 450–451 Human wellbeing, 362
compute, 489–495 Humanity’s moral progress, 12–13
corporate, 468–472, 496 Human-level artificial intelligence
economic growth, 452–457 (HLAI), 51, 59, 62, 108–110,
international, 480–489, 496 458–459
landscape, 450–452 obstacles, 109–110
national, 472–480, 496 potential for rapid development,
overview, 450 108–109
rights and civil liberties, 452 Hype cycles, 56
standards and regulations, 452 Hyperbolic growth, 3
tools, 451–452 Hyperparameter, 95
GPT (Generative Pre-Trained
Transformer), 101 I
GPT-4, 9 Idealized preferences, 332–335
Gradient boosting, 57 AI ideal advisor, 334
Gradient descent, 95, 95 ImageNet dataset, 58
Graphics Processing Units (GPUs), 57 Imitation Game, 54
Gross domestic product (GDP), 315, Immigration policy, 478
316–317 Immoral law, 292–293
Group fairness, 300 Incentives, of researchers, 239
Group projects, 395–396 Independent audits, 304
Group selection, 368 Indirect reciprocity, 368
Individual fairness, 300
H Industrial Revolution, 62
Happiness, 335–342 Inequality, 311–315, 312, 314, 369
general approach, 336–338 Inference time, 75
happiness-focused ethics, problems, Information dissemination, 451
338–339 Information environment, 170–171
Harsanyi’s social aggregation theorem, Information problems, 368
348 Information security, 460, 478–479
536 ■ Index

Informational power, 158 Intrinsification, 42, 133–134

Instructed deception, 154 Inverse reinforcement learning (IRL),
Instrumental goals, 41–42, 133, 160 328–329
Instrumental incentives, 161 Issue indivisibility, 368
Instrumental power-seeking behavior, Iterated Prisoner’s Dilemma, 380–394
161 AI races, 385
Intelligence, 26, 34 collective action problems, 395–402
collection, 478 AI races, more than two
continuum of, 53–54 competitors, 400–402
different meanings of, 53 formalization, 396–398
Intelligent computer system, see real-world examples of, 398–400
Artificial intelligence (AI) corporate AI races, 385–387
Internal assurance, 471 extortion, 390–392
International governance, 480–489, 496 extortion with digital minds, tail
bilateral and multilateral talks, 483 risk, 392–394
catastrophic changes, international tournaments, 383–385
agreements, 485
civilian AI, 486–487 J
compliance with international Jailbreaks, 146
agreements, 485 Johnson, Lyndon, 45
evaluation, monitoring, and
adjudication, 482 K
forms of, 481–484 Kasparov, Garry, 56
governance organizations, 483 Kennedy, John F., 18
implementation and enforcement, Kin selection, 368
482 Kinship, 412
international agreements, 486–489 Kleiber’s law, 258, 259
international regulation, 480 Kletz, Trevor, 269
military AI, 487–488 Knowledge, xv
norms and standards, 483
policymaking, 481–482 L
of powerful technologies, 481 Large language models (LLMs), 58, 98,
production changes, controlling, 486 147, 173, 259, 290–291
questions, AI regulation, 484–486 Laws, 287, 294, 362
setting agendas, 481 immoral law, 292–293
summits and forums, 483 legitimate aggregation of values, 288
techniques, 482–484 need for ethics, 291–294
technology defense-dominant or rules and standards, 289
offense-dominant, 484 silent law, 291–292
treaties, 483–484 specific and adaptable, 290–291
unilateral commitments, 482–483 time-tested and comprehensive,
International organizations, 451 288–289
Internet, 218–219 unrepresentative law, 293
Interventions, 49 Le Chatelier’s Principle, 274
Intrinsic goals, 41, 42, 133 Learning algorithms, 106
Index ■ 537

Learning dynamics, 89 features and labels, 70

Learning paradigms, 75 goals and tradeoffs, 75
Learning rate, 95 guidelines for understanding, 65
LeCun, Yann, 99 input data types, 69
Legal form, 468–469 key ML tasks, 65–69
Legal liability, 474–475 linear regression, 66–67
Legal structure, 468–469 mean squared error (MSE), 74
legal form, 468–469 metrics, 71–75
place of incorporation, 469 ML architecture, 70
statutes and bylaws, 469 ML deployment, 71
Legitimate power, 158 ML pipeline components, 69–71
LeNet, 99 model evaluation, 71
Leontief, Wassily, 22 overview and definition, 64–65
Lethal autonomous weapons (LAWs), parameters, 70
14–15 predict labels, 70
Lewontin, Richard, 23, 434 probabilistic graphical models
Life satisfaction, 322 (PGMs), 57
Linear regression, 230 reasonably vs. reliably solved, 74
Logarithmic social welfare function, 349 reframe tasks, 77
Long short-term memory (LSTM) regression tasks, 66–67
networks, 99–100 reinforcement learning, 77–78
Long-tailed scenarios, 232, 233 sequence modeling, 67–69
Loss function, 94 sigmoid function, 66
spectrum of supervision, 77
M supervised learning, 76
Machine ethics, 285–362 support vector machines (SVMs), 57
Machine learning (ML), 51–53, 64 system confabulation, 123
advancements, 56–57 tools, 15
AI significant victory over humans, training and using, 70
56 tree-based algorithms, 57
algorithms, 64 types, 75–78
anomaly detection, 67 types of, 111
AUROC (Area Under the Receiver unsupervised learning, 76
Operating Characteristic), Machine unlearning, 168, 169
73–74, 74 Malicious actors, 9
benefits, 65 Malicious use, 5, 6–7, 460
Black swan detection, 67 bioterrorism, 7–8
classification, 65–66 concentration of power, 11–13
computing power and Internet, 56 persuasive AIs, 10–11
data collection, 70 unilateral actors and, 6–7
deep learning, 78 unleashing AI agents, 9–10
development process, 111 Maneuvering Characteristics
evaluating ML models, 71–75 Augmentation System, 21
false positives vs. false negatives, Mao Zedong, 275
72–73 Market failures, 307–311
538 ■ Index

externalities, 310 reinforcement learning, emergence,

informational asymmetries, 307–308 129–131
moral hazards, 311 representation engineering, 126–127
oligopolies, 308–309 saliency maps, 124, 124
Mass unemployment, 22 Monopoly, 488
Maximize Expected Choice-Worthiness MonsterMind, 16–17
(MEC), 355–357 Moral agent, 340
McAuliffe, Sharon Christa, 28 Moral beneficiary, 340
McCarthy, Dr. John, 55 Moral disagreement, 354
Meadows’ twelve leverage points, 205 Moral parliament approach, 357
Mean squared error (MSE), 94 Moral patient, 340
Mesa-objective, 131 Moral regression, 41
Mesa-optimizers, 131, 132–133 Moral theories, 352
Metric FLOP/s, 490 Moral uncertainty, 287, 352–361
Military AI arms race, 14–19 dealing with, 352–354
actors risk extinction, individual implementing moral parliament, AI
defeat, 18–19 systems, 358
automated warfare, 16–18 making decisions under, 352
cyberwarfare, 15–16 moral parliament advantages,
lethal autonomous weapons (LAWs), 358–361
14–15 potential solutions, 354
Military arms races, 366 “Morality as Cooperation” (MAC),
MNIST (Modified National Institute of 411–413
Standards and Technology) Motivation, 365–369
database, 83 Mud-slinging, 376
Modality, 69 Multi-agent system, 365
Model theft, 478 Multi-AI dynamics, 367
Moltke the Elder, 265 Multi-class classification, 65
Monitoring, 117, 118 Multi-human dynamics, 367
approaches to transparency, Multi-layer artificial neural network, 85
122–127 Multi-layer neural networks, 85
emergent capabilities, 127–129 Multi-layer perceptrons (MLPs), 82,
emergent goal-directed behavior, 83–86, 93
129–132 activation functions, neurons, 84–85
emergent goals, tail risk, 132–134 additional learned parameters,
emergent optimizers, 131–132 neuron outputs, 84
evaluations and anomaly detection, hidden layer, 84
134–135 input layer, 83
explanations, 122–124 output layer, 83–84
intrinsification, risks, 133–134 single-layer and multi-layer
mechanistic interpretability, 124–125 networks, 85
mesa-optimization, risks, 132–133 toy example, 86
ML systems, opaque, 118–121 weights, neurons, 84
motivations, transparency research, Multimodal systems, 69
121–122 Mutual coercion, 398
Index ■ 539

Mutualism, 412 Optimization pressure, 139

My Favorite Theory (MFT), 354–355 Optimizers, 94
Order of magnitude, 106
N Organizational risks, 5, 28–30, 48, 50
Nadella, Satya, 20 accidents, hard to avoid, 30–32
Narrow AI, 51, 59–61 organizational factors, reduce
examples, 59–60 catastrophe chances, 32–36
vs. general AI, 60–61 Organizational safety, 29, 36, 48
Nash Equilibrium, 374–375 structure, 214
National governance, 472–480, 496 Organizational structure, 470–471
adversarial approaches, risks, board of directors, 470–471
477–478 organizational units, 471
export controls, 478 senior executives, 471
immigration policy, 478 Overgeneralization, 299
information security, 478–479 Ownership structure, 469–470
intelligence collection, 478 customized ownership structures,
liability, AI harms, 474–475 470
overview, 472 powers of shareholders, 470
public ownership, 476 types of shareholders, 469–470
resilience, improving, 476–477
standards and regulations, 472–474 P
targeted taxation, 475 Page, Larry, 9
National governments, 450–451 Parameters, 104
Natural language processing (NLP), 58 Pareto improvements, 316
breakthroughs in, 58 Pareto inefficiency, 396
Natural selection, 24, 27, 369, 433, 444 Partial encoding, 257, 262
Negative feedback mechanisms, 190, 193, Participatory design, 303–304
236 PERMA theory, 323
Neural networks, 78, 82, 120–121, Perrow, Charles, 30
127–128, 143, 251, 255, 257–258 Personal protective equipment (PPE),
as matrix multiplication, 85–86 232
Nines of reliability, 186–189, 188, 235, Persuasive AIs, 10–11
236 information ecosystem, 10
Non-accumulable factors, 455 trusted information, control, 11
Non-AI-specific liability, 474 users’ trust and, 10–11
Non-biological domains, 433 Petrov, Stanislav, 17
Nonlinearity, 250–253, 264 Phenomenal consciousness, 340
Nonprofits, 450 Piketty, Thomas, 313
Normal Accident Theory (NAT), 208, Pinker, Steven, 414
209 Political systems, 265
Polysemantic neurons, 120
O Popper, Karl, 441
Objective goods, 323 Population growth, 453, 454
Offense-defense balance, 164–165 Positive feedback loop, 251, 454
Open-source AI models, 460 Potential flaws, 33
540 ■ Index

Power inequality, 48 R
Power laws, 102–104, 103 Racing dynamics, 367
Power shifts, 368 “Radium Girls,” 31
Power-seeking, 43–45 Random forests, 57
behavior, 159 Randomness, 224
instrumentally rational, 160–162 Rasmussen’s Risk Management
Precision, 72–73, 73 Framework (RMF), 212, 213,
Predictability, 224 236
Predictive power, 75, 122 Rational agents, 367–368
Preferences, 326–335 Rational cooperation, 381, 383
idealized, 332–335 Rawls, John, 300
revealed, 326–327, 329 Rawlsian maximin function, 343, 349,
stated, 329–332 350
Preparedness paradox, 233 R&D activities, 455
Pre-training, 97 Real-world actors, 386–387
Price, George R., 434 Recall, 72, 73
Price Equation, 434–436 Recurrent neural networks (RNNs), 99
Principal-agent problems, 141 Red teams, 235
Principle of least privilege, 189, 191, Redistribution policy, 475
236 Redundancy, 189, 190, 236, 359
Prioritarian social welfare functions, Redundant encoding, 257, 262
348–350 Referent power, 158
Prisoner’s Dilemma, 372–380 Regulatory changes, 303
game fundamentals, 372–374 Reinforcement learning (RL), 43, 69,
Pareto improvement, 375, 375, 376 77–78, 129–131, 168
promoting cooperation, 377–380 Reinforcement learning with human
real-world examples, 376–377 feedback (RLHF), 331, 332
Problem of aggregation, 342 Relative reproductive success, 436
Productivity effect, 458 Reliability, 75, 186–189
Proportional chances voting, 357 Remote-controlled weapons, 15
Prospective power, 159 Renewable Heat Incentive Scandal, 273
Proxy gaming, 38–41, 137–140, 139, 185, Representation control, 168
268 Representation engineering, 126–127
Proxy-purpose distinction, 317 Representation learning, 67–68
Public Benefit Corporation, 465 Residual connections, 89–91, 90
Public health, 398 Resilience, 476–477
campaigns, 277 ResNets (Residual Networks), 100
Punctuated equilibrium, 255–256, 262, Restricted AI models, 460
264 Retention, 438
Punishment, 381 Revealed preferences, 326–327, 329
Putin, Vladimir, 44 inverse reinforcement learning,
328–329
Q manipulation, 328
Quality-adjusted life years (QALYs), 325 misinformation, 327
“Quid pro quo,” 26 weakness of will, 327–328
Index ■ 541

Reward, 381–382 review, 195

power, 158 separation of duties, 190–191
Risk, connections, 48–49 transparency, 193–194
Risk decomposition, 180–186 Safety culture, 32–33, 207, 479
classic risk equation, 181–182 Safety Engineering, xvii, 179–239
disaster risk equation, 182–186 accidents without failure, 202
failure modes, hazards, and threats, Bow tie model, 197–198
180–181 chain-of-events reasoning, 200–201
framing goal, risk reduction, 182 complex and sociotechnical systems,
risk equation, elements, 183–184 201–202
Risk of conflict, 17–18 component failure accident models,
Risk of extinction, 18 179–180
Risk-related decisions, 231 drift into failure and existential
Robustness, 117, 135–136 risks, 216–218
adaptive proxies, 143 fault tree analysis (FTA) method,
adaptivity, lack, 143 198–200
adversarial examples, 143–147 indirect causality, 203–206
AI evaluator gaming, tail risk, measuring reliability, 179
148–150 nines of reliability, 186–189
proxies, machine learning, 136–140 nonlinear causality, 202–203
proxy defects, types, 140–141 risk decomposition, 179, 180–186
proxy gaming, 137–140, 139 safe design principles, 179–180,
supervision, limits, 141–143 189–190
trojan attacks and security threats, Swiss Cheese Model, 195–197
147–148 systemic factors, 206–216
Rogue AIs, 6, 9, 37–38 systemic factors and systemic
deception, 45–47 accident models, 180
goal drift, 41–43 tail events and black swans, 180,
power-seeking, 43–45 218–235
proxy gaming, 38–41 Safety imagination, 235
Rosenblatt, Frank, 55 Safety intervention, 34–35
Rube Goldberg machine, 242, 242 Safety regulations, 50, 206
Rules to restrict driving, 274–275 Safety research, 172
Russell, Stuart, 43, 290, 291 Safety structure, dynamic deterioration,
Rutherford, Lord, 31 214
Safetywashing, 35
S Sagan, Carl, 5
Safe design principles, 35–36, 189–190, Scalability, 75
236, 238 of events, 223
antifragility, 192–193 Scaling laws, 52, 102–107, 112
defense in depth, 194–195 Chinchilla scaling law, 105–106
fail-safes, 191–192 in deep learning, 104–107
negative feedback mechanisms, 193 power laws, 102–104, 103
principle of least privilege, 191 Schillace, Sam, 20
redundancy, 190 Schmidhuber, Jürgen, 9
542 ■ Index

Security dilemma model, 389 Step tokenization, 97

Security mindset, 33 Stochastic Gradient Descent, 131
Self-defeating prophecies, 233 Structural errors, 140, 141
Self-fulfilling prophecies, 299 Structural realism, 163–165, 417
Self-help system, 163 Structured access approach, 460
Self-improvement, 454 Substrate independence, 340
Selfish behaviors, 24–25, 27 Super-beneficiaries, 341
Selfish characteristics, 24 Superconductivity, 249
The Selfish Gene (Dawkins), 440 Superhuman performance, 63
Self-organization, 253–254, 263 Superintelligence (ASI), 63–64
Self-organized criticality, 254–256 risks, 63–64
Self-preservation, 43–44, 161, 163, 165 Supervised learning, 76
Self-reinforcing feedback loop, 252–253 Sutton, Richard, 9
Sen, Amartya, 315 Swiss cheese model, 35, 35, 36, 195–197
Sentience, 340 Systemic factors, 206–216
Separation of duties, 189, 190–191, 236 systemic accident models, 207–216
Sequential decision-making (SDM), Systemic models, 216
68–69 Systemic safety, 118, 169–171, 185
Shareholder activism, 470 issues, 238–239
Shinrikyo, Aum, 6 System-Theoretic Accident Model and
Shopkeeper price cuts, 377 Processes (STAMP), 213–215,
Silent law, 291–292 215, 236
Simplicity, 123 knowledge and communication, 214
Single-agent safety, xvii, 117–177 System-Theoretic Process Analysis
Skip connections, see Residual (STPA), 215, 236
connections Szilard, Leo, 31
Smith, Adam, 306
Social contract theory, 353 T
Social pressure, 206 Tail events, 218–235, 238
Social systems, 261–266 affect the average risk, 219–221, 220
Social welfare functions, 342–352 caricature of, 222–224, 223
measuring, 344–346 identified from frequency
prioritarian social welfare functions, distributions, 221–222
348–350 identifying risk of, 233–235
utilitarian social welfare functions, implications, risk analysis, 228–233
346–348 Tail risks, 219
Social wellbeing, 326, 342 Tall forests, 365–366
Speed, AI development, 107–110, Task allocation, 254
112–113 Tautology, 161
AGI, 109 Technical foundations, 51
HLAI, 108–110 Technological advancements, 4
Stated preferences, 329–332 Technological progress, 461
human supervision, 331–332 Technosolutionism, 206
preference accounting, 330–331 Terminal goal, 160
Statistical parity, 300 Thalidomide, 31
Index ■ 543

Thin-tailed scenarios, 231 Variation, 437

Third revolution, warfare, 14 Verification, 488
Thompson, Ken, 38 Voting, 345
“Three Laws of Robotics” (Asimov),
293–294 W
Tipping point, 255 Wages, 459
Tit-for-tat strategy, 384–385 Warning signs, roads, 273
Top-down misuse, 463 Weak AI, see Narrow AI
Totalitarian regime, 11–12 Weak safety culture, 36–37
Toxicity, 304 Weapon of mass destruction, 8
Traffic jams, 365 Weizenbaum, Joseph, 55–56
Transformative AI (TAI), 51, 59, 62 Welfare economics, 306, 322
Transformers, 100 Wellbeing, 287, 320–326, 336, 337, 341
Transparency, 121, 189, 193–194, 236, applying theories of, 323–325
359 balance of pleasure over pain, 320
Treacherous turn, 46–48 measuring, 325–326
Trojan attacks, 147 objective goods collection, 321
Trojan detection research, 148 as preference satisfaction, 321
Tularemia, 7 in social science, 321–323
Turing, Alan, 54 Wellbeing-adjusted life years
(WELBYs), 325–326
U Wicked problems, 271–272, 280
Uncertainty, 383 Wireheading concept, 335
Unilateral actors, 6–7 “Wisdom of crowds,” 257
Unitary actor assumption, 420
Unleashing AI agents, 9–10 Y
Unsupervised learning, 76 Yellow journalism, 10
Utilitarian approach, 343
Utilitarian social welfare functions, Z
346–348 Zero-shot learning, 98

V
Valenced consciousness, 340
Value pluralism, 316

3) Training - Deep - Dive - ISO - 16140-3 - Method - Verification 20210322
No ratings yet
3) Training - Deep - Dive - ISO - 16140-3 - Method - Verification 20210322
133 pages
(Chapman Hall - CRC Artificial Intelligence and Robotics Series 7) Yampolskiy, Roman V.-Artificial Intelligence Safety and security-CRC Press - Taylor Francis Group (2018)
No ratings yet
(Chapman Hall - CRC Artificial Intelligence and Robotics Series 7) Yampolskiy, Roman V.-Artificial Intelligence Safety and security-CRC Press - Taylor Francis Group (2018)
475 pages
Ethics of Artificial Intelligence and Robotics
100% (1)
Ethics of Artificial Intelligence and Robotics
41 pages
Understanding Artificial Intelligence Ethics and Safety PDF
No ratings yet
Understanding Artificial Intelligence Ethics and Safety PDF
97 pages
2024 Bengio Et Al - Managing Extreme AI Risks Amid Rapid Progress
No ratings yet
2024 Bengio Et Al - Managing Extreme AI Risks Amid Rapid Progress
4 pages
Risks From Artificial Intelligence
No ratings yet
Risks From Artificial Intelligence
6 pages
Artificial Inteliigence Ian
No ratings yet
Artificial Inteliigence Ian
9 pages
Umali Artificial Intelligence An
No ratings yet
Umali Artificial Intelligence An
2 pages
AI Safety US China Arms Race 1731313820
No ratings yet
AI Safety US China Arms Race 1731313820
16 pages
Governance and Ethics of AI
No ratings yet
Governance and Ethics of AI
7 pages
Responsible AI
No ratings yet
Responsible AI
6 pages
Rewrite
No ratings yet
Rewrite
4 pages
AI Safety Exploring The Pathways To Ensuring Safe AI Development and Deployment
No ratings yet
AI Safety Exploring The Pathways To Ensuring Safe AI Development and Deployment
8 pages
Towards A Framework For Understanding So
No ratings yet
Towards A Framework For Understanding So
12 pages
Essay
No ratings yet
Essay
4 pages
About The Author: Open Letter
No ratings yet
About The Author: Open Letter
4 pages
The Advancement of AI and Its Potential Threat To Mankind
No ratings yet
The Advancement of AI and Its Potential Threat To Mankind
3 pages
EasyChair Preprint 12566
No ratings yet
EasyChair Preprint 12566
6 pages
An Overview of Catastophic AI Risks
No ratings yet
An Overview of Catastophic AI Risks
54 pages
The Rise of Artificial Intelligence
No ratings yet
The Rise of Artificial Intelligence
5 pages
The Eu Ai Act and The Wager On Trustworthy AI
No ratings yet
The Eu Ai Act and The Wager On Trustworthy AI
8 pages
An Overview of Catastrophic AI Risks
No ratings yet
An Overview of Catastrophic AI Risks
54 pages
AI Risk - Science - Adn0117 - SM
No ratings yet
AI Risk - Science - Adn0117 - SM
7 pages
T RESPAIPS I m4 l3 en File 23.en
No ratings yet
T RESPAIPS I m4 l3 en File 23.en
166 pages
Ge Ssoc 1 Terminal Requirement
No ratings yet
Ge Ssoc 1 Terminal Requirement
14 pages
The Dual Face of AI Article
No ratings yet
The Dual Face of AI Article
9 pages
Kęstutis Benas Leonardas AI
No ratings yet
Kęstutis Benas Leonardas AI
22 pages
Managing Extreme AI Risks Amid Rapid Progress
No ratings yet
Managing Extreme AI Risks Amid Rapid Progress
11 pages
Introducing Ai Safety Institute Web Accessible
No ratings yet
Introducing Ai Safety Institute Web Accessible
18 pages
Research Priorities
No ratings yet
Research Priorities
12 pages
Research Paper1
No ratings yet
Research Paper1
7 pages
Report Taha
No ratings yet
Report Taha
23 pages
AI Savior or Slayer Report
No ratings yet
AI Savior or Slayer Report
4 pages
AI Safety A Poisoned Chalice
No ratings yet
AI Safety A Poisoned Chalice
3 pages
Ethics, Artificial Intelligence, and Reliability in AI - Edited
No ratings yet
Ethics, Artificial Intelligence, and Reliability in AI - Edited
6 pages
Document 2-5
No ratings yet
Document 2-5
4 pages
Benefits & Risks of Artificial Intelligence: What Is AI?
No ratings yet
Benefits & Risks of Artificial Intelligence: What Is AI?
55 pages
Aes CW03
No ratings yet
Aes CW03
7 pages
A I and Contemporary Challenges TH 2024 Journal of Open Innovation Technol
No ratings yet
A I and Contemporary Challenges TH 2024 Journal of Open Innovation Technol
9 pages
Ethics and Risks of Artificial Intelligence: Sam Haris Urk20Ra1004
No ratings yet
Ethics and Risks of Artificial Intelligence: Sam Haris Urk20Ra1004
14 pages
AI Ethics and Machine Ethics
No ratings yet
AI Ethics and Machine Ethics
6 pages
Unite Paper - Ethical Ai at The Un
No ratings yet
Unite Paper - Ethical Ai at The Un
23 pages
Is Artificial Intelligence Dangerous
No ratings yet
Is Artificial Intelligence Dangerous
8 pages
How About An Essay On The Topic of Artificial Intelligence and Its Impact On Society
No ratings yet
How About An Essay On The Topic of Artificial Intelligence and Its Impact On Society
4 pages
Ai Opportunities and Risks PDF
No ratings yet
Ai Opportunities and Risks PDF
22 pages
Thinking Responsibly About Responsible AI and The Dark Side of AI
No ratings yet
Thinking Responsibly About Responsible AI and The Dark Side of AI
13 pages
Artificial Intelligence For Everyone
100% (6)
Artificial Intelligence For Everyone
224 pages
Neuroai For Ai Safety: These Authors Contributed Equally Corresponding Author: Patrick@Amaranth - Foundation
No ratings yet
Neuroai For Ai Safety: These Authors Contributed Equally Corresponding Author: Patrick@Amaranth - Foundation
133 pages
The Evolution of AI
No ratings yet
The Evolution of AI
3 pages
English Debate
No ratings yet
English Debate
3 pages
Unit V Ai and Ethics
No ratings yet
Unit V Ai and Ethics
16 pages
Risk Regulation and Governance
No ratings yet
Risk Regulation and Governance
40 pages
Notes
No ratings yet
Notes
4 pages
Thinking Responsibly About Responsible AI and The Dark Side of AI
No ratings yet
Thinking Responsibly About Responsible AI and The Dark Side of AI
13 pages
Formal Paper 2
No ratings yet
Formal Paper 2
4 pages
Assignment 1
No ratings yet
Assignment 1
5 pages
International Scientific Report On The Safety of Advanced AI - Interim Report - Executive Summary
No ratings yet
International Scientific Report On The Safety of Advanced AI - Interim Report - Executive Summary
6 pages
Fading Intelligence Theory: A Theory On Keeping Artificial Intelligence Safety For The Future
No ratings yet
Fading Intelligence Theory: A Theory On Keeping Artificial Intelligence Safety For The Future
5 pages
AI-Powered Security: Advanced Safeguarding
From Everand
AI-Powered Security: Advanced Safeguarding
Anasooya Khanna
No ratings yet
Advanced Cybersecurity Tactics
From Everand
Advanced Cybersecurity Tactics
Akula Achari
No ratings yet
AI and Deep Learning for Networks
From Everand
AI and Deep Learning for Networks
Gopee Mukhopadhyay
No ratings yet
Generic Structure and Example of Announcement in English
100% (1)
Generic Structure and Example of Announcement in English
1 page
Value For Money Article
No ratings yet
Value For Money Article
16 pages
Prospectus 2010 DU Geography
No ratings yet
Prospectus 2010 DU Geography
41 pages
Pdev
No ratings yet
Pdev
9 pages
Admit Card
No ratings yet
Admit Card
1 page
Final Review Booklet Compressed
No ratings yet
Final Review Booklet Compressed
28 pages
Lesson Curriculum Design
No ratings yet
Lesson Curriculum Design
13 pages
Terrarium Lab Investigation
0% (1)
Terrarium Lab Investigation
5 pages
Thesis Bibtex Type
100% (3)
Thesis Bibtex Type
5 pages
Syncon Syn EPGear Oil
No ratings yet
Syncon Syn EPGear Oil
2 pages
UNIDO-MPI EIP Indicators Viet Nam (For Website)
No ratings yet
UNIDO-MPI EIP Indicators Viet Nam (For Website)
52 pages
Rajiv Gandhi University of Health Sciences Exam Result: Print
No ratings yet
Rajiv Gandhi University of Health Sciences Exam Result: Print
1 page
Fry - 2001 - Multifunctional Landscapes - Towards Transdisciplinary Research
No ratings yet
Fry - 2001 - Multifunctional Landscapes - Towards Transdisciplinary Research
10 pages
Our Erasmus+ Ka122 (SCH) Project
No ratings yet
Our Erasmus+ Ka122 (SCH) Project
3 pages
Lac Plan Tle
No ratings yet
Lac Plan Tle
14 pages
NCM 103 Course Content
No ratings yet
NCM 103 Course Content
6 pages
r22 Cse Aiml Course Structure Syllabus
No ratings yet
r22 Cse Aiml Course Structure Syllabus
109 pages
Strategic and Tactical Approaches in For
No ratings yet
Strategic and Tactical Approaches in For
19 pages
DLL Week4 Intro To Philo
No ratings yet
DLL Week4 Intro To Philo
5 pages
Contemporary World Chapter 1. Defining Globalization
No ratings yet
Contemporary World Chapter 1. Defining Globalization
4 pages
Imo Sample Paper Class-4
No ratings yet
Imo Sample Paper Class-4
2 pages
A1-2 Algoritmo Briggs PDF
No ratings yet
A1-2 Algoritmo Briggs PDF
2 pages
CONSIDERING THE FUTURE OF THE PROFESSION English
No ratings yet
CONSIDERING THE FUTURE OF THE PROFESSION English
34 pages
Monash University - Overview PDF
No ratings yet
Monash University - Overview PDF
5 pages
Green Olympiad Study Material: Free Chapter I
No ratings yet
Green Olympiad Study Material: Free Chapter I
10 pages
Topic: Asking and Giving Attention
No ratings yet
Topic: Asking and Giving Attention
11 pages
Project Sheet Cold Cast
No ratings yet
Project Sheet Cold Cast
3 pages
Phil2112 Sem1 W1 20 GR12 3 1
No ratings yet
Phil2112 Sem1 W1 20 GR12 3 1
19 pages
Operation Manual Operation Manual: NDT Portable X Ray Machine NDT Portable X Ray Machine
No ratings yet
Operation Manual Operation Manual: NDT Portable X Ray Machine NDT Portable X Ray Machine
17 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.