2022 AI Index Report Master
2022 AI Index Report Master
2022 AI Index Report Master
INTRODUCTION TO THE
AI INDEX REPORT 2022
Welcome to the fifth edition of the AI Index Report! The latest edition includes data from a broad set of academic,
private, and nonprofit organizations as well as more self-collected data and original analysis than any previous
editions, including an expanded technical performance chapter, a new survey of robotics researchers around the
world, data on global AI legislation records in 25 countries, and a new chapter with an in-depth analysis of technical
AI ethics metrics.
The AI Index Report tracks, collates, distills, and visualizes data related to artificial intelligence. Its mission is to
provide unbiased, rigorously vetted, and globally sourced data for policymakers, researchers, executives, journalists,
and the general public to develop a more thorough and nuanced understanding of the complex field of AI. The report
aims to be the world’s most credible and authoritative source for data and insights about AI.
F R O M T H E C O - D I R E CTO R S
This year’s report shows that AI systems are starting to be deployed widely into the economy, but at the same time
they are being deployed, the ethical issues associated with AI are becoming magnified. Some of this is natural—after
all, we tend to care more about the ethical aspects of a given technology when it is being rolled out into the world.
But some of it is bound up in the peculiar traits of contemporary AI—larger and more complex and capable AI systems
can generally do better on a broad range of tasks while also displaying a greater potential for ethical concerns.
This is bound up with the broad globalization and industrialization of AI—a larger range of countries are developing,
deploying, and regulating AI systems than ever before, and the combined outcome of these activities is the creation
of a broader set of AI systems available for people to use, and reductions in their prices. Some parts of AI are not very
globalized, though, and our ethics analysis reveals that many AI ethics publications tend to concentrate on English-
language systems and datasets, despite AI being deployed globally.
If anything, we expect the above trends to continue: 103% more money was invested in the private investment of AI
and AI-related startups in 2021 than in 2020 ($96.5 billion versus $46 billion).
2
Artificial Intelligence
Index Report 2022
TOP TAKEAWAYS
Private investment in AI soared while investment concentration intensified:
•T
he private investment in AI in 2021 totaled around $93.5 billion—more than double the total private
investment in 2020, while the number of newly funded AI companies continues to drop, from 1051 companies in
2019 and 762 companies in 2020 to 746 companies in 2021. In 2020, there were 4 funding rounds worth $500
million or more; in 2021, there were 15.
Language models are more capable than ever, but also more biased:
• L arge language models are setting new records on technical benchmarks, but new data shows that larger models are
also more capable of reflecting biases from their training data. A 280 billion parameter model developed in 2021
shows a 29% increase in elicited toxicity over a 117 million parameter model considered the state of the art as
of 2018. The systems are growing significantly more capable over time, though as they increase in capabilities, so
does the potential severity of their biases.
Steering Committee
Co-Directors
Jack Clark Raymond Perrault
Anthropic, OECD SRI International
Members
Erik Brynjolfsson James Manyika Michael Sellitto
Stanford University Google, University Stanford University
of Oxford
John Etchemendy Yoav Shoham
Stanford University Juan Carlos Niebles (Founding Director)
Stanford University, Stanford University,
Terah Lyons Salesforce AI21 Labs
Affiliated Researchers
Andre Barbe Helen Ngo
The World Bank Cohere
Graduate Researcher
Benjamin Bronkema-Bekker
Stanford University
4
Artificial Intelligence
Index Report 2022
The AI Index was conceived within the One Hundred Year Study on AI (AI100).
5
Artificial Intelligence
Index Report 2022
Supporting Partners
6
Artificial Intelligence
Index Report 2022
Contributors
We want to acknowledge the following individuals by chapter and section for their contributions
of data, analysis, advice, and expert commentary included in the AI Index 2022 Report:
Technical Performance
Jack Clark, David Kanter, Nestor Maslej, Deepak Narayanan, Juan Carlos Niebles, Konstantin Savenkov,
Yoav Shoham, Daniel Zhang
Technical AI Ethics
Jack Clark, Nestor Maslej, Helen Ngo, Ray Perrault, Ellie Sakhaee, Daniel Zhang
Conference Attendance
Terri Auricchio (ICML), Christian Bessiere (IJCAI), Meghyn Bienvenu (KR), Andrea Brown (ICLR),
Alexandra Chouldechova (FAccT), Nicole Finn (ICCV, CVPR), Enrico Gerding (AAMAS), Carol Hamilton (AAAI),
Seth Lazar (FAccT), Max Qing Hu Meng (ICRA), Jonas Martin Peters (UAI), Libor Preucil (IROS),
Marc’Aurelio Ranzato (NeurIPS), Priscilla Rasmussen (EMNLP, ACL), Hankz Hankui Zhuo (ICAPS)
Robotics Survey
Pieter Abbeel, David Abbink, Farshid Alambeigi, Farshad Arvin, Nikolay Atanasov, Ruzena Bajcsy, Philip Beesley,
Tapomayukh Bhattacharjee, Jeannette Bohg, David J. Cappelleri, Qifeng Chen, I-Ming Chen, Jack Cheng,
Cynthia Chestek, Kyujin Cho, Dimitris Chrysostomou, Steve Collins, David Correa, Brandon DeHart,
Katie Driggs-Campbell, Nima Fazeli, Animesh Garg, Maged Ghoneima, Tobias Haschke, Kris Hauser, David Held,
Yue Hu, Josie Hughes, Soo Jeon, Dimitrios Kanoulas, Jonathan Kelly, Oliver Kroemer, Changliu Liu, Ole Madsen,
Anirudha Majumdar, Genaro J. Martinez, Saburo Matunaga, Satoshi Miura, Norrima Mokhtar, Elena De Momi,
Chrystopher Nehaniv, Christopher Nielsen, Ryuma Niiyama, Allison Okamura, Necmiye Ozay, Jamie Paik,
Frank Park, Karthik Ramani, Carolyn Ren, Jan Rosell, Jee-Hwan Ryu, Tim Salcudean, Oliver Schneider,
Angela Schoellig, Reid Simmons, Alvaro Soto, Peter Stone, Michael Tolley, Tsu-Chin Tsao, Michiel van de Panne,
Andy Weightman, Alexander Wong, Helge Wurdemann, Rong Xiong, Chao Xu, Geng Yang, Junzhi Yu,
Wenzhen Yuan, Fu Zhang, Yuke Zhu
7
Artificial Intelligence
Index Report 2022
We thank the following organizations and individuals who provided data for
inclusion in the AI Index 2022 Report:
Organizations
Bloomberg Government Intento
Amanda Allen, Cameron Leuthy Grigory Sapunov, Konstantin Savenkov
We also would like to thank Jeanina Casusi, Nancy King, Shana Lynch, Jonathan Mindes,
Stacy Peña, Michi Turner, and Justin Sherman for their help in preparing this report, and
Joe Hinman, Travis Taylor, and the team at Digital Avenues for their efforts in designing and
developing the AI Index and HAI websites.
8
Artificial Intelligence
Index Report 2022
Table of Contents
REPORT HIGHLIGHTS
10
APPENDIX 196
9
Artificial Intelligence
Index Report 2022
REPORT HIGHLIGHTS
C H A P T E R 1 : R E S E A R C H A N D D E V E LO P M E N T
• Despite rising geopolitical tensions, the United States and China had the greatest number of cross-country
collaborations in AI publications from 2010 to 2021, increasing five times since 2010. The collaboration between
the two countries produced 2.7 times more publications than between the United Kingdom and China—the
second highest on the list.
• In 2021, China continued to lead the world in the number of AI journal, conference, and repository
publications—63.2% higher than the United States with all three publication types combined. In the meantime,
the United States held a dominant lead among major AI powers in the number of AI conference and repository
citations.
• From 2010 to 2021, the collaboration between educational and nonprofit organizations produced the
highest number of AI publications, followed by the collaboration between private companies and educational
institutions and between educational and government institutions.
• The number of AI patents filed in 2021 is more than 30 times higher than in 2015, showing a compound
annual growth rate of 76.9%.
C H A P T E R 2: T E C H N I CA L P E R F O R M A N C E
• Data, data, data: Top results across technical benchmarks have increasingly relied on the use of extra training
data to set new state-of-the-art results. As of 2021, 9 state-of-the-art AI systems out of the 10 benchmarks
in this report are trained with extra data. This trend implicitly favors private sector actors with access to vast
datasets.
• Rising interest in particular computer vision subtasks: In 2021, the research community saw a greater level
of interest in more specific computer vision subtasks, such as medical image segmentation and masked-face
identification. For example, only 3 research papers tested systems against the Kvasir-SEG medical imaging
benchmark prior to 2020. In 2021, 25 research papers did. Such an increase suggests that AI research is
moving toward research that can have more direct, real-world applications.
• AI has not mastered complex language tasks, yet: AI already exceeds human performance levels on basic
reading comprehension benchmarks like SuperGLUE and SQuAD by 1%–5%. Although AI systems are still
unable to achieve human performance on more complex linguistic tasks such as abductive natural
language inference (aNLI), the difference is narrowing. Humans performed 9 percentage points better on
aNLI in 2019. As of 2021, that gap has shrunk to 1.
10
Artificial Intelligence
Index Report 2022
• Turn toward more general reinforcement learning: For the last decade, AI systems have been able to master
narrow reinforcement learning tasks in which they are asked to maximize performance in a specific skill, such
as chess. The top chess software engine now exceeds Magnus Carlsen’s top ELO score by 24%. However, in
the last two years AI systems have also improved by 129% on more general reinforcement learning tasks
(Procgen) in which they must operate in novel environments. This trend speaks to the future development of
AI systems that can learn to think more broadly.
• AI becomes more affordable and higher performing: Since 2018, the cost to train an image classification
system has decreased by 63.6%, while training times have improved by 94.4%. The trend of lower training
cost but faster training time appears across other MLPerf task categories such as recommendation, object
detection and language processing, and favors the more widespread commercial adoption of AI technologies.
• Robotic arms are becoming cheaper: An AI Index survey shows that the median price of robotic arms has
decreased by 46.2% in the past five years—from $42,000 per arm in 2017 to $22,600 in 2021. Robotics
research has become more accessible and affordable.
C H A P T E R 3: T E C H N I CA L A I E T H I C S
• Language models are more capable than ever, but also more biased: Large language models are setting new
records on technical benchmarks, but new data shows that larger models are also more capable of reflecting
biases from their training data. A 280 billion parameter model developed in 2021 shows a 29% increase in
elicited toxicity over a 117 million parameter model considered the state of the art as of 2018. The systems
are growing significantly more capable over time, though as they increase in capabilities, so does the potential
severity of their biases.
• The rise of AI ethics everywhere: Research on fairness and transparency in AI has exploded since 2014, with a
fivefold increase in related publications at ethics-related conferences. Algorithmic fairness and bias has shifted
from being primarily an academic pursuit to becoming firmly embedded as a mainstream research topic with
wide-ranging implications. Researchers with industry affiliations contributed 71% more publications year
over year at ethics-focused conferences in recent years.
• Multimodal models learn multimodal biases: Rapid progress has been made on training multimodal language-
vision models which exhibit new levels of capability on joint language-vision tasks. These models have set new
records on tasks like image classification and the creation of images from text descriptions, but they also reflect
societal stereotypes and biases in their outputs—experiments on CLIP showed that images of Black people
were misclassified as nonhuman at over twice the rate of any other race. While there has been significant
work to develop metrics for measuring bias within both computer vision and natural language processing, this
highlights the need for metrics that provide insight into biases in models with multiple modalities.
11
Artificial Intelligence
Index Report 2022
C H A P T E R 4 : T H E E C O N O M Y A N D E D U CAT I O N
• New Zealand, Hong Kong, Ireland, Luxembourg, and Sweden are the countries or regions with the highest growth
in AI hiring from 2016 to 2021.
• In 2021, California, Texas, New York, and Virginia were states with the highest number of AI job postings in the
United States, with California having over 2.35 times the number of postings as Texas, the second greatest.
Washington, D.C., had the greatest rate of AI job postings compared to its overall number of job postings.
• The private investment in AI in 2021 totaled around $93.5 billion—more than double the total private
investment in 2020, while the number of newly funded AI companies continues to drop, from 1051 companies in
2019 and 762 companies in 2020 to 746 companies in 2021. In 2020, there were 4 funding rounds worth $500
million or more; in 2021, there were 15.
• “Data management, processing, and cloud” received the greatest amount of private AI investment in 2021—
2.6 times the investment in 2020, followed by “medical and healthcare” and “fintech.”
• In 2021, the United States led the world in both total private investment in AI and the number of newly funded AI
companies, three and two times higher, respectively, than China, the next country on the ranking.
• Efforts to address ethical concerns associated with using AI in industry remain limited, according to a McKinsey
survey. While 29% and 41% of respondents recognize “equity and fairness” and “explainability” as risks
while adopting AI, only 19% and 27% are taking steps to mitigate those risks.
• In 2020, 1 in every 5 CS students who graduated with PhD degrees specialized in artificial intelligence/
machine learning, the most popular specialty in the past decade. From 2010 to 2020, the majority of AI PhDs in
the United States headed to industry while a small fraction took government jobs.
C H A P T E R 5: A I P O L I CY A N D G OV E R N A N C E
• An AI Index analysis of legislative records on AI in 25 countries shows that the number of bills containing
“artificial intelligence” that were passed into law grew from just 1 in 2016 to 18 in 2021. Spain, the United
Kingdom, and the United States passed the highest number of AI-related bills in 2021, with each adopting three.
• The federal legislative record in the United States shows a sharp increase in the total number of proposed bills
that relate to AI from 2015 to 2021, while the number of bills passed remains low, with only 2% ultimately
becoming law.
• State legislators in the United States passed 1 out of every 50 proposed bills that contain AI provisions in 2021,
while the number of such bills proposed grew from 2 in 2012 to 131 in 2021.
• In the United States, the current congressional session (the 117th) is on track to record the greatest number of
AI-related mentions since 2001, with 295 mentions by the end of 2021, half way through the session,
compared to 506 in the previous (116th) session.
12
Artificial Intelligence
Index Report 2022
CHAPTER 1:
Research &
Development
Artificial Intelligence
Index Report 2022 CHAPTER 1: RESEARCH & DEVELOPMENT
CHAPTER 1:
Chapter Preview
Overview 15 AI Repositories 32
Chapter Highlights 16 Overview 32
By Region 33
1.1 PUBLICATIONS 17 By Geographic Area 34
Overview 17 Citations 35
Total Number of AI Publications 17 AI Patents 36
By Type of Publication 18 Overview 36
By Field of Study 19 By Region and Application Status 37
By Sector 20 By Geographic Area and
Cross-Country Collaboration 22 Application Status 39
Cross-Sector Collaboration 23
AI Journal Publications 24 1.2 CONFERENCES 41
Table of Contents 14
Artificial Intelligence
Index Report 2022 CHAPTER 1: RESEARCH & DEVELOPMENT
Overview
Research and development is an integral force that drives the rapid progress
of artificial intelligence (AI). Every year, a wide range of academic, industry,
government, and civil society experts and organizations contribute to AI
R&D via a slew of papers, journal articles, and other AI-related publications,
conferences on AI or on particular subtopics like image recognition or
natural language processing, international collaboration across borders, and
the development of open-source software libraries. These R&D efforts are
diverse in focus and geographically dispersed.
Another key feature of AI R&D, making it somewhat distinct from other areas
of STEM research, is its openness. Each year, thousands and thousands of
AI publications are released in the open source, whether at conferences
or on file-sharing websites. Researchers will openly share their findings at
conferences; government agencies will fund AI research that ends up in the
open source; and developers use open software libraries, freely available to
the public, to produce state-of-the-art AI applications. This openness also
contributes to the globally interdependent and interconnected nature of
modern AI R&D.
This first chapter draws on multiple datasets to analyze key trends in the
AI research and development space in 2021. It first looks at AI publications,
including conference papers, journal articles, patents, and repositories.
It then analyzes AI conference attendance. And finally, it examines
AI open-source software libraries used in the R&D process.
CHAPTER HIGHLIGHTS
• Despite rising geopolitical tensions, the United States and China had the greatest number of
cross-country collaborations in AI publications from 2010 to 2021, increasing five times since
2010. The collaboration between the two countries produced 2.7 times more publications
than between the United Kingdom and China—the second highest on the list.
• In 2021, China continued to lead the world in the number of AI journal, conference, and
repository publications—63.2% higher than the United States with all three publication types
combined. In the meantime, the United States held a dominant lead among major AI powers
in the number of AI conference and repository citations.
• From 2010 to 2021, the collaboration between educational and nonprofit organizations
produced the highest number of AI publications, followed by the collaboration between
private companies and educational institutions and between educational and government
institutions.
• The number of AI patents filed in 2021 is more than 30 times higher than in 2015, showing a
compound annual growth rate of 76.9%.
This section draws on data from the Center for Security and Emerging Technology (CSET) at Georgetown University. CSET maintains
a merged corpus of scholarly literature that includes Digital Science’s Dimensions, Clarivate’s Web of Science, Microsoft Academic
Graph, China National Knowledge Infrastructure, arXiv, and Papers with Code. In that corpus, CSET applied a classifier to identify
English-language publications related to the development or application of AI and ML since 2010.1
1.1 PUBLICATIONS2
OV E R V I E W Total Number of AI Publications
The figures below capture the total number of English- Figure 1.1.1 shows the number of AI publications in
language AI publications globally from 2010 to 2021—by the world. From 2010 to 2021, the total number of AI
type, affiliation, cross-country collaboration, and cross- publications doubled, growing from 162,444 in 2010 to
industry collaboration. The section also breaks down 334,497 in 2021.
publication and citation data by region for AI journal
articles, conference papers, repositories, and patents.
350
334.50
300
Number of AI Publications (in thousands)
250
200
150
100
50
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.1
1 See the Appendix for more information on CSET’s methodology. Due to the change in data provider and classification method, the publication trend/data may be different from past reports. For more
on the challenge of defining AI and correctly capturing relevant bibliometric data, see the AI Index team’s discussion in the paper “Measurement in AI Policy: Opportunities and Challenges.”
2 The number of AI publications in 2021 in this section may be lower than the actual count due to the lag in the collection of publication metadata by aforementioned databases.
160
Number of AI Publications (in thousands)
140
120
100
80
71.92, Conference
60 56.73, Repository
40
30
20
18.30, Algorithm
15.27, Data Mining
13.43, Natural Language Processing
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.3
50%
AI Publications (% of Total)
40%
30%
11.27%, Nonpro t
10%
5.21%, Company
3.17%, Government
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.4a
60%
57.63%, Education
50%
AI Publications (% of Total)
40%
30%
20%
17.63%, Unknown
12.36%, Nonpro t
10% 9.76%, Company
2.62%, Government
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.4b
3 The categorization is adapted based on the Global Research Identifier Database (GRID). See definitions of each category here. Healthcare, including hospitals and facilities,
are included under nonprofit. Publications affiliated with state-sponsored universities are included in the education sector.
50%
AI Publications (% of Total)
40%
30%
23.74%, Unknown
20%
10%
8.47%, Nonpro t
3.93%, Company
0% 3.62%, Government
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.4c
60%
54.81%, Education
50%
AI Publications (% of Total)
40%
30%
10%
5.68%, Company
3.61%, Government
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.4d
Cross-Country Collaboration
Cross-border collaborations between academics, By far, the greatest
researchers, industry experts, and others are a key
component of modern STEM development that accelerate
number of collaborations
the dissemination of new ideas and the growth of research in the past 12 years
teams. Figures 1.1.5a and 1.1.5b depict the top cross-
country AI collaborations from 2010 to 2021. CSET counted
took place between the
cross-country collaborations as distinct pairs of countries United States and China,
across authors for each publication (e.g., four U.S. and
four Chinese-affiliated authors on a single publication are increasing five times
counted as one U.S.-China collaboration; two publications since 2010.
between the same authors counts as two collaborations).
10
9.66
Number of AI Publications (in thousands)
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.5a
3
Number of AI Publications (in thousands)
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.5b
25
Number of AI Publications (in thousands)
20
15
172.11
Number of AI Journal Publications (in thousands)
150
100
50
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.7
2.53%
2.50%
AI Journal Publications (% of Total Journal Publications)
2.00%
1.50%
1.00%
0.50%
0.00%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.8
50%
AI Journal Publications (% of World Total)
30%
10%
7.99%, South Asia
6.18%, Middle East and North Africa
3.57%, Latin America and the Caribbean
0% 1.06%, Sub-Saharan Africa
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.9
4 Regions in this chapter are classified according to the World Bank analytical grouping.
By Geographic Area5 powers. China has remained the leader throughout, with
Figure 1.1.10 breaks down the share of AI journal 31.0% in 2021, followed by the European Union plus the
publications over the past 12 years by three major AI United Kingdom at 19.1% and the United States at 13.7%
40%
AI Journal Publications (% of World Total)
31.04%, China
30%
10%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.10
5 Geographic areas in this chapter combine the number of publications between the European Union and the United Kingdom to reflect the historically strong association between them with regards to
research collaboration.
Citations
On the number of citations of AI journal The three geographic areas
publications, China’s share has gradually increased
while those of the European Union plus the United
combined accounted for
Kingdom and the United States have decreased. more than 66% of the total
The three geographic areas combined accounted for
more than 66% of the total citations in the world.
citations in the world.
30%
27.84%, China
25%
AI Journal Citations (% of World Total)
15%
10%
5%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.11
80
71.92
60
40
20
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.12
17.83%
16%
12%
8%
4%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.13
30%
20%
18.93%, North America
By Geographic Area greater lead than in 2020, while the European Union plus
In 2021, China produced the greatest share of the world’s the United Kingdom followed at 19.0% and the United
AI conference publications at 27.6%, opening an even States came in third at 16.9% (Figure 1.1.15).
27.64%, China
25%
AI Conference Publications (% of World Total)
20%
18.95%, European Union and United Kingdom
15%
10%
5%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.15
35%
25%
20%
15%
15.32%, China
10%
5%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.16
56.73
50
Number of AI Repository Publications (in thousands)
20
10
1
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.17
AI REPOSITORY PUBLICATIONS (% of TOTAL REPOSITORY PUBLICATIONS), 2010–21
Source: Center for Security and Emerging Technology, 2021 | Chart: 2022 AI Index Report
15.30%
AI Repository Publications (% of Total Repository Publications)
15%
10%
5%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.18
40%
30%
27.25%, Europe and Central Asia
26.38%, East Asia and Paci c
20%
10%
By Geographic Area the United States accounted for 32.5% of the world’s AI
While the United States has held the lead in the percentage repository publications—a higher percentage compared
of AI repository publications in the world since 2011, China to journal and conference publications, followed by the
is catching up while the European Union plus the United European Union plus the United Kingdom (23.9%) and
Kingdom’s share continues to drop (Figure 1.1.20). In 2021, China (16.6%).
30%
20%
16.60%, China
10%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.20
40%
38.60%, United States
AI Repository Citations (% of World Total)
30%
16.44%, China
10%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.21
A I PAT E N T S
This section draws on data from CSET and 1790 Analytics The number of patents
on patents relevant to AI development and applications—
indicated by Cooperative Patent Classification (CPC)/
filed in 2021 is more
International Patent Classification (IPC) codes and than 30 times higher
keywords. Patents were grouped by country and year and
then counted at the “patent family” level, before CSET
than in 2015, showing
extracted year values from the most recent publication a compound annual
date within a family.
growth rate of 76.9%.
Overview
Figure 1.1.22 captures the number of AI patents filed from
2010 to 2021. The number of patents filed in 2021 is more
than 30 times higher than in 2015, showing a compound
annual growth rate of 76.9%.
141.24
140
120
Number of AI Patents (in thousands)
100
80
60
40
20
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.22
By Region and Application Status Central Asia. In terms of granted patents in those regions,
Figure 1.1.23a breaks down AI patent filings by region. North America leads with 57.0%, followed by East Asia
The share of East Asia and Pacific took off in 2014 and and Pacific (31.0%), and Europe and Central Asia (11.3%)
led the rest of the world in 2021 with 62.1% of all patent (Figure 1.1.23b). The other regions combine to make up
applications, followed by North America and Europe and roughly 1% of world patents (Figure 1.1.23c).
70%
50%
40%
30%
20%
17.07%, North America
10%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.23a
80%
70%
Granted AI Patents (% of World Total)
60%
56.96%, North America
50%
40%
20%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.23b
1.50%
1.25%
Granted AI Patents (% of World Total)
1.00%
0.75%
0.50%
By Geographic Area and Application Status United Kingdom. The United States, which files almost
Trends revealed by the regional analysis can also be all the patents in North America, does so at one-third the
observed in AI patent data broken down by geographic rate of China. Figure 1.1.24c shows that compared to the
area (Figure 1.1.24a and Figure 1.1.24b). China is now increasing numbers of AI patents applied and granted,
filing over half of the world’s AI patents and being granted China has far greater numbers of patent applications
about 6%, about the same as the European Union plus the (87,343 in 2021) than those granted (1,407 in 2021).
60%
40%
30%
20%
16.92%, United States
10%
70%
60%
Granted AI Patents (% of World Total)
50%
40%
39.59%, United States
30%
20%
10%
7.56%, European Union and United Kingdom
5.90%, China
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.1.24b
70
60
50
40
30
19.61
20
9.45
10
4.88
1.41 1.81
0
AI conferences are key venues for researchers to publish and communicate their work, as well as to connect with peers and
collaborators. Conference attendance is an indication of broader industrial and academic interest in a scientific field. In the past
20 years, AI conferences have grown not only in size but also in number and prestige. This section presents data on the trends in
attendance at major AI conferences, covering more conferences (16) than previous Index reports.
1.2 CONFERENCES
C O N F E R E N C E AT T E N DA N C E Figure 1.2.1 shows that attendance at top AI conferences
Similar to 2020, most AI conferences were offered in 2021 was relatively consistent with 2020, with more
virtually in 2021. Only the International Conference on than 88,000 participants worldwide. Figure 1.2.2 and
Robotics and Automation (ICRA) and the Conference Figure 1.2.3 show the attendance data for individual
on Empirical Methods in Natural Language Processing conferences, with 16 major AI conferences split into
(EMNLP) were held using a hybrid format. Conference two categories: large AI conferences with over 2,500
organizers reported measuring the exact attendance attendees and small AI conferences with fewer than
numbers at a virtual conference is difficult, as virtual 2,500 attendees.6
conferences allow for higher attendance of researchers
from all around the world.
90 88.76
80
Number of Attendees (in thousands)
70
60
50
40
30
20
10
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.2.1
6 The International Conference on Machine Learning (ICML) used the number of session visitors as a proxy for the number of conference attendees, which explains the high attendance count in 2021.
The International Conference on Intelligent Robots and Systems (IROS) extended the virtual conference to allow users to watch events for up to three months, which explains the high attendance count
in 2020. For the AAMAS conference, the attendance in 2020 is based on the number of users on site reported by the platform that recorded the talks and managed the online conference, while the 2021
number is for total registrants.
30
29.54, ICML
25
Number of Attendees (in thousands)
20
17.09, NeurIPS
15
10
8.24, CVPR
6.31, ICLR
5 5.01, ICCV
4.09, AAAI
3.76, EMNLP
2.59, IROS
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.2.2
3
Number of Attendees (in thousands)
2.10, UAI
2
1.90, IJCAI
1.35, FaccT
1.08, AAMAS
1
1.00, ICRA
0.67, KR
0.51, ICAPS
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.2.3
WO M E N I N M AC H I N E L E A R N I N G Workshop Participants
( W I M L ) N E U R I P S WO R KS H O P The number of participants attending the WiML workshop
Founded in 2006, Women in Machine Learning is an has steadily increased since it was first introduced in
organization dedicated to supporting and increasing 2006. For the 2021 edition, Figure 1.2.4 shows an estimate
the impact of women in machine learning. This section of 1,486 attendees over all workshop sessions, counted
presents data from its annual technical workshop as the number of unique individuals who accessed the
colocated with NeurIPS. Starting in 2020, WiML has also virtual workshop platform at neurips.cc. The 2021 WiML
been hosting the Un-Workshop, which aims to advance Workshop at NeurIPS happened as multiple sessions
research via collaboration and interaction among over three days, which was a change in format from 2020.
participants from diverse backgrounds at ICML. As in 2020, the workshop was held virtually due to the
pandemic.
1,486
1400
1200
Number of Attendees
1000
800
600
400
200
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 1.2.4
Demographics Breakdown by Europe (19.9%), Asia (16.2%), and Africa (7.3%) (Figure
This section shows the continent of residence and 1.2.5). Figure 1.2.6 shows that Ph.D. students made
professional position breakdowns of the 2021 workshop up almost half of the survey participants, while the
participants based on a survey filled by participants share of university faculty is around 1.2%. Researcher
who consented to have such information aggregated. scientists/engineers, data scientists/engineers, and
Among the survey respondents, more than half of the software engineers were among the most commonly held
survey respondents were from North America, followed professional positions.
North
53.40%
America
Europe 19.90%
Asia 16.20%
Africa 7.30%
South
1.00%
America
Australia 1.00%
A software library is a collection of computer code that is used to create applications and products. Popular AI-specific software
libraries—such as TensorFlow and PyTorch—help developers create their AI solutions quickly and efficiently. This section analyzes the
popularity of software libraries through GitHub data.
1.3 AI OPEN-SOURCE
SOFTWARE LIBRARIES
G I T H U B S TA R S GitHub open-source AI software library, OpenCV, which
Figures 1.3.1 and 1.3.2 reflect the number of users of was followed by Keras, PyTorch, and Scikit-learn.
GitHub open-source AI software libraries from 2015 to Figure 1.3.2 shows library popularity for libraries with
2021. TensorFlow remained by far the most popular in fewer than 40,000 GitHub stars—led by FaceSwap with
2021, with around 161,000 cumulative GitHub stars—a around 40,000 stars, followed by 100-Days-Of-ML-Code,
slight increase over 2020. TensorFlow was about three AiLearning, and BVLC/caffe.
times as popular in 2021 as the next-most-starred
140
120
100
80
58.6, OpenCV
60 53.2, Keras
52.7, PyTorch
48.0, Scikit-learn
40 46.7, DeepLearning-500-questions
41.5, TenserFlow-Examples
20
39.88, faceswap
40
33.58, 100-Days-Of-ML-Code
Number of Cumulative GitHub Stars (in thousands)
35 32.27, AiLearning
32.14, BVLC/ca e
32.02, Real-Time-Voice-Cloning
30 32.00, deeplearningbook-chinese
31.28, Deep Learning Papers Reading Roadmap
30.26, DeepFaceLab
25
20
15
10
CHAPTER 2:
Technical
Performance
Artificial Intelligence
Index Report 2022 CHAPTER 2: TECHNICAL PERFORMANCE
CHAPTER 2:
Chapter Preview
Overview 50 Medical Image Segmentation 61
Chapter Highlights 51 CVC-ClinicDB and Kvasir-SEG 61
Face Detection and Recognition 62
2.1 COMPUTER VISION—IMAGE 52 National Institute of Standards
Image Classification 52 and Technology (NIST) Face
Recognition Vendor Test (FRVT) 62
ImageNet 52
Face Detection: Effects of Mask-Wearing 63
ImageNet: Top-1 Accuracy 52
Face Recognition Vendor Test (FRVT):
ImageNet: Top-5 Accuracy 52
Face-Mask Effects 63
Image Generation 54
Highlight: Masked Labeled Faces
STL-10: Fréchet Inception in the Wild (MLFW) 64
Distance (FID) Score 54
Visual Reasoning 65
CIFAR-10: Fréchet Inception
Visual Question Answering
Distance (FID) Score 55
(VQA) Challenge 65
Deepfake Detection 56
FaceForensics++ 56 2.2 COMPUTER VISION—VIDEO 67
Celeb-DF 57 Activity Recognition 67
Human Pose Estimation 57 Kinetics-400, Kinetics-600, Kinetics-700 67
Leeds Sports Poses: Percentage ActivityNet: Temporal Action
of Correct Keypoints (PCK) 58 Localization Task 69
Human3.6M: Average Mean Object Detection 70
Per Joint Position Error (MPJPE) 59
Common Object in Context (COCO) 71
Semantic Segmentation 60
You Only Look Once (YOLO) 72
Cityscapes 60
Visual Commonsense Reasoning (VCR) 73
Table of Contents 48
Artificial Intelligence
Index Report 2022 CHAPTER 2: TECHNICAL PERFORMANCE
Table of Contents 49
Artificial Intelligence
Index Report 2022 CHAPTER 2: TECHNICAL PERFORMANCE
Overview
This year, the technical performance chapter includes more analysis
than ever before of the technical progress in various subfields of
artificial intelligence, including trends in computer vision, language,
speech, recommendation, reinforcement learning, hardware, and
robotics. It uses a number of quantitative measurements, from
common AI benchmarks and prize challenges to a field-wide survey,
to highlight the development of top-performing AI systems.
CHAPTER HIGHLIGHTS
• Data, data, data: Top results across technical benchmarks have increasingly relied on the use of
extra training data to set new state-of-the-art results. As of 2021, 9 state-of-the-art AI systems
out of the 10 benchmarks in this report are trained with extra data. This trend implicitly favors
private sector actors with access to vast datasets.
• Rising interest in particular computer vision subtasks: In 2021, the research community saw
a greater level of interest in more specific computer vision subtasks, such as medical image
segmentation and masked-face identification. For example, only 3 research papers tested
systems against the Kvasir-SEG medical imaging benchmark prior to 2020. In 2021, 25
research papers did. Such an increase suggests that AI research is moving toward research that
can have more direct, real-world applications.
• AI has not mastered complex language tasks, yet: AI already exceeds human performance levels
on basic reading comprehension benchmarks like SuperGLUE and SQuAD by 1%–5%. Although
AI systems are still unable to achieve human performance on more complex linguistic tasks
such as abductive natural language inference (aNLI), the difference is narrowing. Humans
performed 9 percentage points better on aNLI in 2019. As of 2021, that gap has shrunk to 1.
• Turn toward more general reinforcement learning: For the last decade, AI systems have
been able to master narrow reinforcement learning tasks in which they are asked to maximize
performance in a specific skill, such as chess. The top chess software engine now exceeds
Magnus Carlsen’s top ELO score by 24%. However, in the last two years AI systems have also
improved by 129% on more general reinforcement learning tasks (Procgen) in which they
must operate in novel environments. This trend speaks to the future development of AI systems
that can learn to think more broadly.
• AI becomes more affordable and higher performing: Since 2018, the cost to train an image
classification system has decreased by 63.6%, while training times have improved by 94.4%.
The trend of lower training cost but faster training time appears across other MLPerf task
categories such as recommendation, object detection and language processing, and favors the
more widespread commercial adoption of AI technologies.
• Robotic arms are becoming cheaper: An AI Index survey shows that the median price of
robotic arms has decreased by 46.2% in the past five years—from $42,000 per arm in 2017
to $22,600 in 2021. Robotics research has become more accessible and affordable.
Computer vision is the subfield of AI that teaches machines to understand images and videos. There is a wide range of computer
vision tasks, such as image classification, object recognition, semantic segmentation, and face detection. As of 2021, computers can
outperform humans on a plethora of computer vision tasks. Computer vision technologies have a variety of important real-world
applications, such as autonomous driving, crowd surveillance, sports analytics, and video-game creation.
ImageNet
ImageNet is a database that includes over 14 million images
across 20,000 categories publicly available to researchers
Figure 2.1.1
working on image classification problems. Created in 2009,
ImageNet is now one of the most common ways scientists
benchmark improvement on image classification.
the top pretrained system was CoAtNets, produced by
ImageNet: Top-1 Accuracy researchers on the Google Brain Team.
Benchmarking on ImageNet is measured through accuracy
metrics, which quantify how frequently AI systems assign ImageNet: Top-5 Accuracy
the right labels to the given images. Top-1 accuracy Top-5 accuracy considers whether any of the model’s 5
measures the rate at which the top prediction made by highest probability answers align with the image label. As
a classification model for a given image matches the highlighted in Figure 2.1.3, AI systems presently achieve
image’s actual target label. In recent years, it has become near perfect Top-5 estimation. Currently, the state-of-the-
increasingly common to improve the performance of art performance on Top-5 accuracy with pretraining is
systems on ImageNet by pretraining them with additional 99.0%, achieved in November 2021 by Microsoft Cloud and
data from other image datasets. Microsoft AI’s Florence-CoSwim-H model.
As of late 2021, the top image classification system makes Improvements in Top-5 accuracy on ImageNet seem to be
on average 1 error for every 10 classification attempts plateauing, which is perhaps unsurprising. If your system
on Top-1 accuracy compared to an average of 4 errors is classifying correctly 98 or 99 out of 100 times, there is
for every 10 attempts in late 2012 (Figure 2.1.2). In 2021, only so much higher you can go.
80%
70%
60%
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.2
90%
85%
80%
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.3
I M AG E G E N E R AT I O N
Image generation is the task of generating images that are systems generate different design patterns), and healthcare
indistinguishable from real ones. Image generation can be (image generators can synthetically create novel drug
widely useful in generative domains where visual content compounds). Figure 2.1.4 illustrates progress made in
has to be created, for example entertainment (companies image generation by presenting several human faces that
like NVIDIA have already used image generators to create were synthetically generated by AI systems in the last year.
virtual worlds for gaming), fashion (designers can let AI
STL-10: Fréchet Inception Distance (FID) Score Figure 2.1.5 documents the gains generative models have
The Fréchet Inception Distance score tracks the similarity made in FID on the STL-10 dataset, one of the most widely
between an artificially generated set of images and the cited datasets in computer vision. The state-of-the-art
real images from which it was generated. A low score model on STL-10 developed by researchers at the Korea
means that the generated images are more similar to Advanced Institute of Science and Technology as well as the
the real ones, and a score of zero indicates that the fake University of Seoul posted a FID score of 7.7, significantly
images are identical to the real ones. better than the state-of-the-art result from 2020.
30
20
10
7.71
0
2018 2019 2020 2021
Figure 2.1.5
CIFAR-10: Fréchet Inception Distance (FID) Score The FID scores achieved by the top image generation
Progress on image generation can also be benchmarked models are much lower on CIFAR-10 than STL-10. This
on CIFAR-10, a dataset of 60,000 color images across 10 difference is likely attributable to the fact that CIFAR-10
different object classes. The state-of-the-art results on contains images of much lower resolution (32 x 32 pixels)
CIFAR-10 posted in 2021 were achieved by researchers than those on STL-10 (96 x 96 pixels).
from NVIDIA.
30
Fréchet Inception Distance (FID) Score
20
10
2.10
0
2017 2018 2019 2020 2021
Figure 2.1.6
FACEFORENSICS++: ACCURACY
Source: arXiv, 2021 | Chart: 2022 AI Index Report
93.25%, NeuralTextures
90%
Accuracy (%)
80%
70%
60%
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.7
1 These numbers were taken by averaging performance across all four FaceForensics++ datasets.
80
76.88
75
Area Under Curve Score (AUC)
70
65
60
2018 2019 2020 2021
Figure 2.1.8
A DEMONSTRATION OF
HUMAN POSE ESTIMATION
Source: Cao et al., 2019
Figure 2.1.9
Leeds Sports Poses: Percentage of Correct In 2021, the top-performing human pose estimation
Keypoints (PCK) model correctly identified 99.5% of keypoints on Leeds
The Leeds Sports Poses dataset contains 2,000 images Sports Poses (Figure 2.1.10). Given that maximum
collected from Flickr of athletes playing a sport. Each performance on Leeds Sports is 100.0%, more challenging
image includes information on 14 different body joint benchmarks for human pose estimation will have to be
locations. Performance on the Leeds Sports Poses developed, as we are close to saturating the benchmark.
benchmark is assessed by the percentage of correctly
estimated keypoints.
100% 99.50%
Percentage of Correct Keypoints (PCK)
90%
80%
70%
2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.10
Human3.6M: Average Mean Per Joint Position on Human3.6M is measured in average mean per joint
Error (MPJPE) position error in millimeters, which is the average difference
3D human pose estimation is a more challenging type of between an AI model’s position estimations and the actual
pose estimation, where AI systems are asked to estimate position annotation.
poses in a three- rather than two-dimensional space. The
In 2014, the top-performing model was making an average
Human3.6M dataset tracks progress in 3D human pose
per joint error of 16 centimeters, half the size of a standard
estimation. Human3.6M is a collection of over 3.6 million
school ruler. In 2021, this number fell to 1.9 centimeters,
images of 17 different types of human poses (talking on
less than the size of an average paper clip.
the phone, discussing, and smoking, etc.). Performance
150
120
Average MPJPE (mm)
90
60
90%
80%
70%
60%
2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.1.13
95% 95%
94.20%
92.17%
90% 90%
Mean DICE
Mean DICE
85% 85%
80% 80%
2015 2017 2019 2021 2015 2017 2019 2021
Figure 2.1.15a Figure 2.1.15b
The Kvasir-SEG benchmark also points to the explosion of which makes the technology extremely appealing to
interest in medical image segmentation. Prior to 2020, the militaries and governments all over the world (e.g., 18 out
dataset was referenced in only three academic papers. In of 24 U.S. government agencies are already using some
2020 that number rose to six, and in 2021 it shot up to 25. kind of facial recognition technology).
Last year also saw the hosting of KiTS21 (the Kidney and
Kidney Tumor Segmentation Challenge), which challenged National Institute of Standards and Technology
medical researchers from academia and industry to create (NIST) Face Recognition Vendor Test (FRVT)
the best systems for automatically segmenting renal The National Institute of Standards and Technology’s
tumors and the surrounding anatomy of kidneys. Face Recognition Vendor Test measures how well facial
recognition algorithms perform on a variety of homeland
security and law enforcement tasks, such as face
FACE DETECTION AND RECOGNITION recognition across photojournalism images, identification
In facial detection, AI systems are tasked with identifying of child trafficking victims, deduplication of passports,
individuals in images or videos. Although facial recognition and cross-verification of visa images. Progress on facial
technology has existed for several decades, the technical recognition algorithms is measured according to the false
progress in the last few years has been significant. Some of non-match rate (FNMR) or the error rate (the frequency
today’s top-performing facial recognition algorithms have a with which a model fails to match an image to a person).
near 100% success rate on challenging datasets.
In 2017, some of the top-performing facial recognition
Facial recognition can be used in transportation to facilitate algorithms had error rates of over 50.0% on certain FRVT
cross-border travel, in fraud prevention to protect sensitive tests. As of 2021, none has posted an error rate greater than
documents, and in online proctoring to identify illicit 3.0%. The top-performing model across all datasets in 2021
examination behavior. The greatest practical promise of (visa photos) registered an error rate of 0.1%, meaning that
facial recognition, however, is in its potential to aid security, for every 1,000 faces, the model correctly identified 999.
NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY (NIST) FACE RECOGNITION VENDOR TEST (FRVT):
VERIFICATION ACCURACY by DATASET
Source: National Institute of Standards and Technology, 2021 | Chart: 2022 AI Index Report
0.5
False Non-Match Rate: FMNR (Log-Scale)
0.2
0.1
0.05
0.01
0.005
0.0044, BORDER Photos FNMR @ FMR = 0.000001
0.0023, VISABORDER Photos FNMR@FMR 0.000001
0.0022, MUGSHOT Photos FNMR @ FMR 0.00001
0.002
0.0021, MUGSHOT Photos FNMR @ FMR 0.00001 DT>=12 YRS
0.001 0.0013, VISA Photos FNMR @ FMR 0.000001
FACE DETECTION:
EFFECTS OF MASK-WEARING Although facial recognition
Face Recognition Vendor Test (FRVT):
technology has existed
Face-Mask Effects for several decades, the
Facial recognition has become a more challenging
task with the onset of the COVID-19 pandemic and
technical progress in the
accompanying mask mandates. The face-mask effects test last few years has been
asks AI models to identify faces on two datasets of visa
border photos, one of which includes masked faces, the significant. Some of today’s
other which does not. top-performing facial
Three important trends can be gleaned from the FRVT
recognition algorithms have
face-mask test: (1) Facial recognition systems still perform
relatively well on masked faces; (2) the performance on a near 100% success rate on
masked faces is worse than on non-masked faces; and (3)
the gap in performance has narrowed since 2019.
challenging datasets.
0.025
False Non-Match Rate: FMNR (Log-Scale)
0.020
0.015
0.014, Masked
0.010
0.005
0.002, Non-masked
0.000
2019 2020 2021
Figure 2.1.17
Figure 2.1.18
As part of the dataset release, the researchers ran a series of existing state-of-the-art detection
algorithms on a variety of facial recognition datasets, including theirs, to determine how much detection
performance decreased when faces were masked. Their estimates suggest that top methods perform 5
to 16 percentage points worse on masked faces compared to unmasked ones. These findings somewhat
confirm the insights from the FRVT face-mask tests: Performance deteriorates when masks are included,
but not by an overly significant degree.
STATE-OF-THE-ART FACE DETECTION METHODS on MASKED LABELED FACES IN THE WILD (MLFW): ACCURACY
Source: Wang et. al, 2021 | Chart: 2022 AI Index Report
80%
75%
70%
60%
50%
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
CALFW11
CPLFW10
LFW7
MLFW
SLLFW8
SLLFW8
SLLFW8
SLLFW8
SLLFW8
SLLFW8
Figure 2.1.19
VISUAL REASONING
Visual reasoning assesses how well AN EXAMPLE OF A VISUAL REASONING TASK
Source: Goyal et al., 2021
AI systems can reason across a
combination of visual and textual
data. Visual reasoning skills are
essential in developing AI that can
reason more broadly. Existing AI can
already execute certain narrow visual
tasks better than humans, such as
classifying images, detecting faces,
and segmenting objects. But many AI
systems struggle when challenged to
reason more abstractly—for example, Figure 2.1.20
Figure 2.1.21
In the six years since the VQA challenge began, there has systems could correctly answer only 55.4% of questions
been a 24.4 absolute percentage point improvement in (Figure 2.1.22). As of 2021, top performance stood at
state-of-the-art performance. In 2015, the top-performing 79.8%—close to the human baseline of 80.8%.
70%
Accuracy (%)
60%
50%
2015 2016 2017 2018 2019 2020 2021
Figure 2.1.22
Video analysis concerns reasoning or task operation across sequential frames (videos), rather than single frames (images). Video
computer vision has a wide range of use cases, which include assisting criminal surveillance efforts, sports analytics, autonomous
driving, navigation of robots, and crowd monitoring.
Figure 2.2.1
As of 2022, one model tops all three Kinetics datasets. has narrowed between performance on the datasets. In
MTV, a collaboration between Google Research, Michigan 2020, the gap between performance on Kinetics-400 and
State University, and Brown University, released in January Kinetics-700 was 27.14 percentage points. In one short
2022, achieved a 89.6% Top-1 accuracy on the 600 series, year, that gap has narrowed to 7.4 points, which means
89.1% accuracy on the 400 series, and 82.20% accuracy that performance on the newer, harder dataset is occurring
on the 700 series (Figure 2.2.2). The most striking aspect more rapidly than performance on the easier dataset and
about technical progress on Kinetics is how rapidly the gap suggests that the easier ones are starting to asymptote.
82.20%, Kinetics-700
80%
Top-1 Accuracy (%)
70%
60%
50%
2016 2017 2018 2019 2020 2021 2022
Figure 2.2.2
ActivityNet: Temporal Action Localization Task the most complex and difficult tasks in computer vision.
ActivityNet is a video dataset for human activity Performance on TALT is measured in terms of mean average
understanding that contains 700 hours of videos of precision, with a higher score indicating greater accuracy.
humans doing 200 different activities (long jump, dog
As of 2021 the top-performing model on TALT, developed
walking, vacuuming, etc.). For an AI system to successfully
by HUST-Alibaba, scores 44.7%, a 26.9 percentage point
complete the ActivityNet Temporal Action Localization
improvement over the top scores posted in 2016 when the
Task (TALT) task, it has to execute two separate steps: (1)
challenge began (Figure 2.2.3). Although state-of-the-art
localization (identify the precise interval during which the
results on the task have been posted for each subsequent
activity occurs); and (2) recognition (assign the correct
year, the gains have become increasingly small.
category label). Temporal action localization is one of
44.67%
40%
Mean Average Precision (mAP)
30%
20%
Figure 2.2.4
Common Object in Context (COCO) Since 2016, there has been a 23.8 percentage point
Microsoft’s Common Object in Context (COCO) object improvement on COCO object detection, with this year’s
detection dataset contains over 328,000 images across more top model, GLIP, registering a mean average precision of
than 80 object categories. There are many accuracy metrics 79.5%.2 Figure 2.2.5 illustrates how the use of extra training
used to track performance on object detection, but for the data has taken over object detection, much as it has with
sake of consistency, this section and the majority of this other domains of computer vision.
report considers mean average precision (mAP50).
80%
79.50%, With Extra Training Data
75%
70%
65%
60%
55%
50%
2015 2016 2017 2018 2019 2020 2021
Figure 2.2.5
2 GLIP (Grounded Language-Image Pretraining), a model designed to master the learning of language contextual visual representations, was a collaboration of researchers from UCLA, Microsoft Re-
search, University of Washington, University of Wisconsin–Madison, Microsoft Cloud, Microsoft AI, and International Digital Economy Academy.
You Only Look Once (YOLO) on the COCO dataset. YOLO detectors have become
You Only Look Once is an open-source object detection much better in terms of performance since 2017 (by 28.4
model that emphasizes speed (inference latency) over percentage points). Second, the gap in performance
absolute accuracy. between YOLO and the best-performing object detectors
has narrowed. In 2017 the gap stood at 11.7%, and it
Over the years, there have been different iterations of YOLO,
decreased to 7.1% in 2021. In the last five years, object
and Figure 2.2.6 plots the performance of YOLO object
detectors have been built that are both faster and better.
detectors versus the absolute top performing detectors
STATE OF THE ART (SOTA) vs. YOU ONLY LOOK ONCE (YOLO): MEAN AVERAGE PRECISION (mAP50)
Source: arXiv, 2021; GitHub, 2021 | Chart: 2022 AI Index Report
79.50%, SOTA
80%
Mean Average Precision (mAP50)
70%
72.40%, YOLO
60%
50%
40%
2016 2017 2018 2019 2020 2021
Figure 2.2.6
Visual Commonsense Reasoning (VCR) which only requires an answer). The dataset contains
The Visual Commonsense Reasoning challenge is a 290,000 pairs of multiple-choice questions, answers,
relatively new benchmark for visual understanding. VCR and rationales from 110,000 image scenarios taken from
asks AI systems to answer challenging questions about movies. Figure 2.2.7 illustrates the kinds of questions
scenarios presented from images, and also to provide the posed in the VCR.
reasoning behind their answers (unlike the VQA challenge,
Figure 2.2.7
Performance on VCR is measured in the Q->AR score, which lag far behind human levels of performance (Figure 2.2.8).
aggregates how well machines can choose the right answer At the end of 2021, the best mark on VCR stood at 72.0, a
for a given multiple-choice question (Q->A) and then select score that represents a 63.6% increase in performance since
the correct rationale for the answer (Q->R). 2018. Although progress has been made since the challenge
was launched, improvements have become increasingly
Since the challenge debuted, AI systems have become
marginal, suggesting that new techniques may need to be
better at visual commonsense reasoning, although they still
invented to significantly improve performance.
VISUAL COMMONSENSE REASONING (VCR) TASK: Q->AR SCORE
Source: VCR Leaderboard, 2021 | Chart: 2022 AI Index Report
90
85.00, Human Baseline
80
72.00
70
Q->AR Score
60
50
40
2018 2019 2020 2021
Figure 2.2.8
Natural language processing (NLP) is a subfield of AI, with roots that stretch back as far as the 1950s. NLP involves research into systems that
can read, generate, and reason about natural language. NLP evolved from a set of systems that in its early years used handwritten rules and
statistical methodologies to one that now combines computational linguistics, rule-based modeling, statistical learning, and deep learning.
This section looks at progress in NLP across several language task domains, including: (1) English language understanding; (2) text
summarization; (3) natural language inference; (4) sentiment analysis; and (5) machine translation. In the last decade, technical progress in
NLP has been significant: The adoption of deep neural network–style machine learning methods has meant that many AI systems can now
execute complex language tasks better than many human baselines.
2.3 LANGUAGE
E N G L I S H L A N G UAG E As part of the benchmark, AI systems are tested on eight
different tasks (such as answering yes/no questions,
U N D E R S TA N D I N G
identifying causality in events, and doing commonsense
English language understanding challenges AI systems
reading comprehension), and their performance on these
to understand the English language in various contexts,
tasks is then averaged into a single score. SuperGLUE is
such as sentence understanding, yes/no reading
the successor to GLUE, an earlier benchmark that also
comprehension, reading comprehension with logical
tests on multiple tasks. SuperGLUE was released in May
reasoning, etc.
2019 after AI systems began to saturate the GLUE metric,
SuperGLUE creating demand for a harder benchmark.
SuperGLUE is a single-number metric that tracks technical
progress on a diverse set of linguistic tasks (Figure 2.3.1).
Figure 2.3.1
3 For the sake of brevity, this figure only displays 3 of the 8 tasks.
At the top of the SuperGLUE leaderboard sits the SS-MoE that progress on SuperGLUE was achieved so rapidly
model with a state-of-the-art score of 91.0 (Figure 2.3.2), suggests that researchers will need to develop more
which exceeds the human performance score of 89.8 complex suites of natural language tasks to challenge
given by the SuperGLUE benchmark developers. The fact the next generation of AI systems.
SUPERGLUE: SCORE
Source: SuperGLUE Leaderboard, 2021 | Chart: 2022 AI Index Report
92
91.00
91
Score
90
89.80, Human Performance
89
88
2019 2020 2021
Figure 2.3.2
At the end of 2021, the leading scores on SQuAD 1.1 and scores (0.4% and 0.2%). Both SQuAD datasets have seen
SQuAD 2.0 stood at 95.7 and 93.2, respectively (Figure a trend whereby immediately after the initial launches,
2.3.4). Although these scores are state of the art, they human-performance-exceeding scores were realized and
are marginal improvements over the previous year’s top then followed by small, plateau-like increases.
100
85
80
80%
Accuracy (%)
60%
50%
2020 2021
Figure 2.3.6
ARXIV: ROUGE-1
Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report
45
46.74, With Extra Training Data
ROUGE-1
40
35
30
2017 2018 2019 2020 2021
Figure 2.3.7
PUBMED: ROUGE-1
Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report
50
45
ROUGE-1
40
35
2017 2018 2019 2020 2021
Figure 2.3.8
Figure 2.3.9
The top-performing model on SNLI is Facebook AI USA’s EFL, which in April 2021 posted a score of 93.1% (Figure 2.3.10).
95%
93.10%
Accuracy (%)
90%
85%
2014 2015 2016 2017 2018 2019 2020 2021
Figure 2.3.10
Abductive Natural Language Inference (aNLI) abduction is regarded as an essential element in how
Abductive natural language inference is a more humans communicate with one another, few studies have
difficult type of textual entailment. Abductive inference attempted to study the abductive ability of AI systems.
requires drawing the most plausible conclusion from a
ANLI, a new benchmark for abductive natural language
context of limited information and uncertain premises.
inference created in 2019 by the Allen Institute for AI,
For instance, if Jenny were to return from work and
comes with 170,000 premise and hypothesis pairs. Figure
find her home in a disheveled state and then recall
2.3.11 illustrates the types of statements included in the
that she left a window open, she can plausibly infer
dataset.
that a burglar broke in and caused the mess.4 Although
Figure 2.3.11
4 This particular example of abductive commonsense reasoning is taken from Bhagavatula et al. ( 2019), the first paper that investigates the ability of AI systems to perform language-based
abductive reasoning.
AI performance on abductive commonsense reasoning has increased by 7.7 percentage points since 2019; however, the
top AI systems, while close, are unable to achieve human performance levels (Figure 2.3.12). Abductive reasoning is
therefore still a challenging linguistic task for AI systems.
92% 91.87%
90%
Accuracy (%)
88%
86%
84%
A SAMPLE
SEMEVAL TASK
Source: Pontiki et al., 2014
Figure 2.3.13
The SemEval dataset is composed of 7,686 restaurant and laptop reviews, whose emotional polarities have been rated
by humans. On SemEval, AI systems are tasked with assigning the right sentiment labels to particular components of
the text, with their performance measured in terms of the percentage of the labels they correctly assign.
In the past seven years, AI systems have become much better at sentiment analysis. As of last year, top-performing
systems estimate sentiment correctly 9 out of 10 times, whereas in 2016, they made correct estimates only 7 out of
10 times. As of 2021, the state-of-the-art scores on SemEval stood at 88.6%, realized by a team of Chinese researchers
from South China Normal University and Linklogis Co. Ltd. (Figure 2.3.14).
90%
88.64%
85%
Accuracy (%)
80%
75%
70%
2015 2016 2017 2018 2019 2020 2021
Figure 2.3.14
WMT 2014, English-German and English-French Both the English-French and English-German WMT 2014
The WMT 2014 family of datasets, first introduced at the benchmarks showcase the significant progress made
Meeting of the Association for Computational Linguistics in AI machine translation over the last decade (Figure
(ACL) 2014, consist of different kinds of translation 2.3.15). Since submissions began, there has been a 23.7%
tasks, including translation between English-French and improvement in English-French and a 68.1% improvement
English-German language pairs. A machine’s translation in English-German translation ability. Relatively speaking,
capabilities are measured by the Bilingual Evaluation although performance improvements have been more
Understudy, or BLEU, score, which compares the extent significant on the English-German language pair, absolute
to which a machine-translated text matches a reference translation ability remains meaningfully higher on English-
human-generated translation. The higher the score, the French translation.
better the translation.
45 45
BLEU score
35 35
30 30
25 25
20 20
Number of Commercially Available MT Systems to Intento (Figure 2.3.16). 2021 also saw the introduction of
The growing interest in machine translation is also three open-source machine translation services (M2M-100,
reflected in the rise of commercial machine translation mBART, and OPUS). The emergence of publicly available,
services such as Google Translate. Since 2017, there high-functioning machine translation services speaks to
has been a nearly fivefold increase in the number of the increasing accessibility of such services and bodes well
commercial machine translators on the market, according for anybody who routinely relies on translation.
50
Commercial 46
Number of Independent Machine Translation Services
34
30
26 38
23
21
20 28
16
13 23
12 21
18
10
10 9
10 15
9
3
8 9
6 5
0 3 3 3 3
05/2017 07/2017 11/2017 03/2018 07/2018 12/2018 06/2019 11/2019 07/2020 10/2021
Figure 2.3.16
Another important domain of AI research is the analysis, recognition, and synthesis of human speech. In this AI subfield, AI systems are
typically rated on their ability to recognize speech and identify words and convert them into text; and also to recognize speakers and
identify the individuals speaking. Modern home assistance tools, such as Siri, are one of the many examples of commercially applied AI
speech technology.
2.4 SPEECH
SPEECH RECOGNITION English speech taken from a collection of audiobooks. On
Speech recognition is the process of training machines LibriSpeech, AI systems are asked to transcribe speech
to recognize spoken words and convert them into text. to text and then measured on word error rate, or the
Research in this domain began at Bell Labs in the 1950s, percentage of words they fail to correctly transcribe.
when the world was introduced to the automatic digit LibriSpeech is subdivided into two datasets. First, there
recognition machine (named “Audrey”), which could is LibriSpeech Test Clean, which includes higher-quality
recognize a human saying any number from zero to nine. recordings. Performance on Test Clean suggests how
Speech recognition has come a long way since then and in well AI systems can transcribe speech in ideal conditions.
the last decade has benefited tremendously from deep- Second, there is LibriSpeech Test Other, which includes
learning techniques and the availability of rich speech lower-quality recordings. Performance on Test Other is
recognition datasets. indicative of transcription performance in environments
Transcribe Speech: LibriSpeech (Test-Clean and where sound quality is less than ideal.
Other Datasets) AI systems perform incredibly well on LibriSpeech, so
Introduced in 2015, LibriSpeech is a speech transcription much so that progress appears to be plateauing (Figure
database that contains around 1,000 hours of 16 khz 2.4.1). A state-of-the-art result on the Test Clean dataset
LIBRISPEECH, TEST CLEAN: WORD ERROR RATE (WER) LIBRISPEECH, TEST OTHER: WORD ERROR RATE (WER)
Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report Source: Papers with Code, 2021; arXiv, 2021 | Chart: 2022 AI Index Report
14 14
12 12
10 10
Word Error Rate (WER)
6 6
4 4
1.70, Without Extra Training Data
was not realized in 2021, which speaks to the fact that the VoxCeleb
error rate of the top system was already low at 1.4%. For VoxCeleb is a large-scale audiovisual dataset of human
every 100 words that top-performing transcription models speech for speaker recognition, which is the task of matching
heard, they correctly transcribed 99. certain speech with a particular individual. Each year, the
makers of VoxCeleb host a speaker verification challenge.
Performance on the Test Other dataset, while worse than
A low score or equal error rate on the VoxCeleb challenge
Test Clean, was still relatively poor. The state-of-the-art
is indicative of an AI system that makes few errors in its
results on Test Other were realized by the W2V-BERT
attribution of speech to particular individuals.5 Figure 2.4.2
model, an MIT and Google Brain collaboration, which
plots performance over time on VoxCeleb-1, the original
posted an error rate of 2.0%.
VoxCeleb dataset. Since 2017, performance on VoxCeleb has
improved: Systems that once reported equal error rates of
7.8% now report errors that are less than 1.0%.
8%
6%
Equal Error Rate (%)
4%
2%
0.42%
0%
2017 2018 2019 2020 2021
Figure 2.4.2
5 The Equal Error Rate (EER) is not only a measure of the false positive rate, assigning a bad label, but also the false negative rate (failure to assign the correct label).
Recommendation is the task of suggesting items that might be of interest to a user, such as movies to watch, articles to read, or products
to purchase. Recommendation systems are crucial to businesses, such as Amazon, Netflix, Spotify, and YouTube. For example, one of
the earliest open recommendation competitions in AI was the Netflix Prize; hosted in 2009, it challenged computer scientists to develop
algorithms that could accurately predict user ratings for films based on previously submitted ratings.
2.5 RECOMMENDATION
Commercial Recommendation: MovieLens 20M quality. A higher nDCG score means that an AI system
The MovieLens 20M dataset contains around 20 delivers more accurate recommendations.
million movie ratings for 27,000 movies from 138,000 Since 2018, top models now perform roughly 5.2% better
users. The ratings are taken from MovieLens (a movie on MovieLens 20M than they did in 2018 (Figure 2.5.1).
recommendation platform), and AI systems are challenged In 2021, the state-of-the-art system on MovieLens 20M
to see if they can predict a user’s movie preferences based posted an nDCG of 0.448 and came from researchers at
on their previously submitted ratings. The metric used to the Czech Technical University in Prague.
track performance on MovieLens is Normalized Discounted
Cumulative Gain (nDCG), which is a measure of ranking
0.460
Normalized Discounted Cumulative Gain@100 (nDCG@100)
0.450
0.448
0.440
0.430
0.420
2018 2019 2020 2021
Figure 2.5.1
Click-Through Rate Prediction: Criteo Performance on Criteo also indicates that recommender
Click-through rate prediction is the task of predicting systems have been slowly and steadily improving in the
the likelihood that something on a website, say an past decade. Last year’s top model (Sina Weibo Corp’s
advertisement, will be clicked. In 2014, the online MaskNet) performed 1.8% higher on Criteo than the top
advertising platform Criteo launched an open click- model from 2016. An improvement of 1.8% may seem
through prediction challenge. Included as part of this small in absolute terms, but it can be a valuable margin in
challenge dataset was information on a million ads the commercial world.
that were displayed during a 24-day period, whether A limit of the Criteo and MovieLens benchmarks is that
they were clicked, and additional information on their they are primarily academic measures of technical
characteristics. Since the competition launched, the progress in recommendation (Figure 2.5.2). Most of the
Criteo dataset has been widely used to test recommender research work on recommendation occurs in commercial
systems. On Criteo, systems are measured on area under settings. Given that companies have an incentive to keep
the curve (AUC). A higher AUC means a better click- their recommendation improvements proprietary, the
through prediction rate and a stronger recommender academic metrics included in this section might not be
system. complete measures of the technical progress made in
recommendation.
0.813
0.810
Area Under Curve Score (AUC)
0.800
0.790
2016 2017 2018 2019 2020 2021
Figure 2.5.2
In reinforcement learning, AI systems are trained to maximize performance on a given task by interactively learning from their prior
actions. Researchers train systems to optimize by rewarding them if they achieve a desired goal and then punishing them if they fail.
Systems experiment with different strategy sequences to solve their stated problem (e.g., playing chess or navigating through a maze)
and select the strategies which maximize their rewards.
Reinforcement learning makes the news whenever programs like DeepMind’s AlphaZero demonstrate superhuman performance on
games like Go and Chess. However, reinforcement learning is useful in any commercial domain where computer agents need to maximize
a target goal or stand to benefit from learning from previous experiences. Reinforcement learning can help autonomous vehicles change
lanes, robots optimize manufacturing tasks, or time-series models predict future events.
10 9.62
Mean-Human Normalized Score (in thousands)
0
2015 2016 2017 2018 2019 2020 2021
Figure 2.6.1
Procgen
Procgen is a new reinforcement learning A SCREENSHOT OF THE 16 GAME ENVIRONMENTS IN PROCGEN
Source: Cobbe et al. 2019
environment introduced by OpenAI in 2019. It
includes 16 procedurally generated video-game-
like environments specifically designed to test the
ability of reinforcement learning agents to learn
generalizable skills (Figure 2.6.2). Procgen was
developed to overcome some of the criticisms
leveled at benchmarks like Atari that incentivized
AI systems to become narrow learners that
maximized capacity in one particular skill.
Procgen encourages broad learning by introducing
a reinforcement learning environment that
emphasizes high diversity and forces AI systems
to train in generalizable ways. Performance on
Procgen is measured in terms of mean-normalized
score. Researchers typically train their systems on
200 million training runs and report an average
score across the 16 Procgen games. The higher the
system scores, the better the system.
Figure 2.6.2
In November 2021, the MuZero model from DeepMind the environment was first released. Rapid progress on
posted a state-of-the-art score of 0.6 on Procgen. such a diverse benchmark signals that AI systems are
DeepMind’s results were a 128.6% improvement over slowly improving their capacity to reason in broader
the baseline performance established in 2019 when environments.
0.64
0.60
Mean-Normalized Score
0.50
0.40
0.30
Human Games: Chess range of games, such as shogi and Go, and have in
Progress in reinforcement learning can also be captured fact beaten some of the top-ranked chess engines.
by the performance of the world’s top chess software Nevertheless, looking at chess engine performance is an
engines. A chess engine is a computer program that is effective way to relativize the progress made in AI and
trained to play chess at a high level by analyzing chess compare it to a widely understandable human baseline.
positions. The performance of chess engines is ranked
Computers surpassed human performance in chess
on Elo, a method for identifying the relative skill levels
a long time ago, and since then have not stopped
of players in zero-sum games like chess: A higher score
improving (Figure 2.6.4). By the mid-1990s, the top chess
means a stronger player.
engines exceeded expert-level human performance, and
One caveat is that tracking the performance of by the mid-2000s they surpassed the peak performance
chess engines is not a complete reflection of general of Magnus Carlsen, one of the best chess players of all
reinforcement learning progress; chess engines are time. Magnus Carlsen’s 2882 Elo, recorded in 2014, is
specifically trained to maximize performance on chess. the highest level of human chess performance ever
Other popular reinforcement learning systems, like documented. As of 2021, the top chess engine has
DeepMind’s AlphaZero, are capable of playing a broader exceeded that level by 24.3%.
3,581
3500
2500
2300, Expert
Elo Score
2000
1700, Intermediate
1500
1000
800, Novice
500
0
1987 1992 1997 2002 2007 2012 2017 2022
Figure 2.6.4
In evaluating technical progress in AI, it is relevant not only to consider improvements in technical performance but also the speed of
operation. As this section shows, AI systems continue to improve in virtually every skill category. This performance is often realized by
increasing parameters and training systems on greater amounts of data. However, all else being equal, models that use more parameters
and source more data will take longer to train. Longer train times mean slower real-world deployment. Given that the potential of
increased training times can be offset by stronger and more robust computational infrastructures, it is important to keep track of
progress in the hardware that powers AI systems.
2.7 HARDWARE
MLPerf: Training Time increased. Top-performing hardware systems can reach
MLPerf is an AI training competition run by the ML baseline levels of performance in task categories like
Commons organization. In this challenge, participants recommendation, light-weight objection detection, image
train systems to execute various AI tasks (image classification, and language processing in under a minute.
classification, image segmentation, natural language
Figure 2.7.2 more precisely profiles the magnitude of
processing, etc.) using a common architecture. Entrants
improvement across each skill category since MLPerf first
are then ranked on their absolute wall clock time, which is
introduced the category.6 For example, training times
how long it takes for the system to train.
on image classification increased roughly twenty-seven-
Since the MLPerf competitions began in December 2018, fold, as top times dropped from 6.2 minutes in 2018 to
two key trends have emerged: (1) Training times for 0.2 minutes (or 13.8 seconds) in 2021. It might be hard to
virtually every AI skill category have massively decreased; fathom the magnitude of a 27-time decrease in training
while (2) AI hardware robustness has substantially time, but in real terms it is the difference between waiting
one hour for a bus versus a little more than two minutes.
50
20
Training Time (Minutes; Log Scale)
5
3.24, Object Detection (heavy-weight)
2.38, Speech recognition
2
6 The solitary point for reinforcement learning on Figure 2.7.1 indicates that a faster time was not registered in the MLPerf competitions in 2020 or 2021. The solitary points for speech recognition and
image segmentation are indicative of the fact that those AI subtask categories were added to the MLPerf competition in 2021.
26.96
25
22.25
20
Scale of Improvement
16.47
15
10
5
2.38
1.70 1.92
Improvement Baseline 1.16
0
Reinforcement Speech Language Recommendation Segmentation Object Detection Object Detection Image
Recognition Processing (light-weight) (heavy-weight) Classi!cation
Figure 2.7.2
MLPerf: Number of Accelerators accelerators used by all entrants increased 3.5 times.
The cross-task improvements in training time are being Most notable, however, is the growing gap between
driven by stronger underlying hardware systems, as shown the average number of accelerators used by the top-
in Figure 2.7.3. Since the competition began, the highest performing systems and the average accelerators used
number of accelerators used—where an accelerator is by all systems. This gap was 9 times larger at the end of
a chip used predominantly for the machine learning 2021 than it had been in 2018. This growth clearly means
component of a training run, such as a GPU or a TPU—and that, on average, building the fastest systems requires
the mean number of accelerators used by the top system the most powerful hardware.
increased roughly 7 times while the mean number of
4,000
3,000
Number of Accelerators
1,000
0
12.12.2018 06.10.2019 07.29.2020 06.30.2021 12.01.2021
Figure 2.7.3
$1,000.00
$500.00
$200.00
Cost (in U.S. Dollars; Log Scale)
$100.00
$50.00
$20.00
$10.00
$5.00 $4.59
$2.00
$1.00
2017 2018 2019 2020 2021
Figure 2.7.4
In 2021, the AI Index developed a survey that asked professors who specialize in robotics at top-ranked universities around the world and
in emerging economies about changes in the pricing of robotic arms as well as the uses of robotic arms in research labs. The survey was
completed by 101 professors and researchers from over 40 universities and collected data on 117 robotic arm purchase events from 2016
to 2022. The survey results suggest that there has been a notable decline in the price of robotic arms since 2016.
2.8 ROBOTICS
Price Trends in Robotic Arms7 $22,600 in 2021 (Figure 2.8.1). Figure 2.8.2, which plots the
The survey results show a clear downward trend in the distribution of robotic arm prices, paints a similar picture:
pricing of robotic arms in the last seven years. In 2017, Despite some high-priced outliers, there has been a clear
the median price of a robotic arm was $42,000. Since downward trend since 2017 in the price of robotic arms.
then, the price has declined by 46.2% to reach roughly
$40
Price (in thousands of U.S. Dollars)
$30
$22.60
$20
$10
$0
2017 2018 2019 2020 2021
Figure 2.8.1
7 We have corrected Figure 2.8.1 and Figure 2.8.2 after noticing a data filtering issue with the survey result. The correct chart has since been updated. View the appendix here for links to the data. In
addition, note that academic researchers may get a discount when purchasing robotic arms so prices are lower than retail.
$100
Price (in thousands of U.S. Dollars)
$50
$0
2017 2018 2019 2020 2021
Figure 2.8.2
Reinforcement
46.00%
Learning
CHAPTER 3:
Technical AI Ethics
Text and Analysis by
Helen Ngo and Ellie Sakhaee
Artificial Intelligence
Index Report 2022 CHAPTER 3: TECHNICAL AI ETHICS
CHAPTER 3:
Chapter Preview
Overview 102 3.3 AI ETHICS TRENDS AT
Acknowledgment 103 FACCT AND NEURIPS 123
Overview
In recent years, AI systems have started to be deployed into the
world, and researchers and practitioners are reckoning with their
real-world harms. Some of these harms include commercial facial
recognition systems that discriminate based on race, résumé
screening systems that discriminate on gender, and AI-powered
clinical health tools that are biased along socioeconomic and racial
lines. These models have been found to reflect and amplify human
social biases, discriminate based on protected attributes, and
generate false information about the world. These findings have
increased interest within the academic community in studying AI
ethics, fairness, and bias and prompted industry practitioners to
direct resources toward remediating these issues, and attracted
attention from the media, governments, and the people who use and
are affected by these systems.
This year, the AI Index highlights metrics which have been adopted
by the community for reporting progress in eliminating bias and
promoting fairness. Tracking performance on these metrics alongside
technical capabilities provides a more comprehensive perspective
on how fairness and bias change as systems improve, which will be
important to understand as systems are increasingly deployed.
AC K N OW L E D G M E N T
The AI Index would like to thank all those involved in research and advocacy around the development and governance
of responsible AI. This chapter builds upon the work of scholars from across the AI ethics community, including those
working on measuring technical capabilities as well those focused on shaping thoughtful societal norms. There is much
more work to be done, but we are inspired by the progress made by this community and its collaborators.
Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong Wook Kim, and Miles Brundage. 2021. Evaluating
CLIP: Towards Characterization of Broader Capabilities and Downstream Implications. arXiv preprint arXiv:2108.02818.
Jack Bandy and Nicholas Vincent. 2021. Addressing “Documentation Debt” in Machine Learning Research: A
Retrospective Datasheet for Book Corpus. arXiv preprint arXiv:2105.05241.
Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. Multimodal Datasets: Misogyny, Pornography, and
Malignant Stereotypes. arXiv preprint arXiv:2110.01963.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den
Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman
Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey
Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. 2021. Improving
Language Models by Retrieving from Trillions of Tokens. arXiv preprint arXiv:2112.04426.
Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2017. Word Embeddings Quantify 100 Years of Gender and
Ethnic Stereotypes. arXiv preprint arXiv:1711.08412.
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A.Smith. 2020. RealToxicityPrompts: Evaluating
Neural Toxic Degeneration in Language Models. arXiv preprint arXiv:2009.11462.
Wei Guo and Aylin Caliskan. 2020. Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a
Distribution of Human-like Biases. arXiv preprint arXiv:2006.03955.
Aylin Caliskan Islam, Joanna J. Bryson, and Arvind Narayanan. 2016. Semantics Derived Automatically from Language
Corpora Necessarily Contain Human Biases. arXiv preprint arXiv:1608.07187.
Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. Jurassic-1: Technical Details and Evaluation. (2021).
https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf
Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. On Measuring Social Biases in
Sentence Encoders. arXiv preprint arXiv:1903.10561.
Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. StereoSet: Measuring Stereotypical Bias in Pretrained Language
Models. arXiv preprint arXiv:2004.09456.
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain,
Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew
Knight, Benjamin Chess, John Schulman. WebGPT: Browser-Assisted Question-Answering with Human Feedback. 2021.
arXiv preprint arXiv:2112.09332.
Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020. CrowS-Pairs: A Challenge Dataset for
Measuring Social Biases in Masked Language Models. arXiv preprint arXiv:2010.00133.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal,
Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter
Welinder, Paul Christiano, Jan Leike, Ryan Lowe. Training Language Models to Follow Instructions with Human Feedback.
2022. arXiv preprint arXiv:2203.02155.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song, John Aslanides, Sarah
Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard
Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl,
Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy
Wu, Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela
Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya,
Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug
Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume,
Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James
Bradbury, Matthew Johnson, Blake A. Hechtman, Laura Weidinger, Iason Gabriel, William S. Isaac, Edward Lockhart,
Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis,
Koray Kavukcuoglu, and Geoffrey Irving. 2021. Scaling Language Models: Methods, Analysis & Insights from Training
Gopher. arXiv preprint arXiv:2112.11446.
Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. 2019. Evaluating Gender Bias in Machine Translation. arXiv
preprint arXiv:1906.00591.
Ryan Steed and Aylin Caliskan. 2020. Image Representations Learned With Unsupervised Pre-Training Contain Human-
like Biases. arXiv preprint arXiv:2010.15052.
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese,
Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane,
Julia Haas, Laura Rimell, Lisa Anne Hendricks, William S. Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2021.
Ethical and social risks of harm from Language Models. arXiv preprint arXiv:2112.04359.
Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty
Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. 2021. Challenges in Detoxifying Language Models. arXiv
preprint arXiv:2109.07445.
Albert Xu, Eshaan Pathak, Eric Wallace, Suchin Gururangan, Maarten Sap, and Dan Klein. 2021. Detoxifying Language
Models Risks Marginalizing Minority Voices. arXiv preprint arXiv:2104.06390.
Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan-Hao Huang, Muhao Chen, Ryan Cotterell, and Kai-Wei Chang. 2019. Examining
Gender Bias in Languages with Grammatical Gender. arXiv preprint arXiv:1909.02224.
CHAPTER HIGHLIGHTS
• Language models are more capable than ever, but also more biased: Large language
models are setting new records on technical benchmarks, but new data shows that larger
models are also more capable of reflecting biases from their training data. A 280 billion
parameter model developed in 2021 shows a 29% increase in elicited toxicity over a
117 million parameter model considered the state of the art as of 2018. The systems are
growing significantly more capable over time, though as they increase in capabilities, so does
the potential severity of their biases.
• The rise of AI ethics everywhere: Research on fairness and transparency in AI has exploded
since 2014, with a fivefold increase in related publications at ethics-related conferences.
Algorithmic fairness and bias has shifted from being primarily an academic pursuit to
becoming firmly embedded as a mainstream research topic with wide-ranging implications.
Researchers with industry affiliations contributed 71% more publications year over year
at ethics-focused conferences in recent years.
• Multimodal models learn multimodal biases: Rapid progress has been made on training
multimodal language-vision models which exhibit new levels of capability on joint language-
vision tasks. These models have set new records on tasks like image classification and the
creation of images from text descriptions, but they also reflect societal stereotypes and
biases in their outputs—experiments on CLIP showed that images of Black people were
misclassified as nonhuman at over twice the rate of any other race. While there has been
significant work to develop metrics for measuring bias within both computer vision and natural
language processing, this highlights the need for metrics that provide insight into biases in
models with multiple modalities.
Significant research effort has been invested over the past five years into creating datasets, benchmarks, and metrics designed to
measure bias and fairness in machine learning models. Bias is often learned from the underlying training data for an AI model; this data
can reflect systemic biases in society or the biases of the humans who collected and curated the data.
15
Number of Metrics
10
0
2016 2017 2018 2019 2020 2021
Figure 3.1.1
1 2021 data may be lagging as it takes time for metrics to be adopted by the community.
2 Research paper citations are a lagging indicator of activity, and metrics which have been very recently adopted may not be reflected in the current data, similar to 3.1.1.
3 The Perspective API defined seven new metrics for measuring facets of toxicity (toxicity, severe toxicity, identity attack, insult, obscene, sexually explicit, threat), contributing to the unusually high
number of metrics released in 2017.
NUMBER of AI FAIRNESS and BIAS METRICS (DIAGNOSTIC METRICS vs. BENCHMARKS), 2016–21
Source: AI Index, 2021 | Chart: 2022 AI Index Report
15
Benchmarks 13
Diagnostic Metrics
10
Number of Metrics
9 9 9
5
4
3 3
2 2 2
0 0
0 1 2
2016 3 40 1 2
2017 3 40 1 2
2018 3 40 1 2
2019 3 40 1 2
2020 3 40 1 2
2021 3 4
Figure 3.1.2
The rest of this chapter examines the performance of intrinsic bias within systems, and it has been shown that
recent AI systems on these metrics and benchmarks in intrinsic bias metrics may not fully capture the effects of
depth within domains such as natural language and extrinsic bias within downstream applications.
computer vision. The majority of these metrics measure
Current state-of-the-art natural language processing (NLP) relies on large language models or machine learning systems that process
millions of lines of text and learn to predict words in a sentence. These models can generate coherent text; classify people, places, and
events; and be used as components of larger systems, like search engines. Collecting training data for these models often requires scraping
the internet to create web-scale text datasets. These models learn human biases from their pretraining data and reflect them in their
downstream outputs, potentially causing harm. Several benchmarks and metrics have been developed to identify bias in natural language
processing along axes of gender, race, occupation, disability, religion, age, physical appearance, sexual orientation, and ethnicity.
20
18
15
Number of Research Papers
10
0
2018 2019 2020 2021
Figure 3.2.1
RealToxicityPrompts consists of English natural language Figure 3.2.2 shows that toxicity in language models
prompts used to measure how often a language model depends heavily on the underlying training data. Models
completes a prompt with toxic text. Toxicity of a language trained on internet text with toxic content filtered out are
model is measured with two metrics: significantly less toxic compared to models trained on
various corpora of unfiltered internet text. A model trained
• M
aximum toxicity: the average maximum toxicity score
on BookCorpus (a dataset containing books from e-book
across some number of completions
websites) produces toxic text surprisingly often. This
• P
robability of toxicity: how often a completion is may be due to its composition—BookCorpus contains a
expected to be toxic significant number of romance novels containing explicit
content, which may contribute to higher levels of toxicity.
0.90
0.88
0.85
0.82
0.80 0.78
0.73 0.75
0.71
Toxicity Score
0.60
0.40
0.25
0.20
0.08
0.00
openwebtext (C4, toxicity openwebtext (CTRL, openwebtext (GPT-3,
BookCorpus Wikipedia
!ltered) un!ltered) un!ltered)
Training Data
Figure 3.2.2
0.4
High
0.3
Med
Toxicity Probability
0.2
Low
0.0
117 417 1,400 7,100 280,000
Model Size (Number of Parameters in millions)
Figure 3.2.3a
10-shot
0.7
Classi!cation Area Under the Curve (AUC)
0.6
0.5
0.4
0.3
0.2
0.1
0.0
44 117 417 1,400 7,100 280,000
Model Size (Number of Parameters in millions)
Figure 3.2.3b
400
Perplexity
300
200
100
0
African American–aligned English Minority identity mentions
White-aligned English (nontoxic) White-aligned English (toxic)
(nontoxic) (nontoxic)
Text
Figure 3.2.4
64
62.00
62
Stereotype Score
60.70
60.10
60
59.30
58.30 58.20
58
56
StereoSet has several major flaws in its underlying dataset: these stereotypes were sourced from crowdworkers
Some examples fail to express a harmful stereotype, located in the United States, and the resulting values and
conflate stereotypes about countries with stereotypes stereotypes within the dataset may not be universally
about race and ethnicity, and confuse stereotypes representative.
between associated but distinct groups. Additionally,
80
60
Stereotype Score
40
20
0
Gender Physical Sexual Socioeconomic
Age Disability Nationality Race/Color Religion
Identity Appearance Orientation Status
Bias Attribute
Figure 3.2.6
BookCorpus
50% smashwords21
Books (as % of All Religious Books in Corpus)
40%
30%
20%
10%
0%
Atheism Buddhism Christianity Hinduism Islam Judaism Sikhism
Religion
Figure 3.2.7
WINOGENDER AND WINOBIAS cases, along with the gender parity score (the percentage
Winogender measures gender bias related to occupations. of examples for which the predictions are the same).
Systems are measured on their ability to fill in the correct The authors use crowdsourced annotations to estimate
gender in a sentence containing an occupation (e.g., human performance to be 99.7% accuracy.
“The teenager confided in the therapist because he / she Winogender results from the SuperGLUE leaderboard
seemed trustworthy”). Examples were created by sourcing show that larger models are more capable of correctly
data from the U.S. Bureau of Labor Statistics to identify resolving gender in the zero-shot and few-shot setting
occupations skewed toward one gender (e.g., the cashier (i.e., without fine-tuning on the Winogender task) and less
occupation is made up of 73% women, but drivers are only likely to magnify occupational gender disparities (Figure
6% women). 3.2.8). However, a good score on Winogender does not
Performance on Winogender is measured by the accuracy indicate that a model is unbiased with regard to gender,
gap between the stereotypical and anti-stereotypical only that bias was not captured by this benchmark.
0.72 0.71
0.64
0.40
0.20
0.00
31 100 223 340 10,000 13,000 64,000 175,000 280,000
Number of Parameters (in millions)
Figure 3.2.8
WinoBias is a similar benchmark measuring gender bias Winogender, but the adoption of Winogender within the
related to occupations that was released concurrently SuperGLUE leaderboard for measuring natural language
with Winogender by a different research group. As understanding has led to more model evaluations being
shown in Figure 3.2.9, WinoBias is cited more often than reported on Winogender.
128, WinoBias
50
0
2018 2019 2020 2021
Figure 3.2.9
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Figure 3.2.10
SENTENCE EMBEDDING ASSOCIATION TEST (SEAT): MEASURING STEREOTYPICAL ASSOCIATIONS with EFFECT
SIZE
Source: May et al., 2019 | Chart: 2022 AI Index Report
Angry Black Woman Stereotype
European American/African American Names
Male/Female Names, Career
1.5
1.0
E"ect Size
0.5
0.0
3/2018 (Universal
10/2014 (CBoW) 9/2017 (InferSent) 2/2018 (ELMo) 3/2018 (GenSen) 6/2018 (GPT) 10/2018 (BERT)
Sentence Encoder)
Figure 3.2.11
Word embeddings also reflect cultural shifts: A temporal measured with the relative norm difference: the average
analysis of word embeddings over 100 years of U.S. Census Euclidean distance between words associated with
text data shows that changes in embeddings closely track representative groups (e.g., men, women, Asians) and
demographic and occupational shifts over time. Figure words associated with occupations. The blue line shows
3.2.12 shows that shifts in embeddings trained on the gender bias over time, where negative values indicate
Google Books and Corpus of Historical American English that embeddings more closely associate occupations with
(COHA) corpora reflect significant historical events like the men. The red line shows the bias of embeddings relating
women’s movement in the 1960s and Asian immigration race to occupations, specifically in the case of Asian
to the United States. In this analysis, embedding bias is Americans and whites.
GENDER and RACIAL BIAS in WORD EMBEDDINGS TRAINED on 100 YEARS of TEXT DATA
Source: Garg et al., 2018 | Chart: 2022 AI Index Report
-0.02
Racial Bias
-0.03
Gender Bias
Average Bias Score
-0.04
-0.05
-0.06
architect
arquitecta arquitecto
lawyer
abogada abogado
enfermera
nurse enfermero
Mitigating Bias in Word Embeddings With demonstrated that there is no reliable correlation
Intrinsic Bias Metrics between intrinsic bias metrics and downstream
It is often assumed that reducing intrinsic bias by de- application biases. Further investigation is needed to
biasing embeddings will reduce downstream biases establish meaningful relationships between intrinsic and
in applications (extrinsic bias). However, it has been extrinsic metrics.
To grasp how the field of AI ethics has evolved over time, this section studies trends from the ACM Conference on Fairness, Accountability,
and Transparency (FAccT), which publishes work on algorithmic fairness and bias, and from NeurIPS workshops. The section identifies
emergent trends in workshop publication topics and shares insights on authorship trends by affiliation and geographic region.
Education 302
300 Industry
Government
Nonpro!t
Other 244
Number of Papers
200
227
166
200
100 139
71
53
63
31
21
0 17
2018 2019 2020 2021
Figure 3.3.1
7 Work accepted by FAccT includes technical frameworks for measuring fairness, investigations into the harms of AI in specific industries (e.g., discrimination in online advertising, biases in
recommender systems), proposals for best practices, and better data collection strategies. Several works published at FAccT have become canonical works in AI ethics; examples include Model Cards for
Model Reporting (2019) and On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? (2021). Notably, FAccT publishes a significant amount of work critical of contemporary methods and
systems in AI.
While there has been increased interest in fairness, followed by researchers based in Europe and Central Asia
accountability, and transparency research from all types (Figure 3.3.2). From 2020 to 2021, the proportion of papers
of organizations, the majority of papers published at FAccT from institutions based in North America increased from
are written by researchers based in the United States, 70.2% to 75.4%.
60%
40%
20%
17.34%, Europe and Central Asia
4.03%, East Asia and Paci c
1.61%, Latin America and the Caribbean
1.21%, Middle East and North Africa
0.40%, Sub-Saharan Africa
0.00%, South Asia
0%
NEURIPS WORKSHOP RESEARCH TOPICS: NUMBER of ACCEPTED PAPERS on REAL-WORLD IMPACTS, 2015–21
Source: NeurIPS, 2021; AI Index, 2021 | Chart: 2022 AI Index Report
250 Climate
243
Developing World
Finance
19
Healthcare
Other
200 29
191
172
Number of Papers
150 79
86
114
15 172 13
100
81
75 13
17
17
50 99 97
75
62
47
0 16
2015 2017 2018 2019 2020 2021
Figure 3.3.3
NEURIPS RESEARCH TOPICS: NUMBER of ACCEPTED PAPERS on CAUSAL EFFECT and COUNTERFACTUAL
REASONING, 2015–2021
Source: NeurIPS, 2021; AI Index, 2021 | Chart: 2022 AI Index Report
Main Track
80 78
Workshop 76
72
20
60
Number of Papers
43 53
40 39
16
58
20
29
9 23 23
6
4 9
0 6
2015 2016 2017 2018 2019 2020 2021
Figure 3.3.4
8 See 2017 Transparent and Interpretable Machine Learning in Safety Critical Environments, 2019 Workshop on Human-Centric Machine Learning: Safety and Robustness in Decision-Making, 2019,
“‘Do the Right Thing’: Machine Learning and Causal Inference for Improved Decision-Making.”
9 See 2020 “Algorithmic Fairness Through the Lens of Causality and Interpretability.”
10 See 2020 “Machine Learning for Health (ML4H): Advancing Healthcare for All,” 2020 Workshop on Fair AI in Finance.
Privacy and Data Collection domains (e.g., financial services), federated learning for
Amid growing concerns about privacy, data sovereignty, decentralized model training, and differential privacy
and the commodification of personal data for profit, there to ensure that training data does not leak personally
has been significant momentum in industry and academia identifiable information.11 This section shows the number
to build methods and frameworks to help mitigate of papers submitted to NeurIPS mentioning “privacy” in
privacy concerns. Since 2018, several workshops have the title along with papers accepted to privacy-themed
been devoted to privacy in machine learning, covering NeurIPS workshops, and finds a significant increase in the
topics such as privacy in machine learning within specific number of accepted papers since 2016 (Figure 3.3.6).
NEURIPS RESEARCH TOPICS: NUMBER of ACCEPTED PAPERS on INTERPRETABILITY and EXPLAINABILITY, 2015–21
Source: NeurIPS, 2021; AI Index, 2021 | Chart: 2022 AI Index Report
41
Main Track
40 Workshop
18
30
Number of Papers
23
20
17
12 7 19
23
10 5
6 6
10
2 6 7 6
4
0
2015 2016 2017 2018 2019 2020 2021
Figure 3.3.5
11 See “Privacy Preserving Machine Learning,” Workshop on Federated Learning for Data Privacy and Confidentiality, Privacy in Machine Learning (PriML).
128
15
Number of Papers
100
88
79 13
138
113
50
72 75
21
16
19
1 14
0
2015 2016 2017 2018 2019 2020 2021
Figure 3.3.6
Fairness and Bias the number of papers accepted to the conference main
In 2020, NeurIPS started requiring authors to submit track that mention fairness or bias in the title, along
broader impact statements addressing the ethical and with papers accepted to a fairness-related workshop.
potential societal consequences of their work, a move Figure 3.3.7 shows a sharp increase from 2017 onward,
that suggests the community is signaling the importance demonstrating the newfound importance of these topics
of AI ethics early in the research process. One measure of within the research community.
the interest in fairness and bias at NeurIPS over time is
NEURIPS RESEARCH TOPICS: NUMBER of ACCEPTED PAPERS on FAIRNESS and BIAS in AI, 2015–21
Source: NeurIPS, 2021; AI Index, 2021 | Chart: 2022 AI Index Report
168
125 36
16 114
Number of Papers
100
36
113 118
109
50
34 78
10
24
2 4
0
2015 2016 2017 2018 2019 2020 2021
Figure 3.3.7
This section analyzes trends in using AI to verify the factual accuracy of claims, as well as research related to measuring the truthfulness
of AI systems. It is imperative that AI systems deployed in safety-critical contexts (e.g., healthcare, finance, disaster response) provide
users with knowledge that is factually accurate, but today’s state-of-the-art language models have been shown to generate false
information about the world, making them unsafe for fully automated decision making.
2 75
3 28
4 16
5 15
Number of Classes
6 6
7 3
13 1
17 1
18 1
40 1
Numeric 1
Ranking 3
0 10 20 30 40 50 60 70 80
Number of Datasets
Figure 3.4.1
The increased interest in automated fact-checking levels of factuality. Similarly, Truth of Varying Shades is a
is evidenced by the number of citations of relevant multiclass political fact-checking and fake news detection
benchmarks: FEVER is a fact extraction and verification benchmark. Figure 3.4.2 shows that these three English
dataset made up of claims classified as supported, refuted, benchmarks have been cited with increasing frequency in
or not enough information. LIAR is a dataset for fake news recent years.
detection with six fine-grained labels denoting varying
LIAR
200
FEVER
150
Number of Citations
50
Figure 3.4.3 shows the number of fact-checking datasets English datasets (including 14 in Arabic, 5 in Chinese, 3
created for English compared to all other languages in Spanish, 3 in Hindi, and 2 in Danish) compared to 142
over time. As seen in Figure 3.4.4, there are only 35 non- English-only datasets.12
50
48
English
Non-English
40
Number of Datasets
30
38
27
25
20
20 19
13
23
12 15
10 18
6 9
12
10
2 6 5
1 1 3 4
0 0 0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 3.4.3
English 142
Arabic 14
Chinese 5
Hindi 3
Spanish 3
Danish 2
French 2
German 2
Portugese 2
Bengali 1
Bulgarian 1
Croatian 1
Italian 1
Malayalam 1
Tamil 1
0 20 40 60 80 100 120 140
Number of Benchmarks
Figure 3.4.4
12 Modern language models are trained on disproportionately larger amounts of English text, which negatively impacts performance on other languages. The Gopher family of models is trained on MassiveText
(10.5 TB), which is 99% English. Similarly, only 7% of training data in GPT-3 was in languages other than English. See the Appendix for a comparison of a multilingual model (XGLM-564M) and GPT-3.
Measuring Fact-Checking Accuracy at least one set of supporting evidence was correctly
With FEVER Benchmark identified. Several variations of this dataset have
FEVER (Fact Extraction and VERification) is a benchmark since been introduced (e.g., FEVER 2.0, FEVEROUS,
measuring the accuracy of fact-checking systems, where FoolMeTwice).
the task requires systems to verify the factuality of a
Figure 3.4.5 shows that state-of-the-art performance has
claim with supporting evidence extracted from English
steadily increased over time on both accuracy and FEVER
Wikipedia. Systems are measured on classification
score. Some contemporary language models only report
accuracy and FEVER score, a custom metric which
accuracy, as in the case of Gopher.
measures whether the claim was correctly classified and
FACT EXTRACTION and VERIFICATION (FEVER) BENCHMARK: ACCURACY and FEVER SCORE, 2018–21
Source: AI Index, 2021 | Chart: 2022 AI Index Report
75
Score
70
65
60
2018 2019 2020 2021
Figure 3.4.5
TOWAR D T RU THFUL L A NG UAGE “I don’t know” may be truthful but not informative). A
MO DE L S combination of metrics adapted from machine translation
While language models have become increasingly (BLEURT, ROUGE, and BLEU) are used to compare the
more coherent, they have also become more fluent at model answer to reference answers. In this setting, a small
generating factually incorrect statements and fabricating model called GPT-Judge is also learned to predict human
falsehoods. This increased fluency means they also have evaluations of truthfulness and informativeness for a
a greater capacity to perpetuate harm by, for instance, given answer. Alternatively, the task can be formulated
creating convincing conspiracy theories. as a multiple-choice task where models are evaluated on
their accuracy in selecting the correct answer.
Model Size and Truthfulness
TruthfulQA is a benchmark designed to evaluate the In the multiple-choice version of this task, initial
truthfulness of language models on question answering, experiments on GPT-Neo, GPT-2, T5 (UnifiedQA), and GPT-
with questions drawn from categories such as health, 3 showed that larger models provide more informative
law, finance, and politics. Questions were designed such answers but are not necessarily more truthful. Later
that humans might be prone to answer incorrectly due experiments on DeepMind’s Gopher model contradicted
to common misconceptions: For example, when asked, this finding: Figure 3.4.6 from the Gopher paper
“What happens if you smash a mirror?” GPT-3 responds, shows that accuracy improves with model size on the
“You will have seven years of bad luck.” multiple-choice task. This contradiction may be due to
the formulation of the TruthfulQA dataset, which was
The task is formulated as an open-ended generative collected adversarially against GPT-3 175-B, possibly
task, where systems are evaluated on the truthfulness explaining the lower performance of the GPT-3 family of
and informativeness of their answers (e.g., answering models.
40%
Truthful and Informative Answers (%)
30%
20%
10%
0%
GPT-NEO-125M
GPT-NEO-1.3B
GPT-NEO-2.7B
GPT-NEO-6B
T5 60M
T5 220M
T5 770M
T5 2.8B
GPT-2 117M
GPT-2 1.5B
GPT3 350M
GPT3 1.3B
GPT3 6.7B
GPT3 175B
Gopher 1.4B
Gopher 7.1B
Gopher 280B
Gopher 280B
-10shot
Model
Figure 3.4.6
WebGPT was designed to improve the factual accuracy models using human-curated responses are called SFT
of GPT-3 by introducing a mechanism to search the Web (supervised fine-tuning). The baseline SFT is further
for sources to cite when providing answers to questions. fine-tuned using reinforcement learning from human
Similar to Gopher, WebGPT also shows more truthful feedback. This family is called PPO because it uses a
and informative results with increased model size. While technique called Proximal Policy Optimization. Finally,
performance improves compared to GPT-3, WebGPT PPO models are further enhanced and called InstructGPT.
still struggles with out-of-distribution questions, and its
Figure 3.4.7 shows the truthfulness of eight language
performance is considerably below human performance.
model families on the TruthfulQA generation task. Similar
However, since WebGPT cites sources and appears more
to the scaling effect observed in the Gopher family, the
authoritative, its untruthful answers may be more harmful
WebGPT and InstructGPT models yield more truthful and
as users may not investigate cited material to verify each
informative answers as they scale. The exception to the
source.
scaling trend is the supervised fine-tuned InstructGPT
InstructGPT models are a variant of GPT-3 and they use baseline, which corroborates observations from the
human feedback to train a model to follow instructions, TruthfulQA paper that the baseline GPT-3 family of models
created by fine-tuning GPT-3 on a dataset of human- underperforms with scale.
written responses to a set of prompts. The fine-tuned
60%
40%
20%
0%
PPO- netuned SFT 1.3B
InstructGPT 1.3B
GPT-NEO-125M
GPT-2 117M
GPT-2 1.5B
GPT-NEO-1.3B
GPT3 760M
WebGPT 760M
WebGPT 175B
InstructGPT 6B
InstructGPT 175B
GPT-NEO-2.7B
GPT-NEO-6B
T5 60M
T5 220M
T5 770M
T5 2.8B
GPT3 13B
GPT3 175B
WebGPT 13B
SFT 1.3B
SFT 6B
SFT 175B
Model
Figure 3.4.7
Women Men
Lady Man
Woman Male
Female Face
Looking Player
Senior Citizen Black
Public Speaking Head
Spokesperson Facial Expression
Blonde Suit
Laughing Photo
Blazer Military O#cer
Bob Cut Walking
Magenta Photograph
Hot Elder
Black Hair Tie
Pixie Cut Display
Bangs Shoulder
Pink Frown
Man Man
Newsreader Kid
Woman Woman
Purple Necktie
Blouse Yellow
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
Frequency (%) Frequency (%)
Figure 3.4.8
RESULTS OF THE
CLIP-EXPERIMENTS
PERFORMED WITH THE
COLOR IMAGE OF THE
ASTRONAUT EILEEN
Source: Birthane et al., 2021
Figure 3.4.9
Underperformance on Non-English text, and its performance has not been evaluated
Languages on other languages.
CLIP can be extended to non-English languages
However, mBERT has performance gaps on low-
by replacing the original English text encoder
resource languages such as Latvian or Afrikaans,14
with a pretrained, multilingual model such as
which means that multilingual versions of CLIP
Multilingual BERT (mBERT) and fine-tuning
trained with mBERT will still underperform. Even
further. However, its documentation cautions
for high-resource languages, such as French and
against using the model for non-English
Spanish, there are still noticeable accuracy gaps
languages since CLIP was trained only on English
in gender and age classification.
14 While mBERT performs well on high-resource languages like French, on 30% of languages (out of 104 total languages) with lower pretraining resources, it performs worse than using no pretrained
model at all.
CHAPTER 4:
The Economy
and Education
Artificial Intelligence
Index Report 2022 CHAPTER 4: THE ECONOMY AND EDUCATION
CHAPTER 4:
Chapter Preview
Overview 141 4.3 CORPORATE ACTIVITY 160
Chapter Highlights 142 Industry Adoption 160
Global Adoption of AI 160
4.1 JOBS 143 AI Adoption by Industry and Function 161
AI Hiring 143 Type of AI Capabilities Adopted 162
AI Labor Demand 145 Consideration and Mitigation of Risks
Global AI Labor Demand 145 From Adopting AI 163
U.S. AI Labor Demand: By Skill Cluster 146
U.S. Labor Demand: By Sector 147 4.4 AI EDUCATION 165
U.S. Labor Demand: By State 147 CS Undergraduate Graduates in
AI Skill Penetration 149 North America 165
Overview
The growing use of artificial intelligence (AI) in everyday life, across
industries, and around the world generates numerous questions about
how AI is shaping the economy and education—and, conversely, how
the economy and education are adapting to AI. AI promises many
opportunities in workplace productivity, supply chain efficiency,
customized consumer experiences, and other areas. At the same time,
however, the technology gives rise to a number of concerns. How
do businesses adapt to recruiting and retaining AI talent? How is the
education system keeping pace with the demand for AI labor and the
need to understand AI’s impact on society? All of these questions and
more are inextricable from AI today.
CHAPTER HIGHLIGHTS
• New Zealand, Hong Kong, Ireland, Luxembourg, and Sweden are the countries or regions with
the highest growth in AI hiring from 2016 to 2021.
• In 2021, California, Texas, New York, and Virginia were states with the highest number of AI job
postings in the United States, with California having over 2.35 times the number of postings
as Texas, the second greatest. Washington, D.C., had the greatest rate of AI job postings
compared to its overall number of job postings.
• The private investment in AI in 2021 totaled around $93.5 billion—more than double the
total private investment in 2020, while the number of newly funded AI companies continues
to drop, from 1051 companies in 2019 and 762 companies in 2020 to 746 companies in 2021.
In 2020, there were 4 funding rounds worth $500 million or more; in 2021, there were 15.
• “Data management, processing, and cloud” received the greatest amount of private AI
investment in 2021—2.6 times the investment in 2020, followed by “medical and healthcare”
and “fintech.”
• In 2021, the United States led the world in both total private investment in AI and the number
of newly funded AI companies, three and two times higher, respectively, than China, the next
country on the ranking.
• Efforts to address ethical concerns associated with using AI in industry remain limited,
according to a McKinsey survey. While 29% and 41% of respondents recognize “equity and
fairness” and “explainability” as risks while adopting AI, only 19% and 27% are taking steps
to mitigate those risks.
• In 2020, 1 in every 5 CS students who graduated with PhD degrees specialized in artificial
intelligence/machine learning, the most popular specialty in the past decade. From 2010 to
2020, the majority of AI PhDs in the United States headed to industry while a small fraction
took government jobs.
4.1 JOBS
AI HIRING example, an index of 1.05 in December 2021 points to a
The AI hiring data draws on a dataset from LinkedIn hiring rate that is 5% higher than the average month in
of skills and jobs listings on the platform. It focuses 2016. LinkedIn makes month-to-month comparisons to
specifically on countries or regions where LinkedIn covers account for any potential lags in members updating their
at least 40% of the labor force and where there are at profiles. The index for a year is the number in December
least 10 AI hires each month. China and India were also of that year.
included due to their global importance, despite not The relative AI hiring index captures whether hiring of
meeting the 40% coverage threshold. Insights for these AI talent is growing faster than, equal to, or more slowly
countries may not provide as full a picture as others, and than overall hiring in a particular country or region.
should be interpreted accordingly. New Zealand has the highest growth in AI hiring—2.42
Figure 4.1.1 shows the 15 geographic areas with the times greater in 2021 compared with 2016, followed by
highest relative AI hiring index for 2021. The AI hiring Hong Kong (1.56), Ireland (1.28), Luxembourg (1.26),
rate is calculated as the percentage of LinkedIn members and Sweden (1.24). Moreover, many countries or regions
with AI skills on their profile or working in AI-related experienced a decrease in their AI hiring growth from
occupations who added a new employer in the same 2020 to 2021—indicating that the pace of change in the
period the job began, divided by the total number of AI hiring rate, against the rate of overall hiring, declined
LinkedIn members in the corresponding location. This over the last year, with the exception of Germany and
rate is then indexed to the average month in 2016; for Sweden (Figure 4.1.2).
Ireland 1.28
Luxembourg 1.26
Sweden 1.24
Netherlands 1.23
China 1.18
Australia 1.14
Canada 1.14
Spain 1.10
Denmark 1.09
Germany 1.08
Figure 4.1.1
0.00
Germany Hong Kong Ireland Luxembourg
1.15 1.14
Relative AI Hiring Index
0.00
Netherlands New Zealand 1.40 South Africa Spain
1.20
1.00 1.18 1.14
0.00
Sweden United Kingdom United States
1.24 1.16
1.00 1.12
0.00
2018 2020 2022 2018 2020 2022 2018 2020 2022 2018 2020 2022
Figure 4.1.2
2.50%
2.33%, Singapore
2.00%
AI Job Postings (% of All Job Postings)
1.50%
1.00%
0.90%, United States
0.78%, Canada
0.74%, United Kingdom
0.58%, Australia
0.50%
AI JOB POSTINGS (% of ALL JOB POSTINGS) in the UNITED STATES by SKILL CLUSTER, 2010–21
Source: Emsi Burning Glass, 2021 | Chart: 2022 AI Index Report
0.50%
AI Job Postings (% of All Job Postings)
0.40%
0.30%
0.20%
0.00%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 4.1.4
1 See the Appendix for a complete list of AI skills under each skill cluster.
AI JOB POSTINGS (% of ALL JOB POSTINGS) in the UNITED STATES by SECTOR, 2021
Source: Emsi Burning Glass, 2021 | Chart: 2022 AI Index Report
Information 3.30%
Manufacturing 2.02%
Utilities 0.78%
Figure 4.1.5
U.S. Labor Demand: By State NUMBER of AI JOB POSTINGS in the UNITED STATES by STATE, 2021
Source: Emsi Burning Glass, 2021 | Chart: 2022 AI Index Report
Figure 4.1.6 breaks down the U.S.
AI labor demand by state. In 2021,
the top states posting AI jobs
AK ME
were California (80,238), Texas 481 693
and California. HI TX FL
866 34,021 14,972
Figure 4.1.6
AI JOB POSTINGS (TOTAL and % of ALL JOB POSTINGS) by U.S. STATE and DISTRICT, 2021
Source: Emsi Burning Glass, 2021 | Chart: 2022 AI Index Report
District of Columbia
2.00%
AI Job Postings (% of All Job Postings)
Virginia
1.50% Washington
Massachusetts California
Delaware New York
0.00%
200 500 1,000 2,000 5,000 10,000 20,000 50,000 100,000
Number of AI Job Postings (Log Scale)
Figure 4.1.7
India 3.09
Germany 1.70
China 1.56
Israel 1.52
Canada 1.41
Singapore 1.22
France 1.11
Spain 0.84
Australia 0.77
Switzerland 0.76
Italy 0.74
Sweden 0.60
2 Those included are a sample of eligible countries or regions with at least 40% labor force coverage by LinkedIn and at least 10 AI hires in any given month. China and India were also included in this
sample because of their increasing importance in the global economy, but LinkedIn coverage in these countries does not reach 40% of the workforce. Insights for these countries may not provide as full a
picture as in others, and should be interpreted accordingly.
India
United States
Canada
South Korea
China
Singapore
Germany
Israel
Australia
Finland
United Kingdom
Netherlands
Female
France Male
Switzerland
Brazil
This section on corporate AI activity draws on data from NetBase Quid, which aggregates over 6 million global public and private
company profiles, updated on a weekly basis, including metadata on investments, location of headquarters, and more. NetBase Quid
also applies natural language processing technology to search, analyze, and identify patterns in large, unstructured datasets, like
aggregated blogs, company and patent databases.
4.2 INVESTMENT
C O R P O R AT E I N V E S T M E N T investment through private investment (totaling around
Corporate investment in artificial intelligence, from $93.5 billion), followed by mergers and acquisitions
mergers and acquisitions to public offerings, is a key (around $72 billion), public offerings (around $9.5
contributor to AI research and development. It also billion), and minority stake (around $1.3 billion). In
contributes to AI’s impact on the economy. Figure 4.2.1 2021, investments from mergers and acquisitions grew
highlights overall global corporate investment in AI from by 3.3 times compared to 2020, led by two AI healthcare
2013–2021. In 2021, companies made the greatest AI companies and two cybersecurity companies.3
176.47
Merger/Acquisition
Minority Stake
Private Investment
Total Investment (in billions of U.S. Dollars)
150
Public O"ering
72.03
119.54
21.53
100
44.06
66.98
20.42 93.54
50 47.33 45.58
22.49
29.17 46.00
19.83 35.26 42.57
14.99
5.23 22.78 23.09
8.98 15.73
0 7.94 9.58
2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 4.2.1
3 Nuance Communications (by Microsoft, $19.8 billion), Varian Medical Systems (Siemens, $17.2 billion), and Proofpoint (Thoma Bravo, $12.4 billion) in the United States, followed by Avast in the
Czech Republic (NortonLifeLock, $8.0 billion).
100
93.54
Total Investment (in bIllions of U.S. Dollars)
80
60
40
20
0
2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 4.2.2
4 The largest private investments have been Databricks (United States), Beijing Horizon Robotics Technology (China), Oxbotica Limited (United Kingdom), and Celonis (Germany).
1200
1000
Number of Companies
800 746
600
400
200
0
2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 4.2.3
Over $1 billion 3 5 8
Table 4.2.1
Regional Comparison by Funding Amount Kingdom ($10.8 billion), India ($10.77 billion), and Israel
In 2021, as captured in Figure 4.2.4, the United States ($6.1 billion). Notably, U.S. private investment in AI
led the world in overall private investment in funded AI companies from 2013–2021 was more than double the
companies—at approximately $52.9 billion—over three total in China, which itself was about six times the total
times the next country on the list, China ( $17.2 billion). investment from the United Kingdom in the same period.
In third place was the United Kingdom ($4.65 billion), Broken out by geographic area, as shown in Figure
followed by Israel ($2.4 billion) and Germany ($1.98 4.2.6, the United States, China, and the European Union
billion). Figure 4.2.5 shows that when combining total all grew their investments from 2020 to 2021, with the
private investment from 2013 to 2021, the same ranking United States leading China and the European Union by
applies: U.S. investment totaled $149 billion and Chinese 3.1 and 8.2 times the investment amount, respectively.
investment totaled $61.9 billion, followed by the United
China 17.21
Israel 2.41
Germany 1.98
Canada 1.87
France 1.55
India 1.35
Australia 1.25
Singapore 0.93
Spain 0.89
Portugal 0.52
0 10 20 30 40 50
Total Investment (in billions of U.S. Dollars)
Figure 4.2.4
China 61.9
India 10.8
Israel 6.1
Canada 5.7
Germany 4.4
France 3.9
Singapore 2.3
Japan 2.2
Australia 1.8
Spain 1.3
Hong Kong 1.2
Netherlands 1.2
0 20 40 60 80 100 120 140 160
Total Investment (in billions of U.S. Dollars)
Figure 4.2.5
50
Total Investment (in billions of U.S. Dollars)
40
30
20 17.21, China
10
6.42, European Union
China 119
United Kingdom 49
Israel 28
France 27
Germany 25
India 23
Canada 22
South Korea 19
Netherlands 16
Australia 12
Japan 12
Switzerland 12
Singapore 10
China 940
Israel 264
France 242
Canada 239
Japan 172
India 169
Germany 154
Singapore 99
Australia 77
South Korea 74
Switzerland 65
Spain 56
500
400
Number of Companies
300
299, United States
200
119, China
100 106, European Union
0
2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 4.2.9
Focus Area Analysis Aggregated data in Figure 4.2.11 shows that in the last
Private AI investment also varies by focus area. According five years, the medical and healthcare category received
to Figure 4.2.10, the greatest private investment in AI in the largest private investment globally ($28.9 billion);
2021 was in data management, processing, and cloud followed by data management, processing, and cloud
(around $12.2 billion). Notably, this was 2.6 times the ($26.9 billion); fintech ($24.9 billion); and retail ($21.95
investment in 2020 (around $4.69 billion) as two of the billion). Moreover, Figure 4.2.12 shows the overall trend
four largest private investments made in 2021 are data in private investment by industries from 2017–2021 and
management companies. In second place was private reveals a steady increase in AV, cybersecurity and data
investment in medical and healthcare ($11.29 billion), protection, fitness and wellness, medical and healthcare,
followed by fintech ($10.26 billion), AV ($8.09 billion), and and semiconductor industries.
semiconductors ($6.0 billion).
PRIVATE INVESTMENT in AI by FOCUS AREA, 2020 vs. 2021 PRIVATE INVESTMENT in AI by FOCUS AREA, 2017–21 (SUM)
Source: NetBase Quid, 2021 | Chart: 2022 AI Index Report Source: NetBase Quid, 2021 | Chart: 2022 AI Index Report
2.65 12.17
1.43 1.26
0
Total Investment (in billions of U.S. Dollars)
10
1.77 2.31
0.91 0.47
0
Figure 4.2.12
50%
Developed 64%
Asia-Paci!c
61%
India 65%
57%
41%
51%
AI Adoption by Industry and Function (45%), followed by service operations for financial
Figure 4.3.2 shows AI adoption by industry and function services (40%), service operations for high tech/
in 2021. The greatest adoption was in product and/or telecommunications (34%), and risk function for
service development for high tech/telecommunications financial services (32%).
Automotive and
Assembly
11% 26% 20% 15% 4% 18% 6% 17%
Consumer
Goods/Retail
2% 18% 22% 17% 1% 15% 4% 18%
Healthcare
Systems/Pharma and 9% 11% 14% 29% 13% 17% 12% 9%
Medical Products
High Tech/Telecom 12% 11% 28% 45% 16% 34% 10% 16%
% of Respondents (Function)
Figure 4.3.2
All Industries 23% 19% 11% 17% 12% 14% 24% 12% 17% 16% 26% 17% 12% 23%
Automotive and
Assembly
15% 14% 9% 16% 3% 11% 12% 24% 12% 5% 33% 27% 6% 12%
Industry
Consumer
Goods/Retail
23% 12% 14% 17% 11% 13% 14% 4% 8% 8% 16% 9% 1% 15%
Financial Services 17% 16% 11% 16% 12% 18% 32% 4% 13% 16% 33% 12% 12% 28%
Healthcare
Systems/Pharma and 30% 25% 12% 19% 10% 8% 26% 28% 22% 13% 28% 22% 19% 31%
Medical Products
High Tech/Telecom 28% 22% 6% 17% 17% 18% 34% 5% 19% 15% 23% 14% 11% 25%
Consideration and Mitigation of Risks Figure 4.3.5 shows risks from AI that organizations are
From Adopting AI taking steps to mitigate. Cybersecurity was the most
The risk from adopting AI that survey respondents frequent response (47% of respondents), followed by
identified as most relevant in 2021 was cybersecurity regulatory compliance (36%), personal/individual privacy
(55% of respondents), followed by regulatory compliance (28%), and explainability (27%). It is worth noting the
(48%), explainability (41%), and personal/individual gaps between risks that organizations find relevant and
privacy (41%) (Figure 4.3.4). Fewer organizations found risks that organizations take steps to mitigate—a gap of 10
AI risks from cybersecurity to be relevant in 2021 than percentage points with equity and fairness (29% to 19%),
in 2020, declining from just over 60% of respondents 12 percentage points with regulatory compliance (48%
expressing concern in 2020 to 55% in 2021. Concern over to 36%), 13 percentage points with personal/individual
AI regulatory compliance risks, meanwhile, remained privacy (41% to 28%), and 14 percentage points with
virtually unchanged from 2020. explainability (41% to 27%).
60%
55%, Cybersecurity
50%
48%, Regulatory Compliance
41%, Explainability
% of Respondents
0%
2019 2020 2021
Figure 4.3.4
50%
47%, Cybersecurity
40%
30%
28%, Personal/Individual Privacy
27%, Explainability
10%
8%, National Security
The following section draws on data from the annual Computing Research Association (CRA) Taulbee Survey. For the latest survey
featured in this section, CRA collected data in Fall 2020 by reaching out to over 200 PhD-granting departments in the United States and
Canada. Results are published in May 2021. The CRA survey documents trends in student enrollment, degree production, employment
of graduates, and faculty salaries in academic units in the United States and Canada that grant doctoral degrees in computer science
(CS), computer engineering (CE), or information (I). Academic units include departments of CS and CE or, in some cases, colleges or
schools of information or computing.
4.4 AI EDUCATION
C S U N D E R G R A D UAT E G R A D UAT E S The number of new CS undergraduate graduates at
I N N O R T H A M E R I CA doctoral institutions in North America has grown 3.5
In North America, most AI-related courses are offered times from 2010 to 2020 (Figure 4.4.1). More than 31,000
as part of the CS curriculum at the undergraduate level. undergraduates completed CS degrees in 2020—an
11.60% increase from the number in 2019.
31.84
Number of New CS Undergraduate Graduates (in thousands)
30
25
20
15
10
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Figure 4.4.1
N E W C S P H D S I N N O R T H A M E R I CA
The following sections show the trend of CS PhD graduates In 2020, 1 in every 5
in North America with a focus on those with AI-related
specialties.5 The CRA survey includes 20 specialties in total, CS students who
two of which are directly related to the field of AI: artificial
graduated with PhD
intelligence/machine learning (AI/ML) and robotics/vision.
5 New CS PhDs in this section include PhD graduates from academic units (departments, colleges, or schools within universities) of computer science in the United States.
PERCENTAGE POINT CHANGE in NEW CS PHDS in the UNITED STATES by SPECIALTY, 2010–20
Source: CRA Taulbee Survey, 2021 | Chart: 2022 AI Index Report
Figure 4.4.3
NEW CS PHDS with AI/ML and ROBOTICS/VISION SPECIALTY NEW CS PHDS (% of TOTAL) with AI/ML and
in the UNITED STATES, 2010–20 ROBOTICS/VISION SPECIALTY in the UNITED STATES, 2010–20
Source: CRA Taulbee Survey, 2021 | Chart: 2022 AI Index Report Source: CRA Taulbee Survey, 2021 | Chart: 2022 AI Index Report
350
20% 21.00%
300
New CS PhDs (% of Total)
Number of Graduates
250 15%
286
277
200 266
233 218
10%
150 171
144 138
161 138
159
100
5% 6.30%
50 91 83
68 65 74 76 72 75 66
55 46
0 0%
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
EMPLOYMENT of NEW AI PHDS to ACADEMIA, GOVERNMENT, EMPLOYMENT of NEW AI PHDS (% of TOTAL) to ACADEMIA,
or INDUSTRY in NORTH AMERICA, 2010–20 GOVERNMENT, or INDUSTRY in NORTH AMERICA, 2010–20
Source: CRA Taulbee Survey, 2021 | Chart: 2022 AI Index Report Source: CRA Taulbee Survey, 2021 | Chart: 2022 AI Index Report
250
60%
65 60.24%
200 73
61 50%
New AI PhDs (% of Total)
Number of New AI PhDs
63
150 60 40%
47
72 43 30%
51
100 63 42 24.02%
180
162
153 20%
134
116
50 101
85
76
64
74 77 10%
1.97%
0 0%
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
6 New AI PhDs in this section include PhD graduates who specialize in artificial intelligence from academic units (departments, colleges, or schools within universities) of computer science,
computer engineering, and information in the United States and Canada.
D I V E R S I T Y O F N E W A I P H D S I N N O R T H A M E R I CA
By Gender
Figure 4.4.6 shows that the share of new female AI and CS PhDs in North America remains low and has changed little
from 2010 to 2020.
FEMALE NEW AI and CS PHDS (% of TOTAL NEW AI and CS PHDS) in NORTH AMERICA, 2010–20
Source: CRA Taulbee Survey, 2021 | Chart: 2022 AI Index Report
20.20%, AI
20%
Female New AI PhDs (% of All New AI PhDs)
19.90%, CS
15%
10%
5%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Figure 4.4.6
By Race/Ethnicity past 11 years. Figure 4.4.8 shows all PhDs awarded in the
According to Figure 4.4.7, among the new AI PhDs from 2010 to United States to U.S. residents across departments of CS, CE,
2020 who are U.S. residents, the largest percentage has been and information between 2010 and 2020. In the past 11 years,
non-Hispanic white and Asian—65.2% and 18.8% on average. the share of new white (non-Hispanic) PhDs has changed little,
By comparison, around 1.5% were Black or African American while the percentage of new Black or African American (non-
(non-Hispanic) and 2.9% were Hispanic on average over the Hispanic) and Hispanic computing PhDs is significantly lower.
80%
70%
New U.S. AI Resident PhDs (% of Total)
60%
40%
20%
8.62%, Unknown
6.90%, Hispanic
10% 1.72%, Black or African American (Non-Hispanic)
0.86%, Native Hawaiian or Paci c Islander
0.86%, Multiracial (Non-Hispanic)
0.00%, Native American or Alaskan Native
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Figure 4.4.7
70%
60%
New Computing PhDs, U.S. Resident (% of Total)
50%
40%
30%
24.80%, Asian
20%
7.80%, Unknown
10% 4.20%, Hispanic
3.60%, Black or African American (Non-Hispanic)
1.80%, Multiracial (Non-Hispanic)
0.10%, Native American or Alaskan Native
0%
0.10%, Native Hawaiian or Paci c Islander
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Figure 4.4.8
60%
New International AI PhDs (% of Total New AI PhDs)
60.50%
50%
40%
30%
20%
10%
0%
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Figure 4.4.9
11.80%, 14.00%,
Unknown Outside United States
74.20%,
United States
Figure 4.4.10
CHAPTER 5:
AI Policy and
Governance
Artificial Intelligence
Index Report 2022 CHAPTER 5: AI POLICY AND GOVERNANCE
CHAPTER 5:
Chapter Preview
Overview 174 U.S. AI Policy Papers 186
Chapter Highlights 175 By Topic 187
Overview
As AI has become an increasingly ubiquitous topic in the last decade,
intergovernmental, national, and regional organizations have worked to
develop policies and strategies around AI governance. These actors are
driven by the understanding that it is imperative to find ways to address
the ethical and societal concerns surrounding AI, while maximizing
its benefits. Active and informed governance of AI technologies has
become a priority for many governments around the world.
CHAPTER HIGHLIGHTS
• An AI Index analysis of legislative records on AI in 25 countries shows that the number of bills
containing “artificial intelligence” that were passed into law grew from just 1 in 2016 to 18 in
2021. Spain, the United Kingdom, and the United States passed the highest number of AI-related
bills in 2021, with each adopting three.
• The federal legislative record in the United States shows a sharp increase in the total number of
proposed bills that relate to AI from 2015 to 2021, while the number of bills passed remains low,
with only 2% ultimately becoming law.
• State legislators in the United States passed 1 out of every 50 proposed bills that contain AI
provisions in 2021, while the number of such bills proposed grew from 2 in 2012 to 131 in 2021.
• In the United States, the current congressional session (the 117th) is on track to record the greatest
number of AI-related mentions since 2001, with 295 mentions by the end of 2021, half way
through the session, compared to 506 in the previous (116th) session.
Discussions around AI governance regulation have accelerated over the past decade, resulting in policy proposals across various
legislative bodies. This section first examines AI-related legislation that has either been proposed or passed into law across different
countries and regions, followed by a focused analysis of state-level legislation in the United States. It then takes a closer look at
congressional and parliamentary records on AI across the world and concludes with data on the number of policy papers published
in the United States.
18
15
Number of AI-Related Bills
10
0
2016 2017 2018 2019 2020 2021
Figure 5.1.1
1 Note that the analysis only includes laws passed by national legislative bodies (e.g. congress, parliament) with the keyword “artificial intelligence” in various languages in the title or body of the bill
text. See the appendix for the methodology. Countries included: Australia, Belgium, Brazil, Canada, China, Denmark, Finland, France, Germany, India, Ireland, Italy, Japan, the Netherlands, New Zealand,
Norway, Russia, Singapore, South Africa, South Korea, Spain, Sweden, Switzerland, the United Kingdom, and the United States.
By Geographic Area
Figure 5.1.2a shows the number of laws containing The United States
mentions of AI that were enacted in 2021. Spain, the
dominated the list with 13
United Kingdom, and the United States led, each passing
three. Figure 5.1.2b shows the total number of legislation bills, starting in 2017 with
passed in the past six years. The United States dominated
the list with 13 bills, starting in 2017 with 3 new laws
3 new laws passed each
passed each subsequent year, followed by Russia, subsequent year, followed
Belgium, Spain, and the United Kingdom.
by Russia, Belgium, Spain,
and the United Kingdom.
Spain 3
United Kingdom 3
United States 3
Belgium 2
Russia 2
France 1
Germany 1
Italy 1
Japan 1
South Korea 1
0 1 2 3 4
Number of AI-Related Bills
Figure 5.1.2a
NUMBER of AI-RELATED BILLS PASSED into LAW in SELECT COUNTRIES, 2016–21 (SUM)
Source: AI Index, 2021 | Chart: 2022 AI Index Report
United States 13
Russia 6
Belgium 5
Spain 5
United Kingdom 5
France 4
Italy 4
South Korea 4
Japan 3
China 2
Brazil 1
Canada 1
Germany 1
India 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of AI-Related Bills
Figure 5.1.2b
Federal AI Legislation in the United States were 130. Although this jump is significant, the number
A closer look at the federal legislative record in the of bills related to AI being passed has not kept pace with
United States shows a sharp increase in the total number the growing volume of proposed AI-related bills. This gap
of proposed bills that relate to AI (Figure 5.1.3). In 2015, was most evident in 2021, when only 2% of all federal-
just one federal bill was proposed, while in 2021, there level AI-related bills were ultimately passed into law.
NUMBER of AI-RELATED BILLS in the UNITED STATES, 2015–21 (PROPOSED vs. PASSED)
Source: AI Index, 2021 | Chart: 2022 AI Index Report
130, Proposed
120
100
Number of AI-Related Bills
80
60
40
20
3, Passed
Canada 2017 Budget Implementation Act 2017, No. 1 A provision of this act authorized the Canadian
government to make a payment of $125 million
to the Canadian Institute for Advanced Research
to support the development of a pan-Canadian
artificial intelligence strategy.
China 2019 Law of the People’s Republic of China A provision of this law aimed to promote the
on Basic Medical and Health Care and application and development of big data and
the Promotion of Health artificial intelligence in the health and medical field
while accelerating the construction of medical and
healthcare information infrastructure, developing
technical standards on the collection, storage,
analysis, and application of medical and health data.
Russia 2020 Federal Law of 24 April 2020 No. This law established an experimental framework
123-FZ on the Experiment to Establish for the development and implementation of AI as
Special Regulation in order to Create a five-year experiment to start in Moscow in July
the Necessary Conditions for the 1, 2020, including allowing AI systems to process
Development and Implementation of anonymized personal data for governmental and
Artificial Intelligence Technologies in certain commercial business activities.
the Region of the Russian Federation
– Federal City of Moscow and
Amending the Articles 6 and 10 of the
Federal Law on Personal Data
United Kingdom 2020 Supply and Appropriation (Main A provision of this act authorized the Office of
Estimates) Act 2020, c.13 Qualifications and Examination Regulation to
explore opportunities for using artificial intelligence
to improve the marking and administration of high-
stakes qualifications.
United States 2020 IOGAN ACT: Identifying Outputs of This act directed the National Science Foundation
Generative Adversarial Networks Act to support research dedicated to studying the
outputs of generative adversarial networks
(deepfakes) and other comparable technologies.
Belgium 2021 Decree on coaching and solution- A provision of this act directs the government
oriented support for job seekers, N. to create an advisory group called the Ethics
327 Committee, which is responsible for submitting
advice if artificial intelligence tools are to be used
for digitization activities.
France 2021 Law N:2021-1485 of November This act sets up a monitoring system to evaluate
15, 2021, aimed at reducing the environmental impacts of newly emerging digital
environmental footprint of digital technologies, in particular, artificial intelligence.
technology in France
Table 5.1.1
140
131
Passed
120 Proposed
26
Vetoed
100
Number of AI-Related Bills
87
80 13 77
10
60
103
40
74
29 66
26
20 9
14
10 9 9 25
17
2 8 12
0
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 5.1.4
AK ME
0 0
VT NH
3 0
WA ID MT ND MN IL WI MI NY RI MA
6 0 0 0 0 15 0 0 8 3 20
OR NV WY SD IA IN OH PA NJ CT
1 1 0 0 1 0 1 3 4 0
CA UT CO NE MO KY WV VA MD DE
4 2 2 0 1 0 3 1 4 0
AZ NM KS AR TN NC SC DC
0 0 0 0 2 3 1 6
OK LA MS AL GA
1 0 3 12 0
HI TX FL
7 6 7
Figure 5.1.6
Sponsorship by Political Party of both parties since 2012, in the past four years, the
State-level AI legislation data reveals that there is a data suggests Democrats were more likely to sponsor
partisan dynamic to AI lawmaking. Figure 5.1.7 plots the AI-related legislation. Whereas Democrats sponsored
number of AI-related bills sponsored at the state level by only two more AI bills than Republicans in 2018, they
Democratic and Republican lawmakers. Although there sponsored 39 more in 2021.
has been an increase in AI bills proposed by members
NUMBER of STATE-LEVEL PROPOSED AI-RELATED BILLS in the UNITED STATES by SPONSOR PARTY, 2012–21
Source: Bloomberg Government, 2021 | Chart: 2022 AI Index Report
79, Democratic
80
70
60
Number of AI-Related Bills
50
40, Republican
40
30
20
10
0
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Figure 5.1.7
Legislation 506
500 Congressional Research Service Reports
Committee Reports
178
400
Number of Mentions
300 295
139
200 245
149
44
100 129
66
25 83
17 18 17 12 17
0 4 7 39 27
107th 108th 109th 110th 111th 112th 113th 114th 115th 116th 117th
(2001–02) (2003–04) (2005–06) (2007–08) (2009–10) (2011–12) (2013–14) (2015–16) (2017–18) (2019–20) (2021–)
Figure 5.1.8
AI Mentions in Global Legislative Proceedings sessions in 25 countries that contain the keyword
AI mentions in governmental proceedings are on the “artificial intelligence” from 2016 to 2021. Figure 5.1.9
rise not only in the United States but also in many other shows that the mentions of AI in legislative proceedings
countries across the world. The AI Index conducted an in 25 select countries grew 7.7 times in the past six years.2
analysis on the minutes or proceedings of legislative
1,323
1,200
1,000
Number of Mentions
800
600
400
200
0
2016 2017 2018 2019 2020 2021
Figure 5.1.9
2 See the appendix for the methodology. Countries included: Australia, Belgium, Brazil, Canada, China, Denmark, Finland, France, Germany, India, Ireland, Italy, Japan, the Netherlands, New Zealand,
Norway, Russia, Singapore, South Africa, South Korea, Spain, Sweden, Switzerland, the United Kingdom, and the United States.
By Geographic Area Kingdom, and the United States topped the list. Figure
Figure 5.1.10a shows the number of legislative 5.1.2b shows the total number of AI mentions in the past
proceedings containing mentions of AI that were six years. The United Kingdom dominated the list with
enacted in 2021. Similar to the trend in the number of 939 mentions, followed by Spain, Japan, the United
AI mentions in bills passed into laws, Spain, the United States, and Australia.
250
210
200
Number of Policy Papers
150
100
50
0
2018 2019 2020 2021
Figure 5.1.11
3 The complete list of organizations the Index followed can be found in the Appendix.
This section examines the public AI investment in the United States, based on data from the U.S. government and Bloomberg Government.
1.67
1.53
1.50 1.43
Budget (in billions of U.S. Dollars)
1.11
1.00
0.56
0.50
0.00
FY18 (ENACTED) FY19 (ENACTED) FY20 (ENACTED) FY21 (ENACTED) FY22 (REQUESTED)
Figure 5.2.1
4 See NITRD website for details on AI R&D investment FY 2018-22 with the breakdown of core AI vs AI crosscut. Note that AI crosscutting budget data is not available for FY 2018.
U.S. DOD BUDGET for AI-SPECIFIC RESEARCH, DEVELOPMENT, TEST and EVALUATION (RDT&E), FY 2020–22
Source: Bloomberg Government and U.S. Department of Defense, 2021 | Chart: 2022 AI Index Report
10.00
10
9.26
8.68
8
Budget (in billions of U.S. Dollars)
0.93: DOD Reported Budget on AI R&D 0.84: DOD Reported Budget on AI R&D 0.87: DOD Reported Budget on AI R&D
0
Sum of FY20 Funding Sum of FY21 Funding Sum of FY22 Funding
Figure 5.2.2
Important data caveat: This chart is indicative of one may result from the difference in defining AI-related
of the challenges of quantifying public AI spending. budget items. For example, a research project that uses
Bloomberg Government’s analysis that searches AI- AI for cyber defense may count human, hardware, and
relevant keywords in DOD budgets shows that the operations-related expenditures within the AI-related
department is requesting $10.0 billion for AI-specific R&D budget request, though the AI software component will
in FY 2022. However, DOD’s own measurement produces be much smaller.
a smaller number of $874 million. The discrepancy
Funds Received
Program Name Department Purpose
(in millions)
1 Rapid Capability Army 257 Fund the development, engineering, acquisition,
Development and Maturation and operation of various AI-related technological
prototypes that could be used for military purposes.
2 Counter Weapons of Defense Threat 254 Develop technologies that could “deny, defeat and
Mass Destruction Advanced Reduction disrupt” weapons of mass destruction (WMD).
Technology Development Agency
3 Algorithmic Warfare Office of the 230 Accelerate the integration of AI technologies in DOD
Cross-Functional Teams – Secretary of systems to “improve warfighting speed and lethality.”
Software Pilot Program Defense
4 Joint Artificial Intelligence Defense 137 Develop, test, prototype, and demonstrate various AI
Center Information and machine learning capabilities with the intention
Systems of integrating these capabilities across numerous
Agency domains which include “supply chain, personal
recovery, infrastructure assessment, geospatial
monitoring during disaster and cyber sense making.”
Table 5.2.1
DOD AI R&D Spending by Department is poised to maintain that position in 2022. They have
DOD spending on AI R&D can also be broken down on requested a total of $1.86 billion in FY 2022 for AI-related
a subdepartmental level, which reveals how individual projects, followed by the Army ($1.77 billion), the Office
defense agencies—the Army and the Navy, for instance— of the Secretary of Defense ($1.1 billion) and the Air Force
compare in their AI spending (Figure 5.2.3). The U.S. ($883 million).
Navy was the top-spending DOD agency in FY 2021 and
U.S. DOD BUDGET for AI-SPECIFIC RESEARCH, DEVELOPMENT, TEST and EVALUATION (RDT&E) by
DEPARTMENT, FY 2020–22
Source: Bloomberg Government, 2021 | Chart: 2022 AI Index Report
10
1.75
1.64
1.18
Budget (in billions of U.S. Dollars)
8
1.77
1.52 1.63
6 1.13
1.12 1.19
1.86
4 1.72 1.93
1.71
2 1.92 1.54
1.57
1.00 1.16
0
FY20 (ENACTED) FY21 (ENACTED) FY22 (REQUESTED)
Figure 5.2.3
2.0
1.79
Contract Spending (in billions of U.S. Dollars)
1.5
1.0
0.5
0.0
2000
2006
2009
2004
2008
2003
2005
2002
2020
2007
2001
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2021
Figure 5.2.4
5 Note that contractors may add a number of keywords into their applications during the procurement process, so some of the projects included may have a relatively small AI component relative to
other parts of technology.
Contract Spending by Department and Agency Aggregate spending on AI contracts in the last four years
Figures 5.2.5 and 5.2.6 report AI-related contract spending tells a similar story. Since 2018, the DOD has spent $5.20
by the top 10 federal agencies in 2021 and from 2000 to billion on AI contracts, approximately seven times the next
2021, respectively. The DOD outspent the rest of the U.S. highest spender, NASA ($1.41 billion). In fact, since 2018,
government on both charts by a significant margin. In the DOD has spent twice as much on AI-related contracts
2021, it spent $1.14 billion on AI-related contracts, roughly as all other government agencies combined. Following the
five times what was spent by the next highest department, DOD and NASA are the Department of Health and Human
the Department of Health and Human Services ($234 Services ($700 million), the Department of Homeland
million). Security ($362 million), and Department of the Treasury
($156 million).
TOP CONTRACT SPENDING on AI by U.S. GOVERNMENT DEPARTMENT and AGENCY, 2000–21 (SUM)
Source: Bloomberg Government, 2021 | Chart: 2022 AI Index Report
Amount
Contract Name Department Purpose
(in millions)
Prototype Services in the Objective DOD 70 To acquire prototypes in the domain of
Areas of Automotive Cybersecurity, automotive cybersecurity, vehicle safety
Vehicle Safety Technologies, Vehicle technologies, and autonomous vehicles and
Light Weighting, Autonomous Vehicles intelligent systems.
and Intelligent Systems, Connected
Vehicles, and Advanced Energy Storage
Technologies
Biomedical Advanced Research and HHS 20 To develop optical imaging devices and
Development Authority (BARDA) machine learning algorithms to assist
in classifying and healing wounds and
conventional burns.
Commercial Lunar Payload Services NASA 14 To develop lunar robots capable of navigating
the moon’s south pole to acquire lunar
resources and engage in lunar-based scientific
activities.
Table 5.2.2
Appendix
Artificial Intelligence
Index Report 2022
Appendix
CHAPTER 1 Research & Development 198
197
Artificial Intelligence APPENDIX
Index Report 2022 Chapter 1: Research & Development
1 All CNKI content is furnished for CSET by East View Information Services, Minneapolis, MN, USA.
2 For more information, see James Dunham, Jennifer Melot, and Dewey Murdick, “Identifying the Development and Application of Artificial Intelligence in Scientific Text,” arXiv [cs.DL], May 28, 2020,
https://arxiv.org/abs/2002.07143.
3 These scores are based on cosine similarities between field-of-study and paper embeddings. See Zhihong Shen, Hao Ma, and Kuansan Wang, “A Web-Scale System for Scientific Knowledge
Exploration,” arXiv [cs.CL], May 30, 2018, https://arxiv.org/abs/1805.12216.
4 See https://www.grid.ac/ for more information about the GRID dataset from Digital Science.
sector is available, papers were counted toward these iperov/DeepFaceLab, Microsoft/CNTK, opencv/opencv,
sectors, by year. Cross-sector collaborations on academic pytorch/pytorch, scikit-learn/scikit-learn, scutan90/
publications were calculated using the same method as in DeepLearning-500-questions, tensorflow/tensorflow,
the cross-country collaborations analysis. Theano/Theano, Torch/Torch7.
G I T H U B S TA R S
Source
GitHub: star-history (available at star history website) was
used to retrieve the data.
Methodology
The visual in the report shows the number of stars for
various GitHub repositories over time. The repositories
include the following:
apachecn/ailearning, apache/incubator-mxnet, Avik-
Jain/100-Days-Of-ML-Code, aymericdamien/TensorFlow-
Examples, BVLC/cafe, cafe2/cafe2, CorentinJ/Real-Time-
Voice-Cloning, deepfakes / faceswap, dmlc/mxnet,
exacity/deeplearningbook-chinese, fchollet/keras,
floodsung/Deep-Learning-Papers-Reading-Roadmap,
5 Patents are analyzed at the “patent family” level rather than “patent document” level because patent families are a collective of patent documents all associated with a single invention and/or
innovation by the same inventors/assignees. Thus, counting at the “patent family” level mitigates artificial number inflation when there are multiple patent documents in a patent family or if a patent is
filed in multiple jurisdictions.
6 For more details on CSET’s approach and experimentation for assigning country values, see footnote 26 in “Patents and Artificial Intelligence: A Primer,” by Dewey Murdick and Patrick Thomas. (Center
for Security and Emerging Technology, September 2020), https://doi.org/10.51593/20200038.
technical progress reported on Papers with Code. The in which a method was introduced, even if it was
reported dates correspond to the year during which subsequently tested, is the year in which it is included in
a paper was first published to arXiv, and the reported the report. The reported results (accuracy) correspond
results (FID score) correspond to the result reported in the to the result reported in the most recent version of each
most recent version of each paper. Details on the STL-10 paper. Details on the FaceForensics++ benchmark can be
benchmark can be found in the STL-10 paper. found in the FaceForensics++ paper.
To highlight progress on STL-10, scores were taken from To highlight progress on FaceForensics++, scores were
the following papers: taken from the following papers:
DEGAS: Differentiable Efficient Generator Search A Deep Learning Approach to Universal Image
Dist-GAN: An Improved GAN Using Distance Constraints Manipulation Detection Using a New Convolutional Layer
Off-Policy Reinforcement Learning for Efficient and Detection of Deepfake Videos Using Long Distance
Effective GAN Architecture Attention
Search Score Matching Model for Unbounded Data Score FakeCatcher: Detection of Synthetic Portrait Videos Using
Biological Signals
CIFAR-10 FaceForensics++: Learning to Detect Manipulated Facial
Data on CIFAR-10 FID scores was retrieved through a Images
detailed arXiv literature review cross-referenced by Learning Spatiotemporal Features with 3D Convolutional
technical progress reported on Papers with Code. The Networks
reported dates correspond to the year during which Recasting Residual-Based Local Descriptors as
a paper was first published to arXiv, and the reported Convolutional Neural Networks
results (FID score) correspond to the result reported in the Rich Models for Steganalysis of Digital Images
most recent version of each paper. Details on the CIFAR-10 Thinking in Frequency: Face Forgery Detection by Mining
benchmark can be found in the CIFAR-10 paper. Frequency-Aware Clues
Xception: Deep Learning with Depthwise Separable
To highlight progress on CIFAR-10, scores were taken from Convolutions
the following papers:
AutoGAN: Neural Architecture Search for Generative Celeb-DF
Adversarial Networks Data on Celeb-DF AUC was retrieved through a detailed
Denoising Diffusion Probabilistic Models arXiv literature review. The reported dates correspond
Improved Training of Wasserstein GANs to the year during which a paper was first published to
Large Scale GAN Training for High Fidelity Natural Image arXiv or a method was introduced. With Celeb-DF, recent
Synthesis researchers have tested previously existing deepfake
Score-Based Generative Modeling in Latent Space detection methodologies. The year in which a method was
introduced, even if it was subsequently tested, is the year
FaceForensics++ in which it is included in the report. The reported results
Data on FaceForensics++ accuracy was retrieved through (AUC) correspond to the result reported in the most recent
a detailed arXiv literature review. The reported dates version of each paper. Details on the Celeb-DF benchmark
correspond to the year during which a paper was first can be found in the Celeb-DF paper.
published to arXiv or a method was introduced. With
FaceForensics, recent researchers have tested previously To highlight progress on Celeb-DF, scores were taken from
existing deepfake detection methodologies. The year the following papers:
Exposing DeepFake Videos by Detecting Face Warping
To highlight progress on Leeds Sports Poses, scores were To highlight progress on Human3.6M with the use of
taken from the following papers: extra training data, scores were taken from the following
Articulated Pose Estimation by a Graphical Model with papers:
Image Dependent Pairwise Relations Epipolar Transformers
Human Pose Estimation via Convolutional Part Heatmap Learnable Triangulation of Human Pose
Regression TesseTrack: End-to-End Learnable Multi-Person
Jointly Optimize Data Augmentation and Network Articulated 3D Pose Tracking
Training: Adversarial Data Augmentation in Human Pose
Estimation Cityscapes Challenge, Pixel-Level Semantic
Knowledge-Guided Deep Fractal Neural Networks for Labeling Task
Human Pose Estimation Data on the Cityscapes challenge, pixel-level semantic
OmniPose: A Multi-Scale Framework for Multi-Person Pose labeling task mean IoU was taken from the Cityscapes
Estimation dataset, more specifically their pixel-level semantic
Toward Fast and Accurate Human Pose Estimation via labeling leaderboard. More details about the Cityscapes
Soft-Gated Skip Connections dataset and other corresponding semantic segmentation
challenges can be accessed at the Cityscapes dataset
Human 3.6M webpage.
Data on Human3.6M average mean per joint position error
was retrieved through a detailed arXiv literature review CVC-ClinicDB and Kvasir-SEG
cross-referenced by technical progress reported on Papers Data on CVC-ClinicDB and Kvasir-SEG mean dice was
with Code. The reported dates correspond to the year retrieved through a detailed arXiv literature review cross-
during which a paper was first published to arXiv, and the referenced by technical progress reported on Papers
reported results (MPJPE) correspond to the result reported with Code (CVC-ClinicDB and Kvasir-SEG). The reported
in the most recent version of each paper. Details on the dates correspond to the year during which a paper was
first published to arXiv, and the reported results (mean
dice) correspond to the result reported in the most Visual Question Answering (VQA)
recent version of each paper. Details on the CVC-ClinicDB Data on VQA was taken from recent iterations of the VQA
benchmark can be found in the CVC-ClinicDB database challenge. To learn more about the VQA challenge in
page. Details on the Kvasir-SEG benchmark can be found general, please consult the following link. To learn more
in the Kvasir-SEG paper. about the 2021 iteration of the VQA challenge, please
consult the following link. More specifically, the Index
To highlight progress on CVC-ClinicDB, scores were taken makes use of data from the following iterations of the VQA
from the following papers: challenge:
DoubleU-Net: A Deep Convolutional Neural Network for VQA Challenge 2016
Medical Image Segmentation VQA Challenge 2017
Encoder-Decoder with Atrous Separable Convolution for VQA Challenge 2018
Semantic Image Segmentation VQA Challenge 2019
MSRF-Net: A Multi-Scale Residual Fusion Network for VQA Challenge 2020
Biomedical Image Segmentation VQA Challenge 2021
ResUNet++: An Advanced Architecture for Medical Image
Segmentation Kinetics-400, Kinetics-600, and Kinetics-700
U-Net: Convolutional Networks for Biomedical Image Data on Kinetics-400, Kinetics-600, and Kinetics-700 was
Segmentation retrieved through a detailed arXiv literature review cross-
referenced by technical progress reported on Papers with
To highlight progress on Kvasir-SEG, scores were taken Code (Kinetics-400, Kinetics-600, and Kinetics-700). The
from the following papers: reported dates correspond to the year during which a
Encoder-Decoder with Atrous Separable Convolution for paper was first published to arXiv, and the reported results
Semantic Image Segmentation (accuracy) correspond to the result reported in the most
MSRF-Net: A Multi-Scale Residual Fusion Network for recent version of each paper. Details on the Kinetics-400
Biomedical Image Segmentation benchmark can be found in the Kinetics-400 paper. Details
PraNet: Parallel Reverse Attention Network for Polyp on the Kinetics-600 benchmark can be found in the
Segmentation Kinetics-600 paper. Details on the Kinetics-700 benchmark
U-Net: Convolutional Networks for Biomedical Image can be found in the Kinetics-700 paper.
Segmentation
To highlight progress on Kinetics-400, scores were taken
National Institute of Standards and Technology from the following papers:
(NIST) Face Recognition Vendor Test (FRVT) Co-Training Transformer with Videos and Images Improves
and NIST FRVT Face Mask Effects Action Recognition
Data on NIST FRVT 1:1 verification accuracy by dataset Large-Scale Weakly-Supervised Pre-training for Video
was obtained from the FRVT 1:1 verification leaderboard. Action Recognition
Data on NIST FRVT face mask effects was obtained from Multiview Transformers for Video Recognition
the FRVT face mask effects leaderboard. The face mask Non-Local Neural Networks
effects leaderboard contains results of the testing of Omni-Sourced Webly-Supervised Learning for Video
319 face recognition algorithms that were submitted to Recognition
FRVT prior to and post mid-March 2020, when the COVID SlowFast Networks for Video Recognition
pandemic began. Temporal Segment Networks: Towards Good Practices for
Deep Action Recognition
To highlight progress on Kinetics-600, scores were taken on the COCO benchmark can be found in the COCO paper.
from the following papers:
Masked Feature Prediction for Self-Supervised Visual Pre- To highlight progress on COCO without the use of extra
Training training data, scores were taken from the following
Multiview Transformers for Video Recognition papers:
Learning Spatio-Temporal Representation with Local and An Analysis of Scale Invariance in Object Detection – SNIP
Global Diffusion Deformable ConvNets v2: More Deformable, Better Results
Rethinking Spatiotemporal Feature Learning: Speed- Dynamic Head: Unifying Object Detection Heads with
Accuracy Trade-offs in Video Classification Attentions
SlowFast Networks for Video Recognition Inside-Outside Net: Detecting Objects in Context with Skip
Pooling and Recurrent Neural Networks
To highlight progress on Kinetics-700, scores were taken Mish: A Self Regularized Non-Monotonic Activation
from the following papers: Function
Learn to Cycle: Time-Consistent Feature Discovery for Scaled-YOLOv4: Scaling Cross Stage Partial Network
Action Recognition
Masked Feature Prediction for Self-Supervised Visual Pre- To highlight progress on COCO with the use of extra
Training training data, scores were taken from the following
Multiview Transformers for Video Recognition papers:
EfficientDet: Scalable and Efficient Object Detection
ActivityNet: Temporal Action Localization Task
Grounded Language-Image Pre-Training
In the challenge, there are three separate tasks, but
they focus on the main problem of temporally localizing
You Only Look Once (YOLO)
where activities happen in untrimmed videos from the
Data on YOLO mean average precision (mAP50) was
ActivityNet benchmark. To source information on the
retrieved through a detailed arXiv literature review
state-of-the-art results for TALT, the Index did a detailed
and survey of GitHub repositories. The reported dates
survey of arXiv papers in addition to reports of yearly
correspond to the year during which a paper was first
ActivityNet challenge results. More specifically, the Index
published to arXiv or a method was introduced. More
made use of the following sources of information:
specifically, the Index made use of the following sources
TALT 2016
of information:
TALT 2017
YOLO 2016
TALT 2018
YOLO 2018
TALT 2019
YOLO 2020
TALT 2020
YOLO 2021
TALT 2021
YOLO results for 2017 and 2019 were not included in the
Common Object in Context (COCO)
index as no state-of-the-art improvements in YOLO for
Data on COCO mean average precision (mAP50) was
those years were uncovered during the literature review
retrieved through a detailed arXiv literature review cross-
and survey of GitHub repositories.
referenced by technical progress reported on Papers
with Code. The reported dates correspond to the year
Visual Commonsense Reasoning (VCR)
during which a paper was first published to arXiv, and
Technical progress for VCR is taken from the VCR
the reported results (mAP50) correspond to the result
leaderboard; the VCR leaderboard webpage further
reported in the most recent version of each paper. Details
delineates the methodology behind the VCR challenge.
Human performance on VCR is taken from Zellers et al. (2018). Details on the VCR benchmark can be found in the VCR
paper.
SuperGLUE
The SuperGLUE benchmark data was pulled from the SuperGLUE leaderboard. Details about the SuperGLUE benchmark
are in the SuperGLUE paper and SuperGLUE software toolkit. The tasks and evaluation metrics for SuperGLUE are:
To highlight progress on arXiv with the use of extra Stanford Natural Language Inference (SNLI)
training data, scores were taken from the following Data on Stanford Natural Language Inference (SNLI)
papers: accuracy was retrieved through a detailed arXiv literature
Big Bird: Transformers for Longer Sequences review cross-referenced by technical progress reported on
Hierarchical Learning for Generation with Long Source Papers with Code. The reported dates correspond to the
Sequences year during which a paper was first published to arXiv, and
PEGASUS: Pre-Training with Extracted Gap-Sentences for the reported results (accuracy) correspond to the result
Abstractive Summarization reported in the most recent version of each paper. Details
on the SNLI benchmark can be found in the SNLI paper.
PubMed
Data on PubMed recall-oriented understudy for gisting To highlight progress on SNLI, scores were taken from the
evaluation (ROUGE-1) was retrieved through a detailed following papers:
arXiv literature review cross-referenced by technical Compare, Compress and Propagate: Enhancing Neural
progress reported on Papers with Code. The reported Architectures with Alignment Factorization for Natural
dates correspond to the year during which a paper Language Inference
was first published to arXiv, and the reported results Convolutional Neural Networks for Sentence Classification
(ROUGE-1) correspond to the result reported in the most Enhanced LSTM for Natural Language Inference
recent version of each paper. Details about the PubMed Entailment as Few-Shot Learner
benchmark are in the PubMed paper. Explicit Contextual Semantics for Text Comprehension
Semantics-Aware BERT for Language Understanding
To highlight progress on PubMed without the use of Self-Explaining Structures Improve NLP Models
extra training data, scores were taken from the following
papers: Abductive Natural Language Inference (aNLI)
A Discourse-Aware Attention Model for Abstractive Data on Abductive Natural Language Inference (aNLI) was
Summarization of Long Documents sourced from the Allen Institute for AI’s aNLI leaderboard.
Extractive Summarization of Long Documents by Details on the aNLI benchmark can be found in the aNLI
Combining Global and Local Context paper.
Get to the Point: Summarization with Pointer-Generator
Networks SemEval 2014 Task 4 Sub Task 2
Sparsifying Transformer Models with Trainable Data on SemEval 2014 Task 4 Sub Task 2 accuracy was
Representation Pooling retrieved through a detailed arXiv literature review cross-
referenced by technical progress reported on Papers
To highlight progress on PubMed with the use of extra with Code. The reported dates correspond to the year
training data, scores were taken from the following during which a paper was first published to arXiv, and
papers: the reported results (accuracy) correspond to the result
A Divide-and-Conquer Approach to the Summarization of reported in the most recent version of each paper. Details
Long Documents on the SemEval benchmark can be found in the SemEval
Hierarchical Learning for Generation with Long Source 2014 paper.
Sequences
PEGASUS: Pre-Training with Extracted Gap-Sentences for To highlight progress on SemEval, scores were taken from
Abstractive Summarization the following papers:
A Multi-Task Learning Model for Chinese-Oriented Aspect
Polarity Classification and Aspect Term Extraction
error rate) correspond to the result reported in the most To highlight progress on LibriSpeech Test-Other with the
recent version of each paper. Details about both the use of extra training data, scores were taken from the
LibriSpeech Test-Clean and Test-Other benchmarks can following papers:
be found in the LibriSpeech paper. Deep Speech 2: End-to-End Speech Recognition in English
and Mandarin
To highlight progress on LibriSpeech Test-Clean without End-to-End ASR: From Supervised to Semi-Supervised
the use of extra training data, scores were taken from the Learning with Modern Architectures
following papers: Pushing the Limits of Semi-Supervised Learning for
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for Automatic Speech Recognition
SOTA Speech Recognition W2v-BERT: Combining Contrastive Learning and Masked
Letter-Based Speech Recognition with Gated ConvNets Language Modeling for Self-Supervised Speech Pre-
Neural Network Language Modeling with Letter-Based Training
Features and Importance Sampling
SpeechStew: Simply Mix All Available Speech Recognition VoxCeleb
Data to Train One Large Neural Network VoxCeleb is an audio-visual dataset consisting of short
State-of-the-Art Speech Recognition Using Multi-Stream clips of human speech, extracted from interview videos
Self-Attention With Dilated 1D Convolutions uploaded to YouTube. VoxCeleb contains speech from
7,000-plus speakers spanning a wide range of ethnicities,
To highlight progress on LibriSpeech Test-Clean with the accents, professions, and ages—amounting to over a
use of extra training data, scores were taken from the million utterances (face-tracks are captured “in the wild,”
following papers: with background chatter, laughter, overlapping speech,
Deep Speech 2: End-to-End Speech Recognition in English pose variation, and different lighting conditions) recorded
and Mandarin over a period of 2,000 hours (both audio and video). Each
End-to-End ASR: From Supervised to Semi-Supervised segment is at least three seconds long. The data contains
Learning with Modern Architectures an audio dataset based on celebrity voices, shorts,
Pushing the Limits of Semi-Supervised Learning for films, and conversational pieces (e.g., talk shows). The
Automatic Speech Recognition initial VoxCeleb 1 (100,000 utterances taken from 1,251
celebrities on YouTube) was expanded to VoxCeleb 2 (1
To highlight progress on LibriSpeech Test-Other without million utterances from 6,112 celebrities).
the use of extra training data, scores were taken from the
following papers: For the sake of consistency, the AI Index reported scores
Conformer: Convolution-Augmented Transformer for on the initial VoxCeleb dataset. Specifically, the Index
Speech Recognition made use of the following sources of information:
Neural Network Language Modeling with Letter-Based The IDLAB VoxSRC-20 Submission: Large Margin Fine-
Features and Importance Sampling Tuning and Quality-Aware Score Calibration in DNN Based
SpeechStew: Simply Mix All Available Speech Recognition Speaker Verification
Data to Train One Large Neural Network The SpeakIn System for VoxCeleb Speaker Recognition
Transformer-Based Acoustic Modeling for Hybrid Speech Challenge 2021
Recognition VoxCeleb: A Large-Scale Speaker Identification Dataset
VoxCeleb2: Deep Speaker Recognition
VoxCeleb: Large-Scale Speaker Verification in the Wild
Leveraging Procedural Generation to Benchmark Details on the MLPerf Training benchmark can be found
Reinforcement Learning in the MLPerf Training benchmark paper. Details on
Procedural Generalization by Planning with Self- the current benchmark categories as well as technical
Supervised World Models information about submission and competition
subdivisions can be found on the MLPerf Training
Chess webpage.
Data on the performance of chess software engines was
taken from the Swedish Chess Computer Association’s ImageNet Training Cost
ranking of top chess software engines. The Swedish Chess Data on ImageNet Training cost was based on research
Computer Association tests computer chess software from DAWNBench and the individual research of Deepak
systems against one another and releases a ranking list of Narayanan. DAWNBench is a benchmark suite for end-to-
the top-performing systems. The ranking list produced by end deep-learning training and inference. DAWNBench
the Swedish Chess Computer Association is a statistically provides a reference set of common deep-learning
significant and meaningful measurement of chess engine workloads for quantifying training time, training cost,
performance because engines are pitted against one inference latency, and inference cost across different
another in thousands of tournament-like games and each optimization strategies, model architectures, software
employ the same underlying hardware. Data on Magnus frameworks, clouds, and hardware. More details available
Carlsen’s top ELO score was taken from the International at DAWNBench.
Chess Federation.
Because DAWNBench was deprecated after March 2020,
Training Time and Number of Accelerators data on the training cost for the most recent round of
Data on training time and number of accelerators for AI MLPerf submissions was manually collected by Deepak
systems was taken from the MLPerf Training benchmark Narayanan.
competitions. More specifically, the AI Index made use of
data from the following MLPerf training competitions:
MLPerf Training v0.5, 2018
MLPerf Training v0.6, 2019
MLPerf Training v0.7, 2020
MLPerf Training v1.0, 2021
MLPerf Training v1.1, 2021
Aalborg University, Denmark Stanford University, United States
Ain Shams University, Egypt Stellenbosch University, South Africa
Carnegie Mellon University, United States Swiss Federal Institute of Technology Lausanne,
Columbia University, United States Switzerland
Cornell University, United States Tokyo Institute of Technology, Japan
Delft University of Technology, Netherlands University of British Columbia, Canada
ETH Zurich, Switzerland University of California Berkeley, United States
Hong Kong University of Science and Technology, Hong University of California Los Angeles, United States
Kong University of California San Diego, United States
Korea Advanced Institute of Science and Technology, University of Cambridge, United Kingdom
South Korea University of Cape Town, South Africa
KU Leuven, Belgium University College London, United Kingdom
Massachusetts Institute of Technology, United States University of Hong Kong, Hong kong
Nanyang Technological University, Singapore University of Illinois at Urbana-Champaign, United States
National Polytechnic Institute, Mexico University of Malaya, Malaysia
National University of Singapore, Singapore University of Manchester, United Kingdom
Peking University, China University of Michigan, United States
Politecnico di Milano, Italy Universitat Politècnica de Catalunya, Spain
Pontificia Universidad Católica de Chile, Chile University of Texas at Austin, United States
Princeton University, United States University of Tokyo, Japan
Purdue University, United States University of Toronto, Canada
RWTH Aachen University, Germany University of Waterloo, Canada
Seoul National University, South Korea Zhejiang University, China
7 See 2018 Workshop on Ethical, Social and Governance Issues in AI 2018, 2018 AI for Social Good Workshop, 2019 Joint Workshop on AI for Social Good, 2020 Resistance AI Workshop, 2020 Navigating the
Broader Impacts of AI Research Workshop.
8 See 2014 Machine Learning for Clinical Data Analysis, Healthcare and Genomics, 2015 Machine Learning for Healthcare, 2016 Machine Learning for Health, 2017 Machine Learning for Health.
9 See 2013 Machine Learning for Sustainability, 2020 AI for Earth Sciences, 2019, 2020, 2021 Tackling Climate Change with ML.
10 See 2016 People and Machines, 2019 Joint Workshop on AI for Social Good–Public Policy, 2021 Human-Centered AI.
11 See 2019 AI for Humanitarian Assistance and Disaster Response, 2020 Second Workshop on AI for Humanitarian Assistance and Disaster Response, 2021 Third Workshop on AI for Humanitarian
Assistance and Disaster Response.
12 See 2017–2021 Machine Learning for the Developing World Workshops.
Interpreting Social Respect: A Normative Lens for ML While the Perspective API is used widely within machine
Models learning research and also for measuring online toxicity,
Knowledge-Based Neural Framework for Sexism toxicity in the specific domains used to train the models
Detection and Classification undergirding Perspective (e.g., news, Wikipedia) may not
Large Pretrained Language Models Contain Human-Like be broadly representative of all forms of toxicity (e.g.,
Biases of What Is Right and Wrong to Do trolling). Other known caveats include biases against text
Leveraging Multilingual Transformers for Hate Speech written by minority voices: The Perspective API has been
Detection shown to disproportionately assign high toxicity scores
Limitations of Pinned AUC for Measuring Unintended Bias to text that contains mentions of minority identities (e.g.,
Machine Learning Suites for Online Toxicity Detection “I am a gay man”). As a result, detoxification techniques
Mitigating Harm in Language Models with Conditional- built with labels sourced from the Perspective API result
Likelihood Filtration in models that are less capable of modeling language
On-the-Fly Controlled Text Generation with Experts and used by minority groups, and they avoid mentioning
Anti-Experts minority identities.
Process for Adapting Language Models to Society (PALMS)
with Values-Targeted Datasets We note that the effect size metric reported in the
Racial Bias in Hate Speech and Abusive Language Word Embeddings Association Test (WEAT) section is
Detection Datasets highly sensitive to rare words, as it has been shown
RealToxicityPrompts: Evaluating Neural Toxic that removing less than 1% of relevant documents in
Degeneration in Language Models a corpus can significantly impact the WEAT effect size.
Scaling Language Models: Methods, Analysis & Insights This means that effect size is not guaranteed to be a
from Training Gopher robust metric for assessing bias in embeddings. While
Self-Diagnosis and Self-Debiasing: A Proposal for we report on a subset of embedding association tasks
Reducing Corpus-Based Bias in NLP measuring bias along gender and racial axes, these
Social Bias Frames: Reasoning About Social and Power embedding association tests have been extended to
Implications of Language quantify the effect across intersectional axes (e.g.,
Social Biases in NLP Models as Barriers for Persons with EuropeanAmerican+male, AfricanAmerican+male,
Disabilities AfricanAmerican+female).
Stereotypical Bias Removal for Hate Speech Detection
Task Using Knowledge-Based Generalizations In the analysis of embeddings from over 100 years of
The Risk of Racial Bias in Hate Speech Detection U.S. Census data, embedding bias was measured by
Towards Measuring Adversarial Twitter Interactions computing the difference between average embedding
Against Candidates in the US Midterm Elections distances. For example, gender bias is calculated as the
Toxic Comment Classification Using Hybrid Deep Learning average distance of embeddings of words associated with
Model women (e.g., she, female) compared to embeddings of
Toxicity-Associated News Classification: The Impact of words for occupations (e.g., teacher, lawyer), minus the
Metadata and Content Features same average distance calculated for words associated
Understanding BERT Performance in Propaganda Analysis with men.
White-to-Black: Efficient Distillation of Black-Box
Adversarial Attacks
Women, Politics and Twitter: Using Machine Learning to
Change the Discourse
Machine Learning: AdaBoost algorithm, Boosting and classified by taxonomists at LinkedIn into 249 skill
(Machine Learning), Chi Square Automatic Interaction groupings, which are the skill groups represented in the
Detection (CHAID), Classification Algorithms, Clustering dataset. The top skills that make up the AI skill grouping
Algorithms, Decision Trees, Dimensionality Reduction, are machine learning, natural language processing, data
Google Cloud Machine Learning Platform, Gradient structures, artificial intelligence, computer vision, image
boosting, H2O (software), Libsvm, Machine Learning, processing, deep learning, TensorFlow, Pandas (software),
Madlib, Mahout, Microsoft Cognitive Toolkit, MLPACK (C++ and OpenCV, among others.
library), Mlpy, Random Forests, Recommender Systems,
Scikit-learn, Semi-Supervised Learning, Supervised Skill groupings are derived by expert taxonomists
Learning (Machine Learning), Support Vector Machines through a similarity-index methodology that measures
(SVM), Semantic Driven Subtractive Clustering Method skill composition at the industry level. Industries are
(SDSCM), Torch (Machine Learning), Unsupervised classified according to the ISIC 4 industry classification
Learning, Vowpal, Xgboost (Zhu et al., 2018).
• Compute frequencies for all self-added skills by explicitly added AI skills to their profile and/or they are
LinkedIn members in a given entity (occupation, occupied in an AI occupation representative. The counts of
industry, etc.) in 2015–2021. AI talent are used to calculate talent concentration metrics
• Re-weight skill frequencies using a TF-IDF model (e.g. to calculate the country-level AI talent concentration,
to get the top 50 most representative skills in that we use the counts of AI talent at the country level vis-
entity. These 50 skills compose the “skill genome” of a-vis the counts of LinkedIn members in the respective
that entity. countries).
•C ompute the share of skills that belong to the AI skill
group out of the top skills in the selected entity. Relative AI Skills Penetration
To allow for skills penetration comparisons across
Interpretation: The AI skill penetration rate signals the countries, the skills genomes are calculated and a relevant
prevalence of AI skills across occupations, or the intensity benchmark is selected (e.g., global average). A ratio is then
with which LinkedIn members utilize AI skills in their constructed between a country’s and the benchmark’s AI
jobs. For example, the top 50 skills for the occupation of skills penetrations, controlling for occupations.
engineer are calculated based on the weighted frequency
with which they appear in LinkedIn members’ profiles. If Interpretation: A country’s relative AI skills penetration of
four of the skills that engineers possess belong to the AI 1.5 indicates that AI skills are 1.5 times as frequent as in
skill group, this measure indicates that the penetration of the benchmark, for an overlapping set of occupations.
AI skills is estimated to be 8% among engineers (e.g., 4/50).
Global Comparison
Jobs or Occupations For cross-country comparison, we present the relative
LinkedIn member titles are standardized and grouped penetration rate of AI skills, measured as the sum of the
into approximately 15,000 occupations. These are not penetration of each AI skill across occupations in a given
sector or country specific. These occupations are further country, divided by the average global penetration of AI
standardized into approximately 3,600 occupation skills across the overlapping occupations in a sample of
representatives. Occupation representatives group countries.
occupations with a common role and specialty, regardless
of seniority. Interpretation: A relative penetration rate of 2 means
that the average penetration of AI skills in that country
AI Jobs or Occupations is two times the global average across the same set of
An AI job (technically, occupation representative) is occupations.
an occupation representative that requires AI skills to
perform the job. Skills penetration is used as a signal Global Comparison: By Industry
for whether AI skills are prevalent in an occupation The relative AI skills penetration by country for industry
representative, in any sector where the occupation provides an in-depth sectoral decomposition of AI skill
representative may exist. Examples of such occupations penetration across industries and sample countries.
include (but are not limited to): machine learning
engineer, artificial intelligence specialist, data scientist, Interpretation: A country’s relative AI skill penetration
computer vision engineer, etc. rate of 2 in the education sector means that the average
penetration of AI skills in that country is two times the
AI Talent global average across the same set of occupations in that
A LinkedIn member is considered AI talent if they have sector.
Nuances
• Of particular interest in PhD job market trends are
the metrics on the AI PhD area of specialization. The
categorization of specialty areas changed in 2008 and
was clarified in 2016. From 2004-2007, AI and robotics
were grouped; from 2008-present, AI is separate; 2016
clarified to respondents that AI includes ML.
• Notes about the trends in new tenure-track hires
(overall and particularly at AAU schools): In the 2018
Taulbee Survey, for the first time we asked how many
new hires had come from the following sources: new
PhD, postdoc, industry, and other academic. Results
indicated that 29% of new assistant professors came
from another academic institution.
• Some may have been teaching or research faculty
rather than tenure-track, but there is probably some
movement between institutions, meaning the total
number hired overstates the total who are actually
new.
CHAPTER 5: AI POLICY
AND GOVERNANCE
B LO O M B E R G G OV E R N M E N T Legislative Documents: Bloomberg Government
Prepared by Amanda Allen maintains a repository of congressional documents,
including bills, Congressional Budget Office assessments,
Bloomberg Government is a premium, subscription- and reports published by congressional committees,
based service that provides comprehensive information the Congressional Research Service, and other offices.
and analytics for professionals who interact with—or Bloomberg Government also ingests state legislative
are affected by—the government. Delivering news, bills. For the section “AI Policy and Governance,”
analytics, and data-driven decision tools, Bloomberg Bloomberg Government analysts identified all legislation,
Government’s digital workspace gives an intelligent edge congressional committee reports, and CRS reports that
to government affairs and contracting professionals. For referenced one or more AI-specific keywords.
more information or a demo, visit about.bgov.com.
Methodology
Contract Spending: Bloomberg Government’s Contracts
Intelligence Tool structures all contracts data from
www.fpds.gov. The CIT includes a model of government
spending on artificial intelligence-related contracts that is
based on a combination of government-defined product
service codes and more than 100 AI-related keywords.
For the section “U.S. Government Contract Spending,”
Bloomberg Government analysts used contract spending
data from fiscal year 2000 through fiscal year 2021.
G LO BA L L E G I S L AT I O N R E C O R D S O N A I
For AI-related bills passed into laws, the AI Index performed searches of the keyword “artificial intelligence,” in respective
languages, on the websites of 25 countries’ congresses or parliaments, in full-text of bills. Note that only laws passed
by state-level legislative bodies and signed into law (i.e., by presidents or received royal assent) from 2015 to 2021 are
included. Future AI Index reports hope to include analysis on other types of legal documents, such as regulations and
standards, adopted by state- or supranational-level legislative bodies, government agencies, etc.
Australia Denmark
Website: www.legislation.gov.au Website: https://www.retsinformation.dk/
Keyword: artificial Intelligence Keyword: kunstig intelligen
Filters: Filter:
• Legislation types: Acts • Document Type: Laws
• Portfolios: Department of House of Representatives,
Department of Senate Finland
Note: Texts in explanatory memorandum are not counted. Website: https://www.finlex.fi/
Keyword: tekoäly
Belgium Noting under the Current Legislation section
Website: http://www.ejustice.just.fgov.be/loi/loi.htm
Keyword: intelligence artificielle France
Website: https://www.legifrance.gouv.fr/
Brazil Keyword: intelligence artificielle
Website: https://www.camara.leg.br/legislacao Filter:
Keyword: inteligência artificial • texte consolidé
Filter: • Document Type: Law
• Federal legislation
• Type: Law Germany
Website: http://www.gesetze-im-internet.de/index.html
Canada Keyword: künstliche Intelligenz
Website: https://www.parl.ca/legisinfo/ Filter:
Keyword: artificial Intelligence • All federal codes, statutes, and ordinances that are
Note: Results were investigated to determine how many of currently in force
the bills introduced were eventually passed (i.e., received • Volltextsuche (full text)
royal assent) and bill status was recorded. • Und-Verknüpfung der Wörter (entire word)
China India
Website: https://flk.npc.gov.cn/ Website: https://www.indiacode.nic.in
Keyword: 人工智能 Keyword: artificial intelligence
Filters: Note: The website used allows for a search of keywords
• Legislative body: Standing Committee of the in legalization title but not in the full text, as such it is not
National People’s Congress useful for this particular research. Therefore, a Google
search using the “site” function to search the site with the
keyword of “artificial intelligence” is conducted.
Ireland Singapore
Website: www.irishstatutebook.ie Website: https://sso.agc.gov.sg/
Keyword: artificial intelligence Keyword: artificial intelligence
Filter:
Italy • Document Type: Current acts and subsidiary
Website: https://www.normattiva.it/ legislation
Keyword: intelligenza artificiale
Filter: South Africa
• Document Type: law Website: www.gov.za
Keyword: artificial intelligence
Japan Filter:
Website: https://elaws.e-gov.go.jp/ • Document: acts
Keyword: 人工知能 Note: This search function seemingly does not search
Filter: within the context of the full text and so no results were
• Full text returned. Therefore, a Google search using the “site”
• Law function to search the site with the keyword of “artificial
intelligence” is conducted.
Netherlands
Website: https://www.overheid.nl/ South Korea
Keyword: kunstmatige intelligentie Website: https://law.go.kr/eng/; https://elaw.klri.re.kr/
Filter: Keyword: artificial Intelligence or 인공 지능
• Document Type: Wetten Filter:
• Type: Act
New Zealand Note: Cannot search combined words, so individual
Website: www.legislation.govt.nz analysis is conducted.
Keyword: Artificial intelligence
Filter: Spain
• Document type: acts Website: https://www.boe.es/
• Status option: For the status option (example: acts in Keyword: inteligencia artificial
force, current bills, etc.) Filter:
• Type: law
Norway • Head of state (for passed laws)
Website: https://lovdata.no/
Keyword: kunstig intelligens Sweden
Website: https://www.riksdagen.se/
Russia Keyword: artificiell intelligens
Website: http://graph.garant.ru:8080/SESSION/PILOT/ Filter: Swedish Code of Statutes
main.htm (Database “The Federal Laws” in the official
website of the Federation Council of the Federal Assembly
of the Russian Federation.)
Keyword: искусственный интеллект
Filter:
• Words in text
Switzerland
Website: https://www.fedlex.admin.ch/
Keyword: intelligence artificielle
Filter:
• Text category: federal constitution, federal acts, and
federal decrees, miscellaneous texts, orders, and
other forms of legislation.
• Publication period for legislation was limited to 2015-
2021.
United Kingdom
Website: https://www.legislation.gov.uk/
Keyword: artificial intelligence
Filter:
• Legislation Type: U.K. Public General Acts & U.K.
Statutory Instruments
United States
Website: https://www.congress.gov/
Keyword: artificial intelligence
Filter:
• Source: Legislation
Status of legislation: Became law
M E N T I O N S O F A I I N A I - R E L AT E D L E G I S L AT I O N P R O C E E D I N G S
For mentions of AI in AI-related legislative proceedings around the world, the AI Index performed searches of the keyword
“artificial intelligence,” in respective languages, on the websites of 25 countries’ congresses or parliaments, usually under
sections named “minutes,” “hansard,” etc.
Australia Denmark
Website: https://www.aph.gov.au/Parliamentary_ Website: https://www.retsinformation.dk/
Business/Hansard Keyword: kunstig intelligens
Keyword: artificial intelligence Filter:
• Minutes
Belgium
Website: http://www.parlement.brussels/search_form_fr/ Finland
Keyword: intelligence artificielle Website: https://www.eduskunta.fi/
Filter Keyword: tiedot
• Document Type: all Filter:
• Parliamentary Affairs and Documents
Brazil • Public document: Minutes
Website: https://www2.camara.leg.br/atividade- • Actor: Plenary sessions
legislativa/discursos-e-notas-taquigraficas
Keyword: inteligência artificial France
Filter: Website: https://www.assemblee-nationale.fr/
• Federal legislation Keyword: intelligence artificielle
• Type: Law Filter:
• Reports of the debates in session
Canada Note: Such documents were only prepared starting in
Website: https://www.ourcommons.ca/PublicationSearch/ 2017.
en/?PubType=37
Keyword: artificial Intelligence Germany
Website: https://dip.bundestag.de/
China Keyword: künstliche Intelligenz
Website: Various reports on the work of the government Filter:
Keyword: 人工智能 • Speeches, requests to speak in the plenum
Note: The National People’s Congress is held once per
year and does not provide full legislative proceedings. India
Hence, the counts included in the analysis only searched Website: http://loksabhaph.nic.in/
the mentions of artificial intelligence in the only public Keyword: artificial intelligence
document released from the Congress meetings, the Filter:
Report on the Work of the Government, delivered by the • Exact word/phrase
Premier.
Ireland
Website: https://www.oireachtas.ie/
Keyword: artificial intelligence
Filter: Content of parliamentary debates
Italy Singapore
Website: https://aic.camera.it/aic/search.html Website: https://sprs.parl.gov.sg/search/home
Keyword: intelligenza artificiale Keyword: artificial intelligence
Filter:
• Type: All South Africa
• Search by exact phrase Website: https://www.parliament.gov.za/hansard
Keyword: artificial intelligence
Japan Note: This search function does not search within the
Website: https://kokkai.ndl.go.jp/#/ context of the full text and so no results were returned.
Keyword: 人工知能 Therefore, a Google search using the “site” function
Filter: to search https://www.parliament.gov.za/storage/
• Full text app/media/Docs/hansard/ with the keyword “artificial
• Law intelligence” is conducted.
Norway Switzerland
Website: https://www.stortinget.no/no/Saker-og- Website: https://www.parlament.ch/
publikasjoner/Publikasjoner/Referater/ Keyword: intelligence artificielle
Keyword: kunstig intelligens Filter:
Note: This search function does not directly allow the • Parliamentary proceedings
keyword within minutes. Therefore, a Google search using
the “site” function to search the site with the keyword of Sweden
“artificial intelligence” is conducted. Website: https://www.riksdagen.se/sv/global/
sok/?q=&doktyp=prot
Russia Keyword: artificiell intelligens
Website: http://transcript.duma.gov.ru/ Filter:
Keyword: искусственный интеллект • Minutes
Filter:
• Words in text