Machine Learning Ai in Medical Devices
Machine Learning Ai in Medical Devices
IN MEDICAL DEVICES
Adapting Regulatory Frameworks and Standards
to Ensure Safety and Performance
MACHINE LEARNING AI IN MEDICAL DEVICES: ADAPTING REGULATORY FRAMEWORKS AND STANDARDS TO ENSURE SAFETY AND PERFORMANCE
ABOUT
AAMI
AAMI is a nonprofit organization founded in 1967. It is a diverse community of
approximately 9,000 professionals united by an important mission—the development,
management, and use of safe and effective health technology.
AAMI is a primary source of consensus standards, both national (US) and
international, for medical device industry, as well as practical information, support,
and guidance for healthcare technology and sterilization professionals.
BSI
BSI is a global thought leader in the development of standards of best practice
for business and industry. Formed in 1901, BSI was the world’s first National
Standards Body (NSB) and a founding member of the International Organization for
Standardization (ISO). Over a century later, BSI is focused on business improvement
across the globe, working with experts in all sectors of the economy to develop
codes, guidance and specifications that will accelerate innovation, increase
productivity and support growth. Renowned as the originator of many of the
world’s best-known business standards, BSI’s activity spans multiple sectors including
aerospace, automotive, built environment, food, healthcare, and ICT.
Over 95% of BSI’s work is on international and European standards. In its role as
the UK National Standards Body, BSI represents UK economic and social interests
across the international standards organizations ISO, IEC, CEN, CENELEC and ETSI,
providing the infrastructure for over 11,000 experts to work on international,
European, national, and PAS standards development in their chosen fields.
AAMI/BSI INITIATIVE ON AI
The AAMI/BSI Initiative on Artificial Intelligence (AI) in medical technology is an effort
by AAMI and BSI to explore the ways that AI and, in particular, machine learning
pose unique challenges to the current body of standards and regulations governing
medical devices and related technologies. Also to determine what additional
guidance or standards might be needed to promote the safety and effectiveness
of medical AI technologies. Two stakeholder workshops to explore the issue were
held in the fall of 2018 and resulted in the publication of a first whitepaper, The
emergence of artificial intelligence and machine learning algorithms in healthcare:
Recommendations to support governance and regulation.
This second whitepaper builds on that initial work and was informed by two
additional workshops held in Arlington, VA, USA, and in London, United Kingdom
in 2019.
PRIMARY AUTHORS
Rob Turpin – BSI
Emily Hoefer – AAMI
Joe Lewelling – AAMI
Pat Baird – Philips
ACKNOWLEDGEMENTS
The following individuals participated in the 2019 workshops or otherwise provided
input to the authors. Contributions of their time and expertise are greatly appreciated.
NOTE—Participation in the workshops or in the review of this whitepaper by any individual, including government
agency representatives, does not constitute endorsement or approval of its contents by those individuals or agencies.
EXECUTIVE SUMMARY
This paper examines how AI is different from traditional medical devices and medical
software, explores the implications of those differences, and discusses the controls
necessary to ensure AI in healthcare is safe and effective. Because these differences
will not be the same for the full range of systems, it is important to identify what
aspects of AI are of concern.
1. INTRODUCTION
Artificial Intelligence (AI) promises to revolutionize the practice of medicine by making
healthcare more accessible, more efficient, and even more effective. The concept of
AI itself, however, is ambiguous if not controversial. AI has been broadly defined as
“the capability of a machine to imitate intelligent human behavior”.1 A narrower
and more complex definition applies the term to systems that “display intelligent
behavior by analyzing their environment and taking actions—with some degree of
autonomy—to achieve specific goals…”2 In the first definition, the systems are merely
mimicking human intelligence, but in the second they exhibit a degree of reasoning
or cognition—of “thinking” in some sense of the word.
Whatever promise it holds, AI, like any new healthcare technology, can present
challenges to existing methods for ensuring safety and performance. The safety
and effectiveness3 of medical devices entering the market today are governed by
regulations and private-sector consensus standards. These controls (standards and
regulations) were developed alongside current technologies and are based on an
extensive, shared understanding of how and how well they work. With an emergent
technology like AI, real-world experience is limited, which can hinder our ability to
fully assess its effectiveness. Similarly, a lack of real-world experience with AI limits
our understanding of its associated risks. AI-related risks are harder to quantify and
mitigate; there may be unforeseeable and unpredictable hazards arising from the
unique nature or function of AI.
This paper examines how AI is different from other medical devices and medical
software, explores the implications of those differences, and discusses the controls
necessary to assure AI in healthcare is safe and effective. As these differences will not
be the same for the full range of systems, it is important to identify what aspects of
1 Definition from the Merriam
AI are of concern.
Webster Dictionary.
2 Communication from the
2. MACHINE-LEARNING AI—WHAT MAKES IT DIFFERENT? European Commission to the
European Parliament, the
Systems that simply imitate humans are not new to the medical field, of course. Since European Council, the Council
the advent of commercial transistors in the 1960s, computational medical devices Europe and Economic and Social
Committee and the Committee
have increasingly mimicked human behavior and actions. Automatic blood pressure of the Regions on Artificial
monitors (sphygmomanometers) imitate the actions of trained clinicians in detecting Intelligence for Europe, Brussels,
and reporting the Korotkoff sounds that signify systolic and diastolic blood pressures. 25.4.2018 COM (201) 237 Final.
3 Within this whitepaper the terms
Portable defibrillators evaluate heart waveforms to determine when defibrillation is “effectiveness” and “performance”
necessary and can then act to deliver the needed defibrillation. with respect to medical devices
are used interchangeably in
Devices like these, by supplementing or in some instances replacing direct clinician accordance with the definition of
involvement, have already expanded the availability of care outside of healthcare “effectiveness” provided in the
facilities, to homes and workplaces, as well as to areas and regions where trained U.S. Code of Federal Regulations.
(”21 CFR 860.7). See A.5
clinicians are rare or absent. Such technologies, however, do not act independently
of human reasoning, but instead utilize previously validated clinical protocols to
diagnose medical conditions or deliver therapy. They do not “think” for themselves
in the sense of understanding, making judgements, or solving problems4; rather, they
are static rules-based systems,5 programmed to produce specific outputs based on
the values of received inputs.
While such systems can be very sophisticated, the rules they employ are static—
they are not created or modified by the systems. Their algorithms are developed
based on documented and approved clinical research and then validated to produce
expected (i.e., predictable) results. In this aspect, rules-based AI systems, other
than their complexity, do not differ substantially from computational and electronic
medical devices that have been in use since the 1960s.
There are other types of AI that utilize large data sets and complex statistical
methodologies to discover new relationships between inputs, actions, and outcomes.
These data-driven or machine learning systems6 are not explicitly programmed to
4 One definition of “Think” in the
provide pre-determined outputs, but are heuristic, with the ability to learn and make Cambridge Dictionary is “to use
judgements. In short, machine learning AI systems, unlike simple rules-based systems, one’s mind to understand.”
are cognitive in some sense and can modify their outputs accordingly. For the 5 Daniels, et al, Current State and
Near-Term Priorities for AI-Enabled
purposes of this whitepaper, we have separated data-driven/machine learning AI into Diagnostic Support Software in
two groups—locked models that are unable to change without external intervention, Health Care (White Paper), Duke
and continuous learning (or adaptive models) that modify outputs automatically in Margolis Center for Health Policy,
2019, p. 10
real-time (see Figure 1). In reality, there are likely to be several levels of change control
6 Ibid. For the purposes of this
for AI—from traditional concepts that are already known, to accelerated concepts paper, the terms data-driven, and
that may need additional levels of control. machine-learning are synonymous,
as are the terms continuous
The more sophisticated of these data-driven systems (i.e., super-intelligent AI) can learning and adaptive models.
surpass human cognition in their ability to process enormous and complicated data 7 The metaphor of “black box” is
sets and engage in higher levels of abstraction. Utilizing multiple layers of statistical used widely and with different
connotations, but with respect to
analysis and deep learning/neural networks, these systems act as black boxes7 AI, we are not simply talking about
producing protocols and algorithms for diagnosis or therapy that are not readily a lack of visibility with respect to
understandable by clinicians or explicable to patients. mechanisms or calculations, but
also to the inscrutability of the
basic rationale for performance.
Data-Driven/Machine-Learning
Rules-Based AI Systems
AI Systems
• Mimic human behavior-making
decisions by applying static rules to Locked Machine Learning Continuous Learning/Adaptive
arrive at predicable decisions. Models Models
• Often visualized as a decision tree. • Neither the internal algorithms • Utilize newly received data to test
nor system outputs change assumptions that underlie their
automatically. operation in real-world use.
Data-driven machine learning AI systems can be further divided into locked models
and continuous learning models:
• Locked models8 employ algorithms that are developed using training data and
machine learning, which are then fixed so neither the internal algorithms nor
system outputs change automatically (though changes can be accommodated in
a stepwise manner).
• Continuous learning models (or adaptive models)9 utilize newly received data
to test the assumptions that underlie their operations in real-world use and,
when potential improvements are identified, the systems are programed to
automatically modify internal algorithms and update external outputs.
universe of assumptions and data, then the same systems can fail. AI systems are
poor at handling the unknown-unknowns—they do not know what they do not
know. Thus, any system that can learn can also mislearn—it can “acquire incorrect
knowledge”14 in a variety of ways.
Dataset annotations involve variables and biases that humans apply so that an AI
solution can spot it.
Any bias that exists in a dataset will affect the performance of the machine
learning system. There are many sources, including population, frequency, and
instrumentation bias.
Having a system that is unintentionally biased towards one subset of a patient
population can result in poor model performance when faced with a different
subset, and ultimately this can lead to healthcare inequities. When working with
quality data, instances of intentional bias (also known as positive bias) can be
present, such as a dataset made up of people only over the age of 70 to look at
age-related health concerns.
When considering the application of a dataset for a machine learning application, it
is important to understand the claims that it makes. This can be in terms of whether
a proper balance in the representative population classes has been achieved, along
with whether the data can be reproduced, and if any annotations are reliable. For
example, a dataset could contain chest X-rays from males aged 18–30 in a specific
country, half of whom have pneumonia. This dataset cannot claim to represent
pneumonia in females. It may not be able to claim to represent young males of
a particular ethnic group, as this subgroup might not be listed within the dataset
variables and might not be plausibly represented in the sample size.
The AI model is trained on the dataset. It will learn the variables and annotations
that the dataset is trained on. In healthcare, the vast majority of neural networks
are initially trained on a dataset, evaluated for accuracy, and then used for inference
(e.g., by running the model on new images).
It is important to understand what the model can reliably identify (e.g., the model
claims). Neural networks can generalize a bit, allowing them to learn things slightly
different from their training dataset. For example, a model that is carefully trained
on male chest X-rays may also perform well on the female population, or with
different X-rays equipment. The only way to verify this is to present the trained model
with a new test dataset. Depending on the model performance, it may be possible
to demonstrate that the AI can accurately identify pneumonia across male and
female patients and generalize across different X-ray machines. There may be minor
differences in performance between datasets, but these could still be more accurate
than a human.
In summary, AI will learn the variables, biases, and annotations of a dataset, with
the expectation that it can spot an important feature. Once trained, an algorithm
will be tested, revealing that it is able to identify this feature with a certain level of
accuracy. In order to test the claim that the AI can identify a specific item, it needs to
be tested on a dataset that claims to represent this feature fairly. If it performs to a
satisfactory level of performance on this dataset, the model can then claim to be able
to identify this item in future datasets that share the same variable as the test dataset.
Figure 2 explains this in more detail.
The following examples show instances where poor quality datasets and their
incorrect relationships with algorithm models have caused a failure in the outputs.
An adaptive learning classifier system16 analyzed photographs to differentiate
between wolves and huskies. Instead of detecting distinguishing features between
the two canine breeds, the system determined the most salient distinction was that
photos of huskies included snow in the background, whereas photos of wolves did
not. The system’s conclusions were correct with respect to its training data but were
not usable in real-world scenarios, because extraneous and inappropriate variables 16 https://arxiv.org/pdf/1602.04938.
pdf
(i.e., the backgrounds) were included in the learning dataset.18 This is an example
of how an AI system may detect incidental patterns or correlations in a dataset and
assign a causal or meaningful relationship that is incorrect or irrelevant.
Two other recent examples include IBM’s Watson using profanity after incorporating
the Urban Dictionary into its knowledge-base,19 and Microsoft’s interactive AI
assistant “Tay,” designed to learn from its interactions with users, which had to be
disabled after it was tricked into spouting racist dogma by online pranksters.20 AI is
vulnerable to bad data; it cannot always reliably evaluate the quality of incoming data
to determine if it might be biased, incorrect, or invalid. While AI system engineers can
create filters to curate data, those filters require assumptions and a priori knowledge
of the nature and quality of the data. When the assumptions are incorrect and/or the
knowledge is insufficient, system performance will be detrimentally affected.
During the devastating California wildfires of 2017, a driving app designed to help
users avoid traffic directed fleeing drivers into areas where the inferno was raging as
there was less traffic along those routes.21 Although the system operated correctly
for its original purpose of avoiding traffic jams, when that purpose expanded to the
more critical function of escaping a wildfire, it did not have adequate information—
17 Figure used with permission.
or sufficiently robust algorithms—to make safe and accurate recommendations. In
Alberto Rizzoli, V7.
this instance, the AI system made wrong decisions when it was not correctly matched 18 A similar medical AI example
to the task at hand. occurred when Stanford
researchers tested an AI tool to
identify melanomas from pictures
5. POTENTIAL REGULATORY AND STANDARDIZATION APPROACHES of moles and found the tool
used the presence of rulers in
TO ADDRESS SAFETY AND PERFORMANCE OF AI the photos as a positive indicator
As further advancements are made with AI technology, regulators may consider of cancer. See http://stanmed.
stanford.edu/2018summer/
multiple approaches for addressing the safety and effectiveness of AI in healthcare, artificial-intelligence-puts-
including how international standards and other best practices are currently used to humanity-health-care.html
support the regulation of medical software, along with differences and gaps that will 19 https://www.theatlantic.com/
technology/archive/2013/01/
need to be addressed for AI solutions. A key aspect will be the need to generate real- ibms-watson-memorized-the-
world clinical evidence for AI throughout its life cycle, and the potential for additional entire-urban-dictionary-then-
clinical evidence to support adaptive systems. his-overlords-had-to-delete-
it/267047/
In the last ten years, regulatory guidance and international standards have emerged 20 https://www.nytimes.
for software, either as a standalone medical device or where it is incorporated com/2016/03/25/technology/
microsoft-created-a-twitter-bot-
into a physical device. This has provided requirements and guidance for software to-learn-from-users-it-quickly-
manufacturers to demonstrate compliance to medical device regulations and to place became-a-racist-jerk.html
their products on the market. 21 http://www.latimes.com/local/
california/la-me-southern-
However, AI potentially introduces new risk, as discussed in clause 15 of this california-wildfires-live-
paper, that is not currently addressed within the current portfolio of standards and firefighters-attempt-to-contain-
bel-air-1512605377-htmlstory.
guidance for software. Different approaches will be required to ensure the safety
html
and performance of AI solutions placed on the market. As these new approaches are
being defined, the current regulatory landscape for software should be considered as
a good starting point.
In Europe, the Medical Device Regulation (MDR) and In Vitro Diagnostic Regulation
(IVDR) include several generic requirements that can apply to software. These consist
of the following:
• general obligations of manufacturers, such as risk management, clinical
performance evaluation, quality management, technical documentation, unique
device identification, postmarket surveillance and corrective actions;
• requirements regarding design and manufacture, including construction of
devices, interaction with the environment, diagnostic and measuring functions,
active and connected devices; and
• information supplied with the device, such as labelling and instructions for use.
Due to the potential for AI solutions to learn and adapt in real time, organizational-
based approaches to establish the capabilities of software developers to respond to
real-world AI performance could become crucial. These approaches are already being
considered by U.S. FDA, although they may not necessarily align with EU Medical
25 https://www.fda.gov/about-fda/
Device Regulation. cdrh-reports/national-evaluation-
system-health-technology-nest
7. DEVELOPMENTS IN AI 26 https://www.nice.org.uk/Media/
Default/About/what-we-do/our-
Good AI development processes and practices will be critical for ensuring the safety and programmes/evidence-standards-
performance of AI solutions in healthcare. These practices will need to address product framework/digital-evidence-
standards-framework.pdf
anticipated prior to the changes being made. It may be possible for an AI developer
to specify any modifications that they plan to achieve in the future (i.e., once the AI
is in use). The developer would need methods in place to achieve any anticipated
changes control any risks associated with them.
There will be limitations to the scope of anticipated changes that can be specified
in advance, depending upon how extensive the modifications will be. However, the
ability to monitor AI performance in real time provides an opportunity to develop a
dynamic regulatory process that allows rapid improvements to the software while
ensuring safety.
31 https://ieeexplore.ieee.org/
abstract/document/8787436
BSI and AAMI acknowledge there are many factors that can have an impact on
the quality of data used in AI and a significant number of initiatives are working to
address these challenges. Some of these have been highlighted earlier within this
whitepaper, including dataset size, annotations, and biases. Other factors could also
be applicable for data quality that is applied in regulated situations. For example, data
storage could be an attractive target for hackers or, if a storage solution allows data
to be corrupted, then the performance of AI that depends on that data would be
adversely affected.
Even with enhanced validation study guidance, there will always be a risk of bias
in a validation study that may result in artificially high performance. One way to
mitigate this is by using good, and possibly excellent, development processes. A
thoughtful and pragmatic development process is more likely to create good software
than a compliance-only based development process. Product quality is related to
the process used to develop the product, and this has been noted in the U.S. FDA’s
proposed Pre-Certification program.33
Validity versus validation: The term validation has a special meaning in both
the medical device world and the data science world. For medical devices,
validation is a process that is used to ensure that user needs are met. For data
science, validation is a process to ensure that the data has validity (i.e., that the
data is correct and adequate for its intended purpose).
33 https://www.fda.gov/medical-
devices/digital-health/digital-
health-software-precertification-
pre-cert-program
ANNEX A – GLOSSARY
This Annex does not intend to provide a comprehensive list of all terms associated
with the topic of healthcare artificial intelligence. Rather, the terms defined in this
Annex are intended to be informative towards this whitepaper.
A.1
algorithm
a process or set of rules, including data driven or human-curated, to be followed
in calculation or other problem-solving operations. The technology of artificial
intelligence uses a variety of algorithms as tools and applications.
[Source: ANSI/CTA-2089.1]
A.2
Artificial Intelligence (AI)
(1) <system>capability of an engineered system to acquire, process, and apply
knowledge and skills.
Note 1: Knowledge are facts, information, and skills acquired through
experience or education.
[Source: SC42, draft 22989]
(2) A machine’s ability to make decisions and perform tasks that simulate
human intelligence and behavior.
[Source: Xavier Health, Perspectives and Good Practices for AI and Continuously
Learning Systems in Healthcare]
(3) A general term addressing machine behavior and function that exhibits the
intelligence and behavior of humans.
[Source: ANSI/CTA-2089.1]
A.2
bias
favoritism towards some things, people, or groups over others.
[Source: ISO 24027]
A.3
continuous learning
incremental training of an AI system that takes place on an ongoing basis while the
system is running in production.
[Source: SC42, draft 22989]
A.4
deep learning
(1) approach to creating rich hierarchical representations through the training
of neural networks with many hidden layers.
Note 1: Deep learning uses multilayered networks of simple computing
units (or “neurons”). In these neural networks each unit combines a set of
input values to produce an output value, which in turn is passed on to other
neurons downstream.
[Source: SC42, draft 22989 references ISO/IEC 23053, 3.13]
(2) an advanced form of neural network machine learning that utilizes big data
to generate impressive results.
[Source: CTA, What is Artificial Intelligence?]
A.5
effectiveness
reasonable assurance that a device is effective when it can be determined, based
upon valid scientific evidence, that in a significant portion of the target population,
the use of the device for its intended uses and conditions of use, when accompanied
by adequate directions for use and warnings against unsafe use, will provide clinically
significant results.
[”21 Code of Federal 860.7)].
A.5
machine learning
(1) function of a system that can learn from input data instead of strictly
following a set of specific instructions.
Note 1: MACHINE LEARNING focuses on prediction based on known
properties learned from the input data.
[Source: AAMI TIR66, Guidance for the creation of physiologic data and waveform
databases to demonstrate reasonable assurance of the safety and effectiveness of
alarm system algorithms]
(2) a sub-branch of AI in which the rules by which a decision or action are
taken are learned through examples, a training process.
[Source adapted: BSI, Recent advancements in AI – implications for medical device
technology and certification]
A.6
Natural Language Processing (NLP)
(1) information processing based upon natural-language understanding.
Note 1: NLP is a field of AI.
A.7
neural network/neural net/artificial neural network
network of primitive processing elements connected by weighted links with
adjustable weights, in which each element produces a value by applying a nonlinear
function to its input values, and transmits it to other elements or presents it as an
output value.
Note 1: Whereas some neural networks are intended to simulate the functioning
of neurons in the nervous system, most neural networks are used in artificial
intelligence as realizations of the connectionist model.
Note 2: Examples of nonlinear functions are a threshold function, a sigmoid
function, and a polynomial function.
[Source: SC42, draft 22989, references ISO/IEC 2382-28:1995]
A.8
postmarket surveillance
systematic process to collect and analyze experience gained from medical devices that
have been placed on the market.
[Source ISO 13485:2016, 3.14]
A.9
robustness
ability of a system to maintain its level of performance under any circumstances.
[Source: SC 42, draft 22989]
A.10
Software as a Medical Device (SaMD)
software intended to be used for one or more medical purposes and to perform
these purposes without being integral to the hardware of a medical device.
[Source: IMDRF, Software as a Medical Device (SaMD): Key Definitions]
A.11
training
process to establish or to improve the parameters of a machine learning model,
based on a machine learning algorithm, by using training data.
Note 1: For supervised learning, the machine learning model can be trained
(learn from) data that is similar to input data.
Note 2: For transfer learning, the input data is not necessarily similar to the
training data.
Note 3: For unsupervised learning, the machine learning model is trained (learns
from) and makes inferences, or predictions, based on the same data.
[Source SC42, draft 22989 references ISO/IEC 23053, 3.9]
A.12
transparency
open, comprehensive, accessible, clear, and understandable presentation of
information.
[Source: ISO 20294:2018, 3.3.11]
A.13
validation
confirmation, through the provision of objective evidence, that the requirements for a
specific intended use or application have been fulfilled.
Note 1: The objective evidence needed for a VALIDATION is the result of a test or
other form of
Determination, such as performing alternative calculations or reviewing
documents.
Note 2: The word “validated” is used to designate the corresponding status.
Note 3: The use conditions for VALIDATION can be real or simulated.
[Source: ISO 9000:2015, 3.8.13]
A.14
verification
(1)
confirmation, through the provision of objective evidence, that the specified
requirements have been fulfilled.
ote 1: Verification only provides assurance that a product conforms to
N
its specification.
[Source: ISO/IEC 27042:2015, 3.21]
(2)
confirmation by examination and, through the provision of objective
evidence that specified requirements have been fulfilled.
Note 1: The term verified is used to designate the corresponding status.
Note 2: Confirmation can comprise activities such as:
• performing alternative calculations;
• comparing a new design specification with a similar proven
design specification;
• undertaking tests and demonstrations;
• reviewing documents prior to issue.
[Source: IEC 60601-1:2005+AMD1:2001 [42], definition 3.138]
www.aami.org