software testing using ML_literature survey
software testing using ML_literature survey
software testing using ML_literature survey
net/publication/342548309
CITATIONS READS
21 2,082
2 authors:
All content following this page was uploaded by Dusica Marijan on 15 February 2021.
Abstract This means that for constant test inputs and preconditions,
an ML-trained software component can produce different
Machine learning has become prevalent across a wide vari- outputs in consecutive runs. Researchers have tried using
ety of applications. Unfortunately, machine learning has also
testing techniques from traditional software development
shown to be susceptible to deception, leading to errors, and
even fatal failures. This circumstance calls into question the (Hutchison et al. 2018), to deal with some of these chal-
widespread use of machine learning, especially in safety- lenges. However, it has been observed that traditional test-
critical applications, unless we are able to assure its correct- ing approaches in general fail to adequately address funda-
ness and trustworthiness properties. Software verification and mental challenges of testing ML (Helle and Schamai 2016),
testing are established technique for assuring such proper- and that these traditional approaches require adaptation to
ties, for example by detecting errors. However, software test- the new context of ML. The better we understand current
ing challenges for machine learning are vast and profuse - research challenges of testing ML, the more successful we
yet critical to address. This summary talk discusses the cur- can be in developing novel techniques that effectively ad-
rent state-of-the-art of software testing for machine learning. dress these challenges and advance this scientific field.
More specifically, it discusses six key challenge areas for soft-
ware testing of machine learning systems, examines current
In this paper, we: i) identify and discuss the most chal-
approaches to these challenges and highlights their limita- lenging areas in software testing for ML, ii) synthesize the
tions. The paper provides a research agenda with elaborated most promising approaches to these challenges, iii) spotlight
directions for making progress toward advancing the state-of- their limitations, and iv) make recommendations of further
the-art on testing of machine learning. research efforts on software testing of ML. We note that
the aim of the paper is not to exhaustively list all published
work, but distill the most representative work.
1 Introduction
Applications of machine learning (ML) technology have be- 2 Testing ML
come vital in many innovative domains. At the same time, As ML technologies become more pervasive enabling au-
the vulnerability of ML has become evident, sometimes tonomous system functionality, it is more and more impor-
leading to catastrophic failures1 . This entails that compre- tant to assure the quality of autonomous reasoning supported
hensive testing of ML needs to be performed, to ensure the by ML. Testing is such a quality assurance activity that aims
correctness and trustworthiness of ML-enabled systems. (in a broad sense) to determine the correctness of the system-
Software testing of ML systems is susceptible to a num- under-test, for example, by checking whether the system re-
ber of challenges compared to testing of traditional soft- sponds correctly to inputs, and to identify faults which may
ware systems. In this paper, by traditional systems we mean lead to failures.
software systems not integrating ML, and by ML systems
Interpreting ”Testing ML”: Two distinct communities
we mean software systems containing ML-trained compo-
have been studying the concept of testing ML, the ML scien-
nents (e.g self-driving cars, autonomous ships, or space ex-
tific community (MLC) and the software testing community
ploration robots). As an example, one such challenge of
(STC). However, as the two communities study ML algo-
testing ML systems stems from non-determinism intrinsic
rithms from different perspectives, they interpret the term
to ML. Traditional systems are typically pre-programmed
testing ML differently, and we think it is worth noting the
and execute a set of rules, while ML systems reason in a
distinction. In MLC, testing an ML model is performed to
probabilistic manner and exhibit non-deterministic behavior.
estimate its prediction accuracy and improve its prediction
Copyright c 2020, Association for the Advancement of Artificial performance. Testing happens during model creation, using
Intelligence (www.aaai.org). All rights reserved. validation and test datasets, to evaluate the model fit on the
1 training dataset. In STC, testing an ML system has a more
Tesla failure, www.theguardian.com/technology/2016/jul/01/tesla-
driver-killed-autopilot-self-driving-car general scope aiming to evaluate the system behavior for
a range of quality attributes. For example, in case of inte- systems are pseudo-oracles (Weyuker 1982). Pseudo-
gration or system level testing, an ML component is tested oracles are a differential testing technique that consists in
in interaction with other system components for functional running multiple systems satisfying the same specification
and non-functional requirements, such as correctness, ro- as the original system under test, then feeding the same in-
bustness, reliability, or efficiency. puts to these systems and observing their outputs. Discrep-
ancies in outputs are considered indicative of errors in the
Challenges of Testing ML system under test. A limitation of differential testing is that
Challenges of testing ML stem from the innate complex- it can be resource-inefficient as it requires multiple runs of
ity of the underlying stochastic reasoning. Unlike traditional the system, and error-prone, as the same errors are possible
systems, for which the code is built deductively, ML systems in multiple implementations of the system under test (Knight
are generated inductively. The logic defining system behav- and Leveson 1986).
ior is inferred from training data. Consequently, a fault could
Metamorphic Testing
originate not only from a faulty software code, but also er-
rors in training data. However, existing approaches often as- Metamorphic testing is another approach to testing of soft-
sume that high quality datasets are warranted, without ap- ware without test oracles. In this approach, a transforma-
plying systematic quality evaluation. Furthermore, ML sys- tion function is used to modify the existing test case input,
tems require advanced reasoning and learning capabilities and produce a new output. If the actual output for the mod-
that can give answers in conditions where the correct an- ified input differs from the expected output, it is indicative
swers are previously unknown (Murphy, Kaiser, and Arias of errors in the software under test. Metamorphic testing has
2007). Even though this may be the case for traditional sys- been applied to machine learning classifiers (Xie, Ho, and
tems, ML systems have inherent non-determinism which et al. 2011) (Dwarakanath et al. 2018). However, in testing
makes them constantly change behavior as more data be- ML systems with a large input space, writing metamorphic
comes available, unlike traditional systems. Furthermore, transformations is laborious, and there is a great potential for
for a system containing multiple ML models, the models ML to circumvent this difficulty by automating the creation
will affect each other’s training and tuning, potentially caus- of metamorphic relationships.
ing non-monotonic error propagation (Amershi, Begel, and
Bird 2019). Test Data Prioritization
We elaborate further challenges of testing ML in the Since automated oracles are typically not available for test-
following sections. Specifically, we identify six key chal- ing of big and realistic ML models, there is a great effort in-
lenge areas and discuss their implications. We synthesize volved in manual labeling of test data for ML models. Deep-
existing work pertaining to these challenges and provide its Gini (Shi et al. 2019) is an initial work on reducing the effort
structured presentation corresponding to the identified chal- in labeling test data for DNNs by prioritizing tests that are
lenges. likely to cause misclassifications. The assumption made by
DeepGini is that a test is likely to be misclassified if a DNN
3 Missing Test Oracles outputs similar probabilities for each class. The limitation
of this approach is that it requires running all tests first, to
Unlike traditional systems which operate pre-programmed obtain the output vectors used to calculate the likelihood of
deterministic instructions, ML systems operate based on misclassification.
stochastic reasoning. Such stochastic or probability-based
reasoning introduces uncertainty in the system response,
which gives rise to non-deterministic behavior, includ-
4 Infeasibility of Complete Testing
ing unpredictable or underspecified behavior. Due to non- ML systems are commonly deployed in application areas
determinism, ML systems can change behavior as they learn dealing with a large amount of data. This creates large and
over time. The implications for testing are that system out- diverse test input space. Unfortunately, testing is rarely able
puts can change over time for the same test inputs. This fact to cover all valid inputs and their combinations to examine
largely complicates test case specification. the correctness of a system-under-test, and therefore cover-
Test cases are typically specified with specific inputs to age metrics are typically applied to select an adequate set of
the system under test and expected outputs for these inputs, inputs from a large input space (Gotlieb and Marijan 2014),
known as test oracles. However, due to stochastic reasoning, to generate tests, or to assess the completeness of a test set
the output of an ML system cannot be specified in advance, and improve its quality (Marijan, Gotlieb, and Liaaen 2019).
rather it is learned and predicted by an ML model. This
means that ML systems do not have defined expected val- Test Coverage
ues against which actual values can be compared in testing. The first attempts to define coverage metrics for testing of
Thus, the correctness of the output in testing ML cannot be neural networks are inspired by the traditional code cover-
easily determined. While this problem has been known for age metrics. A metric called neuron coverage was proposed
traditional systems, called ”non-testable” systems (Weyuker in DeepXplore (Pei et al. 2017) for testing deep neural net-
1982), ML systems have non-determinism as part of their works (DNN). DeepXplore measures the amount of unique
design, making the oracle problem even more challenging. neurons activated by a set of inputs out of the total num-
An approach that has been considered for non-testable ber of neurons in the DNN. The limitation of this coverage
metric is that a test suite that has full neuron coverage (all Fuzzing
neurons activated) can still miss to detect erroneous behav- Since the input space of DNNs is typically large and highly-
ior if there was an error in all other DNNs that were part dimensional, selecting test data for DNNs can be highly la-
of a differential comparing (Pei et al. 2017) used by neuron borious. One approach to deal with this challenge is fuzzing,
coverage (DeepXplore leverages the concept of differential which generates large amounts of random input data that is
testing). Furthermore, it has been shown that neuron cov- checked for failures. TensorFuzz is an initial work that ap-
erage can be too coarse a coverage metric, meaning that a plies fuzzing to testing of TensorFlow DNNs (Odena and
test suite that achieves full neuron coverage can be easily Goodfellow 2018). TensorFuzz uses a coverage metric con-
found, but the network can still be vulnerable to trivial ad- sisting of user-specified constraints to randomly mutate in-
versarial examples (Sun, Huang, and Kroening 2018). Sun puts. The coverage is measured by a fast approximate near-
et al. therefore proposed DeepCover, a testing methodology est neighbour algorithm. TensorFuzz has showed to outper-
for DNNs with four test criteria, inspired by the modified form random testing. Another similar approach is Deep-
condition/decision coverage (MC/DC) for traditional soft- Hunter (Xie et al. 2018). This is an initial work on auto-
ware. Their approach includes a test case generation algo- mated feedback-guided fuzz testing for DNNs. DeepHunter
rithm that perturbs a given test case using linear program- runs metamorphic mutation to generate new semantically
ming with a goal to encode the test requirement and a frag- preserved tests, and uses multiple coverage criteria as a feed-
ment of the DNN. The same author also developed a test back to guide test generation from different perspectives.
case generation algorithm based on symbolic approach and The limitation of this approach is that it uses only a single
the gradient-based heuristic (Sun et al. 2019). The difference coverage criteria at the time, not supporting multi-criteria
between their coverage approach, based on MC/DC crite- test generation. Moreover, the general limitation of fuzzing
rion, and neuron coverage is that the latter only considers is that it cannot ensure that certain test objectives will be
individual activations of neurons, while the former consid- satisfied.
ers causal relations between features at consecutive layers of
the neural network. Concolic Testing
Neuron coverage has been further extended in Deep- To provide more effective input selection that increases test
Gauge (Ma et al. 2018a), which aims to test DNN by com- coverage, a concolic testing approach has been proposed in
bining the coverage of key function regions as well as cor- DeepConcolic (Sun et al. 2018). The approach is param-
ner case regions of DNN, represented by neuron boundary eterised with a set of coverage requirements. The require-
coverage. Neuron boundary coverage measures how well ments are used to incrementally generate a set of test inputs
the test datasets cover upper and lower boundary values. with a goal to improve the coverage of requirements by al-
DeepRoad (Zhang et al. 2018) is another test generation ternating between concrete execution (testing on particular
approach for DNN-based autonomous driving. DeepRoad inputs) and symbolic execution. For an unsatisfied require-
is based on generative adversarial networks and it gener- ment, a test input within the existing test suite that is close to
ates realistic driving scenes with various weather conditions. satisfying that requirement is identified, based on concrete
DeepCruiser is an initial work towards testing recurrent- execution. Later, a new test input that satisfies the require-
neural-network (RNN)-based stateful deep learning (Du et ment is generated through symbolic execution and added to
al. 2018). DeepCruiser represents RNN as an abstract state the test suite, improving test coverage.
transition system and defines a set of test coverage crite-
ria for generating test cases for stateful deep learning sys-
tems. Other approaches were proposed extending the no- 5 Quality of Test Datasets for ML Models
tion of neuron coverage, such as DeepTest (Tian et al. 2018) When training ML models, the quality of the training dataset
for testing other types of neural networks. DeepTest applies is important for achieving good performance of the learned
image transformations such as contrast, scaling, blurring to model. The performance is evaluated using a test dataset.
generate synthetic test images. However, such generated im-
ages were found to be insufficiently realistic for testing real- Mutation Testing
world systems. To evaluate the quality of test dataset for DNNs, DeepMuta-
In summary, a common limitation of techniques based on tion (Ma et al. 2018b) proposes an initial work, inspired by
neuron coverage is that they can easily lead to combinato- traditional mutation testing concepts. DeepMutation first de-
rial explosion. Ma et al. initiated the work on the adapta- signs a set of mutation operators to inject faults into training
tion of combinatorial testing techniques for the systematic data. Then, it retrains models with the mutated training data
sampling of a large space of neuron interactions at differ- to generate mutated models, which means that faults are in-
ent layers of DNN (Ma et al. 2018c). This approach can be jected in the models. After that, mutated models are tested
promising for taming combinatorial explosion in testing of using a test dataset. Finally, the quality of the test dataset
DNN based systems, given that its current limitations are is evaluated by analysing to what extent the injected faults
overcome. First, only 2-way interactions of input parame- are detected. The limitation of this approach is that it em-
ters are supported, while real systems typically have much ploys basic mutation operators covering limited aspects of
higher interaction levels of inputs. Second, the approach has deep learning systems, so that the injected faults may not be
been found to face scalability problems for large and com- representative enough of real faults. MuNN (Shen, Wan, and
plex DNNs. Chen 2018) is another mutation testing approach for neural
networks, which needs further work for the application on Adversarial examples can be generated with generative
DNNs. Specifically, the authors of the approach showed that adversarial networks, such as AdvGAN (Xiao et al. 2018).
neural networks of different depth require different muta- This approach aims to generate perturbations for any in-
tion operators. They also showed the importance of develop- stance, which can speed up adversarial training. The limi-
ing domain-dependent mutation operators rather than using tation of the approach is that the resulting adversarial exam-
common mutation operators. ples are based on small norm-bounded perturbations. This
challenge is further addressed in (Song et al. 2018) by de-
6 Vulnerability to Adversaries veloping unrestricted adversarial examples. However, their
ML classifiers are known to be vulnerable to attacks where approach exploits classifier vulnerability to covariate shift
small modifications are added to input data, causing misclas- and is sensitive to different distributions of input data.
sification and leading to failures of ML systems (Szegedy et
al. 2014). Modifications made to input data, called adver- Countering Adversarial Examples
sarial examples, are small perturbations designed to be very
close to the original data, yet able to cause misclassifications To counter adversarial attacks, reactive and proactive defen-
and to compromise the integrity (e.g. accuracy) of clasiffier. sive methods against adversaries have been proposed. De-
Such attacks have been observed for image recognition (Xie fensive distillation is a proactive approach which aims to
et al. 2017), text (Sato et al. 2018), and speech recognition reduce the effectiveness of adversarial perturbations against
tasks (Carlini et al. 2016) (Carlini and Wagner 2018) (Jia DNNs (Papernot et al. 2016). Defensive distillation extracts
and Liang 2017). In the latter, it was shown that adversari- additional knowledge about training points as class proba-
ally inserted sentences in the Stanford Question Answering bility vectors produced by a DNN. The probability vectors
Dataset can decrease reading comprehension of ML from are fed back into training, producing DNN-based classifier
75% to 36% of F-measure (harmonic average of the preci- models that are more robust to perturbations. However, it has
sion and recall of a test). been shown that such defensive mechanisms are typically
vulnerable to some new attacks (Carlini and Wagner 2017).
Generating Adversarial Examples Moreover, just like in testing, if a defense cannot find any
Adversarial examples can be generated for the purpose of adversarial examples, it does not mean that such examples
attack or defense of an ML classifier. The former often use do not exist.
heuristic algorithms to find adversarial examples that are Automated verification is a reactive defensive approach
very close to correctly classified examples. The latter aim against adversarial perturbations which analyses the robust-
to improve the robustness of ML classifiers. ness of DNNs to improve their defensive capabilities. Sev-
Some approaches to adversarial example generation in- eral approaches exist to deal with the robustness challenge.
clude Fast Gradient Sign Method (FGSM) (Goodfellow, An exhaustive search approach to verifying the correct-
Shlens, and Szegedy 2015), which showed that linear be- ness of a classification made by a DNN has been pro-
havior in high-dimensional spaces is sufficient to cause ad- posed (Huang et al. 2017). This approach checks the safety
versarial examples. Later, FGSM was shown to be less effec- of a DNN by exploring the region around a data point to
tive for black-box attacks (Tramèr et al. 2017), and the au- search for specific adversarial manipulations. The limita-
thors developed RAND-FGSM method which adds random tion of the approach is limited scalability and poor com-
perturbations to modify adversarial perturbations. DeepFool putational performance induced by state-space-explosion.
(Moosavi-Dezfooli, Fawzi, and Frossard 2016) is another Reluplex is a constraint-based approach for verifying the
approach that generates adversarial examples based on an properties of DNNs by providing counter-examples (Katz
iterative linearization of the classifier to generate minimal et al. 2017), but is currently limited to small DNNs. An ap-
perturbations that are sufficient to change classification la- proach that can work with larger DNNs is global optimiza-
bels. The limitation of this approach lies in the fact that tion based on adaptive nested optimisation (Ruan, Huang,
it is a greedy heuristic, which cannot guarantee to find and Kwiatkowska 2018). However, the approach is limited
optimal adversarial examples. Further, a two-player turn- in the number of input dimensions to be perturbed. A com-
based stochastic game approach was developed for generat- mon challenge for verification approaches is their compu-
ing adversarial examples (Wicker, Huang, and Kwiatkowska tational complexity. For both approaches (Katz et al. 2017)
2018). The first player tries to minimise the distance to an and (Ruan, Huang, and Kwiatkowska 2018), the complexity
adversarial example by manipulating the features, and the is NP-complete. For the former, the complexity depends on
second player can be cooperative, adversarial, or random. the number of hidden neurons, and for the latter, on input
The approach has shown to converge to the optimal strat- dimensions.
egy, which represents a globally minimal adversarial im-
age. The limitation of this approach is long runtime. Extend- 7 Evaluating the Robustness of ML Models
ing the idea of DeepFool, a universal adversarial attack ap-
proach was developed (Moosavi-Dezfooli et al. 2017). This To reduce the vulnerability of ML classifiers to adversaries,
approach generates universal perturbations using a smaller research efforts are made on systematically studying and
set of input data, and uses DeepFool to obtain a minimal evaluating the robustness of ML models, as well as on pro-
sample perturbation of input data, which is later modified viding frameworks for benchmarking the robustness of ML
into a final perturbation. models.
Robustness Metrics 2018), by updating the training objective to satisfy robust-
Lack of robustness in neural networks raises valid concerns ness constraints. While these initial approaches are interest-
about the safety of systems relying on these networks, es- ing, they can provably achieve only moderate levels of ro-
pecially in safety-critical domains such as transportation, bustness, i.e. provide approximate guarantees. As such, fur-
robotics, medicine, or warfare. A typical approach to im- ther research advances on providing robustness guarantees
prove the robustness of a neural network would be to iden- for ML models are needed.
tify adversarial examples that make the network fail, then
augment the training dataset with these examples and train 8 Verifying Ethical Machine Reasoning
another neural network. The robustness of the new network
is the ratio between the number of adversarial examples ML systems can be deployed in environments where their
that failed the original network and that were found for actions have ethical implications, for example self-driving
the new network (Goodfellow, Shlens, and Szegedy 2015). cars, and as a consequence, they need to have the capabil-
The limitation of this approach is the lack of objective ro- ities to reason about such implications (Deng 2015). Even
bustness measure (Bastani et al. 2016). Therefore, a met- more so, if such systems are to become widely socially ac-
rics for measuring the robustness of DNNs using linear pro- cepted technologies. While multiple approaches have been
gramming (Bastani et al. 2016) was proposed. Other ap- proposed for building ethics into ML, the real research chal-
proaches include defining the upper bound on the robust- lenge lies in building solutions for verifying such machine
ness of classifiers to adversarial perturbations (Fawzi, Fawzi, ethics. This research area has remained largly unaddressed.
and Frossard 2018). The upper bound is found to depend Existing efforts are limited and include a theoretical frame-
on a distinguishability measure between the classes, and can work for ethical decision-making of autonomous systems
be established independently of the learning algorithms. In that can be formally verified (Dennis et al. 2016). The frame-
their work, Fawzi et al. report two findings: first, non-linear work assumes that system control is separated from a higher-
classifiers are more robust to adversarial perturbations than order decision-making, and uses model checking to verify
linear classifiers, and second, the depth (rather than breath) the rational agent (model checking is the most widely used
of a neural network has a key role for adversarial robustness. approach to verifying ethical machine reasoning). How-
ever, as a limitation, the proposed approach requires ethics
Benchmarks for Robustness Evaluation plans that have been correctly annotated with ethical con-
sequences, which cannot be guaranteed. Second, the agent
There is a difficulty of reproducing some of the methods
verification is demonstrated to be very slow. For situations
developed for improving the robustness of neural networks
where no ethical decision exists, the framework continuous
or methods for comparing experimental results, as different
ethical reasoning, negatively affecting overall performance.
sources of adversarial examples in the training process can
Third, the approach scales poorly to the number of sensors
make adversarial training more or less effective (Goodfel-
and sensor values, due to non-deterministic modelling of
low, Papernot, and McDaniel 2016). To alleviate this chal-
sensor inputs. Furthermore, the approach cannot provide any
lenge, Cleverhans (Goodfellow, Papernot, and McDaniel
guarantees that a rational agent will always operate within
2016) and Foolbox (Rauber, Brendel, and Bethge 2017)
certain bounds regardless of the ethics plan.
are adversarial example libraries for developing and bench-
marking adversarial attacks and defenses, so that different Regarding the certification of autonomous reasoning, a
benchmarks can be compared. The limitation of both of proof-of-concept approach (Webster et al. 2014) was de-
these frameworks is that they lack defensive adversarial gen- veloped for the generation of certification evidence for au-
eration strategies (Yuan et al. 2019). Robust Vision Bench- tonomous aircraft using formal verification and flight simu-
mark 2 extends the idea of Foolbox, by allowing the devel- lation. However, the approach relies on a set of assumptions,
opment of novel attacks which are used to further strengthen such as that the requirements of a system are known, or that
robustness measurements of ML models. Other initiatives they have been accurately translated into a formal specifica-
include a competition organized at NIPS 2017 conference tion language, which may not always hold. Finally, ethical
by Google Brain, where researchers were encouraged to de- machine reasoning should be transparent to allow for check-
velop new methods for generating adversarial examples and ing of the underlying reasoning. These findings emphasize
new methods for defense against them (Kurakin et al. 2018). the need for further progress in verifying and certifying eth-
ical machine reasoning.
Formal Guarantees over Robustness
For safety-critical domains which need to comply with 9 Summary and Future Directions
safety regulation and certification, it is of critical importance Software testing of ML faces a range of open research chal-
to provide formal guarantees of performance of ML under lenges, and further research work focused on addressing
adversarial input perturbations. Providing such guarantees these challenges is needed. We envision such further work
is a real challenge of most of defense approaches, including developing in the following directions.
the approaches discussed above. Existing attempts in this di-
rection include (Hein and Andriushchenko 2017), by using Automated test oracles. Test oracles are often missing in
regularization in training, and (Sinha, Namkoong, and Duchi testing ML systems, which makes checking the correctness
of their output highly challenging. Metamorphic testing can
2
http://robust.vision/benchmarks/leaderboard help address this challenge, and further work is needed on
using ML to automate the creation of metamorphic relation- 10 Acknowledgments
ships. This work is supported by the Research Council of Norway
Coverage metrics for ML models. Existing coverage met- through the project T3AS No 287329.
rics are inadequate in some contexts. Structural coverage cri-
teria can be misleading, i.e. too coarse for adversarial in- References
puts and too fine for misclassified natural inputs (Li et al. Amershi, S.; Begel, A.; and Bird, Christian, e. a. 2019. Soft-
2019). High neuron coverage does not mean invulnerability ware engineering for machine learning: A case study. In Int.
to adversarial examples (Sun et al. 2019). In addition, neu- Conf. on Soft. Eng.: SEIP, 291–300. IEEE Press.
ron coverage can lead to input space explosion. Adaptation Bastani, O.; Ioannou, Y.; Lampropoulos, L.; Vytiniotis, D.;
of combinatorial testing techniques is a promising approach Nori, A.; and Criminisi, A. 2016. Measuring neural net robust-
to this challenge, given that progress is made on improving ness with constraints. In Int. Conf. on Neural Inf. Processing
its scalability for real-word ML models. Systems, 2621–2629.
Quality of test datasets for ML models. Evaluation of the Carlini, N., and Wagner, D. 2017. Towards evaluating the ro-
quality of datasets for ML models is in its early stages. bustness of neural networks. IEEE Symp. on Security and Pri-
Adaptation of mutation testing can alleviate this challenge. vacy 39–57.
Common mutation operators are insufficient for mutation Carlini, N., and Wagner, D. 2018. Audio adversarial exam-
testing of DNNs. Instead, domain-specific operators are re- ples: Targeted attacks on speech-to-text. In IEEE Security and
quired. Privacy Worksh., 1–7.
Cost-effectiveness of adversarial examples. Generation Carlini, N.; Mishra, P.; Vaidya, T.; Zhang, Y.; Sherr, M.;
strategies for adversarial examples need further advancing Shields, C.; Wagner, D.; and Zhou, W. 2016. Hidden voice
to reduce computational complexity and improve effective- commands. In USENIX Conf. on Security Symp., 513–530.
ness for different classifiers. Deng, B. 2015. The robots dilemma. Nature 523/7558.
Cost-effectiveness of adversarial countermeasures. Cur- Dennis, L.; Fishera, M.; Slavkovik, M.; and Webstera, M. 2016.
rent techniques are mainly vulnerable to advanced attacks. Formal verification of ethical choices in autonomous systems.
Verification approaches for DNNs to counter adversarial ex- Robotics and Autonomous Systems 77:1–14.
amples are computationally complex (especially constraint- Du, X.; Xie, X.; Li, Y.; Ma, L.; Zhao, J.; and Liu, Y. 2018.
based approaches) and unscalable for real DNNs. More cost- Deepcruiser: Automated guided testing for stateful deep learn-
effective verification approaches are required. ing systems. CoRR abs/1812.05339.
Robustness evaluation of ML models. Metrics for robust- Dwarakanath, A.; Ahuja, M.; Sikand, S.; and et al. 2018. Iden-
ness evaluation of ML models and effectiveness evaluation tifying implementation bugs in machine learning based image
of adversarial attacks need further advancing. Open bench- classifiers using metamorphic testing. In Int. Symp. on Soft.
marks for developing and evaluating new adversarial attacks Test. and Anal., 118–128.
and defense mechanisms can be useful tools to achieve an Fawzi, A.; Fawzi, O.; and Frossard, P. 2018. Analysis of classi-
improved robustness of defense. Further efforts on under- fiers’ robustness to adversarial perturbations. Machine Learn-
standing the existence of adversarial examples is desired ing 107(3):481–508.
(Yuan et al. 2019). Goodfellow, I.; Papernot, N.; and McDaniel, P. 2016. clev-
Certified guarantees over robustness of ML models. Such erhans v0.1: an adversarial machine learning library. CoRR
guarantees are required for the deployment of ML in safety- abs/1610.00768.
critical domains. Current approaches provide only approxi- Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explaining
mate guarantees. Also, further research progress is needed to and harnessing adversarial examples. Int. Conf. on Learn. Rep-
overcome high computational complexity of producing the resen. abs/1412.6572.
guarantees. Gotlieb, A., and Marijan, D. 2014. Flower: Optimal test suite
Verification of machine ethics. Formal verification and cer- reduction as a network maximum flow. In Int. Symp. on Soft.
tification of ethical machine reasoning is uniquely challeng- Test. and Anal., 171–180.
ing. Further efforts are needed to enable the scalability of Hein, M., and Andriushchenko, M. 2017. Formal guarantees on
these approaches for real systems operating in real-time, and the robustness of a classifier against adversarial manipulation.
to reach lower computational complexity. In addition, veri- In Annual Conf. on Neural Inf. Processing Systems.
fication approaches may leverage different formal methods, Helle, P., and Schamai, W. 2016. Testing of autonomous sys-
which underlines the open challenge of interoperability be- tems challenges and current state-of-the-art. In INCOSE Int.
tween different methods. Finally, research advances on en- Symposium (IS 2016).
abling the transparency of ethical decision making process Huang, X.; Kwiatkowska, M.; Wang, S.; and Wu, M. 2017.
is required. Safety verification of deep neural networks. In Computer Aided
In conclusion, with this paper we hope to provide re- Verification, 3–29.
searchers with useful insights into an unaddressed chal- Hutchison, C.; Zizyte, M.; Lanigan, P. E.; Guttendorf, D.; Wag-
lenges of testing of ML, along with an agenda for advancing ner, M.; Le Goues, C.; and Koopman, P. 2018. Robustness
the state-of-the-art in this research area. testing of autonomy software. In IEEE/ACM Int. Conf. on Soft.
Eng., 276–285.
Jia, R., and Liang, P. 2017. Adversarial examples for evaluating Shen, W.; Wan, J.; and Chen, Z. 2018. Munn: Mutation analy-
reading comprehension systems. In Conf. on Emp. Methods in sis of neural networks. Int. Conf. on Soft. Quality, Reliab. and
Natural Lang. Process. Secur. Comp. 108–115.
Katz, G.; Barrett, C.; Dill, D.; Julian, K.; and Kochenderfer, Shi, Q.; Wan, J.; Feng, Y.; Fang, C.; and Chen, Z. 2019. Deep-
M. 2017. Reluplex: An efficient smt solver for verifying deep gini: Prioritizing massive tests to reduce labeling cost. CoRR
neural networks. In Computer Aided Verif., 97–117. abs/1903.00661.
Knight, J. C., and Leveson, N. G. 1986. An experimental eval- Sinha, A.; Namkoong, H.; and Duchi, J. 2018. Certifiable dis-
uation of the assumption of independence in multiversion pro- tributional robustness with principled adversarial training. Int.
gramming. IEEE Trans. on Soft. Eng. (1):96–109. Conf. on Learning Representations abs/1710.10571.
Kurakin, A.; Goodfellow, I.; Bengio, S.; and et al. 2018. Ad- Song, Y.; Shu, R.; Kushman, N.; and Ermon, S. 2018.
versarial attacks and defences competition. In The NIPS ’17 Constructing unrestricted adversarial examples with generative
Competition: Building Intelligent Systems, 195–231. models. In Adv. in Neural Inf. Proc. Sys.
Sun, Y.; Wu, M.; Ruan, W.; Huang, X.; Kwiatkowska, M.; and
Li, Z.; Ma, X.; Xu, C.; and Cao, C. 2019. Structural coverage
Kroening, D. 2018. Concolic testing for deep neural networks.
criteria for neural networks could be misleading. In Int. Conf.
In Int. Conf. on Autom. Soft. Eng., 109–119.
on Soft. Eng.: NIER, 89–92.
Sun, Y.; Huang, X.; Kroening, D.; Sharp, J.; Hill, M.; and Ash-
Ma, L.; Juefei-Xu, F.; Zhang, F.; Sun, J.; and et al. 2018a. Deep- more, R. 2019. Structural test coverage criteria for deep neural
gauge: Multi-granularity testing criteria for deep learning sys- networks. In Int. Conf. on Soft. Eng.
tems. In Int. Conf. on Aut. Soft. Eng., 120–131.
Sun, Y.; Huang, X.; and Kroening, D. 2018. Testing deep neural
Ma, L.; Zhang, F.; Sun, J.; Xue, M.; and et al. 2018b. Deep- networks. CoRR abs/1803.04792.
mutation: Mutation testing of deep learning systems. IEEE Int. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.;
Symp. on Soft. Reliab. Eng. 100–111. and et al. 2014. Intriguing properties of neural networks. In Int.
Ma, L.; Zhang, F.; Xue, M.; Li, B.; Liu, Y.; Zhao, J.; and Wang, Conf. on Learning Representations, volume abs/1312.6199.
Y. 2018c. Combinatorial testing for deep learning systems. Tian, Y.; Pei, K.; Jana, S.; and Ray, B. 2018. Deeptest: Auto-
CoRR abs/1806.07723. mated testing of deep-neural-network-driven autonomous cars.
Marijan, D.; Gotlieb, A.; and Liaaen, M. 2019. A learning In Int. Conf. on Soft. Eng., 303–314.
algorithm for optimizing continuous integration development Tramèr, F.; Kurakin, A.; Papernot, N.; Boneh, D.; and Mc-
and testing practice. Soft. Pract. and Exper. 49:192–213. Daniel, P. 2017. Ensemble adversarial training: Attacks and
Moosavi-Dezfooli, S.; Fawzi, A.; Fawzi, O.; and Frossard, P. defenses. In Int. Conf. on Learning Representations.
2017. Universal adversarial perturbations. Conf. on Comp. Vis. Webster, M.; Cameron, N.; Fisher, M.; and Jump, M. 2014.
and Pattern Recog. 86–94. Generating certification evidence for autonomous unmanned
Moosavi-Dezfooli, S.; Fawzi, A.; and Frossard, P. 2016. Deep- aircraft using model checking and simulation. J. Aerospace Inf.
fool: A simple and accurate method to fool deep neural net- Sys. 11:258–279.
works. IEEE Conf. on Computer Vision and Pattern Recogni- Weyuker, E. J. 1982. On testing non-testable programs. Com-
tion. put. J. 25:465–470.
Murphy, C.; Kaiser, G. E.; and Arias, M. 2007. An approach to Wicker, M.; Huang, X.; and Kwiatkowska, M. 2018. Feature-
software testing of machine learning applications. In Soft. Eng. guided black-box safety testing of deep neural networks. In
and Knowledge Eng. Tools and Alg. for the Construction and Anal. of Systems, 408–
426.
Odena, A., and Goodfellow, I. 2018. Tensorfuzz: Debug-
ging neural networks with coverage-guided fuzzing. CoRR Xiao, C.; Li, B.; Zhu, J.; He, W.; Liu, M.; and Song, D. 2018.
abs/1807.10875. Generating adversarial examples with adversarial networks. In
Int. Joint Conf. on Artif. Intel., 3905–3911.
Papernot, N.; McDaniel, P.; Wu, X.; Jha, S.; and Swami, A.
Xie, C.; Wang, J.; Zhang, Z.; and et al., Y. Z. 2017. Adversarial
2016. Distillation as a defense to adversarial perturbations
examples for semantic segmentation and object detection. In
against deep neural networks. In IEEE Symp. on Sec. and Pri-
IEEE Int. Conf. on Computer Vision, 1378–1387.
vacy, 582–597.
Xie, X.; Ma, L.; Juefei-Xu, F.; Chen, H.; and et al. 2018.
Pei, K.; Cao, Y.; Yang, J.; and Jana, S. 2017. Deepxplore: Coverage-guided fuzzing for deep neural networks. CoRR
Automated whitebox testing of deep learning systems. In Symp. abs/1809.01266.
on Oper. Syst. Princip., 1–18.
Xie, X.; Ho, J.; and et al. 2011. Testing and validating machine
Rauber, J.; Brendel, W.; and Bethge, M. 2017. Foolbox v0.8.0: learning classififiers by metamorphic testing. J. of Sys. and Soft.
A python toolbox to benchmark the robustness of machine (84):544–558.
learning models. CoRR abs/1707.04131.
Yuan, X.; He, P.; Zhu, Q.; and Li, X. 2019. Adversarial exam-
Ruan, W.; Huang, X.; and Kwiatkowska, M. 2018. Reachability ples: Attacks and defenses for deep learning. Tr. on Neural Net.
analysis of deep neural networks with provable guarantees. In and Learn. Syst. 1–20.
Int. J. Conf. on Artif. Intel., 2651–2659. Zhang, M.; Zhang, Y.; Zhang, L.; and et al. 2018. Deeproad:
Sato, M.; Suzuki, J.; Shindo, H.; and Matsumoto, Y. 2018. In- Gan-based metamorphic testing and input validation framework
terpretable adversarial perturbation in input embedding space for autonomous driving systems. In Int. Conf. on Aut. Soft. Eng.,
for text. In Int. J. Conf. on Artif. Intel. 132–142.