Analysis of Skin Lesion Images With Deep Learning: Josef Steppan, Sten Hanke
Analysis of Skin Lesion Images With Deep Learning: Josef Steppan, Sten Hanke
Analysis of Skin Lesion Images With Deep Learning: Josef Steppan, Sten Hanke
Abstract—Skin cancer is the most common cancer worldwide, set. Since the heterogeneity of the image data of the ISIC
with melanoma being the deadliest form. Dermoscopy is a data set requires preprocessing, a suitable approach towards
skin imaging modality that has shown an improvement in preprocessing, as well as the effects of preprocessing on the
the diagnosis of skin cancer compared to visual examination
without support. We evaluate the current state of the art in achieved accuracy of trained networks have been investigated.
the classification of dermoscopic images based on the ISIC- Furthermore, the potential of real-time data augmentation
2019 Challenge for the classification of skin lesions and current to increase the number of available training patterns during
literature. Various deep neural network architectures pre-trained training and to improve the prediction accuracy at inference
arXiv:2101.03814v1 [eess.IV] 11 Jan 2021
on the ImageNet data set are adapted to a combined training time has been investigated. Current ensembling strategies and
data set comprised of publicly available dermoscopic and clinical
images of skin lesions using transfer learning and model fine- an overview of current architectures of deep neural networks
tuning. The performance and applicability of these models for for the classification of image content have been reviewed.
the detection of eight classes of skin lesions are examined. Real-
time data augmentation, which uses random rotation, translation, II. I MAGE C LASSIFICATION
shear, and zoom within specified bounds is used to increase
the number of available training samples. Model predictions are Convolutional Neural Networks (CNNs) [4] are currently
multiplied by inverse class frequencies and normalized to better state of the art in image classification and have been exceeding
approximate actual probability distributions. Overall prediction the recognition rate of human experts in the ImageNet Large
accuracy is further increased by using the arithmetic mean of
the predictions of several independently trained models. The best Scale Visual Recognition Challenge1 (ILSVRC) [5] since 2015
single model has been published as a web service. The source [6]. The ILSVRC evaluates algorithms for object recogni-
code is publicly available at http://github.com/j05t/lesion-analysis tion and image classification on a large scale. An important
motivation is to enable researchers to compare progress in
Index Terms—Lesion, Skin, Melanoma, Deep Learning recognition for a wider variety of objects. Another motivation
is to measure the progress of computer vision algorithms for
classifying images on a large scale. The ImageNet training
I. I NTRODUCTION data set contains 1.000 categories and 1,2 million images.
KIN cancer is the most common cancer worldwide, with Image classification algorithms are compared using a test data
S melanoma being the deadliest form. A later stage in the
diagnosis of melanoma is associated with a strong influence
set of 150.000 images in 1.000 categories. Highest accuracy
rates are currently achieved with the architectures SENet [7]
on melanoma mortality within 5 years of diagnosis [1]. Early 154 (81.3% top-1 accuracy), PNASNet-5 Large [8] (82.9%),
detection of melanoma can significantly reduce both morbidity AmoebaNet-C [9], [10] (83.9%) and EfficientNet-B7 [11]
and mortality [2]. The risk of dying from the disease is directly (84.4%) [12]. Algorithms for classifying image content are
related to the depth of the cancer, which is directly related to constantly being improved. Deep learning has shown enor-
the time it has been growing. Self-examination of the skin by mous potential in this area due to the constantly increasing
patients, full-body skin examinations by a doctor, and patient amounts of data [13], [14]. Some deep learning approaches
education are the keys to early detection. Self-examiners are outperform teams of certified dermatologists in the detection
generally diagnosed with thinner melanomas than non-self- of melanoma in dermoscopic images [15], [16], [17] or achieve
examiners (0.77 mm versus 0.95 mm) [3]. equivalent detection rates [18], [19].
This paper evaluates the current state of the art in the
classification of dermoscopic images based on the ISIC-2019 III. S KIN L ESION DATASETS
Challenge for the classification of skin lesions and current
A. ISIC-2019
literature. Since medical image data sets often show a class
imbalance, several approaches for the training of deep neural To make specialist knowledge more widely available, the
networks on imbalanced data sets have been reviewed. Be- International Skin Imaging Collaboration developed the ISIC
cause the training of deep neural networks requires a large archive, an international repository for dermoscopic images,
amount of training data, further publicly available dermoscopic both for clinical training purposes and to support technical re-
as well as clinical image data sets of skin lesions have search on automated algorithmic analysis by hosting the ISIC
been evaluated for expanding the ISIC-2019 training data Challenges. The training data set of the ISIC-2019 Challenge
consists of several dermoscopic image databases: BCN 20000
J. Steppan was with the Department of eHealth at FH Joanneum University [20] with dermoscopic images of the most common classes
of Applied Sciences, Alte Poststrasse 149, 8020 Graz, AUSTRIA. e-mail: of skin lesions: actinic keratosis, squamous cell carcinoma,
josef.steppan@edu.fh-joanneum.at
S. Hanke is with the Department of eHealth at FH Joanneum University of basal cell carcinoma, seborrheic keratosis, solar lentigo, and
Applied Sciences, Alte Poststrasse 149, 8020 Graz. e-mail: sten.hanke@fh-
joanneum.at 1 http://image-net.org/challenges/LSVRC/2017
2
dermatological lesions. The HAM10000 dataset [21], with development of medical imaging research and the development
600x450 images centered and cropped on lesions. The MSK of new classification algorithms based on light fields as well
data set [22] with images of different resolutions. A total as for clinically oriented dermatological studies; however, only
of 25,331 images are available for training in 8 different dermoscopic images contained in the data set are taken into
categories. The test data set consists of 8,238 images whose account for this work.
labels are not publicly available. Also, the test data set contains
an additional outlier class that is not contained in the training
D. SD-198
data and must be identified by developed systems. Predictions
on the ISIC-2019 test data set are assessed by an automatic In contrast to dermoscopic images with largely constant
evaluation system. The goal of the ISIC-2019 Challenge2 is to lighting and low image disturbances, clinical images are often
classify dermoscopic images among nine different diagnostic created with a large number of different image recording
categories: devices, such as digital cameras or smartphones. The SD-198
1) Melanoma (MEL) data set [26] contains 6,584 clinical images from 198 classes,
2) Melanocytic nevus (NV) which vary according to scale, color, shape, and structure. The
3) Basal cell carcinoma (BCC) SD-198 benchmark data set is intended to stimulate further
4) Actinic Keratosis (AK) research into the visual classification of skin diseases. The
5) Benign keratosis (solar lentigo / seborrheic keratosis / authors also carry out an extensive analysis of this data set
lichen planus-like keratosis) (BKL) using modern methods including CNNs. The ground truth
6) Dermatofibroma (DF) labels of the images were created via DermQuest5 , with each
7) Vascular Lesion (VASC) image being examined by qualified experts and labeled with
8) Squamous cell carcinoma (SCC) the name of its class. To ensure the quality of the labels, two
9) None of the others (UNK) experts were also invited to check the data set.
13704
NV
3378
25331 BCC
29469
4904
MEL
ISIC-2019
2733 Train
BKL
1011 867
7-point criteria database AK
200 628
PH2 SCC 3279
Valid
170 282
MED-NODE VASC
Fig. 3. Applied augmentations for a single training image. Random rotation,
92 294 translation in the x and y directions as well as scaling within defined limits
SKINL2 DF
avoid overfitting on the training data and enable a better generalization of
5944 5958
the model. Used augmentation parameters are: max rotate=45, p affine=0.5,
SD-198 UNK
do flip=True, flip vert=True, max zoom=1.05, max lighting=0.2,
crop pad(input size), cutout(n holes=(1,1), length=(16,16), p=.5).
Fig. 1. Combined training data set from the data sets ISIC-2019, PH2, Light
Field Image Dataset of Skin Lesions, SD-198, the 7-point criteria evaluation
database, and MED-NODE. The ”UNK” category is mainly formed from data B. Data Augmentation
from the SD-198 dataset. The combined data set is divided into a training
(90%) and validation data set (10%), so 29.469 images are available for To avoid overfitting [31] in neural networks, dropout [32]
training and 3.279 images for assessing the generalizability of the predictions is often used. Another simple method for regularization (and
and for adapting hyperparameters in the validation data set. The ISIC-2019 expansion of the number of different training samples) of
test data set consists of 8.238 images whose labels are not publicly available.
The test data set is not used for training or parameter adjustment. CNNs is data augmentation. During training, input data is
changed randomly according to certain criteria (translation,
rotation, scaling, etc.). Additionally, Cutout [33] has been used
IV. C OMBINED TRAINING DATASET for regularization. Figure 3 shows the applied augmentations.
A combined training data set has been created from all
the data sets described in section III. 32,748 images are
C. Out of Distribution Detection
available for training in total. Images from SD-198 were
used exclusively for the creation of training data for the Neural networks offer little or no guarantee of reliable
”UNK” class, after prior removal of image data from the eight prediction when applied to data that was not generated through
categories of the ISIC-2019 training data set. The combined the same process that was used to create the network’s
data set is still heavily imbalanced (Figure 1). training data. With such Out-of-Distribution (OOD) inputs,
the prediction may not only be incorrect but also associated
with a high level of confidence [34], [35] of the network,
V. M ETHODOLOGY which restricts the reliability of deep learning classifiers in
A. Preprocessing real world applications. Often the predictions of (ensembles of)
classifiers that have been trained on data within the distribution
Training and test data of the ISIC-2019 dataset have been are examined for the presence of OOD inputs using statistical
preprocessed to remove black areas surrounding dermoscopic methods [36], [37]. Alternatively, the input distribution can
images, and subsequently rescaled maintaining aspect ratio be modeled directly by using generative models that do not
(Figure 2). Descriptive text appended to images in the SD- require the presence of class labels. However, it has been
198 dataset has been removed. shown that this method can also output higher probabilities
on OOD inputs than on inputs within the distribution [38]. In
the ISIC 2019 Challenge, classes that are not included in the
training data set should be detected as OOD and recognized
as class ”UNK”. In this work, a data-driven approach to the
recognition of OOD inputs is pursued by using images (mostly
from SD-198, see subsection III-D) as training data for the
”UNK” class that are not labeled as one of the classes of
the ISIC-2019 training data set. However, this approach is far
from optimal, and OOD detection in deep learning classifiers
remaining an unsolved problem. Further work is needed to
improve classifier performance regarding OOD detection.
Fig. 2. Preprocessing of the ISIC 2019 dataset. Black image borders are
detected and removed. The top row shows images of the original training
D. Dataset Imbalance
data set, shown below are preprocessed images A common problem with deep learning-based applications
is the fact that some classes have a significantly higher number
4
of samples in the training set than other classes. This difference the simplest case, loss values can be weighted by multiplying
is known as class imbalance. There are many examples in areas by inverse class frequencies.
such as computer vision [39], [40], [41], [42], [43], medical 3) Thresholding: Also referred to as threshold shifting or
diagnosis [44], [45], fraud detection [46], and others [47], [48], rescaling, thresholding adapts the decision threshold of a
[49] where this problem is highly significant and the incidence classifier. This method is used at inference time and involves
of one class (e.g. cancer) can be 1000 times less than another changing the output class probabilities. There are several ways
class (e.g. healthy patient) [50]. It has been shown that a class in which the network outputs can be rescaled. In general, an
imbalance in training data sets can have a significant adverse optimization algorithm can be used to configure the network
effect on the training of traditional classifiers [51], including to minimize any criteria [60]. The simplest method only
classic neural networks or multilayer perceptrons [52]. The compensates for a priori class probabilities [61]. It has been
class imbalance influences both the convergence of neural shown that neural networks estimate Bayesian a posteriori
networks during the training phase and the generalization of probabilities [61]. That is, for a given data point x, the output
a model to real or test data [50]. for class c is implicitly yi (x) = p(c|x) = p(c)p(x|c)
p(x) . The actual
1) Undersampling / Oversampling: Undersampling and probabilities of class membership can therefore be calculated
oversampling in data analysis are techniques to adjust the by dividing the output of the network by the estimated a priori
class distribution of a data set (i.e. the relationship between probability p(c) = P|c||k| , where |c| is the number of samples
k
the different classes/categories represented). These terms are of class c [50]. The resulting class probabilities are normalized
used in statistical sampling, survey design methodology, and after thresholding is applied. This simple method of handling
machine learning. The goal of undersampling and oversam- an existing class imbalance can significantly increase the class
pling is to create a balanced data set. Many machine learning probability distribution approximation made by classifiers.
techniques, such as neural networks create more reliable
predictions when trained on balanced data. Oversampling is E. Transfer Learning
generally used more often than undersampling. The reasons for
Transfer learning in the context of machine learning is a
using undersampling are mainly practical and often resource-
technique that uses information obtained from solving a prob-
dependent. With random oversampling, the training data is
lem and applies it to a similar problem. When using transfer
supplemented by multiple copies of samples from minority
learning, a model that has already been trained on another
classes. This is one of the earliest methods proposed that
data set is adapted to custom data. Ideally, the pre-trained
has also proven robust [53]. Instead of duplicating minority
model has been trained on similar data, but this is not strictly
class samples, some of them can be chosen at random by
necessary. The final layers of the network are removed and
substitution. Other methods of handling unbalanced data sets
replaced by output layers featuring appropriate dimensions.
such as synthetic oversampling [54] are more suitable for
The model is then trained on custom data. By using transfer
traditional machine learning tasks [55] and were therefore not
learning, the time required for training a network can be
considered any further in this work.
greatly reduced [62], [63], [64]. The existing pre-trained model
2) Weighted Cross-Entropy Loss: Weighted cross-entropy
thus serves as a feature extractor, which forwards features such
[56] is useful for training neural networks on unbalanced data
as edges, texture, position of recognized objects, etc. to the
sets. [57] suggest adding a margin-based loss value to the
last layer for classification. A softmax function (normalized
cross-entropy on in-distribution training patterns in order to
exponential function) transforms the network output into a
ensure a minimum difference in average entropy between in-
vector of numbers between zero and one which sum up to
distribution and out-of-distribution data. This ensemble-based
one which allows interpreting the output of the network as a
method is intended to surpass previous methods of recognizing
probability distribution.
out-of-distribution inputs such as ODIN [58]. Cross entropy
can be described as
F. Test Time Augmentation
! ! Data augmentation is a technique widely used to improve
exp(x[y]) X
L(x, y) = −log P = −x[y]+log exp(x[j]) neural network training performance and reduce generalization
j exp(x[j]) j errors. The same image data augmentation technique can
or, by using class weights: also be used at inference time to allow the model to make
!! predictions for several different versions of each image in
X the test data. Test Time Augmentation (TTA) predictions are
L(x, y) = W [y] −x[y] + log exp(x[j])
formed by calculating the average of the regular predictions
j
(with a weighting of beta=0.4) with the average of the pre-
The arithmetic mean of the loss values achieved is calcu- dictions obtained by predicting on augmented versions of the
lated for each mini-batch. A weight vector can be calculated image data (with a weighting of 1-beta). The transformations
using effective class weights [59] with the simple formula specified for the training set are applied with the following
(1 − β n )/(1 − β), with the hyperparameter beta equal to changes: Scaling with a factor of 1.05 controls the scaling for
0.999 (a choice of the parameter beta equal to zero would not the zoom (which is not random for TTA). Furthermore, the
apply any weighting and a choice of beta equal to 1 would cropping is not random to ensure that the four corners of the
correspond to weighting by the inverse class frequency). In picture are used. Reflection is not random but is applied once
5
Architecture Accuracy
G. Ensembling EfficientNet-B5 0.600
SE-ResNeXt-101(32x4d) 0.582
Ensembling is the use of several independently trained mod- EfficientNet-B4 0.577
els to form an overall prediction. The basic idea of ensembling Inception-ResNet-v2 0.569
NASNet-A-Large 0.504
is that individual models have weaknesses in different areas, Ensemble (excluding NasNet) 0.634
which are compensated by the combination with predictions
of other independently trained models. Possible ensembling
strategies are e.g. majority voting, the use of a weighted TABLE II
M ETRICS (E NSEMBLE )
average based on classifier confidences, or simply using the
arithmetic mean of several predictions of different models and Category Mean
model architectures [65]. Metrics Value MEL NV BCC AK BKL DF VASC SCC UNK
AUC .902 .924 .957 .942 .917 .893 .977 .932 .936 .638
AUC, Sens>80% .813 .853 .926 .883 .829 .776 .966 .868 .876 .336
VI. E XPERIMENTS Avg. Precision .561 .766 .923 .719 .366 .572 .586 .502 .326 .285
Accuracy .923 .899 .894 .908 .933 .933 .983 .978 .969 .808
The CNN architectures Inception-ResNet-v2 [66], SE- Sensitivity .525 .581 .752 .666 .580 .384 .744 .614 .408 .00
ResNeXt-101 (32x4d) [7], NASNet-A-Large [8], EfficientNet- Specificity .973 .963 .962 .944 .952 .985 .986 .983 .982 1.00
B4 and EfficientNet-B5 [11] pre-trained on the ImageNet Dice Coeff .491 .659 .821 .654 .468 .499 .523 .434 .364 .00
PPV .609 .760 .905 .642 .392 .713 .404 .335 .328 1.00
data set were adapted for the task of classifying the nine NPV .941 .919 .890 .950 .977 .944 .997 .995 .987 .808
classes of the ISIC-2019 Challenge by replacing final layers
with a custom linear layer to output nine class probabilities.
Real-time data augmentation has been used to improve the
generalizability of the resulting models. Models have been using a weighted loss function did not improve balanced multi-
trained on an NVIDIA GTX 1070 GPU. Batch sizes (number class prediction accuracy. The outputs of several independently
of training samples that are used for a single forward pass) trained models were combined into an overall prediction using
were adapted to individual architectures and input sizes to the arithmetic mean of all model predictions and transmitted to
achieve optimal utilization of the available video memory. the automated evaluation system of the ISIC-2019 Challenge.
Images have been resized to fit model input sizes prior training. Table I shows results for individual models. Best perform-
Models have been trained via transfer learning over 32 ing models were used to form ensemble predictions. NASNet-
epochs followed by model fine-tuning using differential learn- A-Large was not included in the ensemble due to the un-
ing rates until convergence using One Cycle Policy [67], satisfactory overall accuracy achieved. Although EfficientNet
allowing very rapid convergence rates of trained networks shows the best results of all trained network architectures, the
[68]. Appropriate learning rates were determined manually at combination with predictions from SE-ResNeXt-101 (32x4d)
regular intervals. The use of a weighted loss function has, and Inception-ResNet-v2 models still lead to higher average
contrary to expectations, only proven to be advantageous for accuracy than any single model could achieve independently.
training the NASNet-A-Large architecture, which has been Table II shows metrics for the ensemble with 0.634 bal-
unable to converge without applying weighted loss. Other anced multiclass accuracy, as computed by the ISIC challenge
architectures could not benefit from training using a weighted website. AUC: Area under the receiver operating characteristic
loss function. Early stopping has been applied to avoid model (ROC) curve. AUC, Sens >80%: area under the ROC curve,
overfitting. Best models have been selected based on their per- evaluated exclusively for the region in which the sensitivity is
formance on the validation data. Out-of-distribution detection greater than 80%. Average precision (precision is also called
using thresholding proved to provide inferior results to using Positive Predictive Value - PPV) measures the area under
a data-driven approach as described in V-C. the interpolated precision-recall curve (recall = sensitivity).
The unsatisfactory balanced multiclass accuracy of the Accuracy measures the overall accuracy of the classifier, i.e.
NASNet model may be caused by the relatively small batch Accuracy = sensitivity ∗ prevalence + specif icity ∗ (1 −
size, which was limited to four due to the size of the model. prevalence). Sensitivity measures true-positive predictions,
As expected, improved performance of deep neural networks specificity (recall) measures true-negative predictions of the
in the classification of ImageNet data can be directly translated classifier. The F1 score (Dice Coefficient) is the harmonious
to models trained on custom data sets. Improved CNN archi- mean of precision and recall, with an F1 score reaching its
tectures, which achieve higher accuracy in the classification of best value at 1 (perfect precision and recall). F1 score is also
the ImageNet data set, thus also provide better results in the known as the Sørensen-Dice coefficient or Dice similarity
classification of dermoscopic images. coefficient (DSC). A positive predictive value (PPV) is the
A rescaling of the outputs of the models by multiplying the likelihood that subjects who test positive will actually have the
output probabilities by inverse class frequency have proven disease. A negative predictive value (NPV) is the likelihood
to be advantageous for the balanced multiclass accuracy of that subjects who test negative really do not have the disease.
the network predictions in all cases where no weighted loss Figure 4 shows the receiver operating characteristic curve for
function has been used. Applying rescaling on models trained the ensemble.
6
NV
BCC
0.4 AK R EFERENCES
BKL
DF [1] K. J. Wernli, N. B. Henrikson, C. C. Morrison, M. Nguyen, G. Pocobelli,
VASC and P. R. Blasi, “Screening for skin cancer in adults: updated evidence
0.2 SCC report and systematic review for the us preventive services task force,”
UNK
Jama, vol. 316, no. 4, pp. 436–447, 2016.
[2] L. F. di Ruffano, Y. Takwoingi, J. Dinnes, N. Chuchu, S. E. Bayliss,
0 C. Davenport, R. N. Matin, K. Godfrey, C. O’Sullivan, A. Gu-
0 0.2 0.4 0.6 0.8 1
lati et al., “Computer-assisted diagnosis techniques (dermoscopy and
spectroscopy-based) for diagnosing skin cancer in adults,” Cochrane
FPR Database of Systematic Reviews, no. 12, 2018.
[3] P. Carli, V. De Giorgi, D. Palli, A. Maurichi, P. Mulas, C. Orlandi,
Fig. 4. ROC curve for the 0.634 balanced multiclass accuracy ensemble. The G. L. Imberti, I. Stanganelli, P. Soma, D. Dioguardi et al., “Derma-
ROC curve shows the diagnostic capability of a binary classifier as its decision tologist detection and skin self-examination are associated with thinner
threshold varies. The ROC curve is constructed by plotting the true positive melanomas: results from a survey of the italian multidisciplinary group
rate (TPR) against the false positive rate (FPR) at various threshold settings. on melanoma,” Archives of dermatology, vol. 139, no. 5, pp. 607–612,
The true positive rate is also referred to as sensitivity, recall or detection 2003.
probability, whereas FPR corresponds to the false positive rate (1 - specificity). [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[5] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
VII. C ONCLUSION Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
scale visual recognition challenge,” International journal of computer
Deep learning has become a mature technology for the vision, vol. 115, no. 3, pp. 211–252, 2015.
classification of image content and can achieve similar or [6] C. Langlotz, B. Allen, B. Erickson, J. Kalpathy-Cramer, K. Bigelow,
superior accuracy as human experts in the classification of skin T. Cook, A. Flanders, M. Lungren, D. Mendelson, J. Rudie, G. Wang,
and K. Kandarpa, “A roadmap for foundational research on artificial
lesions. The use of deep learning applications that automat- intelligence in medical imaging: From the 2018 nih/rsna/acr/the academy
ically evaluate clinical and dermoscopic images and classify workshop,” Radiology, vol. 291, p. 190613, 04 2019.
skin lesions offer great potential for improving and imple- [7] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proceedings of the IEEE conference on computer vision and pattern
menting prevention and screening measures and increasing recognition, 2018, pp. 7132–7141.
their efficiency. One of the main criticisms of deep learning [8] C. Liu, B. Zoph, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. L.
applications, that these networks have to be treated as a Yuille, J. Huang, and K. Murphy, “Progressive neural architecture
search,” CoRR, vol. abs/1712.00559, 2017. [Online]. Available:
black box and that there is no easy explanation of how http://arxiv.org/abs/1712.00559
they form their decisions remain unchanged despite some [9] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei,
progress in the visualization of network activations. Careful A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture
search,” in The European Conference on Computer Vision (ECCV),
validation of trained models using real-world data sets before September 2018.
and also during use is essential. Progress in the development [10] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution
of more efficient architectures of deep neural networks and for image classifier architecture search,” in Proceedings of the aaai
conference on artificial intelligence, vol. 33, 2019, pp. 4780–4789.
improved accuracy in the classification of images with high [11] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
image quality does not automatically mean that results can convolutional neural networks,” CoRR, vol. abs/1905.11946, 2019.
be transferred to real-world applications. For instance, [69] [Online]. Available: http://arxiv.org/abs/1905.11946
[12] S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark
examined the use of a classification system created by Google analysis of representative deep neural network architectures,” IEEE
researchers to detect diabetic retinopathy in 11 clinics in Access, vol. 6, pp. 64 270–64 277, 2018.
Thailand and found that this technology does not yet work [13] X. Cui, R. Wei, L. Gong, R. Qi, Z. Zhao, H. Chen, K. Song, A. A.
well in practice despite all the research advances. Advantages Abdulrahman, Y. Wang, J. Z. Chen et al., “Assessing the effectiveness
of artificial intelligence methods for melanoma: A retrospective review,”
of deep learning applications in the medical field are the Journal of the American Academy of Dermatology, vol. 81, no. 5, pp.
rapid availability of diagnosis compared to analysis by human 1176–1180, 2019.
specialists and cost-effective provisioning of models for large [14] Y. Fujisawa, S. Inoue, and Y. Nakamura, “The possibility of deep
learning-based, computer-aided skin tumor classifiers,” Frontiers in
numbers of simultaneous users. Central provisioning of deep Medicine, vol. 6, p. 191, 2019.
learning models allows uncomplicated and transparent delivery [15] A. Hekler, J. S. Utikal, A. H. Enk, A. Hauschild, M. Weichenthal, R. C.
of improved models without having to make changes to client Maron, C. Berking, S. Haferkamp, J. Klode, D. Schadendorf et al.,
“Superior skin cancer classification by the combination of human and
software. Cloud applications can serve current deep learning artificial intelligence,” European Journal of Cancer, vol. 120, pp. 114–
models cost-effectively through automatic horizontal scaling 121, 2019.
7
[16] R. C. Maron, M. Weichenthal, J. S. Utikal, A. Hekler, C. Berking, [33] T. DeVries and G. W. Taylor, “Improved regularization of convolutional
A. Hauschild, A. H. Enk, S. Haferkamp, J. Klode, D. Schadendorf et al., neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
“Systematic outperformance of 112 dermatologists in multiclass skin [34] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing
cancer image classification by convolutional neural networks,” European adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
Journal of Cancer, vol. 119, pp. 57–65, 2019. [35] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily
[17] T. J. Brinker, A. Hekler, A. H. Enk, J. Klode, A. Hauschild, C. Berking, fooled: High confidence predictions for unrecognizable images,” in
B. Schilling, S. Haferkamp, D. Schadendorf, T. Holland-Letz et al., Proceedings of the IEEE conference on computer vision and pattern
“Deep learning outperformed 136 of 157 dermatologists in a head- recognition, 2015, pp. 427–436.
to-head dermoscopic melanoma image classification task,” European [36] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified
Journal of Cancer, vol. 113, pp. 47–54, 2019. and out-of-distribution examples in neural networks,” arXiv preprint
[18] A. Blum, H. Luedtke, U. Ellwanger, R. Schwabe, G. Rassner, and arXiv:1610.02136, 2016.
C. Garbe, “Digital image analysis for diagnosis of cutaneous melanoma. [37] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable
development of a highly effective computer algorithm based on analysis predictive uncertainty estimation using deep ensembles,” in Advances in
of 837 melanocytic lesions,” British Journal of Dermatology, vol. 151, neural information processing systems, 2017, pp. 6402–6413.
no. 5, pp. 1029–1038, 2004. [38] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon,
[19] M. Zortea, T. R. Schopf, K. Thon, M. Geilhufe, K. Hindberg, H. Kirch- and B. Lakshminarayanan, “Likelihood ratios for out-of-distribution
esch, K. Møllersen, J. Schulz, S. O. Skrøvseth, and F. Godtliebsen, detection,” in Advances in Neural Information Processing Systems, 2019,
“Performance of a dermoscopy-based computer vision system for the pp. 14 680–14 691.
diagnosis of pigmented skin lesions compared with visual evaluation by [39] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard,
experienced dermatologists,” Artificial intelligence in medicine, vol. 60, H. Adam, P. Perona, and S. Belongie, “The inaturalist species classifi-
no. 1, pp. 13–26, 2014. cation and detection dataset,” in Proceedings of the IEEE conference on
[20] M. Combalia, N. C. Codella, V. Rotemberg, B. Helba, V. Vilaplana, computer vision and pattern recognition, 2018, pp. 8769–8778.
O. Reiter, A. C. Halpern, S. Puig, and J. Malvehy, “Bcn20000: Dermo- [40] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun
scopic lesions in the wild,” arXiv preprint arXiv:1908.02288, 2019. database: Large-scale scene recognition from abbey to zoo,” in 2010
[21] P. Tschandl, C. Rosendahl, and H. Kittler, “The ham10000 dataset, IEEE Computer Society Conference on Computer Vision and Pattern
a large collection of multi-source dermatoscopic images of common Recognition. IEEE, 2010, pp. 3485–3492.
pigmented skin lesions,” Scientific data, vol. 5, p. 180161, 2018. [41] B. A. Johnson, R. Tateishi, and N. T. Hoan, “A hybrid pansharpen-
[22] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, ing approach and multiscale object-based image analysis for mapping
S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler et al., diseased pine and oak trees,” International journal of remote sensing,
“Skin lesion analysis toward melanoma detection: A challenge at the vol. 34, no. 20, pp. 6969–6982, 2013.
2017 international symposium on biomedical imaging (isbi), hosted by [42] M. Kubat, R. C. Holte, and S. Matwin, “Machine learning for the
the international skin imaging collaboration (isic),” in 2018 IEEE 15th detection of oil spills in satellite radar images,” Machine learning,
International Symposium on Biomedical Imaging (ISBI 2018). IEEE, vol. 30, no. 2-3, pp. 195–215, 1998.
2018, pp. 168–172. [43] O. Beijbom, P. J. Edmunds, D. I. Kline, B. G. Mitchell, and D. Krieg-
[23] T. Mendonça, P. M. Ferreira, J. S. Marques, A. R. Marcal, and J. Rozeira, man, “Automated annotation of coral reef survey images,” in 2012 IEEE
“Ph 2-a dermoscopic image database for research and benchmarking,” Conference on Computer Vision and Pattern Recognition. IEEE, 2012,
in 2013 35th annual international conference of the IEEE engineering pp. 1170–1177.
in medicine and biology society (EMBC). IEEE, 2013, pp. 5437–5440. [44] J. W. Grzymala-Busse, L. K. Goodwin, W. J. Grzymala-Busse, and
X. Zheng, “An approach to imbalanced data sets based on changing rule
[24] S. M. de Faria, J. N. Filipe, P. M. Pereira, L. M. Tavora, P. A. Assuncao,
strength,” in Rough-neural computing. Springer, 2004, pp. 543–553.
M. O. Santos, R. Fonseca-Pinto, F. Santiago, V. Dominguez, and
[45] B. Mac Namee, P. Cunningham, S. Byrne, and O. I. Corrigan, “The
M. Henrique, “Light field image dataset of skin lesions,” in 2019 41st
problem of bias in training data in regression problems in medical
Annual International Conference of the IEEE Engineering in Medicine
decision support,” Artificial intelligence in medicine, vol. 24, no. 1, pp.
and Biology Society (EMBC). IEEE, 2019, pp. 3905–3908.
51–70, 2002.
[25] G. Wu, B. Masia, A. Jarabo, Y. Zhang, L. Wang, Q. Dai, T. Chai, and [46] K. Philip and S. Chan, “Toward scalable learning with non-uniform
Y. Liu, “Light field image processing: An overview,” IEEE Journal of class and cost distributions: A case study in credit card fraud detection,”
Selected Topics in Signal Processing, vol. 11, no. 7, pp. 926–954, 2017. in Proceeding of the Fourth International Conference on Knowledge
[26] X. Sun, J. Yang, M. Sun, and K. Wang, “A benchmark for automatic Discovery and Data Mining, 1998, pp. 164–168.
visual classification of clinical skin disease images,” in European Con- [47] P. Radivojac, N. V. Chawla, A. K. Dunker, and Z. Obradovic, “Clas-
ference on Computer Vision. Springer, 2016, pp. 206–222. sification and knowledge discovery in protein databases,” Journal of
[27] J. Kawahara, S. Daneshvar, G. Argenziano, and G. Hamarneh, “Seven- Biomedical Informatics, vol. 37, no. 4, pp. 224–239, 2004.
point checklist and skin lesion classification using multitask multimodal [48] C. Cardie and N. Nowe, “Improving minority class prediction using
neural nets,” IEEE Journal of Biomedical and Health Informatics, case-specific feature weights,” in Proceedings of the Fourteenth Interna-
vol. 23, no. 2, pp. 538–546, 2019. tional Conference on Machine Learning, ser. ICML ’97. San Francisco,
[28] G. Argenziano, G. Fabbrocini, P. Carli, V. De Giorgi, E. Sammarco, and CA, USA: Morgan Kaufmann Publishers Inc., 1997, p. 57–65.
M. Delfino, “Epiluminescence microscopy for the diagnosis of doubtful [49] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and
melanocytic skin lesions: comparison of the abcd rule of dermatoscopy G. Bing, “Learning from class-imbalanced data: Review of methods and
and a new 7-point checklist based on pattern analysis,” Archives of applications,” Expert Systems with Applications, vol. 73, pp. 220–239,
dermatology, vol. 134, no. 12, pp. 1563–1570, 1998. 2017.
[29] H. Kittler, A. A. Marghoob, G. Argenziano, C. Carrera, C. Curiel- [50] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of
Lewandrowski, R. Hofmann-Wellenhof, J. Malvehy, S. Menzies, the class imbalance problem in convolutional neural networks,” Neural
S. Puig, H. Rabinovitz et al., “Standardization of terminology in Networks, vol. 106, pp. 249–259, 2018.
dermoscopy/dermatoscopy: results of the third consensus conference [51] N. Japkowicz and S. Stephen, “The class imbalance problem: A system-
of the international society of dermoscopy,” Journal of the American atic study,” Intelligent data analysis, vol. 6, no. 5, pp. 429–449, 2002.
Academy of Dermatology, vol. 74, no. 6, pp. 1093–1106, 2016. [52] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker,
[30] I. Giotis, N. Molders, S. Land, M. Biehl, M. F. Jonkman, and N. Petkov, and G. D. Tourassi, “Training neural network classifiers for medical
“Med-node: a computer-assisted melanoma diagnosis system using non- decision making: The effects of imbalanced datasets on classification
dermoscopic images,” Expert systems with applications, vol. 42, no. 19, performance,” Neural networks, vol. 21, no. 2-3, pp. 427–436, 2008.
pp. 6578–6585, 2015. [53] C. X. Ling and C. Li, “Data mining for direct marketing: Problems and
[31] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. solutions.” in Kdd, vol. 98, 1998, pp. 73–79.
Salakhutdinov, “Improving neural networks by preventing co-adaptation [54] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
of feature detectors,” arXiv preprint arXiv:1207.0580, 2012. [Online]. synthetic minority over-sampling technique,” Journal of artificial intel-
Available: http://arxiv.org/abs/1207.0580 ligence research, vol. 16, pp. 321–357, 2002.
[32] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and [55] A. Fernández, S. Garcia, F. Herrera, and N. V. Chawla, “Smote for
R. Salakhutdinov, “Dropout: a simple way to prevent neural networks learning from imbalanced data: progress and challenges, marking the
from overfitting.” Journal of machine learning research, vol. 15, no. 1, 15-year anniversary,” Journal of artificial intelligence research, vol. 61,
pp. 1929–1958, 2014. pp. 863–905, 2018.
8