Journal Pone 0269323
Journal Pone 0269323
Journal Pone 0269323
RESEARCH ARTICLE
* mwalker@toh.ca
OPEN ACCESS
Methods
Study setting and design
This was a retrospective study conducted at The Ottawa Hospital, a multi-site tertiary-care
facility in Ottawa, Canada, with a catchment area of 1.3 million people. First trimester
Fig 1. Fetal ultrasound images of normal (A) and cystic hygroma (B) scans.
https://doi.org/10.1371/journal.pone.0269323.g001
ultrasound images taken between March 2014 and March 2021 were retrieved from the institu-
tional Picture Archiving and Communication System (PACS) and saved in Digital Imaging
and Communications in Medicine (DICOM) format. Eligible images were those that included
a mid-sagittal view of the fetus taken between 11 and 14 weeks gestation. Cases were identified
if there was a final clinical diagnosis of a cystic hygroma in the ultrasound report (Fig 1A). A
set of normal ultrasound images from NT screens were used for controls and were retrieved
between March 2021 and June 2021 (Fig 1B). Cases and normal images were reviewed and ver-
ified by a clinical expert (IW). Patients were not contacted and patient consent was not
required to access the images. Data were de-identified and fully anonymized for the model
training. This study was reviewed and approved by the Ottawa Health Sciences Network
Research Ethics Board (OHSN REB #20210079).
A 4-fold cross-validation (4CV) design was used, whereby the same deep-learning architec-
ture was tested and trained four different times using randomly partitioned versions (folds) of
the image dataset. For each fold, 75% of the dataset was used for model training, and 25% was
used for model validation. The 4CV design was chosen instead of the more commonly used
10-fold cross-validation (10CV) design to optimize the performance of the deep-learning
models within our small dataset. With 4CV, each prediction error affects the accuracy of the
model by 1.4%, and the sensitivity by 3.1%. Had a 10CV approach been used, each prediction
error would have affected model accuracy by 3.4% and the sensitivity by 7.8%.
Data preparation
DICOM images included coloured annotations such as calipers, text, icons, profile traces (Fig
2A) and patient personal health information (PHI), which were removed prior to analysis.
PHI was handled by cropping the images to remove the identifying information contained in
the image borders. Coloured annotations were removed by first converting image data from
the Red Green Blue (RGB) colour space to the Hue Saturation Value (HSV) colour space.
Image pixels belonging to the grey ultrasound image were empirically identified and ranged
from 0–27, 0–150 and 0–255 for the H, S and V values, respectively (Fig 2B).
Fig 2. Identifying image annotations on a normal NT scan. (A) Image annotations included calipers, text, icons, and profile traces, all of which were removed
prior to model training. (B) 3D Scatter Plot of HSV image data. Each point represents one image pixel and its associated HSV values. The red region highlights
the range of values which do not belong to the grayscale ultrasound image. The area encircled in green shows pixel values that belong to the grayscale
ultrasound image. Grayscale images had H, S and V values ranging from 0–27, 0–150 and 0–255, respectively.
https://doi.org/10.1371/journal.pone.0269323.g002
Pixels outside of these ranges were determined to be part of the coloured annotations.
Third, a binary mask of the image was created where annotation pixels were labelled ‘1’
and ultrasound image pixels were labelled ‘0’. The binary mask was then dilated with a 5x5 ker-
nel to include contours surrounding the annotations. Last, the Navier-Stokes image infill
method [12] was used to artificially reconstruct the ultrasound image without annotations
(Fig 3).
After DICOM images were cleaned, they were converted to grayscale (1 channel image).
Intensities were standardized to a mean of zero and a standard deviation of one for better sta-
bility during the neural network training [13]. Finally, the images were resized to 256 x 256
pixels.
Fig 3. Removal of image annotations on a scan with cystic hygroma diagnosis. (A) Ultrasound image before annotations were removed. Yellow calipers
(bottom middle) are visible, along with text annotations (top left). (B) The binary mask of the image which was generated to define the region of the image that
need to be infilled (white pixels). (C) Result of the Navier-Stokes image infill method; all image annotations have been removed.
https://doi.org/10.1371/journal.pone.0269323.g003
Model training
Model training was performed from scratch with random weights initialization for 1000
epochs (i.e., step-in optimization) using the cross-entropy loss function [16], the Adam opti-
mizer [17, 18] (epsilon = 1e-8, beta1 = 0.9, beta2 = 0.999) and a batch size of 64. The available
pretrained DenseNet models were trained on ImageNet. ImageNet and fetal ultrasound images
are significantly different, therefore CNN models for fetal US applications should not be pre-
trained on ImageNet. We trained the architecture using our dataset of ultrasound images from
The Ottawa Hospital with random weights initialization. The learning rate was set to 1e-2 and
reduced by a factor of 0.72 at every 100 epochs (gamma = 0.72, step size = 100) using a learning
rate step scheduler [19].
Data augmentation was used during model training to increase the generalizability of the
CNN model. Augmentations included random horizontal flip of the ultrasound images
(50% probability), random rotation in the range of [–15, 15] degrees, random translation (x
±10% and y±30% of image size) and random shearing in the range of [-0.2, 0.2] degrees.
These augmentations were performed dynamically at batch loading. Due to the randomness
of these operations, a single training image underwent different augmentations at each
epoch.
To address imbalance in the number of normal NT and cystic hygroma images in the train-
ing dataset, cystic hygroma images were randomly up sampled, with replacement, to match
the number of normal NT images.
Model validation
For each epoch, the validation data were used to assess the total number of true and false posi-
tives, and true and false negatives and used to calculate accuracy, sensitivity, specificity and the
area under the receiver-operating characteristic curve (AUC). For each fold, the performance
metrics for the epoch with the highest level of accuracy were reported. The mean, standard
deviation and 95% confidence intervals (CI) of the performance metrics across all folds were
then computed.
Explainability
The Gradient-weighted Class Activation Mapping (Grad-CAM) method was used to improve
the interpretability of our trained DenseNet models, and visually contextualize important fea-
tures in the image data that were used for model predictions [20]. Grad-CAM is a widely used
technique for the visual explanation of deep-learning algorithms [21, 22]. With this approach,
heat maps were generated from 8x8 feature maps to highlight regions of each image that were
the most important for model prediction (Fig 4).
Fig 4. Grad-CAM image of a cystic hygroma case. The green gridlines indicate the size of the feature maps (8x8) used
to generate the heat maps. The red highlights the region of the image that influenced the model’s prediction the most.
https://doi.org/10.1371/journal.pone.0269323.g004
Results
The image dataset included 289 unique ultrasound images; 160 control images with normal
NT measurements, and 129 cases with cystic hygroma. A total of 217 images were used for
model training, and 72 images were used for model validation (Table 1).
Table 2 shows the results for all 4 cross validation folds. All 4 models performed well in the
validation set. The overall mean accuracy was 93% (95% CI: 88–98%), and the area under the
receiver operating characteristic curve was 0.94 (95% CI: .89–1.0) (Fig 5). The sensitivity was
92% (95% CI: 79–100%) and the specificity was 94% (95% CI: 91–96%).
Most of the Grad-CAM heat maps highlighted the fetal head and neck (Fig 6). Although
some heat maps specifically highlighted the posterior cervical region, in the area used for NT
measurement (Fig 7A and 7B), poor localization did occur (Fig 7C and 7D).
There were 10 false negatives and 10 false positives. The misclassified images were reviewed
with a clinical expert (MW) and it was determined that misclassifications commonly happened
https://doi.org/10.1371/journal.pone.0269323.t001
when the fetus was close to the placental membrane leading the heat map to focus on another
part of the brain.
Discussion
Our findings demonstrate the feasibility of using deep-learning models to interpret fetal ultra-
sound images and identify cystic hygroma diagnoses with high performance in a dataset of
first trimester ultrasound scans. The model achieved excellent prediction of cystic hygroma
with a sensitivity of 92% (95% CI: 79–100%) and specificity of 94% (95% CI: 91–96%). This
study contributes to the literature on AI and medical diagnostics, and more specifically to the
use of AI in fetal ultrasonography where there are scant data.
Ultrasound is critical in the observation of fetal growth and development, however, small
fetal structures, involuntary fetal movements and poor image quality make neonatal image
acquisition and interpretation challenging. Our study has shown that deep-learning and Grad-
CAM heat maps can correctly identify the fetal head and neck region to identify abnormalities.
There have been several investigations focusing on AI-based localization of standard planes in
fetal ultrasonography, suggesting that AI-models perform as well as clinicians in obtaining rea-
sonable planes for image capture and diagnosis [23–27]. Building on this literature, cystic
hygroma has not been evaluated yet it is an ideal fetal diagnostic condition to assess the accu-
racy of AI-based models because it is a clearly visible diagnosis to the trained expert.
More recently, others have sought to apply machine learning methods to the identification
of fetal malformations, with promising results. Xie et al. developed and tested a CNN-based
https://doi.org/10.1371/journal.pone.0269323.t002
Fig 5. Receiver operating characteristic plot summarizing performance of all four cross validation folds.
https://doi.org/10.1371/journal.pone.0269323.g005
deep-learning model in a dataset of nearly 30,000 fetal brain images, including over 14,000
images with common central nervous system abnormalities [28]. Although, the final model
was able to discriminate well between normal and abnormal images (sensitivity and specificity
were 96.9% and 95.9%, respectively) in a hold out test set, the models were not trained to dis-
tinguish between specific brain abnormalities and did not have information on cystic hygroma
diagnoses. In a large sample of 2D ultrasounds, fetal echocardiograms and anatomy scans
from second trimester pregnancies, Arnaout et al. trained CNN models to accurately localize
the fetal heart and detect complex congenital heart disease [29]. Their models achieved high
sensitivity (95%, 95%CI, 84–99), and specificity (96%, 95%CI, 95–97) and were validated in
several independent datasets. Finally, Baumgartner et al. [25] demonstrated the potential for
such systems to operate in real-time. Their findings suggest that deep-learning models could
be used “live” to guide sonographers with image capture and recognition during clinical prac-
tice. The model proposed by Baumgartner et al. was trained for detection of multiple fetal stan-
dard views in freehand 2D ultrasound data. They achieved a 91% recall (i.e., sensitivity) on the
profile standard views which is close to the evaluation metrics obtained in our study. Although
published data on deep-learning models for the identification of specific malformations are
generally lacking, our findings, combined with those of others in this space demonstrate the
feasibility of deep-learning models for supporting diagnostic decision-making.
Fig 6. Grad-CAM heat maps for the full validation set of Fold 2. Top 4 rows are normal NT cases and bottom 4 rows
are cystic hygroma cases. Red colours highlight regions of high importance and blue colours highlight regions of low or
no importance. Therefore, a good model would have Grad-CAM heatmaps that highlight the head and neck area for
both normal and cystic hygroma images.
https://doi.org/10.1371/journal.pone.0269323.g006
Using the latest data augmentation and image dimensionality techniques, we have devel-
oped a deep-learning algorithm with very good predictive accuracy on a relatively small data-
set. The strengths of this study include our use of the k-fold cross-validation experiment
design. Although computationally intensive, k-fold cross-validation reduces model bias and
variance, as most of the data are used for model training and validation. In addition, removal
of calipers and text annotations from ultrasound images established a dataset free from the
clinical bias on which to develop our models. Annotations on routinely collected ultrasound
images have historically limited their utility for medical AI research [30]. Furthermore, our
use of Grad-CAM heat maps enabled transparent reporting of how the deep-learning models
performed on a case-by-case basis. With this approach, we were able to confirm excellent
localization for most of the images used in our dataset. Class-discriminative visualization
enables the user to understand where models fail (i.e., why the models predicted what they pre-
dicted) and can be used to inform downstream model enhancements. Additionally, all false
negative and false positive images were reviewed and it was determined that misclassifications
commonly occurred when the fetus was close to the placental membrane. Future work could
collect more images where the fetus is close to membrane, up-sample the images that were
error prone, and further train the model to incorporate this pattern.
Our study is not without limitations. First, as a single centre study, the sample size available for
developing and validating deep-learning model was relatively small. However, use of data augmen-
tation to increase the variability in our dataset, enrichment of cystic hygroma cases in our training
set, and use of the k-fold cross validation experiment design are all well-accepted strategies to over-
come the limitations of small datasets [31]. Second, although we removed all image annotations,
we cannot discount the possibility that the blending and infill methods used to reconstitute the
Fig 7. Exemplary Grad-CAM heat maps. (A) Normal NT case with good localization in which the model predicted
the correct class with a high (1.00) output probability (true negative). (B) Cystic hygroma case with good localization in
which the model predicted the correct class with a high (1.00) output probability (true positive). (C) Normal NT case
showing poor localization in which the model predicted this class incorrectly with a 0.90 output probability (false
positive). (D) Cystic hygroma case showing poor localization in which the model predicted the correct class, but with
an output probability that suggests uncertainty (0.63) (true positive).
https://doi.org/10.1371/journal.pone.0269323.g007
images influenced the deep-learning algorithm. However, the Grad-CAM heatmaps provide reas-
surance that fetal craniocervical regions were driving the deep-learning algorithm, and that the
model appropriately places high importance on regions which are clinically relevant for diagnosis.
Given the relatively low incidence of congenital anomalies such as cystic hygroma, a natural exten-
sion of this work will be to introduce our models to a larger, multi-centre dataset with more vari-
ability in the image parameters and greater feature variety specific to cystic hygroma.
Conclusions
In this proof-of-concept study, we demonstrate the potential for deep-learning to support
early and reliable identification of cystic hygroma from first trimester ultrasound scans. We
Acknowledgments
The authors would like to acknowledge that this study took place on unceded Algonquin
Anishinabe territory.
Author Contributions
Conceptualization: Mark C. Walker, Olivier X. Miguel, Steven Hawken.
Data curation: Inbal Willner, Olivier X. Miguel.
Formal analysis: Olivier X. Miguel, Steven Hawken.
Funding acquisition: Mark C. Walker.
Investigation: Mark C. Walker, Inbal Willner, Darine El-Chaâr, Felipe Moretti, André M.
Carrington, Steven Hawken, Richard I. Aviv.
Methodology: Katherine A. Muldoon, André M. Carrington, Steven Hawken.
Project administration: Alysha L. J. Dingwall Harvey, Ruth Rennicks White.
Resources: Mark C. Walker.
Supervision: Mark C. Walker, Darine El-Chaâr, Felipe Moretti, Alysha L. J. Dingwall Harvey,
Ruth Rennicks White, André M. Carrington, Steven Hawken.
Writing – original draft: Olivier X. Miguel, Malia S. Q. Murphy, Katherine A. Muldoon.
Writing – review & editing: Mark C. Walker, Inbal Willner, Olivier X. Miguel, Malia S. Q.
Murphy, Darine El-Chaâr, Felipe Moretti, Alysha L. J. Dingwall Harvey, Ruth Rennicks
White, Katherine A. Muldoon, André M. Carrington, Steven Hawken, Richard I. Aviv.
References
1. Drukker L, Noble JA, Papageorghiou AT. Introduction to artificial intelligence in ultrasound imaging in
obstetrics and gynecology. Ultrasound Obstet Gynecol [Internet]. 2020 Oct; 56(4):498–505. Available
from: https://onlinelibrary.wiley.com/doi/10.1002/uog.22122
2. Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning perfor-
mance against health-care professionals in detecting diseases from medical imaging: a systematic
review and meta-analysis. Lancet Digit Heal [Internet]. 2019 Oct; 1(6):e271–97. Available from: https://
linkinghub.elsevier.com/retrieve/pii/S2589750019301232 https://doi.org/10.1016/S2589-7500(19)
30123-2 PMID: 33323251
3. Park SH. Artificial intelligence for ultrasonography: unique opportunities and challenges. Ultrasonogra-
phy [Internet]. 2021 Jan 1; 40(1):3–6. Available from: http://e-ultrasonography.org/journal/view.php?
doi=10.14366/usg.20078 PMID: 33227844
4. Chen Z, Liu Z, Du M, Wang Z. Artificial Intelligence in Obstetric Ultrasound: An Update and Future Appli-
cations. Front Med [Internet]. 2021 Aug 27; 8. Available from: https://www.frontiersin.org/articles/10.
3389/fmed.2021.733468/full PMID: 34513890
5. The fetal medicine foundation. Cystic Hygroma [Internet]. [cited 2021 Nov 17]. Available from: https://
fetalmedicine.org/education/fetal-abnormalities/neck/cystic-hygroma
6. Chen Y-N, Chen C-P, Lin C-J, Chen S-W. Prenatal Ultrasound Evaluation and Outcome of Pregnancy
with Fetal Cystic Hygromas and Lymphangiomas. J Med Ultrasound [Internet]. 2017 Mar; 25(1):12–5.
25. Baumgartner CF, Kamnitsas K, Matthew J, Fletcher TP, Smith S, Koch LM, et al. SonoNet: Real-Time
Detection and Localisation of Fetal Standard Scan Planes in Freehand Ultrasound. IEEE Trans Med
Imaging [Internet]. 2017 Nov; 36(11):2204–15. Available from: https://ieeexplore.ieee.org/document/
7974824/ https://doi.org/10.1109/TMI.2017.2712367 PMID: 28708546
26. Gofer S, Haik O, Bardin R, Gilboa Y, Perlman S. Machine Learning Algorithms for Classification of First-
Trimester Fetal Brain Ultrasound Images. J Ultrasound Med [Internet]. 2021 Oct 28; Available from:
https://onlinelibrary.wiley.com/doi/10.1002/jum.15860 PMID: 34710247
27. Sciortino G, Tegolo D, Valenti C. Automatic detection and measurement of nuchal translucency. Com-
put Biol Med [Internet]. 2017 Mar; 82:12–20. Available from: https://linkinghub.elsevier.com/retrieve/pii/
S0010482517300070 https://doi.org/10.1016/j.compbiomed.2017.01.008 PMID: 28126630
28. Xie HN, Wang N, He M, Zhang LH, Cai HM, Xian JB, et al. Using deep-learning algorithms to classify
fetal brain ultrasound images as normal or abnormal. Ultrasound Obstet Gynecol [Internet]. 2020 Oct;
56(4):579–87. Available from: https://onlinelibrary.wiley.com/doi/10.1002/uog.21967
29. Arnaout R, Curran L, Zhao Y, Levine JC, Chinn E, Moon-Grady AJ. An ensemble of neural networks
provides expert-level prenatal detection of complex congenital heart disease. Nat Med [Internet]. 2021
May 14; 27(5):882–91. Available from: http://www.nature.com/articles/s41591-021-01342-5 https://doi.
org/10.1038/s41591-021-01342-5 PMID: 33990806
30. Prieto JC, Shah H, Rosenbaum A, Jiang X, Musonda P, Price J, et al. An automated framework for
image classification and segmentation of fetal ultrasound images for gestational age estimation. In:
Landman BA, Išgum I, editors. Medical Imaging 2021: Image Processing [Internet]. SPIE; 2021. p. 55.
Available from: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/11596/2582243/An-
automated-framework-for-image-classification-and-segmentation-of-fetal/10.1117/12.2582243.full
PMID: 33935344
31. Zhang Y, Yang Y. Cross-validation for selecting a model selection procedure. J Econom [Internet].
2015 Jul; 187(1):95–112. Available from: https://linkinghub.elsevier.com/retrieve/pii/
S0304407615000305