An Exploration of 2D and 3D Deep Learning Techniques For Cardiac MR Image Segmentation
An Exploration of 2D and 3D Deep Learning Techniques For Cardiac MR Image Segmentation
An Exploration of 2D and 3D Deep Learning Techniques For Cardiac MR Image Segmentation
2
Computer Vision and Geometry Group, ETH Zurich
1 Introduction
Cardiovascular diseases are a major public health concern and currently the
leading cause of death in Europe [12]. Automated segmentation of cardiac struc-
tures from medical images is an important step towards analysing normal and
pathological cardiac function on a large scale, and ultimately towards developing
diagnosis and treatment methods.
Until recently, the field of anatomical segmentation was dominated by atlas-
based techniques (e.g. [2]), which have the advantage of providing strong spatial
priors and yielding robust results with relatively little training data. With more
data becoming available and recent advances in machine learning and parallel
computing infrastructure, segmentation techniques based on deep convolutional
neural networks (CNN) are emerging as the new state-of-the-art [15,9].
This paper is dedicated to the segmentation of cardiac structures on short-
axis MR images and is accompanied by a submission to the automated cardiac
diagnosis challenge (ACDC) 2017. Short-axis MR images consist of a stack of 2D
MR images acquired over multiple cardiac cycles which are often not perfectly
aligned and typically have a low through-plane resolution of 5 − 10 mm.
?
both authors contributed equally
2
2 Method
In the following, we will outline the individual steps focusing on the pre-processing,
network architectures, optimisation and post-processing of the data.
2.1 Pre-Processing
Since the data were recorded at varying resolutions, we resampled all images
and segmentations to a common resolution. For the networks operating in 2D,
the images were resampled to an in-plane resolution of 1.37 × 1.37 mm. We
did not perform any resampling in the through-plane direction to avoid any
losses in accuracy in the up- and downsampling steps. Part of the data had
a relatively low through-plane resolution of 10 mm and we found that losses
incurred by resampling artefacts can be significant. For the 3D network we chose
a resolution of 2.5×2.5×5 mm. Higher resolutions were not possible due to GPU
memory restrictions. We then placed all the resampled images centrally into
images of constant size, padding with zeros where necessary. The exact image
size depended on the network architecture and will be discussed below. Lastly,
each image was intensity-normalised to zero mean and unit variance.
within each resolution stage. Since this architecture does not employ padded
convolutions, a larger image size of 396 × 396 pixels was necessary, which led to
segmentation masks of size 212 × 212 pixels.
Inspired by the fact that the FCN-8 produces competitive results despite
having a simple upsampling path with few channels, we speculated that the full
complexity of the U-Net upsampling path may not be necessary for our problem.
Therefore, we additionally investigated a modified 2D U-Net with number of
feature maps in the transpose convolutions of the upsampling path set to the
number of classes. Intuitively, each class should have at least one channel.
Çiçek et al. recently extended the U-Net architecture to 3D [4] by following
the same symmetric design principle. However, for data with few slices in one ori-
entation, the repeated pooling and convolving may be too aggressive. We found
that using the 3D U-Net for our data all spatial information in the through-plane
direction was lost before the third max pooling step. We thus also investigated
a slightly modified version of the 3D U-Net in which we performed only one
max-pooling (and upsampling) step in the through-plane direction. This had
two advantages: 1) The spatial information in the through-plane was retained
and thus available in the deeper layers, 2) it allowed us to work with a slightly
higher image resolution because less padding in the through-plane direction (and
thus less GPU memory) was required. In preliminary experiments we found that
the modified 3D U-Net led to improvements of around 0.02 of the average Dice
score over the standard 3D U-Net. In the interest of brevity we only included
the modified version in the final results of this paper. Here, we used an input
image size of 204 × 204 × 60, which led to output masks of size 116 × 116 × 28.
We used batch normalisation [6] on the outputs of every convolutional and
transposed convolutional layer for all architectures. We found that this not only
led to faster convergence, as reported in [4], but also consistently yielded bet-
ter results and allowed the training of some networks to converge that did not
converge otherwise.
2.3 Optimisation
We trained the networks introduced above (i.e. FCN-8, 2D U-Net, 2D U-Net
(mod.) and 3D U-Net (mod.)) from scratch with the weights of the convolutional
layers initialised as described in [5].
We investigated three different cost functions. First, we used the standard
pixel-wise cross entropy. To account for the class imbalance between the back-
ground and the foreground classes, we also investigated a weighted cross entropy
loss. We used a weight of 0.1 for the background class, and 0.3 for the foreground
classes in all experiments in this paper, which corresponds approximately to the
inverse prevalence of each label in the dataset. Lastly, we investigated optimising
the Dice coefficient directly. In order to get more stable gradients we calculated
the Dice loss on the softmax output as follows:
PK PN
tnk ynk
Ldice = 1 − PKk=2PNn=1 ,
k=2 n=1 tnk + ynk
4
2.4 Post-Processing
Since training and inference were performed in a different resolution, the predic-
tions had to be resampled to each subject’s initial resolution. To avoid resampling
artefacts, this step was carried out on the softmax (i.e. continuous) network out-
puts for each label using linear interpolation. The final discrete segmentation
was then obtained in the final resolution by choosing the label with the highest
score at each voxel. Interpolation on the softmax output, rather than the output
masks, led to consistent improvements of around 0.005 in the average Dice score.
We occasionally observed spurious predictions of structures in implausible lo-
cations. To compensate for this, we applied simple post-processing to the segmen-
tation results by keeping only the largest connected component for every struc-
ture. Since the segmentations are already quite accurate without post-processing
this only lead to an average Dice increase of approximately 0.0003, however, it
reduced the Hausdorff distance considerably, which by definition is very sensitive
to outliers. Other post-processing techniques such as the commonly used spatial
regularisation method based on fully connected conditional random fields [8] did
not yield improvements in our experiments.
does not include the apex and basal region of the heart, which are particularly
challenging to segment.
The code and pretrained models for all examined network architectures are
publicly available at https://github.com/baumgach/acdc_segmenter.
References
1. Avendi, R.M.R., Kheradvar, A., Jafarkhani, H.: A combined deep-learning and
deformable-model approach to fully automatic segmentation of the left ventricle in
cardiac MRI. Med Image Anal 30, 108–119 (2016)
2. Bai, W., Shi, W., Ledig, C., Rueckert, D.: Multi-atlas segmentation with aug-
mented features for cardiac MR images. Med Image Anal 19(1), 98–109 (2015)
3. Bai, W., Shi, W., O’Regan, D.P., Tong, T., Wang, H., Jamil-Copley, S., Peters,
N.S., Rueckert, D.: A probabilistic patch-based label fusion model for multi-atlas
segmentation with registration refinement: application to cardiac MR images. IEEE
Transactions on Medical Imaging 32(7), 1302–15 (2013)
4. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net:
Learning Dense Volumetric Segmentation from Sparse Annotation. In: MICCAI.
pp. 424–432 (2016)
5. He, K., Zhang, X., Ren, S., Sun, J.: Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification. In: ICCV. pp. 1026–34
(2015)
6. Ioffe, S., Szegedy, C.: Batch Normalization: Accelerating Deep Network Training
by Reducing Internal Covariate Shift. In: ICML. pp. 448–456 (2015)
7. Kingma, D.P., Ba, J.L.: ADAM: A Method for Stochastic Optimization. In: ICLR
(2015)
8. Krähenbühl, P., Koltun, V.: Efficient Inference in Fully Connected CRFs with
Gaussian Edge Potentials. In: NIPS. pp. 109–117 (2011)
9. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M.,
van der Laak, J.A.W.M., van Ginneken, B., Sánchez, C.I.: A Survey on Deep
Learning in Medical Image Analysis. arXiv:1702.05747 (2017)
10. Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic
Segmentation. In: CVPR. pp. 343 –3440 (2015)
11. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully Convolutional Neural Net-
works for Volumetric Medical Image Segmentation. In: 3D Vision. pp. 565 – 571
(2016)
12. Nichols, M., Townsend, N., Scarborough, P., Rayner, M.: Cardiovascular disease
in Europe 2014 : epidemiological update. European heart journal (2014)
13. Oktay, O., Bai, W., Guerrero, R., Rajchl, M., de Marvao, A., O’Regan, D.P.,
Cook, S.A., Heinrich, M.P., Glocker, B., Rueckert, D.: Stratified Decision Forests
for Accurate Anatomical Landmark Localization in Cardiac Images. IEEE Trans
Med Imag 36(1), 332–342 (2017)
14. Oktay, O., Ferrante, E., Kamnitsas, K., Heinrich, M., Bai, W., Caballero, J., Guer-
rero, R., Cook, S., de Marvao, A., Dawes, T., O’Regan, D., Kainz, B., Glocker, B.,
Rueckert, D.: Anatomically Constrained Neural Networks (ACNN): Application to
Cardiac Image Enhancement and Segmentation. arXiv:1705.08302 (2017)
15. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomed-
ical Image Segmentation. In: MICCAI. pp. 234–241 (2015)