Medical Transformer: Gated Axial-Attention For Medical Image Segmentation
Medical Transformer: Gated Axial-Attention For Medical Image Segmentation
Medical Transformer: Gated Axial-Attention For Medical Image Segmentation
Jeya Maria Jose Valanarasu1 , Poojan Oza1 , Ilker Hacihaliloglu2 , and Vishal M.
Patel1
1
Johns Hopkins University, Baltimore, MD, USA
2
Rutgers, The State University of New Jersey, NJ, USA
arXiv:2102.10662v2 [cs.CV] 6 Jul 2021
1 Introduction
It was observed that that the transformer-based models work well only when
they are trained on large-scale datasets [6]. This becomes problematic while
adopting transformers for medical imaging tasks as the number of images, with
corresponding labels, available for training in any medical dataset is relatively
scarce. Labeling process is also expensive and requires expert knowledge. Specifi-
cally, training with fewer images causes difficulty in learning positional encoding
for the images. To this end, we propose a gated position-sensitive axial attention
mechanism where we introduce four gates that control the amount of informa-
tion the positional embedding supply to key, query, and value. These gates are
learnable parameters which make the proposed mechanism to be applied to any
dataset of any size. Depending on the size of the dataset, these gates would
learn whether the number of images would be sufficient enough to learn proper
position embedding. Based on whether the information learned by the positional
embedding is useful or not, the gate parameters either converge to 0 or to some
higher value. Furthermore, we propose a Local-Global (LoGo) training strategy,
where we use a shallow global branch and a deep local branch that operates
on the patches of the medical image. This strategy improves the segmentation
performance as we do not only operate on the entire image but focus on finer
details present in the local patches. Finally, we propose Medical Transformer
(MedT), which uses our gated position-sensitive axial attention as the building
blocks and adopts our LoGo training strategy.
Global Branch
Image
Local Branch
Segmentation
Resample Mask
Patches
Patches
(a)
Gated Axial Attention Layer
Y
y
Gates
GV1 GV2
Positional
Encoder - Gated Axial Transformer Layer Embeddings
softmax
rV
Gated Gated
Weights
Multi- Multi-
Input
Conv
1x1 Norm Head Head
Attn
Conv
1x1 Norm + GQ GK
Matrix
Attn
Height Width Multiplication
rQ rK Addition
WQ WK WV
(b) (c)
Fig. 2. (a) The main architecture diagram of MedT which uses LoGo strategy for
training. (b) The gated axial transformer layer which is used in MedT. (c) Gated
Axial Attention layer which is the basic building block of both height and width gated
multi-head attention blocks found in the gated axial transformer layer.
Medical Transformer 5
where the formulation in Eq. 2 follows the attention model proposed in [24]
and rq , rk , rv ∈ RW ×W for the width-wise axial attention model. Note that
Eq. 2 describes the axial attention applied along the width axis of the tensor. A
similar formulation is also used to apply axial attention along the height axis and
together they form a single self-attention model that is computationally efficient.
attention mechanism applied on the width axis can be formally written as:
W
X
T T q T k v
yij = softmax qij kiw + GQ qij riw + GK kiw riw (GV 1 viw + GV 2 riw ), (3)
w=1
where the self-attention formula closely follows Eq. 2 with added gating mecha-
nism. Also, GQ , GK , GV 1 , GV 2 ∈ R are learnable parameters and together they
create gating mechanism which control influence of the learned relative positional
encodings have on encoding non-local context. Typically, if a relative positional
encoding is learned accurately, the gating mechanism will assign it high weight
compared to the ones which are not learned accurately. Fig 2 (c) illustrates the
feed-forward in a typical gated axial attention layer.
where w and h are the dimensions of the image, p(x, y) corresponds to the pixel
in the image and p̂(x, y) denotes the output prediction at a specific location
(x, y). The training details are provided in the supplementary document.
For baseline comparisons, we first run experiments on both convolutional and
transformer-based methods. For convolutional baselines, we compare with fully
convolutional network (FCN) [1], U-Net [17], U-Net++ [31] and Res-Unet [27].
For transformer-based baselines, we use Axial-Attention U-Net with residual
connections inspired from [24]. For our proposed method, we experiment with
all the individual contributions. In gated axial attention network, we use axial
attention U-Net with all its axial attention layers replaced with the proposed
gated axial attention layers. In LoGo, we perform local global training for axial
attention U-Net without using the gated axial attention layers. In MedT, we
use gated axial attention as the basic building block for global branch and axial
attention without positional encoding for local branch.
3.3 Results
Table 1. Quantitative comparison of the proposed methods with convolutional and
transformer based baselines in terms of F1 and IoU scores.
Type Network Brain US GlaS MoNuSeg
F1 IoU F1 IoU F1 IoU
FCN [1] 82.79 75.02 66.61 50.84 28.84 28.71
Convolutional
U-Net [17] 85.37 79.31 77.78 65.34 79.43 65.99
Baselines
U-Net++ [31] 86.59 79.95 78.03 65.55 79.49 66.04
Res-UNet [27] 87.50 79.61 78.83 65.95 79.49 66.07
Fully Attention Axial Attention
87.92 80.14 76.26 63.03 76.83 62.49
Baseline U-Net [24]
Gated Axial Attn. 88.39 80.7 79.91 67.85 76.44 62.01
Proposed LoGo 88.54 80.84 79.68 67.69 79.56 66.17
MedT 88.84 81.34 81.02 69.61 79.55 66.17
For quantitative analysis, we use F1 and IoU scores for comparison. The
quantitative results are tabulated in Table 1. It can be noted that for datasets
with relatively more images like Brain US, fully attention (transformer) based
baseline performs better than convolutional baselines. For GlaS and MoNuSeg
datasets, convolutional baselines perform better than fully attention baselines
as it is difficult to train fully attention models with less data [6]. The proposed
method is able to overcome such issue with the help of gated axial attention
and LoGo both individually perform better than the other methods. Our final
architecture MedT performs better than Gated axial attention, LoGo and all
the previous methods. The improvements over fully attention baselines are 0.92
8 JMJ Valanarasu et al.
%, 4.76 % and 2.72 % for Brain US, GlaS and MoNuSeg datasets, respectively.
Improvements over the best convolutional baseline are 1.32 %, 2.19 % and 0.06
%. All of these values are in terms of F1 scores. For the ablation study, we use
the Brain US data for all our experiments. The results for the same has been
tabulated in Table 2.
Furthermore, we visualize the predictions from U-Net [17], Res-UNet [27],
Axial Attention U-Net [24] and our proposed method MedT in Fig 3. It can be
seen that the predictions of MedT captures the long range dependencies really
well. For example, in the second row of Fig 3, we can observe that the small seg-
mentation mask highlighted on red box goes undetected in all the convolutional
baselines. However, as fully attention model encodes long range dependencies,
it learns to segment well thanks to the encoded global context. In the first and
fourth row, other methods make false predictions at the highlighted regions as
those pixels are in close proximity to the segmentation mask. As our method
takes into account pixel-wise dependencies that are encoded with gating mecha-
nism, it is able to learn those dependencies better than the axial attention U-Net.
This makes our predictions more precise as they do not miss-classify pixels near
the segmentation mask.
Table 2. Ablation Study
Global Local
Network U-Net [17] Res-UNet [27] Axial UNet [24] Gated Axial UNet LoGo MedT
only only
F1 Score 85.37 87.5 87.92 88.39 87.67 77.55 88.54 88.84
Fig. 3. Qualitative results on sample test images from Brain US, Glas and MoNuSeg
datasets. The red box highlights regions where exactly MedT performs better than the
other methods in comparison making better use of long range dependencies.
4 Conclusion
which is used as the building block for multi-head attention models. We also pro-
posed a LoGo training strategy to train the image in both full resolution as well
in patches. The global branch helps learn global context features by modeling
long-range dependencies, where as the local branch focus on finer features by op-
erating on patches. Using these, we propose MedT (Medical Transformer) which
has gated axial attention as its main building block for the encoder and uses
LoGo strategy for training. Unlike other transformer-based model the proposed
method does not require pre-training on large-scale datasets. Finally, we conduct
extensive experiments on three datasets where we achieve a good performance
for MedT over ConvNets and other related transformer-based architectures.
Acknowledgment
This work was supported by the NSF grant 1910141.
References
1. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. IEEE transactions on pat-
tern analysis and machine intelligence 39(12), 2481–2495 (2017)
2. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou,
Y.: Transunet: Transformers make strong encoders for medical image segmentation.
arXiv preprint arXiv:2102.04306 (2021)
3. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic
image segmentation with deep convolutional nets and fully connected crfs. arXiv
preprint arXiv:1412.7062 (2014)
4. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3d u-net:
learning dense volumetric segmentation from sparse annotation. In: International
conference on medical image computing and computer-assisted intervention. pp.
424–432. Springer (2016)
5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
6. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is
worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929 (2020)
7. Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial attention in multi-
dimensional transformers. arXiv preprint arXiv:1912.12180 (2019)
8. Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y.W.,
Wu, J.: Unet 3+: A full-scale connected unet for medical image segmentation. In:
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). pp. 1055–1059. IEEE (2020)
9. Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross
attention for semantic segmentation. In: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision. pp. 603–612 (2019)
10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
10 JMJ Valanarasu et al.
11. Kumar, N., Verma, R., Anand, D., Zhou, Y., Onder, O.F., Tsougenis, E., Chen,
H., Heng, P.A., Li, J., Hu, Z., et al.: A multi-organ nucleus segmentation challenge.
IEEE transactions on medical imaging 39(5), 1380–1391 (2019)
12. Kumar, N., Verma, R., Sharma, S., Bhargava, S., Vahadane, A., Sethi, A.: A
dataset and a technique for generalized nuclear segmentation for computational
pathology. IEEE transactions on medical imaging 36(7), 1550–1560 (2017)
13. Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A.: H-denseunet: hybrid
densely connected unet for liver and tumor segmentation from ct volumes. IEEE
transactions on medical imaging 37(12), 2663–2674 (2018)
14. Mehta, S., Mercan, E., Bartlett, J., Weaver, D., Elmore, J.G., Shapiro, L.: Y-
net: joint segmentation and classification for diagnosis of breast biopsy images. In:
International Conference on Medical Image Computing and Computer-Assisted
Intervention. pp. 893–901. Springer (2018)
15. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks
for volumetric medical image segmentation. In: 2016 fourth international confer-
ence on 3D vision (3DV). pp. 565–571. IEEE (2016)
16. Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori,
K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.: Attention u-net: Learning
where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)
17. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-
cal image segmentation. In: International Conference on Medical image computing
and computer-assisted intervention. pp. 234–241. Springer (2015)
18. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position represen-
tations. In: Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Volume 2 (Short Papers). pp. 464–468 (2018)
19. Sirinukunwattana, K., Pluim, J.P., Chen, H., Qi, X., Heng, P.A., Guo, Y.B., Wang,
L.Y., Matuszewski, B.J., Bruni, E., Sanchez, U., et al.: Gland segmentation in colon
histology images: The glas challenge contest. Medical image analysis 35, 489–502
(2017)
20. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training
data-efficient image transformers & distillation through attention. arXiv preprint
arXiv:2012.12877 (2020)
21. Valanarasu, J.M.J., Sindagi, V.A., Hacihaliloglu, I., Patel, V.M.: Kiu-net: Over-
complete convolutional architectures for biomedical image and volumetric segmen-
tation. arXiv preprint arXiv:2010.01663 (2020)
22. Valanarasu, J.M.J., Sindagi, V.A., Hacihaliloglu, I., Patel, V.M.: Kiu-net: Towards
accurate segmentation of biomedical images using over-complete representations.
In: International Conference on Medical Image Computing and Computer-Assisted
Intervention. pp. 363–373. Springer (2020)
23. Valanarasu, J.M.J., Yasarla, R., Wang, P., Hacihaliloglu, I., Patel, V.M.: Learning
to segment brain anatomy from 2d ultrasound with less data. IEEE Journal of
Selected Topics in Signal Processing 14(6), 1221–1234 (2020)
24. Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.C.: Axial-
deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint
arXiv:2003.07853 (2020)
25. Wang, P., Cuccolo, N.G., Tyagi, R., Hacihaliloglu, I., Patel, V.M.: Automatic real-
time cnn-based neonatal brain ventricles segmentation. In: 2018 IEEE 15th In-
ternational Symposium on Biomedical Imaging (ISBI 2018). pp. 716–719. IEEE
(2018)
Medical Transformer 11
26. Wang, X., Han, S., Chen, Y., Gao, D., Vasconcelos, N.: Volumetric attention for 3d
medical image segmentation and detection. In: International Conference on Medi-
cal Image Computing and Computer-Assisted Intervention. pp. 175–184. Springer
(2019)
27. Xiao, X., Lian, S., Luo, Z., Li, S.: Weighted res-unet for high-quality retina vessel
segmentation. In: 2018 9th international conference on information technology in
medicine and education (ITME). pp. 327–331. IEEE (2018)
28. Zhang, Y., Liu, H., Hu, Q.: Transfuse: Fusing transformers and cnns for medical
image segmentation. arXiv preprint arXiv:2102.08005 (2021)
29. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 2881–2890 (2017)
30. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T.,
Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence
perspective with transformers. arXiv preprint arXiv:2012.15840 (2020)
31. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net
architecture for medical image segmentation. In: Deep learning in medical image
analysis and multimodal learning for clinical decision support, pp. 3–11. Springer
(2018)
Supplementary Material for Medical
Transformer: Gated Axial-Attention for Medical
Image Segmentation
Jeya Maria Jose Valanarasu1 , Poojan Oza1 , Ilker Hacihaliloglu2 , and Vishal M.
Patel1
1
Johns Hopkins University, Baltimore, MD, USA
2
Rutgers, The State University of New Jersey, NJ, USA
1 Dataset details
In this section, we describe the datasets that we use in this paper in detail.
MoNuSeg dataset [11,12] was created using H&E stained tissue images captured
at 40x magnification. This dataset is diverse as it contains images across multiple
organs and patients. The training data contains 30 images with around 22000
nuclear boundary annotations. The test data contains 14 images which have over
7000 nuclear boundary annotations. We resize the images to 512 × 512 for all
our experiments.
2 MedT details
Medical Transformer (MedT) uses gated axial attention layer as the basic build-
ing block and uses LoGo strategy for training. MedT has two branches - a global
branch and local branch. The input to both of these branches are the feature
maps extracted from an initial conv block. This block has 3 conv layers, each
followed by a batch normalization and ReLU activation. In the encoder of both
branches, we use our proposed transformer layer while in the decoder, we use
a conv block. The encoder bottleneck contains a 1 × 1 conv layer followed by
normalization and two layers of multi-head attention layers where one operates
along height axis and the other along width axis. Each multi-head attention block
is made up of the proposed gated axial attention layer. Note that each multi-
head attention block has 8 gated axial attention heads. The output from the
multi-head attention blocks are concatenated and passed through another 1 × 1
conv which are added to residual input maps to produce the output attention
maps. In each decoder block, we have a conv layer followed by an upsampling
layer and ReLU activation. We also have skip connections between each encoder
and decoder blocks in both the branches.
In the global branch of MedT, we have 2 blocks of encoder and 2 blocks of
decoder. In the local branch, we have 5 blocks of encoder and 5 blocks of decoder.
3 Training details
We use a batch size of 4, Adam optimizer [10] and a learning rate of 0.001 for our
experiments. The network is trained for 400 epochs. While training the gated
axial attention layer, we do not activate the training of the gates for the first 10
epochs. We use a Nvidia Quadro 8000 GPU for all our experiments.
4 Analysis
In this section, we present an analysis over some of the parameters and methods
we used for our proposed method.
14 JMJ Valanarasu et al.
For the ablation study, we use the Brain US data for all our experiments. We first
start with a standard U-Net. Then, we add residual connections to the U-Net
making it a Res-UNet. Now, we replace all the convolutional layers in the encoder
of Res-UNet with axial attention layers. This configuration is Axial Attention
UNet inspired from [24]. Note that in this configuration we have an additional
conv block at the front for feature extraction. Next, we replace all the axial
attention layers from the previous configuration with gated axial attention layers.
This configuration is denoted as Gated Axial attention. We then experiment
using only the global branch and local branch individually from LoGo strategy.
This shows that using just 2 layers in the global branch is enough to get a decent
performance. The local branch in this configuration is tested on the patches
extracted from the image. Then, we combine both the branches to train the
network in an end-to-end fashion which is denoted as LoGo. Note that in this
configuration the attention layers used are just axial attention layers [24]. Finally,
we replace the axial attention layers in LoGo with gated axial attention layers
which leads to MedT. The ablation study shows that each individual components
of MedT provides useful contribution to improve the performance.
Gated
Res- Axial Global Local
Network U-Net [17] Axial LoGo MedT
UNet [27] UNet [24] only only
UNet
F1 Score 85.37 87.5 87.92 88.39 87.67 77.55 88.54 88.84
Res Gated
U-Net [17] Res- Axial
Network FCN [1] U-Net [17] UNet [27] Axial MedT
(mod) UNet [27] UNet [24]
(mod) UNet
Parameters 12.5 M 3.13 M 1.3 M 5.32 M 1.34 M 1.3 M 1.3 M 1.4 M
F1 Score 82.79 87.71 85.37 87.73 87.5 87.92 88.39 88.84
5 Results
Fig. 1. Qualitative Results. The red box highlights the regions where our proposed
method outperforms the convolutional baselines.
6 Concurrent works
Very recently, TransUNet [2] was proposed which uses a transformer-based en-
coder operating on sequences of image patches and a convolutional decoder with
skip connections for medical image segmentation. As TransUNet is inspired by
16 JMJ Valanarasu et al.
References
1. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. IEEE transactions on pat-
tern analysis and machine intelligence 39(12), 2481–2495 (2017)
2. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou,
Y.: Transunet: Transformers make strong encoders for medical image segmentation.
arXiv preprint arXiv:2102.04306 (2021)
3. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic
image segmentation with deep convolutional nets and fully connected crfs. arXiv
preprint arXiv:1412.7062 (2014)
4. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3d u-net:
learning dense volumetric segmentation from sparse annotation. In: International
conference on medical image computing and computer-assisted intervention. pp.
424–432. Springer (2016)
5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
6. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is
worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929 (2020)
7. Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial attention in multi-
dimensional transformers. arXiv preprint arXiv:1912.12180 (2019)
8. Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y.W.,
Wu, J.: Unet 3+: A full-scale connected unet for medical image segmentation. In:
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). pp. 1055–1059. IEEE (2020)
9. Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross
attention for semantic segmentation. In: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision. pp. 603–612 (2019)
10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
11. Kumar, N., Verma, R., Anand, D., Zhou, Y., Onder, O.F., Tsougenis, E., Chen,
H., Heng, P.A., Li, J., Hu, Z., et al.: A multi-organ nucleus segmentation challenge.
IEEE transactions on medical imaging 39(5), 1380–1391 (2019)
12. Kumar, N., Verma, R., Sharma, S., Bhargava, S., Vahadane, A., Sethi, A.: A
dataset and a technique for generalized nuclear segmentation for computational
pathology. IEEE transactions on medical imaging 36(7), 1550–1560 (2017)
13. Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A.: H-denseunet: hybrid
densely connected unet for liver and tumor segmentation from ct volumes. IEEE
transactions on medical imaging 37(12), 2663–2674 (2018)
Title Suppressed Due to Excessive Length 17
14. Mehta, S., Mercan, E., Bartlett, J., Weaver, D., Elmore, J.G., Shapiro, L.: Y-
net: joint segmentation and classification for diagnosis of breast biopsy images. In:
International Conference on Medical Image Computing and Computer-Assisted
Intervention. pp. 893–901. Springer (2018)
15. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks
for volumetric medical image segmentation. In: 2016 fourth international confer-
ence on 3D vision (3DV). pp. 565–571. IEEE (2016)
16. Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori,
K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.: Attention u-net: Learning
where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)
17. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-
cal image segmentation. In: International Conference on Medical image computing
and computer-assisted intervention. pp. 234–241. Springer (2015)
18. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position represen-
tations. In: Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Volume 2 (Short Papers). pp. 464–468 (2018)
19. Sirinukunwattana, K., Pluim, J.P., Chen, H., Qi, X., Heng, P.A., Guo, Y.B., Wang,
L.Y., Matuszewski, B.J., Bruni, E., Sanchez, U., et al.: Gland segmentation in colon
histology images: The glas challenge contest. Medical image analysis 35, 489–502
(2017)
20. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training
data-efficient image transformers & distillation through attention. arXiv preprint
arXiv:2012.12877 (2020)
21. Valanarasu, J.M.J., Sindagi, V.A., Hacihaliloglu, I., Patel, V.M.: Kiu-net: Over-
complete convolutional architectures for biomedical image and volumetric segmen-
tation. arXiv preprint arXiv:2010.01663 (2020)
22. Valanarasu, J.M.J., Sindagi, V.A., Hacihaliloglu, I., Patel, V.M.: Kiu-net: Towards
accurate segmentation of biomedical images using over-complete representations.
In: International Conference on Medical Image Computing and Computer-Assisted
Intervention. pp. 363–373. Springer (2020)
23. Valanarasu, J.M.J., Yasarla, R., Wang, P., Hacihaliloglu, I., Patel, V.M.: Learning
to segment brain anatomy from 2d ultrasound with less data. IEEE Journal of
Selected Topics in Signal Processing 14(6), 1221–1234 (2020)
24. Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.C.: Axial-
deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint
arXiv:2003.07853 (2020)
25. Wang, P., Cuccolo, N.G., Tyagi, R., Hacihaliloglu, I., Patel, V.M.: Automatic real-
time cnn-based neonatal brain ventricles segmentation. In: 2018 IEEE 15th In-
ternational Symposium on Biomedical Imaging (ISBI 2018). pp. 716–719. IEEE
(2018)
26. Wang, X., Han, S., Chen, Y., Gao, D., Vasconcelos, N.: Volumetric attention for 3d
medical image segmentation and detection. In: International Conference on Medi-
cal Image Computing and Computer-Assisted Intervention. pp. 175–184. Springer
(2019)
27. Xiao, X., Lian, S., Luo, Z., Li, S.: Weighted res-unet for high-quality retina vessel
segmentation. In: 2018 9th international conference on information technology in
medicine and education (ITME). pp. 327–331. IEEE (2018)
28. Zhang, Y., Liu, H., Hu, Q.: Transfuse: Fusing transformers and cnns for medical
image segmentation. arXiv preprint arXiv:2102.08005 (2021)
18 JMJ Valanarasu et al.
29. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 2881–2890 (2017)
30. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T.,
Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence
perspective with transformers. arXiv preprint arXiv:2012.15840 (2020)
31. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net
architecture for medical image segmentation. In: Deep learning in medical image
analysis and multimodal learning for clinical decision support, pp. 3–11. Springer
(2018)