FADE: A Task-Agnostic Upsampling Operator for Encoder–Decoder Architectures

Lu, Hao; Liu, Wenze; Fu, Hongtao; Cao, Zhiguo

doi:10.1007/s11263-024-02191-8

FADE: A Task-Agnostic Upsampling Operator for Encoder–Decoder Architectures

Published: 22 July 2024

Volume 133, pages 151–172, (2025)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

225 Accesses
2 Citations
Explore all metrics

Abstract

The goal of this work is to develop a task-agnostic feature upsampling operator for dense prediction where the operator is required to facilitate not only region-sensitive tasks like semantic segmentation but also detail-sensitive tasks such as image matting. Prior upsampling operators often can work well in either type of the tasks, but not both. We argue that task-agnostic upsampling should dynamically trade off between semantic preservation and detail delineation, instead of having a bias between the two properties. In this paper, we present FADE, a novel, plug-and-play, lightweight, and task-agnostic upsampling operator by fusing the assets of decoder and encoder features at three levels: (i) considering both the encoder and decoder feature in upsampling kernel generation; (ii) controlling the per-point contribution of the encoder/decoder feature in upsampling kernels with an efficient semi-shift convolutional operator; and (iii) enabling the selective pass of encoder features with a decoder-dependent gating mechanism for compensating details. To improve the practicality of FADE, we additionally study parameter- and memory-efficient implementations of semi-shift convolution. We analyze the upsampling behavior of FADE on toy data and show through large-scale experiments that FADE is task-agnostic with consistent performance improvement on a number of dense prediction tasks with little extra cost. For the first time, we demonstrate robust feature upsampling on both region- and detail-sensitive tasks successfully. Code is made available at: https://github.com/poppinace/fade

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

FADE: Fusing the Assets of Decoder and Encoder for Task-Agnostic Upsampling

Fast and High Quality Image Denoising via Malleable Convolution

FDDCC-VSR: a lightweight video super-resolution network based on deformable 3D convolution and cheap convolution

Article 16 September 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell, 39(12), 2481–2495.
Article MATH Google Scholar
Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. Proc (pp. 109–122). European Conference on Computer Vision: Springer.
Bulo, S.R., Porzi, L., & Kontschieder, P. (2018). In-place activated batchnorm for memory-optimized training of dnns. In: Proceedings of the IEEE conference on computer vision pattern Recognition, pp. 5639–5647.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE International Conference Computer Vision, pp. 9650–9660.
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision, pp. 801–818.
Cheng, B., Girshick, R., Dollár, P., Berg, A.C., & Kirillov, A, (2021). Boundary iou: Improving object-centric image segmentation evaluation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 15334–15342.
Cheng, T., Wang, X., Huang, L., & Liu, W. (2020). Boundary-preserving mask r-cnn. Proc (pp. 660–676). European Conference on Computer Vision: Springer.
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv Comput Res Repository.
Dai, Y., Lu, H., & Shen, C. (2021). Learning affinity-aware upsampling for deep image matting. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 6841–6850.
Dong, C., Loy, C. C., He, K., & Tang, X. (2015). Image super-resolution using deep convolutional networks. IEEE Trans Pattern Anal Mach Intell, 38(2), 295–307.
Article MATH Google Scholar
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In: Annual Conference on Neural Information Processing Systems, pp. 2366–2374.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 580–587.
He, K., Sun, J., & Tang, X. (2010). Guided image filtering. Proc (pp. 1–14). European Conference on Computer Vision: Springer.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 770–778.
He. K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In: Proceedings of the IEEE conference on computer vision, pp. 2961–2969
Huang, S., Lu, Z., Cheng, R., & He, C. (2021). Fapn: Feature-aligned pyramid network for dense image prediction. In: Proceedings of the IEEE conference on computer vision, pp. 864–873.
Ignatov, A., Timofte, R., Denna, M., & Younes, A. (2021). Real-time quantized image super-resolution on mobile npus, mobile ai 2021 challenge: Report. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, Workshops, pp. 2525–2534.
Kirillov, A., Wu, Y., He, K., & Girshick, R. (2020). Pointrend: Image segmentation as rendering. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 9799–9808.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., & Girshick, R. (2023). Segment anything. In: Proceedings of the IEEE conference on computer vision, pp. 4015–4026.
Lee, J.H., Han, M.K., Ko, D.W., & Suh, I.H. (2019). From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv Comput Res Repository.
Li, X., Li, X., Zhang, L., Cheng, G., Shi, J., Lin, Z., Tan, S., & Tong, Y. (2020). Improving semantic segmentation via decoupled body and edge supervision. Proc (pp. 435–452). European Conference on Computer Vision: Springer.
Li, X., You, A., Zhu, Z., Zhao, H., Yang, M., Yang, K., Tan, S., & Tong, Y. (2020). Semantic flow for fast and accurate scene parsing. Proc (pp. 775–793). European Conference on Computer Vision: Springer.
Li, X., Zhao, H., Han, L., Tong, Y., Tan, S., & Yang, K. (2020). Gated fully fusion for semantic segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11418–11425.
Article MATH Google Scholar
Li, X., Zhang, J., Yang, Y., Cheng, G., Yang, K., Tong, Y., & Tao, D. (2023). SFNet: Faster and accurate semantic segmentation via semantic flow. International Journal of Computer Vision, 132(2), 1–24.
MATH Google Scholar
Lin, G., Milan, A., Shen, C., & Reid, I. (2017a). RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 1925–1934.
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan. D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision, pp. 740–755.
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017b). Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 2117–2125.
Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 8759–8768.
Liu, Y., Li, J., Pang, Y., Nie, D., & Yap, P.T. (2023). The devil is in the upsampling: Architectural decisions made simpler for denoising with deep image prior. In: Proceedings of the IEEE conference on computer vision, pp. 12408–12417.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 3431–3440.
Lu, H., Dai, Y., Shen, C., & Xu, S. (2019). Indices matter: Learning to index for deep image matting. In: Proceedings of the IEEE conference on computer vision, pp. 3266–3275.
Lu, H., Dai, Y., Shen, C., & Xu, S. (2022). Index networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 242–255.
Article MATH Google Scholar
Lu, H., Liu, W., Fu, H., & Cao, Z. (2022b). Fade: Fusing the assets of decoder and encoder for task-agnostic upsampling. In: Proceedings of the European Conference on Computer Vision, pp. 231–247.
Lu, H., Liu, W., Ye, Z., Fu, H., Liu, Y., & Cao, Z. (2022c). SAPA: Similarity-aware point affiliation for feature upsampling. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 20889–20901.
Mao, X., Shen, C., & Yang, Y.B. (2016). Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 2802–2810.
Mazzini, D. (2018). Guided upsampling network for real-time semantic segmentation. In: Proceedings of British Machine Vision Conference (BMVC), pp. 1–12.
Niklaus, S., Mai, L., Yang, J., & Liu, F. (2019). 3D ken burns effect from a single image. ACM Trans Graph, 38(6), 1–15.
Article Google Scholar
Odena, A., Dumoulin, V., & Olah, C. (2016). Deconvolution and checkerboard artifacts. Distill, 1(10), e3.
Google Scholar
Peng, J., Cao, Z., Luo, X., Lu, H., Xian, K., & Zhang, J. (2022). BokehMe: When neural rendering meets classical rendering. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 16283–16292.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. Proc (pp. 8748–8763). PMLR: International Conference on Machine Learning
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Annual Conference on Neural Information Processing Systems 28.
Rhemann, C., Rother, C., Wang, J., Gelautz, M., Kohli, P., & Rott, P. (2009). A perceptually motivated online benchmark for image matting. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 1826–1833.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 10684–10695.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241.
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., & Wang, Z. (2016). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 1874–1883.
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In: Proceedings of the European Conference on Computer Vision, pp. 746–760.
Song, S., Lichtenberg, S.P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 567–576.
Takikawa, T., Acuna, D., Jampani, V., & Fidler, S. (2019). Gated-scnn: Gated shape cnns for semantic segmentation. In: Proceedings of the IEEE conference on computer vision, pp. 5229–5238.
Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. Proceedings of the International Conference on Machine Learning, 97, 6105–6114.
MATH Google Scholar
Tang, C., Chen, H., Li, X., Li, J., Zhang, Z., & Hu, X. (2021). Look closer to segment better: Boundary patch refinement for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 13926–13935.
Teed, Z., & Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. Proc (pp. 402–419). European Conference on Computer Vision: Springer.
Tian, Z., He, T., Shen, C., & Yan, Y. (2019). Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 3126–3135.
Tomasi, C., & Manduchi, R. (1998). Bilateral filtering for gray and color images. Proc (pp. 839–846). IEEE International Conference on Computer Vision.
Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C.C., & Lin, D. (2019). CARAFE: Context-aware reassembly of features. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3007–3016.
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349–3364.
Article MATH Google Scholar
Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C. C., & Lin, D. (2021). CARAFE++: Unified Content-Aware ReAssembly of FEatures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 4674–4687.
MATH Google Scholar
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 7794–7803.
Wu, J., Pan, Z., Lei, B., & Hu, Y. (2022). Fsanet: Feature-and-spatial-aligned network for tiny object detection in remote sensing images. Transactions on Geoscience and Remote Sensing, 60, 1–17.
Article MATH Google Scholar
Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., & Luo, Z. (2018). Monocular relative depth perception with web stereo data supervision. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 311–320.
Xiao, H., Rasul, K., & Vollgraf, R. (2017), Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv Comput Res Repository.
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision, pp. 418–434.
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., & Luo, P. (2021). SegFormer: Simple and efficient design for semantic segmentation with transformers. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 12077–12090.
Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In: Proceedings of the European Conference on Computer Vision, pp. 1395–1403.
Xu, N., Price, B., Cohen, S., & Huang, T. (2017). Deep image matting. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 2970–2979.
Yuan, Y., Huang, L., Guo, J., Zhang, C., Chen, X., & Wang, J. (2021). Ocnet: Object context for semantic segmentation. International Journal of Computer Vision, 129(8), 2375–2398.
Article MATH Google Scholar
Zeiler, M.D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In: Proceedings of the European Conference on Computer Vision, pp. 818–833.
Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2022). Scaling vision transformers. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 12104–12113.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 2881–2890.
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., & Torr, P.H., et al. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 6881–6890
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 2921–2929.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 633–641.

Download references

Funding

This work is supported by the National Natural Science Foundation of China Under Grant No. 62106080 and the Hubei Provincial Natural Science Foundation of China Under Grant No. 2024AFB566.

Author information

Authors and Affiliations

The Key Laboratory of Image Processing and Intelligent Control, Ministry of Education, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
Hao Lu, Wenze Liu, Hongtao Fu & Zhiguo Cao

Authors

Hao Lu
View author publications
You can also search for this author in PubMed Google Scholar
Wenze Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hongtao Fu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiguo Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiguo Cao.

Additional information

Communicated by Wanli Ouyang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Comparison of Computational Complexity

A favorable upsampling operator, being part of overall network architecture, should not significantly increase the computation cost. This issue is not well addressed in IndexNet as it introduces many parameters and much computational overhead (Lu et al., 2019). In this part we analyze the computational workload and memory occupation among different dynamic upsampling operators. We first compare the FLOPs and number of parameters in Table 11. FADE requires more FLOPs than CARAFE (note that FADE processes 5 times more feature data than CARAFE), but less parameters when the number of channels is small. For example, when $C=256$, $d=64$, $K=5$, and $H=W=112$, CARAFE and FADE cost 2.50 and 4.56 GFLOPs, respectively; the number of parameters are 74 K and 47 K, respectively. FADE-Lite, in the same setting, costs only 1.53 GFLOPs and 13 K parameters. In addition, we also test the inference speed by upsampling a random feature map of size $256\times 120 \times 120$ (a guiding map of size $256\times 240$ is used if required). The inference time is shown in Table 12. Among compared dynamic upsampling operators, FADE and FADE-Lite are relatively efficient given that they process five times more data than CARAFE. We also test the practical memory occupation of FADE on SegFormer-B1 (Xie et al., 2021), with 6 upsampling stages. Under the default training setting, SegFormer-B1 with bilinear upsampling costs 22, 157 MB GPU memory. With the H2L implementation of FADE, it consumes 24, 879 MB, 2722 MB more than the original one. The L2H one reduces the memory cost by $24.2\%$ (from 2722 to 2064 MB), and is within an acceptable range compared with the decoder-only upsampling operator CARAFE (664 MB) if taking the five times more data into account.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lu, H., Liu, W., Fu, H. et al. FADE: A Task-Agnostic Upsampling Operator for Encoder–Decoder Architectures. Int J Comput Vis 133, 151–172 (2025). https://doi.org/10.1007/s11263-024-02191-8

Download citation

Received: 26 June 2023
Accepted: 14 July 2024
Published: 22 July 2024
Issue Date: January 2025
DOI: https://doi.org/10.1007/s11263-024-02191-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FADE: A Task-Agnostic Upsampling Operator for Encoder–Decoder Architectures

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FADE: Fusing the Assets of Decoder and Encoder for Task-Agnostic Upsampling

Fast and High Quality Image Denoising via Malleable Convolution

FDDCC-VSR: a lightweight video super-resolution network based on deformable 3D convolution and cheap convolution

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Comparison of Computational Complexity

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

FADE: A Task-Agnostic Upsampling Operator for Encoder–Decoder Architectures

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FADE: Fusing the Assets of Decoder and Encoder for Task-Agnostic Upsampling

Fast and High Quality Image Denoising via Malleable Convolution

FDDCC-VSR: a lightweight video super-resolution network based on deformable 3D convolution and cheap convolution

Explore related subjects

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Comparison of Computational Complexity

Comparison of Computational Complexity

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.