Abstract
The goal of this work is to develop a task-agnostic feature upsampling operator for dense prediction where the operator is required to facilitate not only region-sensitive tasks like semantic segmentation but also detail-sensitive tasks such as image matting. Prior upsampling operators often can work well in either type of the tasks, but not both. We argue that task-agnostic upsampling should dynamically trade off between semantic preservation and detail delineation, instead of having a bias between the two properties. In this paper, we present FADE, a novel, plug-and-play, lightweight, and task-agnostic upsampling operator by fusing the assets of decoder and encoder features at three levels: (i) considering both the encoder and decoder feature in upsampling kernel generation; (ii) controlling the per-point contribution of the encoder/decoder feature in upsampling kernels with an efficient semi-shift convolutional operator; and (iii) enabling the selective pass of encoder features with a decoder-dependent gating mechanism for compensating details. To improve the practicality of FADE, we additionally study parameter- and memory-efficient implementations of semi-shift convolution. We analyze the upsampling behavior of FADE on toy data and show through large-scale experiments that FADE is task-agnostic with consistent performance improvement on a number of dense prediction tasks with little extra cost. For the first time, we demonstrate robust feature upsampling on both region- and detail-sensitive tasks successfully. Code is made available at: https://github.com/poppinace/fade
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell, 39(12), 2481–2495.
Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. Proc (pp. 109–122). European Conference on Computer Vision: Springer.
Bulo, S.R., Porzi, L., & Kontschieder, P. (2018). In-place activated batchnorm for memory-optimized training of dnns. In: Proceedings of the IEEE conference on computer vision pattern Recognition, pp. 5639–5647.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE International Conference Computer Vision, pp. 9650–9660.
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision, pp. 801–818.
Cheng, B., Girshick, R., Dollár, P., Berg, A.C., & Kirillov, A, (2021). Boundary iou: Improving object-centric image segmentation evaluation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 15334–15342.
Cheng, T., Wang, X., Huang, L., & Liu, W. (2020). Boundary-preserving mask r-cnn. Proc (pp. 660–676). European Conference on Computer Vision: Springer.
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv Comput Res Repository.
Dai, Y., Lu, H., & Shen, C. (2021). Learning affinity-aware upsampling for deep image matting. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 6841–6850.
Dong, C., Loy, C. C., He, K., & Tang, X. (2015). Image super-resolution using deep convolutional networks. IEEE Trans Pattern Anal Mach Intell, 38(2), 295–307.
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In: Annual Conference on Neural Information Processing Systems, pp. 2366–2374.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 580–587.
He, K., Sun, J., & Tang, X. (2010). Guided image filtering. Proc (pp. 1–14). European Conference on Computer Vision: Springer.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 770–778.
He. K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In: Proceedings of the IEEE conference on computer vision, pp. 2961–2969
Huang, S., Lu, Z., Cheng, R., & He, C. (2021). Fapn: Feature-aligned pyramid network for dense image prediction. In: Proceedings of the IEEE conference on computer vision, pp. 864–873.
Ignatov, A., Timofte, R., Denna, M., & Younes, A. (2021). Real-time quantized image super-resolution on mobile npus, mobile ai 2021 challenge: Report. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, Workshops, pp. 2525–2534.
Kirillov, A., Wu, Y., He, K., & Girshick, R. (2020). Pointrend: Image segmentation as rendering. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 9799–9808.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., & Girshick, R. (2023). Segment anything. In: Proceedings of the IEEE conference on computer vision, pp. 4015–4026.
Lee, J.H., Han, M.K., Ko, D.W., & Suh, I.H. (2019). From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv Comput Res Repository.
Li, X., Li, X., Zhang, L., Cheng, G., Shi, J., Lin, Z., Tan, S., & Tong, Y. (2020). Improving semantic segmentation via decoupled body and edge supervision. Proc (pp. 435–452). European Conference on Computer Vision: Springer.
Li, X., You, A., Zhu, Z., Zhao, H., Yang, M., Yang, K., Tan, S., & Tong, Y. (2020). Semantic flow for fast and accurate scene parsing. Proc (pp. 775–793). European Conference on Computer Vision: Springer.
Li, X., Zhao, H., Han, L., Tong, Y., Tan, S., & Yang, K. (2020). Gated fully fusion for semantic segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11418–11425.
Li, X., Zhang, J., Yang, Y., Cheng, G., Yang, K., Tong, Y., & Tao, D. (2023). SFNet: Faster and accurate semantic segmentation via semantic flow. International Journal of Computer Vision, 132(2), 1–24.
Lin, G., Milan, A., Shen, C., & Reid, I. (2017a). RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 1925–1934.
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan. D., Dollár, P., & Zitnick, C.L. (2014). Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision, pp. 740–755.
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017b). Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 2117–2125.
Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 8759–8768.
Liu, Y., Li, J., Pang, Y., Nie, D., & Yap, P.T. (2023). The devil is in the upsampling: Architectural decisions made simpler for denoising with deep image prior. In: Proceedings of the IEEE conference on computer vision, pp. 12408–12417.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 3431–3440.
Lu, H., Dai, Y., Shen, C., & Xu, S. (2019). Indices matter: Learning to index for deep image matting. In: Proceedings of the IEEE conference on computer vision, pp. 3266–3275.
Lu, H., Dai, Y., Shen, C., & Xu, S. (2022). Index networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 242–255.
Lu, H., Liu, W., Fu, H., & Cao, Z. (2022b). Fade: Fusing the assets of decoder and encoder for task-agnostic upsampling. In: Proceedings of the European Conference on Computer Vision, pp. 231–247.
Lu, H., Liu, W., Ye, Z., Fu, H., Liu, Y., & Cao, Z. (2022c). SAPA: Similarity-aware point affiliation for feature upsampling. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 20889–20901.
Mao, X., Shen, C., & Yang, Y.B. (2016). Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 2802–2810.
Mazzini, D. (2018). Guided upsampling network for real-time semantic segmentation. In: Proceedings of British Machine Vision Conference (BMVC), pp. 1–12.
Niklaus, S., Mai, L., Yang, J., & Liu, F. (2019). 3D ken burns effect from a single image. ACM Trans Graph, 38(6), 1–15.
Odena, A., Dumoulin, V., & Olah, C. (2016). Deconvolution and checkerboard artifacts. Distill, 1(10), e3.
Peng, J., Cao, Z., Luo, X., Lu, H., Xian, K., & Zhang, J. (2022). BokehMe: When neural rendering meets classical rendering. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 16283–16292.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. Proc (pp. 8748–8763). PMLR: International Conference on Machine Learning
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. Annual Conference on Neural Information Processing Systems 28.
Rhemann, C., Rother, C., Wang, J., Gelautz, M., Kohli, P., & Rott, P. (2009). A perceptually motivated online benchmark for image matting. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 1826–1833.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 10684–10695.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241.
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., & Wang, Z. (2016). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 1874–1883.
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In: Proceedings of the European Conference on Computer Vision, pp. 746–760.
Song, S., Lichtenberg, S.P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 567–576.
Takikawa, T., Acuna, D., Jampani, V., & Fidler, S. (2019). Gated-scnn: Gated shape cnns for semantic segmentation. In: Proceedings of the IEEE conference on computer vision, pp. 5229–5238.
Tan, M., & Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. Proceedings of the International Conference on Machine Learning, 97, 6105–6114.
Tang, C., Chen, H., Li, X., Li, J., Zhang, Z., & Hu, X. (2021). Look closer to segment better: Boundary patch refinement for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 13926–13935.
Teed, Z., & Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. Proc (pp. 402–419). European Conference on Computer Vision: Springer.
Tian, Z., He, T., Shen, C., & Yan, Y. (2019). Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 3126–3135.
Tomasi, C., & Manduchi, R. (1998). Bilateral filtering for gray and color images. Proc (pp. 839–846). IEEE International Conference on Computer Vision.
Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C.C., & Lin, D. (2019). CARAFE: Context-aware reassembly of features. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3007–3016.
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349–3364.
Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C. C., & Lin, D. (2021). CARAFE++: Unified Content-Aware ReAssembly of FEatures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 4674–4687.
Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 7794–7803.
Wu, J., Pan, Z., Lei, B., & Hu, Y. (2022). Fsanet: Feature-and-spatial-aligned network for tiny object detection in remote sensing images. Transactions on Geoscience and Remote Sensing, 60, 1–17.
Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., & Luo, Z. (2018). Monocular relative depth perception with web stereo data supervision. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 311–320.
Xiao, H., Rasul, K., & Vollgraf, R. (2017), Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv Comput Res Repository.
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision, pp. 418–434.
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., & Luo, P. (2021). SegFormer: Simple and efficient design for semantic segmentation with transformers. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 12077–12090.
Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In: Proceedings of the European Conference on Computer Vision, pp. 1395–1403.
Xu, N., Price, B., Cohen, S., & Huang, T. (2017). Deep image matting. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 2970–2979.
Yuan, Y., Huang, L., Guo, J., Zhang, C., Chen, X., & Wang, J. (2021). Ocnet: Object context for semantic segmentation. International Journal of Computer Vision, 129(8), 2375–2398.
Zeiler, M.D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In: Proceedings of the European Conference on Computer Vision, pp. 818–833.
Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2022). Scaling vision transformers. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 12104–12113.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 2881–2890.
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., & Torr, P.H., et al. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 6881–6890
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 2921–2929.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 633–641.
Funding
This work is supported by the National Natural Science Foundation of China Under Grant No. 62106080 and the Hubei Provincial Natural Science Foundation of China Under Grant No. 2024AFB566.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Wanli Ouyang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Comparison of Computational Complexity
Comparison of Computational Complexity
A favorable upsampling operator, being part of overall network architecture, should not significantly increase the computation cost. This issue is not well addressed in IndexNet as it introduces many parameters and much computational overhead (Lu et al., 2019). In this part we analyze the computational workload and memory occupation among different dynamic upsampling operators. We first compare the FLOPs and number of parameters in Table 11. FADE requires more FLOPs than CARAFE (note that FADE processes 5 times more feature data than CARAFE), but less parameters when the number of channels is small. For example, when \(C=256\), \(d=64\), \(K=5\), and \(H=W=112\), CARAFE and FADE cost 2.50 and 4.56 GFLOPs, respectively; the number of parameters are 74 K and 47 K, respectively. FADE-Lite, in the same setting, costs only 1.53 GFLOPs and 13 K parameters. In addition, we also test the inference speed by upsampling a random feature map of size \(256\times 120 \times 120\) (a guiding map of size \(256\times 240\) is used if required). The inference time is shown in Table 12. Among compared dynamic upsampling operators, FADE and FADE-Lite are relatively efficient given that they process five times more data than CARAFE. We also test the practical memory occupation of FADE on SegFormer-B1 (Xie et al., 2021), with 6 upsampling stages. Under the default training setting, SegFormer-B1 with bilinear upsampling costs 22, 157 MB GPU memory. With the H2L implementation of FADE, it consumes 24, 879 MB, 2722 MB more than the original one. The L2H one reduces the memory cost by \(24.2\%\) (from 2722 to 2064 MB), and is within an acceptable range compared with the decoder-only upsampling operator CARAFE (664 MB) if taking the five times more data into account.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, H., Liu, W., Fu, H. et al. FADE: A Task-Agnostic Upsampling Operator for Encoder–Decoder Architectures. Int J Comput Vis 133, 151–172 (2025). https://doi.org/10.1007/s11263-024-02191-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-024-02191-8