(NIPS23) Scattering Transformation For ViT
(NIPS23) Scattering Transformation For ViT
Matters
Abstract
Vision transformers have gained significant attention and achieved state-of-the-
art performance in various computer vision tasks, including image classification,
instance segmentation, and object detection. However, challenges remain in address-
ing attention complexity and effectively capturing fine-grained information within
images. Existing solutions often resort to down-sampling operations, such as pool-
ing, to reduce computational cost. Unfortunately, such operations are non-invertible
and can result in information loss. In this paper, we present a novel approach called
Scattering Vision Transformer (SVT) to tackle these challenges. SVT incorporates
a spectrally scattering network that enables the capture of intricate image details.
SVT overcomes the invertibility issue associated with down-sampling operations
by separating low-frequency and high-frequency components. Furthermore, SVT
introduces a unique spectral gating network utilizing Einstein multiplication for
token and channel mixing, effectively reducing complexity. We show that SVT
achieves state-of-the-art performance on the ImageNet dataset with a significant
reduction in a number of parameters and FLOPS. SVT shows 2% improvement
over LiTv2 and iFormer. SVT-H-S reaches 84.2% top-1 accuracy, while SVT-H-B
reaches 85.2% (state-of-art for base versions) and SVT-H-L reaches 85.7% (again
state-of-art for large versions). SVT also shows comparable results in other vision
tasks such as instance segmentation. SVT also outperforms other transformers
in transfer learning on standard datasets such as CIFAR10, CIFAR100, Oxford
Flower, and Stanford Car datasets. The project page is available on this webpage
(https://badripatro.github.io/svt/).
1 Introduction
In recent years, there has been a remarkable surge in the interest and adoption of Large Language
Models (LLMs), driven by the release and success of prominent models such as GPT-3, ChatGPT [1],
and Palm [9]. These LLMs have achieved significant breakthroughs in the field of Natural Language
Processing (NLP). Building upon their successes, subsequent research endeavors have extended the
language transformer paradigm to diverse domains including computer vision, speech recognition,
video processing, and even climate and weather prediction. In this paper, we specifically focus on
exploring the potential of LLMs for vision-related tasks. By leveraging the power of these language
models, we aim to push the boundaries of vision applications and investigate their capabilities in
addressing complex vision challenges.
Several adaptations of transformers have been introduced in the field of computer vision for various
tasks. For image classification, notable vision transformers include ViT [14], DeIT [61], PVT [66],
Swin [41], Twin [10], and CSWin transformers [13]. The different vision transformers improved the
performance of image classification tasks significantly compared to Convolutional Neural Networks
(CNNs) such as ResNets and RegNets, as discussed in efficient vision transformer research work [47].
This breakthrough in computer vision has led to state-of-the-art results in various vision tasks,
including image segmentation such as SegFormer [71], TopFormer [82] and SegViT [79] and object
2
Figure 1: This figure illustrates the architectural details of the SVT model with a Scatter and Attention
Layer structure. The Scatter Layer comprises a Scattering Transformation that processes Low-
Frequency (LF) and High-Frequency (HF) components. Subsequently, we apply the Tensor and
Einstein Blending Method to obtain Low-Frequency Representation (LFR) and High-Frequency
Representation (HFR), as depicted in the figure. Finally, we apply the Inverse Scattering transformation
using LFR and HFR.
2 Method
2.1 Background: Overview of DTCWT and Decoupling of Low & High Frequencies
Discrete Wavelet Transform (DWT) replaces the infinite oscillating sinusoidal functions with a set of
locally oscillating basis functions, which are known as wavelets [54, 29]. Wavelet is a combination of
low-pass scaling function 𝜙(𝑡) and a shifted version of a band-pass wavelet function known as 𝜓(𝑡).
It can be represented mathematically as given below:
∑
∞ ∑
∞ ∑
∞
𝑥(𝑡) = 𝑐(𝑛)𝜙(𝑡 − 𝑛) + 𝑑(𝑗, 𝑛)2𝑗∕2 𝜓(2𝑗 𝑡 − 𝑛). (1)
𝑛=−∞ 𝑗=0 𝑛=−∞
where 𝑐(𝑛) is the scaling coefficients and 𝑑(𝑗, 𝑛) is the wavelet coefficients. These coefficients are
computed by the inner product of the scaling function𝜙(𝑡) and wavelet function 𝜓(𝑡) with input 𝑥(𝑡).
∞ ∞
𝑐(𝑛) = 𝑥(𝑡)𝜙(𝑡 − 𝑛)𝑑𝑡, 𝑑(𝑗, 𝑛) = 2𝑗∕2 𝑥(𝑡)𝜓(2𝑗 𝑡 − 𝑛)𝑑𝑡. (2)
∫−∞ ∫−∞
DWT suffers from the following issues oscillations, shift variance, aliasing, and lack of directionality.
One of the solutions to solve the above problems is the Complex Wavelet Transform (CWT) with
complex-valued scaling and wavelet function. The Dual-Tree Complex Wavelet Transform (DT-
CWT) addresses the issues of the CWT. The DT-CWT [30, 28, 29] comes very close to mirroring the
attractive properties of the Fourier Transform, including a smooth, nonoscillating magnitude, a nearly
shift-invariant magnitude with a simple near-linear phase encoding of signal shifts, substantially
reduced aliasing; and better directional selectivity wavelets in higher dimensions. This makes it easier
to detect edges and orientational features of images. The six orientations of the wavelet transformation
are given by 15◦ , 45◦ , 75◦ , 105◦ , 135◦ , and 165◦ . The dual-tree CWT employs two real DWTs, the
first DWT gives the real part of the transform while the second DWT gives the imaginary part. The
two real DWTs use two different sets of filters, which are jointly designed to give an approximation
of the overall complex wavelet transform and satisfy the perfect reconstruction (PR) conditions.
Let ℎ0 (𝑛), ℎ1 (𝑛) denote the low-pass and high-pass filter pair in the upper band, while 𝑔0 (𝑛), 𝑔1 (𝑛)
denote the same for the lower band. Two real wavelets are associated with each of the two real wavelet
transforms as 𝜓ℎ (𝑡), and 𝜓𝑔 (𝑡). The complex wavelet 𝜓ℎ (𝑡) ∶= 𝜓ℎ (𝑡) + 𝜓𝑔 (𝑡) can be approximated
using Half-Sample Delay[53] condition,i.e. 𝜓ℎ (𝑡) is approximately the Hilbert transform of 𝜓𝑔 (𝑡) like
√ ∑ √ ∑
𝑔0 (𝑛) ≈ ℎ0 (𝑛 − 0.5) ⇒ 𝜓𝑔 (𝑡) ≈ {𝜓ℎ (𝑡)}𝜓ℎ (𝑡) = 2 ℎ1 (𝑛)𝜙ℎ (𝑡), 𝜙ℎ (𝑡) = 2 ℎ0 (𝑛)𝜙ℎ (𝑡)
𝑛 𝑛
3
Similarly, we can define 𝜓𝑔 (𝑡), 𝜙𝑔 (𝑡), and 𝑔1 (𝑛). Since the filters are real, no complex arithmetic is
required to implement DTCWT. It is just two times more expansive in 1D because the total output
data rate is exactly twice the input data rate. It is also easy to invert, as the two separate DWTs can be
inverted. Compare DTCWT with the Fourier Transform, which is difficult to obtain low pass and
high pass components of an image and it is less invertible (Loss is high when we do Fourier and
inverse Fourier transform) compared to DTCWT. Also, It cannot speak about time and frequency
simultaneously.
Given input image 𝐈 ∈ ℝ3×224×224 , we split the image into the patch of size ℝ16×16 and obtain
embedding of each patch token using position encoder and token embedding network. 𝐗 = 𝑇 (𝐈) +
𝑃 (𝐈), where 𝑇 , 𝑃 refer to token and position encoding network. The detailed distinct components
of the SVT architecture are illustrated in Figure 1. Scattering Visual Transformer consists of three
components such as a) Scattering Transformation, b) Spectral Gating Network, c) Spectral Channel
and Token Mixing.
A. Scattering Transformation:
The input image 𝐈 is firstly patchified into a feature tensor 𝐗 ∈ ℝ𝐶×𝐻×𝑊 whose spatial resolution
is 𝐻 × 𝑊 and the number of channels is 𝐶. To extract the features of an image, we feed 𝐗 into a
series of transformer layers. We use a novel spectral transform based on an invertibility scattering
network instead of the standard self-attention network. This allows us to capture both the fine-grain
and the global information in the image. The fine-grain information consists of texture, patterns, and
small features that are encoded by the high-frequency components of the spectral transform. The
global information consists of the overall brightness, contrast, edges, and contours that are encoded
by the low-frequency components of the spectral transform. Given feature 𝐗 ∈ ℝ𝐶×𝐻×𝑊 , we use
scattering transform using DTCWT [54] as discussed in section-2.1 to obtain the corresponding
frequency representations 𝐗𝐹 by 𝐗𝐹 = scatter (𝐗). The transformation in frequency domain 𝐗𝐹
provides two components, one low-frequency component i.e. scaling component 𝐗𝜙 , and one high-
frequency component i.e. wavelet component 𝐗𝜓 . The simplified formulation for the real component
of (⋅) is:
∑ 𝑊∑
𝐻−1 −1 ∑ 𝐻−1
𝑀−1 ∑ 𝑊∑
−1 ∑
6
𝐗𝐹 (𝑢, 𝑣) = 𝐗𝜙 (𝑢, 𝑣) + 𝐗𝜓 (𝑢, 𝑣) = 𝑐𝑀,ℎ,𝑤 𝜙𝑀,ℎ,𝑤 + 𝑘
𝑑𝑚,ℎ,𝑤 𝑘
𝜓𝑚,ℎ,𝑤 (3)
ℎ=0 𝑤=0 𝑚=0 ℎ=0 𝑤=0 𝑘=1
4
To perform EBM, we first reshape a tensor 𝐀 from
ℝ𝐻×𝑊 ×𝐶 to ℝ𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 , where 𝐶 = 𝐶𝑏 × 𝐶𝑑 ,
and 𝑏 >> 𝑑. We then define a weight matrix of
size 𝑊 ∈ ℝ𝐶𝑏 ×𝐶𝑑 ×𝐶𝑑 . We then perform Einstein
multiplication between 𝐀 and 𝑊 along the last two
dimensions, resulting in a blended feature tensor
𝑌 ∈ ℝ𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 as shown in the Figure-2. The
formula for EBM is:
𝐘𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 = 𝐀𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 ⧆ 𝐖𝐶𝑏 ×𝐶𝑑 ×𝐶𝑑 Figure 2: Einstein Blending Method
Where ⧆ represents an Einstein multiplication, the bias terms 𝑏𝜓𝑐 ∈ ℝ𝐶𝑏 ×𝐶𝑑 , 𝑏𝜓𝑡 ∈ ℝ𝐻×𝐻 . Now the
total number of weight parameters in the high-frequency gating network is (𝐶𝑏 ×𝐶𝑑 ×𝐶𝑑 )+(𝑊 ×𝐻 ×𝐻)
instead of (𝐶 × 𝐻 × 𝑊 × 𝑘 × 2) where 𝐶 >> 𝐻 and bias is (𝐶𝑏 × 𝐶𝑑 ) + (𝐻 × 𝑊 ). This reduces
the number of parameters and multiplication while performing high-frequency gating operations
in an image. We use a standard torch package [52] to perform Einstin multiplication. Finally, we
perform inverse scattering transform using low-frequency representation( 4) and high-frequency
representation( 6) to bring back the spectral domain to the physical domain. Our SVT architecture
consists of L layers, comprising 𝛼 scatter layers and (𝐿 − 𝛼) attention layers [64], where L denotes the
network’s depth. The scatter layers, being invertible, adeptly capture both the global and the fine-grain
information in the image effectively via low-pass and high-pass filters, while attention layers focus on
extracting semantic features and addressing long-range dependencies present in the image.
5
Table 1: The table shows the performance of various vision backbones on the ImageNet1K[11] dataset
for image recognition tasks. ⋆ indicates additionally trained with the Token Labeling objective using
MixToken[27] and a convolutional stem (conv-stem) [65] for patch encoding. This table provides
results for input image size 224 × 224. We have grouped the vision models into three categories
based on their GFLOPs (Small, Base, and Large). The GFLOP ranges: Small (GFLOPs<6), Base
(6≤GFLOPs<10), and Large (10≤GFLOPs<30).
Method Params GFLOPS Top-1 Top-5 Method Params GFLOPS Top-1 Top-5
Small Large
ResNet-50 [23] 25.5M 4.1 78.3 94.3 ResNet-152 [23] 60.2M 11.6 81.3 95.5
BoTNet-S1-50 [56] 20.8M 4.3 80.4 95.0 ResNeXt101 [72] 83.5M 15.6 81.5 -
Cross-ViT-S [6] 26.7M 5.6 81.0 - gMLP-B [39] 73.0M 15.8 81.6 -
Swin-T [41] 29.0M 4.5 81.2 95.5 DeiT-B [61] 86.6M 17.6 81.8 95.6
ConViT-S [15] 27.8M 5.4 81.3 95.7 SE-ResNet-152 [25] 66.8M 11.6 82.2 95.9
T2T-ViT-14 [77] 21.5M 4.8 81.5 95.7 Cross-ViT-B [6] 104.7M 21.2 82.2 -
RegionViT-Ti+ [5] 14.3M 2.7 81.5 - ResNeSt-101 [80] 48.3M 10.2 82.3 -
SE-CoTNetD-50 [37] 23.1M 4.1 81.6 95.8 ConViT-B [15] 86.5M 16.8 82.4 95.9
Twins-SVT-S [10] 24.1M 2.9 81.7 95.6 PoolFormer [76] 73.0M 11.8 82.5 -
CoaT-Lite-S [73] 20.0M 4.0 81.9 95.5 T2T-ViTt-24 [77] 64.1M 15.0 82.6 95.9
PVTv2-B2 [67] 25.4M 4.0 82.0 96.0 TNT-B [21] 65.6M 14.1 82.9 96.3
LITv2-S [45] 28.0M 3.7 82.0 - CycleMLP-B4 [7] 52.0M 10.1 83.0 -
MViTv2-T [35] 24.0M 4.7 82.3 - DeepViT-L [83] 58.9M 12.8 83.1 -
Wave-ViT-S [75] 19.8M 4.3 82.7 96.2 RegionViT-B [5] 72.7M 13.0 83.2 96.1
CSwin-T [13] 23.0M 4.3 82.7 - CycleMLP-B5 [7] 76.0M 12.3 83.2 -
DaViT-Ti [12] 28.3M 4.5 82.8 - ViP-Large/7 [24] 88.0M 24.4 83.2 -
SVT-H-S 21.7M 3.9 83.1 96.3 CaiT-S36 [62] 68.4M 13.9 83.3 -
iFormer-S [55] 20.0M 4.8 83.4 96.6 AS-MLP-B [38] 88.0M 15.2 83.3 -
CMT-S [19] 25.1M 4.0 83.5 - BoTNet-S1-128 [56] 75.1M 19.3 83.5 96.5
MaxViT-T [63] 31.0M 5.6 83.6 - Swin-B [41] 88.0M 15.4 83.5 96.5
Wave-ViT-S⋆ [75] 22.7M 4.7 83.9 96.6 Wave-MLP-B [58] 63.0M 10.2 83.6 -
SVT-H-S⋆ (Ours) 22.0M 3.9 84.2 96.9 LITv2-B [45] 87.0M 13.2 83.6 -
Base PVTv2-B4 [67] 62.6M 10.1 83.6 96.7
ResNet-101 [23] 44.6M 7.9 80.0 95.0 ViL-Base [81] 55.7M 13.4 83.7 -
BoTNet-S1-59 [56] 33.5M 7.3 81.7 95.8 Twins-SVT-L [10] 99.3M 15.1 83.7 96.5
T2T-ViT-19 [77] 39.2M 8.5 81.9 95.7 Hire-MLP-L [20] 96.0M 13.4 83.8 -
CvT-21 [69] 32.0M 7.1 82.5 - RegionViT-B+ [5] 73.8M 13.6 83.8 -
GFNet-H-B [51] 54.0M 8.6 82.9 96.2 Focal-Base [74] 89.8M 16.0 83.8 96.5
Swin-S [41] 50.0M 8.7 83.2 96.2 PVTv2-B5 [67] 82.0M 11.8 83.8 96.6
Twins-SVT-B [10] 56.1M 8.6 83.2 96.3 CoTNetD-152 [37] 55.8M 17.0 84.0 97.0
CoTNetD-101 [37] 40.9M 8.5 83.2 96.5 DAT-B [70] 88.0M 15.8 84.0 -
PVTv2-B3 [67] 45.2M 6.9 83.2 96.5 LV-ViT-M⋆ [27] 55.8M 16.0 84.1 96.7
LITv2-M [45] 49.0M 7.5 83.3 - CSwin-B [13] 78.0M 15.0 84.2 -
RegionViT-M+ [5] 42.0M 7.9 83.4 - HorNet-𝐵𝐺𝐹 [50] 88.0M 15.5 84.3 -
MViTv2-S [35] 35.0M 7.0 83.6 - DynaMixer-L [68] 97.0M 27.4 84.3 -
CSwin-S [13] 35.0M 6.9 83.6 - MViTv2-B [35] 52.0M 10.2 84.4 -
DaViT-S [12] 49.7M 8.8 84.2 - DaViT-B [12] 87.9M 15.5 84.6 -
VOLO-D1⋆ [78] 26.6M 6.8 84.2 - CMT-L [19] 74.7M 19.5 84.8 -
CMT-B [19] 45.7M 9.3 84.5 - MaxViT-B [63] 120.0M 23.4 85.0 -
MaxViT-S [63] 69.0M 11.7 84.5 - VOLO-D2⋆ [78] 58.7M 14.1 85.2 -
iFormer-B [55] 48.0M 9.4 84.6 97.0 VOLO-D3⋆ [78] 86.3M 20.6 85.4 -
Wave-ViT-B⋆ [75] 33.5M 7.2 84.8 97.1 Wave-ViT-L⋆ [75] 57.5M 14.8 85.5 97.3
SVT-H-B⋆ (Ours) 32.8M 6.3 85.2 97.3 SVT-H-L⋆ (Ours) 54.0M 12.7 85.7 97.5
85.2% with a fewer number of parameters. We also compare SVT with iFormer [55] which captures
low and high-frequency information from visual data, whereas SVT uses an invertible spectral method,
namely the scattering network, to get the low-frequency and high-frequency components and uses
tensor and Einstein mixing respectively to capture effective spectral features from visual data. SVT top-
1 accuracy is 85.2, which is better than iFormer-B, which is at 84.6 with a lesser number of parameters
and FLOPS. We compare SVT with WaveMLP [58] which is an MLP mixer-based technique that
uses amplitude and phase information to represent the semantic content of an image. SVT uses a
low-frequency component as an amplitude of the original feature, while a high-frequency component
captures complex semantic changes in the input image. Our studies have shown, as depicted in
Table- 1, that SVT outperforms WaveMLP by about 1.8%.
We divide the transformer architecture into three parts based on the computation requirements (FLOP
counts) - small (less than 6 GFLOPS), base (6-10 GFLOPS), and large (10-30 GFLOPS). We use a
similar categorization as WaveViT [75]. Notable recent works falling into the small category include
C-Swin Transformers [13], LiTv2[45], MaxVIT[63], iFormer[55], CMT transformer, PVTv2[67],
and WaveViT[75]. It’s worth mentioning that WaveViT relies on extra annotations to achieve its
best results. In this context, SVT-H-S stands out as the state-of-the-art model in the small category,
6
Table 2: Initial Attention Layer vs Scatter Table 4: Results on transfer learning datasets.
Layer vs Initial Convolutional: This table com- We report the top-1 accuracy on the four datasets.
pares SVT transformer where initial scatter layers
CIFAR CIFAR Flowers Cars
and later attention layers, SVT-Inverse where ini- Model
10 100 102 196
tial attention layers and later scatter layers, and ResNet50 [23] - - 96.2 90.0
SVT with initial convolutional layers. Also, we ViT-B/16 [14] 98.1 87.1 89.5 -
show an alternative spectral layer and attention ViT-L/16 [14] 97.9 86.4 89.7 -
Deit-B/16 [61] 99.1 90.8 98.4 92.1
layer. This shows that the Initial scatter layer ResMLP-24 [60] 98.7 89.5 97.9 89.5
works better compared to the rest. GFNet-XS [51] 98.6 89.1 98.1 92.8
Model Params(M) FLOPS(G) Top-1(%) Top-5(%) GFNet-H-B [51] 99.0 90.3 98.8 93.2
SVT-H-S 22.0M 3.9 84.2 96.9 SVT-H-B 99.22 91.2 98.9 93.6
SVT-H-S-Init-CNN 21.7M 4.1 84.0 95.7
SVT-H-S-Inverse 21.8M 3.9 83.1 94.6
SVT-H-S-Alternate 22.4M 4.6 83.4 95.0 Table 5: The performances of various vision back-
bones on COCO val2017 dataset for the down-
Table 3: This table shows the ablation analysis of stream instance segmentation task such as Mask
various spectral layers in SVT architecture such R-CNN 1x [22] method. We adopt Mask R-CNN
as FN, FFC, WGN, and FNO. We conduct this as the base model, and the bounding box & mask
ablation study on the small-size networks in stage Average Precision (i.e., 𝐴𝑃 𝑏 & 𝐴𝑃 𝑚 ) are reported
architecture. This indicates that SVT performs for evaluation
better than other kinds of networks. Backbone 𝐴𝑃 𝑏 𝑏
𝐴𝑃50 𝑏
𝐴𝑃75 𝐴𝑃 𝑚 𝑚
𝐴𝑃50 𝑚
𝐴𝑃75
Params FLOPS Top-1 Top-5 Invertible ResNet50 [23] 38.0 58.6 41.4 34.4 55.1 36.7
Model
(M) (G) (%) (%) loss(↓) Swin-T [41] 42.2 64.6 46.2 39.1 61.6 42.0
FFC 21.53 4.5 83.1 95.23 – Twins-SVT-S [10] 43.4 66.0 47.3 40.3 63.2 43.4
FN 21.17 3.9 84.02 96.77 – LITv2-S [45] 44.9 - - 40.8 - -
FNO 21.33 3.9 84.09 96.86 3.27e-05 RegionViT-S [5] 44.2 - - 40.8 - -
WGN 21.59 3.9 83.70 96.56 8.90e-05 PVTv2-B2 [67] 45.3 67.1 49.6 41.2 64.2 44.4
SVT 22.22 3.9 84.20 96.93 6.64e-06 SVT-H-S 46.0 68.1 50.4 41.9 65.0 45.1
achieving a top-1 accuracy of 84.2%. Similarly, SVT-H-B surpasses all the transformers in the base
category, boasting a top-1 accuracy of 85.2%. Lastly, SVT-H-L outperforms other large transformers
with a top-1 accuracy of 85.7% when tested on the ImageNet dataset with an image size of 224x224.
When comparing different architectural approaches, such as Convolutional Neural Networks (CNNs),
Transformer architectures (attention-based models), MLP Mixers, and Spectral architectures, SVT
consistently outperforms its counterparts. For instance, SVT achieves better top-1 accuracy and
parameter efficiency compared to CNN architectures like ResNet 152 [23], ResNeXt [72], and
ResNeSt in terms of top-1 accuracy and number of parameters. Among attention-based architectures,
MaxViT [63] has been recognized as the best performer, surpassing models like DeiT [61], Cross-
ViT [6], DeepViT [83], T2T [77] etc. with a top-1 accuracy of 85.0. However, SVT achieves an even
higher top-1 accuracy of 85.7 with less than half the number of parameters. In the realm of MLP
Mixer-based architectures, DynaMixer [68] emerges as the top-performing model, surpassing MLP-
mixer[59], gMLP [39], CycleMLP [7], Hire-MLP[20], AS-MLP [38], WaveMLP[58], PoolFormer[76]
and DynaMixer-L [68] with a top-1 accuracy of 84.3%. In comparison, SVT-H-L outperforms
DynaMixer with a top-1 accuracy of 85.7% while requiring fewer parameters and computations.
Hierarchical architectures, which include models like PVT [66], Swin [41] transformer, CSwin [13]
transformer, Twin [10] transformer, and VOLO [78] are also considered. Among this category, VOLO
achieves the highest top-1 accuracy of 85.4%. However, it’s important to note that SVT outperforms
VOLO with a top-1 accuracy of 85.7% for SVT-H-L. Lastly, in the spectral architecture category,
models like GFNet[51], iFormer [55], LiTv2 [45], HorNet [50], Wave-ViT [75], etc. are examined.
Wave-ViT was previously the state-of-the-art method with a top-1 accuracy of 85.5%. Nevertheless,
SVT-H-L surpasses Wave-ViT in terms of top-1 accuracy, network size (number of parameters), and
computational complexity (FLOPS), as indicated in Table 1.
3.3 What Matters: Does Initial Spectral or Initial Attention or Initial Convolution Layers?
The ablation study was conducted to show that initial scatter layers followed by attention in deeper
layers are more beneficial than having later scatter and initial attention layers ( SVT-H-S-Inverse). We
also compare transformer models based on an alternative to the attention and scatter layer (SVT-H-S-
Alternate) as shown in Table- 2. From all these combinations we observe that initial scatter layers
followed by attention in deeper layers are more beneficial than others. We compare the performance of
SVT when the architecture changes from all attention (PVTv2[67] ) to all spectral layers (GFNet[51])
7
Table 6: SVT model comprises low-frequency component and High-frequency component with the
help of scattering net using Dual tree complex wavelet transform. Each frequency component is
controlled by a parameterized weight matrix using Patch mixing and/or Channel Mixing. this table
shows details about all combinations and 𝑆𝑉 𝑇𝑇 𝑇 𝐸𝐸 is the best performing among them.
Params FLOPS Top-1 Top-5
Low Frequency High Frequency
Backbone (M) (G) (%) (%)
Token Channel Token Channel
𝑆𝑉 𝑇𝑇 𝑇 𝑇 𝑇 T T T T 25.18 4.4 83.97 96.86
𝑆𝑉 𝑇𝐸𝐸𝑇 𝑇 E E T T 21.90 4.1 83.87 96.67
𝑆𝑉 𝑇𝐸𝐸𝐸𝐸 E E E E 21.87 3.7 83.70 96.56
𝑆𝑉 𝑇𝑇 𝑇 𝐸𝐸 T T E E 22.01 3.9 84.20 96.82
𝑆𝑉 𝑇𝑇 𝑇 𝐸𝑋 T T E ✗ 21.99 4.0 84.06 96.76
𝑆𝑉 𝑇𝑇 𝑇 𝑋𝐸 T T ✗ E 22.25 4.1 84.12 96.91
as well as a few spectral and remaining attention layers(SVT ours). We observe that combining
spectral and attention boost the performance compared to all attention and all spectral layer-based
transformer as shown in Table- 2. We have conducted an experiment where the initial layers of a
ViT are convolutional networks and later layers are attention layers to compare the performance of
SVT. The results are captured in Table- 1, where we compare SVT with transformers having initial
convolutional layers such as CVT [69], CMT [19], and HorNet [50]. Initial convolutional layers in a
transformer are not performing well compared to the initial scatter layer. Initial scatter layer-based
transformers have better performance and less computation cost compared to initial convolutional
layer-based transformers which is shown in Table- 2.
SVT uses a scattering network to decompose the signal into low-frequency and high-frequency
components. We use a gating operator to get effective learnable features for spectral decomposition.
The gating operator is a multiplication of the weight parameter in both high and low frequencies.
We have conducted experiments that use tensor and Einstein mixing. Tensor mixing is a simple
multiplication operator, while Einstein mixing uses an Einstein matrix multiplication operator [52].
We observe that in low-frequency components, Tensor mixing performs better as compared to Einstein
mixing. As shown in Table- 6, we start with 𝑆𝑉 𝑇𝑇 𝑇 𝑇 𝑇 , which uses tensor mixing in both high and
low-frequency components. We see that it may not perform optimally. Then we reverse it and use
Einstein mixing in both low and high-frequency components - this also does not perform optimally.
Then, we came up with the alternative method 𝑆𝑉 𝑇𝑇 𝑇 𝐸𝐸 , which uses tensor mixing in low frequency
and Einstein mixing in high frequency. The high-frequency further decomposes into token and channel
mixing, whereas in low-frequency we simply tensor multiplication as it is an energy or amplitude
component.
In the second ablation analysis, we compare various spectral architectures, including the Fourier
Network (FN), Fourier Neural Operator (FNO), Wavelet Gating Network (WGN), and Fast Fourier
Convolution (FFC). When we contrast SVT with WGN, it becomes evident that SVT exhibits superior
directional selectivity and a more adept ability to manage complex transformations. Furthermore, in
comparison to FN and FNO, SVT excels in decomposing frequencies into low and high-frequency
components. It’s worth noting that SVT surpasses other spectral architectures primarily due to its
utilization of the Directional Dual-Tree Complex Wavelet Transform (DTCWT), which offers direc-
tional orientation and enhanced invertibility, as demonstrated in Table 3. For a more comprehensive
analysis, please refer to the Supplementary section.
We train SVT on ImageNet1K data and fine-tune it on various datasets such as CIFAR10, CIFAR100,
Oxford Flower, and Stanford Car for image recognition tasks. We compare SVT-H-B performance with
various transformers such as Deit [61], ViT [14], and GFNet [51] as well as with CNN architectures
such as ResNet50 and MLP mixer architectures such as ResMLP. This comparison is shown in Table- 4.
It can be observed that SVT-H-B outperforms state-of-art on CIFAR10 with a top-1 accuracy of
99.1%, CIFAR100 with a top-1 accuracy of 91.3%, Flowers with a top-1 accuracy of 98.9% and Cars
with top-1 accuracy of 93.7%. We observe that SVT has more representative features and has an
inbuilt discriminative nature which helps in classifying images into various categories. We use a
8
Table 7: Latency(Speed test): This table shows the Latency (mili Table 8: Invertibility: This
sec) of SVT compared with Conv type network, attention type table shows the invertibility
transformer, POOl type, MLP type, and Spectral type transformer. of SVT(DTCWT) compared
We report latency per sample on A100 GPU. We adopt the latency with Fourier and DWT. We
table from EfficientFormer [36]. also compare different direc-
Model Type Params GMAC Top-1 Latency tional orientations and show
(M) (G) (%) (ms)
ResNet50[23] Convolution 25.5 4.1 78.5 9.0 the reconstruction loss (MSE)
DeiT-S[61] Attention 22.5 4.5 81.2 15.5 in an image.
PVT-S[67] Attention 24.5 3.8 79.8 23.8
Model MSE loss(↓) PSNR (db)(↑)
T2T-14[] Attention 21.5 4.8 81.5 21.0
Swin-T[40] Attention 29.0 4.5 81.3 22.0 Fourier (FFT) 3.27e-05 11.18
CSwin-T[13] Attention 23.0 4.3 82.7 28.7 DWT-M1 8.90e-05 76.33
PoolFormer[76] Pool 31.0 5.2 81.4 41.2 DWT-M2 3.19e-05 84.67
ResMLP-S[60] MLP 30.0 6.0 79.4 17.4 DWT-M3 1.08e-05 91.94
EfficientFormer [36] MetaBlock 31.3 3.9 82.4 13.9 DTCWT-M1 6.64e-06 137.97
GFNet-H-S[51] Spectral 32.0 4.6 81.5 14.3 DTCWT-M2 2.01e-06 138.87
SVT-H-S Spectral 22.0 3.9 84.2 14.7 DTCWT-M3 1.23e-07 142.14
pre-trained SVT model for the downstream instance segmentation task and obtain good results on the
MS-COCO dataset as shown in Table- 5.
3.8 Limitations
SVT currently uses six directional orientations to capture an image’s fine-grained semantic information.
It is possible to go for the second degree, which gives thirty-six orientations, while the third degree
gives 216 orientations. The more orientations, the more semantic information could be captured, but
9
this leads to higher computational complexity. The decomposition parameter ‘M’ is currently set to 1
to get single low-pass and high-pass components. Higher values of ‘M’ give more components in
both frequencies but lead to higher complexity.
4 Related Work
The Vision Transformer (ViT) [14] was the first transformer-based attempt to classify images into
pre-defined categories and use NLP advances in vision. Following this, several transformer based
approaches like DeiT[61], Tokens-to-token ViT [77], Transformer iN Transformer (TNT) [21], Cross-
ViT [6], Class attention image Transformer(CaiT) [62] Uniformer [34], Beit. [3], SViT[49], RegionViT
[5], MaxViT [63] etc. have all been proposed to improve the accuracy using multi-headed self-attention
(MSA). PVT [66], SwinT [41], CSwin[13] and Twin [41] use hierarchical architecture to improve
the performance of the vision transformer on various tasks. The complexity of MSA is O(𝑛2 ). For
high-resolution images, the complexity increases quadratically with token length. PoolFormer [76]
is a method that uses a pooling operation over a small patch which has to obtain a down-sampled
version of the image to reduce computational complexity. The main problem with PoolFormer
is that it uses a MaxPooling operation which is not invertible. Another approach to reducing the
complexity is the spectral transformers such as FNet [33], GFNet [51], AFNO [18], WaveMix [26],
WaveViT [75], SpectFormer [48], FourierFormer [43], etc. FNet [33] does not use inverse Fourier
transforms, leading to an invertibility issue. GFNet [51] solves this by using inverse Fourier transforms
with a gating network. AFNO [18] uses the adaptive nature of a Fourier neural operator similar to
GFNet. SpectFormer [48] introduces a novel transformer architecture that combines both spectral and
attention networks for vision tasks. GFNet, SpectFormer, and AFNO do not have proper separation
of low-frequency and high-frequency components and may struggle to handle the semantic content
of images. In contrast, SVT has a clear separation of frequency components and uses directional
orientations to capture semantic information. FourierIntegral [43] is similar to GFNet and may have
similar issues in separating frequency components.
WaveMLP [58] is a recent effort that dynamically aggregates tokens as a wave function with two parts,
amplitude and phase to capture the original features and semantic content of the images respectively.
SVT uses a scattered network to provide low-frequency and high-frequency components. The high-
frequency component has six or more directional orientations to capture semantic information in
images. We use Einstein multiplication in token and channel mixing of high-frequency components
leading to lower computational complexity and network size. In Wave-ViT [75], the author has
discussed the quadratic complexity of the self-attention network using a wavelet transform to perform
lossless down-sampling using wavelet transform over keys and values. However, WaveViT still has
the same complexity as it uses attention instead of spectral layers. SVT uses the scatter network which
is more invertible compared to WaveViT.
One of the challenges in MSA is its inability to characterize different frequencies in the input image.
Hilo attention (LiTv2) [45] helps to find high-frequency and low-frequency components by using a
novel variant of MSA. But it does not solve the complexity issue of MSA. Another parallel effort
named Inception Transformer came up [55], which uses an Inception mixer to capture high and
low-frequency information in visual data. iFormer still has the same complexity as it uses attention
as the low-frequency mixer. SVT in comparison, uses a spectral neural operator to capture low and
high frequency components using the DTCWT. This removes the O(𝑛2 ) complexity as it uses spectral
mixing instead of attention. iFormer [55] uses a non-invertible max pooling and convolutional layer
to capture high-frequency components, whereas, in contrast, SVT’s mixer is completely invertible.
SVT uses a scatter network to get a better directional orientation to capture fine-grained information
such as lines and edges, compared to Hilo attention and iFormer.
10
References
[1] https://openai.com/blog/chatgpt/, 2022.
[2] Hezam Albaqami, G Hassan, and Amitava Datta. Comparison of wpd, dwt and dtcwt for multi-class seizure
type classification. In 2021 IEEE Signal Processing in Medicine and Biology Symposium (SPMB), pages
1–7. IEEE, 2021.
[3] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. In
International Conference on Learning Representations, 2021.
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey
Zagoruyko. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer,
2020.
[5] Chun-Fu Chen, Rameswar Panda, and Quanfu Fan. Regionvit: Regional-to-local attention for vision
transformers. In International Conference on Learning Representations, 2022.
[6] Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision
transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 357–366, 2021.
[7] Shoufa Chen, Enze Xie, GE Chongjian, Runjian Chen, Ding Liang, and Ping Luo. Cyclemlp: A mlp-like
architecture for dense prediction. In International Conference on Learning Representations, 2022.
[8] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution. Advances in Neural Information Processing
Systems, 33:4479–4488, 2020.
[9] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language
modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[10] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua
Shen. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural
Information Processing Systems, 34:9355–9366, 2021.
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
Ieee, 2009.
[12] Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, and Lu Yuan. Davit: Dual attention
vision transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October
23–27, 2022, Proceedings, Part XXIV, pages 74–92. Springer, 2022.
[13] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and
Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–
12134, 2022.
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is
worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning
Representations, 2020.
[15] Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Levent Sagun.
Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference
on Machine Learning, pages 2286–2296. PMLR, 2021.
[16] Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, and Wenyu
Liu. You only look at one sequence: Rethinking transformer in vision through object detection. Advances
in Neural Information Processing Systems, 34:26183–26197, 2021.
[17] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics,
pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
[18] John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catanzaro.
Efficient token mixing for transformers via adaptive fourier neural operators. In International Conference
on Learning Representations, 2022.
11
[19] Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu. Cmt:
Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 12175–12185, 2022.
[20] Jianyuan Guo, Yehui Tang, Kai Han, Xinghao Chen, Han Wu, Chao Xu, Chang Xu, and Yunhe Wang.
Hire-mlp: Vision mlp via hierarchical rearrangement. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 826–836, June 2022.
[21] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.
Advances in Neural Information Processing Systems, 34:15908–15919, 2021.
[22] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages 2961–2969, 2017.
[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[24] Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan, and Jiashi Feng. Vision permutator:
A permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis &
Machine Intelligence, (01):1–1, 2022.
[25] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 7132–7141, 2018.
[26] Pranav Jeevan and Amit Sethi. Wavemix: Resource-efficient token mixing for images. arXiv preprint
arXiv:2203.03689, 2022.
[27] Zi-Hang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng.
All tokens matter: Token labeling for training better vision transformers. Advances in Neural Information
Processing Systems, 34:18590–18602, 2021.
[28] Nick Kingsbury. Image processing with complex wavelets. Philosophical Transactions of the Royal Society
of London. Series A: Mathematical, Physical and Engineering Sciences, 357(1760):2543–2560, 1999.
[29] Nick Kingsbury. Complex wavelets for shift invariant analysis and filtering of signals. Applied and
computational harmonic analysis, 10(3):234–253, 2001.
[30] Nick G Kingsbury. The dual-tree complex wavelet transform: a new technique for shift invariance and
directional filters. In IEEE digital signal processing workshop, volume 86, pages 120–131. Citeseer, 1998.
[31] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained
categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages
554–561, 2013.
[32] Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
[33] James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. Fnet: Mixing tokens with fourier
transforms. arXiv preprint arXiv:2105.03824, 2021.
[34] Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao.
Uniformer: Unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450,
2022.
[35] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph
Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814,
2022.
[36] Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian
Ren. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing
Systems, 35:12934–12949, 2022.
[37] Yehao Li, Ting Yao, Yingwei Pan, and Tao Mei. Contextual transformer networks for visual recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[38] Dongze Lian, Zehao Yu, Xing Sun, and Shenghua Gao. As-mlp: An axial shifted mlp architecture for
vision. In International Conference on Learning Representations, 2022.
[39] Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps. Advances in Neural Information
Processing Systems, 34:9204–9215, 2021.
12
[40] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang,
Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 12009–12019, 2022.
[41] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin
transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 10012–10022, 2021.
[42] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on
Learning Representations, 2018.
[43] Tan Minh Nguyen, Minh Pham, Tam Minh Nguyen, Khai Nguyen, Stanley Osher, and Nhat Ho. Fouri-
erformer: Transformer meets generalized fourier integral theorem. In Advances in Neural Information
Processing Systems, 2022.
[44] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of
classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages
722–729. IEEE, 2008.
[45] Zizheng Pan, Jianfei Cai, and Bohan Zhuang. Fast vision transformers with hilo attention. In Advances in
Neural Information Processing Systems, 2022.
[46] Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu, and Jianfei Cai. Less is more: Pay less attention in
vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages
2035–2043, 2022.
[47] Badri N Patro and Vijay Agneeswaran. Efficiency 360: Efficient vision transformers. arXiv preprint
arXiv:2302.08374, 2023.
[48] Badri N Patro, Vinay P Namboodiri, and Vijay Srinivas Agneeswaran. Spectformer: Frequency and
attention is what you need in a vision transformer. arXiv preprint arXiv:2304.06446, 2023.
[49] Tianming Qiu, Ming Gui, Cheng Yan, Ziqing Zhao, and Hao Shen. Svit: Hybrid vision transformer models
with scattering transform. In 2022 IEEE 32nd International Workshop on Machine Learning for Signal
Processing (MLSP), pages 01–06. IEEE, 2022.
[50] Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser Nam Lim, and Jiwen Lu. Hornet: Effi-
cient high-order spatial interactions with recursive gated convolutions. Advances in Neural Information
Processing Systems, 35:10353–10366, 2022.
[51] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image
classification. Advances in Neural Information Processing Systems, 34:980–993, 2021.
[52] Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. In
International Conference on Learning Representations, 2022.
[53] Ivan W Selesnick. Hilbert transform pairs of wavelet bases. IEEE Signal Processing Letters, 8(6):170–173,
2001.
[54] Ivan W Selesnick, Richard G Baraniuk, and Nick C Kingsbury. The dual-tree complex wavelet transform.
IEEE signal processing magazine, 22(6):123–151, 2005.
[55] Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng YAN. Inception
transformer. In Advances in Neural Information Processing Systems, 2022.
[56] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani.
Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 16519–16529, 2021.
[57] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In
International conference on machine learning, pages 6105–6114. PMLR, 2019.
[58] Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Yanxi Li, Chao Xu, and Yunhe Wang. An image patch is
a wave: Phase-aware vision mlp. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 10935–10944, 2022.
[59] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner,
Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture
for vision. Advances in Neural Information Processing Systems, 34:24261–24272, 2021.
13
[60] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave,
Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks
for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2022.
[61] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou.
Training data-efficient image transformers & distillation through attention. In International Conference on
Machine Learning, pages 10347–10357. PMLR, 2021.
[62] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper
with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 32–42, 2021.
[63] Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li.
Maxvit: Multi-axis vision transformer. In Computer Vision–ECCV 2022: 17th European Conference, Tel
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 459–479. Springer, 2022.
[64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems,
30, 2017.
[65] Pichao Wang, Xue Wang, Hao Luo, Jingkai Zhou, Zhipeng Zhou, Fan Wang, Hao Li, and Rong Jin.
Scaled relu matters for training vision transformers. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 36, pages 2495–2503, 2022.
[66] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and
Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021.
[67] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and
Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media,
8(3):415–424, 2022.
[68] Ziyu Wang, Wenhao Jiang, Yiming M Zhu, Li Yuan, Yibing Song, and Wei Liu. Dynamixer: a vision mlp
architecture with dynamic mixing. In International Conference on Machine Learning, pages 22691–22701.
PMLR, 2022.
[69] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing
convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 22–31, 2021.
[70] Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with deformable
attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
4794–4803, 2022.
[71] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer:
Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information
Processing Systems, 34:12077–12090, 2021.
[72] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transforma-
tions for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 1492–1500, 2017.
[73] Weijian Xu, Yifan Xu, Tyler Chang, and Zhuowen Tu. Co-scale conv-attentional image transformers. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9981–9990, 2021.
[74] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal
self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.
[75] Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, and Tao Mei. Wave-vit: Unifying wavelet and trans-
formers for visual representation learning. In Computer Vision–ECCV 2022: 17th European Conference,
Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXV, pages 328–345. Springer, 2022.
[76] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng
Yan. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 10819–10829, 2022.
14
[77] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng,
and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 558–567, 2021.
[78] Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and Shuicheng Yan. Volo: Vision outlooker for visual
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[79] Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin Wei, Chunhua Shen, et al. Segvit: Semantic
segmentation with plain vision transformers. Advances in Neural Information Processing Systems, 35:4971–
4982, 2022.
[80] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas
Mueller, R Manmatha, et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 2736–2746, 2022.
[81] Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-scale
vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 2998–3008, 2021.
[82] Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen, Xinggang Wang, Wenyu Liu, Gang Yu, and
Chunhua Shen. Topformer: Token pyramid transformer for mobile semantic segmentation. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12083–12093, 2022.
[83] Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi
Feng. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
Appendix
This document provides a comprehensive analysis of the vanilla transformer architecture and explores
various versions The architecture comparisons are presented in Table-12, shedding light on the
differences and capabilities of each version. The document also delves into the training configurations,
encompassing transfer learning, task learning, and fine-tuning tasks. The dataset information utilized
for transformer learning is presented in Table- 13, providing insights into dataset sizes, and relevance
to different applications. Moving to the results section, we showcase the fine-tuned model outcomes,
where models are initially trained on 224 x 224 images and subsequently fine-tuned on 384 x 384
images. The performance evaluation, as depicted in Table- 14, encompasses accuracy metrics, number
of parameters(M) and Floating point operations(G). The detailed comparison of similar architectures
is provided in Table- 11. Regarding the trade-off between invertibility and redundancy, we conducted
an experiment to demonstrate that invertibility aids in comprehending the image rather than merely
contributing to performance, as shown in Table- 10.
15
Figure 3: This figure shows the Filter characterization of the initial four layers of the SVT model. It
clearly shows that the High-frequency filter coefficient captures local filter information such as lines,
edges, and different orientations of an Image. The Low-frequency filter coefficient captures the shape
with the maximum energy part in the image.
Table 9: Detailed architecture specifications for three variants of our SVT with different model sizes,
i.e., SVT-S (small size), SVT-B (base size), and SVT-L (large size). 𝐸𝑖 , 𝐺𝑖 , 𝐻𝑖 , and 𝐶𝑖 represent
the expansion ratio of the feed-forward layer, the spectral gating number, the head number, and the
channel dimension in each stage 𝑖, respectively.
OP Size SVT-H-S SVT-H-B SVT-H-L
[ ] [ ] [ ]
𝐸1 = 8 𝐸1 = 8 𝐸1 = 8
𝐻 𝑊
Stage 1 4
× 4 𝐺1 = 1 ×3 𝐺1 = 1 ×3 𝐺1 = 1 ×3
𝐶1 = 64 𝐶1 = 64 𝐶1 = 96
[ ] [ ] [ ]
𝐸2 = 8 𝐸2 = 8 𝐸2 = 8
Stage 2 𝐻8 × 𝑊8 𝐺2 = 1 ×4 𝐺2 = 1 ×4 𝐺2 = 1 ×6
𝐶2 = 128 𝐶2 = 128 𝐶2 = 192
[ ] [ ] [ ]
𝐸3 = 4 𝐸3 = 4 𝐸3 = 4
𝐻 𝑊
Stage 3 16 × 16 𝐻3 = 10 ×6 𝐻3 = 10 ×12 𝐻3 = 12 ×18
𝐶3 = 320 𝐶3 = 320 𝐶3 = 384
[ ] [ ] [ ]
𝐸4 = 4 𝐸4 = 4 𝐸4 = 4
𝐻 𝑊
Stage 4 32 × 32 𝐻4 = 14 ×3 𝐻4 = 16 ×3 𝐻4 = 16 ×3
𝐶4 = 448 𝐶4 = 512 𝐶4 = 512
B Appendix: Dataset and Training Details:
B.1 Dataset and Training Setups on ImageNet-1K for Image Classification task
In this section, we outline the dataset and training setups for the Image Classification task on the
ImageNet-1K benchmark dataset. The dataset comprises 1.28 million training images and 50K valida-
tion images, spanning across 1,000 categories. To train the vision backbones from scratch, we employ
several data augmentation techniques, including RandAug, CutOut, and Token Labeling objectives
Table 10: Invertibility vs redundancy:This table shows the SVT-H performance for each orientation.
We merge all the orientations and make them similar, making 2 and 3 orientations. Final SVT-H-S
has 6 orientations in high-frequency components to capture curves and slants in all 6 orientations. ’H’
stands for hierarchical, ’S’ for small size mode for image size 2242
Model Params GFLOPs Top-1(%) Top-5(%)
SVT-H-S-ori-1 21.5M 3.9 83.2 94.9
SVT-H-S-ori-2 21.6M 3.9 83.4 95.1
SVT-H-S-ori-3 21.7M 3.9 83.7 95.5
SVT-H-S(ori-6) 22.0M 3.9 84.2 96.9
16
Low-Frequency Filters coefficients
Figure 4: This figure shows the Filter characterization of the initial four layers of the SVT model. It
clearly shows that the High-frequency filter coefficient captures local filter information such as lines,
edges, and different orientations of an Image. The Low-frequency filter coefficient captures the shape
with the maximum energy part in the image.
17
Figure 5: Comparison of ImageNet Top-1 Accuracy (%) vs GFLOPs of various models in Vanilla
and Hierarchical architecture.
Figure 6: Comparison of ImageNet Top-1 Accuracy (%) vs Parameters (M) of various models in
Vanilla and Hierarchical architecture.
with MixToken. These augmentation techniques help enhance the model’s generalization capabilities.
For performance evaluation, we measure the trained backbones’ top-1 and top-5 accuracies on the
validation set, providing a comprehensive assessment of the model’s classification capabilities. In the
optimization process, we adopt the AdamW optimizer with a momentum of 0.9, combining it with a
10-epoch linear warm-up phase and a subsequent 310-epoch cosine decay learning rate scheduler.
These strategies aid in achieving stable and effective model training. To handle the computational
load, we distribute the training process on 8 V100 GPUs, utilizing a batch size of 128. This distributed
setup helps accelerate the training process while making efficient use of available hardware resources.
The learning rate and weight decay are fixed at 0.00001 and 0.05, respectively, maintaining stable
training and mitigating overfitting risks.
In the context of transfer learning, we sought to evaluate the efficacy of our vanilla SVT architecture on
widely-used benchmark datasets, namely CIFAR-10 [32], CIFAR100 [32], Oxford-IIIT-Flower [44]
and Standford Cars [31]. Our approach followed the methodology of previous studies [57, 14, 61,
60, 51], where we initialized the model with pre-trained weights from ImageNet and subsequently
fine-tuned it on the new datasets.
Table-4 in the main paper presents a comprehensive comparison of the transfer learning performance
of both our basic and best models against state-of-the-art CNNs and vision transformers. To maintain
consistency, we employed a batch size of 64, a learning rate (lr) of 0.0001, a weight-decay of 1e-4,
a clip-grad value of 1, and performed 5 epochs of warmup. For the transfer learning process, we
utilized a pre-trained model that was initially trained on the ImageNet-1K dataset. This pre-trained
model was fine-tuned on the specific transfer learning dataset mentioned in Table-13 for a total of
1000 epochs.
In this section, we conduct an in-depth analysis of the pre-trained SVT-H-small model’s performance
on the COCO dataset for two distinct downstream tasks involving object localization, ranging from
18
Table 11: This shows a performance comparison of SVT with similar Transformer Architecture with
different sizes of the networks on ImageNet-1K. ⋆ indicates additionally trained with the Token
Labeling objective using MixToken[27].
Network Params GFLOPs Top-1 Acc (%) Top-5 Acc (%)
Vanilla Transformer Comparison
FFC-ResNet-50 [8] 26.7M - 77.8 -
FourierFormer [43] - - 73.3 91.7
GFNet-Ti [51] 7M 1.3 74.6 92.2
SVT-T 9M 1.8 76.9 93.4
FFC-ResNet-101 [8] 46.1M - 78.8 -
Fnet-S [33] 15M 2.9 71.2 -
GFNet-XS [51] 16M 2.9 78.6 94.2
GFNet-S [51] 25M 4.5 80.0 94.9
SVT-XS 19.9M 4.0 79.9 94.5
SVT-S 32.2M 6.6 81.5 95.3
FFC-ResNet-152 [8] 62.6M - 78.9 -
GFNet-B [51] 43M 7.9 80.7 95.1
SVT-B 57.6M 11.8 82.0 95.6
Hierarchical Transformer Comparison
GFNet-H-S [51] 32M 4.6 81.5 95.6
LIT-S [46] 27M 4.1 81.5 -
iFormer-S[55] 20 4.8 83.4 96.6
Wave-ViT-S⋆ [75] 22.7M 4.7 83.9 96.6
SVT-H-S 21.7M 3.9 83.1 96.3
SVT-H-S⋆ 22.0M 3.9 84.2 96.9
GFNet-H-B [51] 54M 8.6 82.9 96.2
LIT-M [46] 48M 8.6 83.0 -
LITv2-M [45] 49.0M 7.5 83.3 -
iFormer-B[55] 48 9.4 84.6 97.0
Wave-MLP-B [58] 63.0M 10.2 83.6 -
Wave-ViT-B⋆ [75] 33.5M 7.2 84.8 97.0
SVT-H-B⋆ 32.8M 6.3 85.2 97.3
LIT-B [46] 86M 15.0 83.4 -
LITv2-B [45] 87.0M 13.2 83.6 -
HorNet-𝐵𝐺𝐹 [50] 88.0M 15.5 84.3 -
iFormer-L[55] 87.0M 14.0 84.8 97.0
Wave-ViT-L⋆ [75] 57.5M 14.8 85.5 97.3
SVT-H-L⋆ 54.0M 12.7 85.7 97.5
bounding-box level to pixel level. Specifically, we evaluate our SVT-H-small model on instance
segmentation tasks, such as Mask R-CNN [22], as demonstrated in Table-5 of the main paper.
For downstream task, we replace the CNN backbones in the respective detectors with our pre-trained
SVT-H-small model to evaluate its effectiveness. Prior to this, we pre-train each vision backbone on
the ImageNet-1K dataset, initializing the newly added layers with Xavier initialization [17]. Next,
we adhere to the standard setups defined in [41] to train all models on the COCO train2017 dataset,
which comprises approximately 118,000 images. The training process is performed with a batch size
of 16, and we utilize the AdamW optimizer [42] with a weight decay of 0.05, an initial learning rate
of 0.0001, and betas set to (0.9, 0.999). To manage the learning rate during training, we adopt the
step learning rate policy with linear warm-up at every 500 iterations and a warm-up ratio of 0.001.
These learning rate configurations aid in optimizing the model’s performance and convergence.
In our main experiments, we conduct image classification tasks on the widely-used ImageNet dataset
[11], a standard benchmark for large-scale image classification. To ensure a fair and meaningful
19
Table 12: In this table, we present a comprehensive overview of different versions of SVT within the
vanilla transformer architecture. The table includes detailed configurations such as the number of
heads, embedding dimensions, the number of layers, and the training resolution for each variant. For
SVT-H models with a hierarchical structure, we refer readers to Table-12 in the main paper, which
outlines the specifications for all four stages. Additionally, the table provides FLOPs (floating-point
operations) calculations for input sizes of both 224×224 and 384×384. In the vanilla SVT architecture,
we utilize four spectral layers with 𝛼 = 4, while the remaining attention layers are (𝐿 − 𝛼).
Model #Layers #heads #Embedding Dim Params (M) Training Resolution FLOPs (G)
SVT-Ti 12 4 256 9 224 1.8
SVT-XS 12 6 384 20 224 4.0
SVT-S 19 6 384 32 224 6.6
SVT-B 19 8 512 57 224 11.5
SVT-XS 12 6 384 21 384 13.1
SVT-S 19 6 384 33 384 22.0
SVT-B 19 8 512 57 384 37.3
Table 13: This table presents information about datasets used for transfer learning. It includes the
size of the training and test sets, as well as the number of categories included in each dataset.
Dataset CIFAR-10 [32] CIFAR-100 [32] Flowers-102 [44] Stanford Cars [31]
Train Size 50,000 50,000 8,144 2,040
Test Size 10,000 10,000 8,041 6,149
#Categories 10 100 196 102
Figure 7: The 1st column shows phase and magnitude plots for the Fourier transformer and the 2nd
column shows the low-frequency component of Dual tree Complex Wavelet transform (DT-CWT). 3rd
column onwards shows high-frequency visualization of all 6 direction-selective. 1st row visualizes
phase information & the second row shows the magnitude of all 6 high-frequency components.
comparison with previous research [61, 60, 51], we adopt the same training details for our SVT
models. For the vanilla transformer architecture (SVT), we utilize the hyperparameters recommended
by the GFNet implementation [51]. Similarly, for the hierarchical architecture (SVT-H), we employ
the hyperparameters recommended by the WaveVit implementation [75]. During fine-tuning at higher
resolutions, we follow the hyperparameters suggested by the GFNet implementation [51] and train
our models for 30 epochs.
All model training is performed on a single machine equipped with 8 V100 GPUs. In our experiments,
we specifically compare the fine-tuning performance of our models with GFNet [51]. Our observations
indicate that our SVT models outperform GFNet’s base spectral network. For instance, SVT-S(384)
achieves an impressive accuracy of 83.0%, surpassing GFNet-S(384) by 1.2%, as presented in Table 14.
Similarly, SVT-XS and SVT-B outperform GFNet-XS and GFNet-B, respectively, highlighting the
superior performance of our SVT models in the fine-tuning process.
20
Table 14: We conducted a comparison of various transformer-style architectures for image classifi-
cation on ImageNet. This includes vision transformers [61], MLP-like models [60, 39], spectral
transformers [51] and our SVT models, which have similar numbers of parameters and FLOPs.
The top-1 accuracy on ImageNet’s validation set, as well as the number of parameters and FLOPs, are
reported. All models were trained using 224 × 224 images. We used the notation "↑384" to indicate
models fine-tuned on 384 × 384 images for 30 epochs.
Model Params (M) FLOPs (G) Resolution Top-1 Acc. (%) Top-5 Acc. (%)
gMLP-Ti [39] 6 1.4 224 72.0 -
DeiT-Ti [61] 5 1.2 224 72.2 91.1
GFNet-Ti [51] 7 1.3 224 74.6 92.2
SVT-T 9 1.8 224 76.9 93.4
ResMLP-12 [60] 15 3.0 224 76.6 -
GFNet-XS [51] 16 2.9 224 78.6 94.2
SVT-XS 20 4.0 224 79.9 94.5
DeiT-S [61] 22 4.6 224 79.8 95.0
gMLP-S [39] 20 4.5 224 79.4 -
GFNet-S [51] 25 4.5 224 80.0 94.9
SVT-S 32 6.6 224 81.5 95.3
ResMLP-36 [60] 45 8.9 224 79.7 -
GFNet-B [51] 43 7.9 224 80.7 95.1
gMLP-B [39] 73 15.8 224 81.6 -
DeiT-B [61] 86 17.5 224 81.8 95.6
SVT-B 57 11.6 224 82.0 95.6
GFNet-XS↑384 [51] 18 8.4 384 80.6 95.4
GFNet-S↑384 [51] 28 13.2 384 81.7 95.8
GFNet-B↑384 [51] 47 23.3 384 82.1 95.8
SVT-XS↑384 21 13.1 384 82.2 95.8
SVT-S↑384 33 22.0 384 83.1 96.4
SVT-B↑384 57 37.3 384 83.0 96.2
85.2% with a fewer number of parameters. We also compare SVT with iFormer [55] which captures
low and high-frequency information from visual data, whereas SVT uses an invertible spectral method,
namely the scattering network, to get the low-frequency and high-frequency components and uses
tensor and Einstein mixing respectively to capture effective spectral features from visual data. SVT
top-1 accuracy is 85.2, which is better than iFormer-B, which is at 84.6 with a lesser number of
parameters and FLOPS.
We compare SVT with WaveMLP [58] which is an MLP mixer-based technique that uses amplitude
and phase information to represent the semantic content of an image. SVT uses a low-frequency
component as an amplitude of the original feature, while a high-frequency component captures
complex semantic changes in the input image. Our studies have shown, as depicted in Table- 11, that
SVT outperforms WaveMLP by about 1.8%. Wave-VIT-B[75] uses wavelet transform in the key and
value part of the multi-head attention method whereas SVT uses a scatter network to decompose high
and low-frequency components with invertibility and better directional orientation using Einstein and
Tensor mixing. SVT outperforms Wave-ViT-B by 0.4%.
We wish to state the following on the comment of the reviewer about large vision models (LVM/LLM):
We have observed in recent papers that certain efficient transformer models such as efficientFormer
and CvT have a significantly larger number of parameters, with BiT-M having 928 million parameters
and achieving 85.4% accuracy on ImageNet 1K, whereas ViT-H has 632 million parameters and
achieving accuracy of 85.1. Comparatively, SVT-H-L has 54 million parameters and achieves 85.7%
accuracy on ImageNet 1K - nearly 10X the lesser number of parameters and FLOPS but with improved
accuracy, as captured in Table 3 of CvT [69].
21