0% found this document useful (0 votes)

22 views21 pages

(NIPS23) Scattering Transformation For ViT

Vision transformers have gained significant attention and achieved state-of-theart performance in various computer vision tasks, including image classification, instance segmentation, and object detection. However, challenges remain in addressing attention complexity and effectively capturing fine-grained information within images. Existing solutions often resort to down-sampling operations, such as pooling, to reduce computational cost. Unfortunately, such operations are non-invertible and can

Uploaded by

xuesongnie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views21 pages

(NIPS23) Scattering Transformation For ViT

Uploaded by

xuesongnie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Scattering Vision Transformer: Spectral Mixing

Matters

Badri Narayana Patro Vijay Srinivas Agneeswaran

Microsoft Microsoft
badripatro@microsoft.com vagneeswaran@microsoft.com
arXiv:2311.01310v2 [cs.CV] 20 Nov 2023

Abstract
Vision transformers have gained significant attention and achieved state-of-the-
art performance in various computer vision tasks, including image classification,
instance segmentation, and object detection. However, challenges remain in address-
ing attention complexity and effectively capturing fine-grained information within
images. Existing solutions often resort to down-sampling operations, such as pool-
ing, to reduce computational cost. Unfortunately, such operations are non-invertible
and can result in information loss. In this paper, we present a novel approach called
Scattering Vision Transformer (SVT) to tackle these challenges. SVT incorporates
a spectrally scattering network that enables the capture of intricate image details.
SVT overcomes the invertibility issue associated with down-sampling operations
by separating low-frequency and high-frequency components. Furthermore, SVT
introduces a unique spectral gating network utilizing Einstein multiplication for
token and channel mixing, effectively reducing complexity. We show that SVT
achieves state-of-the-art performance on the ImageNet dataset with a significant
reduction in a number of parameters and FLOPS. SVT shows 2% improvement
over LiTv2 and iFormer. SVT-H-S reaches 84.2% top-1 accuracy, while SVT-H-B
reaches 85.2% (state-of-art for base versions) and SVT-H-L reaches 85.7% (again
state-of-art for large versions). SVT also shows comparable results in other vision
tasks such as instance segmentation. SVT also outperforms other transformers
in transfer learning on standard datasets such as CIFAR10, CIFAR100, Oxford
Flower, and Stanford Car datasets. The project page is available on this webpage
(https://badripatro.github.io/svt/).

1 Introduction
In recent years, there has been a remarkable surge in the interest and adoption of Large Language
Models (LLMs), driven by the release and success of prominent models such as GPT-3, ChatGPT [1],
and Palm [9]. These LLMs have achieved significant breakthroughs in the field of Natural Language
Processing (NLP). Building upon their successes, subsequent research endeavors have extended the
language transformer paradigm to diverse domains including computer vision, speech recognition,
video processing, and even climate and weather prediction. In this paper, we specifically focus on
exploring the potential of LLMs for vision-related tasks. By leveraging the power of these language
models, we aim to push the boundaries of vision applications and investigate their capabilities in
addressing complex vision challenges.
Several adaptations of transformers have been introduced in the field of computer vision for various
tasks. For image classification, notable vision transformers include ViT [14], DeIT [61], PVT [66],
Swin [41], Twin [10], and CSWin transformers [13]. The different vision transformers improved the
performance of image classification tasks significantly compared to Convolutional Neural Networks
(CNNs) such as ResNets and RegNets, as discussed in efficient vision transformer research work [47].
This breakthrough in computer vision has led to state-of-the-art results in various vision tasks,
including image segmentation such as SegFormer [71], TopFormer [82] and SegViT [79] and object

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

detection, models like DETR [4] and Yolo[16]. However, one challenge faced by vision transformers
is the increasing computational complexity of the self-attention module as the sequence length or
image resolution grows. Additionally, model size and the number of floating-point operations per
second (FLOPS) also increase with image resolution or sequence length. These factors need to be
carefully considered and addressed to ensure efficient and scalable deployment of vision transformers
in real-world applications.
One way to address the computational complexity of attention-based Transformers is to replace the
attention mechanism with a Multi-Layer Perceptron (MLP) based mixer layer [59, 60, 20]. However,
it is difficult to capture spatial information in the MLP mixers. This was addressed by the paper
MetaFormer [76]. MetaFormer uses a pooling operation to replace the attention layer. This however
has the disadvantage that the pooling operation is not invertible and could possibly lose information.
Fourier based Transformers such as FourierFormer[58], FNet[33], GFNet [51] and AFNO [18]
minimizes the loss of information by using Fourier Transform. But it has an inherent problem of
separating the low and high-frequency components. The ability to separate low-frequency and high-
frequency components of an image is important. Recently, transformers such as LiTv2 [45] and
iFormer [55] have been proposed to address this problem. However, both LiTv2 and iFormer have the
same 𝑂(𝑛2 ) complexity as they use full-fledged self-attention networks, similar to ViT [14] and DeIT
[61], which are weak in capturing fine-grained information of images. Vit, DeIT, LiTv2 and iFormer
also have a limitation with respect to network size or number of parameters. We propose a Scattering
Vision Transformer (SVT) which uses a spectral scattering network to address the attention complexity
and Dual-Tree Complex Wavelet Transform (DTCWT) to capture the fine-grained information using
spectral decomposition into low-frequency and high-frequency components of an image.
SVT uses the scattering network as the initial layer of the transformer, which captures fine-grained
information (lines and edges, for instance) along with the energy component. SVT also uses attention
nets in the deeper layers to capture long-range dependencies. The fine-grained information is captured
by the high-frequency component of the scattering network by using the DTCWT transform while the
low-frequency component is the energy component. SVT uses a Spectral Gating Network (SGN) to
capture the effective features in both frequency components. Generally, the high-frequency component
has extra directional information which makes it computationally complex while performing the gating
operation. SVT addresses this complexity with a novel token and channel mixing technique using the
Einstein Blending Method (EBM) in high-frequency component. SVT also uses a Tensor Blending
Method (TBM) in the low-frequency component. We also observe that the DTCWT is more invertible
compared to other spectral transformations in the literature which are based on Fourier transforms and
discrete wavelet transforms [2, 30]. We quantify the invertibility in terms of reconstruction loss in the
performance study section of this paper. The use of TBM for low-frequency components and EBM
for high-frequency components is our contribution. It must be noted that low-frequency components
contain the energy component of the signal which requires all the frequency components to provide
energy compaction, while high-frequency components can be represented by only a few components,
which can be achieved using Einstein multiplication. SVT is a generic recipe for componentizing
the transformer architecture and efficiently implementing transformers with lesser parameters and
computational complexity with the help of Einstein multiplication. So, this can be viewed as a simple
and efficient learning transformer architecture with minimal inductive bias.
Our contributions are as follows:

• We introduce a novel invertible scattering network based on DTCWT transformation into

vision transformers to decompose image features into low-frequency and high-frequency
features.
• We proposed a novel SGN, which uses TBM to mix low-frequency representations and EBM
to mix high-frequency representations. We use an efficient way of mixing high-frequency
components using channel and token mixing with the help of Einstein multiplication.
• Detailed performance analysis shows that SVT outperforms all transformers including LiT
v2 and iFormer on ImageNet data, with a significantly lesser number of parameters. In
addition, SVT also has comparable performance on other transfer learning datasets.
• We show that SVT is efficient not only performance-wise but also in terms of a number of
parameters (memory size) as well as in terms of computational complexity (measured in
Gigaflops). We also show that SVT is efficient for inferencing, by measuring its latency and
comparing it with other state-of-art transformers.

2
Figure 1: This figure illustrates the architectural details of the SVT model with a Scatter and Attention
Layer structure. The Scatter Layer comprises a Scattering Transformation that processes Low-
Frequency (LF) and High-Frequency (HF) components. Subsequently, we apply the Tensor and
Einstein Blending Method to obtain Low-Frequency Representation (LFR) and High-Frequency
Representation (HFR), as depicted in the figure. Finally, we apply the Inverse Scattering transformation
using LFR and HFR.

2 Method
2.1 Background: Overview of DTCWT and Decoupling of Low & High Frequencies

Discrete Wavelet Transform (DWT) replaces the infinite oscillating sinusoidal functions with a set of
locally oscillating basis functions, which are known as wavelets [54, 29]. Wavelet is a combination of
low-pass scaling function 𝜙(𝑡) and a shifted version of a band-pass wavelet function known as 𝜓(𝑡).
It can be represented mathematically as given below:
∑
∞ ∑
∞ ∑
∞
𝑥(𝑡) = 𝑐(𝑛)𝜙(𝑡 − 𝑛) + 𝑑(𝑗, 𝑛)2𝑗∕2 𝜓(2𝑗 𝑡 − 𝑛). (1)
𝑛=−∞ 𝑗=0 𝑛=−∞

where 𝑐(𝑛) is the scaling coefficients and 𝑑(𝑗, 𝑛) is the wavelet coefficients. These coefficients are
computed by the inner product of the scaling function𝜙(𝑡) and wavelet function 𝜓(𝑡) with input 𝑥(𝑡).
∞ ∞
𝑐(𝑛) = 𝑥(𝑡)𝜙(𝑡 − 𝑛)𝑑𝑡, 𝑑(𝑗, 𝑛) = 2𝑗∕2 𝑥(𝑡)𝜓(2𝑗 𝑡 − 𝑛)𝑑𝑡. (2)
∫−∞ ∫−∞
DWT suffers from the following issues oscillations, shift variance, aliasing, and lack of directionality.
One of the solutions to solve the above problems is the Complex Wavelet Transform (CWT) with
complex-valued scaling and wavelet function. The Dual-Tree Complex Wavelet Transform (DT-
CWT) addresses the issues of the CWT. The DT-CWT [30, 28, 29] comes very close to mirroring the
attractive properties of the Fourier Transform, including a smooth, nonoscillating magnitude, a nearly
shift-invariant magnitude with a simple near-linear phase encoding of signal shifts, substantially
reduced aliasing; and better directional selectivity wavelets in higher dimensions. This makes it easier
to detect edges and orientational features of images. The six orientations of the wavelet transformation
are given by 15◦ , 45◦ , 75◦ , 105◦ , 135◦ , and 165◦ . The dual-tree CWT employs two real DWTs, the
first DWT gives the real part of the transform while the second DWT gives the imaginary part. The
two real DWTs use two different sets of filters, which are jointly designed to give an approximation
of the overall complex wavelet transform and satisfy the perfect reconstruction (PR) conditions.
Let ℎ0 (𝑛), ℎ1 (𝑛) denote the low-pass and high-pass filter pair in the upper band, while 𝑔0 (𝑛), 𝑔1 (𝑛)
denote the same for the lower band. Two real wavelets are associated with each of the two real wavelet
transforms as 𝜓ℎ (𝑡), and 𝜓𝑔 (𝑡). The complex wavelet 𝜓ℎ (𝑡) ∶= 𝜓ℎ (𝑡) + 𝜓𝑔 (𝑡) can be approximated
using Half-Sample Delay[53] condition,i.e. 𝜓ℎ (𝑡) is approximately the Hilbert transform of 𝜓𝑔 (𝑡) like
√ ∑ √ ∑
𝑔0 (𝑛) ≈ ℎ0 (𝑛 − 0.5) ⇒ 𝜓𝑔 (𝑡) ≈ {𝜓ℎ (𝑡)}𝜓ℎ (𝑡) = 2 ℎ1 (𝑛)𝜙ℎ (𝑡), 𝜙ℎ (𝑡) = 2 ℎ0 (𝑛)𝜙ℎ (𝑡)
𝑛 𝑛

3
Similarly, we can define 𝜓𝑔 (𝑡), 𝜙𝑔 (𝑡), and 𝑔1 (𝑛). Since the filters are real, no complex arithmetic is
required to implement DTCWT. It is just two times more expansive in 1D because the total output
data rate is exactly twice the input data rate. It is also easy to invert, as the two separate DWTs can be
inverted. Compare DTCWT with the Fourier Transform, which is difficult to obtain low pass and
high pass components of an image and it is less invertible (Loss is high when we do Fourier and
inverse Fourier transform) compared to DTCWT. Also, It cannot speak about time and frequency
simultaneously.

2.2 Scattering Visual Transformer (SVT) Method

Given input image 𝐈 ∈ ℝ3×224×224 , we split the image into the patch of size ℝ16×16 and obtain
embedding of each patch token using position encoder and token embedding network. 𝐗 = 𝑇 (𝐈) +
𝑃 (𝐈), where 𝑇 , 𝑃 refer to token and position encoding network. The detailed distinct components
of the SVT architecture are illustrated in Figure 1. Scattering Visual Transformer consists of three
components such as a) Scattering Transformation, b) Spectral Gating Network, c) Spectral Channel
and Token Mixing.
A. Scattering Transformation:
The input image 𝐈 is firstly patchified into a feature tensor 𝐗 ∈ ℝ𝐶×𝐻×𝑊 whose spatial resolution
is 𝐻 × 𝑊 and the number of channels is 𝐶. To extract the features of an image, we feed 𝐗 into a
series of transformer layers. We use a novel spectral transform based on an invertibility scattering
network instead of the standard self-attention network. This allows us to capture both the fine-grain
and the global information in the image. The fine-grain information consists of texture, patterns, and
small features that are encoded by the high-frequency components of the spectral transform. The
global information consists of the overall brightness, contrast, edges, and contours that are encoded
by the low-frequency components of the spectral transform. Given feature 𝐗 ∈ ℝ𝐶×𝐻×𝑊 , we use
scattering transform using DTCWT [54] as discussed in section-2.1 to obtain the corresponding
frequency representations 𝐗𝐹 by 𝐗𝐹 = scatter (𝐗). The transformation in frequency domain 𝐗𝐹
provides two components, one low-frequency component i.e. scaling component 𝐗𝜙 , and one high-
frequency component i.e. wavelet component 𝐗𝜓 . The simplified formulation for the real component
of   (⋅) is:

∑ 𝑊∑
𝐻−1 −1 ∑ 𝐻−1
𝑀−1 ∑ 𝑊∑
−1 ∑
6
𝐗𝐹 (𝑢, 𝑣) = 𝐗𝜙 (𝑢, 𝑣) + 𝐗𝜓 (𝑢, 𝑣) = 𝑐𝑀,ℎ,𝑤 𝜙𝑀,ℎ,𝑤 + 𝑘
𝑑𝑚,ℎ,𝑤 𝑘
𝜓𝑚,ℎ,𝑤 (3)
ℎ=0 𝑤=0 𝑚=0 ℎ=0 𝑤=0 𝑘=1

𝑀 refers to resolution/level of decomposition and 𝑘 refers to directional selectivity. Similarly, we

compute transformation for the imaginary component of   (⋅).
B. Spectral Gating Network:
We introduce a novel method, Spectral Gating Network (SGN), to extract spectral features from both
low and high-frequency components of the scattering transform. Figure-1 shows the architecture
of our method. We use learnable weight parameters to blend each frequency component, but we
use different blending methods for low and high frequencies. For the low-frequency component
𝐗𝜙 ∈ 𝐶×𝐻×𝑊 , we use the Tensor Blending Method (TBM), which is a new technique. TBM blends
𝑋𝜙 with 𝑊𝜙 using elementwise tensor multiplication, also known as Hadamard tensor product.

𝜙 = [𝐗𝜙 ⊙ 𝜙 ], where (𝐗𝜙 , 𝜙 ) ∈ 𝐶×𝐻×𝑊 , and 𝐌𝜙 ∈ 𝐶×𝐻×𝑊 , (4)

𝜙 having same dimension as in 𝜙 . 𝜙 is the low-frequency representation of the image and it
captures global information of the image. One of the biggest challenges to getting effective features
in the high-frequency components 𝐗𝜓 ∈ 𝑘×𝐶×𝐻×𝑊 ×2 , which are complex-valued and have ‘k’
times more dimensions than the low-frequency components (𝐗𝜙 ). Therefore, using the same Tensor
Blending Method for the high-frequency components 𝑋𝜓 would increase the number of parameters by
2k times and also the computational cost (GFLOPS), where 𝑘 refers to directional selectivity, a factor
of ‘2’ indicating complex value comprising real and imaginary. To address this issue, we propose a
new technique, the Einstein Blending Method (EBM), to blend the high-frequency components 𝑋𝜓
with the learnable weight parameters 𝑊𝜓 efficiently and effectively in the Spectral Gating Network
that we propose in this paper. By using EBM, we can capture the fine-grain information in the image,
such as texture, patterns, and small features.

4
To perform EBM, we first reshape a tensor 𝐀 from
ℝ𝐻×𝑊 ×𝐶 to ℝ𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 , where 𝐶 = 𝐶𝑏 × 𝐶𝑑 ,
and 𝑏 >> 𝑑. We then define a weight matrix of
size 𝑊 ∈ ℝ𝐶𝑏 ×𝐶𝑑 ×𝐶𝑑 . We then perform Einstein
multiplication between 𝐀 and 𝑊 along the last two
dimensions, resulting in a blended feature tensor
𝑌 ∈ ℝ𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 as shown in the Figure-2. The
formula for EBM is:
𝐘𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 = 𝐀𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 ⧆ 𝐖𝐶𝑏 ×𝐶𝑑 ×𝐶𝑑 Figure 2: Einstein Blending Method

C. Spectral Channel and Token Mixing:

We perform EBM in the channel dimension of the high-frequency component we call Spectral Channel
Mixing and following that we perform EBM in the token dimension of the high-frequency component,
which we call Spectral Token Mixing. To perform EBM in the channel dimension, we first reshape
the high-frequency component 𝑋𝜓 from ℝ2×𝑘×𝐻×𝑊 ×𝐶 to ℝ2×𝑘×𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 , where 𝐶 = 𝐶𝑏 × 𝐶𝑑 ,
and 𝑏 >> 𝑑. We then define a weight matrix of size 𝑊𝜓𝑐 ∈ ℝ𝐶𝑏 ×𝐶𝑑 ×𝐶𝑑 . We then perform Einstein
multiplication between 𝑋𝜓 and 𝑊𝜓𝑐 along the last two dimensions, resulting in a blended feature
tensor 𝐒𝜓𝑐 ∈ ℝ2×𝑘×𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 . The formula for EBM in Channel mixing is:
2×𝑘×𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑 2×𝑘×𝐻×𝑊 ×𝐶𝑏 ×𝐶𝑑
𝐒𝜓𝑐 = 𝐗𝜓 ⧆ 𝐖𝜓𝐜 𝐶𝑏 ×𝐶𝑑 ×𝐶𝑑 + 𝑏𝜓𝑐 (5)
To perform EBM in the Token dimension, we first reshape the high-frequency component 𝑆𝜓𝑐 from
ℝ2×𝑘×𝐻×𝑊 ×𝐶 to ℝ2×𝑘×𝐶×𝑊 ×𝐻 , where 𝐻𝑒𝑖𝑔ℎ𝑡(𝐻) = 𝑊 𝑖𝑑𝑡ℎ(𝑊 ). We then define a weight matrix
of size 𝑊𝜓𝑡 ∈ ℝ𝑊 ×𝐻×𝐻 . We then perform Einstein multiplication between 𝑋𝜓 and 𝑊𝜓𝑡 along the
last two dimensions, resulting in a blended feature tensor 𝐒𝜓𝑡 ∈ ℝ2×𝑘×𝐶×𝑊 ×𝐻 . The formula for EBM
in Token mixing is:
𝐒2×𝑘×𝐶×𝑊
𝜓
×𝐻
= 𝐒2×𝑘×𝐶×𝑊
𝜓
×𝐻
⧆ 𝐖𝜓𝐭 𝑊 ×𝐻×𝐻 + 𝑏𝜓𝑡 (6)
𝑡 𝑐

Where ⧆ represents an Einstein multiplication, the bias terms 𝑏𝜓𝑐 ∈ ℝ𝐶𝑏 ×𝐶𝑑 , 𝑏𝜓𝑡 ∈ ℝ𝐻×𝐻 . Now the
total number of weight parameters in the high-frequency gating network is (𝐶𝑏 ×𝐶𝑑 ×𝐶𝑑 )+(𝑊 ×𝐻 ×𝐻)
instead of (𝐶 × 𝐻 × 𝑊 × 𝑘 × 2) where 𝐶 >> 𝐻 and bias is (𝐶𝑏 × 𝐶𝑑 ) + (𝐻 × 𝑊 ). This reduces
the number of parameters and multiplication while performing high-frequency gating operations
in an image. We use a standard torch package [52] to perform Einstin multiplication. Finally, we
perform inverse scattering transform using low-frequency representation( 4) and high-frequency
representation( 6) to bring back the spectral domain to the physical domain. Our SVT architecture
consists of L layers, comprising 𝛼 scatter layers and (𝐿 − 𝛼) attention layers [64], where L denotes the
network’s depth. The scatter layers, being invertible, adeptly capture both the global and the fine-grain
information in the image effectively via low-pass and high-pass filters, while attention layers focus on
extracting semantic features and addressing long-range dependencies present in the image.

3 Experiment and Performance Studies

We evaluated SVT through various mainstream computer vision tasks including image recognition,
object detection, and instance segmentation. To compare the quality of SVT transformer features,
we conducted the following evaluations on standard datasets: a) We trained and evaluated Ima-
geNet1K [11] from scratch for image recognition task. b) We performed transfer learning on CIFAR-
10 [32], CIFAR-100 [32], Stanford Car [31], and Oxford Flower-102 [44] for Image recognition task.
c) We conducted ablation studies to analyze variants of SVT transformers using scatter net with the
help of various spectral mixing techniques. We also compare our results with transformers having
similar decomposition architecture. d) We fine-tune SVT for downstream instance segmentation tasks.
e) We also perform an in-depth analysis of the SVT, by conducting layer-wise analysis as well as
invertibility analysis as well as latency analysis, and comparison.
3.1 Comparison with Similar architectures
We compare SVT with LiTv2 (Hilo) [45] which decomposes attention to find low and high-frequency
components. We show that LiTv2 has a top-1 accuracy of 83.3%, while SVT has a top-1 accuracy of

5
Table 1: The table shows the performance of various vision backbones on the ImageNet1K[11] dataset
for image recognition tasks. ⋆ indicates additionally trained with the Token Labeling objective using
MixToken[27] and a convolutional stem (conv-stem) [65] for patch encoding. This table provides
results for input image size 224 × 224. We have grouped the vision models into three categories
based on their GFLOPs (Small, Base, and Large). The GFLOP ranges: Small (GFLOPs<6), Base
(6≤GFLOPs<10), and Large (10≤GFLOPs<30).
Method Params GFLOPS Top-1 Top-5 Method Params GFLOPS Top-1 Top-5
Small Large
ResNet-50 [23] 25.5M 4.1 78.3 94.3 ResNet-152 [23] 60.2M 11.6 81.3 95.5
BoTNet-S1-50 [56] 20.8M 4.3 80.4 95.0 ResNeXt101 [72] 83.5M 15.6 81.5 -
Cross-ViT-S [6] 26.7M 5.6 81.0 - gMLP-B [39] 73.0M 15.8 81.6 -
Swin-T [41] 29.0M 4.5 81.2 95.5 DeiT-B [61] 86.6M 17.6 81.8 95.6
ConViT-S [15] 27.8M 5.4 81.3 95.7 SE-ResNet-152 [25] 66.8M 11.6 82.2 95.9
T2T-ViT-14 [77] 21.5M 4.8 81.5 95.7 Cross-ViT-B [6] 104.7M 21.2 82.2 -
RegionViT-Ti+ [5] 14.3M 2.7 81.5 - ResNeSt-101 [80] 48.3M 10.2 82.3 -
SE-CoTNetD-50 [37] 23.1M 4.1 81.6 95.8 ConViT-B [15] 86.5M 16.8 82.4 95.9
Twins-SVT-S [10] 24.1M 2.9 81.7 95.6 PoolFormer [76] 73.0M 11.8 82.5 -
CoaT-Lite-S [73] 20.0M 4.0 81.9 95.5 T2T-ViTt-24 [77] 64.1M 15.0 82.6 95.9
PVTv2-B2 [67] 25.4M 4.0 82.0 96.0 TNT-B [21] 65.6M 14.1 82.9 96.3
LITv2-S [45] 28.0M 3.7 82.0 - CycleMLP-B4 [7] 52.0M 10.1 83.0 -
MViTv2-T [35] 24.0M 4.7 82.3 - DeepViT-L [83] 58.9M 12.8 83.1 -
Wave-ViT-S [75] 19.8M 4.3 82.7 96.2 RegionViT-B [5] 72.7M 13.0 83.2 96.1
CSwin-T [13] 23.0M 4.3 82.7 - CycleMLP-B5 [7] 76.0M 12.3 83.2 -
DaViT-Ti [12] 28.3M 4.5 82.8 - ViP-Large/7 [24] 88.0M 24.4 83.2 -
SVT-H-S 21.7M 3.9 83.1 96.3 CaiT-S36 [62] 68.4M 13.9 83.3 -
iFormer-S [55] 20.0M 4.8 83.4 96.6 AS-MLP-B [38] 88.0M 15.2 83.3 -
CMT-S [19] 25.1M 4.0 83.5 - BoTNet-S1-128 [56] 75.1M 19.3 83.5 96.5
MaxViT-T [63] 31.0M 5.6 83.6 - Swin-B [41] 88.0M 15.4 83.5 96.5
Wave-ViT-S⋆ [75] 22.7M 4.7 83.9 96.6 Wave-MLP-B [58] 63.0M 10.2 83.6 -
SVT-H-S⋆ (Ours) 22.0M 3.9 84.2 96.9 LITv2-B [45] 87.0M 13.2 83.6 -
Base PVTv2-B4 [67] 62.6M 10.1 83.6 96.7
ResNet-101 [23] 44.6M 7.9 80.0 95.0 ViL-Base [81] 55.7M 13.4 83.7 -
BoTNet-S1-59 [56] 33.5M 7.3 81.7 95.8 Twins-SVT-L [10] 99.3M 15.1 83.7 96.5
T2T-ViT-19 [77] 39.2M 8.5 81.9 95.7 Hire-MLP-L [20] 96.0M 13.4 83.8 -
CvT-21 [69] 32.0M 7.1 82.5 - RegionViT-B+ [5] 73.8M 13.6 83.8 -
GFNet-H-B [51] 54.0M 8.6 82.9 96.2 Focal-Base [74] 89.8M 16.0 83.8 96.5
Swin-S [41] 50.0M 8.7 83.2 96.2 PVTv2-B5 [67] 82.0M 11.8 83.8 96.6
Twins-SVT-B [10] 56.1M 8.6 83.2 96.3 CoTNetD-152 [37] 55.8M 17.0 84.0 97.0
CoTNetD-101 [37] 40.9M 8.5 83.2 96.5 DAT-B [70] 88.0M 15.8 84.0 -
PVTv2-B3 [67] 45.2M 6.9 83.2 96.5 LV-ViT-M⋆ [27] 55.8M 16.0 84.1 96.7
LITv2-M [45] 49.0M 7.5 83.3 - CSwin-B [13] 78.0M 15.0 84.2 -
RegionViT-M+ [5] 42.0M 7.9 83.4 - HorNet-𝐵𝐺𝐹 [50] 88.0M 15.5 84.3 -
MViTv2-S [35] 35.0M 7.0 83.6 - DynaMixer-L [68] 97.0M 27.4 84.3 -
CSwin-S [13] 35.0M 6.9 83.6 - MViTv2-B [35] 52.0M 10.2 84.4 -
DaViT-S [12] 49.7M 8.8 84.2 - DaViT-B [12] 87.9M 15.5 84.6 -
VOLO-D1⋆ [78] 26.6M 6.8 84.2 - CMT-L [19] 74.7M 19.5 84.8 -
CMT-B [19] 45.7M 9.3 84.5 - MaxViT-B [63] 120.0M 23.4 85.0 -
MaxViT-S [63] 69.0M 11.7 84.5 - VOLO-D2⋆ [78] 58.7M 14.1 85.2 -
iFormer-B [55] 48.0M 9.4 84.6 97.0 VOLO-D3⋆ [78] 86.3M 20.6 85.4 -
Wave-ViT-B⋆ [75] 33.5M 7.2 84.8 97.1 Wave-ViT-L⋆ [75] 57.5M 14.8 85.5 97.3
SVT-H-B⋆ (Ours) 32.8M 6.3 85.2 97.3 SVT-H-L⋆ (Ours) 54.0M 12.7 85.7 97.5

85.2% with a fewer number of parameters. We also compare SVT with iFormer [55] which captures
low and high-frequency information from visual data, whereas SVT uses an invertible spectral method,
namely the scattering network, to get the low-frequency and high-frequency components and uses
tensor and Einstein mixing respectively to capture effective spectral features from visual data. SVT top-
1 accuracy is 85.2, which is better than iFormer-B, which is at 84.6 with a lesser number of parameters
and FLOPS. We compare SVT with WaveMLP [58] which is an MLP mixer-based technique that
uses amplitude and phase information to represent the semantic content of an image. SVT uses a
low-frequency component as an amplitude of the original feature, while a high-frequency component
captures complex semantic changes in the input image. Our studies have shown, as depicted in
Table- 1, that SVT outperforms WaveMLP by about 1.8%.

3.2 Comparison with State of the art methods

We divide the transformer architecture into three parts based on the computation requirements (FLOP
counts) - small (less than 6 GFLOPS), base (6-10 GFLOPS), and large (10-30 GFLOPS). We use a
similar categorization as WaveViT [75]. Notable recent works falling into the small category include
C-Swin Transformers [13], LiTv2[45], MaxVIT[63], iFormer[55], CMT transformer, PVTv2[67],
and WaveViT[75]. It’s worth mentioning that WaveViT relies on extra annotations to achieve its
best results. In this context, SVT-H-S stands out as the state-of-the-art model in the small category,

6
Table 2: Initial Attention Layer vs Scatter Table 4: Results on transfer learning datasets.
Layer vs Initial Convolutional: This table com- We report the top-1 accuracy on the four datasets.
pares SVT transformer where initial scatter layers
CIFAR CIFAR Flowers Cars
and later attention layers, SVT-Inverse where ini- Model
10 100 102 196
tial attention layers and later scatter layers, and ResNet50 [23] - - 96.2 90.0
SVT with initial convolutional layers. Also, we ViT-B/16 [14] 98.1 87.1 89.5 -
show an alternative spectral layer and attention ViT-L/16 [14] 97.9 86.4 89.7 -
Deit-B/16 [61] 99.1 90.8 98.4 92.1
layer. This shows that the Initial scatter layer ResMLP-24 [60] 98.7 89.5 97.9 89.5
works better compared to the rest. GFNet-XS [51] 98.6 89.1 98.1 92.8
Model Params(M) FLOPS(G) Top-1(%) Top-5(%) GFNet-H-B [51] 99.0 90.3 98.8 93.2
SVT-H-S 22.0M 3.9 84.2 96.9 SVT-H-B 99.22 91.2 98.9 93.6
SVT-H-S-Init-CNN 21.7M 4.1 84.0 95.7
SVT-H-S-Inverse 21.8M 3.9 83.1 94.6
SVT-H-S-Alternate 22.4M 4.6 83.4 95.0 Table 5: The performances of various vision back-
bones on COCO val2017 dataset for the down-
Table 3: This table shows the ablation analysis of stream instance segmentation task such as Mask
various spectral layers in SVT architecture such R-CNN 1x [22] method. We adopt Mask R-CNN
as FN, FFC, WGN, and FNO. We conduct this as the base model, and the bounding box & mask
ablation study on the small-size networks in stage Average Precision (i.e., 𝐴𝑃 𝑏 & 𝐴𝑃 𝑚 ) are reported
architecture. This indicates that SVT performs for evaluation
better than other kinds of networks. Backbone 𝐴𝑃 𝑏 𝑏
𝐴𝑃50 𝑏
𝐴𝑃75 𝐴𝑃 𝑚 𝑚
𝐴𝑃50 𝑚
𝐴𝑃75
Params FLOPS Top-1 Top-5 Invertible ResNet50 [23] 38.0 58.6 41.4 34.4 55.1 36.7
Model
(M) (G) (%) (%) loss(↓) Swin-T [41] 42.2 64.6 46.2 39.1 61.6 42.0
FFC 21.53 4.5 83.1 95.23 – Twins-SVT-S [10] 43.4 66.0 47.3 40.3 63.2 43.4
FN 21.17 3.9 84.02 96.77 – LITv2-S [45] 44.9 - - 40.8 - -
FNO 21.33 3.9 84.09 96.86 3.27e-05 RegionViT-S [5] 44.2 - - 40.8 - -
WGN 21.59 3.9 83.70 96.56 8.90e-05 PVTv2-B2 [67] 45.3 67.1 49.6 41.2 64.2 44.4
SVT 22.22 3.9 84.20 96.93 6.64e-06 SVT-H-S 46.0 68.1 50.4 41.9 65.0 45.1

achieving a top-1 accuracy of 84.2%. Similarly, SVT-H-B surpasses all the transformers in the base
category, boasting a top-1 accuracy of 85.2%. Lastly, SVT-H-L outperforms other large transformers
with a top-1 accuracy of 85.7% when tested on the ImageNet dataset with an image size of 224x224.
When comparing different architectural approaches, such as Convolutional Neural Networks (CNNs),
Transformer architectures (attention-based models), MLP Mixers, and Spectral architectures, SVT
consistently outperforms its counterparts. For instance, SVT achieves better top-1 accuracy and
parameter efficiency compared to CNN architectures like ResNet 152 [23], ResNeXt [72], and
ResNeSt in terms of top-1 accuracy and number of parameters. Among attention-based architectures,
MaxViT [63] has been recognized as the best performer, surpassing models like DeiT [61], Cross-
ViT [6], DeepViT [83], T2T [77] etc. with a top-1 accuracy of 85.0. However, SVT achieves an even
higher top-1 accuracy of 85.7 with less than half the number of parameters. In the realm of MLP
Mixer-based architectures, DynaMixer [68] emerges as the top-performing model, surpassing MLP-
mixer[59], gMLP [39], CycleMLP [7], Hire-MLP[20], AS-MLP [38], WaveMLP[58], PoolFormer[76]
and DynaMixer-L [68] with a top-1 accuracy of 84.3%. In comparison, SVT-H-L outperforms
DynaMixer with a top-1 accuracy of 85.7% while requiring fewer parameters and computations.
Hierarchical architectures, which include models like PVT [66], Swin [41] transformer, CSwin [13]
transformer, Twin [10] transformer, and VOLO [78] are also considered. Among this category, VOLO
achieves the highest top-1 accuracy of 85.4%. However, it’s important to note that SVT outperforms
VOLO with a top-1 accuracy of 85.7% for SVT-H-L. Lastly, in the spectral architecture category,
models like GFNet[51], iFormer [55], LiTv2 [45], HorNet [50], Wave-ViT [75], etc. are examined.
Wave-ViT was previously the state-of-the-art method with a top-1 accuracy of 85.5%. Nevertheless,
SVT-H-L surpasses Wave-ViT in terms of top-1 accuracy, network size (number of parameters), and
computational complexity (FLOPS), as indicated in Table 1.

3.3 What Matters: Does Initial Spectral or Initial Attention or Initial Convolution Layers?
The ablation study was conducted to show that initial scatter layers followed by attention in deeper
layers are more beneficial than having later scatter and initial attention layers ( SVT-H-S-Inverse). We
also compare transformer models based on an alternative to the attention and scatter layer (SVT-H-S-
Alternate) as shown in Table- 2. From all these combinations we observe that initial scatter layers
followed by attention in deeper layers are more beneficial than others. We compare the performance of
SVT when the architecture changes from all attention (PVTv2[67] ) to all spectral layers (GFNet[51])

7
Table 6: SVT model comprises low-frequency component and High-frequency component with the
help of scattering net using Dual tree complex wavelet transform. Each frequency component is
controlled by a parameterized weight matrix using Patch mixing and/or Channel Mixing. this table
shows details about all combinations and 𝑆𝑉 𝑇𝑇 𝑇 𝐸𝐸 is the best performing among them.
Params FLOPS Top-1 Top-5
Low Frequency High Frequency
Backbone (M) (G) (%) (%)
Token Channel Token Channel
𝑆𝑉 𝑇𝑇 𝑇 𝑇 𝑇 T T T T 25.18 4.4 83.97 96.86
𝑆𝑉 𝑇𝐸𝐸𝑇 𝑇 E E T T 21.90 4.1 83.87 96.67
𝑆𝑉 𝑇𝐸𝐸𝐸𝐸 E E E E 21.87 3.7 83.70 96.56
𝑆𝑉 𝑇𝑇 𝑇 𝐸𝐸 T T E E 22.01 3.9 84.20 96.82
𝑆𝑉 𝑇𝑇 𝑇 𝐸𝑋 T T E ✗ 21.99 4.0 84.06 96.76
𝑆𝑉 𝑇𝑇 𝑇 𝑋𝐸 T T ✗ E 22.25 4.1 84.12 96.91
as well as a few spectral and remaining attention layers(SVT ours). We observe that combining
spectral and attention boost the performance compared to all attention and all spectral layer-based
transformer as shown in Table- 2. We have conducted an experiment where the initial layers of a
ViT are convolutional networks and later layers are attention layers to compare the performance of
SVT. The results are captured in Table- 1, where we compare SVT with transformers having initial
convolutional layers such as CVT [69], CMT [19], and HorNet [50]. Initial convolutional layers in a
transformer are not performing well compared to the initial scatter layer. Initial scatter layer-based
transformers have better performance and less computation cost compared to initial convolutional
layer-based transformers which is shown in Table- 2.

3.4 Ablation analysis

SVT uses a scattering network to decompose the signal into low-frequency and high-frequency
components. We use a gating operator to get effective learnable features for spectral decomposition.
The gating operator is a multiplication of the weight parameter in both high and low frequencies.
We have conducted experiments that use tensor and Einstein mixing. Tensor mixing is a simple
multiplication operator, while Einstein mixing uses an Einstein matrix multiplication operator [52].
We observe that in low-frequency components, Tensor mixing performs better as compared to Einstein
mixing. As shown in Table- 6, we start with 𝑆𝑉 𝑇𝑇 𝑇 𝑇 𝑇 , which uses tensor mixing in both high and
low-frequency components. We see that it may not perform optimally. Then we reverse it and use
Einstein mixing in both low and high-frequency components - this also does not perform optimally.
Then, we came up with the alternative method 𝑆𝑉 𝑇𝑇 𝑇 𝐸𝐸 , which uses tensor mixing in low frequency
and Einstein mixing in high frequency. The high-frequency further decomposes into token and channel
mixing, whereas in low-frequency we simply tensor multiplication as it is an energy or amplitude
component.
In the second ablation analysis, we compare various spectral architectures, including the Fourier
Network (FN), Fourier Neural Operator (FNO), Wavelet Gating Network (WGN), and Fast Fourier
Convolution (FFC). When we contrast SVT with WGN, it becomes evident that SVT exhibits superior
directional selectivity and a more adept ability to manage complex transformations. Furthermore, in
comparison to FN and FNO, SVT excels in decomposing frequencies into low and high-frequency
components. It’s worth noting that SVT surpasses other spectral architectures primarily due to its
utilization of the Directional Dual-Tree Complex Wavelet Transform (DTCWT), which offers direc-
tional orientation and enhanced invertibility, as demonstrated in Table 3. For a more comprehensive
analysis, please refer to the Supplementary section.

3.5 Transfer Learning and Task Learning

We train SVT on ImageNet1K data and fine-tune it on various datasets such as CIFAR10, CIFAR100,
Oxford Flower, and Stanford Car for image recognition tasks. We compare SVT-H-B performance with
various transformers such as Deit [61], ViT [14], and GFNet [51] as well as with CNN architectures
such as ResNet50 and MLP mixer architectures such as ResMLP. This comparison is shown in Table- 4.
It can be observed that SVT-H-B outperforms state-of-art on CIFAR10 with a top-1 accuracy of
99.1%, CIFAR100 with a top-1 accuracy of 91.3%, Flowers with a top-1 accuracy of 98.9% and Cars
with top-1 accuracy of 93.7%. We observe that SVT has more representative features and has an
inbuilt discriminative nature which helps in classifying images into various categories. We use a

8
Table 7: Latency(Speed test): This table shows the Latency (mili Table 8: Invertibility: This
sec) of SVT compared with Conv type network, attention type table shows the invertibility
transformer, POOl type, MLP type, and Spectral type transformer. of SVT(DTCWT) compared
We report latency per sample on A100 GPU. We adopt the latency with Fourier and DWT. We
table from EfficientFormer [36]. also compare different direc-
Model Type Params GMAC Top-1 Latency tional orientations and show
(M) (G) (%) (ms)
ResNet50[23] Convolution 25.5 4.1 78.5 9.0 the reconstruction loss (MSE)
DeiT-S[61] Attention 22.5 4.5 81.2 15.5 in an image.
PVT-S[67] Attention 24.5 3.8 79.8 23.8
Model MSE loss(↓) PSNR (db)(↑)
T2T-14[] Attention 21.5 4.8 81.5 21.0
Swin-T[40] Attention 29.0 4.5 81.3 22.0 Fourier (FFT) 3.27e-05 11.18
CSwin-T[13] Attention 23.0 4.3 82.7 28.7 DWT-M1 8.90e-05 76.33
PoolFormer[76] Pool 31.0 5.2 81.4 41.2 DWT-M2 3.19e-05 84.67
ResMLP-S[60] MLP 30.0 6.0 79.4 17.4 DWT-M3 1.08e-05 91.94
EfficientFormer [36] MetaBlock 31.3 3.9 82.4 13.9 DTCWT-M1 6.64e-06 137.97
GFNet-H-S[51] Spectral 32.0 4.6 81.5 14.3 DTCWT-M2 2.01e-06 138.87
SVT-H-S Spectral 22.0 3.9 84.2 14.7 DTCWT-M3 1.23e-07 142.14

pre-trained SVT model for the downstream instance segmentation task and obtain good results on the
MS-COCO dataset as shown in Table- 5.

3.6 Latency Analysis

It’s important to highlight that Fourier Transforms, as mentioned in the GFNet[51], are not inherently
capable of performing low-pass and high-pass separations. In contrast, GFNet consistently employs
tensor multiplication, a method that, while effective, may be less efficient compared to Einstein
multiplication. The latter approach is known for reducing the number of parameters and computational
complexity. As a result, SVT does not lag behind in terms of performance or computational complexity;
rather, it gains enhanced representational power. This is exemplified in Table 7, which provides a
comparison of latency, FLOPS (Floating-Point Operations per Second), and the number of parameters.
Table 7 specifically demonstrates the latency (measured in milliseconds) of SVT in relation to various
network types, including convolution-based networks, Attention-based Transformer networks, Pool-
based Transformer networks, MLP-based Transformer networks, and Spectral-based Transformer
networks. The reported latency values are on a per-sample basis, measured on an A100 GPU.

3.7 Invertibility Versus Redundancy trade-off Analysis

We conducted an experiment to illustrate that invertibility not only enhances performance but also
contributes to image comprehension. To do this, we passed an image through the raw DTC-WT and
performed an inverse DTC-WT operation to calculate the reconstruction loss. The experiment was
executed across various values of "M" corresponding to the level of decomposition and orientations
in SVT. We observed that the reconstruction loss decreased as the value of "M" increased, indicating
that SVT’s ability to comprehend the image improved. These orientations effectively captured higher-
order image properties, enhancing SVT’s performance. We further compared different spectral
transforms, including the Fourier Transform, Discrete Wavelet Transform (DWT), and DTC-WT. Our
findings demonstrated that the reconstruction loss was lower for DTC-WT compared to other spectral
transforms, as depicted in Table 8 below. In Table 8, we quantified the mean squared error (MSE)
for FFT, DWT at stages 1, 2, and 3, and DTCWT at stages 1, 2, and 3. The MSE decreased as we
increased the level of decomposition (M) and the degree of selectivity. DTCWT consistently exhibited
lower MSE compared to DWT. Furthermore, the peak signal-to-noise ratio (PSNR) of DTCWT
surpassed that of DWT and the Fourier Transform. PSNR gauges the quality of the reconstructed
image, expressing the ratio between the maximum possible power of an image and the power of noise
affecting its representation, measured in decibels (dB). A higher PSNR indicates superior image
quality. A high-quality reconstructed image is characterized by low MSE and high PSNR values.
For further details on redundancy, please consult the supplementary table. Additionally, we have
visualized the filter coefficients for all six orientations in the supplementary materials.

3.8 Limitations
SVT currently uses six directional orientations to capture an image’s fine-grained semantic information.
It is possible to go for the second degree, which gives thirty-six orientations, while the third degree
gives 216 orientations. The more orientations, the more semantic information could be captured, but

9
this leads to higher computational complexity. The decomposition parameter ‘M’ is currently set to 1
to get single low-pass and high-pass components. Higher values of ‘M’ give more components in
both frequencies but lead to higher complexity.

4 Related Work
The Vision Transformer (ViT) [14] was the first transformer-based attempt to classify images into
pre-defined categories and use NLP advances in vision. Following this, several transformer based
approaches like DeiT[61], Tokens-to-token ViT [77], Transformer iN Transformer (TNT) [21], Cross-
ViT [6], Class attention image Transformer(CaiT) [62] Uniformer [34], Beit. [3], SViT[49], RegionViT
[5], MaxViT [63] etc. have all been proposed to improve the accuracy using multi-headed self-attention
(MSA). PVT [66], SwinT [41], CSwin[13] and Twin [41] use hierarchical architecture to improve
the performance of the vision transformer on various tasks. The complexity of MSA is O(𝑛2 ). For
high-resolution images, the complexity increases quadratically with token length. PoolFormer [76]
is a method that uses a pooling operation over a small patch which has to obtain a down-sampled
version of the image to reduce computational complexity. The main problem with PoolFormer
is that it uses a MaxPooling operation which is not invertible. Another approach to reducing the
complexity is the spectral transformers such as FNet [33], GFNet [51], AFNO [18], WaveMix [26],
WaveViT [75], SpectFormer [48], FourierFormer [43], etc. FNet [33] does not use inverse Fourier
transforms, leading to an invertibility issue. GFNet [51] solves this by using inverse Fourier transforms
with a gating network. AFNO [18] uses the adaptive nature of a Fourier neural operator similar to
GFNet. SpectFormer [48] introduces a novel transformer architecture that combines both spectral and
attention networks for vision tasks. GFNet, SpectFormer, and AFNO do not have proper separation
of low-frequency and high-frequency components and may struggle to handle the semantic content
of images. In contrast, SVT has a clear separation of frequency components and uses directional
orientations to capture semantic information. FourierIntegral [43] is similar to GFNet and may have
similar issues in separating frequency components.
WaveMLP [58] is a recent effort that dynamically aggregates tokens as a wave function with two parts,
amplitude and phase to capture the original features and semantic content of the images respectively.
SVT uses a scattered network to provide low-frequency and high-frequency components. The high-
frequency component has six or more directional orientations to capture semantic information in
images. We use Einstein multiplication in token and channel mixing of high-frequency components
leading to lower computational complexity and network size. In Wave-ViT [75], the author has
discussed the quadratic complexity of the self-attention network using a wavelet transform to perform
lossless down-sampling using wavelet transform over keys and values. However, WaveViT still has
the same complexity as it uses attention instead of spectral layers. SVT uses the scatter network which
is more invertible compared to WaveViT.
One of the challenges in MSA is its inability to characterize different frequencies in the input image.
Hilo attention (LiTv2) [45] helps to find high-frequency and low-frequency components by using a
novel variant of MSA. But it does not solve the complexity issue of MSA. Another parallel effort
named Inception Transformer came up [55], which uses an Inception mixer to capture high and
low-frequency information in visual data. iFormer still has the same complexity as it uses attention
as the low-frequency mixer. SVT in comparison, uses a spectral neural operator to capture low and
high frequency components using the DTCWT. This removes the O(𝑛2 ) complexity as it uses spectral
mixing instead of attention. iFormer [55] uses a non-invertible max pooling and convolutional layer
to capture high-frequency components, whereas, in contrast, SVT’s mixer is completely invertible.
SVT uses a scatter network to get a better directional orientation to capture fine-grained information
such as lines and edges, compared to Hilo attention and iFormer.

5 Conclusions and Future Research Directions

We have proposed SVT, which helps in separating low-frequency and high-frequency components of
an image, while simultaneously reducing computational complexity by using Einstein multiplication-
based technique for efficient channel and token mixing. SVT has been evaluated on standard bench-
marks and shown to achieve state-of-the-art performance on standard benchmark datasets on both
image classification tasks and instance segmentation tasks. It also achieves comparable performance
on object detection tasks. We shall experiment with SVT in other domains such as speech and NLP
as we believe that it offers significant value in these domains as well.

10
References
[1] https://openai.com/blog/chatgpt/, 2022.
[2] Hezam Albaqami, G Hassan, and Amitava Datta. Comparison of wpd, dwt and dtcwt for multi-class seizure
type classification. In 2021 IEEE Signal Processing in Medicine and Biology Symposium (SPMB), pages
1–7. IEEE, 2021.
[3] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. In
International Conference on Learning Representations, 2021.
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey
Zagoruyko. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer,
2020.
[5] Chun-Fu Chen, Rameswar Panda, and Quanfu Fan. Regionvit: Regional-to-local attention for vision
transformers. In International Conference on Learning Representations, 2022.
[6] Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision
transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 357–366, 2021.
[7] Shoufa Chen, Enze Xie, GE Chongjian, Runjian Chen, Ding Liang, and Ping Luo. Cyclemlp: A mlp-like
architecture for dense prediction. In International Conference on Learning Representations, 2022.
[8] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution. Advances in Neural Information Processing
Systems, 33:4479–4488, 2020.
[9] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language
modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[10] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua
Shen. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural
Information Processing Systems, 34:9355–9366, 2021.
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
Ieee, 2009.
[12] Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, and Lu Yuan. Davit: Dual attention
vision transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October
23–27, 2022, Proceedings, Part XXIV, pages 74–92. Springer, 2022.
[13] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and
Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–
12134, 2022.
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is
worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning
Representations, 2020.
[15] Stéphane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Levent Sagun.
Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference
on Machine Learning, pages 2286–2296. PMLR, 2021.
[16] Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, and Wenyu
Liu. You only look at one sequence: Rethinking transformer in vision through object detection. Advances
in Neural Information Processing Systems, 34:26183–26197, 2021.
[17] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics,
pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
[18] John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, and Bryan Catanzaro.
Efficient token mixing for transformers via adaptive fourier neural operators. In International Conference
on Learning Representations, 2022.

11
[19] Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Xinghao Chen, Yunhe Wang, and Chang Xu. Cmt:
Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 12175–12185, 2022.
[20] Jianyuan Guo, Yehui Tang, Kai Han, Xinghao Chen, Han Wu, Chao Xu, Chang Xu, and Yunhe Wang.
Hire-mlp: Vision mlp via hierarchical rearrangement. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pages 826–836, June 2022.
[21] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.
Advances in Neural Information Processing Systems, 34:15908–15919, 2021.
[22] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages 2961–2969, 2017.
[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[24] Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan, and Jiashi Feng. Vision permutator:
A permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis &
Machine Intelligence, (01):1–1, 2022.
[25] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 7132–7141, 2018.
[26] Pranav Jeevan and Amit Sethi. Wavemix: Resource-efficient token mixing for images. arXiv preprint
arXiv:2203.03689, 2022.
[27] Zi-Hang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng.
All tokens matter: Token labeling for training better vision transformers. Advances in Neural Information
Processing Systems, 34:18590–18602, 2021.
[28] Nick Kingsbury. Image processing with complex wavelets. Philosophical Transactions of the Royal Society
of London. Series A: Mathematical, Physical and Engineering Sciences, 357(1760):2543–2560, 1999.
[29] Nick Kingsbury. Complex wavelets for shift invariant analysis and filtering of signals. Applied and
computational harmonic analysis, 10(3):234–253, 2001.
[30] Nick G Kingsbury. The dual-tree complex wavelet transform: a new technique for shift invariance and
directional filters. In IEEE digital signal processing workshop, volume 86, pages 120–131. Citeseer, 1998.
[31] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained
categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages
554–561, 2013.
[32] Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
[33] James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. Fnet: Mixing tokens with fourier
transforms. arXiv preprint arXiv:2105.03824, 2021.
[34] Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao.
Uniformer: Unifying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450,
2022.
[35] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph
Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814,
2022.
[36] Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian
Ren. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing
Systems, 35:12934–12949, 2022.
[37] Yehao Li, Ting Yao, Yingwei Pan, and Tao Mei. Contextual transformer networks for visual recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[38] Dongze Lian, Zehao Yu, Xing Sun, and Shenghua Gao. As-mlp: An axial shifted mlp architecture for
vision. In International Conference on Learning Representations, 2022.
[39] Hanxiao Liu, Zihang Dai, David So, and Quoc V Le. Pay attention to mlps. Advances in Neural Information
Processing Systems, 34:9204–9215, 2021.

12
[40] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang,
Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 12009–12019, 2022.
[41] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin
transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 10012–10022, 2021.
[42] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on
Learning Representations, 2018.
[43] Tan Minh Nguyen, Minh Pham, Tam Minh Nguyen, Khai Nguyen, Stanley Osher, and Nhat Ho. Fouri-
erformer: Transformer meets generalized fourier integral theorem. In Advances in Neural Information
Processing Systems, 2022.
[44] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of
classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages
722–729. IEEE, 2008.
[45] Zizheng Pan, Jianfei Cai, and Bohan Zhuang. Fast vision transformers with hilo attention. In Advances in
Neural Information Processing Systems, 2022.
[46] Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu, and Jianfei Cai. Less is more: Pay less attention in
vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages
2035–2043, 2022.
[47] Badri N Patro and Vijay Agneeswaran. Efficiency 360: Efficient vision transformers. arXiv preprint
arXiv:2302.08374, 2023.
[48] Badri N Patro, Vinay P Namboodiri, and Vijay Srinivas Agneeswaran. Spectformer: Frequency and
attention is what you need in a vision transformer. arXiv preprint arXiv:2304.06446, 2023.
[49] Tianming Qiu, Ming Gui, Cheng Yan, Ziqing Zhao, and Hao Shen. Svit: Hybrid vision transformer models
with scattering transform. In 2022 IEEE 32nd International Workshop on Machine Learning for Signal
Processing (MLSP), pages 01–06. IEEE, 2022.
[50] Yongming Rao, Wenliang Zhao, Yansong Tang, Jie Zhou, Ser Nam Lim, and Jiwen Lu. Hornet: Effi-
cient high-order spatial interactions with recursive gated convolutions. Advances in Neural Information
Processing Systems, 35:10353–10366, 2022.
[51] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image
classification. Advances in Neural Information Processing Systems, 34:980–993, 2021.
[52] Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. In
International Conference on Learning Representations, 2022.
[53] Ivan W Selesnick. Hilbert transform pairs of wavelet bases. IEEE Signal Processing Letters, 8(6):170–173,
2001.
[54] Ivan W Selesnick, Richard G Baraniuk, and Nick C Kingsbury. The dual-tree complex wavelet transform.
IEEE signal processing magazine, 22(6):123–151, 2005.
[55] Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng YAN. Inception
transformer. In Advances in Neural Information Processing Systems, 2022.
[56] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani.
Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 16519–16529, 2021.
[57] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In
International conference on machine learning, pages 6105–6114. PMLR, 2019.
[58] Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Yanxi Li, Chao Xu, and Yunhe Wang. An image patch is
a wave: Phase-aware vision mlp. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 10935–10944, 2022.
[59] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner,
Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture
for vision. Advances in Neural Information Processing Systems, 34:24261–24272, 2021.

13
[60] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave,
Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks
for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2022.

[61] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou.
Training data-efficient image transformers & distillation through attention. In International Conference on
Machine Learning, pages 10347–10357. PMLR, 2021.

[62] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper
with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 32–42, 2021.

[63] Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li.
Maxvit: Multi-axis vision transformer. In Computer Vision–ECCV 2022: 17th European Conference, Tel
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 459–479. Springer, 2022.

[64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems,
30, 2017.

[65] Pichao Wang, Xue Wang, Hao Luo, Jingkai Zhou, Zhipeng Zhou, Fan Wang, Hao Li, and Rong Jin.
Scaled relu matters for training vision transformers. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 36, pages 2495–2503, 2022.

[66] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and
Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021.

[67] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and
Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media,
8(3):415–424, 2022.

[68] Ziyu Wang, Wenhao Jiang, Yiming M Zhu, Li Yuan, Yibing Song, and Wei Liu. Dynamixer: a vision mlp
architecture with dynamic mixing. In International Conference on Machine Learning, pages 22691–22701.
PMLR, 2022.

[69] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing
convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 22–31, 2021.

[70] Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with deformable
attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
4794–4803, 2022.

[71] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer:
Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information
Processing Systems, 34:12077–12090, 2021.

[72] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transforma-
tions for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 1492–1500, 2017.

[73] Weijian Xu, Yifan Xu, Tyler Chang, and Zhuowen Tu. Co-scale conv-attentional image transformers. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9981–9990, 2021.

[74] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal
self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.

[75] Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, and Tao Mei. Wave-vit: Unifying wavelet and trans-
formers for visual representation learning. In Computer Vision–ECCV 2022: 17th European Conference,
Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXV, pages 328–345. Springer, 2022.

[76] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng
Yan. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 10819–10829, 2022.

14
[77] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng,
and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 558–567, 2021.

[78] Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, and Shuicheng Yan. Volo: Vision outlooker for visual
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

[79] Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin Wei, Chunhua Shen, et al. Segvit: Semantic
segmentation with plain vision transformers. Advances in Neural Information Processing Systems, 35:4971–
4982, 2022.

[80] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas
Mueller, R Manmatha, et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 2736–2746, 2022.

[81] Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-scale
vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 2998–3008, 2021.

[82] Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen, Xinggang Wang, Wenyu Liu, Gang Yu, and
Chunhua Shen. Topformer: Token pyramid transformer for mobile semantic segmentation. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12083–12093, 2022.

[83] Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi
Feng. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.

Appendix
This document provides a comprehensive analysis of the vanilla transformer architecture and explores
various versions The architecture comparisons are presented in Table-12, shedding light on the
differences and capabilities of each version. The document also delves into the training configurations,
encompassing transfer learning, task learning, and fine-tuning tasks. The dataset information utilized
for transformer learning is presented in Table- 13, providing insights into dataset sizes, and relevance
to different applications. Moving to the results section, we showcase the fine-tuned model outcomes,
where models are initially trained on 224 x 224 images and subsequently fine-tuned on 384 x 384
images. The performance evaluation, as depicted in Table- 14, encompasses accuracy metrics, number
of parameters(M) and Floating point operations(G). The detailed comparison of similar architectures
is provided in Table- 11. Regarding the trade-off between invertibility and redundancy, we conducted
an experiment to demonstrate that invertibility aids in comprehending the image rather than merely
contributing to performance, as shown in Table- 10.

A Appendix: Filter visualization analysis

SVT incorporates the scattering network utilizing the DTCWT for image decomposition into low
and high-frequency components. Our primary focus is to analyze the low-frequency and high-
frequency filter components to emphasize SVT’s exceptional directional orientation capabilities. It
is worth noting that unlike other spectral transformers, such as GFNet, SVT exhibits pronounced
directional orientation. To gain insights into SVT’s performance, we visualize the first four layers of
the SVT transformer, particularly focusing on 24 filter coefficients out of the total 384. Moreover, our
analysis includes the examination of six directional components; however, we present only the first
two directional components, along with the low-pass filter components for the purpose of brevity.
Through these visualizations, we aim to showcase how SVT adeptly captures lines and edges with
diverse orientations, outperforming other spectral transformers. The findings from our visual analysis,
illustrated in Figure 3, provide compelling evidence of SVT’s superiority in handling directional
information compared to other spectral transformers.
The visualization of each directional component allows us to observe its ability to capture lines
and edges with diverse orientations, surpassing the performance of other spectral transformers. To
support our findings, Figure 4 exhibits the visual representations of these filter components, providing
clear evidence of SVT’s superior orientation handling capabilities. This analysis serves to score the
significance of SVT’s architecture in effectively extracting and leveraging directional information,
contributing to its enhanced performance in various computer vision and signal processing tasks.

15
Figure 3: This figure shows the Filter characterization of the initial four layers of the SVT model. It
clearly shows that the High-frequency filter coefficient captures local filter information such as lines,
edges, and different orientations of an Image. The Low-frequency filter coefficient captures the shape
with the maximum energy part in the image.
Table 9: Detailed architecture specifications for three variants of our SVT with different model sizes,
i.e., SVT-S (small size), SVT-B (base size), and SVT-L (large size). 𝐸𝑖 , 𝐺𝑖 , 𝐻𝑖 , and 𝐶𝑖 represent
the expansion ratio of the feed-forward layer, the spectral gating number, the head number, and the
channel dimension in each stage 𝑖, respectively.
OP Size SVT-H-S SVT-H-B SVT-H-L
[ ] [ ] [ ]
𝐸1 = 8 𝐸1 = 8 𝐸1 = 8
𝐻 𝑊
Stage 1 4
× 4 𝐺1 = 1 ×3 𝐺1 = 1 ×3 𝐺1 = 1 ×3
𝐶1 = 64 𝐶1 = 64 𝐶1 = 96
[ ] [ ] [ ]
𝐸2 = 8 𝐸2 = 8 𝐸2 = 8
Stage 2 𝐻8 × 𝑊8 𝐺2 = 1 ×4 𝐺2 = 1 ×4 𝐺2 = 1 ×6
𝐶2 = 128 𝐶2 = 128 𝐶2 = 192
[ ] [ ] [ ]
𝐸3 = 4 𝐸3 = 4 𝐸3 = 4
𝐻 𝑊
Stage 3 16 × 16 𝐻3 = 10 ×6 𝐻3 = 10 ×12 𝐻3 = 12 ×18
𝐶3 = 320 𝐶3 = 320 𝐶3 = 384
[ ] [ ] [ ]
𝐸4 = 4 𝐸4 = 4 𝐸4 = 4
𝐻 𝑊
Stage 4 32 × 32 𝐻4 = 14 ×3 𝐻4 = 16 ×3 𝐻4 = 16 ×3
𝐶4 = 448 𝐶4 = 512 𝐶4 = 512
B Appendix: Dataset and Training Details:
B.1 Dataset and Training Setups on ImageNet-1K for Image Classification task

In this section, we outline the dataset and training setups for the Image Classification task on the
ImageNet-1K benchmark dataset. The dataset comprises 1.28 million training images and 50K valida-
tion images, spanning across 1,000 categories. To train the vision backbones from scratch, we employ
several data augmentation techniques, including RandAug, CutOut, and Token Labeling objectives

Table 10: Invertibility vs redundancy:This table shows the SVT-H performance for each orientation.
We merge all the orientations and make them similar, making 2 and 3 orientations. Final SVT-H-S
has 6 orientations in high-frequency components to capture curves and slants in all 6 orientations. ’H’
stands for hierarchical, ’S’ for small size mode for image size 2242
Model Params GFLOPs Top-1(%) Top-5(%)
SVT-H-S-ori-1 21.5M 3.9 83.2 94.9
SVT-H-S-ori-2 21.6M 3.9 83.4 95.1
SVT-H-S-ori-3 21.7M 3.9 83.7 95.5
SVT-H-S(ori-6) 22.0M 3.9 84.2 96.9

16
Low-Frequency Filters coefficients

High-Frequency Filters coefficients-orientation-0

High-Frequency Filters coefficients-orientation-1

High-Frequency Filters coefficients-orientation-2

High-Frequency Filters coefficients-orientation-3

High-Frequency Filters coefficients-orientation-4

High-Frequency Filters coefficients-orientation-5

Figure 4: This figure shows the Filter characterization of the initial four layers of the SVT model. It
clearly shows that the High-frequency filter coefficient captures local filter information such as lines,
edges, and different orientations of an Image. The Low-frequency filter coefficient captures the shape
with the maximum energy part in the image.

17
Figure 5: Comparison of ImageNet Top-1 Accuracy (%) vs GFLOPs of various models in Vanilla
and Hierarchical architecture.

Figure 6: Comparison of ImageNet Top-1 Accuracy (%) vs Parameters (M) of various models in
Vanilla and Hierarchical architecture.
with MixToken. These augmentation techniques help enhance the model’s generalization capabilities.
For performance evaluation, we measure the trained backbones’ top-1 and top-5 accuracies on the
validation set, providing a comprehensive assessment of the model’s classification capabilities. In the
optimization process, we adopt the AdamW optimizer with a momentum of 0.9, combining it with a
10-epoch linear warm-up phase and a subsequent 310-epoch cosine decay learning rate scheduler.
These strategies aid in achieving stable and effective model training. To handle the computational
load, we distribute the training process on 8 V100 GPUs, utilizing a batch size of 128. This distributed
setup helps accelerate the training process while making efficient use of available hardware resources.
The learning rate and weight decay are fixed at 0.00001 and 0.05, respectively, maintaining stable
training and mitigating overfitting risks.

B.2 Training setup for Transfer Learning

In the context of transfer learning, we sought to evaluate the efficacy of our vanilla SVT architecture on
widely-used benchmark datasets, namely CIFAR-10 [32], CIFAR100 [32], Oxford-IIIT-Flower [44]
and Standford Cars [31]. Our approach followed the methodology of previous studies [57, 14, 61,
60, 51], where we initialized the model with pre-trained weights from ImageNet and subsequently
fine-tuned it on the new datasets.
Table-4 in the main paper presents a comprehensive comparison of the transfer learning performance
of both our basic and best models against state-of-the-art CNNs and vision transformers. To maintain
consistency, we employed a batch size of 64, a learning rate (lr) of 0.0001, a weight-decay of 1e-4,
a clip-grad value of 1, and performed 5 epochs of warmup. For the transfer learning process, we
utilized a pre-trained model that was initially trained on the ImageNet-1K dataset. This pre-trained
model was fine-tuned on the specific transfer learning dataset mentioned in Table-13 for a total of
1000 epochs.

B.3 Training setup for Task Learning

In this section, we conduct an in-depth analysis of the pre-trained SVT-H-small model’s performance
on the COCO dataset for two distinct downstream tasks involving object localization, ranging from

18
Table 11: This shows a performance comparison of SVT with similar Transformer Architecture with
different sizes of the networks on ImageNet-1K. ⋆ indicates additionally trained with the Token
Labeling objective using MixToken[27].
Network Params GFLOPs Top-1 Acc (%) Top-5 Acc (%)
Vanilla Transformer Comparison
FFC-ResNet-50 [8] 26.7M - 77.8 -
FourierFormer [43] - - 73.3 91.7
GFNet-Ti [51] 7M 1.3 74.6 92.2
SVT-T 9M 1.8 76.9 93.4
FFC-ResNet-101 [8] 46.1M - 78.8 -
Fnet-S [33] 15M 2.9 71.2 -
GFNet-XS [51] 16M 2.9 78.6 94.2
GFNet-S [51] 25M 4.5 80.0 94.9
SVT-XS 19.9M 4.0 79.9 94.5
SVT-S 32.2M 6.6 81.5 95.3
FFC-ResNet-152 [8] 62.6M - 78.9 -
GFNet-B [51] 43M 7.9 80.7 95.1
SVT-B 57.6M 11.8 82.0 95.6
Hierarchical Transformer Comparison
GFNet-H-S [51] 32M 4.6 81.5 95.6
LIT-S [46] 27M 4.1 81.5 -
iFormer-S[55] 20 4.8 83.4 96.6
Wave-ViT-S⋆ [75] 22.7M 4.7 83.9 96.6
SVT-H-S 21.7M 3.9 83.1 96.3
SVT-H-S⋆ 22.0M 3.9 84.2 96.9
GFNet-H-B [51] 54M 8.6 82.9 96.2
LIT-M [46] 48M 8.6 83.0 -
LITv2-M [45] 49.0M 7.5 83.3 -
iFormer-B[55] 48 9.4 84.6 97.0
Wave-MLP-B [58] 63.0M 10.2 83.6 -
Wave-ViT-B⋆ [75] 33.5M 7.2 84.8 97.0
SVT-H-B⋆ 32.8M 6.3 85.2 97.3
LIT-B [46] 86M 15.0 83.4 -
LITv2-B [45] 87.0M 13.2 83.6 -
HorNet-𝐵𝐺𝐹 [50] 88.0M 15.5 84.3 -
iFormer-L[55] 87.0M 14.0 84.8 97.0
Wave-ViT-L⋆ [75] 57.5M 14.8 85.5 97.3
SVT-H-L⋆ 54.0M 12.7 85.7 97.5

bounding-box level to pixel level. Specifically, we evaluate our SVT-H-small model on instance
segmentation tasks, such as Mask R-CNN [22], as demonstrated in Table-5 of the main paper.
For downstream task, we replace the CNN backbones in the respective detectors with our pre-trained
SVT-H-small model to evaluate its effectiveness. Prior to this, we pre-train each vision backbone on
the ImageNet-1K dataset, initializing the newly added layers with Xavier initialization [17]. Next,
we adhere to the standard setups defined in [41] to train all models on the COCO train2017 dataset,
which comprises approximately 118,000 images. The training process is performed with a batch size
of 16, and we utilize the AdamW optimizer [42] with a weight decay of 0.05, an initial learning rate
of 0.0001, and betas set to (0.9, 0.999). To manage the learning rate during training, we adopt the
step learning rate policy with linear warm-up at every 500 iterations and a warm-up ratio of 0.001.
These learning rate configurations aid in optimizing the model’s performance and convergence.

B.4 Training setup for Fine-tuning task

In our main experiments, we conduct image classification tasks on the widely-used ImageNet dataset
[11], a standard benchmark for large-scale image classification. To ensure a fair and meaningful

19
Table 12: In this table, we present a comprehensive overview of different versions of SVT within the
vanilla transformer architecture. The table includes detailed configurations such as the number of
heads, embedding dimensions, the number of layers, and the training resolution for each variant. For
SVT-H models with a hierarchical structure, we refer readers to Table-12 in the main paper, which
outlines the specifications for all four stages. Additionally, the table provides FLOPs (floating-point
operations) calculations for input sizes of both 224×224 and 384×384. In the vanilla SVT architecture,
we utilize four spectral layers with 𝛼 = 4, while the remaining attention layers are (𝐿 − 𝛼).
Model #Layers #heads #Embedding Dim Params (M) Training Resolution FLOPs (G)
SVT-Ti 12 4 256 9 224 1.8
SVT-XS 12 6 384 20 224 4.0
SVT-S 19 6 384 32 224 6.6
SVT-B 19 8 512 57 224 11.5
SVT-XS 12 6 384 21 384 13.1
SVT-S 19 6 384 33 384 22.0
SVT-B 19 8 512 57 384 37.3

Table 13: This table presents information about datasets used for transfer learning. It includes the
size of the training and test sets, as well as the number of categories included in each dataset.
Dataset CIFAR-10 [32] CIFAR-100 [32] Flowers-102 [44] Stanford Cars [31]
Train Size 50,000 50,000 8,144 2,040
Test Size 10,000 10,000 8,041 6,149
#Categories 10 100 196 102

Figure 7: The 1st column shows phase and magnitude plots for the Fourier transformer and the 2nd
column shows the low-frequency component of Dual tree Complex Wavelet transform (DT-CWT). 3rd
column onwards shows high-frequency visualization of all 6 direction-selective. 1st row visualizes
phase information & the second row shows the magnitude of all 6 high-frequency components.

comparison with previous research [61, 60, 51], we adopt the same training details for our SVT
models. For the vanilla transformer architecture (SVT), we utilize the hyperparameters recommended
by the GFNet implementation [51]. Similarly, for the hierarchical architecture (SVT-H), we employ
the hyperparameters recommended by the WaveVit implementation [75]. During fine-tuning at higher
resolutions, we follow the hyperparameters suggested by the GFNet implementation [51] and train
our models for 30 epochs.
All model training is performed on a single machine equipped with 8 V100 GPUs. In our experiments,
we specifically compare the fine-tuning performance of our models with GFNet [51]. Our observations
indicate that our SVT models outperform GFNet’s base spectral network. For instance, SVT-S(384)
achieves an impressive accuracy of 83.0%, surpassing GFNet-S(384) by 1.2%, as presented in Table 14.
Similarly, SVT-XS and SVT-B outperform GFNet-XS and GFNet-B, respectively, highlighting the
superior performance of our SVT models in the fine-tuning process.

B.5 Comparison with Similar architectures

We compare SVT with LiTv2 (Hilo) [45] which decomposes attention to find low and high-frequency
components. We show that LiTv2 has a top-1 accuracy of 83.3%, while SVT has a top-1 accuracy of

20
Table 14: We conducted a comparison of various transformer-style architectures for image classifi-
cation on ImageNet. This includes vision transformers [61], MLP-like models [60, 39], spectral
transformers [51] and our SVT models, which have similar numbers of parameters and FLOPs.
The top-1 accuracy on ImageNet’s validation set, as well as the number of parameters and FLOPs, are
reported. All models were trained using 224 × 224 images. We used the notation "↑384" to indicate
models fine-tuned on 384 × 384 images for 30 epochs.
Model Params (M) FLOPs (G) Resolution Top-1 Acc. (%) Top-5 Acc. (%)
gMLP-Ti [39] 6 1.4 224 72.0 -
DeiT-Ti [61] 5 1.2 224 72.2 91.1
GFNet-Ti [51] 7 1.3 224 74.6 92.2
SVT-T 9 1.8 224 76.9 93.4
ResMLP-12 [60] 15 3.0 224 76.6 -
GFNet-XS [51] 16 2.9 224 78.6 94.2
SVT-XS 20 4.0 224 79.9 94.5
DeiT-S [61] 22 4.6 224 79.8 95.0
gMLP-S [39] 20 4.5 224 79.4 -
GFNet-S [51] 25 4.5 224 80.0 94.9
SVT-S 32 6.6 224 81.5 95.3
ResMLP-36 [60] 45 8.9 224 79.7 -
GFNet-B [51] 43 7.9 224 80.7 95.1
gMLP-B [39] 73 15.8 224 81.6 -
DeiT-B [61] 86 17.5 224 81.8 95.6
SVT-B 57 11.6 224 82.0 95.6
GFNet-XS↑384 [51] 18 8.4 384 80.6 95.4
GFNet-S↑384 [51] 28 13.2 384 81.7 95.8
GFNet-B↑384 [51] 47 23.3 384 82.1 95.8
SVT-XS↑384 21 13.1 384 82.2 95.8
SVT-S↑384 33 22.0 384 83.1 96.4
SVT-B↑384 57 37.3 384 83.0 96.2

85.2% with a fewer number of parameters. We also compare SVT with iFormer [55] which captures
low and high-frequency information from visual data, whereas SVT uses an invertible spectral method,
namely the scattering network, to get the low-frequency and high-frequency components and uses
tensor and Einstein mixing respectively to capture effective spectral features from visual data. SVT
top-1 accuracy is 85.2, which is better than iFormer-B, which is at 84.6 with a lesser number of
parameters and FLOPS.
We compare SVT with WaveMLP [58] which is an MLP mixer-based technique that uses amplitude
and phase information to represent the semantic content of an image. SVT uses a low-frequency
component as an amplitude of the original feature, while a high-frequency component captures
complex semantic changes in the input image. Our studies have shown, as depicted in Table- 11, that
SVT outperforms WaveMLP by about 1.8%. Wave-VIT-B[75] uses wavelet transform in the key and
value part of the multi-head attention method whereas SVT uses a scatter network to decompose high
and low-frequency components with invertibility and better directional orientation using Einstein and
Tensor mixing. SVT outperforms Wave-ViT-B by 0.4%.

B.6 SVT Compared with LVM/LLM

We wish to state the following on the comment of the reviewer about large vision models (LVM/LLM):
We have observed in recent papers that certain efficient transformer models such as efficientFormer
and CvT have a significantly larger number of parameters, with BiT-M having 928 million parameters
and achieving 85.4% accuracy on ImageNet 1K, whereas ViT-H has 632 million parameters and
achieving accuracy of 85.1. Comparatively, SVT-H-L has 54 million parameters and achieves 85.7%
accuracy on ImageNet 1K - nearly 10X the lesser number of parameters and FLOPS but with improved
accuracy, as captured in Table 3 of CvT [69].

Lifting Trunnion Calculations On Horizontal Vessel - 1
No ratings yet
Lifting Trunnion Calculations On Horizontal Vessel - 1
1 page
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
24 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
Solid Waste Management: Civil Engineering Project
No ratings yet
Solid Waste Management: Civil Engineering Project
32 pages
2022 - ViTAEv2 - Zhang Et Al - Arxiv
No ratings yet
2022 - ViTAEv2 - Zhang Et Al - Arxiv
22 pages
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
No ratings yet
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
23 pages
Good Note - ViT
No ratings yet
Good Note - ViT
13 pages
Twins: Revisiting The Design of Spatial Attention in Vision Transformers
No ratings yet
Twins: Revisiting The Design of Spatial Attention in Vision Transformers
14 pages
Li Et Al. - 2022 - Rethinking Vision Transformers For MobileNet Size and Speed
No ratings yet
Li Et Al. - 2022 - Rethinking Vision Transformers For MobileNet Size and Speed
15 pages
A Survey On Visual Transformer
No ratings yet
A Survey On Visual Transformer
23 pages
Efficient V It
No ratings yet
Efficient V It
11 pages
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
No ratings yet
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
12 pages
2023 ExMobileViT Yang Et Al ArXiv
No ratings yet
2023 ExMobileViT Yang Et Al ArXiv
20 pages
A Survey of The Vision Transformers and Its CNN-Transformer Based Variants - Khan Et Al
No ratings yet
A Survey of The Vision Transformers and Its CNN-Transformer Based Variants - Khan Et Al
82 pages
Lightweight
No ratings yet
Lightweight
23 pages
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
No ratings yet
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
23 pages
Li Et Al. - 2022 - EfficientFormer Vision Transformers at MobileNet Speed
No ratings yet
Li Et Al. - 2022 - EfficientFormer Vision Transformers at MobileNet Speed
19 pages
Rethinking Local Perception in Lightweight Vision Transformer
No ratings yet
Rethinking Local Perception in Lightweight Vision Transformer
14 pages
Transformers in Computational Visual Media A Surve
No ratings yet
Transformers in Computational Visual Media A Surve
30 pages
A Survey On Efficient Vision Transformers Algorithms Techniques and Performance Benchmarking
No ratings yet
A Survey On Efficient Vision Transformers Algorithms Techniques and Performance Benchmarking
19 pages
A Simple Single-Scale Vision Transformer For Object Localization
No ratings yet
A Simple Single-Scale Vision Transformer For Object Localization
12 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
Research Notes
No ratings yet
Research Notes
9 pages
Ai Lakshmana Sai Vision Transformer
No ratings yet
Ai Lakshmana Sai Vision Transformer
19 pages
NeurIPS 2021 Transformer in Transformer Paper
No ratings yet
NeurIPS 2021 Transformer in Transformer Paper
12 pages
Deep Learning Paper About Vit
No ratings yet
Deep Learning Paper About Vit
12 pages
Behavior Cloning For Self Driving Cars Using Attention Models
No ratings yet
Behavior Cloning For Self Driving Cars Using Attention Models
5 pages
2103 TiT
No ratings yet
2103 TiT
10 pages
Better Vision Transformer Via Token Pooling and Attention Sharing
No ratings yet
Better Vision Transformer Via Token Pooling and Attention Sharing
13 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
Vision Transformer Adapter For Dense Predictions
No ratings yet
Vision Transformer Adapter For Dense Predictions
20 pages
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
No ratings yet
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
12 pages
Gaurav Vision Transformer
No ratings yet
Gaurav Vision Transformer
10 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
Vision Transformer For Small-Size Datasets
No ratings yet
Vision Transformer For Small-Size Datasets
11 pages
Chen CrossViT Cross-Attention Multi-Scale Vision Transformer For Image Classification ICCV 2021 Paper
No ratings yet
Chen CrossViT Cross-Attention Multi-Scale Vision Transformer For Image Classification ICCV 2021 Paper
10 pages
Challenging Task
No ratings yet
Challenging Task
21 pages
Seminar
No ratings yet
Seminar
61 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
Applsci 13 05521 v2
No ratings yet
Applsci 13 05521 v2
17 pages
Going Deeper With Image Transformers: Hugo Touvron Matthieu Cord Alexandre Sablayrolles Gabriel Synnaeve Herv e J Egou
No ratings yet
Going Deeper With Image Transformers: Hugo Touvron Matthieu Cord Alexandre Sablayrolles Gabriel Synnaeve Herv e J Egou
11 pages
2024 GVT Shan Chen Arxiv
No ratings yet
2024 GVT Shan Chen Arxiv
9 pages
2022 - PVT v2
No ratings yet
2022 - PVT v2
10 pages
Wavemix SR
No ratings yet
Wavemix SR
9 pages
Escaping The Big Data Paradigm With Compact Transformers
No ratings yet
Escaping The Big Data Paradigm With Compact Transformers
18 pages
Project Presentation
No ratings yet
Project Presentation
20 pages
2103 - ICML - Perceiver General Perception With Iterative Attention
No ratings yet
2103 - ICML - Perceiver General Perception With Iterative Attention
16 pages
A Review of Advances in Image Recognition Models F
No ratings yet
A Review of Advances in Image Recognition Models F
5 pages
Transformer-Based Framework For Accurate Segmentation of High-Resolution Images in Structural Health Monitoring
No ratings yet
Transformer-Based Framework For Accurate Segmentation of High-Resolution Images in Structural Health Monitoring
15 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
ViTA A Vision Transformer Inference Accelerator For Edge Applications
No ratings yet
ViTA A Vision Transformer Inference Accelerator For Edge Applications
5 pages
Vision Transformers in Medical Imaging: A Comprehensive Review of Advancements and Applications Across Multiple Diseases
No ratings yet
Vision Transformers in Medical Imaging: A Comprehensive Review of Advancements and Applications Across Multiple Diseases
44 pages
Sima: Simple Softmax-Free Attention For Vision Transformers
No ratings yet
Sima: Simple Softmax-Free Attention For Vision Transformers
15 pages
Assignment Transforming Computer Vision The Rise of Vision Transformers and Its Impact
No ratings yet
Assignment Transforming Computer Vision The Rise of Vision Transformers and Its Impact
3 pages
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
No ratings yet
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
10 pages
Scalable Vision Transformers With Hierarchical Pooling
No ratings yet
Scalable Vision Transformers With Hierarchical Pooling
11 pages
Computer Vision
No ratings yet
Computer Vision
2 pages
An Overview of Vision Transformers For Image Processing A Survey
No ratings yet
An Overview of Vision Transformers For Image Processing A Survey
17 pages
Bvit
No ratings yet
Bvit
12 pages
Production - Derieux - Cedric - Advances in Automatic Image Restoration and Upscaling
No ratings yet
Production - Derieux - Cedric - Advances in Automatic Image Restoration and Upscaling
4 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Melodi Cinta Di Lapangan HijauWriting Servic
No ratings yet
Melodi Cinta Di Lapangan HijauWriting Servic
2 pages
ASSIGNMENT
No ratings yet
ASSIGNMENT
6 pages
Parts ND Functions of The Tongue
No ratings yet
Parts ND Functions of The Tongue
9 pages
Internship Report Layout
No ratings yet
Internship Report Layout
2 pages
The Gentle Art of Verbal Self-Defense (Suzette Had - 221009 - 092401
100% (3)
The Gentle Art of Verbal Self-Defense (Suzette Had - 221009 - 092401
332 pages
PM 3 Pert - CPM SS
No ratings yet
PM 3 Pert - CPM SS
40 pages
Reflections On Implementation Science
No ratings yet
Reflections On Implementation Science
25 pages
Chuyen - Jan - Week 1,2
No ratings yet
Chuyen - Jan - Week 1,2
32 pages
Submitted To Submitted By: Mr. Gopal K. Johari Mandeep Kaur Simrandeep Kaur M.Planning (Urban)
100% (1)
Submitted To Submitted By: Mr. Gopal K. Johari Mandeep Kaur Simrandeep Kaur M.Planning (Urban)
21 pages
Lesson Plan 1 - Nervous System Done
No ratings yet
Lesson Plan 1 - Nervous System Done
7 pages
Java Netbeans: Makalah Program Studi Teknik Informatika Fakultas Komunikasi Dan Informatika
No ratings yet
Java Netbeans: Makalah Program Studi Teknik Informatika Fakultas Komunikasi Dan Informatika
13 pages
Chapter 47. Ньют
No ratings yet
Chapter 47. Ньют
16 pages
CHY1004 Lab Assignment Questions: (Experiment 1)
No ratings yet
CHY1004 Lab Assignment Questions: (Experiment 1)
2 pages
Experiment No. 1 Forced Draft Fan Basiano FINAL
No ratings yet
Experiment No. 1 Forced Draft Fan Basiano FINAL
9 pages
Code of Practice For Programme Management in The Built Environment 1st Edition Ciob (The Chartered Institute of Building)
No ratings yet
Code of Practice For Programme Management in The Built Environment 1st Edition Ciob (The Chartered Institute of Building)
51 pages
Mathematics of Motion Control Profiles: Chuck Lewin, Founder and CEO Performance Motion Devices, Inc
No ratings yet
Mathematics of Motion Control Profiles: Chuck Lewin, Founder and CEO Performance Motion Devices, Inc
12 pages
IGCSE 24 ElectricalResistance
No ratings yet
IGCSE 24 ElectricalResistance
21 pages
Skyscrapers 123
No ratings yet
Skyscrapers 123
24 pages
Solax Solar Inverter: Single Phase
No ratings yet
Solax Solar Inverter: Single Phase
1 page
Hysteresis in A Theoretical Spring Block Model
No ratings yet
Hysteresis in A Theoretical Spring Block Model
4 pages
iPATE Syllabus
No ratings yet
iPATE Syllabus
6 pages
Riddles of Existence A Guided Tour of Metaphysics New Edition Conee - Download The Ebook Now and Own The Full Detailed Content
100% (1)
Riddles of Existence A Guided Tour of Metaphysics New Edition Conee - Download The Ebook Now and Own The Full Detailed Content
57 pages
Grade 12 NSC Tourism Preparatory 2017 Possible Answers
No ratings yet
Grade 12 NSC Tourism Preparatory 2017 Possible Answers
12 pages
Organic Waste Management
No ratings yet
Organic Waste Management
70 pages
2020ijeecs v20 I2 pp736-743
No ratings yet
2020ijeecs v20 I2 pp736-743
8 pages
Carrer Skills: Assignment 7 21/APRIL/2020
No ratings yet
Carrer Skills: Assignment 7 21/APRIL/2020
5 pages
9093-Article Text-18582-1-10-20181130 PDF
No ratings yet
9093-Article Text-18582-1-10-20181130 PDF
23 pages
Exp 09
No ratings yet
Exp 09
18 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

(NIPS23) Scattering Transformation For ViT

Uploaded by

(NIPS23) Scattering Transformation For ViT

Uploaded by

Scattering Vision Transformer: Spectral Mixing

Badri Narayana Patro Vijay Srinivas Agneeswaran

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

• We introduce a novel invertible scattering network based on DTCWT transformation into

2.2 Scattering Visual Transformer (SVT) Method

𝑀 refers to resolution/level of decomposition and 𝑘 refers to directional selectivity. Similarly, we

𝜙 = [𝐗𝜙 ⊙ 𝜙 ], where (𝐗𝜙 , 𝜙 ) ∈ 𝐶×𝐻×𝑊 , and 𝐌𝜙 ∈ 𝐶×𝐻×𝑊 , (4)

C. Spectral Channel and Token Mixing:

3 Experiment and Performance Studies

3.2 Comparison with State of the art methods

3.4 Ablation analysis

3.5 Transfer Learning and Task Learning

3.6 Latency Analysis

3.7 Invertibility Versus Redundancy trade-off Analysis

5 Conclusions and Future Research Directions

A Appendix: Filter visualization analysis

High-Frequency Filters coefficients-orientation-0

High-Frequency Filters coefficients-orientation-1

High-Frequency Filters coefficients-orientation-2

High-Frequency Filters coefficients-orientation-3

High-Frequency Filters coefficients-orientation-4

High-Frequency Filters coefficients-orientation-5

B.2 Training setup for Transfer Learning

B.3 Training setup for Task Learning

B.4 Training setup for Fine-tuning task

B.5 Comparison with Similar architectures

B.6 SVT Compared with LVM/LLM

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.