An Image Is Worth More Than 16x16 Patches
An Image Is Worth More Than 16x16 Patches
An Image Is Worth More Than 16x16 Patches
1 Introduction
The deep learning revolution can be characterized as a revolution in inductive
biases for computer vision. Learning previously occurred on top of manually
crafted features, such as those described in [16, 46], which encoded preconceived
notions about useful patterns and structures for specific tasks. In contrast, biases
in modern features are no longer predetermined but instead shaped by direct
learning from data using predefined model architectures. This paradigm shift’s
dominance highlights the potential of reducing feature biases to create more
versatile and capable systems that excel across a wide range of vision tasks.
Beyond features, model architectures also possess inductive biases. Reducing
these biases can facilitate greater unification not only across tasks but also across
data modalities. The Transformer architecture [62] serves as a great example.
Initially developed to process natural languages, its effectiveness was subsequently
demonstrated for images [22], point clouds [66], codes [8], and many other types
of data. Notably, compared to its predecessor in vision – ConvNet [28, 42],
Vision Transformer (ViT) [22] carries much less image-specific inductive biases.
Nonetheless, the initial advantage from such biases is quickly offset by more
data (and models that have enough capacity to store patterns within the data),
ultimately becoming restrictions preventing ConvNets from scaling further [22].
2 D-K. Nguyen et al.
Table 1: Major inductive biases in vision architectures. ConvNet [28, 42] has all
three – spatial hierarchy, translation equivariance, and locality – with neighboring pixels
being more related than pixels farther apart. Vision Transformer (ViT) [22] removes the
spatial hierarchy, reduces (but still retains) translation equivariance and locality. We
use Pixel Transformer (PiT) to investigate the complete removal of locality by simply
applying Transformers on individual pixels. It works surprisingly well, challenging the
mainstream belief that locality is a necessity for vision architectures.
Of course, ViT is not entirely free of inductive bias. It gets rid of the spatial
hierarchy in the ConvNet and models multiple scales in a plain architecture.
However, for other inductive biases, the removal is merely half-way through:
translation equivariance still exists in its patch projection layer and all the
intermediate blocks; and locality – the notion that neighboring pixels are more
related than pixels that are far apart – still exists in its ‘patchification’ step (that
represents an image with 16×16 patches on a 2D grid) and position embeddings
(when they are manually designed). Therefore, a natural question arises: can we
completely eliminate either or both of the remaining two inductive biases? Our
work aims to answer this question.
Surprisingly, we find locality can indeed be removed. We arrive at this conclu-
sion by directly treating each individual pixel as a token for the Transformer and
using position embeddings learned from scratch. In this way, we introduce zero
priors about the 2D grid structure of images. Interestingly, instead of training
divergence or steep performance degeneration, we obtain better results in quality
from the resulting architecture. For easier reference, we name this PiT, short for
Pixel Transformer. Note that our goal is not to promote PiT as an approach to
replace ViT, but the fact that PiT works so well suggests there is more signals
Transformers can capture by viewing images as sets of individual pixels, rather
than 16×16 patches. This finding challenges the conventional belief that ‘locality
is a fundamental inductive bias for vision tasks’ (see Tab. 1).
In the main paper, we showcase the effectiveness of PiT via three different case
studies: (i) supervised learning for object classification, where CIFAR-100 [38] is
used for our main experiments thanks to its 32×32 input size, but the observation
also generalizes well to ImageNet [19]; (ii) self-supervised learning on CIFAR-100
via standard Masked Autoencoding (MAE) [26] for pre-training, and fine-tuning
for classification; and (iii) image generation with diffusion models, where we follow
the architecture of Diffusion Transformer (DiT) [51], and study its pixel variant
on ImageNet using the latent token space provided by VQGAN [24]. In all three
cases, we find PiT exhibits reasonable behaviors, and achieving results better in
quality than baselines equipped with the locality inductive bias. This observation
is further generalized to fine-grained classification and depth estimation tasks in
the appendix.
Exploring Transformers on Individual Pixels 3
2 Related Work
Locality for images. To the best of our knowledge, most modern vision architec-
tures [28, 50], including those aimed at simplifications of inductive biases [22, 60],
still maintain locality in their design. Manually designed visual features before
deep learning are also locally biased. For example, SIFT [46] uses a local de-
scriptor to represent each point of interest; HOG [16] normalizes the gradient
strengths locally to account for changes in illumination and contrast. Interestingly,
with these features, bag-of-words models [12, 41] were popular – analogous to the
set-of-pixels explored in our work.
Locality beyond images. The inductive bias of locality is widely accepted
beyond modeling 2D images. For text, a natural language sequence is often
pre-processed with ‘tokenizers’ [39, 57], which aggregate the dataset statistics
for grouping frequently-occurring adjacent characters into sub-words. Before
Transformers, recurrent neural networks [31, 47] are the default architecture
for such data, which exploit the temporal connectivity to process sequences
step-by-step. For even less structured data (e.g. point clouds [6, 15]), modern
networks [52, 66] will resort to various sampling and pooling strategies to increase
their sensitivity to the local geometric layout. In graph neural networks [56],
nodes with edges are often viewed as being locally connected, and information is
propagated through these connections to farther-away nodes. Such a design make
them particularly useful for analyzing social networks, molecular structures, etc.
Other notable efforts. We list four efforts in a rough chronological order, and
hope it can provide historical context from multiple perspectives for our work:
4 D-K. Nguyen et al.
Position
Embedding
Self-Attention
norm
norm
MLP
Image Pixels Transformer Tasks
– For ConvNets, relevant attempts have being made to remove locality. No-
tably, [4] replaces all the spatial convolutional filters with 1×1 filters in a
ResNet [28]. It provides more interpretability to understand the decision
making process of a ConvNet, but without inter-pixel communications, the
resulting network is substantially worse in performance. Our work instead
uses Transformers, which are inherently built on set operations, with the Self-
Attention mechanism handling all-to-all communications; understandably, we
attain better results.
– Before ViT gained popularity, iGPT [7] was proposed to directly pre-train
Transformers on pixels following their success on text [20, 53]. In retrospect,
iGPT is a locality-free model for self-supervised next (or masked) pixel
prediction. But despite the expensive demonstrations, its performance still
falls short compared to simple contrastive pre-training [9] for ImageNet linear
classification. Later, ViT [23] re-introduced locality (e.g., via patchification)
into the architecture, achieving impressive results on many benchmarks
including ImageNet. Since then, the community has moved on with 16×16
patches as the default tokens for images. Even today, it is still unclear whether
higher resolution or locality is the key differentiator between the two. Our
work closes this understanding gap, pointing to resolution as the enabler for
ViT, not locality with systematic analyses.
– Perceiver [34, 35] is another series of architectures that operate directly on
pixels for images. Aimed at being modality-agnostic, Perceiver designs latent
Transformers with cross-attention modules to tackle the efficiency issue when
the input is high-dimensional. However, this design is not as widely adopted as
plain Transformers, which have consistently demonstrated scalability across
multiple domains [5, 22]. Through PiT, we show Transformers can indeed
work directly with pixels, and given the rapid development of Self-Attention
Exploring Transformers on Individual Pixels 5
a 2D sin-cos embedding [10, 26], which extends from the original 1D one [62]. As
sin-cos functions are smooth, they tend to introduce locality biases that nearby
tokens are more similar in the embedding space.1 Other designed variants are
also possible and have been explored [22], but all of them can carry information
about the 2D grid structure of images, unlike learned position embedding which
does not make assumptions about the input.
The locality bias has also been exploited when the position embeddings are
interpolated [18, 43]. Through bilinear or bicubic interpolation, spatially close
embeddings are used to generate a new embedding of the current position, which
also leverages locality as a prior.
Compared to ConvNets, ViTs are designed with much less pronounced bias
toward locality. We push this further by completely removing this bias next.
4 Transformers on Pixels
We closely follow the standard Transformer encoder [62] which processes a se-
quence of tokens. Particularly, we apply the architecture directly on an unordered
set of pixels from the input image with learnable position embeddings. This
removes the remaining inductive bias of locality in ViT [22], and for reference,
we name it Pixel Transformer (PiT, see Fig. 1). Conceptually, PiT can be viewed
as a simplified version of ViT, with 1×1 patches instead of 16×16.
Formally, we denote the input sequence as X = (x1 , ..., xL ) ∈ RL×d , where L
is the sequence length and d is the hidden dimension. The Transformer maps the
input sequence X to a sequence of representations Z = (z1 , ..., zL ) ∈ RL×d . The
architecture is a stack of N layers, each of which contains two blocks: multi-headed
Self-Attention block and MLP block:
\hat {Z}^{k} &= \texttt {SelfAttention}\big (\mathrm {norm}(Z^{k-1})\big ) + Z^{k-1},\notag \\ Z^{k} &= \texttt {MLP}\big (\mathrm {norm}(\hat {Z}^{k})\big ) + \hat {Z}^k,\notag
where Z0 is the input sequence X, k ∈ {1, ..., N } indicates k-th layer in the
Transformer, and norm(·) is a normalization layer (typically LayerNorm [1]).
Both blocks use residual connections [28].
Pixels as tokens. The typical input in computer vision to the network is
commonly an image of RGB values, I ∈ RH×W ×3 , where (H, W ) is the size of
the original image. We follow a simple solution and treat I as an unordered
set of pixels (pl )H·W 3
l=1 , pl ∈ R . Thus, PiT simply projects each pixel into a d
dimensional vector via a linear projection layer, f : R3 → Rd , resulting the input
set of tokens X = (f (p1 ), ..., f (pL )) with L = H · W . We append the sequence
with a learnable [cls] token [20]. Additionally, we learn an content-agnostic
position embedding for each position. The pixel tokens are then fed into the
1
While sin-cos functions are also cyclic, it’s easy to verify that the majority of their
periods are longer than the typical sequence lengths encountered by ViTs.
Exploring Transformers on Individual Pixels 7
model layers (N ) hidden dim (d) MLP dim heads param (M)
PiT-T(iny) 12 192 768 12 5.6
PiT-S(mall) 12 384 1536 12 21.8
PiT-B(ase) 12 768 3072 12 86.0
PiT-L(arge) 24 1024 4096 16 303.5
Table 2: Specifications of PiT size variants. In empirical evaluations, we utilize
ViT [22] with the same configuration for head-on comparisons.
Table 3: Results for case study #1 (supervised learning). Learning from scratch,
we compare PiT and ViT [22] (with patch size 2×2). We report results on (a) CIFAR-
100 [38]: with 32×32 inputs, PiT substantially outperforms ViT. Note that our ViT
baselines are already well-optimized, e.g. [58] reports 72.6% when training from scratch
with a larger model; (b) ImageNet [19]: using the training pipeline [26, 61] highly
optimized for ViT, we still observe accuracy gains with PiT. Due to computation
constraints, we use an input size of 28×28 (lower than [22, 61]).
Figure 2: Two trends for PiT vs. ViT on ImageNet. Since PiT can be viewed
as ViT with patch size 1×1, the trends w.r.t. patch size is crucial to our finding. In
(a), we vary the ViT-B patch size but keep the sequence length fixed (last data point
equivalent to PiT) – so the input size is also varied. While Acc@1 remains constant in
the beginning, the size, or the amount of information quickly becomes the dominating
factor that contributes to the degeneration, and PiT is the worst. On the other hand,
in (b) we vary the ViT-S patch size while keeping the input size fixed. The trend is
opposite – reducing the patch size is always helpful and the last point (PiT) becomes
the best. The juxtaposition of these two trends gives a more complete picture of the
relationship between input size, patch size and sequence length.
when training from scratch with ViT-B, whilst we achieve 80+% with smaller
sized models), PiT-T improves over ViT-T by 1.5% of Acc@1; and when moving
to the bigger model (S), PiT shows an improvement of 1.3% of Acc@1 over the
small model (T) while ViT seems to be saturated. These results suggest compared
to the patch-based ViT, PiT is potentially learning new, data-driven patterns
directly from pixels.
Our observation also transfers to ImageNet – albeit with a significantly lower
resolution our results are significantly lower than the state-of-the-art [22, 61]
(80+%), PiT still outperforms ViT in both settings we have experimented.
PiT vs. ViT: a tale of two trends. If position embeddings are learned, PiT
is simply a version of ViT with 1×1 patches. Therefore, it is crucial to study the
performance trend when varying the patch sizes in ViT. There are three variables
in concern: sequence length (L), input size (H×W ) and patch size (p). They
have a deterministic relationship: L = H×W/(p2 ). Thus we have two ways to
study the Acc@1 trend w.r.t. patch size p:
Table 4: Results for case study #2 (self-supervised learning). We use PiT and
ViT [22] (patch size 2×2) for MAE pre-training [26] on CIFAR-100, and then fine-tune
with supervision. For both (a) Tiny and (b) Small sized variants, pre-training boosts
performance. PiT-S offers a bigger gap over ViT-S, suggesting PiT can scale better.
– Fixed input size. Our finding resides in the other trend, when we fix the
input size (therefore the amount of information), and vary the patch size on
ImageNet in Fig. 2b. The model size is ViT-S. Interestingly, we observe an
opposite trend here: it is always helpful to decrease the patch size (or increase
the sequence length), aligned with the prior studies that claim sequence
length is highly important. Note that the trend holds even when it ultimately
reaches PiT – a model without any design for locality. So PiT performs the
best in accuracy compared to ViTs.
With these two trend figures in Fig. 2, our study augments the observations
made from previous studies, as they mainly focused on regimes where the input
size is sufficiently large [2, 32], and presents a more complete picture.
To see what PiT has learned, we show visualizations of the attention maps,
position embeddings from PiT in Appendix A.
In this subsection, we study PiT with self-supervised pre-training and then fine-
tuning for supervised classification. In particular, we choose MAE [26] due to its
efficiency that only retains 25% of the sequence length for the encoder, and its
effectiveness for fine-tuning based evaluation protocols.
Datasets. We use CIFAR-100 [38] due to its inherent size of 32 × 32 for images.
This allows us to fully explore the use of pixels as tokens on the original resolution.
Evaluation metrics. We first perform pre-training on the train split. Then
it serves as the initialization in the fine-tuning stage (also trained on train).
Again, we use image classification on CIFAR-100 as the downstream task and
report the top-1 (Acc@1) and top-5 (Acc@5) accuracy on the val split.
Implementation details. We follow standard MAE and use a mask ratio of
75% and select tokens randomly. Given the remaining 25% visible tokens, the
model needs to reconstruct masked regions using pixel regression. Since there
is no known default setting for MAE on CIFAR-100 (even for ViT), we search
for recipes and report results using PiT-T and PiT-S. The same augmentations
as in [26] are applied to the images during the pre-training for simplicity. All
models are pre-trained using AdamW with β1 =0.9, β2 =0.95. We follow all of the
Exploring Transformers on Individual Pixels 11
hyper-parameters in [26] for the pre-training of 1600 epochs except for the initial
learning rate of 0.004 and a learning rate decay of 0.85 [11]. Thanks to MAE
pre-training, we can fine-tune our model with a higher learning rate of 0.024. We
also set weight decay to 0.02, layer-wise rate decay to 0.65, and drop path to 0.3,
β2 to 0.999, and fine-tune for 800 epochs. Other hyper-parameters closely follow
the scratch training recipe for supervised learning (see Sec. 5.1).
Results. As shown in Tab. 4, we find that for PiT too, self-supervised pre-training
with MAE improves accuracy compared to training from scratch. This is true for
both PiT-T and PiT-S that we experimented with. Notably, the gap between
ViT and PiT, with pre-training, gets larger when we move from Tiny to Small
models. This suggests PiT can potentially scale better than ViT.
model FID (↓) sFID (↓) IS (↑) precision (↑) recall (↑)
DiT-L/2 4.16 4.97 210.18 0.88 0.49
PiT-L 4.05 4.66 232.95 0.88 0.49
DiT-L/2, no guidance 8.90 4.63 104.43 0.75 0.61
DiT-XL/2 [51], no guidance 10.67 - - - -
Table 5: Results for case study #3 (image generation). We use reference batches
from the original evaluation suite of [21], and report 5 metrics comparing locality-biased
DiT-L/2 and PiT-L. With the exception of the last two rows, we use classifier-free
guidance [30] (scale 1.5) during the generation process (250 steps). Last row is from [51],
compared to which our baseline is significantly stronger (8.90 FID [29] with DiT-L, vs.
10.67 with DiT-XL and longer training). Overall, our finding generalizes well to this
new task with a different architecture and different input representations.
warm up [25] for 100 epochs and then keep it constant for a total of 400 epochs.
We use a maximum learning rate of 8e-4, with no weight decay applied.
Qualitative results. Sampled generations from PiT-L are shown in Fig. 3. The
sampling takes 250 time steps, with the latent diffusion outputs mapped back to
the pixel space using the VQGAN decoder. A classifier-free guidance [30] scale
of 4.0 is used. All generations are detailed and reasonable compared to the DiT
models with the locality inductive bias [51].
Quantitative comparisons. We summarize qualitative comparisons between
DiT-L/2 and PiT-L in Tab. 5. First, our baseline is strong despite the change of
training recipe: compared to the reference 10.67 FID [51] with a larger model (DiT-
XL/2) and longer training (∼470 epochs), our DiT-L/2 achieves 8.90 without
classifier-free guidance. Our main comparison (first two rows) uses a classifier-free
guidance of 1.5 with 250 sampling steps. With PiT operating on the latent ‘pixels’,
it outperforms the baseline on three metrics (FID, sFID and IS), and is on-par
on precision/recall. With extended training, the gap is bigger (see Appendix B).
Our demonstration on the image generation task is an important extension of
PiT. Compared to the case studies on discriminative benchmarks from Sec. 5.1
and Sec. 5.2, the task has changed; the model architecture is changed from
standard ViT to a conditioned one; the input space is also changed from raw
pixels to latent encodings from the VQGAN tokenizer. The fact that PiT works
out-of-box suggests our observation generalizes well, and locality-free architecture
can be used across different tasks, architectures, and operating representations.
Finally, we complete the loop of our investigation by revisiting the ViT archi-
tecture, and examining the importance of its two locality-related designs: (i)
position embedding and (ii) patchification.
Experimental setup. We use ViT-B for ImageNet supervised classification. We
adopt the exact same hyper-parameters, augmentations, and other training details
Exploring Transformers on Individual Pixels 13
from the scratch training recipe from [26]. Notably, images are crop-and-resized
to 224×224 and divided into 16×16 non-overlapping patches.
Position embedding. Similar to the investigation in [10], we choose from three
candidates: sin-cos [62], learned, and none. The first option introduces locality
into the model, while the other two do not. The results are summarized below:
PE sin-cos learned none
Acc@1 82.7 82.8 81.2
Our conclusion is similar to the one drawn by [10] for self-supervised representation
evaluation: learnable position embeddings are on-par with fixed sin-cos ones.
Intrestingly, we observe only a minor drop in performance even if there is no
position embedding at all – ‘none’ is only worse by 1.5% compared to sin-cos. Note
that without position embedding, the classification model is fully permutation
invariant w.r.t. patches, though not w.r.t. pixels – will show evidence next.
Patchification. Next, we use learnable position embeddings and study patchifi-
cation. To systematically reduce locality from patchification, our key insight is
that neighboring pixels should no longer be tied in the same patch. To this end,
we perform a pixel-wise permutation before diving the resulting sequence into
separate tokens. Each token contains 256 pixels, same in number to pixels in a
16×16 patch. The permutation is shared, i.e., stays the same for all the images –
including the ones for testing.
The permutation is performed in T steps, each step will swap a pixel pair
within a distance threshold δ ∈ [2, inf] (2 means within the 2×2 neighborhood,
inf means any pixel pair can be swapped). We use hamming distance on the 2D
image grid. T and δ control how ‘corrputed’ an image is – larger T or δ indicates
more damage to the local neighborhood and thus more locality bias is taken away.
Fig. 4 illustrates four such permutations.
Fig. 5 illustrates the results we have obtained. In the table (left), we vary T
with no distance constraint (i.e., δ = inf). As we increase the number of shuffled
pixel pairs, the performance degenerates slowly in the beginning (up to 10K).
Then it quickly deteriorates as we further increase T . And at T = 25K, Acc@1
drops to 57.2%, a 25.2% decrease from the intact image. Note that in total there
are 224 × 224/2 = 25, 088 pixel pairs, so T = 25K means almost all the pixels
14 D-K. Nguyen et al.
have moved away from their original location. Fig. 5 (right) shows the influence
of δ given a fixed T (10K or 20K). We can see when farther-away pixels are
allowed for swapping (with greater δ), performance gets hurt more. The trend is
more salient when more pixel pairs are swapped (T = 20K).
Overall, pixel permutation imposes a much more significant impact on Acc@1,
compared to changing position embeddings, suggesting that patchification is
much more crucial for the overall design of ViTs, and underscores the value of
our work that removes the patchification altogether.
Discussion. As another way to remove locality, pixel permutation is highly
destructive. On the other hand, PiT shows successful elimination of locality
is possible by treating individual pixels as tokens. We hypothesize this is be-
cause permuting pixels not only damages the locality bias, but also hurts the
other inductive bias – translation equivariance. In PiT, although locality is re-
moved altogether, the Transformer weights are still shared to preserve translation
equivariance; but with shuffling, this inductive bias is also largely removed. The
difference suggests that translation equivariance remains important and should
not be disregarded, especially after locality is already compromised.
References
1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv:1607.06450 (2016) 6
2. Beyer, L., Zhai, X., Kolesnikov, A.: Better plain vit baselines for imagenet-1k. arXiv
preprint arXiv:2205.01580 (2022) 10
3. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph
cuts. IEEE TPAMI (2001) 18
4. Brendel, W., Bethge, M.: Approximating cnns with bag-of-local-features models
works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760 (2019) 4
5. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A.,
Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C.,
Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are
few-shot learners. In: NeurIPS (2020) 4
6. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese,
S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model
repository. arXiv preprint arXiv:1512.03012 (2015) 3
7. Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Generative
pretraining from pixels. In: ICML (2020) 4, 9
8. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards,
H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models
trained on code. arXiv preprint arXiv:2107.03374 (2021) 1
9. Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised
models are strong semi-supervised learners. In: NeurIPS (2020) 4
10. Chen, X., Xie, S., He, K.: An empirical study of training self-supervised Vision
Transformers. In: ICCV (2021) 5, 6, 8, 13
11. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: Pre-training text
encoders as discriminators rather than generators. In: ICLR (2020) 11
12. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization
with bags of keypoints. In: ECCVW (2004) 3
13. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning
augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018) 8
14. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated
data augmentation with a reduced search space. In: CVPR Workshops (2020) 8
15. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet:
Richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017) 3
16. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR (2005) 1, 3
17. Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: FlashAttention: Fast and memory-
efficient exact attention with IO-awareness. In: NeurIPS (2022) 5, 7
18. Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner,
A.P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al.: Scaling vision transformers
to 22 billion parameters. In: ICML (2023) 6
19. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale
hierarchical image database. In: CVPR (2009) 2, 8
20. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep
bidirectional transformers for language understanding. In: NAACL (2019) 4, 6
21. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In:
NeurIPS (2021) 11, 12
16 D-K. Nguyen et al.
22. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:
An image is worth 16x16 words: Transformers for image recognition at scale. In:
ICLR (2021) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 18
23. Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsu-
pervised feature learning with convolutional neural networks. In: NeurIPS (2014) 4,
7
24. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image
synthesis. In: CVPR (2021) 2, 11
25. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A.,
Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: Training ImageNet in
1 hour. arXiv:1706.02677 (2017) 12
26. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are
scalable vision learners. In: CVPR (2022) 2, 6, 7, 8, 10, 11, 13
27. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised
visual representation learning. In: CVPR (2020) 8
28. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016) 1, 2, 3, 4, 6
29. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained
by a two time-scale update rule converge to a local Nash equilibrium. In: NeurIPS
(2017) 11, 12
30. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint
arXiv:2207.12598 (2022) 12
31. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation
(1997) 3
32. Hu, R., Debnath, S., Xie, S., Chen, X.: Exploring long-sequence masked autoen-
coders. arXiv preprint arXiv:2210.07224 (2022) 5, 10
33. Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with
stochastic depth. In: ECCV (2016) 8
34. Jaegle, A., Borgeaud, S., Alayrac, J.B., Doersch, C., Ionescu, C., Ding, D., Koppula,
S., Zoran, D., Brock, A., Shelhamer, E., et al.: Perceiver io: A general architecture
for structured inputs & outputs. arXiv preprint arXiv:2107.14795 (2021) 4
35. Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver:
General perception with iterative attention. In: ICML (2021) 4
36. Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and understanding recurrent
networks. arXiv preprint arXiv:1506.02078 (2015) 5
37. Ke, T.W., Hwang, J.J., Guo, Y., Wang, X., Yu, S.X.: Unsupervised hierarchical
semantic segmentation with multiview cosegmentation and clustering transformers.
In: CVPR (2022) 7
38. Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech Report
(2009) 2, 8, 10
39. Kudo, T., Richardson, J.: Sentencepiece: A simple and language independent
subword tokenizer and detokenizer for neural text processing. arXiv preprint
arXiv:1808.06226 (2018) 3
40. Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision
and recall metric for assessing generative models. NeurIPS (2019) 11
41. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In: CVPR (2006) 3
42. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,
Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural
computation (1989) 1, 2
Exploring Transformers on Individual Pixels 17
43. Li, Y., Xie, S., Chen, X., Dollár, P., He, K., Girshick, R.: Benchmarking detection
transfer learning with vision transformers. In preparation (2021) 6
44. Liu, H., Zaharia, M., Abbeel, P.: Ring attention with blockwise transformers for
near-infinite context. arXiv preprint arXiv:2310.01889 (2023) 5, 7
45. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
8
46. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
1, 3
47. Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent
neural network based language model. In: Interspeech (2010) 3
48. Nash, C., Menick, J., Dieleman, S., Battaglia, P.W.: Generating images with sparse
representations. arXiv preprint arXiv:2103.03841 (2021) 11
49. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of
classes. In: Indian Conference on Computer Vision, Graphics and Image Processing
(2008) 22
50. Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., Tran, D.:
Image transformer. In: ICML (2018) 3
51. Peebles, W., Xie, S.: Scalable diffusion models with Transformers. In: ICCV (2023)
2, 5, 7, 11, 12, 22, 23
52. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning
on point sets in a metric space. NeurIPS (2017) 3
53. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language
understanding by generative pre-training (2018) 4
54. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: CVPR (2022) 11
55. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.:
Improved techniques for training gans. NeurIPS (2016) 11
56. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph
neural network model. IEEE Transactions on Neural Networks (2009) 3
57. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909 (2015) 3
58. Shen, C., Chen, J., Wang, S., Kuang, H., Liu, J., Wang, J.: Asymmetric patch
sampling for contrastive learning. arXiv preprint arXiv:2306.02854 (2023) 8
59. Silberman, N., Kohli, P., Hoiem, D., Fergus, R.: Indoor segmentation and support
inference from rgbd images. In: ECCV (2012) 22
60. Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T.,
Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al.: Mlp-mixer: An all-mlp
architecture for vision. NeurIPS (2021) 3
61. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper
with image transformers. In: ICCV (2021) 8, 9
62. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017) 1, 3, 6, 13
63. Walmer, M., Suri, S., Gupta, K., Shrivastava, A.: Teaching matters: Investigating
the role of supervision in vision transformers. In: CVPR (2023) 18
64. Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization
strategy to train strong classifiers with localizable features. In: ICCV (2019) 8
65. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk
minimization. In: ICLR (2018) 8
66. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: ICCV
(2021) 1, 3
18 D-K. Nguyen et al.
Acknowledgments. We thank Kaiming He, Mike Rabbat and Sho Yaida for
helpful discussions. We thank Yann LeCun for feedback on positioning.
A Visualizations of PiT
To check what PiT has learned, we experimented different ways for visualiza-
tions. Unless otherwise specified, we use PiT-B and ViT-B models trained with
supervised learning on ImageNet classification, and compare them side by side.
Mean attention distances. In Fig. 6, we present the mean attention distances
for PiT and ViT across three categories: late layers (last 4), middle layers (middle
4), and early layers (first 4). Following [22], this metric is computed by aggregating
the distances between a query token and all the key tokens in the image space,
weighted by their corresponding attention weights. It can be interpreted as the
size of the ‘receptive field’ for Transformers. The distance is normalized by the
image size, and sorted based on the distance value for different attention heads
from left to right.
As shown in Fig. 6a and Fig. 6b, both models exhibit similar patterns in
the late layers, with the metric increasing from the 8th to the 11th layer. In the
middle layers, while ViT displays a mixed trend among layers (see Fig. 6d), PiT
clearly extract patterns from larger areas in the relatively later layers (see Fig. 6c).
Most notably, PiT focuses more on local patterns by paying more attention to
small groups of pixels in the early layers, as illustrated in Fig. 6e and Fig. 6f.
Mean attention offsets. Fig. 7 shows the mean attention offsets between PiT
and ViT as introduced in [63]. This metric is calculated by determining the center
of the attention map generated by a query and measuring the spatial distance (or
offset) from the query’s location to this center. Thus, the attention offset refers
to the degree of spatial deviation of the ‘receptive field’ – the area of the input
that the model focuses on – from the query’s original position. Note that different
from ConvNets, Self-Attention is a global operation, not a local operation that is
always centered on the current pixel (offset always being zero).
Interestingly, Fig. 7e suggests that PiT captures long-range relationships in
the first layer. Specifically, the attention maps generated by PiT focus on regions
far away from the query token – although according to the previous metric (mean
attention distance), the overall ‘size’ of the attention can be small and focused
in this layer.
Figure-ground segmentation in early layers. In Fig. 8, we observe another
interesting behavior of PiT. Here, we use the central pixel in the image space as
the query and visualize its attention maps in the early layers. We find that the
attention maps in the early layers can already capture the foreground of objects.
Figure-ground segmentation [3] can be effectively performed with low-level signals
(e.g., RGB values) and therefore approaches with a few layers. And this separation
prepares the model to potentially capture higher-order relationships in later layers.
Exploring Transformers on Individual Pixels 19
Figure 6: Mean attention distances in late, middle, and early layers between PiT
and ViT. This metric can be interpreted as the receptive field size for Transformers.
The distance is normalized by the image size, and sorted based on the distance value
for different attention heads from left to right.
20 D-K. Nguyen et al.
Figure 7: Mean attention offsets in late, middle, and early layers between PiT and
ViT. This metric measures the deviation of the attention map from the current token
location. The offset is normalized with the image size, and sorted based on the distance
value for different attention heads from left to right.
Exploring Transformers on Individual Pixels 21
epochs model FID (↓) sFID (↓) IS (↑) precision (↑) recall (↑)
DiT-L/2 4.16 4.97 210.18 0.88 0.49
400
PiT-L 4.05 4.66 232.95 0.88 0.49
DiT-L/2 2.89 4.43 242.13 0.85 0.54
1400
PiT-L 2.68 4.34 268.82 0.85 0.55
Table 6: Extended results for image generation. We continue the training from
400 epochs (main paper) to 1400 epochs, and find the gap between DiT and PiT becomes
larger, especially on FID.
In the main paper (Sec. 5.3), both image generation models, DiT-L/2 and PiT-L,
are trained for 400 epochs. To see the trend for longer training, we followed [51]
and simply continued training them till 1400 epochs while keeping the learning
rate constant.
The results are summarized in Tab. 6. Interestingly, longer-training also
benefits PiT more than DiT. Note that FID shall be compared in a relative sense
– a 0.2 gap around 2 is bigger than 0.2 around 4.
To further examine the generalization of our observation, we tried PiT on two more
tasks: (i) fine-grained classification on Oxford-102-Flower [49], which requires
nuanced understanding; and (ii) depth estimation on NYU-v2 [59]. Given the
computation budget, we resize images to either 32×32 (former) or 48×64 (latter),
and follow standard protocols to train and evaluate models. The results again
shows PiT holds more effectiveness over ViT in quality:
fine-grained classification depth estimation
Acc@1 (↑) Acc@5 (↑) RMSE (↓)
ViT-S/2 45.8 68.3 0.80
PiT-S 46.3 68.9 0.72