Lec25 Architectures
Lec25 Architectures
1
Transformer slides from S. Lazebnik
Reminders
2
Today
• Neural fields
• Transformers for vision
3
3D view synthesis
Point cloud
(reconstructed with SfM + multi-view stereo)
5
[Riegler and Koltun, 2020]
Idea #1: Image-based rendering
To synthesize a new view, select colors from existing views using proxy geometry.
6
[Riegler and Koltun, 2020]
Idea #1: Image-based rendering
7
[Riegler and Koltun, 2020]
Idea #2: voxel representation
8
[Source: Sitzmann et al., “DeepVoxels”, 2019]
Idea #2: voxel representation
V[x, y, z, θ, ϕ] = (R, G, B, σ)
Input views
9
Idea #2: voxel representation
Position Viewing direction Color Density
V[x, y, z, θ, ϕ] = (R, G, B, σ)
Training:
Input views
V[x, y, z, θ, ϕ]
3 2
Problem: A huge table! (D A )
𝒪
10
Idea #3: neural radiance field (NeRF)
FΘ(x, y, z, θ, ϕ) = (R, G, B, σ)
11
Idea #3: neural radiance field (NeRF)
Learn volume:
color + occupancy
3D scene Viewpoints
[Mildenhall*, Srinivasan*, Tanick*, et al., Neural radiance elds, 2020]
12
fi
Learning a NeRF
13
[Source: Mildenhall et al., “NeRF”, 2020]
Neural rendering
15
[Source: Mildenhall et al., “NeRF”, 2020]
Representing the inputs
FΘ(x, y, z, θ, ϕ) = (R, G, B, σ)
Input views
FΘ(x, y, z, θ, ϕ) = (R, G, B, σ)
Input views
18
[Source: Mildenhall et al., “NeRF”, 2020]
Results for a novel viewpoint
19
[Source: Mildenhall et al., “NeRF”, 2020]
Results
20
[Mildenhall*, Srinivasan*, Tanick*, et al. 2020]
Results
21
[Mildenhall*, Srinivasan*, Tanick*, et al. 2020]
Results
22
[Mildenhall*, Srinivasan*, Tanick*, et al. 2020]
Extension: internet photo collections
23
[Martin-Brualla, Radwan et al. “NeRF in the Wild”, 2020]
Extension: internet photo collections
24
[Martin-Brualla, Radwan et al. “NeRF in the Wild”, 2020]
25
Lots of other applications
Neural field
• Neural fields
• Transformers for vision
27
Recall: Transformers
Product(→), Sum(↑)
• Query vectors: =
• Key vectors: = 3
1,3 2,3 3,3
• Value vectors: = 2
1,2 2,2 3,2
( · )
, = or = / Softmax(↑)
• Output vectors:
1 2 3
=
∑ , or =
1 2 3
29
𝐷
Adapted from J. Johnson and S. Lazebnik. One query per input vector
𝑖
𝑗
𝐸
𝑗
𝑄
𝑉
𝐾
𝑖
𝑖
𝑗
𝑗
𝐸
𝑄
𝐾
𝐷
𝑖
𝑄
𝐾
𝑉
𝐷𝐴
𝑌
𝑋
𝐴
𝑋
𝑋
𝑉
𝑊
𝑊
𝑊
𝐸
𝑌
𝐴
𝑉
𝑄
𝐾
𝑗
𝑇
𝑋
𝑋
𝑋
𝑌
𝑌
𝑌
𝑉
𝑉
𝑉
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐾
𝐾
𝐾
𝑄
𝑄
𝑄
Multi-head attention
• Run h attention models in parallel
on top of different linearly
projected versions of , , ;
concatenate and linearly project
the results
• Intuition: enables model to attend
to different kinds of information at
different positions
30
Source: S. Lazebnik
𝑄
𝐾
𝑉
Transformer blocks
• A Transformer is a sequence
of transformer blocks
• Vaswani et al.: N=12 blocks,
embedding dimension = 512,
6 attention heads
• Add & Norm: residual connection
followed by layer normalization
• Feedforward: two linear layers
with ReLUs in between, applied
independently to each vector
• Attention is the only interaction
between inputs!
31
Source: S. Lazebnik
Self-supervised learning in Natural Language Processing
1. Download A LOT of text from the internet
2. Train a giant transformer using a suitable pretext task
3. Fine-tune the transformer on desired NLP task
32
Image source
Source: S. Lazebnik
Self-supervised language modeling with transformers
1. Download A LOT of text from the internet
2. Train a giant transformer using a suitable pretext task
3. Fine-tune the transformer on desired NLP task
Bidirectional Encoder Representations from Transformers (BERT)
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language 33
Source: S. Lazebnik Understanding, EMNLP 2018
Recall: denoising autoencoder
34
35
Source: S. Lazebnik Image source
BERT: More detailed view
36
Source: S. Lazebnik Image source
BERT: Evaluation
• General Language Understanding Evaluation (GLUE)
benchmark (gluebenchmark.com)
37
Source: S. Lazebnik
Image GPT
• Image resolution up to 64x64, color values quantized to 512
levels (9 bits), dense attention
• For transfer learning, average-pool encoded features across
all positions
38
M. Chen et al., Generative pretraining from pixels, ICML 2020
Source: S. Lazebnik
Image GPT – OpenAI
https://openai.com/blog/image-gpt/
39
M. Chen et al., Generative pretraining from pixels, ICML 2020
Source: S. Lazebnik
Vision transformer (ViT)
• Split an image into patches, feed linearly projected patches into
standard transformer encoder
• With patches of 14x14 pixels, you need 16x16=256 patches to represent 224x224 images
A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 40
2021
Source: S. Lazebnik
Vision transformer (ViT)
A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 41
2021
Source: S. Lazebnik
Swin Transformer: windowed attention
[Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”] 42
Vision transformer (ViT)
43
A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021
Application: self-supervised learning
44
[Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”, 2021]
Application: self-supervised learning
45
[Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”, 2021]
Application: self-supervised learning
46
[Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”, 2021]
Application: video
47
Source: [Gedas Bertasius et al, “Is Space-Time Attention All You Need for Video Understanding?”, 2021]
Application: multimodal models
48
[Arsha Nagrani et al, “Attention Bottlenecks for Multimodal Fusion”, 2021]
Detection Transformer (DETR) – Facebook AI
• Hybrid of CNN and transformer, aimed at standard
recognition task
49
N. Carion et al., End-to-end object detection with transformers, ECCV 2020
Source: S. Lazebnik
CNNs can be improved too, though.
50
[Irwan Bello et al. “Revisiting ResNets: Improved Training and Scaling Strategies”, 2021]
CNNs can be improved too, though.
51
[Liu et al. “A ConvNet for the 2020s”, 2022]
Next class: Bias and ethics
52