0% found this document useful (0 votes)
6 views52 pages

Lec25 Architectures

The lecture covers recent architectures in neural fields and transformers for vision, focusing on techniques like image-based rendering, voxel representation, and neural radiance fields (NeRF). It discusses the architecture of transformers, including self-attention mechanisms and applications in natural language processing and vision tasks. Additionally, it highlights advancements such as the Vision Transformer (ViT) and the Detection Transformer (DETR).

Uploaded by

aningkasomwoshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views52 pages

Lec25 Architectures

The lecture covers recent architectures in neural fields and transformers for vision, focusing on techniques like image-based rendering, voxel representation, and neural radiance fields (NeRF). It discusses the architecture of transformers, including self-attention mechanisms and applications in natural language processing and vision tasks. Additionally, it highlights advancements such as the Vision Transformer (ViT) and the Detection Transformer (DETR).

Uploaded by

aningkasomwoshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Lecture 25: Recent architectures

1
Transformer slides from S. Lazebnik
Reminders

• Sign up for final presentation (see Piazza)

2
Today

• Neural fields
• Transformers for vision

3
3D view synthesis

Input views Create model Render new views

What representation should we use?


4
[Source: Mildenhall et al., “NeRF”, 2020]
Idea #1: Image-based rendering

View from a different angle

Point cloud
(reconstructed with SfM + multi-view stereo)

5
[Riegler and Koltun, 2020]
Idea #1: Image-based rendering

Point cloud Proxy geometry (a mesh)

To synthesize a new view, select colors from existing views using proxy geometry.
6
[Riegler and Koltun, 2020]
Idea #1: Image-based rendering

7
[Riegler and Koltun, 2020]
Idea #2: voxel representation

8
[Source: Sitzmann et al., “DeepVoxels”, 2019]
Idea #2: voxel representation

Position Viewing direction Color Density

V[x, y, z, θ, ϕ] = (R, G, B, σ)

Input views

9
Idea #2: voxel representation
Position Viewing direction Color Density

V[x, y, z, θ, ϕ] = (R, G, B, σ)

Training:
Input views
V[x, y, z, θ, ϕ]

3 2
Problem: A huge table! (D A )
𝒪
10
Idea #3: neural radiance field (NeRF)

FΘ(x, y, z, θ, ϕ) = (R, G, B, σ)

• Represent using a neural radiance field.


Input views
• Function that maps a (x, y, z, θ, ϕ) to a color and density.
• Typically parameterized as a multi-layer perceptron (MLP)
• Goal: find parameters Θ for MLP that explain the images

11
Idea #3: neural radiance field (NeRF)
Learn volume:
color + occupancy

3D scene Viewpoints
[Mildenhall*, Srinivasan*, Tanick*, et al., Neural radiance elds, 2020]
12
fi
Learning a NeRF

13
[Source: Mildenhall et al., “NeRF”, 2020]
Neural rendering

Ray: For color c and density σ. 14


[Source: Mildenhall et al., “NeRF”, 2020]
Why is it good to be view-dependent?

15
[Source: Mildenhall et al., “NeRF”, 2020]
Representing the inputs

FΘ(x, y, z, θ, ϕ) = (R, G, B, σ)

Input views

• In theory, could just plug in 4 inputs x, y, z, θ, ϕ


• However, this leads to blurry results.
• Neural nets show a bias toward low frequency
functions.
16
Representing the inputs

FΘ(x, y, z, θ, ϕ) = (R, G, B, σ)

Input views

• Use a positional encoding. Given a scalar p, compute:

• Plug in the coordinate to sinusoids at different frequencies


(e.g. L = 10). Creates a high-frequency input. 17
[Source: Mildenhall et al., “NeRF”, 2020]
MLP architecture

18
[Source: Mildenhall et al., “NeRF”, 2020]
Results for a novel viewpoint

19
[Source: Mildenhall et al., “NeRF”, 2020]
Results

20
[Mildenhall*, Srinivasan*, Tanick*, et al. 2020]
Results

21
[Mildenhall*, Srinivasan*, Tanick*, et al. 2020]
Results

22
[Mildenhall*, Srinivasan*, Tanick*, et al. 2020]
Extension: internet photo collections

23
[Martin-Brualla, Radwan et al. “NeRF in the Wild”, 2020]
Extension: internet photo collections

24
[Martin-Brualla, Radwan et al. “NeRF in the Wild”, 2020]
25
Lots of other applications

Neural field

[Xie et al., “Neural Fields in Visual Computing and26Beyond”, 2021]


Today

• Neural fields
• Transformers for vision

27
Recall: Transformers

• Build whole model out of self-attention


• Uses only point-wise processing and attention (no recurrent
units or convolutions)

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, 28


Source: S. Lazebnik I. Polosukhin, Attention is all you need, NeurIPS 2017 Image source
Self-attention layer 1 2 3

Product(→), Sum(↑)
• Query vectors: =
• Key vectors: = 3
1,3 2,3 3,3

• Value vectors: = 2
1,2 2,2 3,2

• Similarities: scaled dot-product attention 1


1,1 2,1 3,1

( · )
, = or = / Softmax(↑)

( is the dimensionality of the keys) 3 1,3 2,3 3,3

2 1,2 2,2 3,2

• Attn. weights: = softmax( , dim = 1) 1 1,1 2,1 3,1

• Output vectors:
1 2 3

=
∑ , or =
1 2 3

29
𝐷
Adapted from J. Johnson and S. Lazebnik. One query per input vector
𝑖
𝑗
𝐸
𝑗
𝑄
𝑉
𝐾
𝑖
𝑖
𝑗
𝑗
𝐸
𝑄
𝐾
𝐷
𝑖
𝑄
𝐾
𝑉
𝐷𝐴
𝑌
𝑋
𝐴
𝑋
𝑋
𝑉
𝑊
𝑊
𝑊
𝐸
𝑌
𝐴
𝑉
𝑄
𝐾
𝑗
𝑇
𝑋
𝑋
𝑋
𝑌
𝑌
𝑌
𝑉
𝑉
𝑉
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐾
𝐾
𝐾
𝑄
𝑄
𝑄
Multi-head attention
• Run h attention models in parallel
on top of different linearly
projected versions of , , ;
concatenate and linearly project
the results
• Intuition: enables model to attend
to different kinds of information at
different positions

30
Source: S. Lazebnik
𝑄
𝐾
𝑉
Transformer blocks
• A Transformer is a sequence
of transformer blocks
• Vaswani et al.: N=12 blocks,
embedding dimension = 512,
6 attention heads
• Add & Norm: residual connection
followed by layer normalization
• Feedforward: two linear layers
with ReLUs in between, applied
independently to each vector
• Attention is the only interaction
between inputs!

31
Source: S. Lazebnik
Self-supervised learning in Natural Language Processing
1. Download A LOT of text from the internet
2. Train a giant transformer using a suitable pretext task
3. Fine-tune the transformer on desired NLP task

32
Image source
Source: S. Lazebnik
Self-supervised language modeling with transformers
1. Download A LOT of text from the internet
2. Train a giant transformer using a suitable pretext task
3. Fine-tune the transformer on desired NLP task
Bidirectional Encoder Representations from Transformers (BERT)

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language 33
Source: S. Lazebnik Understanding, EMNLP 2018
Recall: denoising autoencoder

Noisy image Reconstructed image

34

[Vincent et al., 2008]


BERT: Pretext tasks
• Masked language model (MLM)
• Randomly mask 15% of tokens in input sentences, goal is to
reconstruct them using bidirectional context

35
Source: S. Lazebnik Image source
BERT: More detailed view

WordPiece (from GNMT)

Trained on Wikipedia (2.5B words) + BookCorpus (800M words)

36
Source: S. Lazebnik Image source
BERT: Evaluation
• General Language Understanding Evaluation (GLUE)
benchmark (gluebenchmark.com)

37
Source: S. Lazebnik
Image GPT
• Image resolution up to 64x64, color values quantized to 512
levels (9 bits), dense attention
• For transfer learning, average-pool encoded features across
all positions

38
M. Chen et al., Generative pretraining from pixels, ICML 2020
Source: S. Lazebnik
Image GPT – OpenAI

https://openai.com/blog/image-gpt/
39
M. Chen et al., Generative pretraining from pixels, ICML 2020
Source: S. Lazebnik
Vision transformer (ViT)
• Split an image into patches, feed linearly projected patches into
standard transformer encoder
• With patches of 14x14 pixels, you need 16x16=256 patches to represent 224x224 images

A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 40
2021
Source: S. Lazebnik
Vision transformer (ViT)

BiT: Big Transfer (ResNet)


ViT: Vision Transformer (Base/Large/Huge,
patch size of 14x14, 16x16, or 32x32)

Internal Google dataset (not public)

A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 41
2021
Source: S. Lazebnik
Swin Transformer: windowed attention

[Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”] 42
Vision transformer (ViT)

43
A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021
Application: self-supervised learning

44
[Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”, 2021]
Application: self-supervised learning

45
[Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”, 2021]
Application: self-supervised learning

46
[Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”, 2021]
Application: video

47
Source: [Gedas Bertasius et al, “Is Space-Time Attention All You Need for Video Understanding?”, 2021]
Application: multimodal models

48
[Arsha Nagrani et al, “Attention Bottlenecks for Multimodal Fusion”, 2021]
Detection Transformer (DETR) – Facebook AI
• Hybrid of CNN and transformer, aimed at standard
recognition task

49
N. Carion et al., End-to-end object detection with transformers, ECCV 2020
Source: S. Lazebnik
CNNs can be improved too, though.

50
[Irwan Bello et al. “Revisiting ResNets: Improved Training and Scaling Strategies”, 2021]
CNNs can be improved too, though.

51
[Liu et al. “A ConvNet for the 2020s”, 2022]
Next class: Bias and ethics

52

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy