0% found this document useful (0 votes)

6 views52 pages

Lec25 Architectures

The lecture covers recent architectures in neural fields and transformers for vision, focusing on techniques like image-based rendering, voxel representation, and neural radiance fields (NeRF). It discusses the architecture of transformers, including self-attention mechanisms and applications in natural language processing and vision tasks. Additionally, it highlights advancements such as the Vision Transformer (ViT) and the Detection Transformer (DETR).

Uploaded by

aningkasomwoshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views52 pages

Lec25 Architectures

Uploaded by

aningkasomwoshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Lecture 25: Recent architectures

1
Transformer slides from S. Lazebnik
Reminders

• Sign up for final presentation (see Piazza)

2
Today

• Neural fields
• Transformers for vision

3
3D view synthesis

Input views Create model Render new views

What representation should we use?

4
[Source: Mildenhall et al., “NeRF”, 2020]
Idea #1: Image-based rendering

View from a different angle

Point cloud
(reconstructed with SfM + multi-view stereo)

5
[Riegler and Koltun, 2020]
Idea #1: Image-based rendering

Point cloud Proxy geometry (a mesh)

To synthesize a new view, select colors from existing views using proxy geometry.
6
[Riegler and Koltun, 2020]
Idea #1: Image-based rendering

7
[Riegler and Koltun, 2020]
Idea #2: voxel representation

8
[Source: Sitzmann et al., “DeepVoxels”, 2019]
Idea #2: voxel representation

Position Viewing direction Color Density

V[x, y, z, θ, ϕ] = (R, G, B, σ)

Input views

9
Idea #2: voxel representation
Position Viewing direction Color Density

V[x, y, z, θ, ϕ] = (R, G, B, σ)

Training:
Input views
V[x, y, z, θ, ϕ]

3 2
Problem: A huge table! (D A )
𝒪
10
Idea #3: neural radiance field (NeRF)

FΘ(x, y, z, θ, ϕ) = (R, G, B, σ)

• Represent using a neural radiance field.

Input views
• Function that maps a (x, y, z, θ, ϕ) to a color and density.
• Typically parameterized as a multi-layer perceptron (MLP)
• Goal: find parameters Θ for MLP that explain the images

11
Idea #3: neural radiance field (NeRF)
Learn volume:
color + occupancy

3D scene Viewpoints
[Mildenhall*, Srinivasan*, Tanick*, et al., Neural radiance elds, 2020]
12
fi
Learning a NeRF

13
[Source: Mildenhall et al., “NeRF”, 2020]
Neural rendering

Ray: For color c and density σ. 14

[Source: Mildenhall et al., “NeRF”, 2020]
Why is it good to be view-dependent?

15
[Source: Mildenhall et al., “NeRF”, 2020]
Representing the inputs

FΘ(x, y, z, θ, ϕ) = (R, G, B, σ)

Input views

• In theory, could just plug in 4 inputs x, y, z, θ, ϕ

• However, this leads to blurry results.
• Neural nets show a bias toward low frequency
functions.
16
Representing the inputs

FΘ(x, y, z, θ, ϕ) = (R, G, B, σ)

Input views

• Use a positional encoding. Given a scalar p, compute:

• Plug in the coordinate to sinusoids at different frequencies

(e.g. L = 10). Creates a high-frequency input. 17
[Source: Mildenhall et al., “NeRF”, 2020]
MLP architecture

18
[Source: Mildenhall et al., “NeRF”, 2020]
Results for a novel viewpoint

19
[Source: Mildenhall et al., “NeRF”, 2020]
Results

20
[Mildenhall*, Srinivasan*, Tanick*, et al. 2020]
Results

21
[Mildenhall*, Srinivasan*, Tanick*, et al. 2020]
Results

22
[Mildenhall*, Srinivasan*, Tanick*, et al. 2020]
Extension: internet photo collections

23
[Martin-Brualla, Radwan et al. “NeRF in the Wild”, 2020]
Extension: internet photo collections

24
[Martin-Brualla, Radwan et al. “NeRF in the Wild”, 2020]
25
Lots of other applications

Neural field

[Xie et al., “Neural Fields in Visual Computing and26Beyond”, 2021]

Today

• Neural fields
• Transformers for vision

27
Recall: Transformers

• Build whole model out of self-attention

• Uses only point-wise processing and attention (no recurrent
units or convolutions)

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, 28

Source: S. Lazebnik I. Polosukhin, Attention is all you need, NeurIPS 2017 Image source
Self-attention layer 1 2 3

Product(→), Sum(↑)
• Query vectors: =
• Key vectors: = 3
1,3 2,3 3,3

• Value vectors: = 2
1,2 2,2 3,2

• Similarities: scaled dot-product attention 1

1,1 2,1 3,1

( · )
, = or = / Softmax(↑)

( is the dimensionality of the keys) 3 1,3 2,3 3,3

2 1,2 2,2 3,2

• Attn. weights: = softmax( , dim = 1) 1 1,1 2,1 3,1

• Output vectors:
1 2 3

=
∑ , or =
1 2 3

29
𝐷
Adapted from J. Johnson and S. Lazebnik. One query per input vector
𝑖
𝑗
𝐸
𝑗
𝑄
𝑉
𝐾
𝑖
𝑖
𝑗
𝑗
𝐸
𝑄
𝐾
𝐷
𝑖
𝑄
𝐾
𝑉
𝐷𝐴
𝑌
𝑋
𝐴
𝑋
𝑋
𝑉
𝑊
𝑊
𝑊
𝐸
𝑌
𝐴
𝑉
𝑄
𝐾
𝑗
𝑇
𝑋
𝑋
𝑋
𝑌
𝑌
𝑌
𝑉
𝑉
𝑉
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐴
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐸
𝐾
𝐾
𝐾
𝑄
𝑄
𝑄
Multi-head attention
• Run h attention models in parallel
on top of different linearly
projected versions of , , ;
concatenate and linearly project
the results
• Intuition: enables model to attend
to different kinds of information at
different positions

30
Source: S. Lazebnik
𝑄
𝐾
𝑉
Transformer blocks
• A Transformer is a sequence
of transformer blocks
• Vaswani et al.: N=12 blocks,
embedding dimension = 512,
6 attention heads
• Add & Norm: residual connection
followed by layer normalization
• Feedforward: two linear layers
with ReLUs in between, applied
independently to each vector
• Attention is the only interaction
between inputs!

31
Source: S. Lazebnik
Self-supervised learning in Natural Language Processing
1. Download A LOT of text from the internet
2. Train a giant transformer using a suitable pretext task
3. Fine-tune the transformer on desired NLP task

32
Image source
Source: S. Lazebnik
Self-supervised language modeling with transformers
1. Download A LOT of text from the internet
2. Train a giant transformer using a suitable pretext task
3. Fine-tune the transformer on desired NLP task
Bidirectional Encoder Representations from Transformers (BERT)

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language 33
Source: S. Lazebnik Understanding, EMNLP 2018
Recall: denoising autoencoder

Noisy image Reconstructed image

[Vincent et al., 2008]

BERT: Pretext tasks
• Masked language model (MLM)
• Randomly mask 15% of tokens in input sentences, goal is to
reconstruct them using bidirectional context

35
Source: S. Lazebnik Image source
BERT: More detailed view

WordPiece (from GNMT)

Trained on Wikipedia (2.5B words) + BookCorpus (800M words)

36
Source: S. Lazebnik Image source
BERT: Evaluation
• General Language Understanding Evaluation (GLUE)
benchmark (gluebenchmark.com)

37
Source: S. Lazebnik
Image GPT
• Image resolution up to 64x64, color values quantized to 512
levels (9 bits), dense attention
• For transfer learning, average-pool encoded features across
all positions

38
M. Chen et al., Generative pretraining from pixels, ICML 2020
Source: S. Lazebnik
Image GPT – OpenAI

https://openai.com/blog/image-gpt/
39
M. Chen et al., Generative pretraining from pixels, ICML 2020
Source: S. Lazebnik
Vision transformer (ViT)
• Split an image into patches, feed linearly projected patches into
standard transformer encoder
• With patches of 14x14 pixels, you need 16x16=256 patches to represent 224x224 images

A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 40
2021
Source: S. Lazebnik
Vision transformer (ViT)

BiT: Big Transfer (ResNet)

ViT: Vision Transformer (Base/Large/Huge,
patch size of 14x14, 16x16, or 32x32)

Internal Google dataset (not public)

A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 41
2021
Source: S. Lazebnik
Swin Transformer: windowed attention

[Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”] 42
Vision transformer (ViT)

43
A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021
Application: self-supervised learning

44
[Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”, 2021]
Application: self-supervised learning

45
[Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”, 2021]
Application: self-supervised learning

46
[Kaiming He et al., “Masked Autoencoders Are Scalable Vision Learners”, 2021]
Application: video

47
Source: [Gedas Bertasius et al, “Is Space-Time Attention All You Need for Video Understanding?”, 2021]
Application: multimodal models

48
[Arsha Nagrani et al, “Attention Bottlenecks for Multimodal Fusion”, 2021]
Detection Transformer (DETR) – Facebook AI
• Hybrid of CNN and transformer, aimed at standard
recognition task

49
N. Carion et al., End-to-end object detection with transformers, ECCV 2020
Source: S. Lazebnik
CNNs can be improved too, though.

50
[Irwan Bello et al. “Revisiting ResNets: Improved Training and Scaling Strategies”, 2021]
CNNs can be improved too, though.

51
[Liu et al. “A ConvNet for the 2020s”, 2022]
Next class: Bias and ethics

A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
24 pages
ERRI Reports V 2 R
No ratings yet
ERRI Reports V 2 R
253 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
Lecture-28-TransformerIntroductionFinal-1
No ratings yet
Lecture-28-TransformerIntroductionFinal-1
69 pages
2012.12556
No ratings yet
2012.12556
23 pages
CS485 Ch5 Transformers
No ratings yet
CS485 Ch5 Transformers
50 pages
Transformers_in_computational_visual_media_A_surve
No ratings yet
Transformers_in_computational_visual_media_A_surve
30 pages
NeurIPS 2021 Transformer in Transformer Paper
No ratings yet
NeurIPS 2021 Transformer in Transformer Paper
12 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
2103 - ICML - Perceiver General Perception With Iterative Attention
No ratings yet
2103 - ICML - Perceiver General Perception With Iterative Attention
16 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
GenAIWorkshop GEOMAR With Footnotes Final
No ratings yet
GenAIWorkshop GEOMAR With Footnotes Final
41 pages
Deep Learning Important Studies
No ratings yet
Deep Learning Important Studies
6 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Aman Arora Blog On Vision Transformer
No ratings yet
Aman Arora Blog On Vision Transformer
11 pages
Harsha Thesis
No ratings yet
Harsha Thesis
62 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
Lec 1 Intro
No ratings yet
Lec 1 Intro
54 pages
Gaurav_Vision_Transformer
No ratings yet
Gaurav_Vision_Transformer
10 pages
ATV - CVPR'23 Tutorial
No ratings yet
ATV - CVPR'23 Tutorial
152 pages
Neural Architecture Search For Transformers A Surv
No ratings yet
Neural Architecture Search For Transformers A Surv
39 pages
ETH Zurich Talk - April 14, 2025
No ratings yet
ETH Zurich Talk - April 14, 2025
84 pages
Transformers For Vision
No ratings yet
Transformers For Vision
28 pages
Research Notes
No ratings yet
Research Notes
9 pages
ece265p-fahmy-day7
No ratings yet
ece265p-fahmy-day7
93 pages
Transformer
No ratings yet
Transformer
5 pages
An Introduction to Transformers
No ratings yet
An Introduction to Transformers
10 pages
The Evolution of Deep Learning
No ratings yet
The Evolution of Deep Learning
53 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
Discussion 1 - Introduction
No ratings yet
Discussion 1 - Introduction
26 pages
ViT Survey On Segmentation
No ratings yet
ViT Survey On Segmentation
30 pages
Introduction to Deep Learning 17th January 2025 (2)
No ratings yet
Introduction to Deep Learning 17th January 2025 (2)
60 pages
paper2
No ratings yet
paper2
8 pages
Implement A Vision On A LLM
No ratings yet
Implement A Vision On A LLM
21 pages
C8-Modern CNNs
No ratings yet
C8-Modern CNNs
57 pages
llm (1)
No ratings yet
llm (1)
28 pages
Deep Learning Resources
No ratings yet
Deep Learning Resources
5 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Chapitre-8-2024 (1)
No ratings yet
Chapitre-8-2024 (1)
231 pages
Vision Transformers: Revolutionizing Computer Vision
No ratings yet
Vision Transformers: Revolutionizing Computer Vision
14 pages
Rec03 - Deep Architectures
No ratings yet
Rec03 - Deep Architectures
65 pages
ppt2_new
No ratings yet
ppt2_new
30 pages
Vision Transformers For Dense Prediction Tasks: Junyong Lee Computer Graphics Lab
No ratings yet
Vision Transformers For Dense Prediction Tasks: Junyong Lee Computer Graphics Lab
22 pages
Convolutional Neural Networks: Computer Vision CS 543 / ECE 549 University of Illinois Jia-Bin Huang
No ratings yet
Convolutional Neural Networks: Computer Vision CS 543 / ECE 549 University of Illinois Jia-Bin Huang
76 pages
Modern Convolutional Neural Networks
No ratings yet
Modern Convolutional Neural Networks
68 pages
Post-Reading Report Alex Shen (Mid Exam)
No ratings yet
Post-Reading Report Alex Shen (Mid Exam)
36 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Deep Learning 1.0 and Beyond: A Tutorial
No ratings yet
Deep Learning 1.0 and Beyond: A Tutorial
55 pages
Transformers
No ratings yet
Transformers
15 pages
Generative AI System Design Resources
No ratings yet
Generative AI System Design Resources
5 pages
Challenging Task[1]
No ratings yet
Challenging Task[1]
21 pages
11 Deep Transfer Learning and Multi Task Learning
No ratings yet
11 Deep Transfer Learning and Multi Task Learning
24 pages
Lect 2 Common Architectural Principles of Deep Networks (3)
No ratings yet
Lect 2 Common Architectural Principles of Deep Networks (3)
20 pages
Cluster1 Core ML NLP Techniques Summary
No ratings yet
Cluster1 Core ML NLP Techniques Summary
8 pages
Deep Learning Hardware
No ratings yet
Deep Learning Hardware
82 pages
Object Detection Using Convolutional Neural Network Transfer Learning
No ratings yet
Object Detection Using Convolutional Neural Network Transfer Learning
11 pages
Cse (Convolutional Neural Network) PPT+Questions
No ratings yet
Cse (Convolutional Neural Network) PPT+Questions
18 pages
Transformers
No ratings yet
Transformers
30 pages
Digital Image Processing: Fundamentals and Applications
From Everand
Digital Image Processing: Fundamentals and Applications
Fouad Sabry
No ratings yet
08-3464 Walker 7080
No ratings yet
08-3464 Walker 7080
2 pages
Boho Vibes Notebook Freebie C1
No ratings yet
Boho Vibes Notebook Freebie C1
105 pages
Ucp 600 PDF
20% (10)
Ucp 600 PDF
2 pages
Iphone 5 Thesis Statement
100% (3)
Iphone 5 Thesis Statement
6 pages
Subaru M41, M41a 3-Speed
No ratings yet
Subaru M41, M41a 3-Speed
8 pages
MS-6728 Atx
No ratings yet
MS-6728 Atx
33 pages
AIHUB-1025
No ratings yet
AIHUB-1025
27 pages
Day 5-2 Classification
No ratings yet
Day 5-2 Classification
34 pages
Sonim XP5plus Brochure ATT 070422 Final
No ratings yet
Sonim XP5plus Brochure ATT 070422 Final
4 pages
Juniper SRX CLI Cheatsheet
No ratings yet
Juniper SRX CLI Cheatsheet
1 page
nord_stage_2
No ratings yet
nord_stage_2
9 pages
RDL And& or Gate
No ratings yet
RDL And& or Gate
9 pages
Mass and Energy Balances in psychrometric
No ratings yet
Mass and Energy Balances in psychrometric
37 pages
A project report on the preparation of a product launch in the international market outlines the stages and efforts involved in planning and executing the product introduction in new regions or countries
No ratings yet
A project report on the preparation of a product launch in the international market outlines the stages and efforts involved in planning and executing the product introduction in new regions or countries
4 pages
How To Cancel or Freeze Your Youfit Membership
No ratings yet
How To Cancel or Freeze Your Youfit Membership
2 pages
Web Design and Trends in Web Design
No ratings yet
Web Design and Trends in Web Design
10 pages
B.Sc IT Python Programming Question Paper Sets
No ratings yet
B.Sc IT Python Programming Question Paper Sets
4 pages
Imani Bouab - 8-Ciro Totku Research Sheet
No ratings yet
Imani Bouab - 8-Ciro Totku Research Sheet
13 pages
KLA-2-TS Operation Manual
No ratings yet
KLA-2-TS Operation Manual
22 pages
How To Read The MC68705U3 EPROM
No ratings yet
How To Read The MC68705U3 EPROM
6 pages
EC 5 Manual v2
No ratings yet
EC 5 Manual v2
22 pages
Logistics Unit 3 PDF
No ratings yet
Logistics Unit 3 PDF
7 pages
Engineered Mass Timber Solutions: Scandinavia
No ratings yet
Engineered Mass Timber Solutions: Scandinavia
36 pages
Module 2 Test - Revisión Del Intento
No ratings yet
Module 2 Test - Revisión Del Intento
7 pages
2.4 Rational Functions
No ratings yet
2.4 Rational Functions
5 pages
Optical Communication
No ratings yet
Optical Communication
29 pages
1527250528E-textofChapter7Module2
No ratings yet
1527250528E-textofChapter7Module2
14 pages
UNIT-4(MCQs)
No ratings yet
UNIT-4(MCQs)
13 pages
DCS 1101 Course Outline
No ratings yet
DCS 1101 Course Outline
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lec25 Architectures

Uploaded by

Lec25 Architectures

Uploaded by

Lecture 25: Recent architectures

• Sign up for final presentation (see Piazza)

Input views Create model Render new views

What representation should we use?

View from a different angle

Point cloud Proxy geometry (a mesh)

Position Viewing direction Color Density

• Represent using a neural radiance field.

Ray: For color c and density σ. 14

• In theory, could just plug in 4 inputs x, y, z, θ, ϕ

• Use a positional encoding. Given a scalar p, compute:

• Plug in the coordinate to sinusoids at different frequencies

[Xie et al., “Neural Fields in Visual Computing and26Beyond”, 2021]

• Build whole model out of self-attention

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, 28

• Similarities: scaled dot-product attention 1

( is the dimensionality of the keys) 3 1,3 2,3 3,3

2 1,2 2,2 3,2

• Attn. weights: = softmax( , dim = 1) 1 1,1 2,1 3,1

Noisy image Reconstructed image

[Vincent et al., 2008]

WordPiece (from GNMT)

Trained on Wikipedia (2.5B words) + BookCorpus (800M words)

BiT: Big Transfer (ResNet)

Internal Google dataset (not public)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.