0% found this document useful (0 votes)
29 views15 pages

Recent Advances in Vision - And-Language Research

Tutorial on Recent Advances in Vision- and-Language Research

Uploaded by

Nabil Madali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views15 pages

Recent Advances in Vision - And-Language Research

Tutorial on Recent Advances in Vision- and-Language Research

Uploaded by

Nabil Madali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Recent Advances in Vision-

and-Language Research
Zhe Gan, Licheng Yu, Yu Cheng, Luowei Zhou,
Linjie Li, Yen-Chun Chen, Jingjing Liu, Xiaodong He
Visual Captioning Visual QA/Grounding/Reasoning

• Popular Topics: Advanced attentions, RL/GAN-based model training, • Popular Topics: Multimodal fusion, Advanced attentions, Use of relations,
Style diversity, Language richness, Evaluation Neural modules, Language bias reduction
• Popular Tasks: Image/video captioning, Dense captioning, Storytelling • Popular Tasks: VQA, GQA, VisDial, Ref-COCO, CLEVR, VCR, NLVR2

Text-to-image Synthesis Self-supervised Learning


Popular Tasks:
• Text-to-image
This bird is red
• Layout-to-image
with white
• Scene-graph-to-
belly and has a
image
very short beak
• Text-based image
editing
• Story visualization

SOTA Models:
• StackGAN
• AttnGAN SOTA Models:
• ObjGAN • Image+Text: ViLBERT, LXMERT, Unicoder-VL,UNITER, etc.
• … • Video+Text: Video-BERT, CBT, UniViLM, etc.
Tutorial Agenda
• 1:15 – 1:25 Opening Remarks
• 1:25 – 2:15 Visual QA/Reasoning
• 2:15 – 2:30 Coffee Break
• 2:30 – 3:10 Visual Captioning
• 3:10 – 3:40 Text-to-image Generation
• 3:40 – 4:00 Coffee Break
• 4:00 – 5:00 Self-supervised Learning

Tutorial Website: https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research/


Session 1: Visual QA and Reasoning

Time:
1:25 – 2:15 PM (50 mins)

Presenter:
Zhe Gan (Microsoft)

Zhe Gan is a Senior Researcher at Microsoft Dynamic 365 AI Research. His current
research interests include Vision-and-Language Pre-training and Self-supervised
Learning. Zhe obtained his Ph.D. degree from Duke University in 2018, and Master’s
and Bachelor’s degrees from Peking University in 2013 and 2010, respectively. He is
an Area Chair for NeurIPS 2020 and 2019, and received AAAI-2020 Outstanding
Senior Program Committee Award.
Visual QA/Reasoning/Grounding

GQA VQA VCR

Referring Expressions CLEVR NLVR2


Main Topics
• Advanced attention mechanism
• Enhanced multimodal fusion
• Better image feature preparation
• Multi-step reasoning
• Incorporation of object relations
• Neural module networks
• Language bias reduction
• Multimodal pre-training
Session 2: Visual Captioning
Time:
2:30 – 3:10 PM (40 mins)

Presenter:
Luowei Zhou (Microsoft)

Luowei Zhou is a Researcher at Microsoft. He received his Ph.D. degree in


Robotics from the University of Michigan in 2020 and Bachelor’s degree
in Automation from Nanjing University in 2015. His research interests
include computer vision and deep learning, in particular, the intersection
of vision and language. He is a PC member/reviewer for TPAMI, IJCV,
CVPR, ICCV, ECCV, ACL, EMNLP, NeurIPS, AAAI, ICML etc. and
actively organizes affiliated workshops and tutorials.
From Images to Videos and Beyond

[Figure credit: Aafaq et al., 2019]


Main Topics
• Show and Tell
• Attention-based
• “Fancier” Attention
• Transformer-based
• Pre-training
Session 3: Text-to-Image Synthesis
Time:
3:10 – 3:40 PM (30 mins)

Presenter:
Yu Cheng (Microsoft)

Yu Cheng is a Senior Researcher at Microsoft. Before that, he was


a Research Staff Member at IBM Research/MIT-IBM Watson AI Lab. Yu
got his Ph.D. from Northwestern University in 2015 and bachelor
from Tsinghua University in 2010. His research is in deep learning in
general, with specific interests in model compression, deep generative
model and adversarial learning. Currently he focuses on using these
techniques to solve real-world problems in computer vision and NLP.
Image and Video Synthesis from Text

[Figure credits: Zhang et al, 2017; Li et al., 2018]


Main Topics

Text-to-Image Synthesis (StackGAN, AttnGAN, TAGAN, Obj-GAN)

Text-to-Video Synthesis​ (GAN-based, VAE-based) Dialogue-based Image Synthesis (ChatPainter, CoDraw, SeqAttnGAN)
Session 4: Self-supervised Learning
Time:
4:00 – 5:00 PM (60 mins)

Presenters:
Licheng Yu (Facebook), Yen-Chun Chen (Microsoft), Linjie Li (Microsoft)
Dr. Licheng Yu is a Research Scientist at Facebook AI. Before then, he was at Microsoft Dynamics 365 AI
Research. Licheng completed his PhD from University of North Carolina at Chapel Hill in 2019, and got his B.S degree
from Shanghai Jiaotong University (SJTU) and M.S degrees from both SJTU and Georgia Tech. During his PhD study,
he did summer internships at eBay Research, Adobe Research and Facebook AI Research.

Linjie Li is a Research SDE at Microsoft Dynamic 365 AI Research. Her current research interests include Vision-and-
Language pre-training and self-supervised learning. Linjie obtained her Master's degree in computer science from
Purdue University in 2018. She also holds a Master's degree in Electrical Engineering from UC, San Diego.

Yen-Chun Chen is a Research SDE at Microsoft. He received his M.S. in computer science from UNC Chapel Hill in
2017, where he focused on NLP and text summarization. He got his bachelor degree in electrical engineering
from NTU in 2014. His current research focus is large-scale self-supervised pre-training and its applications.
Self-supervised Learning for Vision-and-Language

Large, Noisy, Free Data


Pre-training Tasks
• Masked Language Modeling
• Masked Region Modeling
Interior design of modern white
and inbrown living roomsuper
furniture
Model • Image-Text Matching
Emma her hat looking
against white wall with a lamp • Word-Region Alignment
cute
Man sits in a rusted car buried in
Little hanging.
the sand and
girl her dog inbeach
on Waitarere northern


Thailand. They both seemed
interested in what we were doing

Img-Txt Txt-Img Referring Visual Image


VQA VCR NLVR2 Expressions GQA Entailment Captioning
Retrieval Retrieval
Main Topics
ViLBERT B2T2 LXMERT VLP 12-in-1 OSCAR

Image Downstream Tasks


Aug. 6th, 2019 Aug. 14th, 2019 Aug. 20th, 2019 Sep. 24th, 2019 Dec. 5th, 2019 Apr. 13th, 2020 VQA VCR NLVR2
Visual Entailment
Aug. 9th, 2019 Aug. 16th, 2019 Aug. 22nd, 2019 Sep. 25th, 2019 Apr. 2nd, 2020 Referring Expressions
Image-Text Retrieval
VisualBERT Unicoder-VL VL-BERT UNITER Pixel-BERT Image Captioning

VideoBERT CBT UniViLM HERO

Video Downstream Tasks


Apr. 3rd, 2019 Jun. 13th, 2019 Feb. 15th, 2020 May 1st, 2020 Video QA
Video-and-Language
Jun. 7th, 2019 Dec. 13th, 2019 Inference
Video Captioning
HowTo100M MIL-NCE Video Moment Retrieval

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy