Large Language Models: CSC413 Tutorial 9 Yongchao Zhou
Large Language Models: CSC413 Tutorial 9 Yongchao Zhou
Large Language Models: CSC413 Tutorial 9 Yongchao Zhou
CSC413 Tutorial 9
Yongchao Zhou
Overview
● What are LLMs?
● Why LLMs?
● Emergent Capabilities
○ Few-shot In-context Learning
○ Advanced Prompt Techniques
● LLM Training
○ Architectures
○ Objectives
● LLM Finetuning
○ Instruction finetuning
○ RLHF
○ Bootstrapping
● LLM Risks
What are Language Models?
● Narrow Sense
○ A probabilistic model that assigns a probability to every finite sequence (grammatical or not)
● Broad Sense
○ Decoder-only models (GPT-X, OPT, LLaMA, PaLM)
○ Encoder-only models (BERT, RoBERTa, ELECTRA)
○ Encoder-decoder models (T5, BART)
Large Language Models - Billions of Parameters
https://huggingface.co/blog/large-language-models
Large Language Models - Hundreds of Billions of Tokens
https://babylm.github.io/
Large Language Models - yottaFlops of Compute
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf
Why LLMs?
● Scaling Law for Neural Language Models
○ Performance depends strongly on scale! We keep getting better performance as we scale
the model, data, and compute up!
https://arxiv.org/pdf/2001.08361.pdf
Why LLMs?
● Generalization
○ We can now use one single model to solve many NLP tasks
https://arxiv.org/pdf/1910.10683.pdf
Why LLMs?
● Emergent Abilities
○ Some ability of LM is not present in smaller models but is present in larger models
https://docs.google.com/presentation/d/1yzbmYB5E7G8lY2-KzhmArmPYwwl7o7CUST1xRZDUu1Y/edit?resourcekey=0-6_TnUMoK
WCk_FN2BiPxmbw#slide=id.g1fc34b3ac18_0_27
Emergent Capability - In-Context Learning
https://arxiv.org/pdf/2005.14165.pdf
Emergent Capability - In-Context Learning
https://www.cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec04.pdf
Emergent Capability - In-Context Learning
Pretraining + Fine-tuning Paradigm
Pretraining + Prompting Paradigm
● Fine-tuning (FT) Stronger
○ + Strongest performance task-specific
○ - Need curated and labeled dataset for each performance
new task (typically 1k-100k ex.)
○ - Poor generalization, spurious feature
exploitation
● Few-shot (FS)
○ + Much less task-specific data needed
○ + No spurious feature exploitation
○ - Challenging
● One-shot (1S)
○ + "Most natural," e.g. giving humans instructions
○ - Challenging
● Zero-shot (OS)
○ + Most convenient More convenient,
○ - Challenging, can be ambiguous
general, less data
Emergent Capability - Chain of Thoughts Prompting
https://arxiv.org/pdf/2201.11903.pdf
Emergent Capability - Chain of Thoughts Prompting
https://arxiv.org/pdf/2201.11903.pdf
Emergent Capability - Zero Shot CoT Prompting
https://arxiv.org/pdf/2205.11916.pdf
Emergent Capability - Zero Shot CoT Prompting
https://arxiv.org/pdf/2205.11916.pdf
Emergent Capability - Self-Consistency Prompting
https://arxiv.org/pdf/2203.11171.pdf
Emergent Capability - Least-to-Most Prompting
https://arxiv.org/pdf/2205.10625.pdf
Emergent Capability - Augmented Prompting Abilities
http://jalammar.github.io/illustrated-transformer/
Training Objectives - UL2
https://arxiv.org/pdf/2205.05131.pdf
What kinds of things does pretraining learn?
https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf
Finetune - Instruction Finetune
https://arxiv.org/pdf/2210.11416.pdf
Finetune - Instruction Finetune
https://arxiv.org/pdf/2210.11416.pdf
Finetune - Instruction Finetune
https://arxiv.org/pdf/2210.11416.pdf
Finetune - Instruction Finetune
https://arxiv.org/pdf/2210.11416.pdf
Finetune - RLHF
https://arxiv.org/pdf/2203.14465.pdf
Application - ChatGPT
Application - ChatGPT
https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Source
s-b9a57ac0fcf74f30a1ab9e3e36fa1dc1
Finetune - Bootstrapping
https://arxiv.org/pdf/2203.14465.pdf
Finetune - Bootstrapping
https://arxiv.org/pdf/2210.11610.pdf
Large Language models Risks
● LLMs make mistakes
(falsehoods, hallucinations)
(misinformation, spam)
https://arxiv.org/pdf/2005.14165.pdf
Emergent Capability - Decomposed Prompting
https://arxiv.org/pdf/2210.02406.pdf
Training Objectives - UL2
https://arxiv.org/pdf/2205.05131.pdf
Training Techniques - Parallelism
https://openai.com/research/techniques-for-training-large-neural-networks
Training Techniques - Parallelism
https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/