Large Language Models: CSC413 Tutorial 9 Yongchao Zhou

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Large Language Models

CSC413 Tutorial 9
Yongchao Zhou
Overview
● What are LLMs?
● Why LLMs?
● Emergent Capabilities
○ Few-shot In-context Learning
○ Advanced Prompt Techniques
● LLM Training
○ Architectures
○ Objectives
● LLM Finetuning
○ Instruction finetuning
○ RLHF
○ Bootstrapping
● LLM Risks
What are Language Models?
● Narrow Sense
○ A probabilistic model that assigns a probability to every finite sequence (grammatical or not)

● Broad Sense
○ Decoder-only models (GPT-X, OPT, LLaMA, PaLM)
○ Encoder-only models (BERT, RoBERTa, ELECTRA)
○ Encoder-decoder models (T5, BART)
Large Language Models - Billions of Parameters

https://huggingface.co/blog/large-language-models
Large Language Models - Hundreds of Billions of Tokens

https://babylm.github.io/
Large Language Models - yottaFlops of Compute

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf
Why LLMs?
● Scaling Law for Neural Language Models
○ Performance depends strongly on scale! We keep getting better performance as we scale
the model, data, and compute up!

https://arxiv.org/pdf/2001.08361.pdf
Why LLMs?
● Generalization
○ We can now use one single model to solve many NLP tasks

https://arxiv.org/pdf/1910.10683.pdf
Why LLMs?
● Emergent Abilities
○ Some ability of LM is not present in smaller models but is present in larger models

https://docs.google.com/presentation/d/1yzbmYB5E7G8lY2-KzhmArmPYwwl7o7CUST1xRZDUu1Y/edit?resourcekey=0-6_TnUMoK
WCk_FN2BiPxmbw#slide=id.g1fc34b3ac18_0_27
Emergent Capability - In-Context Learning

https://arxiv.org/pdf/2005.14165.pdf
Emergent Capability - In-Context Learning

https://www.cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec04.pdf
Emergent Capability - In-Context Learning
Pretraining + Fine-tuning Paradigm
Pretraining + Prompting Paradigm
● Fine-tuning (FT) Stronger
○ + Strongest performance task-specific
○ - Need curated and labeled dataset for each performance
new task (typically 1k-100k ex.)
○ - Poor generalization, spurious feature
exploitation
● Few-shot (FS)
○ + Much less task-specific data needed
○ + No spurious feature exploitation
○ - Challenging
● One-shot (1S)
○ + "Most natural," e.g. giving humans instructions
○ - Challenging
● Zero-shot (OS)
○ + Most convenient More convenient,
○ - Challenging, can be ambiguous
general, less data
Emergent Capability - Chain of Thoughts Prompting

https://arxiv.org/pdf/2201.11903.pdf
Emergent Capability - Chain of Thoughts Prompting

https://arxiv.org/pdf/2201.11903.pdf
Emergent Capability - Zero Shot CoT Prompting

https://arxiv.org/pdf/2205.11916.pdf
Emergent Capability - Zero Shot CoT Prompting

https://arxiv.org/pdf/2205.11916.pdf
Emergent Capability - Self-Consistency Prompting

https://arxiv.org/pdf/2203.11171.pdf
Emergent Capability - Least-to-Most Prompting

https://arxiv.org/pdf/2205.10625.pdf
Emergent Capability - Augmented Prompting Abilities

Advanced Prompting Techniques Ask a human to

● Zero-shot CoT Prompting ● Explain the rationale


● Self-Consistency ● Double check the answer
● Divide-and-Conquer ● Decompose to easy subproblems

Large Language Models demonstrate some human-like behaviors!


Training Architectures

Encoder-decoder models (T5, BART) Decoder-only models (GPT-X, PaLM)

http://jalammar.github.io/illustrated-transformer/
Training Objectives - UL2

https://arxiv.org/pdf/2205.05131.pdf
What kinds of things does pretraining learn?

https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf
Finetune - Instruction Finetune

https://arxiv.org/pdf/2210.11416.pdf
Finetune - Instruction Finetune

https://arxiv.org/pdf/2210.11416.pdf
Finetune - Instruction Finetune

https://arxiv.org/pdf/2210.11416.pdf
Finetune - Instruction Finetune

https://arxiv.org/pdf/2210.11416.pdf
Finetune - RLHF

https://arxiv.org/pdf/2203.14465.pdf
Application - ChatGPT
Application - ChatGPT

https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Source
s-b9a57ac0fcf74f30a1ab9e3e36fa1dc1
Finetune - Bootstrapping

https://arxiv.org/pdf/2203.14465.pdf
Finetune - Bootstrapping

https://arxiv.org/pdf/2210.11610.pdf
Large Language models Risks
● LLMs make mistakes

(falsehoods, hallucinations)

● LLMs can be misused

(misinformation, spam)

● LLMs can cause harms

(toxicity, biases, stereotypes)

● LLMs can be attacked

(adversarial examples, poisoning, prompt injection)

● LLMs can be useful as defenses

(content moderation, explanations)


Resources for further reading
● https://web.stanford.edu/class/cs224n/
● https://stanford-cs324.github.io/winter2022/
● https://stanford-cs324.github.io/winter2023/
● https://www.cs.princeton.edu/courses/archive/fall22/cos597G/
● https://rycolab.io/classes/llm-s23/
● https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Ab
ilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1d
c1
● https://www.jasonwei.net/blog/emergence
Emergent Capability - In-Context Learning

https://arxiv.org/pdf/2005.14165.pdf
Emergent Capability - Decomposed Prompting

https://arxiv.org/pdf/2210.02406.pdf
Training Objectives - UL2

https://arxiv.org/pdf/2205.05131.pdf
Training Techniques - Parallelism

https://openai.com/research/techniques-for-training-large-neural-networks
Training Techniques - Parallelism

https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy