0% found this document useful (0 votes)
3 views

03 LanguageModel

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

03 LanguageModel

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Natural Language Processing

Application
Week 3: Language Model
❏ Introduction to n-gram
❏ Estimating N-gram Probabilities
❏ N-gram model evaluation
❏ Smoothing techniques
XLNNTNUD - Language Model
INTRODUCTION TO N-GRAM
Introduction to n-gram
❏ Probabilistic Language Model: assign a probability to a
sentence
❏ Machine Translation:
❏ P(ngôn ngữ tự nhiên) > P(ngôn ngữ nhiên tự)
❏ Spelling error detection and correction:
❏ P(ngôn ngữ tự nhiên) > P(gôn ngữ tu nhiên)
❏ Text summarization
❏ Question-Answering
❏ ...
Introduction to n-gram (cont)
❏ Probability of a sentence (or a sequence of words):
❏ P(W) = P(w1, w2, w3...wn)

❏ Probability of the next word in a sentence (or a sequence of


words):
❏ P(wn| w1, w2...wn-1)

❏ Model P(W) or P(wn| w1, w2...wn-1) is called a Language Model


Introduction to n-gram (cont)
❏ The Chain Rule of Probability:
❏ Two variables: P(x1,x2) = P(x1)P(x2|x1)
❏ Three variables: P(x1,x2,x3) = P(x1)P(x2|x1)P(x3|x1x2)
❏ Four variables: P(x1,x2,x3,x4) = P(x1)P(x2|x1)P(x3|x1x2)P(x4|x1x2x3)
❏ …
❏ N variables: P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
Introduction to n-gram (cont)
❏ The Chain Rule of Probability:

❏ Example:
❏ P(ngôn ngữ tự nhiên) = P(ngôn) x P(ngữ|ngôn) x P(tự|ngôn ngữ)
x P(nhiên|ngôn ngữ tự)
Introduction to n-gram (cont)
❏ Estimating the probability:
❏ P(ngôn) = count(ngôn)/N
❏ P(ngữ|ngôn) = count(ngôn ngữ)/count(ngôn)
❏ P(tự|ngôn ngữ) = count(ngôn ngữ tự)/count(ngôn ngữ)
❏ P(nhiên|ngôn ngữ tự) = count(ngôn ngữ tự nhiên)/count(ngôn ngữ tự)
❏ Comment:
❏ There are too many possibilities
❏ Not enough data for estimation
Introduction to n-gram (cont)
❏ Markov Assumption:
❏ P(nhiên|ngôn ngữ tự) ≈ P(nhiên|tự)
Or
❏ P(nhiên|ngôn ngữ tự) ≈ P(nhiên|ngữ tự)
Introduction to n-gram (cont)
❏ Markov Assumption:
Introduction to n-gram (cont)
❏ Unigram model (1-gram):

❏ Automatically generated sentences from a unigram model:


ở trận, fabregas, cesc bàn dusan và cầu kiến emmanuel thứ reyes,
utd trong sau tạo anh ngoại anh thủ pogba man một jose xuất ở này.

là tadic adebayor, thủ harry dennis santi cazorola, nhóm thành


bergkamp, bốn tiên bảy hạng kane. đầu cầu hiện antonio
Introduction to n-gram (cont)
❏ Bigram model (2-gram):

❏ Automatically generated sentences from a bigram model:


anh thành cầu ở ngoại hạng anh tạo bàn trong một trận, sau dennis
bergkamp, và harry kane.
pogba là man utd đầu xuất hiện ở nhóm này.
Introduction to n-gram (cont)
❏ Extension: trigram, 4-gram, 5-gram…
❏ Comment:
❏ The effect of long-distance dependency in language
❏ Ví dụ: “Chiếc máy tính mà tôi vừa đưa vào phòng máy trên tầng năm đã bị
hỏng.”
❏ However, n-gram model should work fine in most cases
XLNNTNUD - Language Model
ESTIMATING N-GRAM PROBABILITIES
Estimating n-gram probabilities
❏ Maximum Likelihood Estimation
Estimating n-gram probabilities (cont)
❏ Example:
Estimating n-gram probabilities (cont)
❏ Example: bigram count (9222 sentences)
Estimating n-gram probabilities (cont)
❏ Normalize by unigram:

❏ Result:
Estimating n-gram probabilities (cont)
❏ Example:
Estimating n-gram probabilities (cont)
❏ Knowledge from the probability:
❏ P(english|want) = .0011
❏ P(chinese|want) = .0065
❏ P(to|want) = .66
❏ P(eat | to) = .28
❏ P(food | to) = 0
❏ P(want | spend) = 0
❏ P (i | <s>) = .25
Estimating n-gram probabilities (cont)
❏ Problem with Multiplication:
❏ Underflow
❏ Slow
❏ Transform Multiplication into Addition:
Estimating n-gram probabilities (cont)
❏ Language Modeling Toolkits:
❏ SRILM
❏ IRSTLM
❏ KendLM
❏ ...
XLNNTNUD - Language Model
MODEL EVALUATION
Model Evaluation
❏ Language Model (comparing “good" and “not good" sentence):
❏ Assign higher probability to “real" or “frequently seen" sentences than
“ungrammatical" or “rarely seen" sentences.
❏ Model’s parameters are trained on a training set
❏ The model’s performance are tested on unseen data
❏ A test set is an unseen dataset, separate from the training set
❏ An evaluation metric show how good our model does on the test set
Model Evaluation (cont)
❏ Extrinsic Evaluation: to compare models A and B
❏ Give each model a task:
❏ Spelling correction, Machine Translation…
❏ Run the task and get an accuracy for A and B
❏ How many misspelled words corrected properly
❏ How many words translated correctly
❏ Compare accuracy for A and B
Model Evaluation (cont)
❏ Extrinsic Evaluation:
❏ Time consuming (days or even weeks to complete…)
❏ Therefore, Intrinsic evaluation is sometimes used: perplexity
❏ Bad approximation:
❏ If the test data doesn’t look like the training data
❏ Only useful in pilot experiment
Model Evaluation (cont)
❏ Perplexity:
❏ How well can we predict the next word:

❏ Unigram are not good in this situation ?


❏ A good model is one assigns a higher probability to the word
that actually occurs
Model Evaluation (cont)
❏ Perplexity:
❏ The best language model is one that best predicts an unseen test set
❏ Perplexity is the inverse probability of the test set, normalized by the
number of words
Model Evaluation (cont)
❏ Perplexity:
❏ Number Recognition 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
❏ Perplexity = 10
❏ Recognizing (30K) name:
❏ Perplexity = 30000
❏ A System:
❏ Management (1/4)
❏ Business (1/4)
❏ Assistance (1/4)
❏ 30K name
❏ Perplexity = 53
Model Evaluation (cont)
❏ Perplexity:
❏ Lower Perplexity = Better Model
❏ Training set 3M words, Test set 1.5M words (WSJ)
XLNNTNUD - Language Model
SMOOTHING TECHNIQUES
Smoothing Techniques
❏ Shakespeare corpus :
❏ N=884,647 tokens, V=29,066
❏ 300K bigram (reality) out of V2 = 844M bigram (possibility)
=> 99.96% bigram (possible) were never seen (probability 0)
Smoothing Techniques (cont)
❏ The problem:
Smoothing Techniques (cont)
❏ The problem:
❏ Bigram with zero probability
❏ Assign probability zero to the test set
❏ Therefore we cannot calculate perplexity (Division by zero)
Smoothing Techniques (cont)
❏ Add-one (Laplace) Smoothing:
❏ Sparse statistics:

❏ Generalize better probability


Smoothing Techniques (cont)
❏ Add-one (Laplace) Smoothing:
❏ Pretend we saw each word one more time
❏ Add one to all the counts
Smoothing Techniques (cont)
❏ Add-one (Laplace) smoothing:
Smoothing Techniques (cont)
❏ Add-one (Laplace) smoothing
Smoothing Techniques (cont)
❏ Add-one (Laplace) smoothing
Smoothing Techniques (cont)
❏ Phương pháp Thêm-1 (Laplace)
Smoothing Techniques (cont)
❏ Other techniques:
❏ Recursive Interpolation
❏ Backtracking
❏ Good Turing
❏ Kneser-Ney
❏ ...

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy