CS 904: Natural Language Processing Statistical Inference: N-Grams

CS 904: Natural Language Processing
Statistical Inference: n-grams
L. Venkata Subramaniam January 17, 2002
Statistical Inference
Statistical Inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inferences about this distribution. We will look at the task of language modeling where we predict the next word given the previous words. Language modeling is a well studied Statistical inference problem.
Applications of LM

Speech recognition Optical character recognition Spelling correction Handwriting recognition Statistical Machine Translation Shannon Game
Shannon Game (Shannon, 1951)
Predict the next word given n-1 previous words. Past behaviour is a good guide to what will happen in the future as there is regularity in language. Determine the probability of different sequences from a training corpus.
The Statistical Inference Problem
Dividing the training data into equivalence classes. Finding a good statistical estimator for each equivalence class. Combining multiple estimators.
Bins: Forming Equivalence Classes
Reliability vs. Discrimination
We try to predict the target feature on the basis of various classificatory features for the equivalence classes. More bins give greater discrimination. Too much discrimination may not leave sufficient training data in many bins. Thus, a statistically reliable estimate cannot be obtained.
Bins: n-gram Models
Predicting the next word is attempting to estimate the probability function:
P(wn|w1,..,wn-1) Markov Assumption: only the n-1
previous words affect the next word. (n-1)th Markov Model or n-gram. We construct a model where all histories that have the same last n-1 words are placed in the same equivalence class (bin).
Problems with n-grams
Sue swallowed the large green ______ . Pill, frog, car, mountain, tree?
Knowing that Sue swallowed helps narrow down possibilities. How far back do we look?
For a Vocabulary of 20,000 words number of bigrams = 400 million, number of trigrams = 8 trillion, number of four-grams = 1.6 x 1017! But other methods of forming equivalence classes are more complicated.
Building n-gram Models
Data preparation: Decide training corpus, remove punctuations, sentence breaks keep them or throw them? Create equivalence classes and get counts on training data falling into each class. Find statistical estimators for each class.
Statistical Estimators
Derive a good probability estimate for the target feature based on training data. Poor estimates of context are worse than none (Gale and Church, 1990). From n-gram data P(w1,..,wn)s predict P(wn|w1,..,wn-1)
Statistical Estimation Methods

Maximum Likelihood Estimation (MLE) Smoothing:

Laplaces Lidstones and Jeffreys-Perks Laws Held Out Estimation Cross Validation
Validation:

Good-Turing Estimation
Maximum Likelihood Estimation (MLE)
It is the choice of parameter values which gives the highest probability to the training corpus.
PMLE(w1,..,wn)=C(w1,..,wn)/N, where C(w1,..,wn) is the frequency of n-gram w1,..,wn PMLE(wn|w1,.,wn-1)=C(w1,..,wn)/C(w1,..,wn-1)
MLE: Problems
Problem of Sparseness of data. Vast majority of the words are very uncommon (Zipfs Law). Some bins may remain empty or contain too little data. MLE assigns 0 probability to unseen events. We need to allow for possibility of seeing events not seen in training.
Discounting or Smoothing
Decrease the probability of previously seen methods to leave a little bit of probability for previously unseen events.
Laplaces Law (1814; 1995)
Adding One Process:
PLAP(w1,..,wn)=(C(w1,..,wn)+1)/(N+B), where C(w1,..,wn) is the frequency of ngram w1,..,wn and B is the number of bins training instances are divided into. Gives a little bit of the probability space to unseen events. This is the Bayesian estimator assuming a uniform prior on events.
Laplaces Law: Problems
For sparse sets of data over large vocabularies, it assigns too much of the probability space to unseen events.
Lidstones Law and The Jeffreys-Perks Law
PLID(w1,..,wn)=(C(w1,..,wn)+)/(N+B), where C(w1,..,wn) is the frequency of n-gram w1,..,wn and B is the number of bins training instances are divided into, and >0.
Lidstones Law
If =1/2, Lidstones Law corresponds to the maximization by MLE and is called the Expected Likelihood Estimation (ELE) or the Jeffreys-Perks Law.
Lidstones Law: Problems
We need a good way to guess k in advance. Predicts all unknown events to be equally likely. Discounting using Lidestones law always gives probability estimates linear in the MLE frequency and this is not a good match to the empirical distribution at low frequencies.
Validation
How do we know how much of the probability space to hold out for unseen events? (Choosing a k) Validation: Take further text (from the same source) and see how often the bigrams that appeared r times in the training text tend to turn up in the further text.
Held out Estimation

Divide the training data into two parts. Build initial estimates by doing counts on one part, and then use the other pool of held out data to refine those estimates.
Held Out Estimation (Jelinek and Mercer, 1985)
Let C1(w1,..,wn) and C2(w1,..,wn) be the frequencies of the n-grams w1,..,wn in training and held out data, respectively. Let r be the frequency of an n-gram, let Nr be the number of bins that have r training instances in them. Let Tr be the total count of n-grams of frequency r in further data, then their average frequency is Tr / Nr. An estimate for the probability of one of these ngrams is: Pho(w1,..,wn)= Tr/(NrT). Where T is the number of n-gram instances in the held out data.
Pots of Data for Developing and Testing Models

Training data (80% of total data) Held Out data (10% of total data). Test Data (5-10% of total data). Write an algorithm, train it, test it, note things it does wrong, revise it and repeat many times. Keep development test data and final test data as development data is seen by the system during repeated testing. Give final results by testing on n smaller samples of the test data and averaging.
Cross-Validation (Deleted Estimation) (Jelinek and Mercer, 1995)
If total amount of data is small then use each part of data for both training and validation cross-validation. Divide data into parts 0 and 1. In one model use 0 as the training data and 1 as the held out data. In another model use 1 as training and 0 as held out data. Do a weighted average of the two: Pdel(w1,..,wn)=(Tr01+Tr10)/T(Nr0+ Nr1) where C(w1,..,wn) = r.
Deleted Estimation: Problems
It overestimates the expected frequency of unseen objects, while underestimating the expected frequency of objects that were seen once in the training data. Leaving-one-out (Ney et. Al. 1997) Data divided into K sets and the hold out method is repeated K times.
Good-Turing Estimation
Determines probability estimates of items based on the assumption that their distribution is binomial. Works well in practice even though distribution is not binomial. PGT= r*/N where, r* can be thought of as an adjusted frequency given by r*=(r+1)E(Nr+1)/E(Nr).
Discounting Methods

Obtain held out probabilities. Absolute Discounting: Decrease all non-zero MLE frequencies by a small constant amount and assign the frequency so gained over unseen events uniformely. Linear Discounting: The non-zero MLE frequencies are scaled by a constant slightly less than one. The remaining probability mass is distributed among unseen events.
Combining Estimators
Combining multiple probability estimates from various different models:

Simple Linear Interpolation Katzs backing-off General Linear Interpolation
Simple Linear Interpolation
Solve the sparseness in a trigram model by mixing with bigram and unigram models. Combine linearly: Termed linear interpolation, finite mixture models or deleted interpolation. Pli(wn|wn-2,wn-1)=1P1(wn)+ 2P2(wn|wn-1)+ 3P3(wn|wn-1,wn-2) where 0i 1 and i i =1 Weights can be picked automatically using Expectation Maximization.
Katzs Backing-Off
Different models are consulted depending on their specificity. Use n-gram probability when the ngram has appeared more than k times (k usu. = 0 or 1). If not, back-off to the (n-1)-gram probability Repeat till necessary.
General Linear Interpolation
Make the weights a function of history: k Pli(w|h)= i=1 i(h) Pi(w|h) where 0i(h)1 and i i(h) =1 Training a distinct set of weights for each history is not feasible. We need to form equivalence classes for the s.

CS 904: Natural Language Processing Statistical Inference: N-Grams

Uploaded by

Copyright:

Available Formats

CS 904: Natural Language Processing Statistical Inference: N-Grams

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS 904: Natural Language Processing Statistical Inference: N-Grams

Uploaded by

Copyright:

Available Formats

CS 904: Natural Language Processing

Statistical Inference: n-grams

L. Venkata Subramaniam January 17, 2002

Shannon Game (Shannon, 1951)

The Statistical Inference Problem

Bins: Forming Equivalence Classes

Reliability vs. Discrimination

Bins: n-gram Models

Predicting the next word is attempting to estimate the probability function:

P(wn|w1,..,wn-1) Markov Assumption: only the n-1

Problems with n-grams

Building n-gram Models

Statistical Estimation Methods

Maximum Likelihood Estimation (MLE) Smoothing:

Maximum Likelihood Estimation (MLE)

PMLE(w1,..,wn)=C(w1,..,wn)/N, where C(w1,..,wn) is the frequency of n-gram w1,..,wn PMLE(wn|w1,.,wn-1)=C(w1,..,wn)/C(w1,..,wn-1)

Laplaces Law (1814; 1995)

Adding One Process:

Laplaces Law: Problems

Lidstones Law and The Jeffreys-Perks Law

Lidstones Law: Problems

Held out Estimation

Held Out Estimation (Jelinek and Mercer, 1985)

Pots of Data for Developing and Testing Models

Cross-Validation (Deleted Estimation) (Jelinek and Mercer, 1995)

Deleted Estimation: Problems

Combining multiple probability estimates from various different models:

Simple Linear Interpolation Katzs backing-off General Linear Interpolation

Simple Linear Interpolation

General Linear Interpolation

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.