0% found this document useful (0 votes)

18 views19 pages

BAI601 Module 3 PDF

This document discusses Naive Bayes classifiers for text classification and sentiment analysis, detailing their application in tasks like sentiment analysis, spam detection, and language identification. It explains the supervised learning approach, the probabilistic nature of Naive Bayes, and the challenges such as zero probability problems, along with solutions like Laplace smoothing. Additionally, it covers optimization techniques for sentiment analysis and evaluation metrics for classification performance.

Uploaded by

pavan dhodmane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views19 pages

BAI601 Module 3 PDF

Uploaded by

pavan dhodmane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

MODULE – 3

Naive Bayes, Text Classification and Sentiment

Naive Bayes, Text Classification and Sentiment: Naive Bayes Classifiers, Training the Naive
Bayes Classifier, Worked Example, Optimizing for Sentiment Analysis, Naive Bayes for Other
Text Classification Tasks, Naive Bayes as a Language Model.

Textbook 2: Ch. 4.

Introduction

• Classification, heart of both human and machine intelligence  assigning a category to

an input
• Deciding what letter, word, or image has been presented to our senses, recognizing faces
or voices, sorting mail, assigning grades to homeworks;
• Naïve Bayes algorithm for text categorization: the task of assigning a label or category
to an entire text or document.
• Common text categorization tasks:
1. Sentiment analysis, the extraction of sentiment, the positive or negative orientation that
a writer expresses toward some object.
o A review of a movie, book, or product on the web.

Example: + ... any characters and richly applied satire, and some great plot twists

- It was pathetic. The worst part about it was the boxing scenes ...
+ ... awesome caramel sauce and sweet toasty almonds. I love this place!
- ... awful pizza and ridiculously overpriced ...

Words like great, richly, awesome, and pathetic, and awful and ridiculously are very informative
cues: unification is based on the functional specifications of the verb, which predicts the overall
sentence structure.

2. Spam detection:
o Binary classification task of assigning an email to one of the two classes spam or
not-spam.
o Many lexical and other features can be used to perform classification.
Example: Suspicious of an email containing phrases like “online pharmaceutical” or “WITHOUT
ANY COST” or “Dear Winner”.

1
3. Assigning a library subject category or topic label to a text: Various sets of subject
categories exist. Deciding whether a research paper concerns epidemiology, embryology,
etc..is an important component of information retrieval.

Supervised Learning:

• The most common way of doing text classification in language processing is supervised
learning.
• In supervised learning, we have a data set of input observations, each associated with
some correct output (a ‘supervision signal’).
• The goal of the algorithm is to learn how to map from a new observation to a correct
output.
• We have a training set of N documents that have each been hand labeled with a class: {(d1
c1)…(dN cN)}. Our goal is to learn a classifier that is capable of mapping from a new
document d to its correct class c € C, where C is some set of useful document classes.

3.1 Naive Bayes Classifiers

The intuition of the classifier is shown in Fig. 1. We represent a text document as if it were a bag
of words, that is, an unordered set of words with their position ignored, keeping only their
frequency in the document.

Instead of representing the word order in all the phrases like “I love this movie” and “I would
recommend it”, we simply note that the word I occurred 5 times in the entire excerpt, the word
it 6 times, the words love, recommend, and movie once, and so on.

• Naive Bayes is a probabilistic classifier.

2
• For a document d, out of all classes c € C the classifier returns the class 𝐶̂ which has the
maximum posterior probability given the document.

(1)

Use Bayes’ rule to break down any conditional probability P(x|y) into three other probabilities:

(2)

We can then substitute Eq.2 into Eq.1 to get Eq.3

(3)

Since P(d) doesn't change for each class, we can conveniently simplify Eq. 3 by dropping the
denominator.

(4)
We call Naive Bayes a generative model, Eq. 4 can be read as class is sampled from P(c), and
then the words are generated by sampling from P(d|c) and a document is generated.

Eq. 4 states, we compute the most probable class 𝑪̂ given some document d by choosing the
class which has the highest product of two probabilities: the prior probability of the class P(c)
and the likelihood of the document P(d|c):

(5)

we can represent a document d as a set of features f1, f2, ..... ,fn:

(6)

Eq. 6 is still too hard to compute directly: without some simplifying assumptions, estimating the
probability of every possible combination of features (for example, every possible set of words
and positions) would require huge numbers of parameters and impossibly large training sets.

Naive Bayes classifiers therefore make two simplifying assumptions.

The first is the bag-of-words assumption, that the features f1, f2, ... ,fn only encode word identity
and not position.

The second is commonly called the naive Bayes assumption, the conditional independence
assumption that the probabilities P(fi|c) are independent given the class c.

3
Therefore, P(f1, f2, .... ,fn|c) = P(f1|c).P(f2|c) ..... P(fn|c) (7)

The final equation for the class chosen by a naive Bayes classifier is:

(8)

To apply the naive Bayes classifier to text, we will use each word in the documents as a feature,
as suggested above, and we consider each of the words in the document by walking an index
through every word position in the document:

(9)

Naive Bayes calculations, like calculations for language modelling, are done in log space, to
avoid underflow and increase speed. Thus Eq. 9 is generally instead expressed as,

(10)

Eq. 10 computes the predicted class as a linear function of input features. Classifiers that use a
linear combination of the inputs to make a classification decision -like naive Bayes and also
logistic regression are called linear classifiers.

3.2 Training the Naive Bayes Classifier

How can we learn the probabilities P(c) and P(fi|c)?

To learn class priori P(c): What percentage of the documents in our training set are in each class
c.

Let Nc be the number of documents in our training data with class c. Ndoc be the total number of
documents. Then,

(11)

To learn the probability P(fi|c):

We'll assume a feature is just the existence of a word in the document's bag of words, and so we'll
want P(wi|c), we compute as the fraction of times the word wi appears among all words in all
documents of topic c.

4
Concatenate all documents with category c into one big "category c" text. Then we use the
frequency of wi in this concatenated document to give a maximum likelihood estimate of the
probability:

i.e (12)
Here the vocabulary V
total number of unique words in your vocabulary in all classes, not just the words in one class c.

Issues with training:

1. Zero Probability problem with maximum likelihood training:

Imagine we are trying to estimate the likelihood of the word "fantastic" given class positive, but
suppose there are no training documents that both contain the word "fantastic" and are classified
as positive. Perhaps the word "fantastic" happens to occur (sarcastically?) in the class negative.
In such a case the probability for this feature will be zero:

(13)

Since naive Bayes naively multiplies all the feature likelihoods together, zero probabilities in
the likelihood term for any class will cause the probability of the class to be zero, no matter the
other evidence!

To solve this, we use something called Laplace smoothing (or add-one smoothing). Instead of:

(14)

Now "fantastic" will still get a very small probability in the "positive" class — but not zero.

2. Words that occur in our test data but are not in our vocabulary:
• Remove them from the test document and not include any probability for them at all.
Some systems choose to completely ignore another class of words: stop words, very
frequent words like the and a.

5
• Defining the top 10-100 vocabulary entries as stop words, or alternatively by using one
of the many predefined stop word lists available online. Then each instance of these stop
words is simply removed from both training and test documents.
• However, using a stop word list doesn't improve performance, and so it is more common
to make use of the entire vocabulary.

Fig. The naïve Bayes algorithm, using add-1smoothing. To use add-smoothing instead, change
the +1 to +α for log likelihood counts in training.

3.3 Worked example:

Let’s use a sentiment analysis domain with the two classes positive (+) and negative (-), and take
the following miniature training and test documents simplified from actual movie reviews.

6
Step1: Prior P(c) for the two classes is computed as per equation 11:

P(-) = 3/5 P(+) = 2/5

Step 2: The word “with” doesn't occur in the training set, so we drop it completely.

Step 3: The likelihoods from the training set for the remaining three words "predictable", "no",
and "fun", are as follows:

Step 4: For the test sentence S = "predictable with no fun", after removing the word 'with', the

chosen class, via Eq. 9: is therefore computed as follows:

3.4 Optimizing for Sentiment Analysis

While standard naive Bayes text classification can work well for sentiment analysis, some small
changes are generally employed that improve performance.

3.4.1 Clip the word counts (duplicate words) in each document at 1:

• Remove all duplicate words before concatenating them into the single big document
during training and we also remove duplicate words from test documents.
• This variant is called binary multinomial naive Bayes or binary naive Bayes.
• Example:

7
Fig. An example of binarization for the binary naive Bayes algorithm
3.4.2 Deal with negation.

Consider the difference between I really like this movie (positive) and I didn’t like this movie
(negative). Similarly, negation can modify a negative word to produce a positive review (don’t
dismiss this film, doesn’t let us get bored).

Solution: Prepend the prefix NOT to every word after a token of logical negation (n’t, not, no,
never) until the next punctuation mark.

Thus the phrase: didn’t like this movie , but I

becomes: didnt NOT_like NOT_this NOT_movie , but I

‘words’ like NOT_like, NOT_recommend will thus occur more often in negative document and
act as cues for negative sentiment, while words like NOT_bored, NOT_dismiss will acquire
positive associations.

3.4.3 Insufficient labelled training data:

Derive the positive and negative word features from sentiment lexicons (corpus), lists of words
that are pre-annotated with positive or negative sentiment.

For example, the MPQA lexicon corpus subjectivity lexicon has 6885 words each marked for
whether it is strongly or weakly biased positive or negative. Some examples:

+ : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great

- : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh, hate

8
3.5 Naive Bayes for other text classification tasks

3.5.1 Spam Detection and Naïve Bayes

Spam detection—deciding whether an email is unsolicited bulk mail—was one of the earliest
applications of naïve Bayes in text classification (Sahami et al., 1998). Rather than treating all
words as individual features, effective systems often use predefined sets of words or patterns,
along with non-linguistic features.

For instance, the open-source tool SpamAssassin uses a range of handcrafted features:

• Specific phrases like "one hundred percent guaranteed"

• Regex patterns like mentions of millions of dollars

• Structural properties like HTML with a low text-to-image ratio

• Non-linguistic metadata, such as the email’s delivery path

Other examples of SpamAssassin features include:

• Subject lines written entirely in capital letters

• Urgent phrases like "urgent reply"

• Keywords such as "online pharmaceutical"

• HTML anomalies like unbalanced head tags

• Claims such as "you can be removed from the list"

3.5.2 Language Identification

In contrast, tasks like language identification rely less on words and more on subword units like
character n-grams or even byte n-grams. These can capture statistical patterns at the start or end
of words, especially when spaces are included as characters.

A well-known system, langid.py (Lui & Baldwin, 2012), starts with all possible n-grams of
lengths 1–4 and uses feature selection to narrow down to the 7,000 most informative.

Training data for language ID systems often comes from multilingual sources such as Wikipedia
(in 68+ languages), newswire, and social media. To capture regional and dialectal diversity,
additional corpora include:

9
• Geo-tagged tweets from Anglophone regions like Nigeria or India

• Translations of the Bible and Quran

• Slang from Urban Dictionary

• Corpora of African American Vernacular English (Blodgett et al., 2016)

These diverse sources help models capture the full range of language use across different
communities and contexts (Jurgens et al., 2017).
3.6 Naive Bayes as a Language Model
• Naive Bayes classifiers can use any sort of feature: dictionaries, URLs, email addresses,
network features, phrases, and so on.
• A naive Bayes model can be viewed as a set of class-specific unigram language models,
in which the model for each class instantiates a unigram language model.
• Assign a probability to each word P(word|c), the model also assigns a probability to each

sentence: (15)
Example: Consider a naive Bayes model with the classes positive (+) and negative (-) and the
following model parameters:

Each of the two columns above instantiates a language model that can assign a probability to
the sentence “I love this fun film”:

P(“I love this fun film”+) = 0.1 * 0.1 * 0.01 * 0.05 * 01=5 * 10-7 P(“I

love this fun film” - ) = 0.2 * 0.001 * 0.01* 0.005 * 0.1=1.0 * 10-9

The positive model assigns a higher probability to the sentence: P(s|pos) > P(s|neg).

Note: This is just the likelihood part of the naive Bayes model; once we multiply in the prior a
full naive Bayes model might well make a different classification decision.

10
3.7 Evaluation: Precision, Recall, F-measure

Text classification evaluation often starts with binary detection tasks.

Example 1: Spam Detection

• Goal: Label each text as spam (positive) or not spam (negative).

• Need to compare:
o System’s prediction o Gold label (human-

defined correct label)

Example 2: Social Media Monitoring for a Brand

• Scenario: CEO of Delicious Pie Company wants to track mentions on social media.
• Build a system to detect tweets about Delicious Pie.
• Positive class: Tweets about the company.
• Negative class: All other tweets.
Why we need metrics:

• To evaluate how well a system (e.g., spam detector or pie-tweet detector) performs.

• Confusion Matrix:

o A table that compares system predictions

vs. gold (human) labels.

o Each cell represents a type of outcome:

▪ True Positive (TP): Correctly predicted positives (e.g.,

actual spam
labeled as spam).

▪ False Negative (FN): Actual positives incorrectly labeled

as negative (e.g., spam labeled as non-spam).

11
• Accuracy:
o Formula: (Correct predictions) / (Total predictions).
o Appears useful but misleading for unbalanced classes.

• Why accuracy can fail:

o Real-world data is often skewed (e.g., most tweets are not about pie).
o Example:

▪ 1,000,000 tweets → only 100 about pie.

▪ A naive classifier labels all tweets as "not about pie".

▪ Result: 99.99% accuracy, but 0 useful results.

o Conclusion: Accuracy is not a reliable metric when the positive class is rare.
That’s why, instead of relying on accuracy, we often use two more informative metrics:
precision and recall (as shown in Fig).

• Precision measures the percentage of items labeled as positive by the system that are
actually positive (according to human-annotated “gold” labels).

Precision = true positives/ true positives + false positives

• Recall measures the percentage of actual positive items that were correctly identified by
the system.

Recall = true positives/ true positives + false negatives

These metrics address the issue with the “nothing is pie” classifier. Despite its seemingly
excellent 99.99% accuracy, it has a recall of 0 —because it misses all 100 true positive cases,
identifying none. Its precision is also meaningless, since it detects nothing (since there are no
true positives, and 100 false negatives, the recall is 0/100).

Unlike accuracy, precision and recall focus on true positives, helping us measure how well the
system finds the things it’s actually supposed to detect.

To combine both precision and recall into a single metric, we use the F-measure (van Rijsbergen,
1975), with the most common version being the F1 score:

12
The ß parameter differentially weights the importance of recall and precision, based perhaps on
the needs of an application. Values of ß > 1 favor recall, while values of ß < 1 favor precision.
When ß = 1, precision and recall are equally balanced; this is the most frequently used metric,
and is called Fβ=1 or just F1:

(16)

3.7.1 Evaluating with more than two classes

For sentiment analysis we generally have 3 classes (positive, negative, neutral) and even
more classes are common for tasks like part-of-speech tagging, word sense disambiguation,
semantic role labeling, emotion detection, and so on. Luckily the naive Bayes algorithm is
already a multi-class classification algorithm.

Consider the sample confusion matrix for a hypothetical 3-way one-of email
categorization decision (urgent, normal, spam) shown in Fig. The matrix shows, for example,
that the system mistakenly labeled one spam document as urgent, and we have shown how to
compute a distinct precision and recall value for each class.

Confusion matrix for a three-class categorization task, showing for each pair of classes
(c1,c2), how many documents from c1 were (in)correctly assigned to c2.

In order to derive a single metric that tells us how well the system is doing, we can combine these
values in two ways.

1. In macroaveraging, we compute the performance for each class, and then average over
classes.
2. In microaveraging, we collect the decisions for all classes into a single confusion matrix,
and then compute precision and recall from that table.

Fig. shows the confusion matrix for each class separately, and shows the computation of
microaveraged and macroaveraged precision.

13
As the figure shows, a microaverage is dominated by the more frequent class (in this case spam),
since the counts are pooled. The macroaverage better reflects the statistics of the smaller classes,
and so is more appropriate when performance on all the classes is equally important
3.8 Test sets and Cross-validation

Training & Testing for Text Classification:

1. Standard Procedure:

o Train the model on the training set. o Use the development set (devset)

to tune parameters and choose the best model.

o Evaluate the final model on a separate test set.

2. Issue with Fixed Splits:

o Fixed training/dev/test sets may lead to small dev/test sets.

o Smaller test sets might not be representative of overall performance.

3. Solution – Cross-Validation (as shown in Fig):

o Cross-validation allows use of all data for training and testing.

o Process:
▪ Split data into k folds.

▪ For each fold:

▪ Train on k-1 folds, test on the remaining fold.

▪ Repeat k times, average the test errors.

o Example: 10-fold cross-validation (train on 90%, test on 10%, repeated 10

times).

4. Limitation of Cross-Validation:

14
o All data is used for testing →
can't analyze the data in
advance (avoiding
"peeking").
o Looking at data is important
for feature design in NLP
systems.

5. Common Compromise:

o Split off a fixed test set.

o Do 10-fold cross-validation
on the training set.
o Use test set only for final
evaluation.
3.9 Statistical Significance Testing
• When building NLP systems, we often need to compare performance between two
systems (e.g., a new model vs. an existing one).
• Simply observing different scores (e.g., accuracy, F1) isn't enough — we need to know if
the difference is statistically significant.
• This is where statistical hypothesis testing comes in.
• Inspired by Dror et al. (2020) and Berg-Kirkpatrick et al. (2012), these tests help
determine if the observed improvement is real or due to chance.
• Example:
o Classifier A (e.g., logistic regression) vs. Classifier B (e.g., naive Bayes). o
Metric M (e.g., F1-score), tested on dataset x.
o Let M(A, x) be the score for A, and δ(x) be the difference in performance between
A and B.

(19)
Understanding Effect Size and Significance
• We want to know if δ(x) > 0, meaning A (logistic regression) performs better than B
(naive Bayes).
• δ(x) is the effect size — larger δ means a bigger performance gap.
• But a positive δ alone isn’t enough.
o Example: A has 0.04 higher F1 than B — is that meaningful?
• Problem: The difference might be due to chance on this specific test set.
• What we really want to know:

15
o Would A still outperform B on another test set or under different conditions?
• That’s why we need statistical testing, not just raw differences.
Statistical Hypothesis Testing Paradigm
• We compare models by setting up two formal hypotheses:
(20)

o Null hypothesis (H₀): There's no real difference between A and B — any

observed difference is due to chance.
o Alternative hypothesis (H₁): There is a real performance difference between A
and B.
• Statistical tests help us decide whether to reject H₀ in favor of H₁ based on the data.
Null Hypothesis and p-value

• Null hypothesis (H₀): Assumes δ(x) ≤ 0 — A is not better than B.

• We want to see if we can reject H₀ and support H₁ (that A is better).

• We imagine δ(x) over many possible test sets.

• The p-value measures how likely we are to observe our δ(x), or a larger one, if H₀ were
true. (21)
• A low p-value suggests our result is unlikely due to chance, supporting H₁.
Interpreting p-values and Statistical Testing in NLP
• The p-value is the probability of observing a performance difference δ(x) (or larger),
assuming A is not better than B (null hypothesis H₀).
• If δ(x) is large (e.g., A’s F1 = 0.9 vs. B’s = 0.2), it's unlikely under H₀ → low p-value →
we reject H₀.
• If δ(x) is small, it's more plausible under H₀ → higher p-value → we may fail to reject
H₀.
What Counts as “Small”? o Common p-value
thresholds: 0.05 or 0.01
• If p < threshold, the result is considered statistically significant (we reject H₀ and
conclude A is likely better than B).
How Do We Compute the p-value in NLP?
• NLP avoids parametric tests (like t-tests or ANOVAs) because they assume certain
distributions that often don't apply.
• Instead, we use non-parametric tests that rely on sampling methods.
Key Idea:

16
• Simulate many variations of the experiment (e.g., using different test sets x′).

• Compute δ(x′) for each → this gives a distribution of δ values.

• If the observed δ(x) is in the top 1% (i.e., p-value < 0.01), it's unlikely under H₀ → reject
H₀.
Common Non-Parametric Tests in NLP:

1. Approximate Randomization (Noreen, 1989)

2. Bootstrap Test (paired version is most common)

o Compares aligned outputs from two systems (e.g., A vs. B on the same inputs xi).
oMeasures how consistently one system outperforms the other across samples.
3.9.1 The Paired Bootstrap Test
The bootstrap test is a flexible, non-parametric method that can be applied to any evaluation
metric—like precision, recall, F1, or BLEU.
What is bootstrapping?
It involves repeatedly sampling with replacement from an original dataset to create many
"bootstrap samples" or virtual test sets. The key assumption is that the original sample is
representative of the larger population.
Example
Imagine a small classification task with 10 test documents. Two classifiers, A and B, are
evaluated:

• Each document outcome falls into one of four categories:

o Both A and B correct o Both incorrect o
A correct, B wrong
o A wrong, B correct
• If A has 70% accuracy and B has 50%, then the performance difference δ(x) = 0.20.
How bootstrap works:

1. Generate a large number (e.g., 100,000) of new test sets by sampling 10 documents with
replacement from the original set.

2. For each virtual test set, recalculate the accuracy difference between A and B.

3. Use the distribution of these differences to estimate a p-value, telling us how likely the
observed δ(x) is under the null hypothesis (that A is not better than B).

This helps determine whether the observed performance difference is statistically significant or
just due to random chance.

17
Figure: The paired bootstrap test: Examples of b pseudo test sets x (i) being created from an initial true test
set x. Each pseudo test set is created by sampling n = 10 times with replacement; thus an individual sample
is a single cell, a document with its gold label and the correct or incorrect performance of classifiers A and
B.
With the b bootstrap test sets, we now have a sampling distribution to analyze whether
A’s advantage is due to chance. Following Berg-Kirkpatrick et al. (2012), we assume the null
hypothesis (H₀)—that A is not better than B—so the average δ(x) should be zero or negative. If
our observed δ(x) is much higher, it would be surprising under H₀. To measure this, we
calculate the p-value by checking how often the sampled δ(xᵢ) values exceed the observed δ(x).

We use the notation 1(x) to mean “1 if x is true, and 0 otherwise.” Although the expected value
of δ(X) over many test sets is 0, this isn't true for bootstrapped test sets due to the bias in the
original test set, so we compute the p-value by counting how often δ (x(i)) exceeds the expected
δ(x) by δ(x) or more.

(22)

If we have 10,000 test sets and a threshold of 0.01, and in 47 test sets we find δ(x(i)) ≥ 2δ(x), the
p-value of 0.0047 is smaller than 0.01. This suggests the result is surprising, allowing us to reject
the null hypothesis and conclude A is better than B.

18
Fig. A version of the paired bootstrap algorithm

The full algorithm for the bootstrap is shown in Fig. It is given a test set x, a number of samples
b, and counts the percentage of the b bootstrap test sets in which δ (x *(i)) > 2δ (x). This percentage
then acts as a one-sided empirical p-value.

3.10 Avoiding Harms in Classification (Summary)

• Classifiers can cause harm, including representational harms (e.g., reinforcing

stereotypes).
o Example: Sentiment analysis systems rated sentences with African American
names more negatively than identical ones with European American names.
• Toxicity classifiers may falsely label non-toxic content as toxic, especially when it
references marginalized groups or dialects (e.g., AAVE), leading to silencing.
• Harms can arise from:

o Biased training data o Biased labels or resources (e.g., lexicons,

embeddings) o Model design choices

• No universal fix exists, so transparency is key.

• A proposed solution: release model cards (Mitchell et al., 2019), which include:

o Training algorithms and parameters o Training data sources, motivation, and

preprocessing o Evaluation data sources, motivation, and preprocessing o
Intended use and users o Model performance across different demographic or
other groups and environmental situations

NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
100% (1)
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
86 pages
Naive Bayes Classification
100% (3)
Naive Bayes Classification
10 pages
Selected Essays of Javed Ahmad Ghamidi
No ratings yet
Selected Essays of Javed Ahmad Ghamidi
197 pages
Gray Hat Hacking: The Ethical Hacker's Handbook, 6th Edition Allen Harper download
100% (1)
Gray Hat Hacking: The Ethical Hacker's Handbook, 6th Edition Allen Harper download
57 pages
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
38 pages
Richard Earl Lectures
No ratings yet
Richard Earl Lectures
86 pages
Grade 7 Quiz - Edited
No ratings yet
Grade 7 Quiz - Edited
3 pages
Curl Cheat Sheet
No ratings yet
Curl Cheat Sheet
7 pages
islam-a-concept-of-political-world-invasion-by-muslims-1nbsped
No ratings yet
islam-a-concept-of-political-world-invasion-by-muslims-1nbsped
131 pages
Instant Download Turning Around Failing Schools Joseph F. Murphy & Coby V. Meyers PDF All Chapters
No ratings yet
Instant Download Turning Around Failing Schools Joseph F. Murphy & Coby V. Meyers PDF All Chapters
24 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
Text Categorization and Classification
No ratings yet
Text Categorization and Classification
13 pages
Thivya Ramesh Resume
No ratings yet
Thivya Ramesh Resume
2 pages
05 Naive Bayes - Relationship To Language Modeling 4-35
No ratings yet
05 Naive Bayes - Relationship To Language Modeling 4-35
2 pages
Eda2s3axw Level1
No ratings yet
Eda2s3axw Level1
2 pages
3. Text Classification
No ratings yet
3. Text Classification
60 pages
Fleet Wave User Guide by Chevin Solutions (103 Pages, 2017)
No ratings yet
Fleet Wave User Guide by Chevin Solutions (103 Pages, 2017)
103 pages
Module 3 NLP
No ratings yet
Module 3 NLP
17 pages
#WWDC16 Typography and Fonts PDF
100% (1)
#WWDC16 Typography and Fonts PDF
186 pages
Lab7&8 NaiveBayes
No ratings yet
Lab7&8 NaiveBayes
5 pages
Cad Dollar to Euro - Google Search
No ratings yet
Cad Dollar to Euro - Google Search
1 page
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper (1)
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper (1)
74 pages
Text Classification in ML
No ratings yet
Text Classification in ML
47 pages
Text Classification[1][1]
No ratings yet
Text Classification[1][1]
11 pages
Uajhs Mathematics 10 Module 18 PDF
No ratings yet
Uajhs Mathematics 10 Module 18 PDF
10 pages
A Complete Guide To The Aptis Speaking Exam Parts English Exam Ninja PDF Free
No ratings yet
A Complete Guide To The Aptis Speaking Exam Parts English Exam Ninja PDF Free
9 pages
Naive Bayes Sentiment Analysis
No ratings yet
Naive Bayes Sentiment Analysis
23 pages
NLP NB
No ratings yet
NLP NB
52 pages
NLP ch4 l1
No ratings yet
NLP ch4 l1
23 pages
KS2 Adverbs - How To Use Words To Change A Verb's Meaning
No ratings yet
KS2 Adverbs - How To Use Words To Change A Verb's Meaning
7 pages
CIS 730/732 Project Text Classification With A Naïve Bayes Classifier
No ratings yet
CIS 730/732 Project Text Classification With A Naïve Bayes Classifier
3 pages
Week4
No ratings yet
Week4
45 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
Lecture 5-1 Naive
No ratings yet
Lecture 5-1 Naive
44 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
nb24aug
No ratings yet
nb24aug
85 pages
CS301 Lec02
No ratings yet
CS301 Lec02
24 pages
nb24aug
No ratings yet
nb24aug
79 pages
05_NaiveBayesAndSentimentClassification
No ratings yet
05_NaiveBayesAndSentimentClassification
36 pages
ML CLassification Naive Bayes
No ratings yet
ML CLassification Naive Bayes
6 pages
Multimedia Application L8
No ratings yet
Multimedia Application L8
68 pages
in4080_2022_lecture_03
No ratings yet
in4080_2022_lecture_03
62 pages
Multimedia Application L7_for
No ratings yet
Multimedia Application L7_for
46 pages
Sentiment Analysis: Using Naïve Bayes Classifier
No ratings yet
Sentiment Analysis: Using Naïve Bayes Classifier
18 pages
Document
No ratings yet
Document
7 pages
24 Shivangi DMDW
No ratings yet
24 Shivangi DMDW
12 pages
Text Classification Using TF-IDF and Machine Learning
No ratings yet
Text Classification Using TF-IDF and Machine Learning
30 pages
Tackling The Poor Assumptions of Naive Bayes Text Classifiers
No ratings yet
Tackling The Poor Assumptions of Naive Bayes Text Classifiers
8 pages
Naive Bayes With Sentiment Classification
No ratings yet
Naive Bayes With Sentiment Classification
82 pages
bag_of_words nlp
No ratings yet
bag_of_words nlp
23 pages
4 Naive Bayes
No ratings yet
4 Naive Bayes
82 pages
Naive_Bayes_Classifier_Presentation
No ratings yet
Naive_Bayes_Classifier_Presentation
10 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
Chapter 2 - Wave Diffraction - Part 2
No ratings yet
Chapter 2 - Wave Diffraction - Part 2
47 pages
MLRD 2
No ratings yet
MLRD 2
15 pages
Naivebayes 2021
No ratings yet
Naivebayes 2021
77 pages
20250129_Lecture03_naivebayes
No ratings yet
20250129_Lecture03_naivebayes
25 pages
Hudood and Rahmah
No ratings yet
Hudood and Rahmah
87 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
48 pages
T4L1 Naive Bayes
No ratings yet
T4L1 Naive Bayes
50 pages
MultinomialNB
No ratings yet
MultinomialNB
52 pages
Naive Bayes
No ratings yet
Naive Bayes
56 pages
4 NB 2024
No ratings yet
4 NB 2024
82 pages
Top Machine Learning Informations About Different Algorithms
No ratings yet
Top Machine Learning Informations About Different Algorithms
63 pages
Thematic Unit - Figurative Language in The Iliad
No ratings yet
Thematic Unit - Figurative Language in The Iliad
19 pages
Reading: How To Write A Fairy Tale in 6 Steps
No ratings yet
Reading: How To Write A Fairy Tale in 6 Steps
2 pages
NB 24 Aug
No ratings yet
NB 24 Aug
82 pages
LM3 - Naive Bayes Model
No ratings yet
LM3 - Naive Bayes Model
21 pages
Naive Bayes and Sentiment
No ratings yet
Naive Bayes and Sentiment
19 pages
02 Text Processing PDF
No ratings yet
02 Text Processing PDF
70 pages
NOTES
No ratings yet
NOTES
15 pages
Inf2b Learn Note07 2up
No ratings yet
Inf2b Learn Note07 2up
5 pages
Report On Naive Bayes
No ratings yet
Report On Naive Bayes
5 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
6 pages
Na Ive Bayes Classifier
No ratings yet
Na Ive Bayes Classifier
3 pages
The Krama Tantricism of Kashmir Vol. I by N. Rastogi Review by Raffaele Torella
No ratings yet
The Krama Tantricism of Kashmir Vol. I by N. Rastogi Review by Raffaele Torella
3 pages
Drawing Book For Kids
No ratings yet
Drawing Book For Kids
64 pages
Principles For Devising A Reading Comprehension Test: A Library Based Review
No ratings yet
Principles For Devising A Reading Comprehension Test: A Library Based Review
20 pages
SVTB Tutorial
No ratings yet
SVTB Tutorial
70 pages
BEOWULF
0% (1)
BEOWULF
7 pages
MCA 2015syllabus PDF
No ratings yet
MCA 2015syllabus PDF
46 pages
Graphic Organizers - Reading PDF
100% (11)
Graphic Organizers - Reading PDF
161 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
General Paper 8001 Syllabus
No ratings yet
General Paper 8001 Syllabus
17 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
Engineering Colleges in Karnataka
No ratings yet
Engineering Colleges in Karnataka
26 pages
BL Part, Format, Sample
No ratings yet
BL Part, Format, Sample
5 pages
The Reasoned Schemer, second edition
From Everand
The Reasoned Schemer, second edition
Daniel P. Friedman
4/5 (16)
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BAI601 Module 3 PDF

Uploaded by

BAI601 Module 3 PDF

Uploaded by

MODULE – 3

Naive Bayes, Text Classification and Sentiment

• Classification, heart of both human and machine intelligence  assigning a category to

3.1 Naive Bayes Classifiers

• Naive Bayes is a probabilistic classifier.

We can then substitute Eq.2 into Eq.1 to get Eq.3

we can represent a document d as a set of features f1, f2, ..... ,fn:

Naive Bayes classifiers therefore make two simplifying assumptions.

3.2 Training the Naive Bayes Classifier

How can we learn the probabilities P(c) and P(fi|c)?

To learn the probability P(fi|c):

Issues with training:

1. Zero Probability problem with maximum likelihood training:

3.3 Worked example:

P(-) = 3/5 P(+) = 2/5

chosen class, via Eq. 9: is therefore computed as follows:

3.4 Optimizing for Sentiment Analysis

3.4.1 Clip the word counts (duplicate words) in each document at 1:

Thus the phrase: didn’t like this movie , but I

becomes: didnt NOT_like NOT_this NOT_movie , but I

3.4.3 Insufficient labelled training data:

+ : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great

3.5.1 Spam Detection and Naïve Bayes

• Specific phrases like "one hundred percent guaranteed"

• Regex patterns like mentions of millions of dollars

• Non-linguistic metadata, such as the email’s delivery path

Other examples of SpamAssassin features include:

• Subject lines written entirely in capital letters

• Urgent phrases like "urgent reply"

• Keywords such as "online pharmaceutical"

• HTML anomalies like unbalanced head tags

• Claims such as "you can be removed from the list"

3.5.2 Language Identification

• Translations of the Bible and Quran

• Slang from Urban Dictionary

• Corpora of African American Vernacular English (Blodgett et al., 2016)

Text classification evaluation often starts with binary detection tasks.

• Goal: Label each text as spam (positive) or not spam (negative).

defined correct label)

Example 2: Social Media Monitoring for a Brand

o A table that compares system predictions

o Each cell represents a type of outcome:

▪ True Positive (TP): Correctly predicted positives (e.g.,

▪ False Negative (FN): Actual positives incorrectly labeled

• Why accuracy can fail:

▪ 1,000,000 tweets → only 100 about pie.

▪ A naive classifier labels all tweets as "not about pie".

▪ Result: 99.99% accuracy, but 0 useful results.

Precision = true positives/ true positives + false positives

Recall = true positives/ true positives + false negatives

3.7.1 Evaluating with more than two classes

Training & Testing for Text Classification:

to tune parameters and choose the best model.

2. Issue with Fixed Splits:

o Smaller test sets might not be representative of overall performance.

3. Solution – Cross-Validation (as shown in Fig):

▪ For each fold:

▪ Train on k-1 folds, test on the remaining fold.

▪ Repeat k times, average the test errors.

o Example: 10-fold cross-validation (train on 90%, test on 10%, repeated 10

o Split off a fixed test set.

o Null hypothesis (H₀): There's no real difference between A and B — any

• Null hypothesis (H₀): Assumes δ(x) ≤ 0 — A is not better than B.

• We imagine δ(x) over many possible test sets.

• Compute δ(x′) for each → this gives a distribution of δ values.

1. Approximate Randomization (Noreen, 1989)

2. Bootstrap Test (paired version is most common)

• Each document outcome falls into one of four categories:

3.10 Avoiding Harms in Classification (Summary)

• Classifiers can cause harm, including representational harms (e.g., reinforcing

o Biased training data o Biased labels or resources (e.g., lexicons,

embeddings) o Model design choices

• No universal fix exists, so transparency is key.

o Training algorithms and parameters o Training data sources, motivation, and

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.