0% found this document useful (0 votes)
21 views

Grapheme:: Morpheme

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Grapheme:: Morpheme

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Grapheme:

A grapheme is a letter or group of letters that represents a single phoneme so it is the


smallest unit of written language whether or not it carries meaning or corresponds to a
single phenomenon. For e.g. /ā/ round is a phoneme that can be represented by various
graphemes including ai, ay, ey, ei, eight. Similarly, a single graphene ea represent different
phonemes (/e/, /a/)
So, it us normally considered that there are 250 graphemes

A morpheme is the smallest unit of meaning that cannot be further divided so a base
wind might be a morpheme but a suffix, prefix or root also represents a morpheme. Fr e.g.
the word red is a single morpheme, but the word unpredictable is made up of the
morphemes un + pre + dict + able while none of these units stand alone as words, but they
are the smallest unit of meaning

06/09/2024

Understanding Linguistic:
Linguistic is the study of language, its structure and the rules that govern its
structure. It follows the approach which include morphology, syntax, semantics, pragmatics.

What is morphology?
It is the study of word structure. It describes how words are formed out of more
basic elements of language called morphemes. A morpheme is the smallest unit of a
language. Morphemes are considered minimal because if they were subdivided any further,
they would become meaningless.
Each morpheme is different from other because each carries a different meaning.
Morphemes are used to form words Base, root or free morpheme are words that have
meaning and cannot be broken down into smaller parts. Examples of free morphemes are
ocean, book, colour, connect etc.
These words cannot be broken down into smaller units.
Bound or grammatical morpheme, which cannot convey meaning by themselves
must be formed with free morphemes.
Examples:
ocean-s
establish-ment
book-ed
color-ful
dis-connect

Bound morphemes often include the following -ing, -s, -ed etc.

09/09/2024

Semantic Analysis: (literal meaning) Semantic analysis is very important component of


NLP that concentrates on understanding on the meaning, interpolation and relationship
between words, phrases and sentences. Tools built on semantic analysis can assist business
in automatically extracting useful information from data including emails, support request
and consumer comments.

Advantages of semantic analysis:


1. Improves understanding of text: AI helps understands the tone meaning of words, phrase
and sentences leading to a more accurate interpretation of text.
2. Enhanced search of Information retrieval: Search Engine can provide more relevant result
by understanding user queries better considering the context of meaning rather
than just keywords.
3. Improved machine learning model: In AI and machine learning semantic analysis helps in
deadline extraction, sentiment analysis and understanding relationship in data which
enhances the performances of model.
4. Enhanced user experience Chatbots, virtual assistant and recommendation system benefit
from semantic analysis by providing more accurate and context aware response.
5. Personalization and Recommendation system Semantic analysis allows for a deeper
understanding of user preferences enabling personalized recommendation in e-commerce . `

Pragmatic Analysis:
Pragmatic in NLP is the study of contextual meaning. It examines cases where a person’s
statement has one literal and another more profound meaning. It tells us how different
context can change the meaning of a sentence. Pragmatic considers the intention of the
speaker and writer.
E.g.:
1. Can you pull the car over?
Actual meaning: Are you capable of pulling the car?
Pragmatic meaning: This means “Can you stop the car”
2. It’s hot in here! Can you crack a window?
Here the speaker wants the windows to be opened a little and does not want the
windows to be physically damaged.
3. What time do you call this?
It meanffs why are you late and not that the speaker wants to know the time.

10/09/2024

Processing Text
1. Use of regular expression: A regular expression is helpful in deriving words given in a
sentence or paragraph. There are various rules used in regular expression. So
basically, regular expression can be used to check if a string contains the specified
search pattern.

Character Description Example


[] A set of characters [a – d]

\ It finds special sequences \d

. Any character (except g.a


newline)
^ Starts with “^The”

$ Ends with “school$”

* Zero or more occurrence “he.*o”

+ One or more occurrence “he.+o”

? Zero or one occurrence r.?m


{} Exactly specified no. of He.{2}o
occurrences

Special Sequences:

Character Description Example


\A Returns a match if the “\AThe”
specified characters are at
the beginning of string
\b Check the specified r”\bain”
character are at the
beginning or end of word
\t Returns a match whose
string contains digits
\D Returns a match whose
string does not contain
digits
\s Returns a match whose
string contains white space
character
\w Returns a match where the
word contains any word
character

Tokenization:
Tokenization in NLP is a technique that involves dividing a sentence or phrase into smaller
units known as tokens. These token consist of words, punctuation marks etc.
Types of tokenization:
1. Sentence Tokenization: The text is segmented into sentences during sentence
tokenization. This is useful for tasks that require included sentence analysis.
For e.g.: “GLA is best university. It is located in Mathura.”
After sentence tokenization “GLA is best university”
“It is located in Mathura”
2. Word Tokenization: It divides the text into individual words.
For e.g.: “Tokenization is an important NLP task.”
After word tokenization we will get the result as
[“Tokenization”, “is”, “an”, “important”, “NLP”, “task”]

3. Character Tokenization: This process divides the text into individual characters. This
can be useful for modelling character level language.
For e.g.: “Tokenization”
[“T”, “o”, “k”, “e”, “n”, “i”, “z”, “a”, “t”, “i”, “o”, “n”]

Need of Tokenization:
1. Effective text processing: Tokenization reduces the size of raw text so that it can be
handled easily
2. Language Modelling: It facilitates the creation of organized representation of
language which is useful for tasks like text generation.
3. Information retrieval: It is essential for indexing and searching in systems that store
and retrieve information.
4. Text Analysis: Tokenization provide us the facility of sentiment analysis or named
entity recognition.

Stemming:
Stemming is a process through which base or root word are extracted. In this process the
last few characters of a given word (derived words) are removed. Sometimes stemming
provides a base word which do not have proper meaning.
For e.g.: history is converted into histori
Stemming is used mostly for the purpose of sentiment analysis.
Positive sentiment  Pizza was very delicious.
Negative sentiment  Burger was very bad.
Neutral sentiment  I ordered Pizza today.

Lemmatization:
Lemmatization is the process of getting meaningful words from the given text.
Lemmatization uses corpus for stop words and WordNet corpus to produce lemma.
Moreover parts-of-speech also had to be defined to obtain correct lemma. Lemma is an
actual language word.
Lemmatization is used to find base words from derived words but it will have a meaning.
Lemmatization is basically used in chatbots.

13/09/2024

N-Grams:
N-Grams are contiguous sequence of ‘n’ items, typically words in the context of NLP. These
items can be characters, words or even syllables depending on the granularity derived. The
value of ‘n’ determines the order of the N-grams.
Example:
1. Unigram (1-gram): Single words e.g. “cat”, “dog”
2. Bigram (2-gram): Pairs of consecutive words e.g. “natural language”, “deep learning”
3. Trigrams (3-gram): Triplets of consecutive words e.g.: “machine learning models”,
“data science approach”
Similarly, 4-gram, 5-gram are sequences of four, five words etc.
N = 1 This is a sentence
Groups are [This, is, a, sentence]
N = 2 This is a sentence
Groups are [This is, is a, a sentence]
N = 3 This is a sentence
Groups are [This is a, is a sentence]

Significance of N-gram in NLP:


1. Capturing context and semantics:
N gram help capture the contextual information and semantics within a
sequence of words.
2. Improving language model:
In language modelling task, N-grams contribute to building more accurate and
context accurate model.
3. Enhanced Text Prediction:
N-gram is essential for predicting text application i.e. prediction of next word
of sequence of words.
4. Information Retrieval:
In information retrieval tasks, N-grams assist in matching and rankings
document based on the relevance of N-gram pattern.
5. Feature Extraction:
N-gram acts as a powerful feature in text classification and sentiment
analysis.

Applications of N-gram:
1. Speech Recognition
2. Machine Translation
3. Predictive Text Input
4. Named Entity Recognition

17/09/2024

Bag of Words
The bag of words model is a simple way to convert words to numerical representation in
natural language processing. This model is a simple document embedding technique based
on word frequency. Conceptually, we think of the whole document as a bag of words, rather
than a sequence. We represent the document simply by the frequency of each word. So
basically, this method converts text into vector based on the frequency of words in the text,
without considering the order or context of the words.
E.g.:
cute dog
cute cute cat

cute dog cat


cute dog 1 1 0
cute cute cat 2 0 1

TF-IDF
The TF-IDF method identifies very common words or rare words in a given document. That
means it tells us the importance of each word in the given text which is not provided by bag
of words method. The term TF stands for term frequency and the IDF stands for Inverse
Document Frequency.
Term frequency refer to the frequency of a word in a document. For a specified word it is
defined as the ratio of the number of times of word appears in a document to the total
number of words in the document.

TF (t, d) = (total number of words∈d )


( number of t appears∈d )

Where t is the word or token and d is the document.


Inverse Document Frequency measures the importance of the word in the corpus. It
measures how common a particular word is across all the document in the corpus.

IDF (t) = log ⁡¿

The TF-IDF score for a term in a document is calculated by multiplying its TF and IDF values.
This score reflects how important the term is within the context of the document and across
the entire corpus. Terms with higher TF-IDF scores are considered more significant.

TF-IDF (t, d) = TF (t, d) x IDF (t)

Let’s take an example


Suppose there are three documents
D1: The cat is on the mat
D2: My dog and cat are the best
D3: The locals are playing
Suppose there is a query
Q: The cat
Now we have to determine which document will appear first in the search result. For this we
will calculate TF-IDF.
So, let’s compute the TF scores of the words ‘the’ and ‘cat’ (i.e. the query words) with
respect to the documents
TF (“the”, D1) = 2 / 6 = 0.33
TF (“the”, D2) = 1 / 7 = 0.14
TF (“the”, D3) = 1 / 4 = 0.25
TF (“cat”, D1) = 1 / 6 = 0.16
TF (“cat”, D2) = 1 / 7 = 0.14
TF (“cat”, D3) = 0 / 4 = 0

Now we will find IDF


IDF (“the”) = log (3 / 3) = log (1) = 0
IDF (“cat”) = log (3 / 2) = log (1.5) = 0.18

So now we will calculate TF-IDF


TF-IDF (“the”, D1) = TF (“the”, D1) x IDF (“the”) = 0.33 x 0 = 0
TF-IDF (“the”, D2) = TF (“the”, D2) x IDF (“the”) = 0.14 x 0 = 0
TF-IDF (“the”, D3) = TF (“the”, D3) x IDF (“the”) = 0.25 x 0 = 0
TF-IDF (“cat”, D1) = TF (“cat”, D1) x IDF (“cat”) = 0.16 x 0.18 = 0.0293
TF-IDF (“cat”, D2) = TF (“cat”, D2) x IDF (“cat”) = 0.14 x 0.18 = 0.0251
TF-IDF (“cat”, D3) = TF (“cat”, D3) x IDF (“cat”) = 0 x 0.18 = 0

Now for the query we can use the average of TF-IDF word scores for each document to get
the ranking of D1, D2, D3 with respect to the query Q.
Average TF-IDF of D1 = (0 + 0.0293) / 2 = 0.01465
Average TF-IDF of D2 = (0 + 0.0251) / 2 = 0.01255
Average TF-IDF of D3 = (0 + 0) / 2 = 0

The Term Document Incidence Matrix:


The term document incidence matrix is one of the basic techniques to represent text data
where we get the unique words across all the documents. For each document, we add if the
term exists in the document otherwise fill 0 in the cell.
For e.g.:
Doc1: I am a cow
Doc2: Cow is what I am
Doc3: Today is Tuesday

Term Doc1 Doc2 Doc3


i 1 1 0
am 1 1 0
a 1 0 0
cow 1 1 0
is 0 1 1
what 0 1 0
today 0 0 1
tuesday 0 0 1

Q. Draw Term Document Incidence matrix.


Doc1: breakthrough drug for schizophrenia
Doc2: new schizophrenia drug
Doc3: new approach for treatment of schizophrenia
Doc4: new hopes for treatment of schizophrenia

Term Doc1 Doc2 Doc3 Doc4


breakthrough 1 0 0 0
drug 1 1 0 0
for 1 0 1 1
schizophrenia 1 1 1 1
new 0 1 1 1
approach 0 0 1 0
treatment 0 0 1 1
of 0 0 1 1
hopes 0 0 0 1

In term document incidence matrix, we can answer any query which is in the form of
Boolean expression of terms, that is in which terms are combined with the operator AND,
OR, NOT.
Q. Write the result for the query.
a) schizophrenia AND drug  1111 && 1100  1100  Doc1 and Doc2
b) for AND NOT (drug OR approach)  1011 && ~ (1100 || 0010)  1011 && ~1110 
1011 && 0001  0001  Doc4

Inverted Index
In this method, a vector is formed where each document is given a document ID and the
term act as pointers. Then sorting of the list is done in alphabetical order and pointers are
maintained to their corresponding document ID.
For example, if we have the following documents
Max lives in Texas
Jen worked in Seattle
Max met Jen in Texas

Doc ID Document Tokenized terms


1 Max lives in Texas [‘Max’, ‘lives’, ‘in’, ‘Texas’]
2 Jen worked in Seattle [‘Jen’, ‘worked’, ‘in’, ‘Seattle’]
3 Max met Jen in Texas [‘Max’, ‘met’, ‘Jen’, ‘in’, ‘Texas’]

Formation of vector
Tokenized Terms:

Doc ID
Max 1
lives 1
in 1
Texas 1
Jen 2
worked 2
in 2
Seattle 2
Max 3
Met 3
Jen 3
in 3
Texas 3

Terms arranged in alphabetical order:

in 1
in 2
in 3
Jen 2
Jen 3
lives 1
Max 1
Max 3
met 3
Seattle 2
Texas 1
Texas 3
worked 2

Inverted Index:

Term Freq Documents


in 3 1, 2, 3
Jen 2 2, 3
lives 1 1
Max 2 1, 3
met 1 3
Seattle 1 2
Texas 2 1, 3
worked 1 2

Q. Draw the inverted index for the document collection


Doc1: breakthrough drug for schizophrenia
Doc2: new schizophrenia drug
Doc3: new approach for treatment of schizophrenia
Doc4: new hopes for treatment of schizophrenia

Term Doc ID
breakthrough 1
drug 1
for 1
schizophrenia 1
new 2
schizophrenia 2
drug 2
new 3
approach 3
for 3
treatment 3
of 3
schizophrenia 3
new 4
hopes 4
for 4
treatment 4
of 4
schizophrenia 4

Term Freq Documents


approach 1 3
breakthrough 1 1
drug 2 1, 2
for 3 1, 3, 4
hopes 1 4
new 3 2, 3, 4
of 2 3, 4
schizophrenia 4 1, 2, 3, 4
treatment 2 3, 4

20/09/2024

Text Similarity:
Text Similarity is the process of comparing a piece of text with another and finding the
similarity between them. It’s basically about determining the degree of closeness of the text.
We will discuss two types of similarity.
1. Cosine similarity
2. Jaccard similarity

Cosine Similarity: Cosine Similarity measures the similarity between two


vectors. It measures the cosine of the angle between two embeddings and
determines whether they are pointing in same direction.
When the embeddings are pointing in the same direction the angle between them is
zero, so their cosine similarity is 1.
When the angle between them is 90 degrees, the cosine similarity is 0.

(∑ )
n
Ai∗B i
Cosine Similarity (A, B) =
i=1

√ √∑
n n

∑ Ai∗ 2 2
Bi
i =1 i=1

Example: Find the cosine similarity


1. Cathy loves me more than Christine loves me
2. Rex likes me more than Cathy loves me

Doc1 Doc2
Me 2 2
Rex 0 1
Cathy 1 1
Christine 1 0
Likes 0 1
Loves 2 1
More 1 1
Than 1 1

A = [2, 0, 1, 1, 0, 2, 1, 1]
B = [2, 1, 1, 0, 1, 1, 1, 1]
Cosine similarity:
n

∑ A i∗Bi=¿2*2 + 0*1 + 1*1 + 1*0 + 0*1 + 2*1 + 1*1 + 1*1 = 4 + 0 + 1 + 0 + 0 + 2 + 1 + 1=9


i=1

Root of sum of squares of A = sqrt (4 + 0 + 1 + 1 + 0 + 4 + 1 + 1) = sqrt (12) = 3.46


Root of sum of squares of B = sqrt (4 + 1 + 1 + 0 + 1 + 1 + 1 + 1) = sqrt (10) = 3.16
Cosine similarity = 9 / (3.46 * 3.16) = 0.823

Example 2:
1. I love Data science
2. I love SAP

Cosine distance = 1 – cos (θ)

Jaccard Similarity: Jaccard Similarity also called as Jaccard index or Jaccard coefficient
is a simple measure to represent the similarity between data samples. The similarity is
computed as the ratio of the length of the intersection within data samples to the length of
the data samples.
n ( A ∩ B)
J (A, B) = n( A ∪ B)

Which means common things between A and B / all things in A and B together
Let us suppose we have 2 vectors
A = [1, 3, 2]
B = [5, 0, 3]
A ∩ B = [3]

A ∪ B = [1, 3, 2, 5, 0]
J (A, B) = 1 / 5 = 0.2

Jaccard distance = 1 – Jaccard Similarity

Q. Find the Jaccard Similarity between two sets.


Set A =
Set B =
J (A, B) = 2 / 6 = 1 / 3 = 0.33

24/09/2024

Difference between Jaccard and Cosine Similarities

Jaccard Cosine

Jaccard similarity treats data as sets focusing Cosine similarity treats data as vectors in a
on the overlap of similarity of elements. It multidimensional space. It considers the
considers the presence or absence of terms, orientation as angles between vectors
ignoring the magnitudes. regardless of their magnitudes.

Jaccard similarity is calculated by finding the It is calculated by taking the dot product of
intersection of the sets by the size of their two vectors and dividing it by the product of
union. It ranges from 0 (no common) to 1 their magnitudes. It ranges from -1
(identical sets). (opposite direction) to 1 (identical
direction).

Jaccard similarity is commonly used in It is widely used in information retrieval,


application involving set comparison, recommendation system etc.
document social network analysis etc.

Q. Find Jaccard Similarity between words w1 and w2 based on bigram model.


where w1 = night, w2 = nicht
Sol.: ngram (n = 2) of ‘night’ and ‘nicht’

Ni Ig Gh Ht Ic Ch

A 1 1 1 1 0 0

B 1 0 0 1 1 1

A = [1, 1, 1, 1, 0, 0]
B = [1, 0, 0, 1, 1, 1]
|A ∩ B| = 2

|A ∪ B| = 6
J (A, B) = 2 / 6 = 0.33

Context Free Grammar:


In NLP, a Context Free Grammar, is in a set of product rules used to generate all the possible
sentence in a given language. A CFG in a formal grammar in the sense that it consists of a set
of terminals, which are the basic units of a language and a set of non-terminals which are
used generate the terminals through a set of production rules. CFG is often used in Natural
Language Parsing and generation. And also used in natural language understanding where
CFG can be used to analyze the syntactic structure of a sentence.
G = {T, N, S, R}
Here, T is terminals, N is set of non-terminals, S is starting symbol, R is rules/productions of
the form X  r whose X is a non-terminal and r is a sequence of terminals and non-terminals
(may be empty).
G is grammar that generates a language L.

Now Rules we will follow,


R {S  NP VP {there is a starting symbol which can produce Noun phrase verb phrase}
S  Aux NP VP {there is a starting symbol which can produce Auxiliary verb NP VP}
S  VP
NP  Det NOM
NOM  Noun {Nominal Noun can produce Noun}
NOM  Noun NOM
VP  Verb NP
}
T = {this, that, a, noun, flight, meal, include, read, does}
N = {S, NP, NOM, VP, Det, Noun, Verb, Aux}
S=S

Write CFG for the sentence “The man read this book”
S  NP VP
 Det NOM VP
 The NOM VP
 The Noun VP
 The man verb NP
 The man read Det NOM
 The man read this NOM
 The man read this Noun
 The man read this book

Parsing:
In NLP parsing is the process of analyzing a sentence to determine its grammatical structure
and there are two main approaches to parsing.
1.Top-Down Parsing
2.Bottom-Up Parsing
Parser generates parse tree via input tag and CFG.

1. Top-down parsing: It is a parsing technique that starts with the highest level of
grammar production rule and then works its way down to the lowest level. It begins
with the start symbol of the grammar and applies the production rules recursively to
expand it into parse tree.

Parse tree:
2. Bottom-up parsing: It is a parsing technique that starts with the sentence’s word
and works its ways up to the highest level of the grammar’s production rules. It
begins with the input sentence and applies the production rules in reverse, reducing
the input sentence to the start symbol of the grammar.

Construct parse Tree


1. Book this flight

Word 2 Vec
Word 2 Vec creates a representation of each word present in our vocabulary into a vector.
Words used in similar context of having semantic relationships are captured effectively
through their closeness in the vector space that means similar words will have similar word
vectors. Word2Vec model was created, patented and published in 2013 by a team of
researchers led by Tomáš Mikolov at Google.
Word2Vec is a shallow 2-layer neural network trained model. The input contains all the
documents/texts in our training set. For the network to process these texts, they are
represented in a 1-hot encoding of the words.
The number of neurons present in the hidden layer is equal to the length of the embedding
we want. That is if we want all our words to be vector of length 300, then the hidden layer
will contain 300 neurons.
The output layer contains probabilities for a target word (given an input to the model, what
word is expected) given a particular input.
At the end of training process, the hidden weights are treated as the word embedding.
Intuitively, this can be thought of as each word having a set of n weights. There are two
approaches we can develop these embeddings.
1. CBOW (Continuous Bag of Words)
2. Skip gram
CBOW: Continuous Bag of Words

For example, Consider the sentence


Complete Sentence  “The cake was chocolate flavoured”
Now suppose if we provide as input only
“The ____ was chocolate flavoured” then cake will become target word.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy