Grapheme:: Morpheme
Grapheme:: Morpheme
A morpheme is the smallest unit of meaning that cannot be further divided so a base
wind might be a morpheme but a suffix, prefix or root also represents a morpheme. Fr e.g.
the word red is a single morpheme, but the word unpredictable is made up of the
morphemes un + pre + dict + able while none of these units stand alone as words, but they
are the smallest unit of meaning
06/09/2024
Understanding Linguistic:
Linguistic is the study of language, its structure and the rules that govern its
structure. It follows the approach which include morphology, syntax, semantics, pragmatics.
What is morphology?
It is the study of word structure. It describes how words are formed out of more
basic elements of language called morphemes. A morpheme is the smallest unit of a
language. Morphemes are considered minimal because if they were subdivided any further,
they would become meaningless.
Each morpheme is different from other because each carries a different meaning.
Morphemes are used to form words Base, root or free morpheme are words that have
meaning and cannot be broken down into smaller parts. Examples of free morphemes are
ocean, book, colour, connect etc.
These words cannot be broken down into smaller units.
Bound or grammatical morpheme, which cannot convey meaning by themselves
must be formed with free morphemes.
Examples:
ocean-s
establish-ment
book-ed
color-ful
dis-connect
Bound morphemes often include the following -ing, -s, -ed etc.
09/09/2024
Pragmatic Analysis:
Pragmatic in NLP is the study of contextual meaning. It examines cases where a person’s
statement has one literal and another more profound meaning. It tells us how different
context can change the meaning of a sentence. Pragmatic considers the intention of the
speaker and writer.
E.g.:
1. Can you pull the car over?
Actual meaning: Are you capable of pulling the car?
Pragmatic meaning: This means “Can you stop the car”
2. It’s hot in here! Can you crack a window?
Here the speaker wants the windows to be opened a little and does not want the
windows to be physically damaged.
3. What time do you call this?
It meanffs why are you late and not that the speaker wants to know the time.
10/09/2024
Processing Text
1. Use of regular expression: A regular expression is helpful in deriving words given in a
sentence or paragraph. There are various rules used in regular expression. So
basically, regular expression can be used to check if a string contains the specified
search pattern.
Special Sequences:
Tokenization:
Tokenization in NLP is a technique that involves dividing a sentence or phrase into smaller
units known as tokens. These token consist of words, punctuation marks etc.
Types of tokenization:
1. Sentence Tokenization: The text is segmented into sentences during sentence
tokenization. This is useful for tasks that require included sentence analysis.
For e.g.: “GLA is best university. It is located in Mathura.”
After sentence tokenization “GLA is best university”
“It is located in Mathura”
2. Word Tokenization: It divides the text into individual words.
For e.g.: “Tokenization is an important NLP task.”
After word tokenization we will get the result as
[“Tokenization”, “is”, “an”, “important”, “NLP”, “task”]
3. Character Tokenization: This process divides the text into individual characters. This
can be useful for modelling character level language.
For e.g.: “Tokenization”
[“T”, “o”, “k”, “e”, “n”, “i”, “z”, “a”, “t”, “i”, “o”, “n”]
Need of Tokenization:
1. Effective text processing: Tokenization reduces the size of raw text so that it can be
handled easily
2. Language Modelling: It facilitates the creation of organized representation of
language which is useful for tasks like text generation.
3. Information retrieval: It is essential for indexing and searching in systems that store
and retrieve information.
4. Text Analysis: Tokenization provide us the facility of sentiment analysis or named
entity recognition.
Stemming:
Stemming is a process through which base or root word are extracted. In this process the
last few characters of a given word (derived words) are removed. Sometimes stemming
provides a base word which do not have proper meaning.
For e.g.: history is converted into histori
Stemming is used mostly for the purpose of sentiment analysis.
Positive sentiment Pizza was very delicious.
Negative sentiment Burger was very bad.
Neutral sentiment I ordered Pizza today.
Lemmatization:
Lemmatization is the process of getting meaningful words from the given text.
Lemmatization uses corpus for stop words and WordNet corpus to produce lemma.
Moreover parts-of-speech also had to be defined to obtain correct lemma. Lemma is an
actual language word.
Lemmatization is used to find base words from derived words but it will have a meaning.
Lemmatization is basically used in chatbots.
13/09/2024
N-Grams:
N-Grams are contiguous sequence of ‘n’ items, typically words in the context of NLP. These
items can be characters, words or even syllables depending on the granularity derived. The
value of ‘n’ determines the order of the N-grams.
Example:
1. Unigram (1-gram): Single words e.g. “cat”, “dog”
2. Bigram (2-gram): Pairs of consecutive words e.g. “natural language”, “deep learning”
3. Trigrams (3-gram): Triplets of consecutive words e.g.: “machine learning models”,
“data science approach”
Similarly, 4-gram, 5-gram are sequences of four, five words etc.
N = 1 This is a sentence
Groups are [This, is, a, sentence]
N = 2 This is a sentence
Groups are [This is, is a, a sentence]
N = 3 This is a sentence
Groups are [This is a, is a sentence]
Applications of N-gram:
1. Speech Recognition
2. Machine Translation
3. Predictive Text Input
4. Named Entity Recognition
17/09/2024
Bag of Words
The bag of words model is a simple way to convert words to numerical representation in
natural language processing. This model is a simple document embedding technique based
on word frequency. Conceptually, we think of the whole document as a bag of words, rather
than a sequence. We represent the document simply by the frequency of each word. So
basically, this method converts text into vector based on the frequency of words in the text,
without considering the order or context of the words.
E.g.:
cute dog
cute cute cat
TF-IDF
The TF-IDF method identifies very common words or rare words in a given document. That
means it tells us the importance of each word in the given text which is not provided by bag
of words method. The term TF stands for term frequency and the IDF stands for Inverse
Document Frequency.
Term frequency refer to the frequency of a word in a document. For a specified word it is
defined as the ratio of the number of times of word appears in a document to the total
number of words in the document.
The TF-IDF score for a term in a document is calculated by multiplying its TF and IDF values.
This score reflects how important the term is within the context of the document and across
the entire corpus. Terms with higher TF-IDF scores are considered more significant.
Now for the query we can use the average of TF-IDF word scores for each document to get
the ranking of D1, D2, D3 with respect to the query Q.
Average TF-IDF of D1 = (0 + 0.0293) / 2 = 0.01465
Average TF-IDF of D2 = (0 + 0.0251) / 2 = 0.01255
Average TF-IDF of D3 = (0 + 0) / 2 = 0
In term document incidence matrix, we can answer any query which is in the form of
Boolean expression of terms, that is in which terms are combined with the operator AND,
OR, NOT.
Q. Write the result for the query.
a) schizophrenia AND drug 1111 && 1100 1100 Doc1 and Doc2
b) for AND NOT (drug OR approach) 1011 && ~ (1100 || 0010) 1011 && ~1110
1011 && 0001 0001 Doc4
Inverted Index
In this method, a vector is formed where each document is given a document ID and the
term act as pointers. Then sorting of the list is done in alphabetical order and pointers are
maintained to their corresponding document ID.
For example, if we have the following documents
Max lives in Texas
Jen worked in Seattle
Max met Jen in Texas
Formation of vector
Tokenized Terms:
Doc ID
Max 1
lives 1
in 1
Texas 1
Jen 2
worked 2
in 2
Seattle 2
Max 3
Met 3
Jen 3
in 3
Texas 3
in 1
in 2
in 3
Jen 2
Jen 3
lives 1
Max 1
Max 3
met 3
Seattle 2
Texas 1
Texas 3
worked 2
Inverted Index:
Term Doc ID
breakthrough 1
drug 1
for 1
schizophrenia 1
new 2
schizophrenia 2
drug 2
new 3
approach 3
for 3
treatment 3
of 3
schizophrenia 3
new 4
hopes 4
for 4
treatment 4
of 4
schizophrenia 4
20/09/2024
Text Similarity:
Text Similarity is the process of comparing a piece of text with another and finding the
similarity between them. It’s basically about determining the degree of closeness of the text.
We will discuss two types of similarity.
1. Cosine similarity
2. Jaccard similarity
(∑ )
n
Ai∗B i
Cosine Similarity (A, B) =
i=1
√ √∑
n n
∑ Ai∗ 2 2
Bi
i =1 i=1
Doc1 Doc2
Me 2 2
Rex 0 1
Cathy 1 1
Christine 1 0
Likes 0 1
Loves 2 1
More 1 1
Than 1 1
A = [2, 0, 1, 1, 0, 2, 1, 1]
B = [2, 1, 1, 0, 1, 1, 1, 1]
Cosine similarity:
n
Example 2:
1. I love Data science
2. I love SAP
Jaccard Similarity: Jaccard Similarity also called as Jaccard index or Jaccard coefficient
is a simple measure to represent the similarity between data samples. The similarity is
computed as the ratio of the length of the intersection within data samples to the length of
the data samples.
n ( A ∩ B)
J (A, B) = n( A ∪ B)
Which means common things between A and B / all things in A and B together
Let us suppose we have 2 vectors
A = [1, 3, 2]
B = [5, 0, 3]
A ∩ B = [3]
A ∪ B = [1, 3, 2, 5, 0]
J (A, B) = 1 / 5 = 0.2
24/09/2024
Jaccard Cosine
Jaccard similarity treats data as sets focusing Cosine similarity treats data as vectors in a
on the overlap of similarity of elements. It multidimensional space. It considers the
considers the presence or absence of terms, orientation as angles between vectors
ignoring the magnitudes. regardless of their magnitudes.
Jaccard similarity is calculated by finding the It is calculated by taking the dot product of
intersection of the sets by the size of their two vectors and dividing it by the product of
union. It ranges from 0 (no common) to 1 their magnitudes. It ranges from -1
(identical sets). (opposite direction) to 1 (identical
direction).
Ni Ig Gh Ht Ic Ch
A 1 1 1 1 0 0
B 1 0 0 1 1 1
A = [1, 1, 1, 1, 0, 0]
B = [1, 0, 0, 1, 1, 1]
|A ∩ B| = 2
|A ∪ B| = 6
J (A, B) = 2 / 6 = 0.33
Write CFG for the sentence “The man read this book”
S NP VP
Det NOM VP
The NOM VP
The Noun VP
The man verb NP
The man read Det NOM
The man read this NOM
The man read this Noun
The man read this book
Parsing:
In NLP parsing is the process of analyzing a sentence to determine its grammatical structure
and there are two main approaches to parsing.
1.Top-Down Parsing
2.Bottom-Up Parsing
Parser generates parse tree via input tag and CFG.
1. Top-down parsing: It is a parsing technique that starts with the highest level of
grammar production rule and then works its way down to the lowest level. It begins
with the start symbol of the grammar and applies the production rules recursively to
expand it into parse tree.
Parse tree:
2. Bottom-up parsing: It is a parsing technique that starts with the sentence’s word
and works its ways up to the highest level of the grammar’s production rules. It
begins with the input sentence and applies the production rules in reverse, reducing
the input sentence to the start symbol of the grammar.
Word 2 Vec
Word 2 Vec creates a representation of each word present in our vocabulary into a vector.
Words used in similar context of having semantic relationships are captured effectively
through their closeness in the vector space that means similar words will have similar word
vectors. Word2Vec model was created, patented and published in 2013 by a team of
researchers led by Tomáš Mikolov at Google.
Word2Vec is a shallow 2-layer neural network trained model. The input contains all the
documents/texts in our training set. For the network to process these texts, they are
represented in a 1-hot encoding of the words.
The number of neurons present in the hidden layer is equal to the length of the embedding
we want. That is if we want all our words to be vector of length 300, then the hidden layer
will contain 300 neurons.
The output layer contains probabilities for a target word (given an input to the model, what
word is expected) given a particular input.
At the end of training process, the hidden weights are treated as the word embedding.
Intuitively, this can be thought of as each word having a set of n weights. There are two
approaches we can develop these embeddings.
1. CBOW (Continuous Bag of Words)
2. Skip gram
CBOW: Continuous Bag of Words