Natural Language Processing Lab Manual
Natural Language Processing Lab Manual
import nltk
nltk.download('punkt') # Download necessary NLTK data files
# Example text
text = "Natural language processing (NLP) is a field of artificial intelligence
that helps computers understand, interpret, and manipulate human language.
It enables tasks like speech recognition, sentiment analysis, and machine
translation."
# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:")
print(sentences)
print()
# Word Tokenization
words = word_tokenize(text)
print("Word Tokenization:")
print(words)
Explanation:
Word Tokenization: ['Hello', '!', 'My', 'name', 'is', 'Guru', '.', 'How', 'can', 'I', 'assist',
'you', 'today', '?']
Sentence Tokenization: ['Hello! My name is Guru.', 'How can I assist you today?'] n
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def remove_stopwords(text):
# Tokenize the input text into words
words = word_tokenize(text)
# Example text
text = "This is an example sentence where we will remove stop words."
# Remove stopwords
filtered_text = remove_stopwords(text)
print("Original Text:", text)
print("Filtered Text:", filtered_text)
Explanation:
Output:
Original Text: This is an example sentence where we will remove stop words.
Filtered Text: example sentence remove stop words .
import nltk
nltk.download('punkt')
def stem_text(text):
words = word_tokenize(text)
return stemmed_text
# Example text
text = "NLTK is a leading platform for building Python programs to work with human
language data."
# Perform stemming
stemmed_text = stem_text(text)
print(stemmed_text)
OUTPUT:
Word Analysis:
Analyze character frequency in a given text.
Analyze word frequency and length.
Word Generation:
import random
class WordAnalyzer:
self.text = text
self.word_list = self.text.split()
self.char_freq = self.analyze_char_frequency()
self.word_freq = self.analyze_word_frequency()
def analyze_char_frequency(self):
def analyze_word_frequency(self):
return Counter(self.word_list)
def analyze_word_lengths(self):
def display_analysis(self):
print("Character Frequency:")
print("\nWord Frequency:")
class WordGenerator:
self.char_freq = char_freq
self.word_list = word_list
self.transition_matrix = self.build_transition_matrix()
def build_transition_matrix(self):
matrix = defaultdict(Counter)
matrix[word[i]][word[i + 1]] += 1
total = sum(transitions.values())
transitions[next_char] /= total
return matrix
def generate_word(self, length):
if not self.char_freq:
return ""
start_char = random.choice(list(self.char_freq.keys()))
word = start_char
break
next_char = random.choices(
list(self.transition_matrix[start_char].keys()),
list(self.transition_matrix[start_char].values())
)[0]
word += next_char
start_char = next_char
return word
return ''.join(random.choices(
list(self.char_freq.keys()),
weights=list(self.char_freq.values()),
k=length
))
if __name__ == "__main__":
# Example Text
text = "hello world this is a simple example of word analysis and generation"
# Word Analysis
analyzer = WordAnalyzer(text)
analyzer.display_analysis()
# Word Generation
print("\nGenerated Words:")
1. Class: WordAnalyzer
This class is responsible for analyzing a given text. It extracts meaningful statistics
about the words and characters in the text.
Methods:
1. __init__(self, text)
o Initializes the WordAnalyzer class with a text input.
o Splits the text into words (self.word_list) and computes:
Character frequencies: How often each character appears.
Word frequencies: How often each word appears.
2. analyze_char_frequency(self)
o Returns a Counter object with the frequency of each character in the
text (excluding spaces).
3. analyze_word_frequency(self)
o Returns a Counter object with the frequency of each unique word in
the text.
4. analyze_word_lengths(self)
o Analyzes the lengths of all the words in the text.
o Returns a Counter object where the keys are word lengths (in
characters) and values are their frequencies.
5. display_analysis(self)
o Prints the analysis results in a readable format, including:
Character frequency
Word frequency
Word length frequency
2. Class: WordGenerator
This class is responsible for generating new words based on the analyzed data.
Initialization:
o Tracks how likely one character is to follow another, based on the input
text.
Methods:
1. build_transition_matrix(self)
o Creates a Markov Chain-like transition matrix for characters.
o For each character in the words, it calculates:
The frequency of every possible "next character."
Normalizes these frequencies to probabilities.
o Example:
For the word hello, the transitions would be:
rust
Copy code
h -> e
e -> l
l -> l
l -> o
2. generate_word(self, length)
o Uses the transition matrix to generate a word of a specified length.
o Starts with a random character and iteratively adds the next character
based on probabilities in the transition matrix.
3. random_word(self, length)
o Generates a completely random word of the specified length using
character frequencies.
o Characters are chosen independently of one another.
3. Main Script
This section ties everything together and demonstrates how to use the classes.
Steps:
1. Input text:
o The text is provided as a string: "hello world this is a simple example of
word analysis and generation".
2. Analyze the text:
o The WordAnalyzer class computes the following:
Frequency of each character (e.g., h appears twice, e appears 5
times, etc.).
Frequency of each word (e.g., hello appears once, world appears
once, etc.).
Distribution of word lengths (e.g., 5-character words appear
twice, etc.).
o Results are printed via display_analysis.
3. Generate new words:
o The WordGenerator class creates words using two approaches:
Markov Chain-based (generate_word):
Uses the character transition matrix to create more
context-aware words.
Random character-based (random_word):
Uses character frequency to randomly pick characters,
independent of sequence.
Example Outputs
Word Analysis:
yaml
Copy code
Character Frequency:
h: 2
e: 5
l: 7
o: 5
w: 2
r: 2
d: 3
t: 2
i: 5
s: 7
a: 5
m: 1
p: 1
n: 3
x: 1
f: 1
g: 1
Word Frequency:
hello: 1
world: 1
this: 1
is: 1
a: 1
simple: 1
example: 1
of: 1
word: 1
analysis: 1
and: 1
generation: 1
Word Generation:
yaml
Copy code
Generated Words:
Markov Chain Based: heilo
Random Word: inaso
4. Create a Sample list for at least 5 words with ambiguous sense and
Implement WSD using Python
Below is the code that uses the Lesk algorithm from the nltk library for WSD:
python
Copy code
from nltk.corpus import wordnet
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize
# Disambiguating senses
for sentence in sentences:
for word in ambiguous_words:
if word in sentence:
sense = disambiguate_sentence(sentence, word)
print(f"Sentence: {sentence}")
print(f"Word: {word}")
if sense:
print(f"Sense: {sense.name()}")
print(f"Definition: {sense.definition()}")
else:
print("Sense: Not found.")
print("-" * 50)
Sample Output
For the sentence "The bank of the river was flooded after the heavy rain.", the
output might be:
makefile
Copy code
Sentence: The bank of the river was flooded after the heavy rain.
Word: bank
Sense: bank.n.01
Definition: sloping land (especially the slope beside a body of water)
python
Copy code
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
5. Install NLTK tool kit and perform stemming
bash
Copy code
pip install nltk
python
Copy code
import nltk
print(nltk.__version__) # Check if it's installed and print the version
NLTK provides various datasets and models for natural language processing tasks.
To download the required data, follow these steps:
python
Copy code
import nltk
nltk.download('punkt') # Required for tokenization
nltk.download('wordnet') # Optional: Required for lemmatization
Stemming is the process of reducing words to their root form. NLTK provides several
stemming algorithms. Here's an example using the Porter Stemmer:
Code Example:
python
Copy code
from nltk.stem import PorterStemmer
# Perform stemming
stemmed_words = [stemmer.stem(word) for word in words]
Output:
less
Copy code
Original Words: ['running', 'jumps', 'easily', 'happiness']
Stemmed Words: ['run', 'jump', 'easili', 'happi']
python
Copy code
from nltk.stem import LancasterStemmer
# Perform stemming
stemmed_words = [lancaster_stemmer.stem(word) for word in words]
6. Create Sample list of at least 10 words POS tagging and find the POS for any
given word
1. Run
2. Beautiful
3. Cat
4. Slowly
5. Play
6. Happiness
7. Beneath
8. Quickly
9. Book
10.Intelligent
Word POS
Run Verb/Noun
Beautiful Adjective
Cat Noun
Slowly Adverb
Play Verb/Noun
Happiness Noun
Beneath Preposition
Quickly Adverb
Book Noun/Verb
Intelligent Adjective
You can implement a simple function in Python to find the POS of any word based
on this list.
python
Copy code
# Sample POS Dictionary
pos_dict = {
"run": ["Verb", "Noun"],
"beautiful": ["Adjective"],
"cat": ["Noun"],
"slowly": ["Adverb"],
"play": ["Verb", "Noun"],
"happiness": ["Noun"],
"beneath": ["Preposition"],
"quickly": ["Adverb"],
"book": ["Noun", "Verb"],
"intelligent": ["Adjective"]
}
# Example Usage
word = input("Enter a word: ")
pos = get_pos(word)
print(f"POS for '{word}': {pos}")
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag
# Part-of-speech tagging
pos_tags = pos_tag(tokens)
print("\nPOS Tags:", pos_tags)
Explanation:
Output:
For the input sentence "The quick brown fox jumps over the lazy dog.", the program
produces tokens, POS tags, stems, and lemmas.
Example output:
yaml
Copy code
Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'),
('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
Morphological Analysis:
Word: The | Stem: the | Lemma: the | POS: DT
Word: quick | Stem: quick | Lemma: quick | POS: JJ
Word: brown | Stem: brown | Lemma: brown | POS: NN
Word: fox | Stem: fox | Lemma: fox | POS: NN
Word: jumps | Stem: jump | Lemma: jump | POS: VBZ
Word: over | Stem: over | Lemma: over | POS: IN
Word: the | Stem: the | Lemma: the | POS: DT
Word: lazy | Stem: lazi | Lemma: lazy | POS: JJ
Word: dog | Stem: dog | Lemma: dog | POS: NN
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
Args:
text (str): The input text to process.
n (int): The size of n-grams to generate (e.g., 2 for bigrams, 3 for trigrams).
Returns:
list: A list of n-grams, each represented as a tuple.
"""
# Tokenize the text into words
tokens = word_tokenize(text)
# Generate n-grams
n_grams = list(ngrams(tokens, n))
return n_grams
# Example usage
if __name__ == "__main__":
sample_text = "This is a simple example to generate n-grams using NLTK."
n = 2 # For bigrams; change this value for different n-grams
n_grams = generate_ngrams(sample_text, n)
print(f"{n}-grams:")
for gram in n_grams:
print(gram)
How It Works:
Example Output:
For the input text: "This is a simple example to generate n-grams using NLTK." and
n=2:
arduino
Copy code
2-grams:
('This', 'is')
('is', 'a')
('a', 'simple')
('simple', 'example')
('example', 'to')
('to', 'generate')
('generate', 'n-grams')
('n-grams', 'using')
('using', 'NLTK')
('.')
class NGramModel:
def __init__(self, n):
self.n = n # The 'n' in n-grams
self.ngram_counts = defaultdict(int)
self.context_counts = defaultdict(int)
self.vocabulary = set()
self.ngram_counts[ngram] += 1
self.context_counts[context] += 1
self.vocabulary.add(word)
for _ in range(max_length):
word_probabilities = {word: self.probability(context, word) for word in
self.vocabulary}
next_word = max(word_probabilities, key=word_probabilities.get)
if next_word == '</s>':
break
sentence.append(next_word)
context = context[1:] + (next_word,)
return sentence
# Example usage
if __name__ == "__main__":
corpus = [
["the", "cat", "sat"],
["the", "cat", "sat", "on", "the", "mat"],
["the", "dog", "barked"]
]
Key Features:
8. Using NLTK package to convert audio file to text and text file to audio files
The NLTK (Natural Language Toolkit) package is primarily used for natural language
processing tasks such as tokenization, stemming, lemmatization, and sentiment
analysis. However, it does not have built-in support for converting audio to text or
vice versa. For these tasks, you can use other specialized libraries:
Requirements
Install the required libraries using pip:
bash
Copy code
pip install SpeechRecognition gTTS pydub
Script
python
Copy code
import os
import speech_recognition as sr
from gtts import gTTS
# Example usage
audio_file = "example_audio.wav" # Replace with your audio file
text_file = "output_text.txt"
output_audio_file = "output_audio.mp3"
Explanation
1. Audio to Text:
o Uses speech_recognition to transcribe speech from an audio file (e.g.,
WAV).
o Saves the transcription to a text file.
2. Text to Audio:
o Reads the content of a text file.
o Uses gTTS to convert the text to speech and saves it as an audio file
(e.g., MP3).