Lab Manual
Lab Manual
Lab Manual
Text preprocessing is an essential step in natural language processing (NLP) that involves
cleaning and transforming raw text data into a format that can be easily understood and analyzed
by machine learning algorithms. NLTK (Natural Language Toolkit) is a powerful library in Python that
provides tools for text processing and analysis. Here's an explanation of the key text preprocessing
steps using NLTK
A. STOPWORD ELIMINATION:-
Stopwords are common words (e.g., "and", "the", "is") that often do not contribute much to the
meaning of a text. NLTK provides a list of stopwords that can be used to filter them out.
CODE:-
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
# Example text
text = "This is an example sentence for stopword elimination."
# Remove stopwords
filtered_words = [word for word in words if word.lower() not in
stopwords.words('english')]
OUTPUT:-
ORIGINAL WORDS: ['This', 'is', 'an', 'example', 'sentence', 'for', 'stopword',
'elimination', '.']
WORDS AFTER STOPWORD ELIMINATION: ['example', 'sentence', 'stopword',
'elimination', '.']
b. STEMMING :-
Stemming involves reducing words to their root or base form. NLTK provides different stemmers, such
as the Porter Stemmer.
CODE:-
# Example text
text = "This is an example sentence for stopword elimination."
OUTPUT:-
ORIGINAL WORDS: ['This', 'is', 'an', 'example', 'sentence', 'for', 'stopword',
'elimination', '.']
WORDS AFTER STEMMING: ['thi', 'is', 'an', 'exampl', 'sentenc', 'for', 'stopword',
'elimin', '.']
C. LEMMATIZATION :-
Lemmatization is similar to stemming but involves reducing words to their base or dictionary form
(lemma). NLTK provides lemmatization functionality.
CODE:-
nltk.download('wordnet')
# Example text
text = "This is an example sentence for stopword elimination."
OUTPUT:-
ORIGINAL WORDS: ['This', 'is', 'an', 'example', 'sentence', 'for', 'stopword',
'elimination', '.']
WORDS AFTER LEMMATIZATION: ['This', 'is', 'an', 'example', 'sentence', 'for',
'stopword', 'elimination', '.']
D. POS TAGGING:-
Part-of-speech (POS) tagging involves tagging each word with its grammatical part of speech (e.g.,
noun, verb, adjective). NLTK provides a function for POS tagging.
CODE:-
nltk.download('averaged_perceptron_tagger')
# Example text
text = "This is an example sentence for stopword elimination."
OUTPUT:-
ORIGINAL WORDS: ['This', 'is', 'an', 'example', 'sentence', 'for', 'stopword',
'elimination', '.']
POS TAGS: [('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'),
('sentence', 'NN'), ('for', 'IN'), ('stopword', 'NN'), ('elimination', 'NN'), ('.',
'.')]
These preprocessing steps are crucial for enhancing the quality of textual data before applying more
advanced NLP techniques or feeding it into machine learning models. The choice of which steps to include
in your preprocessing pipeline depends on the specific requirements of your NLP task.
2. Sentiment analysis on customer review on products
ANS:-
Sentiment analysis on customer reviews is a common natural language processing (NLP) task that
involves determining the sentiment expressed in a piece of text, typically in the form of positive, negative,
or neutral. Analyzing customer reviews on products can provide valuable insights into customer satisfaction,
product performance, and areas for improvement
CODE:-
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def analyze_sentiment(review):
sid = SentimentIntensityAnalyzer()
sentiment_score = sid.polarity_scores(review)['compound']
OUTPUT:-
Web server log data contains information about user requests to a web server. Clickstream analysis
involves analyzing the sequence of pages and actions that a user takes on a website. Below is an example
Python code snippet using the pandas library for analyzing web server log data. Note that this is a basic
example, and in a real-world scenario, you might need to handle more complex log formats and consider
additional factors.
CODE:-
import pandas as pd
OUTPUT:-
Most visited pages:
/home 1
/products 1
/about 1
Name: Page, dtype: int64
b. Hyperlink Data:
Hyperlink data involves analyzing the relationships between different web pages through hyperlinks.
You can use techniques like web scraping or APIs to gather hyperlink data. Here's a basic example using the
requests and BeautifulSoup libraries to scrape hyperlink data from a webpage
CODE:-
import requests
from bs4 import BeautifulSoup
OUTPUT:-
https://www.iana.org/domains/example
Search Engine Optimization refers to set of activities that are performed to increase number of
desirable visitors who come to your site via search engine. These activities may include thing you do to your
site itself, such as making changes to your text and HTML code, formatting text or document to
communicate directly to the search engine.
Spamdexing:
An SEO tactic, technique or method is considered as Black Hat or Spamdexing if it follows the followings:
Try to improve rankings that are disapproved of by the search engines and/or involve deception.
Redirecting users from a page that is built for search engines to one that is more human friendly.
Redirecting users to a page that was different from the page the search engine ranked.
Serving one version of a page to search engine spiders/bots and another version to human visitors.
This is called Cloaking SEO tactic.
Using Hidden or invisible text or with the page background color, using a tiny font size or hiding
them within the HTML code such as "no frame" sections.
Repeating keywords in the Meta tags, and using keywords that are unrelated to the site's content.
This is called Meta tag stuffing.
Calculated placement of keywords within a page to raise the keyword count, variety, and density of
the page. This is called Keyword stuffing.
Creating low-quality web pages that contain very little content but are instead stuffed with very
similar key words and phrases. These pages are called Doorway or Gateway Pages
Mirror web sites by hosting multiple web sites all with conceptually similar content but using
different URLs.
Mirror web sites by hosting multiple web sites all with conceptually similar content but using
different URLs.
Creating a rogue copy of a popular web site which shows contents similar to the original to a web
crawler, but redirects web surfers to unrelated or malicious web sites. This is called Page hijacking.
To use Google Analytics to implement conversion statistics and retrieve visitor profiles using Python,
you can use the Google Analytics Reporting API. To access this API, you will need to set up a project in the
Google Cloud Console, enable the Analytics API, and obtain the necessary credentials. Once you have the
credentials, you can use the google-analytics-data Python library to query the API.
A. CONVERSION STATISTICS
CODE:-
except RefreshError:
print("Credentials refresh failed. Check the credentials file.")
# Example usage
get_conversion_statistics("YOUR_VIEW_ID", "2023-01-01", "2023-01-31")
OUTPUT:-
Date: 2023-01-01
Total Conversions: 10
Total Conversion Value: 500.0
Date: 2023-01-02
Total Conversions: 15
Total Conversion Value: 750.0
B. VISITOR PROFILES
CODE:-
from google.analytics.data_v1alpha import BetaAnalyticsDataClient
from google.analytics.data_v1alpha.types import Dimension, DateRange, Metric
# Example usage
get_visitor_profiles("YOUR_VIEW_ID", "2023-01-01", "2023-01-31")
OUTPUT:-
User Age Bracket: 18-24
User Gender: Male
User Type: New Visitor
Active Users: 100
CODE:-
except Exception as e:
print(f"An error occurred: {e}")
# Example usage
get_traffic_sources("YOUR_VIEW_ID", "2023-01-01", "2023-01-31")
OUTPUT:-
Source: google
Medium: organic
Sessions: 100
Page Views: 500