Lab Manual

1.
Preprocessing text document using NLTK of Python

a. Stopword elimination
b. Stemming
c. Lemmatization
d. POS tagging
e. Lexical analysis
ANS:-
PREPROCESSING TEXT DOCUMENT USING NLTK
Text preprocessing is an essential step in natural language processing (NLP) that involves
cleaning and transforming raw text data into a format that can be easily understood and analyzed
by machine learning algorithms. NLTK (Natural Language Toolkit) is a powerful library in Python that
provides tools for text processing and analysis. Here's an explanation of the key text preprocessing
steps using NLTK
A. STOPWORD ELIMINATION:-
Stopwords are common words (e.g., "and", "the", "is") that often do not contribute much to the
meaning of a text. NLTK provides a list of stopwords that can be used to filter them out.
CODE:-
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
# Example text
text = "This is an example sentence for stopword elimination."
# Tokenize the text

words = word_tokenize(text)
# Remove stopwords
filtered_words = [word for word in words if word.lower() not in
stopwords.words('english')]
print("ORIGINAL WORDS:", words)

print("WORDS AFTER STOPWORD ELIMINATION:", filtered_words)
OUTPUT:-
ORIGINAL WORDS: ['This', 'is', 'an', 'example', 'sentence', 'for', 'stopword',
'elimination', '.']
WORDS AFTER STOPWORD ELIMINATION: ['example', 'sentence', 'stopword',
'elimination', '.']
b. STEMMING :-
Stemming involves reducing words to their root or base form. NLTK provides different stemmers, such
as the Porter Stemmer.
CODE:-
from nltk.stem import PorterStemmer
# Example text
# Tokenize the text

# Initialize the Porter Stemmer

porter_stemmer = PorterStemmer()
# Stem the words

stemmed_words = [porter_stemmer.stem(word) for word in words]

print("WORDS AFTER STEMMING:", stemmed_words)
OUTPUT:-
'elimination', '.']
WORDS AFTER STEMMING: ['thi', 'is', 'an', 'exampl', 'sentenc', 'for', 'stopword',
'elimin', '.']
C. LEMMATIZATION :-
Lemmatization is similar to stemming but involves reducing words to their base or dictionary form
(lemma). NLTK provides lemmatization functionality.
CODE:-
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# Example text
# Tokenize the text

# Initialize the WordNet Lemmatizer

lemmatizer = WordNetLemmatizer()
# Lemmatize the words

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print("WORDS AFTER LEMMATIZATION:", lemmatized_words)
OUTPUT:-
'elimination', '.']
WORDS AFTER LEMMATIZATION: ['This', 'is', 'an', 'example', 'sentence', 'for',
'stopword', 'elimination', '.']
D. POS TAGGING:-
Part-of-speech (POS) tagging involves tagging each word with its grammatical part of speech (e.g.,
noun, verb, adjective). NLTK provides a function for POS tagging.
CODE:-
nltk.download('averaged_perceptron_tagger')
# Example text
# Tokenize the text

# Perform POS tagging

pos_tags = nltk.pos_tag(words)

print("POS TAGS:", pos_tags)
OUTPUT:-
'elimination', '.']
POS TAGS: [('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'),
('sentence', 'NN'), ('for', 'IN'), ('stopword', 'NN'), ('elimination', 'NN'), ('.',
'.')]
These preprocessing steps are crucial for enhancing the quality of textual data before applying more
advanced NLP techniques or feeding it into machine learning models. The choice of which steps to include
in your preprocessing pipeline depends on the specific requirements of your NLP task.
2. Sentiment analysis on customer review on products
ANS:-
Sentiment analysis on customer reviews is a common natural language processing (NLP) task that
involves determining the sentiment expressed in a piece of text, typically in the form of positive, negative,
or neutral. Analyzing customer reviews on products can provide valuable insights into customer satisfaction,
product performance, and areas for improvement
CODE:-
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Download the VADER lexicon for sentiment analysis

nltk.download('vader_lexicon')
def analyze_sentiment(review):
sid = SentimentIntensityAnalyzer()
sentiment_score = sid.polarity_scores(review)['compound']
if sentiment_score >= 0.05:

return "Positive"
elif sentiment_score <= -0.05:
return "Negative"
else:
return "Neutral"
# Example customer reviews

reviews = [
"I love this product! It's amazing.",
"The quality is good, but the price is too high.",
"Not satisfied with the product. It broke after a week.",
"Fast delivery and excellent customer service!",
]
# Analyze sentiment for each review

for i, review in enumerate(reviews, 1):
sentiment = analyze_sentiment(review)
print(f"Review {i}: '{review}' - Sentiment: {sentiment}")
OUTPUT:-
Review 1: 'I love this product! It's amazing.' - Sentiment: Positive

Review 2: 'The quality is good, but the price is too high.' - Sentiment: Positive
Review 3: 'Not satisfied with the product. It broke after a week.' - Sentiment: Negative
Review 4: 'Fast delivery and excellent customer service!' - Sentiment: Positive
3. WEB ANALYTICS
A. WEB USAGE DATA (WEB SERVER LOG DATA, CLICKSTREAM ANALYSIS)
B. HYPERLINK DATA
ANS:-
Web analytics involves analyzing data related to the behavior of users on a website. There are different
types of web analytics data, including web usage data (web server log data, clickstream analysis) and
hyperlink data. Let's discuss each type and provide a basic example of Python code for analyzing web usage
data using web server log data
a. Web Usage Data (Web Server Log Data, Clickstream Analysis):
Web server log data contains information about user requests to a web server. Clickstream analysis
involves analyzing the sequence of pages and actions that a user takes on a website. Below is an example
Python code snippet using the pandas library for analyzing web server log data. Note that this is a basic
example, and in a real-world scenario, you might need to handle more complex log formats and consider
additional factors.
CODE:-
import pandas as pd
# Sample web server log data (columns: IP, Date, Page)

log_data = [
{'IP': '192.168.1.1', 'Date': '2023-01-01 10:00:00', 'Page': '/home'},
{'IP': '192.168.1.2', 'Date': '2023-01-01 10:05:00', 'Page': '/products'},
{'IP': '192.168.1.1', 'Date': '2023-01-01 10:10:00', 'Page': '/about'},
# Add more log entries as needed
]
# Create a DataFrame from the log data

df = pd.DataFrame(log_data)
# Extract useful information (e.g., most visited pages)

most_visited_pages = df['Page'].value_counts()
print("Most visited pages:")

print(most_visited_pages)
OUTPUT:-
Most visited pages:
/home 1
/products 1
/about 1
Name: Page, dtype: int64
b. Hyperlink Data:
Hyperlink data involves analyzing the relationships between different web pages through hyperlinks.
You can use techniques like web scraping or APIs to gather hyperlink data. Here's a basic example using the
requests and BeautifulSoup libraries to scrape hyperlink data from a webpage
CODE:-
import requests
from bs4 import BeautifulSoup
# Sample URL for demonstration purposes

url = 'https://example.com'
# Send a request to the webpage

response = requests.get(url)
# Parse the HTML content

soup = BeautifulSoup(response.text, 'html.parser')
# Find all hyperlinks on the page

hyperlinks = soup.find_all('a', href=True)
# Extract and print the href attributes of the hyperlinks

for link in hyperlinks:
print(link['href'])
OUTPUT:-
https://www.iana.org/domains/example
4. SEARCH ENGINE OPTIMIZATION- IMPLEMENT SPAMDEXING

ANS:-
Search Engine Optimization refers to set of activities that are performed to increase number of
desirable visitors who come to your site via search engine. These activities may include thing you do to your
site itself, such as making changes to your text and HTML code, formatting text or document to
communicate directly to the search engine.
Spamdexing:
An SEO tactic, technique or method is considered as Black Hat or Spamdexing if it follows the followings:
 Try to improve rankings that are disapproved of by the search engines and/or involve deception.
 Redirecting users from a page that is built for search engines to one that is more human friendly.
 Redirecting users to a page that was different from the page the search engine ranked.
 Serving one version of a page to search engine spiders/bots and another version to human visitors.
This is called Cloaking SEO tactic.
 Using Hidden or invisible text or with the page background color, using a tiny font size or hiding
them within the HTML code such as "no frame" sections.
 Repeating keywords in the Meta tags, and using keywords that are unrelated to the site's content.
This is called Meta tag stuffing.
 Calculated placement of keywords within a page to raise the keyword count, variety, and density of
the page. This is called Keyword stuffing.
 Creating low-quality web pages that contain very little content but are instead stuffed with very
similar key words and phrases. These pages are called Doorway or Gateway Pages
 Mirror web sites by hosting multiple web sites all with conceptually similar content but using
different URLs.
 Mirror web sites by hosting multiple web sites all with conceptually similar content but using
different URLs.
 Creating a rogue copy of a popular web site which shows contents similar to the original to a web
crawler, but redirects web surfers to unrelated or malicious web sites. This is called Page hijacking.
5. USE GOOGLE ANALYTICS TOOLS TO IMPLEMENT THE FOLLOWING

A. CONVERSION STATISTICS
B. VISITOR PROFILES
ANS:-
To use Google Analytics to implement conversion statistics and retrieve visitor profiles using Python,
you can use the Google Analytics Reporting API. To access this API, you will need to set up a project in the
Google Cloud Console, enable the Analytics API, and obtain the necessary credentials. Once you have the
credentials, you can use the google-analytics-data Python library to query the API.
A. CONVERSION STATISTICS
CODE:-
from google.auth.transport.requests import Request

from google.oauth2.credentials import Credentials
from google.auth.exceptions import RefreshError
from google.analytics.data_v1alpha import BetaAnalyticsDataClient
from google.analytics.data_v1alpha.types import DateRange, Metric, Dimension,
FilterExpression, Filter, Pivot
def get_conversion_statistics(view_id, start_date, end_date):

try:
# Load credentials from a file
credentials =
Credentials.from_authorized_user_file('path/to/credentials.json')
credentials.refresh(Request())
# Create a Google Analytics Data API client

client = BetaAnalyticsDataClient(credentials=credentials)
# Query for conversion statistics

response = client.run_report(
entity="properties/" + view_id,
date_ranges=[DateRange(start_date=start_date, end_date=end_date)],
dimensions=[Dimension(name="date")],
metrics=[Metric(name="totalConversions"),
Metric(name="totalConversionValue")]
)
# Print the results

for row in response.rows:
print(f"Date: {row.dimension_values[0].value}")
print(f"Total Conversions: {row.metric_values[0].value}")
print(f"Total Conversion Value: {row.metric_values[1].value}\n")
except RefreshError:
print("Credentials refresh failed. Check the credentials file.")
# Example usage
get_conversion_statistics("YOUR_VIEW_ID", "2023-01-01", "2023-01-31")
OUTPUT:-
Date: 2023-01-01
Total Conversions: 10
Total Conversion Value: 500.0
Date: 2023-01-02
Total Conversions: 15
Total Conversion Value: 750.0
B. VISITOR PROFILES
CODE:-
from google.analytics.data_v1alpha.types import Dimension, DateRange, Metric
def get_visitor_profiles(view_id, start_date, end_date):

client = BetaAnalyticsDataClient()
# Query for visitor profiles

dimensions=[
Dimension(name="userAgeBracket"),
Dimension(name="userGender"),
Dimension(name="userType"),
],
metrics=[Metric(name="activeUsers")]
)
# Print the results

print("User Age Bracket:", row.dimension_values[0].value)
print("User Gender:", row.dimension_values[1].value)
print("User Type:", row.dimension_values[2].value)
print("Active Users:", row.metric_values[0].value, "\n")
# Example usage
get_visitor_profiles("YOUR_VIEW_ID", "2023-01-01", "2023-01-31")
OUTPUT:-
User Age Bracket: 18-24
User Gender: Male
User Type: New Visitor
Active Users: 100
User Age Bracket: 25-34

User Gender: Female
User Type: Returning Visitor
Active Users: 150
6. USE GOOGLE ANALYTICS TOOLS TO IMPLEMENT THE TRAFFIC SOURCES.

ANS:-
To retrieve traffic source information from Google Analytics using Python, you can use the Google
Analytics Reporting API.
CODE:-
from google.auth.transport.requests import Request

from google.oauth2.credentials import Credentials
from google.analytics.data_v1alpha.types import Dimension, DateRange, Metric
def get_traffic_sources(view_id, start_date, end_date):

try:
# Load credentials from a file
credentials =
Credentials.from_authorized_user_file('path/to/credentials.json')
credentials.refresh(Request())

client = BetaAnalyticsDataClient(credentials=credentials)
# Query for traffic sources

dimensions=[
Dimension(name="source"),
Dimension(name="medium"),
],
metrics=[Metric(name="sessions"), Metric(name="pageViews")]
)
# Print the results

print(f"Source: {row.dimension_values[0].value}")
print(f"Medium: {row.dimension_values[1].value}")
print(f"Sessions: {row.metric_values[0].value}")
print(f"Page Views: {row.metric_values[1].value}\n")
except Exception as e:
print(f"An error occurred: {e}")
# Example usage
get_traffic_sources("YOUR_VIEW_ID", "2023-01-01", "2023-01-31")
OUTPUT:-
Source: google
Medium: organic
Sessions: 100
Page Views: 500
Source: (other source)

Medium: (other medium)
Sessions: (session count)
Page Views: (page views count)

Lab Manual

Uploaded by

Copyright:

Available Formats

Lab Manual

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab Manual

Uploaded by

Copyright:

Available Formats

1.

Preprocessing text document using NLTK of Python

PREPROCESSING TEXT DOCUMENT USING NLTK

# Tokenize the text

print("ORIGINAL WORDS:", words)

from nltk.stem import PorterStemmer

# Tokenize the text

# Initialize the Porter Stemmer

# Stem the words

print("ORIGINAL WORDS:", words)

from nltk.stem import WordNetLemmatizer

# Tokenize the text

# Initialize the WordNet Lemmatizer

# Lemmatize the words

print("ORIGINAL WORDS:", words)

# Tokenize the text

# Perform POS tagging

print("ORIGINAL WORDS:", words)

# Download the VADER lexicon for sentiment analysis

if sentiment_score >= 0.05:

# Example customer reviews

# Analyze sentiment for each review

Review 1: 'I love this product! It's amazing.' - Sentiment: Positive

a. Web Usage Data (Web Server Log Data, Clickstream Analysis):

# Sample web server log data (columns: IP, Date, Page)

# Create a DataFrame from the log data

# Extract useful information (e.g., most visited pages)

print("Most visited pages:")

# Sample URL for demonstration purposes

# Send a request to the webpage

# Parse the HTML content

# Find all hyperlinks on the page

# Extract and print the href attributes of the hyperlinks

4. SEARCH ENGINE OPTIMIZATION- IMPLEMENT SPAMDEXING

5. USE GOOGLE ANALYTICS TOOLS TO IMPLEMENT THE FOLLOWING

from google.auth.transport.requests import Request

def get_conversion_statistics(view_id, start_date, end_date):

# Create a Google Analytics Data API client

# Query for conversion statistics

# Print the results

def get_visitor_profiles(view_id, start_date, end_date):

# Query for visitor profiles

# Print the results

User Age Bracket: 25-34

6. USE GOOGLE ANALYTICS TOOLS TO IMPLEMENT THE TRAFFIC SOURCES.

from google.auth.transport.requests import Request

def get_traffic_sources(view_id, start_date, end_date):

# Create a Google Analytics Data API client

# Query for traffic sources

# Print the results

Source: (other source)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.