Lab Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

1.

Preprocessing text document using NLTK of Python


a. Stopword elimination
b. Stemming
c. Lemmatization
d. POS tagging
e. Lexical analysis
ANS:-

PREPROCESSING TEXT DOCUMENT USING NLTK

Text preprocessing is an essential step in natural language processing (NLP) that involves
cleaning and transforming raw text data into a format that can be easily understood and analyzed
by machine learning algorithms. NLTK (Natural Language Toolkit) is a powerful library in Python that
provides tools for text processing and analysis. Here's an explanation of the key text preprocessing
steps using NLTK
A. STOPWORD ELIMINATION:-

Stopwords are common words (e.g., "and", "the", "is") that often do not contribute much to the
meaning of a text. NLTK provides a list of stopwords that can be used to filter them out.

CODE:-

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

# Example text
text = "This is an example sentence for stopword elimination."

# Tokenize the text


words = word_tokenize(text)

# Remove stopwords
filtered_words = [word for word in words if word.lower() not in
stopwords.words('english')]

print("ORIGINAL WORDS:", words)


print("WORDS AFTER STOPWORD ELIMINATION:", filtered_words)

OUTPUT:-
ORIGINAL WORDS: ['This', 'is', 'an', 'example', 'sentence', 'for', 'stopword',
'elimination', '.']
WORDS AFTER STOPWORD ELIMINATION: ['example', 'sentence', 'stopword',
'elimination', '.']
b. STEMMING :-

Stemming involves reducing words to their root or base form. NLTK provides different stemmers, such
as the Porter Stemmer.

CODE:-

from nltk.stem import PorterStemmer

# Example text
text = "This is an example sentence for stopword elimination."

# Tokenize the text


words = word_tokenize(text)

# Initialize the Porter Stemmer


porter_stemmer = PorterStemmer()

# Stem the words


stemmed_words = [porter_stemmer.stem(word) for word in words]

print("ORIGINAL WORDS:", words)


print("WORDS AFTER STEMMING:", stemmed_words)

OUTPUT:-
ORIGINAL WORDS: ['This', 'is', 'an', 'example', 'sentence', 'for', 'stopword',
'elimination', '.']
WORDS AFTER STEMMING: ['thi', 'is', 'an', 'exampl', 'sentenc', 'for', 'stopword',
'elimin', '.']

C. LEMMATIZATION :-

Lemmatization is similar to stemming but involves reducing words to their base or dictionary form
(lemma). NLTK provides lemmatization functionality.

CODE:-

from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

# Example text
text = "This is an example sentence for stopword elimination."

# Tokenize the text


words = word_tokenize(text)

# Initialize the WordNet Lemmatizer


lemmatizer = WordNetLemmatizer()

# Lemmatize the words


lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print("ORIGINAL WORDS:", words)


print("WORDS AFTER LEMMATIZATION:", lemmatized_words)

OUTPUT:-
ORIGINAL WORDS: ['This', 'is', 'an', 'example', 'sentence', 'for', 'stopword',
'elimination', '.']
WORDS AFTER LEMMATIZATION: ['This', 'is', 'an', 'example', 'sentence', 'for',
'stopword', 'elimination', '.']

D. POS TAGGING:-

Part-of-speech (POS) tagging involves tagging each word with its grammatical part of speech (e.g.,
noun, verb, adjective). NLTK provides a function for POS tagging.

CODE:-

nltk.download('averaged_perceptron_tagger')

# Example text
text = "This is an example sentence for stopword elimination."

# Tokenize the text


words = word_tokenize(text)

# Perform POS tagging


pos_tags = nltk.pos_tag(words)

print("ORIGINAL WORDS:", words)


print("POS TAGS:", pos_tags)

OUTPUT:-
ORIGINAL WORDS: ['This', 'is', 'an', 'example', 'sentence', 'for', 'stopword',
'elimination', '.']
POS TAGS: [('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'),
('sentence', 'NN'), ('for', 'IN'), ('stopword', 'NN'), ('elimination', 'NN'), ('.',
'.')]

These preprocessing steps are crucial for enhancing the quality of textual data before applying more
advanced NLP techniques or feeding it into machine learning models. The choice of which steps to include
in your preprocessing pipeline depends on the specific requirements of your NLP task.
2. Sentiment analysis on customer review on products
ANS:-
Sentiment analysis on customer reviews is a common natural language processing (NLP) task that
involves determining the sentiment expressed in a piece of text, typically in the form of positive, negative,
or neutral. Analyzing customer reviews on products can provide valuable insights into customer satisfaction,
product performance, and areas for improvement

CODE:-
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the VADER lexicon for sentiment analysis


nltk.download('vader_lexicon')

def analyze_sentiment(review):
sid = SentimentIntensityAnalyzer()
sentiment_score = sid.polarity_scores(review)['compound']

if sentiment_score >= 0.05:


return "Positive"
elif sentiment_score <= -0.05:
return "Negative"
else:
return "Neutral"

# Example customer reviews


reviews = [
"I love this product! It's amazing.",
"The quality is good, but the price is too high.",
"Not satisfied with the product. It broke after a week.",
"Fast delivery and excellent customer service!",
]

# Analyze sentiment for each review


for i, review in enumerate(reviews, 1):
sentiment = analyze_sentiment(review)
print(f"Review {i}: '{review}' - Sentiment: {sentiment}")

OUTPUT:-

Review 1: 'I love this product! It's amazing.' - Sentiment: Positive


Review 2: 'The quality is good, but the price is too high.' - Sentiment: Positive
Review 3: 'Not satisfied with the product. It broke after a week.' - Sentiment: Negative
Review 4: 'Fast delivery and excellent customer service!' - Sentiment: Positive
3. WEB ANALYTICS
A. WEB USAGE DATA (WEB SERVER LOG DATA, CLICKSTREAM ANALYSIS)
B. HYPERLINK DATA
ANS:-
Web analytics involves analyzing data related to the behavior of users on a website. There are different
types of web analytics data, including web usage data (web server log data, clickstream analysis) and
hyperlink data. Let's discuss each type and provide a basic example of Python code for analyzing web usage
data using web server log data

a. Web Usage Data (Web Server Log Data, Clickstream Analysis):

Web server log data contains information about user requests to a web server. Clickstream analysis
involves analyzing the sequence of pages and actions that a user takes on a website. Below is an example
Python code snippet using the pandas library for analyzing web server log data. Note that this is a basic
example, and in a real-world scenario, you might need to handle more complex log formats and consider
additional factors.

CODE:-

import pandas as pd

# Sample web server log data (columns: IP, Date, Page)


log_data = [
{'IP': '192.168.1.1', 'Date': '2023-01-01 10:00:00', 'Page': '/home'},
{'IP': '192.168.1.2', 'Date': '2023-01-01 10:05:00', 'Page': '/products'},
{'IP': '192.168.1.1', 'Date': '2023-01-01 10:10:00', 'Page': '/about'},
# Add more log entries as needed
]

# Create a DataFrame from the log data


df = pd.DataFrame(log_data)

# Extract useful information (e.g., most visited pages)


most_visited_pages = df['Page'].value_counts()

print("Most visited pages:")


print(most_visited_pages)

OUTPUT:-
Most visited pages:
/home 1
/products 1
/about 1
Name: Page, dtype: int64

b. Hyperlink Data:
Hyperlink data involves analyzing the relationships between different web pages through hyperlinks.
You can use techniques like web scraping or APIs to gather hyperlink data. Here's a basic example using the
requests and BeautifulSoup libraries to scrape hyperlink data from a webpage

CODE:-

import requests
from bs4 import BeautifulSoup

# Sample URL for demonstration purposes


url = 'https://example.com'

# Send a request to the webpage


response = requests.get(url)

# Parse the HTML content


soup = BeautifulSoup(response.text, 'html.parser')

# Find all hyperlinks on the page


hyperlinks = soup.find_all('a', href=True)

# Extract and print the href attributes of the hyperlinks


for link in hyperlinks:
print(link['href'])

OUTPUT:-
https://www.iana.org/domains/example

4. SEARCH ENGINE OPTIMIZATION- IMPLEMENT SPAMDEXING


ANS:-

Search Engine Optimization refers to set of activities that are performed to increase number of
desirable visitors who come to your site via search engine. These activities may include thing you do to your
site itself, such as making changes to your text and HTML code, formatting text or document to
communicate directly to the search engine.
Spamdexing:

An SEO tactic, technique or method is considered as Black Hat or Spamdexing if it follows the followings:

 Try to improve rankings that are disapproved of by the search engines and/or involve deception.
 Redirecting users from a page that is built for search engines to one that is more human friendly.
 Redirecting users to a page that was different from the page the search engine ranked.
 Serving one version of a page to search engine spiders/bots and another version to human visitors.
This is called Cloaking SEO tactic.
 Using Hidden or invisible text or with the page background color, using a tiny font size or hiding
them within the HTML code such as "no frame" sections.
 Repeating keywords in the Meta tags, and using keywords that are unrelated to the site's content.
This is called Meta tag stuffing.
 Calculated placement of keywords within a page to raise the keyword count, variety, and density of
the page. This is called Keyword stuffing.
 Creating low-quality web pages that contain very little content but are instead stuffed with very
similar key words and phrases. These pages are called Doorway or Gateway Pages
 Mirror web sites by hosting multiple web sites all with conceptually similar content but using
different URLs.
 Mirror web sites by hosting multiple web sites all with conceptually similar content but using
different URLs.
 Creating a rogue copy of a popular web site which shows contents similar to the original to a web
crawler, but redirects web surfers to unrelated or malicious web sites. This is called Page hijacking.

5. USE GOOGLE ANALYTICS TOOLS TO IMPLEMENT THE FOLLOWING


A. CONVERSION STATISTICS
B. VISITOR PROFILES
ANS:-

To use Google Analytics to implement conversion statistics and retrieve visitor profiles using Python,
you can use the Google Analytics Reporting API. To access this API, you will need to set up a project in the
Google Cloud Console, enable the Analytics API, and obtain the necessary credentials. Once you have the
credentials, you can use the google-analytics-data Python library to query the API.

A. CONVERSION STATISTICS
CODE:-

from google.auth.transport.requests import Request


from google.oauth2.credentials import Credentials
from google.auth.exceptions import RefreshError
from google.analytics.data_v1alpha import BetaAnalyticsDataClient
from google.analytics.data_v1alpha.types import DateRange, Metric, Dimension,
FilterExpression, Filter, Pivot

def get_conversion_statistics(view_id, start_date, end_date):


try:
# Load credentials from a file
credentials =
Credentials.from_authorized_user_file('path/to/credentials.json')
credentials.refresh(Request())

# Create a Google Analytics Data API client


client = BetaAnalyticsDataClient(credentials=credentials)

# Query for conversion statistics


response = client.run_report(
entity="properties/" + view_id,
date_ranges=[DateRange(start_date=start_date, end_date=end_date)],
dimensions=[Dimension(name="date")],
metrics=[Metric(name="totalConversions"),
Metric(name="totalConversionValue")]
)

# Print the results


for row in response.rows:
print(f"Date: {row.dimension_values[0].value}")
print(f"Total Conversions: {row.metric_values[0].value}")
print(f"Total Conversion Value: {row.metric_values[1].value}\n")

except RefreshError:
print("Credentials refresh failed. Check the credentials file.")

# Example usage
get_conversion_statistics("YOUR_VIEW_ID", "2023-01-01", "2023-01-31")

OUTPUT:-
Date: 2023-01-01

Total Conversions: 10
Total Conversion Value: 500.0

Date: 2023-01-02
Total Conversions: 15
Total Conversion Value: 750.0

B. VISITOR PROFILES
CODE:-
from google.analytics.data_v1alpha import BetaAnalyticsDataClient
from google.analytics.data_v1alpha.types import Dimension, DateRange, Metric

def get_visitor_profiles(view_id, start_date, end_date):


# Create a Google Analytics Data API client
client = BetaAnalyticsDataClient()

# Query for visitor profiles


response = client.run_report(
entity="properties/" + view_id,
date_ranges=[DateRange(start_date=start_date, end_date=end_date)],
dimensions=[
Dimension(name="userAgeBracket"),
Dimension(name="userGender"),
Dimension(name="userType"),
],
metrics=[Metric(name="activeUsers")]
)

# Print the results


for row in response.rows:
print("User Age Bracket:", row.dimension_values[0].value)
print("User Gender:", row.dimension_values[1].value)
print("User Type:", row.dimension_values[2].value)
print("Active Users:", row.metric_values[0].value, "\n")

# Example usage
get_visitor_profiles("YOUR_VIEW_ID", "2023-01-01", "2023-01-31")

OUTPUT:-
User Age Bracket: 18-24
User Gender: Male
User Type: New Visitor
Active Users: 100

User Age Bracket: 25-34


User Gender: Female
User Type: Returning Visitor
Active Users: 150

6. USE GOOGLE ANALYTICS TOOLS TO IMPLEMENT THE TRAFFIC SOURCES.


ANS:-
To retrieve traffic source information from Google Analytics using Python, you can use the Google
Analytics Reporting API.

CODE:-

from google.auth.transport.requests import Request


from google.oauth2.credentials import Credentials
from google.analytics.data_v1alpha import BetaAnalyticsDataClient
from google.analytics.data_v1alpha.types import Dimension, DateRange, Metric

def get_traffic_sources(view_id, start_date, end_date):


try:
# Load credentials from a file
credentials =
Credentials.from_authorized_user_file('path/to/credentials.json')
credentials.refresh(Request())

# Create a Google Analytics Data API client


client = BetaAnalyticsDataClient(credentials=credentials)

# Query for traffic sources


response = client.run_report(
entity="properties/" + view_id,
date_ranges=[DateRange(start_date=start_date, end_date=end_date)],
dimensions=[
Dimension(name="source"),
Dimension(name="medium"),
],
metrics=[Metric(name="sessions"), Metric(name="pageViews")]
)

# Print the results


for row in response.rows:
print(f"Source: {row.dimension_values[0].value}")
print(f"Medium: {row.dimension_values[1].value}")
print(f"Sessions: {row.metric_values[0].value}")
print(f"Page Views: {row.metric_values[1].value}\n")

except Exception as e:
print(f"An error occurred: {e}")

# Example usage
get_traffic_sources("YOUR_VIEW_ID", "2023-01-01", "2023-01-31")

OUTPUT:-
Source: google
Medium: organic
Sessions: 100
Page Views: 500

Source: (other source)


Medium: (other medium)
Sessions: (session count)
Page Views: (page views count)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy