Experiment2 Web Scraping and Data Analysis
Experiment2 Web Scraping and Data Analysis
It is
used for extracting data from websites and is particularly well-suited for large-scale web
scraping tasks. Scrapy provides a set of tools for efficient scraping and data extraction,
allowing developers to write spiders (crawlers) that navigate websites, extract data, and save
it in various formats.
Web scraping, also known as web harvesting or web data extraction, is the process of
automatically collecting information from websites. It involves fetching the content of web
pages and extracting specific data from them, which can then be stored and used for various
purposes, such as data analysis, market research, content aggregation, and more.
Components of Scrapy:
1. Spiders: These are classes where you define how to scrape a website. Spiders contain the
initial requests and the logic to process the responses.
2. Selectors: Used to extract data from HTML or XML using CSS selectors or XPath.
3. Item: A container for the scraped data. It defines the structure of the data you want to
extract.
4. Pipelines: Process and store scraped items. Pipelines can clean data, validate it, and store
it in databases or files.
5. Middlewares: Customizable hooks that process requests and responses. They allow you
to modify requests before they are sent and responses before, they are processed by
spiders.
yield {
'title': title,
'paragraphs': paragraphs,
}
for a in response.css('a::attr(href)').getall():
yield response.follow(a, callback=self.parse)
JSON Syntax
JSON data is represented in a structured way using key-value pairs. Here is an example of
JSON syntax:
Objects: An object is an unordered set of key-value pairs enclosed in curly braces {}. Keys
are strings, and values can be strings, numbers, objects, arrays, booleans, or null.
{
"name": "John Doe",
"age": 30,
"isStudent": false,
"address": {
"street": "123 Main St",
"city": "Anytown"
},
"courses": ["Math", "Science", "History"]
}
Arrays: An array is an ordered collection of values enclosed in square brackets []. Values in
an array can be of any type.
[
"Apple",
"Banana",
"Cherry"
]
Data Types
String: Enclosed in double quotes.
"name": "John Doe"
Number: Can be an integer or a floating-point number.
"age": 30
Boolean: Can be true or false.
"isStudent": false
Null: Represents an empty value.
"middleName": null
Object: A collection of key-value pairs.
"address": {
"street": "123 Main St",
"city": "Anytown"
}
Array: An ordered list of values.
"courses": ["Math", "Science", "History"]
Example
Here is an example of how JSON is used to represent user information:
{
"user": {
"id": 1,
"name": "Alice",
"email": "alice@example.com",
"isAdmin": true,
"preferences": {
"language": "en",
"theme": "dark"
},
"friends": ["Bob", "Charlie", "David"]
}
}
In summary, JSON is a versatile and widely used format for data interchange that is easy to
read, write, and parse across different programming languages.
Experiment 2 L1: Scrape data from a website using libraries like BeautifulSoup or
Scrapy.
Question: Write a Python script using the Scrapy framework to scrape data from a
website. The script should extract the title and the first paragraph of the webpage at
'https://presidencyuniversity.in' and save the extracted data in JSON format to a file
named output.json.
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
'https://presidencyuniversity.in',
]
def parse(self, response):
# Extracting data from the website
title = response.css('title::text').get()
paragraph = response.css('p::text').get()
# You can further process the extracted data as needed
# For example, printing it out:
print("Title:", title)
print("Paragraph:", paragraph)
# Or you can yield the data to save it to a file or database
yield {
'title': title,
'paragraph': paragraph,
}
# To run the spider
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings={
'FEED_FORMAT': 'json', # Output format
'FEED_URI': 'output.json' # Output file
})
process.crawl(MySpider)
process.start()
Experminet2 L2: Perform data analysis and visualization on the scraped data using
Pandas and Matplotlib.
import pandas as pd
import matplotlib.pyplot as plt
# Assuming you have already scraped the data and stored it in a JSON file
# If not, you can create a DataFrame with dummy data for demonstration purposes
data = {
'title': ['Presidency', 'University'],
'paragraph': ['Education is the', 'passport to the future']
}
# Creating a DataFrame from the scraped data
df = pd.DataFrame(data)
# Visualization
# Bar plot of the length of titles
plt.figure(figsize=(10, 6))
plt.bar(df['title'], df['title'].str.len(), color='skyblue')
plt.xlabel('Titles')
plt.ylabel('Length')
plt.title('Length of Titles')
plt.xticks(rotation=45)
plt.show()