0% found this document useful (0 votes)
11 views

Experiment2 Web Scraping and Data Analysis

Uploaded by

shubhangc6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Experiment2 Web Scraping and Data Analysis

Uploaded by

shubhangc6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Web Scraping: Scrapy is an open-source web scraping framework written in Python.

It is
used for extracting data from websites and is particularly well-suited for large-scale web
scraping tasks. Scrapy provides a set of tools for efficient scraping and data extraction,
allowing developers to write spiders (crawlers) that navigate websites, extract data, and save
it in various formats.
Web scraping, also known as web harvesting or web data extraction, is the process of
automatically collecting information from websites. It involves fetching the content of web
pages and extracting specific data from them, which can then be stored and used for various
purposes, such as data analysis, market research, content aggregation, and more.

Key Features of Scrapy:


1. Ease of Use: Scrapy provides a simple way to define the logic for navigating and
extracting data from websites using Python.
2. Speed and Efficiency: Scrapy is built on top of Twisted, an asynchronous networking
framework, making it highly efficient and capable of handling multiple requests
simultaneously.
3. Extensible: Scrapy is highly customizable and can be extended with middlewares,
pipelines, and custom components to suit various scraping needs.
4. Built-in Data Exporters: It can export scraped data to various formats, including JSON,
CSV, XML, and more.
5. Selector Support: Scrapy supports both CSS selectors and XPath for extracting data
from HTML and XML documents.
6. Automatic Request Handling: Scrapy handles request scheduling, retries, and throttling
automatically.
7. Broad Crawl Support: It supports following links to other pages and recursively
scraping multiple pages across a website.

Components of Scrapy:
1. Spiders: These are classes where you define how to scrape a website. Spiders contain the
initial requests and the logic to process the responses.
2. Selectors: Used to extract data from HTML or XML using CSS selectors or XPath.
3. Item: A container for the scraped data. It defines the structure of the data you want to
extract.
4. Pipelines: Process and store scraped items. Pipelines can clean data, validate it, and store
it in databases or files.
5. Middlewares: Customizable hooks that process requests and responses. They allow you
to modify requests before they are sent and responses before, they are processed by
spiders.

How Scrapy Works:


1. Spider starts: The spider sends initial requests to the start_urls.
2. Request processing: Scrapy sends these requests and waits for responses.
3. Response handling: Once responses are received, they are passed to the spider's
parsing methods (e.g., parse).
4. Data extraction: The parsing methods extract the desired data using selectors.
5. Item processing: Extracted data (items) are processed through pipelines for
validation, cleaning, and storage.
6. Follow links: The spider can follow links to other pages and continue scraping

Example of a Basic Scrapy Spider:


import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
'https://example.com',
]

def parse(self, response):


title = response.css('title::text').get()
paragraphs = response.css('p::text').getall()

yield {
'title': title,
'paragraphs': paragraphs,
}
for a in response.css('a::attr(href)').getall():
yield response.follow(a, callback=self.parse)

Key Concepts in Web Scraping:


1. HTTP Requests: The scraper sends an HTTP request to a web server to retrieve the
content of a web page. The server responds with the HTML of the page.
2. Parsing HTML: The retrieved HTML is parsed to extract the relevant data. This
involves navigating the HTML structure and selecting the elements that contain the
desired information.
3. Data Extraction: Specific data points (e.g., text, images, links) are extracted from the
parsed HTML using various techniques like regular expressions, CSS selectors, or
XPath.
4. Data Storage: The extracted data is stored in a structured format, such as a database,
CSV file, JSON file, etc., for further processing and analysis.

Example Use Cases:


 Data Extraction: Extracting product details, prices, reviews, or any structured data
from e-commerce sites.
 Content Aggregation: Aggregating news articles, blog posts, or social media content.
 Academic Research: Gathering large datasets from multiple sources for analysis.

Methods and Tools for Web Scraping:


1. Manual Scraping: Copying and pasting information manually from websites. This is
feasible for small amounts of data but impractical for larger tasks.
2. Automated Scraping: Using scripts or software tools to automate the process of data
extraction. This is more efficient and scalable.
3. Web Scraping Libraries:
BeautifulSoup: A Python library for parsing HTML and XML documents. It creates a parse
tree that can be used to extract data.
Scrapy: A powerful Python framework for building web scrapers. It handles sending
requests, parsing responses, and storing data.
Selenium: A tool for automating web browsers. It's useful for scraping dynamic content
generated by JavaScript
JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for
humans to read and write and easy for machines to parse and generate. It is often used to
transmit data between a server and a web application as text.

Key Features of JSON


1. Human-readable: JSON is formatted in a way that is easy for humans to read and
write.
2. Language-independent: Although it is based on JavaScript syntax, JSON can be
used with many programming languages.
3. Lightweight: JSON is a lightweight format compared to other data interchange
formats like XML.

JSON Syntax
JSON data is represented in a structured way using key-value pairs. Here is an example of
JSON syntax:
Objects: An object is an unordered set of key-value pairs enclosed in curly braces {}. Keys
are strings, and values can be strings, numbers, objects, arrays, booleans, or null.
{
"name": "John Doe",
"age": 30,
"isStudent": false,
"address": {
"street": "123 Main St",
"city": "Anytown"
},
"courses": ["Math", "Science", "History"]
}

Arrays: An array is an ordered collection of values enclosed in square brackets []. Values in
an array can be of any type.
[
"Apple",
"Banana",
"Cherry"
]

Data Types
 String: Enclosed in double quotes.
"name": "John Doe"
 Number: Can be an integer or a floating-point number.
"age": 30
 Boolean: Can be true or false.
"isStudent": false
 Null: Represents an empty value.
"middleName": null
 Object: A collection of key-value pairs.
"address": {
"street": "123 Main St",
"city": "Anytown"
}
 Array: An ordered list of values.
"courses": ["Math", "Science", "History"]

Example
Here is an example of how JSON is used to represent user information:
{
"user": {
"id": 1,
"name": "Alice",
"email": "alice@example.com",
"isAdmin": true,
"preferences": {
"language": "en",
"theme": "dark"
},
"friends": ["Bob", "Charlie", "David"]
}
}
In summary, JSON is a versatile and widely used format for data interchange that is easy to
read, write, and parse across different programming languages.

Experiment 2 L1: Scrape data from a website using libraries like BeautifulSoup or
Scrapy.

Question: Write a Python script using the Scrapy framework to scrape data from a
website. The script should extract the title and the first paragraph of the webpage at
'https://presidencyuniversity.in' and save the extracted data in JSON format to a file
named output.json.

import scrapy

class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
'https://presidencyuniversity.in',
]
def parse(self, response):
# Extracting data from the website
title = response.css('title::text').get()
paragraph = response.css('p::text').get()
# You can further process the extracted data as needed
# For example, printing it out:
print("Title:", title)
print("Paragraph:", paragraph)
# Or you can yield the data to save it to a file or database
yield {
'title': title,
'paragraph': paragraph,
}
# To run the spider
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings={
'FEED_FORMAT': 'json', # Output format
'FEED_URI': 'output.json' # Output file
})
process.crawl(MySpider)
process.start()

Experminet2 L2: Perform data analysis and visualization on the scraped data using
Pandas and Matplotlib.

import pandas as pd
import matplotlib.pyplot as plt

# Assuming you have already scraped the data and stored it in a JSON file
# If not, you can create a DataFrame with dummy data for demonstration purposes
data = {
'title': ['Presidency', 'University'],
'paragraph': ['Education is the', 'passport to the future']
}
# Creating a DataFrame from the scraped data
df = pd.DataFrame(data)
# Visualization
# Bar plot of the length of titles
plt.figure(figsize=(10, 6))
plt.bar(df['title'], df['title'].str.len(), color='skyblue')
plt.xlabel('Titles')
plt.ylabel('Length')
plt.title('Length of Titles')
plt.xticks(rotation=45)
plt.show()

# Pie chart of the number of characters in each paragraph


plt.figure(figsize=(8, 8))
df['paragraph_length'] = df['paragraph'].str.len()
plt.pie(df['paragraph_length'], labels=df['title'], autopct='%1.1f%%', startangle=140)
plt.axis('equal')
plt.title('Character Distribution in Paragraphs')
plt.show()

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy