0% found this document useful (0 votes)

11 views

Experiment2 Web Scraping and Data Analysis

Uploaded by

shubhangc6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Experiment2 Web Scraping and Data Analysis

Uploaded by

shubhangc6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Web Scraping: Scrapy is an open-source web scraping framework written in Python.

It is
used for extracting data from websites and is particularly well-suited for large-scale web
scraping tasks. Scrapy provides a set of tools for efficient scraping and data extraction,
allowing developers to write spiders (crawlers) that navigate websites, extract data, and save
it in various formats.
Web scraping, also known as web harvesting or web data extraction, is the process of
automatically collecting information from websites. It involves fetching the content of web
pages and extracting specific data from them, which can then be stored and used for various
purposes, such as data analysis, market research, content aggregation, and more.

Key Features of Scrapy:

1. Ease of Use: Scrapy provides a simple way to define the logic for navigating and
extracting data from websites using Python.
2. Speed and Efficiency: Scrapy is built on top of Twisted, an asynchronous networking
framework, making it highly efficient and capable of handling multiple requests
simultaneously.
3. Extensible: Scrapy is highly customizable and can be extended with middlewares,
pipelines, and custom components to suit various scraping needs.
4. Built-in Data Exporters: It can export scraped data to various formats, including JSON,
CSV, XML, and more.
5. Selector Support: Scrapy supports both CSS selectors and XPath for extracting data
from HTML and XML documents.
6. Automatic Request Handling: Scrapy handles request scheduling, retries, and throttling
automatically.
7. Broad Crawl Support: It supports following links to other pages and recursively
scraping multiple pages across a website.

Components of Scrapy:
1. Spiders: These are classes where you define how to scrape a website. Spiders contain the
initial requests and the logic to process the responses.
2. Selectors: Used to extract data from HTML or XML using CSS selectors or XPath.
3. Item: A container for the scraped data. It defines the structure of the data you want to
extract.
4. Pipelines: Process and store scraped items. Pipelines can clean data, validate it, and store
it in databases or files.
5. Middlewares: Customizable hooks that process requests and responses. They allow you
to modify requests before they are sent and responses before, they are processed by
spiders.

How Scrapy Works:

1. Spider starts: The spider sends initial requests to the start_urls.
2. Request processing: Scrapy sends these requests and waits for responses.
3. Response handling: Once responses are received, they are passed to the spider's
parsing methods (e.g., parse).
4. Data extraction: The parsing methods extract the desired data using selectors.
5. Item processing: Extracted data (items) are processed through pipelines for
validation, cleaning, and storage.
6. Follow links: The spider can follow links to other pages and continue scraping

Example of a Basic Scrapy Spider:

import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
'https://example.com',
]

def parse(self, response):

title = response.css('title::text').get()
paragraphs = response.css('p::text').getall()

yield {
'title': title,
'paragraphs': paragraphs,
}
for a in response.css('a::attr(href)').getall():
yield response.follow(a, callback=self.parse)

Key Concepts in Web Scraping:

1. HTTP Requests: The scraper sends an HTTP request to a web server to retrieve the
content of a web page. The server responds with the HTML of the page.
2. Parsing HTML: The retrieved HTML is parsed to extract the relevant data. This
involves navigating the HTML structure and selecting the elements that contain the
desired information.
3. Data Extraction: Specific data points (e.g., text, images, links) are extracted from the
parsed HTML using various techniques like regular expressions, CSS selectors, or
XPath.
4. Data Storage: The extracted data is stored in a structured format, such as a database,
CSV file, JSON file, etc., for further processing and analysis.

Example Use Cases:

 Data Extraction: Extracting product details, prices, reviews, or any structured data
from e-commerce sites.
 Content Aggregation: Aggregating news articles, blog posts, or social media content.
 Academic Research: Gathering large datasets from multiple sources for analysis.

Methods and Tools for Web Scraping:

1. Manual Scraping: Copying and pasting information manually from websites. This is
feasible for small amounts of data but impractical for larger tasks.
2. Automated Scraping: Using scripts or software tools to automate the process of data
extraction. This is more efficient and scalable.
3. Web Scraping Libraries:
BeautifulSoup: A Python library for parsing HTML and XML documents. It creates a parse
tree that can be used to extract data.
Scrapy: A powerful Python framework for building web scrapers. It handles sending
requests, parsing responses, and storing data.
Selenium: A tool for automating web browsers. It's useful for scraping dynamic content
generated by JavaScript
JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for
humans to read and write and easy for machines to parse and generate. It is often used to
transmit data between a server and a web application as text.

Key Features of JSON

1. Human-readable: JSON is formatted in a way that is easy for humans to read and
write.
2. Language-independent: Although it is based on JavaScript syntax, JSON can be
used with many programming languages.
3. Lightweight: JSON is a lightweight format compared to other data interchange
formats like XML.

JSON Syntax
JSON data is represented in a structured way using key-value pairs. Here is an example of
JSON syntax:
Objects: An object is an unordered set of key-value pairs enclosed in curly braces {}. Keys
are strings, and values can be strings, numbers, objects, arrays, booleans, or null.
{
"name": "John Doe",
"age": 30,
"isStudent": false,
"address": {
"street": "123 Main St",
"city": "Anytown"
},
"courses": ["Math", "Science", "History"]
}

Arrays: An array is an ordered collection of values enclosed in square brackets []. Values in
an array can be of any type.
[
"Apple",
"Banana",
"Cherry"
]

Data Types
 String: Enclosed in double quotes.
"name": "John Doe"
 Number: Can be an integer or a floating-point number.
"age": 30
 Boolean: Can be true or false.
"isStudent": false
 Null: Represents an empty value.
"middleName": null
 Object: A collection of key-value pairs.
"address": {
"street": "123 Main St",
"city": "Anytown"
}
 Array: An ordered list of values.
"courses": ["Math", "Science", "History"]

Example
Here is an example of how JSON is used to represent user information:
{
"user": {
"id": 1,
"name": "Alice",
"email": "alice@example.com",
"isAdmin": true,
"preferences": {
"language": "en",
"theme": "dark"
},
"friends": ["Bob", "Charlie", "David"]
}
}
In summary, JSON is a versatile and widely used format for data interchange that is easy to
read, write, and parse across different programming languages.

Experiment 2 L1: Scrape data from a website using libraries like BeautifulSoup or
Scrapy.

Question: Write a Python script using the Scrapy framework to scrape data from a
website. The script should extract the title and the first paragraph of the webpage at
'https://presidencyuniversity.in' and save the extracted data in JSON format to a file
named output.json.

import scrapy

class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
'https://presidencyuniversity.in',
]
def parse(self, response):
# Extracting data from the website
title = response.css('title::text').get()
paragraph = response.css('p::text').get()
# You can further process the extracted data as needed
# For example, printing it out:
print("Title:", title)
print("Paragraph:", paragraph)
# Or you can yield the data to save it to a file or database
yield {
'title': title,
'paragraph': paragraph,
}
# To run the spider
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings={
'FEED_FORMAT': 'json', # Output format
'FEED_URI': 'output.json' # Output file
})
process.crawl(MySpider)
process.start()

Experminet2 L2: Perform data analysis and visualization on the scraped data using
Pandas and Matplotlib.

import pandas as pd
import matplotlib.pyplot as plt

# Assuming you have already scraped the data and stored it in a JSON file
# If not, you can create a DataFrame with dummy data for demonstration purposes
data = {
'title': ['Presidency', 'University'],
'paragraph': ['Education is the', 'passport to the future']
}
# Creating a DataFrame from the scraped data
df = pd.DataFrame(data)
# Visualization
# Bar plot of the length of titles
plt.figure(figsize=(10, 6))
plt.bar(df['title'], df['title'].str.len(), color='skyblue')
plt.xlabel('Titles')
plt.ylabel('Length')
plt.title('Length of Titles')
plt.xticks(rotation=45)
plt.show()

# Pie chart of the number of characters in each paragraph

plt.figure(figsize=(8, 8))
df['paragraph_length'] = df['paragraph'].str.len()
plt.pie(df['paragraph_length'], labels=df['title'], autopct='%1.1f%%', startangle=140)
plt.axis('equal')
plt.title('Character Distribution in Paragraphs')
plt.show()

Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Mahachi - Design of Structural Steelwork 2015
No ratings yet
Mahachi - Design of Structural Steelwork 2015
275 pages
Agile Development
No ratings yet
Agile Development
5 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
Programming in C: Reema Thareja
100% (1)
Programming in C: Reema Thareja
23 pages
01 Surface Production Initialisation V603
No ratings yet
01 Surface Production Initialisation V603
112 pages
web scraping using python
No ratings yet
web scraping using python
18 pages
b
No ratings yet
b
77 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Scrapy
No ratings yet
Scrapy
8 pages
Christos Chen
No ratings yet
Christos Chen
42 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Practical Web Scraping for Economists 1744341390
No ratings yet
Practical Web Scraping for Economists 1744341390
33 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
Project 1 Email Extraction Using Scrapy
No ratings yet
Project 1 Email Extraction Using Scrapy
13 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Web Data Scraping
No ratings yet
Web Data Scraping
5 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
Learning Scrapy - Sample Chapter
0% (1)
Learning Scrapy - Sample Chapter
16 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Python Scrapy
No ratings yet
Python Scrapy
4 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
EJMCM Volume7 Issue3 Pages433-442
No ratings yet
EJMCM Volume7 Issue3 Pages433-442
11 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Introduction to Web Crawling chapter -13
No ratings yet
Introduction to Web Crawling chapter -13
3 pages
Web Scrapping: From NP-10
No ratings yet
Web Scrapping: From NP-10
11 pages
Web Scraping in Python Using Scrapy
No ratings yet
Web Scraping in Python Using Scrapy
30 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Upload PDF
No ratings yet
Upload PDF
11 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
Document2
No ratings yet
Document2
6 pages
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
No ratings yet
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
6 pages
Web Scraping With Python Tutorials From A To Z
100% (1)
Web Scraping With Python Tutorials From A To Z
35 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
20_BeautifulSoup Library for Web Scraping
No ratings yet
20_BeautifulSoup Library for Web Scraping
12 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
6 Results and Discussions
No ratings yet
6 Results and Discussions
5 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Rohan report
No ratings yet
Rohan report
25 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
21 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
basic_scraping_techniques
No ratings yet
basic_scraping_techniques
7 pages
web_scrapping_final[1]
No ratings yet
web_scrapping_final[1]
7 pages
Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps
17 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Sma 2
No ratings yet
Sma 2
9 pages
chp3A10.10072F978 3 319 32001 4 - 483 1
No ratings yet
chp3A10.10072F978 3 319 32001 4 - 483 1
4 pages
Text-Processing-For-NLP-Web-Scrapping (5)
No ratings yet
Text-Processing-For-NLP-Web-Scrapping (5)
18 pages
Web-Scraping-With-Python
No ratings yet
Web-Scraping-With-Python
16 pages
Utilizing_Python_for_Web_Scraping_and_Incremental_Data_Extraction
No ratings yet
Utilizing_Python_for_Web_Scraping_and_Incremental_Data_Extraction
6 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Unit-3 part-1
No ratings yet
Unit-3 part-1
57 pages
Real-Time Systems (2016)
No ratings yet
Real-Time Systems (2016)
178 pages
Text Analysis Tool
No ratings yet
Text Analysis Tool
7 pages
Fresher CV
No ratings yet
Fresher CV
3 pages
Jacqueline Michelle Brown Ramirez-Martinez: Work Experience
No ratings yet
Jacqueline Michelle Brown Ramirez-Martinez: Work Experience
4 pages
Check Meter 2 1 English
No ratings yet
Check Meter 2 1 English
2 pages
7-IoT Protocols (IPv6, 6LoWPAN, RPL, CoAP)-MQTT-10-01-2025
No ratings yet
7-IoT Protocols (IPv6, 6LoWPAN, RPL, CoAP)-MQTT-10-01-2025
5 pages
Assigment 1 WT
No ratings yet
Assigment 1 WT
18 pages
Car Radio CD USB
No ratings yet
Car Radio CD USB
28 pages
SQL Experiment List
No ratings yet
SQL Experiment List
2 pages
Krish Naik DS All Playlist
No ratings yet
Krish Naik DS All Playlist
2 pages
OS 18ec641
No ratings yet
OS 18ec641
2 pages
India's Top-Most Trusted Web Development Company
No ratings yet
India's Top-Most Trusted Web Development Company
18 pages
Jagan, 22-6-2016 Resume
No ratings yet
Jagan, 22-6-2016 Resume
7 pages
Digital Forensics Value of Android Downloads
No ratings yet
Digital Forensics Value of Android Downloads
12 pages
OOP ASsignment
No ratings yet
OOP ASsignment
22 pages
Linux Question Bank
No ratings yet
Linux Question Bank
2 pages
Big Data Overview
No ratings yet
Big Data Overview
15 pages
Automation 360 5-4-2024
No ratings yet
Automation 360 5-4-2024
6 pages
White Paper Filed Service Queries ERD
No ratings yet
White Paper Filed Service Queries ERD
9 pages
Connectu, Inc. v. Facebook, Inc. Et Al - Document No. 159
No ratings yet
Connectu, Inc. v. Facebook, Inc. Et Al - Document No. 159
16 pages
Getting Started With ATP: Folder H:/atp/xxx Does Not Exist. Create?
No ratings yet
Getting Started With ATP: Folder H:/atp/xxx Does Not Exist. Create?
2 pages
Internet Objective
100% (1)
Internet Objective
4 pages
GeoSwitch_Bögballe-V-2.62-ENG
No ratings yet
GeoSwitch_Bögballe-V-2.62-ENG
4 pages
HP Fortify Higher Order Tech Prev
No ratings yet
HP Fortify Higher Order Tech Prev
2 pages
MV Global - Crypto X AI Report 2024
No ratings yet
MV Global - Crypto X AI Report 2024
32 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Experiment2 Web Scraping and Data Analysis

Uploaded by

Experiment2 Web Scraping and Data Analysis

Uploaded by

Web Scraping: Scrapy is an open-source web scraping framework written in Python.

Key Features of Scrapy:

How Scrapy Works:

Example of a Basic Scrapy Spider:

def parse(self, response):

Key Concepts in Web Scraping:

Example Use Cases:

Methods and Tools for Web Scraping:

Key Features of JSON

# Pie chart of the number of characters in each paragraph

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.