0% found this document useful (0 votes)

41 views

Python Scrapy

This document provides an overview of how to develop web crawlers and scrapers using the Scrapy framework in Python. It discusses installing Scrapy with pip, creating a Scrapy project with the startproject command, defining spiders to extract structured data from websites, using selectors like XPath to parse responses and extract items, and yielding item data from the parse callback. Scrapy provides tools for efficiently crawling websites and extracting structured information through its framework of projects, spiders, items, and pipelines.

Uploaded by

Shubham Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Python Scrapy

Uploaded by

Shubham Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

How To Developers

Crawling the Web

with Scrapy
Web crawling or spidering is the process of systematically extracting
data from a website using a Web crawler, spider or robot. A Web scraper
methodically harvests data from a website. This article takes the reader
through the Web scraping process using Scrapy.

S
crapy is one of the most powerful and popular Python following file structure:
frameworks for crawling websites and extracting
structured data useful for applications like data “scrapy_first/
analysis, historical archival, knowledge processing, etc. -scrapy.cfg
To work with Scrapy, you need to have Python installed on scrapy_first/
your system. Python can be downloaded from www.python.org. -__init__.py
- items.py
Installing Scrapy with Pip -pipelines.py
Pip is installed along with Python in the Python/Scripts/folder. -settings.py
To install Scrapy, type the following command: -spiders/
-__init__.py”
pip install scrapy
In the folder structure given above, ‘scrapy_first’ is the
The above command will install Scrapy on your machine root directory of our Scrapy project.
in the Python/Lib/site-packages folder. A spider is a class that describes how a website will be
scraped, how it will be crawled and how data will be extracted
Creating a project from it. The customisation needed to crawl and parse Web
With Scrapy installed, navigate to the folder in which you pages is defined in the spiders.
want to create your project, open cmd and type the command
below to create the Scrapy project: A spiders’s scraping life cycle
1. You start by generating the initial request to crawl the
scrapy startproject scrapy_first first URL obtained by the start_requests() method,
which generates a request for the URLs specified in the
The above command will create a Scrapy project with the start_urls, and parses them using the parse method as a

www.OpenSourceForU.com | OPEN SOURCE FOR YOU | SEPTEMBER 2017 | 73

Developers How To
Scrapy commands

scrapy startproject myproject This command will create a Scrapy project in the project directory specified; else
[project_dir] with the name of project, if project_dir is not mentioned.

scrapy genspider spider_name This command needs to be run from the root directory of the project, to create a
[domain.com] spider with allowed_domain as domain.com.

Scrapy bench This runs a quick benchmark test, to tell you Scrapy’s maximum possible speed in
crawling Web pages, given your hardware.
scrapy check Checks spider contracts.

scrapy crawl [spider] This command instructs the spider to start crawling the Web pages.

scrapy edit [spider] This command is used to edit the spider using the editor specified in the EDITOR
environment variables or EDITOR setting.
scrapy fetch [url] This command downloads the contents of the URL and stores them in a standard
output file.
scrapy list Lists the available spiders in the project.

scrapy parse [url] This is the default callback used by Scrapy to process downloaded responses,
when their requests don’t specify a callback.

scrapy runspider file_name.py Runs a spider self-contained in a Python file without having to create a project.

scrapy view [url] Opens the URL in the browser as seen by the spider.

scrapy settings This is to get the Scrapy setting value.

scrapy version This is to get the Scrapy version installed.

scrapy shell [url optional] Opens the interactive Scrapy console for the URL.

callback to get a response. that XMLSpider iterates over nodes and CSVSpider iterates over
2. In the callback, after the parsing is done, either of the rows with the parse_rows() method.
three dicts of content — request object, item object or Having understood the different types of spiders, we are
iterable — is returned. This request will also contain the ready to start writing our first spider. Create a file named
callback and is downloaded by Scrapy. The response is myFirstSpider.py in the spiders folder of our project.
handled by the corresponding callback.
3. In callbacks, parsing of page content is performed using import scrapy
the XPath selectors or any other parser libraries like lxml, class MyfirstspiderSpider(scrapy.Spider):
and items are generated with parsed data. name = “myFirstSpider”
4. The returned items are then persisted into the database allowed_domains = [“opensourceforyou.com”]
or the item pipeline, or written to a file using the start_urls = (
FeedExports service. ‘http://opensourceforu.com/2015/10/building-a-
Scrapy is bundled with three kinds of spiders. django-app/’,
BaseSpider: All the spiders must inherit this spider. It is )
the simplest one, responsible for start_urls / start_request()
and calling of the parse method for each resulting response. def parse(self, response):
CrawlSpider: This provides a convenient method for page = response.url.split(“/”)[-2]
crawling links by defining a set of rules. It can be overridden filename = ‘quotes-%s.html’ % page
as per the project’s needs. It supports all the BaseSpider’s with open(filename, ‘wb’) as f:
attributes as well as an additional attribute, ‘rules’, which is a f.write(response.body)
list of one or more rules. self.log(‘Saved file %s’ % filename)
XMLSpider and CSVSpider: XMLSpider iterates over the
XML feeds through a certain node name, whereas CSVSpider In the above code, the following attributes have been defined:
is used to crawl CSV feed. The difference between them is 1. name: This is the unique name given to the spider in the project.

74 | SEPTEMBER 2017 | OPEN SOURCE FOR YOU | www.OpenSourceForU.com

How To Developers

2. allowed_domains: This is the base address of the URLs define the item Fields in our project’s items.py file. Add the
that the spider is allowed to crawl. following lines to it:
3. Start_requests(): The spider begins to crawl on the
requests returned by this method. It is called when the title=item.Field()
spider is opened for scraping. url=item.Field()
4. Parse(): This handles the responses downloaded for
each request made. It is responsible for processing the Our code will look like what’s shown in Figure 1.
response and returning scraped data. In the above code,
the parse method will be used to save the response.body
into the HTML file.

Crawling
Crawling is basically following links and crawling around
websites. With Scrapy, we can crawl on any website using a
spider with the following command:

scrapy crawl myFirstSpider

Figure 1: First item
Extraction with selectors and items
Selectors: A certain part of HTML Source can be scraped After the changes in the item are done, we need to
using selectors, which is achieved using CSS or Xpath make some changes in our spider. Add the following lines
expressions. so that it can yield the item data:
Xpath is a language for selecting nodes in XML
documents as well as with HTML, whereas CSS selectors from scrapy_first.items import ScrapyFirstItem
are used to define selectors for associate styles. Include the
following code to our previous spider code to select the title def parse(self, response):
of the Web page: item=ScrapyFirstItem()
item[‘url’]=response.url
def parse(self, response): for select in response.xpath(‘//title’):
url=response.url title=select.xpath(‘text()’).extract()
for select in response.xpath(‘//title’): self.log(“title here %s” %title)
title=select.xpath(‘text()’).extract() item[‘title’]=title
self.log(“title here %s” %title) yield item

Items: Items are used to collect the scraped data. They Now run the spider and our output will look like
are regular Python dicts. Before using an item we need to what’s shown in Figure 2.

Figure 2: Execution of the spider

www.OpenSourceForU.com | OPEN SOURCE FOR YOU | SEPTEMBER 2017 | 75

Developers How To

Scraped data gethostname())”

After data is scraped from different sources, it can be persisted Increment stat value: “stats.inc_value(‘count_variable’)”
into a file using FeedExports, which ensures that data is stored Get stat value: “stats.get_stats()”
properly with multiple serialisation formats. We will store the
data as XML. Run the following command to store the data: 3. Sending email: Scrapy comes with an easy-to-use service
for sending email and is implemented using a twisted non-
scrapy crawl myFirstSpider -o data.xml blocking IO of the crawler. For example:

We can find data.xml in our project’s root folder, as shown from scrapy.mail import MailSender
in Figure 3. mailer=MailSender()
mailer.send(to=[‘abc@xyz.com’],subject=”Test Subject ”
,body=”Test Body”, cc=[‘cc@abc.com’])

4. Telnet console: All the running processes of Scrapy are

controlled and inspected using this console. It comes enabled by
default and can be accessed using the following command:
Figure 3: data.xml
telnet console 6023

5. Web services: This service is used to control Scrapy’s Web

crawler via the JSON-RPC 2.0 protocol. It needs to be installed
separately using the following command:
Figure 4: Final spider
pip install scrapy-jsonrpc
Our final spider will look like what’s shown in Figure 4.
The following lines should be included in our project’s settings.
Built-in services py file:
1. Logging: Scrapy uses Python’s built-in logging system for
event tracking. It allows us to include our own messages EXTENSIONS={‘scrapy_jsonrpc.webservice.WebService’:500,}
along with third party APIs’ logging messages in our Set JSONRPC_ENABLED settings to True.
application’s log.
Scrapy vs BeautifulSoup
import logging Scrapy: Scrapy is a full-fledged spider library, capable of performing
logging.WARNING(‘this is a warning’) load balancing restrictions, and parsing a wide range of data types
logging.log(logging.WARNING,”Warning Message”) with minimal customisation. It is a Web scraping framework and
logging.error(“error goes here”) can be used to crawl numerous URLs by providing constraints. It
logging.critical(“critical message goes here”) is best suited in situations like when you have proper seed URLs.
logging.info(“info goes here”) Scrapy supports both CSS selectors and XPath expressions for data
logging.debug(“debug goes here”) extraction. In fact, you could even use BeautifulSoup or PyQuery as
the data extraction mechanism in your Scrapy spiders.
2. stats collection: This facilitates the collection of stats in BeautifulSoup: This is a parsing library which provides easy-to-
a key value pair, where values are often counter. This understand methods for navigating, searching and finally extracting
service is always available even if it is disabled, in which the data you need, i.e., it helps us to navigate through HTML and
case the API will be called but will not collect anything. can be used to fetch data and parse it into any specific format. It
The stats collector can be accessed using the stats can be used if you’d rather implement the HTML fetching part
attribute. For example: yourself and want to easily navigate through HTML DOM.

class ExtensionThatAccessStats(object): Reference

def __init__(self,stats): https://doc.scrapy.org
self.stats=stats
@classmethod By: Shubham Sharma
def from_crawler(cls,crawler): The author is an open source activist working as a software
return cls(crawlse.stats) engineer at KPIT Technologies, Pune. He can be contacted
Set stat value: “stats.set_value(‘hostname’,socket. at shubham.ks494@gmail.com.

76 | SEPTEMBER 2017 | OPEN SOURCE FOR YOU | www.OpenSourceForU.com

API - 650 Check List Tank Inspection
100% (4)
API - 650 Check List Tank Inspection
5 pages
Hibernate Nit
100% (6)
Hibernate Nit
246 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (2)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
b
No ratings yet
b
77 pages
Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps
17 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
Common practices - Scrapy 2.12.0 documentation
No ratings yet
Common practices - Scrapy 2.12.0 documentation
5 pages
Scrapy Tutorial PDF
100% (3)
Scrapy Tutorial PDF
114 pages
Scrapy - A Fast and Powerful Scraping and Web Crawling Framework
No ratings yet
Scrapy - A Fast and Powerful Scraping and Web Crawling Framework
2 pages
Web Scraping in Python Using Scrapy
No ratings yet
Web Scraping in Python Using Scrapy
30 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
Web Scrapping: From NP-10
No ratings yet
Web Scrapping: From NP-10
11 pages
Learning Scrapy - Sample Chapter
0% (1)
Learning Scrapy - Sample Chapter
16 pages
Debugging Spiders — Scrapy 2.13.0 documentation
No ratings yet
Debugging Spiders — Scrapy 2.13.0 documentation
4 pages
Id-11659 Scrapping Web
No ratings yet
Id-11659 Scrapping Web
295 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
No ratings yet
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
6 pages
Using Scrapy in PyCharm
100% (1)
Using Scrapy in PyCharm
8 pages
Scrapy
No ratings yet
Scrapy
171 pages
web scraping using python
No ratings yet
web scraping using python
18 pages
CSF2113 10 CLO4 Web Crawling With Scrapy
No ratings yet
CSF2113 10 CLO4 Web Crawling With Scrapy
25 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Docs Scrapy Org en Latest
No ratings yet
Docs Scrapy Org en Latest
354 pages
Scrapy
No ratings yet
Scrapy
298 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
21 pages
Scrapytutorial
No ratings yet
Scrapytutorial
5 pages
Scrapy Docs
No ratings yet
Scrapy Docs
197 pages
Scrapy-Org Documentation
No ratings yet
Scrapy-Org Documentation
352 pages
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
No ratings yet
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
12 pages
Web Scraping With Python Tutorials From A To Z
100% (1)
Web Scraping With Python Tutorials From A To Z
35 pages
Project 1 Email Extraction Using Scrapy
No ratings yet
Project 1 Email Extraction Using Scrapy
13 pages
Introduction to Web Crawling chapter -13
No ratings yet
Introduction to Web Crawling chapter -13
3 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Docs Scrapy Org en Latest
No ratings yet
Docs Scrapy Org en Latest
382 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Scrapy Documentation Guide
No ratings yet
Scrapy Documentation Guide
260 pages
Architecture overview — Scrapy 2.13.0 documentation
No ratings yet
Architecture overview — Scrapy 2.13.0 documentation
3 pages
Scrapy
No ratings yet
Scrapy
248 pages
Scrapy
No ratings yet
Scrapy
306 pages
Scrapy Documentation
No ratings yet
Scrapy Documentation
423 pages
Scrapy Documentation
No ratings yet
Scrapy Documentation
234 pages
Scrapy
No ratings yet
Scrapy
8 pages
Scrapy
No ratings yet
Scrapy
427 pages
Scrapy Documentation
No ratings yet
Scrapy Documentation
230 pages
Scrapy PDF
No ratings yet
Scrapy PDF
250 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping With Scrapy - Practical Understanding - by Karthikeyan P - Jul, 2020 - Towards Data Science
No ratings yet
Web Scraping With Scrapy - Practical Understanding - by Karthikeyan P - Jul, 2020 - Towards Data Science
16 pages
Scrapy
No ratings yet
Scrapy
230 pages
Christos Chen
No ratings yet
Christos Chen
42 pages
Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Concepts
No ratings yet
Concepts
2 pages
Concepts
No ratings yet
Concepts
2 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Ruby Gems Mastery: 100 Essential Packages for 2024
From Everand
Ruby Gems Mastery: 100 Essential Packages for 2024
Kanto
No ratings yet
Fasil Onam Project
No ratings yet
Fasil Onam Project
6 pages
Ronald L Snell - Stanley Kurtz - Jonathan Marr - Fundamentals of Radio Astronomy - Astrophysics-CRC Press (2019)
No ratings yet
Ronald L Snell - Stanley Kurtz - Jonathan Marr - Fundamentals of Radio Astronomy - Astrophysics-CRC Press (2019)
361 pages
Honda Wave 100 One Way Clutch
No ratings yet
Honda Wave 100 One Way Clutch
1 page
Ayush Kumar Offer Letter
No ratings yet
Ayush Kumar Offer Letter
2 pages
OLC Crash Pulse
No ratings yet
OLC Crash Pulse
28 pages
Quiz - Cash flow and financial planning-MCQ
No ratings yet
Quiz - Cash flow and financial planning-MCQ
4 pages
Importance of Distance Education in India
No ratings yet
Importance of Distance Education in India
7 pages
Individual Assignment
No ratings yet
Individual Assignment
2 pages
HP Laserjet 6p 6mp Manual
No ratings yet
HP Laserjet 6p 6mp Manual
6 pages
2.PassLeader 210-260 Exam Dumps (31-60)
No ratings yet
2.PassLeader 210-260 Exam Dumps (31-60)
8 pages
CS3201 Ch6
No ratings yet
CS3201 Ch6
113 pages
Full Download The Game Writing Guide Get Your Dream Job and Keep It 1st Edition Anna Megill PDF DOCX
No ratings yet
Full Download The Game Writing Guide Get Your Dream Job and Keep It 1st Edition Anna Megill PDF DOCX
40 pages
Tibco Enterprise Message Service™ Installation On Red Hat Openshift Container Platform
No ratings yet
Tibco Enterprise Message Service™ Installation On Red Hat Openshift Container Platform
21 pages
Quiz 1 Bac1
No ratings yet
Quiz 1 Bac1
1 page
Ldrs 810 Researcher Reflexivity
No ratings yet
Ldrs 810 Researcher Reflexivity
24 pages
Chapter 16
No ratings yet
Chapter 16
100 pages
3.15 Revision Guide NMR
No ratings yet
3.15 Revision Guide NMR
5 pages
Teacher XXX
No ratings yet
Teacher XXX
5 pages
Prime UPVC Brochure
No ratings yet
Prime UPVC Brochure
12 pages
HS-LS1-7 Evidence Statements June 2015 asterisks
No ratings yet
HS-LS1-7 Evidence Statements June 2015 asterisks
2 pages
PULSAR-23 Patchbook
No ratings yet
PULSAR-23 Patchbook
88 pages
Arslan Ali Khan: Experience
No ratings yet
Arslan Ali Khan: Experience
3 pages
Operation & Maintenance Manual: Section 7 Technical Drawings and Bill of Materials (Bom)
No ratings yet
Operation & Maintenance Manual: Section 7 Technical Drawings and Bill of Materials (Bom)
3 pages
From Local File Inclusion to Reverse Shell _ by A3h1nt _ Medium
No ratings yet
From Local File Inclusion to Reverse Shell _ by A3h1nt _ Medium
20 pages
Part-3 Lalit Narayan Mithila University, Darbhanga
100% (1)
Part-3 Lalit Narayan Mithila University, Darbhanga
1 page
Mackay 2008
No ratings yet
Mackay 2008
16 pages
Trắc nghiệm
No ratings yet
Trắc nghiệm
8 pages
Dokumen - Tips Primitive Art in The Philippines
No ratings yet
Dokumen - Tips Primitive Art in The Philippines
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Python Scrapy

Uploaded by

Python Scrapy

Uploaded by

How To Developers

Crawling the Web

www.OpenSourceForU.com | OPEN SOURCE FOR YOU | SEPTEMBER 2017 | 73

scrapy settings This is to get the Scrapy setting value.

scrapy version This is to get the Scrapy version installed.

74 | SEPTEMBER 2017 | OPEN SOURCE FOR YOU | www.OpenSourceForU.com

scrapy crawl myFirstSpider

Figure 2: Execution of the spider

www.OpenSourceForU.com | OPEN SOURCE FOR YOU | SEPTEMBER 2017 | 75

Scraped data gethostname())”

4. Telnet console: All the running processes of Scrapy are

5. Web services: This service is used to control Scrapy’s Web

class ExtensionThatAccessStats(object): Reference

76 | SEPTEMBER 2017 | OPEN SOURCE FOR YOU | www.OpenSourceForU.com

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.