Python Scrapy

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

How To Developers

Crawling the Web


with Scrapy
Web crawling or spidering is the process of systematically extracting
data from a website using a Web crawler, spider or robot. A Web scraper
methodically harvests data from a website. This article takes the reader
through the Web scraping process using Scrapy.

S
crapy is one of the most powerful and popular Python following file structure:
frameworks for crawling websites and extracting
structured data useful for applications like data “scrapy_first/
analysis, historical archival, knowledge processing, etc. -scrapy.cfg
To work with Scrapy, you need to have Python installed on scrapy_first/
your system. Python can be downloaded from www.python.org. -__init__.py
- items.py
Installing Scrapy with Pip -pipelines.py
Pip is installed along with Python in the Python/Scripts/folder. -settings.py
To install Scrapy, type the following command: -spiders/
-__init__.py”
pip install scrapy
In the folder structure given above, ‘scrapy_first’ is the
The above command will install Scrapy on your machine root directory of our Scrapy project.
in the Python/Lib/site-packages folder. A spider is a class that describes how a website will be
scraped, how it will be crawled and how data will be extracted
Creating a project from it. The customisation needed to crawl and parse Web
With Scrapy installed, navigate to the folder in which you pages is defined in the spiders.
want to create your project, open cmd and type the command
below to create the Scrapy project: A spiders’s scraping life cycle
1. You start by generating the initial request to crawl the
scrapy startproject scrapy_first first URL obtained by the start_requests() method,
which generates a request for the URLs specified in the
The above command will create a Scrapy project with the start_urls, and parses them using the parse method as a

www.OpenSourceForU.com | OPEN SOURCE FOR YOU | SEPTEMBER 2017 | 73


Developers How To
Scrapy commands

scrapy startproject myproject This command will create a Scrapy project in the project directory specified; else
[project_dir] with the name of project, if project_dir is not mentioned.

scrapy genspider spider_name This command needs to be run from the root directory of the project, to create a
[domain.com] spider with allowed_domain as domain.com.

Scrapy bench This runs a quick benchmark test, to tell you Scrapy’s maximum possible speed in
crawling Web pages, given your hardware.
scrapy check Checks spider contracts.

scrapy crawl [spider] This command instructs the spider to start crawling the Web pages.

scrapy edit [spider] This command is used to edit the spider using the editor specified in the EDITOR
environment variables or EDITOR setting.
scrapy fetch [url] This command downloads the contents of the URL and stores them in a standard
output file.
scrapy list Lists the available spiders in the project.

scrapy parse [url] This is the default callback used by Scrapy to process downloaded responses,
when their requests don’t specify a callback.

scrapy runspider file_name.py Runs a spider self-contained in a Python file without having to create a project.

scrapy view [url] Opens the URL in the browser as seen by the spider.

scrapy settings This is to get the Scrapy setting value.

scrapy version This is to get the Scrapy version installed.


scrapy shell [url optional] Opens the interactive Scrapy console for the URL.

callback to get a response. that XMLSpider iterates over nodes and CSVSpider iterates over
2. In the callback, after the parsing is done, either of the rows with the parse_rows() method.
three dicts of content — request object, item object or Having understood the different types of spiders, we are
iterable — is returned. This request will also contain the ready to start writing our first spider. Create a file named
callback and is downloaded by Scrapy. The response is myFirstSpider.py in the spiders folder of our project.
handled by the corresponding callback.
3. In callbacks, parsing of page content is performed using import scrapy
the XPath selectors or any other parser libraries like lxml, class MyfirstspiderSpider(scrapy.Spider):
and items are generated with parsed data. name = “myFirstSpider”
4. The returned items are then persisted into the database allowed_domains = [“opensourceforyou.com”]
or the item pipeline, or written to a file using the start_urls = (
FeedExports service. ‘http://opensourceforu.com/2015/10/building-a-
Scrapy is bundled with three kinds of spiders. django-app/’,
BaseSpider: All the spiders must inherit this spider. It is )
the simplest one, responsible for start_urls / start_request()
and calling of the parse method for each resulting response. def parse(self, response):
CrawlSpider: This provides a convenient method for page = response.url.split(“/”)[-2]
crawling links by defining a set of rules. It can be overridden filename = ‘quotes-%s.html’ % page
as per the project’s needs. It supports all the BaseSpider’s with open(filename, ‘wb’) as f:
attributes as well as an additional attribute, ‘rules’, which is a f.write(response.body)
list of one or more rules. self.log(‘Saved file %s’ % filename)
XMLSpider and CSVSpider: XMLSpider iterates over the
XML feeds through a certain node name, whereas CSVSpider In the above code, the following attributes have been defined:
is used to crawl CSV feed. The difference between them is 1. name: This is the unique name given to the spider in the project.

74 | SEPTEMBER 2017 | OPEN SOURCE FOR YOU | www.OpenSourceForU.com


How To Developers

2. allowed_domains: This is the base address of the URLs define the item Fields in our project’s items.py file. Add the
that the spider is allowed to crawl. following lines to it:
3. Start_requests(): The spider begins to crawl on the
requests returned by this method. It is called when the title=item.Field()
spider is opened for scraping. url=item.Field()
4. Parse(): This handles the responses downloaded for
each request made. It is responsible for processing the Our code will look like what’s shown in Figure 1.
response and returning scraped data. In the above code,
the parse method will be used to save the response.body
into the HTML file.

Crawling
Crawling is basically following links and crawling around
websites. With Scrapy, we can crawl on any website using a
spider with the following command:

scrapy crawl myFirstSpider


Figure 1: First item
Extraction with selectors and items
Selectors: A certain part of HTML Source can be scraped After the changes in the item are done, we need to
using selectors, which is achieved using CSS or Xpath make some changes in our spider. Add the following lines
expressions. so that it can yield the item data:
Xpath is a language for selecting nodes in XML
documents as well as with HTML, whereas CSS selectors from scrapy_first.items import ScrapyFirstItem
are used to define selectors for associate styles. Include the
following code to our previous spider code to select the title def parse(self, response):
of the Web page: item=ScrapyFirstItem()
item[‘url’]=response.url
def parse(self, response): for select in response.xpath(‘//title’):
url=response.url title=select.xpath(‘text()’).extract()
for select in response.xpath(‘//title’): self.log(“title here %s” %title)
title=select.xpath(‘text()’).extract() item[‘title’]=title
self.log(“title here %s” %title) yield item

Items: Items are used to collect the scraped data. They Now run the spider and our output will look like
are regular Python dicts. Before using an item we need to what’s shown in Figure 2.

Figure 2: Execution of the spider

www.OpenSourceForU.com | OPEN SOURCE FOR YOU | SEPTEMBER 2017 | 75


Developers How To

Scraped data gethostname())”


After data is scraped from different sources, it can be persisted Increment stat value: “stats.inc_value(‘count_variable’)”
into a file using FeedExports, which ensures that data is stored Get stat value: “stats.get_stats()”
properly with multiple serialisation formats. We will store the
data as XML. Run the following command to store the data: 3. Sending email: Scrapy comes with an easy-to-use service
for sending email and is implemented using a twisted non-
scrapy crawl myFirstSpider -o data.xml blocking IO of the crawler. For example:

We can find data.xml in our project’s root folder, as shown from scrapy.mail import MailSender
in Figure 3. mailer=MailSender()
mailer.send(to=[‘abc@xyz.com’],subject=”Test Subject ”
,body=”Test Body”, cc=[‘cc@abc.com’])

4. Telnet console: All the running processes of Scrapy are


controlled and inspected using this console. It comes enabled by
default and can be accessed using the following command:
Figure 3: data.xml
telnet console 6023

5. Web services: This service is used to control Scrapy’s Web


crawler via the JSON-RPC 2.0 protocol. It needs to be installed
separately using the following command:
Figure 4: Final spider
pip install scrapy-jsonrpc
Our final spider will look like what’s shown in Figure 4.
The following lines should be included in our project’s settings.
Built-in services py file:
1. Logging: Scrapy uses Python’s built-in logging system for
event tracking. It allows us to include our own messages EXTENSIONS={‘scrapy_jsonrpc.webservice.WebService’:500,}
along with third party APIs’ logging messages in our Set JSONRPC_ENABLED settings to True.
application’s log.
Scrapy vs BeautifulSoup
import logging Scrapy: Scrapy is a full-fledged spider library, capable of performing
logging.WARNING(‘this is a warning’) load balancing restrictions, and parsing a wide range of data types
logging.log(logging.WARNING,”Warning Message”) with minimal customisation. It is a Web scraping framework and
logging.error(“error goes here”) can be used to crawl numerous URLs by providing constraints. It
logging.critical(“critical message goes here”) is best suited in situations like when you have proper seed URLs.
logging.info(“info goes here”) Scrapy supports both CSS selectors and XPath expressions for data
logging.debug(“debug goes here”) extraction. In fact, you could even use BeautifulSoup or PyQuery as
the data extraction mechanism in your Scrapy spiders.
2. stats collection: This facilitates the collection of stats in BeautifulSoup: This is a parsing library which provides easy-to-
a key value pair, where values are often counter. This understand methods for navigating, searching and finally extracting
service is always available even if it is disabled, in which the data you need, i.e., it helps us to navigate through HTML and
case the API will be called but will not collect anything. can be used to fetch data and parse it into any specific format. It
The stats collector can be accessed using the stats can be used if you’d rather implement the HTML fetching part
attribute. For example: yourself and want to easily navigate through HTML DOM.

class ExtensionThatAccessStats(object): Reference


def __init__(self,stats): https://doc.scrapy.org
self.stats=stats
@classmethod By: Shubham Sharma
def from_crawler(cls,crawler): The author is an open source activist working as a software
return cls(crawlse.stats) engineer at KPIT Technologies, Pune. He can be contacted
Set stat value: “stats.set_value(‘hostname’,socket. at shubham.ks494@gmail.com.

76 | SEPTEMBER 2017 | OPEN SOURCE FOR YOU | www.OpenSourceForU.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy