Python Scrapy
Python Scrapy
Python Scrapy
S
crapy is one of the most powerful and popular Python following file structure:
frameworks for crawling websites and extracting
structured data useful for applications like data “scrapy_first/
analysis, historical archival, knowledge processing, etc. -scrapy.cfg
To work with Scrapy, you need to have Python installed on scrapy_first/
your system. Python can be downloaded from www.python.org. -__init__.py
- items.py
Installing Scrapy with Pip -pipelines.py
Pip is installed along with Python in the Python/Scripts/folder. -settings.py
To install Scrapy, type the following command: -spiders/
-__init__.py”
pip install scrapy
In the folder structure given above, ‘scrapy_first’ is the
The above command will install Scrapy on your machine root directory of our Scrapy project.
in the Python/Lib/site-packages folder. A spider is a class that describes how a website will be
scraped, how it will be crawled and how data will be extracted
Creating a project from it. The customisation needed to crawl and parse Web
With Scrapy installed, navigate to the folder in which you pages is defined in the spiders.
want to create your project, open cmd and type the command
below to create the Scrapy project: A spiders’s scraping life cycle
1. You start by generating the initial request to crawl the
scrapy startproject scrapy_first first URL obtained by the start_requests() method,
which generates a request for the URLs specified in the
The above command will create a Scrapy project with the start_urls, and parses them using the parse method as a
scrapy startproject myproject This command will create a Scrapy project in the project directory specified; else
[project_dir] with the name of project, if project_dir is not mentioned.
scrapy genspider spider_name This command needs to be run from the root directory of the project, to create a
[domain.com] spider with allowed_domain as domain.com.
Scrapy bench This runs a quick benchmark test, to tell you Scrapy’s maximum possible speed in
crawling Web pages, given your hardware.
scrapy check Checks spider contracts.
scrapy crawl [spider] This command instructs the spider to start crawling the Web pages.
scrapy edit [spider] This command is used to edit the spider using the editor specified in the EDITOR
environment variables or EDITOR setting.
scrapy fetch [url] This command downloads the contents of the URL and stores them in a standard
output file.
scrapy list Lists the available spiders in the project.
scrapy parse [url] This is the default callback used by Scrapy to process downloaded responses,
when their requests don’t specify a callback.
scrapy runspider file_name.py Runs a spider self-contained in a Python file without having to create a project.
scrapy view [url] Opens the URL in the browser as seen by the spider.
callback to get a response. that XMLSpider iterates over nodes and CSVSpider iterates over
2. In the callback, after the parsing is done, either of the rows with the parse_rows() method.
three dicts of content — request object, item object or Having understood the different types of spiders, we are
iterable — is returned. This request will also contain the ready to start writing our first spider. Create a file named
callback and is downloaded by Scrapy. The response is myFirstSpider.py in the spiders folder of our project.
handled by the corresponding callback.
3. In callbacks, parsing of page content is performed using import scrapy
the XPath selectors or any other parser libraries like lxml, class MyfirstspiderSpider(scrapy.Spider):
and items are generated with parsed data. name = “myFirstSpider”
4. The returned items are then persisted into the database allowed_domains = [“opensourceforyou.com”]
or the item pipeline, or written to a file using the start_urls = (
FeedExports service. ‘http://opensourceforu.com/2015/10/building-a-
Scrapy is bundled with three kinds of spiders. django-app/’,
BaseSpider: All the spiders must inherit this spider. It is )
the simplest one, responsible for start_urls / start_request()
and calling of the parse method for each resulting response. def parse(self, response):
CrawlSpider: This provides a convenient method for page = response.url.split(“/”)[-2]
crawling links by defining a set of rules. It can be overridden filename = ‘quotes-%s.html’ % page
as per the project’s needs. It supports all the BaseSpider’s with open(filename, ‘wb’) as f:
attributes as well as an additional attribute, ‘rules’, which is a f.write(response.body)
list of one or more rules. self.log(‘Saved file %s’ % filename)
XMLSpider and CSVSpider: XMLSpider iterates over the
XML feeds through a certain node name, whereas CSVSpider In the above code, the following attributes have been defined:
is used to crawl CSV feed. The difference between them is 1. name: This is the unique name given to the spider in the project.
2. allowed_domains: This is the base address of the URLs define the item Fields in our project’s items.py file. Add the
that the spider is allowed to crawl. following lines to it:
3. Start_requests(): The spider begins to crawl on the
requests returned by this method. It is called when the title=item.Field()
spider is opened for scraping. url=item.Field()
4. Parse(): This handles the responses downloaded for
each request made. It is responsible for processing the Our code will look like what’s shown in Figure 1.
response and returning scraped data. In the above code,
the parse method will be used to save the response.body
into the HTML file.
Crawling
Crawling is basically following links and crawling around
websites. With Scrapy, we can crawl on any website using a
spider with the following command:
Items: Items are used to collect the scraped data. They Now run the spider and our output will look like
are regular Python dicts. Before using an item we need to what’s shown in Figure 2.
We can find data.xml in our project’s root folder, as shown from scrapy.mail import MailSender
in Figure 3. mailer=MailSender()
mailer.send(to=[‘abc@xyz.com’],subject=”Test Subject ”
,body=”Test Body”, cc=[‘cc@abc.com’])