0% found this document useful (0 votes)
15 views

WEBSCRAping Buildwithpython

WEBSCRAping_ using python

Uploaded by

arete2077
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

WEBSCRAping Buildwithpython

WEBSCRAping_ using python

Uploaded by

arete2077
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 78

Video 1: Python Scrapy Tutorial- 1 - Web Scraping, Spiders and Crawling

imagine this that you are currently on


this website of amazon.com and you're
looking at the books section of all the
books that have been released in the
last 30 days and it is a lot of books in
this section and for some reason you
want the title of the book and maybe the
author name of the book and yes even the
pricing of the book and you want to
store all of the data inside this
section inside some kind of a file maybe
on your external drive or maybe even in
some kind of a database like SQL so what
you do is you create a Python program
using a package known as scrappy and
this is what your program kind of looks
like and to execute this program you
just open up your terminal you go inside
your project folder and you just write
it scrappy crawl and then you just write
the name of what you want to cross so I
just gonna write Amazon and then you
want to store all of the script data
inside a JSON file so I'm just gonna
give it a tag of - oh and name of the
file where I want all the data to be
stored so I'm just going to call it
items dot JSON and I'm just gonna press
ENTER and it's gonna fail a couple of
times because we are using proxies as
Amazon doesn't allow a lot of bought
stuff inside its website on its website
so now we can just refresh our scrappy
project folder and open up items dot
JSON and as you can see in the product
name it has all of the name of the books
and if you scroll a little bit on the
right hand side somewhere over here and
we can just press ENTER over here as you
can see it gives us the product name and
the product price over here so the
becoming title has a price of $19.90 of
$22 and similarly off of the Diary of a
Wimpy Kid we have the pricing of
somewhere around 20 dollars so if we go
back to our website you can see that the
becoming is of ninety dollars and fifty
cents and just extracting the data of
nineteen dollars and not of the fifty
cents and it's pretty easy to do I'll
tell you how to do it I'll show you how
to do it but as you can see it's
currently working and similarly of Foom
body is about twenty dollars and if we
go back to our file as you can see it's
off twenty
dollars so this whole process and this
idea of extracting data from website or
multiple websites is known as web
scraping and the Python program that we
are using to scrape this data is known
as a spider as you can see over here and
sometimes this web scraping is also
known as web crawling so what we have
done over here is that we have created
this Python program which uses a package
called scrappy and this Python program
is known as a spider so basically we
have created a spider that takes all of
the information inside this webpage and
it sends it to this JSON file over here
obviously this data could be sent to
another database like SQL etc so this is
this really web scraping and the idea of
a spider in the next video we are going
to be installing Python and this IDE you
don't need this pycharm IDE you can
install something like sublime etcetera
or maybe atom but we are going to go
through that in the next video where we
will be installing scrappy basically and
yeah so I'll see you in the next video

Video 2: Python Scrapy Tutorial- 2 - How does Web Scraping work?

alright guys welcome back and in this


video we are going to be learning about
the behind the scenes working of a web
scraper or a web crawler that uses
scrappy and how exactly does a Python
program that is called as a spider goes
to this web page and extracts all of the
data on this web page and if you have
done even a little bit of web
development you probably already know
that every web page is made up of two
main languages that is the HTML language
and the cascading style sheets that is
the CSS language so CSS is basically
used to beautify of a page and hTML is
used to give a structure to a web page
and if you want to see the code or the
HTML and the CSS that goes into making
this web page you can right click over
here somewhere and click on View page
source and this is going to show you all
of the source code that goes into making
this web page now what you see over here
is JavaScript let's not worry about that
we are only worried about not even
worried about we are just taking care or
understanding where is the HTML
attribute and where are the CSS
attributes so if I scroll down and just
to give you an example over here you can
see that there is this division HTML tag
that has a class inside it and this
class has an attribute of a row or a
spacing meaning so this is what we are
going to be concerned about and this is
how a web scraper knows what to scrape
so for example what our Python program
does is that it goes to this web page
and because it goes to this web page how
we give it a link to go to so this is
the link that we want it to go to and
after it goes to this link it looks at
the source code and then we say hey I
want the title of the book so it
searches for this B coming over here so
we can search for this becoming by
pressing ctrl F to find this becoming
title and we can just paste it over here
and as you can see there is this href
tag which says becoming Michelle Obama
and so this is basically a data that is
that is basically inside the source code
now we don't have to look at all of this
source code when we are scraping the
data what we can do is we can just go to
this becoming title over here right
click on it and
on inspect and this will directly take
us to the place where this becoming is
written so what a Python program is
doing is that it first goes to this URL
that we have given it and it takes out
the source code that is given at this
respective web page after it takes out
the source code we have given it kind of
an if condition you can just kind of for
now understand it like an if-else
condition so we have given it our
condition that hey our Python program or
the spider make sure that you look for
the h2 tags and inside those h2 HTML
tags try to find this class of s access
title and take out that text from it so
what it does it it goes to the source
code of the web page and then it looks
for this h2 tag and then it looks for a
class and inside this class it looks for
s access title and using this s access
title it extracts the text that in this
case is becoming and then it just takes
this data and stores it in Snyder
database that you are going to be
learning about in future but this is
exactly how our Python program extracts
data now you have to make sure that
while extracting data that it has some
kind of a unique property so in this
case this becoming title this becoming
data has a unique property of s access
title so all of the titles inside this
webpage so for example if we go to this
title and click on inspect you will be
able to see that this title
also has this class of s access title
over here so whenever you are scraping
some kind of a data you have to find
some kind of a unique property related
to the element that you want to scrub
now if you go to for example this 28 $3
or something and click on inspect you'll
be able to see that this also has a
unique property of SX Price hole so what
we can do is we can say hey a Python
program go to this web page or go to
this link take out all of the source
code and find the span HTML tag and
inside this man HTML tag find the class
that contains s X PRIZE hole and extract
this data of 23 and that is exactly what
we are doing over here
as you can see we are saying go to the
source code find the span HTML tag and
find this SX Price hole now obviously
this SX Price hole is a class that is
why we are using this dot operator over
here if this was an ID we would have
used something else but we are going to
be discussing this more in the selector
section of our video series but anyways
guys this is pretty much it for this
video in this video we learned how does
a web scraper or a Python program goes
to a random or a web page that we have
given it the URL off and extracts the
data according to the HTML element and
the attributes inside that HTML element
so in the next video we are going to be
learning about something known as robots
dot text I'm going to be just giving you
a kind of a teaser of what robots dot
text is so if I go to facebook.com and
you can just follow along you can also
go to facebook.com and type in robots
dot txt and press ENTER you'll be able
to see the secret file inside Facebook
it basically says crawling Facebook is
prohibited unless you have written X
permission so we are going to be
discussing more about this file in the
next video so I'll see you over there

Video 3: Python Scrapy Tutorial - 3 - Robots.txt and Web Scraping Rules

in this video we are going to be


learning about obeying the laws of web
scraping on the internet and these laws
are usually written in a file called
robots.txt now what you can do is that
you can go to the URL of facebook.com
slash robots.txt or even you can go to
this URL called amazon.com slash robots
dot txt and you'll be able to see the
directories that these websites don't
want us to crawl obviously you don't
have to follow these rules written by
these websites but it's always a good
idea to follow them where and when you
can now because we are going to be using
scrappy you don't have to worry about
the robots.txt file because scrappy
automatically follows the rules of
robots dot txt and doesn't crawl the URL
links given in this robots.txt file but
if you don't want to follow the robots
dot txt rules for some reason you can
always go to the settings file and
change the robots.txt underscore OB rule
to false obviously right now we haven't
gone into creating a scrappy project and
that is why you don't have this settings
file but in the next video we are going
to be installing scrappy and creating a
new project which I'm really excited
about so I'm going to see you in the
next video and this is pretty much it
for this video

Video 5: Python Scrapy Tutorial - 5 - Installation with Terminal / Sublime

alright guys welcome to this video


series of using strappy to do web
scraping and web crawling in this video
we are going to be learning how to
install scrappy inside your computer if
you are using IDE
or a software like sublime text atom on
notepad plus plus so for the purposes of
this video I'm going to be assuming that
you decided not to use PyCharm IDE but
if you for some reason decide to use the
pygeum ID you can always go back to the
previous video PyCharm makes the whole
process of installing and using python
and scrappy a lot easier but it's fine
if you decide to use sublime text too
so the first thing you need to do to
install scrapping on your computer is
obviously you must have python so you
can just go to the official website of
python and install the latest version of
python in my case it's python three
point seven point one after you have
installed Python you can just go for
your text editor of your choice in this
case it's sublime text so after you have
installed sublime text
it's gonna look somewhat like this so
what I want you to do is create a new
folder called Scrappy tutorial just like
I have if you want to follow along and
then open that folder inside sub-line
text after you have done that we are
going to follow a particular set of
processes which I am going to paste over
here so obviously the first process is
done that you have already installed
Python and sublime text then the second
process is actually to create a virtual
environment now what is a virtual
environment and why do you need to
install one so what your environment
basically isolates your project from the
rest of the computer so for example if
you install any kind of package inside
your scrappy tutorial folder it's not
going to be installed in the rest of the
computer and vice-versa if you install
some kind of package in the rest of the
computer there is not going to be
installed inside this Clabby tutorial
folder so for example if we install this
crappy package inside this crappy
tutorial folder with the virtual
environment activated then scrappy is
just going to be installed inside this
crappy to Doyle folder and not in the
rest of your computer so it's kind of
useful to isolate your working
environment from the rest
of the computer now the process of
installing a virtual environment and
creating a virtual environment is pretty
easy you just need to follow these steps
over here so in our case I'm using
Windows so I'm going to be opening up my
command prompt because you may be using
Linux or Mac then you can open up your
terminal so what you need to do is you
need to install this package known as
pip env or pip environment
so what I'm going to do is I'm just
gonna write rip install and Pepe and we
and just press ENTER now already half PP
and we installed inside my computer so
it's just going to show that requirement
already satisfied but when you write it
on your computer if you if you don't
have 50 env already installed it's going
to install it after that it says create
a virtual environment by typing in
virtual env and then . so first you need
to go to the project folder in our case
it's crappy tutorial and this project
folder is inside my a drive and i'm just
going to change my directory by writing
CD which stands for change directory i
am just going to write the name of my
scrappy tutorial so I'm just going to
write scrappy and scrappy tutorial and
I'm gonna press ENTER after you are
inside your project folder as you can
see is totally empty right now you just
need to create a watcher environment by
typing in virtual env and then full stop
so I'm just going to write word show E
and V for star press enter and it's
going to create an environment for us
you just have to wait for a couple of
minutes after the virtual environment
has been created you need to activate
the virtual environment now if you are
using Windows you can just paste this
dot backslash scripts backslash activate
but if you are using Linux and Mac you
need to type this bin activate on your
terminal and because I'm using Windows
so I'm just going to copy this from over
here and I'm going to paste this on my
terminal and make sure that you write
this this code of dot scripts and
activate inside the folder that you have
created a virtual environment in so I've
created the virtual environment inside
my scrappy tutorial folder so I'm going
to paste this over here and press Enter
and as you can see there are these curly
braces over here
curly braces just braces over here the
brackets over here and it's written
scrappy tutorial over here
and this means that our worship
environment has been properly activated
now if we write something like pip fries
over here you'll be able to see that we
don't have any packages inside our
scrappy tutorial virtual environment or
pip fries does it basically shows what
kind of packages are inside your virtual
environment and right now there are no
packages so let's install scrappy inside
our scrappy tutorial folder after this
virtual environment has been activated
as you can see this is the third step so
how do you install scrappy it's pretty
easy you just write pip install scrappy
and this scrappy package is going to
install other packages so this might
take a little bit of more time depending
on the speed of your internet so I'm
just going to wait for this installation
to be completed
so after scrappy has been installed we
can just type in v fries again to see
what kind of packages are installed
inside a scrappy tutorial folder so now
you can see that scrappy is installed
but not just crappy a lot of other
packages are installed where the scrappy
but obviously scrappy as you can see the
one point five point one version is also
installed but it requires other packages
so that is why scrappy has automatically
installed other packages for you now
what I want to do is I want to create a
project now but before that I just want
to show you the project structure inside
this crappy tutorial folder over here
now if you open it up it includes or it
has a lot of beautiful doors like
include lips scripts and TCL now all of
these folders are actually because of
the virtual environment - don't worry
about them they are not the folders of
your scrappy project but don't delete
them because because of these folders a
virtual environment has been created
inside the scrappy tutorial folder but
just don't get worried about it like
what are these folders now we are going
to be starting a project inside scrappy
and that is the most beautiful part of
scrappy you can just right scrappy over
here and then you can just start the
project and we are going to be in the
next couple of videos we are going to be
creating a quote web scrapper so I'm
just going to be writing quote cute
oriole over here tutorial over here and
press
into now what this will do is that it
will create a project structure for us
and it says you can start your first
byte of LCD tutorial we don't need to do
that because we are using supplying text
and now if we open up our code tutorial
folder as you can see it has a lot of
files installed in it now we are going
to go through this project structure in
the next video but for right now we have
successfully installed
scrappie inside our computer with
sublime text now just for the purposes
of the future videos I'm going to be
using PI chump so my project structure
is going to be like this just like in
sublime text there's not a lot of
difference and instead of using command
prompt or terminal I am going to be
using the terminal that is inside by
jump from over here now the virtual
environment inside pycharm is actually
over here it has been created for us by
pycharm so you don't need to create a
virtual environment and by charm but it
doesn't matter because we have created a
virtual environment for us by using pip
e and b so guys this is pretty much it
for this video from the next video we
are going to be creating of that scraper
for our first project that is scraping
codes of authors from a website so I'll
see you in the next video

Video 6: Python Scrapy Tutorial- 6 - Project Structure in Scrapy

so in the last video we learned how to


install scrappy on your computer doesn't
matter whether you're using Windows
Linux or Mac and after that we even
created a project of scrappy called
wood tutorial now if you haven't created
this project yet and you have just
stumbled upon this video from somewhere
what you can do is you can just open up
this terminal now if you're not using
PyCharm if you are using something like
sublime text then you can just open up
the command prompt or even a terminal if
you are using Linux or Mac after you
have gone inside your virtual
environment and if you don't have
virtual environment I will highly
recommend going back to the previous
videos and seeing how I create this
virtual environment but anyways after
you have gone inside this virtual
environment what you can do is you can
just write crappy and then you can just
right start project and then just write
down the name of the project so in our
case because we are going to be
scrapping a quote website that is why we
have called it as quote tutorial so
again also if you want to follow along
you can just write quote tutorial and
just press Enter I'm not going to do it
because I have already created this
project structure but anyways after we
have created this project you must have
this kind of folder structure inside
your sublime text or pycharm so just to
make sure that you don't get intimidated
by all of these files over here I am
going to go through them one by one but
I'm not going to go into them in a lot
of depth because we are going to be
covering each of these files in a lot
more deeper way in future videos but for
this video let's just start with this
setting start by file actually even
before that let's just go from the top
so we have a scrappy tutorial folder and
this is our virtual and mammon folder
and inside that we have our project
folder that is the core tutorial folder
that we just created and just the
scrappy project has also created and the
folder inside that same project folder
which is also known as for tutorial and
then inside that we have this folder
called as spiders now if you don't
remember spider is basically a Python
program that scrapes other websites so
we are going to be writing our Python
code inside the spiders folder now there
are two init files or initialization
files that you don't need to worry about
scrappy just uses them for its purposes
you don't need to worry about them and
more
they are just empty but in this init
file it's written that you can just put
in your spiders of your scrappy project
inside this spiders folder now after
that we have these four files that are
really important let's just start with
setting start PI file so if we open it
up and we scroll to the top you will be
able to see that the first line is bot
underscore name equals to core tutorial
now who tutorial is just the name of our
project and why is it calling it but
because we are creating a web scraper
here web crawler over here why is it
calling apart so if you don't know bot
is basically anything that automates the
writing of code or automate some kind of
action on the internet or on our website
and because we are automating the action
of web scraping or crawling a website
that is why our crawler can also be
named as a part or can also be called as
a quad that is why in our settings dot v
well it says bot name equals to core
tutorial after that it has a couple of
modules which you don't need to worry
about and after that there is this user
agent so what is a user agent whenever
you visit a web site for example if you
visit google.com then you need to
identify yourself as to who you are so
who is the person basically sending our
request so what Google does it it asks
the browser that you are using hey who
the heck are you exactly so a brother's
sometimes sends a request to the Google
saying that hey I'm just a Mozilla
Firefox browser and if you are web
scraping yourself you can be a little
bit more responsible and you can
identify yourself by writing in your
domain name but you don't have to do
that in the future videos you'll also be
seeing that will be a lot of website put
a lot of restrictions on web scraping so
in the future you'll be also seeing us
bypassing those restrictions using this
user agent over here but for us right
now it doesn't really matter now I'm not
going to go into this robots.txt file
because I've already gone into it in a
little bit of more depth so you can just
check out the previous video I think it
was the third or fourth video that I did
where we go into what is this robots dot
txt file now the next thing that we need
to know in our settings dot file file is
this term of concurrent requests so
whenever you make a lot of requests at
the same time then
that is known as concurrent requests no
just to reel it back a little bit if you
do know what that request is it is
basically asking a website to open up so
for example if you go to google.com and
press Enter
basically a browser is sending a request
to Google server and asking the Google
server to open up so similarly we are
going to be using a Python program to
request a website to open up so that we
can scrap it now whenever we ask the
site to open up once we can get the data
of that website once but because we are
going to be scrapping a lot of data that
is why we can't just use one request
that is why we are going to be sending a
lot of requests at the same time and in
our case we are going to be sending the
request at 16 rate per time so we are
not going to be using the 32 one over
here we are going to be using the
default one that is the 16 now you must
be thinking that hey why are we keeping
the number of concurrent requests to 16
and because we have a huge amount of
data that we want to scrap and we want
the data scrapping to be done really
really quickly why don't we just
increase this number from 16 to maybe
something like thousand you know the
reason that it don't do that because it
causes a lot of Owen load to the server
now 32 requests at every second is
actually fine but a thousand is actually
a little bit too much imagine that
somebody is hitting your website and
sending it requests every second like
thousands of times then this can cause a
lot of overload to the soul and the
website and in some cases if the website
is small the website can also go down so
that is why you want to keep this number
to a very small number maybe 32 is fine
but you know a thousand number is not
fine now if you go come down it has a
lot of the same thing and if we come
down a little bit more you will be able
to see it doesn't have a lot of things
that we are interested in right now it
has this term called autothrottle and we
are going to be going into it a little
bit it basically also helps to make sure
that a website that you are scrapping
does not get overloaded so the next file
that we are going to be looking into is
actually this file called items dot 5
and you feel
into it it has this class of code
tutorial item that has been
automatically created for us and it says
define the fields for your item here so
what are the fields that it requires so
if we go to a website like course dot to
scrape comm and this is the website that
we are going to be scraping in this
first spider that we are creating and
you can just grab this website freely it
has been actually created by scrappy to
make sure that we can learn how to do
web scraping so this is a website where
it has a lot of codes and it has author
of those course and it has tags of the
scores so the first item that we can see
over here is a quote item the same item
that we can see over here is an author
item and the third item that we can see
over here is a tags item so this one
element has actually three items the
first item is a quote then the author
and then the tags so whenever we are
scrapping a website it always has a
particular field of items so though this
element over here has three items but if
we open our web site like Amazon you'll
be able to see that it has different
items for example the product name the
product price the product category the
product description and stuff like that
so whenever we are scrapping a project
it's a good idea and actually it's a
good practice to define your items over
here in this item start by file so if we
are going to be looking into this file a
little bit more in the next video but
for right now just understand that
whenever we need to define a field it
goes into items dot five so for example
over here it has said name equals two
scrappy dot field and if we want to
scrap for example a quote we can call it
quote equals two scrappy dot field and
so this is it for items dot pie file so
let's lick into this file called
pipeline's dot pie so in a nutshell what
happens is that after scrapping the data
of this code website we want it to store
somewhere so for example that data can
be stored in a JSON file or maybe a SQL
file or SQL database or maybe a MongoDB
database so this is done by using this
pipeline stored pie file so every code
that goes inside this pipe launch dot
pie file make sure that your web scraped
data or your script
is handled properly and it goes to where
you intend it to go after that we have
this file called middleware store - so
what this does is when you are sending a
request to a website you can add some
stuff to that request so for example if
we are going to be learning about user
agents and we are going to be adding
proxies to a request proxy is basically
devil using different IP addresses to
bypass restrictions on web scraping on a
website and that is known as a proxy so
whenever we are going to be adding proxy
to us requests we are going to be doing
it through a middleware and whenever a
website sends a response back we can
also handle that response so for example
if we extract this quote from this
website and if we want to do something
with this quote then we can use
middleware to do something with that
code and then we can obviously send it
to this pipeline start by file where we
can store it inside some kind of
database so guys this is pretty much it
for this video in this video we learned
about this project structure and
hopefully now you are a little bit more
comfortable with seeing all of these
four files over here in the next video
we are going to be creating our first
spider our first Python program inside
the spiders folder so I'll see you over
there

Video 7: Python Scrapy Tutorial- 7 - Creating our first spider ( web crawler )

alright guys welcome back and the


previous video we'll learn about this
project structure and how to create this
project structure and we learned about
all of these files individually so I'm
really excited about this video actually
because we are finally going to be
creating our very first spider or a
Python program that is going to be
scrapping this website called ports dot
to scrape comm so what I'd recommend is
that you guys open up this website and
you actually check out the source code
that goes into creating this website
because we are going to be using this
website in the future videos to scrap
the codes the author name and their tags
and a lot of stuff so I just want you to
get familiar with the source code of
this website a little bit obviously I'm
going to go through the source code with
you and then we are scrapping the
website but so if don't worry about it
just too much now what we are going to
do is we are just going to go back to
our code and inside this folder of
spiders we are going to be creating our
very first Python program so whenever
you create a spider make sure that it
goes inside this spiders folder and not
anywhere else so what I'm going to do is
I'm just going to create a new Python
file over here and I'm just gonna call
it quotes underscore spider you can call
it whatever you want
and after that I'm just gonna import
this crappy package obviously and after
that I'm just going to create a class
and I'm gonna call this class as quote
spider you can call it whatever you want
but the more important thing is that
this class is going to inherit from
scrapie and inside scrap is going to
inherit from a spider now this is going
to give us a lot of cool stuff and we
won't have to write a lot of code
because of this class that we are
inheriting from so the first thing that
we need to do is create a variable
called name equals two and this is going
to be the name of our spider and in our
case I'm just gonna call it quotes and
after that scrappie requires you to give
it a list of URLs that you want it to
scrap so you have to call this variable
as start underscore URLs and then you
have to give it a list of the URLs that
you want it to scrap for us right now we
just wanted to scrap one URL and that is
this Court's dot to scribd.com
is gonna copy it and paste it over yeah
now make sure that you don't change the
name of these two variables over here
make sure that you call the name
variable as name and this SOT underscore
URL list as start underscore URL because
the scrappy dot spider class that we are
inheriting from expects us to have these
two variables with the same name so
after that we are going to be creating a
method called as parse and as you can
see PyCharm is already helping us it
requires two things that is the self
instance or the self reference and then
more importantly this response over here
which basically contains the source code
of our website that we want to scrap so
this website of course that is inside
our start underscore URL list is going
to send the source code inside this
response variable now what we can do is
that because we want to scrap that title
for this video we just want the title of
the website that you can see in the top
left corner that is quotes to screen now
if we go inside the source code this
title is actually inside this title tag
over here and we want this title tag
that says quotes to scrape so what we
can do is we can go inside our def
pathid and we can just create a variable
called as title and we can write
response and response contains the
source code but we don't want all of the
source code we just want the source code
that contains this title element so what
we can do is we can just write response
dot CSS and inside this we can tell it
hey I just want that title tag and
nothing else if we just wanted another
tag for example if we just wanted just
division tag instead of title over here
we could have just written div that is
the division tag now we just want to
extract this title stack so we just
gonna write dot extract and after that
we are just going to try it yield now
I'm gonna go into yield a little bit but
not right now I just want to make sure
that I complete the code and this yield
requires a dictionary so I'm just gonna
create a dictionary over here and every
dictionary acquires a key and of values
so I'm just gonna call it title text you
can call it whatever you want and then
the value is this variable of
I'm just going to write the value over
here and now we have completed our very
first spider that is scrapping the title
of the URL that we have given over here
now just to make sure that you guys
understand properly what's going on I'm
just going to go through this code once
more so the first thing we did was
inside the spiders folder we created
this file of course underscore spider
and then we imported scrappy and we
created a class known as quote spider
and this class is going to inherit from
the scrappy dot spider class and we gave
our spider a name that is the ports and
then we put in inside this list of start
underscore URL the list of websites that
we want to scrap right now our list only
contains one URL after that we created a
method called parse that is inside
scrappy and make sure that you call this
variable of name start underscore URL
and this parse exactly the same don't
change it because scrappy expects you to
name it as it is written and this parse
is going to take two things first is
going to take the difference of itself
that is the self so if you don't know
about classes and objects and
inheritance basically what this means is
whenever you are creating a new instance
or a new object and when you want it to
jeffer itself inside a class you just
call it the self and if this is not
making a lot of sense to you don't worry
about it
I'll make sure if you don't know what
classes and objects and inheritances
just check out my videos on that topic
I'm if you can't find that topic I'm
just gonna attach them somewhere in the
bonus section of this video series so be
on the lookout for that if you want to
know more about it but you don't need to
so this is the first argument and then
the second argument is the response
which is the more important one and this
contains the source code but we don't
want the whole source code because we
just want a particular part of the
source code and in our case we just want
the title of our source code which
contains the title of the page so in our
case we just want this title tag so what
we have done is over here we are just
created a variable of title and then
what we are telling scrappy hey go to
the source code and look for this title
tag after you have found the title tag
just extracted and
you have extracted it just yield it or
return it and show it to us and how to
do I want you to show it to us I want it
to be shown as a dictionary and every
dictionary contains a key and a value so
the key can be written as anything I am
just gonna call it title text and the
value is this variable of title that
contains a title of course - the title
of quotes - skrappe so guys this is
pretty much it for this video and one
thing I just noticed oh hello is going
to end this video is that this variable
name is not correct it shouldn't be
start underscore UIL it should actually
be start underscored URLs and this shows
how important it is to make sure that
your URLs or your variables are named
correctly so anyways guys this is pretty
much it for this video in the next video
we are going to run our crawler and
you're going to see how our title is
actually scrapped so I'll see you in the
next video

Video 8: Python Scrapy Tutorial- 8 - Running our first spider ( web crawler )

alright guys so in the previous video we


learned how to create our very first
spider and in this video we are finally
going to run that spider which we
created in the previous video but before
we get started on that I just wanted to
discuss this yield keyword that we
didn't discuss in the previous video
because we didn't have enough time you
can think of this yield keyword as the
return keyword that we usually use
inside a method or a function now why
exactly do we have to use this yield
keyword instead of just using the normal
return statement this is because yield
is usually used with a generator and
this generator is being used by scrappy
behind the scenes now I won't go into
what is a generator in Python because
it's not relevant to this video series
but you just need to remember that you
are supposed to use the yield keyword
instead of the return statement inside
your spider on inside the function that
you are going to be putting in inside
the spider now the process of running
the spider that we have created is
really really simple
now if you are using pycharm you can
just open up the terminal over here and
this is going to take you to the scrappy
tutorial project folder with the virtual
environment activated but if you're not
using pycharm and you're using something
like sublime text then you have to open
up the command prompt or the terminal of
your Mac or Linux OS after you have
opened it up make sure that you're
inside the scrappy tutorial folder and
your virtual environment is activated
after that you have to go inside this
quote tutorial folder so we are going to
do that and after we are inside the core
tutorial folder now we can run our
spider or crawler to run it it's pretty
simple you just write scrappy and then
you write crawl and then you write down
the name of the spider that we have
created so over here we have given it
the name of quotes so we can just write
quotes over here and this is going to
run our crawler now this is going to
give us an error that says no module
named win32 API and if this crawler is
working fine what you don't worry about
it if you are not getting this error if
you are getting this error then you just
have to follow what I am doing so as you
can see it says module not found when 32
FBI and over here you can see it says
from twisted so whenever we installed
scrappy it and also installed another
package known as twisted and that
the state package requires this win32
api that is not currently inside our
project so what we can do is we can just
go to file go to settings pretty simple
go to this plus and over here you can
just write win32 api you can just read
it win32 and so we have to install this
pi win32 library or package then you can
just click on install package and this
is going to install pi then 32 for you
now if you are not using pycharm you can
just write pip install file win32 and
this is going to install the same thing
for you guys and after this package has
installed properly you can go back to
the terminal we can just click on OK
over here and let the two processes run
for a second and after all the packages
have been installed and all the
processes have been done you can just
run the same command again and you'll
see that you'll be able to scrap it
without any problems so over here our
website is being scrolled and if we just
scroll it up a little bit you'll be able
to see that the title text with the
title over here is being extracted
properly but it also contains this title
tag with it which we don't want so what
we can do is we can just write text over
here so after this title we can put in
two colons and we can tell scrappy that
hey we just want the text inside the
title tag and nothing else and then we
can just run our program against over
here we just scroll down scroll down a
little bit more and over here we can
just write again the same thing and
press ENTER and this time you will see
that we will get our text without the
title tag and if we scroll up a little
bit again as you can see it says our key
that is the title text and now we have
this quotes to scrape in a list over
here so we have successfully scrubbed
the website and we have scraped the
title of the website to be exact so in
this video we learned how to run our
crawler or how to run the spider and we
also learned the meaning of this yield
keyword so in the next video we are
going to be learning about something
known as selectors and we are also going
to be looking into the scruffy shell so
I'll see you in the next video

Video 9: Python Scrapy Tutorial - 9 - Extracting data w/ CSS Selectors

alright guys so in the previous video we


created a very simple Python program
that basically scrapped the title of her
quotes website and how did we exactly
scrapped the title was that we took the
source code inside this response
variable and then we extracted that
title tag and then inside this title tag
we just wanted the text so we put in two
colons and then we added the tag of text
in front of it now this idea of
selecting and particular HTML tag or a
particular CSS or an ID inside the
source code is known as a CSS selector
so we did that by typing in dot CSS and
then typing in the if condition inside
the CSS function so this idea of writing
a CSS condition kind of an if condition
is known as a CSS selector so in this
video we are going to be learning about
different CSS selectors so basically how
do you extract this quote from over here
how do you extract this author you do
that by using a particular CSS selector
now a very good way to learn about CSS
selectors is using something known as a
shell inside scrappy so what I want you
to do is open up your terminal and then
I want you to type in scrappy shell and
then we just put in double quotes if you
are using windows make sure that you use
double quotes and then inside those
double quotes we are going to be putting
in the website that we want to scrap so
for right now we just want this quartz
dot to scrape comp and then we can go
back to our terminal and paste this over
here so we can just right click over
here and paste it and after that we can
press ENTER and the scrappy is going to
open up this website inside a shell if
you're wondering what is a shell a shell
inside scrappy is basically like command
from for Windows so shell basically
helps us control scrappy in command line
mode so well you can see we got some
response from it it says crawled 200
which says HTTP caused or to scrape comm
so basically 200 means that the status
is okay and this website has been
crawled and we send the request and then
we got the response variable which
contains the source code and now we can
look inside the source code to find out
various things inside the source code or
scrape the data inside that source code
so let's actually start by scraping the
title just like we did over here but we
won't start from dot extract we will
start even from much more basic command
so we can just write response over here
because this contains our source code
and we want to select something from our
source code using a CSS selector so to
use a CSS selector you just need to
write dot CSS and then you need to tell
it what do you want it to select so in
our case you want to select the title
and the title is inside an HTML title
tag so we are just going to write title
inside the terminal so inside this we
can just write title and we can just
press Enter let's close the quotes and
press ENTER and as you can see it gives
us a selector list which has this title
tag which says quotes to scrap now
obviously we don't want this whole
selector list what we want is just this
code to scrap over here so now what we
can do is we can write the same thing
again but instead of just this over here
this statement over here we can also add
dot extract because we don't want the
whole selector thing we just want this
title tag and quotes to scrape text so
we can just write dot extract and press
Enter
now as you can see we have the title tag
and inside the titles tag we have the
course to script text now what if you
don't want the title tag we just want
that text inside it what we can do is we
can just right
colon colon and add text in front of it
to just get the text and press ENTER
again and as you can see we have course
to script text over here now if you
notice properly this course stood scrape
is actually inside a list as seen by
this square bracket whenever there is a
square bracket you know that this is a
list now what do you see what an item if
you want an item inside a list what we
have to do is pretty simple just like we
extract data from a list we can put
square brackets over here and inside
this we can write 0 because this is the
zeroth element inside the list and press
ENTER and you can be able to see that we
have extracted the item from the list
now there is another way to do it
instead of using this 0 @ indexed what
we can do is we can just write extract
underscore first and this is going to do
the same thing this statement of extract
underscore first is just going to
extract the first item
the list and if he press ENTER you'll be
able to see that this gives the same
response just like we did with the
zeroeth list now the benefit of using
extract underscore first is that venya
scrapping a website I'm sure if it
doesn't have index at zero if it doesn't
scrap anything then this statement of
zero index is going to return an error
but if you are using an extract
underscore first and there's no item in
a scraped list then this an extract
underscore first function is going to
return on null and thus we are not going
to be getting an error actually it's
doesn't return null it returns none and
that is why we don't get an error so
whenever you can use extract underscore
first instead of the zeroeth index
obviously if you want to extract a
second quote or a third item from a less
then you can put in one or two something
like that now what if instead of just
the title we actually wanted something
substantial from this website for
example like the code so what we can do
is we can just select this quote from
over here and we can right click over
here and click on inspect and this is
going to directly take us to the span
class which contains this text item and
now we can see that this quote is inside
the span HTML tag and it contains the
class of text so what we can do is we
can go back to our terminal and we can
write down response dot CSS user CSS
selector and inside the CSS selector we
can just write hey go and look for a
span HTML tag and inside that span HTML
tag look for this class of text that is
the class equals to text to look for
this class of text and how is class
represented in CSS it is represented by
a dot and if instead of a class this was
an ID equals to text instead of dot we
would have written this found operator
character now instead of because we are
using a class we can just use this dot
character and then we can write text
because we want this class of equals to
text and because we don't want this
whole span HTML tag we just want the
text from it we can write the coolin
coolin
text just like we did with the title and
then we can just write extract
we can put in parentheses and press
ENTER and as you can see we have gotten
all the quotes from the website not from
all of the website just from this front
page now what we can do is if we just
want the for example we want the second
quote from this list of items we can
just go over here and just like we did
with the title you can put in the number
one because zero is the first index so
0th will be this first quote the world
one and then the first item in the list
is going to be this the second item in
our list is going to be this at first
index just like how list functions now
we can just press ENTER and you'll be
able to see that we have gotten the
first or the second code now we can
extract these CSS selectors and you can
find this CSS selectors like example
this span dot text on our own and we
don't need a tool for it on simple
websites like this quote web site but
when we are going to be scrapping a
little bit more complex website then
it's better to have a tool to scrap and
to find the particular CSS selectors
that we want so what I want you to do is
I want you to go to google.com and
search for the selector gadget Chrome
this is a Google Chrome extension and
the first link should be fine just open
it up and you have to add it to your
Google Chrome after you have added it to
your google chrome I have already added
it so I'm not going to add it again
you'll see this icon on the top right
corner called selective gadget after you
have installed it you can come back to
the quote website and let's say we want
the author name on this website so what
I can do is I can select this icon over
here and then it can go over here and we
can just go to this yellow box and click
on one of these authors so let's say I'm
gonna click on Albert Einstein and then
you can see along with Albert Einstein
Albert Einstein is green but all the
other authors have also become yellow
and if we go down you will be able to
see that all of the other authors have
also become yellow and over here at the
bottom it says dot authored and it says
that ten of the items have been selected
so now basically is telling us to select
all the authors you can just use this
class of dot author so let's actually
try it out I am just going to copy it
from over here and I'm gonna write
response actually let me just do this
and instead of man dot text
I can do is I can just write da tu th o
r and press ENTER and you'll be able to
see that we got JK rowling over here
just because we are extracting the first
index now we don't want just the first
index we want all of the authors to be
I'm just going to remove this index
option from my here and then press enter
and as you can see we have gotten all of
the authors inside our response dot CSS
selector so now I've restarted my
terminal by pressing on this cross
button because I wanted to give you
another example of this very cool tool
so if I go back to the website and I
want to deselect all of this I can just
press on the selected gadget again and
then I'm going to go to this amazon.com
last 30-day book section where it has
the releases of the last 30 days of
books now what if I want the title of
the product names of these books on this
page what I can do is I can go back to
our selected gadget tool and I can
select these titles now as you can see
along with the titles all of these
prices all of this format and sure those
four is also selected and I want to
deselect all of the other things so what
I can do is I can just select on one of
them and as you can see it shows a red
box around it because I don't want it so
it shows hey what kind of things don't
you want inside your CSS selector so I
don't want this red box I'm just going
to left-click on it and as you can see
it has removed a lot more stuff I don't
want this yellow thing from over here so
I'm going to select on this again and as
you can see now we only have the titles
or the product names on this web page so
now it has given us this CSS tag of dot
s access title so let's actually find
out whether this works or not the CSS
title works or not so I'm just going to
copy this URL from over here go back to
our terminal I'm just going to open this
website on our shelf we're going to
paste it over here press Enter and wait
for it to be scrubbed
now it says let's see let's just
maximize it a little bit and you can see
that they've gotten the response so now
we can just type in response dot CSS
user CSS selector and in
like this we can just use this CSS
selector that this tool has given us we
can just copy it and we can paste it
over here and because we just want the
text so we can just put in :
and again : again and we can just put in
text and after then put in double quotes
again and want to extract all of this
rate of fear just gonna put an extract
and press ENTER and this is gonna give
us the title of all of the books on this
page now what if we want instead of the
first book we want the second book title
now we can just put in the index off
first because we want the second book
and this is how less work and press
enter and you will be able to see that
we have got in the second book so guys
this is pretty much it for this video in
this video we learned about CSS
selectors and how to select HTML tag or
a CSS selector using this tool known as
selective gadget so in the next video we
are going to be learning about another
way of selecting these CSS and these IDs
and that way is known as X but a lot of
people prefer using XPath for me I
generally prefer using a CSS selector
but because this is a video series I
also want to cover XPath and so yeah
that is what we are going to cover in
the next video so I'll see you over
there

Video 11: Python Scrapy Tutorial - 11 - Web Scraping Quotes and Authors
alright guys welcome back in the last
couple of videos we learned about the
selectors and more specifically we
learned about the CSS selectors and the
XPath selectors and now I believe that
you guys are finally ready to extract
data from this course website that is
Coates dot to scrape calm and we are
finally going to be writing code inside
our codes underscored Spyder dot Python
file but before we start writing code
let's actually inspect the elements and
see the code that goes into writing
these codes so we're just gonna right
click over here and click on inspect and
you'll be able to see that each of these
elements is inside this class equals two
code so the first element that contains
the code the author and the tags is
represented by this division tag with
class equals to code and similarly with
the second quote the third quote and so
on
so what we are going to be doing is that
first we are going to selecting this
division class equal to code so we are
going to be selecting all these division
tags from our response very well that
will contain the source code so let's go
back inside our python file and instead
of this title shield we'll just remove
that because we don't need it now and
instead of that we are just going to
create a new variable and let's call it
or give quotes and what this variable is
gonna contain is that is gonna contain
all of these division tags over here
which have the class of equals two code
so let's go back to our code and just
like we did in the previous videos we
are going to write response dot CSS we
are going to use the CSS selector and
because we want all of the quotes in the
division tax that's why we are going to
write div dot quote and we are not going
to write dot extract over here because
we don't want this class data we just
want the items that are inside this
class for example the code the actual
code and then we want author name and
then the tags we don't want this actual
class data so what we are going to do is
we are going to use this class equals
two code and we are going to go inside
this so instead of using this response
variable we are going to be using all
diff
so let's actually show you guys what we
are talking about so we are going to
create a new variable let's call it
title and now instead of using the
response variable which contains all of
the source code we are just going to be
using all div cuts that just contains
the source code of this division tag of
class equals two code so
let's go back to our code and over here
we're just going to write all courts dot
CSS and then we are going to be using
another CSS selector to extract this
title that is our main code so if we
open up a little bit we will be able to
see that we want to extract the data
from this span HTML tag which has a
class of text so now we can just write
span and make sure you do it inside the
code so we can just write span dot text
to select our text and we can just write
not extract now obviously we need to
extract the text part of this code and
not the tool element with this span HTML
element so we are going to add a tag of
text over here just like we learned in
the previous videos now that they have
extracted the main code let's extract
the author name so we also know create
another variable let's call it author
and again we are going to be using
instead of response or div codes and
using the CSS selector and let's go back
to our code to see how we can extract
the data that is inside code so if you
open up this span class you can see that
this contains span inside that span it
contains an a tag which has this all
right that is strong inside the span
class we have this small HTML tag which
has the class of author and inside this
author we have this Albert Einstein
which we want so what we can do is we
can come over here and we can use just
the dot author selector and because we
just want the text we can again write
text and we can also extract this and
now that we have the author let's
actually extract the tags that we have
so if we go inside this division of tags
let's open it up you will be able to see
that the tags are inside this a tag of
class equals hoodak and there are four
of these so what we can do is we can go
back to our code and we can write tag
equals two and again we are just going
to use oil division codes dot CSS and we
are going to be selecting using this dot
axe and I just want the text from it so
I'm going to write dot text dot extract
and this should work fine now like we
did in all of the videos we are going to
be using a return statement and in web
scraping in scrappy especially instead
of return we use the yield statement so
I'm just gonna write yield and because
you always return or you always yield a
dictionary so I'm going to create a
dictionary and every dictionary needs to
have a key and then a value so I'm going
to be writing our title key over here
and then I'm going to be sending the
value of the title variable and I'm
going to be doing with both of the
author and the tag I can do it with
simply the authors I'm just going to
write author and I'm gonna return the
value of author and forgot to put a
comma over here come over here and then
I can just write tag and over here I'm
gonna return a value of that so this all
looks pretty good but this is going to
return us a lot of data so what I'm
going to be doing is I'm just going to
be returning the first code data so over
here the first code contains this the
world code the author and then these
tags so instead of returning all of that
I can just write the 0 over here and
like we learned that this basically is a
list so we can just kind of extract the
first element from it instead of
extracting all of the elements so now to
run this crawler we can just open up our
terminal and we have to go inside a
project folder press Enter so now let's
just try it scrappy crawl and we want to
crawl our codes so this is the name of
our spider that we have created if you
have for cotton and now we can just
press ENTER and see what happens so now
if we scroll up a little bit let's see
what scrappy has returned to us and as
you can see we have some of output so it
says that the code is over here it says
the world as we have created is a
process which is pretty good we have the
author but there is some problem in that
tags which is not giving us anything so
let's see what the problem is when the
tags it says all division codes dot CSS
dot tags text extract so let's go back
and see what the problem is so we have
this class of equals to tag all right so
maybe we should instead of
like writing dot tags we have to write
just dot tag so I kind of misspelled the
spelling so we have to correct this so
instead of dot tags we just have to
write dot tag and this should work
properly
so let's go back to her terminal and I
made a loop say it's fine and this is
this kind of a I'm not going to remove
this from the video because this is kind
of a good learning lesson so I'm gonna
press ENTER again alright so here it is
the title we have extracted the author
albert einstein and we have the tags now
let's actually remove this zero index
over here and let's see what is the
output when we have all of the quotes
and we have extracted all of the quotes
and the title and the tags from this
main page so our page has quite a lot of
codes so let's go here and let's scroll
it again so let's scroll down and we are
going to write strappy crawl quotes
alright so it has been crawled let's
scroll up and as you can see this
contains a lot of codes so here are our
codes and then here are our authors and
then the tag somewhere should start over
here alright so this gives us a lot of
values of title authors and tags now
what if we want the values of title
authors and tags one by one instead of
just throwing them at her face instantly
so what we are going to be doing is that
we are going to be creating a for loop
so what we can do is over here we can
just create a for loop it says for Q or
instead of Q we can write code in and we
can just write all division codes so we
are going to be extracting all of these
codes that are inside this div tag so
let's actually inspect over here and s
wait for this to open up yeah so we are
going to be extracting all of these
division tags one by one and then one by
one we are going to be taking the code
the author and the tags so this is the
process so what we can do is we can just
copy this and instead of copying we can
just press app and now instead of all
def codes we can just put court over
here actually let's change this to
quotes so that we don't have to do a lot
of work and we can just remove this term
over here
we can remove this and then we can just
press tab so that this comes inside the
for-loop
and this looks pretty good let me just
format this properly and let me just run
this crawler again to see now how our
output looks so the website has been
crawled and now if we scroll up you'll
be able to see that a lot of stuff has
been scraped so if we go to the top you
will be able to see over here that now
these sections of codes have been
scrapped one by one now we have the
title the author the tag and then we go
to the second section where we have the
code the author and then the attack and
similarly we have the third title the
author and then the tag so you can do
both of them that's totally fine so this
for functionality of extracting code
section one at a time kind of helps you
store them properly inside the database
but it won't matter much when we go to
the next video where we'll be learning
about items and making sure that the
title author and the tags are properly
customized and stored in a proper form
so I'll see you in the next video where
we'll be learning about items in scrappy

Video 12: Python Scrapy Tutorial - 12 - Item containers ( Storing scraped data )

all righty guys in the previous video we


extracted the data of courts and authors
so in this video we are going to be
learning how to put that extracted data
in containers called as items now why
exactly do we need to put them in
containers because they have already
extracted the data can't we just put
them in some kind of a database directly
the answer is yes obviously you can but
there might be a few problems when you
are storing the data directly inside the
database when you are working on big /
multiple projects so scrappy spiders can
return the extracted data as Python
dictionaries which we have already been
doing right now without quotes project
but the problem with Python dictionaries
is that it lacks structure it is easy to
make a typo in a field name or return
inconsistent data especially in a larger
project with many spiders because our
koats project is right now very very
small that's why you don't get these
kind of mistakes in our project so it's
always a good idea to move the scrap
data to a temporary location called
containers and then store them inside
the database and these temporary
containers where we are storing the
extracted data are called as items so we
will be using this items dot pi file to
create our item containers and if you
look over here you can see that the
class of pour tutorial item has been
automatically created for us by scrappy
when we created the project now we have
a couple of fields inside our quotes
underscore spider that is the title
field the author field and the tag field
in order to declare these fields inside
items dot IFL simply you can write the
name of the field and then you can type
in scrappie dot field so we are just
going to uncomment this up and instead
of the name and is gonna call it title
and then I'm just gonna copy and paste
two more times for the author and the
tags so instead of the title over here I
can just write author pretty easy peasy
and then instead of the title I'm just
gonna write tags over here and then I'm
just going to remove this pass now we
have declared the fields inside a class
of quote toriel I don't know we need to
import this file of items dot PI file in
site are quotes underscore spider dot
Python file so that is pretty simple we
can just write from dot dot items
so this basically goes to this items dot
PI files from inside this spiders folder
and from then we need to import this
class of code tutorial items of in just
type in core tutorial item and this is
going to automatically import this class
for us now over here inside this parse
method we have to create a new variable
let's call this variable as items so
this is basically an instance variable
because we are going to be creating an
instance of this quad tutorial item
class so we can just type n items equals
two and then write down the name of the
class that is in our case code tutorial
item and then we can add a parenthesis
so if you don't know about classes and
objects this is basically a class and
when we need to create an instance of
this class then we just write the name
of the instance that we want and then we
can just write the name of the class and
then add parentheses now that you have
created the instance we I need to store
we need to use this code tutorial items
blueprint to store some items inside
this items instance so what we can do is
we can write items over here and then
instead of just yielding these
dictionaries over here we can just write
the name that is the title and this name
that we are writing inside the square
brackets is actually the field name that
we have given over here so if we had put
in titles over here we will actually
need to write titles over here too but
it's just easier to make sure that you
write the name of these items as the
same as the name of the variables that
you have extracted so instead of calling
them titles I'm just going to call it
title just to make our work a little bit
easier and then items title equals to
title and this is the name of the
variable that we have extracted over
here now instead of the author and the
tax just like we have done with the
title we are going to do it with the
author and tag so this is going to copy
and paste it two more times and then
instead of the titles pretty easy we're
just going to write author and over here
to author and if you're following along
you can just write it with me and just
tag
over here we can also call it dag now
just to make sure in items dot PI I have
called this variable as tags and over
here I've called this variable as tag so
this is not going to work instead of
tags we can just call it tag so that
this remains the same
now instead of yielding all of these key
value and key value pairs what we can do
is we can just yield one thing and that
is items and this will make sure that
all of our code is working properly and
all of these items are being returned
properly so what we are going to do is
we are just gonna run this crawler once
and make sure that everything is running
properly and then I am just gonna go
through all of this code again so that
you guys properly understand what is
happening let me just run this crawler
once so I'm just gonna open our terminal
window I'm just gonna go inside our code
tutorial project and over here we are
just going to right scrappy grow and
then over here we are going to write the
name of a project that is quotes and
then press ENTER and everything should
be working properly let it run and if we
scroll up you will be able to see that
all of these quotes are getting scrapped
properly so what we have done is that
inside this items dot Y file we have
created these temporary container known
as code tutorial item and inside that
declared some fields and then we created
an instance obviously first we imported
our items dot PI file and specifically
imported this class of code tutorial
item and after we reported the class we
created an instance called as items and
then we use the blueprint of this class
to make sure that the title the author
and the tags are stored in respective
proper containers and then we just
yielded the items and this looks
actually a lot more beautiful than what
we were doing previously that is just
returning or yielding the dictionary's
key value pairs now if you have any
confusion about this class and instance
and objects or whatever I am talking
about make sure you check out the video
of classes and objects and inheritance
also because that is going to help you
understand object-oriented programming
that I have added in this video series
but anyways now that we have made sure
that our
items are in proper containers proper
temporary containers now we can move on
to the third part that is storing them
inside our database so in the next video
we are going to use a very simple kind
of a database known as a JSON format so
we are going to be storing all of our
data inside JSON in the next video so
I'll see you over there

Video 13: Python Scrapy Tutorial - 13 - Storing in JSON, XML and CSV

so in the last video we learned about


item containers and now that we have
successfully scrubbed data from the
court website and stored them in these
temporary containers we can finally go
to the next step and learn how to store
that scrap data in some kind of a
database or a file system so in this
video we are finally going to be
learning how to store this extracted
data in a JSON file CSV file or an XML
file obviously in the future videos we
are going to be learning how to store
this data in SQLite my SQLite Momoa DB
databases but for this video we are
going to be learning how to store them
in these files
like for example JSON file CSV files and
XML files
so without wasting any more time let's
just jump into it now storing them
inside these files like JSON XML and CSV
it's pretty easy because scrappy does
all the heavy lifting for you behind the
scenes what you need to do is just open
up your terminal go inside your project
folder let's go inside our project
folder and inside the project folder we
just need to give it a command that
tells crappy to store the scrap data
inside some kind of a file so how do we
scrap our website we just right scrappy
crawl and then we write the name of the
crawler that we have or the name of the
spider that we have right now its quotes
and then we tell the scrappy that hey we
want the output that is this - oh the
output energy sone file and I'm just
going to name it items dot JSON you can
just name it whatever you want but if
you want a JSON file make sure that you
have the dot JSON extension after it and
similarly if you want an XML or a CSV
file you can write items dot JSON let's
actually run this and see how it looks
so now if we go off you'll be able to
see that our data has been scrapped and
if we go back to our project structure
you can see there is a new file and it
is known as items toward JSON and if we
open it up you can see that it has this
JSON format which contains the code and
if we scroll it a little bit on the
right-hand side you can see it has the
author and all the tags and then we have
the JSON file you can if you want you
can use it as a web service or something
like that whatever you want to do with
the JSON file but I just want to show
you the same use
case in the matter of items dot xml and
items dot csv we are literally going to
do the same thing we are just gonna
scroll down a little bit and instead of
json let's write csv and then we will
just write items dot xml just to see how
it looks now let's just write xml and we
are going to look at both the files at
the same time so our spider has finished
crawling and as you can see there are
two new files over here let's open item
start csv and as you can see it has this
author first then the tag and then the
code so we have the albert einstein and
the code is at last and similarly if we
open up the xml file you can see it is
in the format of xml and now you can use
this XML file in whatever way you want
mostly XML files are used in some kind
of website if you want some data to be
extracted in some kind of a website then
you can use XML but anyways guys so this
is pretty much it for this video so this
video is pretty short as you already
know but it was very important to tell
you guys how to store these files in a
kind of a file system of CSV JSON and
XML obviously from the next video we are
going to be learning about more proper
databases like how to store your data in
SQLite database so I'll see you over
there peace out

Video 14: Python Scrapy Tutorial - 14 - Pipelines in Web Scraping

welcome back you beautiful Python


developers now before we go on to learn
about storing the scrapped data inside
our database we got to learn about
pipelines I'm talking about this file on
the left hand side pipelines start fire
so if we discuss the flow of our scrap
data it somewhat looks like this it
first gets scraped by a spider then it
is stored inside these temporary
containers called as items and then you
can store them inside a JSON file but
what if you want to send this data to a
database like SQLite or MySQL what we
have to do is we have to add one more
step to this flow after storing them
inside item containers we are going to
strain them to this pipeline where this
process underscore item method is
automatically called and the item
variable will contain our scrap data now
all of this code inside this pipeline
stored yfl has been automatically
generated for us by scrappy but he still
need to activate this pipeline inside
our setting start by file so let's open
up our setting start by file and over
here we are going to search for the word
pipeline and let's close the search
functionality and over here you can see
that this word of item under scope
pipelines is written so let's actually
uncomment these three lines so you can
just select all of these three lines and
if you are using pycharm you can just
press the control button and then the
backslash button to uncomment multiple
lines at the same time so now that the
pipeline has been activated I just want
to tell you one more thing about this
number over here so the lower the number
is the more priority a pipeline is given
so for example let's say we have created
multiple pipelines over here and then in
settings stored by a file we have to add
another pipeline because this code
tutorial pipeline has been automatically
created for us by using scrappy but if
we add another pipeline over here we
will have to add them inside item
underscore pipelines and difficult
depending upon the priority we have to
give it a number so if this is a
a priority that is that you want to
execute this pipeline first we will give
it a lower number than the other
pipeline so this will just leave it at
300 after this let me just go through
the flow of our spider and often
pipeline so whenever we scrap data over
here and in our case the title author
and tag and we yield items every time we
yield items it goes to this pipeline
start by file now that we have activated
the pipeline in a setting stored by file
and it goes to this method of process
underscore item and it contains the
items that are sent from over here so
this yield items at every for loop these
items are sent to this pipeline stored
by file and they're contained inside
this item variable actually it's an item
list so let's actually try this out and
what we are going to do is we are just
going to print the value out so I'm just
going to write pipeline over here by
mine and we sort of print out our title
so I'm just gonna write item and inside
it and this is going to write it title
so this should print out the quote that
we are scrapping so let's just go to our
quotes underscore spider and open up our
terminal go inside our project folder
just normally how we activate our
crawler and we can just right scrappy
crawl and quotes and press ENTER and now
the crawler has run properly so let's
actually maximize the window and now if
we go up over here that this is an error
it says a type error must be a strain
not list so we made a very basic error
it's fine let's go back to our pipelines
dot PI file and let me actually also add
0 over here and this will contain our
strength hopefully I think this contains
the string and if we don't write the 0
over here then we just return a kind of
a dictionary so let's actually crawl it
again and see if it works this time and
hopefully it will otherwise I'll kill
myself I won't I'm just kidding then
here is let's scroll up now and it
should work all right so it's working
now and as you can see along with the
quotes of author tag and I tell that we
are in
not using our this yield items over here
we have also printed out this line that
says pipeline and then it has the quote
that we want to print out so guys this
is pretty much it for this video in the
next video we are going to be learning
how to finally send this data from our
pipelines dot Pi file to our database
and most specifically we are going to be
learning about SQLite database in the
next video so I'll see you over there

Video 15: Python Scrapy Tutorial - 15 - Basics of SQLite3 database

alright guys we are back again and in


the last video we learned about this
pipeline stored Python file and in this
video we are going to be using this
pipeline stored Python file to store the
data inside a sqlite3 database now if
you are not using SQLite 3 and you don't
want to use SQLite 3 feel free to use
any other database we are going to be
discussing other databases in the
forward lectures in the previous video
lectures so if you want you can skip
this video for example if you are using
MongoDB database you can just directly
jump to that video but if you want to
stick around and maybe learn how is like
3 works and how we are going to be using
scrappy to store the data inside this
Clyde 3 database feel free to stick
around so let's get back to our video so
for a moment we are going to forget
about pipelines and scrappy we are just
going to learn about sqlite3 so what I'm
gonna do is I'm gonna go to this code
tutorial folder and I'm going to create
a new temporary file so we are going to
be deleting this file later this is file
is just to learn about sqlite3 and I'm
gonna call this file as database because
we are going to be creating a database
so why not and then we are just going to
import SQLite three over here
and the cool thing about SQL I 3 is that
you don't need to install anything
externally so escalade 3 is already
inside Python that is why you don't need
to pip install it after importing
sqlite3 we can just create a connection
and how do we create that we just create
a variable called corn you can call it
whatever you want I'm just gonna call it
con constant for connection and then I'm
gonna write SQLite 3 dot connect and
then over here you need to write down
the database or the database file name
so I'm just gonna call our database as
my codes because we are going to be
storing the codes inside this file and
it has an extension of dot DB dot DB
stands for database so what this line is
going to do is is using this escalatory
is going to connect to this my course
database and if this database does not
exist then this is going to create this
file of my cost or TB but if it already
exists then it is just going to connect
to this database so let's actually try
it out and run this database dot file by
right clicking on it and clicking on run
database and after clicking on it you
will be able to see that
this new file of my coat store DB has
been created for us and this is actually
a file that will contain all of our
database and you can even run it again
if you want you will be able to see that
you know overwritten is done it just
opens up the file again now what you
need to do is we need to implement
something known as cursor now cursor
helps us take advantage of all the other
functionalities that are inside this
sqlite3
package so to take advantage of that
let's create a new variable and we are
gonna call it CU RR for cursor and then
we are just going to use our connection
variable and we are going to write dot
cursor and that's pretty much it now we
can go on to add stuff inside this my
code store TV so the first thing we are
going to do is add a very simple table
so if you don't know I'm not gonna go
much into it because it is not a scalar
nor database video series this is a
scrappy series but just to give you an
idea a database normally contains a
table which has rows and columns and the
data is stored inside these rules and
columns so first thing we need to do is
create this table so how do we create
the table inside escalade 3 it's pretty
simple you just use your cursor variable
that is you are R and then you write
execute so whenever we need to execute
statements of SQL for example if you
already knew SQL for example create
table or insert values inside a database
or delete values inside a database then
we need to use this dot execute function
that is inside this dot cursor available
or that occurs available that we have
created now what we are going to do is
we are going to execute a statement but
first we need to write our statement so
we are going to start by writing triple
quotes and triple quotes is used
whenever you need to write a multiple
sentence theory so if for example a
query is basically a statement that as
fuel needs to execute so if you need to
write multiple queries multiple line
queries so for example our query is
going to come on multiple lines then we
need to use these triple quotes so what
we are going to do is we are just gonna
write create table over here and then we
right the name of the table and just
gonna call it quotes underscore tbh TV
stands for table you can call it
whatever you want feel free to be
innovative and then we are just going to
create a bracket over here and then I'm
gonna press ENTER and over here we are
just gonna write the name of the
variables that we want inside your
database so I want that title and what
data type is the title going to be our
title is gonna be of next type so SQLite
3 if I remember properly it contains
five kinds of data types text integer
are the basic ones it contains three
more I don't remember but anyways if you
want to store any kind of textual data
you can call it text data type and if
you want to save cost or something for
example if you were scrapping Amazon and
needed to store the data of the cost of
a product then instead of text you could
have used an integer but in our case all
the three of them like the title the
author and the tags are all of our next
type so we are not going to use the
integer and then I'm just going to make
sure that this is formatted properly by
pressing that button alright so this
looks good and don't worry about this
yellowish over here this pie chump just
make sure that whenever this execute
statement is written it makes sure that
this is separated a little bit and it
shows you in the yellow color but don't
worry about it too much now that our
table has been written actually this
hasn't been created yet inside our
database because for that we need to
execute a database not by file and we
also need to commit this execute
statement so what we are going to do is
we are going to write corn dot commit
and this is going to make sure that all
of these statements inside this data
based on PI file are executed when this
database dot PI 5 it's run and it's
always a good practice to close the
connection after you have done all of
the work with the database dot pi file
and basically escalate three database so
this connection is activated over here
and then this connection is closed by
this line now before we execute this
file I just want to show you what's
inside this micro store TV because we
created this file when there was nothing
over here
these two lines were present then we
created this mic or store DB file let me
just show you guys what's inside this
mic or store DB fine
so how do you look inside these sqlite3
files is you can download programs
offline software's if you want but what
I like to use is use this website called
SQLite online.com if you want you can
search for other websites you can just
go to google and kind of just type
something like us collide three online
and you'll get a lot of websites for
free but anyway this is the website that
I like to use so what we can do is we
can just go to file open DB and then we
can search for this file over here that
is inside our scrappy tutorial folder so
let me go over here and over here you
can see that this is the micro store DB
file that I bought so let's click on it
and if we go over here you can see that
nothing is present so even though we
have uploaded the file there is nothing
inside this file so it doesn't look like
anything basically
let me just upload it once more to be
sure alright so there's nothing inside
this this file what we are gonna do is
we are going to run this database dot
bio file again and try taking on it run
database and now you will be able to see
that there is this icon that has changed
over here and some basically a table has
been created inside this database so now
we can go over here again click on file
open DB and click on this my code store
TV file and now you can see that a table
has been created over here and if you
click on it there will be nothing inside
the table because we haven't added any
items inside it but you can see that we
are making some progress at least and
this course underscore TB table has been
created now let's actually try and
actually add some kind of values inside
of course underscore TB table so what we
will do is we will just uncomment this
over here the execute of create table
codes underscore TV statement because we
don't want the table to be again created
because it has been already created once
and if you execute this file again it
will probably give you an error which
says the table has already been created
actually let me just show it to you guys
so if we run the
play button okay and you will be able to
see that it says table quotes TV already
exists that is why we are commenting out
the statement before inserting the
values inside our quotes underscore TV
table so we can just try to see you are
execute again and now we are going to be
inserting inside our escalate three
table and if you already use a skill you
probably already know what to do if you
just write insert into and then we just
write the name of our table so we are
going to write quotes underscore t B and
we are going to write two values and
inside this we are going to give it the
value so over here we have the title the
author and the tag so let for now let's
just give the values manually so the
title we can say is vitam is awesome
let's be honest here it is and then the
author author is build with Python and
then we can say that the tag is we'll
just give the tag as Python all right so
this looks pretty good hopefully I
haven't missed anything and now we can
just run this file again and all of
these three values should get inserted
inside our table so let's run the play
button again no error perfect let's go
back to our Escalade online click on
open DB and let's open up this file
again and now if we click on this course
DB file you can see that it contains the
title the author and the tag and it
contains the value of Python is awesome
the author and the tag so guys actually
this video has been pretty pretty long
so the second part of the SQLite 3 video
we'll be covering in the next video the
second part we'll be covering in the
next video and what we will be doing is
we will be using the same concepts that
we learned over here inside this
pipeline start by file and more
specifically inside this code tutorial
pipeline class so I'll see you in the
next video and we'll finish the process
of storing the data inside our sqlite3
database pizza
Video 17: Python Scrapy Tutorial - 17 - Storing data in MySQL Database

all right guys welcome back in this


video we are going to learn how to store
our scrapped data inside of my sequel
database now before watching this video
make sure that you have watched the
previous two videos in which we cover
how to store data insight and SQLite
database because a lot of concepts that
I teach in those videos are going to be
used in this video too and it don't want
to go over those same concepts again in
this video now obviously the first thing
you need to do after you have seen the
SQLite videos is install my sequel on
our computer so you can go to this link
if you are on the Windows platform to
install my sequel but if you're using
Linux you can check this link out to
install and learn how to use my sequel I
am only going to be covering the windows
part of installation because the Linux
installation is pretty easy and pretty
self-explanatory now you can just click
on this link to start the installation
and I'm going to go through this
installation pretty quickly because it's
simple
while installing make sure that you
choose the developers default option
because we want everything installed on
a computer including connectors router
servers and MySQL bench software this is
basically a GUI software to connect and
handle various databases and connections
also when you are asked to choose the
root password you can choose whatever
you want but make sure you remember the
root password it's really important that
you make sure that you remember the root
password because we are going to be
using the same password everywhere and
if you forget this password it's going
to be really difficult to reset it so I
just make a suggestion that you write it
down somewhere so that in future when
you don't remember the root password you
can actually check it out and then use
the root password again and after that
when everything is installed we are
going to launch the MySQL workbench
which is going to somewhat look like
this now for some reason if you are not
able to find the MySQL workbench you can
go to the MySQL folder inside your
Program Files or you can just go to
search and maybe search for my SQL a
move
the first option will be of MySQL pinch
now we'll come back to this MySQL
workbench because we are going to be
creating our database and making a new
connection using this GUI MySQL software
but before that we need to install one
more thing inside a Python project so
inside a Python project that is the
PyCharm project we'll go to file and we
need to install a new library or a new
package and if you're not using a
pycharm you can just pip install this
library I'll just tell you what the name
of this library is in a sec so I just
want you to search for my SQL and then
connect up - Python so if you are pip
installing it you can just check out
this link or you can just write pip
install MySQL connector Python and you
can just click on install package and
this will install it on your computer
and inside your Python and we need this
mysql connector to connect our python
code to the SQL database so that's why
it's pretty important I have already
installed it so I'm not going to install
it again as you can see it's already
installed over here now after installing
it we need to create a database so we
are going to go back to our workbench
and create a new connection over here
creating a new connection is really easy
you just click on this plus icon over
here and you can just give it a
connection name I am just going to call
it build with Python you can call it
whatever you want and then you're just
going to click on OK after that we need
to open up this connection so we can
just double click on it and now it will
ask for the password so I have put in
the password as one in digit and
HelloWorld you have to put in the
password that you put in during the
installation purpose I if you remember I
asked you to make sure that you remember
that password and kind of store it
somewhere so my password is one
HelloWorld you have to put in the
password over here that you put in while
installing my sequel and after you put
that and you can just click on OK and
this will open up this GUI where you can
manipulate the database create your
database and one kind of cool stuff so
the first thing we need to do is go to
this schemas option over here we just
need to go to the Seema's option and
schema is basically kind of pattern or
place where we create our database and
tables and we can look
at 8 we can manipulate it all kind of
thought stuff so now we just need to
create a new database so we can just
click on create schema that kind of
stands for create database if you want
to be loose about it and we're going to
call it my database as my codes so we
are going to put that you can want to
click on apply and then we're gonna
click on apply again and if we click on
finish you will be able to see that a
new database has been created over here
with tables views and stored procedures
etc but we are only Institute's
interested in this one thing known as
tables after we have created our
database now we can go back to our
Python project and add that stuff in our
pipeline over here you can see that we
are using SQLite 3 currently now again
if you haven't seen the SQLite three
videos I highly recommend you go back
and watch them and come back again over
here because we are going to be using
the same code in this video and I don't
want to explain all of that stuff again
so what we are going to be doing is
instead of importing sqlite3 we are
going to be importing SQLite my not
SQLite my SQL and then dot connector
because we are using the connector that
we just installed as a packaged MySQL
connector python after we've imported it
all of the stuff is most of the stuff is
going to remain the same except few
things for example as you can see over
here it's showing us an error so we will
just remove this length from here and we
are going to write my sequel dot
connector dot connect and then it's
going to require a couple of pair of
meters more specifically it will need
four parameters the first parameter is
known as the host which in our case is
localhost because we are just hosting it
on our computer the second parameter
make sure you add a comma over here
otherwise is going to show you an error
the second parameter is the user and in
our case the user is the root and you
can check this data again by going to
the home button over here and you can
see that the user is root so we're just
going to put in the user as root and
after that we are going to put in the
password using the parameter as fast WD
that stands for password and in my case
the password is one hello world but you
put in the password that you selected
during the installation of my secret and
if you don't remember the password
you'll just have to reinstall everything
I know it sucks but you have to so again
we have to put in the comma here and
then at the last parameter we are going
to write the name of the database so we
will have put in the parameter of
database and then we are going to put in
the name that we put in in our database
that is my sequel inside my sequel we
have put in the name as my course you
can obviously create a new database just
like I showed you guys but we are just
going to go with my quotes so I'm going
to be writing this inside our code and
let's just call it my quotes now I just
want to do a couple of more changes to
our code and the first change is that I
want to add one more execute line over
here so I'm just gonna copy this and
before creating the table we need to
make sure that our table already does
not exist and if our table already
exists I just want to drop it or delete
it so because we are going to be running
this program again and again so you want
to make sure that there is no error of
that this table already exists
and if you already know a little bit of
SQL you know what I'm talking about so
I'm just going to write a statement over
here which says drop table if exists and
then I'm just going to write the name of
the table which in our case is quotes
underscore TV and then I'm going to put
in triple quotes and goes good and close
this with parentheses and now one last
change that I want to do is that instead
of question marks which is a standard in
SQLite 3 instead of that in MySQL we use
percentage s so I'm just going to
replace these three question mark with
percentage s and we should be ready to
go let me just format all of this code
properly and if you're reading
pycharm you can just press ctrl or L and
this will automatically make sure that
your code looks beautiful all right so
this looks pretty good now what we can
do is we can open up our terminal and
run our crawler now you already know how
to run the crawler we just need to go
inside a project folder and we can just
right scrappy crawl and the name of a
crawler is quotes and press ENTER and
hopefully everything will go fine and
we'll be able to store data inside a
database so let's just wait for it to
have
all right so our data has been crawled
and now if you go back inside a database
and have a look at this
my codes table let's open it up let's
open this table up all right let me just
remove all of this stuff from here let
me just close this connection and let me
just open this connection again and the
password is one hello world so let me
just open this up and now if you look
inside our barcodes mykos database and
we open up tables you can see that a
table of codes underscore DB is over
here now I just want to look what is
inside this table so I can just right
click over here and click on table
inspector and you'll be able to see if
you go inside columns that there is
three columns of author tag and title
which we have given in the last lines of
our pipeline over here you can see title
author and that now if you want to look
more further inside our database and
actually find out the values that are
inside this database we can just right
click over here and we can go to select
rows limit 1000 I mean it's gonna just
click on it and as you can see over here
you have all the codes with the authors
and that's so guys this is pretty much
it for this video this was the police
simple video I think and the most
difficult part of this video is actually
installing my sequel because all of the
code of this video is we have taken from
the previous two videos and if you
watched previous two videos this was a
piece of cake anyways guys in the next
video we are going to be learning how to
store the scrap data inside a MongoDB
database I know a lot of you guys want
it so we are going to be covering that
in the next video so I'll see you over
there

Video 18: Python Scrapy Tutorial - 18 - Storing data in MongoDB

all righty guys so in the previous video


we learned how to store uh scrap data
inside of my sequel database and in this
video obviously as you can see on the
screen we'll be learning how to store
that script data inside a MongoDB
database so obviously we are going to be
going by step by step because Mammootty
be a little bit it's actually pretty
simple but you need to follow a
particular set of steps so what we are
going to be doing is that we are going
to be following the steps that we have
written inside this notepad so the first
step is to install MongoDB pretty simple
so you can just copy this installation
link place it on your browser and you
can install the version that you require
if you're on Linux you can install the
Linux version but I am on Windows so I
just install the Windows version and
after installing it you'll be able to
access MongoDB but make sure you install
everything that is given inside this
MongoDB database make sure you don't
install a custom installation so don't
just install particular parts of mo DB I
want you to install everything that is
inside this installation package also
there is one thing inside this
installation package known as MongoDB
compass that is kind of like the GUI
version of accessing this MongoDB
database kind of looking at the stuff
that we have stored inside a MongoDB
database creating new connections etc
and for some reason if you are not able
to install MongoDB compas GUI software
using this link what you can do is you
can just go to this MongoDB products
compass it's free even though it's
written drive free but actually it's
pretty free you were able to do advanced
stuff but you will be able to do pretty
easy stuff like looking at the database
etc which we want so just install this
mono DB compass and after you have
installed this MongoDB compass the
installation will kind of after you
install this MongoDB compass will
somewhat look like this so now that you
have installed MongoDB and you have
already installed MongoDB campus I'm
going to assume that you have installed
both of them then we can go to the
second step the second step is to create
a folder known as data and inside this
data folder create and the folder known
as BB so that is why I have written the
steps back
steps a little bit weird so that's why I
want to make sure that you guys remember
everything that is in the video so the
second step is to create a folder inside
the C Drive and it says data and dB so
what we are going to do is we are just
gonna go inside our C folder so inside a
C folder as you can see I have already
created this folder known as data and if
I open it up there is another folder
known as DB and then Moe has done some
stuff inside it we haven't put in these
files MongoDB automatically has put in
these files now what I want you to do is
again go to the C Drive and then after
installing MongoDB make sure you have
all installed more DB I know I am saying
it again again but after you have
installed more DB you can go to Program
Files then you can search for MongoDB
folder and inside this go to the server
and 4.0 inside the bin and then I want
to do double click on this file known as
mont god dot exe running this monga dot
exe file actually starts your server and
when you double click on this file and
run this file you will get a kind of a
CMD terminal which will look like this
and at the last it will say waiting for
connections on port two seven zero one
seven the port number might be a little
bit different but if you are getting an
error make sure that you have created
that data and DB directory if you
haven't you'll get an error now make
sure that you are running this monk or
dot exe in the background while we do
other stuff inside our code all right so
we have created the folder we have run
the dot exe that is why we are on
this step I've numbered these steps a
little bit wrong but it's alright so now
we have to install pi on by chunk
so we can go back to our PI charm and if
you're not using pi charm don't worry
you can just write pip install pi
and it will be alright it will be fine
but we are using PI chump so we'll go to
file we'll go to settings and in project
or to Preda we'll click on this plus
icon over here and we'll just search for
PI and there is our PI and
we just need to install package and if
you're not using pi charm you can just
type in pip install pi and it will
do the same for you now what it install
PI Mongoose so I won't get into it as
you can see
you have already installed now if
we go to the next step
that is we have to make sure that our
pipeline is activated what I mean by
that if we go to our setting start I
fell and if you scroll down a little bit
about 67th line you'll see that these
three lines are commented or uncommented
make sure these three lines are not in
comments like all of these other lines
and that is what basically what I mean
by making sure that your pipeline is
activated now that our pipeline is
activated we can actually start writing
code inside our pipelines toward PI file
which is actually the next step so now
we need to import find inside our
pipeline start by fire so we is going to
write import by and then inside
this class code tutorial pipeline we are
going to create our initialization
function by just writing in def
underscore underscore and then clicking
enter and then inside it we can do all
kind of stuff and we will start by
creating a function now if you don't
know what this initialization function
is make sure you watch the classes and
objects video which I'll probably add
somewhere at the starting of this video
series or at the end of the video series
that is kind of the extra videos that
you need to know to become comfortable
with all of this stuff but you don't
need it you can just think of this as
the initialization function that is very
important when you create a class of
code tutorial pipeline now inside this
initialization function will actually
try to connect to our MongoDB database
so for that we need to create a variable
and we'll just call this variable as con
variable that stands for connection and
now we can just type in pymongo dot
client and inside this we need to
give it two parameters the first
parameter is localhost because we are
doing it on our own computer and then
the second parameter is the port number
now what I want you to do is actually
open up your MongoDB compass that you
install at the starting of this video
and if you open up your MongoDB compass
this will actually look like this and
you as you can see the hostname is
localhost and the quote is two seven
zero one seven so that is what we are
going to be putting in inside our movie
so if you go back over here we can just
put in to just go back and check the
boat number game two seven zero one
seven two seven zero one seven and we
have to put come over here don't forget
that and now we have created a
connection variable after that we need
to create a database so creating a
database is pretty easy we just write
self dot gone and then over here we are
just going to write the name of the
database that we want so the name of the
database that we want is let's call it
my course and inside single course we're
just gonna write my quotes and we are
going to save this inside available and
we're going to call this variable as DB
so these three lines create of
connection this lines creates your
database and now you already know that
every database has a table inside it all
the concept of is a little bit
different but if you kind of see it as a
conventional database every database has
a table so now you have to create our
table for that we need to create
something known as collection so we just
write self dot let's call it collection
and inside this we are just gonna write
DB and we are going to write the name of
the table that you want and I want to
call it quotes underscore TB so we have
done a pretty basic stuff over here we
have created the connection we have
created the database and inside that
database we have created a table and we
have stored the properties of that table
inside this variable of collection and
now we can store inside this table or
indirectly we are storing inside this
variable that has the codes underscore
DB table so now we are going to go
inside this process underscore item
which have this item variable which is
very important so every time an item is
scrapped it is saying to this pipeline
and inside this pipeline is goods inside
this code tutorial pipeline then
automatically it goes inside this
initialization function or connection is
created or database created table is
created and then it goes to this process
underscore item function which it sees
what it is supposed to do with that item
so now we want to store this item inside
this table so that is pretty easy we are
just gonna write
self dot collection and we want to
incite stuff inside our table so we can
just write self dot connection dot
insert and then inside that we are going
to write ticked because we are going to
be storing inside our mama DB and
anytime you store stuff inside MongoDB
it is in kind of dictionary form and
then we are just gonna write item that
we are getting from this variable and
that is pretty much it guys don't
believe me let's just try it out and see
if it works so before even we do that I
just want to open up this MongoDB
compass and just want to click on this
connect so that we can connect to this
connection and if you see over here
there are just three databases in mint
config and local there is no my code
stable database over here and there is
no so if you go over here you can see
that this my codes database and there's
codes underscore TV that is code and
this code table and there are no these
two properties inside and MongoDB
compass so the database is going to
appear over here and the table is going
to appear over here I just want you to
look out for it so now what you can do
is we can go back to our scrappy and now
we can open up terminal and let's
actually start it from the start and we
can go inside our code tutorial folder
and like they have done with all the
previous projects we are just going to
crawl it simply and we're gonna write
quotes and let it crawl and after it has
finished crawling we can go back to our
MongoDB compass and kind of just refresh
it a little bit and it will say loading
database and as you can see there is
another database over here known as my
codes and if we click on it you can see
that there is stable inside it which is
called our scores underscore TB and it
has ten elements and now if we even open
it up a little bit more and we actually
have a look what is inside these
elements you can see that it has a code
to author and our tags so guys this
looks pretty good and this is pretty
much it for this video and actually
surprisingly this was the easiest of the
database to implement because it is just
so quick and so simple so guys this is
pretty much it for this video in the
next video we are going to be learning a
little bit of advanced scrappy so we are
going to be learning
how to crawl the various links inside
our website so I'll see you over there

Video 19: Python Scrapy Tutorial - 19 - Web Crawling & Following links

alright guys welcome back so in the


previous video we learned how to store
data inside a database and we are done
with that now we have already created a
very basic spider and a spider and a
crawler are working right now but now
it's time to move on to more advanced
stuff and what I want to cover is how to
bend a more advanced spider that is able
to crawl multiple pages so currently we
are just scrapping the first page of our
example website that is codes dot to
scrape comm now what if you want to
scrap multiple pages so for example if I
click on next is going to take me to the
second page of this website and if I
click on next again is going to take me
to the third page so right now we are
just crapping the first page so in this
video we are going to learn how to
follow the links that are given on a web
page and in the next video we are going
to learn how to scrap the websites that
have pagination in them by pagination I
basically mean that they have numbers
over here that is the first page the
second page is the third page the fourth
page and that case our following link
strategy would work but on websites
where this next button is there this
strategy of following links will work
and even if you don't have to follow the
next page you can use it to follow other
links so for example websites have for
example these tags you can click on this
tag and it will take you to the tag of
love and you'll be able to see all the
codes of the tag with the love so it can
be also used to follow these links over
here but in our case we are going to
learn how to follow the link that is
given in this next page and scrap all of
these multiple pages and in our case
there are 10 pages now one thing I want
you to notice is that whenever this next
button is clicked it takes us to the
next page so right now we are on the
second page of this codes website and
when we click on this next button the
page number changes to 3 and how does
the browser know that it has to change
this number to 3 and it knows that by
the free click on just an inspect
element of this next button
you will be able to see that the next
page number is actually stored inside
this lead element and inside the Slayer
element there is this element and inside
this
there is this attribute of harf and it
contains the page number that it is
supposed to go next so here is what we
are going to do with scrappy so first we
are going to go to the main page we want
a scrappy to be at the main page with
the start URL over here so if we go back
to our code the start URL should be this
main page and then what we want is that
scrappy should find this next button and
after finding this next button we want
it to search for this a href tag and
take out this page number of two and
then we want the scrappy to redirect to
this second page so then it redirects to
the second page it will see this page so
let's just wait for a second all right
so it will see this second page and then
you want to scrap the items of the
second page and then again we wanted to
go to the next button find the next
element that is in our case it will be
probably page three and just go so on
till all of the pages end so in our case
if we go to page number 10 so you can
even manipulate the URL to go to page
number 10 so if we go to page number 10
and press ENTER you will be able to see
that this is the last page because there
is no more next button but if we try to
go to the 11th page by manager
manipulating the URL you will be able to
see that no codes found is written so in
our scrappy Python code we will have to
add one more condition which tells
scrappy that hey if no more pages are
found that if the if condition is equals
to none then make sure you stop
following this link so in total we need
to do three steps the first step is to
find this next button the second step is
actually to find the link that it is
redirecting to in our case in this page
is page number two and then the third
step is actually going to that link and
within these three steps between the
second step and the first step we'll
also put an if condition with checks
whether the next page is empty or none
or not so let's go into our code and
actually code this thing so the first
step as we have already discussed is to
find this page 2 and how we do that that
we are going to do that using CSS
selectors which we have already learned
so we are going to find this lis element
which has a class of next and then
inside that we are going to go inside
area mint and then we are going to take
out this attribute of href and we are
going to get the value of this href tag
so let's go inside a Python code and
over here just outside this for loop so
let's press ENTER go outside this for
loop and over here we are going to
create a variable and let us call it
next page and then we are just going to
use a simple CSS selector so we will
just write response dot CSS inside this
response we are going to use the CSS
selector so let's just write lis dot and
we want the class off next as you can
see it says lis class equals so next so
we'll put in the dot character and then
we'll just put in next and then inside
this Li dot next we want this a element
a HTML element so we are going to put in
an A and then inside this a we want this
href attribute so we are just going to
go inside a code and put in to thousands
and then I write href and because we
want the attribute of this href we have
to write a double T odd that stands for
attribute and we need to cover this href
in parentheses and then we can just use
dot K to get the value of the next and
now we are just going to use an if
condition to check whether this next
page value is empty or not so we're just
gonna write an if condition and write if
next page is not none and this basically
means if the next page value is empty or
not and we are doing this because if we
go to the next page and let's go on to
the page number 10 and press Enter as
you can see that there is no next button
over here so when scrap we will try to
get the value of the next page it will
return as none or empty so if the next
page is not none then we can actually go
to the next page otherwise it's not
gonna do anything so we don't want this
crappy to scrap empty pages like for
example it can even go to 100 but there
is no page over there so there is no
point in going going over to page 100
when it just has 10 pages so that's why
we are claiming kind of putting on an
Afghan
so that not a lot of our resources are
wasted so now that we have put in an if
condition how do we exactly go to this
next page even though we have the value
of next phase right now we don't have
any way of going to the next page so
that scrappy can scrap the next page
so what scrappy does is that it has a
very cool method inside it con known as
response dot follow and we it will
automatically follow this next page and
you don't have to do any work
so what scrappy does is every time it
goes inside this def parse method it
automatically looks for that method
known as response dot follow and that is
what we are going to do so first we will
just write Yi which kind of its return
and then we are going to use this
function of response dot follow and
inside this is going to take two
parameters the first parameter is going
to be the page that we want it to follow
and we want it to follow the next page
and then it wants us a call back so
where should it go after it goes to that
next page so we want to also scrap the
next page so we'll just ask it to go
back to parse and scrap all the quotes
from the second page and then it will
come back to the next page and this time
it will contain the page number three
and then it will check which is the page
number three is empty or not and then it
will again go to the value of the next
page and then it will again go to pass
until it reaches the page number ten
where this if condition will return
false and it will just stop so we will
just add a call back over here let's
call it call back and we will give it a
value of self dot pass because we wanted
to go back to this method so what this
crappy is doing right now is that first
it's going to start from this main page
that is quotes dot to scrape comm let's
actually follow that let's go back to
our main page so first it will start
with quotes not to scrape comm and it
will scrap the first page after that it
will come to this next page and it will
check out this next button it will get
the value that is the value of the
second page and it will go to this
response dot follow after checking
whether the second page is empty or not
and then it will actually go to the next
page and then it's kind of start staying
in here what I'm supposed to do next so
here we are going to tell it that hey
just go to this pattern again so it will
go back to the past method and this is
kind of a recursive function so it will
go back to this method and
okay it will just grab the second page
so it goes to the next page that is the
second page it scraps all of these items
and as you can see we have done the next
page or we have taken out the next bit
outside this volume so this thing
happens after it has cracked all the
codes of that page and then it goes with
page number three page number four and
it goes on so guys this is pretty much
it for this video in the next video we
are going to learn how to scrap pages
that have this kind of page in nation in
them because we haven't learned that
till now so I'll see you in the next
video

Video 20: Python Scrapy Tutorial - 20 - Scraping Websites with Pagination

alright guys welcome back in the last


video we learned how to scrap multiple
pages using scrappy and by following
links given on a web page and we wrote
all of this code but we forgot to check
the output of this code but you must
have already checked the output on your
own if you haven't I just want to show
you the output of this scrapping
multiple pages and using this following
link method so what I've done is I've
already printed out this next page and
if we scroll up a little bit you'll be
able to see that this page is being
printed so let's scroll up a little bit
and somewhere over here you can see that
the page 9 has been printed and every
page actually has imprinted because of
this print statement obviously you don't
need to print it and so we will just
remove it and in this video we are going
to learn how to scrap pages of
pagination so for example if we open up
amazon.com and we scroll at the bottom
of our department or a section you will
be able to see that there are these
multiple pages at the bottom and
clicking on one of these numbers takes
you to the second page so we learn how
to scrap pages when this kind of
pagination is given so first thing I
want you guys to notice is that for
example if we are on our courts dot -
script website and if we go to the next
page you will be able to see that in the
URL the page number increases so right
now we are on page 2 and if we click on
the next button again you will be able
to see that now we are on page 3 and
similarly if you just want to go back to
page 1 we can just manipulate this URL
to go to that specific page number so
for example if I want to go to page 1 I
can press ENTER and this is the home
page of our code to scrape website and
now if you want to go to page 9 I can
just put into page 9 over here and this
will directly take us to page 9 now this
is a concept that is used to scrap
websites that have pagination in them
similarly if we go to amazon.com and
look at the URL you will be able to say
that there is currently no page number
over here you can scroll it a little bit
on the right but as you can see there is
no page number over here and that is
because we are just on the first page
sometimes when you go to the second page
then you can actually see the page
number so for example in a quotes
website as you can see on the paint page
you can't see the page number but when
you click
The Next button only then you can see
the page number and this theme is
actually consistent with a lot of
websites so just keep it in mind when
you are scrapping pages and websites
that have pagination in them so if we go
back to the second page and as you can
see right now there is no page number
but now I'm free click on the second
page you can see that there's this new
attribute over here this is called page
equals to 2 so we can use this attribute
to go to any page that we want so for
example if we want to go to the fifth
page we can just type in five and press
ENTER and now you will be able to see
that the fifth page has opened up and
this is consistent if we scroll down you
can see that the pagination also has
this fifth element tuck in and that
means we are on the fifth page and
similarly we can just go to the first
page of gain and we can just press enter
and you will be able to see that the
first page is the same now we will be
scrapping amazon.com completely in the
future videos so I'm going to leave this
website for now and we are just gonna
focus on a course dot to scrape website
and learn how to scrape this using page
relation and then we are going to use
the same concepts that we learn in this
video to actually scrap Amazon so what
we do is that we go back to our code and
instead of just pasting it at the start
URL this code start to scrape com we're
actually going to replace this fist the
first page so if we click on next this
is page two and we can just manipulate
it to go to page one and we will be
using this as a start URL you don't have
to do it we don't have to change the
start URL in this case but it just makes
it more intuitive for you guys to
understand what am i doing if I replace
the start URL with this page one after
that because we need to go to page 2
page 3 page 4 we need some kind of a
variable and so I'm just going to create
a variable and just call it page
underscore number and we'll give it a
value of two because we are already on
page one and I wanted to go to page two
and then inside this depth equals two
parts outside this for loop we are going
to change this response dot CSS because
the next page we already know what we
want we want to go to the second page so
we can just copy it from over here and
we are going to paste it over here and
let's enclose it in single quotes now
this is static so even though for the
first time it might go to the second
page it will go to the second page by it
one go to the third page 4 page fifth
page so what I'm planning to do is that
I'm going to replace this digit of 2
with actually this page number and then
inside this if next page is not equals
to it's not none if condition will just
increase the value of this variable that
has page underscore number variable so
what you are going to do is just we are
gonna replace this two so let's let's
actually get this page underscore number
over here so this this variable of page
underscore number is actually a class
variable so you can't just write page
underscore number and expect it to work
you actually have to refer it through
the code under code spider class so I'm
just going to write code spider dot page
number and now you can access it and let
me just add a plus and over here
and now it gives us an error which says
expected string got into instead so this
actually is should be a strength so
we'll just gonna type cast it into our
string you just remove this thing from
here so we're just going to write STR
and your type casted into STR so this
integer just became a string and now the
value of the page is actually stored
inside this next page variable and now
we are going to replace this if
condition so we're just gonna copy this
from here and this is actually just the
page number and we will replace this if
condition which says if quote spider
taught page number less than equal to 11
then only made sure that you go to the
next page because we only have the limit
to go to 10 and if we go to a lemon page
number there are no codes so you're just
putting that in inside our if condition
and after this we just need to increase
this page number by one so that it takes
time it comes to this next page this
page number actually becomes 3 4 5 and
so on so inside this F condition we are
just going to write page number plus
equals to 1 and this will make sure that
our page number is increased by 1 so
just to recap what will happen is that
first the URL the start URL is going to
be this page 1 and then it will go
inside our parse function
we'll scrap all the courts from that
page that is going to go to this next
page equals two and it will get this
link with page number equals to two
which we have stored over here and this
is a class variable that is why we have
to use this dot character over here and
after that it will check if our page
number is less than equals to 11 or
let's make it less than 11 not less than
equals to 11 and if it's less than 11
then it will make sure it just to
increase the page number by one and then
just go to the next page and basically
the same thing that we did in the last
video and this will go on it will scrap
all of the pages till the page number is
10 after that the if conditions become
false and it won't go to the next page
so guys this is pretty much it for this
video in the next video we are going to
learn how to login into our website so
we are actually going to be exploring
the functionality of this login button
over here but more of that in the next
video and just to end this video we are
gonna test this out to make sure that
this actually works so we are going to
open up our terminal again and let's go
to the bottom and we are going to just
crawl it once more so let me just do it
from the start
let's try scrappy crop codes and
hopefully this should work without any
errors all right so it's crawling for a
long time so it means that it's actually
working and as you can see it has a lot
of codes scrap over here so that is
pretty much it and see you in the next
video peace out

Video 21: Python Scrapy Tutorial - 21 - Logging in with Scrapy FormRequest


alright guys welcome back in this video
we are going to learn how to login into
websites using scrappy and we'll be
using this example websites which we
have used before called course door to
scrape comm to learn that now as you can
see that there is a login button on the
right hand side and clicking on it takes
us to a form which contains the username
field and the password field now why
exactly do we need to learn how to login
using scrappy and this is because a lot
of websites will restrict the content
that you might want to scrap behind a
login page so to scrap that restricted
data it's always a good idea to learn
how to login into websites using this
Python scrappy now before we start
coding I just want you to notice a
couple of very important things the
first thing is that currently our URL is
code store to scrape comm and after we
click on this login button it changes to
code start to scrape comm log in the
second thing is that currently you can
enter in any username and password and
just click on this login button and it
will log you in into this website
because this is a testing website
obviously when you are scrapping a
particular project you would already be
knowing the username and the password so
it won't matter obviously to scrap a
website you and you make sure that you
are logging into that website you need
to know the username and the password
and again I want you to notice how the
URL changes from this login URL after we
click on the sloughing button so after
we click on this login button you can
see that it again goes to the main page
and now instead of login it shows us log
out which means that we have finally
logged in into the website so let's
actually click on login and then again
go over here and now I want you to
notice a very important thing and you
will only be able to notice that by
using google developer chrome tool so we
are just going to right click and click
on inspect and then we are going to go
to this network tab and then we are just
gonna minimize it a little bit and now
we are gonna try and log in again so I'm
just gonna put in my username as the my
email and just in the password I am just
gonna put in hello world and then I'm
gonna press the login button so my lovin
bit is a little bit hidden over here but
it's gonna log N and now you can see
that in the network section there is
in quotes thought to scrape calm and
then the three other unimportant stuff
but the important is vlogging and this
course dot to escape calm as you can see
this has the three zero two status which
means it's being redirected from the
login page to the main page that is
course trip calm now whenever you click
on one of these networking tabs for
example if I click on login which is the
most important one right now if you
click on login and we go to this headers
tab you can see there is a lot of stuff
over here and the stuff that we are
actually interested in if you scroll
down you can see that there is this form
data which contains three values the
first CSRF token then the username and
then the password which I put in while
logging in so always when you are trying
to login into websites make sure that
you right click click on inspect go to
network and actually try to log in once
after you log in try to find this log in
or something like that and then click on
this and then go to headers and try to
find the form data after going to the
form data these are the three values
which we'll be using while coding in our
scrappy Python project what is this CSRF
token
so the CSRF token is used in most of the
websites for security purposes and
mostly it's always changing after we
press on the login button so right now
this CI sir of token is this but when we
click on the login button again it's
going to change again so we need to get
the value of the CSRF token from the
login page by using CSS selectors and
then using from request in scrappy we
can just send the user name password and
CSRF token to login into website we just
get into what is form request a little
bit later but let me just show you where
is this CSRF token actually let me just
log out and show you the whole process
once more because I just want this step
to get very clear because this is the
most important step so we'll just put in
some random data of username and
password and then we'll right click
click on inspect and we go to our
network tab and then we are going to
press on this login button that is over
here let's press this and now you can
see again we have these these requests
and if we click on the login one and we
scroll down you can see that the CSRF
token has changed along with the
username
password now how do we get the CSRF
token so if we just close this up and
let's press logout and we go to the
login page and then we can just right
click and click on View page source to
see all of the code that went into
writing this website and this is going
to search for this element of CSRF and
as we can see there is this form inside
this form there is this input tag input
element and inside that we have the name
of CSRF token and this is the value that
we want to scrap and now armed with all
of this knowledge let's actually go back
to our Python code and start coding so
we have a very basic spider over here we
just are scrapping the main URL the
first page we are not concerned about
following links etc so this is a very
basic spider so what I want you to do is
actually create a backup of this so I'm
just going to create a new file over
here and create a backup for myself I'm
just gonna call this file as backup
press ENTER and paste it over here and
now we are gonna remove everything from
over here I know this sounds a little
bit intimidating why are we removing
everything but trust me is gonna be okay
because we have already created the
backup now instead of the start URL
because we want to login into the web
site first we are just gonna give it the
URL of this login so let's just copy
this from here and paste this instead of
the start URL currently and we're gonna
paste this and now the next step is to
get the value of the token over here
because this will change every time we
click on this login page so let's go
back to our code and create a variable
let's call it token and then we are just
going to use the CSS selectors that we
have learnt and we are going to write
response dot CSS and inside it if we go
back to our code you can see that it has
a form and then inside that form there
is this input element and if we get the
value of it the attribute or value we
can get the value that is inside this
attribute so this CSS selector might be
different in different websites and in
this one is pretty easy so don't worry
about it just when you are scrapping a
particular project make sure you do the
CSS selectors properly and we have
already discussed CSS selector
that but anyways the CSS selector for
this one is we are just going to go into
form and then inside form we are going
to go into input and then inside input
we need the attribute of value so I'm
just going to write a TDR and we need
the attribute of the term of value so
I'm going to get that and I'm just going
to extract it first so whatever form
input attribute value comes first we are
just going to extract that someone right
extract first and let's actually print
out this token to see whether our
crawler is working or not so I'm just
going to go to our terminal let's make
sure that it's open go to our encode
tutorial write crappy Chrome and then
quotes press ENTER and this should print
out our CSRF token if it's working
all right so I forgot to put in the
closing brackets over here so that's why
is giving us in here let's try it again
just go down and cross the quotes and
this shine should work properly so now
let's scroll up and as you can see that
the CSRF token has been printed so now
we need to import into our project
something known as a form request and
how do we do that we just gonna write
from scrappy dot HTTP and from that HTTP
we are going to import the form request
and now inside this pass method we are
just going to remove this print token
statement from here and we're just gonna
return the form request and then we are
just going to write from response and it
will require a couple of parameters the
first parameter as you can see is the
response then is the form name and etc
or form data and stuff like that so we
are going to be using the form data
parameter but the first parameter is
obviously going to be the response and
then we are going to write the form data
equals two and then it requires a
dictionary so we are going to open up
dictionary and then inside that we are
going to write the values that it needs
so it currently needs three values as
you saw then we just right clicked over
here and saw on the network tab that it
requires three values the first value is
CSRF token the second value is username
and the third value is password so I am
just going to write CS
and underscore token and then we are
gonna give it the value of token and
this is our token variable from here so
I'm gonna put a comma and the second
value it requires is username so I'm
gonna give it a random username right
now obviously this needs to be the
username and password that you have of
the website that you want to scrap and
then the be the third parameter is the
password and it's gonna put in the
password as some gibberish because I
don't know and then it requires a third
parameter so first parameter is response
the second parameter is form data and
then the last parameter is what do you
want
scrappy to do after it has logged in
into the website so obviously we want to
implement all of this code inside our
spider after it has locked and into the
website so what I'm going to do is I'm
just gonna introduce a parameter known
as callback which we have already used
in the previous videos and I'm just
gonna put in the value of self start
under store scraping so what we wanted
to do is that after logging in we want
it to call this function of start
scraping so we haven't created that
function yet so let's actually create
that function so I'm just going to call
that function obviously start scraping
so I'm just gonna call that start
scraping and in this one requires two
parameters the first is self and the
second is the response otherwise we
won't be able to scrap the response and
now all of this data from the backup we
can just copy and paste so I'm just
going to copy all of this stuff that is
all of this path stuff from over here
I'm just gonna copy and paste this
inside this start scraping method and
then it shows that there is an error
which says unresolved error so obviously
we haven't copied this items line so we
need to also copy that until starting
let's copy it over here and this is
looking pretty good
alright so let me just format this
properly and let's actually try it out
and see if it works so I'm gonna open up
my terminal again and then we are going
to
go down and we are going to run our
trawler again so let's see and we are
going to know if whether it's locked in
or not by a 302 redirect because if you
remember whenever we typed in our
username and password and right clicked
over here and clicked it on inspect
going to network tap and let's press the
login button okay so that you guys can
see as you can see there is on this vlog
in there is a 302 status which basically
stands for red light red so if there is
a 302 redirect in our scrappy then we
know for sure that our scrapper is
working and I just want to show you once
more that there are the three values
that we have used the first is CSRF
token then the username and then the
password and you can see that inside our
code we have also used the same three
values so let's actually run this I
don't know I don't remember if I ran
this scroller or not I think I actually
ran it before so let me just go up and
as you can see there is this redirecting
302 and we know that our login is
working because there is login URL also
over here
one other way to find out whether our
login functionality is working or not is
we can import something known as open in
browser because it's gonna write from
scrappy dot utils dot response we are
going to import something known as open
in browser and then in just when we
start to scrap so for example inside our
start scrapping method and to starting
we are going to use this open in a
browser
method and we are going to give it the
value of response and now we can just
grab the value again and we can go to
terminal and let's go down and this
let's just scrap it again and see what
is the difference so this is actually
going to open up our web page inside our
browser when it starts to scrap data and
you will be able to see that on the
right hand side instead of login the
logout button is shown and that means
that our crawler is working as you can
see that this new page has opened up and
on the right hand side this login button
is showed instead of the login button
and this means that a login
functionality is working so guys this is
pretty much it for this video from the
next video I think I'm going to start
teaching you guys how to scrap and
as a complete project so I'll see you
over there

Video 22: Python Scrapy Tutorial - 22 - Web Scraping Amazon


so by this video you already have a very
good understanding of scrappy
now just to internalize the concepts
that we have already learned we will be
working on a complete real-life project
of scrapping amazon.com we will be
scrapping the books Department of Amazon
and more specifically the collection of
books that were released in the last 30
days now if you are following along you
don't have to choose books you can
choose any department on Amazon I have
already created the project Amazon
tutorial on PyCharm and have installed
scrappy if you don't remember how to
install scrappy you can always go back
to my installing scrappy video now we
will just start our new project so that
we can get rest of our project files and
we do that by going to terminal just
opening up the terminal and just writing
scrappy start project and then the name
of the project so in this case I'm just
going to call it Amazon tutorial in
lower case and press ENTER and this will
start our new project and give us the
project files that are needed so now our
project has been started and if we
refresh our folder structure you can see
that this new folder contains all of the
remaining files that we need and now it
says you can start with your first
spider with going to the Amazon tutorial
directory and then it gives us this mute
command which says scrappy Jen spider
it basically means creating a general
spider so we'll be using both of these
come on so let's first go inside this
Amazon tutorial folder and then we'll
just write this scrappy and this will
create a general spider for us so we
won't have to actually create our own
spider by going inside the spider folder
this will automatically create it for us
and now it says example which is
basically the spider name so I'm gonna
call our spider as Amazon underscore
tutorial actually let's just call it
Amazon underscore spider and then it
asks us which website do you want to
scrap so we want to scrap amazon.com so
we will just write that and press enter
and now magically our spider has been
created so if we refresh the spiders
folder you will be able to see that
there is this new file known as
Oh spider and if you open it up you can
see that the pass function and the class
of Amazon spider has been automatically
created for us and now we can just
remove this allowed domains line because
we don't need it and now instead of
start URLs just amazon.com we are going
to paste the link that we actually want
to scrap so I'm just going to copy and
paste this in our start you are else so
let me just format this a little bit
properly so that we can taste more of
content on this one line so instead of
amazon.com I am just gonna paste this
URL this big URL and this looks pretty
good and before we go into this pass
method and actually start scraping stuff
we need to create some items which are
temporary containers so what are the
items that we want to scrap from Amazon
so they are mainly food items that we
want first is the title the second is
the author the third is the price and
the last is the image link of this image
so what we will be doing is that we will
be creating four items over here so we
will just copy and paste this four times
let me just paste this four times and
the first thing that we want is the
product name something's gonna instead
of nain I'm just gonna write in product
name product underscore name and then we
are going to do the same thing with all
of these four so let me just copy and
paste this over here over here and over
here and now instead of name and it's
more to put an author and instead of
this we want to put in the price and the
last we need to put in the image link
and now we need to import this items dot
buy file insider amazon spider so we're
just gonna write from not frozen state
from dot dot items import and we need
the name of this class that is amazon
tutorial item so we're just gonna write
that amazon tutorial items and now
inside this pass method we'll create a
variable and let's call it items and
this items variable will store the
instance of this amazon tutorial item
class so we're just gonna call it amazon
tutorial item and we are going to put in
parentheses to make sure that this is an
instance and now we use actually start
scrapping the items the four items that
we are selected insider items dot pi
file that is the name or the price and
image link so we're going to start with
the product name so I'm going to create
a variable
let's call it product name equals two
and we're gonna use our typical CSS
selector so I'm gonna write response dot
CSS and inside this we are going to put
in the CSS selector that we want so now
we can go over here and we are going to
use the CSS selector gadget that I've
already discussed if you have forward
and about it this actually helps us in
selecting the items that we want so if
you want to download it you can just
type in select the gadget chrome and
install it from the first link so I'm
just gonna click on the selected gadget
and the first thing that I want is all
the titles so I'm just going to click on
this and then we don't want all of the
other yellow parts so I'm gonna click on
the parts that I don't want so I don't
want this show results for so I'm gonna
click on this again and we don't want
this yellow part so I'm going to click
on this again and now you can see all
the titles on this page have been
selected and this means that we are okay
so this is the CSS selector that we want
so I'm just gonna copy it from over here
and I'm gonna paste this inside our CSS
selector and now we can just try it dot
extract and this will give us the name
of all the product names and we are
going to do this same thing with the
product price the image link and the
auto so now let's go with product or
third instead of the name we're just
gonna go with Auto and response dot CSS
and now let's do the same thing with the
author so I'm just going to disable this
ones and enable it again now we're going
to click on authors and again we don't
want all of this stuff so I'm going to
click on it again and this looks pretty
good we are selected all of the authors
actually we haven't selected all of the
authors because this is not selected so
let me just click on it again and now
this is selected but all the other stuff
is also selected so right click on it to
make sure all of this other stuff is not
selected alright so this looks pretty
good all of the authors have been
selected and all the unwanted parts have
been deselected so now this means that
we have to copy all of this stuff from
here and paste it inside our CSS
selector I know this is pretty big and
that is why we're to select the gadget
comes in handy you don't have to make
these decisions on your own so now we
can just write dot Cait over here and
this will pop for us and now the time is
for let's remove this pass
and the author name has been done and
now we need the price so we're just
gonna do the same thing response dot CSS
single quotes let's go back to uh select
a gadget D selected once now going to
write let's come over here and now this
is selecting for us the price and I also
want this sent spot all right so this
has selected all of the prices but I
don't want all of the prices I just want
the hardcover place so I'm going to
deselect all of these items from here so
let me also do so like this and now if
you scroll down you can see that only
the prices of the hardcover are being
selected so I'm just gonna copy and
paste is from here insider CSS code and
add dot get alright and now is the time
for the image image links so I'm going
to create again another variable and
we're going to call it product
underscore image link equals to response
dot CSS I'm learning from a mistakes I'm
gonna put in single quotes and dot get
before so that I do not have to scroll
on the right hand side and paste all of
that stuff and then write dot get so I'm
gonna use my CSS selector again so
disable enable and let's go to our image
links and as you can see all the images
have been selected and there are no
problems so we just gonna copy this CF
maka from here and it's this inside our
CSS selector and now I just realized
that I've used Todd gate instead of dot
extract now dot get an extract
underscore first are the same but dot
ket and dot extract are not the same if
you want this dot gate to behave in the
same way as this dot extract we instead
of dog get we need to write get all but
just for the sake of clarity I'm not
gonna use the dot gate at all I'm just
gonna replace all the dot gets with dot
extract so let me just go over here and
replace both of these with dot extract
and now what we need to do is we need to
add a text parameter to it for because
right now this will only extract the
HTML tags but we only want to extract
the text so what we are going to do is
we are going to add
another coolin coolin and add or text
value over here and this will make sure
that the whole tag doesn't get extracted
and only the text from this tag gets
extracted and saved in the product
underscore name and we're gonna do the
same thing with the second variable but
because they are using multiple classes
for the CSS selector that is why we
can't just add this text at the end we
actually have to write dot CSS and then
inside these parentheses we can add the
value of text obviously been used to be
enclosed in the single core so I'm just
going to add that and then the same
thing with the third variable Psalms is
gonna copy and paste this in the third
variable and then in the last line that
is product underscore image link this
would actually extract the image this is
just selecting the element that has the
image so for example if we go back to an
Amazon example right-click on the image
and click on inspect element you'll be
able to see that there is the CF marker
at the end and this is basically
selecting this image element but the
image that we want the image link that
we want is actually inside this SRC that
is the source attribute and we want this
link so what we are going to do is we
are going to add an attribute over here
and how do you do that we just write
colon colon and then etdr and then we
want the attribute or the source so
we're just gonna write that and now that
you have extracted all the four values
we are going to store them inside their
individual temporary item containers and
how do you do that it's pretty easy you
just write items and then over here
we're gonna write product name so let's
just write product name and then this
will be equal to the product name
valuable so this product name is
actually the variable in which our scrap
data is being stored and this product in
which I have misspelled and then this
product name is actually the name that
we are given inside our items dot pi
file and now we are gonna do that with
all the four values and just to save
time I'm just gonna paste all of the
four values over yeah and then we just
need to yield the items so I'm just
gonna write yield and then items and
this will complete our code now we just
need to run the scrappy and see whether
it works or not and one last thing I
want to do is actually change this name
of Amazon underscore spider because it's
just too big
now before we run a spider I just want
you to tell you that a program might not
work don't get worried it's okay if you
have scrapped Amazon before then it's
not going to work but if this is your
first time scrapping Amazon then the
above code will work the reason for it's
not working is that Amazon puts on
restrictions when you are trying to
scrap a lot of his state up so what we
are going to do is we are going to
bypass those restrictions by something
known as user agents and proxies but
before we get into both of that which we
will cover in the next two videos I just
want to start this program and show you
the error that we get and actually show
you a little bit of user agents and a
preview of what's going to come in the
next video
so let's actually run our program so
let's actually run our program by
opening up the terminal and what we will
do is we will go and sign a project
folder and then over here we're just
going to write the command that you
already know to scrap our website it
will write scrappy
crawl and then we're just gonna write
Amazon which we just changed over here
and then we are gonna press ENTER and my
scrappy will both because I have already
implemented and I have removed those
restrictions so if this doesn't work for
you don't freak out because I wanted to
show you guys a working example of this
that's why I have bypassed those
restrictions so if I scroll up a little
bit you'll be able to see that we have
all of the values that we want we have
the product author you have the product
image link we have the product name and
lastly we have the product price so our
scrappy is working properly now if we
just want to test out whether your code
is working or not and it is showing you
some kind of an error when you are
running the program so what I want you
to do is you want I want you to open up
the setting start by a file and when you
open up this setting start by a pile I
want you to find this user agent line
that is over here and just below this
line I just want you to copy and paste
this below this eighteenth line and I
just want you to instead of this Amazon
tutorial I just want you to put this new
value now when are you gonna get this
new value from you can just go to google
and type in Googlebot user agent and
then check out this website developers
dot what is my browser com
and when you open it up is the first
link over there and you can just paste
it over here and then
run the program again and it should run
perfectly now I'm gonna go into what is
this user agent and what I just pasted
over here in the next video and then in
the video after that we are going to go
into proxy which is another method for
bypassing the restrictions of Amazon so
guys this is pretty much it for this
video I'll see you in the next one

Video 23: Python Scrapy Tutorial - 23 - Bypass Restrictions using User-Agent

alright guys welcome back in the last


video we scrapped the book section of
Amazon and we used something known as a
user agent to bypass those restrictions
so what exactly is this user agent and
how is it able to bypass the
restrictions placed by Amazon
so whenever a browser like Chrome or
Mozilla visits a website that website
asks for the identity of your browser
and that identity of your browser is
known as a user agent and if we give the
same identity to a website like Amazon
it places restrictions and sometimes
bans the computer from visiting Amazon
so there are two main ways to trick
Amazon using user agents first is to use
user agents that are allowed by Amazon
for example Amazon has to allow Google
to crawl its website if it wants its
products to be shown on Google search so
we can basically replace our user agent
with Google's user agent which is known
as Googlebot and trick Amazon into
thinking that actually Google is
crawling the website and not us and this
is exactly what we did in the last video
we found out the Google's user agent
name by tapping it in Google search and
then we replaced our user agent with
Google the other way is to keep rotating
our user agents and if amazon identifies
a computer using a user agent then we
can probably use fake user agents in
rotation and trick Amazon into thinking
that a lot of browsers are visiting the
website instead of just one and this is
what we'll be learning in this video and
doing that it's really easy because of
the various Python developers that have
created cool libraries and packages for
us and we are going to install one of
those libraries so we will just go to go
to settings click on this plus button
and then we are going to install a
library known as scrappy - user agents
so we are going to be installing this
second library and if you go to google
and just visit this link on the URL that
is the pipe I dot orc project scrappy
user agents you'll be able to read all
about it
so if we scroll down you can see the
installation is pretty easy if you are
not using PyCharm you can just pip
install it if you are using pycharm you
can just go over here click on install
package
and we'll install it for you I have
already installed it on my computers I'm
not gonna install it again
so if we go into the description on how
to install it it says if you're using
scrappy greater than 1.0 you just need
to copy and paste this three lines or
four lines so we're gonna copy that and
we are going to go inside her settings
dot buy a file and you don't know
uncomment this we're going to come in
this line so that we don't use the
Google bot and then we're gonna come
down to where middle bits are there so
this is already commented out so we
don't need to worry about it and just
below this we are going to paste our
downloader middlewares
so this looks pretty good we don't have
any more instructions and you must be
wondering how does it get a list of user
agents to rotate through so it has
basically its own user agent file that
we don't even need to include it already
has about 2,200 user agents so it
basically has 2200 browser means and in
rotation so what we're gonna do is we're
just gonna paste these two lines and
then we're gonna run our scrappy
scroller again so we're gonna go to
terminal and you'll be able to see that
this time is crawled a little bit
differently so we're just gonna run our
crawler
let's try scrappy crawl and Amazon press
enter and you'll be able to see the
output so you're able to see that a lot
of warnings and other unsupported
browser types are there and this is
basically our user agent at work it is
basically trying out different user
agents to see that Amazon does not
basically grab your user agent and use
it for restricting purposes and as you
can see we have already scrapped the
data and if we scroll up a little bit
you will be able to see that we have
used a user agent that is given over
here and not our own user agent so guys
this is pretty much it for this video in
the next video we are gonna learn about
how to bypass restrictions using proxies
so there are two ways to bypass
restrictions one is user agents the
other is proxies which are basically
rotating IP addresses and I'm gonna go
into order IP addresses in the next
video so you don't need to worry about
that but you can use both of these
methods the user agent method and the
proxy method together and combine them
to create a hybrid method for bypassing
those restrictions and using them
is very easy so I'm not going to go into
that but we are going to go into how to
use proxy to bypass these restrictions
which is also pretty easy we're just
going to use a package so anyways I'll
see you in the next video I'm really
excited about it
Pizza

Video 24: Python Scrapy Tutorial - 24 - Bypass Restrictions using Proxies

in the last video we bypass the scraping


restrictions by using something known as
user agents and in this video we'll be
learning how to bypass them by using
proxies now before we even go into
proxies you need to understand what
exactly is an IP address and what
exactly is an IP address so that's the
cushion an IP address is basically an
address of your computer so just like
you have an address of your house or an
apartment even your computer has an
address of its location you can find
your own IP address by just going to
Google and typing in what is my IP and
whenever you connect to a website you
are automatically telling them your IP
address so you cannot really hide it a
web site like Amazon can recognize your
IP address and even ban you if you try
to scrape a lot of its data but what if
we use another IP address instead of our
own it's kind of like identity theft but
it's not illegal and even better what if
we used a lot of IP addresses that are
not our own and put them in rotation
just like we did with user agents so
every time we send a request to Amazon
is going to be with a different IP
address and whenever you use an IP
address that is not your own then that
other IP address is known as a proxy if
you look up the definition of proxy on
Google it says the authority to
represent someone else so basically we
are hiding our address and using someone
else's but I just want to make one thing
clear is that whenever you use another
IP address this is not the IP address of
another computer so every IP address has
to be unique so nobody is actually using
those IP addresses so you're not
actually doing identity theft just
wanted to make that clear now you think
this proxy is really really easy because
of the libraries that a lot of Python
developers have created so more
specifically we are going to be using
the scrappy proxy pool library and if
you scroll down you'll be able to see
that all the instructions have been
given over here so the first thing is
obviously we need to install it so we'll
go over here to file settings and do
this
anything that we have already done a lot
of pens and then we'll click on this
plus icon over here and install this
tool which is cool and scrappy and then
let me go back and see what it says
scrappy proxy tools so I'm just gonna
write scrappy property tool and this is
the tool that we need and if you want to
know more about it you can just click on
this github link which will basically
take you to this page and now just click
on install package and install it I have
already installed it so I'm not gonna
install it again but let's go ahead and
look at the instructions it says to
enable this middleware you go to the
Settings short bio file and copy and
paste this client so that is what we are
gonna do so I'm just gonna take this
line and paste it somewhere over here so
we should be paste it I'm just gonna
paste it underneath this robot dot text
and then it says go to download or
middle ways and copy and paste this
thing somebody's gonna paste this I'm
gonna find middleware just for the case
of clarity I'm just gonna paste this
over here and in fact in extra download
middle way don't worry about it I was
just kind of testing stuff out so that's
it and then it says you can run your
program and it will work fine so now
that we have pasted our middleware let's
actually run our spiders let's go to
terminal open up terminal and over here
go inside our Amazon directory scrappy
crawl and Amazon and now this crawling
process is gonna take a little bit of
more time than your normal crawling
process but once it has access access
the IP address that it wants then you
can just grab as much data as you want
so let's actually wait for it to
complete scrapping and it's gonna try
out a lot of IP addresses because some
of the IP address might not be working
now and it's going to use this website
known as free proxy list to get those IP
addresses and if you want to use your
own IP addresses you can always go to
google and type in something like github
github proxy list and so you can do give
you a list of proxy and you can use any
of these websites so for example if I go
to this website and go to the last
profiler
we will see that there are a lot of IP
addresses and port numbers and then the
country where they are from
similarly in another github directory if
we go to this free proxy list you will
be able to see that this has a lot of IP
addresses and this thing after colon is
actually the port number so this is
basically what proxies are their
collection of IP addresses and port
numbers and more specifically IP
addresses of other people so let's go
and see if our program hasn't completed
yet so we're just gonna wait for a
little bit of more time alright guys so
our scrapper has finally finished
scrapping and if we scroll up a little
bit you can see that our page has been
scrapped properly now if it took you a
couple of minutes don't worry about it
sometimes it takes a lot of time to find
the right proxy but in a scheme of
bigger things for example if you want to
scrap thousands of pages of Amazon it
might be better to just wait for 4
minutes instead of running into problems
later so guys this is pretty much it for
this video in the next video we are
going to learn how to scrap pagination
of Amazon if you have been following
along all of these videos the rest of
the videos I have already covered that
but I'm also going to cover it again in
the next video so that you guys have an
internalized understanding of how to do
it but instead of watching the next
video just try to do it on your own
first but if you're not able to don't
worry about it we actually cover that in
the next video and I'll see you over
there

Video 25: Python Scrapy Tutorial - 25 - Scraping multiple pages of Amazon

alright guys welcome to the last video


of the series in which will be scrapping
multiple pages from our amazon.com
website so I've opened up a link and
what we'll be doing in this video is
scrapping the multiple pagination over
here so that we can scrap multiple pages
of this last 30 days book section and
you guys already probably already know
how to do it but I'm just gonna go over
it again for the sake of completion so
the first thing we are going to do is
create a variable over here and let's
call it page number and then I'm gonna
give it the value of 2 and then we'll
just go back to our chrome and check out
the website and if we go down over here
you can see that there are multiple
pages 1 2 3 but in the URL there is no
page number there is no query for the
page number so but if we click on the
second page then you can see that this
patient number two has appeared and that
is why we will be using this link so
we'll just copy this link from here and
we'll go back to our code and just
underneath this yield items I'm gonna
create a variable and let's call it next
page equals two and then I'm going to
store that URL inside that page and then
I'm gonna refer this to from this page
number that we have just created inside
our class that is this page number
equals to do and the next very well
basically contains the next page that we
want to scrap so what we are going to do
is we are gonna go over here and we are
going to close these from here and then
add another coat over here and then we
are going to replace this two with the
page number but obviously because you
are referring a class variable that is
where we also have to write down the
name of the class so in our case the
name of the class is Amazon spider
spider so is gonna write that Amazon
spider spider and then we gonna add a
dot operator and this will make sure
that we refer the right thing and then
it says me because this is strain that
is why it needs to be string that is why
will convert this integer into a string
by putting in parenthesis here and then
let's go back and if you want you can
put an if condition with basically space
if the next page is less than or equal
to hundred because there are hundred
pages in our page nation and we just
want to scrap one hundred pages you can
put that in but if you don't want to put
that is also fine but I'm just going to
put that in for the sake of completion
and just to make sure that you guys
understand what I'm doing so let's go on
the next line let's say let's say over
here and then we're going to write an if
condition which basically says if Amazon
spider spider dot page number is less
than or equal to 100 then we are going
to stop this F condition let me just do
this properly and then underneath this
if condition we are gonna write yield
and then response dot follow because
after it scraps the items that are
inside this pass method it automatically
looks for this method known as a
response dot follow obviously we have
already covered that in that pagination
video and then it requires two
parameters this response not follow
method the first method is what it
should do where where should it go and
we want it to go as to the next page and
then what function do we want to call
after it goes to the next page so we
again what to scrap the same item so we
want it to go to the next page and
basically go to this past method again
so we are just gonna write a callback
callback dot callback equal to self dot
purse and this will make sure that after
it goes to the next page it again scraps
the same thing and this whole thing will
go on until it reaches page number 100
so this is pretty much it let's actually
run the our scraper and see if it works
and one more thing in settings taught by
file I'm gonna add our Google port so
that we don't have any problems
scrapping Amazon so let's go back to our
terminal and if you forgot about your
agents just go back to the video termed
as named as user agent and have a look
at it again so I'm just going to run our
scraper again so let's click on enter
and hopefully it will keep on scraping
till we reach page number 100 so let's
just wait for it alright guys so it
didn't work it only scraped the first
two pages so the problem with this is
that if I were to increase the value of
this page number by one so let's go back
over here anyways like a very stupid
mistake but we are humans so I'm just
gonna increase the value
in this page number y1 and then let's go
back to our terminal and let's see
what's up so let me just scrap the
Amazon again all right now you can see
it has grabbed the second page now just
grab the third page and now it will go
on deliver th number reach page number
100 so this is looking pretty good guys
and so this was the last video of this
video series but don't get disheartened
because I'm going to be uploading a lot
more projects in this video series I
have a couple of projects in mind for
example scraping images and stuff like
that just tell me in the descriptions or
in the discussions wherever you want
what you want me to create further in
this course and make sure you also have
a look at my other courses because I
have a lot of other Python related video
series tool for example networking
videos GUI video is how to create GI
software's and stuff like that so I'll
see you in one of those other video
series next time peace out

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy