WEBSCRAping Buildwithpython
WEBSCRAping Buildwithpython
Video 7: Python Scrapy Tutorial- 7 - Creating our first spider ( web crawler )
Video 8: Python Scrapy Tutorial- 8 - Running our first spider ( web crawler )
Video 11: Python Scrapy Tutorial - 11 - Web Scraping Quotes and Authors
alright guys welcome back in the last
couple of videos we learned about the
selectors and more specifically we
learned about the CSS selectors and the
XPath selectors and now I believe that
you guys are finally ready to extract
data from this course website that is
Coates dot to scrape calm and we are
finally going to be writing code inside
our codes underscored Spyder dot Python
file but before we start writing code
let's actually inspect the elements and
see the code that goes into writing
these codes so we're just gonna right
click over here and click on inspect and
you'll be able to see that each of these
elements is inside this class equals two
code so the first element that contains
the code the author and the tags is
represented by this division tag with
class equals to code and similarly with
the second quote the third quote and so
on
so what we are going to be doing is that
first we are going to selecting this
division class equal to code so we are
going to be selecting all these division
tags from our response very well that
will contain the source code so let's go
back inside our python file and instead
of this title shield we'll just remove
that because we don't need it now and
instead of that we are just going to
create a new variable and let's call it
or give quotes and what this variable is
gonna contain is that is gonna contain
all of these division tags over here
which have the class of equals two code
so let's go back to our code and just
like we did in the previous videos we
are going to write response dot CSS we
are going to use the CSS selector and
because we want all of the quotes in the
division tax that's why we are going to
write div dot quote and we are not going
to write dot extract over here because
we don't want this class data we just
want the items that are inside this
class for example the code the actual
code and then we want author name and
then the tags we don't want this actual
class data so what we are going to do is
we are going to use this class equals
two code and we are going to go inside
this so instead of using this response
variable we are going to be using all
diff
so let's actually show you guys what we
are talking about so we are going to
create a new variable let's call it
title and now instead of using the
response variable which contains all of
the source code we are just going to be
using all div cuts that just contains
the source code of this division tag of
class equals two code so
let's go back to our code and over here
we're just going to write all courts dot
CSS and then we are going to be using
another CSS selector to extract this
title that is our main code so if we
open up a little bit we will be able to
see that we want to extract the data
from this span HTML tag which has a
class of text so now we can just write
span and make sure you do it inside the
code so we can just write span dot text
to select our text and we can just write
not extract now obviously we need to
extract the text part of this code and
not the tool element with this span HTML
element so we are going to add a tag of
text over here just like we learned in
the previous videos now that they have
extracted the main code let's extract
the author name so we also know create
another variable let's call it author
and again we are going to be using
instead of response or div codes and
using the CSS selector and let's go back
to our code to see how we can extract
the data that is inside code so if you
open up this span class you can see that
this contains span inside that span it
contains an a tag which has this all
right that is strong inside the span
class we have this small HTML tag which
has the class of author and inside this
author we have this Albert Einstein
which we want so what we can do is we
can come over here and we can use just
the dot author selector and because we
just want the text we can again write
text and we can also extract this and
now that we have the author let's
actually extract the tags that we have
so if we go inside this division of tags
let's open it up you will be able to see
that the tags are inside this a tag of
class equals hoodak and there are four
of these so what we can do is we can go
back to our code and we can write tag
equals two and again we are just going
to use oil division codes dot CSS and we
are going to be selecting using this dot
axe and I just want the text from it so
I'm going to write dot text dot extract
and this should work fine now like we
did in all of the videos we are going to
be using a return statement and in web
scraping in scrappy especially instead
of return we use the yield statement so
I'm just gonna write yield and because
you always return or you always yield a
dictionary so I'm going to create a
dictionary and every dictionary needs to
have a key and then a value so I'm going
to be writing our title key over here
and then I'm going to be sending the
value of the title variable and I'm
going to be doing with both of the
author and the tag I can do it with
simply the authors I'm just going to
write author and I'm gonna return the
value of author and forgot to put a
comma over here come over here and then
I can just write tag and over here I'm
gonna return a value of that so this all
looks pretty good but this is going to
return us a lot of data so what I'm
going to be doing is I'm just going to
be returning the first code data so over
here the first code contains this the
world code the author and then these
tags so instead of returning all of that
I can just write the 0 over here and
like we learned that this basically is a
list so we can just kind of extract the
first element from it instead of
extracting all of the elements so now to
run this crawler we can just open up our
terminal and we have to go inside a
project folder press Enter so now let's
just try it scrappy crawl and we want to
crawl our codes so this is the name of
our spider that we have created if you
have for cotton and now we can just
press ENTER and see what happens so now
if we scroll up a little bit let's see
what scrappy has returned to us and as
you can see we have some of output so it
says that the code is over here it says
the world as we have created is a
process which is pretty good we have the
author but there is some problem in that
tags which is not giving us anything so
let's see what the problem is when the
tags it says all division codes dot CSS
dot tags text extract so let's go back
and see what the problem is so we have
this class of equals to tag all right so
maybe we should instead of
like writing dot tags we have to write
just dot tag so I kind of misspelled the
spelling so we have to correct this so
instead of dot tags we just have to
write dot tag and this should work
properly
so let's go back to her terminal and I
made a loop say it's fine and this is
this kind of a I'm not going to remove
this from the video because this is kind
of a good learning lesson so I'm gonna
press ENTER again alright so here it is
the title we have extracted the author
albert einstein and we have the tags now
let's actually remove this zero index
over here and let's see what is the
output when we have all of the quotes
and we have extracted all of the quotes
and the title and the tags from this
main page so our page has quite a lot of
codes so let's go here and let's scroll
it again so let's scroll down and we are
going to write strappy crawl quotes
alright so it has been crawled let's
scroll up and as you can see this
contains a lot of codes so here are our
codes and then here are our authors and
then the tag somewhere should start over
here alright so this gives us a lot of
values of title authors and tags now
what if we want the values of title
authors and tags one by one instead of
just throwing them at her face instantly
so what we are going to be doing is that
we are going to be creating a for loop
so what we can do is over here we can
just create a for loop it says for Q or
instead of Q we can write code in and we
can just write all division codes so we
are going to be extracting all of these
codes that are inside this div tag so
let's actually inspect over here and s
wait for this to open up yeah so we are
going to be extracting all of these
division tags one by one and then one by
one we are going to be taking the code
the author and the tags so this is the
process so what we can do is we can just
copy this and instead of copying we can
just press app and now instead of all
def codes we can just put court over
here actually let's change this to
quotes so that we don't have to do a lot
of work and we can just remove this term
over here
we can remove this and then we can just
press tab so that this comes inside the
for-loop
and this looks pretty good let me just
format this properly and let me just run
this crawler again to see now how our
output looks so the website has been
crawled and now if we scroll up you'll
be able to see that a lot of stuff has
been scraped so if we go to the top you
will be able to see over here that now
these sections of codes have been
scrapped one by one now we have the
title the author the tag and then we go
to the second section where we have the
code the author and then the attack and
similarly we have the third title the
author and then the tag so you can do
both of them that's totally fine so this
for functionality of extracting code
section one at a time kind of helps you
store them properly inside the database
but it won't matter much when we go to
the next video where we'll be learning
about items and making sure that the
title author and the tags are properly
customized and stored in a proper form
so I'll see you in the next video where
we'll be learning about items in scrappy
Video 12: Python Scrapy Tutorial - 12 - Item containers ( Storing scraped data )
Video 13: Python Scrapy Tutorial - 13 - Storing in JSON, XML and CSV
Video 19: Python Scrapy Tutorial - 19 - Web Crawling & Following links