Session 3 Data Aquisition - Updated

Innovation and
Marketing Analytics
Prof. Qiaoni Shi
Questions?
Today’s Plan
• Introduction of Web scraping
• Web scraping with requests & BeautifulSoup
• Web scraping with Selenium
Introduction of Web scraping
Web page
• Webpages are
(mostly) written in
HTML
• Web page delivered
to user’s browser
exactly as stored
• Each webpage is a
separate HTML file
Web page
• HTML
• hyper text markup language
Tree-like Structure of a HTML Page
HTML tags
Tag Name/function
<head> Heading of a HTML document, which contains
elements describing the document
<body> Body of a HTML document, which is the
content of the web page
<h1>…<h6> Headings
<p> Paragraph
<div> A block/session
<span> An inline session
<a> A link
<li> List
<ul> unordered list
HTML Resources
• https://www.youtube.com/watch?v=UB1O30fR-EE
• https://www.codecademy.com/learn/learn-html
• https://www.w3schools.com/html/html_intro.asp
• Mac OS
• Chrome - > Developer -> View source
• Command + Shift + Option
Source: Chris Bail

• Windows
• Chrome -> right click -> Inspect
Exercise
• Pick a webpage, check the following items:

• How is it organized?
• Where is the head and where is the body?
• Is it a tree-like structure?
Outline of basic web scraping
Web scraping with requests &
BeautifulSoup
Web scraping with requests &
BeautifulSoup
Steps
Step 1 Request Information
You need to request

information from the
url and get the html
text data.
https://m.imdb.com/title/tt1160419/
Send request
page = requests.get(url)
page.content
Tree-like Structure of a HTML Page
Step 2 Parsing Information
What we want
Parsing Data
soup = BeautifulSoup(page.content, ‘html.parser’)
• Locate the information we want
soup.find(“h1“).get_text()
soup.find_all()
Parsing Data
soup.find("div",{"class":“…"})
soup.find(id=“…")
soup.find_all(“span", class_=“…")
Web scraping with Selenium
Outline of basic web scraping
Selenium
Selenium is a Python module that controls a
browser to open a webpage and extract data from it
An unique advantage of Selenium
• Selenium can handle non-static webpage that has content

hidden behind code (e.g. Javascript)
• How? Selenium can interact with the browser. For example,

Selenium can click on button / link / dropdown menu etc.
Step 0 Import Modules
# install firefox, geckodriver, and selenium
!apt-get update
!pip install selenium
!apt install firefox-geckodriver
!cp /usr/lib/geckodriver /usr/bin
!cp /usr/lib/firefox /usr/bin
binary = '/usr/bin/firefox'
options = webdriver.FirefoxOptions()
options.binary = binary
options.add_argument('--headless')
driver = webdriver.Firefox(options=options, executabl
e_path='/usr/bin/geckodriver')
Step 1&2 Send requests, Parsing Data
driver = webdriver.Firefox(options=options, executable_path='/u
sr/bin/geckodriver')
driver.get(url)
drive.page_source
Selenium
.find_element(By.CLASS_NAME,””)
.find_element(By.XPATH,””)
.find_elements(By.CLASS_NAME,””)
.find_elements(By.XPATH,””)
e.g.,
driver.find_elements(By.CLASS_NAME, "review-container")
Selenium locator
Selenium
Browser interaction
.click() instructs the browser to click on the element
select() instructs the browser to select the specific dropdown box
Example:
dropdown_box = Select(elem)
dropdown_box.select_by_visible_text('Most recent’)
First, instructs the browser to select the dropdown box referred to by the element
second, instructs the browser to choose the option with text 'Most recent'
Selenium
Browser interaction
.back() instructs the browser to go back one page
.forward() instructs the browser to go forward one page

Step 3 Save Data (Pandas)
• final_dict = {‘v1’:list1, ‘v2’:list2}
• df = pd.DataFrame(final_dict)
• df.to_csv()
More tools
Comparison of Web Scraping Tools
BeautifulSoup Selenium Scrapy
Easy to learn Easy to learn Good integration with

Pros data pipeline, proxies, VPN
Extensive documentation Can scrape non-static page
(e.g. javascript) via browser Fast performance
automation
Cons Slow performance Slow performance More complex
Documentation of Scrapy: https://docs.scrapy.org/en/latest/

More tools
https://www.youtube.com/watch?v=n7fob_XVsbY
Challenges in Web Scraping
• Time investment
• Each website is different and requires custom-made web
scraping code
• Fragility of code
• Web scraping code may break when the website is
redesigned (even slightly)
• Require continual monitoring and maintenance
for ongoing / production data source
• Website may block / IP-ban your scraper
Resources
• HTML
• https://www.youtube.com/watch?v=UB1O30fR-EE
• https://www.codecademy.com/learn/learn-html
• BeautifulSoup
• Filters applied to search the tree
• https://www.crummy.com/software/BeautifulSoup/bs4/doc/#c
alling-a-tag-is-like-calling-find-all
• Documentation
• https://www.crummy.com/software/BeautifulSoup/bs4/doc
• Selenium
• Documentation
• https://selenium-python.readthedocs.io/
• https://www.selenium.dev/
Questions?

Session 3 Data Aquisition - Updated

Uploaded by

Copyright:

Available Formats

Session 3 Data Aquisition - Updated

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 3 Data Aquisition - Updated

Uploaded by

Copyright:

Available Formats

Innovation and

Source: Chris Bail

• Pick a webpage, check the following items:

You need to request

• Selenium can handle non-static webpage that has content

• How? Selenium can interact with the browser. For example,

.click() instructs the browser to click on the element

select() instructs the browser to select the specific dropdown box

.back() instructs the browser to go back one page

.forward() instructs the browser to go forward one page

Easy to learn Easy to learn Good integration with

Cons Slow performance Slow performance More complex

Documentation of Scrapy: https://docs.scrapy.org/en/latest/

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.