Session 3 Data Aquisition - Updated

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Innovation and

Marketing Analytics
Prof. Qiaoni Shi
Questions?
Today’s Plan
• Introduction of Web scraping
• Web scraping with requests & BeautifulSoup
• Web scraping with Selenium
Introduction of Web scraping
Web page
• Webpages are
(mostly) written in
HTML
• Web page delivered
to user’s browser
exactly as stored
• Each webpage is a
separate HTML file
Web page
• HTML
• hyper text markup language
Tree-like Structure of a HTML Page
HTML tags
Tag Name/function
<head> Heading of a HTML document, which contains
elements describing the document
<body> Body of a HTML document, which is the
content of the web page
<h1>…<h6> Headings
<p> Paragraph
<div> A block/session
<span> An inline session
<a> A link
<li> List
<ul> unordered list
HTML Resources
• https://www.youtube.com/watch?v=UB1O30fR-EE
• https://www.codecademy.com/learn/learn-html
• https://www.w3schools.com/html/html_intro.asp
• Mac OS
• Chrome - > Developer -> View source
• Command + Shift + Option

Source: Chris Bail


• Windows
• Chrome -> right click -> Inspect
Exercise

• Pick a webpage, check the following items:


• How is it organized?
• Where is the head and where is the body?
• Is it a tree-like structure?
Outline of basic web scraping
Web scraping with requests &
BeautifulSoup
Web scraping with requests &
BeautifulSoup
Steps
Step 1 Request Information

You need to request


information from the
url and get the html
text data.

https://m.imdb.com/title/tt1160419/
Send request
page = requests.get(url)
page.content
Tree-like Structure of a HTML Page
Step 2 Parsing Information

What we want
Parsing Data
soup = BeautifulSoup(page.content, ‘html.parser’)
• Locate the information we want
soup.find(“h1“).get_text()
soup.find_all()
Parsing Data
soup.find("div",{"class":“…"})
soup.find(id=“…")
soup.find_all(“span", class_=“…")
Web scraping with Selenium
Outline of basic web scraping

Selenium
Selenium is a Python module that controls a
browser to open a webpage and extract data from it
An unique advantage of Selenium

• Selenium can handle non-static webpage that has content


hidden behind code (e.g. Javascript)

• How? Selenium can interact with the browser. For example,


Selenium can click on button / link / dropdown menu etc.
Step 0 Import Modules
# install firefox, geckodriver, and selenium
!apt-get update
!pip install selenium
!apt install firefox-geckodriver
!cp /usr/lib/geckodriver /usr/bin
!cp /usr/lib/firefox /usr/bin
binary = '/usr/bin/firefox'
options = webdriver.FirefoxOptions()
options.binary = binary
options.add_argument('--headless')
driver = webdriver.Firefox(options=options, executabl
e_path='/usr/bin/geckodriver')
Step 1&2 Send requests, Parsing Data
driver = webdriver.Firefox(options=options, executable_path='/u
sr/bin/geckodriver')
driver.get(url)
drive.page_source
Selenium
.find_element(By.CLASS_NAME,””)
.find_element(By.XPATH,””)
.find_elements(By.CLASS_NAME,””)
.find_elements(By.XPATH,””)

e.g.,
driver.find_elements(By.CLASS_NAME, "review-container")
Selenium locator
Selenium
Browser interaction

.click() instructs the browser to click on the element

select() instructs the browser to select the specific dropdown box

Example:

dropdown_box = Select(elem)
dropdown_box.select_by_visible_text('Most recent’)

First, instructs the browser to select the dropdown box referred to by the element
second, instructs the browser to choose the option with text 'Most recent'
Selenium
Browser interaction

.back() instructs the browser to go back one page

.forward() instructs the browser to go forward one page


Step 3 Save Data (Pandas)
• final_dict = {‘v1’:list1, ‘v2’:list2}
• df = pd.DataFrame(final_dict)
• df.to_csv()
More tools
Comparison of Web Scraping Tools
BeautifulSoup Selenium Scrapy

Easy to learn Easy to learn Good integration with


Pros data pipeline, proxies, VPN
Extensive documentation Can scrape non-static page
(e.g. javascript) via browser Fast performance
automation

Cons Slow performance Slow performance More complex

Documentation of Scrapy: https://docs.scrapy.org/en/latest/


More tools

https://www.youtube.com/watch?v=n7fob_XVsbY
Challenges in Web Scraping
• Time investment
• Each website is different and requires custom-made web
scraping code

• Fragility of code
• Web scraping code may break when the website is
redesigned (even slightly)
• Require continual monitoring and maintenance
for ongoing / production data source
• Website may block / IP-ban your scraper
Resources
• HTML
• https://www.youtube.com/watch?v=UB1O30fR-EE
• https://www.codecademy.com/learn/learn-html
• BeautifulSoup
• Filters applied to search the tree
• https://www.crummy.com/software/BeautifulSoup/bs4/doc/#c
alling-a-tag-is-like-calling-find-all
• Documentation
• https://www.crummy.com/software/BeautifulSoup/bs4/doc
• Selenium
• Documentation
• https://selenium-python.readthedocs.io/
• https://www.selenium.dev/
Questions?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy