Session 3 Data Aquisition - Updated
Session 3 Data Aquisition - Updated
Session 3 Data Aquisition - Updated
Marketing Analytics
Prof. Qiaoni Shi
Questions?
Today’s Plan
• Introduction of Web scraping
• Web scraping with requests & BeautifulSoup
• Web scraping with Selenium
Introduction of Web scraping
Web page
• Webpages are
(mostly) written in
HTML
• Web page delivered
to user’s browser
exactly as stored
• Each webpage is a
separate HTML file
Web page
• HTML
• hyper text markup language
Tree-like Structure of a HTML Page
HTML tags
Tag Name/function
<head> Heading of a HTML document, which contains
elements describing the document
<body> Body of a HTML document, which is the
content of the web page
<h1>…<h6> Headings
<p> Paragraph
<div> A block/session
<span> An inline session
<a> A link
<li> List
<ul> unordered list
HTML Resources
• https://www.youtube.com/watch?v=UB1O30fR-EE
• https://www.codecademy.com/learn/learn-html
• https://www.w3schools.com/html/html_intro.asp
• Mac OS
• Chrome - > Developer -> View source
• Command + Shift + Option
https://m.imdb.com/title/tt1160419/
Send request
page = requests.get(url)
page.content
Tree-like Structure of a HTML Page
Step 2 Parsing Information
What we want
Parsing Data
soup = BeautifulSoup(page.content, ‘html.parser’)
• Locate the information we want
soup.find(“h1“).get_text()
soup.find_all()
Parsing Data
soup.find("div",{"class":“…"})
soup.find(id=“…")
soup.find_all(“span", class_=“…")
Web scraping with Selenium
Outline of basic web scraping
Selenium
Selenium is a Python module that controls a
browser to open a webpage and extract data from it
An unique advantage of Selenium
e.g.,
driver.find_elements(By.CLASS_NAME, "review-container")
Selenium locator
Selenium
Browser interaction
Example:
dropdown_box = Select(elem)
dropdown_box.select_by_visible_text('Most recent’)
First, instructs the browser to select the dropdown box referred to by the element
second, instructs the browser to choose the option with text 'Most recent'
Selenium
Browser interaction
https://www.youtube.com/watch?v=n7fob_XVsbY
Challenges in Web Scraping
• Time investment
• Each website is different and requires custom-made web
scraping code
• Fragility of code
• Web scraping code may break when the website is
redesigned (even slightly)
• Require continual monitoring and maintenance
for ongoing / production data source
• Website may block / IP-ban your scraper
Resources
• HTML
• https://www.youtube.com/watch?v=UB1O30fR-EE
• https://www.codecademy.com/learn/learn-html
• BeautifulSoup
• Filters applied to search the tree
• https://www.crummy.com/software/BeautifulSoup/bs4/doc/#c
alling-a-tag-is-like-calling-find-all
• Documentation
• https://www.crummy.com/software/BeautifulSoup/bs4/doc
• Selenium
• Documentation
• https://selenium-python.readthedocs.io/
• https://www.selenium.dev/
Questions?