DADS404 Unit-02 - V1.1
DADS404 Unit-02 - V1.1
DADS404
DATA SCRAPPING
Unit 2
Finding Data Across Sources
Table of Contents
SL Fig No / Table SAQ /
Topic Page No
No / Graph Activity
1 Introduction - -
3-4
1.1 Learning Objectives - -
2 How Data Scraping is Done - 1 5-7
3 Scraping Data from Different Sources - 2
3.1 Considerations When Scraping Different 8 - 10
- -
Sources
4 Identifying the right Data Source - 3 11 - 12
5 Understanding the Website Structure for
- 4, 5, 6
Scraping
5.1 HTML - -
5.2 CSS Selectors - - 13 - 18
5.3 JavaScript and Dynamic Content - -
5.4 Sitemaps - 7
5.5 Robots.txt - -
6 Summary - - 19
7 Glossary - - 20 - 21
8 Terminal Questions - - 21 - 22
9 Answer Keys - - 22 - 23
10 Suggested Books and E-References - - 23
1. INTRODUCTION
In our journey of understanding data scraping and wrangling, we've reached a critical
junction: finding data across sources. Now, you may ask, why is this so important? Can't we
just take any data and get started? Well, here's the catch - data is like the ingredient for our
analytical cooking. The quality, relevance, and accessibility of our data directly affect the
outcome of our analysis.
Just as a chef chooses the freshest produce or a carpenter selects the right wood, a data
analyst or scientist must know how to find and identify the most suitable data for their
project. This unit will empower you to do just that!
Our focus will be on understanding different data sources and how to effectively and
ethically scrape data from them. We will start by examining the various sources where data
can be obtained, such as websites, APIs, databases, social media platforms, and even PDF
documents.
Next, we will delve into how to identify the most suitable data source for your project, taking
into account factors like relevance, accessibility, data quality, data format, and legal and
ethical considerations.
Finally, we will dive into the structure of websites. Like a blueprint to a building,
understanding a website's structure helps you navigate efficiently and accurately to scrape
the data you need.
➢ Create a comprehensive plan for finding, evaluating, and scraping data from the most
suitable sources for your data analysis project.
So, are you ready to embark on this data sourcing adventure? Let's dive right in and explore
the world of data!
1. Identify the target URL/website: The first step in web scraping is to determine the
URL of the webpage you want to scrape. This URL serves as the access point for the
scraping tool or script.
2. Inspect the Page: Before you can start scraping, you need to understand the
structure of the web page. This involves examining the page's HTML to identify the
tags that contain the data you're interested in.
3. Write the Code: Once you understand the page structure, you can write the script or
code that will extract the data. This code should instruct the scraper to visit the URL,
locate the desired data, and extract it.
4. Run the Scraper: After writing the code, the next step is to run the scraper. This will
initiate the data extraction process. Running the code involves the following steps.
a. Sending an HTTP request: When you run the code for web scraping, a
request is sent to the URL that you have mentioned. As a response to the
request, the server sends the data and allows you to read the HTML or XML
page.
b. Parsing the data: Once you have accessed the data, the next step is to parse
the data. Parsing refers to the process of analyzing the data syntactically. Web
scraping is all about parsing because once the data you need is in HTML
format, you must extract the data from it, which can only be done by parsing
the HTML document.
c. Web Data Extraction: After parsing the HTML document, you can extract the
data. Since the data on websites is unstructured, web scraping enables us to
convert the data into a structured form.
5. Store the Data: Once the data has been extracted, it's typically stored in a structured
format such as a CSV or Excel file, or a database, for further processing or analysis.
6. Clean and Analyze the Data: The final step involves cleaning the scraped data and
analyzing it to derive insights.
The choice of the programming language and tools you use for web scraping can vary widely,
depending on your specific needs and the complexity of the website you're scraping. Some
popular options include Python with libraries such as BeautifulSoup, Scrapy, or Selenium;
JavaScript with Node.js and libraries like Axios and Cheerio; and R with packages like rvest
or RCurl.
While the specifics of the process can vary, the basic process remains the same. It involves
making a request to the server hosting the site, parsing the HTML response, and extracting
the data you need.
It's important to note that not all websites can be scraped, and not all data is accessible.
Some websites use JavaScript to load data, which can make it more difficult to scrape, and
some websites have measures in place to prevent or limit scraping. Additionally, it's
important to consider the legal and ethical implications of web scraping and to respect the
website's terms of service and the privacy of its users.
SELF-ASSESSMENT QUESTIONS - 1
1) Which of the following is NOT a step in the web scraping process?
a) Identify the URL
b) Inspect the page
c) Encrypt the data
d) Write the code
2) What is the final step in the web scraping process?
a) Run the scraper
b) Write the code
c) Store the data
d) Clean and analyze the data
Many websites and services offer APIs that allow you to access their data in a
structured, reliable way.
Scraping data from an API often requires registering for an API key and making HTTP
requests to the API's endpoints. The data returned from an API is typically in a
structured format like JSON or XML, making it easier to work with than the
unstructured HTML of a webpage.
➢ PDFs: PDFs are another source that can be used for data scraping, especially when
dealing with reports, academic papers, or other documents. Extracting data from
PDFs can be a bit more complex than scraping HTML or XML, but there are tools
available, like Tabula or PyPDF2 with Python, that can help extract text and tables
from PDF documents.
➢ Databases: Web scrapers can also be used to extract data directly from databases.
This could involve downloading a database dump, connecting to a database server, or
using a database API.
Like web APIs, database APIs return data in a structured format, making it easier to
work with. However, accessing a database often requires permissions and
credentials, so this method is typically only possible if you have been granted access
to the database. This requires knowledge of SQL (Structured Query Language) or
other database query languages.
➢ Social Media Platforms: Many social media platforms provide APIs that allow you to
scrape data. This can include user profiles, posts, comments, likes, and more. This
data can be valuable for a variety of purposes, from market research to sentiment
analysis.
However, scraping social media data also raises significant ethical and privacy
concerns. It's important to respect users' privacy and the terms of service of the
platform. In some cases, you may need to anonymize the data to protect users'
privacy.
scraping. APIs often require registration and may have rate limits or restrictions on what
data you can access. Databases require permissions and may also have restrictions on the
volume of data you can access or the queries you can run. Social media platforms have their
APIs, but these also come with restrictions and privacy considerations.
In all cases, it's important to respect the source's terms of service and any legal or ethical
obligations you have when handling the data.
SELF-ASSESSMENT QUESTIONS - 2
3) Which of the following cannot be used as a source for data scraping?
a) HTML documents
b) APIs
c) Databases
d) Physical books
4) When scraping data from an API, what is typically involved?
a) Sending a request to the API's URL and parsing the response
b) Sending a request to the API's database and parsing the response
c) Sending a request to the API's HTML and parsing the response
d) None of the above
➢ Relevance: The source should contain the specific type of data you need for your analysis
or project. For instance, if you're analyzing customer reviews of a product, you might
choose to scrape data from e-commerce websites or social media platforms where the
product is discussed.
➢ Reliability: The source should be trustworthy and reliable. This typically means
choosing sources that are reputable and have a history of providing accurate data.
➢ Accessibility: The source should be accessible for scraping. Some websites have
measures in place to prevent or limit scraping, such as CAPTCHAs or limitations on the
number of requests that can be made from a single IP address. In some cases, the desired
data might be accessible through an API, which can be a more reliable and efficient way
to gather data.
➢ Legal and Ethical Considerations: Ensure that scraping the chosen source is both legal
and ethical. Some websites explicitly prohibit scraping in their terms of service, while
others may have legal protections on their data. Always respect privacy, copyright laws,
and terms of service when scraping data.
SELF-ASSESSMENT QUESTIONS - 3
5) Which of the following is NOT a factor to consider when identifying a source for
data scraping?
a) Relevance
b) Reliability
c) Website design
d) Legal and ethical considerations
6) Why is it important to consider legal and ethical considerations when choosing
a data source?
a) To avoid penalties and respect privacy rights
b) To ensure the data is relevant
c) To ensure the data is accessible
d) None of the above
To view the HTML of a webpage, you can use the "Inspect" or "View Page Source" option in
your web browser (Figure 2). This will display the page's HTML, allowing you to see how the
content is structured and identify the tags that contain the data you're interested in (Figure
3).
For example, text within <p> tags is paragraph text, links are typically contained within <a>
tags, and table data is often within <table>, <tr> (table row), and <td> (table data) tags.
Here is an overview of some of the key aspects of website structure you'll need to understand
for effective web scraping.
5.1 HTML
HyperText Markup Language (HTML) is the standard markup language for documents
designed to be displayed in a web browser. It's the backbone of any webpage and is what
you'll be interacting with when you're web scraping.
An HTML document consists of elements, represented by tags. Some of the most commonly
used HTML tags include:
➢ <div> and <span>: These tags are used to group other elements together.
➢ <table>, <tr>, <th>, and <td>: These tags are used for tables.
➢ Element selector: Selects elements based on the element name. For example, p
would select all paragraph elements.
➢ ID selector: Selects a specific element based on its ID. For example, #myID would
select the element with the ID "myID".
➢ Class selector: Selects elements based on their class. For example, .myClass would
select all elements with the class "myClass".
➢ Attribute selector: Selects elements based on an attribute or attribute value. For
example, [href] would select all elements with a "href" attribute, and
[href="https://www.example.com"] would select all elements with a "href" attribute
of "https://www.example.com".
To scrape dynamic content, you may need to use a tool like Selenium, Puppeteer, or
Pyppeteer that can interact with a web browser and execute JavaScript.
5.4 Sitemaps
A sitemap is a file where you provide information about the pages, videos, and other files on
your site, and the relationships between them. Webmasters use this data to inform search
engines about pages on their site that are available for crawling. For a web scraper, sitemaps
can be useful to identify and list all the URLs that need to be crawled, especially for large
websites.
5.5 Robots.txt
The robots exclusion standard, also known as the robots exclusion protocol or simply
robots.txt, is a standard used by websites to communicate with web crawlers and other web
robots. The standard specifies how to inform the web robot about which areas of the website
should not be processed or scanned.
When planning a web scraping project, it's important to check a website's robots.txt file to
see which parts of the site the owner has asked bots not to crawl. It's generally considered
good web scraping etiquette to respect these wishes, although the robots.txt file is
technically more of a guideline than a hard and fast rule.
SELF-ASSESSMENT QUESTIONS - 4
7) Which of the following tags typically contains links in a webpage?
a) <p>
b) <a>
c) <tr>
d) <td>
8) How can you view the HTML of a webpage?
a) By right-clicking on the page and selecting "Inspect" or "View Page Source"
b) By copying the page's URL into a text editor
c) By saving the page as a PDF and opening it in a PDF viewer
d) None of the above
6. SUMMARY
In this unit we covered the important aspects of finding and identifying the right sources for
data scraping, an essential step in any data analysis project. We started by detailing the
general process of how data scraping is conducted. This involved sending a request to a
specific URL, parsing the response, extracting the required data, and storing it in a structured
format for further use.
We then dived into the various sources from which data can be scraped. We discussed the
potential for scraping data from websites, APIs, databases, social media platforms, and even
PDF documents. For each of these sources, we highlighted the unique considerations that
need to be taken into account.
In the third section, we explored the importance of identifying the right data source for your
project, keeping factors like relevance, accessibility, data quality, data format, and legal and
ethical considerations in mind.
In the final section, we took a deep dive into understanding the structure of websites to
facilitate efficient and accurate data scraping. We detailed the role of HTML and CSS
selectors, the challenge posed by JavaScript and dynamic content, and the usefulness of
sitemaps and robots.txt files.
Through this unit, students should have gained a deeper understanding of how to approach
the data gathering phase of their projects, making them well-equipped to find the right data
in the most efficient and ethical manner possible. The knowledge acquired will serve as a
foundation for subsequent units, where we will delve into the practical aspects of scraping
data from various types of websites and using different tools and libraries.
7. GLOSSARY
• Data Scraping/Web Scraping: The process of extracting data from websites. This
involves sending a request to a website's server, reading the HTML or XML page that's
returned, and parsing the page to extract the desired data.
• API (Application Programming Interface): An interface that allows software programs
to interact with each other. Many websites and services provide APIs, which allow for
structured and reliable access to their data.
• HTML (HyperText Markup Language): The standard language for creating web pages.
It uses tags to structure content on the page.
• CSS (Cascading Style Sheets): A style sheet language used for describing the look and
formatting of a document written in HTML.
• CSS Selector: Patterns used in CSS to select the elements to be styled. In web scraping,
CSS selectors are used to select the elements to be scraped.
• JavaScript: A programming language commonly used in web development to create
interactive effects within web browsers.
• Dynamic Content: Website content that changes based on the behavior, preferences,
and interest of the user. It's typically loaded or changed by JavaScript after the page loads.
• Sitemap: A file where information about the pages, videos, and other files on a site, and
the relationships between them, is provided. Sitemaps can help web scrapers to identify
and list all the URLs that need to be crawled.
• Robots.txt: A text file webmasters create to instruct web robots (typically search engine
robots) how to crawl pages on their website. Web scrapers should respect the
instructions in a website's robots.txt file.
• Data Source: The location, file, database, service, or user from which data originates.
• Data Quality: A measure of the condition of data based on factors such as accuracy,
completeness, consistency, reliability, and whether it's up-to-date.
• PDFs (Portable Document Format): A file format used to present and exchange
documents reliably, independent of software, hardware, or operating system. PDF files
can be a source of data for web scraping.
• Data Format: The particular way that data is structured or organized. Examples of data
formats include CSV, JSON, and XML.
• Data Parsing: The process of analyzing a string of symbols, either in natural language,
computer languages or data structures, conforming to the rules of a formal grammar.
• HTTP Request: A message sent by a client (user) to a server to retrieve information or
perform an operation on the server.
• Relevance: The degree to which data fits the purposes and needs of the data user. It's a
key factor to consider when identifying a source for data scraping.’
8. TERMINAL QUESTIONS
1. Explain the process of data scraping. Where can the answer be found?
2. Give examples of sources that can be used for data scraping. Where can the answer be
found?
3. What factors should you consider when identifying a source for data scraping? Where can
the answer be found?
4. Why is it important to understand a website's structure before scraping it? Where can the
answer be found?
5. What are APIs, and how are they used in data scraping? Where can the answer be found?
6. What is the role of HTML in web scraping? Where can the answer be found?
7. How can web scraping be used to gather data from social media platforms? Where can
the answer be found?
8. What measures do some websites use to prevent or limit scraping? Where can the answer
be found?
9. Give examples of different tags in HTML and explain what type of content they typically
contain. Where can the answer be found?
10. How do you view the HTML of a webpage? Where can the answer be found?
11. How do you store the data after scraping it? Where can the answer be found?
12. What are the legal and ethical considerations in web scraping? Where can the answer be
found?
13. How can web scraping be used to gather data from PDFs? Where can the answer be
found?
14. What is the final step in the web scraping process? Where can the answer be found?
15. Why is the reliability of a data source important in web scraping? Where can the answer
be found?
16. How can web scraping be used to gather data from databases? Where can the answer be
found?
17. What is the role of data cleaning in the web scraping process? Where can the answer be
found?
18. What is the relevance factor when identifying a data source for web scraping? Where can
the answer be found?
19. Give an example of how data scraped from the web can be used in a real-world
application.
20. What are the steps involved in scraping data from an API? Where can the answer be
found?
9. ANSWERS
TERMINAL QUESTIONS
1. Refer Section 2
2. Refer Section 3
3. Refer Section 4
4. Refer Section 5
5. Refer Section 3
6. Refer Section 5
7. Refer Section 3
8. Refer Section 4
9. Refer Section 5
10. Refer Section 5
11. Refer Section 2
12. Refer Section 4
13. Refer Section 3
14. Refer Section 2
15. Refer Section 4
16. Refer Section 3
17. Refer Section 2
18. Refer Section 4
19. Open-ended question; can be inferred from all sections
20. Refer Section 3
BOOKS:
• Boopathi, Kabilan. "Web Scraping using Python (Guide for Beginners)." GeeksforGeeks,
2019.
• Mitchell, Ryan. "Web Scraping with Python: Collecting More Data from the Modern Web."
O'Reilly Media, 2018.
• Munzert, Simon, et al. "Automated Data Collection with R: A Practical Guide to Web
Scraping and Text Mining." John Wiley & Sons, 2015.
• Lawson, Richard. "Web Scraping with Python and BeautifulSoup." PythonHow, 2016.
• Russell, Matthew A. "Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn,
Google+, GitHub, and More." O'Reilly Media, 2013.