0% found this document useful (0 votes)
31 views

DADS404 Unit-02 - V1.1

Uploaded by

kulhariravindra7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

DADS404 Unit-02 - V1.1

Uploaded by

kulhariravindra7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

MASTER OF BUSINESS ADMINISTRATION


SEMESTER 4

DADS404
DATA SCRAPPING

Unit 2: Finding Data Across Sources 1


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

Unit 2
Finding Data Across Sources
Table of Contents
SL Fig No / Table SAQ /
Topic Page No
No / Graph Activity
1 Introduction - -
3-4
1.1 Learning Objectives - -
2 How Data Scraping is Done - 1 5-7
3 Scraping Data from Different Sources - 2
3.1 Considerations When Scraping Different 8 - 10
- -
Sources
4 Identifying the right Data Source - 3 11 - 12
5 Understanding the Website Structure for
- 4, 5, 6
Scraping
5.1 HTML - -
5.2 CSS Selectors - - 13 - 18
5.3 JavaScript and Dynamic Content - -
5.4 Sitemaps - 7
5.5 Robots.txt - -
6 Summary - - 19
7 Glossary - - 20 - 21
8 Terminal Questions - - 21 - 22
9 Answer Keys - - 22 - 23
10 Suggested Books and E-References - - 23

Unit 2: Finding Data Across Sources 2


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

1. INTRODUCTION

In our journey of understanding data scraping and wrangling, we've reached a critical
junction: finding data across sources. Now, you may ask, why is this so important? Can't we
just take any data and get started? Well, here's the catch - data is like the ingredient for our
analytical cooking. The quality, relevance, and accessibility of our data directly affect the
outcome of our analysis.

Just as a chef chooses the freshest produce or a carpenter selects the right wood, a data
analyst or scientist must know how to find and identify the most suitable data for their
project. This unit will empower you to do just that!

Our focus will be on understanding different data sources and how to effectively and
ethically scrape data from them. We will start by examining the various sources where data
can be obtained, such as websites, APIs, databases, social media platforms, and even PDF
documents.

Next, we will delve into how to identify the most suitable data source for your project, taking
into account factors like relevance, accessibility, data quality, data format, and legal and
ethical considerations.

Finally, we will dive into the structure of websites. Like a blueprint to a building,
understanding a website's structure helps you navigate efficiently and accurately to scrape
the data you need.

1.1 Learning Objectives

After studying the chapter, you will be able to:


➢ Remember and list the various sources of data that can be used for data scraping.
➢ Understand the process of data scraping from websites and APIs.
➢ Apply the knowledge of identifying the right data source for a given project.
➢ Analyze the structure of a website to determine the best strategy for data extraction.
➢ Evaluate different data sources based on their relevance, reliability, accessibility, and legal
and ethical considerations.

Unit 2: Finding Data Across Sources 3


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

➢ Create a comprehensive plan for finding, evaluating, and scraping data from the most
suitable sources for your data analysis project.

So, are you ready to embark on this data sourcing adventure? Let's dive right in and explore
the world of data!

Unit 2: Finding Data Across Sources 4


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

2. HOW DATA SCRAPING IS DONE


Data scraping, also known as web scraping, involves extracting data from websites. It's a
powerful tool in data analysis, allowing users to gather large quantities of data quickly. While
the specific process may vary depending on the tool or library being used, the fundamental
steps involved in data scraping are generally the same:

Identify the target


URL/website 01

Inspect the Page 02

Write the Code 03

Run the Scraper 04

Store the Data 05

Clean and Analyze


the Data 06
Figure 1: Steps to perform Data Scraping

1. Identify the target URL/website: The first step in web scraping is to determine the
URL of the webpage you want to scrape. This URL serves as the access point for the
scraping tool or script.

2. Inspect the Page: Before you can start scraping, you need to understand the
structure of the web page. This involves examining the page's HTML to identify the
tags that contain the data you're interested in.

3. Write the Code: Once you understand the page structure, you can write the script or
code that will extract the data. This code should instruct the scraper to visit the URL,
locate the desired data, and extract it.

Unit 2: Finding Data Across Sources 5


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

4. Run the Scraper: After writing the code, the next step is to run the scraper. This will
initiate the data extraction process. Running the code involves the following steps.

a. Sending an HTTP request: When you run the code for web scraping, a
request is sent to the URL that you have mentioned. As a response to the
request, the server sends the data and allows you to read the HTML or XML
page.

b. Parsing the data: Once you have accessed the data, the next step is to parse
the data. Parsing refers to the process of analyzing the data syntactically. Web
scraping is all about parsing because once the data you need is in HTML
format, you must extract the data from it, which can only be done by parsing
the HTML document.

c. Web Data Extraction: After parsing the HTML document, you can extract the
data. Since the data on websites is unstructured, web scraping enables us to
convert the data into a structured form.

5. Store the Data: Once the data has been extracted, it's typically stored in a structured
format such as a CSV or Excel file, or a database, for further processing or analysis.

6. Clean and Analyze the Data: The final step involves cleaning the scraped data and
analyzing it to derive insights.

The choice of the programming language and tools you use for web scraping can vary widely,
depending on your specific needs and the complexity of the website you're scraping. Some
popular options include Python with libraries such as BeautifulSoup, Scrapy, or Selenium;
JavaScript with Node.js and libraries like Axios and Cheerio; and R with packages like rvest
or RCurl.

While the specifics of the process can vary, the basic process remains the same. It involves
making a request to the server hosting the site, parsing the HTML response, and extracting
the data you need.

It's important to note that not all websites can be scraped, and not all data is accessible.
Some websites use JavaScript to load data, which can make it more difficult to scrape, and

Unit 2: Finding Data Across Sources 6


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

some websites have measures in place to prevent or limit scraping. Additionally, it's
important to consider the legal and ethical implications of web scraping and to respect the
website's terms of service and the privacy of its users.

SELF-ASSESSMENT QUESTIONS - 1
1) Which of the following is NOT a step in the web scraping process?
a) Identify the URL
b) Inspect the page
c) Encrypt the data
d) Write the code
2) What is the final step in the web scraping process?
a) Run the scraper
b) Write the code
c) Store the data
d) Clean and analyze the data

Unit 2: Finding Data Across Sources 7


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

3. SCRAPING DATA FROM DIFFERENT SOURCES


Data/Web scraping can be used to extract data from various types of sources. Here are a few
examples:

Figure 2: Sources for Data Scraping


➢ Websites: The most common source for data scraping is public websites. This can
include anything from e-commerce sites, news websites, forums, and more. The
diversity and volume of data available on websites make them a rich source of
information for a variety of purposes.
When scraping websites, it's essential to respect the site's terms of service. Some sites
explicitly forbid scraping, while others may limit how much data you can access or
how often you can make requests.
➢ HTML and XML Documents: Web pages are typically structured using HTML
(HyperText Markup Language), and web scraping involves parsing this HTML code to
extract the desired data. XML (eXtensible Markup Language) is another markup
language that can be scraped in a similar way.
➢ APIs: Another common source for data scraping is APIs or Application Programming
Interfaces. APIs are interfaces that software programs use to interact with each other.

Unit 2: Finding Data Across Sources 8


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

Many websites and services offer APIs that allow you to access their data in a
structured, reliable way.
Scraping data from an API often requires registering for an API key and making HTTP
requests to the API's endpoints. The data returned from an API is typically in a
structured format like JSON or XML, making it easier to work with than the
unstructured HTML of a webpage.
➢ PDFs: PDFs are another source that can be used for data scraping, especially when
dealing with reports, academic papers, or other documents. Extracting data from
PDFs can be a bit more complex than scraping HTML or XML, but there are tools
available, like Tabula or PyPDF2 with Python, that can help extract text and tables
from PDF documents.
➢ Databases: Web scrapers can also be used to extract data directly from databases.
This could involve downloading a database dump, connecting to a database server, or
using a database API.
Like web APIs, database APIs return data in a structured format, making it easier to
work with. However, accessing a database often requires permissions and
credentials, so this method is typically only possible if you have been granted access
to the database. This requires knowledge of SQL (Structured Query Language) or
other database query languages.
➢ Social Media Platforms: Many social media platforms provide APIs that allow you to
scrape data. This can include user profiles, posts, comments, likes, and more. This
data can be valuable for a variety of purposes, from market research to sentiment
analysis.
However, scraping social media data also raises significant ethical and privacy
concerns. It's important to respect users' privacy and the terms of service of the
platform. In some cases, you may need to anonymize the data to protect users'
privacy.

3.1 Considerations When Scraping Different Sources


Each source for data scraping comes with its considerations. When scraping websites, you
need to deal with the structure of the HTML and any measures the site has in place to prevent

Unit 2: Finding Data Across Sources 9


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

scraping. APIs often require registration and may have rate limits or restrictions on what
data you can access. Databases require permissions and may also have restrictions on the
volume of data you can access or the queries you can run. Social media platforms have their
APIs, but these also come with restrictions and privacy considerations.

In all cases, it's important to respect the source's terms of service and any legal or ethical
obligations you have when handling the data.

SELF-ASSESSMENT QUESTIONS - 2
3) Which of the following cannot be used as a source for data scraping?
a) HTML documents
b) APIs
c) Databases
d) Physical books
4) When scraping data from an API, what is typically involved?
a) Sending a request to the API's URL and parsing the response
b) Sending a request to the API's database and parsing the response
c) Sending a request to the API's HTML and parsing the response
d) None of the above

Unit 2: Finding Data Across Sources 10


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

4. IDENTIFYING THE RIGHT DATA SOURCE


When planning a data scraping project, one of the most critical steps is identifying the right
data sources. The right data source depends on your project's goals and requirements. Here
are some factors to consider when choosing a data source:

Figure 3: Right Data Source Identification

➢ Relevance: The source should contain the specific type of data you need for your analysis
or project. For instance, if you're analyzing customer reviews of a product, you might
choose to scrape data from e-commerce websites or social media platforms where the
product is discussed.
➢ Reliability: The source should be trustworthy and reliable. This typically means
choosing sources that are reputable and have a history of providing accurate data.
➢ Accessibility: The source should be accessible for scraping. Some websites have
measures in place to prevent or limit scraping, such as CAPTCHAs or limitations on the
number of requests that can be made from a single IP address. In some cases, the desired
data might be accessible through an API, which can be a more reliable and efficient way
to gather data.
➢ Legal and Ethical Considerations: Ensure that scraping the chosen source is both legal
and ethical. Some websites explicitly prohibit scraping in their terms of service, while
others may have legal protections on their data. Always respect privacy, copyright laws,
and terms of service when scraping data.

Unit 2: Finding Data Across Sources 11


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

SELF-ASSESSMENT QUESTIONS - 3
5) Which of the following is NOT a factor to consider when identifying a source for
data scraping?
a) Relevance
b) Reliability
c) Website design
d) Legal and ethical considerations
6) Why is it important to consider legal and ethical considerations when choosing
a data source?
a) To avoid penalties and respect privacy rights
b) To ensure the data is relevant
c) To ensure the data is accessible
d) None of the above

Unit 2: Finding Data Across Sources 12


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

5. UNDERSTANDING THE WEBSITE STRUCTURE FOR SCRAPING


Before scraping a website, it's crucial to understand its structure. Websites are built using
HTML, which structures the content of the webpage using various tags. Understanding these
tags and how they're used to structure content can help you target the specific data you want
to scrape.

To view the HTML of a webpage, you can use the "Inspect" or "View Page Source" option in
your web browser (Figure 2). This will display the page's HTML, allowing you to see how the
content is structured and identify the tags that contain the data you're interested in (Figure
3).

Figure 4: Steps to view the Page Source

Unit 2: Finding Data Across Sources 13


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

Figure 5: Page Source on the Right of the screen

For example, text within <p> tags is paragraph text, links are typically contained within <a>
tags, and table data is often within <table>, <tr> (table row), and <td> (table data) tags.

Here is an overview of some of the key aspects of website structure you'll need to understand
for effective web scraping.

Unit 2: Finding Data Across Sources 14


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

Figure 6: Understanding Website Structure

5.1 HTML
HyperText Markup Language (HTML) is the standard markup language for documents
designed to be displayed in a web browser. It's the backbone of any webpage and is what
you'll be interacting with when you're web scraping.

An HTML document consists of elements, represented by tags. Some of the most commonly
used HTML tags include:

➢ <html>: This tag is used to start and end an HTML document.


➢ <head>: This tag is used for metadata and information that isn't displayed on the
page, like the title of the page and links to CSS stylesheets.
➢ <body>: This tag contains the content of the web page that is displayed in the
browser.
➢ <h1> to <h6>: These tags are used for headings, with <h1> being the highest level
and <h6> the lowest.
➢ <p>: This tag is used for paragraphs of text.
➢ <a>: This tag is used for hyperlinks.

Unit 2: Finding Data Across Sources 15


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

➢ <div> and <span>: These tags are used to group other elements together.
➢ <table>, <tr>, <th>, and <td>: These tags are used for tables.

5.2 CSS Selectors


Cascading Style Sheets (CSS) is a style sheet language used for describing the look and
formatting of a document written in HTML. CSS selectors are patterns used to select the
elements you want to style. When you're web scraping, you can use CSS selectors to select
the elements you want to scrape.

There are several types of CSS selectors, including:

➢ Element selector: Selects elements based on the element name. For example, p
would select all paragraph elements.
➢ ID selector: Selects a specific element based on its ID. For example, #myID would
select the element with the ID "myID".
➢ Class selector: Selects elements based on their class. For example, .myClass would
select all elements with the class "myClass".
➢ Attribute selector: Selects elements based on an attribute or attribute value. For
example, [href] would select all elements with a "href" attribute, and
[href="https://www.example.com"] would select all elements with a "href" attribute
of "https://www.example.com".

5.3 JavaScript and Dynamic Content


Many modern websites use JavaScript to load or display content dynamically. This can
complicate web scraping, as the content you're trying to scrape might not be in the HTML
when you first load the page. Instead, it might be loaded or changed by JavaScript after the
page loads.

To scrape dynamic content, you may need to use a tool like Selenium, Puppeteer, or
Pyppeteer that can interact with a web browser and execute JavaScript.

5.4 Sitemaps
A sitemap is a file where you provide information about the pages, videos, and other files on
your site, and the relationships between them. Webmasters use this data to inform search

Unit 2: Finding Data Across Sources 16


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

engines about pages on their site that are available for crawling. For a web scraper, sitemaps
can be useful to identify and list all the URLs that need to be crawled, especially for large
websites.

Figure 7: Sitemap of Google.com

5.5 Robots.txt
The robots exclusion standard, also known as the robots exclusion protocol or simply
robots.txt, is a standard used by websites to communicate with web crawlers and other web
robots. The standard specifies how to inform the web robot about which areas of the website
should not be processed or scanned.

When planning a web scraping project, it's important to check a website's robots.txt file to
see which parts of the site the owner has asked bots not to crawl. It's generally considered

Unit 2: Finding Data Across Sources 17


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

good web scraping etiquette to respect these wishes, although the robots.txt file is
technically more of a guideline than a hard and fast rule.

Understanding the structure of a website is a crucial step in a web scraping project. By


understanding the underlying HTML, using CSS selectors, dealing with JavaScript and
dynamic content, and respecting sitemaps and robots.txt, you can ensure that your web
scraping project is more effective and respectful to the website owner.

SELF-ASSESSMENT QUESTIONS - 4
7) Which of the following tags typically contains links in a webpage?
a) <p>
b) <a>
c) <tr>
d) <td>
8) How can you view the HTML of a webpage?
a) By right-clicking on the page and selecting "Inspect" or "View Page Source"
b) By copying the page's URL into a text editor
c) By saving the page as a PDF and opening it in a PDF viewer
d) None of the above

Unit 2: Finding Data Across Sources 18


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

6. SUMMARY
In this unit we covered the important aspects of finding and identifying the right sources for
data scraping, an essential step in any data analysis project. We started by detailing the
general process of how data scraping is conducted. This involved sending a request to a
specific URL, parsing the response, extracting the required data, and storing it in a structured
format for further use.

We then dived into the various sources from which data can be scraped. We discussed the
potential for scraping data from websites, APIs, databases, social media platforms, and even
PDF documents. For each of these sources, we highlighted the unique considerations that
need to be taken into account.

In the third section, we explored the importance of identifying the right data source for your
project, keeping factors like relevance, accessibility, data quality, data format, and legal and
ethical considerations in mind.

In the final section, we took a deep dive into understanding the structure of websites to
facilitate efficient and accurate data scraping. We detailed the role of HTML and CSS
selectors, the challenge posed by JavaScript and dynamic content, and the usefulness of
sitemaps and robots.txt files.

Through this unit, students should have gained a deeper understanding of how to approach
the data gathering phase of their projects, making them well-equipped to find the right data
in the most efficient and ethical manner possible. The knowledge acquired will serve as a
foundation for subsequent units, where we will delve into the practical aspects of scraping
data from various types of websites and using different tools and libraries.

Unit 2: Finding Data Across Sources 19


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

7. GLOSSARY
• Data Scraping/Web Scraping: The process of extracting data from websites. This
involves sending a request to a website's server, reading the HTML or XML page that's
returned, and parsing the page to extract the desired data.
• API (Application Programming Interface): An interface that allows software programs
to interact with each other. Many websites and services provide APIs, which allow for
structured and reliable access to their data.
• HTML (HyperText Markup Language): The standard language for creating web pages.
It uses tags to structure content on the page.
• CSS (Cascading Style Sheets): A style sheet language used for describing the look and
formatting of a document written in HTML.
• CSS Selector: Patterns used in CSS to select the elements to be styled. In web scraping,
CSS selectors are used to select the elements to be scraped.
• JavaScript: A programming language commonly used in web development to create
interactive effects within web browsers.
• Dynamic Content: Website content that changes based on the behavior, preferences,
and interest of the user. It's typically loaded or changed by JavaScript after the page loads.
• Sitemap: A file where information about the pages, videos, and other files on a site, and
the relationships between them, is provided. Sitemaps can help web scrapers to identify
and list all the URLs that need to be crawled.
• Robots.txt: A text file webmasters create to instruct web robots (typically search engine
robots) how to crawl pages on their website. Web scrapers should respect the
instructions in a website's robots.txt file.
• Data Source: The location, file, database, service, or user from which data originates.
• Data Quality: A measure of the condition of data based on factors such as accuracy,
completeness, consistency, reliability, and whether it's up-to-date.
• PDFs (Portable Document Format): A file format used to present and exchange
documents reliably, independent of software, hardware, or operating system. PDF files
can be a source of data for web scraping.
• Data Format: The particular way that data is structured or organized. Examples of data
formats include CSV, JSON, and XML.

Unit 2: Finding Data Across Sources 20


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

• Data Parsing: The process of analyzing a string of symbols, either in natural language,
computer languages or data structures, conforming to the rules of a formal grammar.
• HTTP Request: A message sent by a client (user) to a server to retrieve information or
perform an operation on the server.
• Relevance: The degree to which data fits the purposes and needs of the data user. It's a
key factor to consider when identifying a source for data scraping.’

8. TERMINAL QUESTIONS
1. Explain the process of data scraping. Where can the answer be found?
2. Give examples of sources that can be used for data scraping. Where can the answer be
found?
3. What factors should you consider when identifying a source for data scraping? Where can
the answer be found?
4. Why is it important to understand a website's structure before scraping it? Where can the
answer be found?
5. What are APIs, and how are they used in data scraping? Where can the answer be found?
6. What is the role of HTML in web scraping? Where can the answer be found?
7. How can web scraping be used to gather data from social media platforms? Where can
the answer be found?
8. What measures do some websites use to prevent or limit scraping? Where can the answer
be found?
9. Give examples of different tags in HTML and explain what type of content they typically
contain. Where can the answer be found?
10. How do you view the HTML of a webpage? Where can the answer be found?
11. How do you store the data after scraping it? Where can the answer be found?
12. What are the legal and ethical considerations in web scraping? Where can the answer be
found?
13. How can web scraping be used to gather data from PDFs? Where can the answer be
found?
14. What is the final step in the web scraping process? Where can the answer be found?
15. Why is the reliability of a data source important in web scraping? Where can the answer
be found?

Unit 2: Finding Data Across Sources 21


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

16. How can web scraping be used to gather data from databases? Where can the answer be
found?
17. What is the role of data cleaning in the web scraping process? Where can the answer be
found?
18. What is the relevance factor when identifying a data source for web scraping? Where can
the answer be found?
19. Give an example of how data scraped from the web can be used in a real-world
application.
20. What are the steps involved in scraping data from an API? Where can the answer be
found?

9. ANSWERS

SELF ASSESSMENT QUESTIONS


1. (Answer: c) Encrypt the data)
2. (Answer: d) Clean and analyze the data)
3. (Answer: d) Physical books)
4. (Answer: a) Sending a request to the API's URL and parsing the response)
5. (Answer: c) Website design)
6. (Answer: a) To avoid penalties and respect privacy rights)
7. (Answer: b) <a>)
8. (Answer: a) By right-clicking on the page and selecting "Inspect" or "View Page Source")

TERMINAL QUESTIONS
1. Refer Section 2
2. Refer Section 3
3. Refer Section 4
4. Refer Section 5
5. Refer Section 3
6. Refer Section 5
7. Refer Section 3

Unit 2: Finding Data Across Sources 22


DADS404: Data Scrapping Manipal University Jaipur (MUJ)

8. Refer Section 4
9. Refer Section 5
10. Refer Section 5
11. Refer Section 2
12. Refer Section 4
13. Refer Section 3
14. Refer Section 2
15. Refer Section 4
16. Refer Section 3
17. Refer Section 2
18. Refer Section 4
19. Open-ended question; can be inferred from all sections
20. Refer Section 3

s 10. SUGGESTED BOOKS AND E-REFERENCES

BOOKS:
• Boopathi, Kabilan. "Web Scraping using Python (Guide for Beginners)." GeeksforGeeks,
2019.
• Mitchell, Ryan. "Web Scraping with Python: Collecting More Data from the Modern Web."
O'Reilly Media, 2018.
• Munzert, Simon, et al. "Automated Data Collection with R: A Practical Guide to Web
Scraping and Text Mining." John Wiley & Sons, 2015.
• Lawson, Richard. "Web Scraping with Python and BeautifulSoup." PythonHow, 2016.

• Russell, Matthew A. "Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn,
Google+, GitHub, and More." O'Reilly Media, 2013.

Unit 2: Finding Data Across Sources 23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy