0% found this document useful (0 votes)

31 views

DADS404 Unit-02 - V1.1

Uploaded by

kulhariravindra7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

DADS404 Unit-02 - V1.1

Uploaded by

kulhariravindra7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

MASTER OF BUSINESS ADMINISTRATION

SEMESTER 4

DADS404
DATA SCRAPPING

Unit 2: Finding Data Across Sources 1

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

Unit 2
Finding Data Across Sources
Table of Contents
SL Fig No / Table SAQ /
Topic Page No
No / Graph Activity
1 Introduction - -
3-4
1.1 Learning Objectives - -
2 How Data Scraping is Done - 1 5-7
3 Scraping Data from Different Sources - 2
3.1 Considerations When Scraping Different 8 - 10
- -
Sources
4 Identifying the right Data Source - 3 11 - 12
5 Understanding the Website Structure for
- 4, 5, 6
Scraping
5.1 HTML - -
5.2 CSS Selectors - - 13 - 18
5.3 JavaScript and Dynamic Content - -
5.4 Sitemaps - 7
5.5 Robots.txt - -
6 Summary - - 19
7 Glossary - - 20 - 21
8 Terminal Questions - - 21 - 22
9 Answer Keys - - 22 - 23
10 Suggested Books and E-References - - 23

Unit 2: Finding Data Across Sources 2

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

1. INTRODUCTION

In our journey of understanding data scraping and wrangling, we've reached a critical
junction: finding data across sources. Now, you may ask, why is this so important? Can't we
just take any data and get started? Well, here's the catch - data is like the ingredient for our
analytical cooking. The quality, relevance, and accessibility of our data directly affect the
outcome of our analysis.

Just as a chef chooses the freshest produce or a carpenter selects the right wood, a data
analyst or scientist must know how to find and identify the most suitable data for their
project. This unit will empower you to do just that!

Our focus will be on understanding different data sources and how to effectively and
ethically scrape data from them. We will start by examining the various sources where data
can be obtained, such as websites, APIs, databases, social media platforms, and even PDF
documents.

Next, we will delve into how to identify the most suitable data source for your project, taking
into account factors like relevance, accessibility, data quality, data format, and legal and
ethical considerations.

Finally, we will dive into the structure of websites. Like a blueprint to a building,
understanding a website's structure helps you navigate efficiently and accurately to scrape
the data you need.

1.1 Learning Objectives

After studying the chapter, you will be able to:

➢ Remember and list the various sources of data that can be used for data scraping.
➢ Understand the process of data scraping from websites and APIs.
➢ Apply the knowledge of identifying the right data source for a given project.
➢ Analyze the structure of a website to determine the best strategy for data extraction.
➢ Evaluate different data sources based on their relevance, reliability, accessibility, and legal
and ethical considerations.

Unit 2: Finding Data Across Sources 3

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

➢ Create a comprehensive plan for finding, evaluating, and scraping data from the most
suitable sources for your data analysis project.

So, are you ready to embark on this data sourcing adventure? Let's dive right in and explore
the world of data!

Unit 2: Finding Data Across Sources 4

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

2. HOW DATA SCRAPING IS DONE

Data scraping, also known as web scraping, involves extracting data from websites. It's a
powerful tool in data analysis, allowing users to gather large quantities of data quickly. While
the specific process may vary depending on the tool or library being used, the fundamental
steps involved in data scraping are generally the same:

Identify the target

URL/website 01

Inspect the Page 02

Write the Code 03

Run the Scraper 04

Store the Data 05

Clean and Analyze

the Data 06
Figure 1: Steps to perform Data Scraping

1. Identify the target URL/website: The first step in web scraping is to determine the
URL of the webpage you want to scrape. This URL serves as the access point for the
scraping tool or script.

2. Inspect the Page: Before you can start scraping, you need to understand the
structure of the web page. This involves examining the page's HTML to identify the
tags that contain the data you're interested in.

3. Write the Code: Once you understand the page structure, you can write the script or
code that will extract the data. This code should instruct the scraper to visit the URL,
locate the desired data, and extract it.

Unit 2: Finding Data Across Sources 5

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

4. Run the Scraper: After writing the code, the next step is to run the scraper. This will
initiate the data extraction process. Running the code involves the following steps.

a. Sending an HTTP request: When you run the code for web scraping, a
request is sent to the URL that you have mentioned. As a response to the
request, the server sends the data and allows you to read the HTML or XML
page.

b. Parsing the data: Once you have accessed the data, the next step is to parse
the data. Parsing refers to the process of analyzing the data syntactically. Web
scraping is all about parsing because once the data you need is in HTML
format, you must extract the data from it, which can only be done by parsing
the HTML document.

c. Web Data Extraction: After parsing the HTML document, you can extract the
data. Since the data on websites is unstructured, web scraping enables us to
convert the data into a structured form.

5. Store the Data: Once the data has been extracted, it's typically stored in a structured
format such as a CSV or Excel file, or a database, for further processing or analysis.

6. Clean and Analyze the Data: The final step involves cleaning the scraped data and
analyzing it to derive insights.

The choice of the programming language and tools you use for web scraping can vary widely,
depending on your specific needs and the complexity of the website you're scraping. Some
popular options include Python with libraries such as BeautifulSoup, Scrapy, or Selenium;
JavaScript with Node.js and libraries like Axios and Cheerio; and R with packages like rvest
or RCurl.

While the specifics of the process can vary, the basic process remains the same. It involves
making a request to the server hosting the site, parsing the HTML response, and extracting
the data you need.

It's important to note that not all websites can be scraped, and not all data is accessible.
Some websites use JavaScript to load data, which can make it more difficult to scrape, and

Unit 2: Finding Data Across Sources 6

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

some websites have measures in place to prevent or limit scraping. Additionally, it's
important to consider the legal and ethical implications of web scraping and to respect the
website's terms of service and the privacy of its users.

SELF-ASSESSMENT QUESTIONS - 1
1) Which of the following is NOT a step in the web scraping process?
a) Identify the URL
b) Inspect the page
c) Encrypt the data
d) Write the code
2) What is the final step in the web scraping process?
a) Run the scraper
b) Write the code
c) Store the data
d) Clean and analyze the data

Unit 2: Finding Data Across Sources 7

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

3. SCRAPING DATA FROM DIFFERENT SOURCES

Data/Web scraping can be used to extract data from various types of sources. Here are a few
examples:

Figure 2: Sources for Data Scraping

➢ Websites: The most common source for data scraping is public websites. This can
include anything from e-commerce sites, news websites, forums, and more. The
diversity and volume of data available on websites make them a rich source of
information for a variety of purposes.
When scraping websites, it's essential to respect the site's terms of service. Some sites
explicitly forbid scraping, while others may limit how much data you can access or
how often you can make requests.
➢ HTML and XML Documents: Web pages are typically structured using HTML
(HyperText Markup Language), and web scraping involves parsing this HTML code to
extract the desired data. XML (eXtensible Markup Language) is another markup
language that can be scraped in a similar way.
➢ APIs: Another common source for data scraping is APIs or Application Programming
Interfaces. APIs are interfaces that software programs use to interact with each other.

Unit 2: Finding Data Across Sources 8

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

Many websites and services offer APIs that allow you to access their data in a
structured, reliable way.
Scraping data from an API often requires registering for an API key and making HTTP
requests to the API's endpoints. The data returned from an API is typically in a
structured format like JSON or XML, making it easier to work with than the
unstructured HTML of a webpage.
➢ PDFs: PDFs are another source that can be used for data scraping, especially when
dealing with reports, academic papers, or other documents. Extracting data from
PDFs can be a bit more complex than scraping HTML or XML, but there are tools
available, like Tabula or PyPDF2 with Python, that can help extract text and tables
from PDF documents.
➢ Databases: Web scrapers can also be used to extract data directly from databases.
This could involve downloading a database dump, connecting to a database server, or
using a database API.
Like web APIs, database APIs return data in a structured format, making it easier to
work with. However, accessing a database often requires permissions and
credentials, so this method is typically only possible if you have been granted access
to the database. This requires knowledge of SQL (Structured Query Language) or
other database query languages.
➢ Social Media Platforms: Many social media platforms provide APIs that allow you to
scrape data. This can include user profiles, posts, comments, likes, and more. This
data can be valuable for a variety of purposes, from market research to sentiment
analysis.
However, scraping social media data also raises significant ethical and privacy
concerns. It's important to respect users' privacy and the terms of service of the
platform. In some cases, you may need to anonymize the data to protect users'
privacy.

3.1 Considerations When Scraping Different Sources

Each source for data scraping comes with its considerations. When scraping websites, you
need to deal with the structure of the HTML and any measures the site has in place to prevent

Unit 2: Finding Data Across Sources 9

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

scraping. APIs often require registration and may have rate limits or restrictions on what
data you can access. Databases require permissions and may also have restrictions on the
volume of data you can access or the queries you can run. Social media platforms have their
APIs, but these also come with restrictions and privacy considerations.

In all cases, it's important to respect the source's terms of service and any legal or ethical
obligations you have when handling the data.

SELF-ASSESSMENT QUESTIONS - 2
3) Which of the following cannot be used as a source for data scraping?
a) HTML documents
b) APIs
c) Databases
d) Physical books
4) When scraping data from an API, what is typically involved?
a) Sending a request to the API's URL and parsing the response
b) Sending a request to the API's database and parsing the response
c) Sending a request to the API's HTML and parsing the response
d) None of the above

Unit 2: Finding Data Across Sources 10

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

4. IDENTIFYING THE RIGHT DATA SOURCE

When planning a data scraping project, one of the most critical steps is identifying the right
data sources. The right data source depends on your project's goals and requirements. Here
are some factors to consider when choosing a data source:

Figure 3: Right Data Source Identification

➢ Relevance: The source should contain the specific type of data you need for your analysis
or project. For instance, if you're analyzing customer reviews of a product, you might
choose to scrape data from e-commerce websites or social media platforms where the
product is discussed.
➢ Reliability: The source should be trustworthy and reliable. This typically means
choosing sources that are reputable and have a history of providing accurate data.
➢ Accessibility: The source should be accessible for scraping. Some websites have
measures in place to prevent or limit scraping, such as CAPTCHAs or limitations on the
number of requests that can be made from a single IP address. In some cases, the desired
data might be accessible through an API, which can be a more reliable and efficient way
to gather data.
➢ Legal and Ethical Considerations: Ensure that scraping the chosen source is both legal
and ethical. Some websites explicitly prohibit scraping in their terms of service, while
others may have legal protections on their data. Always respect privacy, copyright laws,
and terms of service when scraping data.

Unit 2: Finding Data Across Sources 11

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

SELF-ASSESSMENT QUESTIONS - 3
5) Which of the following is NOT a factor to consider when identifying a source for
data scraping?
a) Relevance
b) Reliability
c) Website design
d) Legal and ethical considerations
6) Why is it important to consider legal and ethical considerations when choosing
a data source?
a) To avoid penalties and respect privacy rights
b) To ensure the data is relevant
c) To ensure the data is accessible
d) None of the above

Unit 2: Finding Data Across Sources 12

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

5. UNDERSTANDING THE WEBSITE STRUCTURE FOR SCRAPING

Before scraping a website, it's crucial to understand its structure. Websites are built using
HTML, which structures the content of the webpage using various tags. Understanding these
tags and how they're used to structure content can help you target the specific data you want
to scrape.

To view the HTML of a webpage, you can use the "Inspect" or "View Page Source" option in
your web browser (Figure 2). This will display the page's HTML, allowing you to see how the
content is structured and identify the tags that contain the data you're interested in (Figure
3).

Figure 4: Steps to view the Page Source

Unit 2: Finding Data Across Sources 13

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

Figure 5: Page Source on the Right of the screen

For example, text within <p> tags is paragraph text, links are typically contained within <a>
tags, and table data is often within <table>, <tr> (table row), and <td> (table data) tags.

Here is an overview of some of the key aspects of website structure you'll need to understand
for effective web scraping.

Unit 2: Finding Data Across Sources 14

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

Figure 6: Understanding Website Structure

5.1 HTML
HyperText Markup Language (HTML) is the standard markup language for documents
designed to be displayed in a web browser. It's the backbone of any webpage and is what
you'll be interacting with when you're web scraping.

An HTML document consists of elements, represented by tags. Some of the most commonly
used HTML tags include:

➢ <html>: This tag is used to start and end an HTML document.

➢ <head>: This tag is used for metadata and information that isn't displayed on the
page, like the title of the page and links to CSS stylesheets.
➢ <body>: This tag contains the content of the web page that is displayed in the
browser.
➢ <h1> to <h6>: These tags are used for headings, with <h1> being the highest level
and <h6> the lowest.
➢ <p>: This tag is used for paragraphs of text.
➢ <a>: This tag is used for hyperlinks.

Unit 2: Finding Data Across Sources 15

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

➢ <div> and <span>: These tags are used to group other elements together.
➢ <table>, <tr>, <th>, and <td>: These tags are used for tables.

5.2 CSS Selectors

Cascading Style Sheets (CSS) is a style sheet language used for describing the look and
formatting of a document written in HTML. CSS selectors are patterns used to select the
elements you want to style. When you're web scraping, you can use CSS selectors to select
the elements you want to scrape.

There are several types of CSS selectors, including:

➢ Element selector: Selects elements based on the element name. For example, p
would select all paragraph elements.
➢ ID selector: Selects a specific element based on its ID. For example, #myID would
select the element with the ID "myID".
➢ Class selector: Selects elements based on their class. For example, .myClass would
select all elements with the class "myClass".
➢ Attribute selector: Selects elements based on an attribute or attribute value. For
example, [href] would select all elements with a "href" attribute, and
[href="https://www.example.com"] would select all elements with a "href" attribute
of "https://www.example.com".

5.3 JavaScript and Dynamic Content

Many modern websites use JavaScript to load or display content dynamically. This can
complicate web scraping, as the content you're trying to scrape might not be in the HTML
when you first load the page. Instead, it might be loaded or changed by JavaScript after the
page loads.

To scrape dynamic content, you may need to use a tool like Selenium, Puppeteer, or
Pyppeteer that can interact with a web browser and execute JavaScript.

5.4 Sitemaps
A sitemap is a file where you provide information about the pages, videos, and other files on
your site, and the relationships between them. Webmasters use this data to inform search

Unit 2: Finding Data Across Sources 16

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

engines about pages on their site that are available for crawling. For a web scraper, sitemaps
can be useful to identify and list all the URLs that need to be crawled, especially for large
websites.

Figure 7: Sitemap of Google.com

5.5 Robots.txt
The robots exclusion standard, also known as the robots exclusion protocol or simply
robots.txt, is a standard used by websites to communicate with web crawlers and other web
robots. The standard specifies how to inform the web robot about which areas of the website
should not be processed or scanned.

When planning a web scraping project, it's important to check a website's robots.txt file to
see which parts of the site the owner has asked bots not to crawl. It's generally considered

Unit 2: Finding Data Across Sources 17

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

good web scraping etiquette to respect these wishes, although the robots.txt file is
technically more of a guideline than a hard and fast rule.

Understanding the structure of a website is a crucial step in a web scraping project. By

understanding the underlying HTML, using CSS selectors, dealing with JavaScript and
dynamic content, and respecting sitemaps and robots.txt, you can ensure that your web
scraping project is more effective and respectful to the website owner.

SELF-ASSESSMENT QUESTIONS - 4
7) Which of the following tags typically contains links in a webpage?
a) <p>
b) <a>
c) <tr>
d) <td>
8) How can you view the HTML of a webpage?
a) By right-clicking on the page and selecting "Inspect" or "View Page Source"
b) By copying the page's URL into a text editor
c) By saving the page as a PDF and opening it in a PDF viewer
d) None of the above

Unit 2: Finding Data Across Sources 18

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

6. SUMMARY
In this unit we covered the important aspects of finding and identifying the right sources for
data scraping, an essential step in any data analysis project. We started by detailing the
general process of how data scraping is conducted. This involved sending a request to a
specific URL, parsing the response, extracting the required data, and storing it in a structured
format for further use.

We then dived into the various sources from which data can be scraped. We discussed the
potential for scraping data from websites, APIs, databases, social media platforms, and even
PDF documents. For each of these sources, we highlighted the unique considerations that
need to be taken into account.

In the third section, we explored the importance of identifying the right data source for your
project, keeping factors like relevance, accessibility, data quality, data format, and legal and
ethical considerations in mind.

In the final section, we took a deep dive into understanding the structure of websites to
facilitate efficient and accurate data scraping. We detailed the role of HTML and CSS
selectors, the challenge posed by JavaScript and dynamic content, and the usefulness of
sitemaps and robots.txt files.

Through this unit, students should have gained a deeper understanding of how to approach
the data gathering phase of their projects, making them well-equipped to find the right data
in the most efficient and ethical manner possible. The knowledge acquired will serve as a
foundation for subsequent units, where we will delve into the practical aspects of scraping
data from various types of websites and using different tools and libraries.

Unit 2: Finding Data Across Sources 19

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

7. GLOSSARY
• Data Scraping/Web Scraping: The process of extracting data from websites. This
involves sending a request to a website's server, reading the HTML or XML page that's
returned, and parsing the page to extract the desired data.
• API (Application Programming Interface): An interface that allows software programs
to interact with each other. Many websites and services provide APIs, which allow for
structured and reliable access to their data.
• HTML (HyperText Markup Language): The standard language for creating web pages.
It uses tags to structure content on the page.
• CSS (Cascading Style Sheets): A style sheet language used for describing the look and
formatting of a document written in HTML.
• CSS Selector: Patterns used in CSS to select the elements to be styled. In web scraping,
CSS selectors are used to select the elements to be scraped.
• JavaScript: A programming language commonly used in web development to create
interactive effects within web browsers.
• Dynamic Content: Website content that changes based on the behavior, preferences,
and interest of the user. It's typically loaded or changed by JavaScript after the page loads.
• Sitemap: A file where information about the pages, videos, and other files on a site, and
the relationships between them, is provided. Sitemaps can help web scrapers to identify
and list all the URLs that need to be crawled.
• Robots.txt: A text file webmasters create to instruct web robots (typically search engine
robots) how to crawl pages on their website. Web scrapers should respect the
instructions in a website's robots.txt file.
• Data Source: The location, file, database, service, or user from which data originates.
• Data Quality: A measure of the condition of data based on factors such as accuracy,
completeness, consistency, reliability, and whether it's up-to-date.
• PDFs (Portable Document Format): A file format used to present and exchange
documents reliably, independent of software, hardware, or operating system. PDF files
can be a source of data for web scraping.
• Data Format: The particular way that data is structured or organized. Examples of data
formats include CSV, JSON, and XML.

Unit 2: Finding Data Across Sources 20

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

• Data Parsing: The process of analyzing a string of symbols, either in natural language,
computer languages or data structures, conforming to the rules of a formal grammar.
• HTTP Request: A message sent by a client (user) to a server to retrieve information or
perform an operation on the server.
• Relevance: The degree to which data fits the purposes and needs of the data user. It's a
key factor to consider when identifying a source for data scraping.’

8. TERMINAL QUESTIONS
1. Explain the process of data scraping. Where can the answer be found?
2. Give examples of sources that can be used for data scraping. Where can the answer be
found?
3. What factors should you consider when identifying a source for data scraping? Where can
the answer be found?
4. Why is it important to understand a website's structure before scraping it? Where can the
answer be found?
5. What are APIs, and how are they used in data scraping? Where can the answer be found?
6. What is the role of HTML in web scraping? Where can the answer be found?
7. How can web scraping be used to gather data from social media platforms? Where can
the answer be found?
8. What measures do some websites use to prevent or limit scraping? Where can the answer
be found?
9. Give examples of different tags in HTML and explain what type of content they typically
contain. Where can the answer be found?
10. How do you view the HTML of a webpage? Where can the answer be found?
11. How do you store the data after scraping it? Where can the answer be found?
12. What are the legal and ethical considerations in web scraping? Where can the answer be
found?
13. How can web scraping be used to gather data from PDFs? Where can the answer be
found?
14. What is the final step in the web scraping process? Where can the answer be found?
15. Why is the reliability of a data source important in web scraping? Where can the answer
be found?

Unit 2: Finding Data Across Sources 21

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

16. How can web scraping be used to gather data from databases? Where can the answer be
found?
17. What is the role of data cleaning in the web scraping process? Where can the answer be
found?
18. What is the relevance factor when identifying a data source for web scraping? Where can
the answer be found?
19. Give an example of how data scraped from the web can be used in a real-world
application.
20. What are the steps involved in scraping data from an API? Where can the answer be
found?

9. ANSWERS

SELF ASSESSMENT QUESTIONS

1. (Answer: c) Encrypt the data)
2. (Answer: d) Clean and analyze the data)
3. (Answer: d) Physical books)
4. (Answer: a) Sending a request to the API's URL and parsing the response)
5. (Answer: c) Website design)
6. (Answer: a) To avoid penalties and respect privacy rights)
7. (Answer: b) <a>)
8. (Answer: a) By right-clicking on the page and selecting "Inspect" or "View Page Source")

TERMINAL QUESTIONS
1. Refer Section 2
2. Refer Section 3
3. Refer Section 4
4. Refer Section 5
5. Refer Section 3
6. Refer Section 5
7. Refer Section 3

Unit 2: Finding Data Across Sources 22

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

8. Refer Section 4
9. Refer Section 5
10. Refer Section 5
11. Refer Section 2
12. Refer Section 4
13. Refer Section 3
14. Refer Section 2
15. Refer Section 4
16. Refer Section 3
17. Refer Section 2
18. Refer Section 4
19. Open-ended question; can be inferred from all sections
20. Refer Section 3

s 10. SUGGESTED BOOKS AND E-REFERENCES

BOOKS:
• Boopathi, Kabilan. "Web Scraping using Python (Guide for Beginners)." GeeksforGeeks,
2019.
• Mitchell, Ryan. "Web Scraping with Python: Collecting More Data from the Modern Web."
O'Reilly Media, 2018.
• Munzert, Simon, et al. "Automated Data Collection with R: A Practical Guide to Web
Scraping and Text Mining." John Wiley & Sons, 2015.
• Lawson, Richard. "Web Scraping with Python and BeautifulSoup." PythonHow, 2016.

• Russell, Matthew A. "Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn,
Google+, GitHub, and More." O'Reilly Media, 2013.

Unit 2: Finding Data Across Sources 23

Web Scraping Ganesh
0% (1)
Web Scraping Ganesh
20 pages
5e9dd8a680b972888929747b - The Modern Web Design Process - Webflow Ebook
100% (2)
5e9dd8a680b972888929747b - The Modern Web Design Process - Webflow Ebook
124 pages
DADS404 Unit-01 - V1.2
No ratings yet
DADS404 Unit-01 - V1.2
20 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
DSE 3 Unit 3
No ratings yet
DSE 3 Unit 3
4 pages
Lecture-4 5
No ratings yet
Lecture-4 5
18 pages
Rohan report
No ratings yet
Rohan report
25 pages
L2_Data Acquisition
No ratings yet
L2_Data Acquisition
48 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Data Scraping
No ratings yet
Data Scraping
14 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Mini Project
No ratings yet
Mini Project
13 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
08 Gtu Tpt Report.docx
No ratings yet
08 Gtu Tpt Report.docx
37 pages
2 Data Science - Managing Data
No ratings yet
2 Data Science - Managing Data
37 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Module 2_final
No ratings yet
Module 2_final
58 pages
Data Aggregation by Web Scraping Using Python
No ratings yet
Data Aggregation by Web Scraping Using Python
48 pages
Data Collection
No ratings yet
Data Collection
14 pages
Unit - 2 Web Intelligence
No ratings yet
Unit - 2 Web Intelligence
12 pages
UNIT 1_PPT
No ratings yet
UNIT 1_PPT
67 pages
DATA WRANGLING
No ratings yet
DATA WRANGLING
4 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
ds final
No ratings yet
ds final
45 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Utilizing_Python_for_Web_Scraping_and_Incremental_Data_Extraction
No ratings yet
Utilizing_Python_for_Web_Scraping_and_Incremental_Data_Extraction
6 pages
A Survey on Web Scraping and Its Applications - IJCRT
No ratings yet
A Survey on Web Scraping and Its Applications - IJCRT
4 pages
Lecture 4: Let's Get Data!: Prof. Esther Duflo
No ratings yet
Lecture 4: Let's Get Data!: Prof. Esther Duflo
44 pages
Final Publish Paper
No ratings yet
Final Publish Paper
4 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
Com 059
No ratings yet
Com 059
6 pages
Diouf 2019
No ratings yet
Diouf 2019
3 pages
SMA02
No ratings yet
SMA02
11 pages
Data Scraping
No ratings yet
Data Scraping
17 pages
Internship
No ratings yet
Internship
10 pages
A Web Scraper For Extracting Alumni Information From Social
No ratings yet
A Web Scraper For Extracting Alumni Information From Social
4 pages
Dap M4
No ratings yet
Dap M4
18 pages
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
Sing Rodia 2019
No ratings yet
Sing Rodia 2019
6 pages
4 Design and Development
No ratings yet
4 Design and Development
3 pages
4 Parsing
No ratings yet
4 Parsing
20 pages
Text-Processing-For-NLP-Web-Scrapping (5)
No ratings yet
Text-Processing-For-NLP-Web-Scrapping (5)
18 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Big Data Visualizer Course Notes
No ratings yet
Big Data Visualizer Course Notes
20 pages
8
No ratings yet
8
43 pages
Data Collection
No ratings yet
Data Collection
10 pages
Adbms Ans
No ratings yet
Adbms Ans
4 pages
Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
Web Data Scraping
No ratings yet
Web Data Scraping
5 pages
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
From Everand
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
Robert Johnson
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
Web Data Extractors
No ratings yet
Web Data Extractors
26 pages
Web Scraping
No ratings yet
Web Scraping
16 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Seo 3
No ratings yet
Seo 3
29 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
Public Attitudes and Sentiments Towards New Energy Vehicles in China A Text Mining Approach
No ratings yet
Public Attitudes and Sentiments Towards New Energy Vehicles in China A Text Mining Approach
12 pages
Getting Started With Apache Nutch
No ratings yet
Getting Started With Apache Nutch
33 pages
Search Engine Optimization
50% (2)
Search Engine Optimization
27 pages
Assignment #1 Text Retrieval & Search Engine
No ratings yet
Assignment #1 Text Retrieval & Search Engine
6 pages
retrieval-augmented-generation-options-Good-5-38
No ratings yet
retrieval-augmented-generation-options-Good-5-38
34 pages
Scrapy PDF
No ratings yet
Scrapy PDF
250 pages
Salesforce PKB Implementation Guide
No ratings yet
Salesforce PKB Implementation Guide
30 pages
Docs Scrapy Org en Latest
No ratings yet
Docs Scrapy Org en Latest
382 pages
Jharkhand Poly Unit 1 and Unit 3_be6b7b85-1dd9-4576-8e09-329ceb17ea91
No ratings yet
Jharkhand Poly Unit 1 and Unit 3_be6b7b85-1dd9-4576-8e09-329ceb17ea91
67 pages
CHapter4-Advance Features of Web Designing
No ratings yet
CHapter4-Advance Features of Web Designing
35 pages
Effect of Digital Marketing On The Buying Behaviour of The Customers..
No ratings yet
Effect of Digital Marketing On The Buying Behaviour of The Customers..
98 pages
Answerssa Cti
No ratings yet
Answerssa Cti
16 pages
Immediate download Marketing Through Search Optimization How to be found on the web 2 edition Edition Michael A ebooks 2024
100% (15)
Immediate download Marketing Through Search Optimization How to be found on the web 2 edition Edition Michael A ebooks 2024
85 pages
Microsoft Sharepoint 2010 It Professional Evaluation Guide
No ratings yet
Microsoft Sharepoint 2010 It Professional Evaluation Guide
56 pages
Assignment 5 - Text Web and Social Media Analytics
No ratings yet
Assignment 5 - Text Web and Social Media Analytics
2 pages
One company's devious plan to stop AI web scrapers from stealing your content _ Mashable
No ratings yet
One company's devious plan to stop AI web scrapers from stealing your content _ Mashable
8 pages
ReleaseNotes 2024.2.0
No ratings yet
ReleaseNotes 2024.2.0
97 pages
Increase The Roi of Your Content Strategy: A Guide To
No ratings yet
Increase The Roi of Your Content Strategy: A Guide To
10 pages
Data Mining Assignment 2
No ratings yet
Data Mining Assignment 2
25 pages
Traversing the Ethical Landscape of Data Scraping for AI
No ratings yet
Traversing the Ethical Landscape of Data Scraping for AI
26 pages
System_Design_Interview_Questions_1696745718
No ratings yet
System_Design_Interview_Questions_1696745718
32 pages
MOZ Guide
No ratings yet
MOZ Guide
4 pages
Abrasion of Current Weather of A City Using Variant Python Libraries and Weather Application Programming Interface (API)
No ratings yet
Abrasion of Current Weather of A City Using Variant Python Libraries and Weather Application Programming Interface (API)
4 pages
Literature Review On Online Hostel Management System
100% (1)
Literature Review On Online Hostel Management System
7 pages
221FJ01022
No ratings yet
221FJ01022
18 pages
27676
No ratings yet
27676
5 pages
An_Effective_SEO_Techniques_and_Technologies_Guide-Map (1)
No ratings yet
An_Effective_SEO_Techniques_and_Technologies_Guide-Map (1)
47 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DADS404 Unit-02 - V1.1

Uploaded by

DADS404 Unit-02 - V1.1

Uploaded by

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

MASTER OF BUSINESS ADMINISTRATION

Unit 2: Finding Data Across Sources 1

Unit 2: Finding Data Across Sources 2

1.1 Learning Objectives

After studying the chapter, you will be able to:

Unit 2: Finding Data Across Sources 3

Unit 2: Finding Data Across Sources 4

2. HOW DATA SCRAPING IS DONE

Identify the target

Inspect the Page 02

Write the Code 03

Run the Scraper 04

Store the Data 05

Clean and Analyze

Unit 2: Finding Data Across Sources 5

Unit 2: Finding Data Across Sources 6

Unit 2: Finding Data Across Sources 7

3. SCRAPING DATA FROM DIFFERENT SOURCES

Figure 2: Sources for Data Scraping

Unit 2: Finding Data Across Sources 8

3.1 Considerations When Scraping Different Sources

Unit 2: Finding Data Across Sources 9

Unit 2: Finding Data Across Sources 10

4. IDENTIFYING THE RIGHT DATA SOURCE

Figure 3: Right Data Source Identification

Unit 2: Finding Data Across Sources 11

Unit 2: Finding Data Across Sources 12

5. UNDERSTANDING THE WEBSITE STRUCTURE FOR SCRAPING

Figure 4: Steps to view the Page Source

Unit 2: Finding Data Across Sources 13

Figure 5: Page Source on the Right of the screen

Unit 2: Finding Data Across Sources 14

Figure 6: Understanding Website Structure

➢ <html>: This tag is used to start and end an HTML document.

Unit 2: Finding Data Across Sources 15

5.2 CSS Selectors

There are several types of CSS selectors, including:

5.3 JavaScript and Dynamic Content

Unit 2: Finding Data Across Sources 16

Figure 7: Sitemap of Google.com

Unit 2: Finding Data Across Sources 17

Understanding the structure of a website is a crucial step in a web scraping project. By

Unit 2: Finding Data Across Sources 18

Unit 2: Finding Data Across Sources 19

Unit 2: Finding Data Across Sources 20

Unit 2: Finding Data Across Sources 21

SELF ASSESSMENT QUESTIONS

Unit 2: Finding Data Across Sources 22

s 10. SUGGESTED BOOKS AND E-REFERENCES

Unit 2: Finding Data Across Sources 23

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.