0% found this document useful (0 votes)

2 views

L2_Data Acquisition

The lecture covers data acquisition methods, including various data sources such as relational databases, flat files, APIs, and web scraping techniques. It explains the processes of identifying, scraping, and importing data into repositories, as well as the differences between web scraping and web crawling. Additionally, it highlights the importance of data collection methods and tools used for web scraping and crawling.

Uploaded by

hungsir86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

L2_Data Acquisition

Uploaded by

hungsir86

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Lecture 2: Data Acquisition

CS5481 Data Engineering

Instructor: Linqi Song
Outline

• 1. Data sources

• 2. Web scraping

• 3. From web scraping to web crawling

2
Data
What is Data?

Facts, observations Numbers, Multimedia

perceptions characters, data
from experiments symbols

3
Sources of data

Relational Flat files and XML APIs and web

databases datasets services

4
Relational databases
Typically, data stored in databases and data warehouses can be used as a source for
analysis, organizations have internal applications to support them in managing:
• Day to day business activities
• Customer transactions
• Human resource activities
• Workflows

5
Using queries to extract data from relational
databases
SQL, or Structured Query Languages, is a querying language used for extracting
information form relational databases. Offers simple commands to specify:
• What is to be retrieved from the database.
• Table from which it needs to be extracted.
• Grouping records with matching values.
• Dictating the sequence in which the query results are displayed.
• Limiting the number of results that can be returned by the query.

6
Using queries to extract data from non-relational
databases
Non-relational databases can be queried using SQL or SQL-like query tools.
Some non-relational databases come with their own querying tools such as CQL for
Cassandra and GraphQL for Neo4J.

7
Flat file and XML datasets
External to the organization, there are
other publicly and privately available
datasets.
Such data sets are typically made
available as flat files, spreadsheet files,
or XML documents.

8
Flat files
• Store data in plain text format
• Each line, or row, is one record
• Each value is separated by a delimiter
• All of the data in a flat file maps to a single table
• Most common flat file format is .CSV

9
APIs and web services
APIs and Web Services typically listen for incoming requests, which can be
in the form of web requests from users or network requests from
applications, and return data in plain text, XML, HTML, JSON, or media
files.

Web Requests Network Requests

10
Sources for gathering data – web
Web is a source of publicly available data that is available to companies
and individuals for free or commercial use.

• News websites, social networks

• Textbooks
• Government records
• Papers and articles for public
consumption 11
Types of file formats (1)
• Delimited text file formats, or .CSV

used to store data as text, each value is separated by a delimiter.

• Microsoft Excel Open .XML Spreadsheet, or .XLSX

• The HyperText Markup Language (HTML) is the standard markup

language for documents designed to be displayed in a web browser.

• Extensible Markup Language, or .XML

It is a markup language with set rules for encoding data.

12
Types of file formats (2)
• Portable Document Format, or .PDF

Can be viewed the same way on any device.

• JavaScript Object Notation, or .JSON

A text-based open standard designed for transmitting

structured data over the web.

13
Application Programming Interfaces (or APIs)

• Popularly used for extracting data from a variety of data sources.

• Are invoked form applications that require the data and access an
endpoint containing the data. Endpoints can include databases, web
services, and data marketplaces.
• Also used for data validation.

14
API examples
Popular Examples of APIs

Twitter and Facebook Stock market APIs Data lookup and validation
APIs For trading and analysis APIs
For customer sentiment For cleaning and co-relating
analysis data

15
Data streams and feeds
Aggregating streams of data flowing from:
1. Instruments
2. IoT devices and applications
3. GPS data from cars
4. Computer programs
5. Websites
6. Social network.
On the World Wide Web, a web feed is a data format used
for providing users with frequently updated content. 16
Sources for gathering data – sensor data
Sensor data produced by wearable devices, smart buildings, smart cities,
smartphones, medical devices, even household appliances, is a widely
used source of data.

17
Data streams and feeds examples
Examples of uses:
• Surveillance and video feeds for threat detection.
• Sensor data feeds for monitoring industrial or farming machinery.
• Social media feeds for sentiment analysis.
• Stock and market tickers for financial trading.
• Retail transaction streams for predicting demand and supply chain
management.
• Web click feeds for monitoring web performance and improving design.
18
Data streams and feeds tools
Popular technologies used to process data streams include:

• Kafka
Apache Kafka is a distributed event store and stream-
processing platform. It is an open-source system
developed by the Apache Software Foundation written in
Java and Scala. The project aims to provide a unified,
high-throughput, low-latency platform for handling real-
time data feeds

• Storm Apache Storm is a distributed stream processing

computation framework written predominantly in the
Clojure programming language.

19
Sources for gathering data – data exchange
Data exchange is a source of third-party data that involves the voluntary sharing
of data between data providers and data consumers. Individuals, organizations,
and governments could be both data providers and data consumers.

• Data from business applications

Crunchbase • Sensor devices
AWS DataExchange

• Social media activity

• Location data
• Consumer behavior data 20
Lotame Snowflake
Other sources for gathering data
• Surveys – gather information through questionnaires distributed to a select group of
people.

• Census – popularly used for gathering household data such as wealth and income of
population.

• Interviews – a source for gathering qualitative data such as the participant’s opinions
and experiences. Interviews can be telephonic, over the web, or face-to-face.

• Observation studies – include monitoring participants in a specific environment or

while performing a particular task.

21
Outline

• 1. Data sources

• 2. Web scraping

• 3. From web scraping to web crawling

22
Web scraping
• The construction of an agent to download, parse, and organize data from the web in an
automated manner
• Extract relevant data from unstructured sources on the Internet
• Also known as screen scraping, web harvesting, and web data extraction
• Can extract text, contact information, images, videos, product items, etc.

23
Web scraping examples
Popular examples of uses:
• Generating sales leads through public data sources; weather information
to forecast, for example, soft drink sales.
• Collecting training and testing datasets for machine learning models.
• There might be an interesting table on a Wikipedia page (or pages) you
want to retrieve to perform some statistical analysis.
• You might wish to get a listing of properties on a real-estate site to build
an appealing geo-visualization. 24
Web scraping tools
Popular Web Scraping tools:
• BeautifulSoup
• Scrapy
• Pandas
• Selenium

25
The web speaks HTTP
• 7 network layers of Open Systems
Interconnection (OSI) model
• Web scraping mainly focus on the
application layer: HTTP

26
HTTP
The core component in the
exchange of messages on
WWW consists of a
HyperText Transfer Protocol
(HTTP) request message to a
web server, followed by an
HTTP response, which can
be rendered by the browser.

27
Response message body is usually html format
Hypertext Markup Language (HTML)
• HTML is a standard markup language for creating web pages.
• HTML provides the building blocks to provide structure and formatting to
documents.
• Python ‘requests’ library could get the html content from a webpage.

28
HTML format
• HTML’s building blocks are usually a series of tags that often come in pairs
(but not always).
• Commonly used tags

29
HTML parsing
• HTML parsing involves tokenization and
tree construction. HTML tokens include
start and end tags, as well as attribute
names and values. If the document is
well-formed, parsing it is straightforward
and faster. The parser parses tokenized
input into the document, building up the
document tree.

30
General web scraping procedure
• Identifying data for scraping
• Scraping the data
• Importing the data

31
Identifying data for scraping (1)
Importance of Identifying data for analysis:

• Identifying the right data is a very important step of the data analysis process.

• Done right, it will ensure that your are able

to look at a problem from multiple
perspectives and your findings are credible
and reliable.

32
Identifying data for scraping (2)
Process for identifying data
Step 1: Determine the information you want to collect

The specific information you need

The possible sources for this data

33
Identifying data for scraping (3)
Process for identifying data
Step 2: Define a plan for collecting data

Establish a timeframe How much data is Define dependencies,

for collecting data sufficient for a risks, and mitigation
credible analysis plan

34
Identifying data for scraping (4)
Process for identifying data
Step 3: Determine your data collection methods. The methods depend on:

Sources of Data Type of Data Timeframe over Volume of data

which you need the
data

35
Web scraping
• Web Scraping: Extracting a large amount of specific data from online
sources

36
Importing data into data repositories
• Gathering data from data sources such as databases, the web, sensor
data, data exchanges, and several other sources leveraged for specific
data needs.
• Importing data into different types of data repositories.

37
Importing structured data
Importing data: data Identified and gathered -> data repository
Specific data repositories are optimized for certain types of data.

• Relational databases store structured data with a well-

defined schema
• Sources include data from OLTP systems, spreadsheets,
online forms, sensors, network and web logs.
Structured data • Can be stored in NoSQL database.
38
Importing unstructured data
Specific data repositories are optimized for certain types of data.

• Sources include emails, XML, zipped files, binary

executables, and TCP/IP protocols.
• Can also be stored in NoSQL clusters.
• XML and JSON are commonly used for storing and
Semi-structured Data
exchanging semi-structured data.

39
Outline

• 1. Data sources

• 2. Web scraping

• 3. From web scraping to web crawling

40
From web scraping to web crawling
Web Crawling: Using tools to read, copy and store the content of the websites for
archiving or indexing purposes. Crawling usually deals with a network of webpages

Web Scraping: Extracting a large amount of specific

data usually from a single webpage or a single website

41
Different use cases
Web Crawling Web Scrapers
• Generating search engine • Comparing prices.
results. • Stock market analysis.
• Monitoring SEO analytics. • Managing brand reputation.
• Performing website analysis. • Academic and scientific research.
• Performed only by large • Used by small and large
corporations. businesses

42
Tools
Differences between Web Crawling and Web Scraping
Web Crawlers Web Scrapers

43
Web crawling process

A web crawler operates like a graph traversal/search algorithm.

44
What any crawler MUST do？
• Be robust: Be immune to spider traps and other malicious
behavior from web servers
• Be polite: Respect implicit and explicit politeness considerations.
• Explicit politeness: specifications from webmaster on what
portions of a site can be crawled – robots.txt
• Implicit politeness: even with no specification, avoid hitting
any site too often.

45
Robots.txt
• Protocol for giving spiders (“robots”) limited access to a website,
originally from 1994.
- www.robotstxt.org/robotstxt.html
• Website announces its request on what can(not) be crawled.
- For a server, create a file /robots.txt.
- This file specifies access restrictions.

46
47
Appendix
1.https://www.coursera.org/learn/introduction-to-data-
analytics/home/week/3
2.https://www.ics.uci.edu/~lopes/teaching/cs221W15/slides/WebCrawling.
pdf
3. https://link.springer.com/content/pdf/10.1007/978-1-4842-3582-9.pdf

Ob Config Guide
100% (1)
Ob Config Guide
22 pages
Data Science With Python
No ratings yet
Data Science With Python
16 pages
Internship Report
No ratings yet
Internship Report
27 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
03 Getting Data
No ratings yet
03 Getting Data
18 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Module 2_final
No ratings yet
Module 2_final
58 pages
2 Data Science - Managing Data
No ratings yet
2 Data Science - Managing Data
37 pages
8
No ratings yet
8
43 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
5 pages
Lecture 4: Let's Get Data!: Prof. Esther Duflo
No ratings yet
Lecture 4: Let's Get Data!: Prof. Esther Duflo
44 pages
Unit - 2 Web Intelligence
No ratings yet
Unit - 2 Web Intelligence
12 pages
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Lecture-4 5
No ratings yet
Lecture-4 5
18 pages
Document For Scribd
No ratings yet
Document For Scribd
54 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
DSE 3 Unit 3
No ratings yet
DSE 3 Unit 3
4 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
08 Gtu Tpt Report.docx
No ratings yet
08 Gtu Tpt Report.docx
37 pages
WP WebDataExtractionPlaybook
No ratings yet
WP WebDataExtractionPlaybook
13 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
DADS404 Unit-02 - V1.1
No ratings yet
DADS404 Unit-02 - V1.1
23 pages
Api and data structure
No ratings yet
Api and data structure
3 pages
CRM Data Collection and Storage
No ratings yet
CRM Data Collection and Storage
22 pages
4.step1 AccessingAndRetrievingData Altintas
No ratings yet
4.step1 AccessingAndRetrievingData Altintas
11 pages
UNIT 1_PPT
No ratings yet
UNIT 1_PPT
67 pages
Big Data Visualizer Course Notes
No ratings yet
Big Data Visualizer Course Notes
20 pages
Data Collection
No ratings yet
Data Collection
14 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Data Mining News Article
No ratings yet
Data Mining News Article
30 pages
Data Import
No ratings yet
Data Import
23 pages
Data Analytics
No ratings yet
Data Analytics
21 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Lect1
No ratings yet
Lect1
25 pages
AI PROJECT CYCLE
No ratings yet
AI PROJECT CYCLE
50 pages
Web Data Extractors
No ratings yet
Web Data Extractors
26 pages
5213935-UNIT 2 AI PROJECT CYCLE With Modelling - Uploaded
No ratings yet
5213935-UNIT 2 AI PROJECT CYCLE With Modelling - Uploaded
42 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Topic 1 T
No ratings yet
Topic 1 T
20 pages
Web Crawler Assisted Web Page Cleaning For Web Data Mining
No ratings yet
Web Crawler Assisted Web Page Cleaning For Web Data Mining
75 pages
Crawling Through Web To Extract The Data From Social Networking Site - Twitter
No ratings yet
Crawling Through Web To Extract The Data From Social Networking Site - Twitter
6 pages
Unit 1
No ratings yet
Unit 1
137 pages
Slides
No ratings yet
Slides
191 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
Ll Ll Lllll Lllll
No ratings yet
Ll Ll Lllll Lllll
39 pages
Where to find data PDF
No ratings yet
Where to find data PDF
10 pages
Text-Processing-For-NLP-Web-Scrapping (5)
No ratings yet
Text-Processing-For-NLP-Web-Scrapping (5)
18 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
DA_Unit -2-5-2025 (1)
No ratings yet
DA_Unit -2-5-2025 (1)
109 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Scraping Handbook
No ratings yet
Web Scraping Handbook
115 pages
Web Scraping: Applications and Tools
100% (2)
Web Scraping: Applications and Tools
31 pages
Web Scraping Job Portals: Ashutosh Kumar, Kinshuk Chauhan, Jaspreet Kaur Grewal
No ratings yet
Web Scraping Job Portals: Ashutosh Kumar, Kinshuk Chauhan, Jaspreet Kaur Grewal
13 pages
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
6.6.7 Packet Tracer - Configure PAT - ILM
No ratings yet
6.6.7 Packet Tracer - Configure PAT - ILM
4 pages
Prism Central Guide 2022.6
No ratings yet
Prism Central Guide 2022.6
1,024 pages
A MIS (Management Information System) Group Report On: Microsoft
No ratings yet
A MIS (Management Information System) Group Report On: Microsoft
37 pages
Streams: What Is A Stream Types of Strams
No ratings yet
Streams: What Is A Stream Types of Strams
28 pages
Dbamp Refresh and Replicate Optimizations
No ratings yet
Dbamp Refresh and Replicate Optimizations
35 pages
Week 01
No ratings yet
Week 01
79 pages
Customizing The Inventor Content Center
No ratings yet
Customizing The Inventor Content Center
21 pages
Unit 4 Databases, Cloud & Snowflake: Prof. Thushara Weerawardane
No ratings yet
Unit 4 Databases, Cloud & Snowflake: Prof. Thushara Weerawardane
50 pages
Penetration Testing Report
No ratings yet
Penetration Testing Report
34 pages
A Study On "Online Admission System" Mba It
No ratings yet
A Study On "Online Admission System" Mba It
45 pages
Updated CV - Digital Marketing
No ratings yet
Updated CV - Digital Marketing
2 pages
Oracle University Cloud Services - Metrics and Service Descriptions
No ratings yet
Oracle University Cloud Services - Metrics and Service Descriptions
615 pages
StoneOS CLI User Guide Complete Book 5.5R7 Completo
No ratings yet
StoneOS CLI User Guide Complete Book 5.5R7 Completo
1,933 pages
Greg Trester
No ratings yet
Greg Trester
9 pages
System Changeover Four Possible Approaches
No ratings yet
System Changeover Four Possible Approaches
5 pages
SQL Grant Revoke Commands SQL GRANT Command
No ratings yet
SQL Grant Revoke Commands SQL GRANT Command
7 pages
It Se Synopsis Sample
No ratings yet
It Se Synopsis Sample
21 pages
PeopleSoft HCM 91 - Demo
100% (1)
PeopleSoft HCM 91 - Demo
37 pages
Rocas - 17.7.6
No ratings yet
Rocas - 17.7.6
5 pages
Database Connections in R
No ratings yet
Database Connections in R
10 pages
Gujarat Technological University: Instructions
No ratings yet
Gujarat Technological University: Instructions
2 pages
Cloud DevOps Engineer G6
No ratings yet
Cloud DevOps Engineer G6
2 pages
Sap PM
No ratings yet
Sap PM
4 pages
Top 10 FTK Features WP FINAL 2023
No ratings yet
Top 10 FTK Features WP FINAL 2023
7 pages
ENSC 26 PreLab Handout 1 - AY2324 - 2s
No ratings yet
ENSC 26 PreLab Handout 1 - AY2324 - 2s
14 pages
CORS Complete Guide Theory Video Slides
No ratings yet
CORS Complete Guide Theory Video Slides
38 pages
Neologism
100% (3)
Neologism
19 pages
Digital Marketing Services in Delhi
No ratings yet
Digital Marketing Services in Delhi
5 pages
Java Interview
No ratings yet
Java Interview
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

L2_Data Acquisition

Uploaded by

L2_Data Acquisition

Uploaded by

Lecture 2: Data Acquisition

CS5481 Data Engineering

• 3. From web scraping to web crawling

Facts, observations Numbers, Multimedia

Relational Flat files and XML APIs and web

Web Requests Network Requests

• News websites, social networks

used to store data as text, each value is separated by a delimiter.

• Microsoft Excel Open .XML Spreadsheet, or .XLSX

• The HyperText Markup Language (HTML) is the standard markup

• Extensible Markup Language, or .XML

It is a markup language with set rules for encoding data.

Can be viewed the same way on any device.

• JavaScript Object Notation, or .JSON

A text-based open standard designed for transmitting

• Popularly used for extracting data from a variety of data sources.

• Storm Apache Storm is a distributed stream processing

• Data from business applications

• Social media activity

• Observation studies – include monitoring participants in a specific environment or

• 3. From web scraping to web crawling

• Done right, it will ensure that your are able

The specific information you need

The possible sources for this data

Establish a timeframe How much data is Define dependencies,

Sources of Data Type of Data Timeframe over Volume of data

• Relational databases store structured data with a well-

• Sources include emails, XML, zipped files, binary

• 3. From web scraping to web crawling

Web Scraping: Extracting a large amount of specific

A web crawler operates like a graph traversal/search algorithm.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.