L2_Data Acquisition
L2_Data Acquisition
• 1. Data sources
• 2. Web scraping
2
Data
What is Data?
3
Sources of data
4
Relational databases
Typically, data stored in databases and data warehouses can be used as a source for
analysis, organizations have internal applications to support them in managing:
• Day to day business activities
• Customer transactions
• Human resource activities
• Workflows
5
Using queries to extract data from relational
databases
SQL, or Structured Query Languages, is a querying language used for extracting
information form relational databases. Offers simple commands to specify:
• What is to be retrieved from the database.
• Table from which it needs to be extracted.
• Grouping records with matching values.
• Dictating the sequence in which the query results are displayed.
• Limiting the number of results that can be returned by the query.
6
Using queries to extract data from non-relational
databases
Non-relational databases can be queried using SQL or SQL-like query tools.
Some non-relational databases come with their own querying tools such as CQL for
Cassandra and GraphQL for Neo4J.
7
Flat file and XML datasets
External to the organization, there are
other publicly and privately available
datasets.
Such data sets are typically made
available as flat files, spreadsheet files,
or XML documents.
8
Flat files
• Store data in plain text format
• Each line, or row, is one record
• Each value is separated by a delimiter
• All of the data in a flat file maps to a single table
• Most common flat file format is .CSV
9
APIs and web services
APIs and Web Services typically listen for incoming requests, which can be
in the form of web requests from users or network requests from
applications, and return data in plain text, XML, HTML, JSON, or media
files.
13
Application Programming Interfaces (or APIs)
14
API examples
Popular Examples of APIs
Twitter and Facebook Stock market APIs Data lookup and validation
APIs For trading and analysis APIs
For customer sentiment For cleaning and co-relating
analysis data
15
Data streams and feeds
Aggregating streams of data flowing from:
1. Instruments
2. IoT devices and applications
3. GPS data from cars
4. Computer programs
5. Websites
6. Social network.
On the World Wide Web, a web feed is a data format used
for providing users with frequently updated content. 16
Sources for gathering data – sensor data
Sensor data produced by wearable devices, smart buildings, smart cities,
smartphones, medical devices, even household appliances, is a widely
used source of data.
17
Data streams and feeds examples
Examples of uses:
• Surveillance and video feeds for threat detection.
• Sensor data feeds for monitoring industrial or farming machinery.
• Social media feeds for sentiment analysis.
• Stock and market tickers for financial trading.
• Retail transaction streams for predicting demand and supply chain
management.
• Web click feeds for monitoring web performance and improving design.
18
Data streams and feeds tools
Popular technologies used to process data streams include:
• Kafka
Apache Kafka is a distributed event store and stream-
processing platform. It is an open-source system
developed by the Apache Software Foundation written in
Java and Scala. The project aims to provide a unified,
high-throughput, low-latency platform for handling real-
time data feeds
19
Sources for gathering data – data exchange
Data exchange is a source of third-party data that involves the voluntary sharing
of data between data providers and data consumers. Individuals, organizations,
and governments could be both data providers and data consumers.
• Census – popularly used for gathering household data such as wealth and income of
population.
• Interviews – a source for gathering qualitative data such as the participant’s opinions
and experiences. Interviews can be telephonic, over the web, or face-to-face.
21
Outline
• 1. Data sources
• 2. Web scraping
22
Web scraping
• The construction of an agent to download, parse, and organize data from the web in an
automated manner
• Extract relevant data from unstructured sources on the Internet
• Also known as screen scraping, web harvesting, and web data extraction
• Can extract text, contact information, images, videos, product items, etc.
23
Web scraping examples
Popular examples of uses:
• Generating sales leads through public data sources; weather information
to forecast, for example, soft drink sales.
• Collecting training and testing datasets for machine learning models.
• There might be an interesting table on a Wikipedia page (or pages) you
want to retrieve to perform some statistical analysis.
• You might wish to get a listing of properties on a real-estate site to build
an appealing geo-visualization. 24
Web scraping tools
Popular Web Scraping tools:
• BeautifulSoup
• Scrapy
• Pandas
• Selenium
25
The web speaks HTTP
• 7 network layers of Open Systems
Interconnection (OSI) model
• Web scraping mainly focus on the
application layer: HTTP
26
HTTP
The core component in the
exchange of messages on
WWW consists of a
HyperText Transfer Protocol
(HTTP) request message to a
web server, followed by an
HTTP response, which can
be rendered by the browser.
27
Response message body is usually html format
Hypertext Markup Language (HTML)
• HTML is a standard markup language for creating web pages.
• HTML provides the building blocks to provide structure and formatting to
documents.
• Python ‘requests’ library could get the html content from a webpage.
28
HTML format
• HTML’s building blocks are usually a series of tags that often come in pairs
(but not always).
• Commonly used tags
29
HTML parsing
• HTML parsing involves tokenization and
tree construction. HTML tokens include
start and end tags, as well as attribute
names and values. If the document is
well-formed, parsing it is straightforward
and faster. The parser parses tokenized
input into the document, building up the
document tree.
30
General web scraping procedure
• Identifying data for scraping
• Scraping the data
• Importing the data
31
Identifying data for scraping (1)
Importance of Identifying data for analysis:
• Identifying the right data is a very important step of the data analysis process.
32
Identifying data for scraping (2)
Process for identifying data
Step 1: Determine the information you want to collect
33
Identifying data for scraping (3)
Process for identifying data
Step 2: Define a plan for collecting data
34
Identifying data for scraping (4)
Process for identifying data
Step 3: Determine your data collection methods. The methods depend on:
35
Web scraping
• Web Scraping: Extracting a large amount of specific data from online
sources
36
Importing data into data repositories
• Gathering data from data sources such as databases, the web, sensor
data, data exchanges, and several other sources leveraged for specific
data needs.
• Importing data into different types of data repositories.
37
Importing structured data
Importing data: data Identified and gathered -> data repository
Specific data repositories are optimized for certain types of data.
39
Outline
• 1. Data sources
• 2. Web scraping
40
From web scraping to web crawling
Web Crawling: Using tools to read, copy and store the content of the websites for
archiving or indexing purposes. Crawling usually deals with a network of webpages
41
Different use cases
Web Crawling Web Scrapers
• Generating search engine • Comparing prices.
results. • Stock market analysis.
• Monitoring SEO analytics. • Managing brand reputation.
• Performing website analysis. • Academic and scientific research.
• Performed only by large • Used by small and large
corporations. businesses
42
Tools
Differences between Web Crawling and Web Scraping
Web Crawlers Web Scrapers
43
Web crawling process
45
Robots.txt
• Protocol for giving spiders (“robots”) limited access to a website,
originally from 1994.
- www.robotstxt.org/robotstxt.html
• Website announces its request on what can(not) be crawled.
- For a server, create a file /robots.txt.
- This file specifies access restrictions.
46
47
Appendix
1.https://www.coursera.org/learn/introduction-to-data-
analytics/home/week/3
2.https://www.ics.uci.edu/~lopes/teaching/cs221W15/slides/WebCrawling.
pdf
3. https://link.springer.com/content/pdf/10.1007/978-1-4842-3582-9.pdf
48