4 Parsing

IEDA 3302
1. Foundations:
[w1] Infrastructure and protocols of the web, sessions
[w2] Security basics
2. Search:
[w3-4] Web as a graph, graph search basics
[w5] Crawling basics
3. Visualization:
[w6-8] Simple interactive graphical informatics, Dimensionality reduction
4. Demand forecasting:
[w9] Simple regression models, Bass diffusion model
5. Customer acquisition:
[w10-12] Clustering, classification; [time permitting: Discrete choice models]
Topics and motivations
The different modes of data collection from the web

- structured data vs.
- unstructured data
Some use-cases:
- Recommendation systems: how does Amazon determine what

products to show you?
- Fraud detection: how does a bank determine
if a transaction on a credit card is potentially illegitimate?
- Future cash flow prediction: estimating potential lifetime value of
transactions of a customer can help predict a website's cash flow
- Customer feedbacks/reviews evaluation for improving
product/customer service
Structured vs Unstructured data
Data collected from a crawled web page may contain some information,
but usually this is unstructured
e.g.
Text in the <meta> tag of about the page contents
Text in the body of the page
Image and image caption on the page
Video and video caption on the page
Table of data on the page
Examples of structured data:

Much (but not all) of the data collected by a website via forms
Data about the interactions of a user (e.g. click on a particular item on a shopping site)
Transactions on a service website – e.g. a banking site
(in general, data that can be [directly] transferred into a [relational] database)
Processing data
We usually need to explicitly convert collected data into structured data before
we may use it
Exceptions:
some applications of deep learning
A simple example of parsing

Suppose we wish to capture a Date/Time field
- If reported by a computer, we can use the standard time() format, e.g.
YYYY-MM-DDTHH:mm:ss.sssZ
where Z id the time zone in format +/-hh:mm relative to GMT
Suppose we capture it from a text field (e.g. from an email), then we need to parse,
e.g. 2015-03-25, 03/25/2015, Mar 25 2015, 25 Mar 2015, etc.
Example: Gmail  automatic scheduling of events in emails to calendar
Data parsing concerns
How do we know the quality of the data that we collect?
- Accuracy
- Were there errors in the data that was recorded/parsed?
- Completeness
- Did we collect all the necessary data?
- Uniqueness
- Did we record the same data more than once?
- Timeliness
- Is the data correct at the time when we use it?
- Consistency
- Are there any inconsistencies in the data?
e.g. recall the role of keys in the relational model for Databases
Data parsing problems
- Accuracy
Very difficult to compute/estimate, in particular if we parse natural language
[Try]: Translate a Chinese passage from a book via Google translate
A classic example: I saw the man on the hill with a telescope.
source: Allthingslinguistic
Data parsing problems..
- Completeness
Is it possible to measure completeness?
- Timeliness
Often not easy to estimate if data is "up-to-date" or obsolete
- Consistency
If we infer a structure to the data, how can we make relational connections
between different pieces of data?
Data quality metrics..
1. Conformance to schema
How to measure?
Freeze the data set at a time, measure %-non-conforming data
Violate [Domain / Foreign Key / Uniqueness] constraints
2. Accuracy
How to measure?
Manual check (sampling)  expensive
Estimate via other indicators (e.g. user-complaints, ..)
Solution
Current approach towards estimation of data quality is "use-based"
try to measure "how useful is the data"
try to measure "does it lead to improvement in processes/operations"
 These measures identify where to focus: the "pain-points", "gain-points" of data
Data integration
Crawling for data implies we need to integrate data coming from different
databases/data sets
Problems:
Data is heterogeneous
- Different data sets may use different primary keys
- Different data sets may use different data types
- Different data sets may use different data dictionary (meaning of terms)
e.g. "customer" may be person, or company, …
- Different data sets are not time synchronized
is data from two sets referencing the same time window?
If not, how to unify?
Reading Assignment: WebTables paper. Shows an approach to unify data from

different tables on different pages 
Found 2.6M unique schemas from 154M distinct tables
Data semantics
The next stage of making meaningful conclusions from parsed data is to assign a
meaning to each data entry.
This has several challenges also:

- Duplicate entries
- Why is this a problem?
How to handle duplicate entries  several approaches (DeDup)
DeDup: how to identify duplicates? After identification: merge or purge?
- Distinct entities with similar attribute values

- Examples: people with same name, …
- Incorrect entries
- Examples?
Data semantics.. entity resolution problems: examples
Incorrect entries problem:
Source: Entity resolution tutorial, L. Gitoor & A. Machanavajjhala, VLDB 2012

Data semantics.. entity resolution problems: examples
Distinct entity problem:
Source: Entity resolution tutorial, L. Gitoor & A. Machanavajjhala, VLDB 2012

Concluding remarks: unstructured data
An overview of crawling and parsing
Crawled
Individual Database web
data
Individual Database
Extract
Individual Database web
data Transform
Individual Database Compiled Data
Merge Schema Warehouse
web
Individual Database data
Purge
web
Individual Database
data
Decision Analytics
Individual Database
web Missed Business Intelligence
data
Individual Database
Parsing: structured data (tables)
Before finishing, we take a brief look at a special case: parsing of tables
- The internet has many pages that contain one or more “tables” (> 14B tables)
- Some of these tables report data taken from a relational database (> 154M DBs)
Questions:
- What are effective methods for searching within such a collection of tables?
- Is there additional power that can be derived by analyzing such a huge corpus?
Source: Carafella et al., WebTables: Exploring the Power of Tables on the Web, ACM VLDB 2008
Data organized on page using <table>

but is not a table
Data organized on page using <table>

and is a table
Another example (table, but not organized using <table>):

https://www.cbssports.com/nba/standings/
Example of a table
Problems:
The big table is derived from a DB schema, which is not explicitly defined on the page
Note that the navigation bar is also an HTML table, but is not a relational schema
Question: why not just use Google-style search?
1. Rank of a web-page (e.g. using page-rank) may not be same as rank of a schema
2. Tables lack “incoming link” which is used in page-rank
3. Tables/Relations may or may not have anchor text (to use as cue about the
meaning)
4. Relations have 2-dimensional, connected data – which cannot be queried using
usual search-engine reverse index
Main idea:
1. Crawl all pages and collect all “tables”
2. Throw away those tables that are not true schemas (~99%)
3. Select those tables that can be identified as coming from schemas How?
4. Identify the schemas, and the relations between schemas
5. Consolidate the information into a large, connected set of schemas
In each of steps 2-5, different ‘heuristics’ must be employed (read the WebTables
paper to get an idea of some of these), e.g.:
- using meta-data (information in the page about the table)
- inferring equivalent terms (e.g. tel-#, tel, (p), Phone-No, …) etc.
Sources of error:
In step 1, we may miss some web-pages altogether!
In steps 2-5, we need to use some automatic identification heuristics, and each may
make some mistakes
7 7
How to measure these types of mistakes?
Standard metrics are: precision and recall

5 3
In the example:
precision = 5/8 = 0.625
recall = 5/12 = 0.417
Other (absolute) metrics used are:

Type I errors = 8 – 5 = 3
Type II errors = 12 – 5 = 7
Source: wikipedia
Brief summary
By using cues form standard search engine, text surrounding the table,
and various other schema matching heuristics, it is possible to create tools (e.g.
WebTables) that can allow for finding data in tables faster, or more relevant to a
search
This type of parsing also allows for automatically combining data from multiple
sources into a consolidated database inferred by the engine.
References and Further Reading
References:
Ted Johnson, SIGMOD 2003 Tutorial
Entity resolution tutorial, L. Gitoor & A. Machanavajjhala, VLDB 2012
John Canny, UC Berkeley, Notes, Intro Data Science
State-of-the-art in NLP:
GPT3: https://www.youtube.com/watch?v=jz78fSnBG0s
Bag-of-words (Naïve Bayes): NLTK

https://realpython.com/python-nltk-sentiment-analysis/
Next topic: Visualization

4 Parsing

Uploaded by

Copyright:

Available Formats

4 Parsing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 Parsing

Uploaded by

Copyright:

Available Formats

IEDA 3302

The different modes of data collection from the web

- Recommendation systems: how does Amazon determine what

Examples of structured data:

A simple example of parsing

How do we know the quality of the data that we collect?

[Try]: Translate a Chinese passage from a book via Google translate

A classic example: I saw the man on the hill with a telescope.

Reading Assignment: WebTables paper. Shows an approach to unify data from

This has several challenges also:

- Distinct entities with similar attribute values

Incorrect entries problem:

Source: Entity resolution tutorial, L. Gitoor & A. Machanavajjhala, VLDB 2012

Distinct entity problem:

Source: Entity resolution tutorial, L. Gitoor & A. Machanavajjhala, VLDB 2012

Before finishing, we take a brief look at a special case: parsing of tables

Data organized on page using <table>

Data organized on page using <table>

Another example (table, but not organized using <table>):

Question: why not just use Google-style search?

Standard metrics are: precision and recall

Other (absolute) metrics used are:

Ted Johnson, SIGMOD 2003 Tutorial

Entity resolution tutorial, L. Gitoor & A. Machanavajjhala, VLDB 2012

John Canny, UC Berkeley, Notes, Intro Data Science

Bag-of-words (Naïve Bayes): NLTK

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.