4 Parsing
4 Parsing
4 Parsing
1. Foundations:
[w1] Infrastructure and protocols of the web, sessions
[w2] Security basics
2. Search:
[w3-4] Web as a graph, graph search basics
[w5] Crawling basics
3. Visualization:
[w6-8] Simple interactive graphical informatics, Dimensionality reduction
4. Demand forecasting:
[w9] Simple regression models, Bass diffusion model
5. Customer acquisition:
[w10-12] Clustering, classification; [time permitting: Discrete choice models]
Topics and motivations
Some use-cases:
Data collected from a crawled web page may contain some information,
but usually this is unstructured
e.g.
Text in the <meta> tag of about the page contents
Text in the body of the page
Image and image caption on the page
Video and video caption on the page
Table of data on the page
We usually need to explicitly convert collected data into structured data before
we may use it
Exceptions:
some applications of deep learning
Suppose we capture it from a text field (e.g. from an email), then we need to parse,
e.g. 2015-03-25, 03/25/2015, Mar 25 2015, 25 Mar 2015, etc.
Example: Gmail automatic scheduling of events in emails to calendar
Data parsing concerns
- Accuracy
- Were there errors in the data that was recorded/parsed?
- Completeness
- Did we collect all the necessary data?
- Uniqueness
- Did we record the same data more than once?
- Timeliness
- Is the data correct at the time when we use it?
- Consistency
- Are there any inconsistencies in the data?
e.g. recall the role of keys in the relational model for Databases
Data parsing problems
- Accuracy
Very difficult to compute/estimate, in particular if we parse natural language
source: Allthingslinguistic
Data parsing problems..
- Completeness
Is it possible to measure completeness?
- Timeliness
Often not easy to estimate if data is "up-to-date" or obsolete
- Consistency
If we infer a structure to the data, how can we make relational connections
between different pieces of data?
Data quality metrics..
1. Conformance to schema
How to measure?
Freeze the data set at a time, measure %-non-conforming data
Violate [Domain / Foreign Key / Uniqueness] constraints
2. Accuracy
How to measure?
Manual check (sampling) expensive
Estimate via other indicators (e.g. user-complaints, ..)
Solution
Current approach towards estimation of data quality is "use-based"
try to measure "how useful is the data"
try to measure "does it lead to improvement in processes/operations"
These measures identify where to focus: the "pain-points", "gain-points" of data
Data integration
Crawling for data implies we need to integrate data coming from different
databases/data sets
Problems:
Data is heterogeneous
- Different data sets may use different primary keys
- Different data sets may use different data types
- Different data sets may use different data dictionary (meaning of terms)
e.g. "customer" may be person, or company, …
- Different data sets are not time synchronized
is data from two sets referencing the same time window?
If not, how to unify?
The next stage of making meaningful conclusions from parsed data is to assign a
meaning to each data entry.
- Incorrect entries
- Examples?
Data semantics.. entity resolution problems: examples
Crawled
Individual Database web
data
Individual Database
Extract
Individual Database web
data Transform
Individual Database Compiled Data
Merge Schema Warehouse
web
Individual Database data
Purge
web
Individual Database
data
Decision Analytics
Individual Database
web Missed Business Intelligence
data
Individual Database
Parsing: structured data (tables)
- The internet has many pages that contain one or more “tables” (> 14B tables)
- Some of these tables report data taken from a relational database (> 154M DBs)
Questions:
- What are effective methods for searching within such a collection of tables?
- Is there additional power that can be derived by analyzing such a huge corpus?
Source: Carafella et al., WebTables: Exploring the Power of Tables on the Web, ACM VLDB 2008
Parsing: structured data (tables)
Example of a table
Problems:
The big table is derived from a DB schema, which is not explicitly defined on the page
Note that the navigation bar is also an HTML table, but is not a relational schema
Parsing: structured data (tables)
1. Rank of a web-page (e.g. using page-rank) may not be same as rank of a schema
2. Tables lack “incoming link” which is used in page-rank
3. Tables/Relations may or may not have anchor text (to use as cue about the
meaning)
4. Relations have 2-dimensional, connected data – which cannot be queried using
usual search-engine reverse index
Parsing: structured data (tables)
Main idea:
1. Crawl all pages and collect all “tables”
2. Throw away those tables that are not true schemas (~99%)
3. Select those tables that can be identified as coming from schemas How?
4. Identify the schemas, and the relations between schemas
5. Consolidate the information into a large, connected set of schemas
In each of steps 2-5, different ‘heuristics’ must be employed (read the WebTables
paper to get an idea of some of these), e.g.:
- using meta-data (information in the page about the table)
- inferring equivalent terms (e.g. tel-#, tel, (p), Phone-No, …) etc.
Parsing: structured data (tables)
Sources of error:
In step 1, we may miss some web-pages altogether!
In steps 2-5, we need to use some automatic identification heuristics, and each may
make some mistakes
7 7
How to measure these types of mistakes?
In the example:
precision = 5/8 = 0.625
recall = 5/12 = 0.417
Source: wikipedia
Brief summary
By using cues form standard search engine, text surrounding the table,
and various other schema matching heuristics, it is possible to create tools (e.g.
WebTables) that can allow for finding data in tables faster, or more relevant to a
search
This type of parsing also allows for automatically combining data from multiple
sources into a consolidated database inferred by the engine.
References and Further Reading
References:
State-of-the-art in NLP:
GPT3: https://www.youtube.com/watch?v=jz78fSnBG0s