4 Parsing

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

IEDA 3302

1. Foundations:
[w1] Infrastructure and protocols of the web, sessions
[w2] Security basics

2. Search:
[w3-4] Web as a graph, graph search basics
[w5] Crawling basics

3. Visualization:
[w6-8] Simple interactive graphical informatics, Dimensionality reduction

4. Demand forecasting:
[w9] Simple regression models, Bass diffusion model

5. Customer acquisition:
[w10-12] Clustering, classification; [time permitting: Discrete choice models]
Topics and motivations

The different modes of data collection from the web


- structured data vs.
- unstructured data

Some use-cases:

- Recommendation systems: how does Amazon determine what


products to show you?
- Fraud detection: how does a bank determine
if a transaction on a credit card is potentially illegitimate?
- Future cash flow prediction: estimating potential lifetime value of
transactions of a customer can help predict a website's cash flow
- Customer feedbacks/reviews evaluation for improving
product/customer service
Structured vs Unstructured data

Data collected from a crawled web page may contain some information,
but usually this is unstructured

e.g.
Text in the <meta> tag of about the page contents
Text in the body of the page
Image and image caption on the page
Video and video caption on the page
Table of data on the page

Examples of structured data:


Much (but not all) of the data collected by a website via forms
Data about the interactions of a user (e.g. click on a particular item on a shopping site)
Transactions on a service website – e.g. a banking site
(in general, data that can be [directly] transferred into a [relational] database)
Processing data

We usually need to explicitly convert collected data into structured data before
we may use it

Exceptions:
some applications of deep learning

A simple example of parsing


Suppose we wish to capture a Date/Time field
- If reported by a computer, we can use the standard time() format, e.g.
YYYY-MM-DDTHH:mm:ss.sssZ
where Z id the time zone in format +/-hh:mm relative to GMT

Suppose we capture it from a text field (e.g. from an email), then we need to parse,
e.g. 2015-03-25, 03/25/2015, Mar 25 2015, 25 Mar 2015, etc.
Example: Gmail  automatic scheduling of events in emails to calendar
Data parsing concerns

How do we know the quality of the data that we collect?

- Accuracy
- Were there errors in the data that was recorded/parsed?
- Completeness
- Did we collect all the necessary data?
- Uniqueness
- Did we record the same data more than once?
- Timeliness
- Is the data correct at the time when we use it?
- Consistency
- Are there any inconsistencies in the data?
e.g. recall the role of keys in the relational model for Databases
Data parsing problems

- Accuracy
Very difficult to compute/estimate, in particular if we parse natural language

[Try]: Translate a Chinese passage from a book via Google translate

A classic example: I saw the man on the hill with a telescope.

source: Allthingslinguistic
Data parsing problems..

- Completeness
Is it possible to measure completeness?

- Timeliness
Often not easy to estimate if data is "up-to-date" or obsolete

- Consistency
If we infer a structure to the data, how can we make relational connections
between different pieces of data?
Data quality metrics..

1. Conformance to schema
How to measure?
Freeze the data set at a time, measure %-non-conforming data
Violate [Domain / Foreign Key / Uniqueness] constraints
2. Accuracy
How to measure?
Manual check (sampling)  expensive
Estimate via other indicators (e.g. user-complaints, ..)

Solution
Current approach towards estimation of data quality is "use-based"
try to measure "how useful is the data"
try to measure "does it lead to improvement in processes/operations"
 These measures identify where to focus: the "pain-points", "gain-points" of data
Data integration

Crawling for data implies we need to integrate data coming from different
databases/data sets

Problems:
Data is heterogeneous
- Different data sets may use different primary keys
- Different data sets may use different data types
- Different data sets may use different data dictionary (meaning of terms)
e.g. "customer" may be person, or company, …
- Different data sets are not time synchronized
is data from two sets referencing the same time window?
If not, how to unify?

Reading Assignment: WebTables paper. Shows an approach to unify data from


different tables on different pages 
Found 2.6M unique schemas from 154M distinct tables
Data semantics

The next stage of making meaningful conclusions from parsed data is to assign a
meaning to each data entry.

This has several challenges also:


- Duplicate entries
- Why is this a problem?
How to handle duplicate entries  several approaches (DeDup)
DeDup: how to identify duplicates? After identification: merge or purge?

- Distinct entities with similar attribute values


- Examples: people with same name, …

- Incorrect entries
- Examples?
Data semantics.. entity resolution problems: examples

Incorrect entries problem:

Source: Entity resolution tutorial, L. Gitoor & A. Machanavajjhala, VLDB 2012


Data semantics.. entity resolution problems: examples

Distinct entity problem:

Source: Entity resolution tutorial, L. Gitoor & A. Machanavajjhala, VLDB 2012


Concluding remarks: unstructured data
An overview of crawling and parsing

Crawled
Individual Database web
data
Individual Database
Extract
Individual Database web
data Transform
Individual Database Compiled Data
Merge Schema Warehouse
web
Individual Database data

Purge
web
Individual Database
data
Decision Analytics
Individual Database
web Missed Business Intelligence
data

Individual Database
Parsing: structured data (tables)

Before finishing, we take a brief look at a special case: parsing of tables

- The internet has many pages that contain one or more “tables” (> 14B tables)
- Some of these tables report data taken from a relational database (> 154M DBs)

Questions:

- What are effective methods for searching within such a collection of tables?
- Is there additional power that can be derived by analyzing such a huge corpus?

Source: Carafella et al., WebTables: Exploring the Power of Tables on the Web, ACM VLDB 2008
Parsing: structured data (tables)

Data organized on page using <table>


but is not a table

Data organized on page using <table>


and is a table

Another example (table, but not organized using <table>):


https://www.cbssports.com/nba/standings/

Example of a table
Problems:
The big table is derived from a DB schema, which is not explicitly defined on the page
Note that the navigation bar is also an HTML table, but is not a relational schema
Parsing: structured data (tables)

Question: why not just use Google-style search?

1. Rank of a web-page (e.g. using page-rank) may not be same as rank of a schema
2. Tables lack “incoming link” which is used in page-rank
3. Tables/Relations may or may not have anchor text (to use as cue about the
meaning)
4. Relations have 2-dimensional, connected data – which cannot be queried using
usual search-engine reverse index
Parsing: structured data (tables)

Main idea:
1. Crawl all pages and collect all “tables”
2. Throw away those tables that are not true schemas (~99%)
3. Select those tables that can be identified as coming from schemas How?
4. Identify the schemas, and the relations between schemas
5. Consolidate the information into a large, connected set of schemas

In each of steps 2-5, different ‘heuristics’ must be employed (read the WebTables
paper to get an idea of some of these), e.g.:
- using meta-data (information in the page about the table)
- inferring equivalent terms (e.g. tel-#, tel, (p), Phone-No, …) etc.
Parsing: structured data (tables)

Sources of error:
In step 1, we may miss some web-pages altogether!

In steps 2-5, we need to use some automatic identification heuristics, and each may
make some mistakes

7 7
How to measure these types of mistakes?

Standard metrics are: precision and recall


5 3

In the example:
precision = 5/8 = 0.625
recall = 5/12 = 0.417

Other (absolute) metrics used are:


Type I errors = 8 – 5 = 3
Type II errors = 12 – 5 = 7

Source: wikipedia
Brief summary

By using cues form standard search engine, text surrounding the table,
and various other schema matching heuristics, it is possible to create tools (e.g.
WebTables) that can allow for finding data in tables faster, or more relevant to a
search

This type of parsing also allows for automatically combining data from multiple
sources into a consolidated database inferred by the engine.
References and Further Reading

References:

Ted Johnson, SIGMOD 2003 Tutorial

Entity resolution tutorial, L. Gitoor & A. Machanavajjhala, VLDB 2012

John Canny, UC Berkeley, Notes, Intro Data Science

State-of-the-art in NLP:

GPT3: https://www.youtube.com/watch?v=jz78fSnBG0s

Bag-of-words (Naïve Bayes): NLTK


https://realpython.com/python-nltk-sentiment-analysis/
Next topic: Visualization

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy