0% found this document useful (0 votes)
38 views

Data Profiling Vision Felix Naumann

This document discusses data profiling, which involves examining data sources to collect statistics and metadata. It proposes that data profiling deserves renewed focus due to its importance and the diversity of data beyond traditional relational databases. The document classifies common data profiling tasks like analyzing column cardinality, patterns, and dependencies for single data sources, and suggests new directions like profiling schema matching and dependencies across multiple heterogeneous sources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Data Profiling Vision Felix Naumann

This document discusses data profiling, which involves examining data sources to collect statistics and metadata. It proposes that data profiling deserves renewed focus due to its importance and the diversity of data beyond traditional relational databases. The document classifies common data profiling tasks like analyzing column cardinality, patterns, and dependencies for single data sources, and suggests new directions like profiling schema matching and dependencies across multiple heterogeneous sources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/262221918

Data Profiling Revisited

Article in ACM SIGMOD Record · February 2014


DOI: 10.1145/2590989.2590995

CITATIONS READS
129 4,362

1 author:

Felix Naumann
Hasso Plattner Institute
303 PUBLICATIONS 8,355 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Schema Mapping and Data Exchange View project

Data preparation View project

All content following this page was uploaded by Felix Naumann on 26 May 2014.

The user has requested enhancement of the downloaded file.


Data Profiling Revisited

Felix Naumann∗
Qatar Computing Research Institute (QCRI), Doha, Qatar
fnaumann@qf.org.qa

ABSTRACT research area in its own right. We focus our discus-


Data profiling comprises a broad range of methods to ef- sion on relational data, the predominant format of
ficiently analyze a given data set. In a typical scenario, traditional data profiling methods, but we do regard
which mirrors the capabilities of commercial data pro- data profiling for other data models in a separate
filing tools, tables of a relational database are scanned section.
to derive metadata, such as data types and value pat- Data profiling encompasses a vast array of meth-
terns, completeness and uniqueness of columns, keys ods to examine data sets and produce metadata.
and foreign keys, and occasionally functional dependen- Among the simpler results are statistics, such as
cies and association rules. Individual research projects the number of null values and distinct values in a
have proposed several additional profiling tasks, such as column, its data type, or the most frequent pat-
the discovery of inclusion dependencies or conditional terns of its values. Metadata that are more diffi-
functional dependencies. cult to compute usually involve multiple columns,
Data profiling deserves a fresh look for two reasons: such as inclusion dependencies or functional depen-
First, the area itself is neither established nor defined in dencies. More advanced techniques detect approx-
any principled way, despite significant research activity imate properties or conditional properties of the
on individual parts in the past. Second, more and more data set at hand. To allow focus, the broad field
data beyond the traditional relational databases are be- of data mining is deliberately omitted from the dis-
ing created and beg to be profiled. The article proposes cussion here, as justified below. Obviously, all such
new research directions and challenges, including inter- discovered metadata refer only to the given data
active and incremental profiling and profiling heteroge- instance and cannot be used to derive with cer-
neous and non-relational data. tainty schematic/semantic properties, such as pri-
mary keys or foreign key relationships. Figure 1
shows a classification of data profiling tasks. The
1. DATA PROFILING tasks for “single sources” correspond to state-of-the-
“Data profiling is the process of examining the art in tooling and research (see Section 2), while the
data available in an existing data source [...] and tasks for “multiple sources” reflect new research di-
collecting statistics and information about that rections for data profiling (see Section 5).
data.”1 Profiling data is an important and frequent Systematic data profiling, i.e., profiling beyond
activity of any IT professional and researcher. We the occasional exploratory SQL query or spread-
can safely assume that any reader of this article has sheet browsing, is usually performed by dedicated
engaged in the activity of data profiling, at least tools or components, such as IBM’s Information
by eye-balling spreadsheets, database tables, XML Analyzer, Microsoft’s SQL Server Integration Ser-
files, etc. Possibly more advanced techniques were vices (SSIS), or Informatica’s Data Explorer. Their
used, such as key-word-searching in data sets, sort- approaches all follow the same general procedure:
ing, writing structured queries, or even using ded- A user specifies the data to be profiled and selects
icated data profiling tools. While the importance the types of metadata to be generated. Next, the
of data profiling is undoubtedly high, and while ef- tool computes in batch the metadata using SQL
ficiently and effectively profiling is an enormously queries and/or specialized algorithms. Depending
difficult challenge, it has yet to be established as a on the volume of the data and the selected pro-

On leave from Hasso Plattner Institute, Potsdam, Ger- filing results, this step can last minutes to hours.
many (naumann@hpi.uni-potsdam.de). The results are usually displayed in a vast collec-
1
Wikipedia on “Data Profiling”, 2/2013
Cardinalities ing values, or outliers. Profiling results can
also be used to measure and monitor the gen-
Uniqueness and
keys eral quality of a data set, for instance by de-
Single column
Patterns and termining the number of records that do not
data types conform to previously established constraints.
Distributions Data integration. Often the data sets to be inte-
Single source
Uniqueness and grated are somewhat unfamiliar and the inte-
keys gration expert wants to explore the data sets
Inclusion and first: How large is it? What data types are
foreign key dep.
needed? What are the semantics of columns
Data Profiling

Multiple columns
Functional
dependencies and tables? Are there dependencies between
Conditional and
tables and among databases, etc.? The vast
approximate dep. abundance of (linked) open data and the de-
Topic discovery
sire and potential to integrate them with en-
Topical overlap terprise data has amplified this need.
Topical clustering
Scientific data management. The management of
Schema matching
data that is gathered during scientific exper-
Multiple sources Schematic overlap iments or observations has created additional
Cross-schema motivation for efficient and effective data pro-
dependencies
filing: When importing raw data, e.g., from
Duplicate
detection scientific experiments or extracted from the
Data overlap
Web, into a DBMS, it is often necessary and
Record linkage
useful to profile the data and then devise an
adequate schema.
Figure 1: A classification of data profiling
Data analytics. Almost any statistical analysis or
tasks
data mining run is preceded by a profiling step
to help the analyst understand the data at
tion of tabs, tables, charts, and other visualiza- hand and appropriately configure tools, such
tions to be explored by the user. Typically, dis- as SPSS or Weka. Pyle describes detailed steps
coveries can then be translated into constraints or of analyzing and subsequently preparing data
rules that are then enforced in a subsequent cleans- for data mining [38].
ing/integration phase. For instance, after discover-
Knowledge about data types, keys, foreign keys,
ing that the most frequent pattern for phone num-
and other constraints supports data modeling and
bers is (ddd)ddd-dddd, this pattern can be pro-
helps keep data consistent, improves query op-
moted to the rule that all phone numbers must
timization, and reaps all the other benefits of
be formatted accordingly. Most cleansing tools can
structured data management. Other research ef-
then either transform differently formatted numbers
forts have mentioned query formulation and index-
or at least mark them as violations.
ing [42], scientific discovery [26], and database re-
verse engineering [35] as further motivation for data
Use cases for profiling. The need to profile a new
profiling.
or unfamiliar set of data arises in many situations,
in general to prepare for some subsequent task.
Time to revisit. Recent trends in the database
Query optimization. Basic profiling is performed field have added challenges but also opportunities
by most database management systems to sup- for data profiling. First, under the big data um-
port query optimization with statistics about brella, industry and research have turned their at-
tables and columns. These profiling results tention to data that they do not own or have not
can be used to estimate the selectivity of oper- made use of yet. Data profiling can help assess
ators and ultimately the cost of a query plan. which data might be useful and reveals the yet
unknown characteristics of such new data: before
Data cleansing. Probably the most typical use case exposing an infrastructure to Twitter’s firehose it
is profiling data to prepare a data cleansing might be worthwhile to know about properties of
process. Profiling reveals data errors, such as the data one is receiving; before downloading sig-
inconsistent formatting within a column, miss- nificant parts of the linked data cloud, some prior
sense of the integration effort is needed; before aug- the discovery process to certain columns or tables.
menting a warehouse with text mining results an For instance, there are tools that verify inclusion de-
understanding of their quality is required. Leading pendencies on user-suggested pairs of columns, but
researchers have recently noted “If we just have a that cannot automatically check inclusion between
bunch of data sets in a repository, it is unlikely any- all pairs of columns or column sets.
one will ever be able to find, let alone reuse, any of The following section elaborates these traditional
this data. With adequate metadata, there is some data profiling tasks and gives a brief overview of
hope, but even so, challenges will remain [. . . ]” [4]. known approaches. Sections 3 – 6 are the main
Second, much of the data that shall be exploited contributions of this article by defining and moti-
is of non-traditional type for data profiling, i.e., vating new research perspectives for data profiling.
non-relational (e.g., linked open data), non-struc- These areas include interactive profiling (users can
tured (e.g., tweets and blogs), and heterogeneous act upon profiling results and re-profile efficiently),
(e.g., open government data). And it is often incremental profiling (profiling results are incremen-
truly “big”, both in terms of schema, rendering tally updated as new data arrives), profiling hetero-
algorithms that are exponential in the number of geneous data and multiple sources simultaneously,
schema elements infeasible, and in terms of data, profiling non-relational data (XML and RDF), and
rendering main-memory based methods infeasible. profiling on different architectures (column stores,
Existing profiling methods are not adequate to han- key-value stores, etc.).
dle that kind of data: Either they do not scale well This article is not intended to be a survey of ex-
(e.g., dependency discovery), or there simply are isting approaches, though there is certainly a need
no methods yet (e.g., incremental profiling, profil- for such, nor is it a formal framework for future data
ing multiple data sets, profiling textual attributes). profiling developments. Rather, it strives to spark
Third, different and new data management archi- interest in this research area and to assemble a wide
tectures and frameworks have emerged, including range of research challenges.
distributed systems, key-value stores, multi-core- or
main-memory-based servers, column-oriented lay-
2. STATE OF THE ART
outs, streaming input, etc. These new premises pro-
vide interesting opportunities as we discuss later. While the introduction mentions current indus-
trial profiling tools, this section discusses current
Profiling challenges. Data profiling, even in a research directions. In its basic form, data pro-
traditional relational setting, is non-trivial for three filing is about analyzing data values of a single
reasons: First, the results of data profiling are com- column, summarized as “traditional data profil-
putationally complex to discover. For instance, dis- ing”. More advanced techniques detect relation-
covering key candidates or dependencies usually in- ships among columns of one or more tables, which
volves some sorting step for each considered col- we discuss as “dependency detection”. Finally, we
umn. Second, the discovery-aspect of the profil- distinguish data profiling from the broad field of
ing task demands the verification of complex con- “data mining”, which we deliberately exclude from
straints on all columns and combinations of columns further discussion.
in a database. And thus also the solution-space
of uniqueness-, inclusion dependency-, or functional Traditional data profiling. The most basic
dependency-discovery is exponential in the number form of data profiling is the analysis of individ-
of attributes. Third, profiling is often performed on ual columns in a given table. Typically, gener-
data sets that may not fit into main memory. ated metadata comprises various counts, such as the
Various tools and algorithms have tackled these number of values, the number of unique values, and
challenges in different ways. First, many rely on the number of non-null values. These metadata are
the capabilities of an underlying DBMS, as many often part of the basic statistics gathered by DBMS.
profiling tasks can be expressed as SQL queries. Mannino et al. give a much-cited survey on statis-
Second, many have developed innovative ways to tics collection and its relationship to database opti-
handle the individual challenges, for instance using mization [32]. In addition to the basic counts, the
indexing schemes, parallel processing, and reusing maximum and minimum values are discovered and
intermediate results. Third, several methods have the data type is derived (usually restricted to string
been proposed that deliver only approximate results vs. numeric vs. date). Slightly more advanced tech-
for various profiling tasks, for instance by profiling niques create histograms of value distributions, for
samples. Finally, users are asked to narrow down instance to optimize range-queries [37], and iden-
tify typical patterns in the data values in the form
of regular expressions [40]. Data profiling tools dis- eral attributes” [39]. Yet, a different distinction
play such results and can suggest some actions, such is more useful to separate the different use cases:
as declaring a column with only unique values a key- Data profiling gathers technical metadata to sup-
candidate or suggesting to enforce the most frequent port data management, while data mining and data
patterns. analytics discovers non-obvious results to support
business management. In this way, data profil-
Dependency detection. Dependencies are meta- ing results are information about columns and col-
data that describe relationships among columns. umn sets, while data mining results are information
The difficulties are twofold: First, pairs of columns about rows or row sets (clustering, summarization,
or column-sets must be regarded, and second, the association rules, etc.).
chance existence of a dependency in the data at Of course such a distinction is not strict. Some
hand does not imply that this dependency is mean- data mining technology does express information
ingful. about columns, such as feature selection methods
The most frequent real-world use-case is the dis- for sets of values within a column [7] or regression
covery of foreign keys [30, 41] with the help of in- techniques to characterize columns [13]. Yet with
clusion dependencies [6, 33]. Current data profil- the distinction above, we concentrate on data pro-
ing tools often avoid checking all combinations of filing and put aside the broad area of data mining,
columns, but rather ask the user to suggest a candi- which has already received unifying treatment in
date key/foreign-key pair to verify. Another form of numerous text books and surveys.
dependency, which is also relevant for data quality,
is the functional dependency (FD). Again, much re-
3. INTERACTIVE DATA PROFILING
search has been performed to automatically detect
FDs [26, 45]. Data profiling research has yet hardly recognized
Both types of dependencies can be relaxed in that data profiling is an inherently user-oriented
two ways. First, conditional dependencies need task. In most cases, the produced metadata is con-
to hold only for tuples that fulfill the condition. sumed directly by the user or it is at least regarded
Conditional inclusion dependencies (CINDs) were by a user before put to use in some application,
proposed for data cleaning and contextual schema such as schema design or data cleansing. We sug-
matching [11]. Different aspects of CIND discov- gest the involvement of the user already during the
ery have been addressed in [5, 17, 22, 34]. Condi- algorithmic part of data profiling, hence “interac-
tional functional dependencies (CFDs) were intro- tive profiling”.
duced in [20] for data cleaning. Algorithms for dis-
covering CFDs are also proposed in [14, 21]. Sec- Online profiling. Despite many optimization ef-
ond, approximate dependencies need to hold only forts, data profiling might last longer than a user
for a certain percentage of the data – they are not is willing to wait in front of a screen with nothing
guaranteed to hold for the entire relation. Such de- to look at. Online profiling shows intermediate re-
pendencies are often discovered using sampling [27] sults as they are created. However, simply hooking
or other summarization techniques [16]. the graphical interface into existing algorithms is
Finally, algorithms for the discovery of columns usually not sufficient: Data that is sorted by some
and column combinations with only unique values attribute or has a skewed order yields misleading in-
(which is strictly speaking a constraint and not a termediate results. Solutions might be approximate
dependency) have been proposed in [2, 42]. or sampling-based methods, whose results grace-
To reiterate our motivation: There are various in- fully improve as more computation is invested. Nat-
dividual techniques for various individual profiling urally, such intermediate results do not reflect the
tasks. What is lacking even for the state-of-the-art properties of the entire data set. Thus, some form
is a unified view of data profiling as a field and a of confidence, along with a progress indicator, can
unifying framework of its tasks. be shown to allow an early interpretation of the re-
sults.
Data mining. Rahm and Do distinguish data pro- Apart from entertaining users during computa-
filing from data mining by the number of columns tion, an advantage of online profiling is that the
that are examined: “Data profiling focusses on the user may abort the profiling run altogether. For in-
instance analysis of individual attributes. [...] Data stance, a user might decide early on that the data
mining helps discover specific data patterns in large set is not interesting (or clean) enough for the task
data sets, e.g., relationships holding between sev- at hand.
Profiling on queries and views. In many cases,
data profiling is performed with the purpose of Incremental profiling. An obvious, but yet
cleaning the data or the schema to some extent, for under-examined extension to data profiling is to re-
instance, to be able to insert it into a data ware- use earlier profiling results to speed-up computation
house or to integrate it with some other data set. on changed data. I.e., the profiling system is pro-
However, each cleansing step changes the data, and vided with a data set and with knowledge of its
thus implicitly also the metadata produced by pro- delta compared to a previous version, and it has
filing. In general, after each cleansing step a new stored any intermediate or final profiling results on
profiling run should be performed. For instance, that previous version. In the simplest cases, profil-
only after cleaning up zip codes does the functional ing metadata can be calculated associatively (e.g.,
dependence with the city values become apparent. sum, count, equi-width histograms), in some cases
Or only after deduplication does the uniqueness of some intermediate metadata can help (e.g., sum and
email addresses reveal itself. count for average, indexes for value patterns), and
A modern profiling system should be able to al- finally in some cases a complete recalculation might
low users to virtually interact with the data and be necessary (e.g., median or clustering).
re-compute profiling results. For instance, the pro- There is already some research on performing
filing system might show a 96% uniqueness for a cer- individual profiling tasks incrementally. For in-
tain column. The user might recognize that indeed stance, the AD-Miner algorithm allows an incre-
the attribute should be completely unique and is in mental update of functional dependency informa-
fact a key. Without performing the actual cleans- tion [19]. Fan et al. focus on the area of condi-
ing, a user might want to virtually declare the col- tional functional dependencies and also consider in-
umn to be a key and re-perform profiling on this cremental updates [20]. The area of data mining,
virtually cleansed data. Only then a foreign key for on the other hand, has seen much related work, for
this attribute might be recognized. instance on association rule mining and other data
In short, a user might want to act upon pro- mining applications [24].
filing results in an ad-hoc fashion without going
through the entire cleansing and profiling loop, but Continuous profiling. While for incremental pro-
remain within the profiling tool context and per- filing we assumed periodic updates (or periodic pro-
form cleansing and re-profiling only on a virtually filing runs), a further use case is to update profiling
cleansed view. When satisfied, the virtual cleansing results while (transactional) data is created or up-
can of course be materialized. A key enabling tech- dated. If the profiling results can be expressed as
nology for this kind of interaction is the ability to a query, and if they shall be performed only on a
efficiently re-perform profiling on slightly changed temporal window of the data, this use case can be
data, as discussed in the next section. In the same served by data stream management systems [23].
manner, profiling results can be efficiently achieved If this is not the case, continuous profiling meth-
on query results: While calculating the query re- ods need to be developed, whose results can be dis-
sult, profiling results can be generated on the side, played in a dashboard. Of particular importance is
thus showing a user not only the result itself, but to find a good tradeoff between recency, accuracy,
also the nature of that data. Faceted search pro- and resource consumption. Use cases for continu-
vides similar features in that a user is presented ous profiling include internet traffic monitoring or
with cardinalities based on the chosen filters. the profiling of incoming search queries.
For all suggestions above, new algorithms and
data structures are needed to enhance the user ex- Multi-measure profiling. Each profiling algo-
perience of data profiling. rithm has its own scheme of running through the
data and collecting or aggregating whatever infor-
mation is needed. Realizing that multiple types of
4. INCREMENTAL DATA PROFILING
profiling metadata shall be collected, it is likely that
A data set is hardly ever fixed: Transactional many of these runs can be combined. Thus, in a
data is appended to frequently, analytics-oriented manner similar to multi-query-optimization, there
data sets experience periodic updates (typically is a high potential for efficiency gains, in particu-
daily), and large data sets available on the web lar wrt. I/O cost. While such potential is already
data are updated every few weeks or months. Data realized in commercial systems, it has not yet been
profiling methods should be able to efficiently han- investigated for the more complex tasks that are not
dle such moving targets, in particular without re- covered by these tools.
profiling the entire data set.
5. PROFILING HETEROGENEOUS matically finding correspondences between schema
DATA elements [18]. Already Smith et al. have recognized
that schema matching techniques often play the role
While typical profiling tasks assume a single,
of profiling tools [43]: Rather than using them to
largely homogeneous database or even only a sin-
derive schema mappings and perform data trans-
gle table, there are many use cases in which a com-
formation, they play roles that have a more infor-
bined profiling of multiple, heterogeneous data sets
mative character, such as assessment of project fea-
is needed. In particular when integrating data it is
sibility or the identification of integration targets.
useful to learn about the common properties of par-
However, the mere matching of schema elements
ticipating data sets. From profiling one can learn
might not suffice as a profiling-for-integration re-
about their integrability, i.e., how well their data
sult: Additional information on the structure of the
and schemata fit together, and learn in advance the
values of the matching columns can provide further
properties of the integrated data set. Even profiling
details about the integration difficulty.
a single source that stores data for multiple or many
After determining schematic overlap, a next step
domains, such as DBpedia or Freebase, can profit
is to determine data overlap, i.e., the (estimated)
from techniques that profile heterogeneous data.
number of real-world objects that are represented
in both data sets, or that are represented multiple
Degrees of heterogeneity. Heterogeneity in data
times in a single data set. Such multiple represen-
sets can appear at many different levels and in many
tations are typically identified using entity match-
different degrees of severity. Data profiling methods
ing methods (aka. record linkage, entity resolution,
can be used to uncover these heterogeneities and
duplicate detection, and many other names) [15].
possibly provide hints on how to overcome them.
However, estimating the number of matches with-
Heterogeneity is traditionally divided into syn-
out actually performing the matching on the entire
tactic heterogeneity, structural heterogeneity, and
data set is an open problem. If used to determine
semantic heterogeneity [36]. Discovering syntactic
the integration effort, it is additionally important
heterogeneity, in the context of data profiling, is
to know how diverse such matching records are rep-
precisely what traditional profiling aims at, e.g.,
resented, i.e., how difficult it is to devise good sim-
finding inconsistent formatting. Next, structural
ilarity measures and find appropriate thresholds.
heterogeneity appears in the form of unmatched
schemata and differently structured information.
Topical profiling. When profiling yet unknown
Such problems are only partly addressed by tradi-
data from a large pool of sources, it is necessary to
tional profiling, e.g., by discovery schema informa-
recognize the topic or domain covered by the source.
tion, such as types, keys, or foreign keys. Finally, se-
One recently proposed use case for such source
mantic heterogeneity addresses the underlying and
discovery is situational BI where warehouse data
possibly mismatched meaning of the data. For data
is complemented with data from openly available
profiling we interpret it as the discovery of seman-
sources [3, 31]. Examples for such sources are the
tical overlap of the data and their domain(s).
set of linked open data sources (linkeddata.org)
or tables gleaned from the web: “Data on the Web
Data profiling for integration. Our focus here is
reflects every topic in existence, and topic bound-
on profiling tasks to discover structural and seman-
aries are not always clear.” [12]
tic heterogeneity, arguing that structural profiling
Topical profiling should be able to match a data
seeks information about the schema and semantic
set to a given set of topics or domains. Given two
profiling seeks information about the data. Both
data sets, it should be able to determine topical
serve to assess the integrability of data sets, and
overlap between them. There is already initial work
thus also indicate the necessary integration effort,
on topical profiling for traditional databases in the
which is vital to project planning. The integration
iDisc system [44], which matches tables to topics or
effort might be expressed in terms of similarity, but
clusters them by topic, and for web data [8], which
also in terms of man-months or in terms of which
discovers frequent patterns of concepts and aggre-
tools are needed.
gates them to topics.
An important issue in integrated information
systems, irrelevant for single databases, is the
schematic similarity, i.e., the degree to which their 6. DATA PROFILING ON OTHER AR-
schemata complement each other and the degree to CHITECTURES
which they overlap. There is an obvious relation Most current data profiling methods and tools
to schema matching techniques, which aim at auto- assume data to be stored in relational form on a
single-node database. However, much interesting ing a benchmark is to (ii) provide data that closely
data nowadays resides in data stores of different mirrors real-world situations. Given a schema and
architecture and in various (non-relational) mod- a set of constraints (uniqueness, data types, FDs,
els and formats. If these architectures are more INDs, patterns, etc.) it is not trivial to create a
amenable to data profiling tasks, they might even valid database instance. If in addition some dirt-
warrant copying data for the purpose of profiling. iness, i.e., violations to constraints, are to be in-
serted, or if conditional dependencies are needed,
Storage architectures. Of all modern hardware the task becomes even more daunting. The mea-
architectures, columnar storage seems the most sures for (iii) need to be carefully selected, in par-
promising for many data profiling tasks, which of- ticular if they are to go beyond traditional mea-
ten are inherently column-oriented: Analyzing in- sures of response time and cost efficiency and in-
dividual columns for patterns, data types, unique- clude the evaluation of approximate results. Fi-
ness, etc. involves reading only the data of that col- nally, the benchmark should be able to evaluate not
umn and thus matches precisely the sweet-spot of only entire profiling systems but also methods for
columns stores [1]. This advantage may dwindle individual tasks.
when analyzing column-combinations, for instance
to discover functional dependencies, but even then Types of data. Data comes not only in relational
one can avoid reading entire rows of data. form, but also in tree or graph shapes, such as XML
As data profiling includes many different tasks and RDF data. A first step is to adapt traditional
on many tables and columns, a promising research profiling tasks to those models. An example is Pro-
avenue is the use of many cores, GPUs, or dis- LOD, which profiles linked open data delivered as
tributed environments for parallelization. Paral- RDF triples [10]. A further challenge arises from
lelization can occur at different levels: A compre- the sheer size of many RDF data sets, so profiling
hensive profiling run might distribute individual, in- computation must be distributed [9]. In addition,
dependent profiling tasks to different nodes (task such data models demand new, data model-specific
parallelism). Another approach is to partition data profiling tasks, such as maximum tree depth or av-
for a single profiling task (data parallelism). As erage node-degree.
most profiling tasks are not associative, in the sense Structured data is often intermingled with un-
that profiling results for subsets of column-values structured, textual data, for instance in product in-
cannot be aggregated to overall results, horizontal formation or user profiles on the web. The field
partitioning is usually not useful or at least raises of linguistics knows various measures to character-
some coordination overhead. For instance, unique- ize a text from simple measures, such as average
ness within each partition of a column does not sentence length, to complex measures, such as vo-
imply uniqueness of the entire column, but com- cabulary richness [25] as visualized in [29]. Thus,
municating the sets of distinct values is sufficient. data profiling might be extended to text profiling
Finally, task parallelism can again be applied to and possibly to methods that jointly profile both
finer-grained tasks, such as sorting or hashing, that data and text. A discussion on the large area of
form the basic building blocks of many profiling al- text mining is omitted, for the same reasons data
gorithms. mining was omitted from this article.
Further challenges arise when performing data
profiling on key-value stores: Typically, the val-
7. AN OUTLOOK
ues contain some structured data, without enforced
schemata. Thus, even defining the expected results This article points out the potentials and the
on such “soft schema” values is a challenge, and a needs of modern data profiling – there is yet much
first step must involve schema profiling as described principled research to do. A planned first step is
in Section 5. to develop a general framework for data profiling,
To systematically evaluate different methods and which classifies and formalizes profiling tasks, shows
architectures for the various data profiling tasks, a its amenability for a range of use cases, and provides
corresponding data profiling benchmark is needed. a means to compare various techniques both in their
It must define (i) a set of tasks, (ii) data on which abilities and their efficiency.
the tasks shall be executed, and (iii) measures to At the same time, this article shall serve as a “call
evaluate efficiency. For (i) the first (single-source) to arms” for database researchers to develop more
subtree of Figure 1 can serve as an initial set of efficient and more advanced profiling techniques, in
tasks. Arguably, the most difficult part of establish- particular for the fast growing areas of “big data”
and “linked data”, both of which have attracted
great interest by industry, but both of which have Proceedings of the International Conference
proven that data is difficult to grasp and use effec- on Information and Knowledge Management
tively. Data profiling can bridge this gap by show- (CIKM), pages 2094–2098, Maui, HI, 2012.
ing what the data sets are about, how well they fit [6] J. Bauckmann, U. Leser, F. Naumann, and
the data environment at hand, and what steps are V. Tietz. Efficiently detecting inclusion
needed to make use of them. dependencies. In Proceedings of the
Several research areas were deliberately omitted International Conference on Data Engineering
in this article, in particular data mining and text (ICDE), pages 1448–1450, Istanbul, Turkey,
mining, as reasoned above, but also data visual- 2007.
ization: Because data profiling targets users, ef- [7] J. Berlin and A. Motro. Database schema
fectively visualizing the profiling results is of ut- matching using machine learning with feature
most importance. A suggestion for such a visual selection. In Proceedings of the Conference on
data profiling tool is the Profiler system [28]. A Advanced Information Systems Engineering
strong cooperation between the database commu- (CAiSE), pages 452–466, Toronto, Canada,
nity, which produces the data and metadata to be 2002.
visualized, and the visualization community, which [8] C. Böhm, G. Kasneci, and F. Naumann.
enables users to understand and make use of the Latent topics in graph-structured data. In
data, is needed. Proceedings of the International Conference
Acknowledgments. Discussions and collabora- on Information and Knowledge Management
tion with Ziawasch Abedjan, Jana Bauckmann, (CIKM), pages 2663–2666, Maui, HI, 2012.
Christoph Böhm, and Frank Kaufer inspired this [9] C. Böhm, J. Lorey, and F. Naumann. Creating
article. voiD descriptions for web-scale data. Journal
of Web Semantics, 9(3):339–345, 2011.
8. REFERENCES [10] C. Böhm, F. Naumann, Z. Abedjan, D. Fenz,
[1] D. J. Abadi. Column stores for wide and T. Grütze, D. Hefenbrock, M. Pohl, and
sparse data. In Proceedings of the Conference D. Sonnabend. Profiling linked open data
on Innovative Data Systems Research with ProLOD. In Proceedings of the
(CIDR), pages 292–297, Asilomar, CA, 2007. International Workshop on New Trends in
[2] Z. Abedjan and F. Naumann. Advancing the Information Integration (NTII), pages
discovery of unique column combinations. In 175–178, Long Beach, CA, 2010.
Proceedings of the International Conference [11] L. Bravo, W. Fan, and S. Ma. Extending
on Information and Knowledge Management dependencies with conditions. In Proceedings
(CIKM), pages 1565–1570, Glasgow, UK, of the International Conference on Very Large
2011. Databases (VLDB), pages 243–254, Vienna,
[3] A. Abelló, J. Darmont, L. Etcheverry, Austria, 2007.
M. Golfarelli, J.-N. Mazón, F. Naumann, [12] M. J. Cafarella, A. Halevy, and J. Madhavan.
T. B. Pedersen, S. Rizzi, J. Trujillo, Structured data on the web. Communications
P. Vassiliadis, and G. Vossen. Fusion Cubes: of the ACM, 54(2):72–79, 2011.
Towards self-service business intelligence. [13] S. Chaudhuri, U. Dayal, and V. Ganti. Data
Data Warehousing and Mining (IJDWM), in management technology for decision support
press, 2013. systems. Advances in Computers, 62:293–326,
[4] D. Agrawal, P. Bernstein, E. Bertino, 2004.
S. Davidson, U. Dayal, M. Franklin, [14] F. Chiang and R. J. Miller. Discovering data
J. Gehrke, L. Haas, A. Halevy, J. Han, H. V. quality rules. Proceedings of the VLDB
Jagadish, A. Labrinidis, S. Madden, Endowment, 1:1166–1177, 2008.
Y. Papakonstantinou, J. M. Patel, [15] P. Christen. Data Matching. Springer Verlag,
R. Ramakrishnan, K. Ross, C. Shahabi, Berlin – Heidelberg – New York, 2012.
D. Suciu, S. Vaithyanathan, and J. Widom. [16] G. Cormode, M. N. Garofalakis, P. J. Haas,
Challenges and opportunities with Big Data. and C. Jermaine. Synopses for massive data:
Technical report, Computing Community Samples, histograms, wavelets, sketches.
Consortium, http://cra.org/ccc/docs/ Foundations and Trends in Databases,
init/bigdatawhitepaper.pdf, 2012. 4(1-3):1–294, 2012.
[5] J. Bauckmann, Z. Abedjan, H. Müller, [17] O. Curé. Conditional inclusion dependencies
U. Leser, and F. Naumann. Discovering for data cleansing: Discovery and violation
conditional inclusion dependencies. In
detection issues. In Proceedings of the literary analysis. In Proceedings of Visual
International Workshop on Quality in Analytics Science and Technology (VAST),
Databases (QDB), Lyon, France, 2009. pages 115 –122, Sacramento, CA, 2007.
[18] J. Euzenat and P. Shvaiko. Ontology [30] S. Lopes, J.-M. Petit, and F. Toumani.
Matching. Springer Verlag, Berlin – Discovering interesting inclusion
Heidelberg – New York, 2007. dependencies: application to logical database
[19] S. M. Fakhrahmad, M. H. Sadreddini, and tuning. Information Systems, 27(1):1–19,
M. Z. Jahromi. AD-Miner: A new incremental 2002.
method for discovery of minimal approximate [31] A. Löser, F. Hueske, and V. Markl.
dependencies using logical operations. Situational business intelligence. In
Intelligent Data Analysis, 12(6):607–619, Proceedings Business Intelligence for the
2008. Real-Time Enterprise (BIRTE), pages 1–11,
[20] W. Fan, F. Geerts, X. Jia, and Auckland, New Zealand, 2008.
A. Kementsietsidis. Conditional functional [32] M. V. Mannino, P. Chu, and T. Sager.
dependencies for capturing data Statistical profile estimation in database
inconsistencies. ACM Transactions on systems. ACM Computing Surveys,
Database Systems (TODS), 33(2):1–48, 2008. 20(3):191–221, 1988.
[21] W. Fan, F. Geerts, J. Li, and M. Xiong. [33] F. D. Marchi, S. Lopes, and J.-M. Petit.
Discovering conditional functional Efficient algorithms for mining inclusion
dependencies. IEEE Transactions on dependencies. In Proceedings of the
Knowledge and Data Engineering (TKDE), International Conference on Extending
23(4):683–698, 2011. Database Technology (EDBT), pages 464–476,
[22] L. Golab, F. Korn, and D. Srivastava. Prague, Czech Republic, 2002.
Efficient and effective analysis of data quality [34] F. D. Marchi, S. Lopes, and J.-M. Petit.
using pattern tableaux. IEEE Data Unary and n-ary inclusion dependency
Engineering Bulletin, 34(3):26–33, 2011. discovery in relational databases. Journal of
[23] L. Golab and M. T. Özsu. Data Stream Intelligent Information Systems, 32:53–73,
Management. Morgan Claypool Publishers, 2009.
2010. [35] V. M. Markowitz and J. A. Makowsky.
[24] J. Han, M. Kamber, and J. Pei. Data Mining: Identifying extended entity-relationship object
Concepts and Techniques. Morgan Kaufmann, structures in relational schemas. IEEE
2011. Transactions on Software Engineering,
[25] D. I. Holmes. Authorship attribution. 16(8):777–790, 1990.
Computers and the Humanities, 28:87–106, [36] T. Özsu and P. Valduriez. Principles of
1994. Distributed Database Systems. Prentice-Hall,
[26] Y. Huhtala, J. Kärkkäinen, P. Porkka, and 2nd edition, 1999.
H. Toivonen. TANE: An efficient algorithm [37] V. Poosala, P. J. Haas, Y. E. Ioannidis, and
for discovering functional and approximate E. J. Shekita. Improved histograms for
dependencies. Computer Journal, 42:100–111, selectivity estimation of range predicates. In
1999. Proceedings of the International Conference
[27] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, on Management of Data (SIGMOD), pages
and A. Aboulnaga. CORDS: Automatic 294–305, Montreal, Canada, 1996.
discovery of correlations and soft functional [38] D. Pyle. Data Preparation for Data Mining.
dependencies. In Proceedings of the Morgan Kaufmann, 1999.
International Conference on Management of [39] E. Rahm and H.-H. Do. Data cleaning:
Data (SIGMOD), pages 647–658, Paris, Problems and current approaches. IEEE Data
France, 2004. Engineering Bulletin, 23(4):3–13, 2000.
[28] S. Kandel, R. Parikh, A. Paepcke, [40] V. Raman and J. M. Hellerstein. Potters
J. Hellerstein, and J. Heer. Profiler: Wheel: An interactive data cleaning system.
Integrated statistical analysis and In Proceedings of the International Conference
visualization for data quality assessment. In on Very Large Databases (VLDB), pages
Proceedings of Advanced Visual Interfaces 381–390, Rome, Italy, 2001.
(AVI), pages 547–554, Capri, Italy, 2012. [41] A. Rostin, O. Albrecht, J. Bauckmann,
[29] D. A. Keim and D. Oelke. Literature F. Naumann, and U. Leser. A machine
fingerprinting: A new method for visual
learning approach to foreign key discovery. In enterprises. In Proceedings of the Conference
Proceedings of the ACM SIGMOD Workshop on Innovative Data Systems Research
on the Web and Databases (WebDB), (CIDR), Asilomar, CA, 2009.
Providence, RI, 2009. [44] W. Wu, B. Reinwald, Y. Sismanis, and
[42] Y. Sismanis, P. Brown, P. J. Haas, and R. Manjrekar. Discovering topical structures
B. Reinwald. GORDIAN: Efficient and of databases. In Proceedings of the
scalable discovery of composite keys. In International Conference on Management of
Proceedings of the International Conference Data (SIGMOD), pages 1019–1030,
on Very Large Databases (VLDB), pages Vancouver, Canada, 2008.
691–702, Seoul, Korea, 2006. [45] H. Yao and H. J. Hamilton. Mining functional
[43] K. P. Smith, M. Morse, P. Mork, M. H. Li, dependencies from data. Data Mining and
A. Rosenthal, M. D. Allen, and L. Seligman. Knowledge Discovery, 16(2):197–219, 2008.
The role of schema matching in large

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy