Lecture Notes On "Data Warehousing by Examples: Article
Lecture Notes On "Data Warehousing by Examples: Article
Lecture Notes On "Data Warehousing by Examples: Article
net/publication/249837605
Article
CITATIONS READS
0 1,436
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Daniel Lemire on 11 August 2014.
1 Introduction
Data Warehousing is an application-driven field. The Data Warehousing techniques
can be applied to almost every type of data possible, as long as there is a lot of it.
In this lecture, we review four Data Warehousing projects and some of the insights
gained by the authors. All these papers are freely available over the Web. Only one of
those projects is an industrial project [1], but all are concerned with concrete problems
faced by Data Warehousing experts. The topics cover include: updates from legacy
data sources, qualifying the quality of a data warehouse, scalability, and performance.
We discuss the following projects:
• The WareHouse Information Project at Stanford Project (1995) [3]: the project
reported on its use of monitors to check for updates in data sources.
• DWQ Project (1997): a European research project with some focus on data ware-
housing quality [4].
• Data Warehouse of NewsGroup (1999): an research project on the application of
Data Warehousing techniques on Internet Newsgroups [2].
• Data Warehouse Population Platform (2003): recent data warehousing project
by Telecom Italia with strong performance concerns [1].
1
Figure 1: In the WHIPS Project, monitors are used between the Data Warehouse and
the Data Source to check for changes and then forwarded to a data integrator.
and an insertion in place of an insertion. This means that instead of indexing both the
new data dump and the existing data, you can stream through the new data dump and,
assuming the order in which the data entries appear doesn’t change too much between
data dumps, you can efficiently proceed by detecting local changes in the stream of
entries. This assumes you kept older data dump streams. Because you have to do the
task continuously (daily), you can build buffers to help you: "The snapshot problem is
a continuous problem where we have a sequence of files, each to be compared with its
predecessor. Thus, in comparing F2 to F1 , we can construct auxiliary structures for F2
that will be useful when we later compare F3 to F2 , and so on".
2
Figure 2: Data Warehousing is sometimes described as the long term buffer between
OLTP and OLAP.
Accessibility do the people who need to have access to the data have the proper ac-
cess? Is the system crashing or too slow?
Believability am I missing too much data? Are there strong biais? Is the data quality
consistent?
3
least 100 articles classified in soc.culture.indian. On the other hand, att.forsale may
require user to be working at AT&T.
They say that they face two problems.
Newsgroup-selection program which views should be eager (materialized) and which
should be lazy (computed on the fly)?
newsgroup-maintenance problem must efficiently insert new article into a possibly
large number of newsgroup
They solve the newsgroup maintenance problem using an Independent Search Trees
Algorithm using the fact that you have relatively few attributes (date, length, title, au-
thor) to your benefit: you can represent each newsgroup as a rectangular region in
space and through indexes, you can quickly compute the newsgroups a given article
(modelled as a point in space) belongs to. You can handle “contains” constraints using
trie data structure (such as “the subject must contain the word apple”.) For Body, use
the frequency of words and an inverted list for fast similarility searches.
The newsgroup-selection problem should be such that frequently accessed news-
groups have faster retrieval times whereas rarely accessed newsgroup could tolerate
slower access times. We can turn the newsgroup problem into a graph problem: G =
(Q ∪ V, E) where Q are the newsgroups queries, V are the (potentially) materialized
newsgroups and the hyperedges R are of the form (q, {v1 , . . . , vl }) where q ∈ Q and
v1 , . . . , vl ∈ V . Each hyperedge is labeled with a query-cost: how expensive is it to
build q from {v1 , . . . , vl }. (Comment: For example, you can build rec.cooking.* as
the union of all rec.cooking.french, rec.cooking.italian and so on, or else, you can ma-
terialized the union. On the other hand, if you have materialized rec.cooking.*, you
might be able to compute rec.cooking.italian faster.) The newsgroup-selection prob-
lem amounts to finding the smallest subset M of V such that all q ∈ Q belong to at least
one hyperedge with M such that the cost is no larger than a given threshold. We can
reduce the minimum set cover to a special case of the newsgroup-selection problem
and so, the newsgroup-selection problem is NP-hard.
In summary, determining the views one needs to materialized in a hard problem
and at least in this case, the problem is NP-hard. Sometimes the problem is not so hard
(think Data Cubes).
4
1. extraction of relevant information at the source side;
2. transformation and integration of the data coming from multiple sources into a
common format;
3. cleaning of the resulting data according to business rules;
4. loading and propagation of the data to the data warehouse and/or data marts.
While it is often reported elsewhere that the common data format mentioned in the
second step lies inside a relational database, their approach to ETL was to convert all
data into a flat file format.
They were particularly concerned with performance and maintenance (including
data integrity) at the last step in their ETL tool. It is worth mentioning that they were
working with a very large data set. The main fact table, containing traffic detail data
of all customers, has about 3 Billion of records. The Customer dimension table is also
hash partitioned due to volumes of data (about 26 Million Customers).
All fact tables are partitioned by value and hash sub-partitioned due to:
• managing history needs - every day you have to add new partition and drop the
oldest; you have to export for backup needs only the just loaded partition;
• performance needs - Parallel Query Servers work on different sub-partitions.
For increased performance, they designed a module called LDR characterized by
so-called Asynchronous Read/Transform/Write with complete parallelism of read, trans-
form and write activities (see Fig. 4). The LDR module has an internal multi-threaded
structure based on in-memory structure (buffer pools) and two kinds of specialized
thread groups:
Input Buffer Pool memory segments used by Readers (from Readers Thread Pool)
for storing input data;
Output Buffer Pool memory segments used by ProcessUnits to write transformed
data ready to be loaded by Writers (from Writers Thread Pool);
Reader Thread Pool threads that open the input data files or chunks, once that the
appropriate message has been detected from the Listener, and spread that data
over the Input Buffer Pool;
Writer Thread Pool threads that move transformed data from Output Buffer Pool to
database tables (in direct or conventional path) or flat files.
While they did not use commercially available ETL tools, the used Oracle database,
the OLAP tool was Oracle Discoverer, and the Data Mining tool was SAS Enterprise
Miner.
Acknowledgement
Figures were captured from the original articles.
5
Figure 4: The loading module in DWPP (LDR) used to load and transform data into
Data Warehouse tables is entirely threaded with an Input/Output Buffer Pools and
Read/Write threads.
References
[1] Jovanka Adzic and Valter Fiore. Data Warehouse Population Platform. In DMDW,
September 2003.
[2] H. Gupta and D. Srivastava. The Data Warehouse of Newsgroups. In International
Conference on Database Theory, January 1999.
[3] J. Hammer, H. Garcia-Molina, W. Labio, J. Widom, and Y. Zhuge. The Stanford
Data Warehousing Project. IEEE Data Engineering Bulletin, 18(2):41–48, June
1995.
[4] M. Jarke and Y. Vassiliou. Data Warehouse Quality Design: A Review of the DWQ
Project. In Information Quality, 1997.