Data Warehouse
Data Warehouse
Data Warehouse
Warehousing
Metadata
3.6 Data lake, Architecture of Data lake, Data Warehouse Vs Data lake
Operational Vs Decisional Support
System
General
direction of
JMI – Enabled the flow of
Metadata data
Service
– driven tools
much as possible.
the data volume, which has to be managed and processed, and the number of
Monitor
& OLAP
Other Metad
ata Integrat Server
sources
or
Analysis
Operationa Extract Query
l Transfor Data Serv Reports
DBs m
Warehous e Data
Load
Refresh e mining
Data
Marts
Data Data OLAP Front-End
Sources Storage Engine Tools
Three Data Warehouse Models
Enterprise warehouse
collects all of the information about subjects spanning
the entire organization
Data Mart
a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific,
selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be
materialized
Extraction, Transformation, and Loading (ETL)
Data extraction
get data from multiple, heterogeneous, and external
sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse
format
Load
sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
propagate the updates from the data sources to the
warehouse
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
Classification of Metadata
Determine if dissimilar kinds are grouped together. While grouping,
subject or subgroup does not inherit all the features of the superset.
Which signifies that the knowledge and requirements about the
superset are not relevant for the members of the subset.
Determine whether the classes have overlaps.
Whether subordinates (may) have several superordinate's or not.
Multiple supertypes for one subtype signify that the subordinate
contains the features of all its superordinate's.
Evaluate whether the standards for belonging to a class or group are
well defined.
Whether or not the types of relations between the concepts are made
clear and well defined.
Whether or not the subtype-supertype relations are differentiated
from composition relations and from object-role relations.
Data warehouse and Data Marts
Data is received from the staging Data is received from star-join (facts
area. and dimensions).
a
0-D (apex)
ll
cuboid
tim ite locatio suppli
e m n er 1-D
cuboids
time,locatio item,locatio location,supplie
n n r
time,ite 2-D
m time,supplie item,supplie
r cuboids
r
time,location,suppli
er 3-D
time,item,locatio cuboids
n time,item,suppli item,location,supplie
er r
4-D (base)
time, item, location,
cuboid
supplier
Conceptual Modeling of Data Warehouses
time
time_key item
day item_key
day_of_the_week Sales Fact item_name
month Table brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Example of Snowflake Schema
time
time_key
ite
day m
item_key supplier
day_of_the_week Sales Fact item_name supplier_key
month Table brand supplier_type
quarter time_key type
year supplier_key
item_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type city
dollars_sold
city_key
avg_sales city
state_or_province
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact
day item_key Table
day_of_the_week Sales Fact item_name time_key
month Table brand
quarter type item_key
year time_key
supplier_type
shipper_key
item_key
branch_key from_location
A new dimension table row with the new value of the changed
attribute is added.
A new column called the effective date is added in the dimension
table.
The original row in the dimension table is not changed.
The key of the original row remains the same.
The new row is inserted with anew surrogate key in the dimension
Customer Customer Customer Marital Address
key ID Name status
- - - - -
22522134 C12345 Jenny Single AAAAAAA
david
- - - - -
There is a need to track history with both old and new value of the same
attribute.
Type 3 changes are used to compare performance across the transition.
They enable the users to track data in both forward and backward
directions.
An “old” field is added in the dimension table for the affected attribute.
The existing value of the attribute is pushed down from the “current” field
to the “old” field.
The new value of the attribute is kept in the “current” field.
A “current” effective data field is also added for the changed
attribute.
The key of the row is not affected.
No new dimension row is added in the dimension table.
The existing queries will automatically switch to the “current” value.
Revision must be done to queries that need to use the “old” value.
This technique works well with one soft change at a time and if there
are a succession of changes, more sophisticated techniques must be
devised.
Salesperson Salesperson Salesperson Old Current Effective
key ID Name Location Location date
- - - - -
22522134 C12345 Jenny Delhi 1/Mar/03
david
- - - - -
Organizations that successfully generate business value from their data, will
leaders were able to do new types of analytics like machine learning over new
sources like log files, data from click-streams, social media, and internet connected
devices stored in the data lake. This helped them to identify, and act upon
decisions.
A Concept Hierarchy: Dimension (location)
a a
ll ll
Specification of hierarchies
Schema hierarchy
day < {month <
quarter; week} < year
Set_grouping hierarchy
{1..10} < inexpensive
Multidimensional Data
Pr
o
Quarter
d
u
A Sample Data Cube
a
ll 0-D (apex)
countr cuboid
produ dat
ct y 1-D
e
cuboids
product,dat product,countr date,
e y country
2-D
cuboids
3-D (base)
product, date, cuboid
country
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
Fig. 3.10 Typical
OLAP Operations
A Star-Net Query Model
Customer
Shipping Orders
Method Custom
er
CONTRACT
AIR- S
EXPRESS
ORDE
TRUC R
K PRODUCT
Tim LINE Produ
e ANNUAL QTRL DAIL PRODUCT ct
Y Y Y PRODUCT
ITEM GROUP
CIT
Y SALES
COUNTR PERSON
Y DISTRIC
T
REGIO
N DIVISIO
Locatio Each circle is N
n called a Promotio Organizatio
n n
footprint
Browsing a Data Cube
Visualization
OLAP capabilities
Interactive manipulation
Chapter 4: Data Warehousing and On-line
Analytical Processing
Multi-Tier Data
Warehouse
Distributed
Data Marts
C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data Warehouse Design: Relational and Dimensional
Techniques. John Wiley, 2003
W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 2ed.
John Wiley, 2002
P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD Record, 24:8–11, Sept.
1995.
P. O'Neil and D. Quass. Improved query performance with variant indexes. SIGMOD'97
Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998
S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. ICDE'94
A. Shoshani. OLAP and statistical databases: Similarities and differences. PODS’00.
D. Srivastava, S. Dar, H. V. Jagadish, and A. V. Levy. Answering queries with aggregation using views.
VLDB'96
P. Valduriez. Join indices. ACM Trans. Database Systems, 12:218-246, 1987.
J. Widom. Research problems in data warehousing. CIKM’95
K. Wu, E. Otoo, and A. Shoshani, Optimal Bitmap Indices with Efficient Compression, ACM Trans. on
Database Systems (TODS), 31(1): 1-38, 2006