What Is Data Warehouse?: Data Mining by IK Unit 2
What Is Data Warehouse?: Data Mining by IK Unit 2
Unit 2
Data Mining by IK
Unit 2
Data Mining by IK
Unit 2
FROM TABLES AND SPREADSHEETS TO DATA CUBES (WHY TO USE DATA CUBES INSTEAD OF TABLES AND SPREAD SHEETS)
A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
Data Mining by IK
Unit 2
Star schema: The most common modeling paradigm is the star schema, in which the data
warehouse contains. (1) A large central table (fact table) containing the bulk of the data and (2) A set of smaller attendant tables (dimension tables), one for each dimension.
Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables.
Data Mining by IK
Unit 2
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension
tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.
Data Mining by IK
Unit 2
EXAMPLES FOR DEFINING STAR, SNOWFLAKE, AND FACT CONSTELLATION SCHEMAS BASIC SYNTAX Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition (Dimension Table) define dimension <dimension_name> as (<attribute_or_subdimension_list>) Special Case (Shared Dimension Tables) define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> DEFINING STAR SCHEMA IN DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) DEFINING SNOWFLAKE SCHEMA IN DMQL define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) DEFINING FACT CONSTELLATION IN DMQL define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales
6
Data Mining by IK
Unit 2
CONCEPT HIERARCHIES
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts. SCHEMA HIERARCHY: A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy. Schema hierarchy may formally express existing relationship between attributes. .
Data Mining by IK
Unit 2
SET-GROUPING HIERARCHY: Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or attribute, resulting in a set-grouping hierarchy. A total or partial order can be defined among groups of values.
Data Mining by IK
Unit 2
Data Mining by IK
Unit 2
10
Data Mining by IK
Unit 2
11
Data Mining by IK
Unit 2
12
Data Mining by IK
Unit 2
BOTTOM TIER: The bottom tier is a warehouse database server that is almost always a relational database system. The data are extracted using application program interfaces known as gateways. Examples of gateways include ODBC JDBC. This tier also contains a metadata repository, which stores information about the data warehouse and its contents. MIDDLE TIER: The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to standard relational operations; or (2) a multidimensional OLAP (MOLAP) model, that is, a specialpurpose server that directly implements multidimensional data and operations TOP TIER: The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools. ENTERPRISE WAREHOUSE: collects all of the information about subjects spanning the entire organization. It provides corporate-wide data integration, usually from one or more operational systems or external information providers.
13
Data Mining by IK
Unit 2
DATA MART: A subset of corporate-wide data that is of value to specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart. Independent data marts are sourced from data captured from one or more operational systems or external information providers, or from data generated locally within a particular department or geographic area. Dependent data marts are sourced directly from enterprise data warehouses. VIRTUAL WAREHOUSE A set of views over operational databases Only some of the possible summary views may be materialized.
First, a high-level corporate data model is defined within a reasonably short period Second, independent data marts can be implemented in parallel with the enterprise warehouse based on the same corporate data model set as above. Third, distributed data marts can be constructed to integrate different data marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise warehouse is the sole custodian of all warehouse data, which is then distributed to the various dependent data marts.
14
Data Mining by IK
Unit 2
Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
January 23, 2012 Data Mining: Concepts and Techniques 49
Cube Operation
Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.96) () SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys
(city) (item) (year)
(date, product, customer), (date,product),(date, customer), (product, customer), (city, item, year) (date), (product), (customer) ()
January 23, 2012 Data Mining: Concepts and Techniques
(city, item)
(city, year)
(item, year)
50
15
Data Mining by IK
Unit 2
Base table
Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe
Index on Region
Index on Type
Type RecID Asia Europe Am erica RecID Retail Dealer Retail 1 1 0 0 1 1 0 Dealer 2 0 1 0 2 0 1 Dealer 3 1 0 0 3 0 1 4 0 0 1 4 1 0 Retail 0 1 0 5 0 1 Dealer 5
Data Mining: Concepts and Techniques 51
16
Data Mining by IK
Unit 2
17
Data Mining by IK
Unit 2
Exception in a data cube cell is a significant deviation from anticipated value calculated through statistical measures
Degree of surprise defined as deviation from the anticipated value of a date cell
18
Data Mining by IK
Unit 2
19
Data Mining by IK
Unit 2
20
Data Mining by IK
Unit 2
21