Dataware House Concept
Dataware House Concept
Dataware House Concept
The DATA warehousing, as the name suggests is a game played with data. The data for today,
yesterday, day before yesterday, a week old data, a year old data stretching may be up to the ages of
your grand grandfather.
The data warehouse is mainly the term used in the companies to manage the data. The larger the
company, the bigger the data, bigger is the requirement to manage the data.
The concept of a data warehouse is to have a centralized database that is used to capture information
from different parts of the business process. The definition of a data warehouse can be determined by
the collection of data and how it is used by the company and individuals that it supports. Data
warehousing is the method used by businesses where they can create and maintain information for a
companywide view of the data.
The History
Many people, when they first hear the basic principles of data warehousing — particularly copying
data from one place to another — think (or even say),
“That doesn’t make any sense! Why waste time copying and moving data, and storing it in a different
database? Why not just get it directly from its original location when someone needs it?”
The 1970s — the preparation
The computing world was dominated by the mainframe in those days. Real data-processing
applications, the ones run on the corporate mainframe, almost always had a complicated set of files or
early-generation databases (not the table-oriented relational databases most
applications use today) in which they stored data.
Although the applications did a fairly good job of performing routine data processing functions, data
created as a result of these functions (such as information about customers, the products they
ordered, and how much money they spent) was locked away in the depths of the files and databases.
The way data was stored made it more complex to solve the purpose.
It was almost impossible, for example, to see how retail stores in one region were doing against stores
in the other region, against their competitors, or even against their own performance in some earlier
period.
Between 1976 and 1979, the concept grew out of research, driven from discussions with Citibank’s
advanced technology group, called TARADATA. The name Teradata was chosen
to symbolize the ability to manage terabytes (trillions of bytes) of data.
The 1980s — the birth
As the era began, the computers were only making their presence everywhere. The organization
started to have computers everywhere in each department. How could an organization hope to
compete if its data was scattered all over the place on different computer systems that weren’t even all
under the control of the centralized data processing department? (Never mind that even when the data
was all stored on mainframes, it was still isolated in different files and databases, so it was
just as inaccessible.)
Special software then came into existence which made the life simple for the user and everyone, using
the data or analyzing the data. This new type of software,
called a distributed database management system (distributed DBMS, or
DDBMS), would magically pull the requested data from databases across the
organization, bring all the data back to the same place, and then consolidate
it, sort it, and do whatever else was necessary to answer the user’s question.
Although this was thing all were expecting, but the chocolate is always better if it’s sweeter, the
sweetest the perfect. The chocolate being RDBMS (relational database management system) for
decision support — the world’s first.
Data warehousing --- THE need
The need has been established for a companywide view of data in operational systems. Date
warehouses are designed to help management and businesses analyze data and this helps to fill the
need for subject - oriented concepts. Integration is closely related to subject orientation. Data
warehouses must put data from disparate sources into a consistent format. They must resolve such
problems as naming conflicts and inconsistencies among units of measure. When this concept is
achieved, the data warehouse is considered to be integrated.
The form of the stored data has nothing to do with whether something is a data warehouse. A data
warehouse can be normalized or de-normalized. It can be a relational database, multidimensional
database, flat file, hierarchical database, object database, etc. Data warehouse data often gets
changed. Also, data warehouses will most often be directed to a specific action or entity.
Data warehouse success cannot be guaranteed for each project. The techniques involved can
become quite complicated and erroneous data can also cause errors and failure. When management
support is strong, resources committed for business values, and an enterprise vision is established,
the end results may turn out to be more helpful for the organization or business. The main factors that
create need for data warehousing for most businesses today are requirements for the companywide
view of quality information and departments separating information from operational systems for
improved performance for managing data.
Dimension tables are typically small, ranging from a few to several thousand rows. Occasionally
dimensions can grow fairly large, however. For example, a large credit card company could have a
customer dimension with millions of rows. For example customer dimension could look like following:
Customer_key
Customer_full_name
Customer_city
Customer_state
Customer_country
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time
Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes within a
dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month →
Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales
amount would be such a measure. This measure is stored in the fact table with the appropriate
granularity. For example, it can be sales amount by store by day. In this case, the fact table would
contain three columns: A date column, a store column, and a sales amount column.
Fact tables contain keys to dimension tables as well as measurable facts that data analysts would
want to examine. For example, a store selling automotive parts might have a fact table recording a
sale of each item.
Fact tables can grow very large, with millions or even billions of rows. It is important to identify the
lowest level of facts that makes sense to analyze for your business this is often referred to as fact
table "grain".
Fact and Fact Table Types
Types of Facts
There are three types of facts:
Additive: Additive facts are facts that can be summed up through all of the dimensions in the
fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes that we
are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose (note: pay attention towards the purpose of the table) of this table is to record the sales
amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case,
Sales_Amount is an additive fact, because you can sum up this fact along any of the three
dimensions present in the fact table -- date, store, and product. For example, the sum of
Sales_Amount for all 7 days in a week represents the total sales amount for that week.
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the end of each day, as
well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the
facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts
(what's the total current balance for all accounts in the bank?), but it does not make sense to add them
up through time (adding up all current balances for a given account for each day of the month does
not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to
add them up for the account level or the day level.
Star Schema
What is star schema? The star schema architecture is the simplest data warehouse schema. It is
called a star schema because the diagram resembles a star, with points radiating from a center. The
center of the star consists of fact table and the points of the star are the dimension.
In the star schema design, a single object (the fact table) sits in the middle and is radially connected
to other surrounding objects (dimension lookup tables) like a star. Each dimension is represented as
a single table. The primary key in each dimension table is related to a foreign key in the fact table.
A star schema can be simple or complex. A simple star consists of one fact table; a complex star can
have more than one fact table.
Snowflake schema
The snowflake schema is an extension of the star schema, where each point of the star explodes into
more points. In a star schema, each dimension is represented by a single dimensional table, whereas
in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each
representing a level in the dimensional hierarchy.
For example, the Time Dimension that consists of 2 different hierarchies:
1. Year → Month → Day
2. Week → Day
We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for month,
a lookup table for week, and a lookup table for day. Year is connected to Month, which is then
connected to Day. Week is only connected to Day.
The main advantage of the snowflake schema is the improvement in query performance due to
minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the
snowflake schema is the additional maintenance efforts needed due to the increase number of lookup
tables.
Fact Constellation Schema
This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for
better comprehension. The split of fact table is done only when we want to focus on aggregation over
few facts & dimensions.
Granularity
The first step in designing a fact table is to determine the granularity of the fact table. By granularity,
we mean the lowest level of information that will be stored in the fact table. This constitutes two steps:
1. Determine which dimensions will be included.
2. Determine where along the hierarchy of each dimension the information will be kept.
Slowly Changing Dimensions
The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a
nutshell, this applies to cases where the attribute for a record varies over time. We give an example
below:
Robin is a customer with ABC Inc. He first lived in Mumbai. So, the original entry in the customer
lookup table has the following record:
Customer Key Name State
At a later date, he moved to Pune. How should ABC Inc. now modify its customer table to reflect this
change? This is the "Slowly Changing Dimension" problem.
There are in general three ways to solve this type of problem, and they are categorized as follows:
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated
essentially as two people.
We next take a look at each of the scenarios and how the data model and the data looks like for each
of them. Finally, we compare and contrast among the three alternatives.
Type 1 Slowly Changing Dimension
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information.
In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key Name State
After Robin moved from Mumbai to Pune, the new information replaces the new record, and we have
the following table:
Customer Key Name State
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new
information. Therefore, both the original and the new record will be present. The new record gets its
own primary key.
In our example, recall we originally have the following table:
Customer Key Name State
After Robin moved from Mumbai to Pune, we add the new information as a new row into the table:
Customer Key Name State
Advantages:
This allows us to accurately keep all historical information.
Disadvantages:
This will cause the size of the table to grow fast. In cases where the number of rows for the
table is very high to start with, storage and performance can become a concern.
This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to
track historical changes.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of
interest, one indicating the original value, and one indicating the current value. There will also be a
column that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key Name State
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
Customer Key
Name
Original State
Current State
Effective Date
After Robin moved from Mumbai to Pune, the original information gets updated, and we have the
following table (assuming the effective date of change is January 15, 2003):
Customer Key Name Original State Current State Effective Date
Advantages:
This does not increase the size of the table, since new information is updated.
This allows us to keep some part of history.
Disadvantages:
Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Robin later moves to Delhi on December 15, 2003, the Pune information will be
lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse
to track historical changes, and when such changes will only occur for a finite number of time.
Data Integrity
Data integrity refers to the validity of data, meaning data is consistent and correct. In the data
warehousing field, we frequently hear the term, "Garbage In, Garbage Out." If there is no data integrity
in the data warehouse, any resulting report and analysis will not be useful.
In a data warehouse or a data mart, there are three areas of where data integrity needs to be
enforced:
Database level
We can enforce data integrity at the database level. Common ways of enforcing data integrity include:
Referential integrity
The relationship between the primary key of one table and the foreign key of another table must
always be maintained. For example, a primary key cannot be deleted if there is still a foreign key that
refers to this primary key.
Primary key / unique constraint
Primary keys and the UNIQUE constraint are used to make sure every row in a table can be uniquely
identified.
Not NULL vs. NULL-able
For columns identified as NOT NULL, they may not have a NULL value.
Valid Values
Only allowed values are permitted in the database. For example, if a column can only have positive
integers, a value of '-1' cannot be allowed.
OLTP vs. OLAP
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can
assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to
analyze it.
Types of OLAP
Depending on the underlying technology used, OLAP can be broadly divided into two different camps: MOLAP
and ROLAP.
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP
(ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The
storage is not in the relational database, but in proprietary formats.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and are optimal for slicing and
dicing operations.
Can perform complex calculations: All calculations have been pre-generated when the cube is
created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
Limited in the amount of data it can handle: Because all calculations are performed when the cube is
built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data
in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only
summary-level information will be included in the cube itself.
Requires additional investment: Cube technology is often proprietary and does not already exist in the
organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and
capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give the appearance of
traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to
adding a "WHERE" clause in the SQL statement.
Advantages:
Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on
data size of the underlying relational database. In other words, ROLAP itself places no limitation on
data amount.
Can leverage functionalities inherent in the relational database: Often, relational database already
comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational
database, can therefore leverage these functionalities.
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL
queries) in the relational database, the query time can be long if the underlying data size is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements
to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to
perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by
what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box
complex functions as well as the ability to allow users to define their own functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information,
HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill
through" from the cube into the underlying relational data.