Data Warehouse Notes
Data Warehouse Notes
Data Warehouse Notes
INTRO
What is a data warehouse?
Data warehouse (DWH) is a large store of data accumulated from a wide
range of sources within a company and used to guide management decisions.
A DWH is typically used to connect and analyze business data from
heterogeneous sources.
The data warehouse is the core of the BI system which is built for data
analysis and reporting.
It is a blend of technologies and components which aids the strategic use of
data.
It is electronic storage of a large amount of information by a business which
is designed for query and analysis instead of transaction processing.
Data Warehouse Architecture
Client Client
Query &
Analysis
Metadata Warehouse
Integration
Source Source
Source
INTRO….
What is Data Warehousing?
A Data Warehousing is process for collecting and
managing data from varied sources to provide
meaningful business insights.
Data
Characteristics of data warehouse
• Subject Oriented: Data that gives information about a
particular subject instead of about a company's ongoing
operations.
• Integrated: Data that is gathered into the data warehouse from
a variety of sources and merged into a coherent whole.
• Time-variant: All data in the data warehouse is identified with
a particular time period.
• Non-volatile: Data is stable in a data warehouse. More data is
added but data is never removed. This enables management to
gain a consistent picture of the business.
Characteristics of data warehouse
• Data warehousing is combining data from multiple and
usually varied sources into one comprehensive and easily
manipulated database.
• Common accessing systems of data warehousing include
queries, analysis and reporting.
• Because data warehousing creates one database in the end, the
number of sources can be anything you want it to be, provided
that the system can handle the volume, of course.
• The final result, however, is homogeneous data, which can be
more easily manipulated.
DATABASE VS DATA WAREHOUSE
Parameter Database Data Warehouse
Purpose
Is designed to record Is designed to analyze
Processing
Method The database uses the Online Data warehouse uses Online
Transactional Processing (OLTP) Analytical Processing (OLAP).
Usage
The database helps to perform Data warehouse allows you to
fundamental operations for your analyze your business.
business
Tables and
Joins Tables and joins of a database are Table and joins are simple in a data
complex as they are normalized. warehouse because they are
denormalized.
Orientation
Is an application-oriented collection It is a subject-oriented collection of
of data data
Storage limit
Generally limited to a single Stores data from any number of
application applications
DATABASE VS DATA WAREHOUSE
Parameter Database Data Warehouse
Availability Data is available real-time Data is refreshed from source systems as and
when needed
Usage ER modeling techniques are used for Data modeling techniques are used for
designing. designing.
Technique Capture data Analyze data
Data Type Data stored in the Database is up to date. Current and Historical Data is stored in Data
Warehouse. May not be up to date.
Storage of data Flat Relational Approach method is used for Data Ware House uses dimensional and
data storage. normalized approach for the data structure.
Example: Star and snowflake schema.
Query Type Simple transaction queries are used. Complex queries are used for analysis
purpose.
Data Summary Detailed Data is stored in a database. It stores highly summarized data.
Applications of Data Warehousing
Sector Usage
Airline It is used for airline system management
operations like crew assignment, analyzes of
route, frequent flyer program discount schemes
for passenger, etc.
Banking It is used in the banking sector to manage the
resources available on the desk effectively.
Healthcare Data warehouse used to strategize and predict
sector outcomes, create patient's treatment reports, etc.
Advanced machine learning, big data enable
datawarehouse systems can predict ailments.
Applications of Data Warehousing
Industry Functional areas of use Strategic use
• Extraction, transformation, and loading (ETL) – a process that extracts information from
internal and external databases, transforms the information using a common set of enterprise
definitions, and loads the information into a data warehouse.
Data Mart
Subset of data warehouses that is highly focused and
isolated for a specific population of users
Example: Marketing data mart, Sales data mart, etc.
14
ETL process
What is ETL?
ETL is a process that extracts the data from different
source systems, then transforms the data (like applying
calculations, concatenations, etc.) and finally loads the
data into the Data Warehouse system.
Full form of ETL is Extract, Transform and Load.
ETL PROCESS
Extraction, Transformation, and Loading (ETL)
Data extraction
◦ get data from multiple, heterogeneous, and external sources
Data cleaning
◦ detect errors in the data and rectify them when possible
Data transformation
◦ convert data from legacy or host format to warehouse format
Load
18
Multidimensional Analysis
Data mining – the process of analyzing data to extract
information not offered by the raw data alone
Data-mining tool – uses a variety of techniques to find
patterns and relationships in large volumes of information
and infers rules that predict future behavior and guide
decision making
Data-mining tools include: query tools, statistical tools,
intelligent agents, etc.
19
From Tables and Spreadsheets to
Data Cubes
A data warehouse is based on a multidimensional data model which views data in the
form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple
dimensions
◦ Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year)
◦ Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables
20
Conceptual Modeling of Data Warehouses
21
Conceptual Modeling of Data Warehouses
22
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
23
Conceptual Modeling of Data Warehouses
24
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
25
Conceptual Modeling of Data Warehouses
26
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
Office Day
Month
28
A Sample Data Cube
t
uc
TV
od PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
29
Typical OLAP Operations
Roll up (drill-up): summarize data
◦ by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
◦ from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
◦ reorient the cube, visualization, 3D to series of 2D planes
Other operations
◦ drill across: involving (across) more than one fact table
◦ drill through: through the bottom level of the cube to its back-
end relational tables (using SQL)
30
Roll-up: