Capstone Project
Capstone Project
4
Data warehousing concepts
Operational / informational data:
❑ Operational data is the data you use to run your business. This data is what is
typically stored, retrieved, and updated by your Online Transactional
Processing (OLTP) system. An OLTP system may be, for example, a
reservations system, an accounting application, or an order entry application.
❑ Informational data is created from the wealth of operational data that exists in
your business and some external data useful to analyze your business.
Informational data is what makes up a data warehouse. Informational data is
typically:
➢ Summarized operational data
➢ De-normalized and replicated data
➢ Infrequently updated from the operational systems
➢ Optimized for decision support applications
➢ Possibly "read only" (no updates allowed)
➢ Stored on separate systems to lessen impact on operational systems
5
Operational vs. Informational
Data
Operational Informational
Primarily primitive, Primarily derived,
highly detailed somewhat summarized
Current; guaranteed Historical; accuracy
accurate now maintained over time
Constantly updated Infrequently updated
Minimal redundancy Managed redundancy
Static structure, Dynamic structure,
dynamic content static content
Referential integrity Historical integrity
Supports day-to-day Supports long-term
business functions informational requirements
The Complete DWH System
extract Query/Reporting
transform serve
load
refresh e.g., ROLAP
Operational etc.
Data Mining
DB’s serve
Data Marts
Three-Tier DWH System
• Warehouse database server
– Almost always a relational DBMS, Analytical DBMS, rarely
flat files
• OLAP servers
– Relational OLAP (ROLAP): extended relational DBMS that
maps operations on multidimensional data to standard
relational operators
– Multidimensional OLAP (MOLAP): special-purpose server
that directly implements multidimensional data and operations
• Clients
– Query and reporting tools
– Analysis tools
– Data mining tools
Sample Unified BI Architecture: Investment Banking Domain
Data Feeds Factory FTP Data Validation Components Presentation
Server Layer
Heterogeneous
A DWH Database Server
Common
Operational Interface Star-Schema MIS/OP
Normalized Schema
(Atomic Level Data)
(Source Image)
Source Layer B Reports
Staging Area
Systems Model 1 Model 4
2
Flat Files 1
(Risk) Analytics
Model 2
Oracle 3 Model 5
Security Layer
Flat Files
Model 3 (Profitability)
FTP
Portal
Flat Files Operational MIS &Adhoc OLAP Server
(Optional)
ETL Report Builder
ETL Server
Banking Feeds 1 2 3
Publish Layer
Reference
Customized Report Publish SERVER
ETL Server
Master Data
Application
These are various paths in the report generation process. If any of these paths are not needed for a Bank, then the
1 2 3
components on that path are not needed in the setup.
Data validation: Option A-Sample data from Staging area, Option B-Sample data from report for validation.
A B
9 These will be compared to check the validity of data and result is stored in a log file.
Sample: Data Marts Blocks
Flat File Systems - 2 Dimensional
Character Positions
Records
Relational Data - 2 Dimensional
...
MOLAP: Dimensional Modeling Using the Multi Dimensional
Model
dimensions = 2
3-D Cube
Fact table view: Multi-dimensional cube:
dimensions = 3
Multi-Dimensional Cube Visualization
roll-up to region
New Town Dimensions:
Salt Lake
Time, Product, Store
roll-up to brandAttributes:
JU
10
Product (upc, price, …)
Juice
Store …
Product
Milk 34
56 …
Coke
Cream 32 Hierarchies:
Soap 12 Product → Brand → …
Bread 56 roll-up to week
Day → Week → Quarter
M T W Th F S S
Store → Region → Country
Time
56 units of bread sold in JU/Kolkata on M (Monday)
Cube Aggregation: Roll-up
s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
Cube Operators for Roll-up
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8 sale(s1,*,*)
s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
sale(s2,p2,*) p1 110
p2 19 sale(*,*,*)
Extended Cube
* s1 s2 s3 *
p1 56 4 50 110
p2 11 8 19
day 2 *
s1 67
s2 12
s3 *50 129
p1 44 4 48
p2
s1 s2 s3 *
day 1
p1
*
12
44 4
50 62
48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
rollup
drill-down
Aggregation Using Hierarchies
s1 s2 s3
day 2
p1 44 4
store
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
region
country
region A region B
p1 56 54
p2 11 8
(store s1 in Region A;
stores s2, s3 in Region B)
Points to be noticed about MOLAP
• Pre-calculating or pre-consolidating transactional data improves
speed.
BUT
Fully pre-consolidating incoming data, MDDs require an enormous
amount of overhead both in processing time and in storage. An input
file of 200MB can easily expand to 5GB
MDDs are great candidates for the <50GB department data marts.
1) Performance:
Product
Household
Telecomm
‘99
Video ‘98
‘97
Audio
Sales Channel
Retail Direct Special
Multidimensional Analysis
1996
1995
Product
Household
Video
Telecomm
Audio
Direct
Retail
Video
Audio
Sales Channel
Retail Direct Special
Multidimensional Analysis
Product
Household Time ?
Telecomm
Video
Audio
Sales Channel
Retail Direct Special
Multidimensional Analysis
Customer
Product Information
• Geographic
Location
Household • Customer Type
• Income Bracket
Telecomm
?
Video ?
?
Audio
Sales Channel
Retail Direct Special
Multidimensional Analysis
Government
Commercial
Video Video
Audio Audio
Government
Commercial
Commercial
Product
Individual
Household
Video
Telecomm
Retail
Video
Audio
Sales Channel
Retail Direct Special
Multidimensional Analysis
Retail
Government
Direct
Product
Special
Household
Video
Telecomm
Commercial
Government
Video
Audio
Customer Type
Government Comm Individual
ercial
Multidimensional Analysis
Government
Commercial
Product
Individual
Household
Telecomm
Europe (EC)
Video
Asia Pacific
Sales Channel
Retail Direct Special
Multidimensional Analysis
Retail Direct Special
Household $300 $200 $100
Government $100 $50 $10
Commercial $100 $75 $70
Individual $100 $75 $30
Telecomm $6000 $3000 $400
Government $2000 $500 $100
Commercial $1000 $1500 $100
Individual $3000 $1000 $200
Video $4444 $2222 $777
Government $1030 $150 $222
Commercial $1311 $1175 $111
Individual $2103 $897 $444
Audio $50 $75 $25
Government $10 $25 $20
Commercial $10 $25 $0
Individual $30 $25 $5
Tools available
• ROLAP:
– ORACLE 8i
– ORACLE Reports; ORACLE Discoverer
– ORACLE Warehouse Builder
– Arbors Software’s Essbase
• MOLAP:
– ORACLE Express Server
– ORACLE Express Clients (C/S and Web)
– MicroStrategy’s DSS server
– Platinum Technologies’ Plantinum InfoBeacon
• HOLAP:
– ORACLE 8i
– ORACLE Express Serve
– ORACLE Relational Access Manager
– ORACLE Express Clients (C/S and Web)
Conclusion
• ROLAP: RDBMS -> star/snowflake schema
• ROLAP or MOLAP: Data models used play major role in performance differences
• One farmer wants to harvest grapes and sell out all the fruits it
in the nearby Town.
• After harvesting he then stores the produce in a storage room.
Simple Example of Parallel storage and parallel processing
Challenge: Now the storage became the bottleneck to store and access all the
fruits in a single storage place.
Simple Example of Parallel storage and parallel processing
So the farmer now decided to distribute the storage area and give each
one of them a different storage area.
Simple Example of Parallel storage and parallel processing
High yield and processing
approach Business Decision
This solution helps them to complete the order on time without any hassles.
This is the way they
continue to grow more
and more and deliver
more and more fruit
baskets and formed a large
farm.
• Single storage unit became the bottleneck due to which network overhead
was generated.
• The solution was to use distributed storage for each processor
• This enabled easy access to store and retrieve data.
• This method worked and no network overhead is generated
• No bottleneck will be there for data pull or push with high lead time
• This is known as parallel processing of distributed storage called
MapReduce
What is in with Hadoop
The Hadoop Distributed
File System (HDFS) is a
distributed file system
designed to run
on hardware based on
open standards or what is
called commodity
hardware. This means
the system is capable of
running
MapReduce isdifferent
a
1. Storage Unit of Hadoop operating
programming model or systems
- HDFS ( Hadoop Distributed file system) (OSes)
pattern such as Windows
within the
or Linux without
Hadoop
- It is specially designed for storing huge data sets in commodity framework requiring
that
2. hardware
Processing Unit of Hadoop special drivers.
is used to access big
Map Reduce is a used to retrieve big data from HDFS data stored in the
Hadoop File System
(HDFS). It is a core
component, integral to
the functioning of the
There is only one Name
Node but there could be
multiple data node.
Master/Slave node is
typically form the HDFS
Cluster