Unit #3 - Data Warehouse and Data Mining
Unit #3 - Data Warehouse and Data Mining
and
Data Mining
Prof. Dr. M. S. Memon
sulleman@quest.edu.pk
03337037187
May 20, 2023 1
Architecture of DW
Architecture of DW
• Basic Architecture
Data Warehouse Architecture
Data Warehouse Architecture
Operational Source systems
• These are the operational systems of record
that capture the transactions of the business.
• These systems are outside the data warehouse
which do not have control over contents and
format of the data
• The source systems maintain little historical data
• These systems generate operation data that is
detailed, current and subject to change
Data Staging Area
• Data staging area can be divided into three
phases
– Extraction (E)
– Transformation (T)
– Loading (L)
• Extraction: It means reading and
understanding the source data and copying
the data needed for the data warehouse into
staging area for further manipulation (i.e.
transformation)
Data Staging Area
• Loading: Loading refers to populating of data
warehouse with data that has been extracted from
operational systems.
• There are two types of loads, which generally
take place in data warehouse environment:
– Initial load
– Incremental load
Data Staging Area
• Transformation: The transformation phase applies
a series of rules or functions to the extracted/
loaded data.
• This may include some or all of the following:
– Select only certain columns to load (or if you prefer, null columns
not to load)
– Translate coded values
– Derive a new calculated value (e.g. sale_amount = qty * unit_price)
– Denormalization in order to fit the Dawarehouse Schema
– Summarize multiple rows of data (e.g. total sales for each region)
Data Staging Area
• The Data Staging Area
– Is both a storage and process area (the ETL process)
– It represents everything that happens between the
operational source system and the data presentation
area
– The key architectural requirement for data staging area
is that it is off-limits to business users and does not
provide query and presentation services
– should be accessible only to skilled professionals
ETL versus ELT
• ETL (The traditional approach): ETL (Extract, transform,
and load) is a process in data warehousing that involves:
– Extracting data from outside sources
– transforming it to fit business needs, and ultimately
– loading it into the data warehouse
• ELT (The Teradata Approach): ELT (Extract, Load and
Transform) strategy extracts and loads the data into a
Teradata Database first, then uses the power and
performance of the Teradata Warehouse to perform the
transformation
Data Presentation Area
• Extended Relational DBMS
(ROLAP servers)
– data stored in RDB
– star-join schemas
– support SQL extensions (Cube)
– Index structures (bitmap, join)
• Multidimensional DBMS
(MOLAP servers)
– data stored in arrays (n-dimensional
array)
– direct access to array data structure
– poor storage utilization, especially
when the data is sparse
Data Presentation Area
• The Data Presentation Area
– Is where data is organized, stored and made available
for queries, report writers, and other analytical
processing
– This area is the Warehouse as far as the business
community is concerned
Data Access Tools
• Analysis / OLAP / DSS Tools
• Data Mining
Warehouse components
Component: Operational Data
• The sources of data for the data warehouse is
supplied from:
– The data from the mainframe systems in the traditional
network and hierarchical format
– Data can also come from the relational DBMS like
Oracle, Informix
– In addition to these internal data, operational data also
includes external data obtained from commercial
databases and databases associated with supplier and
customers
Component: Load Manager
• The load manager (also called the front end
component) performs all the operations associated
with extraction and loading data into the data
warehouse
• These operations include simple transformations of
the data to prepare the data for entry into the
warehouse
• The size and complexity of this component will vary
between data warehouses and may be constructed
using a combination of vendor data loading tools and
custom built programs
Component: Warehouse Manager
• The warehouse manager performs all the operations
associated with the management of data in the warehouse
This component is built using vendor data management
tools and custom built programs
• The operations performed by warehouse manager include:
– Analysis of data to ensure consistency
– Transformation and merging the source data from temporary
storage into data warehouse tables
– Create indexes and views on the base table.
– Generation of de-normalization
– Generation of aggregation
– Backing up and archiving of data
Warehouse Manager: Detailed Data
M. S. Memon CSE
Dept. QUEST
May 20, 2023 Nawabshah 25
Data Mart
• A data mart is a special purpose subset of
enterprise data for a particular function or
application (It may contain detail or summary data
or both).
• Data Mart types:
– Independent—created directly from operational systems
to a separate physical data store
– Logical—exists as a subset of existing data warehouse.
– Dependent—created from data warehouse to a separate
physical data store
Data Marts
Operational
Systems
Independent
Data
Mart Dependent
Data
Data Mart
Warehouse
Time
Multi-dimensional Data
• Measures - numerical data being tracked
• Dimensions - business parameters that define a
transaction
• Example: Analyst may want to view sales data
(measure) by geography, by time, and by product
(dimensions)
• Dimensional modeling is a technique for
structuring data around the business concepts
• ER models describe “entities” and “relationships”
• Dimensional models describe “measures” and
“dimensions”
Multi-dimensional Model
“Sales by product line over the past six
months” “Sales by store between 1990 and
1995”
Store Info Key columns joining fact table
to dimension tables Numerical Measures
• Web-based Architecture
– Advantages:
• Usage of existing software, reduction of costs, platform independence
– Disadvantages:
• Security issues: data encryption/user access and identification
DistributedDW
• In most cases the economics and technology
greatly favor a single centralized DW
• But in some cases, distributed DW make sense
• Types of distributed DW
– Geographically distributed
• Local DW/global DW
– Technologically distributed DW
• Logically one DW, physically more DW
– Independently evolving distributed DW
• Uncontrolled growth
DistributedDW
• Geographically distributed
– In the case of corporations spread
around the world
• Information is needed both locally and
globally
– A distributed DW makes sense
• When much processing occurs at the
local level
• Even though local branches report to the
same balance sheet, the local
organizations are their own companies
DistributedDW
DistributedDW
• Technologically distributed DW
– Placing the DW on the distributed technology of a vendor
– Advantages
• The entry cost is cheap – large centralized hardware is expensive
• No theoretical limit to how much data can be placed in the DW – one
can add new servers to the network
– As the DW starts to expand network data communication
starts playing an important role
• Example: Let’s simplify and consider one has 4 nodes holding each
data regarding the last 4 years
• Now let’s consider one has a query which needs to access the data
from the last 4 years: such a query arises the issue of transporting large
amount of data between processors
DistributedDW
• Independently evolving distributed DW
– In practice there are many cases in which independent
DW are developed concurrently and uncontrolled in the
same organization
• The first step many corporations make is to build a DW for
financial or marketing
• Once it is successfully set up, other parts of the organization
follow independently the process resulting in the coexistence
of more independent DW in the same organization
• This problem will be addressed later