0% found this document useful (0 votes)
3 views

Chap 2

The document outlines the fundamentals of data warehousing, detailing its defining features such as subject-oriented, integrated, time-variant, nonvolatile data, and data granularity. It discusses different architectural types of data warehouses and data marts, including centralized, independent, federated, hub-and-spoke, and data-mart bus approaches, along with their advantages and disadvantages. Additionally, it covers the components involved in building a data warehouse, including source data, data staging, and the processes of extraction, transformation, and loading.

Uploaded by

Sobica Noor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Chap 2

The document outlines the fundamentals of data warehousing, detailing its defining features such as subject-oriented, integrated, time-variant, nonvolatile data, and data granularity. It discusses different architectural types of data warehouses and data marts, including centralized, independent, federated, hub-and-spoke, and data-mart bus approaches, along with their advantages and disadvantages. Additionally, it covers the components involved in building a data warehouse, including source data, data staging, and the processes of extraction, transformation, and loading.

Uploaded by

Sobica Noor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Data Warehousing

Mohib Ullah Khan

mohibkhan483@gmail.com

1
Book

➢Data Warehousing Fundamentals for IT Professionals


Paulraj Ponniah

2
Chapter 2

➢DATA WAREHOUSE: The Building Blocks

3
DEFINING FEATURES

• The data in the data warehouse is:


• Separate
• Available
• Integrated
• Time stamped
• Subject oriented
• Nonvolatile
• Accessible

4
DEFINING FEATURES

Subject-Oriented Data
• In Operational Systems, we store data by individual applications
• E.g. in an order processing application, we keep the data for that
particular application
• These data sets provide the data for all the functions for entering
orders, checking stock, verifying customer’s credit etc.
• In data warehouse, data is stored by real-world business subjects or
events, not by applications
• In data warehouse, all the data sets relating to the same real-world
business subject or event is tied together
• Data is linked and stored by business subjects
5
Subject-Oriented Data

6
DEFINING FEATURES

Integrated Data
• The data in the data warehouse comes from several operational
systems
• Source data reside in different databases, files, and data segments
• The file layouts, character code representations, and field naming
conventions all could be different
• Here are some of the items that would need to standardized and
made consistent:
• † Naming conventions
• † Codes
• † Data attributes
• † Measurements 7
Integrated Data

8
DEFINING FEATURES

Time-Variant Data
• Data warehouse contains historical data, not just current values
• Changes to data are tracked and recorded
• If necessary, reports can be produced to show changes over time
• Every data structure in the data warehouse contains the time
element
• This aspect of the data warehouse is quite significant for both the
design and the implementation phases

9
DEFINING FEATURES

Time-Variant Data
• For example, the sales quantity in a record may relate to a specific
date, week, month, or quarter
• The time-variant nature of the data in a data warehouse
• † Allows for analysis of the past
• † Relates information to the present
• † Enables forecasts for the future

10
DEFINING FEATURES

Nonvolatile Data
• Once the data is captured and committed in the data warehouse,
you do not run individual transactions to change the data there
• Data updates are common place in an operational database; not so
in a data warehouse
• The data in a data warehouse is not as volatile as the data in an
operational database

11
Nonvolatile Data

12
DEFINING FEATURES

Data Granularity
• Frequently, the analysis begins at a high level and moves down to
lower levels of detail
• It is efficient to keep data summarized at different levels
• Depending on the query, we can go to the particular level of detail
and satisfy the query
• Data granularity in a data warehouse refers to the level of detail
• The lower the level of detail, the finer is the data granularity

13
Data Granularity

14
DATA WAREHOUSES AND DATA MARTS

• Before deciding, we need to address the relevant issues:


• † Top-down or bottom-up approach?
• † Enterprise-wide or departmental?
• † Which first—data warehouse or data mart?
• † Build pilot or go with a full-fledged implementation?
• † Dependent or independent data marts?

15
DATA WAREHOUSES AND DATA MARTS

• Should you build a large data warehouse and then let that
repository feed data into local?
• Should you build individual local data marts, and combine them to
form your overall data warehouse?
• Should these local data marts be independent of one another?
• Or should they be dependent on the overall data warehouse for
data feed?
• Should you build a pilot data mart?
• These are crucial questions

16
DATA WAREHOUSES AND DATA MARTS

How Are They Different?


• (1) Overall data warehouse feeding dependent data marts
• (2) Several departmental or local data marts combining into a data
warehouse
• In the first approach, you extract data from the operational
systems; you then transform, clean, integrate, and keep the data in
the data warehouse
• So, which approach is best in your case, the top-down or the
bottom-up approach?

17
DATA WAREHOUSES AND DATA MARTS

Top-Down Approach
• Data in the data warehouse is stored at the lowest level of
granularity; based on a normalized data model
• In the Inmon vision the data warehouse is at the center of the
“Corporate Information Factory” (CIF)
• Delivering business intelligence to the enterprise
• Business operations provide data to drive the CIF
• The centralized data warehouse would feed the dependent data
marts that may be designed based on a dimensional data model

18
DATA WAREHOUSES AND DATA MARTS

Top-Down Approach
• The advantages of this approach are:
• † A truly corporate effort, an enterprise view of data
• † Inherently architected, not a union of disparate data marts
• † Single, central storage of data about the content
• † Centralized rules and control
• † May see quick results if implemented with iterations
• The disadvantages are:
• † Takes longer to build even with an iterative method
• † High exposure to risk of failure
• † Needs high level of cross-functional skills
• † High expense without proof of concept 19
DATA WAREHOUSES AND DATA MARTS

Top-Down Approach

20
DATA WAREHOUSES AND DATA MARTS

Bottom-Up Approach
• Data marts are created first to provide analytical and reporting
capabilities for specific business subjects
• Data marts may contain detailed data or summaries depending on
the needs for analysis
• These data marts are joined or “unioned” together by conforming
the dimensions

21
DATA WAREHOUSES AND DATA MARTS

Bottom-Up Approach
• The advantages of this approach are:
• † Faster and easier implementation of manageable pieces
• † Favorable return on investment and proof of concept
• † Less risk of failure
• † Inherently incremental; can schedule important data marts first
• † Allows project team to learn and grow
• The disadvantages are:
• † Each data mart has its own narrow view of data
• † Permeates redundant data in every data mart
• † Perpetuates inconsistent and irreconcilable data
• † Proliferates unmanageable interfaces 22
DATA WAREHOUSES AND DATA MARTS

A Practical Approach
• We do not lose sight of the overall big picture for the entire
enterprise
• We base our planning on this overall big picture
• This aspect is from the top-down approach
• Adopt the principles of the bottom-up approach and build the
conformed data marts based on a priority scheme
• The steps in this practical approach are as follows:
• 1. Plan and define requirements at the overall corporate level
• 2. Create a surrounding architecture for a complete warehouse
• 3. Conform and standardize the data content
• 4. Implement the data warehouse as a series of supermarts, one at a
time 23
DATA WAREHOUSES AND DATA MARTS

A Practical Approach
• Plan at the enterprise level
• Gather requirements at the overall level
• Establish the architecture for the complete warehouse
• Determine the data content for each supermart
• Supermart are carefully architected data marts
• Implement these supermarts, one at a time
• Make sure that the data content among the various supermarts are
conformed in terms of datatypes, field lengths, precision etc
• A certain data element must mean the same thing in every
supermart
24
ARCHITECTURAL TYPES

Centralized Data Warehouse


• Takes into account the enterprise-level information requirements
• An overall infrastructure is established
• Atomic level normalized data at the lowest level of granularity
• In the third normal form
• Occasionally, some summarized data is included
• Queries and applications access data in the central data warehouse
• There are no separate data marts

25
ARCHITECTURAL TYPES

Independent Data Marts


• Each data mart serves the particular organizational unit
• Separate data marts do not provide “a single version of the truth.”
• The data marts are independent of one another
• These different data marts are likely to have inconsistent data
definitions and standards
• Such variances hinder analysis of data across data marts
• For example, two independent data marts; sales and shipments
• Difficult to analyze sales and shipments data together

26
ARCHITECTURAL TYPES

Federated
• Some companies get into data warehousing with an existing legacy
of an assortment of decision-support structures
• In the form of operational systems, extracted datasets, primitive data
marts
• It is unwise to discard everything and start from scratch
• In federated, Data may be physically or logically integrated through
shared key fields, overall global metadata, distributed queries, and
etc.
• In this architectural type, there is no one overall data warehouse
27
ARCHITECTURAL TYPES

Hub-and-Spoke
• Similar to the centralized data warehouse, an overall enterprise-
wide data warehouse
• Atomic data in the third normal form is stored in the centralized data
warehouse
• Dependent data marts in this architectural type
• Dependent data marts obtain data from the centralized data
warehouse
• The centralized data warehouse forms the hub to feed data to the
data marts on the spokes
28
ARCHITECTURAL TYPES

Hub-and-Spoke
• Each dependent dart mart may have normalized, denormalized,
summarized, or dimensional data structures based on individual
requirements
• Most queries are directed to the dependent data marts
• The centralized data warehouse may also be used
• Top-down approach to data warehouse development

29
ARCHITECTURAL TYPES

Data-Mart Bus
• Begin with analyzing requirements for a specific business subject
such as orders, shipments, billings, insurance claims, car rentals
• Build the first data mart using business dimensions and metrics
• These business dimensions will be shared in the future data marts
• By conforming dimensions among the various data marts, the
result would be logically integrated supermarts that will provide an
enterprise view of the data
• The data marts contain atomic data organized as a dimensional
data model
• This architectural type results from adopting an enhanced bottom-
up approach to data warehouse development 30
OVERVIEW OF THE COMPONENTS

• You build a data warehouse with software and hardware


components.
• You arrange these building blocks in a certain way for maximum
benefit
• You may also want to lay special emphasis on one component; you
may want to bolster up another component with extra tools and
services
• All of this depends on your circumstances

31
OVERVIEW OF THE COMPONENTS

32
OVERVIEW OF THE COMPONENTS

Source Data Component


• Source data coming into the data warehouse may be grouped into
four broad categories
Production Data
• This category of data comes from the various operational systems of
the enterprise
• These normally include financial systems, manufacturing systems,
systems along the supply chain, and customer relationship
management systems
• Based on the information requirements in the data warehouse, you
choose segments of data from the different operational system

33
OVERVIEW OF THE COMPONENTS

Source Data Component


Internal Data
• In every organization, users keep their “private” spread-sheets,
documents, customer profiles, and sometimes even departmental
databases
• This is the internal data parts of which could be useful in a data
warehouse

34
OVERVIEW OF THE COMPONENTS

Source Data Component


Archived Data
• In every operational system, you periodically take the old data and
store it in archived files
• The circumstances in your organization dictate how often and which
portions of the operational databases are archived for storage
• Some data is archived after a year
• Sometimes data is left in the operational system databases for as
long as five years
• Much of the archived data comes from old legacy systems that are
nearing the end of their useful lives in organizations
35
OVERVIEW OF THE COMPONENTS

Source Data Component


External Data
• Most executives depend on data from external sources for a high
percentage of the information they use
• They use statistics relating to their industry produced by external
agencies and national statistical offices
• They use market share data of competitors
• They use standard values of financial indicators for their business to
check on their performance

36
OVERVIEW OF THE COMPONENTS

Data Staging Component


• Three major functions need to be performed for getting the data
ready
• You have to extract the data, transform the data, and then load the
data into the data warehouse storage
• These three major functions of extraction, transformation, and
preparation for loading take place in a staging area
• The data staging component consists of a workbench for these
functions
• Data staging provides a place and an area with a set of functions to
clean, change, combine, convert, deduplicate, and prepare source
data for storage and use in the data warehouse
37
OVERVIEW OF THE COMPONENTS

Data Staging Component


Data Extraction
• This function has to deal with numerous data sources
• You have to employ the appropriate technique for each data source
• Source data may be from different source machines in diverse data
formats
• Part of the source data may be in relational database systems
• Some data may be on other legacy network and hierarchical data
models
• Many data sources may still be in flat files
• You may want to include data from spreadsheets and local
departmental data sets. Data extraction may become quite complex
38
OVERVIEW OF THE COMPONENTS

Data Staging Component


Data Transformation
• You perform a number of individual tasks as part of data
transformation
• First, you clean the data extracted from each source
• Cleaning may just include;
• Correction of misspellings
• Resolution of conflicts between state codes and zip codes in the source
data
• Deal with providing default values for missing data elements
• Elimination of duplicates when you bring in the same data from multiple
source systems 39
OVERVIEW OF THE COMPONENTS

Data Staging Component


Data Transformation
• Standardization of data elements forms a large part of data
transformation
• You standardize the data types and field lengths for same data elements
retrieved from the various sources
• Semantic standardization is another major task
• You resolve synonyms and homonyms when two or more terms from
different source systems mean the same thing, you resolve the
synonyms
• When a single term means many different things in different source
systems, you resolve the homonym
40
OVERVIEW OF THE COMPONENTS
Data Staging Component

41
OVERVIEW OF THE COMPONENTS

Data Staging Component


Data Loading
• Two distinct groups of tasks form the data loading function;
• When you complete the design and construction and go live for the
first time, initial loading of the data is done
• It moves large volumes of data using up substantial amounts of time
• As the data warehouse starts functioning, you continue to extract the
changes to the source data, transform the data revisions, and feed
the incremental data revisions on an ongoing basis

42
OVERVIEW OF THE COMPONENTS

Data Storage Component


• The data storage for the data warehouse is a separate repository
• Data in the data warehouse is in structures suitable for analysis, and
not for quick retrieval of individual pieces of information
• Therefore, the data storage for the data warehouse is kept separate
from the data storage for operational systems
• Generally, the database in your data warehouse must be open
• Depending on your requirements, you are likely to use tools from
multiple vendors
• The data warehouse must be open to different tools
• Most of the data warehouses employ relational database management
systems 43
OVERVIEW OF THE COMPONENTS

Data Storage Component


• Many data warehouses also employ multidimensional database
management systems
• Data extracted from the data warehouse storage is aggregated in
many ways and the summary data is kept in the multidimensional
databases (MDDBs)
• Such multidimensional database systems are usually registered
products

44
OVERVIEW OF THE COMPONENTS

Information Delivery Component


• The information delivery component includes different methods of
information delivery
• Ad hoc reports are predefined reports primarily meant for novice and
casual users
• Provision for complex queries, multidimensional (MD) analysis, and
statistical analysis cater to the needs of the business analysts and
power users
• Information fed into executive information systems (EIS) is meant for
senior executives and high-level managers
• Some data warehouses also provide data to data-mining applications
• Data-mining applications are knowledge discovery systems where the
mining algorithms help you discover trends and patterns from the usage
of your data 45
OVERVIEW OF THE COMPONENTS
Information Delivery Component

46
OVERVIEW OF THE COMPONENTS

Metadata Component
• Meta data in a data warehouse is similar to the data dictionary or the
data catalog in a database management system
• The metadata component is the data about the data in the data
warehouse
• Metadata in a data warehouse is similar to a data dictionary, but
much more than a data dictionary

47
OVERVIEW OF THE COMPONENTS

Management and Control Component


• The management and control component coordinates the services
and activities within the data warehouse
• This component controls the data transformation and the data
transfer into the data warehouse storage
• On the other hand, it moderates the information delivery to the users
• It works with the database management systems and enables data to
be properly stored in the repositories
• It monitors the movement of data into the staging area and from
there into the data warehouse storage itself
48
METADATA IN THE DATA WAREHOUSE

• Metadata in a data warehouse fall into three major categories:


• † Operational metadata
• † Extraction and transformation metadata
• † End-user metadata

49
METADATA IN THE DATA WAREHOUSE

Operational Metadata
• Data for the data warehouse comes from several operational systems
of the enterprise
• These source systems contain different data structures
• The data elements selected for the data warehouse have various field
lengths and data types
• In selecting data from the source systems for the data warehouse,
you split records, combine parts of records from different source
files, and deal with multiple coding schemes and field lengths
• When you deliver information to the end-users, you must be able to
tie that back to the original source data sets
• Operational metadata contain all of this information about the
operational data sources
50
METADATA IN THE DATA WAREHOUSE

Extraction and Transformation Metadata


• Extraction and transformation metadata contain data about the
extraction of data from the source systems, namely;
• The extraction frequencies, extraction methods, and business rules for
the data extraction
• This category of metadata also contains information about all the
data transformations that take place in the data staging area

51
METADATA IN THE DATA WAREHOUSE

End-User Metadata
• The end-user metadata is the navigational map of the data
warehouse
• It enables the end-users to find information from the data warehouse
• The end-user metadata allows the end-users to use their own
business terminology and look for information in those ways in which
they normally think of the business

52
METADATA IN THE DATA WAREHOUSE

Special Significance
• It acts as the glue that connects all parts of the data warehouse.
• It provides information about the contents and structures to the
developers
• It opens the door to the end-users and makes the contents
recognizable in their own terms

53

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy