DWM 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Unit-1 Introduction to Data Warehousing

Data Warehouse:-

A data warehouse is a data management system that stores the data from multiple sources into a single
repository for analytics & decision support.The term Data Warehouse was defined by Bill Inmon in 1990,
in the following way:

"A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in


support of management's decision making process.He defined the terms in the sentence as follows:

Subject Oriented-Data that gives information about a particular subject instead of about a company's
ongoing operations.

Integrated-Data that is gathered into the data warehouse from a variety of sources and merged into a
coherent whole.

Time-variant-All data in the data warehouse is identified with a particular time period

Non-volatile- Data is stable in a data warehouse. More data is added but data is never removed. This
enables management to gain a consistent picture of the business

Features/Characteristics of Data warehouse:-

a) Subject Oriented: A data warehouse is subject oriented because it provides information around a
subject rather than the organization's ongoing operations. These subjects can be product, customers,
suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it
focuses on modeling and analysis of data for decision making

b)Integrated: A data warehouse is constructed by integrating data from heterogeneous sources such as
relational databases, flat files, etc. This integration enhances the effective analysis of data.

c)Non-volatile: Non-volatile means the previous data is not erased when new data is added to it. A data
warehouse is kept separate from the operational database and therefore frequent changes in
operational database is not reflected in the data warehouse.
d)Time Variant: The data collected in a data warehouse is identified with a particular time period. The
data in a data warehouse provides information from the historical point of view.

Difference between Operational Database & Data Warehouse

The Operational Database is the source of information for the data warehouse. It includes detailed
information used to run the day to day operations of the business. The data frequently changes as
updates are made and reflect the current value of the last transactions

Operational Database Management Systems also called as OLTP (Online Transactions Processing
Databases), are used to manage dynamic data in real-time.

Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis and
decision-making. Such systems can organize and present information in specific formats to
accommodate the diverse needs of various users. These systems are called as Online-Analytical
Processing (OLAP) Systems.

Data Warehouse and the OLTP database are both relational databases. However, the goals of both
these databases are different

Operational Database Data Warehouse

Operational systems are designed to support Data warehousing systems are typically
high-volume transaction processing. designed to support high- volume analytical
processing (i.e... OLAP)

Operational systems are usually concerned Data warehousing systems are usually
with current data. concerned with historical data.

Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.

It is designed for real-time business dealing It is designed for analysis of business measures
and processes. by subject area, categories, and attributes.

It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a complex, unpredictable queries that access
time per table. many rows per table.

It supports thousands of concurrent clients It supports a few concurrent clients relative to


OLTP.

Operational systems are widely process- Data warehousing systems are widely subject-
oriented. oriented

Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of optimized to perform fast retrievals of
associatively small volumes of data. relatively high volumes of data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Need of Data warehouse

-Data warehouse allows business users to quickly access critical data from some sources all in one place.

-Data warehouse provides consistent information on various cross-functional activities. It is also


supporting ad-hoc reporting and query.

-Data Warehouse helps to integrate many sources of data to reduce stress on the production system.

- Data warehouse helps to reduce total turnaround time for analysis and reporting

-Restructuring and Integration make it easier for the user to use for reporting and analysis.

-Data warehouse allows users to access critical data from the number of sources in single place.
Therefore, it saves user's time of retrieving data from multiple sources

-Data warehouse stores a large amount of historical data. This helps users to analyze different time
periods and trends to make future predictions.

Data Warehouse Archictecture:-

The data in a data warehouse comes from operational systems of the organization as well as from other
external sources. These are collectively referred to as source systems.

The data extracted from source systems is stored in an area called data staging area, where the data is
cleaned,transformed, combined, and duplicated to prepare the data in the data warehouse.

The data staging area is generally a collection of machines where simple activities like sorting and
sequential processing takes place. The data staging area does not provide any query or presentation
services.
As soon as a system provides query or presentation services, it is categorized as a presentation server.

A Presentation Server is the target machine on which data is loaded from data staging area organized
and stored for direct querying by end users.

The three different kinds of systems that are required for a data warehouse are:

1.Source Systems

2.Data Staging Area

3.Presentation servers

The data travels from source systems to presentation servers via the data staging area. The entire
process Popularly known as 17 extract, transform, and load) or ETT (extract, transform, and transfer).
Oracle's ETL called Oracle Warehouse Builder (WB) and MS SQL Server's ETL tool is called Data
Transformation Services (DTS)

Data Warehouse Architecture

Each component and the tasks performed by them are explained below:
1.Operational Source
The sources of data for the data warehouse are supplied from:
-The data from the mainframe systems in the traditional network and hierarchical format.
-Data can also come from the relational DBMS like Oracle, Informix.

2. Load Manager
-The load manager performs all the operations associated with extraction and loading data into the
datawarehouse.
-These operations include simple transformations of the data to prepare the data for entry into the
warehouse.
-The size and complexity of this component will vary between data warehouses and may be constructed
using a combination of vendor data loading tools and custom built programs.

3. Warehouse Manager
-The warehouse manager performs all the operations associated with the management of data in the
warehouse
-This component is built using vendor data management tools and custom built program
-The operations performed by warehouse manager include.
-Analysis of data to ensure consistency.
-Transformation and merging the source data from temporary storage into data warehouse tables.
-Create indexes and views on the base table.
-Demonmalization
-Generation of aggregation
-Backing up and archiving of data.
-in certain situations, the warehouse manager also generates query profiles to determine which indexes
and aggregations are appropriate.

4.Query Manager
-The query manager performs all operations associated with management of user queries.
-This component is usually constructed using vendor end-user access tools, data warehousing
monitoring tools database facilities and custom built programs.

5.Detailed Data
-This area of the warehouse stores all the detailed data in the database schema.
-The detailed data is added regularly to the warehouse to supplement the aggregated data

6.Lightly and Highly Summarized Data


-This stores all the predefined lightly and highly summarized (aggregated) data generated by the
warehouse manager.
-The main goal of the summarized information is to speed up the query performance.
-As the new data is loaded into the warehouse, the summarized data is updated continuously

7.Archive and Back up Data


-The detailed and summarized data are stored for the purpose of archiving and back up.
-The data is transferred to storage archives such as magnetic tapes or optical disks.

8.Meta Data
-The data warehouse also stores all the Meta data (data about data) definitions used by all processes in
the warehouse.
9. End-User Access Tools
-The main purpose of a data warehouse is to provide information to the business managers for strategic
decision-making.
-These users interact with the warehouse using end user access tools.
-Some of the examples of end user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

Three-Tier / Multi-Tier Data Warehouse Archictecture:-

Data Warehouses usually have a three-level (tier) architecture that includes:

• Bottom Tier (Data Warehouse Server)


• Middle Tier (OLAP Server)
• Top Tier (Front end Tools).

A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may
include several specialized data marts and a metadata repository.

Data from operational databases and external sources (such as user profile data provided by external
consultants) are extracted using application program interfaces called a gateway. A gateway is provided
by the underlying DBMS and allows customer programs to generate SQL code to be executed at a
server.

Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and
Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).

A middle-tier which consists of an OLAP server for fast querying of the data warehouse.

The OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on
multidimensional data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly implements
multidimensional information and operations.

A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional
tools for data mining of the OLAP-generated data.

The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:

• A description of the DW structure, including the warehouse schema, dimension, hierarchies,


data mart locations, and contents, etc.
• Operational metadata, which usually describes the currency level of the stored data, i.e., active,
archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports,
audit, etc.
• System performance data, which includes indices, used to improve data access and retrieval
performance.
• Information about the mapping from operational databases, which provides source RDBMSs
and their contents, cleaning and transformation rules, etc.
• Summarization algorithms, predefined queries, and reports business data, which include
business terms and definitions, ownership information, etc.

Data Warehouse Models:-


Data warehouse modeling is the process of designing the schemas of the detailed and summarized
information of the data warehouse. The goal of data warehouse modeling is to develop a schema
describing the reality, or at least a part of the fact, which the data warehouse is needed to support.

Types of Data Warehouse Models

Enterprise Warehouse

-An Enterprise warehouse collects all of the records about subjects spanning the entire organization. It
supports corporate-wide data integration, usually from one or more operational systems or external
data providers, and it's cross-functional in scope. It generally contains detailed information as well as
summarized information and can range in estimate from a few gigabytes to hundreds of gigabytes,
terabytes, or beyond.

-It Supports reporting,analysis and planning.

-It provides access to all data within an organization,without compromising security and integrity of the
data.

Data Mart

-A data mart is a small warehouse which is designed for the department level.

-A data mart includes a subset of corporate-wide data that is of value to a specific collection of users.
The scope is confined to particular selected subjects. For example, a marketing data mart may restrict its
subjects to the customer, items, and sales. The data contained in the data marts tend to be summarized.

Data Marts is divided into two parts:


Independent Data Mart: Independent data mart is sourced from data captured from one or more
operational systems or external data providers, or data generally locally within a different department
or geographic area.its built by drawing data from operational or external data sources or both.

Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-warehouses. It's
built by drawing data from a central data warehouse.

Virtual Warehouses
-A set of views over operational databases.
-It gives you a quick overview of your data.It has metadata in it.it connects to several data sources with
the use of middleware.
-A virtual warehouse is simple to build but requires excess capacity on operational database servers.
-The processing can be fast and allow the users to filter most important data from different legacy
applications.

Extraction,Transformation and Loading:-

The mechanism of extracting information from source systems and bringing it into the data warehouse
is commonly called ETL, which stands for Extraction, Transformation and Loading.

Extraction

In this step, data is extracted from the source system to the ETL server or staging area. Transformation is
done in this area so that the performance of the source system is not degraded If corrupted data is
copied directly into the data warehouse from the source system, rollback will be a challenge over there.
Staging area allows validation of the extracted data before it moves in the data warehouse

There is a need to integrate the system in the data warehouse which has different DBMS. hardware,
operating systems, and communication protocols. Here is a need for a logical data map before data is
extracted and loaded physically. This data map describes all the relationship between the sources and
the target data.

There are three methods to extract the data.

1. FULL Extraction

2. Partial Extraction- without update notification

3. Partial Extraction-With update notification

Whether we are using any extraction method, this should not affect the performance and response time
of the source system. These source systems are live production system.

Validations during the extraction:

o Confirm record with the source data

o The data type should be checked

o It will check whether all the keys are in place or not

o We have to be sure that no spam/unwanted data is loaded

o Remove all kind of fragment and duplicate data.

Transformation

Extracted data from source server is raw and not usable in its original form. Therefore the data should
be mapped, cleansed, and transformed. Transformation is an important step where the ETL process
adds values and change the data, such as the Bi reports, can be generated.

In this step, we apply a set of functions on extracted data. Data that does not require any
transformation is called direct move or pass-through data.

In this step, we can apply customized operations on data. For example, the first name and the last name
in a table are in a different column, it is possible to concatenate them before loading.
Validation during the Transformation:

1. Filtering: For loading select only specific columns

2. Character set conversion and encoding handling

3. Data threshold and validation check

4. For example, Age cannot be more than two digits

5. The required field should not be left blank.

6. Transpose the rows and columns.

7. To merge the data use lookup

Loading

Loading the data into the data warehouse is the last step of the ETL process. The vast volume of data
needs to load into the data warehouse for a concise time. For increasing the performance, loading
should be optimized.

If the loading fails, the recover mechanism should be there to restart from the point of the failure
without data integrity loss. Admin of a data warehouse needs to monitor, resume, and cancel loads as
per server performance.

Types of Loading

1. Initial Load -Full the entire data warehouse table

2. Incremental Load- Apply changes when needed.

3. Full Refresh- Erase the content of one or more tables and reloading with new data.

Metadata repository is a pretentious term for nothing other than a computerized database containing
metadata to support the development, maintenance, and operations of a major portion of an
enterprise's systems. Among other things, such a repository can be the foundation for a data wangbouse

Metadata Repository:-
Metadata
-Data warehouse metadata are pieces of information stored in one or more special-purpose metadata
repositories that include.

-Information on the contents of the data warehouse, their location and their structure.

-Information on the processes that take place in the data warehouse back-stage, concerning the
refreshment of the warehouse with clean, up-to-date, semantically and structurally reconciled data,

-Information on the implicit semantics of data (with respect to a common enterprise model), along with
any other kind of data that aids the end-user exploit the information of the warehouse,

-Information on the infrastructure and physical characteristics of components and the sources of the
data warehouse

-Information including security, authentication, and usage statistics that aids the administrator, tunes
the operation of the data warehouse as appropriate

Metadata of a Bookstore

Metadata can also be considered as an equivalent of Amazon book store. if we consider each data
element as a book, the meta-data will contain

-Name of the book,

-Summary of the book,

-Assessments about the book,

-The date of publication,

-High level description of what it contains,

-Who are the publishers,

-How you can find the book,

-Author of the book,

-Whether the book is available OR not.

-This Information helps you to:


-Search for the book

-Access the book

-Understand about the book before you access OR buy it.

Types of Metadata in Data Warehouse

Operational Meta data

-In an Enterprise, data for the data warehouse comes from various operational systems.
-Different source systems contain different data structures having different field lengths and data types.
-So the information of operational data source is given by Operational Meta Data.

Extraction and Transformation Metadata

This contains the information about the extraction of data from heterogeneous source system.

It also contains information about data transformation in data staging area

End User Meta data

This particular category provides the end user the flexibility of looking for information in their own way.

How can Data Warehousing Metadata be Managed ?

Meta data is managed from people and process side. From people side, they need to be trained on the
importance and usage of metadata. They need to understand how and when to use the tools and the
benefits that gained from the metadata

Meta data management from process side encompasses data warehousing and business intelligence life
cycle.Meta data is entered into appropriate tool and stored in a repository for further use as the life
cycle progresses

Metadata can be managed through individual tools:

*Metadata manager/repository
*Metadata extract tools
*Data modeling
*ETL
*Bl Reporting

Role of Metadata-
- Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is
different from the warehouse data, yet it plays an important role. The various roles of metadata are
explained below.
1.Metadata acts as a directory.
2.This directory helps the decision support system to locate the contents of the data warehouse.
3.Metadata helps in decision support system for mapping of data when data is transformed from
operational environment to data warehouse environment.
4.Metadata helps in summarization between current detailed data and highly summarized data.
5.Metadata also helps in summarization between lightly detailed data and highly summarized data.
6.Metadata is used for query tools.
7.Metadata is used in extraction and cleansing tools.
8.Metadata is used in reporting tools.
9.Metadata is used in transformation tools.
10.Metadata plays an important role in loading functions.

Metadata Manager/Repository
-Metadata repository is an integral part of a data warehouse system.
-A shared repository that combines information from multiple sources is used for managing metadata.

- Most of the organizations start with a spreadsheet containing data definitions, then grow to a more
refined approach.
-Meta data manager may be purchased as a software package or built as a full-fledged system.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy