DWM 1
DWM 1
DWM 1
Data Warehouse:-
A data warehouse is a data management system that stores the data from multiple sources into a single
repository for analytics & decision support.The term Data Warehouse was defined by Bill Inmon in 1990,
in the following way:
Subject Oriented-Data that gives information about a particular subject instead of about a company's
ongoing operations.
Integrated-Data that is gathered into the data warehouse from a variety of sources and merged into a
coherent whole.
Time-variant-All data in the data warehouse is identified with a particular time period
Non-volatile- Data is stable in a data warehouse. More data is added but data is never removed. This
enables management to gain a consistent picture of the business
a) Subject Oriented: A data warehouse is subject oriented because it provides information around a
subject rather than the organization's ongoing operations. These subjects can be product, customers,
suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it
focuses on modeling and analysis of data for decision making
b)Integrated: A data warehouse is constructed by integrating data from heterogeneous sources such as
relational databases, flat files, etc. This integration enhances the effective analysis of data.
c)Non-volatile: Non-volatile means the previous data is not erased when new data is added to it. A data
warehouse is kept separate from the operational database and therefore frequent changes in
operational database is not reflected in the data warehouse.
d)Time Variant: The data collected in a data warehouse is identified with a particular time period. The
data in a data warehouse provides information from the historical point of view.
The Operational Database is the source of information for the data warehouse. It includes detailed
information used to run the day to day operations of the business. The data frequently changes as
updates are made and reflect the current value of the last transactions
Operational Database Management Systems also called as OLTP (Online Transactions Processing
Databases), are used to manage dynamic data in real-time.
Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis and
decision-making. Such systems can organize and present information in specific formats to
accommodate the diverse needs of various users. These systems are called as Online-Analytical
Processing (OLAP) Systems.
Data Warehouse and the OLTP database are both relational databases. However, the goals of both
these databases are different
Operational systems are designed to support Data warehousing systems are typically
high-volume transaction processing. designed to support high- volume analytical
processing (i.e... OLAP)
Operational systems are usually concerned Data warehousing systems are usually
with current data. concerned with historical data.
Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.
It is designed for real-time business dealing It is designed for analysis of business measures
and processes. by subject area, categories, and attributes.
It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a complex, unpredictable queries that access
time per table. many rows per table.
Operational systems are widely process- Data warehousing systems are widely subject-
oriented. oriented
Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of optimized to perform fast retrievals of
associatively small volumes of data. relatively high volumes of data.
-Data warehouse allows business users to quickly access critical data from some sources all in one place.
-Data Warehouse helps to integrate many sources of data to reduce stress on the production system.
- Data warehouse helps to reduce total turnaround time for analysis and reporting
-Restructuring and Integration make it easier for the user to use for reporting and analysis.
-Data warehouse allows users to access critical data from the number of sources in single place.
Therefore, it saves user's time of retrieving data from multiple sources
-Data warehouse stores a large amount of historical data. This helps users to analyze different time
periods and trends to make future predictions.
The data in a data warehouse comes from operational systems of the organization as well as from other
external sources. These are collectively referred to as source systems.
The data extracted from source systems is stored in an area called data staging area, where the data is
cleaned,transformed, combined, and duplicated to prepare the data in the data warehouse.
The data staging area is generally a collection of machines where simple activities like sorting and
sequential processing takes place. The data staging area does not provide any query or presentation
services.
As soon as a system provides query or presentation services, it is categorized as a presentation server.
A Presentation Server is the target machine on which data is loaded from data staging area organized
and stored for direct querying by end users.
The three different kinds of systems that are required for a data warehouse are:
1.Source Systems
3.Presentation servers
The data travels from source systems to presentation servers via the data staging area. The entire
process Popularly known as 17 extract, transform, and load) or ETT (extract, transform, and transfer).
Oracle's ETL called Oracle Warehouse Builder (WB) and MS SQL Server's ETL tool is called Data
Transformation Services (DTS)
Each component and the tasks performed by them are explained below:
1.Operational Source
The sources of data for the data warehouse are supplied from:
-The data from the mainframe systems in the traditional network and hierarchical format.
-Data can also come from the relational DBMS like Oracle, Informix.
2. Load Manager
-The load manager performs all the operations associated with extraction and loading data into the
datawarehouse.
-These operations include simple transformations of the data to prepare the data for entry into the
warehouse.
-The size and complexity of this component will vary between data warehouses and may be constructed
using a combination of vendor data loading tools and custom built programs.
3. Warehouse Manager
-The warehouse manager performs all the operations associated with the management of data in the
warehouse
-This component is built using vendor data management tools and custom built program
-The operations performed by warehouse manager include.
-Analysis of data to ensure consistency.
-Transformation and merging the source data from temporary storage into data warehouse tables.
-Create indexes and views on the base table.
-Demonmalization
-Generation of aggregation
-Backing up and archiving of data.
-in certain situations, the warehouse manager also generates query profiles to determine which indexes
and aggregations are appropriate.
4.Query Manager
-The query manager performs all operations associated with management of user queries.
-This component is usually constructed using vendor end-user access tools, data warehousing
monitoring tools database facilities and custom built programs.
5.Detailed Data
-This area of the warehouse stores all the detailed data in the database schema.
-The detailed data is added regularly to the warehouse to supplement the aggregated data
8.Meta Data
-The data warehouse also stores all the Meta data (data about data) definitions used by all processes in
the warehouse.
9. End-User Access Tools
-The main purpose of a data warehouse is to provide information to the business managers for strategic
decision-making.
-These users interact with the warehouse using end user access tools.
-Some of the examples of end user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may
include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by external
consultants) are extracted using application program interfaces called a gateway. A gateway is provided
by the underlying DBMS and allows customer programs to generate SQL code to be executed at a
server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and
Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on
multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly implements
multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional
tools for data mining of the OLAP-generated data.
The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:
Enterprise Warehouse
-An Enterprise warehouse collects all of the records about subjects spanning the entire organization. It
supports corporate-wide data integration, usually from one or more operational systems or external
data providers, and it's cross-functional in scope. It generally contains detailed information as well as
summarized information and can range in estimate from a few gigabytes to hundreds of gigabytes,
terabytes, or beyond.
-It provides access to all data within an organization,without compromising security and integrity of the
data.
Data Mart
-A data mart is a small warehouse which is designed for the department level.
-A data mart includes a subset of corporate-wide data that is of value to a specific collection of users.
The scope is confined to particular selected subjects. For example, a marketing data mart may restrict its
subjects to the customer, items, and sales. The data contained in the data marts tend to be summarized.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-warehouses. It's
built by drawing data from a central data warehouse.
Virtual Warehouses
-A set of views over operational databases.
-It gives you a quick overview of your data.It has metadata in it.it connects to several data sources with
the use of middleware.
-A virtual warehouse is simple to build but requires excess capacity on operational database servers.
-The processing can be fast and allow the users to filter most important data from different legacy
applications.
The mechanism of extracting information from source systems and bringing it into the data warehouse
is commonly called ETL, which stands for Extraction, Transformation and Loading.
Extraction
In this step, data is extracted from the source system to the ETL server or staging area. Transformation is
done in this area so that the performance of the source system is not degraded If corrupted data is
copied directly into the data warehouse from the source system, rollback will be a challenge over there.
Staging area allows validation of the extracted data before it moves in the data warehouse
There is a need to integrate the system in the data warehouse which has different DBMS. hardware,
operating systems, and communication protocols. Here is a need for a logical data map before data is
extracted and loaded physically. This data map describes all the relationship between the sources and
the target data.
1. FULL Extraction
Whether we are using any extraction method, this should not affect the performance and response time
of the source system. These source systems are live production system.
Transformation
Extracted data from source server is raw and not usable in its original form. Therefore the data should
be mapped, cleansed, and transformed. Transformation is an important step where the ETL process
adds values and change the data, such as the Bi reports, can be generated.
In this step, we apply a set of functions on extracted data. Data that does not require any
transformation is called direct move or pass-through data.
In this step, we can apply customized operations on data. For example, the first name and the last name
in a table are in a different column, it is possible to concatenate them before loading.
Validation during the Transformation:
Loading
Loading the data into the data warehouse is the last step of the ETL process. The vast volume of data
needs to load into the data warehouse for a concise time. For increasing the performance, loading
should be optimized.
If the loading fails, the recover mechanism should be there to restart from the point of the failure
without data integrity loss. Admin of a data warehouse needs to monitor, resume, and cancel loads as
per server performance.
Types of Loading
3. Full Refresh- Erase the content of one or more tables and reloading with new data.
Metadata repository is a pretentious term for nothing other than a computerized database containing
metadata to support the development, maintenance, and operations of a major portion of an
enterprise's systems. Among other things, such a repository can be the foundation for a data wangbouse
Metadata Repository:-
Metadata
-Data warehouse metadata are pieces of information stored in one or more special-purpose metadata
repositories that include.
-Information on the contents of the data warehouse, their location and their structure.
-Information on the processes that take place in the data warehouse back-stage, concerning the
refreshment of the warehouse with clean, up-to-date, semantically and structurally reconciled data,
-Information on the implicit semantics of data (with respect to a common enterprise model), along with
any other kind of data that aids the end-user exploit the information of the warehouse,
-Information on the infrastructure and physical characteristics of components and the sources of the
data warehouse
-Information including security, authentication, and usage statistics that aids the administrator, tunes
the operation of the data warehouse as appropriate
Metadata of a Bookstore
Metadata can also be considered as an equivalent of Amazon book store. if we consider each data
element as a book, the meta-data will contain
-In an Enterprise, data for the data warehouse comes from various operational systems.
-Different source systems contain different data structures having different field lengths and data types.
-So the information of operational data source is given by Operational Meta Data.
This contains the information about the extraction of data from heterogeneous source system.
This particular category provides the end user the flexibility of looking for information in their own way.
Meta data is managed from people and process side. From people side, they need to be trained on the
importance and usage of metadata. They need to understand how and when to use the tools and the
benefits that gained from the metadata
Meta data management from process side encompasses data warehousing and business intelligence life
cycle.Meta data is entered into appropriate tool and stored in a repository for further use as the life
cycle progresses
*Metadata manager/repository
*Metadata extract tools
*Data modeling
*ETL
*Bl Reporting
Role of Metadata-
- Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is
different from the warehouse data, yet it plays an important role. The various roles of metadata are
explained below.
1.Metadata acts as a directory.
2.This directory helps the decision support system to locate the contents of the data warehouse.
3.Metadata helps in decision support system for mapping of data when data is transformed from
operational environment to data warehouse environment.
4.Metadata helps in summarization between current detailed data and highly summarized data.
5.Metadata also helps in summarization between lightly detailed data and highly summarized data.
6.Metadata is used for query tools.
7.Metadata is used in extraction and cleansing tools.
8.Metadata is used in reporting tools.
9.Metadata is used in transformation tools.
10.Metadata plays an important role in loading functions.
Metadata Manager/Repository
-Metadata repository is an integral part of a data warehouse system.
-A shared repository that combines information from multiple sources is used for managing metadata.
- Most of the organizations start with a spreadsheet containing data definitions, then grow to a more
refined approach.
-Meta data manager may be purchased as a software package or built as a full-fledged system.