Chapter 1: Data Warehousing:: Dept of CSE, KLESCET - Shrikant Athanikar

Data Mining & Warehousing
Chapter 1: Data Warehousing:

1.1 Introduction:
• Major enterprises have many computers that run a variety of enterprise applications.
• For an enterprise with many braches in many locations the branches may have their own
systems.
• A large company might have following systems:
 Human resource
 Financials
 Billing
 Sales leads
 Web sales
 Customer support
Such systems are called online transaction processing (OLTP) systems.
• The OLTP systems are mostly relational database systems designed for transaction
processing.
• The performance of OLTP system is usually very important since systems are used to
support the users that provide services to the customers.
• These systems normally deals with the operations like update, delete, insert, and can deal
with some basic & simple queries quickly. But these systems cannot answer to the
management questions which are going to be more complex which normally involves lots
of joins and aggregation.
• Some enterprises may have old pre-relational systems that store valuable information in
files which are build using complex data structures, and it is difficult for the organization
to convert all these into relational database.
• In some cases the data might be stored in some old legacy systems which are build using
different applications and where the data semantics are not properly documented.
• With these type of storages, generating some report related to business or even answering
some important queries becomes difficult, involving intolerable delays.
It has been reported that several years ago Coca-Cola Company could not even quickly
determine, how many bottles their plants produced in a day since the information was
distributed in 20 different computers systems in different locations.
Dept of CSE, KLESCET – Shrikant Athanikar Page 1

The focus of operation managers is on improving business management and process across
the various enterprise functions, Ex: customer support, inventory, marketing.
To achieve this they require:

• A single sign-on path to the enterprise information
• A single version of the enterprise information
• A High level of data accuracy
• A user friendly interface to the information
• Easy sharing of data across enterprise business units
Ex: a manager of university will be more interested in finding out:
• How many student are enrolled in the current semester and how many have dropped out.
• Percentage of the top 5% of high school graduates enrolled in the university this year
Solutions to these problems:
1) One simple solution to meeting these needs is to allow managers to pose queries of interest to
a mediator system that decomposes each query into appropriate sub queries for the systems
that the query needs to access, obtains results from those systems and then combines and
presents the result to the user. This is sometimes called lazy or on-demand query
processing (what is lazy or on-demand processing means?).
Advantage: User is provided up-to-date information if the information is changing rapidly
Disadvantages: The management queries may generate such a heavy load on some OLTP
systems that the performance of the systems becomes unacceptable. Also, the OLTP systems
have no historical data and so finding trends may not be possible.
2) One could collect the most common queries that the managers ask and ran them regularly and
have the results available when a manager poses one of those queries. This approach, called
the eager approach (what is eager approach?)
Advantage: quick

Drawback: The information may not be up-to-date and managers may pose queries that have
not been pre- computed which would again have to be run in the lazy mode.
3) The third approach is somewhere in between the two extremes. It involves the creation of a
separate database that only stores information that is of interest to the management staff and
involves the following two steps:
1. The information needs of management staff are analyzed and the enterprise systems that
store some or all of the information are identified. A suitable data model is developed and
the information is then extracted, transformed and loaded in the new database.
2. The new database is then used to answer management queries and the OLTP systems are
not accessed for such queries.
1.2 OPERATIONAL DATA STORES:
To meet the information needs of management staff, one possible solution is to build separate
database.
One such approach is called the operational data store (ODS) approach.
An ODS is designed to provide a consolidated view of the enterprise's current operational

information.
An ODS is a subject-oriented, integrated, volatile, current valued data store, containing

only corporate detailed data.
A data warehouse does not have operational data in it. A data warehouse
reporting database contains relatively recent as well as historical data and
may also contain aggregate data.
ODS is subject-oriented: it is organized around the major data subjects of an enterprise.

Ex: In a university, the subjects might be students, lecturers and courses. In a company
the subjects might be customers, salespersons and products.

ODS is integrated: it is a collection of subject-oriented data from a variety of systems.
ODS is current valued: an ODS is up-to-date and reflects the current status of the information.
An ODS does not include historical data. Since the OLTP systems data is changing all the time,
data from underlying sources should refresh the ODS as regularly and frequently as possible.
ODS is volatile: The data in the ODS changes frequently as new information refreshes the ODS.
ODS is detailed: The ODS is detailed enough to serve the needs of the operational management
staff in the enterprise.
An ODS may also be used as an interim database for a data warehouse
An ODS may be considered a type of data warehouse
Comparison Between ODS and Data Warehouse:
An ODS may be viewed as an enterprise's short-term memory in that it

stores only very recent information.
The data warehouse is more like an enterprise's long-term memory in

that it stores historical information relatively permanently.
Benefits of ODS:
1) An ODS is the unified operational view of the enterprise that provides the operational
managers improved access to important operational data.
2) An ODS can be much more effective in generating current reports without having to
access the OLTP or legacy systems.
3) ODS may also shorten the time required to implement and populate a data warehouse
system because ODS already provides integrated enterprise data.
4) An ODS may also become a source of reliable information for some other applications.

ODS Design and Implementation
• To implement an ODS, just like implementing any database system, a data model should
be developed
• Although all the attributes included in the source databases do not need to be included in
the ODS, all attributes that are likely to be needed by the operational management staff
should be identified and included.
• The extraction of information from source databases needs to be efficient and the quality
of data to be maintained.
• Since the data is refreshed regularly and frequently, suitable checks are required to ensure
quality of data after each refresh.
• An ODS would of course be required to satisfy normal integrity constraints, for example,
existential integrity, referential integrity and appropriate action to deal with nulls.
• An ODS is a read only database other than regular refreshing by the OLTP systems.
• Users should not be allowed to update ODS information.
• Populating an ODS involves an acquisition process of extracting, transforming and
loading data from OLTP source systems. This process is called ETL.
• Completing populating the database, checking for anomalies and testing for performance
are necessary before an ODS system can go online.

Why a Separate Database

• There is no theoretical basis for justifying a separate ODS. It is always possible to carry
out the ODS queries on the existing OLTP systems but as stated earlier an ODS makes
query processing for operational managers much more efficient without loading the OLTP
systems.
• It is therefore a question of trade-off. Shall we perhaps spend more money on the
OLTP systems to improve their capacity so that they can deal with operational
management queries?
Ans: important reason why an ODS should be separate from the operational databases is
that from time to time complex queries are likely to degrade the performance of the OLTP
systems. The OLTP systems have to provide a quick response to operational users as
businesses cannot afford to have response time suffer when a manager is running a
complex query.

Zero Latency Enterprise (ZLE)

• An ODS typically contains current operational data that is frequently updated.
• The Gartner Group has used a term Zero Latency Enterprise (ZLE) for near real-time
integration of operational data so that there is no significant delay in getting information
from one part or one system of an enterprise to another part or another system that needs
the information.
• The heart of a ZLE system is an operational data store.
• Enterprises need to understand their customers better and react to their behavior quickly,
ODS and DW solves many such problems but they are not always able to help
management to react in real-time or zero latency.
• A ZLE data store is something like an ODS that is integrated and up-to-date. The aim of a
ZLE data store is to allow management a single view of enterprise information by
bringing together relevant data in real-time and providing management a "360- degree"
view of the customer.
Characteristics of ZLE:
• It has a unified view of the enterprise operational data.

• It has a high level of availability and it involves online refreshing of information.
How does a ZLE data store fit into enterprise information architecture?
Several different arrangements are possible. Either a ZLE data store can be used
almost like an ODS that is refreshed in realtime (it can then feed data to the data
warehouse if required) or a ZLE can take over some of the roles of the data
warehouse.
1.3 ETL: (Extracting, Transforming, loading)

An ODS or a data warehouse is based on a single global schema that integrates and
consolidates enterprise information from many sources.
Building such a system requires data acquisition from OLTP and legacy systems.
The ETL process involves extracting, transforming and loading data from source systems.
The process may sound very simple since it only involves reading information from source
databases (and perhaps other systems), transforming it to fit the ODS database model and
loading it in the ODS.

In practice, the process is much more complex and tedious and may require significant
resources to implement.
There are a variety of tools in the market that may help reduce the cost.
As different data sources tend to have different conventions for coding information and
different standards for the quality of information, building an ODS (or a data warehouse)
requires data filtering, data cleaning, and integration.
The following examples show the importance of data cleaning:
• If an enterprise wishes to contact its customers or its suppliers, it is essential that a

complete, accurate and up-to-date list of contact addresses, email addresses and telephone
numbers be available. Correspondence sent to a wrong address that is then redirected does
not create a very good impression about the enterprise.
• If a customer or supplier calls, the staff responding should be quickly able to find the
person in the enterprise database but this requires that the caller's name or his/her
company name is accurately listed in the database.
• If a customer appears in the databases with two or more slightly different names or
different account numbers, it becomes difficult to update the customer's information.
 Data errors at least partly arise because of unmotivated data entry staff since they
are poorly paid.
 ETL requires skills in management, business analysis and technology and is often
a significant component of developing an ODS or a data warehouse.
 The ETL process tends to be different for every ODS and data warehouse since
every system is different.
 It is best to use a flexible ETL approach combining perhaps an off-the-shelf tool
with some in-house developed software to perform the tasks that are required for a
proposed ODS or DW.
 It is essential that the ETL process be adequately documented so that at a later
stage another staff member(s) can understand what exactly the ETL does.
 An ODS will be refreshed regularly and frequently and therefore ETL will
continue to play an important role in maintaining the ODS since the refreshing
task itself will require an ETL process.
 The ETL process need to be automated so that whenever the ODS is scheduled to
be refreshed the ETL process can be turned on automatically.

What are the major issues resolved by ETL after successful Implementation?
1. What are the source systems?

These systems may include relational database systems (e.g. Oracle, DB2, SQL Server, MySQL),
legacy systems (e.g. IMS, ISAM, VSAM, flat files) as well as other systems (e.g. CICS, COBOL
legacy systems).
2. To what extent are the source systems and the target system interoperable? The
more different the sources and target, the more complex the ETL process
3. What ETL technology will be used?

One approach might be to use an in-house custom-built solution. Some tools may be available to
help in building such a solution. Another approach is to acquire a generic ETL package that will
either meet the needs of the enterprise or can be modified to meet those needs. As noted earlier, a
large variety of tools are available in the market.
4. How big is the ODS likely to be at the beginning and in the long term?
Database systems tend to grow with time. Consideration may have to be given to whether some of
the data from the ODS will be archived regularly as the data becomes old and is no longer needed
in the ODS.
5. How frequently will the ODS be refreshed or reloaded?

Once the system has been populated, it should be regularly and frequently refreshed.
6. How will the quality and integrity of the data be monitored?

Data cleaning will often be required to deal with issues like missing values, date formats, code
values, primary key, and referential integrity.
7. How will a log be maintained?

Once an ODS has been built and is being used, a dispute may arise about the origin of some data. It
is therefore necessary to be able to not only log which information came from where but also when
the information was last updated.
8. How will recovery take place?

Although database systems normally have a well-defined recovery procedure, the ETL process
may also need to be recovered if it fails.

9. Would the extraction process only copy data from the source systems and not delete the
original data?
In some cases one may wish to delete the data from source systems once it has been successfully
copied across to an ODS or a DW.
10. How will the transformation be carried out?

Transformation could be done within each of the source OLTP systems but that implies that
additional load will be placed on the source systems. The transformation could be done in the ODS
but the data is not yet in the form the ODS model requires. It is therefore sometimes useful to have
a staging area, that is, a temporary database separate from the source systems and the ODS so that
the data can be copied across to it and then transformed and copied to the ODS. Although the
staging area is yet another database system, it is useful to have a system that is separate from the
source and target systems. If anything is going to go wrong, it is likely to happen to the staging
database without creating any major problems
11. How will data be copied from non-relational legacy systems that are still operational?
It is necessary to work out a sensible methodology of copying data from such systems and
transforming it to eventually load it across to the ODS.
ETL Functions:
ETL process consists of data extraction from source systems, data transformation which
includes data cleaning, and loading data in the ODS or the data warehouse.
Suppose we have extracted 3 tables from three source systems and the three records belong to
the supplier.
We have the task of cleaning and combining them:
Table 1: Supplier record extracted from the first source:
Business name Krishna software Inc.

Address 65 Gandhi road
City Secunderabad
State Andhra
PIN 500003

Table 2: Supplier record extracted from the second source:
Supplier ID
Business name Krishna software Software Inc.
Address P O Box 123
City Hyderabad
State Andhra Pradesh
PIN 500001
Table 3: Supplier record extracted from the third source:
Supplier ID 12345
Business name Krishna Inc.
Address 201, 2nd Floor, 65 Gandhi road
City New Delhi
State Andhra
PIN 500003
Resulting Table from ETL:

Table 4: Supplier record after ETL:
Supplier ID 12345
Business name Krishna software Inc.
Address 201, 2nd Floor, 65 Gandhi road
City Secunderabad
State Andhra Pradesh
PIN 500003
Postal Address P O Box 123
It may still have some errors in it but they cannot be corrected without additional
information.

Data cleaning (also called data cleansing or data scrubbing) deals with detecting and
removing errors and inconsistencies from the data, in particular the data that is sourced from a
variety of computer systems.
Problems Solved during transforming data in the process of ETL?
1. Instance identity problem: The same customer or client may be represented slightly
differently in different source systems.
For example, my name is represented as Gopal Gupta in some systems and as GK Gupta in
others. In fact it has been claimed that there are more than 200 different ways to spell
Mohammed.
It has been reported that achieving very high consistency (i.e. close to 100%) in names and
addresses requires a huge amount of resources.
2. Data errors: Many different types of data errors other than identity errors are possible.
For example:
• Data may have some missing attribute values
• Coding of some values in one database may not match with coding in other databases
(i.e. different codes with the same meaning or same code for different meanings)
• There may be duplicate records
• There may be wrong aggregations
• There may be inconsistent use of nulls, spaces and empty values
• Some data may be wrong because of input errors
• There may be inappropriate use of address lines
• There may be non-unique identifiers
The ETL process needs to ensure that all these types of errors and others are resolved using a
sound methodology.
3. Record linkage problem:

The problem can arise if a unique identifier is not available in all databases that are being
linked. Perhaps records from a database are being linked to records from a legacy system or
to information from a spreadsheet.

4. Semantic integration problem:

Some of the sources may be relational, some may not be. Some may be even in text
documents. Some data may be character strings while others may be integers.
5. Data integrity problem:

This deals with issues like referential integrity, null values, domain of values, etc.
A sound theoretical background is being developed for data cleaning techniques.

It has been suggested that Data cleaning should be based on the following five steps:
Data cleaning Steps:
1. Parsing: As in compiler technology, parsing identifies various components of the source data
files and then establishes relationships between those and the fields in the target files.
The classical example of (parsing is identifying the various components of a person's
name and address.)
2. Correcting: correcting the identified components is usually based on a variety of
sophisticated techniques including mathematical algorithms.
3. Standardizing: Business rules of the enterprise may now be used to transform the data to
standard form. For example, in some companies there might be rules on how name and
address are to be represented.
4. Matching: The data extracted from a number of source systems is likely to be related. Such
data needs to be matched.
5. Consolidating: (All corrected, standardized and matched data can now be consolidated to
build a single version of the enterprise data.")
Note: Once the data has been transformed in the staging area, the data is ready to be loaded in
the ODS
Selecting an ETL Tool:
 Selection of an appropriate ETL tool is an important decision that has to be made in

choosing components of an ODS or data warehousing application.
 The ETL tool is required to provide coordinated access to multiple data sources so that
relevant data may be extracted from them.

 An ETL tool would normally include tools for data cleansing, reorganization,
transformation, aggregation, calculation and automatic loading of data into the target
database.
 An ETL tool should provide an easy user interface that allows data cleansing and data
transformation rules to be specified using a point-and-click approach.
 When all mappings and transformations have been specified, the ETL tool should
automatically generate the data extract transformation/load programs, which typically
run in batch mode.
1.4 Data Warehouse:

• Data warehouse is the process for assembling and managing the data from various sources
for the purpose of gaining a single detailed view of an enterprise.
• Father of Data warehouse : Bill Inmon
• Data warehouse id an integrated subject oriented repository of information that supports
management decision making system.
• The definition Data warehouse is similar to ODS except the fact that ODS is current
valued data store, and data warehouse is time variant repository of data.
• The primary aims in building a data warehouse are to provide a single version of the truth
about the enterprise information and to provide good performance for ad hoc management
queries required for enterprise analysis to manipulate, animate and synthesize enterprise
information.
Benefits of implementing a data warehouse:

• To provide a single version of truth about enterprise information.
• To speed up reports and queries that involves aggregations across many attributes.
• To provide a system in which managers that do not have a strong technical background
are able to run complex queries.
• To provide a database that stores relatively clean data.
• To provide database that stores historical data, that has been deleted from an OLTP
system.

Comparison between OLTP with Data Warehouse:
Properties: OLTP Data Warehouse

Nature of Database 3NF Multidimensional
Joins Many Some
Duplicate data Normalized De-normalized
Derived Data and Aggregation Rare Common
Queries Mostly predefined Mostly Ad-hoc
Nature of Queries Mostly simple Mostly complex
Compare ODS with Data Warehouse:
Operational Data Store Data Warehouse
Data of high quality at detailed level and Data may not be perfect, but sufficient for
assured availability strategic analysis
Contains current and near current data Contains historic data
Mostly updated at data field level Data is appended not updated
Typically detailed data Summarized and detailed data
Transactions similar to those in OLTP Complex queries processing, larger

systems volume of data
Used at Operational Level Used at the Managerial Level
Simple structure of Data warehouse system:

• Just like in building an ODS, data warehousing is a process of integrating enterprise-wide
data, originating from a variety of sources, into a single repository in the enterprise or it
may consist of a number of smaller data warehouses often called data marts or local data
warehouse might have a data mart about marketing that supports marketing and sales.
• The data mart approach is attractive since beginning with a single data mart is relatively
inexpensive and easier to implement

Simple Structure of Data Warehouse system
Data Mart:
• Data marts are often the common approach for building a data warehouse since the cost
curve for data marts tends to be more linear.
• A centralized data warehouse project can be very resource intensive and requires
significant investment at the beginning although overall costs over a number of years for a
centralized data warehouse and for decentralized data marts are likely to be similar
• A centralized warehouse can provide better quality data and minimize data inconsistencies
since the data quality is controlled centrally.
Structure of ODS and Data Warehouse

1.5 Data warehouse Design:
• To develop Data warehouse we view data warehouse as a multidimensional structure

consists of dimensions.
• Dimensions: essentially it is an attribute; it is an ordinate within multidimensional
structure consisting of list of ordered values (called members).
• Ex: degree dimension has member BSc.
• Dimension always has hierarchies that show parent/ child relationship between member
and dimension.
• Ex: time (year, month, day, hour, minute, sec),
• Country (states, districts, villages…)
Start Schema:
It is a data warehousing model often consists of a central fact table and a set of
surrounding dimension tables on which the facts depends.
Simple example of Star Schema

Example of Fact Table

YearMM Degree name Country name Scholarship name Number
200301 BSc Australia Govt 35
199902 MBBS Canada None 50
200002 LLB USA ABC 22
199901 BCom UK Commonwealth 7
200102 LLB Australia Equity 2
Example of Degree Dimension Table
Name Faculty Scholarship eligibility Number of semesters

BSc Science Yes 6
MBBS Medicine No 10
LLB Law Yes 8
BCom Business No 6
BA Arts No 6
Example of Country Dimension Table
Name Continent Education level Major religion

Nepal Asia Low Hinduism
Indonesia Asia Low Warn
Norway Europe High Christianity
Singapore Asia High NULL
Colombia South America Low Christianity
Example of Scholarship Dimension Table
Name Amount (%) Scholarship eligibility Number

Colombo 100 All 6
Equity 100 Low income 10
Asia 50 Top 5% 8
Merit 75 Top 5% 5
Bursary 25 Low income 12

Example of Year Dimension Table
Name New programs

2001 Journalism
2002 Multimedia
2003 Biotechnology
Star Schema for Four Dimensional example

Snowflake schema:
Star schemas may be refined into snowflake schemas if we wish to provide support for
dimension hierarchies by allowing the dimension tables to have sub tables to represent
the hierarchies
Example of Snow Flake Schema
• The star and snowflake schemas are intuitive, easy to understand, can deal with
aggregate data and are easily extendible by adding new attributes or new dimensions
• They are the popular modeling techniques for a data warehouse.
Entity-relationship modeling is often not discussed in the context of data warehousing

although it is quite straightforward to look at the star schema as an ER model.

1.6 Guidelines for data warehouse implementation:
Implementation steps:
1) Requirements analysis and capacity planning: identifying enterprise needs.
2) Hardware integration: after selecting hardware and software they need to integrate with
servers.
3) Modeling: its major step involves designing of warehouse schema and views.
4) Physical modeling: involves data placement, data partitioning, decision on access
methods and indexing.
5) Sources: identifying and connecting the sources using gateway, ODBC drivers or other
wrappers.
6) ETL: The data from the source systems will need to go through an ETL process. The step
of designing and implementing the ETL process may involve identifying a suitable ETL
tool and implementing the tool.
7) Populate the data warehouse: Once the ETL tools have been agreed upon, testing the
tools will be required, perhaps using a staging area. Once everything is working
satisfactorily, the ETL tools may be used in populating the warehouse given the schema
and view definitions.
8) User application: For the data warehouse to be useful there must be end-user applications.
This step involves designing and implementing applications required by the end users.
9) Roll-out the warehouse and application: once the DW is populated and tested. Then it
may be rolled out for the end user.
Implementation Guidelines
1) Build incrementally: Data warehouses must be built incrementally (step by step).
First build data mart then based on the requirement we can build data warehouse.

2) Need a champion: Data warehousing projects require inputs from many units in an
enterprise and therefore need to be driven by someone who is capable of interacting with
people in the enterprise.
3) Senior management support: Data warehouse project must be fully supported by the
senior management. A warehouse project calls for a sustained commitment from senior
management.
4) Ensure quality: Only data that has been cleaned and is of a quality that is understood by
the organization should be loaded in the data warehouse.
5) Corporate strategy: A data warehouse project must fit with corporate strategy and
business objective.
6) Business plan: The financial costs (hardware, software, and people ware), expected
benefits and a project plan (including an ETL plan) for a data warehouse project must be
clearly outlined and understood by all stakeholders.
7) Training: A data warehouse project must not overlook data warehouse training
requirements. For a data warehouse project to be successful, the users must be trained to use
the warehouse and to understand its capabilities.
8) Adaptability: The project should build in adaptability so that changes may be made to the
data warehouse if and when required.
9) Joint management: The project must be managed by both IT and business professionals
in the enterprise ensure good communication with the stakeholders.
1.7 Data warehouse metadata:

Metadata is data about data or documentation about the data that is needed by the users It is
not the actual data warehouse, but answers the "who, what, where, when, why and how"
questions about the data warehouse
Examples of metadata: A library catalogue may be considered metadata

Content of Metadata:
In a database, metadata usually consists of table (relation) lists, primary key names, attribute
names, their domains, schemas, record counts and perhaps a list of the most common queries.
Additional information may be provided including logical and physical data structures and
when and what data was loaded

Possible Questions from This chapter:

1. Explain the solutions to problems faced during posing management related queries? And
Mention their advantages and disadvantages Ans: page-2 & 3
2. Mention the Benefits of using ODS? And show the design and implementation structure of
ODS. Ans: Page-4 & 5
3. Explain the function of ETL process? And briefly explain the problems faced during
transforming data in ETL. Ans: Page-9 to 11
4. Briefly explain about Data warehouse and mention the benefits of Data warehouse? Ans:
Page-12 & 13
5. Mention the differences between OLTP and DW? Ans: Page-13

6. Give the Comparison between ODS and Data warehouse? Ans : Page-13
7. Write a note on ZLE , Data mart and meta data? Ans: Page-6, Page-14 and Page-20
8. Write the structure of Data warehouse and ODS? Ans: Page-14
9. Explain in detail data warehouse design and implementation. Explain star schema and snow
flake schema in detail? Ans: Page-15 to Page-17
10. Mention the implementation steps for data warehouse? Ans: Page-18
11. Mention the implementation guidelines for data warehouse? Ans: Page-19
12. What are the major issues resolved by ETL after successful Implementation? Ans: Page-8

Chapter 1: Data Warehousing:: Dept of CSE, KLESCET - Shrikant Athanikar

Uploaded by

Copyright:

Available Formats

Chapter 1: Data Warehousing:: Dept of CSE, KLESCET - Shrikant Athanikar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1: Data Warehousing:: Dept of CSE, KLESCET - Shrikant Athanikar

Uploaded by

Copyright:

Available Formats

Data Mining & Warehousing

Chapter 1: Data Warehousing:

Dept of CSE, KLESCET – Shrikant Athanikar Page 1

To achieve this they require:

Ex: a manager of university will be more interested in finding out:

Solutions to these problems:

Advantage: User is provided up-to-date information if the information is changing rapidly

Dept of CSE, KLESCET – Shrikant Athanikar Page 2

1.2 OPERATIONAL DATA STORES:

An ODS is designed to provide a consolidated view of the enterprise's current operational

An ODS is a subject-oriented, integrated, volatile, current valued data store, containing

ODS is subject-oriented: it is organized around the major data subjects of an enterprise.

Dept of CSE, KLESCET – Shrikant Athanikar Page 3

ODS is integrated: it is a collection of subject-oriented data from a variety of systems.

An ODS may also be used as an interim database for a data warehouse

An ODS may be considered a type of data warehouse

Comparison Between ODS and Data Warehouse:

An ODS may be viewed as an enterprise's short-term memory in that it

The data warehouse is more like an enterprise's long-term memory in

Dept of CSE, KLESCET – Shrikant Athanikar Page 4

ODS Design and Implementation

Dept of CSE, KLESCET – Shrikant Athanikar Page 5

Why a Separate Database

Dept of CSE, KLESCET – Shrikant Athanikar Page 6

Zero Latency Enterprise (ZLE)

• It has a unified view of the enterprise operational data.

1.3 ETL: (Extracting, Transforming, loading)

Dept of CSE, KLESCET – Shrikant Athanikar Page 7

The following examples show the importance of data cleaning:

• If an enterprise wishes to contact its customers or its suppliers, it is essential that a

Dept of CSE, KLESCET – Shrikant Athanikar Page 8

1. What are the source systems?

3. What ETL technology will be used?

5. How frequently will the ODS be refreshed or reloaded?

6. How will the quality and integrity of the data be monitored?

7. How will a log be maintained?

8. How will recovery take place?

Dept of CSE, KLESCET – Shrikant Athanikar Page 9

10. How will the transformation be carried out?

We have the task of cleaning and combining them:

Table 1: Supplier record extracted from the first source:

Business name Krishna software Inc.

Dept of CSE, KLESCET – Shrikant Athanikar Page 10

Table 2: Supplier record extracted from the second source:

Table 3: Supplier record extracted from the third source:

Resulting Table from ETL:

Dept of CSE, KLESCET – Shrikant Athanikar Page 11

Problems Solved during transforming data in the process of ETL?

3. Record linkage problem:

Dept of CSE, KLESCET – Shrikant Athanikar Page 12

4. Semantic integration problem:

5. Data integrity problem:

A sound theoretical background is being developed for data cleaning techniques.

Data cleaning Steps:

Selecting an ETL Tool:

 Selection of an appropriate ETL tool is an important decision that has to be made in

Dept of CSE, KLESCET – Shrikant Athanikar Page 13

1.4 Data Warehouse:

Benefits of implementing a data warehouse: