Chapter 1: Data Warehousing:: Dept of CSE, KLESCET - Shrikant Athanikar
Chapter 1: Data Warehousing:: Dept of CSE, KLESCET - Shrikant Athanikar
Chapter 1: Data Warehousing:: Dept of CSE, KLESCET - Shrikant Athanikar
• Major enterprises have many computers that run a variety of enterprise applications.
• For an enterprise with many braches in many locations the branches may have their own
systems.
• A large company might have following systems:
Human resource
Financials
Billing
Sales leads
Web sales
Customer support
Such systems are called online transaction processing (OLTP) systems.
• The OLTP systems are mostly relational database systems designed for transaction
processing.
• The performance of OLTP system is usually very important since systems are used to
support the users that provide services to the customers.
• These systems normally deals with the operations like update, delete, insert, and can deal
with some basic & simple queries quickly. But these systems cannot answer to the
management questions which are going to be more complex which normally involves lots
of joins and aggregation.
• Some enterprises may have old pre-relational systems that store valuable information in
files which are build using complex data structures, and it is difficult for the organization
to convert all these into relational database.
• In some cases the data might be stored in some old legacy systems which are build using
different applications and where the data semantics are not properly documented.
• With these type of storages, generating some report related to business or even answering
some important queries becomes difficult, involving intolerable delays.
It has been reported that several years ago Coca-Cola Company could not even quickly
determine, how many bottles their plants produced in a day since the information was
distributed in 20 different computers systems in different locations.
The focus of operation managers is on improving business management and process across
the various enterprise functions, Ex: customer support, inventory, marketing.
• How many student are enrolled in the current semester and how many have dropped out.
• Percentage of the top 5% of high school graduates enrolled in the university this year
1) One simple solution to meeting these needs is to allow managers to pose queries of interest to
a mediator system that decomposes each query into appropriate sub queries for the systems
that the query needs to access, obtains results from those systems and then combines and
presents the result to the user. This is sometimes called lazy or on-demand query
processing (what is lazy or on-demand processing means?).
Disadvantages: The management queries may generate such a heavy load on some OLTP
systems that the performance of the systems becomes unacceptable. Also, the OLTP systems
have no historical data and so finding trends may not be possible.
2) One could collect the most common queries that the managers ask and ran them regularly and
have the results available when a manager poses one of those queries. This approach, called
the eager approach (what is eager approach?)
Advantage: quick
Drawback: The information may not be up-to-date and managers may pose queries that have
not been pre- computed which would again have to be run in the lazy mode.
3) The third approach is somewhere in between the two extremes. It involves the creation of a
separate database that only stores information that is of interest to the management staff and
involves the following two steps:
1. The information needs of management staff are analyzed and the enterprise systems that
store some or all of the information are identified. A suitable data model is developed and
the information is then extracted, transformed and loaded in the new database.
2. The new database is then used to answer management queries and the OLTP systems are
not accessed for such queries.
To meet the information needs of management staff, one possible solution is to build separate
database.
One such approach is called the operational data store (ODS) approach.
A data warehouse does not have operational data in it. A data warehouse
reporting database contains relatively recent as well as historical data and
may also contain aggregate data.
ODS is current valued: an ODS is up-to-date and reflects the current status of the information.
An ODS does not include historical data. Since the OLTP systems data is changing all the time,
data from underlying sources should refresh the ODS as regularly and frequently as possible.
ODS is volatile: The data in the ODS changes frequently as new information refreshes the ODS.
ODS is detailed: The ODS is detailed enough to serve the needs of the operational management
staff in the enterprise.
Benefits of ODS:
1) An ODS is the unified operational view of the enterprise that provides the operational
managers improved access to important operational data.
2) An ODS can be much more effective in generating current reports without having to
access the OLTP or legacy systems.
3) ODS may also shorten the time required to implement and populate a data warehouse
system because ODS already provides integrated enterprise data.
4) An ODS may also become a source of reliable information for some other applications.
• To implement an ODS, just like implementing any database system, a data model should
be developed
• Although all the attributes included in the source databases do not need to be included in
the ODS, all attributes that are likely to be needed by the operational management staff
should be identified and included.
• The extraction of information from source databases needs to be efficient and the quality
of data to be maintained.
• Since the data is refreshed regularly and frequently, suitable checks are required to ensure
quality of data after each refresh.
• An ODS would of course be required to satisfy normal integrity constraints, for example,
existential integrity, referential integrity and appropriate action to deal with nulls.
• An ODS is a read only database other than regular refreshing by the OLTP systems.
• Users should not be allowed to update ODS information.
• Populating an ODS involves an acquisition process of extracting, transforming and
loading data from OLTP source systems. This process is called ETL.
• Completing populating the database, checking for anomalies and testing for performance
are necessary before an ODS system can go online.
Characteristics of ZLE:
How does a ZLE data store fit into enterprise information architecture?
Several different arrangements are possible. Either a ZLE data store can be used
almost like an ODS that is refreshed in realtime (it can then feed data to the data
warehouse if required) or a ZLE can take over some of the roles of the data
warehouse.
In practice, the process is much more complex and tedious and may require significant
resources to implement.
There are a variety of tools in the market that may help reduce the cost.
As different data sources tend to have different conventions for coding information and
different standards for the quality of information, building an ODS (or a data warehouse)
requires data filtering, data cleaning, and integration.
Data errors at least partly arise because of unmotivated data entry staff since they
are poorly paid.
ETL requires skills in management, business analysis and technology and is often
a significant component of developing an ODS or a data warehouse.
The ETL process tends to be different for every ODS and data warehouse since
every system is different.
It is best to use a flexible ETL approach combining perhaps an off-the-shelf tool
with some in-house developed software to perform the tasks that are required for a
proposed ODS or DW.
It is essential that the ETL process be adequately documented so that at a later
stage another staff member(s) can understand what exactly the ETL does.
An ODS will be refreshed regularly and frequently and therefore ETL will
continue to play an important role in maintaining the ODS since the refreshing
task itself will require an ETL process.
The ETL process need to be automated so that whenever the ODS is scheduled to
be refreshed the ETL process can be turned on automatically.
What are the major issues resolved by ETL after successful Implementation?
2. To what extent are the source systems and the target system interoperable? The
more different the sources and target, the more complex the ETL process
4. How big is the ODS likely to be at the beginning and in the long term?
Database systems tend to grow with time. Consideration may have to be given to whether some of
the data from the ODS will be archived regularly as the data becomes old and is no longer needed
in the ODS.
9. Would the extraction process only copy data from the source systems and not delete the
original data?
In some cases one may wish to delete the data from source systems once it has been successfully
copied across to an ODS or a DW.
11. How will data be copied from non-relational legacy systems that are still operational?
It is necessary to work out a sensible methodology of copying data from such systems and
transforming it to eventually load it across to the ODS.
ETL Functions:
ETL process consists of data extraction from source systems, data transformation which
includes data cleaning, and loading data in the ODS or the data warehouse.
Suppose we have extracted 3 tables from three source systems and the three records belong to
the supplier.
Supplier ID
Business name Krishna software Software Inc.
Address P O Box 123
City Hyderabad
State Andhra Pradesh
PIN 500001
Supplier ID 12345
Business name Krishna Inc.
Address 201, 2nd Floor, 65 Gandhi road
City New Delhi
State Andhra
PIN 500003
Supplier ID 12345
Business name Krishna software Inc.
Address 201, 2nd Floor, 65 Gandhi road
City Secunderabad
State Andhra Pradesh
PIN 500003
Postal Address P O Box 123
It may still have some errors in it but they cannot be corrected without additional
information.
Data cleaning (also called data cleansing or data scrubbing) deals with detecting and
removing errors and inconsistencies from the data, in particular the data that is sourced from a
variety of computer systems.
1. Instance identity problem: The same customer or client may be represented slightly
differently in different source systems.
For example, my name is represented as Gopal Gupta in some systems and as GK Gupta in
others. In fact it has been claimed that there are more than 200 different ways to spell
Mohammed.
It has been reported that achieving very high consistency (i.e. close to 100%) in names and
addresses requires a huge amount of resources.
2. Data errors: Many different types of data errors other than identity errors are possible.
For example:
• Data may have some missing attribute values
• Coding of some values in one database may not match with coding in other databases
(i.e. different codes with the same meaning or same code for different meanings)
• There may be duplicate records
• There may be wrong aggregations
• There may be inconsistent use of nulls, spaces and empty values
• Some data may be wrong because of input errors
• There may be inappropriate use of address lines
• There may be non-unique identifiers
The ETL process needs to ensure that all these types of errors and others are resolved using a
sound methodology.
1. Parsing: As in compiler technology, parsing identifies various components of the source data
files and then establishes relationships between those and the fields in the target files.
The classical example of (parsing is identifying the various components of a person's
name and address.)
2. Correcting: correcting the identified components is usually based on a variety of
sophisticated techniques including mathematical algorithms.
3. Standardizing: Business rules of the enterprise may now be used to transform the data to
standard form. For example, in some companies there might be rules on how name and
address are to be represented.
4. Matching: The data extracted from a number of source systems is likely to be related. Such
data needs to be matched.
5. Consolidating: (All corrected, standardized and matched data can now be consolidated to
build a single version of the enterprise data.")
Note: Once the data has been transformed in the staging area, the data is ready to be loaded in
the ODS
An ETL tool would normally include tools for data cleansing, reorganization,
transformation, aggregation, calculation and automatic loading of data into the target
database.
An ETL tool should provide an easy user interface that allows data cleansing and data
transformation rules to be specified using a point-and-click approach.
When all mappings and transformations have been specified, the ETL tool should
automatically generate the data extract transformation/load programs, which typically
run in batch mode.
Data of high quality at detailed level and Data may not be perfect, but sufficient for
assured availability strategic analysis
Contains current and near current data Contains historic data
Data Mart:
• Data marts are often the common approach for building a data warehouse since the cost
curve for data marts tends to be more linear.
• A centralized data warehouse project can be very resource intensive and requires
significant investment at the beginning although overall costs over a number of years for a
centralized data warehouse and for decentralized data marts are likely to be similar
• A centralized warehouse can provide better quality data and minimize data inconsistencies
since the data quality is controlled centrally.
Start Schema:
It is a data warehousing model often consists of a central fact table and a set of
surrounding dimension tables on which the facts depends.
Snowflake schema:
Star schemas may be refined into snowflake schemas if we wish to provide support for
dimension hierarchies by allowing the dimension tables to have sub tables to represent
the hierarchies
• The star and snowflake schemas are intuitive, easy to understand, can deal with
aggregate data and are easily extendible by adding new attributes or new dimensions
Implementation steps:
1) Requirements analysis and capacity planning: identifying enterprise needs.
2) Hardware integration: after selecting hardware and software they need to integrate with
servers.
3) Modeling: its major step involves designing of warehouse schema and views.
4) Physical modeling: involves data placement, data partitioning, decision on access
methods and indexing.
5) Sources: identifying and connecting the sources using gateway, ODBC drivers or other
wrappers.
6) ETL: The data from the source systems will need to go through an ETL process. The step
of designing and implementing the ETL process may involve identifying a suitable ETL
tool and implementing the tool.
7) Populate the data warehouse: Once the ETL tools have been agreed upon, testing the
tools will be required, perhaps using a staging area. Once everything is working
satisfactorily, the ETL tools may be used in populating the warehouse given the schema
and view definitions.
8) User application: For the data warehouse to be useful there must be end-user applications.
This step involves designing and implementing applications required by the end users.
9) Roll-out the warehouse and application: once the DW is populated and tested. Then it
may be rolled out for the end user.
Implementation Guidelines
1) Build incrementally: Data warehouses must be built incrementally (step by step).
First build data mart then based on the requirement we can build data warehouse.
2) Need a champion: Data warehousing projects require inputs from many units in an
enterprise and therefore need to be driven by someone who is capable of interacting with
people in the enterprise.
3) Senior management support: Data warehouse project must be fully supported by the
senior management. A warehouse project calls for a sustained commitment from senior
management.
4) Ensure quality: Only data that has been cleaned and is of a quality that is understood by
the organization should be loaded in the data warehouse.
5) Corporate strategy: A data warehouse project must fit with corporate strategy and
business objective.
6) Business plan: The financial costs (hardware, software, and people ware), expected
benefits and a project plan (including an ETL plan) for a data warehouse project must be
clearly outlined and understood by all stakeholders.
7) Training: A data warehouse project must not overlook data warehouse training
requirements. For a data warehouse project to be successful, the users must be trained to use
the warehouse and to understand its capabilities.
8) Adaptability: The project should build in adaptability so that changes may be made to the
data warehouse if and when required.
9) Joint management: The project must be managed by both IT and business professionals
in the enterprise ensure good communication with the stakeholders.
2. Mention the Benefits of using ODS? And show the design and implementation structure of
ODS. Ans: Page-4 & 5
3. Explain the function of ETL process? And briefly explain the problems faced during
transforming data in ETL. Ans: Page-9 to 11
4. Briefly explain about Data warehouse and mention the benefits of Data warehouse? Ans:
Page-12 & 13
10. Mention the implementation steps for data warehouse? Ans: Page-18
11. Mention the implementation guidelines for data warehouse? Ans: Page-19
12. What are the major issues resolved by ETL after successful Implementation? Ans: Page-8