Unit 1
Unit 1
Unit 1
Information assets are immensely valuable to any enterprise, and because of this, these assets must
be properly stored and readily accessible when they are needed. However, the availability of too
much data makes the extraction of the most important information difficult, if not impossible. View
results from any Google search, and you’ll see that the data = information equation is not always
correct—that is, too much data is simply too much.
Data warehousing is a phenomenon that grew from the huge amount of electronic data stored in
recent years and from the urgent need to use that data to accomplish goals that go beyond the
routine tasks linked to daily processing. In a typical scenario, a large corporation has many branches,
and senior managers need to quantify and evaluate how each branch contributes to the global
business performance. The corporate database stores detailed data on the tasks performed by
branches. To meet the managers’ needs, tailor-made queries can be issued to retrieve the required
data. In order for this process to work, database administrators must first formulate the desired
query (typically an aggregate SQL query) after closely studying database catalogs. Then the query is
processed. This can take a few hours because of the huge amount of data, the query complexity, and
the concurrent effects of other regular workload queries on data. Finally, a report is generated and
passed to senior managers in the form of a spreadsheet.
Many years ago, database designers realized that such an approach is hardly feasible, because it is
very demanding in terms of time and resources, and it does not always achieve the desired results.
Moreover, a mix of analytical queries with transactional routine queries inevitably slows down the
system, and this does not meet the needs of users of either type of query. Today’s advanced data
warehousing processes separate online analytical processing (OLAP) from online transactional
processing (OLTP) by creating a new information repository that integrates basic data from various
sources, properly arranges data formats, and then makes data available for analysis and evaluation
aimed at planning and decision-making processes (Lechtenbörger, 2001).
Let’s review some fields of application for which data warehouse technologies are successfully used:
• Trade Sales and claims analyses, shipment and inventory control, customer care and public
relations.
• Health care service Patient admission and discharge analysis and bookkeeping in accounts
departments.
The field of application of data warehouse systems is not only restricted to enterprises, but it also
ranges from epidemiology to demography, from natural science to education. A property that is
common to all fields is the need for storage and query tools to retrieve information summaries easily
and quickly from the huge amount of data stored in databases or made available by the Internet.
This kind of information allows us to study business phenomena, learn about meaningful
correlations, and gain useful knowledge to support decision-making processes.
Data Warehouse A data warehouse is a collection of data that supports decision-making processes.
It provides the following features (Inmon, 2005):
• It is subject-oriented.
Similarly, in operational systems, we store data by individual applications. For example for an order
processing application, we keep the data for that particular application. This application provides the
data to all the functions for entering orders, checking stock, verifying customer’s credit, and
accessing the order for shipment. But these data sets contain only the data that is needed for those
functions relating to this particular application. In striking contrast, in the data warehousing, data is
stored by subjects, not by applications. For example, in the data warehouse for an insurance
company, claims data are organized around the subject of claims and not by individual applications
of general insurance and life insurance.
The second characteristic of the data warehouse is integrated. The integrator component of a data
warehouse integrates the data fetched from different data sources. For proper decision making, we
need to pull together all the relevant data from the various applications and remove the
inconsistencies. The data in the data warehouse comes from several disparate operational systems.
These are disparate applications, so the operational platforms and operating systems could be
different. The file layouts character code representation and field naming conventions all could be
different. The data is converted, reformatted, sequenced, summarized and so fourth. We have to
standardize the various data elements and make sure of the meanings of data names in each source
application. For example, Naming conventions, Codes, Data attributes and measurement’s etc would
need standardization. The result is that data once it resides in the data warehouse has a single
physical corporate image.
The third characteristic of a data warehouse is non-volatile. Operational data is regularly accessed
and manipulated one record at a time. Data is updated in the operational environment as a regular
matter of course but data warehouse data exhibits a very different set of characteristics. The data in
the data warehouse is not intended to run the day-to-day business. When we want to process the
next order received from a customer, we do not look into the data warehouse to find the current
stock status. Data warehouse data is loaded and accessed, but it is not updated. Instead, when data
in the data warehouse is loaded, it is loaded in a snapshot, static format. When subsequent changes
occur, a new snapshot record is written. In doing so, a historical record of data is kept in the data
warehouse.
The fourth characteristic of a data warehouse is time variant. The data in the warehouse is meant for
analysis and decision making. If a user is looking at the buying pattern of a specific customer, the
user needs data not only about the current purchase, but on the past purchases as well. When a
user wants to find out the reason for the drop in sales in the particular region, the user needs all the
sales data for that region over a period extending back in time. Time variance implies that every unit
of data in the data warehouse is accurate as of some moment in time. In some cases, a record is
time stamped. In other cases, a record has a date transaction. But in every case, there is some form
of time marking to show the moment in time during which the record is accurate.
Architecture
The architecture of a data warehouse includes data sources, wrapper, monitor, integrator and the
data warehouse data repository (Figure 1.11). The bottom level of the architecture depicts the data
sources, which store day-to-day transactions data. The monitor component of the architecture is
responsible for automatically detecting changes of interest in the source data and reporting them to
the integrator component. The wrapper component is responsible for translating information from
the native format of the source to compatible format. The new information sources and change in
existing data sources are propagated to the integrator. The integrator works as liaising between
wrapper/monitor and Data Warehouse data repository. The integrator brings source data into the
warehouse data repository, which may include filtering the information, summarizing it and
merging. In order to properly integrate new change information into data repository, need a process
to store change data into data repository without affecting the Quality of Service (QoS) from the
same or different data sources.
The information stored at the warehouse is in the form of derived views of data from the sources.
These views stored at the warehouse are often referred to as materialized views. The data
warehouse itself can use an off-the-shelf or special purpose database management system. there is
a single, centralized warehouse, but the warehouse certainly may be implemented as a distributed
database system and in fact data parallelism or distribution may be necessary to provide the desired
performance.
The architecture and basic functionality we have described is more general than that provided by
most commercial data warehousing systems. In particular current systems usually assume that the
sources and the warehouse subscribe to a single data model (normally relational), that propagation
of information from the sources to the warehouse is performed as a batch process (perhaps off-line)
and that queries from the integrator to the information sources are never needed.
ETL
ETL is a process that extracts the data from different source systems, then transforms the data (like
applying calculations, concatenations, etc.) and finally loads the data into the Data Warehouse
system. Full form of ETL is Extract, Transform and Load.
It's tempting to think a creating a Data warehouse is simply extracting data from multiple sources
and loading into database of a Data warehouse. This is far from the truth and requires a complex ETL
process. The ETL process requires active inputs from various stakeholders including developers,
analysts, testers, top executives and is technically challenging.
In order to maintain its value as a tool for decision-makers, Data warehouse system needs to change
with business changes. ETL is a recurring activity (daily, weekly, monthly) of a Data warehouse
system and needs to be agile, automated, and well documented.
It helps companies to analyze their business data for taking critical business decisions.
Transactional databases cannot answer complex business questions that can be answered by
ETL.
A Data Warehouse provides a common data repository
ETL provides a method of moving the data from various sources into a data warehouse.
As data sources change, the Data Warehouse will automatically update.
Well-designed and documented ETL system is almost essential to the success of a Data
Warehouse project.
Allow verification of data transformation, aggregation and calculations rules.
ETL process allows sample data comparison between the source and the target system.
ETL process can perform complex transformations and requires the extra area to store the
data.
ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and types
to adhere to one consistent system.
ETL is a predefined process for accessing and manipulating source data into the target
database.
ETL offers deep historical context for the business.
It helps to improve productivity because it codifies and reuses without a need for technical
skills.
Step 1) Extraction
In this step, data is extracted from the source system into the staging area. Transformations if any
are done in staging area so that performance of source system in not degraded. Also, if corrupted
data is copied directly from the source into Data warehouse database, rollback will be a challenge.
Staging area gives an opportunity to validate extracted data before it moves into the Data
warehouse.
Data warehouse needs to integrate systems that have different DBMS, Hardware, Operating Systems
and Communication Protocols. Sources could include legacy applications like Mainframes,
customized applications, Point of contact devices like ATM, Call switches, text files, spreadsheets,
ERP, data from vendors, partners amongst others.
Hence one needs a logical data map before data is extracted and loaded physically. This data map
describes the relationship between sources and target data.
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and response time of the
source systems. These source systems are live production databases. Any slow down or locking could
effect company's bottom line.
Step 2) Transformation
Data extracted from source server is raw and not usable in its original form. Therefore it needs to be
cleansed, mapped and transformed. In fact, this is the key step where ETL process adds value and
changes data such that insightful BI reports can be generated.
In this step, you apply a set of functions on extracted data. Data that does not require any
transformation is called as direct move or pass through data.
In transformation step, you can perform customized operations on data. For instance, if the user
wants sum-of-sales revenue which is not in the database. Or if the first name and the last name in a
table is in different columns. It is possible to concatenate them before loading.
Following are Data Integrity Problems:
Step 3) Loading
Loading data into the target datawarehouse database is the last step of the ETL process. In a typical
Data warehouse, huge volume of data needs to be loaded in a relatively short period (nights). Hence,
load process should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the point of failure
without data integrity loss. Data Warehouse admins need to monitor, resume, cancel loads as per
prevailing server performance.
Types of Loading:
Load verification
Ensure that the key field data is neither missing nor null.
Test modeling views based on the target tables.
Check that combined values and calculated measures.
Data checks in dimension table as well as history table.
Check the BI reports on the loaded fact and dimension table
A multidimensional data model is typically organized around a central theme, such as sales. This
theme is represented by a fact table. Facts are numeric measures. Think of them as the quantities by
which we want to analyze relationships between dimensions. Examples of facts for a sales data
warehouse include dollars sold (sales amount in dollars), units sold (number of units sold), and
amount budgeted. The fact table contains the names of the facts, or measures, as well as keys to
each of the related dimension tables. You will soon get a clearer picture of how this works when we
look at multidimensional schemas.
Although we usually think of cubes as 3-D geometric structures, in data warehousing the data cube is
n-dimensional. To gain a better understanding of data cubes and the multidimensional data model,
let’s start by looking at a simple 2-D data cube that is, in fact, a table or spreadsheet for sales data
from AllElectronics. In particular, we will look at the AllElectronics sales data for items sold per
quarter in the city of Vancouver. These data are shown in Table 4.2. In this 2-D representation, the
sales for Vancouver are shown with respect to the time dimension (organized in quarters) and the
item dimension (organized according to the types of items sold). The fact or measure displayed is
dollars sold (in thousands). Now, suppose that we would like to view the sales data with a third
dimension. For instance, suppose we would like to view the data according to time and item, as well
as location, for the cities Chicago, New York, Toronto, and Vancouver. These 3-D data are shown in
Table 4.3. The 3-D data in the table are represented as a series of 2-D tables. Conceptually, we may
also represent the same data in the form of a 3-D data cube, as in Figure 4.3. Suppose that we would
now like to view our sales data with an additional fourth dimension such as supplier. Viewing things
in 4-D becomes tricky. However, we can think of a 4-D cube as being a series of 3-D cubes, as shown
in Figure 4.4. If we continue in this way, we may display any n-dimensional data as a series of (n −
1)-dimensional “cubes.” The data cube is a metaphor for multidimensional data storage. The actual
physical storage of such data may differ from its logical representation. The important thing to
remember is that data cubes are n-dimensional and do not confine data to 3-D. Tables 4.2 and 4.3
show the data at different degrees of summarization. In the data warehousing research literature, a
data cube like those shown in Figures 4.3 and 4.4 is often referred to as a cuboid. Given a set of
dimensions, we can generate a cuboid for each of the possible subsets of the given dimensions. The
result would form a lattice of cuboids, each showing the data at a different level of summarization,
or group-by. The lattice of cuboids is then referred to as a data cube. Figure 4.5 shows a lattice of
cuboids forming a data cube for the dimensions time, item, location, and supplier. The cuboid that
holds the lowest level of summarization is called the base cuboid.
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models
The entity-relationship data model is commonly used in the design of relational databases, where a
database schema consists of a set of entities and the relationships between them. Such a data model
is appropriate for online transaction processing. A data warehouse, however, requires a concise,
subject-oriented schema that facilitates online data analysis. The most popular data model for a data
warehouse is a multidimensional model, which can exist in the form of a star schema, a snowflake
schema, or a fact constellation schema. Let’s look at each of these.
Star schema: The most common modeling paradigm is the star schema, in which the data warehouse
contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy,
and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema
graph resembles a starburst, with the dimension tables displayed in a radial pattern around the
central fact table. with two measures: dollars sold and units sold. To minimize the size of the fact
table,dimension identifiers (e.g., time key and item key) are system-generated identifiers. Notice
that in the star schema, each dimension is represented by only one table, and each table contains a
set of attributes. For example, the location dimension table contains the attribute set {location key,
street, city, province or state, country}. This constraint may introduce some redundancy. For
example, “Urbana” and “Chicago” are both cities in the state of Illinois, USA. Entries for such cities in
the location dimension table will create redundancy among the attributes province or state and
country; that is, (..., Urbana, IL, USA) and (..., Chicago, IL, USA). Moreover, the attributes within a
dimension table may form either a hierarchy (total order) or a lattice (partial order).
Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake. The major difference between the
snowflake and star schema models is that the dimension tables of the snowflake model may be kept
in normalized form to reduce redundancies. Such a table is easy to maintain and saves storage
space. However, this space savings is negligible in comparison to the typical magnitude of the fact
table. Furthermore, the snowflake structure can reduce the effectiveness of browsing, since more
joins will be needed to execute a query. Consequently, the system performance may be adversely
impacted. Hence, although the snowflake schema reduces redundancy, it is not as popular as the
star schema in data warehouse design.
Measures. Measures store quantifiable business data (such as sales, expenses, and inventory).
Measures are also called "facts". Measures are organized by one or more dimensions and may be
stored or calculated at query time:
Stored Measures. Stored measures are loaded and stored at the leaf level. Commonly, there is also a
percentage of summary data that is stored. Summary data that is not stored is dynamically
aggregated when queried.
Calculated Measures. Calculated measures are measures whose values are calculated dynamically at
query time. Only the calculation rules are stored in the database. Common calculations include
measures such as ratios, differences, totals and moving averages. Calculations do not require disk
storage space, and they do not extend the processing time required for data maintenance.
Dimensions. A dimension is a structure that categorizes data to enable users to answer business
questions. Commonly used dimensions are Customers, Products, and Time. A dimension's structure
is organized hierarchically based on parent-child relationships. These relationships enable:
Hierarchies on dimensions enable drilling down to lower levels or navigation to higher levels (rolling
up). Drilling down on the Time dimension member 2012 typically navigates you to the quarters Q1
2012 through Q4 2012. In a calendar year hierarchy for 2012, drilling down on Q1 2012 would
navigate you to the months, January 12 through March 12. These kinds of relationships make it easy
for users to navigate through large volumes of multidimensional data.
The reverse of aggregation is allocation and is heavily used by planning budgeting, and similar
applications. Here, the role of the hierarchy is to identify the children and descendants of particular
dimension members of "top-down" allocation of budgets (among other uses).
Share and index calculations take advantage of hierarchical relationships (for example, the
percentage of total profit contributed by each product, or the percentage share of product revenue
for a certain category, or costs as a percentage of the geographical region for a retail location).
A dimension object helps to organize and group dimensional information into hierarchies. This
represents natural 1:n relationships between columns or column groups (the levels of a hierarchy)
that cannot be represented with constraint conditions. Going up a level in the hierarchy is called
rolling up the data and going down a level in the hierarchy is called drilling down the data.
There are two ways that you can implement a dimensional model:
Oracle OLAP Cubes. The physical model provided with Oracle Communications Data Model provides
a dimensional perspective of the data using Oracle OLAP cubes. This dimensional model is discussed
in "Characteristics of the OLAP Dimensional Model".
For example, think about a record of student attendance in classes. In this case, the fact table would
consist of 3 dimensions: the student dimension, the time dimension, and the class dimension. This
factless fact table would look like the following:
The only measure that you can possibly attach to each combination is "1" to show the presence of
that particular combination. However, adding a fact that always shows 1 is redundant because we
can simply use the COUNT function in SQL to answer the same questions.
Factless fact tables offer the most flexibility in data warehouse design. For example, one can easily
answer the following questions with this factless fact table:
“How are concept hierarchies useful in OLAP?” In the multidimensional model, data are organized
into multiple dimensions, and each dimension contains multiple levels of abstraction defined by
concept hierarchies. This organization provides users with the flexibility to view data from different
perspectives. A number of OLAP data cube operations exist to materialize these different views,
allowing interactive querying and analysis of the data at hand. Hence, OLAP provides a user-friendly
environment for interactive data analysis.
OLAP operations. Let’s look at some typical OLAP operations for multidimensional data. Each of the
following operations described is illustrated in Figure 4.12. At the center of the figure is a data cube
for AllElectronics sales. The cube contains the dimensions location, time, and item, where location is
aggregated with respect to city values, time is aggregated with respect to quarters, and item is
aggregated with respect to item types. To aid in our explanation, we refer to this cube as the central
cube. The measure displayed is dollars sold (in thousands). (For improved readability, only some of
the cubes’ cell values are shown.) The data examined are for the cities Chicago, New York, Toronto,
and Vancouver. Roll-up: The roll-up operation (also called the drill-up operation by some vendors)
performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or
by dimension reduction. Figure 4.12 shows the result of a roll-up operation performed on the central
cube by climbing up the concept hierarchy for location given in Figure 4.9. This hierarchy was
defined as the total order “street < city < province or state < country.” The roll-up operation shown
aggregates the data by ascending the location hierarchy from the level of city to the level of country.
In other words, rather than grouping the data by city, the resulting cube groups the data by country.
When roll-up is performed by dimension reduction, one or more dimensions are removed from the
given cube. For example, consider a sales data cube containing only the location and time
dimensions. Roll-up may be performed by removing, say, the time dimension, resulting in an
aggregation of the total sales by location, rather than by location and by time. Drill-down: Drill-down
is the reverse of roll-up. It navigates from less detailed data to more detailed data. Drill-down can be
realized by either stepping down a concept hierarchy for a dimension orintroducing additional
dimensions. Figure 4.12 shows the result of a drill-down operation performed on the central cube by
stepping down a concept hierarchy for time defined as “day < month < quarter < year.” Drill-down
occurs by descending the time hierarchy from the level of quarter to the more detailed level of
month. The resulting data cube details the total sales per month rather than summarizing them by
quarter. Because a drill-down adds more detail to the given data, it can also be performed by adding
new dimensions to a cube. For example, a drill-down on the central cube of Figure 4.12 can occur by
introducing an additional dimension, such as customer group.
Slice and dice: The slice operation performs a selection on one dimension of the given cube,
resulting in a subcube. Figure 4.12 shows a slice operation where the sales data are selected from
the central cube for the dimension time using the criterion time = “Q1.” The dice operation defines a
subcube by performing a selection on two or more dimensions. Figure 4.12 shows a dice operation
on the central cube based on the following selection criteria that involve three dimensions: (location
= “Toronto” or “Vancouver”) and (time = “Q1” or “Q2”) and (item = “home entertainment” or
“computer”).
Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in view
to provide an alternative data presentation. Figure 4.12 shows a pivot operation where the item and
location axes in a 2-D slice are rotated. Other examples include rotating the axes in a 3-D cube, or
transforming a 3-D cube into a series of 2-D planes.
Other OLAP operations: Some OLAP systems offer additional drilling operations. For example, drill-
across executes queries involving (i.e., across) more than one fact table. The drill-through operation
uses relational SQL facilities to drill through the bottom level of a data cube down to its back-end
relational tables. Other OLAP operations may include ranking the top N or bottom N items in lists, as
well as computing moving averages, growth rates, interests, internal return rates, depreciation,
currency conversions, and statistical functions. OLAP offers analytical modeling capabilities, including
a calculation engine for deriving ratios, variance, and so on, and for computing measures across
multiple dimensions. It can generate summarizations, aggregations, and hierarchies at each
granularity level and at every dimension intersection. OLAP also supports functional models for
forecasting, trend analysis, and statistical analysis. In this context, an OLAP engine is a powerful data
analysis tool.
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a
relational back-end server and client front-end tools. They use a relational or extended-relational
DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces.
ROLAP servers include optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services. ROLAP technology tends to have greater
scalability than MOLAP technology. The DSS server of Microstrategy, for example, adopts the ROLAP
approach.
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional data views
through array-based multidimensional storage engines. They map multidimensional views directly to
data cube array structures. The advantage of using a data cube is that it allows fast indexing to
precomputed summarized data. Notice that with multidimensional data stores, the storage
utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques
should be explored .
Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data
sets: Denser subcubes are identified and stored as array structures, whereas sparse subcubes
employ compression technology for efficient storage utilization.
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology,
benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example,
a HOLAP server may allow large volumes of detailed data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid
OLAP server.
Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases,
some database system vendors implement specialized SQL servers that provide advanced query
language and query processing support for SQL queries over star and snowflake schemas in a read-
only environment.
“How are data actually stored in ROLAP and MOLAP architectures?” Let’s first look at ROLAP. As its
name implies, ROLAP uses relational tables to store data for online analytical processing. Recall that
the fact table associated with a base cuboid is referred to as a base fact table. The base fact table
stores data at the abstraction level indicated by the join keys in the schema for the given data cube.
Aggregated data can also be stored in fact tables, referred to as summary fact tables. Some
summary fact tables store both base fact table data and aggregated data (see Example 3.10).
Alternatively, separate summary fact tables can be used for each abstraction level to store only
aggregated data. Example 4.10 A ROLAP data store. Table 4.4 shows a summary fact table that
contains both base fact data and aggregated data. The schema is “hrecord identifier (RID), item, . . . ,
day, month, quarter, year, dollars soldi,” where day, month, quarter, and year define the sales date,
and dollars sold is the sales amount. Consider the tuples with an RID of 1001 and 1002, respectively.
The data of these tuples are at the base fact level, where the sales dates are October 15, 2010, and
October 23, 2010, respectively. Consider the tuple with an RID of 5001. This tuple is at a more
general level of abstraction than the tuples 1001 and 1002. The day value has been generalized to
all, so that the corresponding time value is October 2010. That is, the dollars sold amount shown is
an aggregation representing the entire month of October 2010, rather than just October 15 or 23,
2010. The special value all is used to represent subtotals in summarized data.
MOLAP uses multidimensional array structures to store data for online analytical processing. This
structure is discussed in greater detail. Most data warehouse systems adopt a client-server
architecture. A relational data store always resides at the data warehouse/data mart server site. A
multidimensional data store can reside at either the database server site or the client site.