Data Warehousing & Mining
Data Warehousing & Mining
Data Warehousing & Mining
mathematics
HEALTH ENGINEERING
DESIGN
management
EDUCATION
MUSIC
GEOGRAPHY E
ART
C PHYSICS
O
law
L
O
agriculture
BIOTECHNOLOGY G
Y
LANGUAGE
CHEMISTRY history
MECHANICS
psychology
SYLLABUS
Data Mining
What is Data mining (DM)? Definition and description, Relationship and Patterns, KDD vs Data mining,
DBMS vs Data mining, Elements and uses of Data Mining, Measuring Data Mining Effectiveness :
Accuracy,Speed & Cost Data Information and Knowledge, Data Mining vs. Machine Learning, Data Mining
Models. Issues and challenges in DM, DM Applications Areas.
OLAP
Need for OLAP, OLAP vs. OLTP Multidimensional Data Model Multidimensional verses Multirelational
OLAP Characteristics of OLAP: FASMI Test (Fast, Analysis Share, Multidimensional and Information),
Features of OLAP, OLAP Operations Categorization of OLAP Tools: MOLAP, ROLAP
Suggested Readings:
COURSE OVERVIEW
The last few years have seen a growing recognition of informa- sional verses Multirelational OLAP, OLAP Operations and
tion as a key business tool. In general, the current business Categorization of OLAP Tools: MOLAP and ROLAP.
market dynamics make it abundantly clear that, for any com- Armed with the knowledge of data warehousing technology,
pany, information is the very key to survival. the student continues into a discussion on the principles of
If we look at the evolution of the information processing business analysis, models and patterns and an in-depth analysis
technologies, we can see that while the first generation of client/ of data mining.
server systems brought data to the desktop, not all of this data
Prerequisite
was easy to understand, unfortunately, and as such, it was not
Knowledge of Database Management Systems
very useful to end users. As a result, a number of new tech-
nologies have emerged that are focused on improving the Objective
information content of the data to empower the knowledge Ever since the dawn of business data processing, managers
workers of today and tomorrow. Among these technologies are have been seeking ways to increase the utility of their informa-
data warehousing, online analytical processing (OLAP), and data tion systems. In the past, much of the emphasis has been on
mining. automating the transactions that move an organization through
Therefore, this book is about the need, the value and the the interlocking cycles of sales, production and administration.
technological means of acquiring and using information in the Whether accepting an order, purchasing raw materials, or paying
and characteristics of Data Warehousing, Data Warehousing want is information. In conjunction with the increased amount
Models, Data warehouse architecture & Principles of Data of data, there has been a shift in the primary users of comput-
Warehousing, topics related to building a data warehouse ers, from a limited group of information systems professionals
project are discussed along with Managing and implementing a to a much larger group of knowledge workers with expertise in
data warehouse project. Using these topics as a foundation, this particular business domains, such as finance, marketing, or
book proceeds to analyze various important concepts related to manufacturing. Data warehousing is a collection of technologies
Data mining, Techniques of data mining, Need for OLAP, designed to convert heaps of data to usable information. It
i
DATA WAREHOUSING AND DATA MINING
does this by consolidating data from diverse transactional The newest, hottest technology to address these concerns is data
systems into a coherent collection of consistent, quality-checked mining. Data mining uses sophisticated statistical analysis and
databases used only for informational purposes. Data ware- modeling techniques to uncover pattern and relationships
houses are used to support online analytical processing (OLAP). hidden in organizational databases – patterns that ordinary
However, the very size and complexity of data warehouses methods might miss.
make it difficult for any user, no matter how knowledgeable in The objective of this book is to have detailed information
the application of data, to formulate all possible hypotheses about Data warehousing, OLAP and data mining. I have
that might explain something such as the behavior of a group brought together these different pieces of data warehousing,
of customers. How can anyone successfully explore databases OLAP and data mining and have provided an understandable
containing 100 millions rows of data, each with thousands of and coherent explanation of how data warehousing as well as
attributes? data mining works, plus how it can be used from the business
perspective. This book will be a useful guide.
ii
DATA WAREHOUSING AND DATA MINING
CONTENT
. Lesson No. Topic Page No.
Lesson Plan vi
Data Warehousing
Lesson 1 Introduction to Data Warehousing 1
Lesson 2 Meaning and Characteristics of Data Warehousing 5
iv
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
MCA
CONTENT
Lesson No. Topic Page No.
Lesson 24 Decision Trees - 2 103
Lesson 25 Neural Networks 107
Lesson 26 Neural Networks 112
Lesson 27 Association Rules and Genetic Algorithm 118
OLAP
Lesson 28 Online Analytical Processing, Need for OLAP
Multidimensional Data Model 124
Lesson 29 OLAP vs. OLTP, Characteristics of OLAP 129
Lesson 30 Multidimensional verses Multirelational OLAP,
Features of OLAP 132
Lesson 31 OLAP Operations 136
Lesson 32 Categorization of OLAP Tools Concepts used
in MOLAP/ ROLAP 141
v
CHAPTER 1
LESSON 1 DATA WAREHOUSING
INTRODUCTION TO
DATA WAREHOUSING
1
understand what a data warehouse is and what it is not. You the disadvantages faced it led to the development of the new
DATA WAREHOUSING AND DATA MINING
will learn what human resources are required, as well as the roles application called Data Warehousing
and responsibilities of each player. You will be given an
Factors, which Lead To Data Warehousing
overview of good project management techniques to help
Many factors have influenced the quick evolution of the data
ensure the data warehouse initiative dose not fail due the poor
warehousing discipline. The most important factor has been the
project management. You will learn how to physically imple-
advancement in the hardware and software technologies.
ments a data warehouse with some new tools currently available
to help you mine those vast amounts of information stored Hardware and Software prices: Software and hardware prices
with in the warehouse. Without fine running this ability to have fallen to a great extent. Higher capacity memory chips are
mine the warehouse, even the most complete warehouse, available at very low prices.
would be useless. • Powerful Preprocessors: Today’s preprocessor are many
times powerful than yesterday’s mainframes: e.g. Pentium III
History of Data Warehousing
and Alpha processors
Let us first review the historical management schemes of the
analysis data and the factors that have led to the evolution of • Inexpensive disks: The hard disks of today can store
the data warehousing application class. hundreds of gigabytes with their prices falling. The amount
of information that can be stored on just a single one-inch
Traditional Approaches to Historical Data high disk drive would have required a roomful of disk drives
Throughout the history of systems development, the primary in 1970’s and early eighties.
emphasis had been given to the operational systems and the
data they process. It was not practical to keep data in the
• Desktop powerful for analysis tools: Easy to use GUI
interfaces, client/server architecture or multi-tier computing
operational systems indefinitely; and only as an afterthought
can be done on the desktops as opposed to the mainframe
was a structure designed for archiving the data that the opera-
computers of yesterday.
tional system has processed. The fundamental requirements of
the operational and analysis systems are different: the opera- • Server software: Server software is inexpensive, powerful,
tional systems need performance, whereas the analysis systems and easy to maintain as compared to that of the past.
need flexibility and broad scope. Example of this is Windows NT that have made setup of
powerful systems very easy as well as reduced the cost.
Data from Legacy Systems
Different platforms have been developed with the development The skyrocketing power of hardware and software, along with
of the computer systems over past three decades. In the 1970’s, the availability of affordable and easy-to-use reporting and
business system development was done on the IBM mainframe analysis tools have played the most important role in evolution
computers using tools such as Cobol, CICS, IMS, DB2, etc. of data warehouses.
With the advent of 1980’s computer platforms such as AS/400 Emergence of Standard Business
and VAX/VMS were developed. In late eighties and early Applications
nineties UNIX had become a popular server platform introduc- New vendors provide to end-users with popular business
ing the client/server architecture which remains popular till date. application suites. German software vendor SAP AG, Baan,
Despite all the changes in the platforms, architectures, tools, and PeopleSoft, and Oracle have come out with suites of software
technologies, a large number of business applications continue that provide different strengths but have comparable function-
to run in the mainframe environment of the 1970’s. The most ality. These application suites provide standard applications that
important reason is that over the years these systems have can replace the existing custom developed legacy applications.
captured the business knowledge and rules that are incredibly This has led to the increase in popularity of such applications.
difficult to carry to a new platform or application. These systems Also, the data acquisition from these applications is much
are, generically called legacy systems. The data stored in such simpler than the mainframes.
systems ultimately becomes remote and becomes difficult to get End-user more Technology Oriented
at. One of the most important results of the massive investment
Extracted Information on the Desktop in technology and movement towards the powerful personal
During the past decade the personal computer has become very computer has been the evolution of a technology-oriented
popular for business analysis. Business Analysts now have business analyst. Even though the technology-oriented end
many of the tools required to use spreadsheets for analysis and users are not always beneficial to all projects, this trend certainly
graphic representation. Advanced users will frequently use has produced a crop of technology-leading business analysts
desktop database programs to store and work with the that are becoming essential to today’s business. These technol-
information extracted from the legacy sources. ogy-oriented end users have frequently played an important role
in the development and deployment of data warehouses. They
The disadvantage of the above is that it leaves the data frag-
have become the core users that are first to demonstrate the
mented and oriented towards very specific needs. Each
initial benefits of data warehouses. These end users are also
individual user has obtained only the information that she/he
critical to the development of the data warehouse model: as
requires. The extracts are unable to address the requirements of
they become experts with the data warehousing system, they
multiple users and uses. The time and cost involved in
train other users.
addressing the requirements of only one user are large. Due to
2
Discussions
3
Notes
4
DATA WAREHOUSING AND DATA MINING
CHAPTER 1: DATA WAREHOUSING
5
In the last few years, Data Warehousing has grown rapidly from subject contain only the information necessary for decision
DATA WAREHOUSING AND DATA MINING
a set of related ideas into architecture for data delivery for support processing.
enterprise end user computing.
Integrated
They support high-performance demands on an organization’s When data resides in money separate applications in the
data and information. Several types of applications-OLAP, DSS, operational environment, encoding of data is often inconsis-
and data mining applications-are supported. OLAP (on-line tent. For instance in one application, gender might be coded as
analytical processing) is a term used to describe the analysis of “m” and “f ” in another by o and l. When data are moved from
complex data from the data warehouse. In the hands of skilled the operational environment in to the data warehouse, when
knowledge workers. OLAP tools use distributed computing data are moved from the operational environment in to the
capabilities for analyses that require more storage and processing data warehouse, they assume a consistent coding convention
power than can be economically and efficiently located on an e.g. gender data is transformed to “m” and “f ”.
individual desktop. DSS (Decision-Support Systems) also
known as EIS (Executive Information Systems, not to be
Time variant
The data warehouse contains a place for storing data that are five
confused with enterprise integration systems) support an
to ten years old, or older, to be used for comparisons, trends,
organization’s leading deci-sion makers with higher-level data
and forecasting. These data are not up dated.
for complex and important decisions. Data mining is used for
knowledge discovery, the pro-cess of searching data for Non-volatile
unanticipated new knowledge. Data are not update or changed in any way once they enter the
Traditional databases support On-Line Transaction Processing data warehouse, but are only loaded and accessed.
(OLTP), which includes insertions, updates, and deletions, Data warehouses have the following distinctive characteristics.
while also supporting information query requirements. • Multidimensional conceptual view.
Traditional relational databases are optimized to process queries
• Generic dimensionality.
that may touch a small part of the database and transactions
that deal with insertions or updates of a few tuples per relation • Unlimited dimensions and aggregation levels.
to process. Thus, they cannot be optimized for OLAP, DSS, or • Unrestricted cross-dimensional operations.
data mining. By contrast, data warehouses are designed precisely • Dynamic sparse matrix handling.
to support efficient extraction, process-ing, and presentation for
• Client-server architecture.
analytic and decision-making purposes. In comparison to tradi-
tional databases, data warehouses generally contain very large • Multi-user support.
amounts of data from multiple sources that may include • Accessibility.
databases from different data models and sometimes lies • Transparency.
acquired from independent systems and platforms. • Intuitive data manipulation.
A database is a collection of related data and a database system • Consistent reporting performance.
is a database and database software together. A data warehouse
• Flexible reporting
is also a collection of information as well as supporting system.
However, a clear distinction exists, Traditional databases are Because they encompass large volumes of data, data ware-
transactional: relational, object-oriented, network, or hierarchical. houses are generally an order of magnitude (sometimes two
Data warehouses have the distinguishing characteristic that they orders of magnitude) larger than the source databases. The
are mainly intended for decision-support applications. They are sheer volume of data (likely to be in terabytes) is an issue that
optimized for data retrieval, not routine transaction processing. has been dealt with through enterprise-wide data warehouses,
virtual data warehouses, and data marts:
Characteristics of Data Warehousing
As per W. H. Inmon, author of building the data warehouse
• Enterprise-wide data warehouses are huge projects requiring
massive investment of time and resources.
and the guru who is ready widely considered to be the origina-
tor of the data warehousing concept, there are generally four • Virtual data warehouses provide views of operational
character that describe a data warehouse: databases that are materialized for efficient access.
W. H. Inmon characterized a data warehouse as “a subject- • Data marts generally are targeted to a subset of the
oriented, integrated, nonvola-tile, time-variant collection of data organization, such as a dependent, and are more tightly
in support of management’s decisions.” Data ware-houses focused.
provide access to data for complex analysis, knowledge discov- To summarize the above, here are some important points to
ery, and decision-making. remember about various characteristics of a Data warehouse:
Subject Oriented • Subject-oriented
Data are organized according to subject instead of application • Organized around major subjects, such as customer,
e.g. an insurance company using a data warehouse would product, sales.
organize their data by costumer, premium, and claim, instead
of by different products (auto. Life etc.). The data organized by
6
• Focusing on the modeling and analysis of data for
7
8
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
LESSON 3
ONLINE TRANSACTION PROCESSING
9
• Mapping from the operational environment to the data • In contrast OLAP uses Multi-Dimensional (MD) views of
DATA WAREHOUSING AND DATA MINING
10
• MD views provide the foundation for analytical processing • Identify various benefits of OLTP.
11
12
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
LESSON 4
DATA WAREHOUSING MODELS
Structure
• Introduction
• Objective
• The Date warehouse Model
• Data Modeling for Data Warehouses
• Multidimensional models
• Roll-up display
• A drill-down display
• Multidimensional Schemas
• Star Schema
• Snowflake Schema
Objective
The main objective of this lesson is to make you understand a
data warehouse model. It also explains various types of
multidimensional models and Schemas.
Figure 2: The structure of data inside the data warehouse
Introduction
The current detail data is central in importance as it:
Data warehousing is the process of extracting and transforming
operational data into informational data and loading it into a • Reflects the most recent happenings, which are usually the
central data store or warehouse. Once the data is loaded it is most interesting;
accessible via desktop query and analysis tools by the decision • It is voluminous as it is stored at the lowest level of
makers. granularity;
The Data Warehouse Model • it is always (almost) stored on disk storage which is fast to
The data warehouse model is illustrated in the following access but expensive and complex to manage.
diagram. Older detail data is stored on some form of mass storage, it is
infrequently accessed and stored at a level detail consistent with
current detailed data.
Lightly summarized data is data distilled from the low level of
detail found at the current detailed level and generally is stored
on disk storage. When building the data warehouse have to
consider what unit of time is summarization done over and
also the contents or what attributes the summarized data will
contain.
Highly summarized data is compact and easily accessible and can
even be found outside the warehouse.
Metadata is the final component of the data warehouse and is
Figure 1: A data warehouse model
really of a different dimension in that it is not the same as data
The data within the actual warehouse itself has a distinct drawn from the operational environment but is used as:
structure with the emphasis on different levels of summariza- • a directory to help the DSS analyst locate the contents of the
tion as shown in the figure below. data warehouse,
• a guide to the mapping of data as the data is transformed
from the operational environment to the data warehouse
environment,
• a guide to the algorithms used for summarization between
the current detailed data and the lightly summarized data and
the lightly summarized data and the highly summarized
data, etc.
13
The basic structure has been described but Bill Inmon fills in
DATA WAREHOUSING AND DATA MINING
In the figure, there is a three-dimensional data cube that
the details to make the example come alive as shown in the organizes product sales data by fiscal quarters and sales regions.
following diagram. Each cell could contain data for a specific prod-uct, specific fiscal
quarter, and specific region. By including additional dimensions,
a data hypercube could be produced, although more than three
dimensions cannot be easily visualized at all or presented
graphically. The data can be queried directly in any combination
of dimensions, by passing complex database queries. Tools
exist for viewing data Data Modeling for Data Warehouses
14
table contains the data and the dimensions identify each tuple in
15
a. Give three dimension data elements and two fact data
DATA WAREHOUSING AND DATA MINING
It works very well for domains of low-cardinality. There is a 1
bit placed in the jth position in the vector if the jth row elements that could be in the database for this data
contains the value being indexed. For example, imagine an warehouse. Draw a data cube, for this database.
inventory of 100,000 cars with a bitmap index on car size. If b. State two ways in which each of the two fact data elements
there are four-car sizes--economy, compact, midsize, and full could be of low quality in some respect.
size-there will be four bit vectors, each containing 100,000 bits
2. You have decided to prepare a budget for the next 12
(12.5 K) for a total index size of 50K. Bitmap indexing can
months based on your actual expenses for the past 12. You
provide consider-able input/output and storage space advan-
need to get your expense information into what is in effect a
tages in low-cardinality domains. With bit vec-tors a bitmap
data warehouse, which you plan to put into a spreadsheet for
index can provide dramatic improvements in comparison,
easy sorting and analysis.
aggregation, and join performance. In a star schema, dimen-
sional data can be indexed to tuples in the fact table by join a. What are your information sources for this data warehouse?
indexing. Join indexes are traditional indexes to maintain b. Describe how you would carry out each of the five steps of
relationships between primary key and foreign key values. They data preparation for a data warehouse database, from
relate the values of a dimension of a star schema to rows in the extraction through summarization. If a particular step does
fact table. For example, consider a sales fact table that has city not apply, say so and justify your statement.
and fiscal quarter as dimensions. If there is a join index on city,
References
for each city the join index maintains the tuple IDs of tuples
containing that city. Join indexes may involve multiple dimen- 1. Adriaans, Pieter, Data mining, Delhi: Pearson Education
sions. Asia, 1996.
Data warehouse storage can facilitate access to summary data by 2. Anahory, Sam, Data warehousing in the real world: a practical
taking further advantage of the nonvolatility of data ware- guide for building decision support systems, Delhi: Pearson
houses and a degree of predictability of the analyses that will be Education Asia, 1997.
performed using them. Two approaches have been used.(1) 3. Berry, Michael J.A. ; Linoff, Gordon, Mastering data mining
smaller tables including summary data such as quarterly sales or : the art and science of customer relationship management, New
revenue by product line, and (2) encoding of level (e.g., weekly, York : John Wiley & Sons, 2000
quarterly, annual) into existing tables. By comparison, the 4. Corey, Michael, Oracle8 data warehousing, New Delhi: Tata
overhead of creating and maintaining such aggregations would McGraw- Hill Publishing, 1998.
likely be excessive in a volatile, transaction-oriented database.
5. Elmasri, Ramez, Fundamentals of database systems, 3rd ed.
Discussions Delhi: Pearson Education Asia, 2000.
• What are the various kinds of models used in Data
warehousing?
• Discuss the following:
• Roll-up display
• Drill down operation
• Star schema
• Snowflake schema
• Why is the star schema called by that name?
• State an advantage of the multidimensional database
structure over the relational database structure for data
warehousing applications.
• What is one reason you might choose a relational structure
over a multidimensional structure for a data warehouse
database? .
• Clearly contrast the difference between a fact table and a
dimension table.
Exercises
1. Your college or university is designing a data warehouse to
enable deans, department chairs, and the registrar’s office to
optimize course offerings, in terms of which courses are
offered, in how many sections, and at what times. The data
warehouse planners hope they will be able to do this better
after examining historical demand for courses and
extrapolating any trends that emerge.
16
DATA WAREHOUSING AND DATA MINING
17
Notes
18
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
LESSON 5
ARCHITECTURE AND PRINCIPLES OF DATA WAREHOUSING
19
maintained in transactional databases. Compared with transac- Data Warehousing, OnLine Analytical Processing (OLAP) and
DATA WAREHOUSING AND DATA MINING
tional databases, data warehouses are nonvolatile. That means Decision Support Systems - apart from being buzz words of
that information in the data warehouse changes far less often today IT arena - are the expected result of IT systems and
and may be regarded as non-real-time with periodic updating. current needs. For decades, Information Management Systems
In transactional systems, transactions are the unit and are the have focused exclusively in the gathering and recording into
agent of change a database; by contrast, data warehouse Database Management Systems data that corresponded to
information is much more coarse grained and is refreshed everyday simple transactions, from which the name OnLine
according to a careful choice of refresh policy, usually incremen- Transaction Processing (OLTP) comes from.
tal. Warehouse updates are handled by the warehouse’s Managers and analysts now need to go steps further from the
acquisition component that provides all required preprocessing. simple data storing phase and exploit IT systems by posing
We can also describe data warehousing more generally as “a complex queries and requesting analysis results and decisions
collection of decision support technologies, aimed at enabling that are based on the stored data. Here is where OLAP and Data
the knowledge worker (executive, manager, ana-lyst) to make, Warehousing is introduced, bringing into business the necessary
better and faster decisions.” The following Figure gives an system architecture, principles, methodological approach and -
overview of the conceptual structure of a data warehouse. It finally - tools to assist in the presentation of functional
shows the entire data warehousing process. This process Decision Support Systems.
includes possible cleaning and reformatting of data before it’s I.M.F. has been working closely with the academic community -
warehousing. At the end of the process, OLAP, data mining, which only recently followed up the progress of the commercial
and DSS may generate new relevant information such as rules; arena that was boosting and pioneering in the area for the past
this information is shown in the figure going back into the decade - and adopted the architecture and methodology
warehouse figure also shows that data sources may include files. presented in the following picture. This is the result of the
ESPRIT funded Basic Research project, “Foundations of Data
Warehouse Quality - DWQ”.
20
DATA WAREHOUSING AND DATA MINING
Being basically dependent on architecture in concept, a Data
Warehouse - or an OLAP system - is designed by applying data
warehousing concepts on traditional database systems and
using appropriate design tools. Data Warehouses and OLAP
applications designed and implemented comply with the
adopted methodology by IMF.
The final deployment takes place through the use of specialized
data warehouse and OLAP systems, namely MicroStrategy’s
DSS Series. MicroStrategy Inc. is one of the most prominent
and accepted international players on data warehousing systems
and tools, offering solutions for every single layer of the DW
architecture hierarchy.
21
Principles of a Data Warehousing
DATA WAREHOUSING AND DATA MINING
• Load Performance
Data warehouses require increase loading of new data on a
periodic basic within narrow time windows; performance on
the load process should be measured in hundreds of
millions of rows and gigabytes per hour and must not
artificially constrain the volume of data business.
• Load Processing
Many steps must be taken to load new or update data into
the data warehouse including data conversion, filtering,
reformatting, indexing and metadata update.
• Data Quality Management
Fact-based management demands the highest data quality.
The warehouse must ensure local consistency, global
consistency, and referential integrity despite “dirty” sources
and massive database size.
• Query Performance
Fact-based management must not be slowed by the
performance of the data warehouse RDBMS; large, complex
queries must be complete in seconds not days.
• Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today
these range from a few to hundreds of gigabytes and
terabyte-sized data warehouses.
Fig : Expanded Three-Level Physical Architecture Discussions
Associated with the three-level physical architecture • Write short notes on:
• Operational Data • Data Quality Management
Stored in the various operational systems throughout • OLAP
the organization • DSS
• Reconciled Data • Data marts
The data stored in the enterprise data warehouse • Operational data
Generally not intended for direct access by end users • Discuss Three-Layer Data Architecture with the help of a
• Derived Data diagram.
The data stored in the data marts • What are the various Principles of Data warehouse?
Selected, formatted, and aggregated for end user • What is the importance of a data warehouse in any
decision-support applications organization? Where it is required?
Self Test
As set of multiple choices is given with every question, Choose
the correct answer for the Following questions.
1. Data warehouse cannot deal with
a. Data analysis
b. Operational activities
c. Information extraction
d. None of these
2. A data warehouse system requires
a. Only current data
b. Data for a large period
c. Only data projections
d. None of these
Fig : Three-Layer Data Architecture
22
References
23
Notes
24
DATA WAREHOUSING AND DATA MINING
CHAPTER 2
LESSON 6 BUILDING A DATA
DATA WAREHOUSING AND WAREHOUSE PROJECT
OPERATIONAL SYSTEMS
Structure A typical operational system deals with one order, one account,
25
applications. Nearly all data in a typical data warehouse is built Even though many of the queries and reports that are run
DATA WAREHOUSING AND DATA MINING
around the time dimension. Time is the primary filtering against a data warehouse are predefined, it is nearly impossible
criterion for a very large percentage of all activity against the data to accurately predict the activity against a data warehouse. The
warehouse. An analyst may generate queries for a given week, process of data exploration in a data warehouse takes a business
month, quarter, or a year. Another popular query in many data analyst through previously undefined paths. It is also common
warehousing applications is the review of year-on-year activity. to have runaway queries in a data warehouse that are triggered
For example, one may compare sales for the first quarter of this by unexpected results or by users’ lack of understanding of the
year with the sales for first quarter of the prior years. The time data model. Further, many of the analysis processes tend to be
dimension in the data warehouse also serves as a fundamental all encompassing whereas the operational processes are well
cross-referencing attribute. For example, an analyst may attempt segmented. A user may decide to explore detail data while
to access the impact of a new marketing campaign run during reviewing the results of a report from the summary tables.
selected months by reviewing the sales during the same periods. After finding some interesting sales activity in a particular
The ability to establish and understand the correlation between month, the user may join the activity for this month with the
activities of different organizational groups within a company is marketing programs that were run during that particular month
often cited as the single biggest advanced feature of the data to further understand the sales. Of course, there would be
warehousing systems. instances where a user attempts to run a query that will try to
The data warehouse system can serve not only as an effective build a temporary table that is a Cartesian product of two tables
platform to merge data from multiple current applications; it containing a million rows each! While an activity like this would
can also integrate multiple versions of the same application. unacceptably degrade an operational system’s performance, it is
For example, an organization may have migrated to a new expected and planned for in a data warehousing system.
standard business application that replaces an old mainframe- Data is mostly Non-volatile
based, custom-developed legacy application. The data Another key attribute of the data in a data warehouse system is
warehouse system can serve as a very powerful and much that the data is brought to the warehouse after it has become
needed platform to combine the data from the old and the new mostly non-volatile. This means that after the data is in the data
applications. Designed properly, the data warehouse can allow warehouse, there are no modifications to be made to this
for year-on-year analysis even though the base operational information. For example, the order status does not change,
application has changed. the inventory snapshot does not change, and the marketing
Differences between Transaction and Analysis promotion details do not change. This attribute of the data
Processes warehouse has many very important implications for the kind
The most important reason for separating data for business of data that is brought to the data warehouse and the timing
analysis from the operational data has always been the potential of the data transfer.
performance degradation on the operational system that can Let us further review what it means for the data to be non-
result from the analysis processes. High performance and quick volatile. In an operational system the data entities go through
response time is almost universally critical for operational many attribute changes. For example, an order may go through
systems. The loss of efficiency and the costs incurred with many statuses before it is completed. Or, a product moving
slower responses on the predefined transactions are usually easy through the assembly line has many processes applied to it.
to calculate and measure. For example, a loss of five seconds of Generally speaking, the data from an operational system is
processing time is perhaps negligible in and of itself; but it triggered to go to the data warehouse when most of the activity
compounds out to considerably more time and high costs once on these business entity data has been completed. This may
all the other operations it impacts are brought into the picture. mean completion of an order or final assembly of an accepted
On the other hand, business analysis processes in a data product. Once an order is completed and shipped, it is unlikely
warehouse are difficult to predefine and they rarely need to have to go back to backorder status. Or, once a product is built and
rigid response time requirements. accepted, it is unlikely to go back to the first assembly station.
Operational systems are designed for acceptable performance for Another important example can be the constantly changing data
pre-defined transactions. For an operational system, it is that is transferred to the data warehouse one snapshot at a time.
typically possible to identify the mix of business transaction The inventory module in an operational system may change
types in a given time frame including the peak loads. It also with nearly every transaction; it is impossible to carry all of these
relatively easy to specify the maximum acceptable response time changes to the data warehouse. You may determine that a
given a specific load on the system. The cost of a long response snapshot of inventory carried once every week to the data
time can then be computed by considering factors such as the warehouse is adequate for all analysis. Such snapshot data
cost of operators, telecommunication costs, and the cost of any naturally is non-volatile.
lost business. For example, an order processing system might It is important to realize that once data is brought to the data
specify the number of active order takers and the average warehouse, it should be modified only on rare occasions. It is
number of orders for each operational hour. Even the query very difficult, if not impossible, to maintain dynamic data in the
and reporting transactions against the operational system are data warehouse. Many data warehousing projects have failed
most likely to be predefined with predictable volume. miserably when they attempted to synchronize volatile data
between the operational and data warehousing systems.
26
Data saved for longer periods than in transaction Logical Transformation of Operational
27
with a bank customer. The retail operational system may
DATA WAREHOUSING AND DATA MINING
provide the most fertile data for business analysis. Figure 4 Product Price changes
Product Inventory
Marketing
Data Warehouse model aligns with the business Customer
Profile
Product
price
tion. The entities defined and maintained in the data •Data warehouse model has business entities
warehouse parallel the actual business entities such as custom- Figure 5. Data warehouse entities align with the business structure
ers, products, orders, and distributors. Different parts of an
organization may have a very narrow view of a business entity
such as a customer. For example, a loan service group in a bank Figure 5 illustrates the alignment of data warehouse entities
may only know about a customer in the context of one or more with the business structure. The data warehouse model breaks
loans outstanding. Another group in the same bank may know away from the limitations of the source application data models
about the same customer in context of a deposit account. The and builds a flexible model that parallels the business structure.
data warehouse view of the customer would transcend the view This extensible data model is easy to understand by the
from a particular part of the business. A customer in the data business analysts as well as the managers.
warehouse would represent a bank customer that has any kind
of business with the bank.
The data warehouse would most likely build attributes of a
business entity by collecting data from multiple source applica-
tions. Consider, for example, the demographic data associated
28
inventory may be reduced by an order fulfillment transaction or
Down
Wee
UpInventory
Transformation of the Operational State
Information
It is essential to understand the implications of not being able
•Operational state information is not carried to the data warehouse
to maintain the state information of the operational system
•Data is transferred to the data warehouse after all state changes
when the data is moved to the data warehouse. Many of the
•Or, data is transferred with period snapshots
attributes of entities in the operational system are very dynamic
and constantly modified. Many of these dynamic operational
Figure 6. Transformation of the operational state information
system attributes are not carried over to the data warehouse;
others are static by the time they are moved to the data ware-
house. A data warehouse generally does not contain Figure 6 illustrates how most of the operational state informa-
information about entities that are dynamic and constantly tion cannot be carried over the data warehouse system.
going through state changes.
De-normalization of Data
To understand what it means to lose the operational state
Before we consider data model de-normalization in the context
information, let us consider the example of an order fulfillment
of data warehousing, let us quickly review relational database
system that tracks the inventory to fill orders. First let us look
concepts and the normalization process. E. F. Codd developed
at the order entity in this operational system. An order may go
relational database theory in the late 1960s while he was a
through many different statuses or states before it is fulfilled or
researcher at IBM. Many prominent researchers have made
goes to the “closed” status. Other order statuses may indicate
significant contributions to this model since its introduction.
that the order is ready to be filled, it is being filled, back ordered,
Today, most of the popular database platforms follow this
ready to be shipped, etc. This order entity may go through
model closely. A relational database model is a collection of
many states that capture the status of the order and the
two-dimensional tables consisting of rows and columns. In
business processes that have been applied to it. It is nearly
the relational modeling terminology, the tables, rows, and
impossible to carry forward all of attributes associated with
columns are respectively called relations, attributes, and tuples.
these order states to the data warehousing system. The data
The name for relational database model is derived from the
warehousing system is most likely to have just one final
term relation for a table. The model further identifies unique
snapshot of this order. Or, as the order is ready to be moved
keys for all tables and describes the relationship between tables.
into the data warehouse, the information may be gathered from
multiple operational entities such as order and shipping to Normalization is a relational database modeling process where
build the final data warehouse order entity. the relations or tables are progressively decomposed into
smaller relations to a point where all attributes in a relation are
Now let us consider the more complicated example of inven-
very tightly coupled with the primary key of the relation. Most
tory data within this system. The inventory may change with
data modelers try to achieve the “Third Normal Form” with all
every single transaction. The quantity of a product in the
29
of the relations before they de-normalize for performance or these price changes may be carried to the data warehouse with a
DATA WAREHOUSING AND DATA MINING
other reasons. The three levels of normalization are briefly periodic snapshot of the product price table. In a data ware-
described below: housing system you would carry the list price of the product
• First Normal Form: A relation is said to be in First Normal when the order is placed with each order regardless of the
Form if it describes a single entity and it contains no arrays selling price for this order. The list price of the product may
or repeating attributes. For example, an order table or change many times in one year and your product price database
relation with multiple line items would not be in First snapshot may even manage to capture all these prices. But, it is
Normal Form because it would have repeating sets of nearly impossible to determine the historical list price of the
attributes for each line item. The relational theory would call product at the time each order is generated if it is not carried to
for separate tables for order and line items. the data warehouse with the order. The relational database
theory makes it easy to maintain dynamic relationships between
• Second Normal Form: A relation is said to be in Second
business entities, whereas a data warehouse system captures
Normal Form if in addition to the First Normal Form
relationships between business entities at a given time.
properties, all attributes are fully dependent on the primary
key for the relation.
• Third Normal Form: A relation is in Third Normal Form if Order processing
in addition to Second Normal Form, all non-key attributes Customer Product
Data
namely, performance and simplicity. The data normalization in Figure 7. Logical transformation of application data
relational databases provides considerable flexibility at the cost
of the performance. This performance cost is sharply increased
in a data warehousing system because the amount of data Logical transformation concepts of source application data
involved may be much larger. A three-way join with relatively described here require considerable effort and they are a very
small tables of an operational system may be acceptable in important early investment towards development of a success-
terms of performance cost, but the join may take unacceptably ful data warehouse. Figure 7 highlights the logical
long time with large tables in the data warehouse system. transformation concepts discussed in this section.
30
tions while moving the data to the data warehousing system.
Notes
31
DATA WAREHOUSING AND DATA MINING
LESSON 7
BUILDING DATA WAREHOUSING, IMPORTANT CONSIDERATIONS
32
• What is the loading time (including cleaning, formatting, descriptions, warehouse operations and maintenance, and access
33
As a result of interviewing marketing users, finance users, sales designs that has a multipart key. Each component of the
DATA WAREHOUSING AND DATA MINING
force users, operational users, first- and second-level manage- multipart key is a foreign key to an individual dimension table.
ment, and senior manage-ment, a picture emerges of what is In the example of customer invoices, the “grain” of the fact
keeping these people awake at night. table is the individual line item on the customer invoice. In
Table 7.1 Nine-Step Method in the Design of a Data other words, a line item on an invoice is a single fact table
Warehouse record, and vice versa. Once the fact table representation is
decided, a coherent discussion of what the dimen-sions of the
1. Choosing the subject matter
data mart’s fact table are can take place.
2. Deciding what a fact table represents
Step 3: Identifying and conforming the dimensions. The
3. Identifying and conforming the dimensions. dimensions are the drivers of the data mart. The dimensions
4. Choosing the facts are the platforms for brows-ing the allowable constraint values
5. Storing precalculations in the fact table. and launching these constraints. The dimensions are the source’
of row headers in the user’s final reports; they carry the
6. Rounding out the dimension tables
enterprise’s vocabulary to the users. A well-architect set of
7. Choosing the duration of the database dimensions makes the data mart understandable and easy to
8. The need to track slowly changing dimensions use. A poorly presented or incomplete set of dimensions robs
9. Deciding the query priorities and the query modes the data mart of its usefulness. Dimensions should be chosen
with the long-range data warehouse in mind. This choice pre-
Can list and prioritize the primary business issues facing the
sents the primary moment at which the data mart architect must
enterprise. At the same time you should conduct a set of
disregard the data mart details and consider the longer-range
interviews with the legacy systems’ DBAs who will reveal which
plans. If any dimension occurs in two data marts, they must be
data sources are clean, which contain valid and consistent data,
exactly the same dimension, or one must be a mathematical
and which will remain supported over the next few years.
subset of the other. Only in this way can two data marts share
Preparing for the design with a proper set of interviews is one or more dimensions in the same application. When a
crucial. Inter-viewing is also one of the hardest things to teach dimen-sion is used in two data marts, this dimension is said to
people. I find it helpful to reduce the interviewing process to a be conformed. Good examples of dimensions that absolutely
tactic and an objective. Crudely put, the tactic is to make the end- must be conformed between data marts are the customer and
users talk about what they do, and the objective is to gain product dimensions in an enterprise. If these dimensions .are
insights that feed the nine design decisions. The tricky part is allowed drifting out of synchronization between data marts,
that the interviewer can’t pose the design questions directly to the overall data warehouse will fail, because the two data marts
the end users. End users talk about what are important in their will not be able to be used together. The requirement to
business lives. End users are intim-idated by system design conform dimensions across data marts i£ very strong. Careful
questions, and they are quite right when they insist that system thought must be given to this requirement before the firs: data
design is IT responsibility, not theirs. Thus, the challenge of the mart is implemented. The data mart team must figure out what
data mart designer is to meet the users tar more than half way. an enter-prise customer ID is and what an enterprise product
In any event, armed with both the top-down view (what keeps ID is. If this task is done correctly, successive data marts can be
management awake) and the bottom-up view (which data built at different times, on different machines, and by different
sources are available), the data warehouse designer may follow development teams, and these data marts
these steps: Will merge coherently into an overall data warehouse. In
Step 1: Choosing the subject matter of a particular data mart. particular, if the dimen-sions of two data marts are conformed,
The first data mart you build should be the one with the most it is easy to implement drill across by sending separate queries to
bangs for the buck. It should simultaneously answer the most the two data marts, and then sort-merging the answer sets on a
important business questions and be the most accessible in set of common row headers. The row headers can be made be
terms of data extraction. According to Kimball, a great place to common only if they are drawn from a conformed dimension
start in most enterprises is to build a data mart that consists of common to the two data marts.
cus-tomer invoices or monthly statements. This data source is With these first three steps correctly implemented, designers can
probably fairly accessible and of fairly high quality. One of attack last six steps (see Table 7.1). Each step gets easier if the
Kimball’s laws is that the best data source in any enterprise is preceding steps have been performed correctly.
the record of “how much money they owe us.” Unless costs
and profitability are easily available before the data mart is even Discussions
designed, it’s best to avoid adding these items to this first data • Write short notes on:
mart. Nothing drags down a data mart implementation faster • Data cleaning
than a heroic or impossible mission to provide activity-based
• Back flushing
costing as part of the first deliverable.
• Heterogeneous Sources
Step 2: Deciding exactly what a fact table record represents. This
step, according to R. Kimball, seems like a technical detail at this • Metadata repository
early point, but it is actually the secret to making progress on the
design. A table is the large central table in the dimensional
34
• Discuss various steps involved in the acquisition of data for
Notes
35
DATA WAREHOUSING AND DATA MINING
LESSON 8
BUILDING DATA WAREHOUSING - 2
Structure spend the extra time and build a core data warehouse first, and
then use this as the basis to quickly spin off many data marts.
• Objective
Disadvantage is this approach takes longer to build initially since
• Introduction
time has to be spent analyzing data requirements in the full –
• Data warehouse Application blown warehouse, identifying the data elements that will be
• Approaches used to build a Data Warehousing used in numerous marts down the road.
• Important considerations The advantage is that once you go to build the data mart, you
• Tighter Integration already have the warehouse to draw from.
• Empowerment Bottom-Up Approach, implying that the business priorities
resulted in developing individual data marts, which are then
• Willingness
integrated into the enterprise data warehouse. In this approach
• Reason for Building a Data Warehousing we need to build a workgroup specific data mart first. This
Objective approach gets data into your users hands quicker but the work it
The aim of this lesson is to study about Data warehouse takes to get the information into the data mart may not be
applications and various approaches that are used to build a reusable when moving the same data into a warehouse or trying
Data warehouse. to use similar data in different data mart.
Advantage is you gain speed but not portability
Introduction
It is the professional warehouse team deals with issues and We don’t care. In fact, we’d like to coin the term “ware mart”.
develops solutions, which wi1l best suit the needs of the The answer to which approach is correct depends on a number
analytical user community. A process of negotiation and, of vectors. You can take the approach that gets information in
sometimes of give and take is used to address issues that have to your user’s hands quickest. In our experience, that arguing
common ground between the players in the data warehouse over the marts and evolving into a warhorse. Results are what
delivery team. count not arguing over the data mart v/s the warehouse
approach.
Data Warehouse Application
In application, data warehouse application is different transac- Are they Different?
tion deals with large amounts of data, which is aggregate in Well, you can give many reasons your operational system and
nature, a data warehouse application answers questions like your data warehouses are not the same. The data need to
support operational needs is differently then the data needed to
• What is the average deposit by branch?
support analytical processing. In fact the data are physically
• Which day of the week is busiest? stored quite differently. An operationa1 system is optimized for
• Which customers with high average balances currently are not transactional updates, while a data warehouse system is
participating in a checking- plus account) optimized-for large queries dealing with large data sets. These
Because we are dealing with questions, each request is unique. differences become apparent. When you begin to monitor
The interface that supports this end user must be flexible by central processing unit (CPU) usage on a computer that contains
design. You have many different applications accessing the same a data warehouse v / s CPU usage on a system that contains an
information. operational system.
Each with a particular strength. A data mart is typically a subset Important Considerations
of your warehouse with a specific purpose in mind. For • Tighter Integration
example, you might have a financial mart and a marketing mart;
• Empowerment
each designed to feed information to a specific part of your
corporate business community. A key issue in the industry • Willingness
today is which approach should you take when building a Tighter Integration
Decision Support System? The term back end describes the data repository used to
Approaches used to build a Data Warehouse support the data warehouse coupled with the software that
We have experiences with two approaches to the build process: supports the repository. For example, Oracle 7 Cooperative
Server Front end describes the tools used by the warehouse end-
Top-Down Approach, meaning that an organization has
users to support their decision making activities, With classic
developed an enterprise data model, collected enterprise wide
operational systems sometimes ad-hoc dialog exist between the
business requirements, and decided to build an enterprise data
front end and back end to support specialties; with the data
warehouse with subset data marts. In this approach, we need to
36
warehouse team, this dialog must be ongoing as the warehouse c. Working management
37
38
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
LESSON 9
BUSINESS CONSIDERATIONS: RETURN ON INVESTMENT DESIGN CONSIDERATIONS
39
nents of the enterprise, interacting with each other, and using a warehouse design: a business driven, continuous, iterative
DATA WAREHOUSING AND DATA MINING
common enterprise data model. As defined earlier the indi- warehouse engineering approach. In addition to these
vidual warehouses are known as data marts. Organizations general considerations, there are several specific points
embarking on data warehousing development can chose one of relevant to the data warehouse design.
the two approaches:
Data content
• The top-down approach, meaning that an organization has One common misconception about data warehouses is that
developed an enterprise data model, collected enterprise wide they should not con-tain as much detail-level data as operational
business requirements, an..: decided to build an enterprise systems used to source this data in. In reality, however, while
data warehouse with subset data marts the data in the warehouse is formatted differently from the
• The bottom-up approach, implying that the business operational data, it may be just as detailed. Typically, a data
priorities resulted i.e. developing individual data marts, warehouse may contain detailed data, but the data is cleaned up
which are then integrated into the enter-prise data warehouse and-transformed to fit the ware-house model, and certain
• The bottom-up approach is probably more realistic, but the transactional attributes of the data are filtered out. These
complexity of the integration may become a serious obstacle, attributes are mostly the ones used for the internal transaction
and the warehouse designer’s should carefully analyze each system logic, and they are not meaningful in the context of
data mart for integration affinity. analysis and decision-making.
The content and structure of the data warehouse are reflected in
Organizational Issues
its data model. The data model is the template that describes
Most IS an organization has considerable expertise in develop-
how information will be organized within the integrated
ing operational systems. However; the requirements and
warehouse framework. It identifies major sub-jects and
environments associated with the informational applications of
relationships of the model, including keys, attribute, and
a data warehouse are different. Therefore, an organization will
attribute groupings. In addition, a designer should always
need to employ different development practices than the ones it
remember that decision sup-port queries, because of their
uses for operational applications.
broad scope and analytical intensity, require data models to be
The IS department will need to bring together data that cuts optimized to improve query performance. In addition to its
across a com-pany’s operational systems as well as data from effect on query performance, the data model affects data storage
outside the company. But users will also need to be involved requirements and data loading performance.
with a data warehouse implementation since they are closest to
Additionally, the data model for the data warehouse may be
the data. In many ways, a data warehouse implemen-tation is
(and quite often is) different from the data models for data
not truly a technological issue; rather, it should be more
marts. The data marts, discussed in the previous chapter, are
concerned with identifying and establishing information
sourced from the data warehouse, and may contain highly
requirements, the data sources to fulfill these requirements, and
aggregated and summarized data in the form of a specialized
timeliness.
demoralized relational schema (star schema) or as a multidi-
Design Considerations mensional data cube. The key point is, however, that in a
To be successful, a data warehouse designer must adopt a dependent data mart environment, the data mart data is cleaned
holistic approach consider all data warehouse components as up, is transformed, and is consistent with the data warehouse
parts of a single complex system and take into the account all and other data marts sourced from the same warehouse.
possible data sources and all known usage requirements. Failing
Metadata
to do so may easily result in a data warehouse design that is
As already discussed, metadata defines the contents and location
skewed toward a particular business requirement, a particular
of data (data model) in the warehouse, relationships between
data source, or a selected access tool.
the operational databases and the data warehouse, and the
In general, a data warehouse’s design point is to consolidate business views of the warehouse data that are accessible by end-
data from mul-tiple, often heterogeneous, sources into a query user tools. Metadata is searched by users to find data
database. This is also one of the reasons why a data warehouse defini-tions or subject areas. In other words, metadata provides
is rather difficult to build. The main factors include decision-support oriented pointers to warehouse data, and thus
• Heterogeneity of data sources, which affects data conversion, provides a logical link between warehouse data and the decision
quality, and time-lines. Use of historical data, which implies support application. A data warehouse design should ensure
that data may be “old”. Tendency of databases to grow very that there is mechanisms that populates and maintains the
large. Another important point concerns the experience and metadata repository, and that all access paths to the data
accepted practices. Basi-cally, the reality is that the data warehouse have metadata as an entry point. To put it another
warehouse design is different from traditional OLTP. Indeed, way, the warehouse design should prevent any direct access to
the data warehouse is business-driven (not IS-driven, as in the warehouse data (especially updates) if it does not use
OLTP), requires continuous interactions with end users, and metadata definitions to gain the access.
is never finished, since both requirements and data sources
change. Understanding these points allows developers to
avoid a number of pitfalls relevant to data warehouse
development, and justifies a new approach to data
40
Data Distribution Thus, traditional database design and tuning techniques don’t
41
DATA WAREHOUSING AND DATA MINING
LESSON 10
TECHNICAL CONSIDERATION, IMPLEMENTATION CONSIDERATION
42
To begin with, the data warehouse server has to be able to induced data skew is more severe in the low-density asymmetric
43
• Extract the data from the operational databases, transform it, Discussions
DATA WAREHOUSING AND DATA MINING
44
DATA WAREHOUSING AND DATA MINING
LESSON 11
BENEFITS OF DATA WAREHOUSING
45
purchasing and inventory patterns, and can indicate • Data Quality Management - The shift to fact-based
DATA WAREHOUSING AND DATA MINING
otherwise unseen credit exposure and opportunities for cost management demands the highest data quality. The
savings. warehouse must ensure local consistency, global consistency,
and referential integrity despite “dirty” sources and massive
Intangible Benefits
database size. While loading and preparation are necessary
In addition to the tangible benefits outlined above, a data
steps, they are not sufficient. Query throughput is the
warehouse provides a number of intangible benefits. Although
measure of success for a data warehouse application. As
they are more difficult to quantify, intangible benefits should
more questions are answered, analysts are catalyzed to ask
also be considered when planning for the data ware-house.
more creative and insightful questions.
Examples of intangible benefits are:
1. Improved productivity, by keeping all required data in a
• Query Performance - Fact-based management and ad-hoc
analysis must not be slowed or inhibited by the performance
single location and eliminating the rekeying of data
of the data warehouse RDBMS; large, complex queries for
2. Reduced redundant processing, support, and software to key business operations must complete in seconds not days.
support over- lapping decision support applications.
• Terabyte Scalability - Data warehouse sizes are growing at
3. Enhanced customer relations through improved knowledge astonishing rates. Today these range from a few to hundreds
of individual requirements and trends, through of gigabytes, and terabyte-sized data warehouses are a near-
customization, improved communica-tions, and tailored term reality. The RDBMS must not have any architectural
product offerings limitations. It must support modular and parallel
4. Enabling business process reengineering-data warehousing management. It must support continued availability in the
can provide useful insights into the work processes event of a point failure, and must provide a fundamentally
themselves, resulting in developing breakthrough ideas for different mechanism for recovery. It must support near-line
the reengineering of those processes mass storage devices such as optical disk and Hierarchical
Problems with Data Warehousing Storage Management devices. Lastly, query performance must
One of the problems with data mining software has been the not be dependent on the size of the database, but rather on
rush of companies to jump on the band wagon as the complexity of the query.
these companies have slapped ‘data warehouse’ labels on • Mass User Scalability - Access to warehouse data must no
traditional transaction-processing products, and co-opted longer be limited to the elite few. The RDBMS server must
the lexicon of the industry in order to be considered support hundreds, even thousands, of concurrent users
players in this fast-growing category. while maintaining acceptable query performance.
Chris Erickson, president and CEO of Red Brick (HPCwire, • Networked Data Warehouse - Data warehouses rarely exist
Oct. 13, 1995) in isolation. Multiple data warehouse systems cooperate in a
larger network of data warehouses. The server must include
Red Brick Systems have established a criteria for a relational
tools that coordinate the movement of subsets of data
database management system (RDBMS) suitable for data
between warehouses. Users must be able to look at and work
warehousing, and documented 10 specialized requirements for
with multiple warehouses from a single client workstation.
an RDBMS to qualify as a relational data warehouse server, this
Warehouse managers have to manage and administer a
criteria is listed in the next section.
network of warehouses from a single physical location.
According to Red Brick, the requirements for data warehouse
• Warehouse Administration - The very large scale and time-
RDBMSs begin with the loading and preparation of data for
cyclic nature of the data warehouse demands administrative
query and analysis. If a product fails to meet the criteria at this
ease and flexibility. The RDBMS must provide controls for
stage, the rest of the system will be inaccurate, unreliable and
implementing resource limits, chargeback accounting to
unavailable.
allocate costs back to users, and query prioritization to
Criteria for a Data Warehouse address the needs of different user classes and activities. The
The criteria for data warehouse RDBMSs are as follows: RDBMS must also provide for workload tracking and tuning
• Load Performance - Data warehouses require incremental so system resources may be optimized for maximum
loading of new data on a periodic basis within narrow time performance and throughput. “The most visible and
windows; performance of the load process should be measurable value of implementing a data warehouse is
measured in hundreds of millions of rows and gigabytes per evidenced in the uninhibited, creative access to data it
hour and must not artificially constrain the volume of data provides the end user.
required by the business. • Integrated Dimensional Analysis - The power of
• Load Processing - Many steps must be taken to load new multidimensional views is widely accepted, and dimensional
or updated data into the data warehouse including data support must be inherent in the warehouse RDBMS to
conversions, filtering, reformatting, integrity checks, physical provide the highest performance for relational OLAP tools.
storage, indexing, and metadata update. These steps must be The RDBMS must support fast, easy creation of
executed as a single, seamless unit of work. precomputed summaries common in large data warehouses.
It also should provide the maintenance tools to automate
the creation of these precomputed aggregates. Dynamic
46
calculation of aggregates should be consistent with the
Notes
47
CHAPTER 3
LESSON 12 MANAGING AND IMPLEMENTING A
PROJECT MANAGEMENT PROCESS, DATA WAREHOUSE PROJECT
SCOPE STATEMENT
Structure Every project must have a defined start and end. If you are
DATA WAREHOUSING AND DATA MINING
• Objective unclear about what must be done to complete the project, don’t
start it.
• Introduction
If at any point in the project if becomes clear that the end
• Project Management Process
product of the project cannot be delivered then the project
• Scope Statement should be cancelled. This also applies to a data warehouse
• Project planning project.
• Project scheduling One must have a clear idea of what you are building and why it
• Software Project Planning is unique.
• Decision Making It is not necessary to understand why your product or service is
unique from day one. But through the process of developing
Objective the project plan, the uniqueness of the project. Only with a
The main objective of this lesson is to introduce you with
define end do you have a project that can succeed
various topics related to Process Management Process. It also
includes the need of Scope statement; project planning, Project In addressing the unique aspect or providing business users
Scheduling Software Project Planning and decision-making. with timely access to data amidst constantly changing business
conditions, Whether you are embarking on a customer relation-
Introduction ship management initiative, a balanced scorecard
Now here we will talk about project management when we implementation, a risk management system or other, decision
haven’t yet defined the term “project”, For this definition, we support applications, there is a large body of knowledge about
for to the source - the project management institute, The Project what factors contributed the most to the failures of these type
management institute is a nonprofit professional organization of initiative. We, as an industry: need to leverage this knowledge
dedicated to advancing the state of the art in the management and learn these lessons so we don’t repeat them in our own
of projects. projects.
Membership is open to anyone actively engaged or interested in A data warehouse is often the foundation for many of decision
the application practice teaching, and researching of project support initiatives. Many studies have show that a significant
management principles and techniques. reason for the failure of data warehousing
According to the book. A Guide to the project management And decision support projects is not failure of the technology,
body knowledge, published by the project management but rather, inappropriate project management including lack of
institute standards committee. A project is a temporary integration, lack of communication and lack of clear linkages to
endeavor undertaken to create u unique product or service. business objectives and to benefits achievement.
One key point here is temporary endeavor. Any project must
The Scope Statement
have a defined start and end, If you are unclear about what
must be done to complete the project, don’t start it. This is a One of the major proven techniques we can use to help us with
sure path to disaster. In addition, if at any point in the project this discovery process is called a scope statement. A scope
if becomes clear that the end product of the project cannot be statement is a written document by which you begin to define
delivered then the project should be called. This also applies to a the job at hand and all the key deliverables. In fact, we feel it is
data warehouse project. Another key point is to create a unique good business practice not to begin work on any project until
product or service. You must have a clear idea of what you are you have developed a scope statement. No work should begin
building and why it is unique. It is not necessary for you to on, a project until a scope statement has been developed. These
understand why your product or service is unique from day are the major elements in the breakdown of a scope statement.
one. But through the process of developing the project plan the 1.Project Title and Description: Every project should have a
uniqueness of the project. Only with a define end do you have a clear name and description of what you are trying to accom-
project that can succeed. plish.
Project Management Process 2.Project Justification: clear description of why this project is
The project management institute is a nonprofit professional being done. What is the goal of the project?
organization dedicated to advancing the state of the art in the 3. Project Key Deliverables: a 1ist of keys items that must be
management of projects. accomplished so this project can be completed. What must be
A project is a temporary endeavor undertaken to create a unique done for us to consider the project done?
product or service.
48
4. Project Objective: an additional list of success criteria. During early stages of planning, critical decisions must be made
49
This activity is highly dependent on project managers intuition • Set policies and procedures
DATA WAREHOUSING AND DATA MINING
50
• Organisational constraints: use of a specific programming Discussions
• The intangible nature of software causes problems for 4. Royce, Walker, Software project management: a unified
management framework, Delhi: Pearson Education Asia, 1998.
• Managers have diverse roles but their most significant 5. Young, Trevor L., Successful project management, London:
activities are planning, estimating and scheduling Kogan Page, 2002.
51
DATA WAREHOUSING AND DATA MINING
LESSON 13
WORK BREAKDOWN STRUCTURE
52
• Prepare room Time dependencies:
53
4. NASA Academy of Program and Project Leadership
DATA WAREHOUSING AND DATA MINING
Notes
54
DATA WAREHOUSING AND DATA MINING
LESSON 14
PROJECT ESTIMATION, ANALYZING PROBABILITY AND RISK
Accurately planning and estimating software projects is an (Examples: analyze sales, analyze markets, and analyze financial
extremely difficult software management function. Few accounts.) a pilot should be limited to just one business
organizations have established formal estimation processes, process. If management insists on more than one, the time and
despite evidence that suggests organizations without formal effort will be proportionally greater.
estimation are four times more likely to experience cancelled or 3. How many subject areas are expected for the pilot?
delayed projects. (Examples: customer, supplier/vendor, store/location,
Project Estimation product, organizational unit, demographic area of market
In the 1970s, geologists at Shell were excessively confident when segment, general ledger account, and promotion/campaign.)
they predicted the presence of oil or gas. They would for If possible, a pilot should be limited to just one subject area.
example estimate a 40% chance of finding oil, but when ten If management insists on more than one, the time and effort
such wells were actually drilled, only one or two would produce. will be proportionally greater.
This overconfidence cost Shell considerable time and money. 4. Will a high-level enterprise model be developed during
Shell embarked on a training programme, which enabled the the pilot?
geologists to be more realistic about the accuracy of their
Ideally, an enterprise model should have been developed prior
predictions. Now, when Shell geologists predict a 40% chance
to the start of the DW pilot, if the model has not been
of finding oil, four out of ten are successful.
finished and the pilot requires its completion. The schedule for
Software project managers are required to estimate the size of a the pilot must be adjusted.
project. They will usually add a percentage for ‘contingency’, to
5. How many attributes (fields, columns) will be selected
allow for their uncertainty. However, if their estimates are
for the pilot?
overconfident, these ‘contingency’ amounts may be insufficient,
and significant risks may be ignored. Sometimes several such The more attributes to research understand, clean, integrate and
‘contingency’ amounts may be multiplied together, but this is a document, the longer the pilot and the greater the effort.
clumsy device, which can lead to absurdly high estimates, while 6. Are the source files well modeled and well documented?
still ignoring significant risks. In some cases, higher manage- Documentation is critical to the success of the pilot. Extra time
ment will add additional ‘contingency’, to allow for the fallibility and effort must be included if the source files and databases
of the project manager. Game playing around ‘contingency’ is have not been well documented.
rife.
55
7. Will there be any external data (Lundberg, A. C. 14. Will it be necessary to synchronize the oper5ational
DATA WAREHOUSING AND DATA MINING
Nielsen. Dun and Bradstreet) in the pilot system? Is the system with the data warehouse?
external system well documented? This is always difficult and will require initial planning, genera-
External data is often not well documented and usually does tion procedures and ongoing effort form operations.
not follow the organization standards. Integrating external data 15. Will a new hardware platform be required? If so, will to
is often difficult and time consuming. different than the existing platform?
8. Is the external data modeled? (Modeled, up-to-date, The installation of new hardware always requires planning and
accurate, actively being used and comprehensive; a high execution effort.
level accurate and timely mode! Exists: an old, out of
If it is to be a new type of platform, operations training and
date model exists; no model exists)
familiarization takes time and efforts, a new operating system
Without a model, the effort to understand the source external requires work by the technical support staff. There are new
data is significantly greater. It’s unlikely that the external data has procedures to follow, new utilities to learn and the shakedown
been modeled, but external data vendors should find the sale and testing efforts for anything new is always time consuming
of their data easier when they have models that effectively and riddled with unexpected problems.
document their products.
16. Will new desktop swill require?
9. How much cleaning will the source data require? (Data
New desktops require installation, testing and possible training
need no cleaning, minor complexity, transformations,
of the users of the desktops.
medium/moderate complexity; and very complicated
transformation required) 17. Will new network work be required?
Data cleansing both with and without software tools to aid the If a robust network (one that one handle the additional load
process is tedious and time consuming organizations usually form the data warehouse with acceptable performance) is already
overestimate the quality of their data and always underestimate in place, shaken-out and tested, a significant amount of work
the effort to clean the data. and risk will be eliminated.
10. How much integration will be required? (None 18. Will network people be available?
required, moderate integration required, serious and If network people are available, it will eliminate the need to
comprehensive integration required) recruit or train
An example of integration is pulling customer data together 19. How many query tools will be chosen?
form multiple internal files as well as from external data. The Each new query tool takes time to train those responsible for
absence of consistent customer identifiers can cause significant support and time to train the end users.
work to accurately integrate customer data.
20. Is user management sold on and committed to this
11. What the estimated size of the pilot database? project and what is the organization level at which the
Data warehouse Axiom #A – Large DW databases (100GB TO commitment was made?
500GB) will always have performance problems. Resolving If management is not sold on the project, the risk is signifi-
those problems (living within an update/refresh/backup cantly greater. For the project manager, lack of management
window, providing acceptable query performance) will always commitment means far more difficulty in getting resources
take significant time and effort. (money, involvement) and getting timely answers.
Data Warehouse Axiom #B – A pilot greater than 500 GB is 21. Where does the DW project manager’s report in the
not a pilot; it’s a disaster waiting to happen. organization?
12. What is the service level requirement? (Five days/week, The higher up the project mangers report, the greater the
eight hours/day; six days/weeks, eighteen hours/day; management commitment, the more visibility and the more the
seven days/week, 24 hours/day) indication that the project is important to the organization.
It is always easier to establish an operational infrastructure as 22. Will the appropriate user be committed and available
well as a develop the update/refresh/backup scenarios for an for the project?
8*% than for a 24*& schedule. It’s also easier for operational
If people important to the project are not committed and
people and DBAs to maintain a more limited scheduled up
available, it will take far longer for the project to complete. User
time.
involvement is essential to the success of any DW project.
13. How frequently will the data be loaded/updated/
23. Will knowledgeable application developers
refreshed? (Monthly, weekly, daily, hourly)
(programmers) be available for the migration process?
The more frequent the load/ update/refresh, the greater the
These programmers need to be available when they are needed
performance impact, If real time is every being considered, the
unavailability means the project will be delayed.
requirement is for an operational, not a decision support
system. 24. How many trained and experienced programmer/
analysts will be available for system testing?
If these programmer/analysts are not available, the will have to
be recruited and / or trained.
56
25. How many, trained an experienced systems analysts will the level of sophistication of the users? (Power users,
57
DATA WAREHOUSING AND DATA MINING
58
DATA WAREHOUSING AND DATA MINING
LESSON 15
MANAGING RISK: INTERNAL AND EXTERNAL, CRITICAL PATH ANALYSIS
Structure
Objective
Introduction Minimum The lowest possible cost.
Risk Analysis
The most likely cost.
Risk management Mode
Objective This is the highest point on the curve.
When you have completed this lesson you should be able to:
• Understand the importance of Project Risk Management. The midway cost of n projects.
• Understand what is Risk. Median (In other words, n/2 will cost less than the
• Identify various types of Risks. median, and n/2 will cost more.)
• Describe Risk Management Process
• Discuss Critical Path Analysis Average
The expected cost of n similar projects,
divided by n.
Introduction
Project Risk Management is the art and science of identifying, Reasonable The highest possible cost, to a 95%
assigning and responding to risk throughout the life of a maximum certainty.
project and in the best interests of meeting project objectives.
Risk management can often result in significant improvements Absolute The highest possible cost, to a 100%
in the ultimate success of projects. Risk management can have a maximum certainty.
positive impact on selecting projects, determining the scope of
projects, and developing realistic schedules and cost estimates. It
helps stakeholders understand the nature of the project, On this curve, the following sequence holds:
involves team members in defining strengths and weaknesses, Minimum < mode < median < average < reasonable maxi-
and help to integrate the other project management areas. mum < absolute maximum
Risk Analysis Note the following points:
At a given stage in a project, there is a given level of knowledge
and uncertainty about the outcome and cost of the project. The
probable cost can typically be expressed as a skewed bell curve, • The absolute maximum cost may be infinite, although there
since although there is a minimum cost; there is no maximum is an infinitesimal tail. For practical purposes, we can take the
cost. reasonable maximum cost. However, the reasonable
maximum may be two or three times as great as the average
cost.
• Most estimation algorithms aim to calculate the mode. This
means that the chance that the estimates will be achieved on a
single project is much less than 50%, and the chances that the
total estimates will be achieved on a series of projects is even
lower. (In other words, you do not gain as much on the
roundabouts as you lose on the swings.)
• A tall thin curve represents greater certainty, and a low broad
curve represents greater uncertainty.
There are several points of particular interest on this curve: • Risk itself has a negative value. This is why investors
demand a higher return on risky ventures than on ‘gilt-
edged’ securities. Financial number-crunchers use a measure
of risk called ‘beta’.
• Therefore, any information that reduces risk has a positive
value.
59
This analysis yields the following management points: To mitigate the risks in a software development project, a
DATA WAREHOUSING AND DATA MINING
• A project has a series of decision points, at each of which the management strategy for every identified risk must be devel-
sponsors could choose to cancel. oped.
• Thus at decision point k, the sponsors choose between Risk monitoring steps are:
continuation, which is then expected to cost Rk, and • Assess each identified risks regularly to decide whether or not
cancellation, which will cost Ck. it is becoming less or more probable
• The game is to reduce risk as quickly as possible, and to place • Also assess whether the effects of the risk have changed
decisions at the optimal points. • Each key risk should be discussed at management progress
• This means we have to understand what specific information meetings
is relevant to the reduction of risk, plan the project so that
Managing Risks: Internal & External
this information emerges as early as possible, and to place
When a closer look at risk. When you do get caught this is
the decision points immediately after this information is
typically due to one of three situations:
available.
1. Assumptions – you get caught by unvoiced assumptions
Risk Management which were never spelled out.
Risk management is concerned with identifying risks and
2. Constraints – you get caught by restricting factors, which
drawing up plans to minimize their effect on a project. A risk is
were not fully understood.
a probability that some adverse circumstance will occur:
3. Unknowns – items you could never predict, by they are acts
• Project risks affect schedule or resources
of God or human errors.
• Product risks affect the quality or performance of the
The key to risk management Is to do our best to identify the
software being developed
source of all risk and the likelihood of its happening.,
• Business risks affect the organization developing or
For example when we project plan, we typically do not take
procuring the software
work stoppages into account. But if we were working for and
The risk management process has four stages: airline that was under threat of major strike, we might re
• Risk identification: Identify project, product and business evaluate the likelihood of losing valuable project time.
risks Calculate the cost to the project if the particular risk happens
• Risk analysis: Assess the likelihood and consequences of and make decision. You can decide either to accept it, find a way
these risks to avoid it or to prevent it. Always look for ways around the
• Risk planning: Draw up plans to avoid or minimize the obstacles.
effects of the risk Internal and External Risks
• Risk monitoring: Monitor the risks throughout the project Duncan Nevison lists the following types of internal risks:
Some risk types that are applicable to software projects are: 1. Project Characteristics
• Technology risks Schedule Bumps
• People risks Cost Hiccups
• Organizational risks Technical Surprises
• Requirements risks 2. Company politics
• Estimation risks Corporate Strategy Change
Risk Analysis Departmental Politics
• Assess probability and seriousness of each risk 3. Project Stakeholders
• Probability may be very low, low, moderate, high or very high Sponsor
• Risk effects might be catastrophic, serious, tolerable or Customer
insignificant Subcontractors
Risk planning steps are: Project Team
1. Consider each risk and develop a strategy to manage that risk As well as the following external risks;
2. Avoidance strategies: The probability that the risk will arise is 1. Economy
reduced • Currency Rate Change
3. Minimization strategies: The impact of the risk on the • Market Shift
project or product will be reduced
• Competitors Entry OR Exit
4. Contingency plans: If the risk arises, contingency plans are
• Immediate Competitive Actions
plans to deal with that risk
• Supplier Change
60
2. Environment Self Test
61
10. The project estimation can get delayed because of the reason
DATA WAREHOUSING AND DATA MINING
a. Wrong assumptions
b. Constraints
c. Unknown factors
d. All of the above
Reference
1. Hughes, Bob, Software project management, 2nd ed. New
Delhi: Tata McGraw- Hill Publishing, 1999.
2. Kelkar, S.A., Software project management: a concise study, New
Delhi: Prentice Hall of India, 2002
3. Meredith, Jack R.; Mantel, Samuel J., Project management: a
managerial approach, New York: John Wiley and Sons, 2002.
4. Royce, Walker, Software project management: a unified
framework, Delhi: Pearson Education Asia, 1998.
5. Young, Trevor L., Successful project management, London:
Kogan Page, 2002.
Notes
62
CHAPTER 4
LESSON 16 DATA MINING
DATA MINING CONCEPTS
63
knowledge in bodies of data, and extracting these in such a reconfigured to ensure a consistent format as there is a
DATA WAREHOUSING AND DATA MINING
way that they can be put to use in the areas such as decision possibility of inconsistent formats because the data is drawn
support, prediction, forecasting and estimation. The data is from several sources e.g. sex may recorded as f or m and also
often voluminous, but as it stands of low value as no direct as 1 or 0.
use can be made of it; it is the hidden information in the • Transformation - the data is not merely transferred across
data that is useful” but transformed in that overlays may added such as the
Clementine User Guide, a data mining toolkit demographic overlays commonly used in market research.
Basically data mining is concerned with the analysis of data and The data is made useable and navigable.
the use of software techniques for finding patterns and • Data mining - this stage is concerned with the extraction of
regularities in sets of data. It is the computer, which is respon- patterns from the data. A pattern can be defined as given a
sible for finding the patterns by identifying the underlying rules set of facts(data) F, a language L, and some measure of
and features in the data. The idea is that it is possible to strike certainty C a pattern is a statement S in L that describes
gold in unexpected places as the data mining software extracts relationships among a subset Fs of F with a certainty c such
patterns not previously discernable or so obvious that no one that S is simpler in some sense than the enumeration of all
has noticed them before. the facts in Fs.
Data mining analysis tends to work from the data up and the • Interpretation and evaluation - the patterns identified by
best techniques are those developed with an orientation the system are interpreted into knowledge which can then be
towards large volumes of data, making use of as much of the used to support human decision-making e.g. prediction and
collected data as possible to arrive at reliable conclusions and classification tasks, summarizing the contents of a database
decisions. The analysis process starts with a set of data, uses a or explaining observed phenomena.
methodology to develop an optimal representation of the
Data Mining Background
structure of the data during which time knowledge is acquired.
Data mining research has drawn on a number of other fields
Once knowledge has been acquired this can be extended to
such as inductive learning, machine learning and statistics etc.
larger sets of data working on the assumption that the larger
data set has a structure similar to the sample data. Again this is Inductive Learning
analogous to a mining operation where large amounts of low- Induction is the inference of information from data and
grade materials are sifted through in order to find something of inductive learning is the model building process where the
value. environment i.e. database is analyzed with a view to finding
patterns. Similar objects are grouped in classes and rules
The following diagram summarizes the some of the stages/
formulated whereby it is possible to predict the class of unseen
processes identified in data mining and knowledge discovery by
objects. This process of classification identifies classes such that
Usama Fayyad & Evangelos Simoudis, two of leading expo-
each class has a unique pattern of values, which forms the class
nents of this area.
description. The nature of the environment is dynamic hence
the model must be adaptive i.e. should be able learn.
Generally, it is only possible to use a small number of proper-
ties to characterize objects so we make abstractions in that
objects, which satisfy the same subset of properties, are
mapped to the same internal representation.
Inductive learning where the system infers knowledge itself
from observing its environment has two main strategies:
• Supervised learning - this is learning from examples where
a teacher helps the system construct a model by defining
classes and supplying examples of each class. The system has
to find a description of each class i.e. the common properties
in the examples. Once the description has been formulated
the description and the class form a classification rule, which
The phases depicted start with the raw data and finish with the can be used to predict the class of previously unseen objects.
extracted knowledge, which was acquired as a result of the This is similar to discriminate analysis as in statistics.
following stages: • Unsupervised learning - this is learning from observation
• Selection - selecting or segmenting the data according to and discovery. The data mine system is supplied with objects
some criteria e.g. all those people who own a car, in this way but no classes are defined so it has to observe the examples
subsets of the data can be determined. and recognize patterns (i.e. class description) by itself. This
• Preprocessing - this is the data cleansing stage where certain system results in a set of class descriptions, one for each class
information is removed which is deemed unnecessary and discovered in the environment. Again this similar to cluster
may slow down queries for example unnecessary to note the analysis as in statistics.
sex of a patient when studying pregnancy. Also the data is
64
Induction is therefore the extraction of patterns. The quality of • Illustrate various stages of Data mining with the help of a
65
Notes
66
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
LESSON 17
DATA MINING CONCEPTS-2
67
such information. It is generally accepted that if we know the information. Then, can we say that extracting the average age o f
DATA WAREHOUSING AND DATA MINING
query precisely we can turn to query language to formulate the the employees of a department from the employees database
query. But if we have some vague idea and we do not know the (which stores the date-of-birth of every employee) is a data-
precisely query, then we can resort to data mining techniques. mining task? The task is surely ‘non-trivial’ extraction of
The evolution of data mining began when business data was implicit information. It is needed a type of data mining task,
first stored in computers, and technologies were generated to but at a very low level. A higher-level task would, for example,
allow users to navigate through the data in real time. Data be to find correlations between the average age and average
mining takes this evolutionary process beyond retrospective income of individuals in an enterprise.
data access and navigation, to prospective and proactive 2. Data mining is the search for the relationships and global
information delivery. This massive data collection, high patterns that exist in large databases but are hidden among
performance computing and data mining algorithms. vast amounts of data, such as the relationship between
We shall study some definitions of term data mining in the patient data and their medical diagnosis. This relationship
following section. represents valuable knowledge about the databases, and the
objects in the database, if the database is a faithful mirror of
Data Mining: Definitions the real world registered by the database.
Data mining, the extraction of the hidden predictive informa-
Consider the employee database and let us assume that we have
tion from large databases is a powerful new technology with
some tools available with us to determine some relationships
great potential to analyze important information in the data
between fields, say relationship between age and lunch-patterns.
warehouse. Data mining scours databases for hidden patterns,
Assume, for example, that we find that most of employees in
finding predictive information that experts may miss, as it goes
there thirties like to eat pizzas, burgers or Chinese food during
beyond their expectations. When implemented on a high
their lunch break. Employees in their forties prefer to carry a
performance client/server or parallel processing computers, data
home-cooked lunch from their homes. And employees in their
mining tolls can analyze massive databases to deliver answers to
fifties take fruits and salads during lunch. If our tool finds this
questions such as which clients are most likely to respond to the
pattern from the database which records the lunch activities of
next promotions mailing. There is an increasing desire to use
all employees for last few months, then we can term out tool as
this new technology in the new application domain, and a
a data-mining tool. The daily lunch activity of all employees
growing perception that these large passive databases can be
collected over a reasonable period fo time makes the database
made into useful actionable information.
very vast. Just by examining the database, it is impossible to
The term ‘data mining’ refers to the finding of relevant and notice any relationship between age and lunch patterns.
useful information from databases. Data mining and knowl-
3. Data mining refers to using a variety of techniques to
edge discovery in the databases is a new interdisciplinary field,
identify nuggets of information or decision-making
merging ideas from statistics, machine learning, databases and
knowledge in the database and extracting these in such a way
parallel computing. Researchers have defined the term ‘ data
that they can be put to use in areas such as decision support,
mining’ in many ways.
prediction, forecasting and estimation. The data is often
We discuss a few of these definitions below. voluminous, but it has low value and no direct use can be
1. Data mining or knowledge discovery in databases, as it is made of it. It is the hidden information in the data that is
also known is the non-trivial extraction of implicit, useful.
previously unknown and potentially useful information Data mining is a process of finding value from volume. In any
from the data. This encompasses a number of technical enterprise, the amount of transactional data generated during
approaches, such as clustering, data summarization, its day-to-day operations is massive in volume. Although these
classification, finding dependency networks, analyzing transactions record every instance of any activity, it is of little use
changes, and detecting anomalies. in decision-making. Data mining attempts to extract smaller
Though the terms data mining and KDD are used above pieces of valuable information from this massive database.
synonymously, there are debates on the difference and similarity 4. Discovering relations that connect variables in a database is
between data mining and knowledge discovery. In the present the subject of data mining. The data mining system self-
book, we shall be suing these two terms synonymously. learns from the previous history of the investigated system,
However, we shall also study the aspects in which these two formulating and testing hypothesis about rules which
terms are said to be different. systems obey. When concise and valuable knowledge about
Data retrieval, in its usual sense in database literature, attempts the system of interest is discovered, it can and should be
to retrieve data that is stored explicitly in the database and interpreted into some decision support system, which helps
presents it to the user in a way that the user can understand. It the manager to make wise and informed business decision.
does not attempt to extract implicit information. One may Data mining is essentially a system that learns from the existing
argue that if we store ‘date-of-birth’ as a field in the database data. One can think of two disciplines, which address such
and extract ‘age’ from it, the information received from the problems- Statistics and Machine Learning. Statistics provide
database is not explicitly available. But all of us would agree that sufficient tools for data analysis and machine learning deals with
the information is not ‘non-trivial’. On the other hand, if one different learning methodologies. While statistical methods are
attempts to as a sort of non-trivial extraction of implicit theory-rich-data-poor, data mining is data-rich-theory-poor
68
approach. On the other hand machine-learning deals with Stages OF KDD
69
tion of data. Data mining helps in extracting meaningful new • Requires expert user guidance
DATA WAREHOUSING AND DATA MINING
patterns that cannot be found necessarily by merely querying or Difference between Database management systems
processing data or metadata in the data warehouse. Data mining (DBMS), Online Analytical Processing (OLAP) and Data
applications should therefore be strongly considered early, Mining
during the design of a data warehouse. Also, data mining tools
should be designed to facilitate their use in conjunction with
data ware-houses. In fact, for very large databases ‘running into
Area DBMS OLAP Data Mining
terabytes of data, successful use of database mining applica-
Knowledge
tions will depend, first on the construction of a data Extraction of
Summaries, trends discovery of
warehouse. Task detailed and
and forecasts hidden patterns
summary data
and insights
Machine Learning vs. Data Mining
Type of Insight and
• Large Data sets in Data Mining result
Information Analysis
Prediction
• Efficiency of Algorithms is important Deduction
Multidimensional Induction (Build
(Ask the
• Scalability of Algorithms is important Method question,
data modeling, the model, apply
Aggregation, it to new data, get
• Real World Data verify with
Statistics the result)
data)
• Lots of Missing Values Who
What is the average Who will buy a
• Pre-existing data - not user generated Example
purchased
income of mutual mutual fund in
mutual funds
• Data not static - prone to updates question
in the last 3
fund buyers by the next 6
region by year? months and why?
• Efficient methods for data retrieval available for use years?
70
DATA WAREHOUSING AND DATA MINING
71
Notes
72
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
LESSON 18
ELEMENTS AND USES OF DATA MINING
73
Noise and missing values • Predict customers likely to change their credit card affiliation
DATA WAREHOUSING AND DATA MINING
Databases are usually contaminated by errors so it cannot be • Determine credit card spending by customer groups
assumed that the data they contain is entirely correct. Attributes,
• Find hidden correlations between different financial
which rely on subjective or measurement judgments, can give
indicators
rise to errors such that some examples may even be mis-
classified. Error in either the values of attributes or class • Identify stock trading rules from historical market data
information are known as noise. Obviously where possible it is Insurance and Health Care
desirable to eliminate noise from the classification information • Claims analysis - i.e which medical procedures are claimed
as this affects the overall accuracy of the generated rules. together
Missing data can be treated by discovery systems in a number of • Predict which customers will buy new policies
ways such as;
• Identify behaviour patterns of risky customers
• simply disregard missing values
• Identify fraudulent behaviour
• omit the corresponding records
Transportation
• infer missing values from known values
• treat missing data as a special value to be included
• Determine the distribution schedules among outlets
additionally in the attribute domain • Analyse loading patterns
• or average over the missing values using Bayesian techniques. Medicine
Noisy data in the sense of being imprecise is characteristic of all • Characterise patient behaviour to predict office visits
data collection and typically fit a regular statistical distribution • Identify successful medical therapies for different illnesses
such as Gaussian while wrong values are data entry errors.
Statistical methods can treat problems of noisy data, and Data Mining and Data Warehousing
separate different types of noise. The goal of a data warehouse is to support decision making
with data. Data mining can be used in conjunction with a data
Uncertainty warehouse to help with certain types of decisions. Data mining
Uncertainty refers to the severity of the error and the degree of can be applied to operational databases with individual
noise in the data. Data precision is an important consideration transactions. To make data mining more efficient, the data
in a discovery system. warehouse should have an aggregated’ or summarized collec-
Size, Updates, and Irrelevant Fields tion of data. Data mining helps in extracting meaningful new
Databases tend to be large and dynamic in that their contents patterns that cannot be found necessarily by merely querying or
are ever-changing as information is added, modified or processing data or metadata in the data warehouse. Data mining
removed. The problem with this from the data mining applications should therefore be strongly considered early,
perspective is how to ensure that the rules are up-to-date and during the design of a data warehouse. Also, data mining tools
consistent with the most current information. Also the learning should be designed to facilitate their use in conjunction with
system has to be time-sensitive as some data values vary over data ware-houses. In fact, for very large databases ‘running into
time and the discovery system is affected by the ‘timeliness’ of terabytes of data, successful use of database mining applica-
the data. tions will depend, first on the construction of a data
warehouse.
Another issue is the relevance or irrelevance of the fields in the
database to the current focus of discovery for example post Data Mining as a Part of the Knowledge Discovery
codes are fundamental to any studies trying to establish a Process
geographical connection to an item of interest such as the sales Knowledge Discovery in Databases, frequently abbreviated as
of a product. KDD, typically encompasses more than data mining. The
knowledge discovery process comprises six phases: 6 data
Potential Applications
selection, data cleansing, enrichment, data transformation or
Data mining has many and varied fields of application some of
encoding, data mining, and the reporting and display of the
which are listed below.
discovered information.
Retail/Marketing As an example, consider a transaction database maintained by a
• Identify buying patterns from customers specialty consumer goods retailer. Suppose the client data
• Find associations among customer demographic includes a customer name, zip code, phone num-ber, date of
characteristics purchase, item code, price, quantity, and total amount. A variety
• Predict response to mailing campaigns of new knowledge can be discovered by KDD processing on
this client database. During data selec-tion, data about specific
• Market basket analysis items or categories of items, or from stores in a specific region
Banking or area of the country, may be selected. The data cleansing
• Detect patterns of fraudulent credit card use process then may correct invalid zip codes or eliminate records
with incorrect phone prefixes. Enrichment typically enhances the
• Identify ‘loyal’ customers
data with additional sources of information. For example,
74
given the client names and phone numbers, the store may The term data mining is currently used in a very broad sense. In
75
digital library) may be analyzed in. terms of the key-words of
DATA WAREHOUSING AND DATA MINING
76
DATA WAREHOUSING AND DATA MINING
77
Notes
DATA WAREHOUSING AND DATA MINING
LESSON 19
DATA INFORMATION AND KNOWLEDGE
Structure Knowledge
• Objective Information can be converted into knowledge about historical
• Introduction patterns and future trends. For example summary information
on retail supermarket sales can be analyzed in light of promo-
• Data, Information and Knowledge
tional efforts to provide knowledge of consumer buying
• Information behavior. Thus, a manufacturer or retailer could determine
• Knowledge which items are most susceptible to promotional efforts.
• Data warehouse Data Warehouse
• What can Data Mining Do? Dramatic advances in data capture, processing power, data
• How Does Data Mining Work? transmission and storage capabilities are enabling organization
to integration their various databases into data warehouses.
• Data Mining in a Nutshell Data warehousing is defined as a process of centralized data
• Differences between Data Mining and Machine Learning management and retrieval. Data warehousing, like data mining,
Objective is a relatively new term although the concept itself has been
At the end of this lesson you will be able to around for years. Data warehousing represents an ideal vision
of maintaining a central repository of all organizational data.
• Understand the meaning and difference of Data,
Centralization of data is needed to maximize user access and
Information and knowledge
analysis. Dramatic technological advances are making this vision
• Study the need of data mining a reality for many companies. And, equally dramatic advances in
• Understand the working of data mining data analysis software is what support data mining.
• Study the difference between Data Mining and Machine What can Data Mining do?
Learning Data mining is primarily used today by companies with a strong
Introduction consumer focus retail, financial communication, and marketing
In the previous lesson you have studied various elements and organizations. It enables these companies to determine
uses of data mining. In this lesson, I will explain you the relationships among “ internal” factors such as price, product
difference and significance of Data, information and knowledge. positioning, or staff skills and external factors such as economic
Further, you will also study about the need and working of indicators, competition and customer demographics. And it
data mining. corporate profits. Finally, it enables them to “drill down” into
summary information to view detail transactional data.
Data, Information and Knowledge
With data mining, a retailer could use point of sale records of
Data are any facts, numbers, or text that can be processed by a
customer purchase to send targeted promotions based on an
computer. Today, organizations are accumulating vast and
individual’s purchase history. By mining demographic data from
growing amounts of data in different formats and different
comment or warranty cards, the retailer could develop products
databases. This includes
and promotions to appeal to specific customer segments. For
• Operational or transactional data such as sales, cost, example, Blockbuster Entertainment mines its video rental
inventory, payroll, accounting. history database to recommend rentals to individual customers.
• Non-operational data such as industry sales, forecast data, American Express can suggest products to its cardholders based
and macro economic data. on analysis of their monthly expenditures.
• Meta data-data about the data itself, such as logical database Wal-Mart is pioneering massive data mining to transform its
design or data dictionary definitions supplier relationships. Captures point of sale transactions from
Information over 2,900 stores in 6 countries and continuously transmits this
The patterns, associations, or relationships among all this data data to its massive 7.5 terabyte tera data warehouse. Wal-Mart
can provide information. For example, analysis of retail point allows more than 3,500 suppliers, to access data on their
of sale transaction data can yield information on which product products and perform data analyses. These suppliers use this
are selling and when. data to identify customer-buying patterns at the store display
level. They use this information to mange local store inventory
Meta data-data about the data itself, such as logical database
and identify new merchandising opportunities. In 1995, Wal-
design or data dictionary definitions
mart computers processed over 1 million complex queries.
78
The national Basketball Association (NBA) is exploring a data Data storage becomes easier as the availabilities of large amount
79
• KDD is concerned with finding understandable knowledge,
DATA WAREHOUSING AND DATA MINING
80
DATA WAREHOUSING AND DATA MINING
81
DATA WAREHOUSING AND DATA MINING
LESSON 20
DATA MINING MODELS
Structure Though the terms data mining and KDD are used above
• Objective synonymously, there are debates on the difference and similarity
between data mining and knowledge discovery. In the present
• Introduction book, we shall be suing these two terms synonymously.
• Data mining However, we shall also study the aspects in which these two
• Data Mining Models terms are said to be different.
• Verification Model Data retrieval, in its usual sense in database literature, attempts
• Discovery Model to retrieve data that is stored explicitly in the database and
presents it to the user in a way that the user can understand. It
• Data warehousing
does not attempt to extract implicit information. One may
Objective argue that if we store ‘date-of-birth’ as a field in the database
At the end of this lesson you will be able to and extract ‘age’ from it, the information received from the
• Reviewing the concept of Data mining database is not explicitly available. But all of us would agree that
the information is not ‘non-trivial’. On the other hand, if one
• Study various types of data mining models
attempts to as a sort of non-trivial extraction of implicit
• Understand the difference between Verification and information. Then, can we say that extracting the average age o f
Discovery model. the employees of a department from the employee’s database
Introduction (which stores the date-of-birth of every employee) is a data-
In the previous lesson, I have explained you the difference and mining task? The task is surely ‘non-trivial’ extraction of
significance of Data, Information and Knowledge. You have implicit information. It is needed a type of data mining task,
also studied about the need and working of data mining. In but at a very low level. A higher-level task would, for example,
this lesson, I will explain you various types of Data Mining be to find correlations between the average age and average
Models. income of individuals in an enterprise.
Data Mining 2. Data mining is the search for the relationships and global
Data mining, the extraction of the hidden predictive informa- patterns that exist in large databases but are hidden among
tion from large databases is a powerful new technology with vast amounts of data, such as the relationship between
great potential to analyze important information in the data patient data and their medical diagnosis. This relationship
warehouse. Data mining scours databases for hidden patterns, represents valuable knowledge about the databases, and the
finding predictive information that experts may miss, as it goes objects in the database, if the database is a faithful mirror of
beyond their expectations. When implemented on a high the real world registered by the database.
performance client/server or parallel processing computers, data Consider the employee database and let us assume that we have
mining tolls can analyze massive databases to deliver answers to some tools available with us to determine some relationships
questions such as which clients are most likely to respond to the between fields, say relationship between age and lunch-patterns.
next promotions mailing. There is an increasing desire to use Assume, for example, that we find that most of employees in
this new technology in the new application domain, and a there thirties like to eat pizzas, burgers or Chinese food during
growing perception that these large passive databases can be their lunch break. Employees in there forties prefer to carry a
made into useful actionable information. home-cooked lunch from their homes. And employees in there
The term ‘data mining’ refers to the finding of relevant and fifties take fruits and salads during lunch. If our tool finds this
useful information from databases. Data mining and knowl- pattern from the database which records the lunch activities of
edge discovery in the databases is a new interdisciplinary field, all employees for last few months, then we can term out tool as
merging ideas from statistics, machine learning, databases and a data-mining tool. The daily lunch activity of all employees
parallel computing. Researchers have defined the term ‘ data collected over a reasonable period fo time makes the database
mining’ in many ways. very vast. Just by examining the database, it is impossible to
notice any relationship between age and lunch patterns.
1. Data mining or knowledge discovery in databases, as it is
also known is the non-trivial extraction of implicit, 3. Data mining refers to using a variety of techniques to
previously unknown and potentially useful information identify nuggets of information or decision-making
from the data. This encompasses a number of technical knowledge in the database and extracting these in such a way
approaches, such as clustering, data summarization, that they can be put to use in areas such as decision support,
classification, finding dependency networks, analyzing prediction, forecasting and estimation. The data is often
changes, and detecting anomalies. voluminous, but it has low value and no direct use can be
82
made of it. It is the hidden information in the data that is used to target a mailing campaign. The whole operation can be
83
dence and support provide the user with interesting and more specifically, different transactions collected over a period of
DATA WAREHOUSING AND DATA MINING
previously unknown information. We shall study the tech- time. The clustering methods will help him in identifying
niques to discover association rules in the following chapters. different categories of customers. During the discovery process,
the differences between data sets can be discovered in order to
Discovery of Classification Rules
separate them into different groups, and similarity between data
Classification involves finding rules that partition the data into
sets can be used to group similar data together. We shall discuss
disjoint groups. The input for the classification is the training
in detail about using the clustering algorithm for data mining
data set, whose class labels are already known. Classification
tasks in further lessons.
analyzes the training data set and constructs a model based on
the class label, and aims to assign a class label to the future Data Warehousing
unlabelled records. Since the class field is known, this type of Data mining potential can be enhanced if the appropriate data
classification is known as supervised learning. A set of classifies has been collected and stored in a data warehouse. A data
future data and develops a better understanding of each class in warehouse is a relational database management system
the database. We can term this as supervised learning too. (RDBMS) designed specifically to meet the needs of transaction
There are several classification discovery models. They are: the processing systems. It can be loosely defined as any centralized
decision trees, neural networks, genetic algorithms and the data repository which can be queried for business benefit but
statistical models like linear/ geometric discriminates. The this will be more clearly defined later. Data warehousing is a new
applications include the credit card analysis, banking, medical powerful technique making it possible to extract archived
applications and the like. Consider the following example. operational data and overcome inconsistencies between different
legacy data formats. As well as integrating data throughout an
The domestic flights in our country were at one time only
enterprise, regardless of location, format, or communication
operated by Indian Airlines. Recently, many other private airlines
requirements it is possible to incorporate additional or expert
began their operations for domestic travel. Some of the
information. It is,
customers of Airlines started flying with these private airlines
and, as a result, Indian Airlines lost these customers. Let us The logical link between what the managers see in their
assume that Indian Airlines want to understand why some decision support EIS applications and the company’s
customers remain loyal while others leave. Ultimately, the airline operational activities
wants to predict which customers it is most likely to lose to its John McIntyre of SAS Institute Inc
competitors. Their aim to build a model based on the historical In other words the data warehouse provides data that is already
data of loyal customers versus customers who have left. This transformed and summarized, therefore making it an appropri-
becomes a classifications problem. It is supervised learning task, ate environment for more efficient DSS and EIS applications.
as the historical data becomes the training set, which is used to
trading the model. The decision tree is the most popular Discussion
classification technique. We shall discuss different methods of • Explain the relation between a data warehouse and data
decision tree construction in the forthcoming lessons. mining
Clustering • What are the various kinds of Data mining models?
Clustering is a method of grouping data into different groups, • “Data warehousing is a new powerful technique making it
so that the data in each group share similar trends and patterns. possible to extract archived operational data and overcome
Clustering constitutes a major class of data mining algorithms. inconsistencies between different legacy data formats”.
The algorithm attempts to automatically partition the data space Comment.
into a set of regions or clusters, to which the examples in the • Explain the problems associated with the Verification Model.
table are assigned, either deterministically or probability-wise.
• “Data mining is essentially a system that learns from the
The goal of process is to identify all sets of similar examples in
existing data”. Illustrate with examples.
the data, in some optimal fashion.
Clustering according to similarity is a concept which appears in
many disciplines. If a measure of similarity is available, then
there are a number of techniques for forming clusters. Another
approach is to build set functions that measure some particular
property of groups. This latter approach achieves what is
known as optimal partitioning.
The objectives of clustering are:
• To uncover natural groupings
• To initiate hypothesis about the data
• To find consistent and valid organization of the data.
A retailer may want to know where similarities in his customer
base, so that he can create and understand different groups. He
can use the existing database of the different customers or,
84
DATA WAREHOUSING AND DATA MINING
85
86
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
LESSON 21
ISSUES AND CHALLENGES IN DM, DM APPLICATIONS AREAS
87
the forms or sources of data. We shall study the Data mining Text Mining
DATA WAREHOUSING AND DATA MINING
problems for different types of data. The term text mining or KDT (Knowledge Discovery in Text)
was first proposed by Feldman and Dagan in 1996. They
Sequence Mining
suggest that text documents be structured by means of
Sequence mining is concerned with mining sequence data. It
information extraction, text categorization, or applying NLP
may be noted that in the, discovery of association rules, we are
techniques as a preprocessing step before performing any kind
interested in finding associations between items irrespective of
of KDTs. Presently the term text mining, is being used to cover
their order of occurrence. For example, we may be interested in
many applications such as text categorization, exploratory data
the association between the purchase of a particular brand of
analysis, text clustering, finding patterns in text databases,
soft drinks and the occurrence of stomach upsets. But it is
finding sequential patterns in texts, IE (Information Extrac-
more relevant to identify whether there is some pattern in the
tion), empirical computational linguistic tasks, and association
stomach upsets which occurs after the purchase of the soft
discovery.
drink. Then one is inclined to infer that the soft drink causes
stomach upsets. On the other hand, if it is more likely that the Spatial Data Mining
purchase of the soft drink follows the occurrence of the Spatial data mining is the branch of data mining that deals with
stomach upset, then it is probable that the soft drink provides spatial (location) data. The immense explosion in geographi-
some sort of relief to the user. Thus, the discovery of temporal cally-referenced data occasioned by developments in IT, digital
sequences of events concerns causal relationships among the mapping, remote sensing, and the global diffusion of GIS,
events in a sequence. Another application of this domain places demands on developing data driven inductive approaches
concerns drug misuse. Drug misuse can occur unwittingly, to spatial analysis and modeling. Spatial data mining is regarded
when a patient is prescribed two or more interacting drugs as a special type of data mining that seeks to perform similar
within a given time period of each other. Drugs that interact generic functions as conventional data mining tools, but
undesirably are recorded along with the time frame as a pattern modified to take into account the special features of spatial
that can b located within the patient records. The rules that information.
describe such instances of drug misuse are then successfully For example, we may wish to discover some association among
inducted based on medical records. patterns of residential colonies and topographical features. A
Another related area which falls into the larger domain of typical spatial association may look like: ‘The residential land
temporal data mining is trend discovery. One characteristic of pockets are dense in a plain region and rocky areas are thinly
sequence-pattern discovery in comparison with trend discovery populated”; or, “The economically affluent citizens reside in
is the lack of shapes, since the causal impact of a series of hilly, secluded areas whereas the middle income group residents
events, cannot be shaped. prefer having their houses near the market”.
Web Mining Data mining Application Areas
With the huge amount of information available online, the The discipline of data mining is driven in part by new applica-
World Wide Web is a ferti1e area for data mining research. Web tions, which require new capabilities that are not currently being
mining research is at the crossroads of research from several supplied by today’s technology. These new applications can be
research communities, such as database, information retrieval, naturally divided into three broad categories [Grossman, 1999].
and within AI, especially the sub areas of machine learning and
A. Business And E-Commerce Data
natural language processing. Web mining is the use of data
This is a major source category of data for data mining applica-
mining techniques to automatically discover and extract
tions. Back-office front-office, and network applications produce
information from web documents and services. This area of
large amounts of data about business processes. Using this
research is so huge today partly due to the interests of various
data for effective decision-making remains a fundamental
research commuI1ities, the tremendous growth of information
challenge.
sources available on the web and the recent interest in e-
commerce. This phenomenon often creates confusion when we Business Transactions
ask what constitutes web mining. Web mining can be broken Modern business processes are consolidating with millions of
down into following subtasks: customers and billions of their transactions. Business enter-
1. Resource finding: retrieving documents intended for the prises require necessary information for their effective
web. functioning in today’s competitive world. For example, they
would like know: “Is this transaction fraudulent?”; “Which
2. Information selection and preprocessing: automatically customer is likely to migrate?”, and “What product is this
selecting and preprocessing specific information from customer most likely to buy next?’.
resources retrieved from the web.
3. Generalization: to automatically discover general patterns at
Electronic Commerce
Not only does electronic commerce produce large data sets in
individual web site as well as across multiple sites.
which the analysis marketing patterns and risk patterns is critical
4. Analysis: validation and/or interpretation of the mined but, it is also important to do this near-real time, in order to
patterns. meet the demands of online transactions.
88
Data Mining Applications-Case Studies
89
cluster of cases seem to point to the same offenders, then these • Determine credit card spending by’ customer groups
DATA WAREHOUSING AND DATA MINING
frequent offenders can be subjected to careful examination. • Finding hidden correlations between different financial
Store-Level Fruits Purchasing Prediction indicators
A super market chain called ‘Fruit World’ sells fruits of different • Identifying stock trading rules from historical market data
types and it purchases these fruits from the wholesale suppliers
Insurance and Health Care
on a day-to-day basis. The problem is to analyze fruit-buying
patterns, using large volumes of data captured at the ‘basket’ • Claims analysis - i.e., which medical procedures are claimed
level. Because fruits have a short shelf life, it is important that together
accurate store-level purchasing predictions should be made to • Predict which customers will buy new policies
ensure optimum freshness and availability. The situation is • Identify behavior patterns of risky customers
inherently complicated by the ‘domino’ effect. For example,
• Identify fraudulent behavior
when one variety of mangoes is sold out, then sales are
transferred to another variety. With help of data mining Transportation
techniques, a thorough understanding of purchasing trends • Determine the distribution schedules among outlets
enables a better availability of fruits and greater customer • Analyze loading patterns
satisfaction.
Medicine
Other Application Area
• Characterize patient behavior to predict office visits
Risk Analysis • Identify successful medical therapies for different illnesses
Given a set of current customers and an assessment of their
risk-worthiness, develop descriptions for various classes. Use Discussion
these descriptions to classify a new customer into one of the • Discuss different data mining tasks.
risk categories. • What is spatial data mining?
Targeted Marketing • What is sequence mining?
Given a database of potential customers and how they have • What is web mining?
responded to a solicitation, develop a model of customers
most likely to respond positively, and use the model for more • What is text mining?
focused new customer solicitation. Other applications are to • Discuss the applications of data mining in the banking
identify buying patterns from customers; to find associations industry.
among customer demographic characteristics, and to predict the • Discuss the applications of data mining in customer
response to mailing campaigns. relationship management.
Retail/Marketing • How is data mining relevant to scientific data?
• Identify buying patterns from customers • How is data mining relevant for web-based computing?
• Find associations among customer demographic • Discuss the application of data mining in science data.
characteristics
Bibliography
• Predict response to mailing campaigns
• Agrawal R, Gupta A., and Sarawagi S., Modeling
• Market basket analysis multidimensional databases. ICDE, 1997.
Customer Retention • Anahory S., and Murray D. Data warehousing in the Real
Given a database of past customers and their behavior prior to World: Apracfical guide for building decision support systems.
attrition, deve1op a model of customers most likely to leave. Addison Wesley Longman, 1997.
Use the model for determining the course of action for these • Barbara D. (ed.) Special Issue on Mining of Large Datasets;
customers. IEEE Data Engineering Bulletin, 21 (1), 1998
Portfolio Management • Brachman R., Khabaza T., Kloesgen W., Shapiro G.P.,
Given a particular financial ‘asset, predict the return on invest- and Simoudis E., Industrial applications of data mining
ment to determine the inclusion of the asset in a folio or not. and knowledge discovery, Communication of ACM,- 1926.
Brand Loyalty • Fayyad U.M., Piatetsky-Shapiro G., Smyth P.,
Given a customer and the product he/she uses, predict whether Uthurusamy R. (Eds.): Advances in Knowledge Discovery and
the customer will switch brands. Data Mining. Menlo Park, CA: AAAI Press/ The MIT Press,
1996
Banking
The application areas in banking are: • Fayyad U.M., Uthurusamy R. (eds.): Special issue on data
mining. CommunicatJon of ACM,1996 .
• Detecting patterns of fraudulent credit card use
• Identifying ‘loyal’ customers • Grossman R., Kasif S.,Moore R., Rocke D. and Ullmann
J. Data Mining Research: Opportunities and Challenges, A Report.
• Predicting customers likely to change their credit card www..ncdni.uic.edu/M3D-final-report.htm., Jan 1999.
affiliation
90
• Heckerman D., Bayesian networks for data, mining. Data
91
Notes
92
DATA WAREHOUSING AND DATA MINING
LESSON 22 CHAPTER 5
VARIOUS TECHNIQUES OF DATA DATA MINING TECHNIQUES
MINING NEAREST NEIGHBOR AND
CLUSTERING TECHNIQUES
93
mining product can make all the difference in how easy the Where to Use Clustering and Nearest-
DATA WAREHOUSING AND DATA MINING
94
make sense of the world. It is what allows us to build clus-ters- constructed for no particular purpose except to note similarities
95
that dimension (and hence predictor) more important than the
DATA WAREHOUSING AND DATA MINING
Clustering and segmentation basically partition the database so
others in calculating the distance. that each partition or group is similar according to some criteria
For instance, ‘if you were a mountain climber and someone or metric. Clustering according to similarity is a concept, which
told you that you were 2 mi from your destination, the distance appears in many disciplines. If a measure of similarity is
would be the same whether it were 1 mi north and’ 1 mi up the available there are a number of techniques for forming clusters.
face of the mountain or 2 mi north on level ground, but clearly Membership of groups can be based on the level of similarity
the former route is much different from the latter. The dis-tance between members and from this the rules of membership can
traveled straight upward is the most important in figuring out be defined. Another approach is to build set functions that
how long it will really take to get to the destination, and you measure some property of partitions ie groups or subsets as
would probably like to con-sider this “dimension” to be more functions of some parameter of the partition. This latter
important than the others. In fact, you, as a mountain climber, approach achieves what is known as optimal partitioning.
could “weight” the importance of the vertical dimension in Many data mining applications make use of clustering according
calculating some new distance by reasoning that every mile to similarity for example to segment a client/customer base.
upward is equiva-lent to 10 mi on level ground. Clustering according to optimization of set functions is used in
If you used this rule of thumb to weight the importance of data analysis e.g. when setting insurance tariffs the customers
one dimension over the other, it would be clear that in one case can be segmented according to a number of parameters and the
you were much “farther away” from your destination (limit) optimal tariff segmentation achieved.
than in the second (2 mi). In the net section we’ll show how the Clustering/segmentation in databases are the processes of
nearest neighbor algorithm uses the distance measure that separating a data set into components that reflect a consistent
similarly weights the important dimensions more heavily when pattern of behavior. Once the patterns have been established
calculating a distance. they can then be used to “deconstruct” data into more under-
standable subsets and also they provide sub-groups of a
population for further analysis or action, which is important
Nearest Neighbor Clustering when dealing with very large databases. For example a database
Used for prediction as Used mostly for consolidating data into could be used for profile generation for target marketing where
well as consolidation. a high-level view and general grouping previous response to mailing campaigns can be used to generate
of records into like behaviors. a profile of people who responded and this can be used to
predict response and filter mailing lists to achieve the best
Space is defined by the Space is defined as default n- response.
problem to be solved dimensional space, or is defined by the Discussion
(Supervised Learning). user, or is a predefined space driven by1. Write short notes on:
past experience (Unsupervised learning).
• Clustering
Generally, only uses Can use other metrics besides distance • Sequential patterns
distance metrics to to determine nearness of two records – • Segmentation
determine nearness. for example, linking points together. • Association rules
• Classification hierarchies
2. Explain Cluster analysis.
3. Correctly contrast the difference between supervised and
Cluster Analysis: Overview unsupervised learning.
In an unsupervised learning environment the system has to 4. Discuss in brief, where Clustering and Nearest-Neighbor
discover its own classes and one way in which it does this is to Prediction are used?
cluster the data in the database as shown in the following
5. How is the space for clustering and nearest neighbor
diagram. The first step is to discover subsets of related objects
defined? Explain.
and then find descriptions e.g., D1, D2, D3 etc. which describe
each of these subsets. 6. What is the difference between clustering nearest-neighbor
prediction?
Figure 5: Discovering clusters and descriptions in a database
7. How are tradeoffs made when determining which records fall
into which clusters?
8. Explain the following:
• Clustering
• Nearest-Neighbor
96
DATA WAREHOUSING AND DATA MINING
97
Some important slides
DATA WAREHOUSING AND DATA MINING
LESSON 23
DECISION TREES
98
customers, products, and sales regions is something that Table 23.1 Training Data Set
Decision Tree Structure RULE 4 If it is rainy and not windy, then play.
In order to have a clear idea of a decision tree, I have explained RULE 5 If it is rainy and windy. then don’t play.
it with the following examples: Please note that this may not be the best set of rules that can
Example 23.1 _be derived from the given set of training data.
Let us consider the following data sets-the training data set (see
Table 23.1) and the test data set (see Table 23.2). The data set Outlook
99
Example 23.2
DATA WAREHOUSING AND DATA MINING
The classification of an unknown input vector is done by
traversing the tree from the root node to a leaf node. A record At this stage, let us consider another example to illustrate the
enters the tree at the root node. At the root, a test is applied to concept of categorical attributes. Consider the following training
determine which child node the record will encounter next. This data set (Table 23.3). There are three attributes, namely, age, pin
process is repeated until the record arrives at a leaf node. All the code and class. The attribute class is used for class label.
records that end up at a given leaf of the tree are classified in the Table 23.3 Another Example
same way. There is a unique path from the root to each leaf. The
path is a rule, which is used to classify the records.
ID AGE PINCODE CLASS
In the above tree, we can carry out the classification for an
1 30 5600046 Cl
unknown record as follows. Let us assume” for the record, that
2 25 5600046 Cl
we know the values of the first four attributes (but we do not
know the value of class attribute) as 3 21 5600023 C2
4' 43 5600046 Cl
outlook= rain; temp = 70; humidity = 65; and windy= true.
5 18 5600023 C2
We start from the root node to check the value of the attribute
6 33 5600023 Cl
associated at the root node. This attribute is the splitting
7 29 5600023 Cl
attribute at this node. Please note that for a decision tree, at
8 55 5600046 C2
every node there is an attribute: associated with the node called
the splitting attribute. In our example, outlook is the splitting 9 48 5600046 Cl
attribute at root. Since for the given record, outlook = rain, we
move to the right-most child node of the root. At this node, The attribute age is a numeric attribute, whereas pincode is a
the splitting attribute is windy and we find that for the record categorical one. Though the domain of pincode is numeric, no
we want classify, windy = true. Hence, we move to the left child ordering can be defined among pincode values. You cannot
node to conclude that the class label is “no play”. derive any useful information if one pin-code is greater than
Note that every path from root node to leaf nodes represents a another pincode. Figure 23.2 gives a decision tree for this
rule. It may be noted that many different leaves of the tree may training data. The splitting attribute at the root is pincode and
refer to the same class labels, but each leaf refers to a different the splitting criterion here is pincode = 500 046. Similarly, for
rule. the left child node, the splitting criterion is age £ 48 (the
splitting attribute is age). Although the right child node has the
The accuracy of the classifier is determined by the percentage of
same attribute as the splitting attribute, the splitting criterion is
the test data set that is correctly classified. Consider the follow-
different.
ing test data set (Table 23.2).
Most decision tree building algorithms begin by trying to find
Table 23.2 Test Data Set
the test, which does the best job of splitting the records among
the desired categories. At each succeeding level of the tree, the
OUTLOOK TEMP(F) HUMIDITY(%) WINDY CLASS subsets created by the preceding split are themselves split
according to whatever rule works best for them. The tree
sunny 79 90 true play
continues to grow until it is no longer possible to find better
sun)1Y 56 70 false play ways to split up incoming records, or when all the records are in
sunny 79 75 true no play one class.
sunny 50 90 true no play Figure 23.2 A Decision Tree
overcast 88 88 false no play
Pin code = 560046; [1-9]
overcast 63 75 true Play
overcast 88 95 false Play
ram 78 60 false play Age < = 48; Age < = 21;
1,2,4,8,9 [3,5,6,7]
ram 66 70 false no play
rain 68 60 true play
C1; 1,2,4,9 C2; 8 C2; [3,5] C1;[6,7]
We can see that for Rule 1 there are two records of the test data
set satisfying outlook= sunny and humidity s 75, and only one In Figure 23.2, we see that at the root level we have 9 records.
of these is correctly classified as play. Thus, the accuracy of this The associated splitting criterion is pincode = 500 046. As a
rule is 0.5 (or 50%). Similarly, the accuracy of Rule 2 is also 0'.5 result, we split the records into two subsets, Records 1, 2, 4, 8
(or 50%). The accuracy of Rule 3 is 0.66. and 9 are to the left child node and the remaining to the right
node. This process is repeated at every node.
A decision tree construction process is concerned with identify-
ing the splitting attributes and splitting criteria at every level of
the tree. There are several alternatives and the main aim of the
100
decision tree construction process is to generate simple, prediction to time series prediction of the exchange rate of
101
{Ol, O2, ..., On} .T is partitioned into the subsets T1, T2, a. Overfitted models are incorrect
DATA WAREHOUSING AND DATA MINING
T3, ..., Tn where T1; contains all those cases in T that have b. Overfitted decision trees require more space and more
the outcome OJ of the chosen test. The decision tree for T computational resources.
consists of a decision node identifying the test, and one
c. Overfitted models require the collection of unnecessary
branch for each possible outcome. The same tree building
features
method is applied recursively to each subset of training cases.
Most often, n is chosen to be 2 and hence, the algorithm d. They are more difficult to comprehend.
generates a binary decision tree. The pruning phase helps in handling the overfitting problem.
• T is trivial T contains no cases. The decision tree T is a leaf, The decision tree is pruned back by removing the subtree
but the class to be associated with the leaf must be rooted at a node and replacing it by a leaf node, using some
determined from information other than T. criterion. Several pruning algorithms are reported in literature.
The generic algorithm of decision tree construction outlines the In the next lesson we will study about Decision Tress Construc-
common principle of all algorithms. Nevertheless, the follow- tion Algorithms and the working of decision trees.
ing aspects should be taken into account while studying any Exercises
specific algorithm. In one sense, the following are three major 1. What is a decision tree? Illustrate with an example.
difficulties, which arise when one uses a decision tree in a real-
life situation. 2. Describe the essential features in a decision tree. How is it
useful to classify data?
Guillotine Cut 3. What is a classification problem? What is supervised
Most decision tree algorithms examine only a single attribute at
classification? How is a decision tree useful in classification?
a time. As mentioned in the earlier paragraph, normally the
splitting is done for a single attribute at any stage and if the 4. Explain where to use Decision Trees?
attribute is numeric, then the splitting test is an inequality. 5. What are the disadvantages of the decision tree over other
Geometrically, each splitting can be viewed as a plane parallel to classification techniques?
one of the axes. Thus, splitting one single attribute leads to 6. What are advantages and disadvantages of the decision tree
rectangular classification boxes that may not correspond too approach over other approaches of data mining?
well with the actual distribution of records in the decision space.
7. What are the three phases of construction of a decision tree?
We call this the guillotine cut phenomenon. The test is of the
Describe the importance of each of the phases.
form (X> z) or (X < z), which is called a guillotine cut, since it
creates a guillotine cut subdivision of the Cartesian space of the 8. The 103 generates a
ranges of attributes. a. Binary decision tree
However, the guillotine cut approach has a serious problem if a b. A decision tree with as many branches as there are
pair of attributes are correlated. For example, let us consider distinct values of the attribute
two numeric attributes, height (in meters) and weight (in c. A tree with a variable number of branches, not related
Kilograms). Obviously, these attributes have a strong correla- to the domain of the attributes
tion. Thus, whenever there exists a correlation between d. A tree with an exponential number of branches.
variables, a decision tree with the splitting criteria on a single
attribute is not accurate. Therefore, some researchers propose an Suggested Readings
oblique decision tree that uses a splitting criteria involving more 1. Pieter Adriaans, Dolf Zantinge Data Mining, Pearson
than one attribute. Education, 1996
Over Fit 2. George M. Marakas Modern Data Warehousing, Mining, and
Decision trees are built from the available data. However, the Visualization: Core Concepts, Prentice Hall, 1st edition, 2002
training data set may not be a proper representative of the real- 3. Alex Berson, Stephen J. Smith Data Warehousing, Data
life situation and may also contain noise. In an attempt to build Mining, and OLAP (Data Warehousing/Data Management),
a tree from a noisy training data set, we may grow a decision tree McGraw-Hill, 1997
just deeply enough to perfectly classify the training data set. 4. Margaret H. Dunham Data Mining, Prentice Hall, 1st
Definition 23.3 Overfit edition, 2002
A decision tree T is said to over fit the training data if there 5. David J. Hand Principles of Data Mining (Adaptive
exists some other tree T' which is a simplification of T, such Computation and Machine Learning), Prentice Hall, 1st
that T has smaller error than T' over the training set but T' has a edition, 2002
smaller error than T over the entire distribution of instances. 6. Jiawei Han, Micheline Kamber Data Mining, Prentice Hall,
Overfitting can lead to difficulties when there is noise in the 1st edition, 2002
training data, or when the number of training examples is too
7. Michael J. Corey, Michael Abbey, Ben Taub, Ian Abramson
small. Specifically, if there is no conflicting instances in the
Oracle 8i Data Warehousing McGraw-Hill Osborne Media,
training data set, the error of the fully built tree is zero, while
2nd edition, 2001
the true error is likely to be bigger. There are many disadvan-
tages of an overfitted decision tree:
102
DATA WAREHOUSING AND DATA MINING
LESSON 24
DECISION TREES - 2
Structure does the best job of separating the records into groups, where a
• Objective single class predominates. To choose the best splitter at a node,
we consider each independent attribute in turn.
• Introduction
Assuming that an attribute takes on multiple values, we sort it
• Best Split and then, using some evaluation function as the measure of
• Decision Tree Construction Algorithms goodness, evaluate each split. We compare the effectiveness of
• CART the split provided by the best splitter from each attribute. The
• ID3 winner is chosen as the splitter for the root node. How does
one know which split is better than the other? We shall discuss
• C4.5
below two different evaluation functions to determine the
• CHAID splitting attributes and the splitting criteria.
• When does the tree stop growing? Decision Tree Construction Algorithms
• Why would a decision tree algorithm prevent the tree from A number of algorithms for inducing decision trees have been
growing if there weren’t enough data? proposed over the years. However, they differ among them-
• Decision trees aren’t necessarily finished after they are fully selves in the methods employed for selecting splitting attributes
grown and splitting conditions. In the following few sections, we shall
• Are the splits at each level of the tree always binary yes/no study some of the major methods of decision tree construc-
splits? tions.
103
yet considered in the path from the root. Entropy is used to categories) in the partition, and thus does not take into account
DATA WAREHOUSING AND DATA MINING
measure how informative is a node. This algorithm uses the the fact that different numbers of branches are considered.
criterion of information gain to determine the goodness of a
When does the tree stop growing?
split. The attribute with the greatest information gain is taken as
If the decision tree algorithm just continued like this, it could
the splitting attribute, and the data set is split for all distinct
conceivably cre-ate more and more questions and branches in
values of the attribute.
the tree so that eventually there was only one record in the
C4.5 segment. To let the tree grow to this size is compu-tationally
C4.5 is an extension of ID3 that accounts for unavailable expensive and also unnecessary. Most decision tree algorithms
values, continuous attribute value ranges, pruning of decision stop growing the tree when one of three criteria are met:
trees and rule derivation. In building a decision tree, we can deal 1. The segment contains only one record or some
with training sets that have records with unknown attribute algorithmically defined min-imum number of records,
values by evaluating the gain, or the gain ratio, for an attribute (Clearly, there is no way to break a smile-record segment into
by considering only those records where those attribute values two smaller segments, and segments with very few records
are available. We can classify records that have unknown are not likely to be very helpful in the final prediction since
attribute values by estimating the probability of the various the predictions that they are making won’t be based on
possible results. Unlike CART, which generates a binary decision sufficient historical data.)
tree, C4.5 produces trees with variable branches per node. When
2. The segment is completely organized into just one
a discrete variable is chosen as the splitting attribute in C4.5,
prediction value. There is no reason to continue further
there will be one branch for each value of the attribute.
segmentation since this data is now com-pletely organized
CHAID (the tree has achieved its goal).
CHAID, proposed by Kass in 1980, is a derivative of AID 3. The improvement in organization is not sufficient to
(Automatic Interaction Detection), proposed by Hartigan in warrant making the split. For instance, if the starting
1975. CHAID attempts to stop growing the tree before segment were 90 percent churners and the resulting segments
overfitting occurs, whereas the above algorithms generate a fully from the best possible question were 90.001 percent churners
grown tree and then carry out pruning as post-processing step. and 89.999 percent churners, then not much progress would
In that sense, CHAID avoids the pruning phase. have been or could be made by continuing to build the tree.
In the standard manner, the decision tree is constructed by
Why would a decision tree algorithm prevent the
partitioning the data set into two or more subsets, based on the
tree from growing if there weren’t enough data?
values of one of the non-class attributes. After the data set is
Consider the following example of a segment that we might
partitioned according to the chosen attributes, each subset is
want to split further because it has only two examples. Assume
considered for further partitioning using the same algorithm.
that it has been created out of a much larger customer database
Each subset is partitioned without regard to any other subset.
by selecting only those customers aged 27 with blue eyes and
This process is repeated for each subset until some stopping
with salaries ranging between $80,000 and $81,000.
criterion is met. In CHAID, the number of subsets in a
partition can range from two up to the number of distinct In this case all the possible questions that could be asked about
values of the splitting attribute. In this regard, CHAID differs the two cus-tomers turn out to have the same value (age, eye
from CART, which always forms binary splits, and from ID3 or color, salary) except for name.
C4.5, which form a branch for every distinct value. TABLE: Decision Tree Algorithm Segment>
The splitting attribute is chosen as the one that is most Name Age Eyes Salary ($) Churned?
significantly associated with the dependent attributes according
Steve 27 Blue 80,000 Yes
to a chi-squared test of independence in a contingency table (a
Alex 27 Blue 80,000 No
cross-tabulation of the non-class and class attribute). The main
stopping criterion used by such methods is the p-value from * This segment cannot be split further except by using the
this chi-squared test. A small p-value indicates that the observed predictor “name.”
association between the splitting attribute and the dependent
Decision trees aren’t necessarily finished after they
variable is unlikely to have occurred solely as the result of
are fully grown
sampling variability.
After the tree has been grown to a certain size (depending on
If a splitting attribute has more than two possible values, then the particular stop-ping criteria used in the algorithm), the
there may be a very large number of ways to partition the data CART algorithm has still more work to do. The algorithm then
set based on these values. A combinatorial search algorithm can checks to see if the model has been over fit to the data. It does
be used to find a partition that has a small p-value for the chi- this in several ways using a cross-validation approach or a test
squared test. The p-values for each chi-squared test are adjusted set valida-tion approach-basically using the same mind-
for the multiplicity of partitions. A Bonferroni adjustment is numbingly simple approach it used to find the best questions
used for the p-values computed from the contingency tables, in the first place: trying many different simpler versions of the
relating the predictors to the dependent variable. The adjust- tree on a held-aside test set. The algorithm as the best model
ment is conditional on the number of branches (compound selects the tree that does the best on the held-aside data. The
104
nice thing about CART is that this testing and selection is all an Table - Two Possible Splits for Eight Records with
105
State of the Industry Use this example to illustrate the working of different algo-
DATA WAREHOUSING AND DATA MINING
Exercises
1. What are advantages and disadvantages of the decision tree
approach over other approaches of data mining?
2. Describe the ID3 algorithm of the decision tree construction.
Why is it unsuitable for data mining applications?
3. Consider the following examples
106
DATA WAREHOUSING AND DATA MINING
LESSON 25
NEURAL NETWORKS
Structure ment, which started off with the premise that machines could
• Objective be made to “think” if scientists found ways to mimic the struc-
ture and functioning of the human brain on the computer.
• Introduction
Thus historically neural networks grew out of the community
• What is a Neural Network? of artificial intelligence rather than from the discipline of
• Learning in NN statistics. Although scientists are still far from understanding
• Unsupervised Learning the human brain, let alone mimicking it, neural networks that
run on computers can do some of the things that people can
• Data Mining using NN: A Case Study
do.
Objective lt is difficult to say exactly when the first “neural network” on a
The aim of this lesson is to introduce you with the concept of computer was built. During World War II a seminal paper was
Neural Networks. It also includes various topics, which explains published by McCulloch and Pitts which first outlined the idea
how the method of neural network is helpful in extracting the that simple processing units (like the individual neurons in the
knowledge from the warehouse. human brain) could be connected together in large \networks
Introduction to create a system that could solve difficult problems and
Data mining is essentially a task of learning from data and display behavior that’ was much more complex than the simple
hence, any known technique which attempts to learn from data pieces that made it up. Since that time much progress has been
can, in principle, be applied for data mining purposes. In made in finding ways to apply artifi cial neural networks to real-
general, data mining algorithms aim at minimizing I/O world prediction problems and improving the per-formance of
operations of disk-resident data, whereas conventional the algorithm in general. In many respects the greatest break-
algorithms are more concerned about time and space complexi- through in neural networks in recent years have been in their
ties, accuracy and convergence. Besides the techniques discussed application to more mundane real-world problems such as
in the earlier lessons, a few other techniques hold promise of customer response prediction or fraud detection rather than the
being suitable for data mining purposes. These are Neural loftier goals that were originally set out for the techniques such
Networks (NN), Genetic Algorithms (GA) and Support Vector as overall human learning and computer speech and image
Machines (SVM). The intention of this chapter is to briefly understanding.
present you the underlying concepts of these subjects and Don’t neural networks learn to make better
demonstrate their applicability to data mining. We envisage that predictions?
in the coming years these techniques are going to be important Because of the origins of the techniques and because of some
areas of data mining techniques. of their early suc-cesses, the techniques have enjoyed a great deal
Neural Networks of interest. To understand how neural networks can detect
The first question that comes to the mind is, what is this patterns in a database, an analogy is often made that they
Neural Network? “learn” to detect these patterns and make better predictions,
When data mining algorithms are discussed these days, people similar to the way human beings do. This view is encouraged by
are usually talking about either decision trees or neural net- the way the historical training data is often supplied the
works. Of the two, neural net-works have probably been of network-one record (example) at a time.
greater interest through the formative stages of data mining Networks do “learn” in a very real sense, but under the hood,
technology. As we will see, neural networks do have disadvan- the algo-rithms and techniques that are being deployed are not
tages that can be limiting in their ease of use and ease of truly different from the techniques found in statistics or other
deployment, but they do also have some significant advantages. data mining algorithms. It is, for instance, unfair to assume that
Foremost among these advantages are their highly accurate neural networks could outperform other tech-niques because
predictive models, which can be applied across a large number they “learn” and improve over time while the other techniques
of different types of problems. remain static. The other techniques, in fact, “learn” from
To be more precise, the term neural network might be defined historical examples in exactly the same way, but often the
as an “artifi-cial” neural network. True neural networks are examples (historical records) to learn from are processed all at
biological systems [also known as (a.k.a.) brains] that detect once in a more efficient manner than are neural networks, which
patterns, make predictions, and learn. The artifi-cial ones are often modify their model one record at a time.
computer programs implementing sophisticated pattern Are neural networks easy to use?
detection and machine learning algorithms on a computer to A common claim for neural networks is that they are automated
build predictive models from large historical databases. Artificial to a degree where the user does not need to know that much
neural networks derive their name from their historical develop- about how they work, or about predictive modeling or even the
107
database in order to use them. The implicit claim is also that The first tactic has seemed to work quite well because when the
DATA WAREHOUSING AND DATA MINING
most neural networks can be unleashed on your data straight technique is used for a well-defined problem, many of the
out of the box without the need to, rearrange or modify the difficulties in preprocessing the data can be automated (because
data very much to begin with. the data structures have been seen before) and interpretation of
Just the opposite is often true. Many important design the model is less of an issue since entire industries begin to use
decisions need to be made in order to effectively use a neural the technology successfully and a level of trust is created. Several
network, such as ven-dors have deployed this strategy (e.g., HNC’s Falcon system
for credit card fraud prediction and Advanced Software Applica-
• How should the nodes in the network be connected?
tions’ Model MAX package for direct marketing).
• How many neurons like processing units should be used?
Packaging up neural networks with expert consultants is also a
• When should “training” be stopped in order to avoid over viable strat-egy that avoids many of the pitfalls of using neural
fitting? networks, but it can be quite expensive because it is human-
There are also many important steps required for preprocessing intensive. One of the great promises of data mining is, after all,
the data that goes into a neural network-most often there is a the automation of the predictive modeling process. These
requirement to normalize numeric data between 0.0 and 1.0, neural network-consulting teams are little different from the
and categorical predictors may need to be broken up into virtual analytical departments many companies already have in house.
predictors that are 0 or 1 for each value of the original categorical Since there is not a great difference in the overall predictive
predictor. And, as always, understanding what the data in your accuracy of neural networks over standard statistical techniques,
database means and a clear definition of the business problem the main difference becomes the replacement of the sta-tistical
to be solved are essential to ensuring eventual success. The expert with the neural network expert. Either with statistics or
bottom line is that neural networks provide no shortcuts. neural network experts, the value of putting easy-to-use tools
into the hands of the business end user is still not achieved.
Business Scorecard
Neural networks are very powerful predictive modeling Neural networks rate high for accurate mod-els that provide
techniques, but some of the power comes at the expense of good return on investment but rate low in terms of automa-
ease of use and ease of deployment. As we will see in this tion and clarity, making them more difficult to deploy across the
chapter, neural networks create very complex models that are enterprise.
almost always impossible to fully understand, even by experts. Where to use Neural Networks
The model itself is represented by numeric values in a complex Neural networks are used in a wide variety of applications. They
calculation that requires all the predictor values to be in the form have been used in all facets of business from detecting the
of a number. The output of the neural net-work is also fraudulent use of credit cards and credit risk prediction to
numeric and needs to be translated if the actual prediction value increasing the hit rate of targeted mailings. They also have a
is categorical (e.g., predicting the demand for blue, white, or long history of application in other areas such as the military for
black jeans for a clothing manufacturer requires that the the automated driving of an unmanned vehicle at 30 mph on
predictor values blue, black and white for the predictor color be paved roads to biological simulations such as learning the
converted to numbers). Because of the complexity of these correct pronunciation of English words from written text.
techniques, much effort has been expended in trying to increase
the clar-ity with which the model can be understood by the end Neural Networks for Clustering
user. These efforts are still in their infancy but are of tremen- Neural networks of various kinds can be used for clustering and
dous importance since most data mining techniques including prototype cre-ation. The Kohonen network described in this
neural networks are being deployed against real business chapter is probably the most common network used for
problems where significant investments are made on the basis clustering and segmentation of the database. Typi-cally the
of the predic-tions from the models (e.g., consider trusting the networks are used in a unsupervised learning mode to create the
predictive model from a neural network that dictates which one clus-ters. The clusters are created by forcing the system to
million customers will receive a $1 mailing). compress the data by cre-ating prototypes or by algorithms that
steer the system toward creating clus-ters that compete against
These shortcomings in understanding the meaning of the
each other for the records that they contain, thus ensuring that
neural network model have been successfully addressed in two
the clusters overlap as little as possible.
ways:
1. The neural network is packaged up into a complete solution Business Score Card for Neural Networks
such as fraud prediction. This allows the neural network to Data mining measure Description
be carefully crafted for one particular application, and once it Automation Neural networks are often repre
has been proven successful, it can be used over and over sented as automated data mining
again without requiring a deep understanding of how it techniques. While they are very
works. powerful at building predictive
models, they do require significant
2. The neural network is packaged up with expert consulting data preprocessing and a good
services. Here trusted experts who have a track record of understanding and definition of
success deploy the neural network. The experts either are able the prediction target. Usually
to explain the models or trust that the models do work.
108
normalizing predictor values between 0.0 and the picture looks like, but certainly describing it in terms of
109
TABLE 25.1 Applications Score Card for Neural Networks Text Because of the large number of possible
DATA WAREHOUSING AND DATA MINING
110
DATA WAREHOUSING AND DATA MINING
Fig. 25.4
111
DATA WAREHOUSING AND DATA MINING
LESSON 26
NEURAL NETWORKS
Structure The network has 2 binary inputs, I0 and I1 and one binary
• Objective output Y. W0 and W1 are the connection strengths of input 1
and input 2, respectively. Thus, the total input received at the
• What Is A Neural Network?
processing unit is given by
• Hidden nodes are like trusted advisors to the output nodes
W0I0 + W1I1 - Wb,
• Design decisions in architecting a neural network
where
• How does the neural network resemble the human brain?
Wb is the threshold (in another notational convention, it is
• Applications Of -Neural Networks
viewed as the bias). The output Y takes on the value 1, if W0I0
• Data Mining Using NN: A Case Study + W1I1 - Wb, > 0 and, otherwise, it is 0 if
Objective W0I0 + Will - Wb £ O.
The main objective of this lesson is to introduce you the But the model, known as perceptron, was far from a true model
principles of neural computing. of a biological neuron as, for a start, the biological neuron’s
What is a Neural Network? output is a continuous function rather than a step function.
Neural networks are a different paradigm for computing, which This model also has a limited computational capability as it
draws its inspiration from neuroscience. The human brain represents only a linear-separation. For two classes of inputs,
consists of a network of neurons, each of which is made up of which are linearly separable, we can find the weights such that
a number of nerve fibres called dendrites, connected to the cell the network returns 1 as output for one class and 0 for another
body where the cell nucleus is located. The axon is a long, single class.
fibre that originates from the cell body and branches near its end There have been many improvements on this simple model
into a number of strands. At the ends of these strands are the and many architectures have been presented in recently. As a first
transmitting ends of the synapses that connect to other step, the threshold function or the step function is replaced by
biological neurons through the receiving ends of the synapses other more general, continuous functions called activation.
found on the dendrites as well as the cell body of biological
neurons. A single axon typically makes thousands of synapses
with other neurons. The transmission process is a complex
chemical process, which effectively increases or decreases the
electrical potential within the cell body of the receiving neuron.
When this electrical potential reaches a threshold value (action
potential),itentersitsexcitatorystateandissaid tofire. It is the
connectivity of the neuron that gives these simple ‘devices’ their
real power.
Artificial neurons (or processing elements, PE) are highly
simplified models of biological neurons. As in biological
neurons, an artificial neuron has a number of inputs, a cell body Figure 26.1 a simple perceptron
(most often consisting of the summing node and the transfer
function), and an output, which can be connected to a number
of other artificial neurons. Artificial neural networks are densely
interconnected networks of PEs, together with a rule (learning
rule) to adjust the strength of the connections between the
units in response to externally supplied data.
The evolution of neural networks as a new computational
model originates from the pioneering work of McCulloch and
Pitts in 1943. They suggested a simple model of a neuron that
connoted the weighted sum of the inputs to the neuron and an
output of 1 or 0, according to whether the sum was over a
threshold value or not. A 0 output would correspond to the
inhibitory state of the neuron, while a 1 output would Figure 26.2 A Typical Artificial Neuron with Activation
correspond to the excitatory state of the neuron. Consider a Function
simple example illustrated below.
112
functions. Figure 26.2 illustrates the structure of a node (PE) 1. It is difficult to trust the prediction of the neural network if
with an activation function. For this particular node, n weighted the meaning of these nodes is not well understood.
inputs (denoted W;, i = 1,..., n) are combined via a combination 2. Since the prediction is made at the output layer and the
function that often consists of a simple summation. A transfer difference between the prediction and the actual value is
function then calculates a corresponding value, the result calculated there, how is this error cor-rection fed back
yielding a single output, usually between 0 and 1. Together, the through the hidden layers to modify the link weights. Those
combination function and the transfer function make up the connect them?
activation function of the node. The meaning of these hidden nodes is not necessarily well
Three common transfer functions are the sigmoid, linear and understood but sometimes after the fact they can be studied to
hyperbolic functions. The sigmoid function (also known as the see when they are active (have larger numeric values) and when
logistic function) is very widely used and it produces values they are not and derive some meaning from them. In some of
between 0 and 1 for any input from the combination function. the earlier neural networks were used to learn the family trees of
The sigmoid function is given by (the subscript n identifies a two different families-one was Italian and one was English and
PE): the network was trained to take as inputs either two people and
return their rela-tionship (father, aunt, sister, etc.) or given one
person and a relationship to return the. Other person. After
training, the units in one of the hidden layers were exam med
Y= to see If there was any discernible explanation as to their role in
the prediction. Several of the nodes did seem to have specific
and under-standable purposes. One, for instance, seemed to
Note that the function is strictly positive and defined for all break up the input records (people) into either Italian or
values of the input. When plotted, the graph takes on a English descent, another unit encoded for which generation a
sigmoid shape, with an inflection point at (0, .5) in the Carte- person belonged to, and another encoded for the branch of the
sian plane. The graph (Figure 26.3) plots the different values of family that the person came from. The neutral network to aid in
S as the input varies from -10 to 10. predictor automatically extracted each of these nodes.
Individual nodes are linked together in different ways to create Any interpretation of the meaning of the hidden nodes needs
neural networks. In a feed-forward network, the connections to be done after the fact-after the network has been trained, and
between layers are unidirectional from input to output. We it is not always possible to determine a logical description for
discuss below two different architectures of the feed-forward the particular function for the hidden nodes. The second
network, Multi-Layer Perceptron and Radial-Basis Function. problem with the hidden nodes is perhaps more serious (if it
hadn’t been solved, neural networks wouldn’t work). Luckily it
has been solved.
The learning procedure for the neural network has been defined
to work for the weights in the links connecting the hidden layer.
A good analogy of how this works would be a military
operation in some war where there are many layers of com-
mand with a general ultimately responsible for making the
decisions on where to advance and where to retreat. Sev-eral
lieutenant generals probably advise the general, and several
major generals, in turn, probably advise each lieutenant general.
This hierarchy continues downward through colonels and
privates at the bottom of the hierarchy.
This is not too far from the structure of a neural network with
several hid-den layers and one output node. You can think of
the inputs coming from the hidden nodes as advice. The link
weight corresponds to the trust that generals have in their
Figure 26.3 Sigmoid Functions advisors. Some trusted advisors have very high weights and
some advisors may not be trusted and, in fact, have negative
Hidden nodes are like trusted advisors to the output
weights. The other part of the advice from the advisors has to
nodes
do with how competent the particular
The meanings of the input nodes and the output nodes are
usually pretty well understood-and are usually defined by the Advisor is for a given situation. The general may have a trusted
end user with respect to the par-ticular problem to be solved advisor, but if that advisor has no expertise in aerial invasion
and the nature and structure of the database. The hidden and the situation in question involves the air force, this advisor
nodes, however, do not have a predefined meaning and are may be very well trusted but the advisor per-sonally may not
determined by the neural network as it trains. This poses two have any strong opinion one way or another.
problems:
In this analogy the link weight of a neural network to an
output unit is like the trust or confidence that commanders
have in their advisors and the actual node value represents how
strong an opinion this particular advisor has about this TABLE 26.1 Neural Network Nodes.
particular situation. To make a decision, the general considers
Advisor's Advisor's Change to
how trustworthy and valuable the advice is and how knowl- General's trust
edgeable and confident all the advisors are in making their Recommendation Confidence General’s trust
suggestions; then, taking all this into account, the general makes
the decision to advance or retreat. High Good High Great increase
In the same way, the output node will make a decision (a
High Good Low Increase
prediction) by tak-ing into account all the input from its
advisors (the nodes connected to it). In the case of the neural
High Bad High Great decrease
network multiplying the link weight by the output value of the
node and summing these values across all nodes reach this
High Bad Low Decrease
decision. If the prediction is incorrect, the nodes that had the
most influence on making the decision have their weights Low Good High Increase
modified so that the wrong prediction is less likely to be made
the next time. Low Good Low Small increase
This learning in the neural network is very similar to what
happens when the general makes the wrong decision. The Low Bad High Decrease
confidence that the general has in all those advisors who gave
the wrong recommendation is decreased-and all the more so for Low Bad Low Small decrease
those advisors who were very confident and vocal in they’re rec-
ommendations. On the other hand, any advisors who were
making the correct recommendation but whose input was not
taken as seriously would be taken more seriously the next time. *The link weights in a neural network are analogous to the
Likewise, any advisors who were reprimanded for giving the confidence that gen-erals might have. In there trusted advisors.
wrong advice to the general would then go back to their own diction from the neural network) through the hidden layers and
advisors and determine which of them should have been to the input layers. At each level the link weights between the
trusted less and whom should have been listened to more layers are updated so as to decrease the chance of making the
closely in rendering the advice or recommendation to the same mistake again.
general. The changes generals should make in listening to their
Design decisions in architecting a neural network
advisors to avoid the same bad decision in the future are shown
Neural networks are often touted as self-learning automated
in Table 26.1.
techniques that simplify the analysis process. The truth is that
This feedback can continue in this way down throughout the there still are many decisions to be made by the end user in
organizational each level, giving increased emphasis to those designing the neural network even before training begins. If
advisors who had advised cor-rectly and decreased emphasis to these decisions are not made wisely, the neural network will
those who had advised incorrectly. In this way the entire likely come up with a sub optimal model. Some of the
organization becomes better and better at supporting the gen- decisions that need to be made include
eral in making the correct decision more of the time. • How will predictor values be transformed for the input
A very similar method of training takes place in the neural nodes? Will normal-ization be sufficient? How will
network. It is called back propagation and refers to the propaga- categoricals be entered?
tion of the error backward from the output nodes (where the • How will the output of the neural network be interpreted?
error is easy to determine as the difference between the actual
prediction value from the training database and the pre- • How many hidden layers will there be?
• How will the nodes be connected? Will every node be
connected to every other node, or will nodes just be
connected between layers?
• How many nodes will there be in the hidden layer? (This can
have an impor-tant influence on whether the predictive
model is over fit to the training database.)
• How long should the network be trained for? (This also has
an impact on whether the model over fits the data.)
Depending on the tool that is being used, these decisions may
be (explicit, where the user must set some parameter value, or
they may be decided for the user because the particular neural
• Distant neurons seemed to inhibit each other
115
How does the neural network resemble the human • Marketing:
DATA WAREHOUSING AND DATA MINING
116
differentiation presented to the SOM’s input. For example,
Notes
After the network is trained it is used for one final pass through
the input data set, in which the weights are not adjusted. This
provides the final classification of each input data tuple into a
single node in the l0 x 10 grid. The output is taken from the
coordinate layer as an (x, y) pair. The output of the SOM is a
population distribution of tuples with spatial significance
(Figure 26.6). This grid displays the number of tuples that were
classified into each Kohorien layer node (square) during testing;
for example, Square (1,1) contains 180 tuples.
Upon examination of the raw data within these clusters, one
finds similarities between the tuples which are indicative of
medical relationships or dependencies. Numerous hypotheses
can be made regarding these relationships, many of which were
not known a priori. The SOM groups together tuples in each
square according to their similarity. The only level at which the
SOM can detect similarities between tuples is at the root level of
each of the three subspace trees, since this is the level of
117
DATA WAREHOUSING AND DATA MINING
LESSON 27
ASSOCIATION RULES AND GENETIC ALGORITHM
Structure Þ Juice has only 25% support. Another term for support is
preva-lence of the rule.
• Objective
• Association Rules To compute confidence we consider all transactions that include
items in LHS. The confidence for the association rule LHS
• Basic Algorithms for Finding Association Rules
ÞRHS is the percentage (fraction) of such transactions that also
• Association Rules among Hierarchies include RHS. Another term for confidence is strength of the
• Negative Associations rule.
• Additional Considerations for Association Rules For MilkÞJuice, the confidence is 66.7% (meaning that, of three
• Genetic Algorithms (GA) transactions in which milk occurs, two contain juice) and
• Crossover breadÞjuice has 50% confidence (meaning that one of two
transactions containing bread also contains juice.)
• Mutation
As we can see, support and confidence do not necessarily go
• Problem-Dependent Parameters
hand in hand. The goal of mining association rules, then, is to
• Encoding generate all possible rules that exceed some mini-mum user-
• The Evaluation Step specified support and confidence thresholds. The problem is
thus decomposed into two subproblems:
• Data Mining Using GA
1. Generate all item sets that have a support that exceeds the
Objective threshold. These sets of items are called large itemsets. Note
The objective of this lesson is to introduce you with data that large here means large support.
mining techniques like association rules and genetic algorithm. 2. For each large item set, all the rules that have a minimum
Association Rules confidence are gener-ated as follows: for a large itemset X and
One of the major technologies in data mining involves the Y C X, let Z = X – Y; then if support (X) /support (Z) Þ
discovery of association rules. The database is regarded as a minimum confidence, the rule Z ÞY (Le., X - Y Þ Y) is a
collection of transactions, each involving. Set of items. A valid rule. [Note: In the previous sentence, Y C X reads, “Y
common example is that of market-basket data. Here the is a subset of X.”]
market basket corresponds to what a consumer buys in a
supermarket during one visit. Consider four such transactions Generating rules by using all large itemsets and their supports is
in a random sample: relatively straightforward. However, discovering all large
Transaction-id Time Items- Brought itemsets together with the value for their support is a major
101 6:35 milk, bread, juice problem if the cardinality of the set of items is very high. A
typical supermarket has thousands of items. The number of
792 7:38 milk, juice distinct itemsets is 2m, where m is the number c£ items, and
1130 8:05 milk, eggs counting support for all possible itemsets becomes very
1735 8:40 bread, cookies, coffee computation-intensive.
An association rule is of the form X=>Y, where X = {x1, x2... To reduce the combinatorial search space, algorithms for finding
xn}, and Y = {y1, y2...ym} are sets, of items, with xi and Yj being association rule have the following properties:
distinct items for all i and j. This association states that if a • A subset of a large itemset must also be large (i.e., each
customer buys X, he or she is also likely to buy Y. In general, subset of a large itemset: L exceeds the minimum required
any association rule has the form LHS (left-hand side) Þ RHS support).
(right-hand side), where LHS and RHS are sets of items. • Conversely, an extension of a small itemset is also small
Association rules should supply both support and confidence. (implying that it does 00: have enough support).
The support for the rule LHSÞRHS is the percentage of The second property helps in discarding an itemset from further
transactions that hold all of the items in the union, the set LHS consideration J extension, if it is found to be small.
U RHS. If the support is low, it implies that there is no
overwhelming evidence that items in LHS U RHS occur Basic Algorithms for Finding Association
together, because the union happens in only a small fraction of Rules
transactions. The rule MilkÞJuice has 50% support, while Bread The current algorithms that find large itemsets are designed to
work as follows:
118
1. Test the support for itemsets of length 1, called l-itemsets,
119
they predominantly buy Topsy and not Joke and not Wakeup Association rules can be generalized for data mining purposes
DATA WAREHOUSING AND DATA MINING
that would be interesting. This is so because we would Nor- although the notion of itemsets was used above to discover
mally expect that if there is a strong association between Days association rules, almost any data in the standard relational
and Topsy, there should also be such a strong association form with a number of attributes can be used. For example,
between Days and Joke or Days and Wakeup. consider blood-test data with attributes like hemoglobin, red
In the frozen yogurt and bottled water groupings in Fig 27.1, blood cell count, white blood cell count, blood--sugar, urea, age
suppose the Reduce versus Healthy brand division is 80-20 and of patient, and so on. Each of the attributes can be divided
the Plain and Clear brands division 60-40 among respective into ranges, and the presence of an attribute with a value can be
categories. This would give a joint probability of Reduce frozen considered equivalent to an item. Thus, if the hemoglobin
yogurt. attribute is divided into ranges 0-5, 6-7, 8-9, 10-12, 13-14, and
above 14, then we can. Consider them as items HI, H2... H7.
Being purchased with Plain bottled water as 48% among the
Then a specific hemoglo-bin value for a patient corresponds to
transactions containing a frozen yogurt and bottled water. If
one of these seven items being present. The mutual exclusion
this support, however, is found to be only 20%, that would
among these hemoglobin items can be used to some advantage
indicate a significant negative association among Reduce yogurt
in the scanning for large itemsets. This way of dividing variable
and Plain bottled water; again, that would be interesting.
values into ranges allows us to apply the association-rule
The problem of finding negative association is important in the machinery to any database for mining purposes. The ranges
above situations given the domain knowledge in the form of have to be determined from domain knowledge such as the
item generalization hierarchies (that is, the beverage given and relative importance of each of the hemo-globin values.
desserts hierarchies shown in Fig 27.1), the existing positive
associations (such as between the frozen-yogurt and bottled Genetic Algorithm
water group’s), and the distribution of items (such as the name Genetic algorithms (GA), first proposed by Holland in 1975,
brands within related groups). Recent work has been reported are a class of computational models that mimic natural
by the database group at Georgia Tech in this context (see evolution to solve problems in a wide variety of domains.
bibliographic notes). The Scope of dis-covery of negative Genetic algorithms are particularly suitable for solving complex
associations is limited in terms of knowing the item hierarchies optimization problems and for applications that require
and dis-tributions. Exponential growth of negative associa- adaptive problem-solving strategies. Genetic algorithms are
tions remains a challenge. search algorithms based on the mechanics of natural genetics,
i.e., operations existing in nature. They combine a Darwinian
Additional Considerations for ‘survival of the fittest approach’ with a structured, yet random-
Association Rules ized, information exchange. The advantage is that they can
For very large datasets, one way to improve efficiency is by search complex and large amount of spaces efficiently and locate
sampling. If a representative sample can be found that truly near-optimal solutions pretty rapidly. GAs was developed in the
repre-sents the properties of the original data, then most of the early 1970s by John Holland at the University of Michigan
rules can be found. The problem then reduces to one of
(Adaptation in Natural and Artificial Systems, 1975).
devising a proper sampling procedure. This process has the
potential danger of discovering some false positives (large item A genetic algorithm operates on a set of individual elements
sets that are not truly large) as well as hawing false negatives by (the population) and there is a set of biologically inspired
missing some large itemsets and corresponding association operators that can change these individuals. According to the
rules. evolutionary theory_ only the more suited individuals in the
population are likely to survive and to generate offspring, thus
transmitting their biological heredity to new generations.
In computing terms, genetic algorithms map strings of
numbers to each potential solution. Each solution becomes an
Fig: 27.1 “individual in the population, and each string becomes a
representation of an individual. There should be a way to derive
Mining association rules in real-life databases is further compli- each individual from its string representation. The genetic
cated by the following factors. algorithm then manipulates the most promising strings in its
The cardinality of itemsets in most situations is extremely large, search for an improved solution. The algorithm operates
and the volume of transactions is very high as well. Some through a simple cycle:
operational databases in retailing and commu-nication indus- • Creation of a population of strings.
tries collect tens of millions of transactions per day. • Evaluation of each string.
Transactions show variability in such factors as geographic
• Selection of the best strings.
location and seasons, making sampling difficult. Item classifica-
tions exist along multiple dimensions. Hence, driving the • Genetic manipulation to create a new population of strings.
discovery process with domain knowledge, particularly for Figure 27.2 shows how these four stages interconnect. Each
negative rules, is extremely difficult. Quality of data is variable; cycle produces a new generation of possible solutions (indi-
significant problems exist with missing, erroneous, con-flicting, viduals) for a given problem. At the first stage, a population of
as well as redundant data in many industries. possible solutions is created as a starting point. Each individual
120
in this population is enc6ded into a string (the chromosome) to operation of cutting and combining strings from a father and a
Crossover
Crossover is one of the genetic operators used to recombine the
population’s genetic material. It takes two chromosomes and
swaps part of their genetic information to produce new
chromosomes. As Figure 27.3 shows, after the crossover point
has been randomly chosen, portions of the parent’s chromo-
some (strings). Parent 1 and Parent 2 are combined to produce Figure 27.3 Crossover
the new offspring, Son.
Genetic Algorithms In detail The selection process associated with the recombination made
Genetic algorithms (GAs) are a class of randomized search by the crossover, assures that special genetic structures, called
processes capable of adaptive and robust search over a wide building blocks, are retained for future generations. These
range of search space to pologies. Modeled after the adaptive building blocks represent the fittest genetic structures in the
emergence of biological species from evolutionary mechanisms, population.
and introduced by Holland GAs have been successfully applied
in such diverse. Fields such as image analysis, scheduling, and Mutation
engineering design. The mutation operator introduces new genetic structures in the
population by randomly changing some of its building blocks.
Genetic algorithms extend the idea from human genetics of the Since the modification is totally random and thus not related to
four-letter alphabet loosed on the A, C, T, G nucleotides) of any previous genetic structures present in the population, it
the human DNA code. The construction of a genetic algorithm creates different structures related to other sections of the search
involves devising an alphabet that encodes the solutions to the space. As shown in Figure 27.4, the mutation is implemented
deci-sion problem in terms of strings of that alphabet. Strings by occasionally altering a random bit from a chromosome
are equivalent to individuals. A fitness function defines which (string). The figure shows the operator being applied to the
solutions, can survive and which cannot. The ways in which fifth element of the chromosome.
solutions can be combined are patterned after the crossover
121
A number of other operators, apart from crossover and Data Mining using GA
DATA WAREHOUSING AND DATA MINING
mutation, have been introduced since the basic model was The application of the genetic algorithm in the context of data
proposed. They are usually versions of the recombination and mining is generally for the tasks of hypothesis testing and
genetic alterations processes adapted to the constraints of a refinement, where the user poses some hypothesis and the
particular problem. Examples of other operators are: inversion, system first evaluates the hypothesis and then seeks to refine it.
dominance and genetic edge recombination. Hypothesis refinement is achieved by “seeding” the system with
the hypothesis and then allowing some or all parts of it to vary.
One can use a variety of evaluation functions to determine the
fitness of a candidate refinement. The important aspect of the
GA application is the encoding of the hypothesis and the
evaluation function for fitness.
Another way to use GA for data mining is to design hybrid
techniques by blending one of the known techniques with GA.
For example, it is possible to use the genetic algorithm for
optimal decision tree induction. By randomly generating
different samples, we can build many decision trees using any
Figure 27.4 Mutation of the traditional techniques. But we are not sure of the
optimal tree. At this stage, the GA is very useful in deciding on
the optimal tree and optimal splitting attributes. The genetic
Problem-Dependent Parameters algorithm evolves a population of biases for the decision tree
This description of the GA’ s computational model reviews the induction algorithm. We can use a two-tiered search strategy. On
steps needed to create the algorithm. However, a real implemen- the bottom tier, the traditional greedy strategy is performed
tation takes into account a number of problem-dependent through the space of the decision trees. On the top tier, one can
parameters. For instance, the offspring produced by genetic have a genetic search in a space of biases. The attribute selection
manipulation (the next population to be evaluated) can either parameters are used as biases, which are used to modify the
replace the whole population (generational approach) or just its behavior of the first tier search. In other words, the GA
less fit members (steady-state approach). The problem con- controls the preference for one type of decision tree over
straints will dictate the best option. Other parameters to be another.
adjusted are the population size, crossover and mutation rates, An individual (a bit string) represents a bias and is evaluated by
evaluation method, and convergence criteria. using testing data subsets. The “fitness” of the individual is the
average cost of classification of the decision tree. In the next
Encoding
generation, the population is replaced with new individuals.
Critical to the algorithm’s performance is the choice of underly-
The new individuals are generated from the previous genera-
ing encoding for the solution of the optimization problem (the
tion, using mutation and crossover. The fittest individuals in
individuals or the population). Traditionally, binary encoding
the first generation have the most offspring in the second
has being used because they are easy to implement. The
generation. After a fixed number of generations, the algorithm
crossover and mutation operators described earlier are specific to
halts and its output is the decision tree the determined by the
binary encoding. When symbols other than 1 or 0 are used, the
fittest individual.
crossover and mutation operators must be tailored accordingly.
Exercises
The Evaluation Step
1. Write short notes on
The evaluation step in the cycle, shown in Figure 27.2, is more
closely related to the actual application the algorithm is trying to a. Mutation
optimize. It takes the strings representing the individuals of the b. Negative Associations
population and, from them, creates the actual individuals to be c. Partition algorithm
tested the way the individuals are coded as strings will depend
2. Discuss the importance of association rules.
on what parameters one is tying to optimize and the actual
structure of possible solutions (individuals). After the actual 3. Explain the basic Algorithms for Finding Association Rules.
individuals have been created, they have to be tested and scored. 4. Discuss the importance of crossover in Genetic algorithm.
These two tasks again are closely related to the actual system 5. Explain Association Rules among Hierarchies with example.
being optimized. The testing depends on what characteristics
6. Describe the principle of Genetic algorithm and discuss its
should be optimized and the scoring. The production of a
suitability to data mining.
single value representing the fitness of an individual depends
on the relative importance of each different characteristic value 7. Discuss the salient features of the genetic algorithm. How
can a data-mining problem be an optimization problem?
obtained during testing.
How do you use GA for such cases?
122
Reference
Notes
123
LESSON 28 CHAPTER 6
ONLINE ANALYTICAL PROCESSING, OLAP
NEED FOR OLAP MULTIDIMENSIONAL
DATA MODEL
• Objective system;
• Introduction • Multidimensional conceptual view
• On-line Analytical processing • Transparency
• What is Multidimensional (MD) data and when does it • Accessibility
become OLAP? • Consistent reporting performance
• OLAP Example • Client/server architecture
• What is OLAP? • Generic dimensionality
• Who uses OLAP and WHY? • Dynamic sparse matrix handling
• Multi-Dimensional Views • Multi-user support
• Complex Calculation capabilities • Unrestricted cross dimensional operations
• Time intelligence • Intuitative data manipulation
Objective • Flexible reporting
At the end of this lesson you will be able to • Unlimited dimensions and aggregation levels
• Understand the significance of OLAP in Data mining An alternative definition of OLAP has been supplied by Nigel
• Study about Multi-Dimensional Views, Complex Calculation Pendse who unlike Codd does not mix technology prescrip-
capabilities, and Time intelligence tions with application requirements. Pendse defines OLAP as,
Fast Analysis of Shared Multidimensional Information,
Introduction which means:
This lesson focuses on the need of Online Analytical Process- Fast in that users should get a response in seconds and so
ing. Solving modern business problems such as market analysis doesn’t lose their chain of thought;
and financial forecasting requires query-centric database schemas
that are array-oriented and multidimensional in nature. These Analysis in that the system can provide analysis functions in an
business problems are characterized by the need to retrieve large intuitative manner and that the functions should supply
numbers of records from very large data sets and summarize business logic and statistical analysis relevant to the users
them on the fly. The multidimensional nature of the problems application;
it is designed to address is the key driver for OLAP. Shared from the point of view of supporting multiple users
In this lesson i will cover all the important aspects of OLAP. concurrently;
Multidimensional as a main requirement so that the system
On Line Analytical Processing
supplies a multidimensional conceptual view of the data
A major issue in information processing is how to process including support for multiple hierarchies;
larger and larger databases, containing increasingly complex data,
Information is the data and the derived information required
without sacrificing response time. The client/server architecture
by the user application.
gives organizations the opportunity to deploy specialized
servers, which are optimized for handling specific data manage- One question that arises is,
ment problems. Until recently, organizations have tried to target What is Multidimensional (MD) data and when does
relational database management systems (RDBMSs) for the it become OLAP?
complete spectrum of database applications. It is however It is essentially a way to build associations between dissimilar
apparent that there are major categories of database applications pieces of information using predefined business rules about
which are not suitably serviced by relational database systems. the information you are using. Kirk Cruikshank of Arbor
Oracle, for example, has built a totally new Media Server for Software has identified three components to OLAP, in an issue
handling multimedia applications. Sybase uses an object- of UNIX News on data warehousing;
oriented DBMS (OODBMS) in its Gain Momentum product, • A multidimensional database must be able to express
which is designed to handle complex data such as images and complex business calculations very easily. The data must be
audio. Another category of applications is that of on-line referenced and mathematics defined. In a relational system
analytical processing (OLAP). OLAP was a term coined by E F there is no relation between line items, which makes it very
Codd (1993) and was defined by him as; difficult to express business mathematics.
The dynamic synthesis, analysis and consolidation of large
volumes of multidimensional data
124
• Intuitive navigation in order to ‘roam around’ data, which • In addition to answering who and what questions OLAPs
125
• Users queries should not be inhibited by the complex to The OLAP performance benchmark contains how time is used
DATA WAREHOUSING AND DATA MINING
form a query or receive an answer to a query. in OLAP applications. Eg the forecast calculation uses this year’s
• The benchmark for OLAP performance investigates a servers vs. last year’s knowledge, year-to-date knowledge factors.
ability to provide views based on queries of varying
complexity and scope.
• Basic aggregation on some dimensions
• More complex calculations are performed on other
dimensions
• Ratios and averages
• Variances on sceneries
• A complex model to compute forecasts
• Consistently quick response times to these queries are
imperative to establish a server’s ability to provide MD views
of information.
Complex Calculations
• The ability to perform complex calculations is a critical test
for an OLAP database.
• Complex calculations involve more than aggregation along a
hierarchy or simple data roll-ups, they also include percentage
of the total share calculations and allocations utilising
hierarchies from a top-down perspective.
• Further calculations include:
• Algebraic equations for KPI
• Trend algorithms for sales forecasting
• Modelling complex relationships to represent real
world situations
• OLAP software must provide powerful yet concise
computational methods.
• The method for implementing computational methods
must be clear and non-procedural
• Its obvious why such methods must be clear but they must
also be non-procedural otherwise changes can not be done in
a timely manner and thus eliminate access to JIT
information.
• In essence OLTP systems are judged on their ability to collect
and manage data while OLAP systems are judged on their
ability to make information from data. Such abilities involves
the use of both simple and complex calculations
Time Intelligence
• Time is important for most analytical applications and is a
unique dimension in that it is sequential in character. Thus
true OLAP systems understand the sequential nature of
time.
• The time hierarchy can be used in a different way to other
hierarchies eg sales for june or sales for the first 5 months of
2000.
• Concepts such as year to date must be easily defined
• OLAP must also understand the concept of balances over
time. Eg in some cases, for employees, an average is used
while in other cases an ending balance is used.
126
DATA WAREHOUSING AND DATA MINING
127
DATA WAREHOUSING AND DATA MINING
Exercise
1. Write short notes on:
• Multidimensional Views
• Time Intelligence
• Complex Calculations
2. What do you understand by Online Analytical Processing
(OLAP)? Explain the need of OLAP.
3. Who uses OLAP and why?
4. Correctly contrast the difference between OLAP and Data
warehouse.
5. Discuss various applications of OLAP.
Notes
128
DATA WAREHOUSING AND DATA MINING
LESSON 29
OLAP VS. OLTP, CHARACTERISTICS OF OLAP
129
few products, so the cells that relate sales channels to products • Accessibility: OLAP as a mediator, Dr. Codd essentially
DATA WAREHOUSING AND DATA MINING
will be mostly empty and therefore sparse. By optimizing space describes OLAP engines as middleware, sitting
utilization, OLAP servers can minimize physical storage heterogeneous data sources & OLAP front-end.
requirements, thus making it possible to analyse exceptionally • Batch Extraction vs. Interpretive: this rule effectively requires
large amounts of data. It also makes it possible to load more that products offer both their own staging database for
data into computer memory, which helps to significantly OLAP data as well as offering live access to external data.
improve performance by minimizing physical disk I/O.
• OLAP analysis models: Dr. Codd requires that OLAP
In conclusion OLAP servers logically organize data in multiple products should support all four-analysis models that
dimensions, which allows users to quickly, and easily analyze describes in his white paper model
complex data relationships. The database itself is physically
• Client server architecture: Dr. Codd requires not only that the
organized in such a way that related data can be rapidly retrieved
product should be client/server but that the server
across multiple dimensions. OLAP servers are very efficient
component of an OLAP product should be sufficiently
when storing and processing multidimensional data. RDBMSs
intelligent that various client can be attached with minimum
have been developed and optimized to handle OLTP applica-
effort and programming for integration.
tions. Relational database designs concentrate on reliability and
transaction processing speed, instead of decision support need. • Transparency: full compliance means that a user should be
The different types of server can therefore benefit a broad range able to get full value from an OLAP engine and not even be
of data management applications. aware of where the data ultimately comes from. To do this
products must allow live excess to heterogeneous data
Characteristics of OLAP: FASMI sources from a full function spreadsheet add-in, with the
Fast – means that the system targeted to deliver most re- OLAP server in between.
sponses to user within about five second, with the simplest • Multi-user support: Dr. Codd recognizes that OLAP
analysis taking no more than one second and very few taking applications are not all read-only, & says that, to be regarded
more than 20 seconds. as strategic, OLAP tools must provide concurrent access,
Analysis – means that the system can cope with any business integrity & security.
logic and statistical analysis that it relevant for the application
Special features
and the user, keep it easy enough for the target user. Although
some pre programming may be needed we do not think it • Treatment of non-normalize data: this refers to the
acceptable if all application definitions have to be allow the user integration between an OLAP engine and denormalized
to define new adhoc calculations as part of the analysis and to source data.
report on the data in any desired way, without having to • Storing OLAP Result: keeping them separate from source
program so we exclude products (like Oracle Discoverer) that do data. This is really an implementation rather than a product
not allow the user to define new adhoc calculation as part of the issue. Dr. Codd is endorsing the widely held view that read-
analysis and to report on the data in any desired product that do write OLAP applications should not be implemented directly
not allow adequate end user oriented calculation flexibility. on live transaction data, and OLAP data changes should be
Share – means that the system implements all the security kept distinct from transaction data.
requirements for confidentiality and, if multiple write access is • Extraction of missing value: all missing value are to be
needed, concurrent update location at an appropriated level not distinguished from zero values.
all applications need users to write data back, but for the • Treatment of Missing values: all missing values to be
growing number that do, the system should be able to handle ignored by the OLAP analyzer regardless of their source.
multiple updates in a timely, secure manner.
Multidimensional – is the key requirement. OLAP system
must provide a multidimensional conceptual view of the data,
including full support for hierarchies, as this is certainly the
most logical way to analyze business and organizations.
Information – are all of the data and derived information
needed? Wherever it is and however much is relevant for the
application. We are measuring the capacity of various products
in terms of how much input data they can handle, not how
many gigabytes they take to store it.
Basic Features of OLAP
• Multidimensional Conceptual view: We believe this to be the
central core of OLAP
• Initiative data manipulation: Dr. Codd prefers data
manipulation to be done through direct action on cells in the
view w/o resource to menus of multiple actions.
130
DATA WAREHOUSING AND DATA MINING
Exercise
1. Write short notes on
• Client Server Architecture
• Slicing and Dicing
• Drill down
2. Correctly contrast and Compare OLAP and OLTP with
example.
3. What is FASMI? Explain in brief.
4. Explain various Basic Features of OLAP
5. Discuss the importance of Multidimensional View in OLAP.
Explain with an example.
131
DATA WAREHOUSING AND DATA MINING
LESSON 30
MULTIDIMENSIONAL VERSES MULTIRELATIONAL OLAP,
FEATURES OF OLAP
132
GEOGRAPHY may contain country, state, city, etc. Having the (e.g., find account balance, find grade in course)
133
9. Unrestricted Cross-Dimensional Operations • Shared
DATA WAREHOUSING AND DATA MINING
All forms of calculations should be allowed across all The system implements all the security requirements for
dimensions. confidentiality. Also, if multiple write access is needed, the
10.Intuitive Data Manipulation system provides concurrent update locking at an appropriate
level.
The users should be able to directly manipulate the data
without interference from the user interface. • Multidimensional
11.Flexible Reporting This is the key requirement. The system should provide a
multidimensional conceptual view of the data, including
The user should be able to retrieve any view of data required
support for hierarchies and multiple hierarchies.
and present it in any way that they require.
12.Unlimited Dimensions and Aggregation Levels
• Information
The system should be able to hold all data needed by the
There should be no limit to the number of dimensions or
applications. Data sparsity should be handled in an efficient
aggregation levels.
manner.
Six additional features of an OLAP system
1. Batch Extraction vs. Interpretive
OLAP systems should offer both their own multi-
dimensional database as well as live access to external data.
This describes a hybrid system where users can transparently
reach through to detail data.
2. OLAP Analysis Models
OLAP products should support all four data analysis models
described above (Categorical, Exegetical, Contemplative, and
Formulaic)
3. Treatment of Non-Normalized Data
OLAP systems should not allow the user to alter de-
normalized data stored in feeder systems. Another
interpretation is that the user should not be allowed to alter
data in calculated cells within the OLAP database.
4. Storing OLAP Results: keeping them separate from Source
Data
Read-write OLAP applications should not be implemented
directly on live transaction data and OLAP data changes
should be kept distinct from transaction data.
5. Extraction of Missing Values
Missing values should be treated as Null values by the OLAP
database instead of zeros.
6. Treatment of Missing Values
An OLAP analyzer should ignore missing values.
Many people take issue with the rules put forth by Dr. Codd.
Unlike his rules for relational databases, these rules are not
based upon mathematical principles. Because a software
company, Arbor Software Corporation, sponsored his paper,
some members of the OLAP community feel that his rules are
too subjective. Nigel Pendise of the OLAP Report has offered
an alternate definition of OLAP. This definition is based upon
the phrase Fast Analysis of Shared Multidimensional Informa-
tion (FASMI).
• Fast
The system should deliver most responses to users within a
few seconds. Long delays may interfere with the ad hoc
analysis.
• Analysis
The system should be able to cope with any business logic
and statistical analysis that is relevant for the application.
134
DATA WAREHOUSING AND DATA MINING
Exercise
Notes
135
DATA WAREHOUSING AND DATA MINING
LESSON 31
OLAP OPERATIONS
Structure
• Objective
cool mild hot
• Introduction
day 1 0 0 0
• OLAP Operations day 2 0 0 0
• Lattice of cubes, slice and dice operations day 3 0 0 1
• Relational representation of the data cube day 4 0 1 0
• Database management systems (DBMS), Online Analytical day 5 1 0 0
Processing (OLAP) and Data Mining day 6 0 0 0
• Example of DBMS, OLAP and Data Mining: Weather data day 7 1 0 0
day 8 0 0 0
Objective
day 9 1 0 0
The main objective of this lesson is to introduce you with
day 10 0 1 0
various OLAP Operations
day 11 0 1 0
Introduction day 12 0 1 0
In today’s fast-paced, information-driven economy, organiza-
day 13 0 0 1
tions heavily rely on real-time business information to make
day 14 0 0 0
accurate decisions. The number of individuals within an
enterprise who have a need to perform more sophisticated
analysis is growing. With their ever-increasing requirements for
data manipulating tools, end users cannot be already satisfied
with flat grids and a fixed set of parameters for query execution. Lattice of Cubes, Slice and Dice
Operations
OLAP is the best technology that empowers users with
The number of dimensions defines the total number of data
complete ease in manipulating their data. The very moment you
cubes that can be created. Actually this is the number of
replace your common grid with an OLAP interface users will be
elements in the power set of the set of attributes. Generally if
able independently to perform various ad-hoc queries, arbitrarily
we have a set of N attributes, the power set of this set will
filter data, rotate a table, drill down, get desired summaries, and
have 2N elements. The elements of the power set form a
rank. From users’ standpoint, the information system equipped
lattice. This is an algebraic structure that can be generated by
with OLAP-tool gains a new quality; helps not only get
applying intersection to all subsets of the given set. It has a
information but also summarize and analyze it.
bottom element - the set itself and a top element - the empty
From the developer’s point of view, OLAP is an elegant way to set. Here is a part of the lattice of cubes for the weather data
avoid thankless and tedious programming of multiple on-line cube.
and printed reports.
OLAP Operations {}
Assume we want to change the level that we selected for the _____|______
| |
temperature hierarchy to the intermediate level (hot, mild, cool). ... {outlook} {temperature} ...
To do this we have to group columns and add up the values ___________|________
according to the concept hierarchy. This operation is called roll- | |
up, and in this particular case it produces the following cube. ... {temperature,humidity} {outlook,temperature} ...
| |
... ...
cool mild hot ...
week 1 2 1 1 |
week 2 1 3 1 {outlook,temperature,humidity,windy} {time,temperature,humidity,windy}
|____________________________________|
|
{time,outlook,temperature,humidity,windy}
In other words, climbing up the concept hierarchy produces
roll-up’s. Inversely, climbing down the concept hierarchy
expands the table and is called drill-down. For example, the
In the above terms the selection of dimensions actually means
drill down of the above data cube over the time dimension
selection of a cube, i.e. an element of the above lattice.
produces the following:
136
There are two other OLAP operations that are related to the Using this technique the whole data cube can be represented as a
cool hot
day 3 0 1
day 4 0 0
The above table allows us to use an unified approach to
implement all OLAP operations - they all can me implemented
Relational Representation of the Data just by selecting proper rows. For example, the following cube,
Cube can be extracted from the table by selecting the rows that match
The use of the lattice of cubes and concept hierarchies gives us a the pattern (*, ALL, *, ALL, ALL), where * matches all legiti-
great flexibility to represent and manipulate data cubes. mate values for the corresponding dimension except for ALL.
However, a still open question is how to implement all this. An
interesting approach to this based on a simple extension of cool mild hot
standard relational representation used in DBMS is proposed week 1 2 1 1
by Jim Gray and collaborators. The basic idea is to use the value week 2 1 3 1
ALL as a legitimate value in the relational tables. Thus, ALL
will represent the set of all values aggregated over the corre-
sponding dimension. By using ALL we can also represent the
lattice of cubes, where instead of dropping a dimension when
intersecting two subsets, we will replace it with ALL. Then all
cubes will have the same number of dimensions, where their
values will be extended with the val,ue ALL. For example, a
part of the above shown lattice will now look like this:
{ALL,ALL,temperature,ALL,ALL}
__________________|_________________
| |
{ALL,ALL,temperature,humidity,ALL} {ALL,outlook,temperature,ALL,ALL}
137
Database Management Systems (DBMS), Online By querying a DBMS containing the above table we may answer
DATA WAREHOUSING AND DATA MINING
Area DBMS OLAP Data Mining • Which days the humidity was less than 75? {6, 7, 9, 11}
• Which days the temperature was greater than 70? {1, 2, 3, 8,
Knowledge
Extraction of 10, 11, 12, 13, 14}
Summaries, trends discovery of
Task detailed and • Which days the temperature was greater than 70 and the
and forecasts hidden patterns
summary data
and insights humidity was less than 75? The intersection of the above
Type of Insight and two: {11}
Information Analysis
result Prediction
OLAP
Deduction Induction (Build Using OLAP we can create a Multidimensional Model of our
Multidimensional data data (Data Cube). For example using the dimensions: time,
(Ask the the model, apply
Method modeling,
question, verify it to new data, outlook and play we can create the following model.
Aggregation, Statistics
with data) get the result)
9 / 5 sunny rainy overcast
Week 1 0 / 2 2 / 1 2 / 0
Who Who will buy a
What is the average Week 2 2 / 1 1 / 1 2 / 0
purchased mutual fund in
Example income of mutual
mutual funds the next 6
question fund buyers by region Obviously here time represents the days grouped in weeks
in the last 3 months and
by year? (week 1 - days 1, 2, 3, 4, 5, 6, 7; week 2 - days 8, 9, 10, 11, 12, 13,
years? why?
14) over the vertical axis. The outlook is shown along the
horizontal axis and the third dimension play is shown in each
individual cell as a pair of values corresponding to the two
values along this dimension - yes / no. Thus in the upper left
Example of DBMS, OLAP and Data Mining: Weather corner of the cube we have the total over all weeks and all
Data outlook values.
Assume we have made a record of the weather conditions
By observing the data cube we can easily identify some impor-
during a two-week period, along with the decisions of a tennis
tant properties of the data, find regularities or patterns. For
player whether or not to play tennis on each particular day. Thus
example, the third column clearly shows that if the outlook is
we have generated tuples (or examples, instances) consisting of
overcast the play attribute is always yes. This may be put as a
values of four independent variables (outlook, temperature,
rule:
humidity, windy) and one dependent variable (play). See the
textbook for a detailed description. if outlook = overcast then play = yes
We may now apply “Drill-down” to our data cube over the time
DBMS
dimension. This assumes the existence of a concept hierarchy
Consider our data stored in a relational table as follows:
for this attribute. We can show this as a horizontal tree as
follows:
Day Outlook Temperature Humidity Windy Play • time
1 Sunny 85 85 false no • week 1
2 Sunny 80 90 true no • day 1
3 overcast 83 86 false yes • day 2
4 Rainy 70 96 false yes • day 3
5 Rainy 68 80 false yes • day 4
6 Rainy 65 70 true no • day 5
7 overcast 64 65 true yes
• day 6
8 Sunny 72 95 false no
• day 7
9 Sunny 69 70 false yes
10 Rainy 75 80 false yes
• week 2
138
• day 14 These rules show some attribute values sets (the so called item
139
DATA WAREHOUSING AND DATA MINING
Exercise
1. Write short notes on:
o Relational representation of the data cube
o Mining Association Rules
o Slice and dice operations
2. Explain in brief various OLAP Operations.
3. Differentiate between Database management systems
(DBMS), Online Analytical Processing (OLAP) and Data
Mining
4. Explain the difference between DBMS, OLAP and Data
Mining with related example.
Notes
140
DATA WAREHOUSING AND DATA MINING
LESSON 32
CATEGORIZATION OF OLAP TOOLS CONCEPTS USED IN MOLAP/ ROLAP
141
specialized multidimensional data storage with RDBMS
DATA WAREHOUSING AND DATA MINING
Fig.32.3
142
investment in the relational database technology to provide
143
• Remote analysis where users pull subsets of information marketing professionals. The following products are at the core
DATA WAREHOUSING AND DATA MINING
144
Embedded data mining. The Pilot Analysis Server is the first or a script written to a proprietary Web-server API (Le.,
145
Java and ActiveX applications. This approach is for a vendor
DATA WAREHOUSING AND DATA MINING
146
DATA WAREHOUSING AND DATA MINING
Exercise
1. Write short notes on:
• Managed Query environment
• MDDB
2. Explain the following:
• ROLAP
• MOLAP
3. Discuss the difference between Relational OLAP and
Multidimensional OLAP.
4. Explain the basic architecture of OLAP with the help of a
diagram.
5. What are the various OLAP tools available? Explain any one
of them.
Notes
147
“The lesson content has been compiled from various sources in public domain including but not limited to the
internet for the convenience of the users. The university has no proprietary right on the same.”