Cla Idma

Download as pdf or txt
Download as pdf or txt
You are on page 1of 160

Implementing Data Mining

Algorithms in
Microsoft® SQL ServerTM

WIT Press publishes leading books in Science and Technology.


Visit our website for the current list of titles.
www.witpress.com

WITeLibrary
Home of the Transactions of the Wessex Institute, the WIT electronic-library provides the
international scientific community with immediate and permanent access to individual
papers presented at WIT conferences. Visit the WIT eLibrary at www.witpress.com
Advances in Management Information Series

Objectives of the Series

Information and Communications Technologies have experienced considerable


advances in the last few years. The task of managing and analysing ever-increasing
amounts of data requires the development of more efficient tools to keep pace with
this growth.
This series presents advances in the theory and applications of Management
Information. It covers an interdisciplinary field, bringing together techniques from
applied mathematics, machine learning, pattern recognition, data mining and data
warehousing, as well as their applications to intelligence, knowledge management,
marketing and social analysis. The majority of these applications are aimed at
achieving a better understanding of the behaviour of people and organisations in
order to enable decisions to be made in an informed manner. Each volume in the
series covers a particular topic in detail.

The volumes cover the following fields:

Information
Information Retrieval
Intelligent Agents
Data Mining
Data Warehouse
Text Mining

Competitive Intelligence
Customer Relationship Management
Information Management
Knowledge Management
Series Editor
A. Zanasi
TEMIS Text Mining Solutions S.A.
Italy

Associate Editors

P.L. Aquilar D. Goulias


University of Extremadura University of Maryland
Spain USA

O. Ciftcioglu P. Giudici
Delft University of Technology Universita di Pavia
The Netherlands Italy

M. Costantino A. Gualtierotti
London IDHEAP
UK Switzerland

P. Coupet T.V. Hromadka II


TEMIS Integral Consultants
France USA

N.J. Dedios Mimbela


J. Jaafar
Universidad de Cordoba
UiTM
Spain
Malaysia

A. De Montis J. Lourenco
Universita di Cagliari Universidade do Minho
Italy Portugal

G. Deplano G. Loo
Universita di Cagliari The University of Auckland
Italy New Zealand
D. Malerba F. Rossi
Università degli Studi DATAMAT
UK Germany

N. Milic-Frayling D. Sitnikov
Microsoft Research Ltd Kharkov Academy of Culture
UK Ukraine

G. Nakhaeizadeh R. Turra
DaimlerChrysler CINECA Interuniversity Computing
Germany Centre
Italy
P. Pan
National Kaohsiung University of D. Van den Poel
Applied Science Ghent University
Taiwan Belgium

J. Rao J. Yoon
Case Western Reserve University Old Dominion University
USA USA

D. Riaño N. Zhong
Universiteit Ghent Maebashi Institute of Technology
Belgium Japan

F. Rodrigues H.G. Zimmermann


Poly Institute of Porto Siemens AG
Portugal Germany

J. Roddick
Flinders University
Australia
Implementing Data Mining
Algorithms in
Microsoft® SQL ServerTM

C.L. Curotto
CESEC/UFPR
Federal University of Paraná, Brazil

N.F.F. Ebecken
COPPE/UFRJ
Federal University of Rio de Janeiro, Brazil
Implementing Data Mining
Algorithms in
Microsoft® SQL ServerTM
Series: Advances in Management Information, Vol. 3
C.L. Curotto and N.F.F. Ebecken

Published by

WIT Press
Ashurst Lodge, Ashurst, Southampton, SO40 7AA, UK
Tel: 44 (0) 238 029 3223; Fax: 44 (0) 238 029 2853
E-Mail: witpress@witpress.com
http://www.witpress.com

For USA, Canada and Mexico

WIT Press
25 Bridge Street, Billerica, MA 01821, USA
Tel: 978 667 5841; Fax: 978 667 7582
E-Mail: infousa@witpress.com
http://www.witpress.com

British Library Cataloguing-in-Publication Data

A Catalogue record for this book is available


from the British Library

ISBN: 1-84564-037-3
ISSN: 1742-0172

Library of Congress Catalog Card Number: 2004116312

No responsibility is assumed by the Publisher, the Editors and Authors for any injury
and/or damage to persons or property as a matter of products liability, negligence or
otherwise, or from any use or operation of any methods, products, instructions or ideas
contained in the material herein.

© WIT Press 2005.

Printed in Great Britain by *******************

All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without the prior written permission of the Publisher.
Contents
Foreword xi

Preface xiii

Chapter 1: Data mining technology overview 1


1.1 The importance of data mining technology.................................... 1
1.2 Multidisciplinary aspects................................................................ 1
1.3 The KDD process ........................................................................... 3
1.4 Data preparation ............................................................................. 4
1.5 Database aspects............................................................................. 4
1.6 Frequent approaches....................................................................... 5
1.7 Existing tools.................................................................................. 6
1.8 Research activities.......................................................................... 6
1.9 Applications ................................................................................... 7
1.10 The future ....................................................................................... 7

Chapter 2: Tools 9
2.1 Hardware ........................................................................................ 9
2.2 Software ......................................................................................... 9
2.2.1 Operating system.................................................................. 9
2.2.2 DBMS .................................................................................. 9
2.2.3 Compilers ........................................................................... 10
2.2.3.1 Microsoft Visual Studio 6.0 SP5......................... 10
2.2.3.2 Microsoft Visual Studio .NET 2003.................... 12
2.2.4 Syntax parser ...................................................................... 12
2.2.5 Utilities............................................................................... 12
2.2.6 DM sample provider........................................................... 14
2.2.7 IDMA CD........................................................................... 14
2.2.7.1 Assembling DMSample ........................................ 14
2.2.7.2 Assembling DMclcMine ....................................... 15
2.2.7.3 Creating DMiBrowser web site............................. 16

Chapter 3: OLE DB for DM technology 17


3.1 Universal data access architecture................................................ 17
3.1.1 OLE DB technology........................................................... 19
3.2 OLE DB for DM specification ..................................................... 21
3.2.1 AllElet data set ................................................................... 21
3.2.2 Creating the data mining model ......................................... 24
3.2.3 Populating the data mining model...................................... 27
3.2.4 Predicting attributes from data mining model .................... 31
3.2.5 Browsing the data mining model........................................ 34

Chapter 4: Implementation of DMclcMine 39


4.1 The SNBi classifier....................................................................... 39
4.1.1 The Naїve Bayes classifier ................................................. 39
4.1.2 Formulation of SNBi classifier ........................................... 43
4.1.3 Training and prediction algorithms of SNBi classifier ....... 45
4.2 The clcMine data mining provider ............................................... 48
4.2.1 Premises ............................................................................. 48
4.2.2 Parsing the syntax of the new algorithm ............................ 49
4.2.3 Creating the data mining model ......................................... 49
4.2.4 Populating the data mining model...................................... 49
4.2.5 Predicting attributes from data mining model .................... 50
4.2.6 Browsing the data mining model........................................ 50

Chapter 5: Experimental results 51


5.1 Waveform recognition problem ................................................... 52
5.2 Meteorological data...................................................................... 54
5.3 Life insurance data ....................................................................... 56
5.4 Performance study........................................................................ 58
5.4.1 Varying the number of input attributes............................... 59
5.4.2 Varying the number of training cases ................................ 59
5.4.3 Varying the number of states of the input attributes........... 62
5.5 Conclusions .................................................................................. 63

References 65

Appendix A Building DMSample 69


A.1 DMSample workspace ................................................................. 69
A.1.1 Naming variables............................................................... 70
A.2 Building steps............................................................................... 70
A.2.1 Building the release version .............................................. 74

Appendix B Building DMclcMine 75


B.1 Correcting the grammar definition file......................................... 75
B.2 Aggregate and standalone modes ................................................. 76
B.3 Creating a new provider from DMSample ................................... 79
B.4 Renaming algorithm names.......................................................... 80
B.5 Implementing new algorithms...................................................... 82
B.6 Inserting support for new parameters ........................................... 83
B.7 Inserting support for new properties............................................. 85
B.8 Inserting new error codes ............................................................. 85
B.9 XML load & save support............................................................ 86
B.10 Debugging .................................................................................... 86
B.10.1 Aggregate mode............................................................... 86
B.10.2 Standalone mode.............................................................. 87
B.11 Cleaning garbage.......................................................................... 88
B.12 Bugs ............................................................................................. 93
B.13 Created and modified files ........................................................... 93

Appendix C Running the experiments 95



C.1 Microsoft decision trees parameters........................................... 95
C.1.1 COMPLEXITY_PENALTY ............................................. 95
C.1.2 MINIMUM_LEAF_CASES.............................................. 95
C.1.3 SCORE_METHOD ........................................................... 96
C.1.4 SPLIT_METHOD ............................................................. 96
C.2 DTS packages............................................................................... 96
C.3 Waveform recognition problem ................................................... 96
C.3.1 Creating the data................................................................ 98
C.3.2 Creating the database......................................................... 98
C.3.3 DTS packages .................................................................... 99
C.3.4 DMSQL statements ......................................................... 102
C.4 Meteorological data.................................................................... 103
C.4.1 Creating the database....................................................... 104
C.4.2 DTS packages .................................................................. 104
C.4.3 DMSQL statements ......................................................... 106
C.5 Life insurance data ..................................................................... 107
C.5.1 Creating the database....................................................... 108
C.5.2 DTS packages .................................................................. 111
C.5.3 DMSQL statements ......................................................... 111
C.6 Performance study...................................................................... 123
C.6.1 Creating the data.............................................................. 124
C.6.2 Creating the database....................................................... 124
C.6.3 DTS packages .................................................................. 128
C.6.4 DMSQL statements ......................................................... 128

Appendix D List of conventions 135

Appendix E List of terms and acronyms 137

Appendix F List of Figures 139

Appendix G List of Tables 141

Appendix H List of Listings 143


Foreword

Pyungchul (Peter) Kim


SQL Server Data Mining Team

Microsoft

Data mining technology is growing rapidly and, in many industries, has become
commonplace. The main reasons for this are advances in computer technology
that acquire, store and retrieve enormous amounts of data about everything and
from everywhere. Uncovering patterns from these data allow a better understand-
ing of a particular business, which, in turn, helps optimize business processes,
thereby, creating new business opportunities, etc. Some data mining applications
include target marketing, cross-selling, risk analysis, fraud detection, te xt classi-
fication, spam f iltering and anomaly detection. It involves virtually all industrial
sectors where data are collected and stored electronicall y.
In 2000, Microsoft introduced a data mining feature in their Microsoft ® SQL
ServerTM 2000 Analysis Services. The SQL Server 2000 data mining component
has made numerous features available, becoming an enterprise-level platform for
data mining ser vices. Among others, it implements a standard data mining lan-
guage, called DMX (Data Mining eXtensions), which was proposed as part of the
object linking and embedding database (OLE DB) for Data Mining specif ication.
DMX is an extension of SQL (Structured Query Language), with the same basic
philosophy that the SQL community had in mind when it was invented decades
ago, i.e. of providing application programmers with a clear separation between
business logic and data management details. With DMX, application program-
mers can focus on aspects of data mining modeling combined with business logic
rather than algorithm-specific interfaces and data transformations. Customer
feedback indicates that this is one of the most significant innovations in data min-
ing technology.
Furthermore, Microsoft has introduced a new feature, third-party data mining
provider aggregation, in the SQL ServerTM 2000 Service Pack 1 (SP1), which
allows third parties to integrate their own OLE DB for Data Mining providers into
SQL Server 2000 Analysis Services. In this way, third-party data mining providers
can supply their customers with all enter prise-level platform functionalities of
Analysis Services, as w ell as their data mining algorithms. This is par ticularly
important in the current data mining industry, where no single vendor offers data
mining algorithms to suit all customers’ needs. New algorithms for new needs are
being introduced as we speak.
To promote SQL Ser ver 2000 Analysis Services among third-party data min-
ing providers, Microsoft has made a sample code ( DMSample) available to the
public. DMSample contains all the functionality for a full-fledged OLE DB for
Data Mining providers. It includes OLE DB interfaces (connection, session, com-
mand, rowset, etc.), full implementation of DMX language and a sample data min-
ing algorithm, which is expected to be replaced by the provider. Since the sample
code has been placed on our web site, there have been several thousand downloads
from all o ver the w orld. Traffic to the MSN Data Mining group (http://
groups.msn.com/AnalysisServicesDataMining) and the newsgroup (microsoft.
public.sqlserver.datamining) has increased significantly. Several vendors are
offering their products through the agg regation and we expect substantial growth
in this area.
In view of these developments, I am delighted to see Claudio mak e his expe-
rience in the integration available to others. He has been meticulous in getting his
data mining algorithm inte grated into SQL Ser ver 2000 through the agg regation
functionality using the DMSample code. Throughout the process, he was always
in close communication with the SQL Ser ver data mining team and did an e xcel-
lent job on the integration.
This book covers all the practicalities required to inte grate a third-party data
mining algorithm into SQL Server 2000.
Preface

Database Management System (DBMS) technology will increase the popularity


of Data Mining (DM) acti vities. It is ef ficient, inexpensive, safe and reliab le to
develop your own application in the same environment that supports and manages
all data and knowledge.
To achieve the goal of providing tight integration between DM and DBMS, a
number of approaches have been developed over the last few years. These
approaches include solutions pro vided by both company and academic research
groups.
In July 2000, Microsoft released the object linking and embedding database
for DM (OLE DB for DM) specification, a protocol based on SQL language. This
provided software vendors and application developers with an open interface to
integrate, more efficiently, data mining tools and capabilities into line-of-business
and e-commerce applications. About 55 independent software vendors (ISVs) par-
ticipated in the elaboration of this specif ication.
OLE DB for DM provides an industry standard for DM so that dif ferent DM
algorithms from various DM developers can be easily plugged into user applica-
tions. It also specifies the Application Programming Interface (API) between DM
consumers (applications that use DM features) and DM providers (software pack-
ages that provide DM algorithms). This approach appears to be the most promis-
ing solution for the inte gration of DM and DBMS technolo gies. As Stephen
Brobst, chief technolo gy officer for NCR’ s Teradata Solutions Group, said ,
“Release of the OLE DB for DM specif ication is a signif icant milestone on the
path to much wider use of predictive and descriptive analytic models by commer-
cial applications”.
OLE DB for DM incorporates the Predictive Model Markup Language
(PMML) standards from the Data Mining Group (http://www.dmg.org), an indus-
try consortium that facilitates the creation of useful standards for the data mining
community. PMML is an Extensib le Markup Language (XML)-based language
that provides a quick and easy method for or ganizations to define and share Data
Mining Models (DMM) among compliant vendors’ applications. To quote Jack
Noonan, President and CEO of SPSS: “By incorporating the PMML standard ,
Microsoft has further strengthened an open specification for bringing data mining
into analytical applications.”
OLE DB for DM extends the SQL syntax to mak e DM capabilities available
to any business analyst or developer, without the need for specialized data mining
skills. This natural inte gration into the relational database w orld enables faster
integration of data mining into high-payoff e-commerce application areas, such as
site personalization and shopping cart analysis.
Microsoft® SQL Server (MSSQL) 2000 provides the integration of DM tech-
niques by using the OLE DB for DM technolo gy. The Analysis Services compo-
nent of this software includes a DM provider supporting two algorithms: one for
decision tree classification and another for clustering. The DM Aggregator feature
of this component and the Microsoft ® OLE DB for DM Sample Pro vider
(DMSample) make it possible for researchers and de velopers to easily imple-
ment new DM pro viders supporting new DM algorithms. DMSample imple-
ments the Microsoft ® Simple Naïve Bayes (MSNB) classifier, supporting only
discrete attributes.
The task of creating a provider and implementing new algorithms, however,
remains problematical due to the lack of practical experience and infor mation.
Only a few publications have been published on this subject. Experience in imple-
menting DM providers and algorithms is still inadequate.
A large number of DM papers ha ve been published in recent y ears, many of
them concerning Database Mining Integration—the main subject of this book.
Despite outstanding contributions from DM researchers, valuable information still
remains hidden forever for prospective readers. This situation is probab ly due to
publishers’ restrictions on the size of these papers, but, in fact, there is a lack of
information on many aspects of the reported research. How were experiments car-
ried out? What is the complete specification of the equipment used? What are the
details of the processed data sets? Where are computer source codes pub lished?
Most researchers only publish a brief description of their theoretical development
and the results of their experiments. Such practices make experimental repro-
ducibility very difficult, resulting in unnecessary duplication.
Thanks to our backgrounds and experiences as engineers, we have a very prac-
tical approach to applied research and the importance of publishing all research
development details and related e xperimentation. Additionally, we have a special
interest in hands-on work and still recall how, in the late 1970s, the tips & tricks
books for the Apple® computer were so valuable for people like us, involved in
developing software for such a machine.
All of these points ha ve been taken into consideration during the writing of
this book. In addition to the theoretical contribution of the incremental algorithm,
described here, we also supply the reader with all the tips & tricks for the imple-
mentation and experimentation of a DM algorithm in MSSQL.
This book was designed for DM researchers and Information Technology (IT)
workers who implement DM algorithms and intend to use Microsoft DM. It is
very suitable as a textbook in Master and Doctorate IT pro grams.
The reader must be f amiliar with MSSQL Analysis Services and with pro-
gramming languages, such as Microsoft ® Visual C++® and Microsoft® Visual
Basic®.
All necessary information for implementing the ne w DM algorithms in an
MSSQL environment, using the OLE DB for DM technology, is provided. A com-
plete implementation of a DM provider, supporting SNBi, an incremental Simple
Naïve Bayes classifier, is described. This classifier supports both discrete and
numeric input attributes, multiple discrete prediction attributes and incremental
update of the training data set. All source codes, as well as data sets used in com-
putational experiments, are included in the accompanying CD-ROM.
Chapter 1 is an o verview of DM technolo gy and its role in the Kno wledge
Discovery in Databases (KDD) process. The multidisciplinary aspects of DM are
highlighted and aspects of the databases are discussed.A brief description of some
DM tasks and algorithms, such as classification, regression and clustering, is pro-
vided. In addition, the main features of frequent approaches in DM (Decision
Trees, Artificial Neural Networks, Rule Induction and Genetic Algorithms are
shown.
Chapter 2 describes the hardw are and software tools used to de velop the DM
provider, as well as to run the examples and the computational experiments shown
in the book. All instructions on installing the source codes, utilities and data f iles
from the accompanying CD-ROM are provided.
Chapter 3 contains a description of the OLE DB for DM technolo gy and its
insertion in the Uni versal Data Access (UDA Architecture context. A brief
description of the Object Linking and Embedding Database (OLE DB) technol-
ogy is also provided, as well as an extended overview of the OLE DB for DM
specification. Using an e xample, the k ey operations to be suppor ted by a DM
provider algorithm are described.
Chapter 4 shows the complete formulation and implementation of the
clcMine OLE DB for DM Pro vider (DMclcMine supporting the SNBi classi-
fier, as well as the DMiBrowser, a browser developed to display trees produced
by OLE DB for DM classif iers.
Chapter 5 describes the computational experiments used to evaluate the SNBi
classifier and shows the results. These results are also compared to those produced
by Microsoft® Decision Trees (MSDT, C4.5 decision trees and Waikato
Environment for Knowledge Analysis (WEKA Simple Naïve Bayes classifiers.
Four types of data sets were used in these experiments: a synthetic data set of the
University of California at Irvine (UCI Repository; two real world data sets, one
of meteorological data and another of li ve insurance data; and synthetic data sets
generated by the DATGEN program. Using the last data set type, a performance
study was carried out to verify the scalability of SNBi and MSDT classifiers. This
study included cardinality factors (number of training cases), number of input
attributes, number of states of the input attributes and number of prediction
attributes.
Appendix A includes useful instr uctions to build the DMSample provider,
using the original source f iles provided by Microsoft.
Appendix B provides information on building DM providers, starting from the
original DMSample source files, as well as an improved awareness of the com-
piling and debugging tasks to be perfor med in building your own provider.
Appendix C details the experiments, including source codes, SQL scripts,
DMSQL queries and utilities.
Appendices D–H provide several lists to assist the reader in navigating the
book more efficiently.

Acknowledgements
Kris Ganjam, Claude Seidman, Raman Iyer, Peter Kim, Jamie MacLennan and
ZhaoHui Tang provided valuable help in elucidating the doubts that appeared dur-
ing the computational implementation and the execution of the performance
experiments.
Gary Perlman supplied the source code of the |STAT system, used in the com-
putational experiments.
Manoel Rodrigues Justino Filho extensively revised the text.
Chapter 1
Data mining technology overview
1.1 The importance of data mining technology
Transactions within the business community generate data, which are the founda-
tion of the business itself. These business sectors include retail, petroleum,
telecommunications, utilities, manuf acturing, transportation, credit cards, insur-
ance, banking, etc. For years, data have been collected, collated, stored and used
through large databases.
Within the scientific community, data are much more dispersed, since the goal
in modeling scientif ic problems is to f ind and for mulate governing laws in the
form of accurate mathematical ter ms. It has long been reco gnized, however, that
such perfect descriptions are not always possible. Incomplete and inaccurate
knowledge, observations that are often of a qualitati ve nature, the complete
heterogeneity of the sur rounding world, boundaries and initial conditions being
incompletely known—all these generate the search for data models.
To build a model that does not need complex mathematical equations, one
needs sufficient and good data. The possibility of substituting a huge amount of
data by a set of rules, for example, is an improbable accomplishment.
The possibility of extracting explicit and comprehensive knowledge, hidden in
a large database, can generate precious information. It is also possible to complete
these data with the addition of e xpert knowledge.
Data Mining (DM) technology has opened a whole new field of opportunities,
completely changing problem-solving strategies.
DM is a multidisciplinar y domain, covering several techniques. It deals with
discovering hidden, unexpected knowledge in large databases. The
methodologies of traditional decision-making support systems do not scale up to
the degree where they can show the hidden structures of large databases and data
warehouses within the time limits imposed by current scientific and business
environments.

1.2 Multidisciplinary aspects


DM is not a new technique, but rather a multidisciplinary field of research,
where statistics, machine lear ning, database technolo gy, expert systems, data
2 Multidisciplinary aspects

Figure 1.1: Multidisciplinary context of DM.

visualization and high-performance computing are important procedures


(fig. 1.1).
It is important to note that DM is a crucial step in the Kno wledge Discovery
in Databases (KDD) process, which will be explained in Section 1.3.
Occasionally, however, KDD is used to indicate DM only, and we must not con-
fuse both terms.
As regards machine learning-based DM, a major component of the available
literature describes the application of algorithms to relati vely small data sets.
Real world databases and data warehouses are huge, perhaps, holding several
terabytes of data. In addition, the y are usually multipurpose databases, where data
are available for several applications. Often there are a very large number of records
and/or attributes. Furthermore, dealing with high-dimensional data sets often neces-
sitates the use of parallel-processing techniques and g rid computing technology.
Data mining technology overview 3

The advantage of using intelligent agents to automate the DM task is that it


enables DM of on-line transactions.

1.3 The KDD process


KDD is a multi-step process of knowledge discovery within databases, while
DM is a step in this process, in w hich algorithms are used to extract patterns
from data [1]. Other steps of the KDD process (such as incorporating
appropriate prior knowledge and proper interpretation of the results) are
essential to ensure that useful information (knowledge) is derived from the
data.
The KDD multi-step process is shown in f ig. 1.2 [2]. A description of these
steps follows:
1. Data consolidation: Data from di verse sources are consolidated in a data
warehouse. This step includes the following tasks:
a. Create or choice of working database;
b. Define list of attributes;
c. Delete obvious exceptions (outliers);
d. Define strategy for handling missing data f ields;
e. Determine prior probabilities of categories.
2. Selection and Preprocessing:
a. Generate a set of examples: choose sampling method; consider sample
complexity; deal with volume bias issues;
b. Reduce attribute dimensionality: remove redundant and/or correlating
attributes; combine attributes (sum, multiply, difference);
c. Reduce attribute value ranges: group symbolic discrete values;
quantitate continuous numeric values;

Figure 1.2: The KDD process.


4 Database aspects

d. Transform data: uncorrelate and normalize values; map time-series


data to static representation;
e. Encode data: representation must be appropriated for the DM tool that
will be used; continue to reduce attribute dimensionality where
possible without loss of information.
3. Data Mining:
a. Choose the DM task: classification, regression, clustering, etc.;
b. Choose the DM algorithms: artificial neural networks; inductive decision
tree and rule systems; genetic algorithms; nearest neighbor clustering
algorithms; statistical (parametric and non-parametric).
4. Interpretation and Evaluation:
a. Evaluation: statistical validation and significance testing; qualitative
review by experts in the field; pilot surveys to evaluate model accuracy;
b. Interpretation: inductive tree and rule models can be read directly;
clustering results can be graphed and tabled; histograms of value
distribution can be plotted; code can be automatically generated by
some systems.

1.4 Data preparation


Data preparation does not deal only with data. It encompasses the miner as well.
A miner is anyone who needs to understand data and wants to apply techniques to
achieve the best data model. Miners may include business analysts, consultants,
marketing managers, finance managers, corporate executives, data analysts,
statisticians, mathematicians, engineers, scientists, among others.
In any project, data preparation is traditionally the most time-consuming.
Moreover, it cannot be completely automated. Therefore, it is necessary to employ
data precision techniques with care, kno wing their effects on the data and under-
standing their functions and applicability.
Furthermore, it is the miner who establishes how the data will be prepared, based
on the nature of the problem, available tools and types of variables in the data set.

1.5 Database aspects


The majority of data used in DM resides in relational database systems. SQL is
the primary method for manipulating data in relational databases. In the case of
extremely large data sets, the performance of the relational database systems
becomes more important.
For the purpose of better performance, databases have adopted parallel hardware
configurations to provide the necessary transaction-processing and quer y-response
times. These parallel database architectures can be split into tw o types:

• Symmetrical multiprocessing or tightly coupled systems: all the processes


in the system can access all of the data on the hard disks and the y also all
share the system memory (bottleneck);
Data mining technology overview 5

• Shared-nothing or loosel y coupled systems: each processor has its own


disks and its own memory.
The shared-nothing parallel databases are inherently scalable.
On-line analytical processing (OLAP) tools are based on the concepts of
multi-dimensional databases and allow sophisticated anal ysis of the data using
elaborate, multi-dimensional, complex views. Nevertheless, they do not suffice in
today’s competitive environment.

1.6 Frequent approaches


The frequent approaches in DM are:
• Decision Trees, which are tree-shaped structures representing sets of
decisions. These decisions generate rules for data set classif ication.
• Artificial Neural Netw orks, which are nonlinear predictive models that
learn through training and resemble biological networks.
• Rule Induction, which is the process of extracting useful if-then rules from
data, based on statistical signif icance.
• Genetic Algorithms, which are optimization techniques, based on
evolutionary concepts. They use a process that mimics genetics, such as
combination, mutation and natural selection.
As for constructing a decision tree, it ma y be useful to adopt a splitting criterion
for each tree node. It is worth noting that there are similarities between trees and
neural networks. Decision trees, however, employ, sequentially, a hierarchicall y
structured decision function, whereas neural networks utilize a set of soft deci-
sions in parallel. Decision trees and neural netw orks must be trained beforehand,
each one with their own techniques.
Moreover, it has been shown that linear tree classifiers can be adequately
mapped into multilayer perceptrons over the decision trees. Nevertheless decision
trees require less training time.
It is a known fact that for large data sets induction tree algorithms show good
scalability. Induction trees also give an accurate insight into the nature of the deci-
sion process. Furthermore, the resulting decision tree can be used directly. In con-
trast, similar information from neural netw orks needs to be e xtracted from their
hidden layers.
Currently association rules are only useful in DM if a rough idea of what it is
looking for already exists.
The utility of genetic algorithms lies more within the realm of optimization.
Genetic algorithms start with a population of items and seek to alter and optimize
their structure for the solution of a par ticular problem. Genetic algorithms can be
used, for e xample, to evolve neural network structures while simultaneously
searching for signif icant input variables to maximize the predictive accuracy of
the resulting neural network models.
The pros and cons of genetic algorithm are those of natural selection. The main
drawbacks are large over-production of indi viduals and searching randomness.
6 Existing tools

Usually, they require a lot of computing power, but they are solid, in the sense that
if a solution exists, a genetic algorithm can probably find it.
Fuzzy systems are universal approximators, that is, they are capable of approx-
imating general nonlinear functions to an y desired degree of accuracy, similar to
feed forward networks. DM has played a central role in the development of fuzzy
models because fuzzy interpretations of data structures are a very natural and intu-
itively plausible way of formulating and solving various problems.
The mixture of fuzzy logic, neural networks and genetic algorithms provides a
powerful framework for solving difficult real-world problems.

1.7 Existing tools


There is quite a variety of commercial tools a vailable as complete systems, along
with countless others cur rently under de velopment. Tools range in scope from
small-scale systems, developed for particular problem domains, to general-purpose
systems, which can be used on almost any DM problems. There is a lot of variance
in terms of cost, platform independence, requirements and visualization f acilities.
The methods available for analyzing data are virtually limitless. Visualization
offers a powerful means of analysis, which can help uncover patterns and trends.
It should not be for gotten that databases are depositories of se veral types of
information, accommodating many different data for mats. Free text is an amor-
phous data type, which can be easily stored. However, text appears to be a far from
standard form, suggesting g reat difficulties in DM. An analysis of the written
word can introduce complicated issues, such as understanding human reading and
comprehension. However, some automated text analysis applications have
achieved impressive results using relati vely simple data preparation strate gies.
Recent research results in text mining have shown that a comprehensive statistical
analysis of word frequencies in labeled documents, combined with prediction
techniques, can lead to effective text classification systems.
Web mining is a new research area that tries to solve the information overload
problem by exploiting recent adv ances in dif ferent fields of technology.
Documents and web pages are a source of knowledge in an unstructured data for-
mat that can be decoded, analyzed and turned into actionable intelligence.

1.8 Research activities


As a multidisciplinary field of research, several new improvements are taking
place simultaneously, making DM more and more attractive. Research activities
in DM can be grouped as follows:
• Discovering new data analysis techniques or improving the existing
ones;
• Scaling data analysis techniques over large data sets.
Results obtained so far have evolved in two directions:
• Building computational models representing the data sets;
• Extracting explicit and interesting knowledge from data.
Data mining technology overview 7

1.9 Applications
At present, DM is a full y developed and very powerful tool, ready for use. DM
finds applications in many areas, the more popular being:
• DM on Government: Detection and pre vention of fraud and money
laundering, criminal patterns, health care transactions, etc.
• DM for Competitive Intelligence: New product ideas, retail marketing and
sales pattern, competitive decisions, future trends and competitive
opportunities, etc.
• DM on Finance: Consumer credit policy, portfolio management,
bankruptcy prediction, foreign e xchange forecasting, derivatives pricing,
risk management, price prediction, forecasting macroeconomic data, time
series modeling, etc.
• Building Models from Data: Applications of DM in science, engineering,
medicine, global climate change modeling, ecological modeling, etc.

1.10 The future


The evolution in the availability of pub lished information and primary source
(human intelligence collection are creating many fundamental challenges in com-
puting, communications, storage and human–computer interaction issues for DM.
Novel DM application areas are fast emerging and these developments call for
new perspectives and new approaches.
Today, DM is recognized as essential in analyzing and understanding the infor-
mation collected by government, business and scientif ic applications.
Focusing on customers and competitors is the key to b usiness effectiveness.
Building applications for Customer Relationship Management (CRM and
Competitive Intelligence make the identification and prediction of individual, as
well as aggregate behavior, possible.
CRM requires learning from the past. This is where DM is appropriate,
because DM is about lear ning from data. Competiti ve Intelligence is a tool that
improves the strate gic decision-making process. It in volves the monitoring of
competitor behavior to identify early warnings of opportunities and threats. With
this information, a company can lear n from its competitors and design counter
measures to improve its competitiveness.
DM technology is expanding continuously. Some traditional areas are highl y
organized. Financial engineering, for example, is well advanced, while bioinfor-
matic and medical applications are cur rently booming. For instance, how much
valuable information could be gathered by correlating patient histories stored in
large hospital databases?
Certainly, advances in computer technology are an accelerating factor.
Database system improvements, rapid access to information, visualization capac-
ities, etc., will mak e it possible to build, without difficulties, accurate data mod-
els, which are essential for the future scientif ic and business environments.
Chapter 2
Tools
This chapter describes the hardware and softw are required to develop the DM
provider, as well as running examples and computational experiments.

2.1 Hardware
The implementation was made using an IBM PC compatible microcomputer, Intel
Pentium III 500 MHz processor inside, 512 MB of RAM memory, 30 MB hard
disk, virtual memory of 768 MB minimum and 2048 MB maximum.
Several programs are used in the debugging and running tasks. This config-
uration exceeds the hardware required by each program running alone. However,
in the debugging task for DM provider implementation, a large amount of mem-
ory is necessary, since MSSQL with Analysis Services component, Microsoft ®
Visual C++® (MSVC) and Microsoft ® Visual Basic® (MSVB) are usually running
together.

2.2 Software
2.2.1 Operating system
The operating system used was the Microsoft® Windows® 2000 Advanced Server
SP4 (Service Pack 4)
(http://www.microsoft.com/windows2000/downloads/servicepacks/sp4). You can
try using the trial version of Windows® 2003
(http://www.microsoft.com/windowsserver2003/evaluation/trial/evalkit.mspx).
To use DMiBrowser, a browser developed to display trees produced by OLE
DB for DM classifiers, you must f irst install the Inter net Information Services
(IIS) component.

2.2.2 DBMS
The DBMS used was MSSQL 2000 Enterprise SP3A. MSSQL and the
Analysis Services component with Service Pack 3A must be installed
(http://www.microsoft.com/sql/downloads/2000/sp3.asp). A trial v ersion of this
software is available (http://www.microsoft.com/sql/evaluation/trial) and can be
used without any problems. For the debugging and de veloping tasks of the DM
10 Software

provider, MSSQL is not necessary, but it is almost impossible to follow the


instructions in this book without it.
2.2.3 Compilers
The main development tools were Microsoft® Visual Studio (MSVS), releases 6.0
SP5 and .NET 2003 with Visual C++®, Visual Basic® and Visual J++® compilers;
and Microsoft® Platform SDK February 2003 Edition.
The DMSamp utility [3], the WEKA system [4] and the DMiBrowser util-
ity, included in the IDMA CD, require MSVS 6.0 SP5 to be compiled , but their
executable files will run without any compiler. DMclcMine can be compiled by
both releases of MSVS. MSVS 6.0 SP5 cannot be used to compile theAssemble
utility (see Section 2.2.7.2), but its e xecutable file will run without any compiler.
So, it does not matter which version of MSVS y ou are using, all operations
described in this book can be car ried out.

2.2.3.1 Microsoft® Visual Studio 6.0 SP5 The following instructions must be
performed in installing MSVS 6.0 SP5:
1. Install MSVS 6.0 (Visual C++®, Visual Basic® and Visual J++® compilers).
2. Install the debugging tools used to implement the DMiBrowser:
A. Running MSVS 6.0 setup:
a. Select Server Applications and Tools in the first dialog;
b. In the ne xt dialog select Launch BackOffice Installation
Wizard, and press Install;
c. In the BackOffice Business Solutions dialog, select Custom
and press Next;
d. In the dialog BackOffice Programs and Their Components
select only the installation of the components: Remote Machine
Debugging and Visual InterDev Server.
B. Finish the setup and execute the instruction included in Visual
InterDev / Using Visual InterDev / Building Integrated
Solutions / Integration Tasks / Debugging Remotely.
3. Install MSVS SP 5
(http://msdn.microsoft.com/vstudio/downloads/updates/sp/vs6/sp5).
4. Download Microsoft® Platform SDK F ebruary 2003 Edition from:
(http://www.microsoft.com/msdownload/platformsdk/sdkupdate) and
install the following minimum set of components:
9 Core SDK (Windows Server 2003) – Build Environment;
9 Common Setup Files;
9 Internet Development SDK (Version 6.0) – Build Environment;
9 Microsoft® Data Access Components (Version 2.7) – Build
Environment and Sample and source code.
5. Verify whether Microsoft® Platform SDK include and lib directories are on
the top of Tools/Options/Directories/Include files and Tools/
Options/Directories/Library files lists, as shown in figs 2.1 and 2.2.
6. If MSVS .NET 2003 is not installed, then install Microsoft® .NET Framework
Version 1.1 Redistributable Package (http://www.microsoft.com/
Tools 11

Figure 2.1: MSVS 6.0 Tools/Options/Directories/Include files.

Figure 2.2: MSVS 6.0 Tools/Options/Directories/Library files.


12 Software

downloads/details.aspx?FamilyId=262D25E3-F589-4842-8157-
034D1E7CF3A3&displaylang=en). This package will be used b y
Assemble utility (see Section 2.2.7.2), to build the DMclcMine provider.

2.2.3.2 Microsoft® Visual Studio .NET 2003 The following instructions must
be performed for installing MSVS .NET 2003:
1. Install MSVS .NET (Visual C++® and Visual Basic®). The trial version of
MSVS .NET 2003 can be used:
(http://msdn.microsoft.com/vstudio/productinfo/trial/default.aspx);
2. Install October 2003 MSVS .NET Documentation Update (http://
www.microsoft.com/downloads/details.aspx?
familyid=a3334aed-4803-4495-8817-c8637ae902dc&displaylang=en);
3. Verify whether Microsoft® Platform SDK include and lib directories
included in MSVS .NET are on the top of Tools/Options/Projects/
VC++ Directories/Include files and Tools/Options/Projects/
VC++ Directories/Library files lists as shown in figs 2.3 and
2.4.

2.2.4 Syntax parser


To build the DM provider, you must acquire a Visual Parse++ license and software
from SandStone Technology (http://www.sand-stone.com). It is possible to use the
DM provider without it, but you will have to re-implement the OLE DB for DM
syntax parsing in DMParse, a very hard task.
After you install Visual++ Parse, you must execute the following steps:
1. If you are using MSVS 6.0:
a. Select Tools/Options/Directories/Include files and insert the
Program Files/Sandstone/Visual Parse++ 4.0/Include direc-
tory into the directories list;
b. Verify whether this list is as shown in fig. 2.1.
2. If you are using MSVS .NET:
a. Select Tools/Options/Projects/VC++ Directories/Include files
and insert the Program Files/Sandstone/Visual Parse++ 4.0/
Include directory into the directories list;
b. Verify whether this list is as shown in fig. 2.3.

2.2.5 Utilities
RowsetViewer is a sample tool that integrates the Microsoft ® Data Access
Components of Microsoft® Platform SDK February 2003 Edition. It allows a sim-
ple way of viewing and manipulating OLE DB rowsets, with the additional ability
of calling and manipulating other OLE DB methods from the data source, session,
command, rowset, transaction and notification objects supported by any OLE DB
provider (see Section 3.1.1). The source code of RowsetViewer can be installed
selecting the option Sample and source code from the Microsoft® Data
Access Components (Version 2.7) group of Microsoft ® Platform SDK. The
instructions to install this tool are described in the step 5 of Section 2.2.3.1.
Tools 13

Figure 2.3: MSVS .NET Tools/Options/Projects/VC++ Directories/Include


files.

Figure 2.4: MSVS .NET Tools/Options/Projects/VC++ Directories/Library


files.
14 Software

Other useful rowset viewer, enhanced to be used with DM providers, is the


DMSamp utility. This program was used to run queries against the DM provider
as you will see in subsequent chapters. You can download this pro gram from
http://communities.msn.com/analysisservicesdatamining.
To transfer the Data Transformation Services (DTS) packages from the com-
panion CD to y our server, you may use DTSBackup 2000. This program can be
downloaded from http://www.sqldts.com/?242.

2.2.6 DMsampleprovider
DMSample [5] provides a template, with complete source code, to implement a
DM provider that can be run aggregated with MSSQL. The following steps pro-
vide the instructions to install this tool:
1. Download SampleProvider.zip file of DMSample from: (http://www.
microsoft.com/downloads/details.aspx?FamilyID=d4402893-8952-4c9d-
b9bc-0d60c70d619d&DisplayLang=en).
2. Extract all directories and files from SampleProvider.zip to any tempo-
rary directory or root drive. Be sure that the Use folder names option is
selected. A directory named SampleProvider will be created with all
files of this provider.

2.2.7 IDMA CD
The IDMA CD that accompanies this book contains all source codes for
DMclcMine implementation, as w ell as the data sets and utilities used in the
computational experiments. To implement DmclcMine, we created 54 f iles
(including the 13 files created to support MSVS .NET) and modified 95 f iles
(including the 77 files modified by Assemble utility of DMSample (460 orig-
inal files). That is, 149 f iles were created or modif ied from a total of 514 f iles.
Due to copyright restrictions on this tool, we cannot distribute all of these files
already fixed. To circumvent this problem, we have made a Visual Basic® utility which
assembles all files of DmclcMine, starting from DMSample original source files.
Create the idma directory by copying all files from the IDMA CD to the root
directory of your hard disk’s chosen drive and execute the instructions described
in the following sections.

2.2.7.1 Assembling DMSample The DMSample source files and projects


can be assembled automatically, with all compiler and run-time errors fixed, by,
performing the following steps:
1. The IDMA CD files must be copied into your drive and DMSample
must be downloaded (see Section 2.2.6).
2. Copy all SandStone Visual Parse++ (see Section 2.2.4) source f iles from
Program Files/SandStone/Visual Parse++ 4.0/Source/C++
directory to idma/DMclcMine/VP64 directory.
3. Run idma/utilities/assemble/bin/assemble.exe, select the Script
File, IDMA directory and Sample Provider directory, edit New
provider name as shown in fig. 2.5. and press Proceed. If you are not
using MSVS .NET 2003, the Microsoft ® .NET Framework Version 1.1
Tools 15

Figure 2.5: Assembling DMSample.

Redistributable Package is required to run this utility (see Section 2.2.3.1,


step 6). When this utility is f inished, DMSample is ready to be built.
If you prefer, this task can be done manually by executing the instructions
included in Appendix A.

2.2.7.2 Assembling DMclcMine Assemble the DMclcMine workspace with


its source f iles and projects, and register its executable file by performing the
following steps:
1. The IDMA CD files must be copied into your drive and DMSample
must be downloaded (see Section 2.2.6).
2. Copy all SandStone Visual Parse++ (item 2.2.4) source f iles from
Program Files/SandStone/Visual Parse++ 4.0/Source/C++
directory to idma/DMclcMine/VP64 directory.
3. Run idma/utilities/assemble/bin/assemble.exe, select the Script
File, IDMA directory and Sample Provider directory, edit New
provider name as shown in f ig. 2.6 and press Proceed. If you are not
using MSVS .NET 2003, the Microsoft ® .NET Framework Version 1.1
Redistributable Package is required to run this utility (see Section 2.2.3.1,
step 6). When this utility is f inished, DMclcMine is ready to be built.
4. Stop MSSQL OLAP Ser vice. This service always must be stopped w hen
you build a DM provider to allow old files to be unregistered and removed.
16 Software

Figure 2.6: Assembling DMclcMine.

5. Run idma/DMclcMine/DMProv/release/vs6/RegDMclcMine.bat.
This batch file will register the release MSVS 6.0 version of DMclcMine
in your computer. Start MSSQL OLAP Service again.
If you prefer, this task also can be done manually by executing the instructions in
Appendix B.

2.2.7.3 Creating DMiBrowser web site Create the DMiBrowser web site on
your computer by performing the following steps:
1. Ruxn IIS.
2. Execute Action/New/Web Site.
3. Execute the Web Site Creation Wizard providing:
a. A Description for the new web site;
b. An IP address, for example 127.0.0.1;
c. The Path of the new web site: idma/DMiBrowser/web;
d. Allow the follo wing permissions: Read and Run scripts (such as
ASP).
4. Execute Action/Properties. Press tab Documents and Add
index.htm as the default document.
To browse the DMiBrowser web site you have to use http://127.0.0.1 link.
Chapter 3
OLE DB for DM technology
The OLE DB for DM technology [6] defines a standard open API for the
development of DM providers.
This standardization allows the complete portability of these providers,
formed as API coupled systems, which can take advantage of the internal
resources of DBMSs, as well as of any data source compatible with this API.
By using this technolo gy, a DBMS manages distinct applications, de veloped
by distinct developers, for each task forming the KDD process, e.g. data prepara-
tion, DM, visualization and evaluation.
Integration among the different steps of the KDD process is made possible by
the storage of the DM model (DMM), which is available for use by any applica-
tion that complies with this technology.
Seeking portability and free information exchange, the storage of the model is
carried out by using the Predictive Model Markup Language (PMML) specif ica-
tion [7], a markup language for statistics and DMMs, based on XML [8].
In addition, this technology allows a DM provider to be fully portable,
running, coupled with standalone applications, without the support of any DBMS.
A single data connection with a data source, such as a flat text file, is sufficient.
From the DM researcher’s point of view, this technology is highly promising, since
it creates the possibility, with minimum effort, of porting or developing his/her own DM
algorithm, DMM viewer or any other application, using a non-proprietar y program-
ming language to assemble a DM provider ready for use, integrated with a DBMS.
However, it is essential that implementation of DM algorithms proceeds ef fi-
ciently and with care to optimize use of a vailable DBMS internal functions [9].
Physically, a DM provider is a Dynamic Link Librar y (DLL) f ile that can be
used either by MSSQL or by any application through real-time function calls.
As an extension of OLE DB technology, OLE DB for DM can be explained
within the UDA Architecture context, which is described in Section 3.1.

3.1 Universal data access architecture


Figure 3.1 shows the UDA architecture, formed by Microsoft® Data Access Components
(MDAC), as well as the DMclcMine DM provider in this architecture context.
18
Universal data access architecture

Figure 3.1: Universal Data Access Architecture.


OLE DB for DM technology 19

This architecture provides high-performance access to a v ariety of data for mats


(both relational and non-relational) on multiple platfor ms across the enter prise.
UDA gives a simple API that can be used with almost all cur rent programming
languages and tools. This allows the use of user’s preferred tools for de veloping
complete applications based on this architecture [10].
The OLE DB for DM technology is an extension of OLE DB, which is
described as follows.

3.1.1 OLE DB technology


OLE DB is the basis of the UD A architecture. It is a specif ication for a set of
Component Object Model (COM)-based data access interf aces that encapsulate
several data management ser vices. These interfaces define how a multitude of
data sources can interact seamlessly. An OLE DB provider is formed by OLE DB
components. This component-based model allows the use of data sources in a uni-
fied manner and allowing future extensibility.
There is no restriction on data manipulation through a SQL query. The single
criterion imposed by this technology is that the data produced by a query must be
exposed by virtual tables. Thus, any data can be manipulated, since a single value
can be represented by virtual tables with one line and one column.
As shown in fig. 3.2, the most important feature of OLE DB is its heterogene-
ity, which allows queries based on an y data format or stored by any source, such
as SQL ser vers, Access® databases and Excel ® work-sheets, among others. An

Figure 3.2: OLE DB architecture.


20 Universal data access architecture

OLE DB provider can be simple or complex. To carry out queries involving mul-
tiple data sources, a provider can use other providers.
The seven basic components of the OLE DB provider object model are
described in table 3.1 [10].

Table 3.1: OLE DB components.

OLE DB components can be classif ied into three categories: data providers,
data consumers and service components. The key characteristic of a data provider
is that it holds the data it exposes to the outside world. While each provider han-
dles implementation details independently, all providers expose their data in a
tabular format through virtual tables. A data consumer is any component—
whether it is system or application code—that needs to access data from OLE
DB providers. Development tools, programming languages and many sophisti-
cated applications fall into this category. Finally, a service component is a lo gi-
cal object that encapsulates a piece of DBMS functionality . One of OLE DB’s
design goals was to implement ser vice components (such as query processors,
cursor engines, or transaction managers) as standalone products that can be
plugged in when needed.
OLE DB for DM technology 21

3.2 OLE DB for DM specification


A careful reading of the complete specif ication of this technology [6] is essential
for whoever wants to develop DM providers, since, here, we only provide an
overview of this specif ication.
OLE DB for DM describes an abstraction of the DM process. Extending OLE
DB, a ne w object, called DMM, is introduced, as well as commands for its
manipulation. This manipulation is done by statements similar to SQL statements.
Netz et al. [11] described the basic philosophy and design decisions leading to the
current specification of OLE DB for DM. They stated exactly the key operations to
be supported by a DM provider algorithm on DMMs, reproduced as follows:
1. Create the DMM, identifying the prediction data set of attributes, the
input data set of attributes to be used for prediction, and the algorithm
used to build the mining model.
2. Populate a DMM from training data using the algorithm specif ied.
3. Predict attributes for new data using a DMM that has been populated.
4. Browse a DMM for reporting, visualization or any other application of
KDD process such as inter pretation and evaluation of results.
These key operations will be described in the next sections in the context of an
example.

3.2.1 AllElet data setp


To illustrate these operations the AllElet data set, extracted from AllElectronics
customer database [12], will be used. F igure 3.3 shows the training data set tab le
(idma/data/allelet/allelet.data file), fig. 3.4 displays the testing data set table
(idma/data/allelet/allelet.test file) and tab le 3.2 sho ws the mar ginal model
statistics. These figures reproduce the AllElet_Train and AllElet_Test views of
AllElet MSSQL database.
The AllElet database, as well as its DMM, can be built and f illed using the
following instructions:
1. Run MSSQL Enterprise Manager.
2. Start Microsoft® SQL Query Analyzer executing Tools/SQL Query
Analyzer and connect to your SQL Server.
3. Execute File/Open and select the idma/data/allelet/allelet.sql file.
4. Execute Query/Execute. An empty Allelet database will be created on
your server. You can quit SQL Query Analyzer now.
5. Run DTSBackup 2000.
6. Press Select Source in Source tab and choose Structure Storage File
for Storage Location, choose Select a directory and select idma/
data /allelet for File path. Press Select All. Figure 3.5 shows the screen
of DTSBackup 2000 with AllElet DTS source files before the selection.
7. Press the Select Destination in Destination tab and choose Local
Packages for Storage Location and choose your server. Press Transfer.
All DTS files will be transferred to your server. Figure 3.6 shows the screen
of DTSBackup 2000 after this operation. You can quit this utility now.
22 OLE DB for DM specif ication

Figure 3.3: AllElet – Training data set.

Figure 3.4: AllElet – Testing data set.

Table 3.2: AllElet – Training data set – marginal model statistics.

8. Return to MSSQL Enter prise Manager and select the AllElet_Import


package from Data Transformation Services/Local Packages. Execute
Action/Execute Package. You will see a dialog, as shown in fig. 3.7.
9. Select the vie w AllElet_Train from AllElet’s Views list. Then press
Action/Open View/Return all rows. Now you can see the contents of the
training data set table shown in fig. 3.3. Quit MSSQL Enterprise Manager.
OLE DB for DM technology 23

Figure 3.5: DTSBackup 2000 – AllElet – DTS source f iles.

Figure 3.6: DTSBackup 2000 – AllElet – DTS transferred files.


24 OLE DB for DM specif ication

10. Start MSSQL Analysis Manager.


11. Select your server. Execute Action/Restore Database and open the
idma/data/allelet/allelet.cab file.
12. Select the Data Source CLC – Allelet. Execute Action/Edit, select tab
Connection, verify your server name and press Test Connection.
Quit MSSQL Analysis Manager.
DTSBackup 2000 is very useful but has a drawback: the layout of DTS packages
cannot be retrieved from structure storage f iles to MSSQL. If y ou wish to retrie ve
these layouts you must use Open Package Action of Data Transformation
Services to open each file and save it to Local Packages of MSSQL.

3.2.2 Creating the data mining model


A DMM object is very similar to a tab le and is created by a CREATE MINING
MODEL statement, also very similar to the SQL CREATE TABLE statement.
OLE DB for DM treats a DMM as a special type of tab le. When training
data are inser ted into the table, they are processed by a DM algorithm and the
resulting abstraction, or DMM, is saved instead of the training data itself.
Subsequently, the DMM can be browsed, refined, or used to derive predictions.

Figure 3.7: Run of AllElet – Import DTS package.


OLE DB for DM technology 25

The CREATE MINING MODEL statement specifies the DMM columns


(attributes to be analyzed) and the specific DM algorithm to be used when the
model is later trained by the DM provider. The simplified syntax of the statement
used to create a DMM is:
CREATE MINING MODEL <model name> (<column definitions>)
USING <algorithm name> [(<parameter list>)]
where
<model name> a unique name for the DMM;
<column definitions> a comma-separated list of column def initions;
<algorithm name> the provider-defined name of a DM algorithm;
<parameter list> (Optional) a comma-separated list of pro vider-
defined parameters for the algorithm.
However, since the DMM columns require a lot of specialized information,
some extensions were added to the standard SQL syntax. Details about the syntax
of DM SQL statements are found in Refs. [6, 13, 14].
Following is an example of a statement for creating a DM model with two pre-
diction attributes of the data set displayed in table 3.2:
CREATE MINING MODEL [AllElet_SNB] ([NumReg] LONG KEY,
[Age] TEXT DISCRETE, [Income] TEXT DISCRETE, [Student]
TEXT DISCRETE PREDICT, [CreditRating] TEXT DISCRETE,
[BuysComputer] TEXT DISCRETE PREDICT) USING
Simple_Naive_Bayes
The Relational Mining Model Editor of MSSQL Analysis Services Manager
can be used to build this statement. You can run this statement and the next ones
using the DMSamp utility, creating and using this DMM:
1. First, DMclcMine must be registered. To do this, run idma/DMclcMine/
DMProv/release/RegDMclcMine.bat.
2. Run DMSamp utility (Section 2.2.5).
3. Execute File/Connect. Choose idma/data/allelet for Server and
DMclcMine for Provider. Press OK.
4. Execute File/Open. Choose the idma / data / allelet /
AllElet_SNB_Queries.xml file, which includes eight ready to use
DMSQL queries.
5. Select and r un (Query/Run) the f irst query: DROP MINING MODEL
[AllElet_SNB]. Do not worry about an infor mation message saying that
the model does not exist.
6. Select and run the second query, which is the create query displayed above.
The name of AllElet_SNB DMM will appear in Mining Models list as
shown in fig. 3.8.
A DMM model was created in the DM provider and a text formatted XML
idma/data/allelet/AllElet_SNB.sdmm.xml file, shown in listing 3.1, will
contain the empty DMM.
26 OLE DB for DM specif ication

Figure 3.8: DMSamp – Creating AllElet DMM model.

Listing 3.1: AllElet – XML file of the empty DMM.


OLE DB for DM technology 27

3.2.3 Populating the data mining model


The CREATE MINING MODEL statement does not def ine the actual content of
the DMM. An INSERT INTO statement must be used to load the training data
into the DMM. This operation is similar to INSERT INTO statement used to
populate a table. The training data is processed on DM pro vider by the algorithm
specified in CREATE MINING MODEL statement. The data structure of DMM
is filled and stored instead of the training data. This result is refer red to as the
DMM content. The two forms of the statement syntax used to populate a DMM
are as follows:
INSERT INTO < model name > (<mapped model columns>)
<source data query>
INSERT INTO < model name >.COLUMN_VALUES (<mapped
model columns>) <source data query>
where
<model name> a unique name for the DMM;
<mapped model columns> a comma-separated list of column identifiers
and nested identifiers;
<source data query> the source query in the provider-defined format.
The INSERT INTO statement inserts training data into the DMM. The
columns from the query are mapped to DMM columns through the <mapped
model columns> section. The keyword SKIP is used to instr uct the DMM to
ignore columns that appear in the source data query that are not used in the DMM.
The INSERT INTO <model>.COLUMN_VALUES form inserts data directly
into the models columns without training the model’s algorithm. This allows pro-
viding column data to the DMM in a concise ordered manner that is useful w hen
dealing with data sets containing hierarchies or ordered columns.This form is also
essential for dealing with incremental training.
The next lines show the syntax of the statement used to populate the DMM,
regarding the data set displayed in fig. 3.3, using data from MSSQL:
INSERT INTO [AllElet_SNB] (SKIP, [Age], [Income], [Student],
[CreditRating], [BuysComputer]) OPENROWSET('SQLOLEDB.1',
'Provider=SQLOLEDB.1;Integrated Security=SSPI;Persist
Security Info=False;Initial Catalog=allelet;Data
Source=CLC;Connect Timeout=15', 'SELECT "NumReg", "Age",
"Income", "Student", "CreditRating", "BuysComputer" FROM
"AllElet_Train"')
This SQL can be generated automatically by MSSQL Analysis Services
Manager. Select and run the third query on DMSamp to populate the DMM.
After this operation the idma/data/allelet/AllElet_SNB.sdmm.xml file,
shown in Listing 3.2, will contain the populated (and trained DMM. As shown
in this listing, all possible values of the attributes are included in the data dictio-
nary, the global statistics and the statistical data defining the DMM was included
in the file.
28 OLE DB for DM specif ication

Listing 3.2: AllElet – XML file of the populated and trained DMM.
OLE DB for DM technology 29
30 OLE DB for DM specif ication
OLE DB for DM technology 31

3.2.4 Predicting attributes from data mining model


Prediction queries on a DMM allow the prediction of unknown attributes for new
cases. After a DMM has been populated and trained , a prediction quer y can be
done. The data set to be predicted can be a new one or the same already trained
set, with the purpose of verifying the DMM accuracy.
Prediction queries are retrieved from DMMs with a SELECT statement, as
shown below [6]:
SELECT <select expression list> FROM < model name > [NATU-
RAL] PREDICTION JOIN <source data query> [ON <join map-
ping list>] [WHERE <condition expression>]
where
<select expression list> A comma-separated list of column identif iers
and other expressions to describe the columns
in the results of the quer y.
<model name> a unique name for the DMM;
<source data query> the source query in the provider-defined format.
<join mapping list> A logical expression comparing DMM column
to column from source query.
<condition expression> (Optional) A condition to restrict the v alues
returned from the column list.
This statement allows the prediction of attributes (columns) based on the input
data supplied in the PREDICT clause. It can be specif ied the OLE DB for DM
feature-rich prediction functions, including prediction histo grams, prediction
32 OLE DB for DM specif ication

probability, sub-SELECT, and so forth, in <select expression list> and <condition


expression>. Only the rows that qualify the condition in the WHERE clause will be
included in the result. An example of this statement, to predict the
BuysComputer attribute of the testing data set (f ig. 3.4), is shown below:
SELECT FLATTENED [T1].[NumReg], [T1].[Age], [T1].[Income],
[T1].[Student], [T1].[CreditRating], [T1].[BuysComputer],
[AllElet_SNB].[BuysComputer] AS BuysComputer FROM
[AllElet_SNB] PREDICTION JOIN OPENROWSET ('SQLOLEDB.1',
'Provider=SQLOLEDB.1;Integrated Security=SSPI;Persist
Security Info=False;Initial Catalog=allelet;Data Source=CLC',
'SELECT 'NumReg”, 'Age”, 'Income”, 'Student”, 'CreditRating”,
'BuysComputer” FROM 'AllElet_Test” ORDER BY 'NumReg”') AS
[T1] ON [AllElet_SNB].[NumReg] = [T1].[NumReg] AND
[AllElet_SNB].[Age] = [T1].[Age] AND [AllElet_SNB].[Income] =
[T1].[Income] AND [AllElet_SNB].[Student] = [T1].[Student] AND
[AllElet_SNB].[CreditRating] = [T1].[CreditRating] AND
[AllElet_SNB].[BuysComputer] = [T1].[BuysComputer]
This statement may be built b y using the DMSamp utility or the MSSQL
DTS. The result of selecting and running the fourth query on DMSamp is shown
in fig. 3.9. A table can be seen in this figure with two records containing the result

Figure 3.9: DMSamp – Predicting attributes of the testing data set.


OLE DB for DM technology 33

Figure 3.10: AllElet – DMM contents.

Figure 3.11: AllElet – DMM prediction tree.


34 OLE DB for DM specif ication

of the quer y. Column [T1].[ BuysComputer] corresponds to original v alue of


this attributes and column [ BuysComputer] corresponds to predicted value. In
this case, the f irst prediction is correct and the second is wrong (f ig. 3.4).
3.2.5 Browsing the data mining model
DMM is exposed by DM providers for applications such as vie wers and repor t
generators by means of rowsets. The rowset content is retrieved by DM provider
client applications by using DMSQL Select queries or by function calls to retrieve
the DMM node properties. The DMM content is a set of rules, formulae, distribu-

Figure 3.12: AllElet BuysComputer prediction tree.


OLE DB for DM technology 35

Figure 3.13: DMSamp using AllElet DMM UDA architecture.

Figure 3.14: AllElet BuysComputer prediction tree shown by Internet


Explorer.
36 OLE DB for DM specif ication

Figure 3.15: Microsoft® IE AllElet DMM UDA architecture.

tions, nodes and other infor mation specific of each DM algorithm. So a suitab le
way to view DMM contents is b y directed graph or trees (set of nodes with con-
necting edges). The decision tree technique f its very well with this method.
Figure 3.10 brings out DMSamp screen showing part of the results of the
query that retrieves the entire contents of the AllElet DMM (selecting and r un-
ning the fifth query):
SELECT * FROM [AllElet_SNB].CONTENT
Selecting the AllElet_SNB DMM in Mining Models list, clicking the right
button of the mouse and pressing Browse you can see the prediction tree of the
AllElet_SNB DMM (fig. 3.11). Figure 3.12 shows the complete prediction tree
of BuysComputer attribute.
Figure 3.13 sho ws the UD A architecture conf iguration for the scenario of
using DMSamp and DMclcMine providers. DMSamp is a simple VB applica-
tion, using the SQLOLEDB, the Microsoft ® OLE DB Provider for MSSQL and
the DMclcMine provider.
Figure 3.14 shows the AllElet BuysComputer prediction tree displayed by
Microsoft® Internet Explorer (MSIE) through DMiBrowser, an ASP utility [15],
modified by Curotto [16] to pro vide support for algorithms other than MSDT.
You can see this screen in MSIE perfor ming the following steps:
OLE DB for DM technology 37

1. Run MSIE and open the address http://127.0.0.1 (see Section 2.2.7.3).
2. In the right upper frame, a list of Analysis Server Databases will
appear. Select AllElet database.
3. In this frame, a list of Models of AllElet database will appear. Select
AllElet_SNB.
4. In this frame, a list of Prediction trees of AllElet_SNB model will
appear. Select Student.
5. You can scroll o ver the tree and select one node to see the Node
Description and Node Distribution. You can also view a specific node
by selecting it at the left upper frame.

Figure 3.15 shows the UDA architecture configuration for this last scenario.
Chapter 4
Implementation of DMclcMine
4.1 The SNBi classifier
The SNBi classifier, implemented to illustrate the algorithm implementation, is a
Naïve Bayesian classifier supporting an incremental update of the training data
set, continuous data attributes and multiple discrete prediction attributes. In spite
of its simplicity (due to the premise of attribute independence), this algorithm
shows better results, in many situations, than more ref ined algorithms.

4.1.1 The Naïve Bayes classifier


The Naïve Bayes classifier is a statistical classif ier, which predicts classif ication
probabilities of samples that belong to classes. This classifier is based on Ba yes
theorem, described below. Formulated by Thomas Bayes (†1761), a British math-
ematician, this theorem is related to conditional probabilities:

(4.1)

where:

P(A|B) posterior probability of event A conditioned on event B;


P(B|A) posterior probability of event B conditioned on event A;
P(A) prior probability of event A;
P(B) prior probability of event B (P(B) ≠ 0).

P(A) and P(B) are prior probabilities because they are independent of an y prior
knowledge. In contrast, P(A|B) and P(B|A) are posterior probabilities because
they are dependent of prior knowledge (based on more information).
The Naïve Bayesian classifier can be formulated as follows [12].
Suppose a training data set for med by:

v total number of samples or training cases;


X ⫽ xi, i ⫽ 1, n vector of a training case for med by n attribute values of
vector A;
40 The SNBi classifier

A ⫽ ai, i ⫽ 1, n vector of attribute values of training data set;


C ⫽ cj, j ⫽ 1, m vector of classes of each case may belong.
For an unknown case, represented by the vector Y ⫽ yi, i ⫽ 1, n, the Naïve Bayes
classifier will choose the class with the g reatest posterior probability, in other
words, the chosen class will satisfy the follo wing equation:

(4.2)

Class cj with the g reatest probability P(c j|Y) is named maximum posterior
hypothesis. With Bayes theorem, this probability is computed by:

(4.3)

where:
P(cj|Y) probability of event of class c j for a case Y;
P(Y|cj) probability of event of case Y for a class c j;
P(cj) probability of event of class c j;
P(Y) probability of event of case Y (P(Y) ≠ 0).
Since P(Y) is constant for all classes, only the numerator of the abo ve equation
needs to be computed. The class prior probabilities can be computed by:

(4.4)

where:
sj number of training cases of class c j;
v total number of training cases.
Assuming class conditional independence betw een the attributes (origin of the
name Naïve Bayesian Classifier), the probability P(Y|c j) can be computed by:

(4.5)

where:
P(Y|cj) probability of event of case Y for a class c j;
P(yi|cj) probability of event of value yi for an attribute a i of an unknown
case Y for a class c j.
Using the training data, the probabilities P(y i|cj) are computed b y equations,
depending on the type of attribute ai. If ai is discrete, then:
Implementation of DMclcMine 41

(4.6)

where:
r hji number of training cases of class c j, value yi, order h, attribute a i;
sj number of training cases of class c j.
In another way, if ai is continuous, then:

(4.7)

where:
Gaussian (normal) density function for attribute a i;
yi value yi for attribute ai of unknown case Y;
e Naperian number;
mean and standard de viation, respectively, of attribute
values xi for attribute ai of training cases of class c j.
The mean is computed by:

(4.8)

where:
z ji sum of values xi for attribute ai for training cases of class c j;
r1ji number of training cases of class cj of any value xi, order h ⫽ 1 (h ⫽
1 is used for e xisting values of continuous attributes and h ⫽ 0 is
used for missing v alues of continuous and discrete attributes) for
attribute ai;
x ijk values xi for attribute ai for training cases of class c j.
The original standard deviation equation is:

(4.9)
42 The SNBi classifier

where:
x ijk value xi for attribute ai for training cases of class c j;
mean of values xi for attributes ai for training cases of class c j;
r1ji number of training cases of class cj of any value xi, order h ⫽ 1 (h ⫽
1 is used for e xisting values of continuous attributes and h ⫽ 0 is
used for missing values of continuous and discrete attributes) for
attribute ai.
Substituting the mean values in the above equation and developing the square
operation, the standard deviation equation will be:

(4.10)

Simplifying:

(4.11)

Considering the sum of the squares of the v alues xi as:

(4.12)

Substituting the known values and simplifying:

(4.13)

Simplifying again, a final equation can be written. This equation is completely


independent of individual values xi, an excellent feature to be used in an incre-
mental classifier:

(4.14)
Implementation of DMclcMine 43

where:
q ji sum of the squares of the values xi for attribute ai for training cases of
class cj;
z ji sum of the values xi for attribute ai for training cases of class c j;
r1ji number of training cases of class c j of any value xi, order h ⫽ 1 (h ⫽
1 is used for existing values of continuous attributes and h ⫽ 0 is used
for missing v alues of continuous and discrete attrib utes) for
attribute ai.

4.1.2 Formulation of SNBi classifier


The equations demonstrated above can be used to for mulate the SNBi incremen-
tal classifier for multiple prediction attributes.
Suppose now a training data set for med by:
v total number of training cases;
Xt ⫽ x ti, i ⫽ 1, n vector of a training case for prediction attribute t,
formed by n attribute values of vector At;
At ⫽ a ti, i ⫽ 1, n vector of attribute values of training data set for predic-
tion attribute t, formed by n values;
Ot ⫽ o ti, i ⫽ 1, n vector of number of possib le attribute values for each
attribute a it, for prediction attrib ute t, for med by n
values. For a continuous attrib ute, the vector element
has a v alue equal to 2 (one for e xisting values and
another for missing v alues). For a discrete attrib ute,
the vector element has a value equal to the number of
discrete values plus one (for missing values) ;
Ct ⫽ c tj, j ⫽ 1, mt vector of classes of each case ma y belong, for predic-
tion attribute t;
p total number of prediction attributes;
t ⫽ 1, p prediction attribute index.
Suppose an unknown data set represented by:
u number of unknown cases;
Y ⫽ y i, i ⫽ 1, n
t t
vector of an unkno wn case, for prediction attribute t,
formed by n attribute values of vector At.
For an unkno wn case of this data set, represented b y vector Yt, for prediction
attribute t, the SNBi classifier will select the class with the g reatest posterior
probability using the following equation (see eqn. (4.2)):

(4.15)

Being P(cjt |Yt) (probability of e vent of class c jt for a case Yt) for the prediction
44 The SNBi classifier

attribute t, computed by the following equation (see eqns. (4.4) and (4.5)):

(4.16)

where:
s jt number of training cases of class c jt;
v total number of training cases;
P(y it |c jt ) probability of event of value y jt for an attribute ait of an unknown
case Yt for a class c jt.
The probabilities P(y it |c jt ), defined by eqns. (4.5)–(4.7), for prediction attribute t,
are computed from the training data set by the following equations, depending on
the type of attribute a it . If a it is discrete, then (see eqn. (4.6)):

(4.17)

where:
r thji number of training cases of class c jt with the value y it , order h, for
attribute a it ;
s jt number of training cases of class c jt .
In another w ay, if a it is continuous, then the follo wing equation will be used
(see eqn. (4.7)):

(4.18)

where:
y it value y it for attribute a it of unknown case Yt;
e Naperian number;
mean and standard de viation, respectively, for attribute v al-
ues x it for attribute a it of training cases of class c jt , computed
by the following equations.
The mean, for prediction attribute t, is computed b y (see eqn. (4.8)):

(4.19)
Implementation of DMclcMine 45

where:
z tji sum of values x it for attribute a it for training cases of class c jt ;
r t1ji number of training cases of class c jt of any value x it , order h ⫽ 1
(h ⫽ 1 is used for existing values of continuous attributes and h ⫽ 0
is used for missing values of continuous and discrete attributes), for
attribute a it ;
x ijtk values x it for attribute a it for training cases of class c jt .
The standard deviation, for prediction attribute t, is computed by (see eqn. (4.14)):

(4.20)

where:
q tji sum of the squares of the values x it for attribute a it for training cases
of class c jt ;
z tji sum of the v alues x it for attribute a it for training cases of
class c jt ;
r t1ji number of training cases of class cjt of any value x it , order h ⫽ 1 (h ⫽
1 is used for e xisting values of continuous attributes and h ⫽ 0 is
used for missing values of continuous and discrete attributes), for
attribute a it .

4.1.3 Training and prediction algorithms of SNBi classifier


Working with incremental update of training data, the Ba yesian classifier
(described in Section 4.1.2) needs to store some data, which will be retrieved when
new training data are inser ted. This strategy can be seen in listings 4.1 and 4.2
which describe, respectively, the training and prediction algorithm of this classifier.

Listing 4.1: SNBi Classifier – Training algorithm.


46 The SNBi classifier
Implementation of DMclcMine 47

The procedure to select the maximum class probability , shown in listing 4.2, can
result in colossal processing times for situations with many classes for the
prediction attribute. An interesting solution for this prob lem has been sho wn by
Leung and Sze [17]. They used the Branch-and-Bound algorithm in a Chinese
character recognition problem. This problem has a training data set with 120 000
cases and 5515 classes.

Listing 4.2: SNBi Classifier – Prediction algorithm.


48 The SNBi classifier

4.2 The clcMine data mining provider


4.2.1 Premises
The DMclcMine provider was implemented, starting from the DMSample
source code [5]. Thus, before beginning implementation of the ne w algorithm, it
is necessary to build DMSample, fixing a few compiler errors and bugs. You can
either assemble DMSample automatically, using the Assemble utility (see
Section 2.2.7.1), or execute the instructions included in Appendix A, performing
the task manually, step by step. It does not matter which way you choose to build
DMSample, you have to read carefully the instructions in Appendix A to better
understand the whole process of implementation.
Furthermore, it is necessary to make extensive corrections to the syntax parser
main function f ile (DMParse/dmreduce.cpp), because this file is not clean
and when any change is car ried out, Visual Parse++ crashes and produces a cor-
rupted file (see Appendix B.1).
The DMSample source code implements an aggregated DM provider that
can be used only with MSSQL. To implement a provider that can run either in
standalone or in agg regated modes, some modif ications must be made. This
feature is useful for debugging new algorithms and for using the provider with
any application (see Appendix B.2).
To implement ne w algorithms, you can either include support for the new
algorithm in DMSample or build a ne w DM provider to include the new algo-
rithm. This last option was our selected choice. The advantage of this approach is
that both providers can run together.
To create a ne w provider starting from DMSample you must copy all f iles to a
different directory, modifying the name of all files that identify the provider,
Implementation of DMclcMine 49

especially the DLL f ile name, and all constant values that identify the pro vider to
external applications, such as those using globally unique identif iers (GUID). The
detailed instructions to do this task, creating the DMclcMine provider, are shown in
Appendix B.3.
With this new DM provider, you must rename the MSNB classifier (see
Appendix B.4, or remove it to avoid conflict with the MSNB classifier supported
by DMSample.
Finally, the new provider can accept new algorithms. In the following section,
the implementation will be described in the conte xt of the k ey operations to be
supported by a DM provider algorithm on DMMs, shown in Section 3.2.

4.2.2 Parsing the syntax of the new algorithm


The name of the new algorithm, as well as its parameters, must be def ined. The
Visual Parse++ is used to include suppor t for the new algorithm name. The com-
plete instructions to carry out this task are shown in Appendix B.5.
The SNBi classifier does not have any parameter. However, for
algorithms with new parameters, an additional task must be performed
(see Appendix B.6.

4.2.3 Creating the data mining model


This operation is totally supported by the source code of DMSample, since the
CREATE MINING MODEL statement only builds a DMM, as sho wn in Section
3.2.2.

4.2.4 Populating the data mining model


Total support for inser ting cases in the DMM must be de veloped for new algo-
rithms. The developer must pay special attention when undertaking this task. The
data structure that represents the model of the algorithm is def ined and all func-
tions related to training the data set, assemb ling the model tree, saving and load-
ing this model are de veloped. This data structure must support the processing of
the two following operations.
Three files were created to implement this operation:

3 DMM/ModelUtils.hpp contains some functions and constants used b y


the SNBi classifier.
3 DMM/SNaiveBayes.hpp contains constants, function prototypes, class
definitions and inline function def initions used b y the ne w
algorithm.
3 DMM/SNaiveBayes.cpp contains constants, function prototypes, class
definitions and inline function def initions used by the new algorithm.
All functions of the DMM/SNaiveBayes.cpp file must be carefull y studied
by the reader, with special attention to DMM_SNaiveBayes::DoInsertCases
function that executes the processing of the INSERT INTO statement, populat-
ing the DMM. The source code of this f ile is very well documented and self-
explanatory.
50 The SNBi classifier

After the DMM w as populated, the DMM has contents, which will be sa ved
and loaded by, respectively, the DMM_SNaiveBayes::Save and DMM_
SNaiveBayes::Load functions, also included in the DMM/SNaiveBayes.cpp
file.

4.2.5 Predicting attributes from data mining model


The prediction operation, started by an appropriated SELECT statement, has most
of the w ork done b y the DMSample source code. To implement suppor t for
SNBi classifier only, the DMM_SNaiveBayes::DoPredict function, included
in the DMM/SNaiveBayes.cpp file, was developed.

4.2.6 Browsing the data mining model


As stated in Section 3.2.5, a DMM is e xposed by DM providers for applications,
such as viewers and report generators, by means of rowsets. Also, a suitable way
of viewing DMM contents is by directed graph or trees (set of nodes with connect-
ing edges).
So, several functions have been developed to retrieve data from DMM, as well
as functions to navigate along the DMM tree. These functions are also included in
the DMM/SNaiveBayes.cpp file.
Chapter 5
Experimental results
We carried out four types of experiment to evaluate the implemented SNBi
classifier.
The first experiment, the classical waveform recognition problem [18], was
used to verify the accuracy of results and classifier’s performance. The results
were also compared to those produced b y other classifiers.
Meteorological and life insurance data allow the study of SNBi classifier
behavior faced with real data sets.
The synthetic data sets, used in the four th experiment, allow evaluation of the
scalability of SNBi classifier, considering four v ariables: cardinality (number of
cases, number of input attributes, number of states of the attributes and number
of prediction attributes.
Three other classifiers were used in the experiments:
• MSDT – Microsoft® Decision Trees, classifier included in MSSQL
Analysis Services [19];
• C4.5 – Visual C++® implementation of the C4.5 decision trees classif ier
[20];
• WEKA – Visual J++® implementation of the Simple Naïv e Bayes
classifier, included in Waikato Environment for Knowledge Analysis [4].
SNBi and MSDT classifiers are supported by DM providers and work aggregated
with MSSQL. C4.5 and WEKA classifiers were implemented as standalone
applications using input and output te xt files. The parameters used by C4.5 and
MSDT classifiers in the experiments were those that produced the best results for
their respective data sets. MSDT classifier does not have a pruning phase; a simi-
lar effect is obtained by changing the COMPLEXITY_PENALTY parameter (a
description of the MSDT classifier’s parameters is shown in Appendix C.1. So, to
comply with this situation, the C4.5 results are those obtained before pr uning.
The accuracy of prediction results, obtained in classif ication problems, was
computed by the percentage of cor rected classified cases.
Two methods were used to estimate this accuracy. The first involves dividing
the available data into training and testing data sets. Usually, the testing data set is
about 30% of the whole data and the training data set is the remainder. The DMM
52 Waveform recognition problem

is built using the training data set and the accuracy is computed using this model
to predict results using the testing data set.
The second method estimates the accuracy by cross-validation and is more robust.
In this method, the whole data is divided into N blocks. Each one of these blocks has
the number of cases and class distribution as uniform as possible. N different DMMs
are then built, using, as training data, the whole data excluding one block and this
omitted block is used to evaluate the DMM. In this way, each case appears in exactly
one testing set. Usually, N ⫽ 10 and the average accuracy rate over the N testing data
sets is a good predictor of the accuracy of a model built from whole data. Indeed, this
result slightly underestimates the accuracy, since each of the N models is constructed
from a subset of the whole data. All experiments carried out here used N ⫽ 10.
In the tables that show the experiment results, the best values are highlight in
bold and the columns of SNBi classifier have light gray background.
It should be noted that the times sho wn in these tab les are the elapsed times
and not the real processing times for the considered task. During the experiments,
the computer was used as a standalone ser ver, without any connected client.
Appendix C shows all details required for processing the experiments, such as
programs and utilities.

5.1 Waveform recognition problem


The classical example of the Waveform recognition problem [18] was used to ver-
ify the accuracy of the results and the perfor mance of the implementation. It is a
three-class problem with one prediction attribute and 21 continuous input attributes.
Five sets of cases were considered. The testing data set for all of them was the
5000 cases data set provided by UCI Repository [21]. The Waveform program
provided by this repository was used to generate randoml y the training data sets,
labeled a, b, c, d, e, with, respectively, 300, 5000, 10,000, 100,000 and 1 million
cases. Table 5.1 shows the parameters of the classif iers used in this e xperiment,
while table 5.2 shows the elapsed processing times spent b y the classifiers.

Table 5.1: Waveform – Parameters of the classif iers.


Experimental results 53

The results of the MSDT1 and MSDT2 classifiers, identified by the question
mark (?), mean that these classifiers fail and the process does not reach a success-
ful termination. This error was reported to Microsoft and w as qualified as a b ug
to be f ixed in a future release of MSSQL Analysis Services. For customers with
this problem, Microsoft has been suggesting a w orkaround, which involves mak-
ing the continuous attributes discretized, and is creating a Knowledge Base (KB)
article for this. F or this reason, the tab les also show the result of the classif ier
MSDT2d, which uses discretized input attributes instead of continuous input
attributes. The null values of the train prediction times of the C4.5 classifier are
due to the f act that these values are very small, which when rounded to the unit,
became null. The data structure of this classifier made these short times possible.
The training times of all classifiers, with the e xception of C4.5 classifier,
demonstrated scalability, growing linearly with the increment of the number of
cases. This behavior can be better obser ved in the g raph in f ig. 5.1. SNBi is the
fastest of all considering the training times, while the prediction times were
among the worst, with the exception of WEKA. The poor perfor mance of this
classifier is caused by the use of the JAVA virtual machine.

Table 5.2: Waveform – Elapsed times (s).


54 Waveform recognition problem

Figure 5.1: Waveform – Training time ⫻ Number of cases.

The non-linear behavior of the C4.5 classifier happened due to g reat amount
of memory demanded by this classifier for the training of 1 million cases, exceed-
ing the efficient capacity of the equipment used. The intensive use of virtual
memory introduces the hard disk access time variable, causing this behavior.
Table 5.3 shows that SNBi and Weka results are identical, as expected, since they
have the same formulation. The accuracy results of SNBi classifier were among the
worst; however, the testing results were the best, but not better than the training
results of C4.5 classifier. In addition, if, in group d, the product time ⫻ accuracy
had been considered, the C4.5 classifier combines speed with competitive accuracy.
The cross-validation time results of SNBi and MSDT2 are shown in tables 5.4
and 5.5. The SNBi training times remained among the lowest. The SNBi accuracy
results also showed the same tendency: the training results were the worst and the
testing results were the best.
In this experiment, the excellent uniform results, shown in the SNBi classifier,
are due to the f act that the data are artificially generated, using normal distribu-
tion and without attribute dependencies.

5.2 Meteorological data


Over a 10-year period (from 1951 to 1960), the Institute of Meteorological
Researches of UNESP (State University of São Paulo collected meteorological
data at the Inter national Airport of Rio de Janeiro. The purpose of this research
was to assemble a model for w eather forecasting thick fog—a risk situation for
Experimental results 55

Table 5.3: Waveform – Accuracy (%).

Table 5.4: Waveform – Cross-validation – Elapsed times (s).

aircrafts using this airport. These data, collected daily, hour by hour, were formed
by 36 input attributes with 87 600 cases. Faults in the data collection caused some
missing attribute values in several cases.
Costa [22] car ried out intensi ve data preparation that resulted in a data set
formed by 26 482 cases, 18 input attributes and one prediction attribute with seven
classes. This data set was split into two: one for training and another for testing.
This split w as achieved using a unifor m random distribution, maintaining the
same class distribution. The training data set are formed of 18 264 cases (68.97%
of the whole data) and the testing data set are for med of 8218 cases (31.03% of
the whole data). The parameters of the classif iers are shown in table 5.6.
As tables 5.7 and 5.8 show, the SNBi2 classifier results, with some discretized
input attributes, are better than the SNBi1 classifier, with all continuous input
attributes. This is due to the fact that the SNBi classifier uses normal distribution
for continuous attributes. When a few values exist for an attribute, the model will
certainly produce distor tions. An experiment was done with all discretized
Table 5.5: Waveform - Cross-validation - Accuracy (%).
56 Waveform recognition problem

Table 5.6: Meteo – Parameters of the classif iers.

attributes and the obtained result was still better. However, some attributes, by
their own characteristics, cannot be modeled as discrete.
The elapsed processing times of all classifiers were comparable, except for
the classifier C4.5, which showed the best performance. The accuracy results of
the SNBi classifier were among the worst, unlike the previous experiment.
The cross-validation results of SNBi2 and MSDT1 are shown in tables 5.9 and
5.10. These results did not show significant differences, relative to previous results.
These worse results are probably due to dependence among the input
attributes, which is not considered by this classifier.
Results reported by Coelho and Ebecken [23], for the ROC, a Bayesian academic
classifier, showed very close results of those obtained by the SNBi2 classifier.

5.3 Life insurance data


This data set was extracted from MASY data warehouse [24] of the SwissLife
European insurance company, with the purpose of being used in a competition,
denominated by KDD Sisyphus I [25], which was not finished. This is a high
dimensional and complex problem, having been exhaustively prepared by
Costa [22].
Table 5.7: Meteo – Elapsed times (s).

Table 5.8: Meteo – Accuracy (%).


Experimental results 57

Table 5.9: Meteo – Cross-validation – Elapsed times (s).

The source data are multi-relational, formed by 10 tables. For use in this exper-
iment, a SQL query using these tables generated a single one. The parameters of
the classifiers used in this experiment are listed in Table 5.11. Some input
attributes are discrete and all classif iers use the same attribute types.
The prepared data set has 130 143 cases with 63 input attributes and one pre-
diction attribute with two classes (see Appendix C.5. This data set was split into
two: training and testing data sets. This split was achieved using a unifor m ran-
dom distribution, maintaining the same class distribution.The training data set are
formed by 89 543 cases (68.80% of the w hole data and the testing data set are
formed by 40 600 cases (31.20% of the w hole data. Tables 5.12 and 5.13 sho w
the results of elapsed processing times and accurac y.
In spite of showing the lowest training time among the compared classif iers,
the SNBi classifier again presented the worst result for accuracy. On the other
hand, the C4.5 classifier showed an excellent performance in all results. It should
be noted that the high elapsed processing time spend by the MSDT2 classifier, as
well as its accuracy results, are very close to those obtained by the C4.5 classifier.
This is probably due to the fact that both classifiers use the entropy method to con-
trol decision tree growth.
The lowest accuracy presented by the SNBi classifier is probably due to
dependence among the input attributes, which is not considered by this classifier.

Table 5.10: Meteo – Cross-validation – Accuracy (%).

Table 5.11: Insurance – Parameters of the classif iers.


58 Waveform recognition problem

Table 5.12: Insurance – Elapsed times (s).

Table 5.13: Insurance – Accuracy (%).

The cross-validation results of SNBi and MSDT1 are shown in tables 5.14 and
5.15. These results did not show significant differences, relative to the previous results.
Again, the results reported by Coelho and Ebecken [23], for the ROC, a Bayesian
academic classifier, showed very close results of those obtained by the SNBi classifier.
The same experiment was carried out with the original (unprepared) data [26].
In a general way, all of the results, for accuracy and elapsed processing times, were
worse than those obtained for the prepared data — justifying the need for this prepa-
ration. The ratios among the results of the classif iers were practically the same.

5.4 Performance study


A performance study for MSDT classifier has been sho wn by Soni et al. [27].
This work was used as a template for experiments to evaluate the scalability and
the performance of the SNBi classifier. Four variable factors are considered: car-
dinality (number of cases), number of input attributes, number of states of the

Table 5.14: Insurance – Cross-validation – Elapsed times (s).

Table 5.15: Insurance – Cross-validation – Accuracy (%).


Experimental results 59

Table 5.16: Performance – Parameters of the classif iers.

attributes and number of prediction attributes. The results of the experiments were
compared to those produced by the MSDT classifier.
The data sets used by these experiments are generated artificially by a com-
puter program [28]. Details of this generation are sho wn in Appendix C.6.
In each e xperiment, the training time is measured by varying one variable.
Table 5.16 shows the parameters and description of the classifiers used in these
experiments. All input and prediction attributes are discrete.

5.4.1 Varying the number of input attributes


In this experiment, the number of training cases was fixed at 1 million. Each input
attribute may have 25 different states. The prediction attribute may have five
classes. Figure 5.2 shows the graph with the training times for the following
sequence of number of attributes: 12, 25, 50, 100 and 200.
The SNBi classifier showed absolute linear behavior and a better performance
than the MSDT classifier. It is interesting to note that, as reported by Soni et al.
[27], the MSDT classifier spent 130 min (7800 s training 1 million cases, with
equipment (4 Intel Xeon 550 MHz processors, 4 GB RAM better than that used
in this experiment. Figure 5.2 shows that this time was only of 8673 s.

5.4.2 Varying the number of training cases


This experiment was used to compare the performance of the classifiers by varying
the cardinality of the training data set. The numbers of training cases were: 10 000,

Figure 5.2: VarInpAtt – Training time ⫻ Number of input attributes.


60 Waveform recognition problem

25 000, 50 000, 75 000, 100 000, 1 million, 2 million and 10 million. The number
of input attributes was fixed at 20 with 25 dif ferent states. The prediction attribute
had five classes. The graph in fig. 5.3 displays the training elapsed processing times.
Again, the SNBi classifier showed absolute linear beha vior and a better per-
formance than the MSDT classifier.
The graph in fig. 5.4 displays the elapsed processing times used to predict the
same data sets. This experiment made possible the evaluation of the processing
capacity of the equipment used. When processing the prediction of 1 million cases
with the SNBi classifier, there was a lack of memory after 2 h and 16 elapsed min-
utes. At this time, the installed virtual memory was only 1024 MB. Only after the
virtual memory was increased to 2048 MB, could this task be completed.
However, the task exceeds the efficient capacity of the equipment due to the
intensive use of virtual memory, which causes the non-linear behavior in the
results shown in fig. 5.4.
Another problem occurred with the prediction task of 1 million cases using
MSDT classifier. DM providers retrieve the cases using a simple query of a table
with 10 million registers stored in Microsoft® SQL ServerTM. However, a non-con-
figurable time limit of 30 s exists to carry out query operations in this ser ver. As
queries of this size can exceed this limit, this task is impossible. This is a bug
reported by Microsoft that should be f ixed in future releases of the ser ver. To by-
pass this problem, tables can be used instead of queries. If the use of queries is
required, this can be done through Microsoft® SQL Query Analyzer, which allows
the configuration of this limit. Surprisingly, this problem was also solved by the
increase in virtual memory.
The graph in fig. 5.4 shows that the linear behavior of SNBi classifier, for the
prediction task, occur red only until the number of cases reached 100 000. From
this value on, non-linear beha vior occurred due to the high demand on memory

Figure 5.3: VarNumCases – Training time ⫻ Number of training cases.


Experimental results 61

Figure 5.4: VarNumCases – Prediction time ⫻ Number of cases.


required by the classif ier. This fact can be explained by the same reasons as
detailed previously.
The use of memor y by the SNBi classifier to process the prediction of
1 million cases was approximately of 1.6 MB. As expected, the prediction task of
10 million cases was impossible to f inish due to lack of memory (approximately
2.2 MB were in use after 6 h and 34 min of elapsed processing time).To make this
processing task possible, a table was used instead of a query. However, the
MSDT classifier presented an error exceeding the time limit. The difference in
the two kinds of er ror can be explained by the way each classif ier accesses the
training data internally. As reported by Bernhardt et al. [29], the classifier
MSDT possesses a processing module that makes queries directly to
Microsoft ® SQL ServerTM, justifying the kind of error presented.
To by-pass the problem of the prediction task, a prediction using part tasks was
elaborated. Thus, the data were divided in parts of 10 000 cases. The graph in fig.
5.4, representing the classif iers MSDTp and SNBip, show the results of this
experiment, proving the linear behavior of the two classifiers for this task.
Experiments were carried out with continuous data and both classifiers pre-
sented similar results to those for discrete data.
To evaluate the incremental resource of the SNBi classifier, the cardinality
experiment was repeated, simulating the number of training case increments.
Thus, the numbers of cases w ere: 10 000, 25 000, 50 000, 75 000, 100 000, 1
million, 2 million and 10 million. For comparison, the cardinality results of the
non-incremental SNBi classifier are reproduced again in fig. 5.5, labeled SNBi.
62 Waveform recognition problem

Figure 5.5: VarNumCases – Incremental training time ⫻ Number of train-


ing cases.

The results of the non-incremental SNBi classifier for this experiment correspond
to the accumulated values of the previous results, represented in the g raph by the
line labeled SNBi a. The results of the incremental SNBi classifier are represented
in the graph by the line labeled SNBii. Obviously, the results of elapsed process-
ing times of the incremental method are inferior to those obtained by the non-
incremental method. However, the comparison of lines SNBi and SNBii shows
that the incremental method produced slightl y better results (~2.5% than those
obtained by the non-incremental method.

5.4.3 Varying the number of states of the input attrib utes


In this experiment, the number of training cases is fixed at 1 million. The number
of data attributes is constant and equal to 20. The prediction attribute ma y have
five different classes. The states of the input attributes w ere: 2, 5, 10, 25, 50, 75
and 100.
The results for the elapsed processing times are sho wn in f ig. 5.6. This graph
shows clear linear behavior of the SNBi classifier. The MSDT classifier oscillated
in the same manner, presenting better linearity than repor ted by Soni et al. [27].

5.4.4 Varying the number of prediction attributes


SNBi and MSDT classifiers have resources for generating multiple trees, consid-
ering multiple prediction attributes. This experiment studies the behavior of the
two classifiers versus the variation of the number of prediction attributes, in the
following sequence: 1, 2, 4, 16 and 32. While the other values stay constant:
1 million cases, 40 total attributes, including the prediction attributes for each
task, and 25 states for each attribute, including the prediction attribute.
Experimental results 63

Figure 5.6: VarStates – Training time ⫻ Number of states of the input


attributes.

The graph in f ig. 5.7 shows the linear behavior of the SNBi classifier, which
displayed a slight increase in training time when the number of prediction
attributes increased. Again, the classifier MSDT has shown better linear behavior
than reported by Soni et al. [27].

5.5 Conclusions
The OLE DB for DM technology has proven to be a very useful tool in implement-
ing DM algorithms, achieving complete database querying and mining integration.
It has been demonstrated that the use of microcomputers is possible for
medium level DM task solutions.

Figure 5.7: VarPredAtt – Training time ⫻ Number of prediction attributes.


64 Waveform recognition problem

As predicted, due to its statistical formulation combined with the data struc-
ture of the implementation, the SNBi classifier has shown high scalability, within
the limits of the equipment used in the e xperiments.
Since an optimum algorithm for all problems does not exist (only an optimum
algorithm for each problem), the good results sho wn by the f irst experiment and
the excellent results of elapsed processing times (considering large data sets),
identifies the SNBi classifier as an e xcellent choice for use, primaril y, in data
exploration, classification and prediction. Using incremental training, associated
with prediction by part tasks, problems of virtually any size can be solved.
The non-linear behavior of the classif ier for large data sets signif ies the need
for additional research using higher capacity equipment, including the use of fed-
erations of databases servers, as well as multiprocessor computers.
References
[1] Fayyad, U.M., Piatetsky-Shapiro, G. & Sm yth, P., From Data Mining to
Knowledge Discovery in Databases, AI Magazine, Vol. 17, No. 3, pp. 37–54,
1996. http://kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf.
[2] Silver, D.L, Knowledge Discovery and Data Mining, MBA course notes of
Dalhousie University, Nova Scotia, Canada, 1998. http://plato.acadiau.ca/
courses/comp/dsilver.
[3] Kim, P. & Carroll, M., Making OLE DB for Data Mining queries against a
DM provider, Visual Basic utility application hosted by Analysis Services –
Data Mining Group w eb site, 2002. http://communities.msn.com/
AnalysisServicesDataMining.
[4] Witten, I.H. & Frank, E., Data Mining: Practical Machine Learning Tools
with Java Implementations, 1st ed., San Francisco, California, USA,
Morgan Kaufmann Publishers, 2000. http://www.cs.waikato.ac.nz/~ml.
[5] Microsoft Corporation, OLE DB f or Data Mining Sample Provider,
Microsoft Corporation, Redmond, Washington, USA, 2002. http://www .
microsoft.com/downloads/details.aspx?FamilyID=d4402893-8952-4c9d-
b9bc-0d60c70d619d&DisplayLang=en.
[6] Microsoft Corporation, OLE DB for Data Mining Specification Version 1.0,
Microsoft Corporation, Redmond, Washington, USA, 2000. http://www .
microsoft.com/downloads/details.aspx?displaylang=en&familyid=01005f9
2-dba1-4fa4-8ba0-af6a19d30217.
[7] DMG – Data Mining Group, PMML – Predictive Model Markup Language
Version 2.0 Specification, Data Mining Group, 2002. http://www.dmg.org.
[8] WC3 – World Wide Web Consortium, Extensible Markup Language (XML)
1.0 (Second Edition) , Bray, T. et al (eds.), WC3 – World Wide Web
Consortium, 2000. http://www.w3.org/TR/REC-xml.
[9] Chaudhuri, S., Data Mining and Database Systems: Where is the
Intersection?, IEEE Data Engineering Bulletin , Vol. 21, No. 1, pp. 4–8,
1998. ftp://ftp.research.microsoft.com/pub/debull/98MAR-CD.pdf.
[10] Skonnard, A., Say UDA for All Your Data Access Needs, Microsoft
Interactive Developer, April, 1998. http://www.microsoft.com/mind/0498/uda/
uda.asp.
[11] Netz, A. et al, Integration of Data Mining and Relational Databases.
Proc. of the VLDB 2000, 26th Int’l Conf. on Very Large DataBases, Cairo,
Egypt, pp. 719–722, 2000. ftp://ftp.research.microsoft.com/users/
AutoAdmin/vldb00DM.pdf.
66 References

[12] Han, J. & Kamber, M., Data Mining: Concepts and Techniques, 1st ed., San
Francisco, California, USA, Morgan Kaufmann Publishers, 2001.
http://www.mkp.com.
[13] Microsoft Corporation, Microsoft® SQL Server™ 2000, Microsoft
Corporation, Redmond, Washington, USA, 2000. http://www .microsoft.
com/sql.
[14] Seidman, C., Data Mining with Microsoft® SQL Server™ 2000 –Technical
Reference, 1st ed., Redmond , Washington, USA, Microsoft Press, 2001.
http://www.microsoft.com/mspress.
[15] Tang, Z., Thin Client Decision Viewer Sample, ASP utility application
hosted by Analysis Services – Data Mining Group web site, 2002.
http://communities.msn.com/AnalysisServicesDataMining.
[16] Curotto, C.L., Thin Client Decision Viewer Sample Modif ied, ASP utility
application, 2002. http://curotto.com/doc/english/dmibrowser.
[17] Leung, C.H. & Sze, L., A Method to Speed Up the Ba yes Classifier,
Engineering Applications of Artificial Intelligence, Vol. 11, No. 3, pp.
419–424, 1998. http://www.elsevier.com.
[18] Breiman, L. et al, Classification And Regression Trees, 1st ed., Boca Raton,
Florida, USA, Chapman & Hall/CRC, 1984. http://www.crcpress.com.
[19] Chickering, D.M., Geiger, D. & Heckerman, D., Learning Bayesian Networks:
The Combination of Knowledge and Statistical Data, Technical
Report MSR-TR-94-09, Microsoft Research, Microsoft Cor poration,
Redmond, Washington, USA, 1994. ftp://ftp.research.microsoft.com/pub/tr/
tr-94-09.ps.
[20] Quinlan, J.R., C4.5: Programs for Machine Learning, 1st ed., San Mateo,
California, USA, Morgan Kaufmann Pub lishers, 1993. http://www .cse.
unsw.edu.au/~quinlan.
[21] Blake, C.L. & Merz, C.J ., UCI Repository of Mac hine Learning Databases,
University of California, Department of Information and Computer Science,
Irvine, CA, USA, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html.
[22] Costa, M.C.A., Data Mining on High Performance Computing Using
Neural Networks, D.Sc. Thesis (in Portuguese), COPPE/UFRJ – Ci vil
Engineering Graduate Program – Computational Systems, Rio de Janeiro,
RJ, Brazil, 1999. http://www.coc.ufrj.br.
[23] Coelho, P.S.S. & Ebeck en, N.F.F., A Comparison of some Classification
Techniques, In: Data Mining III (Pr oc. of the 3r d Int'l Conf . on Data
Mining), Bologna, Italy, Zanasi, A. et al (eds.), WIT Press, pp. 573–582,
2002. http://www.witpress.com.
[24] Staudt, M., Kietz, J.U. & Reimer, U., A Data Mining Support Environment
and its Application on Insurance Data. Proc. of the KDD'98, 4th Int'l Conf.
on Knowledge Discovery and Data Mining , New York City, New York,
USA, Agrawal, R., Stolorz, P .E. & Piatetsk y-Shapiro, G. (eds.), AAAI
Press, pp. 105–111, 1998. http://kietz.ch/kdd98.ps.gz.
[25] Kietz, J.U. & Staudt, M., KDD Sisyphus I – Dataset, Information Systems
Research, CH/IFUE, Swiss Life, Zurich, Switzerland , 1998.
http://www.cs.wisc.edu/~lists/archive/dbworld/0426.html.
References 67

[26] Curotto, C.L., Integration of Data Mining Resources with Relational


Database Management Systems, D.Sc. Thesis (in Portuguese), COPPE/UFRJ –
Civil Engineering Graduate Program – Computational Systems, Rio de
Janeiro, RJ, Brazil, 2003. http://curotto.com/doc/portugues/tesedsc.
[27] Soni, S., Tang, Z. & Yang, J., Performance Study of Microsoft® Data Mining
Algorithms, White paper, Microsoft Research, Microsoft Cor poration,
Redmond, Washington, USA, 2002. http://www .microsoft.com/SQL/
evaluation/compare/AnalysisDMWP.asp.
[28] Melli, G., DatGen – Dataset Generator , Web Resource, 1999.
http://www.datgen.com.
[29] Bernhardt, J., Chaudhuri S. & Fayyad, U.M., Scalable Classification over
SQL Databases. Proc. of the ICDE’99, 15th Int’ l Conf. on Data
Engineering, Sydney, Australia, pp. 470–479, 1999.
ftp://ftp.research.microsoft.com/users/AutoAdmin/icde99.pdf.
[30] Kim, P., Microsoft.public.sqlserver.datamining FAQ (Frequently Asked
Questions and Answers), Resource hosted by Analysis Services – Data
Mining Group w eb site, 2002. http://communities.msn.com/Anal ysis
ServicesDataMining.
[31] Perlman, G., The |STAT Handbook Data Analysis Programs on UNIX and
MSDOS, on-line ed., Gary Perlman’s Home Page, 1986. http://www.acm.
org/~perlman/stat.
Appendix A
Building DMSample
This appendix shows instructions to implement the DMSample, fixing compiler
errors and bugs, as w ell as providing useful information to better understand the
necessary compiling and debugging tasks to be performed while building y our
own provider and implementing new algorithms.
These instructions are useful if you plan to start your DM provider from origi-
nal files. Otherwise, if you prefer to fix DMSample files automatically, you
must execute the instructions provided in Section 2.2.7.1.

A.1 DMSample workspace


The DMSample’s source code includes the complete implementation of a DM
provider able to r un aggregated with MSSQL Analysis Services including the
following [5]:
• All required OLE DB objects, such as session, command, and rowset (see
Section 3.1.1);
• The OLE DB for DM SQL Syntax Parser;
• Tokenization of input data;
• Query processing engine;
• A sample Naïve Bayes algorithm;
• Model archiving in XML and binary formats.
The DMSample.dsw workspace contains 12 projects of 14 sub-directories,
whose descriptions are reproduced as follows in compile-order dependency [5]:
1. VP64 is the Visual Parse++ library provided by SandStone Technology
(see Section 2.2.4).
2. Putl contains generic data structures and classes for storage, error
handling, database reading, etc.
3. DMBase contains the generic DMObject reference counted base class,
as well as global objects, macros, and resource IDs.
4. DMUtl contains more generic data structures and classes. This project also
contains wrappers of classes contained in PUtil that combine those classes
with DMObject to provide reference counting.
70 Building steps

5. DMXML contains code to store models in XML and binar y formats.


6. DMM contains the actual code for the mining model algorithm, as well
as code for the marginal model (statistics). The sample provider contains
the MSNB, a Naïve Bayes algorithm implementation supporting only
discrete input attributes.
7. DMCore contains the definitions and implementation of the core objects
used by the DM provider, such as the DMModel, DMDatabase,
DMRowset, and so on. Additionally, the DMQueryContext and
operations are defined here. These provide the mechanism for context
flow during a query.
8. DMParse contains the parser implementation.
9. TK contains the DMTokenizer, which converts discrete data into
integers, and the DMCaseConsumer, which reads data from the source
and creates cases to be used b y the DM algorithms.
10. CP contains the DMCaseProvider, which calls the DM algorithms’
prediction method and provides output cases to the output rowset.
11. DMQM contains the DMQueryManager, which controls the o verall
flow of any query against the provider.
12. DMProv provides the DLL output and implements all external interfaces
on the provider, such as the output rowset, the schema rowset and all other
supported OLE DB interfaces.

In addition to these projects, there are two sub-directories: DMCommon, con-


taining the common definitions header f iles and DMInclude, which contains
the OLE DB structure definitions.

A.1.1 Naming variables


Variable names used by DMSample have the following convention:
• public, private and protected class variables start with m_ prefix.
Example: DM_Count m_cCases (count of cases defined in
SPCaseConsumer class).
• usually variables have a prefix concerned with this type. Example:
WCHAR m_szName[129] (sz is used for string types). Table A.1 shows
the prefixes used by some types of variables.

B.2 Building steps


Execute the following steps to build the DMSample, remembering that before to
do this, the appropriated tools must be installed (see Chapter 2):
1. Download and extract all files from DMSample (see Chapter 2, Section
2.2.6).
2. Unset the read only attribute of all files from all directories from
SampleProvider directory.
3. Remove or rename the oledb.h. and transact.h files from
SampleProvider/Putl/depend/inc directory.
Building DMSample 71

Table A.1: Prefixes of some types of variables.

4. Copy all files from Program Files/SandStone/Visual Parse++ 4.0/


Source/ C++ to SampleProvider/VP64 directory (see Chapter 2,
Section 2.2.4).
5. Stop MSSQL OLAP Service. This service always must be stopped w hen
you build a DM provider to allow old files to be unregistered and removed.
6. Open DMSample.dsw file from SampleProvider directory with
MSVC 6.0 and set DMProv as the Active Project or open this f ile with
MSVC .NET, convert all projects to the new version of MSVS and set
DMProv as StartUp Project.
7. If you do not wish to use Microsoft ® Visual SourceSafe, reply No to all
messages of this type:
Unable to connect to source code control project <project name>.
8. Modify the definitions _WIN32_WINNT=0x0400 and WINVER=0x0400
to _WIN32_WINNT=0x0500 and WINVER=0x0400 in Preprocessor
definitions of Preprocessor Category of C/C++ Settings (Properties)
of all projects.
9. Execute the MSVC appropriated command to build DMProv project.
Throughout this process, several compiler er rors, shown as follo ws, not
necessarily in the same order, will appear . The beginning of each er ror
contains the file name where the error was detected. Following the compiler
error message, the instr uctions on how to f ix each one of these er rors will
be provided. If two error messages have the same solution they are
together. After the corrections are finished, the build process must be
invoked again.
• ...Putl/cmpxchg.hpp(453): error C2373: ‘InterlockedIncrement’:
redefinition; different type modifiers.
3 Open this file and insert commenting characters in the beginning of
lines 453 and 454 to avoid their compilation:
72 Building steps

//inline LONG InterlockedIncrement(volatile


LONG* pl) { return
InterlockedIncrement((LONG*)pl); }
//inline LONG InterlockedDecrement (volatile
LONG* pl) { return
InterlockedDecrement((LONG*)pl); }
• ...Putl/putillib.h(48): fatal err or C1083: Cannot open inc lude file:
‘jetoledb.h’: No such file or directory.
3 Add ../Putl/depend/inc in Additional include directories of
Preprocessor Category of C/C++ Settings of all projects but P64.
• ...objbase.h(952): Fatal error C1083: Cannot open inc lude file:
‘propidl.h’: No such file or directory.
3 Install Core SDK (Windows Server 2003) – Build en vironment of the
Microsoft® Platform SDK F ebruary 2003 Edition (see Chapter 2,
Section 2.2.3.1).
• ...Error C2501: ‘DBLENGTH’:missing storage-class or type specifiers or
...Putl/putillib.h(41) : fatal err or C1083: Cannot open inc lude file:
‘oledb.h’: No such file or directory.
3 Install Microsoft® Data Access Components (Version 2.7) – Build
Environment and Sample and source code of the Microsoft® Platform
SDK February 2003 Edition (see Section 2.2.3.1) and verify whether
the Microsoft® Platform SDK include and lib directories are on the
top of the lists of Tools/Options/Directories/Include files and
Tools/Options/Directories/Library files (see Chapter 2,
Sections 2.2.3.1 and 2.2.3.2).
• ...msxml.idl(52): Fatal error C1083: Cannot open inc lude file:
‘xmldom.idl’: No such file or directory or
...msxml.idl(53): Fatal error C1083: Cannot open inc lude file:
‘xmldso.idl’: No such file or directory. How to fix:
3 Install Internet Development SDK (V ersion 6.0) – Build
Environment of the Microsoft ® Platform SDK F ebruary 2003
Edition (see Chapter 2, Section 2.2.3.1).
• ...VP64/ssvppatch.cpp(11): fatal error C1083: Cannot open include
file: ‘ssvp.cpp’: No such file or directory or
...VP64/ssdbcs.cpp(11) : fatal err or C1083: Cannot open inc lude
file: ‘ssdbcs.hpp’: No such file or directory
3 Install Visual Parse++ 4.0 (see Chapter 2, Section 2.2.4).
• ...DMParse/dmparser.cpp(160): error C2664: ‘__thiscall
SSLexTable::SSLexTable... and DMParse/dmparser.cpp(167): error
C2664: ‘__thiscall SSYaccTable::SSYaccTable...
3 Add VP40 in Preprocessor definitions of Preprocessor
Category of C/C++ Settings (Properties) of DMParse project.
Building DMSample 73

• ...DMparse/dmparselib.h(31): Fatal error C1083: Cannot open


include file: ‘sslex.hpp’: No such file or directory or
...VP64/ssdbcs.cpp(11): fatal err or C1083: Cannot open inc lude
file: ‘ssdbcs.hpp’: No such file or directory
3 Verify the installation of Visual Parse++ (see Chapter 2, Section 2.2.4).

• ...Putl/pnvalio.cpp(68): warning C4102: ‘abort’: unreferenced label


and other similar warnings about this same file:
3 Open this file and insert comments in these lines.

• ...DMProv/dmglobals.cpp(307): warning C4102:


‘DmErrorHandlerLabel’: unreferenced label.
3 Open this f ile and replace SP_ERROR_HANDLER from line 307
by the definition of macro without the unreferenced label:
PFAssert(!"This is a <fall through> ! Most
likely, you've forgotten to SP_REPORT_*");

• ...vc98/include/stdexcept(29) : w arning C4251: ‘_Str’ : class


‘std::basic_string<char,struct std::char_traits<char>,class std::
allocator<char> >’ needs to have dll-interface to be used by c lients of
class ‘std::logic_error’and other warnings of same kind.

3 If you are using MSVS 6.0, Open Program Files / Microsoft


Visual Studio / VC98 / Include / XSTRING file and replace
#ifdef _DLL by #if defined (_DLL) || defined
(_DMDLL) and include _DMDLL in Preprocessor definitions of
Preprocessor Category of C/C++ Settings of DMProv project.
_DLL is defined in all projects but in DMM/dmmlib.h file this
macro is undefined to “Force static runtimes”.
If you are using MSVS .NET , Open Program Files/ Microsoft
Visual Studio .NET 2003/Vc7/include/ XSTRING file and
replace #ifdef _DLL_CPPLIB by #if defined
(_DLL_CPPLIB) || defined (_DMDLL) and include
_DMDLL in Preprocessor definitions of Preprocessor
Category of C/C++ Settings of DMProv project.

If you are using MSVS 6.0 to perfor m these steps, the DMSample will be built
and registered.
However, if you are using MSVS .NET , you need to do more, because you
must port the source code to this ne w version of MSVS. The instructions for this
task can be seen in the idma/DMSampleModScript.txt script file.
After you start MSSQL OLAP Service again, the new algorithm will
appear in Mining Algorithm list of Basic Properties of Relational
Mining Model Editor of MSSQL Analysis Services Manager as Sample DM
Algorithm.
74 Building steps

A.2.1 Building the release version


The DMSample source files only include a Debug configuration. To build a
Release version, a new configuration must be created.
The Release configuration can be created by copying the Debug configura-
tion. In this conf iguration, the DEBUG macro, as well as the _DEBUG,
ID_DEBUG and DEBUG_VMALLOC macros, must be deleted from C/C++ /
Preprocessor / Preprocessor definitions of all projects. Also, the
Putl/pfdebug.h file must be modified to prevent the automatic definition of
DEBUG macro.
In addition, the debug libraries must be modified to the release libraries
because the _DEBUG macro is def ined automatically by the compiler when the
debug libraries are used.
Appendix B
Building DMclcMine
The instructions provided here are useful in star ting your DM provider from the
DMSample original source f iles, as well as to better understand the necessar y
compiling and debugging tasks to be performed when building your own provider.
If you prefer, you can assemble all DMclcMine files automatically by using the
Assemble utility, as shown in Section 2.2.7.2.
First, you must execute all instructions to build the DMSample provided in
Appendix A and, then, execute all instructions provided in this Appendix.

B.1 Correcting the grammar definition file


If you need to modify any token in the DMSQL grammar definition file
(DMParse/dmparser.ycc), you must make extensive corrections to the syn-
tax parser main function dmreduce.cpp file. Rename dmreduce.cpp as
dmreduce.back.cpp. Open dmparser.ypw workspace with Visual Parse++
and execute:
9 Debug/Compile
9 Debug/Generate Files
Visual Parse++ will generate a new clean dmreduce.cpp. You can view the fol-
lowing notice at the beginning of this f ile:
You must keep this reduce function clean to avoid confusing
the AutoMerge parser. By ‘clean’we mean that if you are going
to add a significant amount of code to a case , make a function
call instead of adding the code directly to the case.
NEVER add embedded switch statements inside this function.
Since original dmreduce.cpp is not clean, when you modify anything in
dmparser.ycc and execute Debug/Generate Files, Visual Parse++ crashes
and produces a corrupted dmreduce.cpp file. Using this clean version of
dmreduce.cpp, you must include all function calls from the original version of
dmreduce.cpp, as well as create a file for each function call, adding this file to
DMParse project.
76 Aggregate and standalone modes

B.2 Aggregate and standalone modes


The DMSample source code implements an aggregated DM provider. The
IDMASecurityAdvise/IDMASecurityNotify interfaces are used for imple-
menting a provider that can be aggregated by Microsoft® Analysis Services. If the
provider is aggregated, Analysis Server advises the aggregated provider with a
callback object that implements security notification methods defined by
IDMASecurityNotify.
To implement a standalone provider that also can be used in aggregated mode,
some modifications must be made.
A dummy IDMASecurityNotify implementation that just returns success/true
(as appropriate) for the security callback methods must be included. This object
must be instantiated into the session if IDMASecurityAdvise::AdviseSecurity
is not called on the session. The following steps show these modifications:
1. Open DMCore/dmaggregator.h file and save it as
DMProv/dmstandalone.hpp. Use this as a template to def ine
IDMASecurityNotifyDummy as follows:

#ifndef __dmstandalone_hpp__
#define __dmstandalone_hpp__

#ifndef __cplusplus
#error dmstandalone.hpp requires C++
compilation
#endif

interface IDMASecurityNotifyDummy : public


IDMASecurityNotify
{
public:

SP_NO_INTERFACES(IDMASecurityNotifyDummy,
IDMASecurityNotify);

HRESULT STDMETHODCALLTYPE
PreCreateMiningModel(
/* [string][in] */ wchar_t __RPC_FAR
*in_strMiningModelName)
{ return S_OK; };

HRESULT STDMETHODCALLTYPE
PreRemoveMiningModel(
/* [string][in] */ wchar_t __RPC_FAR
*in_strMiningModelName)
{ return S_OK; };

HRESULT STDMETHODCALLTYPE RenameMiningModel(


Building DMclcMine 77

/* [string][in] */ wchar_t __RPC_FAR


*in_strOldMiningModelName,
/* [string][in] */ wchar_t __RPC_FAR
*in_strNewMiningModelName)
{ return S_OK; };

HRESULT STDMETHODCALLTYPE IsSELECTAllowed(


/* [string][in] */ wchar_t __RPC_FAR
*in_strMiningModelName,
/* [out] */ boolean __RPC_FAR
*out_pfIsAllowed)
{ *out_pfIsAllowed = TRUE;
return S_OK; };

HRESULT STDMETHODCALLTYPE IsMODIFYAllowed(


/* [string][in] */ wchar_t __RPC_FAR
*in_strMiningModelName,
/* [out] */ boolean __RPC_FAR
*out_pfIsAllowed)
{ *out_pfIsAllowed = TRUE;
return S_OK; };

HRESULT STDMETHODCALLTYPE
PostCreateMiningModel(
/* [string][in] */ wchar_t __RPC_FAR
*in_strMiningModelName,
/* [in] */ HRESULT in_hResult)
{ return S_OK; };
HRESULT STDMETHODCALLTYPE
PostRemoveMiningModel(
/* [string][in] */ wchar_t __RPC_FAR
*in_strMiningModelName,
/* [in] */ HRESULT in_hResult)
{ return S_OK; };
};
#endif // __dmstandalone_hpp__
2. Open DMProv/StdAfx.h file:
3 Insert the following statement in DMProv header files:
#include "dmstandalone.hpp"
3. Open DMProv/Session.hpp file:
3 Insert the following private variables in the Session class:
bool IsAdviseSecurityCalled;
bool IsSecurityNotifyDummyInstantiated;
3 Replace GetDMASecurityNotify function by:
78 Aggregate and standalone modes

HRESULTGetDMASecurityNotify
(IDMASecurityNotify **
out_ppSecurityNotify)
{
if (IsAdviseSecurityCalled)
{
if (!m_spSecurityNotify) return E_FAIL;
return
m_spSecurityNotify.CopyTo(out_ppSecurityNotify);
}
else
{
if (!IsSecurityNotifyDummyInstantiated)
{
SPObject<IDMASecurityNotifyDummy>::
Create Instance((IDMASecurityNotifyDummy **)
&m_spSecurityNotify);
if (!m_spSecurityNotify) return E_FAIL;
IsSecurityNotifyDummyInstantiated = true;
}
return
m_spSecurityNotify.CopyTo(out_ppSecurityNotify);
}
}
9 Replace IDMASecurityAdvise methods by:
STDMETHOD
(AdviseSecurity)(IDMASecurityNotify *
in_pSecurityNotify)
{
m_spSecurityNotify = in_pSecurityNotify;
IsAdviseSecurityCalled = true;
return S_OK;
}
STDMETHOD (UnadviseSecurity)
(IDMASecurityNotify * in_pSecurityNotify)
{
if (m_spSecurityNotify !=
in_pSecurityNotify) return E_INVALIDARG;
m_spSecurityNotify.Release();
IsAdviseSecurityCalled = false;
return S_OK;
}
Building DMclcMine 79

4. Open DMProv/Session.cpp file:


9 Insert the following statement in SPSession::Close function:
if (m_spSecurityNotify)
m_spSecurityNotify.Release();
5. Open DMProv/Session.inl file:
9 Insert the following statements in SPSession::SPSession function:
IsAdviseSecurityCalled = false;
IsSecurityNotifyDummyInstantiated = false;

B.3 Creating a new provider from DMSample


To create a new provider DMclcMine from DMSample source files, you must
perform the following steps:
1. Copy all files from SampleProvider directory to DMclcMine directory.
2. Rename all DMSample.* files to DMclcMine.* from DMclcMine directory.
3. If you are using MSVS 6.0, open DMclcMine.dsw file:
a Change the Output directories of Intermediate files and Output
files of General Settings of all projects as necessary.
b Using General Category of Link Settings of DMProv project:
9 In Output file name, replace Debug/DMSProv.dll by
Debug/DMclcMine.dll;
9 In Object/library modules, replace the directory of DMQM.lib,
as changed in step 7.
c Replace the directory of DMProv.bsc in Browse info file name of
Browse Info Settings of DMProv project.
d Using MIDL Settings of DMProv/DMSProv.idl file, replace
Output filename to “./DMclcMine.tlb”.
If you are using MSVS .NET, open DMclcMine.sln file:
a Change the Output Directory and the Intermediate Directory of
General category of General Properties of all projects, as necessary.
b Replace Debug/DMSProv.dll by Debug/DMclcMine.dll in
Output File of General category of Linker Properties of
DMProv project.
c Replace the directory of DMQM.lib, as changed in step 7, in
Additional Dependencies of Input category of Linker Properties
of DMProv project.
d Replace the directory of DMProv.bsc in Output file of General
category of Browse Information Properties of DMProv project.
e Replace Type Library of Output category of MIDL Properties of
DMProv/DMSProv.idl file to “./DMclcMine.tlb” .
4. Open DMProv/DMSProv.def file and replace DMSProv.dll by
DMclcMine.dll.
5. Open DMProv/DMSProv.rc file for text editing and replace:
9 DMSProv by DMclcMine;
80 Renaming algorithm names

9 Sample OLE DB FOR DM Provider by clcMine OLE DB


FOR DM Provider;
9 Sample DM Algorithm by MS Naive Bayes.
6. Open DMProv/DMSProv.idl, DMProv/DMSProv.rgs and DMProv/
DMErrorLookup.rgs files (DMSProv.h file will be changed b y the
MIDL compiler) and replace:
9 DMSProv by DMclcMine;
9 Sample by clcMine.
7. Open DMProv/SchemaRowsets.h file and replace Sample_
DM_Algorithm by MS_Naive_Bayes.
8. Open DMProv/dmerrorlookup.hpp file and replace CLSID_
DMSProvErrorLookup by CLSID_DMclcMineErrorLookup.
9. Open DMProv/dmerrorlookup.inl file and replace CLSID_
DMSProvErrorLookup by CLSID_DMclcMineErrorLookup.
10. Open DMProv/DMSProv.cpp file and replace:
9 CLSID_DMSProv by CLSID_DMclcMine;
9 CLSID_DMSProvErrorLookup by CLSID_DMclcMineErrorLookup;
9 LIBID_DMSProvLib by LIBID_DMclcMineLib.
11. Open DMProv/DMErrors.cpp file and replace CLSID_DMSProv by
CLSID_DMclcMine.
12. Open DMProv/DataSrc.hpp file and replace CLSID_DMSProv by
CLSID_ DMclcMine.
13. Open DMProv/dmglobals.cpp file and replace TEXT("DMSProv")
by TEXT("DMclcMine").
14. Open dmparser.ypw workspace with Visual Parse++. Open
dmparser.ycc file and replace T_SampleAlgorithm token by
MS_Naive_Bayes (the algorithm name cannot contain spaces).
15. Rebuild all projects.
The DMSample is designed to work only when aggregated. If you want to write your
provider to run in standalone mode, then some extra modifications need to be carried
out (see pre vious Section B .3). After you make these modifications, you should
be able to access the sample provider directly from ActiveX® Data Objects (ADO)
by supplying “DMclcMine” as the provider name in the OLE DB connection string.

B.4 Renaming algorithm names


The following steps must be performed to rename the algorithm:
1. Open dmparser.ypw workspace with Visual Parse++. Open
dmparser.ycc file:
9 Replace T_SampleAlgorithm token by (the tok en cannot
contain spaces):
Building DMSample 81

`[Mm][Ss]_[Nn][Aa][Ii][Vv][Ee]_[Bb][Aa][Yy]
[Ee][Ss]´ ("MS_Naive_Bayes")
9 Replace T_SampleAlgorithm by T_MSNaiveBayes;
9 Replace PSampleAlg by PMSNaiveBayesAlg;
9 Execute Debug/Compile;
9 Execute Debug/Generate Files. You will receive a message from
Visual Parse++: Auto-merge failed, your original file is in . /dmr
educe. bak00x... You must remo ve dmreduce.cpp because it is
corrupted! The file dmreduce.bak00x must be renamed to
dmreduce.cpp.
2. Open DMParse/dmreduce.cpp file and replace SP_PSampleAlg by
SP_PMSNaiveBayesAlg.
3. Open DMProv/SchemaRowsets.h and DMProv/SchemaRowsets.cpp
files and replace:
9 SERVICE_GUID_Sample_Algorithm by
SERVICE_GUID_MS_Naive_Bayes;
9 SERVICE_NAME_SAMPLE_ALGORITHM by
SERVICE_NAME_MS_NAIVE_BAYES.
4. Open DMProv/SchemaRowsets.h file and replace the value of
SERVICE_NAME_MS_NAIVE_BAYES constant by “MS_Naive_
Bayes” (must be e xactly the same value of T_MSNaiveBayes token
definition). This data is used by Relational and OLAP Mining Model
editors to build the DMSQL queries.
5. Open DMProv/DMSProv.rc, DMProv/SchemaRowsets.cpp and
DMBase/dmresource.h files and replace:
9 IDS_SERVICE_SAMPLE_ALGORITHM_DESCRIPTION by
IDS_SERVICE_MS_NAIVE_BAYES_DESCRIPTION;
9 IDS_SERVICE_SAMPLE_ALGORITHM_DISPLAY_NAME by
IDS_SERVICE_MS_NAIVE_BAYES_DISPLAY_NAME.
6. Open DMProv/DMSProv.rc file and replace the v alues of
IDS_SERVICE_MS_NAIVE_BAYES_DISPLAY_NAME and
IDS_SERVICE_MS_ NAIVE_BAYES_DESCRIPTION strings. These
strings are used in Mining Algorithm list of Relational and OLAP Mining
Model editors. They can be different from token definition.
7. Open DMCore/dmmodel.h, DMCore/dmmodel.cpp, DMParse/
dm reduce.cpp, DMProv/dmxbind.cpp and DMProv/
dmxmlpersist. cpp files and replace DM_ALGORITHMIC_
METHOD_SAMPLE_ ALGORITHM by DM_ALGORITHMIC_
METHOD_MS_NAIVE_ BAYES.
82 Implementing new algorithms

B.5 Implementing new algorithms


The following steps must be performed to implement new algorithms:
1. Open dmparser.ypw workspace with Visual Parse++. Open
dmparser.ycc file:
9 Insert T_SNaiveBayes token:
'[Ss][Ii][Mm][Pp][Ll][Ee]_[Nn][Aa][Ii][Vv][Ee]_
[Bb][Aa][Yy][Ee][Ss]' ("Simple_Naive_Bayes")
after T_MSNaiveBayes token. Tokens cannot contain spaces;
9 Insert PSNaiveBayesAlg definition after PMSNaiveBayesAlg:
PSNaiveBayesAlg dmm_algorithm -> T_SNaiveBayes;
9 Execute Debug/Compile;
9 Execute Debug/Generate Files.
2. Open DMParse/dmreduce.cpp file and move SP_PMSNaiveBayesAlg
case to the same place of SP_PMSNaiveBayesAlg. The same function
(redSP_PAlgorithm) will be used to process this case.
3. Open DMParse/dmredAlgorithm.cpp file and modify redSP_
PAlgorithm inserting the following statement:
case SP_PSNaiveBayesAlg: Algorithm =
DM_ALGORITHMIC_METHOD_S_NAIVE_BAYES; break;
inside the switch block.
4. Open DMProv/SchemaRowsets.h file:
9 Run guidgen.exe from the common tools folder of MSVS
(Microsoft Visual Studio / Common / Tools for MSVS 6.0 and
Microsoft Visual Studio .NET 2003/ Common7 / Tools for
MSVS .NET) and create a ne w GUID ( static const format) and
insert the extern const GUID SERVICE_GUID_S_Naive_Bayes;
9 Insert the SERVICE_NAME_S_NAIVE_BAYES constant with the
value “Simple_Naive_Bayes” (must be e xactly the same v alue of
T_SNaiveBayes token definition). These data are used by Relational
and OLAP Mining Model editors to b uild DMSQL queries;
9 Increment by one the value of DM_NUM_MINING_SERVICES constant;
9 Increment by the number of parameters used in new algorithm the
value of DM_NUM_MINING_SERVICE_PARAMETERS constant
5. Open DMProv/DMSProv.rc file and insert the
IDS_SERVICE_S_NAIVE_BAYES_DISPLAY_NAME and
IDS_SERVICE_S_NAIVE_BAYES_DESCRIPTION strings. These
strings are used in Mining Algorithm list of Relational and OLAP
Mining Model editors. They can be different from token definition.
6. Open DMBase/dmresource.h file and insert the
IDS_SERVICE_S_NAIVE_BAYES_DISPLAY_NAME and
IDS_SERVICE_S_NAIVE_BAYES_DESCRIPTION definitions after
IDS_SERVICE_MS_NAIVE_BAYES_DESCRIPTION.
Building DMSample 83

7. Open DMXML/PersistXML.idl file and insert DMPI_


SNAIVEBAYESMODEL constant in DMPERSISTITEM enumerator
and modify DMPI_END to reflect this change.
8. Open DMXML/PersistXML.cpp file and insert L"Simple-naive-
bayes-model" value in wszTags (array of all of the possible tags).
9. Open DMCore/dmmodel.h file and insert DM_ALGORITHMIC_
METHOD_S_NAIVE_BAYES constant in the DM_Algorithmic
Method enumerator.
10. Open DMCore/dmmodel.cpp file and insert DM_ALGORITHMIC_
METHOD_S_NAIVE_BAYES case and the appropriated statements in
the SPModel::GetDMAlgorithm function.
11. Create DMM/ModelUtils.hpp file from DMM/NaiveBayes.hpp
file, moving definitions that are common to all algorithms.
12. Create DMM/SNaiveBayes.hpp file from DMM/NaiveBayes.hpp
file and replace MSNaiveBayes by SNaiveBayes.
13. Create DMM/SNaiveBayes.cpp file from DMM/NaiveBayes.cpp
file and replace MSNaiveBayes by SNaiveBayes.
14. Open DMM/dmmlib.h file and insert the following line:
#include "SNaiveBayes.hpp"
15. Open DMProv/dmxbind.cpp file and execute the necessary modifica-
tions to be implemented for SNBi classifier.
16. Open DMProv/dmxmlpersist.cpp file:
9 Insert the DMPI_SNAIVEBAYESMODEL case in the
SPModel::Load function, including the appropriated statements;
9 Insert DM_ALGORITHMIC_METHOD_S_NAIVE_BAYES case
in SPModel::Save and SPModel::Load functions, including the
appropriated statements.
17. Open DMProv/SchemaRowsets.cpp file:
9 Insert the necessar y steps to define SERVICE_NAME_S_NAIVE
_BAYES in _rgDMSERVICES table in SPServices Schema
Rowset::InitClass function (not implemented for SNBi classifier);
9 Insert the necessar y steps to define SERVICE_NAME_S_
NAIVE_BAYES in _rgDMSERVICE_PARAMETERS table in
SPServiceParametersSchemaRowset::InitClass function (not
implemented for SNBi classifier).

B.6 Inserting support for new parameters


Parameters are not processed b y the parser. Support for new parameters must be
inserted in the code of many projects, as described by the following steps:
1. Open DMBase/dmresource.h file and include the resource identifier
of the new parameters after the line:
84 Inserting support for new parameters

// Resources for MINING_SERVICE_PARAMETERS


schema rowset
2. Open DMCore/dmmodel.h file:
9 Insert the strings of the identif ier of the new parameter after the line:
// Flag names
9 Insert the type of the new parameter in DMMFlagSearch structure.
3. Open DMCore/dmmodel.hpp file:
9 Insert the default constant value of the new parameter after the line:
// Constant Declarations
9 Insert the global variable definition for the new parameter after the line:
DM_Real m_HoldoutPercentage; // Training
Holdout Percentage
9 Define the accessors of the new parameter after the line:
// Accessors
9 Define the initiators of the new parameter after the line:
// Initiators
4. Open DMCore/dmmodel.inl file:
9 Create a function SPModel::Set<parameter name> to set the
new parameter.
9 Create a function SPModel::Get<parameter name> to get the
new parameter.
9 Create a function SPModel::<parameter name> to return a
reference to the current new parameter.
9 Create a function SPModel::Reset<parameter name> to reset
the new parameter to original values.
5. Open DMM/DMM.idl file and insert the value DMMPROP_
<parameter name> in the DMMPROPENUM enumerator.
6. Open DMParse/dmtools.cpp file and inser t the validations of the new
parameter value in the function SPDataManager::ApplyModelFlags.
7. Open DMProv/DMSProv.rc file for text editing and insert the new
parameter description in the string table after the line:
// Mining Service Parameter Descriptions
8. Open DMProv/dmxmlpersist.cpp file and insert support to new
parameter in the function SPModel::LoadModelFlags.
9. Open DMProv/SchemaRowsets.cpp file and insert support to new
parameter in the function SPServiceParametersSchemaRowset::
InitClass.
10. Open DMXML/PersistXML.cpp file and insert the new parameter tag
name in wszTags (array of all of the possible tags).
11. Open DMXML/PersistXML.idl file and add DMPI_<parameter
name> constant in DMPERSISTITEM enumerator and modify
DMPI_END to reflect this change.
Building DMSample 85

12. Insert new error codes associated with the new parameter using the
instructions in Section B.8.

B.7 Inserting support for new properties


To insert support for a new property (for example: name = INIT_ASYNCH), the
following steps must be performed:
1. Open DMInclude/dbs.idl file and verify whether DBPROP_<name>
(DBPROP_INIT_ASYNCH) entry exists in DBPROPENUM enumerator.
2. Open DMBase/dmresource.h file and insert the IDS_DBPROP_
<name> (IDS_DBPROP_INIT_ASYNCH) macro in the appropriated
place.
3. Open DMProv/DMSProv.rc file and insert the
IDS_DBPROP_<name> (IDS_DBPROP_INIT_ASYNCH) string
definition in the appropriated place.
4. Open DMProv/DataSrc.hpp file and insert the
PROPERTY_INFO_ENTRY(<name>)
(PROPERTY_INFO_ENTRY(INIT_ASYNCH)) property info entry
in the appropriated place.
5. Open DMProv/Properties.h file and insert the <name>_Flags
(INIT_ASYNCH_Flags), <name>_Type (INIT_ASYNCH_Type) and
<name>_Value (INIT_ASYNCH_Value) macros in the appropriated place.
6. Insert corresponding code to deal with inserted property. Usually in
DMProv/DataSrc.hpp file, SPDataSource::Initialize function.
7. Evaluate the implemented resources by using RowsetViewer or another
tool.

B.8 Inserting new error codes


The necessary steps for inserting new error codes are listed as follo ws. It will be
illustrated with the insertion of the error regarding the impossibility of predicting
continuous attributes.
1. Open DMBase/dmerrors_rc.h file and insert the error resource identif ier
according to the error type syntax. F or errors regarding the DMM, the
identifier will be: IDS_SPE_DMM_<error code> (IDS_SPE_
DMM_ALG_CANT_PREDICT_CONTINUOUS_ATTRIBUTE).
2. Open DMProv/dmerrors.rc file and insert the error string def inition in
the corresponding table (“This algorithm can’t predict continuous
attribute”).
3. Open DMBase/dmerrors.h file and insert the return error code
according to the er ror type syntax. For DMM errors this syntax will be:
SPE_DMM_<error code>. For the e xample, the following statement
must be inserted:
86 Debugging

DM_DEFINE_ERROR_HRESULT(SPE_DMM_ALG_CANT_PREDICT
_CONTINUOUS_ATTRIBUTE)
4. Open DMM/DMM.idl file and insert the return error code definition
according to the syntax: DMME_<error code>. For the example, the
following statement must be inserted:
DMM_DEFINE_ERROR_HRESULT(DMME_ALG_CANT_PREDICT_CON T
INUOUS_ATTRIBUTE,0x000D)
5. Open DMProv/DMErrors.cpp file and insert the appropriated
statements regarding the new error in the DMMapDMMErrorCode
function.
Other files that deal with er ror codes are the following:
DMBase/DMErrors.hpp, DMBase/DMErrors.inl and DMBase/
dmbsglob.cpp.

B.9 XML load & save support


DMM are archived in XML text or binary formats. Text format archiving is car-
ried out using the Predictive Model Markup Language (PMML) specification [7],
a markup language for statistical and DMMs, based on the Extensib le Markup
Language (XML) [8].
For debugging purposes, it is useful to see XML files in te xt format,
which can be seen using any text editor or internet browser. The DMM XML file
is located in the MSSQL Analysis Services data folder , named <database
name> / DMA.<DM provider DLL name> / <DMM name>.sdmm.xml.
To define archiving texts files as default, the m_dmpfDefault variable
must be initialized to DMPF_XML instead of DMPF_BIN in SPDataSource::
SPDataSource function of DMProv/DataSrc.inl file.
Also verify lines 68 and 222 of DMProv/dmglobals.cpp file.

B.10 Debugging
The DM Provider can be debugged by either running integrated with Analysis
Services Server or as a standalone ser ver.

B.10.1 Aggregate mode


To debug a DM Provider aggregated with MSSQL Analysis Services, the MSSQL
OLAP Service should be set as executable for debugging session:
9 If you are using MSVS 6.0, select General Category, of Debug
Settings of DMProv project and set Executable for debug session to
“Program Files/Microsoft Analysis Services/Bin/msmdsrv.exe”
and Program arguments to “-c”, that means console mode.
If you are using MSVS .NET, select Action category of Debugging
Properties of DMProv project and set Command to “Program
Files/Microsoft Analysis Services/Bin/msmdsrv.exe” and
Command Arguments to “-c”.
Building DMSample 87

MSSQL OLAP Ser vice must be stopped before start debugging. Note that
DLLs of DM providers will not be loaded until its DM functionality is being
requested, unless you ask the debugger to load its symbols initiall y.
At the start of debug in MSVS, a console window will appear with the mes-
sage: The Analysis server started successfully.
Start MSSQL Analysis Services Manager and select the Mining Models
folder of any database.
Each registered DM provider DLL will be called and the following functions
will be processed:
9 DllMain (DMProv/DMSProv.cpp);
9 DllGetClassObject (DMProv/DMSProv.cpp);
9 SPDataSource::Initialize (DMProv/DataSrc.cpp);
9 SPModel::Load (DMProv/dmxmlpersist.cpp) called by the previous
function in the statement:
SP_CHECK_ERROR(pDataManager->Init(pQC));
This last function retrieves all existing DMMs, reading the following segments:
9 statements – DMM creation statements;
9 data-dictionary – Data dictionary definition;
9 global-statistics – Global statistics data;
9 Simple-naive-bayes-model – DMM contents.
Reading “data dictionary” segment evolves the following functions:
9 SPModel::LoadColumn (DMProv/dmxmlpersist.cpp);
9 SPQueryManager::DoTrainAttributes (DMQM/DMQueryManager.
cpp), called for each attribute to compute all v alues of discrete attributes.
When a specif ic DMM is selected and Action/Process is activated, the fol-
lowing functions will be processed:
9 SPQueryManager::DoCreate (DMQM/DMQueryManager.cpp);
9 SPDataManager::CreateModel (DMParse/dmtools.cpp);
9 SPQueryManager::DoTrainModel (DMQM/DMQueryManager.
cpp).
The functions SPModel::Load and SPQueryManager::DoTrainModel
call the DMM specif ic functions.

B.10.2 Standalone mode


To debug a DM Provider with a Microsoft ® Visual Basic® project, such as
DMSamp utility [3] (see Chapter 2, Section 2.2.5), the Visual Basic should be
set as the executable for the debug session:
9 If you are using MSVS 6.0, select General Category, of Debug
Settings of DMProv project and set Executable for debug session
to “Program Files/Microsoft Visual Studio/VB98/VB6.EXE” and
Program arguments to “DMSamp/DMSamp.vbp”.
If you are using MSVS .NET consult Debugging Across
Languages / Debugging in MSVS .NET of MSVS .NET Technical
Articles. The DMProv project cannot be compiled with this version
of MSVS. However, the e xecutable version of this utility can be
88 Cleaning garbage

used to debug a DM provider, as you will see in the following


paragraphs.
It does not matter whether MSSQL OLAP Service is stopped or not, since the
DM Provider will run in standalone mode.
To debug a DM Provider with an application like DMSamp utility [3] (see
Section 2.2.5), the executable of the application must be set as the executable for
the debug session:
9 If you are using MSVS 6.0, select General Category of Debug
Settings of DMProv project and set Executable for debug session
to “DMSamp/DMSamp.exe” and Program arguments to any value
you wish. This argument can be the name of a DMSQL file to be executed,
for example: ”idma / data / allelet / AllElet_SNB_Queries.xml”.
If you are using MSVS .NET, select Action category of Debugging
Properties of DMProv project and set Command to “DMSamp/
DMSamp.exe” and Command Arguments to any value you wish.

B.11 Cleaning garbage


To clean garbage produced by the DM providers, you must look at MSSQL Analysis
Services data folder. To see the location of this folder, run MSSQL Analysis Services
Manager and execute Action/Properties selecting the desired Analysis Server.
This folder contains the log file each DM Provider creates. MSDMine.log is the
log file of Microsoft ® OLE DB Provider for DM Services (MSDMine) and
DMclcMine.log is the log file of the DMclcMine provider. Some of these files need
to be removed at regular intervals or when DMM and databases are deleted.This folder
also contains the <database name >.odb file with the definition of the databases.
For each database, a folder named <database name> is created in the data
folder. In each database folder, the following are created:
9 for each data source, a file named <server name> – <database name>.src;
9 for each DMM, a file named <DMM name>.dmm.xml> (these files are
always empty);
9 for each DM provider, a folder named DMA.<DM provider DLL name>.
Inside this folder, a file named <DMM name>.sdmm.xml is created for
each DMM. These files contain the DMM contents. Also, a f ile named
<DMM name>.debug.txt is created by the DMclcMine provider;
9 for each access role, a f ile named <rule name>.role;
9 for each DMM, a folder named <DMM name>. Inside this folder, a file
named <rule name>.dmmsec is created for each rule.
It happens that, w hen you delete a DMM, these f iles are not deleted. When you
use the MSSQL Analysis Manager, the DMM and the databases deleted do not
appear but the cor responding files remain active for debugging tasks. When you
debug the MSSQL OLAP Service, all files will be loaded and sometimes this pro-
gram can crash. The files named *.log and *.txt can be removed at any time.
When the DM provider is used in standalone mode, the DMM XML f iles, the
log files and the debug f iles are created in the def ault folder.
Building DMSample 89

Table B.1: DMclcMine – List of created and modif ied files.


90 Cleaning garbage
Building DMSample 91
92 Cleaning garbage
Building DMSample 93

B.12 Bugs
A few bugs w ere detected w hile using DM pro viders and MSSQL Analysis
Services:
9 When you use an in valid algorithm name, the er ror message repor ted by
Analysis Server is incor rect. This bug w as not f ixed yet and ma y be in:
__DMSetLastErrorExHelper function of DMBase/dmbsglob.cpp
file; PFError::SetLastError (pnerror.cpp file of Putl project) or
PNWriteDebug. (pnglobal.cpp file of Putl project);
9 Opening a long-running view from MSSQL Enterprise Manager gets time-
out. The query time-out is 30 s, and the time-out value is not configurable.
To circumvent this bug, y ou can use Microsoft ® SQL Query Analyzer to
run the long-r unning view. If it is necessar y, you can modify the Quer y
time-out of this tool using Tools/Options/Connections/Query time-
out =9999 seconds, for example. It seems that using a zero value to permit
infinite time-out queries does not work.
9 Running MSDT classifier with enormous training data sets and continuous
input attributes cause MSDMine to crash. This error was reported to
Microsoft and w as qualified as a bug to be fixed in a future release of
MSSQL Analysis Services. For customers with this problem, Microsoft
has suggested a workaround, i.e. making the continuous attributes
discretized, and will create a Knowledge Base (KB) article for this.

B.13 Created and modified files


The table B.1 shows all created and modif ied files starting from DMSample to
implement DMclcMine. All files signed b y an asterisk (all created and some
modified files) are included in the IDMA CD.
Appendix C
Running the experiments
This appendix shows the details of the experiments, such as data generator
programs, SQL script f iles to create the databases, DMSQL statements to create
the DMMs and to make data prediction, DTS packages for managing processes of
data import, training tasks and data prediction. The MSDT parameters are also
described.

C.1 Microsoft® decision trees parameters


The descriptions of these parameters are extracted from the SQL books online
[13] and from the Analysis Services of the DM Group Web Site [30].

C.1.1 COMPLEXITY_PENALTY
A floating point number, with a range between 0 and 1 (exclusive), acts as a
penalty for g rowing a tree. The parameter is applied at each additional split. A
value of 0 applies no penalty, and a value close but not equal to 1 (1.0 is outside
the range) applies a high penalty. Applying a penalty will limit the depth and com-
plexity of lear ned trees, w hich avoids over-fitting. However, using too high a
penalty may adversely affect the predictive ability of the learned model. The
effect of this mining parameter is dependent on the mining model itself; some
experimentation and observation may be required to accurately tune the DM
model. The default value is based on the number of attributes for a gi ven model:
9 For 1–9 attributes, the value is 0.5;
9 For 10–99 attributes, the value is 0.9;
9 For 100 or more attributes, the value is 0.99;

C.1.2 MINIMUM_LEAF_CASES
A non-negative integer within a range of 0 to 2 147 483 647. It deter mines the
minimum number of leaf cases required to generate a split in the decision tree.
A low value causes more splits in the decision tree, but can increase the
likelihood of over-fitting. A high value reduces the number of splits in the
decision tree, but can inhibit the g rowth of the decision tree. The default value
is 10.
96 DTS packages

C.1.3 SCORE_METHOD
This identifies the algorithm used to control the g rowth of a decision tree. This algo-
rithm selects the attributes that constitute the tree, the order in which the attributes are
used, the way in which the attribute values should be split up and the point at w hich
the tree should stop growing. Valid values: 1, 2, 3, 4. The meanings of these values are:
1. Entropy: it is based on entropy gain of the classif ier.
2. Orthogonal: a home-g rown method that is based on the or thogonality of
the classifier state distribution. This scoring method yields binary-splits
only, which may end up with too-deep trees.
3. Bayesian with K2: based on Bayesian score with K2 a priori.
4. Bayesian Dirichlet Equivalent with Uniform a priori: the default scoring
method [19].

CI.1.4 SPLIT_METHOD
It describes the various ways that SCORE_METHOD should consider splitting up
attribute values. For example, if an attrib ute has f ive potential values, the values
could be split into binary branches (for example, 3 and 1,2,4,5), or the values
could be split into five separate branches, or some other combination may be con-
sidered. A value of 1 results in decision trees that have only binary branches; a
value of 2 results in decision trees with multiple (Nar y) branches; a value of 3
(default) allows the algorithm to use binar y or multiple branches as needed.

C.2 DTS packages


All computational experiments are carried using the MSSQL. So, it is necessar y
to use appropriated utilities to run these experiments in an easy and rapid fashion.
The natural choice for developing these utilities is the DTS packages.
DTS are object-oriented graphic tools used for data transformation in MSSQL
environment.
Using a graphic API, data connection objects and task objects are used to assemble
data transformation procedures. Data connections can use text files, MSSQL data and
Microsoft® Access® databases, among others. Tasks can transfer data from Inter net,
perform data transformations, execute external applications, execute SQL queries, exe-
cute DM tasks, execute data prediction tasks and execute other DTS tasks, etc.
DTS packages can run in the MSSQL environment or standalone, through
direct commands or be included in batch command f iles. These packages can use
timers to run once at pre-determined times or at regular time intervals.
Fig. C.1 shows the DTS Package Editor screen with all programmable objects
used for data import (idma/data/meteo/meteo_Import.dts) in the
Meteorological experiment. This figure clearly illustrates the processing workflow
to carry out this task, as well as all necessary objects, described in Section C.4.2.

C.3 Waveform recognition problem


This example is a three-class problem, based on linear waveforms, denoted b y
h1(t), h2(t) and h 3(t), shown in f igs. C.2–C.4 and described in detail by Breiman
et al. [18].
Running the experiments 97

Figure C.1: Meteo – DTS – Data impor t workflow.

Each class consists of a random convex combination of two of these wave-


forms sampled at the inte gers with noise added. The 21 continuous attributes
are the values of the measurement vectors (cases or instances). F igs C.5–C.7
show the three cases for each class from the 5000 cases provided by the UCI
Repository [21].

Figure C.2: Waveform 1.


98 Waveform recognition problem

Figure C.3: Waveform 2.

C.3.1 Creating the data


The training data sets were created by the idma/data/waveform/generator/
waveform.c program. The random numbers are generated by the probdist
program [31].
The seeds to generate the random numbers were: 3, 50, 100, 1000 and 10 000 to
generate, respectively, the 300, 5000, 10 000, 100 000 and 1 000 000 training cases.

C.3.2 Creating the database


The database with its all objects, such as tables, views and stored procedures, was
created by the idma/data/waveform/waveform.sql script file.

Figure C.4: Waveform 3.


Running the experiments 99

Figure C.5: First three cases of class 0.

C.3.3 DTS packages


This experiment uses se veral DTS packages for data import, processing of the
training, prediction and cross-validation tasks, described as follows.
The first part of this experiment considered five data groups, and the complete
processing for each data group was carried out fully by the DTS package
(idma/data/waveform/waveform.dts), shown in fig. C.8. This figure shows all
programmable objects of the package; the most impor tant will be detailed. It must

Figure C.6: First three cases of class 1.


100 Waveform recognition problem

Figure C.7: First three cases of class 2.

be noted that two batch command file tasks, c45exe.bat and wekaNaive.bat,
correspond, respectively, to the processing of C4.5 and WEKA classifiers.
The data impor t workflow (idma/data/waveform/waveform_Import.dts
of this experiment is shown in f ig. C.9, where two distinct data f iles should be
noted, one for training and another for testing. These data are accessed by two
Text File (Source) data connection objects, named Train Data, for reading the
training data set, and Test Data, for reading the testing data set. This figure also
shows a doub le SQLOLEDB database connection object, named Waveform,
which allows connection with data in the MSSQL.
To begin, all existing data is removed from Waveform MSSQL database by
a SQL statement, included in an Execute SQL Task, named Remove data:
DELETE FROM Waveform
In the following, an ActiveX Script Task, with Visual Basic® script commands,
initializes the record counter, as shown in listing C.1.
In the next step, the training data of the te xt file are archived in the MSSQL
by using a Transform Data Task, defined by another Visual Basic® script, as
shown in listing C.2. A similar task transfers the testing data set.
Fig. C.10 shows the processing workflow of the training and prediction
tasks for the MSDT1 classifier (idma / data / waveform / waveform_
MSDT1_Train&Predict.dts. The running time control tasks (Init TrainPredTime,
Init TestPredTime and End TestPredTime, carried out by stored procedures
included in the idma/data/waveform/waveform.sql script file, should be
noted. The processing workflow of the other classif iers is similar to this.
The processing workflow of the cross-validation task ( idma/data/
waveform/waveform_Cross_Validation.dts is shown in f ig. C.11. On the
Running the experiments 101

Figure C.8: Waveform – Full processing workflow of the f irst part of the
experiment.

other hand, fig. C.12 sho ws the processing w orkflow of the block data import
(idma/data/waveform/waveform_Import_Blocks.dts). Data transfer is
carried out by a similar Visual Basic® script, as that sho wn in listing C.2, while
the stored procedure to create the b locks, named Create_Blocks, called by the
Create Blocks Execute SQL Task, is included in the idma/data/waveform/
waveform.sql script file.

Figure C.9: Waveform – Data import workflow.


102 Waveform recognition problem

Listing C.1: Record count initializer script.

Listing C.2: Waveform – Text file data import.

All tasks, beginning in Create Views and finishing in Inc ActiveBlock,


are executed N times (N = 10). This loop is controlled by the Inc ActiveBlock
task, as shown in listing C.3.

C.3.4 DMSQL statements


The DMSQL statements, used to create the DMMs for all classif iers, are shown
in Table C.1. On the other hand, listing C.4 shows the required DMSQL statement
to predict the testing data set for the SNBi classifier. The respective statements of

Figure C.10: Waveform – Training and prediction processing tasks of MSDT1


classifier.
Running the experiments 103

Figure C.11: Waveform – Cross-validation – Processing workflow.

the other classif iers are similar to this. Extensive repetitions in these statements
are replaced by three consecutive dots (...).

C.4 Meteorological data


Table C.2 shows the attribute description of the meteorolo gical data set [22]. In
addition to these attributes, one more w as used, a record k ey identifier, required
by classifiers supported by DM providers.
Initially, the weather prediction attribute (Class) could assume values from 0
to 9, indicating the follo wing situations: nothing to tell (0), rain to the vie w (1),

Figure C.12: Waveform – Block data import.

Listing C.3: Waveform – Cross-validation loop control script.


104 Meteorological data

Table C.1: Waveform – DMSQL statements to create the DMMs.

dry fog or smoke (2), sand or dust (3), moist fog or thick fog (4), sprinkle (5), rain
(6), snow (7), thunderstorm or lightning (8), and hail (9). After data preparation,
the value 9 (occurring only once) and the v alues 3 and 7 (never occurring) were
excluded; thus, at end, this attribute had seven values.

C.4.1 Creating the database


The database with its all objects, such as tables, views and stored procedures, was
created by the idma/data/meteo/meteo.sql script file.

C.4.2 DTS Packages


The first part of this experiment uses the whole data split in two data sets, one
for training and another for testing the DMM generated by the training data set.
Running the experiments 105

Listing C.4: Waveform – SNBi classifier – DMSQL statement to predict the


testing data set.

Fig. C.13 shows the full processing w orkflow of this par t of the e xperiment
(idma/data/meteo/meteo.dts and f ig. C.14 illustrates the data impor t work-
flow (idma/data/meteo/meteo_Import.dts. There is a doub le SQLOLEDB
database connection object, named Meteo, which allows connecting with data in
the MSSQL.
There are three Text File (Source) data connection objects, named All
Data, to read the whole data, Train Data, to write the training data in the text
format, and Test Data ,to write the testing data in the text format. These text
formatted data are used by the C4.5 classifier.
To begin, all e xisting data is removed from meteo MSSQL database by a
SQL statement included in Execute SQL Task, named Remove data:

DELETE FROM Meteo


In the following, an ActiveX Script Task, with Visual Basic® script com-
mands, initializes the record counter, as shown in listing C.3.
In the next step, the whole data of the text file are archived in the MSSQL by
using a Transform Data Task, defined by another Visual Basic® script, as
shown in listing C.5.
After this step, the whole data is split into two parts, using the column Train,
by a stored procedure, named Create_Test_Data, included in the idma/data/
meteo/meteo.sql script file and called by the Create Test Data task.
106 Meteorological data

Table C.2: Meteo – Attribute description of the meteorological data set.

Concluding, by using two simple data transfer tasks, the training and the test-
ing data sets are archived in text files. These task use two SQL distinct statements
to select the desired data, as can be seen in the tw o following lines:
SELECT * FROM Meteo WHERE Train=1 -- training data
SELECT * FROM Meteo WHERE Train=0 -- testing data

The processing workflow of the training and prediction tasks is similar to that
shown in fig. C.10 of the Waveform experiment. Also, the cross-validation proce-
dures are similar to those sho wn in the Waveform experiment and will not be
described here.

C.4.3 DMSQL statements


The DMSQL statements, used to create the DMMs for all classifiers, are shown in
table C.3, while listing C.6 shows the DMSQL statement required to predict the
testing data set for the SNBi1 classifier. The respective statements of the other
classifiers are similar to this.
Running the experiments 107

Figure C.13: Meteo – Full Processing workflow of the f irst part of the
experiment.

C.5 Life insurance data


Fig. C.15 shows the existing relationship between the 10 original tables, dis-
tributed for the KDD Sisyphus I competition [25]. Table C.4 shows the
descriptions of all attributes of these original tables, signing the attributes
used for each part of the experiment.
The distributed f iles describe the relationship betw een the clients (partners),
insurance contracts and components of insurance tarif fs.
Table part contains all partners, as the company calls its clients. Tables
eadr and padr contain, respectively, the electronic addresses (fax, phone, etc.)
and the postal addresses of the par tners. Details about the households the y are

Figure C.14: Meteo – Data import workflow.


108 Life insurance data

living in can be found in table hhold. Each par tner can play roles in cer tain
insurance policies (tab le vvert), realized b y table parrol. If a par tner is the
insured person of the contract then tariff role records (tab le tfrol) specify cer-
tain further properties. An insurance contract can have several components (e.g.
the main contract par t plus a component for insuring the case that the ensured
person becomes invalid), each of w hich (recorded in tab le tfkomp) is related
with a tariff role of the respective partners. Finally, each policy concerns a cer-
tain product (table prod) and tariff components are bound to dedicated insur-
ance tariffs (table lvtarf).
The prod and lvtarf tables were not distributed. The taska and taskb tables
contain, respectively, classes assigned to partners and to households.

C.5.1 Creating the database


The idma/data/insurance/insurance.sql and idma/data/sisyphus/sisy-
phus.sql script files are used to create the prepared and the original databases,
containing all required objects, such as tables, views and stored procedures. The
last file contains the taska1 query used to assemble the unprepared data from the
original tables.

Listing C.5: Meteo – Text file data import.


Running the experiments 109

Table C.3: Meteo – DMSQL statements to create the DMM.


110 Life insurance data

Listing C.6: Meteo – SNBi1 classifier DMSQL statements to predict the testing
data set.
Running the experiments 111

Figure C.15: Insurance – Original tables relationships.

C.5.2 DTS packages


The first part of this experiment, related to prepared data, has complete similarity
to the Meteorological experiment. The DTS packages are similar and will not be
reproduced again.
The second part of this experiment, related to original (unprepared data, has
differences in data import and assemble and was carried out by the DTS packages
shown in figs. C16 and C17.
The DTS package for data assembling, through the Create Test Data
object, executes the stored procedures Assemble_sisyphus, which creates a
unique table using the taska1 query, and Create_Test_Data, which splits the
whole data in the training and testing data sets. The other objects change these
tables, identifying the null columns and e xport the data to be used by the C4.5
classifier.

C.5.3 DMSQL statements


The DMSQL statements used to create the DMMs for all classifiers of the first
part of the experiment are shown in table C.5, while listing C.7 shows the DMSQL
statement required to predict the testing data set for the SNBi classifier. The
respective statements of the MSDT classifier are similar to this. Extensive repe-
titions in these statements are replaced b y three consecutive dots (...).
Table C.6 and listing C.8 show the respective DMSQL statements for the sec-
ond part of the experiment.
112 Life insurance data

Table C.4: Insurance – Attribute description.


Running the experiments 113
114 Life insurance data
Running the experiments 115
116 Life insurance data
Running the experiments 117
118 Life insurance data
Running the experiments 119
120 Life insurance data
Running the experiments 121
Life insurance data 122

Figure C.16: Insurance – Original data – Data impor t workflow.


Running the experiments 123

Figure C.17: Insurance – Original data – Data assembling workflow.

C.6 Performance study


The performance study was carried out through four experiments. The data sets ,used
by the experiments, are generated ar tificially by a computer pro gram: DATGEN
[28]. In the following, the details of the data generation, as well as the DTS pack-
ages and stored procedures used in the task processing, are described.

Table C.5: Insurance – prepared data – DMSQL statements to create the DMM.
124 Performance study

C.6.1 Creating the Data


The DATGEN program needed to be modif ied to r un in the installed operating
system. The drand48 and srand48 functions were created, as shown in listing
C.9. Table C.7 shows the parameters used with this program to generate the data
for the different experiments of this study.

C.6.2 Creating the Database


The idma/data/performance/VarInpAtt.sql,
idma/data/performance/VarNumCases.sql,
idma/data/performance/VarStates.sql and

Listing C.7: Insurance – prepared data – SNBi classifier – DMSQL statement


to predict the testing data set.
Running the experiments 125

Table C.6: Insurance – original data – DMSQL statements to create the DMM.
126 Performance study

Listing C.8: Insurance – original data – SNBi classifier – DMSQL statement


to predict the testing data set.
Running the experiments 127
128 Performance study

Listing C.9: Functions for the DATGEN program.

idma/data/performance/VarPredAtt.sql script files contain the database


generation scripts, including all objects such as tables, views and stored proce-
dures, for all the experiments in this study.

C.6.3 DTS packages


The experiments of this study have several DTS packages, which are similar to
those described previously, but some are reproduced as follows.
Fig. C.18 shows the processing workflow of the prediction by part tasks of the
SNBi classifier. The Create Train View, Train Prediction and Inc View
tasks are e xecuted as man y times as necessary to complete the prediction of all
desired cases. The Visual Basic® script file, called by the Init View task, executes
the necessary initializations for executing the following tasks, as shown in listing
C.10. The control loop is carried out by the Inc View task, as shown in listing C.11.
This package yet has global variables, defined to allow its utilization for all
desired prediction tasks. Thus, there are variables for the index of active view, for
the increment of the number of cases, for the total number of cases, for the name
of the current task being processed and, also, for the number of vie ws, computed
by the Init View task, as shown in listing C.10. F ig. C.19 shows the DTS pack-
age used to process all prediction by part tasks of the SNBi classifier, that through
appropriated definitions for the global variables, uses the same DTS package,
shown in fig. C.18, to execute these tasks.
Fig. C.20 shows the DTS package used for the training task for the SNBi classifier,
which also has global variables defined to allow another DTS package, similar to
that used for processing all prediction by part tasks, perform the processing of all
training tasks.

C.6.4 DMSQL statements


The following tables show the DMSQL used in all experiments of this study. Again,
extensive repetitions in these statements are replaced by three consecutive dots (....
Table C.8 shows the statements to create the DMM with 12 input attributes of
VarInpAtt experiment. The statements for other numbers of attributes are similar
to this one. This experiment does not have any prediction task.
Running the experiments 129

Table C.7: DATGEN parameters.

Table C.9 shows the DMSQL statements to create the DMM of the
VarNumCases experiment and listing C.12 shows the DMSQL statement to pre-
dict the testing data set for the SNBi classifier of this same experiment.
Table C.10 shows the DMSQL statements to create the DMM of the
VarStates experiment, while table C.11 shows DMSQL statements to create the
DMM with 32 prediction attributes of the VarPredAtt experiment. The state-
ments for other numbers of attributes are similar to this.
These two last experiments do not have any prediction task.
130 Performance study

Figure C.18: VarNumCases – SNBi classifier – Prediction by parts task.

Listing C.10: Task loop initializer script.


Running the experiments 131

Listing C.11: Task loop script.

Figure C.19: VarNumCases – SNBi classifier – Processing workflow of predic-


tion by parts tasks.

Figure C.20: VarNumCases – SNBi classifier – Training task.


132 Performance study

Table C.8: VarInpAtt – DMSQL statements to create the DMM with 12 input
attributes.

Table C.9: VarNumCases – DMSQL statements to create the DMM.

Listing C.12: VarNumCases – SNBi classifer – DMS statement to predict the


testing data set
Running the experiments 133

Table C.10: VarStates – DMSQL statements to create the DMM

Table C.11: VarPredAtt – DMSQL statements to create the DMM with 32


prediction attributes.
Appendix D
List of conventions
Example Description

1 Font used for constant values of


equations.
n Font used for variable values of
equations.
clcMine Font used for names and identif iers,
including filenames, function
names, compiler options, etc.
CREATE MINING MODEL Font used for SQL keywords
void initialize() Font used for source code listings,
{ SQL statements, scripts, stored pro-
int i,j; cedures and Visual Basic® scripts.
char command[100];

For j = 0, mt Font used for algorithm listings.


Recover s jt

Application Programming Interface (API) Acronyms are written within paren-


thesis, following their description
the first time the y appear and are
included in the list of acronyms.
Program Files/ The drive letter part of file path
Microsoft Analysis Services/Bin/ names is always omitted in the text.
msmdsrv.exe You have to include your own. In
figures, the drive letters used b y the
authors appear; you must replace by
your own. Occasionally additional
spaces are added near the forward
slash to improve text readability.
Appendix E
List of terms and acronyms
Abbreviation Description
ADO ActiveX Data Objects, object oriented API.
API Application Programming Interface
C4.5 C4.5 Classifier
COM Component Object Model
CRISP-DM CRoss-Industry Standard Process for Data Mining
CRM Customer Relationship Management
DBMS DataBase Management System
DLL Dynamic Link Library
DM Data Mining
DMclcMine clcMine OLE DB for Data Mining Provider
DMM Data Mining Model
DMSample Microsoft® OLE DB for Data Mining Sample Provider
DMSQL Data Mining Structured Query Language
DMX Data Mining eXtensions
DTS Data Transformation Services
IIS Internet Information Services
ISV Independent Software Vendor
IT Information Technology
KDD Knowledge Discovery in Databases
MDAC Microsoft® Data Access Components
ML Machine Learning
MOLAP Multidimensional On-Line Analytical Processing
MSDMine Microsoft® OLE DB Provider for Data Mining Services
MSDT MicroSoft® Decision Trees
MSIE MicroSoft® Internet Explorer
MSJet MicroSoft® Jet OLE DB Provider
MSNB Microsoft® Simple Naïve Bayes
MSOLAP Microsoft® OLE DB Provider for Analysis Services
MSSQL Microsoft® SQL Server™
MSVB MicroSoft® Visual Basic®
MSVC MicroSoft® Visual C++®
138 List of terms and acronyms

MSVS MicroSoft® Visual Studio®


OLAP On-line Analytical Processing
OLE Object Linking and Embedding
OLE DB Object Linking and Embedding DataBase
OLE DB for DM Object Linking and Embedding DataBase for Data
Mining
PL Programming Language
PMML Predictive Model Markup Language
RDBMS Relational DataBase Management System
ROLAP Relational On-Line Analytical Processing
SNB Simple Naïve Bayes
SNBi Simple Naïve Bayes Incremental
SP Service Pack
SQL Structured Query Language
SQLOLEDB Microsoft® OLE DB Provider for SQL Server
UDA Universal Data Access
XML eXtensible Markup Language
WEKA Waikato Environment for Knowledge Analysis
Appendix F
List of Figures
Figure 1.1: The Multidisciplinary context of DM. 2
Figure 1.2: The KDD process. 3
Figure 2.1: MSVS 6.0 Tools/Options/Directories/Include files. 11
Figure 2.2: MSVS 6.0 Tools/Options/Directories/Library files. 11
Figure 2.3: MSVS .NET Tools/Options/Projects/VC++
Directories/Include files. 13
Figure 2.4: MSVS .NET Tools/Options/Projects/VC++
Directories/Library files. 13
Figure 2.5: Assembling DMSample. 15
Figure 2.6: Assembling DMclcMine. 16
Figure 3.1: Universal data access architecture. 18
Figure 3.2: OLE DB architecture. 19
Figure 3.3: AllElet – Training data set. 22
Figure 3.4: AllElet – Testing data set. 22
Figure 3.5: DTSBackup 2000 – AllElet – DTS source f iles. 23
Figure 3.6: DTSBackup 2000 – AllElet – DTS transferred files. 23
Figure 3.7: Run of AllElet – Import DTS package. 24
Figure 3.8: DMSamp – Creating AllElet DMM model. 26
Figure 3.9: DMSamp – Predicting attributes of the testing data set. 32
Figure 3.10: AllElet – DMM contents. 33
Figure 3.11: AllElet – DMM prediction tree. 33
Figure 3.12: AllElet BuysComputer prediction tree. 34
Figure 3.13: DMSamp using AllElet DMM UDA architecture. 35
Figure 3.14: AllElet BuysComputer prediction tree shown by
Internet Explorer. 35
Figure 3.15: Microsoft® IE AllElet DMM UDA architecture. 36
Figure 5.1: Waveform – Training time × Number of cases. 54
Figure 5.2: VarInpAtt – Training time × Number of input attributes. 59
Figure 5.3: VarNumCases – Training time × Number of training cases. 60
Figure 5.4: VarNumCases – Prediction time × Number of cases. 61
Figure 5.5: VarNumCases – Incremental training time × Number
of training cases. 62
140 List of figures

Figure 5.6: VarStates – Training time × Number of states of the


input attributes. 63
Figure 5.7: VarPredAtt – Training time × Number of prediction attributes. 63
Figure C.1: Meteo – DTS – Data impor t workflow. 97
Figure C.2: Waveform 1. 97
Figure C.3: Waveform 2. 98
Figure C.4: Waveform 3. 98
Figure C.5: First three cases of class 0. 99
Figure C.6: First three cases of class 1. 99
Figure C.7: First three cases of class 2. 100
Figure C.8: Waveform – Full processing workflow of the f irst
part of the experiment. 101
Figure C.9: Waveform – Data import workflow. 101
Figure C.10: Waveform – Training and prediction processing
tasks of MSDT1 classifier. 102
Figure C.11: Waveform – Cross-validation – Processing workflow. 103
Figure C.12: Waveform – Block data import. 103
Figure C.13: Meteo – Full processing workflow of the f irst
part of the experiment. 107
Figure C.14: Meteo – Data impor t workflow. 107
Figure C.15: Insurance – Original tables relationships. 111
Figure C.16: Insurance – Original data – Data impor t workflow. 122
Figure C.17: Insurance – Original data – Data assemb ling workflow. 123
Figure C.18: VarNumCases – SNBi classifier – Prediction by part tasks. 130
Figure C.19: VarNumCases – SNBi classifier – Processing
workflow of prediction by part tasks. 131
Figure C.20: VarNumCases – SNBi classifier – Training task. 131
Appendix G
List of Tables
Table 3.1: OLE DB components. 20
Table 3.2: AllElet – Training data set – marginal model statistics. 22
Table 5.1: Waveform – Parameters of the classif iers. 52
Table 5.2: Waveform – Elapsed times (s). 53
Table 5.3: Waveform – Accuracy (%). 55
Table 5.4: Waveform – Cross-validation – Elapsed times (s). 55
Table 5.5: Waveform – Cross-validation – Accuracy (%). 55
Table 5.6: Meteo – Parameters of the classif iers. 56
Table 5.7: Meteo – Elapsed times (s). 56
Table 5.8: Meteo – Accuracy (%). 56
Table 5.9: Meteo – Cross-validation – Elapsed times (s). 57
Table 5.10: Meteo – Cross-validation – Accuracy (%). 57
Table 5.11: Insurance – Parameters of the classif iers. 57
Table 5.12: Insurance – Elapsed times (s). 58
Table 5.13: Insurance – Accuracy (%). 58
Table 5.14: Insurance – Cross-validation – Elapsed times (s). 58
Table 5.15: Insurance – Cross-validation – Accuracy (%). 58
Table 5.16: Performance – Parameters of the classif iers. 59
Table A.1: Prefixes of some types of variables. 71
Table C.1: Waveform – DMSQL statements to create the DMMs. 104
Table C.2: Meteo – Attribute description of the meteorological data set. 106
Table C.3: Meteo – DMSQL statements to create the DMM. 109
Table C.4: Insurance – Attribute description 112
Table C.5: Insurance – Prepared data – DMSQL statements to create
the DMM. 123
Table C.6: Insurance – Original data – DMSQL statements to create
the DMM. 125
142 List of tables

Table C.7: DATGEN parameters. 129


Table C.8: VarInpAtt – DMSQL statements to create the DMM with 12
input attributes. 132
Table C.9: VarNumCases – DMSQL statements to create the DMM. 132
Table C.10: VarStates – DMSQL statements to create the DMM. 133
Table C.11: VarPredAtt – DMSQL statements to create the DMM with
32 prediction attributes. 133
Appendix H
List of Listings
Listing 3.1: AllElet – XML f ile of the empty DMM. 26
Listing 3.2: AllElet – XML f ile of the populated and trained DMM. 28
Listing 4.1: SNBi Classifier – Training algorithm. 45
Listing 4.2: SNBi Classifier – Prediction algorithm. 47
Listing C.1: Record count initializer script. 102
Listing C.2: Waveform – Text file data import. 102
Listing C.3: Waveform – Cross-validation loop control script. 103
Listing C.4: Waveform – SNBi classifier – DMSQL statement
to predict the testing data set. 105
Listing C.5: Meteo – Text file data import. 108
Listing C.6: Meteo – SNB i1 classifier – DMSQL statement to predict
the testing data set. 110
Listing C.7: Insurance – prepared data – SNB i classifier – DMSQL
statement to predict the testing data set. 124
Listing C.8: Insurance – original data – SNB i classifier – DMSQL
statement to predict the testing data set 126
Listing C.9: Functions for the DATGEN program. 128
Listing C.10: Task loop initializer script. 130
Listing C.11: Task loop script. 131
Listing C.12: VarNumCases – SNBi classifier – DMSQL statement
to predict the testing data set. 132

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy