My Chapter Two

CHAPTER TWO
L I T E R AT U R E
REVIEW
This chapter involves the basic topics on Data Mining its meaning,
reasons for its application, various tasks, processes, techniques and
application area of data mining. It dwells much on Artificial neural
network approach. These algorithms strength, weaknesses and when
to apply are not left out.
2.1
DATA MINING
Data Mining is the process of discovery of meaningful new correlation,

pattern, and trends by shifting through large amount of data stored
repositories and by using pattern recognition technologies as well as
statistical and mathematical
[6]
techniques.
Data mining also refers to the analysis of the large quantities of data
that are stored in computers in form of file or in database. It is called
exploratory data analysis, among other things
[5]
Data mining is not limited to business. Data mining has been heavily
used in the medical field, to include diagnosis of patient records to
help identify best practices.
2.2
WHY DATA MINING
Data mining caught on in a big way in the last few years due to the
following number of factors
[6]
i)
The data is being produced and collected at unprecedented way.
ii)
The data is being warehoused data warehousing brings
together data from different format in a common format with

consistent definitions for keys and fields.
iii)
The computing power is affordable price for disk, memory,
processing power, I/O bandwidth is affordable by many ordinary

businesses.
iv)
The existence of commercial data mining software products.
2.3
DATA MINING PROCESS

6
In order to systematically conduct data mining analysis, a general

process is usually followed. There are some standard processes, two of
which CRISP (an industry standard process consisting of a sequence of
steps that are usually involved in a data mining study) and SEMMA
which stands for Sample, Explore, Modify, Model, Assess(figure 2.2
depicts how SEMMA phases interact)
[7]
. While each step of either
approach isnt needed in every analysis, this process provides a good

coverage of the steps needed, starting with data exploration, data
collection,
data
processing,
analysis,
inferences
drawn,
and
implementation.
2.3.1 CRISP-DM
CRISP-DM - Cross-Industry Standard Process for Data Mining is widely
used by industries and cooperate organizations. This model consists of
six phases intended as a cyclical process.
Business Understanding: Business understanding includes
determining business objectives, assessing the current situation,
establishing data mining goals, and developing a project plan.
Figure 2.1 CRISP-DM processes
Data Understanding: Once business objectives and the project

plan
are
established,
data
understanding
considers
data
requirements. This step can include initial data collection, data

description, data exploration, and the verification of data quality.
Data exploration such as viewing summary statistics (which
includes the visual display of categorical variables) can occur at
the end of this phase. Models such as cluster analysis can also
be applied during this phase, with the intent of identifying
patterns in the data.
Data Preparation: Once the data resources available are
identified, they need to be selected, cleaned, built into the form
desired, and formatted.
Data cleaning and data transformation in preparation of data
modeling needs to occur in this phase. Data exploration at a
greater depth can be applied during this phase, and additional
models utilized, again providing the opportunity to see patterns
based on business understanding.
Modeling Data mining software tools such as visualization
(plotting data and establishing relationships) and cluster analysis
(to identify which variables go well together) are useful for initial
analysis. Tools such as generalized rule induction can develop
initial association rules. Once greater data understanding is
gained (often through pattern recognition triggered by viewing
model output), more detailed models appropriate to the data
type can be applied. The division of data into training and test
sets is also needed for modeling.
Evaluation Model results should be evaluated in the context of
the business objectives established in the first phase (business
understanding). This will lead to the identification of other needs
(often through pattern recognition), frequently reverting to prior
phases of CRISP-DM. Gaining business understanding is an
iterative procedure in data mining, where the results of various
visualization, statistical, and artificial intelligence tools show the

user new relationships that provide a deeper understanding of
organizational operations.
Deployment Data mining can be used to both verify previously
held hypotheses, or for knowledge discovery (identification of
unexpected and useful relationships). Through the knowledge
discovered in the earlier phases of the CRISP-DM process, sound
models can be obtained that may then be applied to business
operations
for
many
purposes,
including
prediction
or
identification of key situations. These models need to be

monitored for changes in operating conditions, because what
might be true today may not be true a year from now. If
significant changes do occur, the model should be redone. Its
also wise to record the results of data mining projects so
documented evidence is available for future studies.
2.32
Figure 2.2 Schematic of SEMMA (original from SAS )
2.4
DATA MINING TASKS
CLASSIFICATION: This consists of examining the features of newly

presented object and assigning it to one of a predefined set of
classes. For our purposes, the objects to be classified are generally
represented by records in databases and the act of classification
consists of updating each record by filling in a field with a class
code of some kind.
The classification task is classified by a well defined definition of
some classes, and a training set consisting of pre-classified
examples. The task is to build a model of some kind that can be
applied to unclassified data in order to classify it.
ESTIMATION: While classification deals with discrete outcomes
such as yes or no, measles, rubella, or chicken pox; Estimation deals
with continuous value outcomes. Given some input data, we use
estimation to come up with a value for some unknown continuous
variable such as income, height, or credit card balance.
In practice, estimation is often used to perform classification task
and Neural Networks are well-suited to estimate tasks.
PREDICTION:
Prediction
is
the
same
for
classification
and
estimation except that the records are classified according to some

predicted future behaviour on estimated future value. In a prediction
task, the only way to check the accuracy of the classification is to
wait and see.
Any of the techniques used for classification and estimation can
adapted for use in prediction is already known, along with historical
data for those examples. The historical data is used to build a model
that explains the current observed behaviour. When this model is
applied to current inputs, the result is a prediction of future
behaviour.
AFFINITY GROUPING: The task affinity grouping is to determine
which things go together. It can be used to identify cross-selling
10
opportunities and to design attractive packages or groupings of

product and services. Affinity grouping is one simple approach to
generate rules from data. If two items go together, two association
rules can be generated together.
CLUSTRING: This is the task of segmenting a heterogeneous
population into a number of more homogeneous subgroups. It does
not rely on predefined classes and records are grouped together on
the basis of self-similarity. It is now up to you to determine what
meaning, if any, to attach to the resulting clusters.
Clustering is often done as a prelude to some other form of data
mining or modelling.
DESCRIPTION: Sometimes the purpose of data mining is to simply
describe what is going on in a complicated database in a way that
increases the understanding of people, products or processes that
produces the data in the first place.
Some of the techniques that will later be discussed in this chapter
such as market basket analysis tools are purely descriptive. Others
like neural networks provide next to nothing in the way of
description
2.5
[6]
DATA MINING ISSUES
As data mining initiatives continue to evolve, there are several issues

Congress may decide to consider related to implementation and
oversight. These issues include, but are not limited to, data quality,
interoperability, mission creep, and privacy. As with other aspects of
data mining, while technological capabilities are important, other
factors also influence the success of a projects outcome
[6]
Data Quality
Data quality refers to the accuracy and completeness of the data. Data
quality can also be affected by the structure and consistency of the
data being analyzed. The presence of duplicate records, the lack of
data standards, the timeliness of updates, and human error can
11
significantly impact the effectiveness of the more complex data mining

techniques, which are sensitive to subtle differences that may exist in
the data. To improve data quality, it is sometimes necessary to clean
the data, which can involve the removal of duplicate records,
normalizing the values used to represent information in the database,
accounting for missing data points, removing unneeded data fields,
identifying anomalous data points (e.g., an individual whose age is
shown as 142 years), and standardizing data formats (e.g., changing
dates so they all include MM/DD/YYYY).
Interoperability
This refers to the ability of a computer system and/or data to work
with other systems or data using common standards or processes For
data mining, interoperability of databases and software is important to
enable the search and analysis of multiple databases simultaneously,
and to help ensure the compatibility of data mining activities of
different agencies. Similarly, as agencies move forward with the
creation of new databases and information sharing efforts, they will
need to address interoperability issues during their planning stages to
better ensure the effectiveness of their data mining projects.
Mission Creep
Mission creep refers to the use of data for purposes other than that for
which the data was originally collected. This can occur regardless of
whether the data was provided voluntarily by the individual or was
collected through other means. One of the primary reasons for
misleading results is inaccurate data. All data collection efforts suffer
accuracy concerns to some degree. Ensuring the accuracy of
information can require costly protocols that may not be cost effective
if the data is not of inherently high economic value. In well-managed
data mining projects, the original data collecting organization is likely
to be aware of the datas limitations and account for these limitations
accordingly. However, such awareness may not be communicated or
heeded when data is used for other purposes
Privacy
12
As additional information sharing and data mining initiatives have

been announced, increased attention has focused on the implications
for privacy.
Concerns about privacy focus both on actual projects proposed, as well
as concerns about the potential for data mining applications to be
expanded beyond their original purposes (mission creep).
So far there has been little consensus about how data mining should
be carried out, with several competing points of view being debated.
Some observers contend that tradeoffs may need to be made
regarding privacy to ensure security. In contrast, some privacy
advocates argue in favor of creating clearer policies and exercising
stronger oversight.
2.6
BASIC STYLES OF DATA MINING
The first, hypothesis testing, is a top-down approach that attempt to

substantiate or disproved preconceived ideas. The second, knowledge
discovery, is a bottom-up approach that starts with the data and tries
to get it to tell us we didnt already know
[6]
2.6.1 Hypothesis
A hypothesis is a propose explanation whose validity can be tested.
Testing the validity of an hypothesis is done by analyzing data that
may simply be collected by observation or generated through
experiment.
The process of hypothesis testing
The hypothesis testing method ha several steps:
1)
Generate good ideas (hypothesis)
2)
Determine what data would allow these hypotheses to be tested.
3)
Locate the data
4)
Prepare the data for analysis
5)
Build computer model based on the data
6)
Evaluate computer model to confirm or reject hypotheses
2.6.2 Knowledge Discovery
13
Undirected learning ha s long been goal of artificial intelligence

researchers in the academic discipline called machine learning. In the
real world, discovering valuable patterns is worthwhile, but it is still
hard work.
Knowledge discovery can be either directed or undirected.
Directed Knowledge Discovery
This is goal oriented. There is a specific field whose value we want
to predict, a fixed set of classes to be assigned to each record, or a
specific relationship we want to explore.
Here are the steps in the process of direct knowledge discovery:
1. Identify source of pre-classified data.
2. Prepare data for analysis
3. Build and train computer model
4. Evaluate the computer model
Undirected Knowledge Discovery
Here, there is no target field. The data mining tool is simply let
loosed on the data with the hope that it will discover meaningful
structure.
The Process of Undirected Knowledge Discovery
Here are the steps in the process of undirected knowledge

discovery:
1.
Identify source of pre-classified data.
2.
Prepare data for analysis
3.
Build and train computer model
4.
Evaluate the computer model
5.
Apply the computer model to new data.
6.
Identify potential targets for directed knowledge discovery.
7.
Generate new hypothesis to test
14
D ATA
2.7
MINING
TECHNIQUES/METHODS
MEMORY-BASED REASONING
Memory-based reasoning systems are a type of model, supporting the

modeling phase of the data mining process. Their unique feature is
that
they
are
relatively
machine
driven,
involving
automatic
classification of cases. It is a highly useful technique that can be

applied to text data as well as traditional numeric data domains.
Memory-based reasoning is an empirical classification method
[8]
. It
operates by comparing new unclassified records with known examples

and patterns.
The case that most closely matches the new record is identified, using
one of a number of different possible measures. Memory-based
reasoning provides best overall classification when compared with the
more traditional approaches in classifying jobs with respect to back
disorders
[9]
Matching: While matching algorithms are not normally found in

standard data mining software, they are useful in many specific
data mining applications. Fuzzy matching has been applied to
discover patterns in the data relative to user expectations
[10]
Java software has been used to completely automate document

matching
[11].
Matching
can
also
be
identification in geometric environments
applied
to
pattern
[12]
There are a series of measures that have been applied to

implement
memory-based reasoning. The simplest technique assigns a new
observation
to the pre-classified example most similar to it. The Hamming
distance metric identifies the nearest neighbor as the example
from the training database with the highest number of matching
fields (or lowest number of
15
non-matching fields). Case-based reasoning is a well-known

expert system
approach that assigns new cases to the past case that is closest
in some sense. Thus case-based reasoning can be viewed as a
special case of the
nearest neighbor technique.
Weighted Matching
Data mining can involve deletion of variables, but the usual
attitude is to retain data because you dont know what it may
provide. Weighting provides another means to emphasize certain
variables over others. All that would change would be that the
Matches measure could now represent a weighted score for
selection of the best matching case
Distance Minimization
This concept uses the distance measured from the observation
to be
classified to each of the observations in the known data set.
In this case, the nominal and ordinal data needs to be converted
to meaningful ratio data
Strength of Memory-Based Reasoning
It produces results that are readily understandable
It is applicable to arbitrary data types, even non relational
data.
It works efficiently on any number of fields.
Maintaining the training set requires a minimal amount of
effort.
Weaknesses of Memory-Based Reasoning
It is computationally expensive when doing classification and
prediction
It requires a large amount of storage for the training set.
Results can be dependent on the choice of distance function,
combination function and the number of neigbours.
16
2.8
ASSOCIATION RULES IN KNOWLEDGE DISCOVERY
An association rule is an expression of X Y, where X is a set of items,

and Y is a single item. Association rule methods are an initial data
exploration approach that is often applied to extremely large data set.
Association rules mining provides valuable information in assessing
significant correlations. They have been applied to a variety of fields,
to include medicine
[13]
and medical insurance fraud detection
[14]
Many algorithms have been proposed to find association rules mining

in large databases. Most, such as the APriori algorithm identify
correlations among transactions consisting of categorical attributes
using binary values. Some data mining approaches involve weighted
association rules for binary values,
[15]
or time intervals
[16]
Data structure is an important issue due to the scale of data usually

encountered
[17]
. Structured query language (SQL) has been a
fundamental tool in manipulation of database content. Knowledge

discovery involves ad hoc queries, needing efficient query compilation.
Lopes et al. considered functional dependencies in inference problems.
SQL was used by those researchers to generate sets of attributes that
were useful in identifying item clusters.
Key measures in association rule mining include support and
confidence.
Support refers to the degree to which a relationship appears in
the data.
Confidence relates to the probability that if a precedent occurs,
a consequence will occur. The rule X Y has minimum support
value minsup if minsup percent of transactions support X Y,
the rule X Y holds with minimum confidence value minconf if
minconf percent of transactions that support X also support Y.
For example, from the transactions kept in supermarkets, an
association rule such as Bread and Butter Milk could be
identified through association mining.
2.9
MARKET BASKET ANALYSIS
17
Market-basket
analysis
refers
to
methodologies
studying
the
composition of a shopping basket of products purchased during a

single shopping event.
This technique has been widely applied to grocery store operations (as
well as other retailing operations, to include restaurants). Market
basket data in its rawest form would be the transactional list of
purchases by customer, indicating only the items purchased together
(with their prices). This data is challenging because of a number of
characteristics:
[18]
A very large number of records (often millions of transactions

per day)
Sparseness (each market basket contains only a small portion of
items
carried)
Heterogeneity (those with different tastes tend to purchase a
specific
subset of items).
The aim of market-basket analysis is to identify what products tend to
be purchased together. Analyzing transaction-level data can identify
purchase patterns, such as which frozen vegetables and side dishes
are purchased with steak during barbecue season. This information
can be used in determining where to place products in the store, as
well as aid inventory management. Product presentations and staffing
can be more intelligently planned for specific times of day, days of the
week, or holidays. Another commercial application is electronic
couponing, tailoring coupon face value and distribution timing using
information obtained from market baskets
[19]
2.9.1 Market Basket Analysis Benefits

The ultimate goal of market basket analysis is finding the products
that
customers frequently purchase together. The stores can use this
information by putting these products in close proximity of each other
18
and making them more visible and accessible for customers at the
time of shopping.
These assortments can affect customer behavior and promote the
sales for complement items. The other use of this information is to
decide about the layout of catalogs and put the items with strong
association together in sales catalogs. The advantage of using sales
data for promotions and store layout is that the consumer behavior
information determines the items with associations. This information
may vary based on the area and the assortments of available items in
stores and the point of sale data reflects the real behavior of the group
of customers that frequently shop at the same store. Catalogs that are
designed based on the market basket analysis are expected to be
more effective on consumer behavior and sales promotion.
2.9.2 Strength of Market Basket Analysis
It produces clear and understandable results
It supports undirected data mining
It works on variable-length data.
The computations it uses are simple to understand
2.9.3 Weaknesses of Market Basket Analysis
It requires exponentially more computational effort as the
problem size grows.
It has a limited supports for attributes on the data
It is difficult to determine the right number of items
It discounts rear items
2.10 FUZZY SETS IN DATA MINING

Real-world application is full of vagueness and uncertainty. Several
theories
on
managing
uncertainty
19
and
imprecision
have
been
advanced, to include fuzzy set theory [20], probability theory[21], rough

set theory[22] and set pair theory[23].
Fuzzy set theory is used more than the others because of its simplicity
and
similarity to human reasoning. Fuzzy modeling provides a very useful
tool to deal with human vagueness in describing scales of value. The
advantages of the fuzzy approach in data mining is that it serves as an
interface between a numerical scale and a symbolic scale which is
usually composed of linguistic terms.[24]
Fuzzy association rules described in linguistic terms help users
better understand the decisions they face[25]. Fuzzy set theory is being
used more and more frequently in intelligent systems. A fuzzy set A in
universe U is defined as A={(x, A(x))| xU, A (x) [0,1]} where A (x)
is a membership function indicating the degree of membership of x to
A. The greater the value of A (x) , the more x belongs to A. Fuzzy sets
can also be thought of as an extension of the traditional crisp sets and
categorical/ordinal scales, in which each element is either in the set or
not in the set (a membership
function of either 1 or 0).
Fuzzy set theory in its many manifestations (interval-valued fuzzy
sets, vague sets, grey-related analysis, rough set theory, etc.) is highly
appropriate for dealing with the masses of data available.
There are many data mining tools available, to cluster data, to help
analysts
find patterns, to find association rules. The majority of data mining
approaches in classification work with numerical and categorical
information.
Most data mining software tools offer some form of fuzzy analysis.
Modifying continuous data is expected to degrade model accuracy, but
might be more robust with respect to human understanding. (Fuzzy
representations might lose accuracy with respect to numbers that
dont really reflect accuracy of human understanding, but may better
represent the reality humans are trying to express.) Another approach
20
to fuzzify data is to make it categorical. Categorization of data is

expected to yield greater inaccuracy on test data. However, both
treatments are still useful if they better reflect human understanding,
and might even be more accurate on future implementations.
The
categorical
limits
selected
are
key
to
accurate
model
development. Not many data mining techniques take into account

ordinal data features.
2.10.1
Fuzzy Association Rules
With the rapid growth of data in enterprise databases, making sense of

valuable
information
becomes
more
and
more
difficult.
KDD
(Knowledge Discovery in Databases) can help to identify effective,

coherent, potentially useful and previously unknown patterns in large
databases
[26]
. Data mining plays an important role in the KDD process,
applying specific algorithms for extracting desirable knowledge or

interesting patterns from existing datasets for specific purposes. Most
of the previous studies focused on categorical attributes.
Mining fuzzy association rules for quantitative values has long been
considered by a number of researchers, most of whom based their
methods on the important APriori algorithm
[27]
. Each of these
researchers treated all attributes (or all the linguistic terms) as

uniform. However, in real-world applications, the users perhaps have
more interest in the rules that contain fashionable items.
Decreasing minimum support minsup and minimum confidence
minconf to get rules containing fashionable items is not best, because
the efficiency of the algorithm will be reduced and many uninteresting
rules will be generated simultaneously
[28]
. Weighted quantitative
association rules mining based on a fuzzy approach has been proposed

(by Genesei) using two different definitions of weighted support: with
and without normalization
[29]
In the non-normalized case, he used the product operator for defining

the combined weight and fuzzy value.
21
The combined weight or fuzzy value is very small and even tends to
zero when the number of items is large in a candidate itemset, so the
support level is very small, this will result in data overflow and make
the algorithm terminate unexpectedly when calculating the confidence
value.
2.11 ROUGH SET
Rough set analysis is a mathematical approach that is based on the
theory of rough sets first introduced by Pawlak (1982)
[22]
. The purpose
of rough sets is to discover knowledge in the form of business rules

from imprecise and uncertain data sources. Rough set theory is
based on the notion of indiscernibility and the inability to distinguish
between objects, and provides an approximation of sets or concepts
by means of binary relations, typically constructed from empirical
data.
Such approximations can be said to form models of our target
concepts, and hence in the typical use of rough sets falls under the
bottom-up approach to model construction. The intuition behind this
approach is the fact that in real life, when dealing with sets, we often
have no means of precisely distinguishing individual set elements from
each other due to limited resolution (lack of complete and detailed
knowledge)
and
uncertainty
associated
with
their
measurable
characteristics.
As an approach to handling imperfect data, rough set analysis
complements other more traditional theories such as probability
theory, evidence theory, and fuzzy set theory.
2.11.1
A Brief Theory of Rough Sets
Statistical data analysis faces limitations in dealing with data with high
levels of uncertainty or with non-monotonic relationships among the
variables.
The original idea behind his Rough sets theory was vagueness
inherent to the representation of a decision situation.
22
Vagueness may be caused by granularity of the representation. Due to

the
granularity, the facts describing a situation are either expressed
precisely
by means of granule of the representation or only approximately.
[30]
The
the
vagueness
and
imprecision
problems
are
present
in
information that describes most real world applications.

2.11.2
Rough Sets as an Information System
In rough sets, an information system is a representation of data that

prescribes some object. An information system S is composed of a 4tuple S = < U, Q, V, f > where U is the closed universe of a N objects
{x1, x2, , xN}, a nonempty finite set; Q is a nonempty finite set of n
attributes {q1, q2, , qn} (that uniquely characterizes the objects); V
= Uq Q Vq where Vq is the value of the attribute q; f : U Q V is
the total decision function called the information function such that f
(x, q) Vq for every q Q, x U
[31]
. The six stores are the universe U,
the first three attributes are Q, their possible values V, and the profit
category f.
Any pair (q, v) for q Q,, v Vq is called the descriptor in an
information system S. The information system can be represented as a
finite data table, in which the columns represent the attributes, the
rows represent the objects and the cells represent the attribute values
f(x, q). Thus, each row in the table describes the information about an
A
object in S.
If we let S = < U, Q, V, f > be an information system,

be a subset of attributes, and x, y U are objects, then x and y are
indiscernible by the set of attribute A in S if and only if f (x, a) = f (y, a)
for every a A. Every subset of variables A determines an equivalence
relation of the universe U, which is referred to indiscernibility relation.
For any given subset of attributes the IND(A) is an equivalence relation
on
universe
and
is
called
an
indiscernibility
relation.
The
indiscernibility relation IND(A) can be defined as IND(A) = {(x, y) U
23
U : for all a A, f (x, a) = f (y, a) If the pair of objects (x, y) belongs to

the relation IND(A) then objects x and y are called indiscernible with
respect to attribute set A. In other words, we cannot distinguish object
x from y based on the information
contained in the attribute set A.
2.11.3
Some Exemplary Applications of Rough Sets
Most of the successful applications of rough sets are in the field of

medicine, more specifically, in medical diagnosis or prediction of
outcomes.
Rough sets have been applied to analyze a database of patients with
duodenal ulcer treated by highly selective vagotomy1 (HSV)
[32]
. The
goal was to predict the long-term success of the operation, as

evaluated by a surgeon into four outcome classes. This successful HSV
study is still one of few data analysis studies, regardless of the
methodology, that has managed to cross the clinical deployment
barrier. There have
been a steady stream of rough set applications in medicine. Some
more recent applications include analysis of breast cancer
other forms of diagnosis
pain
[35]
[33]
and
[34]
, as well as support to triage of abdominal
and analysis of Medicaid Home Care Waiver programs
[36].
In addition to medicine, Rough Sets have also been applied to a wide

range of application areas to include real estate property appraisal
predicting bankruptcy
[38]
[37]
and predicting the gaming ballot outcomes
[39]
. Rough sets have been applied to identify better stock trading
timing
[40]
to
enhance
support
vector
manufacturing process document retrieval

performance of construction firms
machine
models
in
[41]
, and to evaluate safety
[42]
. Rough sets have thus been
useful in many applications.
24
Figure
2.3 Process map and the main steps of the rough sets analysis
2.12 SUPPORT VECTOR MACHINES

Support vector machines (SVMs) are supervised learning methods that
generate input-output mapping functions from a set of labeled training
data
[5].
The mapping function can be either a classification function (used to

categorize the input data) or a regression function (used for estimation
of the desired output). For classification, nonlinear kernel functions are
often used to transform the input data (inherently representing highly
complex nonlinear relationships) to a high dimensional feature space
in which the input data becomes more separable (i.e., linearly
separable) compared to the original input space. Then, the maximummargin hyperplanes are constructed to optimally separate the classes
in the training data. Two parallel hyperplanes are constructed on
each side of the hyperplane that separates the data by maximizing the
distance between the two parallel hyperplanes
[5]
An assumption is made that the larger the margin or distance between

these parallel hyperplanes the better the generalization error of the
classifier will be. SVMs belong to a family of generalized linear models
which achieves a classification or regression decision based on the
value of the linear combination of features. They are also said to
belong to kernel methods.
In addition to its solid mathematical foundation in statistical learning
theory, SVMs have demonstrated highly competitive performance in
numerous
real-world
applications,
such
as
medical
diagnosis,
bioinformatics, face recognition, image processing and text mining,
25
which has established SVMs as one of the most popular, state-of-theart tools for knowledge discovery and data mining.
Similar to artificial neural networks, SVMs possess the well-known
ability of being universal approximators of any multivariate function to
any desired degree of accuracy. Therefore, they are of particular
interest to modeling highly nonlinear, complex systems and processes.
Regression
A version of a SVM for regression was proposed called support
vector
regression (SVR). The model produced by support vector
classification
(as described above) only depends on a subset of the training
data, because
the cost function for building the model does not care about
training points
that lie beyond the margin. Analogously, the model produced by
SVR only
depends on a subset of the training data, because the cost
function for building the model ignores any training data that are
close (within a
threshold ) to the model prediction
2.12.1
[6]
Use of SVM A Process-Based Approach
Due largely to the better classification results, recently support vector

machines
(SVMs) have become a popular technique for classification type
problems. Even though people consider them as easier to use than
artificial neural networks, users who are not familiar with the
intricacies of SVMs often get unsatisfactory results. In this section we
provide a process-based approach to the use of SVM which is more
likely to produce better results.
26
Preprocess the data

Scrub the data
Deal with the missing values
Deal with the presumably incorrect values
Deal with the noise in the data
Transform the data
Numerisize the data
Normalize the data
Develop the model(s)
Select kernel type (RBF is a natural choice)
Determine kernel parameters based on the selected kernel
type (e.g., C and for RBF) A hard problem. One should
consider using crossvalidation and experimentation to
determine the appropriate values
for these parameters.
If the results are satisfactory, finalize the model, otherwise

change the kernel type and/or kernel parameters to
achieve the desired accuracy level.
Extract and deploy the model.
27
Figure 2.4
2.12.2
Support
Vector
Machines
versus
Artificial
Neural Networks
The development of ANNs followed a heuristic path, with applications
and extensive experimentation preceding theory. In contrast, the
development of SVMs involved sound theory first, then implementation
and experiments.
A significant advantage of SVMs is that while ANNs can suffer from
multiple local minima, the solution to an SVM is global and unique.
Two more advantages of SVMs are that that have a simple geometric
interpretation
and give a sparse solution. Unlike ANNs, the computational
complexity of SVMs does not depend on the dimensionality of the
input space. ANNs use empirical risk minimization, whilst SVMs use
structural risk minimization. The reason that SVMs often outperform
ANNs in practice is that they deal with the biggest problem with ANNs,
SVMs are less prone to over fitting.
They differ radically from comparable approaches such as neural
networks: SVM training always finds a global minimum, and their
simple geometric interpretation provides fertile ground for
further investigation.
Most often Gaussian kernels are used, when the resulted SVM
corresponds to an RBF network with Gaussian radial basis
functions. As the SVM approach automatically solves the
network complexity problem, the size of the hidden layer is
obtained as the result of the QP procedure. Hidden neurons and
support vectors correspond to
each other, so the center problems
of the RBF network is also solved, as the support vectors serve as

the basis function centers.
28
In problems when linear decision hyperplanes are no longer

feasible, an input space is mapped into a feature space (the
hidden layer in NN
models), resulting in a nonlinear classifier.
SVMs, after the learning stage, create the same type of decision
hypersurfaces as do some well-developed and popular NN
classifiers.
Note that the training of these diverse models is different.
However,
after the successful learning stage, the resulting decision
surfaces are
identical.
Unlike conventional statistical and neural network methods, the
SVM
approach does not attempt to control model complexity by
keeping the number of features small.
Classical learning systems like neural networks suffer from their
theoretical weakness, e.g. back-propagation usually converges
only to
locally optimal solutions. Here SVMs can provide a significant
improvement.
In contrast to neural networks SVMs automatically select their
model
size (by selecting the Support vectors).
The absence of local minima from the above algorithms marks a
major departure from traditional systems such as neural
networks.
While the weight decay term is an important aspect for obtaining
good generalization in the context of neural networks for
regression, the margin plays a somewhat similar role in
classification problems.
In comparison with traditional multilayer perceptron neural
networks
29
that suffer from the existence of multiple local minima solutions,

convexity is an important and interesting property of nonlinear
SVM
classifiers.
SVMs have been developed in the reverse order to the
development of neural networks (NNs). SVMs evolved from the
sound theory to the implementation and experiments, while the
NNs followed more
heuristic path, from applications and extensive experimentation
to the
theory.
2.12.3
Disadvantages of Support Vector Machines
Besides the advantages of SVMs (from a practical point of view)

they
have some limitation. An important practical question that is not
entirely solved, is the selection of the kernel function parameters
for Gaussian kernels the width parameter () and the value
of ( ) in the ()-insensitive loss function.
A second limitation is the speed and size, both in training and
testing. It involves complex and time demanding calculations.
From a practical point of view perhaps the most serious problem
with SVMs is the high algorithmic complexity and extensive
memory requirements of the required quadratic programming in
large-scale tasks. Shi et al. have conducted comparative testing
of SVM with other algorithms on real credit card data.
Processing of discrete data presents another problem.
Despite these limitations, because SVMs are based on sound
theoretical foundation and the solution it produces are global and
unique in nature (as opposed to getting stuck in local minima),
nowadays they are the most popular prediction modeling techniques in
the data mining arena. Their use and popularity will only increase as
30
the popular commercial data mining tools start to incorporate them

into their modeling arsenal
[43]
2.13 PERFORMANCE EVALUATION FOR PREDICTIVE MODELING

Once a predictive model is developed using the historical data, one
would be curious as to how the model will perform for the future (on
the data that it has not seen during the model building process). One
might even try multiple model types for the same prediction problem,
and then, would like to know which model is the one to use for the
real-world decision making situation, simply by comparing them on
their prediction performance (e.g., accuracy). But, how do you
measure the performance of a predictor? What are the commonly
used performance metrics? What is accuracy? How can we
accurately
estimate
the
performance
measures?
Are
there
methodologies that are better in doing so in an unbiased manner?

These questions are answered in the following sub-sections. First, the
most commonly used performance metrics will be described, then a
wide range of estimation methodologies are explained and compared
to each other
[5]
2.13.1
In
classification
Performance Metrics for Predictive Modeling

problems,
the
primary
source
of
performance
measurements
is a coincidence matrix (a.k.a. classification matrix or a contingency
table).
The numbers along the diagonal from upper-left to lower-right
represent the correct decisions made, and the numbers outside this
diagonal represent the errors. The true positive rate (also called hit
rate or recall) of a classifier is estimated by dividing the correctly
classified positives (the true positive count) by the total positive count.
The false positive rate (also called false alarm rate) of the classifier is
estimated by dividing the incorrectly classified negatives (the false
negative count) by the total negatives.
31
The overall accuracy of a classifier is estimated by dividing the total

correctly classified positives and negatives by the total number of
samples.
Other performance measures, such as recall (a.k.a. sensitivity),
specificity
and F-measure are also used for calculating other aggregated
performance
measures (e.g., area under the ROC curves).
2.13.2
Estimation
Methodology
for
Classification
Models
Estimating the accuracy of a classifier induced by some supervised
learning
algorithms is important for the following reasons. First, it can be used
to estimate its future prediction accuracy which could imply the level
of confidence one should have in the classifiers output in the
prediction system. Second, it can be used for choosing a classifier from
a given set (selecting the best model from two or more qualification
models). Lastly, it can be used to assign confidence levels to multiple
classifiers so that the outcome of a combining classifier can be
optimized. Combined classifiers are increasingly becoming more
popular due to the empirical results that
suggest them producing more robust and more accurate predictions as
they are compared to the individual predictors. For estimating the final
accuracy of a classifier one would like an estimation method with low
bias and low variance. In some application domains, to choose a
classifier or to combine classifiers the absolute accuracies may be less
important and one might be willing to trade off bias for low variance.
2.14 DECISION TREES
Decision trees are powerful and popular tools for classification and
prediction. The attractiveness of tree-based methods is due in large
32
part to the fact that, in contrast to neural networks, decision trees

represents rules. Rules can readily be expressed in English so that we
humans can understand them or in a database access language like
SQL so that record falling into a particular category may be retrieved.
There are varieties of algorithms for building decision trees which
share the describe traits of explicability. Two of the most popular go by
the
acronyms
Classification
CART
and
and
CHAID
Regression
which
Trees
stand
and
respectively
Chi-square
for
Automatic
Interaction Detection. A new algorithm, C4.5, is gaining popularity and

is now available in several software packages
2.14.1
[5]
Strengths of Decision-Tree Methods
The strengths of decision-trees are:

Decision trees are able to generate understandable rules
Decision trees perform classification without requiring much
computation.
Decision trees are able to handle continuous and categorical
variables
Decision trees provide a clear indication of which fields are most
important for prediction or classification.
Watch the game
Yes
No
Home team wins?
No
Yes
Out with friends
No
Diet Soda!
Out with friends?
Yes
Beer!
33
No
Yes
Milk!
Beer!
Figure 2.5 A beverage prediction tree

2.14.2
Weakness of Decision Trees Methods
Decision trees are less appropriate for estimation tasks where the goal
is to predict the value of a continuous variable such as income, blood
pressure or interest rate. Decision trees are also problematic for timeseries data unless a lot of effort is put into presenting the data in such
a way that trends and sequential patterns are made visible.
2.14.3
Application of Decision Tree Methods
Decision-tree methods are a good choice when the data mining task is
classification of records of prediction of outcomes. Use decision trees
when your goal is to assign each record to one of a few broad
categories. Decision trees are also a natural choice when your goal is
to generate rules that can be easily understood, explained, and
translated into SQL or a natural language.
2.15 GENETIC ALGORITHM

Genetic Algorithms, first introduced by [Holland 1975]
[44]
, have been
applied to a variety of problems and offer intriguing possibilities for

general purpose adaptive search algorithms in artificial intelligence,
especially, but not necessarily, for situations where it is difficult or
impossible to precisely model the external circumstances faced by the
program. Search based on evolutionary models had, of course, been
tried before Holland. However, these models were based on mutation
and natural selection and were not notably successful. The principal
difference of Hollands approach was the incorporation of a crossover
operator to mimic the effect of sexual reproduction.
Figure 2.6 below illustrates the basic idea of GA
34
Figure 2.6 Generic Model for Genetic Algorithm

Genetic algorithms are mathematical procedures utilizing the process
of genetic inheritance. They have been usefully applied to a wide
variety of analytic problems. Data mining can combine human
understanding with automatic analysis of data to detect patterns or
key relationships. Given a large database defined over a number of
variables, the goal is to efficiently find the most interesting patterns in
the database. Genetic algorithms have been applied to identify
interesting patterns in some applications.
They usually are used in data mining to improve the performance of
other algorithms, one example being decision tree algorithms,3
another association rules.
Genetic algorithms require certain data structure. They operate on a
population with characteristics expressed in categorical form. The
analogy with genetics is that the population (genes) consist of
characteristics (alleles). One way to implement genetic algorithms
is to apply operators (reproduction, crossover, selection) with the
feature of mutation to enhance generation of potentially better
combinations. The genetic algorithm process is thus:
1. Randomly select parents.
2. Reproduce through crossover. Reproduction is the operator
choosing which individual entities will survive. In other words, some
35
objective function or selection characteristic is needed to determine

survival.
Crossover relates to changes in future generations of entities.
3. Select survivors for the next generation through a fitness function.
4. Mutation is the operation by which randomly selected attributes of
randomly selected entities in subsequent operations are changed.
5. Iterate until either a given fitness level is attained, or the preset
number of iterations is reached.
Genetic algorithm parameters include population size, crossover rate
(the probability that individuals will crossover), and the mutation rate
(the probability that a certain entity mutates)
2.15.1
Genetic
Algorithm
[45]
Advantages:
Genetic
algorithms are very easy to develop and to validate, which makes

them highly attractive if they apply. The algorithm is parallel, meaning
that it can applied to large populations efficiently. The algorithm is also
efficient in that if it begins with a poor original solution, it can rapidly
progress to good solutions. Use of mutation makes the method
capable of identifying global optima even in very nonlinear problem
domains. The method does
not require
knowledge about the
distribution of the data.

2.15.2
Genetic
Algorithm
Disadvantages:
Genetic
algorithms require mapping data sets to a form where attributes have

discrete values for the genetic algorithm to work with. This is usually
possible, but can lose a great deal of detail information when dealing
with continuous variables. Coding the data into categorical form can
unintentionally lead to biases in the data.
There are also limits to the size of data set that can be analyzed with
genetic algorithms. For very large data sets, sampling will be
necessary, which leads to different results across different runs over
the same data set.
2.15.3
GA Operators
36
Selection
This is the procedure for choosing individuals (parents) on which to
perform crossover in order to create new solutions. The idea is that
the fitter individuals are more prominent in the selection process,
with the hope that the offspring they create will be even fitter still.
Two
commonly
used
procedures
are
roulette
wheel
and
tournament selection. In roulette wheel, each individual is assigned

a slice of a wheel, the size of the slice being proportional to the
fitness of the individual. The wheel is then spun and the individual
opposite the marker becomes one of the parents. In tournament
selection several individuals are chosen at random and the fittest
becomes one of the parents.
Crossover
Along with mutation, crossover is the operator that creates new
candidate solutions. A position is randomly chosen on the string and
the two parents are crossed over at this point to create two new
solutions. Multiple point crossover is where this occurs at several
points along the string. A crossover probability (Pc) is often given
which enables a chance that the parents descend into the next
generation unchanged.
Mutation
After crossover, each bit of the string has the potential to mutate,
based on a mutation probability (Pm). In binary encoding mutation
involves the flipping of a bit from 0 to 1 or vice versa.
2.15.4
Application
of
Genetic
Algorithms
in
Data
Mining
Genetic algorithms have been applied to data mining in two ways.
External
support is through evaluation or optimization of some parameter for
another
37
learning system, often hybrid systems using other data mining tools
such as
clustering or decision trees. In this sense, genetic algorithms help
other data mining tools operate more efficiently. Genetic algorithms
can also be directly applied to analysis, where the genetic algorithm
generates descriptions, usually as decision rules or decision trees.
Many applications of genetic algorithms within data mining have
been applied outside of business.
Specific examples include medical data mining and computer network
intrusion detection. In business, genetic algorithms have been applied
to customer segmentation, credit scoring, and financial security
selection.
Genetic algorithms can be very useful within a data mining analysis
dealing with more attributes and many more observations. It saves the
brute force checking of all combinations of variable values, which can
make
some
data
mining
algorithms
more
effective.
However,
application of genetic algorithms requires expression of the data into

discrete outcomes, with a calculable functional value upon which to
base selection.
This does not fit all data mining applications. Genetic algorithms are
useful because sometimes it does fit. We review an application to
demonstrate some of the aspects of genetic algorithms.
2.16 NEURAL NETWORKS
Neural computation is introduced as an intelligent system relating the
processing parameters to the process responses such a system is
based on artificial neural network (ANN)
[46]
which is an interconnected
structure of processing elements called neurons. The ANN structure

consists of the input pattern representing the processing parameters,
the output pattern, the hidden layers describing implicitly the
correlations between the processing parameters and the output
characteristics. The connection between a couple of neurons is
38
described by a number called weight translating the strength of the

connection
[47]
Figure 2.7 Structure of a neural cell in human brain

There are three steps which are required to optimize the ANN
structure; these are training, validation and testing steps. There are
several types of neural network architectures. But in this study we will
focus on multilayer perception (MLP) and Back propagation Net.
2.16.1
BIOLOGICAL BACKGROUND
Structural and Functional Organization of the Brain

The inspiration for the development of ANN lies on the organization
and
functionality of the (human) brain. The brain is organized in
different structural
levels, which correspond to small-scale and large-scale anatomical
and functional organizations. Different functions take place in
different organization levels. The hierarchy of these levels are
shown in Fig. 2.8 from the lowest (bottom) to the highest (top).
Therefore, the lowest (basic) level of brain structural organization is
the molecular level and the highest is the Central Nervous System
[48]
The synapses are the neuronal interconnections and their function

depends on specific molecules and ions. The next level is the neural
39
microcircuit,
which
is
an
assembly
of
synaptic
connections
organized to produce a specific functional operation. The neural

microcircuits are grouped to form dendritic subunits that are
parts of the dendritic trees of individual neurons. It is believed that
neurons are the simplest computing unit in the brain, the simplest
element that can perform computational tasks. At the next
hierarchical and complexity level we have local neural circuits
(neural networks), which are constructed from the same type of
neurons, and are able to perform operations characteristic of a
localized region of the brain
[49]
Central nervous system

Interregional circuits
Local Circuits
Neurons
Dentritic trees
Neural microcircuits
Senapses
Molecules
Figure 2.8 Schematic structural organization of the brain.
At a higher level these neural circuits are organized to interregional
circuits than involve multiple regional neural networks located in
different parts of the brain through specific pathways, columns and
topographic maps. These structures are organized to respond to
40
incoming sensory information. Neurophysiological experiments

have
shown
clearly
that
different
sensory
inputs
(motor,
somatosensory, visual, auditory, etc.) are mapped onto specialized

corresponding areas of the cerebral cortex. The ultimate level of
complexity and hierarchy, the interregional circuits mediate
specific types of behaviour in the central nervous system.
2.16.2
The Neuron
The key word to understand the brain structural organization and

function is the
neuron. The idea of the neurons was introduced by Ramon y Cajal in
1911 and refers to the fundamental logical units that the whole
Central Nervous System is consisted of. It is indicative that the neuron
lies somewhere in the middle of the structural organization of the brain
shown in Fig. 2.8 A neuron is a nerve cell with all of its processes.
Neurons are one of the main distinguishing features of animals (plants
do not have nerve cells). Neurons come in a wide variety of shapes,
sizes and functionality in different parts of the brain. The number of
different classes of neurons that have been identified in humans lies
between seven and a hundred (the observed wide variety in that
estimation is related to how restrictively a class of neurons is defined)
[49]
. A simple representation of a neuron is shown in Fig. 2.9
41
Fig 2.9
Schematic representation of a typical
neuron
As it is shown in Fig. 2.9 typically the neuron mainly consists of three
parts, the
dendrites (or dendritic tree) and the synapses (or synaptic connections
or synaptic
terminals), the neuron cell body, the axon. Typically the neuron can be
in two states:
the resting state, where no electrical signal is generated, and the firing
state, where the neuron depolarises and an electrical signal is
generated (that is the output of the neuron).
[48]
The neuron receives inputs from other neurons that are connected to
it, via synaptic connections that are mainly positioned in the dendrites.
The incoming signals (which are in the form of positive or negative
electrical potentials) are summed in neurons cell body (also called
42
soma) and if the obtained sum exceeds a certain amount, which is

referred as the activation threshold, then the neuron depolarises and
an electrical pulse is generated. This pulse is commonly known as
action potential or spike.
Originating at or close to the cell body of the neuron the action
potential propagates
through the axon of the neuron at constant velocity and amplitude to
the synaptic
terminals. Through these synaptic terminals the electrical signals
generated at one
neuron are transmitted to the neurons that it is interconnected to.
Typically, neural events happen in the millisecond (10-3 sec) range,
whereas in silicon chip the corresponding time range is of the order of
nanosecond (10-9 sec). Thus, biological neurons are five to six
hundreds of magnitude slower than silicon chips.
2.16.3
Dendrites and Synapses
The dendrites, the receptive zones of the neuron, have an irregular

surface and a great number of branches. As it is shown in the right top
of Fig. 1.2 there are observed dendritic spines and synaptic inputs in a
dendrite. These synaptic inputs are the points that a neuron is
connected to other neurons and receive input signals from them. Thus
synapses are the elementary functional and structural units that
mediate the interactions between neurons. A number of one to ten
thousand of incoming synapses is typical for cortical neurons. With
respect to the nature of the signal that is transferred through a
synapse there are two kinds of synaptic connections, the chemical
synapse and the electrical synapse, with the former to be the most
common
[49]
In the case of the chemical synapse there is no actual contact of the

presynaptic and
the postsynaptic neuron. A synaptic gap (synaptic cleft) occurs instead
and the
43
chemical synapse operates as follows: when an electrical signal arrives

from the
presynaptic neuron to the synapse, a process at the presynaptic
neuron liberates a
number of molecules of a chemical substance called neurotransmitter.
These
neurotransmitter molecules diffuse across the synaptic gap and are
captured in
specialized regions of the dendrites of the postsynaptic neuron, by
molecules that are called neuroreceptors, and generate electrical
signals in the postsynaptic region. Thus, a chemical synapse converts
electrical signals that are generated in the presynaptic neuron into
chemical signals that travel through the synaptic gap and then back
into postsynaptic electrical signals.
It is obvious that this kind of synaptic transmission is unidirectional
and
nonreciprocal, i.e., chemical synapses carry signals from a neuron that
always plays
the role of the presynaptic unit to another neuron that always plays
the role of the
postsynaptic unit. This is the main difference between chemical and
electrical
synapses
[49]
In the case that two neurons are interconnected via an electrical

synapse, an electrical signal can be transmitted from the neuron with
higher voltage to the one with a lower voltage, thus signal
transmission
can
be
bi-directional
in
electrical
synapses.
This
characteristic of the electrical synapses means that there is no fixed

presynaptic and postsynaptic neuron in that kind of synaptic
connections and these roles can be interchanged depending on the
electrical conditions on each one of the interconnected neurons.
Further from distinguishing synapses to chemical and to electrical ones
according to the nature of the transmitting signal, we classify the
44
synapses with respect to the kind of activation that is produced to the

postsynaptic neuron in two main categories: the excitatory synapses
and the inhibitory synapses. In the first case, that of the excitatory
synapses,
the
electrical
potential
that
is
transmitted
to
the
postsynaptic neuron is positive and has an excitation effect. In the

second case of the inhibitory synapse the postsynaptic potential is
negative and imposes inhibition on the postsynaptic neuron.
A key point in the synaptic transmission is that the signals are
weighted. That is, some postsynaptic potentials are stronger than
others.
2.16.4
Neuron Cell Body
The neuron cell body (or soma) has a triangular like form and
contains the nucleus of the cell. As it is shown in Fig. 1.2, the dendrites
are leading into the neuron cell body, carrying the incoming inputs
(electrical signals generated by the postsynaptic potentials). These
electrical signals affect the membrane potential of the cell body of the
neuron. Typically, when in the resting state, the membrane potential of
a neuron is approximately 70 mV. If the incoming postsynaptic
potential is positive (excitatory) the membrane potential is increased
and is moving closer to the firing state. If the incoming postsynaptic
potential is negative (inhibitory) the membrane potential is decreased,
moving away from the firing state
[49]
All the incoming postsynaptic potentials are summed in both time

and space
(temporal and spatial summation). If the resulted sum is equal to or
greater than the firing threshold of the neuron, and the membrane
potential exceeds a certain value (typically 60 mV) then the neuron
depolarises (fires) and an action potential is generated and propagated
through the axon of the neuron to the synaptic terminals.
After firing, the neuron returns to the resting state and the membrane
potential
to
the
appropriate
resting
value.
This
is
not
done
instantaneously, but takes a little time which is called refractory period
45
of the neuron. When the refractory period is passed, the neuron is

ready to fire again if it receives the appropriate input.
2.16.5
The Axon
In cortical neurons, the axon is very long and thin and is characterized
by high
electrical resistance and very large capacity. The neural axon is the
main transmission line of the neuron that propagates the action
potential. The axon has a smoother surface than the dendrites and
carries the characteristic Ranvier nodes (not shown in Fig. 2.8) that
help the propagation of the action potential along the axon. The axon
terminates to the synaptic terminals that establish the interconnection
of the neuron to other neurons
2.16.5
[49]
The Neuron Model
To built-up an ANN we need to model the biological neuron, the

elementary
computing unit in the brain that is capable to perform informationprocessing
operations. The simplest model of a neuron is shown in Fig. 2.10.
Neurons, also referred as processing elements (PE), nodes, short-term
memory
devices, or threshold logic units, are the ANN components where the
most, if not all,
of the computing is done. The generic model of the neuron shown in
Fig. 2.10 consists the basis for designing and implementing ANNs. As it
is indicated in Fig. 2.10 three are the basic elements of the neuronal
model: a set of synapses or synaptic (connecting) links, an adder
(logical unit) and an activation function (threshold function).
The synapses, or connecting links carry the input signals to the
neuron, coming from either the environment or outputs of other
46
neurons. Each synapse is characterized by a weight or strength of its

own, which will affect the impact of the specific input.
Therefore, the incoming signals to a neuron are weighed, multiplied by
the
appropriate value of the synaptic weight. To be more specific, a signal
xj at the input
of synapse j of the k th neuron is multiplied by the synaptic weight w kj.
In the notation following here, the first subscript refers to the neuron in
question and the second subscript refers to the input to which the
weight refers.
Figure 2.10
Model of a neuron.
In the notation following here, the first subscript refers to the neuron in
question and the second subscript refers to the input to which the
weight refers. In general and in accordance to the biological figure,
there are two primary types of synaptic connections: the excitatory
and the inhibitory ones. The excitatory connections increase the
neurons activation and are typically represented by positive signals.
On the other hand, inhibitory connections decrease neurons activation
and are typically represented by negative signals. The two types of
47
connections are implemented using positive and negative values for

the corresponding synaptic weights
[49]
One of the most important features of the model neuron presented

here, as well as the biological neurons, is that the values of the
synaptic weights are subject to alterations and modification in
response to various inputs and in accordance to the networks own
rules for modification. This feature that is technically called synaptic
modification is of great importance since it is closely related to that
ability of
adaptation and learning of the ANN.
Sometimes there is an additional parameter b k, that is associated with
the inputs. The role of this additional parameter depends on the type
of the activation function.
Typically it is considered to be an internal bias, which can also be
weighted. In a
somehow different approach this parameter is a threshold value
(denoted by k for the kth neuron) that must be exceeded for there to
be any neuronal activation. In general it is a parameter that has the
effect of either increasing or decreasing the neurons input k to the
activation function if its value is positive or negative respectively
[47].
The second basic element of the model neuron is the adder. This
element is
responsible for summing the input signals to the neuron that are
transmitted through the synapses of the neuron and are weighted by
them. The described operations constitute a linear combiner. As
mentioned above the total result of the summation of the incoming
weighted signal and the addition of the bias b k or subtraction of the
threshold k is denoted by k.
The third basic element of the model neuron is the activation function,
which is also
48
referred as squashing function or signal function. The role of the

activation function is to squash (limit) the output signal of the neuron
to a certain (finite) range. Thus,
activation function maps a (possibly infinite) domain (the input) to a
pre-specified
range (the output). A great number of mathematical functions should
be suitable for
the role of the activation function of a neuron. However, four are the
most common
families of functions that are widely used: the step, the linear, the
ramp and the
sigmoid functions
[50]
The step (or threshold) function is described by the following equation:
Thus, the step function of (Eq. 2.1) returns a positive value if its
argument is a nonnegative number, otherwise it returns a negative
value if its argument is a negative number. A special case of the step
function is for = 1 and = 0. In that case (Eq. 2.1) is transformed to
(Eq. 2.2):
This special case of the step function is commonly referred as a

Heaviside function. A plot of the Heaviside function is shown in Fig.
2.11a. A neuron that incorporates the Heaviside function as its
activation function is usually referred as the McCulloch-Pitts neuron
model, in recognition of the pioneering work done by McCulloch and
Pitts back in 1943. According to that neuron model the output of a
neuron turns to the firing state generating an output signal equal to 1
if the total input to the neuron is non-negative. Otherwise, in the case
that the total input is negative the neuron remains in the resting state,
49
generating no signal (zero output). This characteristic behaviour is

described by the special term and is referred as the all-or-none
property of the McCulloch-Pitts model
[47]
. The all-or-none property
is in accordance to the behaviour of the biological neurons where the

total postsynaptic potential (inputs) must exceed a certain internal
threshold value in order the neuron to fire and generate an action
potential. If that threshold value is not exceeded the neuron remains in
the resting state and ceases.
The next family of activation functions is the linear function, described
in its general
form by the equation:
() =
(Eq. 2.3)
The parameter is a real-valued constant that regulates the

magnification of the
neuron activity . Despite its simple form the linear function is rather
inappropriate
for the role of the activation function of a neuron, since it is not
bounded (considering that the input parameter is not bounded too).
The third family of commonly used activation functions is the ramp
function, also
referred as piece-wise linear function. The ramp function is a linear
function that is
bounded to the range [-, +] and in its general form is described by
the equation:
In the above equation and correspond to the maximum and the

minimum output values respectively, i.e., the upper and lower bound
of the mapping. The piece-wise linear functions of (Eq. 2.4) are often
used to represent a simplified nonlinear operation and can be viewed
as an approximation to a nonlinear amplifier. Depending on the value
50
of the input parameter the ramp function operates as a linear

function without running into saturation, if is in the linear region,
otherwise the function returns the upper or lower saturation values. In
the special case that for = 1/2 and 1 and 0 as upper and lower bound
respectively (Eq. 2.4) takes the form:
A graphical representation of the ramp function described in (Eq. 2.5)

is shown in Fig. 2.11b. As it is shown in Fig. 2.11b this special form of
the ramp function exhibits a linear part in the range of 1/2< < 1/2
and saturates to the upper or the lower bound if exceeds that range.
The fourth and final family of activation functions is the sigmoid
functions. The
family of sigmoid functions is by far the pervasive type of activation
function and is
the most commonly used in the implementation of an ANN. That is
because the
sigmoid functions incorporate a number of properties that are mostly
desirable in the construction of a neuron. There are several types of
sigmoid functions. A common type is the logistic function that is
described in the following equation:
The parameter is the slope parameter of the sigmoid function.

Graphical
representations of the logistic sigmoid function for different values of
the slope
parameter are shown in Fig. 2.11c. The shape of the obtained
representations reveals that the reason for the sigmoid functions to
have been given that name is the s-shape of its graphs. As it is easily
recognised in Fig. 2.11c the logistic sigmoid function is a bounded,
51
monotonic, non-decreasing function that provides a graded, nonlinear

response. Thus, the logistic function balances between linear and
nonlinear behaviour.
The upper and lower bounds (saturation values) of that function are 1
and 0
respectively. Another feature of the logistic function that is partially
revealed in Fig.
2.11c is the role of the slope parameter . The greater the value of
that parameter the steeper is the increase of the logistic function. In
the limit that the slope parameter approaches infinity, the logistic
function turns simply to a step (Heaviside) function.
However, for values of the slope parameter in the normal range, the
logistic function is a continuous and differentiable function that returns
a continuous range of values from 0 to 1 (graded response). On the
opposite the Heaviside function is not differentiable.
A second sigmoid type function that ranges in the interval [0,1] is the
augmented ratio of squares function defined as:
What is common in the activation functions described by the (Eqs. 2.2,

2.5, 2.6 and
2.7) is that return an output in the range from 0 to 1. However,
sometimes it is
52
desirable to have an activation function in the range from 1 to 1. In

that case we have to give a different definition of the threshold
function of (Eqs. 2.1 and 2.2). The new form of the threshold function
is described by the following equation:
Fig 2.11.
Three common types of activation functions. (a) Threshold
(Heaviside) function. (b) Piecewise-linear (ramp) function. (c) Sigmoid

function for varying slop parameter a.
The above equation is commonly referred to as the signum function

since it returns the sign of the parameter or 0 if in neither positive,
nor negative.
Similarly, other types of sigmoid functions have to be presented for
the case that the
output range from 1 to 1 is the desirable one, instead of the range
from 0 to 1 that
return the logistic sigmoid function of (Eq. 2.6). If that is the case,
among others, two are some reasonable candidates. The first one is a
53
hyperbolic trigonometric function, the hyperbolic tangent function,

which is defined as:
( ) = tanh( )
(Eq. 2.9)
The second one is defined by the formula:
Both these functions defined in the last two equations have saturation
levels at 1
(lower) and 1 (upper), therefore range in [-1,1].
The description of the neural dynamics in mathematical terms follows.
According to
the notation introduced above, assuming that the k th neuron receives
m synaptic
connections, k is the total sum of the incoming input weighed signals
xj via the jth
synaptic connection, and wkj is the corresponding synaptic weight of
that connection, the threshold is k and the bias is bk. In the case that
the adder sums the total incoming weighted signals and subtracts the
threshold k the obtained result k is given by the mathematical
formula:
In (Eq. 2.13), the bias bis included in the form as the product W k0X0.,
where X0 = 1 and Wk0= bk.
54
Finally, let yk be the output signal of the k th neuron that receives a total
incoming
signal k. The output of the neuron is given by the next formula:
y = k y
(Eq.
2.14)
In the above equation, () is the activation function, which should be
given by one of the described in Eqs. 2.1 2.10.
The
neuron-like
processing
element
presented
here
model
approximately three of the processes we know real neurons perform.

As far as we know, there are at least 150 processes performed by the
neurons in the human brain. Although the obvious
poverty of the model neuron, it handles several basic functions.
Namely, the model
neuron is capable to receive and evaluate the input signals, to
calculate a total of the
combined inputs and compare that total to some threshold level, and
finally to
determine what the output should be. In addition to the deterministic
neuronal model presented above, for some applications of neural
networks it is desirable to
incorporate a stochastic feature in the dynamics of the neural model.
In such a case,
the neuronal model is based on a modification of the bi-state neuronal
element of
McCulloch-Pitts and it is permitted to reside in only two states: +1 and
1. The
decision of a neuron to alter its state is probabilistic. Thus, if the
neuron fires (is in the +1 state) with probability of firing P(), then it
remains in the 1 state with
probability 1-P(). The firing probability is given by the formula:
55
In the above formula T is a pseudo-temperature that is incorporated to

control the
noise level, thus the uncertainty and the stochastic nature in firing,
and must be
realised as a parameter that represents the effects of synaptic noise.
In the limit case that T 0, the stochastic neural model reduces to the
noiseless (therefore deterministic) form that is described by the
McCulloch-Pitts neural model in (Eq. 2.2)
2.16.6
[47]
Supervised and unsupervised learning
The learning algorithm of a neural network can either be supervised or

unsupervised.
A neural net is said to learn supervised, if the desired output is already
known.
Neural nets that learn unsupervised have no such target outputs.
It can't be determined what the result of the learning process will look
like.
During the learning process, the units (weight values) of such a neural
net are "arranged" inside a certain range, depending on given input
values. The goal is to group similar units close together in certain
areas
of
the
value
range.
This effect can be used efficiently for pattern classification purposes

[51]
2.16.7
Forward propagation
Forward propagation is a supervised learning algorithm and describes

the "flow of information" through a neural net from its input layer to its
output layer.
The algorithm works as follows:
1. Set all weights to random values ranging from -1.0 to +1.0
2. Set an input pattern (binary values) to the neurons of the net's
input layer
3. Activate each neuron of the following layer:
56
Multiply the weight values of the connections leading to this

neuron
with the output values of the preceding neurons
Add up these values

Pass the result to an activation function, which computes the
output value of this neuron
4. Repeat this until the output layer is reached
5. Compare the calculated output pattern to the desired target pattern
and
compute an error value
6. Change all weights by adding the error value to the (old) weight
values
7. Go to step 2
8. The algorithm ends, if all output patterns match their target
patterns
2.16.8
Multi-layer Perceptron: This was first introduced
by M. Minsky and S. Papert in 1969
[52].
It is a special case of
perceptron whose first layer units are replaced by trainable threshold

logic units in order to allow it to solve non-linear separable problem.
Minsky and Papert called multi-layer perceptron of one trainable
hidden layer a Gamba perceptron. The structure is shown below:
.
.
.
.
.
.
.
.
.
Input
Layer
First
Hidden
Layer
Second
Hidden
Layer
Figure 2.12
Output
Layer
Each layer is fully connected to the next one. Depending on the

complexity, performance and implementation point of view, the
57
number of hidden layers may be increased or decreased with

corresponding increase or decrease in the number of hidden units and
connections.
Both the perceptron and the multi-layer perceptron are trained with
error-correction learning. But since perceptron does not have an
explicit error available, this stopped further work on the multi-layer
perceptron
around
1970,
until
method
to
train
multi-layer
perceptrons was later discovered. The method is called Back

Propagation or the generalized Delta Rule.
With this method, processing is done from the input to the output
layer, that is, in the forward direction, after which computed errors are
then propagated back in the backward direction to change the weights
so as to obtain a better result.
2.16.9
Strength of Artificial Neural Networks
They can handle a wide range of problems

They provide good results even in complicated domain
They handle both categorical and continuous variables
They are available in many off-the-shelf packages
2.16.10
Weaknesses of Artificial Neural Networks
They require inputs in the range of 0 to 1

They can not explain their results
They may converge prematurely to an inferior solution
2.17 ON-LINE ANALYTICAL PROCESSING
OLAP is the next advance in giving end-user access to data.
These are client-server tools that have an advance graphical interface
talking to an efficient and powerful presentation of the data called a
cube. The cube is ideally suited for queries that allow users to sliceand-dice the data in any way they see fit. The cube itself is stored in
58
either a relational database, typically using a star schema or in a

special multi-dimensional database that optimize OLAP operations.
OLAP tools have a very fast response times, measured in seconds. SQL
queries on standard relational database would require hours or days in
many cases to generate the same information. In addition, OLAP tools
provide handy analysis functions that are difficult or impossible to
express in SQL.
2.17.1
OLAP and Data Mining
We have to provide feedback to people and use the information from

data mining to improve business process. We need to enable people to
provide input, in the form of observations, hypotheses and hunches
about what results are important and how to use those results
[6]
In the larger solution to exploit data, OLAP clearly plays an important

role as a means of broadening the audience with access to data.
2.17.2
Strengths of OLAP
It is a powerful visualization tool.

It provides fast, interactive response time
It is good for analyzing time series
It can be used to find some clusters and outliers
Many vendors offer OLAP products
2.17.3
Weaknesses of OLAP
Setting-up a cube can be difficult

It does not handle continuous variables well
Cubes can quickly become out-of date
It is not data mining
59
2.18
DATA MINING APPLICATION AREAS
Other application areas are:

Health sector
Food And Drug Product Safety
Election analysis
Detection Of Terrorists Or Criminals
Etc
2.19 DATA MINING TOOLS
Many good data mining software products are available
[5]
Enterprise Miner by SAS

Intelligent Miner by IBM
CLEMENTINE by SPSS
PolyAnalyst by Megaputer
WEKA (from the University of Waikato in New Zealand) etc
60
Given a CSP P = (V,D,C), its dual

transformation dual(p) = (Vdual(p), Ddual(p), C Ddual(p))
is defined as follows.
Vdual(p) = {c1,,cn}where c1,.cm are called dual

variables.
For each constraint Ci of P there is a unique
corresponding dual variable ci. We use vars(ci) and
rel(ci) to denote the corresponding sets vars(Ci) and
rel(Ci) (given that the context is not ambiguous).
Ddual(p) = {dom(c1),,dom(cm)} is the set of
domains for the dual variables. For each dual
variable ci, dom(ci) = rel(Ci), i.e., each value for is a
tuple over vars(Ci). An assignment of a value t to a
dual variable ci, ci t , can thus be viewed as being
a sequence of assignments to the ordinary variables
xvars(ci) where each such ordinary variable is
assigned the valuet[x].
Cdual(p) is a set of binary constraints overVdual(p)

called the dual constraints.
There is a dual constraint between dual variables ci
and cj if S = vars(ci) vars(cj) . In this dual
constraint a tuple tidom(ci) is compatible with a
tuple tjdom(cj) iff ti[S] = tj[S], i.e., the two tuples
have the same values over their common variables
61
Given a CSP P = (V,D,C), its hidden transformation hidden(p) = (V hidden (p), D hidden (p), C D hidden (p)) is
defined as follows:
V hidden (p), {x1,., xn} {c1,,cn}where {x1,., xn} is the original set of
variables in V (Called ordinary variables) and c 1,.cm are called dual
variables generated form from the constant C
There is a unique dual variable corresponding to each constraint Ci C.
When dealing with the hidden transformation, the dual variables are
sometimes called hidden variables.
D hidden (p)= {dom(x1),.,dom( xn)} {dom(c1),,dom(cm)} is the set of
domains for the dual variables. For each dual variable c i, dom(ci) = rel(Ci),
V = {x1,., xn} is a finite set of n variables.
D = {dom(x1,..,dom(xn)} is a set of domains. Each variable xV has a
corresponding finite domain of possible values, dom(x).
C = {C1,,Cm} is a set of m constraints. Each constraint C C is a pair
(vars(C), rel(C)) defined as follows:
62

My Chapter Two

Uploaded by

Copyright:

Available Formats

My Chapter Two

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

My Chapter Two

Uploaded by

Copyright:

Available Formats

CHAPTER TWO

Data Mining is the process of discovery of meaningful new correlation,

WHY DATA MINING

The data is being produced and collected at unprecedented way.

The data is being warehoused data warehousing brings

together data from different format in a common format with

The computing power is affordable price for disk, memory,

processing power, I/O bandwidth is affordable by many ordinary

The existence of commercial data mining software products.

DATA MINING PROCESS

In order to systematically conduct data mining analysis, a general

. While each step of either

approach isnt needed in every analysis, this process provides a good

Figure 2.1 CRISP-DM processes

Data Understanding: Once business objectives and the project

requirements. This step can include initial data collection, data

visualization, statistical, and artificial intelligence tools show the

identification of key situations. These models need to be

Figure 2.2 Schematic of SEMMA (original from SAS )

DATA MINING TASKS

CLASSIFICATION: This consists of examining the features of newly

estimation except that the records are classified according to some

opportunities and to design attractive packages or groupings of

DATA MINING ISSUES

As data mining initiatives continue to evolve, there are several issues

significantly impact the effectiveness of the more complex data mining

As additional information sharing and data mining initiatives have

BASIC STYLES OF DATA MINING

The first, hypothesis testing, is a top-down approach that attempt to

Generate good ideas (hypothesis)

Determine what data would allow these hypotheses to be tested.

Locate the data

Prepare the data for analysis

Build computer model based on the data

Evaluate computer model to confirm or reject hypotheses

2.6.2 Knowledge Discovery

Undirected learning ha s long been goal of artificial intelligence

The Process of Undirected Knowledge Discovery

Here are the steps in the process of undirected knowledge

Identify source of pre-classified data.

Prepare data for analysis

Build and train computer model

Evaluate the computer model

Apply the computer model to new data.

Identify potential targets for directed knowledge discovery.

Generate new hypothesis to test

Memory-based reasoning systems are a type of model, supporting the

classification of cases. It is a highly useful technique that can be

operates by comparing new unclassified records with known examples

Matching: While matching algorithms are not normally found in

Java software has been used to completely automate document

identification in geometric environments

There are a series of measures that have been applied to

non-matching fields). Case-based reasoning is a well-known

ASSOCIATION RULES IN KNOWLEDGE DISCOVERY

An association rule is an expression of X Y, where X is a set of items,

and medical insurance fraud detection

Many algorithms have been proposed to find association rules mining

Data structure is an important issue due to the scale of data usually