My Chapter Two
My Chapter Two
My Chapter Two
L I T E R AT U R E
REVIEW
This chapter involves the basic topics on Data Mining its meaning,
reasons for its application, various tasks, processes, techniques and
application area of data mining. It dwells much on Artificial neural
network approach. These algorithms strength, weaknesses and when
to apply are not left out.
2.1
DATA MINING
[6]
techniques.
Data mining also refers to the analysis of the large quantities of data
that are stored in computers in form of file or in database. It is called
exploratory data analysis, among other things
[5]
Data mining is not limited to business. Data mining has been heavily
used in the medical field, to include diagnosis of patient records to
help identify best practices.
2.2
Data mining caught on in a big way in the last few years due to the
following number of factors
[6]
i)
ii)
2.3
[7]
data
processing,
analysis,
inferences
drawn,
and
implementation.
2.3.1 CRISP-DM
CRISP-DM - Cross-Industry Standard Process for Data Mining is widely
used by industries and cooperate organizations. This model consists of
six phases intended as a cyclical process.
Business Understanding: Business understanding includes
determining business objectives, assessing the current situation,
establishing data mining goals, and developing a project plan.
are
established,
data
understanding
considers
data
for
many
purposes,
including
prediction
or
2.4
Prediction
is
the
same
for
classification
and
10
[6]
[6]
Data Quality
Data quality refers to the accuracy and completeness of the data. Data
quality can also be affected by the structure and consistency of the
data being analyzed. The presence of duplicate records, the lack of
data standards, the timeliness of updates, and human error can
11
12
[6]
2.6.1 Hypothesis
A hypothesis is a propose explanation whose validity can be tested.
Testing the validity of an hypothesis is done by analyzing data that
may simply be collected by observation or generated through
experiment.
The process of hypothesis testing
The hypothesis testing method ha several steps:
1)
2)
3)
4)
5)
6)
13
2.
3.
4.
5.
6.
7.
14
D ATA
2.7
MINING
TECHNIQUES/METHODS
MEMORY-BASED REASONING
they
are
relatively
machine
driven,
involving
automatic
[8]
. It
[9]
[10]
[11].
Matching
can
also
be
applied
to
pattern
[12]
15
16
2.8
[13]
[14]
[15]
or time intervals
[16]
[17]
17
Market-basket
analysis
refers
to
methodologies
studying
the
[18]
[19]
18
and making them more visible and accessible for customers at the
time of shopping.
These assortments can affect customer behavior and promote the
sales for complement items. The other use of this information is to
decide about the layout of catalogs and put the items with strong
association together in sales catalogs. The advantage of using sales
data for promotions and store layout is that the consumer behavior
information determines the items with associations. This information
may vary based on the area and the assortments of available items in
stores and the point of sale data reflects the real behavior of the group
of customers that frequently shop at the same store. Catalogs that are
designed based on the market basket analysis are expected to be
more effective on consumer behavior and sales promotion.
2.9.2 Strength of Market Basket Analysis
It produces clear and understandable results
It supports undirected data mining
It works on variable-length data.
The computations it uses are simple to understand
2.9.3 Weaknesses of Market Basket Analysis
It requires exponentially more computational effort as the
problem size grows.
It has a limited supports for attributes on the data
It is difficult to determine the right number of items
It discounts rear items
on
managing
uncertainty
19
and
imprecision
have
been
20
categorical
limits
selected
are
key
to
accurate
model
information
becomes
more
and
more
difficult.
KDD
[26]
[27]
. Each of these
[28]
. Weighted quantitative
[29]
21
The combined weight or fuzzy value is very small and even tends to
zero when the number of items is large in a candidate itemset, so the
support level is very small, this will result in data overflow and make
the algorithm terminate unexpectedly when calculating the confidence
value.
2.11 ROUGH SET
Rough set analysis is a mathematical approach that is based on the
theory of rough sets first introduced by Pawlak (1982)
[22]
. The purpose
and
uncertainty
associated
with
their
measurable
characteristics.
As an approach to handling imperfect data, rough set analysis
complements other more traditional theories such as probability
theory, evidence theory, and fuzzy set theory.
2.11.1
Statistical data analysis faces limitations in dealing with data with high
levels of uncertainty or with non-monotonic relationships among the
variables.
The original idea behind his Rough sets theory was vagueness
inherent to the representation of a decision situation.
22
[30]
The
the
vagueness
and
imprecision
problems
are
present
in
[31]
the first three attributes are Q, their possible values V, and the profit
category f.
Any pair (q, v) for q Q,, v Vq is called the descriptor in an
information system S. The information system can be represented as a
finite data table, in which the columns represent the attributes, the
rows represent the objects and the cells represent the attribute values
f(x, q). Thus, each row in the table describes the information about an
A
object in S.
universe
and
is
called
an
indiscernibility
relation.
The
23
[32]
. The
[35]
[33]
and
[34]
[36].
[38]
[37]
[39]
timing
[40]
to
enhance
support
vector
machine
models
in
[41]
[42]
24
Figure
2.3 Process map and the main steps of the rough sets analysis
[5].
[5]
real-world
applications,
such
as
medical
diagnosis,
25
which has established SVMs as one of the most popular, state-of-theart tools for knowledge discovery and data mining.
Similar to artificial neural networks, SVMs possess the well-known
ability of being universal approximators of any multivariate function to
any desired degree of accuracy. Therefore, they are of particular
interest to modeling highly nonlinear, complex systems and processes.
Regression
A version of a SVM for regression was proposed called support
vector
regression (SVR). The model produced by support vector
classification
(as described above) only depends on a subset of the training
data, because
the cost function for building the model does not care about
training points
that lie beyond the margin. Analogously, the model produced by
SVR only
depends on a subset of the training data, because the cost
function for building the model ignores any training data that are
close (within a
threshold ) to the model prediction
2.12.1
[6]
26
27
Figure 2.4
2.12.2
Support
Vector
Machines
versus
Artificial
Neural Networks
The development of ANNs followed a heuristic path, with applications
and extensive experimentation preceding theory. In contrast, the
development of SVMs involved sound theory first, then implementation
and experiments.
A significant advantage of SVMs is that while ANNs can suffer from
multiple local minima, the solution to an SVM is global and unique.
Two more advantages of SVMs are that that have a simple geometric
interpretation
and give a sparse solution. Unlike ANNs, the computational
complexity of SVMs does not depend on the dimensionality of the
input space. ANNs use empirical risk minimization, whilst SVMs use
structural risk minimization. The reason that SVMs often outperform
ANNs in practice is that they deal with the biggest problem with ANNs,
SVMs are less prone to over fitting.
They differ radically from comparable approaches such as neural
networks: SVM training always finds a global minimum, and their
simple geometric interpretation provides fertile ground for
further investigation.
Most often Gaussian kernels are used, when the resulted SVM
corresponds to an RBF network with Gaussian radial basis
functions. As the SVM approach automatically solves the
network complexity problem, the size of the hidden layer is
obtained as the result of the QP procedure. Hidden neurons and
support vectors correspond to
28
30
[43]
estimate
the
performance
measures?
Are
there
[5]
2.13.1
In
classification
the
primary
source
of
performance
measurements
is a coincidence matrix (a.k.a. classification matrix or a contingency
table).
The numbers along the diagonal from upper-left to lower-right
represent the correct decisions made, and the numbers outside this
diagonal represent the errors. The true positive rate (also called hit
rate or recall) of a classifier is estimated by dividing the correctly
classified positives (the true positive count) by the total positive count.
The false positive rate (also called false alarm rate) of the classifier is
estimated by dividing the incorrectly classified negatives (the false
negative count) by the total negatives.
31
2.13.2
Estimation
Methodology
for
Classification
Models
Estimating the accuracy of a classifier induced by some supervised
learning
algorithms is important for the following reasons. First, it can be used
to estimate its future prediction accuracy which could imply the level
of confidence one should have in the classifiers output in the
prediction system. Second, it can be used for choosing a classifier from
a given set (selecting the best model from two or more qualification
models). Lastly, it can be used to assign confidence levels to multiple
classifiers so that the outcome of a combining classifier can be
optimized. Combined classifiers are increasingly becoming more
popular due to the empirical results that
suggest them producing more robust and more accurate predictions as
they are compared to the individual predictors. For estimating the final
accuracy of a classifier one would like an estimation method with low
bias and low variance. In some application domains, to choose a
classifier or to combine classifiers the absolute accuracies may be less
important and one might be willing to trade off bias for low variance.
2.14 DECISION TREES
Decision trees are powerful and popular tools for classification and
prediction. The attractiveness of tree-based methods is due in large
32
acronyms
Classification
CART
and
and
CHAID
Regression
which
Trees
stand
and
respectively
Chi-square
for
Automatic
[5]
Yes
No
No
Yes
No
Diet Soda!
Yes
Beer!
33
No
Yes
Milk!
Beer!
Decision trees are less appropriate for estimation tasks where the goal
is to predict the value of a continuous variable such as income, blood
pressure or interest rate. Decision trees are also problematic for timeseries data unless a lot of effort is put into presenting the data in such
a way that trends and sequential patterns are made visible.
2.14.3
Decision-tree methods are a good choice when the data mining task is
classification of records of prediction of outcomes. Use decision trees
when your goal is to assign each record to one of a few broad
categories. Decision trees are also a natural choice when your goal is
to generate rules that can be easily understood, explained, and
translated into SQL or a natural language.
[44]
, have been
34
35
Genetic
Algorithm
[45]
Advantages:
Genetic
not require
Genetic
Algorithm
Disadvantages:
Genetic
GA Operators
36
Selection
This is the procedure for choosing individuals (parents) on which to
perform crossover in order to create new solutions. The idea is that
the fitter individuals are more prominent in the selection process,
with the hope that the offspring they create will be even fitter still.
Two
commonly
used
procedures
are
roulette
wheel
and
Application
of
Genetic
Algorithms
in
Data
Mining
Genetic algorithms have been applied to data mining in two ways.
External
support is through evaluation or optimization of some parameter for
another
37
learning system, often hybrid systems using other data mining tools
such as
clustering or decision trees. In this sense, genetic algorithms help
other data mining tools operate more efficiently. Genetic algorithms
can also be directly applied to analysis, where the genetic algorithm
generates descriptions, usually as decision rules or decision trees.
Many applications of genetic algorithms within data mining have
been applied outside of business.
Specific examples include medical data mining and computer network
intrusion detection. In business, genetic algorithms have been applied
to customer segmentation, credit scoring, and financial security
selection.
Genetic algorithms can be very useful within a data mining analysis
dealing with more attributes and many more observations. It saves the
brute force checking of all combinations of variable values, which can
make
some
data
mining
algorithms
more
effective.
However,
[46]
which is an interconnected
38
[47]
BIOLOGICAL BACKGROUND
39
microcircuit,
which
is
an
assembly
of
synaptic
connections
[49]
Local Circuits
Neurons
Dentritic trees
Neural microcircuits
Senapses
Molecules
Figure 2.8 Schematic structural organization of the brain.
At a higher level these neural circuits are organized to interregional
circuits than involve multiple regional neural networks located in
different parts of the brain through specific pathways, columns and
topographic maps. These structures are organized to respond to
40
shown
clearly
that
different
sensory
inputs
(motor,
2.16.2
The Neuron
41
Fig 2.9
neuron
As it is shown in Fig. 2.9 typically the neuron mainly consists of three
parts, the
dendrites (or dendritic tree) and the synapses (or synaptic connections
or synaptic
terminals), the neuron cell body, the axon. Typically the neuron can be
in two states:
the resting state, where no electrical signal is generated, and the firing
state, where the neuron depolarises and an electrical signal is
generated (that is the output of the neuron).
[48]
The neuron receives inputs from other neurons that are connected to
it, via synaptic connections that are mainly positioned in the dendrites.
The incoming signals (which are in the form of positive or negative
electrical potentials) are summed in neurons cell body (also called
42
[49]
43
[49]
can
be
bi-directional
in
electrical
synapses.
This
44
the
electrical
potential
that
is
transmitted
to
the
The neuron cell body (or soma) has a triangular like form and
contains the nucleus of the cell. As it is shown in Fig. 1.2, the dendrites
are leading into the neuron cell body, carrying the incoming inputs
(electrical signals generated by the postsynaptic potentials). These
electrical signals affect the membrane potential of the cell body of the
neuron. Typically, when in the resting state, the membrane potential of
a neuron is approximately 70 mV. If the incoming postsynaptic
potential is positive (excitatory) the membrane potential is increased
and is moving closer to the firing state. If the incoming postsynaptic
potential is negative (inhibitory) the membrane potential is decreased,
moving away from the firing state
[49]
to
the
appropriate
resting
value.
This
is
not
done
45
The Axon
In cortical neurons, the axon is very long and thin and is characterized
by high
electrical resistance and very large capacity. The neural axon is the
main transmission line of the neuron that propagates the action
potential. The axon has a smoother surface than the dendrites and
carries the characteristic Ranvier nodes (not shown in Fig. 2.8) that
help the propagation of the action potential along the axon. The axon
terminates to the synaptic terminals that establish the interconnection
of the neuron to other neurons
2.16.5
[49]
46
weight refers.
Figure 2.10
Model of a neuron.
In the notation following here, the first subscript refers to the neuron in
question and the second subscript refers to the input to which the
weight refers. In general and in accordance to the biological figure,
there are two primary types of synaptic connections: the excitatory
and the inhibitory ones. The excitatory connections increase the
neurons activation and are typically represented by positive signals.
On the other hand, inhibitory connections decrease neurons activation
and are typically represented by negative signals. The two types of
47
[49]
48
[50]
Thus, the step function of (Eq. 2.1) returns a positive value if its
argument is a nonnegative number, otherwise it returns a negative
value if its argument is a negative number. A special case of the step
function is for = 1 and = 0. In that case (Eq. 2.1) is transformed to
(Eq. 2.2):
49
[47]
() =
(Eq. 2.3)
50
51
52
Fig 2.11.
53
( ) = tanh( )
(Eq. 2.9)
Both these functions defined in the last two equations have saturation
levels at 1
(lower) and 1 (upper), therefore range in [-1,1].
The description of the neural dynamics in mathematical terms follows.
According to
the notation introduced above, assuming that the k th neuron receives
m synaptic
connections, k is the total sum of the incoming input weighed signals
xj via the jth
synaptic connection, and wkj is the corresponding synaptic weight of
that connection, the threshold is k and the bias is bk. In the case that
the adder sums the total incoming weighted signals and subtracts the
threshold k the obtained result k is given by the mathematical
formula:
In (Eq. 2.13), the bias bis included in the form as the product W k0X0.,
where X0 = 1 and Wk0= bk.
54
Finally, let yk be the output signal of the k th neuron that receives a total
incoming
signal k. The output of the neuron is given by the next formula:
y = k y
(Eq.
2.14)
In the above equation, () is the activation function, which should be
given by one of the described in Eqs. 2.1 2.10.
The
neuron-like
processing
element
presented
here
model
55
[47]
of
the
value
range.
2.16.7
Forward propagation
6. Change all weights by adding the error value to the (old) weight
values
7. Go to step 2
8. The algorithm ends, if all output patterns match their target
patterns
2.16.8
[52].
It is a special case of
.
.
.
.
.
.
.
.
.
Input
Layer
First
Hidden
Layer
Second
Hidden
Layer
Figure 2.12
Output
Layer
around
1970,
until
method
to
train
multi-layer
58
[6]
Strengths of OLAP
Weaknesses of OLAP
59
2.18
[5]
60
61
Given a CSP P = (V,D,C), its hidden transformation hidden(p) = (V hidden (p), D hidden (p), C D hidden (p)) is
defined as follows:
V hidden (p), {x1,., xn} {c1,,cn}where {x1,., xn} is the original set of
variables in V (Called ordinary variables) and c 1,.cm are called dual
variables generated form from the constant C
There is a unique dual variable corresponding to each constraint Ci C.
When dealing with the hidden transformation, the dual variables are
sometimes called hidden variables.
D hidden (p)= {dom(x1),.,dom( xn)} {dom(c1),,dom(cm)} is the set of
domains for the dual variables. For each dual variable c i, dom(ci) = rel(Ci),
V = {x1,., xn} is a finite set of n variables.
D = {dom(x1,..,dom(xn)} is a set of domains. Each variable xV has a
corresponding finite domain of possible values, dom(x).
C = {C1,,Cm} is a set of m constraints. Each constraint C C is a pair
(vars(C), rel(C)) defined as follows:
62