Efficient Distributed Genetic Algorithm For Rule Extraction
Efficient Distributed Genetic Algorithm For Rule Extraction
Efficient Distributed Genetic Algorithm For Rule Extraction
Antonio Peregrin
Dept. of Information Technologies
University of Huelva
peregrin@dti.uhu.es
Abstract
This paper presents an efficient distributed genetic
algorithm for classification rules extraction in data
mining, which is based on a new method of dynamic
data distribution applied to parallelism using networks
of computers in order to mine large datasets. The
presented algorithm shows many advantages when
compared with other distributed algorithms proposed
in the specific literature. In this way, some results are
presented showing significant learning rate speed-up
without compromising other features.
1. Introduction
Mining huge datasets for learning classification
models with high prediction accuracy can be a very
difficult task. The evaluation of data over large size
datasets makes data mining algorithms inefficacy and
inefficient. The effect produced by the size of the data
in the algorithms is called scaling problem.
There are three main approaches for this problem:
Use as much as possible apriori knowledge to
search in subspaces small enough to be explored.
Perform data reduction.
Algorithm scalability.
In this way, there have been several efforts to make
use of models based on distributed evolutionary
algorithms in data mining for classification in large size
datasets in order to emphasize on aspects of scalability
and efficiency. REGAL [1] and NOW G-net [2]
increase the computational resources via the use of data
distribution. Nowadays, the use of collection of
computers to achieve greater amount of computational
resources become more popular because they are much
more cost-effective than single computers of
comparable speed.
In this paper we present an Efficient Distributed
Genetic Algorithm for classification Rules extraction
(EDGAR) with dynamic data partitioning that shows
advantages in scalability for exploring high complexity
search spaces with comparable classification quality.
531
3.1. Model
Using the intrinsic parallelism of GA, various
distributed GA have been proposed to reduce the
computational effort needed in solving complex
optimization problems. Instead of evolving the entire
population in a single processor, parallel GA applies
the concept of multiple intercommunicating
subpopulations [10] in analogy with the natural
evolution of spatially distributed populations namely
island model.
In order to improve scalability, we have assigned
different partitions of the learning data to each node.
Each node will tend to cover the local data proposing a
concept description. The communication of the best
individuals between subpopulations will enforce those
individuals that perform properly in more than one
node.
GA 1
Rules &
Examples
Rules &
Examples
Subpopulations
& Data partition
Pool
GA 4
Concept
Description
Rules &
Full
Dataset
GA 2
Rules &
Examples
Rules &
Examples
GA 3
3. EDGAR
This section describes the characteristics of
EDGAR. This algorithm uses the inherent parallelism
of GA for distributing the population and the training
data in order to improve the scalability on large
datasets.
532
3.3. Representation
3.4. Fitness
The fitness function is based on the following
measurements:
533
zeros(r )
f (r ) = 1 +
lenght(r )
1*Cases
zeros(r )
: 1 +
length
(r )
1*Cases
* Cases +
reduction
4. Experimental study
under
534
3,5
Edgar
Regal
3,0
Minutes
2,5
2,0
1,5
1,0
0,5
0,0
4
16
Nodes
32
64
4.4 Analysis
5. Conclusions
This work presents an evolutionary distributed
genetic algorithm for classification rules extraction
based on the island model and enhanced for scalability
with data training partitioning. To be able to generate
an accurate classifier with data partition two techniques
has been proposed: an elitist pool for rule selection and
a novel technique of data distribution (DLF) that uses
heuristics based on the local data to dynamically
redistribute the training data in the node
neighbourhood.
In this preliminary study EDGAR shows a
considerable speed up and even more, this
improvement does not compromised the accuracy and
quality of the classifier.
Finally, we would like to remark the absence of a
master process to guide the search. This architecture
suggests a better scalability by avoiding idle time due
4.3 Results
Tables 2 and 3 represent average on 150 executions
(5 for each partition, 6 different seeds and 5 node
configurations). First column is the number of nodes.
Second is the execution time in minutes. Finally, third
and fourth columns show the classification results for
test and training dataset.
Table 2: Results of REGAL
Nodes Time Rules %Test %Training
4
1,78 290
98,6
98,9
8
1,97 251
99,0
99,3
16
2,32 250
98,9
99,2
32
2,81 268
98,5
98,7
64
3,08 316
97,9
98,0
535
6. Acknowledgements
7. References
536