Evaluation of BIRCH Clustering Algorithm For Big Data
Evaluation of BIRCH Clustering Algorithm For Big Data
Evaluation of BIRCH Clustering Algorithm For Big Data
ISSN No:-2456-2165
Abstract:- Clustering algorithm are as of late additionally an incremental strategy that does not require the
recapturing consideration with the accessibility of vast entire dataset ahead of time.
datasets. Many clustering algorithm don’t scale well
with the expanding dataset and requires the cluster BIRCH already solved the scalability problem for
count parameter which is difficult to assess. In a large dataset. To improve the quality of the cluster, the
BIRCH Clustering algorithm, computing the threshold number of cluster is to be given as an input. The main aim is
parameter of BIRCH from the data is evaluated and the to achieve the quality and the speed of the cluster without
issue of Scalability with increasing dataset sizes and the giving the cluster count. This can be achieved by deleting
cluster count is solved. the global clustering phase which is done at the end of the
BIRCH clustering algorithm. It has four phases: the first
Keywords:- BIRCH Clustering Algorithm, Tree-BIRCH, phase of the algorithm is loading the data into the memory,
Flat Tree. second is condensing tree, third is global clustering and the
last is cluster refinement. The cluster count is needed at
I. INTRODUCTION global clustering step so it is removed and yet preserve the
cluster quality and speed. The other part of the BIRCH
Cluster is a subset of objects which are same or clustering algorithm is called as tree-BIRCH
similar. Data clustering is an unsupervised learning where
the set of data points are grouped in such a way that the data The tree-BIRCH may go wrong in dividing the parent
points in a group are similar to each other than to those in node, splitting of clusters and cluster merging. This is
other groups. The clustering algorithms are k-means solved by computing optimal threshold parameter. The
clustering algorithm, Density-Based Spatial Clustering of threshold parameter is computed by minimum distance and
Applications with Noise (DBSCAN), etc. Many clustering the maximum cluster radius. This is calculated
algorithms don’t scale well with the dynamic (increasing) automatically which is done in Bach and Jordan [9]. Tree-
dataset and the algorithms requires the number of clusters to BIRCH is quick and it is used for the online clustering
be formed as an input. The prediction of the cluster count is which is the elimination if global clustering step. To solve
not so easy where it needs the information, such as data the supercluster splitting, the flat tree is introduced. The flat
exploration, feature engineering and cluster document. tree is the tree where it has a parent node and all the child
nodes are attached to that parent node. This is achieved by
Balanced Iterative Reducing and Clustering Using keeping the branching factor infinite. The parent node will
Hierarchies (BIRCH) is an unsupervised and data mining be having the information about all the child nodes that is,
clustering algorithm which is introduced by Zhang et al [8]. the parent node maintain the triples total number of data
This is the high speed clustering algorithm available. It is a point, linear summation and the square summation of the
data mining algorithm which is used to form a cluster over child node. When new data point is entered, the triples are
categorically immensely colossal data-sets. Birch cluster updated which is built in a clustering feature tree. Tree-
algorithm works on only numerical dataset. Favorable BIRCH clusters faster than the k-means clustering algorithm
position of BIRCH is its ability of clustering with the for the same size of the dataset [2].
increasing records in a dataset, multi-dimensional metric
information indicates in an endeavor create the good quality Dataset is also called as informational index which
clustering for a given arrangement of assets (memory and gathers the information. Each rows in a dataset is seems to
time imperatives). Much of the time, BIRCH just requires a be one record. There are many rows and columns it consists.
solitary output of the database. It is neighborhood in that It is a collection of the records. The informational index
each clustering choice is made without examining all records esteems for every one of the factors. Each esteem is
information focuses and right now existing groups. It known as a datum. The informational index may involve
misuses the perception that information space isn't normally information for at least one individuals, relating to the
consistently possessed and only one out of every odd quantity of columns. The term informational collection may
information point is similarly imperative. It makes full likewise be utilized all the more freely, to allude to the
utilization of accessible memory to infer the finest information in a gathering of firmly related tables,
conceivable sub-clusters while limiting I/O costs. It is comparing to a specific investigation or occasion. Here, it is
approach to aim at datasets from two-dimensional. The
The scalability problem is solved in k-mean clustering The clustering algorithm Density-Based spatial
algorithm by parallelization [2]. Here the work is done on clustering of application with noise (DBSCAN) is proposed
two types of k-means, first one is sequential k-means and [6]. The DBSCAN clustering algorithm visits all the data
another is the parallel k-means. In sequential k-means, the points many times. The structure of the clustering is
MacQueen[3] is used where the clusters are formed by arbitrarily shaped because spatial databases are spherical.
performing the computations many times. At the starting This clustering algorithm has three points. Core point,
stage the centroid is selected randomly from those data border point and noise point. Eps represents the radius of the
points. Once this is done, the selected centroid is calculated cluster and minpts is the minimum number of point in the
multiple times to select a perfect centroid where it includes cluster. Consider if the minpts value is 4 and the point in a
all the data points. Euclidean distance equation is used to cluster is greater than the minpts then it is called as core
calculate the centroid. The second is parallel k-means where point. The data point which are less than minpts and those
it make use of Modha [4] which is a distributed memory point are within the radius Eps those are the neighbors of the
multiprocessors. In this method the data points are divided core point are called border point. The noise points are those
equally based upon the processors and then the calculations which are not a core point as well as the border point. This
are done on the divided subset. Once this is done then the clustering algorithm need density as an input parameter.
centroid is calculated on the divided subset. The parallel k- Density parameter gives the maximum number of points a
means runtime is much faster than the sequential k-means. cluster can have within a radius Eps. This clustering
Graphical processing unit is a memory used here which is a algorithm handles the noise and the value of Eps and minpts
shared multiprocessor. Processors are also called as thread of the clusters should be known. If one point of the cluster is
which results in master-slave relationship where the master known then all points is checked either it density reachable
will select the centroid. Labelling step in a Graphical or not. In this algorithm, the global values are used for the
processing unit seems to be more advantageous of the k- Eps and minpts parameter throughout the cluster formatting.
means for huge amount of dataset and more number of The cluster with least density parameter is considered as a
cluster counts. Parallel k-means through Compute Unified good cluster. These clusters are said to be a good candidates
Device Architecture is also taken care. This is tested only for the global values which has no noise in it. The clusters
for the limited available memory and the issue of scalability are combined only if the density of two clusters are near to
is solved by the parallel k-means. each other. The evaluation is done on real data and synthetic
data of SEQUOIA2 000 benchmark.
The improved version BIRCH in the threshold value is
studied from the improved multi threshold birch clustering In hierarchical clustering [7], the cluster is formed in a
algorithm [5]. BIRCH uses clustering feature (CF) in each tree manner. It has two strategy: The first is Agglomerative
clustering feature tree. It is used for the huge amount of clustering and the second is divisive clustering. The
dataset. The clustering feature consists of three attribute that hierarchical agglomerative clustering is an algorithm which
is N, LS and SS. N represents the total number of data is also called as Single-linkage (SLINK) and complete
points in a cluster, LS represents the summation of the data linkage (CLINK) in rare cases.it is a leave to root node
points in a cluster and SS represents the square summation concept. The agglomerative clustering algorithm is a bottom
of the data points in a cluster. The clustering feature is up approach where the clusters are merged and has time
maintained at each node with a static threshold value which complexity O(n3) and needs memory O(n2). The maximum
becomes a problem. In paper [5], another element threshold distance between the data points of the clusters are known
value T is added to the clustering feature. This is done to as CLINK and the minimum distance between the data point
make use of multiple threshold value in an algorithm. In of the clusters are known as SLINK. The mean distance
multi threshold birch algorithm, the clustering feature has between the data points of the clusters are known as average
four attributes are N, LS, SS, T. The threshold value is linkage clustering. The variance is calculated and wen it is
given at the initial stage. When the new data point is increased, the two sub-clusters are combined. The process
entered, it has to check for the nearest node in a clustering of clustering is ended with the small number of clusters. The
feature tree. Once the near leaf node is found then it has to other strategy is Divisive clustering algorithm which is the
check for the threshold value. If the threshold value is not top down approach where the clusters are splits as moves
violated then the newly entered node is assigned to that down O(2n). It is a root to leaves concept. The merging and
nearest node and the threshold value is updated to the the splitting of the cluster are done in greedy manner where
clustering feature and the updated threshold value is the result is presented in dendrogram. A dendogram is a tree
considered as the new threshold value. If the threshold value diagram which is used to arrange the clusters from the
is violated by the entered node, then increase the threshold output of hierarchical agglomerative clustering and divisive
value by multiplying the threshold value with the modified clustering. The BIRCH clustering algorithm and its
factor of the threshold. Once it is multiplied then the application is studied [1].
threshold value becomes larger enough to enter the new
IV. IMPLEMENTATION
V. RESULTS
LS
Centroid: C . (3)
N Fig 4:- Dataset
The centroid of the cluster is calculated by dividing The snapshot of the dataset is shown in the Fig. 3.
the summation of the linear sum of data point with the total This dataset is preprocessed and applied for an algorithm.
number of data points in a cluster. Equation 3 is used for the As the threshold increases, the cluster splitting will be
calculation of cluster centroid. . Equation 4 is to calculate decreased. When the threshold decreases the cluster
the radius of the cluster. combining is avoided. The variation of the threshold of tree
birch and the levelled tree is shown in the Fig. 4 and Fig. 5.
N
j 1
(Y j C ) 2 When the supercluster splitting is avoided in a flat tree
Radius: R then the cluster count should be less than the tree-BIRCH as
N
shown in Fig. 6. The flat tree runtime is less than the tree-
N * C SS 2 * C * LS
2
BIRCH as shown in the Fig. 7.
R . (4)
N
Nn Nm
j 1 k 1
(Y j Z k ) 2
D
Nn * Nm
N n * SS m N m * SS n SS 2 * LS n * LS m
. (5) Fig 5:- Threshold for tree-birch
Nn * Nm