Unit 4
Unit 4
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is
an unsupervised machine learning-based algorithm that acts on unlabelled data. A group of
data points would comprise together to form a cluster in which all the objects would belong
to the same group.
Cluster:
The given data is divided into different groups by combining similar objects into a group. This
group is nothing but a cluster. A cluster is nothing but a collection of similar data which is
grouped together.
For example, consider a dataset of vehicles given in which it contains information about
different vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no
class labels like Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a
structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using
clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming
clusters like cars cluster which contains all the cars, bikes clusters which contains all the
bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering :
1. Clustering Scalability:
Nowadays there is a vast amount of data and should be dealing with huge databases. In
order to handle extensive databases, the clustering algorithm should be scalable. Data
should be scalable, if it is not scalable, then we can’t get the appropriate result which would
lead to wrong results.
2. High Dimensionality:
The algorithm should be able to handle high dimensional space along with the data of small
size.
4. Dealing with unstructured data: There would be some databases that contain missing
values, and noisy or erroneous data. If the algorithms are sensitive to such data then it may
lead to poor quality clusters. So it should be able to handle unstructured data and give some
structure to the data by organising it into groups of similar data objects. This makes the job
of the data expert easier in order to process the data and discover new patterns.
5. Interpretability:
The clustering outcomes should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.
Clustering Methods:
The clustering methods can be classified into the following categories:
✓ Partitioning Method
✓Hierarchical Method
✓Density-based Method
✓Grid-Based Method
✓Model-Based Method
✓ Constraint-based Method
Partitioning Method:
It is used to make partitions on the data in order to form clusters. If “n” partitions are done
on “p” objects of the database then each partition is represented by a cluster and n < p. The
two conditions which need to be satisfied with this Partitioning Clustering Method are:
In the partitioning method, there is one technique called iterative relocation, which means the
object will be moved from one group to another to improve the partitioning
Hierarchical Method:
In this method, a hierarchical decomposition of the given set of data objects is created. We
can classify hierarchical methods and will be able to know the purpose of classification on
the basis of how the hierarchical decomposition is formed. There are two types of
approaches for the creation of hierarchical decomposition, they are:
°Agglomerative Approach:
The agglomerative approach is also known as the bottom-up approach. Initially, the given
data is divided into which objects form separate groups. Thereafter it keeps on merging the
objects or the groups that are close to one another which means that they exhibit similar
properties. This merging process continues until the termination condition holds.
°Divisive Approach:
The divisive approach is also known as the top-down approach. In this approach, we would
start with the data objects that are in the same cluster. The group of individual clusters is
divided into small clusters by continuous iteration. The iteration continues until the condition
of termination is met or until each cluster contains one object.
Once the group is split or merged then it can never be undone as it is a rigid method and is
not so flexible. The two approaches which can be used to improve the Hierarchical
Clustering Quality in Data Mining are: –
*One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.
*One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the microcluster.
Density-Based Method:
The density-based method mainly focuses on density. In this method, the given cluster will
keep on growing continuously as long as the density in the neighbourhood exceeds some
threshold, i.e, for each data point within a given cluster. The radius of a given cluster has to
contain at least a minimum number of points.
Grid-Based Method:
In the Grid-Based method a grid is formed using the object together,i.e, the object space is
quantized into a finite number of cells that form a grid structure. One of the major
advantages of the grid-based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The processing time for this
method is much faster so it can save time.
Model-Based Method:
In the model-based method, all the clusters are hypothesized in order to find the data which
is best suited for the model. The clustering of the density function is used to locate the
clusters for a given model. It reflects the spatial distribution of data points and also provides
a way to automatically determine the number of clusters based on standard statistics, taking
outlier or noise into account. Therefore it yields robust clustering methods.
Constraint-Based Method:
The constraint-based clustering method is performed by the incorporation of application or
user-oriented constraints. A constraint refers to the user expectation or the properties of the
desired clustering results. Constraints provide us with an interactive way of communication
with the clustering process. The user or the application requirement can specify constraints.
Evaluation of Clustering:
Evaluating the clustering tendency: i) a statistical (Hopkins statistic) and ii) a visual methods
(Visual Assessment of cluster Tendency (VAT) algorithm).
Statistical methods
The Hopkins statistic (Lawson and Jurs 1990) is used to assess the clustering tendency of a
data set by measuring the probability that a given data set is generated by a uniform data
distribution. In other words, it tests the spatial randomness of the data.
For example, let D be a real data set. The Hopkins statistic can be calculated as follow:
2. Compute the distance, xi, from each real point to each nearest neighbor: For each point
pi∈D, find it’s nearest neighbor pj; then compute the distance between pi and pj and denote
it as xi=dist(pi,pj).
3. Generate a simulated data set (randomD) drawn from a random uniform distribution with n
points (q1,…, qn) and the same variation as the original real data set D.
4. Compute the distance, yi from each artificial point to the nearest real data point: For each
point qi∈randomD, find it’s nearest neighbor qj in D; then compute the distance between qi
and qj and denote it yi=dist(qi,qj).
5. Calculate the Hopkins statistic (H) as the mean nearest neighbor distance in the random
data set divided by the sum of the mean nearest neighbor distances in the real and across
the simulated data set.
If all the data objects in the cluster are highly similar then the cluster has high quality. We
can measure the quality of Clustering by using the Dissimilarity/Similarity metric in most
situations. But there are some other methods to measure the Qualities of Good Clustering if
the clusters are alike.
Extrinsic method:
1. Dissimilarity/Similarity metric:
The similarity between the clusters can be expressed in terms of a distance function, which
is represented by d(i, j). Distance functions are different for various data types and data
variables. Distance function measure is different for continuous-valued variables, categorical
variables, and vector variables. Distance function can be expressed as Euclidean distance,
Mahalanobis distance, and Cosine distance for different types of data.
2. Cluster completeness:
Cluster completeness is the essential parameter for good clustering, if any two data objects
are having similar characteristics then they are assigned to the same category of the cluster
according to ground truth. Cluster completeness is high if the objects are of the same
category.
Let us consider the clustering C1, which contains the sub-clusters s1 and s2, where the
members of the s1 and s2 cluster belong to the same category according to ground truth. Let
us consider another clustering C2 which is identical to C1 but now s1 and s2 are merged into
one cluster. Then, we define the clustering quality measure, Q, and according to cluster
completeness C2, will have more cluster quality compared to the C1 that is, Q(C2, Cg ) >
Q(C1, Cg ).
3. Ragbag:
In some situations, there can be a few categories in which the objects of those categories
cannot be merged with other objects. Then the quality of those cluster categories is
measured by the Rag Bag method. According to the rag bag method, we should put the
heterogeneous object into a rag bag category.
Let us consider a clustering C1 and a cluster C ∈ C1 so that all objects in C belong to the
same category of cluster C1 except the object o according to ground truth. Consider a
clustering C2 which is identical to C1 except that o is assigned to a cluster D which holds the
objects of different categories. According to the ground truth, this situation is noisy and the
quality of clustering is measured using the rag bag criteria. we define the clustering quality
measure, Q, and according to rag bag method criteria C2, will have more cluster quality
compared to the C1 that is, Q(C2, Cg )>Q(C1, Cg).
Let clustering C2 also split into three clusters, namely C1 = {d1, . . . , dn−1}, C2 = {dn}, and
C3 = {dn+1,dn+2}. As C1 splits the small category of objects and C2 splits the big category
which is preferred according to the rule mentioned above the clustering quality measure Q
should give a higher score to C2, that is, Q(C2, Cg ) > Q(C1, Cg ).
Intrinsic method:
When the ground truth of a data set is not available, we have to use an intrinsic method to
assess the clustering quality. In general, intrinsic methods evaluate a clustering by
examining how well the clusters are separated and how compact the clusters are. Many
intrinsic methods have the advantage of a similarity metric between objects in the data set.
The silhouette coefficient is such a measure. For a data set, D, of n objects, suppose D is
partitioned into k clusters, C1, …, Ck. For each object o ∈ D, we calculate a (o) as the
average distance between o and all other objects in the cluster to which o belongs. Similarly,
b(o) is the minimum average distance from o to all clusters to which o does not belong.
To measure a cluster's fitness within a clustering, we can compute the average silhouette
coefficient value of all objects in the cluster. To measure the quality of a clustering, we can
use the average silhouette coefficient value of all objects in the data set. The silhouette
coefficient and other intrinsic measures can also be used in the elbow method to heuristically
derive the number of clusters in a data set by replacing the sum of within-cluster variances.
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups.
Clustering of the High-Dimensional Data return the group of objects which are clusters. It is
required to group similar types of objects together to perform the cluster analysis of
high-dimensional data, But the High-Dimensional data space is huge and it has complex
data types and attributes. A major challenge is that we need to find out the set of attributes
that are present in each cluster. A cluster is defined and characterized based on the
attributes present in the cluster. Clustering High-Dimensional Data we need to search for
clusters and find out the space for the existing clusters.
The High-Dimensional data is reduced to low-dimension data to make the clustering and
search for clusters simple. some applications need the appropriate models of clusters,
especially the high-dimensional data. clusters in the high-dimensional data are significantly
small. the conventional distance measures can be ineffective. Instead, To find the hidden
clusters in high-dimensional data we need to apply sophisticated techniques that can model
correlations among the objects in subspaces.
Subspace clustering approaches to search for clusters existing in subspaces of the given
high-dimensional data space, where a subspace is defined using a subset of attributes in the
full space.
1. Subspace Search Methods:
A subspace search method searches the subspaces for clusters. Here, the cluster is a
group of similar types of objects in a subspace. The similarity between the clusters is
measured by using distance or density features.
Bottom-up approach
starts to search from the low-dimensional subspaces. If the hidden clusters are not found in
low-dimensional subspaces then it searches in higher dimensional subspaces.
The top-down approach
starts to search from the high-dimensional subspaces and then search in subsets of
low-dimensional subspaces. Top-down approaches are effective if the subspace of a cluster
can be defined by the local neighbourhood sub-space clusters.
2. Correlation-Based Clustering:
Correlation-based approaches discover the hidden clusters by developing advanced
correlation models. Correlation-Based models are preferred if is not possible to cluster the
objects by using the Subspace Search Methods. Correlation-Based clustering includes the
advanced mining techniques for correlation cluster analysis. Biclustering Methods are the
Correlation-Based clustering methods in which both the objects and attributes are clustered.
3. Biclustering Methods:
Biclustering means clustering the data based on the two factors. we can cluster both objects
and attributes at a time in some applications. The resultant clusters are biclusters. To
perform the biclustering there are four requirements:
Objects and attributes are not treated in the same way. Objects are clustered according to
their attribute values. We treat Objects and attributes as different in biclustering analysis.
Data mining is the process of collecting and processing data from a heap of unprocessed
data. When the patterns are established, various relationships between the datasets can be
identified and they can be presented in a summarized format which helps in statistical
analysis in various industries. Among the other data structures, the graph is widely used in
modeling advanced structures and patterns. In data mining, the graph is used to find
subgraph patterns for discrimination, classification, clustering of data, etc. The graph is used
in network analysis. By linking the various nodes, graphs form network-like communications,
web and computer networks, social networks, etc. In multi-relational data mining, graphs or
networks is used because of the varied interconnected relationship between the datasets in
a relational database.
Graph mining:
Graph mining is a process in which the mining techniques are used in finding a pattern or
relationship in the given real-world collection of graphs. By mining the graph, frequent
substructures and relationships can be identified which helps in clustering the graph sets,
finding a relationship between graph sets, or discriminating or characterizing graphs.
Predicting these patterning trends can help in building models for the enhancement of any
application that is used in real-time. To implement the process of graph mining, one must
learn to mine frequent subgraphs.
Let us consider a graph h with an edge set E(h) and a vertex set V(h). Let us consider the
existence of subgraph isomorphism from h to h’ in such a way that h is a subgraph of h’. A
label function is a function that plots either the edges or vertices to a label. Let us consider a
labeled graph dataset,
F = {H1, H2, H3....Hn} Let us consider s(h) as the support which means the percentage of
graphs in F where h is a subgraph. A frequent graph has support that will be no less than the
minimum support threshold. Let us denote it as min_support.
The approach to find the frequent graphs begin from the graph with a small size. The
approach advances in a bottom-up way by creating candidates with extra vertex or edge.
This algorithm is called an Apriori Graph.
Let us consider Qk as the frequent sub-structure set with a size of k. This approach acquires
a level-wise mining technique. Before the Apriori Graph technique, the generation of
candidates must be done.
This is done by combining two same but slightly varied frequent subgraphs. After the
formation of new substructures, the frequency of the graph is checked. Out of that, the
graphs found frequently are used to create the next candidate. This step to generate
frequent substructure candidates is a complex step. But, when it comes to generating
candidates in itemset, it is easy and effortless. Let’s consider an example of having two
itemsets of size three such that
itemset 2 = qrs.
So, the itemset derived using join would be pqrs. But when it comes to substructures, there
is more than one method to join two substructures.
Apriori approach uses BFS(Breadth-First Search) due to the iterative level-wise generation
of candidates. This is necessary because if you want to mine the k+1 graph, you should
have already mined till k subgraphs.
This pattern-growth approach can use both BFS and DFS(Depth First Search). DFS is
preferred for this approach due to its less memory consumption nature. Let us consider a
graph h. A new graph can be formed by adding an edge e. The edge can introduce a vertex
but it is not a need. If it introduces a vertex, it can be done in two ways, forward and
backward. The Pattern-growth graph is easy but it is not that efficient. Because there is a
possibility of creating a similar graph that is already created which leads to computation
inefficiency. The duplicate graphs generated can be removed but it increases the time and
work. To avoid the creation of duplicate graphs, the frequent graphs should be introduced
very carefully and conservatively which calls the need for other algorithms.
Algorithm:
An edge e is used to extend a new graph from the old one q. The newly extend graph is
denoted as
Network Analysis:
In the concept of network analysis, the relationship between the units is called links in a
graph. From the data mining outlook, this is called link mining or link analysis. The network is
a diversional dataset with a multi-relational concept in form of a graph. The graph is very
large with nodes as objects, edges as links which in turn denote the relationship between the
nodes or objects. Telephone networking systems, WWW( World Wide Web) are very good
examples. It also helps in filtering the datasets and providing customer-preferred services.
Every network consists of numerous nodes. The datasets are widely enormous. Thus by
studying and mining useful information from a wide group of datasets would help in solving
problems and effective transmission of data.
Link Mining:
There are some conventional methods of machine learning in which taking homogeneous
objects from one relationship is taken. But in networks, this is not applicable due to a large
number of nodes and its multi-relational, heterogeneous nature. Thus the link mining has
appeared as a new field after many types of research. Link mining is the convergence of
multiple research held in graph mining, networks, hypertexts, logic programming, link
analysis, predictive analysis, and modeling. Links are nothing but the relationship between
nodes in a network. With the help of links, the mining process can be held efficiently. This
calls for the various functions to be done.
In link mining, only attributes are not enough. Here the links and the traits of the linked
nodes are also necessary. One best example is Web-based classification. In web-based
classification, the system predicts the categorization of a webpage based on the presence of
that specified word which means the searched word occurs on that page. Anchor text is
which the person clicks the hyperlink that opens while searching. These two things act as
attributes in web-based classification. The attributes can be anything that relates to the link
and network pages.
Object Reconciliation:
In this method, the function is to predict if any two objects are the same on the basis of their
attributes or traits or links. This method is also called identity uncertainty or record linkage.
This task has it’s the same procedure in the matching of citation, extraction of details, getting
rid of duplicates, consolidating objects. For an instance, this task is to help if one website is
reflecting the other website like a mirror to each other.
Constrained clustering is an approach to clustering the data while it incorporates the domain
knowledge in form of constraints. All data including input data, constraints, and domain
knowledge are processed in the clustering process with constraints and give the output
clusters as an output.
There are various methods for clustering with constraints and can handle specific
constraints:
This penalty occurs due to when there is a must-link constraint present on objects y, x. They
are created to the given two centers C1, C2 by which the constraint can be violated hence
the distance that lies between C1 and C2 is inserted but as a penalty.
Penalty in cannot-link violation:
This type of penalty is different from a must-link violation as in this penalty there is one
center created to a common center C when cannot link is present on objects x, y. Therefore
the constraints are violated and hence the distance that lies between (C, C) can be inserted
in the objective function and it is recognized as a penalty.
Web Mining
Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is
discovering useful information from the World-Wide Web and its usage patterns.
Web mining helps to improve the power of web search engine by classifying the web
documents and identifying the web pages.
*It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g., FatLens,
Become etc.
*Web mining is used to predict user behavior.
*Web mining is very useful of a particular *Website and e-service e.g., landing page
optimization.
*Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.
Definition:
*Data Mining is the process that attempts to discover pattern and hidden knowledge in large
data sets in any system.
*Web Mining is the process of data mining techniques to automatically discover and extract
information from web documents.
Application:
Target Users:
*Data scientist and data engineers. *Data scientists along with data analysts.
Access:
*Data Mining access data privately. *Web Mining access data publicly.
Structure:
*In Data Mining get the information from explicit structure.
*In Web Mining get the information from structured, unstructured and semi-structured web
pages.
Problem Type:
*Clustering, classification, regression, prediction, optimization and control.*Web content
mining, Web structure mining.
Tools :
*It includes tools like machine learning algorithms.
*Special tools for web mining are Scrapy, PageRank and Apache logs.
Skills :
*It includes approaches for data cleansing, machine learning algorithms. Statistics and
probability.
*It includes application level knowledge, data engineering with mathematical modules like
statistics and probability.
Spatial Mining:
Spatial data mining refers to the extraction of knowledge, spatial relationships, or other
interesting patterns not explicitly stored in spatial databases. Such mining demands the
unification of data mining with spatial database technologies. It can be used for learning
spatial records, discovering spatial relationships and relationships among spatial and
nonspatial records, constructing spatial knowledge bases, reorganizing spatial databases,
and optimizing spatial queries.
A central challenge to spatial data mining is the exploration of efficient spatial data mining
techniques because of the large amount of spatial data and the difficulty of spatial data types
and spatial access methods. Statistical spatial data analysis has been a popular approach to
analyzing spatial data and exploring geographic information.
The term geostatistics is often associated with continuous geographic space, whereas the
term spatial statistics is often associated with discrete space. In a statistical model that
manages non-spatial records, one generally considers statistical independence among
different areas of data.
There is no such separation among spatially distributed records because, actually spatial
objects are interrelated, or more exactly spatially co-located, in the sense that the closer the
two objects are placed, the more likely they send the same properties. For example, natural
resources, climate, temperature, and economic situations are likely to be similar in
geographically closely located regions.
Such a property of close interdependency across nearby space leads to the notion of spatial
autocorrelation. Based on this notion, spatial statistical modeling methods have been
developed with success. Spatial data mining will create spatial statistical analysis methods
and extend them for large amounts of spatial data, with more emphasis on effectiveness,
scalability, cooperation with database and data warehouse systems, enhanced user
interaction, and the discovery of new kinds of knowledge.
Any unwanted error occurs in some previously measured variable, or there is any variance in
the previously measured variable called noise. Before finding the outliers present in any data
set, it is recommended first to remove the noise.
Types of Outliers
Global outliers
are also called point outliers. Global outliers are taken as the simplest form of outliers. When
data points deviate from all the rest of the data points in a given data set, it is known as the
global outlier. In most cases, all the outlier detection procedures are targeted to determine
the global outliers. The green data point is the global outlier.
Collective Outliers:
In a given set of data, when a group of data points deviates from the rest of the data set is
called collective outliers. Here, the particular set of data objects may not be outliers, but
when you consider the data objects as a whole, they may behave as outliers. To identify the
types of different outliers, you need to go through background information about the
relationship between the behavior of outliers shown by different data objects. For example,
in an Intrusion Detection System, the DOS package from one system to another is taken as
normal behavior. Therefore, if this happens with the various computer simultaneously, it is
considered abnormal behavior, and as a whole, they are called collective outliers. The green
data points as a whole represent the collective outlier.
Contextual Outliers:
As the name suggests, "Contextual" means this outlier introduced within a context. For
example, in the speech recognition technique, the single background noise. Contextual
outliers are also known as Conditional outliers. These types of outliers happen if a data
object deviates from the other data points because of any specific condition in a given data
set. As we know, there are two types of attributes of objects of data: contextual attributes
and behavioral attributes. Contextual outlier analysis enables the users to examine outliers
in different contexts and conditions, which can be useful in various applications. For
example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy
season. Still, it will behave like a normal data point in the context of a summer season. In the
given diagram, a green dot representing the low-temperature value in June is a contextual
outlier since the same value in December is not an outlier.
Outliers Analysis
Outliers are discarded at many places when data mining is applied. But it is still used in
many applications like fraud detection, medical, etc. It is usually because the events that
occur rarely can store much more significant information than the events that occur more
regularly.
Other applications where outlier detection plays a vital role are given below.
Any unusual response that occurs due to medical treatment can be analyzed through outlier
analysis in data mining.
Supervised Methods − Supervised methods model data normality and abnormality. Domain
professionals tests and label a sample of the basic data. Outlier detection can be modeled
as a classification issue. The service is to understand a classifier that can identify outliers.
The sample can be used for training and testing. In various applications, the professionals
can label only the normal objects, and several objects not connecting the model of normal
objects are documented as outliers. There are different methods model the outliers and
consider objects not connecting the model of outliers as normal.
An unsupervised outlier detection method predict that normal objects follow a pattern far
more generally than outliers. Normal objects do not have to decline into one team sharing
large similarity. Instead, they can form several groups, where each group has multiple
features.
This assumption cannot be true sometime. The normal objects do not send some strong
patterns. Rather than, they are uniformly distributed. The collective outliers, share large
similarity in a small area.
Unsupervised methods cannot identify such outliers efficiently. In some applications, normal
objects are separately distributed, and several objects do not follow strong patterns. For
example, in some intrusion detection and computer virus detection issues, normal activities
are distinct and some do not decline into high-quality clusters.
Statistical approaches:
Parametric Methods:
The basic idea behind the parametric method is that there is a set of fixed parameters that
uses to determine a probability model that is used in Machine Learning as well. Parametric
methods are those methods for which we priory knows that the population is normal, or if not
then we can easily approximate it using a normal distribution which is possible by invoking
the Central Limit Theorem. Parameters for using the normal distribution is as follows:
Mean
Standard Deviation
Confidence interval used for – population mean along with known standard deviation.
The confidence interval is used for – population means along with the unknown standard
deviation.
The confidence interval for population variance.
The confidence interval for the difference of two means, with unknown standard deviation.
Nonparametric Methods:
The basic idea behind the parametric method is no need to make any assumption of
parameters for the given population or the population we are studying. In fact, the methods
don’t depend on the population. Here there is no fixed set of parameters are available, and
also there is no distribution (normal distribution, etc.) of any kind is available for use. This is
also the reason that nonparametric methods are also referred to as distribution-free
methods. Nowadays Non-parametric methods are gaining popularity and an impact of
influence some reasons behind this fame is:
The main reason is that there is no need to be mannered while using parametric methods.
The second important reason is that we do not need to make more and more assumptions
about the population given (or taken) on which we are working on.
Most of the nonparametric methods available are very easy to apply and to understand also
i.e. the complexity is very low.
There are many nonparametric methods are available today but some of them are as
follows:
Proximity-based methods are an important technique in data mining. They are employed to
find patterns in large databases by scanning documents for certain keywords and phrases.
They are highly prevalent because they do not require expensive hardware or much storage
space, and they scale up efficiently as the size of databases increases.
To find sets of documents containing certain categories, one must assign categorical values
to each document and then run proximity-based methods on these documents as training
data, hoping for accurate representations of the categories.
One way to identify outliers is by calculating their distance from the rest of the data set in is
known as density-based outlier detection.
Clustering Analysis is the process of dividing a set of data objects into subsets. Each subset
is a cluster such that objects are similar to each other. The set of clusters obtained from
clustering analysis can be referred to as Clustering. For example: Segregating customers in
a Retail market as a frequent customer, new customer.
Clustering-based outlier detection methods assume that the normal data objects belong to
large and dense clusters, whereas outliers belong to small or sparse clusters, or do not
belong to any clusters. Clustering-based approaches detect outliers by extracting the
relationship between Objects and Cluster. An object is an outlier if
.Does the object belong to any cluster? If not, then it is identified as an outlier.
.Is there a large distance between the object and the cluster to which it is closest? If yes, it is
an outlier.
.Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster are
outliers.
Checking an outlier:
.To check the objects that do not belong to any cluster we go with DENSITY BASED
CLUSTERING (DBSCAN).
.To check outlier detection using distance to the closest cluster we go with K-MEANS
CLUSTERING (K-Means).
dist(o, co)/x
where,
Note that each of the procedures we’ve visible up to now detects individual objects items as
outliers due to the fact they evaluate items separately in opposition to clusters withinside the
information set. However, in a huge information set, a few outliers can be comparable and
shape a small cluster. The procedures mentioned up to now can be deceived via way of
means of such outliers.
To conquer this problem, the 3rd method to cluster-primarily based totally outlier detection
identifies small or sparse clusters and pronounces the items in the one’s clusters to be
outliers as well.
Advantages:
The cluster-based outlier detection method has the following advantages. First, they can
detect outliers without labeling the data, that is, they are out of control. You deal with multiple
types of data. You can think of a cluster as a collection of data. Once the cluster is obtained,
the cluster-based method only needs to compare the object with the cluster to determine
whether the object is an outlier. This process is usually fast because the number of clusters
is usually small in comparison. In the total number of objects.
Disadvantages:
The weakness of clustering outlier detection is its effectiveness, which largely depends on
the clustering method used. These methods cannot be optimized for outlier detection.
Clustering techniques for large data sets are usually expensive, which may be a bottleneck.
Learning Step:
It is a step where the Classification model is to be constructed. In this phase, training data
are analyzed by a classification Algorithm.
Classification Step:
it’s a step where the model is employed to predict class labels for given data. In this phase,
test data are wont to estimate the accuracy of classification rules.
High-dimensional data:
poses unique challenges in outlier detection process. Most of the existing algorithms fail to
properly address the issues stemming from a large number of features. In particular, outlier
detection algorithms perform poorly on data set of small size with a large number of features.
In this paper, we propose a novel outlier detection algorithm based on principal component
analysis and kernel density estimation.
The proposed method is designed to address the challenges of dealing with
high-dimensional data by projecting the original data onto a smaller space and using the
innate structure of the data to calculate anomaly scores for each data point. Numerical
experiments on synthetic and real-life data show that our method performs well on
high-dimensional data. In particular, the proposed method outperforms the benchmark
methods as measured by the F1-score. Our method also produces better-than-average
execution times compared to the benchmark methods.