0% found this document useful (0 votes)
6 views

Unit 4

The document discusses clustering and outlier detection methods in data analysis, including various clustering techniques such as partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods. It emphasizes the importance of clustering properties, evaluation methods, and applications in fields like image processing and biology. Additionally, it addresses challenges in clustering high-dimensional data and introduces subspace clustering methods, correlation-based clustering, and biclustering techniques.

Uploaded by

kvm474710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit 4

The document discusses clustering and outlier detection methods in data analysis, including various clustering techniques such as partitioning, hierarchical, density-based, grid-based, model-based, and constraint-based methods. It emphasizes the importance of clustering properties, evaluation methods, and applications in fields like image processing and biology. Additionally, it addresses challenges in clustering high-dimensional data and introduces subspace clustering methods, correlation-based clustering, and biclustering techniques.

Uploaded by

kvm474710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit-4: CLUSTERING AND OUTLIER DETECTION

Cluster Analysis– Partitioning Methods–Hierarchical Methods–Density–Based Methods–


Grid– Based Methods–Evaluation of Clustering.–Clustering High –Dimensional Data–
Clustering Graph and Network Data – Clustering with Constraints–Web Mining– Spatial
Mining. Outlier Detection– Outliers and Outliers Analysis–Outlier Detection Methods–Outlier
Approaches–Statistical–Proximity– Based–Clustering–Based–Classification Based–High–
DimensionalData. Cluster Analysis:

Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is
an unsupervised machine learning-based algorithm that acts on unlabelled data. A group of
data points would comprise together to form a cluster in which all the objects would belong
to the same group.

Cluster:

The given data is divided into different groups by combining similar objects into a group. This
group is nothing but a cluster. A cluster is nothing but a collection of similar data which is
grouped together.

For example, consider a dataset of vehicles given in which it contains information about
different vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no
class labels like Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a
structured manner.

Now our task is to convert the unlabelled data to labelled data and it can be done using
clusters.

The main idea of cluster analysis is that it would arrange all the data points by forming
clusters like cars cluster which contains all the cars, bikes clusters which contains all the
bikes, etc.

Simply it is the partitioning of similar objects which are applied to unlabelled data.

Properties of Clustering :

1. Clustering Scalability:
Nowadays there is a vast amount of data and should be dealing with huge databases. In
order to handle extensive databases, the clustering algorithm should be scalable. Data
should be scalable, if it is not scalable, then we can’t get the appropriate result which would
lead to wrong results.

2. High Dimensionality:
The algorithm should be able to handle high dimensional space along with the data of small
size.

3. Algorithm Usability with multiple data kinds:


Different kinds of data can be used with algorithms of clustering. It should be capable of
dealing with different types of data like discrete, categorical and interval-based data, binary
data etc.

4. Dealing with unstructured data: There would be some databases that contain missing
values, and noisy or erroneous data. If the algorithms are sensitive to such data then it may
lead to poor quality clusters. So it should be able to handle unstructured data and give some
structure to the data by organising it into groups of similar data objects. This makes the job
of the data expert easier in order to process the data and discover new patterns.

5. Interpretability:
The clustering outcomes should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.

Clustering Methods:
The clustering methods can be classified into the following categories:

✓ Partitioning Method
✓Hierarchical Method
✓Density-based Method
✓Grid-Based Method
✓Model-Based Method
✓ Constraint-based Method

Partitioning Method:

It is used to make partitions on the data in order to form clusters. If “n” partitions are done
on “p” objects of the database then each partition is represented by a cluster and n < p. The
two conditions which need to be satisfied with this Partitioning Clustering Method are:

*One objective should only belong to only one group.


*There should be no group without even a single purpose.

In the partitioning method, there is one technique called iterative relocation, which means the
object will be moved from one group to another to improve the partitioning

Hierarchical Method:
In this method, a hierarchical decomposition of the given set of data objects is created. We
can classify hierarchical methods and will be able to know the purpose of classification on
the basis of how the hierarchical decomposition is formed. There are two types of
approaches for the creation of hierarchical decomposition, they are:

°Agglomerative Approach:
The agglomerative approach is also known as the bottom-up approach. Initially, the given
data is divided into which objects form separate groups. Thereafter it keeps on merging the
objects or the groups that are close to one another which means that they exhibit similar
properties. This merging process continues until the termination condition holds.

°Divisive Approach:
The divisive approach is also known as the top-down approach. In this approach, we would
start with the data objects that are in the same cluster. The group of individual clusters is
divided into small clusters by continuous iteration. The iteration continues until the condition
of termination is met or until each cluster contains one object.
Once the group is split or merged then it can never be undone as it is a rigid method and is
not so flexible. The two approaches which can be used to improve the Hierarchical
Clustering Quality in Data Mining are: –

*One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.

*One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the microcluster.

Density-Based Method:
The density-based method mainly focuses on density. In this method, the given cluster will
keep on growing continuously as long as the density in the neighbourhood exceeds some
threshold, i.e, for each data point within a given cluster. The radius of a given cluster has to
contain at least a minimum number of points.

Grid-Based Method:
In the Grid-Based method a grid is formed using the object together,i.e, the object space is
quantized into a finite number of cells that form a grid structure. One of the major
advantages of the grid-based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The processing time for this
method is much faster so it can save time.

Model-Based Method:
In the model-based method, all the clusters are hypothesized in order to find the data which
is best suited for the model. The clustering of the density function is used to locate the
clusters for a given model. It reflects the spatial distribution of data points and also provides
a way to automatically determine the number of clusters based on standard statistics, taking
outlier or noise into account. Therefore it yields robust clustering methods.

Constraint-Based Method:
The constraint-based clustering method is performed by the incorporation of application or
user-oriented constraints. A constraint refers to the user expectation or the properties of the
desired clustering results. Constraints provide us with an interactive way of communication
with the clustering process. The user or the application requirement can specify constraints.

Applications Of Cluster Analysis:


It is widely used in image processing, data analysis, and pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies and identifying
genes with the same capabilities.
It also helps in information discovery by classifying documents on the web.

Evaluation of Clustering:
Evaluating the clustering tendency: i) a statistical (Hopkins statistic) and ii) a visual methods
(Visual Assessment of cluster Tendency (VAT) algorithm).

Statistical methods
The Hopkins statistic (Lawson and Jurs 1990) is used to assess the clustering tendency of a
data set by measuring the probability that a given data set is generated by a uniform data
distribution. In other words, it tests the spatial randomness of the data.

For example, let D be a real data set. The Hopkins statistic can be calculated as follow:

1.Sample uniformly n points (p1,…, pn) from D.

2. Compute the distance, xi, from each real point to each nearest neighbor: For each point
pi∈D, find it’s nearest neighbor pj; then compute the distance between pi and pj and denote
it as xi=dist(pi,pj).

3. Generate a simulated data set (randomD) drawn from a random uniform distribution with n
points (q1,…, qn) and the same variation as the original real data set D.

4. Compute the distance, yi from each artificial point to the nearest real data point: For each
point qi∈randomD, find it’s nearest neighbor qj in D; then compute the distance between qi
and qj and denote it yi=dist(qi,qj).

5. Calculate the Hopkins statistic (H) as the mean nearest neighbor distance in the random
data set divided by the sum of the mean nearest neighbor distances in the real and across
the simulated data set.

Measures for Quality of Clustering:

If all the data objects in the cluster are highly similar then the cluster has high quality. We
can measure the quality of Clustering by using the Dissimilarity/Similarity metric in most
situations. But there are some other methods to measure the Qualities of Good Clustering if
the clusters are alike.

Extrinsic method:

1. Dissimilarity/Similarity metric:
The similarity between the clusters can be expressed in terms of a distance function, which
is represented by d(i, j). Distance functions are different for various data types and data
variables. Distance function measure is different for continuous-valued variables, categorical
variables, and vector variables. Distance function can be expressed as Euclidean distance,
Mahalanobis distance, and Cosine distance for different types of data.

2. Cluster completeness:
Cluster completeness is the essential parameter for good clustering, if any two data objects
are having similar characteristics then they are assigned to the same category of the cluster
according to ground truth. Cluster completeness is high if the objects are of the same
category.
Let us consider the clustering C1, which contains the sub-clusters s1 and s2, where the
members of the s1 and s2 cluster belong to the same category according to ground truth. Let
us consider another clustering C2 which is identical to C1 but now s1 and s2 are merged into
one cluster. Then, we define the clustering quality measure, Q, and according to cluster
completeness C2, will have more cluster quality compared to the C1 that is, Q(C2, Cg ) >
Q(C1, Cg ).

3. Ragbag:
In some situations, there can be a few categories in which the objects of those categories
cannot be merged with other objects. Then the quality of those cluster categories is
measured by the Rag Bag method. According to the rag bag method, we should put the
heterogeneous object into a rag bag category.

Let us consider a clustering C1 and a cluster C ∈ C1 so that all objects in C belong to the
same category of cluster C1 except the object o according to ground truth. Consider a
clustering C2 which is identical to C1 except that o is assigned to a cluster D which holds the
objects of different categories. According to the ground truth, this situation is noisy and the
quality of clustering is measured using the rag bag criteria. we define the clustering quality
measure, Q, and according to rag bag method criteria C2, will have more cluster quality
compared to the C1 that is, Q(C2, Cg )>Q(C1, Cg).

4. Small cluster preservation:


If a small category of clustering is further split into small pieces, then those small pieces of
cluster become noise to the entire clustering and thus it becomes difficult to identify that
small category from the clustering. The small cluster preservation criterion states that are
splitting a small category into pieces is not advisable and it further decreases the quality of
clusters as the pieces of clusters are distinctive. Suppose clustering C1 has split into three
clusters, C11 = {d1, . . . , dn}, C12 = {dn+1}, and C13 = {dn+2}.

Let clustering C2 also split into three clusters, namely C1 = {d1, . . . , dn−1}, C2 = {dn}, and
C3 = {dn+1,dn+2}. As C1 splits the small category of objects and C2 splits the big category
which is preferred according to the rule mentioned above the clustering quality measure Q
should give a higher score to C2, that is, Q(C2, Cg ) > Q(C1, Cg ).

Intrinsic method:

When the ground truth of a data set is not available, we have to use an intrinsic method to
assess the clustering quality. In general, intrinsic methods evaluate a clustering by
examining how well the clusters are separated and how compact the clusters are. Many
intrinsic methods have the advantage of a similarity metric between objects in the data set.

The silhouette coefficient is such a measure. For a data set, D, of n objects, suppose D is
partitioned into k clusters, C1, …, Ck. For each object o ∈ D, we calculate a (o) as the
average distance between o and all other objects in the cluster to which o belongs. Similarly,
b(o) is the minimum average distance from o to all clusters to which o does not belong.
To measure a cluster's fitness within a clustering, we can compute the average silhouette
coefficient value of all objects in the cluster. To measure the quality of a clustering, we can
use the average silhouette coefficient value of all objects in the data set. The silhouette
coefficient and other intrinsic measures can also be used in the elbow method to heuristically
derive the number of clusters in a data set by replacing the sum of within-cluster variances.

Clustering High-Dimensional Data in Data Mining

Clustering is basically a type of unsupervised learning method. An unsupervised learning


method is a method in which we draw references from datasets consisting of input data
without labeled responses.

Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups.

Challenges of Clustering High-Dimensional Data:

Clustering of the High-Dimensional Data return the group of objects which are clusters. It is
required to group similar types of objects together to perform the cluster analysis of
high-dimensional data, But the High-Dimensional data space is huge and it has complex
data types and attributes. A major challenge is that we need to find out the set of attributes
that are present in each cluster. A cluster is defined and characterized based on the
attributes present in the cluster. Clustering High-Dimensional Data we need to search for
clusters and find out the space for the existing clusters.

The High-Dimensional data is reduced to low-dimension data to make the clustering and
search for clusters simple. some applications need the appropriate models of clusters,
especially the high-dimensional data. clusters in the high-dimensional data are significantly
small. the conventional distance measures can be ineffective. Instead, To find the hidden
clusters in high-dimensional data we need to apply sophisticated techniques that can model
correlations among the objects in subspaces.

Subspace Clustering Methods:


There are 3 Subspace Clustering Methods:

*Subspace search methods


*Correlation-based clustering methods
*Biclustering methods.

Subspace clustering approaches to search for clusters existing in subspaces of the given
high-dimensional data space, where a subspace is defined using a subset of attributes in the
full space.
1. Subspace Search Methods:

A subspace search method searches the subspaces for clusters. Here, the cluster is a
group of similar types of objects in a subspace. The similarity between the clusters is
measured by using distance or density features.

CLIQUE algorithm is a subspace clustering method. subspace search methods search a


series of subspaces. There are two approaches in Subspace Search Methods:

Bottom-up approach
starts to search from the low-dimensional subspaces. If the hidden clusters are not found in
low-dimensional subspaces then it searches in higher dimensional subspaces.
The top-down approach
starts to search from the high-dimensional subspaces and then search in subsets of
low-dimensional subspaces. Top-down approaches are effective if the subspace of a cluster
can be defined by the local neighbourhood sub-space clusters.

2. Correlation-Based Clustering:
Correlation-based approaches discover the hidden clusters by developing advanced
correlation models. Correlation-Based models are preferred if is not possible to cluster the
objects by using the Subspace Search Methods. Correlation-Based clustering includes the
advanced mining techniques for correlation cluster analysis. Biclustering Methods are the
Correlation-Based clustering methods in which both the objects and attributes are clustered.

3. Biclustering Methods:

Biclustering means clustering the data based on the two factors. we can cluster both objects
and attributes at a time in some applications. The resultant clusters are biclusters. To
perform the biclustering there are four requirements:

*Only a small set of objects participate in a cluster.


*A cluster only involves a small number of attributes.
*The data objects can take part in multiple clusters, or the objects may also include in any
cluster.
*An attribute may be involved in multiple clusters.

Objects and attributes are not treated in the same way. Objects are clustered according to
their attribute values. We treat Objects and attributes as different in biclustering analysis.

Data Mining Graphs and Networks:

Data mining is the process of collecting and processing data from a heap of unprocessed
data. When the patterns are established, various relationships between the datasets can be
identified and they can be presented in a summarized format which helps in statistical
analysis in various industries. Among the other data structures, the graph is widely used in
modeling advanced structures and patterns. In data mining, the graph is used to find
subgraph patterns for discrimination, classification, clustering of data, etc. The graph is used
in network analysis. By linking the various nodes, graphs form network-like communications,
web and computer networks, social networks, etc. In multi-relational data mining, graphs or
networks is used because of the varied interconnected relationship between the datasets in
a relational database.

Graph mining:

Graph mining is a process in which the mining techniques are used in finding a pattern or
relationship in the given real-world collection of graphs. By mining the graph, frequent
substructures and relationships can be identified which helps in clustering the graph sets,
finding a relationship between graph sets, or discriminating or characterizing graphs.
Predicting these patterning trends can help in building models for the enhancement of any
application that is used in real-time. To implement the process of graph mining, one must
learn to mine frequent subgraphs.

Frequent Subgraph Mining :

Let us consider a graph h with an edge set E(h) and a vertex set V(h). Let us consider the
existence of subgraph isomorphism from h to h’ in such a way that h is a subgraph of h’. A
label function is a function that plots either the edges or vertices to a label. Let us consider a
labeled graph dataset,

F = {H1, H2, H3....Hn} Let us consider s(h) as the support which means the percentage of
graphs in F where h is a subgraph. A frequent graph has support that will be no less than the
minimum support threshold. Let us denote it as min_support.

Steps in finding frequent subgraphs:

There are two steps in finding frequent subgraphs.

*The first step is to create frequent substructure candidates.


*The second step is to find the support of each and every candidate. We must optimize and
enhance the first step because the second step is an NP-completed set where the
computational complexity is accurate and high.

There are two methods for frequent substructure mining.

The Apriori-based approach:

The approach to find the frequent graphs begin from the graph with a small size. The
approach advances in a bottom-up way by creating candidates with extra vertex or edge.
This algorithm is called an Apriori Graph.
Let us consider Qk as the frequent sub-structure set with a size of k. This approach acquires
a level-wise mining technique. Before the Apriori Graph technique, the generation of
candidates must be done.
This is done by combining two same but slightly varied frequent subgraphs. After the
formation of new substructures, the frequency of the graph is checked. Out of that, the
graphs found frequently are used to create the next candidate. This step to generate
frequent substructure candidates is a complex step. But, when it comes to generating
candidates in itemset, it is easy and effortless. Let’s consider an example of having two
itemsets of size three such that

itemset 1= pqr and

itemset 2 = qrs.
So, the itemset derived using join would be pqrs. But when it comes to substructures, there
is more than one method to join two substructures.

Apriori approach uses BFS(Breadth-First Search) due to the iterative level-wise generation
of candidates. This is necessary because if you want to mine the k+1 graph, you should
have already mined till k subgraphs.

The Pattern- growth approach:

This pattern-growth approach can use both BFS and DFS(Depth First Search). DFS is
preferred for this approach due to its less memory consumption nature. Let us consider a
graph h. A new graph can be formed by adding an edge e. The edge can introduce a vertex
but it is not a need. If it introduces a vertex, it can be done in two ways, forward and
backward. The Pattern-growth graph is easy but it is not that efficient. Because there is a
possibility of creating a similar graph that is already created which leads to computation
inefficiency. The duplicate graphs generated can be removed but it increases the time and
work. To avoid the creation of duplicate graphs, the frequent graphs should be introduced
very carefully and conservatively which calls the need for other algorithms.

Algorithm:

The below algorithm is a pattern-growth-based frequent substructure mining with a simplistic


approach. If you need to search without duplicating, you must go with a different algorithm
with gSpan.

An edge e is used to extend a new graph from the old one q. The newly extend graph is
denoted as

q-> e. The extension can either be backward or forward.

Network Analysis:

In the concept of network analysis, the relationship between the units is called links in a
graph. From the data mining outlook, this is called link mining or link analysis. The network is
a diversional dataset with a multi-relational concept in form of a graph. The graph is very
large with nodes as objects, edges as links which in turn denote the relationship between the
nodes or objects. Telephone networking systems, WWW( World Wide Web) are very good
examples. It also helps in filtering the datasets and providing customer-preferred services.
Every network consists of numerous nodes. The datasets are widely enormous. Thus by
studying and mining useful information from a wide group of datasets would help in solving
problems and effective transmission of data.

Link Mining:

There are some conventional methods of machine learning in which taking homogeneous
objects from one relationship is taken. But in networks, this is not applicable due to a large
number of nodes and its multi-relational, heterogeneous nature. Thus the link mining has
appeared as a new field after many types of research. Link mining is the convergence of
multiple research held in graph mining, networks, hypertexts, logic programming, link
analysis, predictive analysis, and modeling. Links are nothing but the relationship between
nodes in a network. With the help of links, the mining process can be held efficiently. This
calls for the various functions to be done.

Link-based object classification:

In link mining, only attributes are not enough. Here the links and the traits of the linked
nodes are also necessary. One best example is Web-based classification. In web-based
classification, the system predicts the categorization of a webpage based on the presence of
that specified word which means the searched word occurs on that page. Anchor text is
which the person clicks the hyperlink that opens while searching. These two things act as
attributes in web-based classification. The attributes can be anything that relates to the link
and network pages.

Link type prediction:


According to the resources of the object involved, the system predicts the motive of that link.
In organizations, it helps in suggesting interactive communication sessions between
employees if needed. In the online retail market, it helps predict what a customer prefers to
buy which can increase sales and recommendations.

Object type prediction:


Here the prediction is based on the type of the object involved, its attributes and properties,
links and traits of the object linked to it. For example in the restaurant domain, a similar
method is done to predict if a customer prefers ordering food or directly visiting the
restaurant. It also helps in predicting the method of communication a customer prefers
whether by phone or mail.

Link Cardinality estimation:


In this task, there are two types of estimation. The first one is predicting the number of links
linked to an object. For example, the percentage of the authority of a web page can be
calculated by finding the number of links linked to it which is called in-links. Web pages that
act as a hub which means a set of web pages denotes other links which come under the
same topic can be identified using out-links. For example, when a pandemic strikes, finding
the links of the affected patient can lead us to the other patients which helps in the control of
the transmission. The second one is done by predicting the number of objects outreaching
along a route from an object. This method is crucial in estimating the object number returned
as output by a query.
Predicting link existence:
In link type prediction, the type of the link is predicted. But, here the system predicts whether
a link exists between two objects. For an instance, this task is used to predict if a link exists
between two web pages.

Object Reconciliation:
In this method, the function is to predict if any two objects are the same on the basis of their
attributes or traits or links. This method is also called identity uncertainty or record linkage.
This task has it’s the same procedure in the matching of citation, extraction of details, getting
rid of duplicates, consolidating objects. For an instance, this task is to help if one website is
reflecting the other website like a mirror to each other.

Clustering with constraints:

Constrained clustering is an approach to clustering the data while it incorporates the domain
knowledge in form of constraints. All data including input data, constraints, and domain
knowledge are processed in the clustering process with constraints and give the output
clusters as an output.

Methods For Clustering With Constraints:

There are various methods for clustering with constraints and can handle specific
constraints:

Handling Hard Constraints:


There is a method for handling the hard constraints by regarding the constraint in a cluster
assignment procedure. It is a very important method for handling the difficult constraints we
can regard the constraints in the assignment procedure of cluster.

Generating the super instances for must-link constraints:


There are must-link constraints that have transitive closure that can be calculated by it. so
that we can say that must-link constraints are known as an equivalence relation. The subset
can be defined by it. In subset, there are some objects which can be replaced by the mean.

Handling the soft constraints:


In the clustering process of soft constraints, there is always an optimization process. There
is always a penalty requires in the clustering process. Hence the optimization in this
process’s aim is to optimize the constraint violation and decreasing the clustering aspect.
For example, if we take two sets one is of data sets and the other is a set of constraints,
CVQE stands for Constrained Vector Quantization Error. In the CVQE algorithm, K-means
clustering is enforced to constraint violation penalty. The main objective of CVQE is the total
distance used for K-means which are used as follows:

Penalty in must-link violation:

This penalty occurs due to when there is a must-link constraint present on objects y, x. They
are created to the given two centers C1, C2 by which the constraint can be violated hence
the distance that lies between C1 and C2 is inserted but as a penalty.
Penalty in cannot-link violation:

This type of penalty is different from a must-link violation as in this penalty there is one
center created to a common center C when cannot link is present on objects x, y. Therefore
the constraints are violated and hence the distance that lies between (C, C) can be inserted
in the objective function and it is recognized as a penalty.

Web Mining

Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is
discovering useful information from the World-Wide Web and its usage patterns.

Applications of Web Mining:

Web mining helps to improve the power of web search engine by classifying the web
documents and identifying the web pages.
*It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g., FatLens,
Become etc.
*Web mining is used to predict user behavior.
*Web mining is very useful of a particular *Website and e-service e.g., landing page
optimization.
*Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.

Web Content Mining:


Web content mining is the application of extracting useful information from the content of the
web documents. Web content consist of several types of data – text, image, audio, video etc.
Content data is the group of facts that a web page is designed. It can provide effective and
interesting patterns about user needs. Text documents are related to text mining, machine
learning and natural language processing. This mining is also known as text mining. This
type of mining performs scanning and mining of the text, images and groups of web pages
according to the content of the input.

Web Structure Mining:


Web structure mining is the application of discovering structure information from the web.
The structure of the web graph consists of web pages as nodes, and hyperlinks as edges
connecting related pages. Structure mining basically shows the structured summary of a
particular website. It identifies relationship between web pages linked by information or direct
link connection. To determine the connection between two commercial websites, Web
structure mining can be very useful.

Web Usage Mining:


Web usage mining is the application of identifying or discovering interesting usage patterns
from large data sets. And these patterns enable you to understand the user behaviors or
something like that. In web usage mining, user access data on the web and collect data in
form of logs. So, Web usage mining is also called log mining.

Comparison Between Data mining and Web mining:

Definition:

*Data Mining is the process that attempts to discover pattern and hidden knowledge in large
data sets in any system.
*Web Mining is the process of data mining techniques to automatically discover and extract
information from web documents.

Application:

*Data Mining is very useful for web page analysis.


*Web Mining is very useful for a particular website and e-service.

Target Users:

*Data scientist and data engineers. *Data scientists along with data analysts.

Access:
*Data Mining access data privately. *Web Mining access data publicly.

Structure:
*In Data Mining get the information from explicit structure.
*In Web Mining get the information from structured, unstructured and semi-structured web
pages.

Problem Type:
*Clustering, classification, regression, prediction, optimization and control.*Web content
mining, Web structure mining.

Tools :
*It includes tools like machine learning algorithms.
*Special tools for web mining are Scrapy, PageRank and Apache logs.

Skills :
*It includes approaches for data cleansing, machine learning algorithms. Statistics and
probability.
*It includes application level knowledge, data engineering with mathematical modules like
statistics and probability.
Spatial Mining:

A spatial database saves a huge amount of space-related data, including maps,


preprocessed remote sensing or medical imaging records, and VLSI chip design data.
Spatial databases have several features that distinguish them from relational databases.
They carry topological and/or distance information, usually organized by sophisticated,
multidimensional spatial indexing structures that are accessed by spatial data access
methods and often require spatial reasoning, geometric computation, and spatial knowledge
representation techniques.

Spatial data mining refers to the extraction of knowledge, spatial relationships, or other
interesting patterns not explicitly stored in spatial databases. Such mining demands the
unification of data mining with spatial database technologies. It can be used for learning
spatial records, discovering spatial relationships and relationships among spatial and
nonspatial records, constructing spatial knowledge bases, reorganizing spatial databases,
and optimizing spatial queries.

It is expected to have broad applications in geographic data systems, marketing, remote


sensing, image database exploration, medical imaging, navigation, traffic control,
environmental studies, and many other areas where spatial data are used.

A central challenge to spatial data mining is the exploration of efficient spatial data mining
techniques because of the large amount of spatial data and the difficulty of spatial data types
and spatial access methods. Statistical spatial data analysis has been a popular approach to
analyzing spatial data and exploring geographic information.

The term geostatistics is often associated with continuous geographic space, whereas the
term spatial statistics is often associated with discrete space. In a statistical model that
manages non-spatial records, one generally considers statistical independence among
different areas of data.

There is no such separation among spatially distributed records because, actually spatial
objects are interrelated, or more exactly spatially co-located, in the sense that the closer the
two objects are placed, the more likely they send the same properties. For example, natural
resources, climate, temperature, and economic situations are likely to be similar in
geographically closely located regions.

Such a property of close interdependency across nearby space leads to the notion of spatial
autocorrelation. Based on this notion, spatial statistical modeling methods have been
developed with success. Spatial data mining will create spatial statistical analysis methods
and extend them for large amounts of spatial data, with more emphasis on effectiveness,
scalability, cooperation with database and data warehouse systems, enhanced user
interaction, and the discovery of new kinds of knowledge.

Outliers and outliers analysis:


Whenever we talk about data analysis, the term outliers often come to our mind. As the
name suggests, "outliers" refer to the data points that exist outside of what is to be expected.
The major thing about the outliers is what you do with them. If you are going to analyze any
task to analyze data sets, you will always have some assumptions based on how this data is
generated. If you find some data points that are likely to contain some form of error, then
these are definitely outliers, and depending on the context, you want to overcome those
errors. The data mining process involves the analysis and prediction of data that the data
holds. In 1969, Grubbs introduced the first definition of outliers.

Difference between outliers and noise

Any unwanted error occurs in some previously measured variable, or there is any variance in
the previously measured variable called noise. Before finding the outliers present in any data
set, it is recommended first to remove the noise.

Types of Outliers

Outliers are divided into three different types

.Global or point outliers


.Collective outliers
.Contextual or conditional outliers

Global outliers
are also called point outliers. Global outliers are taken as the simplest form of outliers. When
data points deviate from all the rest of the data points in a given data set, it is known as the
global outlier. In most cases, all the outlier detection procedures are targeted to determine
the global outliers. The green data point is the global outlier.

Collective Outliers:

In a given set of data, when a group of data points deviates from the rest of the data set is
called collective outliers. Here, the particular set of data objects may not be outliers, but
when you consider the data objects as a whole, they may behave as outliers. To identify the
types of different outliers, you need to go through background information about the
relationship between the behavior of outliers shown by different data objects. For example,
in an Intrusion Detection System, the DOS package from one system to another is taken as
normal behavior. Therefore, if this happens with the various computer simultaneously, it is
considered abnormal behavior, and as a whole, they are called collective outliers. The green
data points as a whole represent the collective outlier.

Contextual Outliers:

As the name suggests, "Contextual" means this outlier introduced within a context. For
example, in the speech recognition technique, the single background noise. Contextual
outliers are also known as Conditional outliers. These types of outliers happen if a data
object deviates from the other data points because of any specific condition in a given data
set. As we know, there are two types of attributes of objects of data: contextual attributes
and behavioral attributes. Contextual outlier analysis enables the users to examine outliers
in different contexts and conditions, which can be useful in various applications. For
example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy
season. Still, it will behave like a normal data point in the context of a summer season. In the
given diagram, a green dot representing the low-temperature value in June is a contextual
outlier since the same value in December is not an outlier.

Outliers Analysis

Outliers are discarded at many places when data mining is applied. But it is still used in
many applications like fraud detection, medical, etc. It is usually because the events that
occur rarely can store much more significant information than the events that occur more
regularly.

Other applications where outlier detection plays a vital role are given below.

Any unusual response that occurs due to medical treatment can be analyzed through outlier
analysis in data mining.

Fraud detection in the telecom industry


In market analysis, outlier analysis enables marketers to identify the customer's behaviors.
In the Medical analysis field.
Fraud detection in banking and finance such as credit cards, insurance sector, etc.
The process in which the behavior of the outliers is identified in a dataset is called outlier
analysis. It is also known as "outlier mining", the process is defined as a significant task of
data mining.

Outlier detection methods

There are various methods of outlier detection is as follows −

Supervised Methods − Supervised methods model data normality and abnormality. Domain
professionals tests and label a sample of the basic data. Outlier detection can be modeled
as a classification issue. The service is to understand a classifier that can identify outliers.

The sample can be used for training and testing. In various applications, the professionals
can label only the normal objects, and several objects not connecting the model of normal
objects are documented as outliers. There are different methods model the outliers and
consider objects not connecting the model of outliers as normal.

Unsupervised Methods − In various application methods, objects labeled as “normal” or


“outlier” are not applicable. Therefore, an unsupervised learning approach has to be used.
Unsupervised outlier detection methods create an implicit assumption such as the normal
objects are considerably “clustered.”

An unsupervised outlier detection method predict that normal objects follow a pattern far
more generally than outliers. Normal objects do not have to decline into one team sharing
large similarity. Instead, they can form several groups, where each group has multiple
features.

This assumption cannot be true sometime. The normal objects do not send some strong
patterns. Rather than, they are uniformly distributed. The collective outliers, share large
similarity in a small area.

Unsupervised methods cannot identify such outliers efficiently. In some applications, normal
objects are separately distributed, and several objects do not follow strong patterns. For
example, in some intrusion detection and computer virus detection issues, normal activities
are distinct and some do not decline into high-quality clusters.

Some clustering methods can be adapted to facilitate as unsupervised outlier detection


methods. The main idea is to discover clusters first, and therefore the data objects not
belonging to some cluster are identified as outliers. However, such methods deteriorate from
two issues. First, a data object not belonging to some cluster can be noise rather than an
outlier. Second, it is expensive to discover clusters first and then discover outliers.

Semi-Supervised Methods − In several applications, although obtaining some labeled


instance is possible, the number of such labeled instances is small. It can encounter cases
where only a small group of the normal and outlier objects are labeled, but some data are
unlabeled. Semi-supervised outlier detection methods were produced to tackle such
methods.

Semi-supervised outlier detection methods can be concerned as applications of


semisupervised learning approaches. For example, when some labeled normal objects are
accessible, it can use them with unlabeled objects that are nearby, to train a model for
normal objects. The model of normal objects is used to identify outliers—those objects not
suitable the model of normal objects are defined as outliers.

Statistical approaches:

Parametric Methods:
The basic idea behind the parametric method is that there is a set of fixed parameters that
uses to determine a probability model that is used in Machine Learning as well. Parametric
methods are those methods for which we priory knows that the population is normal, or if not
then we can easily approximate it using a normal distribution which is possible by invoking
the Central Limit Theorem. Parameters for using the normal distribution is as follows:

Mean
Standard Deviation

Eventually, the classification of a method to be parametric is completely depends on the


presumptions that are made about a population. There are many parametric methods
available some of them are:

Confidence interval used for – population mean along with known standard deviation.
The confidence interval is used for – population means along with the unknown standard
deviation.
The confidence interval for population variance.
The confidence interval for the difference of two means, with unknown standard deviation.

Nonparametric Methods:
The basic idea behind the parametric method is no need to make any assumption of
parameters for the given population or the population we are studying. In fact, the methods
don’t depend on the population. Here there is no fixed set of parameters are available, and
also there is no distribution (normal distribution, etc.) of any kind is available for use. This is
also the reason that nonparametric methods are also referred to as distribution-free
methods. Nowadays Non-parametric methods are gaining popularity and an impact of
influence some reasons behind this fame is:

The main reason is that there is no need to be mannered while using parametric methods.
The second important reason is that we do not need to make more and more assumptions
about the population given (or taken) on which we are working on.
Most of the nonparametric methods available are very easy to apply and to understand also
i.e. the complexity is very low.
There are many nonparametric methods are available today but some of them are as
follows:

.Spearman correlation test


.Sign test for population means
.U-test for two independent means

Proximity-Based Methods in Data Mining

Proximity-based methods are an important technique in data mining. They are employed to
find patterns in large databases by scanning documents for certain keywords and phrases.
They are highly prevalent because they do not require expensive hardware or much storage
space, and they scale up efficiently as the size of databases increases.

Advantages of Proximity-Based Methods:


Proximity-based methods make use of machine learning techniques, in which algorithms are
trained to respond to certain patterns.
Using a random sample of documents, the machine learning algorithm analyzes the
keywords and phrases used in them and makes predictions about the probability that these
words appear together across all documents.
Proximity can be calculated by calculating a similarity score between two collections of
training data and then comparing these scores. The algorithm then tries to compute the
maximum similarity score for two distinct sets of training items.

Disadvantages of Proximity-Based Methods:

Important words may not be as close in proximity as we expected.


Over-segmentation of documents into phrases. To counter these problems, a lexical
chain-based algorithm has been proposed.
Proximity-based methods perform very well for finding sets of documents that contain certain
words based on background knowledge. But performance is limited when the background
knowledge has not been pre-classified into categories.

To find sets of documents containing certain categories, one must assign categorical values
to each document and then run proximity-based methods on these documents as training
data, hoping for accurate representations of the categories.

One way to identify outliers is by calculating their distance from the rest of the data set in is
known as density-based outlier detection.

Types of Proximity-Based Outlier Detection Methods:

Distance-based outlier detection methods:

A distance-based outlier detection method is a statistical technique. Such methods typically


measure distances between individual data points and the rest of their respective groups.
Many approaches also have a configurable error threshold for determining when a point is
an outlier. Many distance-based outliers methods have been developed. The methods use
distance statistics such as Euclidean, Manhattan, or Mahalanobis distance for calculating
distances between individual points and to detect outliers. The following three outlier
detection methods have been selected based on their performance:
WLSMV (Weighted Least Squares Minimization) method
SVM (Support Vector Machines) method,
RMSProp method.

Density-based Outlier detection methods:


A density-based outlier detection method is used for checking the density of an entity object
and its closest objects. Key applications of this method are used in many applications
including Malware Detection, Awareness, Behavior Analysis, and Network Intrusion
Detection. There are some limitations to density-based outlier detection methods that are
effective until it is determined that the outliers being detected are not necessarily outliers but
just a part of a much larger distribution of data. A limitation with using density-based outlier
detection methods is that the density function must be defined and clearly understood before
implementation and the proper value set.

Clustering-Based approaches for outlier detection in data mining:

Clustering Analysis is the process of dividing a set of data objects into subsets. Each subset
is a cluster such that objects are similar to each other. The set of clusters obtained from
clustering analysis can be referred to as Clustering. For example: Segregating customers in
a Retail market as a frequent customer, new customer.

Basic approaches in Clustering:


.Partition Methods
.Hierarchical Methods
.Density-Based Methods
.Grid-Based Methods

Clustering-based outlier detection methods assume that the normal data objects belong to
large and dense clusters, whereas outliers belong to small or sparse clusters, or do not
belong to any clusters. Clustering-based approaches detect outliers by extracting the
relationship between Objects and Cluster. An object is an outlier if

.Does the object belong to any cluster? If not, then it is identified as an outlier.
.Is there a large distance between the object and the cluster to which it is closest? If yes, it is
an outlier.
.Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster are
outliers.

Checking an outlier:
.To check the objects that do not belong to any cluster we go with DENSITY BASED
CLUSTERING (DBSCAN).

.To check outlier detection using distance to the closest cluster we go with K-MEANS
CLUSTERING (K-Means).

This K-Means makes use of a ratio

dist(o, co)/x

where,

co is the closest center to object o and

dist(o, co) is the distance between o and co

x is the average distance between co and o

Note that each of the procedures we’ve visible up to now detects individual objects items as
outliers due to the fact they evaluate items separately in opposition to clusters withinside the
information set. However, in a huge information set, a few outliers can be comparable and
shape a small cluster. The procedures mentioned up to now can be deceived via way of
means of such outliers.

To conquer this problem, the 3rd method to cluster-primarily based totally outlier detection
identifies small or sparse clusters and pronounces the items in the one’s clusters to be
outliers as well.

Strength and Weakness for cluster-based outlier detection:

Advantages:
The cluster-based outlier detection method has the following advantages. First, they can
detect outliers without labeling the data, that is, they are out of control. You deal with multiple
types of data. You can think of a cluster as a collection of data. Once the cluster is obtained,
the cluster-based method only needs to compare the object with the cluster to determine
whether the object is an outlier. This process is usually fast because the number of clusters
is usually small in comparison. In the total number of objects.

Disadvantages:
The weakness of clustering outlier detection is its effectiveness, which largely depends on
the clustering method used. These methods cannot be optimized for outlier detection.
Clustering techniques for large data sets are usually expensive, which may be a bottleneck.

Classification-Based Approaches in Data Mining:


Classification is that the processing of finding a group of models (or functions) that describe
and distinguish data classes or concepts, for the aim of having the ability to use the model to
predict the category of objects whose class label is unknown. The determined model
depends on the investigation of a set of training data information (i.e. data objects whose
class label is known). The derived model could also be represented in various forms, like
classification (if – then) rules, decision trees, and neural networks. Data Mining has a
different type of classifier: A classification is a form of data analysis that extracts models
describing important data classes. Such models are called Classifiers. For example, We can
build a classification model for banks to categorize loan applications.

A general approach to classification:


Classification is a two-step process involving,

Learning Step:
It is a step where the Classification model is to be constructed. In this phase, training data
are analyzed by a classification Algorithm.

Classification Step:
it’s a step where the model is employed to predict class labels for given data. In this phase,
test data are wont to estimate the accuracy of classification rules.

High-dimensional data:
poses unique challenges in outlier detection process. Most of the existing algorithms fail to
properly address the issues stemming from a large number of features. In particular, outlier
detection algorithms perform poorly on data set of small size with a large number of features.
In this paper, we propose a novel outlier detection algorithm based on principal component
analysis and kernel density estimation.
The proposed method is designed to address the challenges of dealing with
high-dimensional data by projecting the original data onto a smaller space and using the
innate structure of the data to calculate anomaly scores for each data point. Numerical
experiments on synthetic and real-life data show that our method performs well on
high-dimensional data. In particular, the proposed method outperforms the benchmark
methods as measured by the F1-score. Our method also produces better-than-average
execution times compared to the benchmark methods.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy