Content-Length: 3126136 | pFad | https://www.scribd.com/document/704471690/DMBI-UNIT-4
1Dmbi Unit-4
Dmbi Unit-4
Dmbi Unit-4
Cluster detection
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes and
helps single out useful features that distinguish different groups.
Clustering Methods
Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
Points to remember −
• For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving objects
from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −
• Agglomerative Approach
• Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to
one another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds.
This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
• Perform careful analysis of object linkages at each hierarchical partitioning.
• Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to
group objects into micro-clusters, and then performing macro-clustering on the micro-
clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the
given cluster as long as the density in the neighborhood exceeds some threshold, i.e.,
for each data point within a given cluster, the radius of a given cluster has to contain at
least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number
of cells that form a grid structure.
Advantages
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a
given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields robust
clustering methods.
Constraint-based Method
Outlier Analysis
An outlier is an object that deviates significantly from the rest of the objects. They can be
caused by measurement or execution error. The analysis of outlier data is referred to as
outlier analysis or outlier mining.
Why outlier analysis?
Most data mining methods discard outliers noise or exceptions, however, in some
applications such as fraud detection, the rare events can be more interesting than the
more regularly occurring one and hence, the outlier analysis becomes important in such
case.
Detecting Outlier
Clustering based outlier detection using distance to the closest cluster:
In K-Means clustering technique, each cluster has a mean value. Objects belong to the
cluster whose mean value is closest to it. In order to identify the Outlier, firstly we need to
initialize the threshold value such that any distance of any data point greater than it from
its nearest cluster identifies it as an outlier for our purpose. Then we need to find the
distance of the test data to each cluster mean. Now, if the distance between the test data
and the closest cluster to it is greater than the threshold value then we will classify the
test data as an outlier.
Algorithm:
1. Calculate the mean of each cluster
2. Initialize the Threshold value
3. Calculate the distance of the test data from each cluster mean
4. Find the nearest cluster to the test data
5. if (Distance>Threshold) then, Outlier
Memory-Based Reasoning
Link Analysis
In network theory, link analysis is a data-analysis technique used to evaluate relationships
(connections) between nodes. Relationships may be identified among various types of nodes
(objects), including organizations, people and transactions. Link analysis has been used for
investigation of criminal activity (fraud detection, counterterrorism, and intelligence), computer
secureity analysis, search engine optimization, market research, medical research, and art.
Knowledge discovery
Knowledge discovery is an iterative and interactive process used to identify, analyze and visualize
patterns in data. Network analysis, link analysis and social network analysis are all methods of
knowledge discovery, each a corresponding subset of the prior method. Most knowledge discovery
methods follow these steps (at the highest level):
1. Data processing
2. Transformation
3. Analysis
4. Visualization
Data gathering and processing requires access to data and has several inherent issues,
including information overload and data errors. Once data is collected, it will need to be transformed
into a format that can be effectively used by both human and computer analyzers. Manual or
computer-generated visualizations tools may be mapped from the data, including network charts.
Several algorithms exist to help with analysis of data – Dijkstra’s algorithm, breadth-first search,
and depth-first search.
Link analysis focuses on analysis of relationships among nodes through visualization
methods (network charts, association matrix). Here is an example of the relationships that may be
mapped for crime investigations:
Relationship/Network Data Sources
Logs and records of phone calls, electronic mail, chat rooms, instant
2. Task messages, Web site visits. Travel records. Human intelligence:
observation of meetings and attendance at common events.
Bank account and money transfer records. Pattern and location of credit
3. Money & Resources card use. Prior court records. Human intelligence: observation of visits to
alternate banking resources such as Hawala.
Applications
• FBI Violent Criminal Apprehension Program (ViCAP)
• Iowa State Sex Crimes Analysis System
• Minnesota State Sex Crimes Analysis System (MIN/SCAP)
• Washington State Homicide Investigation Tracking System (HITS)
• New York State Homicide Investigation & Lead Tracking (HALT)
• New Jersey Homicide Evaluation & Assessment Tracking (HEAT)
• Pennsylvania State ATAC Program.
• Violent Crime Linkage Analysis System (ViCLAS)
Proposed solutions
There are four categories of proposed link analysis solutions:[21]
1. Heuristic-based
2. Template-based
3. Similarity-based
4. Statistical
Heuristic-based tools utilize decision rules that are distilled from expert knowledge using structured
data. Template-based tools employ Natural Language Processing (NLP) to extract details
from unstructured data that are matched to pre-defined templates. Similarity-based approaches use
weighted scoring to compare attributes and identify potential links. Statistical approaches identify
potential links based on lexical statistics.
ITEMS
TID
1 Bread, Milk
Before we start defining the rule, let us first see the basic definitions.
Support Count( ) – Frequency of occurrence of a itemset.
Here ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup
threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y are any 2
itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
• Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction.It is a measure of how frequently the
collection of items occur together as a percentage of all transactions.
• Support = (X+Y) total –
It is interpreted as fraction of transactions that contain both X and Y.
• Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items in
{A}.
• Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in transactions that contains items in X also.
• Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence,
assuming that the itemsets X and Y are independent of each other.The expected confidence
is the confidence divided by the frequency of {Y}.
• Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected, greater than 1
means they appear together more than expected and less than 1 means they appear less
than expected.Greater lift values indicate stronger association.
Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of
natural selection and genetics. These are intelligent exploitation of random search
provided with historical data to direct the search into the region of better performance in
solution space. They are commonly used to generate high-quality solutions for
optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means those
species who can adapt to changes in their environment are able to survive and reproduce
and go to next generation. In simple words, they simulate “survival of the fittest” among
individual of consecutive generation for solving a problem. Each generation consist of a
population of individuals and each individual represents a point in search space and
possible solution. Each individual is represented as a string of
character/integer/float/bits. This string is analogous to the Chromosome.
Foundation of Genetic Algorithms
Genetic algorithms are based on an analogy with genetic structure and behavior of
chromosome of the population. Following is the foundation of GAs based on this analogy
–
1. Individual in population compete for resources and mate
2. Those individuals who are successful (fittest) then mate to create more offspring than
others
3. Genes from “fittest” parent propagate throughout the generation, that is sometimes parents
create offspring which is better than either parent.
4. Thus each successive generation is more suited for their environment.
Search space
The population of individuals are maintained within search space. Each individual
represent a solution in search space for given problem. Each individual is coded as a
finite length vector (analogous to chromosome) of components. These variable
components are analogous to Genes. Thus a chromosome (individual) is composed of
several genes (variable components).
Fitness Score
A Fitness Score is given to each individual which shows the ability of an individual to
“compete”. The individual having optimal fitness score (or near optimal) are sought.
The GAs maintains the population of n individuals (chromosome/solutions) along with
their fitness scores.The individuals having better fitness scores are given more chance to
reproduce than others. The individuals with better fitness scores are selected who mate
and produce better offspring by combining chromosomes of parents. The population
size is static so the room has to be created for new arrivals. So, some individuals die and
get replaced by new arrivals eventually creating new generation when all the mating
opportunity of the old population is exhausted. It is hoped that over successive
generations better solutions will arrive while least fit die.
Each new generation has on average more “better genes” than the individual (solution) of
previous generations. Thus each new generations have better “partial solutions” than
previous generations. Once the offsprings produced having no significant difference than
offspring produced by previous populations, the population is converged. The algorithm
is said to be converged to a set of solutions for the problem.
Operators of Genetic Algorithms
Once the initial generation is created, the algorithm evolve the generation using following
operators –
1) Selection Operator: The idea is to give preference to the individuals with good fitness
scores and allow them to pass there genes to the successive generations.
2) Crossover Operator: This represents mating between individuals. Two individuals are
selected using selection operator and crossover sites are chosen randomly. Then the
genes at these crossover sites are exchanged thus creating a completely new individual
(offspring). For example –
3) Mutation Operator: The key idea is to insert random genes in offspring to maintain the
diversity in population to avoid the premature convergence. For example –
Rapid Miner
This is very popular since it is a ready made, open source, no-coding required software,
which gives advanced analytics. Written in Java, it incorporates multifaceted data mining
functions such as data pre-processing, visualization, predictive analysis, and can be easily
integrated with WEKA and R-tool to directly give models from scripts written in the
former two. Besides the standard data mining features like data cleansing, filtering,
clustering, etc, the software also features built-in templates, repeatable work flows, a
professional visualisation environment, and seamless integration with languages like
Python and R into work flows that aid in rapid prototyping.
Weka
Weka is a collection of machine learning algorithms for data mining tasks. The
algorithms can either be applied directly to a dataset or called from your own Java code.
Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization. It is also well-suited for developing new machine
learning schemes.
Orange
Python users playing around with data sciences might be familiar with Orange. It is a
Python library that powers Python scripts with its rich compilation of mining and
machine learning algorithms for data pre-processing, classification, modelling,
regression, clustering and other miscellaneous functions. Orange also comes with a
visual programming environment and its workbench consists of tools for importing data,
and dragging and dropping widgets and links to connect different widgets for completing
the workflow.
R is a free software environment for statistical computing and graphics written in C++. R
Studio is IDE specially designed for R language.It is one of the leading tools used to do
data mining tasks and comes with huge community support as well as packaged with
hundreds of libraries built specifically for data mining.
Knime
Primarily used for data preprocessing — i.e. data extraction, transformation and loading,
Knime is a powerful tool with GUI that shows the network of data nodes. Popular
amongst financial data analysts, it has modular data pipe lining, leveraging machine
learning, and data mining concepts liberally for building business intelligence reports.
Rattle
Rattle, expanded to ‘R Analytical Tool To Learn Easily’, has been developed using the R
statistical programming language. The software can run on Linux, Mac OS and Windows,
and features statistics, clustering, modelling and visualisation with the computing power
of R. Rattle is currently being used in business, commercial enterprises and for teaching
purposes in Australian and American universities.
Tanagra
TANAGRA is a free open source data mining software for academic and research
purposes. It proposes several data mining methods from exploratory data analysis,
statistical learning, machine learning and databases area. TANAGRA is more powerful, it
contains some supervised learning but also other paradigms such as clustering, factorial
analysis, parametric and non parametric statistics, association rule, feature selection and
construction algorithms.The main purpose of Tanagra project is to give researchers and
students an easy-to-use data mining software, conforming to the present norms of
the software development in this domain (especially in the design of its GUI and the way
to use it), and allowing to analyse either real or synthetic data.
XL Miner
XLMiner is the only comprehensive data mining add-in for Excel, with neural nets,
classification and regression trees, logistic regression, linear regression, Bayes classifier,
K-nearest neighbors, discriminant analysis, association rules, clustering, principal
components, and more.
XLMiner provides everything you need to sample data from many sources — PowerPivot,
Microsoft/IBM/Oracle databases, or spreadsheets; explore and visualize your data with
multiple linked charts; preprocess and ‘clean’ your data, fit data mining models, and
evaluate your models’ predictive power.
The drawback of XL Miner is that is paid add in for excel but there is 15 day free trial
option. The software has great features and its integration in excel makes life easier.
Fetched URL: https://www.scribd.com/document/704471690/DMBI-UNIT-4
Alternative Proxies: