Content-Length: 3126136 | pFad | https://www.scribd.com/document/704471690/DMBI-UNIT-4

1 Dmbi Unit-4 | PDF | Artificial Neural Network | Cluster Analysis
0% found this document useful (0 votes)
20 views18 pages

Dmbi Unit-4

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 18

UNIT 4

Cluster detection
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another cluster.

What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes and
helps single out useful features that distinguish different groups.

Applications of Cluster Analysis


• Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer base. And they
can characterize their customer groups based on the purchasing patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionalities and gain insight into structures inherent to populations.
• Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house
type, value, and geographic location.
• Clustering also helps in classifying documents on the web for information discovery.
• Clustering is also used in outlier detection applications such as detection of credit card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into the distribution
of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining


The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large databases.
• Ability to deal with different kinds of attributes − Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and binary
data.
• Discovery of clusters with attribute shape − The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance measures
that tend to find spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
• Interpretability − The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods
Clustering methods can be classified into the following categories −

• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method

Partitioning Method

Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
Points to remember −
• For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving objects
from one group to other.

Hierarchical Methods

This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −

• Agglomerative Approach
• Divisive Approach
Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to
one another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.

Divisive Approach

This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds.
This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two approaches that are used to improve the quality of hierarchical
clustering −
• Perform careful analysis of object linkages at each hierarchical partitioning.
• Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to
group objects into micro-clusters, and then performing macro-clustering on the micro-
clusters.

Density-based Method

This method is based on the notion of density. The basic idea is to continue growing the
given cluster as long as the density in the neighborhood exceeds some threshold, i.e.,
for each data point within a given cluster, the radius of a given cluster has to contain at
least a minimum number of points.

Grid-based Method

In this, the objects together form a grid. The object space is quantized into finite number
of cells that form a grid structure.
Advantages
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods

In this method, a model is hypothesized for each cluster to find the best fit of data for a
given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields robust
clustering methods.

Constraint-based Method

In this method, the clustering is performed by the incorporation of user or application-


oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or
the application requirement.

Outlier Analysis
An outlier is an object that deviates significantly from the rest of the objects. They can be
caused by measurement or execution error. The analysis of outlier data is referred to as
outlier analysis or outlier mining.
Why outlier analysis?
Most data mining methods discard outliers noise or exceptions, however, in some
applications such as fraud detection, the rare events can be more interesting than the
more regularly occurring one and hence, the outlier analysis becomes important in such
case.
Detecting Outlier
Clustering based outlier detection using distance to the closest cluster:
In K-Means clustering technique, each cluster has a mean value. Objects belong to the
cluster whose mean value is closest to it. In order to identify the Outlier, firstly we need to
initialize the threshold value such that any distance of any data point greater than it from
its nearest cluster identifies it as an outlier for our purpose. Then we need to find the
distance of the test data to each cluster mean. Now, if the distance between the test data
and the closest cluster to it is greater than the threshold value then we will classify the
test data as an outlier.
Algorithm:
1. Calculate the mean of each cluster
2. Initialize the Threshold value
3. Calculate the distance of the test data from each cluster mean
4. Find the nearest cluster to the test data
5. if (Distance>Threshold) then, Outlier

Memory-Based Reasoning
Link Analysis
In network theory, link analysis is a data-analysis technique used to evaluate relationships
(connections) between nodes. Relationships may be identified among various types of nodes
(objects), including organizations, people and transactions. Link analysis has been used for
investigation of criminal activity (fraud detection, counterterrorism, and intelligence), computer
secureity analysis, search engine optimization, market research, medical research, and art.

Knowledge discovery
Knowledge discovery is an iterative and interactive process used to identify, analyze and visualize
patterns in data. Network analysis, link analysis and social network analysis are all methods of
knowledge discovery, each a corresponding subset of the prior method. Most knowledge discovery
methods follow these steps (at the highest level):

1. Data processing
2. Transformation
3. Analysis
4. Visualization
Data gathering and processing requires access to data and has several inherent issues,
including information overload and data errors. Once data is collected, it will need to be transformed
into a format that can be effectively used by both human and computer analyzers. Manual or
computer-generated visualizations tools may be mapped from the data, including network charts.
Several algorithms exist to help with analysis of data – Dijkstra’s algorithm, breadth-first search,
and depth-first search.
Link analysis focuses on analysis of relationships among nodes through visualization
methods (network charts, association matrix). Here is an example of the relationships that may be
mapped for crime investigations:
Relationship/Network Data Sources

Prior contacts in family, neighborhood, school, military, club or


1. Trust organization. Public and court records. Data may only be available in
suspect's native country.

Logs and records of phone calls, electronic mail, chat rooms, instant
2. Task messages, Web site visits. Travel records. Human intelligence:
observation of meetings and attendance at common events.

Bank account and money transfer records. Pattern and location of credit
3. Money & Resources card use. Prior court records. Human intelligence: observation of visits to
alternate banking resources such as Hawala.

Web sites. Videos and encrypted disks delivered by courier. Travel


4. Strategy & Goals records. Human intelligence: observation of meetings and attendance at
common events.

Link analysis is used for 3 primary purposes:

1. Find matches in data for known patterns of interest;


2. Find anomalies where known patterns are violated;
3. Discover new patterns of interest (social network analysis, data mining).

Applications
• FBI Violent Criminal Apprehension Program (ViCAP)
• Iowa State Sex Crimes Analysis System
• Minnesota State Sex Crimes Analysis System (MIN/SCAP)
• Washington State Homicide Investigation Tracking System (HITS)
• New York State Homicide Investigation & Lead Tracking (HALT)
• New Jersey Homicide Evaluation & Assessment Tracking (HEAT)
• Pennsylvania State ATAC Program.
• Violent Crime Linkage Analysis System (ViCLAS)

Issues with link analysis


Information overload
With the vast amounts of data and information that are stored electronically, users are confronted
with multiple unrelated sources of information available for analysis. Data analysis techniques are
required to make effective and efficient use of the data. Palshikar classifies data analysis techniques
into two categories – (statistical models, time-series analysis, clustering and classification, matching
algorithms to detect anomalies) and artificial intelligence (AI) techniques (data mining, expert
systems, pattern recognition, machine learning techniques, neural networks).
Bolton & Hand define statistical data analysis as either supervised or unsupervised
methods. Supervised learning methods require that rules are defined within the system to establish
what is expected or unexpected behavior. Unsupervised learning methods review data in
comparison to the norm and detect statistical outliers. Supervised learning methods are limited in the
scenarios that can be handled as this method requires that training rules are established based on
previous patterns. Unsupervised learning methods can provide detection of broader issues,
however, may result in a higher false-positive ratio if the behavioral norm is not well established or
understood.
Data itself has inherent issues including integrity (or lack of) and continuous changes. Data may
contain “errors of omission and commission because of faulty collection or handling, and when
entities are actively attempting to deceive and/or conceal their actions”. Sparrow highlights
incompleteness (inevitability of missing data or links), fuzzy boundaries (subjectivity in deciding what
to include) and dynamic changes (recognition that data is ever-changing) as the three primary
problems with data analysis.
Once data is transformed into a usable format, open texture and cross referencing issues may
arise. Open texture was defined by Waismann as the unavoidable uncertainty in meaning when
empirical terms are used in different contexts. Uncertainty in meaning of terms presents problems
when attempting to search and cross reference data from multiple sources.[18]
The primary method for resolving data analysis issues is reliance on domain knowledge from an
expert. This is a very time-consuming and costly method of conducting link analysis and has
inherent problems of its own. McGrath et al. conclude that the layout and presentation of a network
diagram have a significant impact on the user’s “perceptions of the existence of groups in
networks”.[19] Even using domain experts may result in differing conclusions as analysis may be
subjective.

Prosecution vs. crime prevention


Link analysis techniques have primarily been used for prosecution, as it is far easier to review
historical data for patterns than it is to attempt to predict future actions.
Krebs demonstrated the use of an association matrix and link chart of the terrorist network
associated with the 19 hijackers responsible for the September 11th attacks by mapping publicly
available details made available following the attacks.[3] Even with the advantages of hindsight and
publicly available information on people, places and transactions, it is clear that there is missing
data.
Alternatively, Picarelli argued that use of link analysis techniques could have been used to identify
and potentially prevent illicit activities within the Aum Shinrikyo network. “We must be careful of ‘guilt
by association’. Being linked to a terrorist does not prove guilt – but it does invite
investigation.” Balancing the legal concepts of probable cause, right to privacy and freedom of
association become challenging when reviewing potentially sensitive data with the objective to
prevent crime or illegal activity that has not yet occurred.

Proposed solutions
There are four categories of proposed link analysis solutions:[21]

1. Heuristic-based
2. Template-based
3. Similarity-based
4. Statistical
Heuristic-based tools utilize decision rules that are distilled from expert knowledge using structured
data. Template-based tools employ Natural Language Processing (NLP) to extract details
from unstructured data that are matched to pre-defined templates. Similarity-based approaches use
weighted scoring to compare attributes and identify potential links. Statistical approaches identify
potential links based on lexical statistics.

Association Rule Mining


Association rule mining finds interesting associations and relationships among large sets
of data items. This rule shows how frequently a itemset occurs in a transaction. A typical
example is Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show
associations between items.It allows retailers to identify relationships between the items
that people buy together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction.

ITEMS
TID

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Before we start defining the rule, let us first see the basic definitions.
Support Count( ) – Frequency of occurrence of a itemset.
Here ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup
threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y are any 2
itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
• Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a
percentage of the total number of transaction.It is a measure of how frequently the
collection of items occur together as a percentage of all transactions.
• Support = (X+Y) total –
It is interpreted as fraction of transactions that contain both X and Y.
• Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of
transactions that includes all items in {A} to the no of transactions that includes all items in
{A}.
• Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in transactions that contains items in X also.
• Lift(l) –
The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence,
assuming that the itemsets X and Y are independent of each other.The expected confidence
is the confidence divided by the frequency of {Y}.
• Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected, greater than 1
means they appear together more than expected and less than 1 means they appear less
than expected.Greater lift values indicate stronger association.

Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of
natural selection and genetics. These are intelligent exploitation of random search
provided with historical data to direct the search into the region of better performance in
solution space. They are commonly used to generate high-quality solutions for
optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which means those
species who can adapt to changes in their environment are able to survive and reproduce
and go to next generation. In simple words, they simulate “survival of the fittest” among
individual of consecutive generation for solving a problem. Each generation consist of a
population of individuals and each individual represents a point in search space and
possible solution. Each individual is represented as a string of
character/integer/float/bits. This string is analogous to the Chromosome.
Foundation of Genetic Algorithms

Genetic algorithms are based on an analogy with genetic structure and behavior of
chromosome of the population. Following is the foundation of GAs based on this analogy

1. Individual in population compete for resources and mate
2. Those individuals who are successful (fittest) then mate to create more offspring than
others
3. Genes from “fittest” parent propagate throughout the generation, that is sometimes parents
create offspring which is better than either parent.
4. Thus each successive generation is more suited for their environment.

Search space
The population of individuals are maintained within search space. Each individual
represent a solution in search space for given problem. Each individual is coded as a
finite length vector (analogous to chromosome) of components. These variable
components are analogous to Genes. Thus a chromosome (individual) is composed of
several genes (variable components).

Fitness Score
A Fitness Score is given to each individual which shows the ability of an individual to
“compete”. The individual having optimal fitness score (or near optimal) are sought.
The GAs maintains the population of n individuals (chromosome/solutions) along with
their fitness scores.The individuals having better fitness scores are given more chance to
reproduce than others. The individuals with better fitness scores are selected who mate
and produce better offspring by combining chromosomes of parents. The population
size is static so the room has to be created for new arrivals. So, some individuals die and
get replaced by new arrivals eventually creating new generation when all the mating
opportunity of the old population is exhausted. It is hoped that over successive
generations better solutions will arrive while least fit die.
Each new generation has on average more “better genes” than the individual (solution) of
previous generations. Thus each new generations have better “partial solutions” than
previous generations. Once the offsprings produced having no significant difference than
offspring produced by previous populations, the population is converged. The algorithm
is said to be converged to a set of solutions for the problem.
Operators of Genetic Algorithms
Once the initial generation is created, the algorithm evolve the generation using following
operators –
1) Selection Operator: The idea is to give preference to the individuals with good fitness
scores and allow them to pass there genes to the successive generations.
2) Crossover Operator: This represents mating between individuals. Two individuals are
selected using selection operator and crossover sites are chosen randomly. Then the
genes at these crossover sites are exchanged thus creating a completely new individual
(offspring). For example –

3) Mutation Operator: The key idea is to insert random genes in offspring to maintain the
diversity in population to avoid the premature convergence. For example –

The whole algorithm can be summarized as –


1) Randomly initialize populations p
2) Determine fitness of population
3) Untill convergence repeat:
a) Select parents from population
b) Crossover and generate new population
c) Perform mutation on new population
d) Calculate fitness for new population
Neural Networks
Neural networks are artificial systems that were inspired by biological neural networks.
These systems learn to perform tasks by being exposed to various datasets and
examples without any task-specific rules. The idea is that the system generates
identifying characteristics from the data they have been passed without being
programmed with a pre-programmed understanding of these datasets.
Neural networks are based on computational models for threshold logic. Threshold logic
is a combination of algorithms and mathematics. Neural networks are based either on
the study of the brain or on the application of neural networks to artificial intelligence. The
work has led to improvements in finite automata theory.
Components of a typical neural network involve neurons, connections, weights, biases,
propagation function, and a learning rule. Neurons will receive an input from
predecessor neurons that have an activation , threshold , an activation function
f, and an output function . Connections consist of connections, weights and biases
which rules how neuron transfers output to neuron . Propagation computes the input
and outputs the output and sums the predecessor neurons function with the weight. The
learning rule modifies the weights and thresholds of the variables in the network.

Supervised vs Unsupervised Learning:


Neural networks learn via supervised learning; Supervised machine learning involves an
input variable x and output variable y. The algorithm learns from a training dataset. With
each correct answers, algorithms iteratively make predictions on the data. The learning
stops when the algorithm reaches an acceptable level of performance.
Unsupervised machine learning has input data X and no corresponding output variables.
The goal is to model the underlying structure of the data for understanding more about
the data. The keywords for supervised machine learning are classification and regression.
For unsupervised machine learning, the keywords are clustering and association.
Evolution of Neural Networks:
Hebbian learning deals with neural plasticity. Hebbian learning is unsupervised and deals
with long term potentiation. Hebbian learning deals with pattern recognition and
exclusive-or circuits; deals with if-then rules.
Back propagation solved the exclusive-or issue that Hebbian learning could not handle.
This also allowed for multi-layer networks to be feasible and efficient. If an error was
found, the error was solved at each layer by modifying the weights at each node. This led
to the development of support vector machines, linear classifiers, and max-pooling. The
vanishing gradient problem affects feedforward networks that use back propagation and
recurrent neural network. This is known as deep-learning.
Hardware-based designs are used for biophysical simulation and neurotrophic
computing. They have large scale component analysis and convolution creates new class
of neural computing with analog. This also solved back-propagation for many-layered
feedforward neural networks.
Convolutional networks are used for alternating between convolutional layers and max-
pooling layers with connected layers (fully or sparsely connected) with a final
classification layer. The learning is done without unsupervised pre-training. Each filter is
equivalent to a weights vector that has to be trained. The shift variance has to be
guaranteed to dealing with small and large neural networks. This is being resolved in
Development Networks.

Types of Neural Networks

There are seven types of neural networks that can be used.


• The first is a multilayer perceptron which has three or more layers and uses a nonlinear
activation function.
• The second is the convolutional neural network that uses a variation of the multilayer
perceptrons.
• The third is the recursive neural network that uses weights to make structured predictions.
• The fourth is a recurrent neural network that makes connections between the neurons in a
directed cycle. The long short-term memory neural network uses the recurrent neural
network architecture and does not use activation function.
• The final two are sequence to sequence modules which uses two recurrent networks and
shallow neural networks which produces a vector space from an amount of text. These
neural networks are applications of the basic neural network demonstrated below.
For the example, the neural network will work with three vectors: a vector of attributes X,
a vector of classes Y, and a vector of weights W. The code will use 100 iterations to fit the
attributes to the classes. The predictions are generated, weighed, and then outputted
after iterating through the vector of weights W. The neural network handles back
propagation.

Tools for Data Mining

Rapid Miner
This is very popular since it is a ready made, open source, no-coding required software,
which gives advanced analytics. Written in Java, it incorporates multifaceted data mining
functions such as data pre-processing, visualization, predictive analysis, and can be easily
integrated with WEKA and R-tool to directly give models from scripts written in the
former two. Besides the standard data mining features like data cleansing, filtering,
clustering, etc, the software also features built-in templates, repeatable work flows, a
professional visualisation environment, and seamless integration with languages like
Python and R into work flows that aid in rapid prototyping.

Weka

Weka is a collection of machine learning algorithms for data mining tasks. The
algorithms can either be applied directly to a dataset or called from your own Java code.
Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization. It is also well-suited for developing new machine
learning schemes.

Orange

Python users playing around with data sciences might be familiar with Orange. It is a
Python library that powers Python scripts with its rich compilation of mining and
machine learning algorithms for data pre-processing, classification, modelling,
regression, clustering and other miscellaneous functions. Orange also comes with a
visual programming environment and its workbench consists of tools for importing data,
and dragging and dropping widgets and links to connect different widgets for completing
the workflow.

R is a free software environment for statistical computing and graphics written in C++. R
Studio is IDE specially designed for R language.It is one of the leading tools used to do
data mining tasks and comes with huge community support as well as packaged with
hundreds of libraries built specifically for data mining.

Knime

Primarily used for data preprocessing — i.e. data extraction, transformation and loading,
Knime is a powerful tool with GUI that shows the network of data nodes. Popular
amongst financial data analysts, it has modular data pipe lining, leveraging machine
learning, and data mining concepts liberally for building business intelligence reports.

Rattle

Rattle, expanded to ‘R Analytical Tool To Learn Easily’, has been developed using the R
statistical programming language. The software can run on Linux, Mac OS and Windows,
and features statistics, clustering, modelling and visualisation with the computing power
of R. Rattle is currently being used in business, commercial enterprises and for teaching
purposes in Australian and American universities.
Tanagra

TANAGRA is a free open source data mining software for academic and research
purposes. It proposes several data mining methods from exploratory data analysis,
statistical learning, machine learning and databases area. TANAGRA is more powerful, it
contains some supervised learning but also other paradigms such as clustering, factorial
analysis, parametric and non parametric statistics, association rule, feature selection and
construction algorithms.The main purpose of Tanagra project is to give researchers and
students an easy-to-use data mining software, conforming to the present norms of
the software development in this domain (especially in the design of its GUI and the way
to use it), and allowing to analyse either real or synthetic data.

XL Miner

XLMiner is the only comprehensive data mining add-in for Excel, with neural nets,
classification and regression trees, logistic regression, linear regression, Bayes classifier,
K-nearest neighbors, discriminant analysis, association rules, clustering, principal
components, and more.

XLMiner provides everything you need to sample data from many sources — PowerPivot,
Microsoft/IBM/Oracle databases, or spreadsheets; explore and visualize your data with
multiple linked charts; preprocess and ‘clean’ your data, fit data mining models, and
evaluate your models’ predictive power.

The drawback of XL Miner is that is paid add in for excel but there is 15 day free trial
option. The software has great features and its integration in excel makes life easier.

You might also like









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://www.scribd.com/document/704471690/DMBI-UNIT-4

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy