BDA3
BDA3
BDA3
3
4 Normalization
• Upper or lowercasing
• Stopword removal
• Stemming – bluntly removing prefixes and suffixes from a word
• Lemmatization – replacing a single-word token with its root
3) BoW and TF-IDF For Creating Features from Text
• We understand that sentence in a fraction of a second, such as
• Review 1: This movie is very scary and long
• Review 2: This movie is not scary and is slow
• Review 3: This movie is spooky and good
• But machines simply cannot process text data in raw form. They need us to break down the text into a
numerical format that’s easily readable by the machine.
• This is where the two concepts come into play
• Bag-of-Words (BoW) and
• Term Frequency-Inverse Document Frequency (TF-IDF).
• Both BoW and TF-IDF are techniques that help us convert text sentences into numeric vectors.
3) Bag-of-Words (BoW)
• We will first build a vocabulary from all the unique vocabulary, which consists of these 11 words:
• ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’, ‘slow’, ‘spooky’, ‘good’.
• We can now take each of these words and mark their occurrence in the three movie reviews above with 1s and 0s.
• This will give us 3 vectors for 3 reviews as a Vector Representation:
Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good
• Example: We can calculate the IDF values for the all the words in Review 2: Term Review Review Review IDF
1 2 3
• IDF(‘this’) =
This 1 1 1 0.00
= log(number of documents/number of documents containing the word ‘this’) movie 1 1 1 0.00
• IDF(‘movie’, ) = log(3/3) = 0 ✓ Hence, we see that words like “is”, “this”, and 1 1 1 0.00
• IDF(‘is’) = log(3/3) = 0 “and”, etc., are reduced to 0 and have little long 1 0 0 0.48
• IDF(‘scary’) = log(3/2) = 0.18 ✓ while words like “scary”, “long”, “good”, slow 0 1 0 0.48
• IDF(‘and’) = log(3/3) = 0 etc. are words with more importance and spooky 0 0 1 0.48
Similarly, Similarly, we can calculate the TF-IDF scores for all the
• TF-IDF(‘movie’, Review 2) = 1/8 * 0 = 0 words with respect to all the reviews:
• TF-IDF(‘is’, Review 2) = 1/4 * 0 = 0 Term Review Review Review IDF TF TF TF
1 2 3 (Review1) (Review2) (Review3)
• TF-IDF(‘not’, Review 2) = 1/8 * 0.48 = 0.06
This 1 1 1 0.00 0.000 0.000 0.000
• TF-IDF(‘scary’, Review 2) = 1/8 * 0.18 = 0.023
movie 1 1 1 0.00 0.000 0.000 0.000
• TF-IDF(‘and’, Review 2) = 1/8 * 0 = 0
is 1 2 1 0.00 0.000 0.025 0.000
• TF-IDF(‘slow’, Review 2) = 1/8 * 0.48 = 0.06
very 1 0 0 0.48 0.068 0.000 0.000
• Words with a higher score are more important, and scary 1 1 0 0.18 0.025 0.022 0.000
• those with a lower score are less important and 1 1 1 0.00 0.000 0.000 0.000
• TF-IDF gives larger values for less frequent words. long 1 0 0 0.48 0.068 0.000 0.000
0 1 0 0.48 0.000 0.060 0.000
• It also gives large value for frequent words in a single not
slow 0 1 0 0.48 0.000 0.060 0.000
document but, rare in all the documents combined spooky 0 0 1 0.48 0.000 0.000 0.080
• Means, both IDF and TF values are high good 0 0 1 0.48 0.000 0.000 0.080
4) Dimensionality Reduction
• The number of input features, variables, or columns present in a given dataset is known as dimensionality, and
• the process to reduce these features is called dimensionality reduction.
• Handling the high-dimensional data is very difficult in practice, commonly known as
• the curse of dimensionality.
• If the machine learning model is trained on high-dimensional data, it becomes overfitted and results in poor
performance.
• Hence, it is often required to reduce the number of features, which can be done with dimensionality reduction.
• Some benefits of applying dimensionality reduction technique to the given dataset are given below:
• By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
• Less Computation training time is required for reduced dimensions of features.
• Reduced dimensions of features of the dataset help in visualizing the data quickly.
• It removes the redundant features (if present) by taking care of multi-collinearity.
4.1) Techniques for Dimensionality Reduction
• Dimensionality reduction is accomplished based on either feature selection or feature extraction.
• Feature selection is based on omitting those features from the available measurements which do not
contribute to class separability. In other words, redundant and irrelevant features are ignored.
a) Variance Thresholds
b) Correlation Thresholds
c) Genetic Algorithms
d) Stepwise Regression- This has two types: forward and backward.
• Feature extraction, Feature extraction is for creating a new, smaller set of features that still captures
most of the useful information. This can come as supervised (e.g. LDA) and unsupervised (e.g.
PCA) methods.
a) Principal Component Analysis (PCA)
b) Linear Discriminant Analysis (LDA)
4.2) Feature selection:
a) Variance Thresholds: This technique looks for the variance from one observation to another of a given
feature and then
• if the variance is not different in each observation according to the given threshold,
• feature that is responsible for that observation is removed.
b) Correlation Thresholds: We first calculate all pair-wise correlations. Then, if
• the correlation between a pair of features is above a given threshold,
• we remove the one that has larger mean absolute correlation with other features.
• Like the previous technique, this is also based on intuition and hence the burden of tuning the thresholds in
such a way that the useful information will not be neglected, will fall upon the user.
• Because of those reasons, algorithms with built-in feature selection or algorithms like PCA(Principal
Component Analysis) are preferred over this one.
a) Genetic Algorithms: They are search algorithms that are inspired by evolutionary biology and natural
selection, combining mutation and cross-over to efficiently traverse large solution spaces.
• Genetic Algorithms are used to find an optimal binary vector, where each bit is associated with a feature.
✓ If the bit of this vector equals 1, then the feature is allowed to participate in classification.
✓ If the bit is a 0, then the corresponding feature does not participate.
4.3) Feature selection…
d) Stepwise Regression: This is a greedy algorithm and commonly has a lower performance than the supervised methods
such as regularizations etc.
• This has two types: forward and backward.
• For forward stepwise search, we start without any features. Then,
• We train a 1-feature model using each of our candidate features and keep the version with the best performance.
• We would continue adding features, one at a time, until our performance improvements stall.
• Backward stepwise search is the same process, just reversed:
• start with all features in our model and
• then remove one at a time until performance starts to drop substantially.
4.4) Feature selection: Example
A fitness level prediction based on the three independent variables is used to show how ID Calories_ Gender Plays_ Fintess
forward feature selection works. burnt Sport? Level
121 M Yes Fit
• So, the first step in Forward Feature Selection is 1
230 M No Fit
• To train n models and judge how well they work by looking at each feature on its own. 2
3 342 F No Unfit
• So, if we have three independent variables, we'll train three models, one for each of
these three traits. 4 70 M Yes Fit
5 278 F Yes Unfit
• Let's say we trained the model using the Calories Burned feature and the Fitness Level goal
variable and 6 146 M Yes Fit
ID Calories_ Gender Plays_ Fintess
• got an accuracy of 87%. burnt Sport? Level 7 168 F No Unfit
8 231 F Yes Fit
• We'll next use the Gender feature to train the model, 9 150 M No Fit
ID Calories_ Gender Plays_ Fintess
• we acquire an accuracy of 80%. burnt Sport? Level 10 190 F No Fit
• And similarly, the Plays_sport variable gives us an accuracy of 85%.
ID Calories_ Gender Plays_ Fintess
burnt Sport? Level
✓ At this point, we are going to select the variable that produced the most favourable results.
✓ When these two sets of data were compared, the winner was, unsurprisingly,
✓ the number of calories burned.
✓ As a direct result of this, we will select this variable.
4.5) Feature selection: Example conti…
• The next thing we'll do is repeat the previous steps, but this time we'll just add a single variable at
a time.
• Because of this, it makes perfect sense for us to retain the Calories Burned variable as we
proceed to add variables one at a time.
• Consequently, if we use gender as an illustration, we have an accuracy rate of 88%.
• We acquire a 91% accuracy when we combine Plays Sport with Calories Burnt.
• As a result, we will keep it and use it in our model.
• We will keep repeating the process till all the features are considered in improving the model
performance
4.6) Feature Extraction: Principal Component Analysis (PCA)
• PCA is a dimensionality reduction that
• identifies important relationships in our data,
• transforms the existing data based on these relationships, and then
• quantifies the importance of these relationships so we can keep the most important relationships.
Objectives of PCA:
1. Reduces attribute space: It is basically a non-dependent procedure:
• From a large number of variables to a smaller number of factors.
• But there is no guarantee that the dimension is interpretable.
2. Identifying patterns: PCA can help identify patterns or relationships between variables.
3. Feature extraction: PCA can be used to extract features from a set of variables
• that are more informative or relevant than the original variables.
4. Data compression: PCA can be used to compress large datasets by
reducing the number of variables
• while retaining as much information as possible.
5. Noise reduction: PCA can be used to reduce the noise in a dataset by
• Identifying and removing the principal components that
correspond to the noisy parts of the data.
6. Visualization: PCA can be used to visualize high-dimensional data in
a lower-dimensional space,
• making it easier to interpret and understand.
4.7) Feature Extraction: Linear Discriminant Analysis (LDA)
• Linear Discriminant Analysis (LDA) is a supervised learning algorithm
• used for classification tasks in machine learning.
• It is a technique used to find a linear combination of features that
• best separates the classes in a dataset.
Example:
• Suppose we have two sets of data points belonging to two different classes
• that we want to classify.
• As shown in the given 2D graph, when the data points are plotted on the 2D plane,
• there’s no straight line that can separate the two classes of the data points
completely.
• Hence, in this case, LDA (Linear Discriminant Analysis) is used
• which reduces the 2D graph into a 1D graph
• in order to maximize the separability between the two classes.
Web Mining
• Product Cart Analysis on the eCommerce platform uses the • Financial institutes use classification to determine the
classification technique to associate the items into groups defaulters and help in determining the loan seekers, and
and create combinations of products to recommend. other categories.
• This is a very common Classification Applications in • These Classification Applications in Data Mining helps
Data Mining in finding the target audience much easier.
• The weather patterns can be predicted and classified based • Spam detection e-mails based on the header and content
on parameters such as temperature, humidity, wind of the document.
direction, and many more. • Classification of students according to their qualifications.
• These Classification Applications of Data Mining are • Patients are classified according to their medical history.
used in daily life. • Classification can be used for the approval of credit.
• The public health sector classifies the diseases based on • Facial key points detection.
the parameters like spread rate, severity, and a lot more. • Drugs classification.
• This helps in charting out strategies to mitigate • Pedestrian’s detection in an automotive car driving.
diseases. These Classification Applications of Data • Cancer tumor cells identification.
Mining help in finding cures. • Sentiment Analysis.
8. Classification Algorithms in Data Mining
• Classification is the operation of separating various entities into several classes.
• These classes can be defined by
• business rules, class boundaries, or some mathematical function.
• The classification operation may be based on a relationship between a known class
assignment and characteristics of the entity to be classified.
• This type of classification is called supervised.
• If no known examples of a class are available, the classification is unsupervised.
• The most common unsupervised classification approach is clustering
• Classification algorithm finds relationships between the values of the predictors and the values
of the target.
• Different Classification algorithms use different techniques for finding relationships.
• Data mining has many classifiers/classification algorithms such as:
✓ Logistic regression ✓ Rule-based Classification
✓ Linear regression ✓ Bayesian Classification
✓ K-Nearest Neighbours Algorithm (kNN) ✓ Random Forest
✓ Decision trees ✓ Naive Bayes
✓ Support Vector Machines
8. k-NN (k-Nearest Neighbors) ALGORITHM
• The following is the pseudocode for KNN:
1. Load the data
2. Choose K value
3. For each data point in the data:
o Find the Euclidean distance to all training data samples
o Store the distances on an ordered list and sort it
o Choose the top K entries from the sorted list
o Label the test point based on the majority of classes present in the selected points
4. End
Partition
Stop
✓ In this version of the algorithm, all attributes are categorical, that is,
✓ discrete-valued.
✓ Continuous valued attributes must be discretized.
9.1: Decision Tree Classifier…
The advantages of decision tree approaches are:
• Decision trees are simple to understand and interpret.
• They require little data and are able to handle both numerical and categorical data.
• Decision trees can produce comprehensible rules.
• Classification of decision trees without much computation.
• Decision trees clearly show which fields for prediction or classification are most important.
• They are strong in nature, therefore,
• they perform well even if its assumptions are somewhat violated by the true model from
which the data were generated.
• Decision trees perform well with large data in a short time.
• Nonlinear relationships between parameters do not affect tree performance.
9.2: Decision Tree Classifier…
The drawbacks of decision tree approaches are:
• Decision trees are less suited to estimate tasks where the goal is to predict a constant attribute value.
• The decision trees are vulnerable to mistakes in many class problems and relatively limited numbers of training
instances.
• The decision-making method is computationally costly.
• Each splitting field must be sorted at each node before the best split can be identified.
• Combinations of fields are used in some algorithms and search must be made for optimal combined weights.
• Pruning algorithms can also be costly as many sub-trees of candidates have to be created and compared.
• Data fragmentation: Each split in a tree leads to a reduced dataset under consideration.
• And, hence the model created at the split will potentially introduce bias.
• High variance and unstable : As a result of the greedy strategy applied by decision tree's variance in finding the
right starting point of the tree
• can greatly impact the final result. i.e. small changes early on can have big impacts later.
• So- if for example we draw two different samples from our universe,
• the starting points for both the samples could be very different (and may even be different variables) this
can lead to totally different results.
10. Bayesian Classification
• Bayesian classifiers are statistical classifiers.
• They can predict class membership probabilities,
• such as the probability that a given tuple belongs to a particular class.
• Bayesian classification is based on Bayes’ theorem.
• A simple Bayesian classifier known as the Naïve Bayesian classifier
• to be comparable in performance with
• decision tree and selected neural network classifiers.
• Bayesian classifiers exhibits high accuracy and speed when applied to large databases.
• Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
• independent of the values of the other attributes.
• This assumption is called class conditional independence.
• It is made to simplify the computations involved and, in this sense, is considered “Naïve.”
10. Bayesian Classification…
Some to the advantages of the Naïve Bayes Classifier are:
• Naive Bayes Algorithm is a fast, highly scalable algorithm.
• Naive Bayes can be use for Binary and Multiclass classification.
• It provides different types of Naive Bayes Algorithms like
• GaussianNB, MultinomialNB, BernoulliNB.
• It is a simple algorithm that depends on doing a bunch of counts.
• Great choice for Text Classification problems.
• It can be easily train on small dataset
• If the vector of the weights is denoted by Θ and |Θ| is the norm of this vector, then
• it is easy to see that the size of the maximal margin is 2/|Θ|.
11. SVM…
• Finding the maximal margin hyperplanes and support vectors is
• a problem of convex quadratic optimization.
• It is important to note that the complexity of SVM is characterized by
• the number of support vectors, rather than the dimension of the feature space.
• That is the reason SVM has a comparatively less tendency to overfit.
• If all data points other than the support vectors are removed from the training data set, and the training
algorithm is repeated,
• the same separating hyperplane would be found.
• The number of support vectors provides an upper bound to the expected error rate of the SVM classifier,
• which happens to be independent of data dimensionality.
• An SVM with a small number of support vectors has good generalization,
• even when the data has high dimensionality.
• As a training algorithm, SVM may not be very fast compared to some other classification methods,
• but owing to its ability to model complex nonlinear boundaries, SVM has high accuracy.
• SVM is comparatively less prone to overfitting.
• SVM has successfully been applied to handwritten digit recognition, text classification, speaker identification etc..
12. Rule Based Classification
• Rule-based classifier makes use of a set of IF-THEN Some of the advantages of Rule-Based classifiers:
rules for classification.
• We can express a rule in the following from • They have the characteristics quite similar to decision
IF condition THEN conclusion trees
• These classifiers are as highly expressive as decision trees
• Let us consider a rule R1, • They are easy to interpret
R1: IF age = youth AND student = yes • Their performance is comparable to decision trees
THEN buy_computer = yes • They can handle redundant attributes
• They are better suited for handling imbalanced classes
• Rule Notation: (Condition) → Class Label • There are harder to handle missing values in the test set
13. Model Selection Models Evaluation for classification model:
✓ Confusion Matrix
• Model selection is a technique for selecting the best model
• after the individual models are evaluated based on the required criteria.
• Model selection is the problem of choosing one from among a set of candidate models.
• in the case of supervised learning, the three most common approaches are:
• Train, Validation, and Test datasets
• Resampling Methods
• Probabilistic Statistics
The simplest reliable method of model selection involves fitting candidate models on a training set, tuning them on
the validation dataset, and selecting a model that performs the best on the test dataset according to a chosen metric,
such as accuracy or error. A problem with this approach is that it requires a lot of data.
Resampling techniques attempt to achieve the same as the train/val/test approach to model selection, although using
a small dataset. An example is k-fold cross validation where a training set is split into many train/test pairs and a
model is fit and evaluated on each. This is repeated for each model and a model is selected with the best average
score across the k-folds. A problem with this and the prior approach is that only model performance is assessed,
regardless of model complexity.
A third approach to model selection attempts to combine the complexity of the model with the performance of the
model into a score, then select the model that minimizes or maximizes the score. There are three statistical
approaches to estimating how well a given model fits a dataset and how complex the model is.
1. Akaike Information Criterion (AIC). Derived from frequentist probability
2. Bayesian Information Criterion (BIC). Derived from Bayesian probability
3. Minimum Description Length (MDL). Derived from information theory
14. CLUSTERING – AN OVERVIEW
• Clustering helps in organizing huge voluminous data into clusters and
• displays interior structure of statistical information.
• Clustering improves the data readiness towards artificial intelligence techniques.
• Process for clustering, exhibits knowledge discovery in data,
• It is used either as a stand-alone tool to get penetration into data distribution or
• as a preprocessing step for other algorithm.
14.1. General Approach to Clustering
• Cluster analysis is an exploratory discovery process.
• It can be used to discover structures in data without providing an explanation/interpretation.
• Cluster analysis includes two major aspects: clustering and cluster validation.
• Clustering aims at partitioning objects into groups according to a certain criteria.
• To achieve different application purposes, a large number of clustering algorithms have been developed.
• While due to there are no general purpose clustering algorithms to fit all kinds of applications,
• thus, it is required an evaluation mechanism to assess the quality of clustering results that
• produced by different clustering algorithms or
• a clustering algorithm with different parameters,
so that the user may find a fit cluster scheme for a specific application.
• The quality assessment process of clustering results is regarded as cluster validation.
• Cluster analysis is an iterative process of clustering and cluster verification by the user facilitated with
• clustering algorithms,
• cluster validation methods,
• visualization and
• domain knowledge to databases.
14.3. Applications of Cluster Analysis
• Clustering analysis is widely utilized in a variety of fields, including
• data analysis, market research, pattern identification, and image processing.
• Earth observation databases use this data to identify
• similar land regions and
• to group houses in a city based on house type, value, and geographic position.
• It is the backbone of search engine algorithms,
• where objects that are similar to each other must be presented together and dissimilar
objects should be ignored.
• Also, it is required to fetch objects that are closely related to a search term, if not
completely related.
• Used in image segmentation in bioinformatics where
• clustering algorithms have proven their worth in detecting cancerous cells from various
medical imagery
– eliminating the prevalent human errors and other bias.
14.3. Applications of Cluster Analysis…
• Clustering effectively detects hidden patterns, rules, constraints, flow etc.
• based on various metrics of traffic density from GPS data and
• can be used for segmenting routes and
• suggesting users with best routes, location of essential services, search for objects
on a map etc.
• Satellite imagery can be segmented to find suitable and arable lands for agriculture.
• Document clustering is effectively being used in preventing the spread of
fake news on Social Media.
• Website network traffic can be divided into various segments and
• heuristically when we can prioritize the requests and
• also helps in detecting and preventing malicious activities.
15: Clustering Methods in Data Mining
• For a successful grouping there are two major goals –
(i) Similarity between one data point with another
(ii) Distinction of those similar data points with others which most certainly,
heuristically differ from those points.
• To address the challenges such as scalability, attributes, dimensional,
boundary shape, noise, and interpretation
• there are various types of clustering methods to solve one or many of these
problems.
• Various types of Clustering methods are: ✓ Partitioning Method
✓ Hierarchical Method
✓ Density-based Method
✓ Grid-Based Method
✓ Model-Based Method
✓ Constraint-based Method
16. Partitioning Method: k-Means Algorithm
Advantages
• In this, m data set are clustered to form some number of • Effortless implementation process.
clusters say k, where each of the data set belongs to the • Dense clusters are produced when clusters are
closer mean cluster.
spherical when compared to hierarchical method.
• Appropriate for large databases.
Algorithm
1. Define the number of clusters (k) to be produced and Disadvantages
identical data point centroids. • Inappropriate for clusters with different density and size.
2. The distance from every data point to all the centroids is • Equivalent results are not produced on iterative run.
calculated and the • Euclidean distance measures can weigh unequally due to
point is assigned to the cluster with a minimum distance. underlying factors.
3. Follow the above step for all the data points. • Unsuccessful for non-linear data set and categorical data.
4. The average of the data points present in a cluster is • Noisy data and outliers are difficult to handle.
calculated and can set
new centroid for that cluster.
5. Until desired clusters are formed repeat Step 2.
✓ The initial centroid is selected randomly and thus the resulting clusters have larger
✓ influence on them. Complexity of k-means algorithm is O(tkn) where n- total data set,
✓ k-clusters formed, t-iterations in order to form cluster
16.1 Partitioning Method: k-Medoids or PAM
(Partitioning Around Medoids)
• It is similar in process to the K-means clustering algorithm with Advantages
the difference being in the assignment of the center of the cluster. • Effortless understanding and implementation process.
The algorithm is implemented in two steps: • Can run quickly and converge in few steps.
Build: Initial medoids are innermost objects. • Dissimilarities between the objects is allowed.
Swap: A function can be swapped by another function until the • Less sensitive to outliers when compared to k-means.
function can no longer be reduced.
Disadvantages
Algorithm
• Initial sets of medoids can produce different
1. Initially choose m random points as initial medoids from given
clustering’s. It is thus advisable to run the
data set.
procedure several times with different initial sets.
2. For every data point assign a closest medoid by distance metrics.
• Resulting clusters may depend upon units of
3. Swapping cost is calculated for every chosen and unchosen object
measurement. Variables of different magnitude
given as TCns where s is selected and n is non-selected object.
can be standardized.
4. If TCns < 0, s is replaced by n
5. Until there is no change in medoids, repeat 2 and 3.
Four characteristics to be considered are:
✓ Shift-out membership: Movement of an object from current cluster to another is allowed.
✓ Shift-in membership: Movement of an object from outside to current cluster is allowed.
✓ Update the current medoids: Current medoid can be replaced by a new medoid.
✓ No change: Objects are at their appropriate distances from cluster.
17. Hierarchical Method: Agglomerative and Divisive Approach
• This method decomposes a set of data items into a hierarchy. Depending on how the hierarchical breakdown
is generated, we can put hierarchical approaches into different categories. Following are the two approaches;
Agglomerative Approach Divisive Approach
• This Algorithm is also referred as Bottom-up approach. • This approach is also referred as the top-down approach.
• This approach treats each and every data point as a single • In this, we consider the entire data sample set as one cluster and
cluster and • continuously splitting the cluster into smaller clusters iteratively.
• then merges each cluster by considering the similarity • It is done until each object in one cluster or the termination
(distance) in each individual cluster condition holds.
• until a single large cluster is obtained or when some • This method is rigid, because once a merging or splitting is done, it can
condition is satisfied. never be undone.
Advantages
• Easy to identify nested clusters.
• Gives better results and ease in implementation. Advantage
• They are suitable for automation. • It produces more accurate hierarchies than bottom-up algorithm in some
• Reduces the effect of initial values of cluster on the clustering circumstances.
results.
• Reduces the computing time and space complexity.
Disadvantages
Disadvantages
• Top down approach is computationally more complex than bottom up
• It can never undo what was done previously.
approach because we need a second flat clustering algorithm.
• Difficulty in handling different sized clusters and convex shapes
• Use of different distance metrics for measuring distance between clusters
lead to increase in time complexity
may generate different results.
• There is no direct minimization of objective function.
• Sometimes there is difficulty in identifying the exact number of
clusters by the Dendrogram.
18. Density Based Method: DBSCAN
(Density-Based Spatial Clustering of Applications with Noise)
• In DBSCAN, a cluster is defined as group of data that is of highly dense. Algorithm
1. In order to form clusters, initially consider a random
• DBSCAN considers two parameters such as: point say point p
• Eps: the maximum value of radius from its neighborhood. 2. The second step is to find the all points that are density
• MinPts: The Eps is surrounded by data points (i.e. Eps-Neighborhood) that should be reachable from point p with respect to Eps and MinPts.
minimum.
The following condition is checked in order to form the
• To define Eps-Neighborhood it should satisfy the following condition, cluster
a. If point p is found to be core point, then cluster is
NEps(q) : { p belongs to D|(p, q) ≤ Eps }. obtained.
• In order to understand the Density Based Clustering let us follow few b. If point p is found to be border point, then no
definitions: points are density reachable from point p and
• Core point: It is point which lies within Eps and MinPts which are specified by user. And hence visit the next point of database.
that point is surrounded by dense neighborhood. 3. Continue this process until all the points is processed.
• Border point: It is point that lies within the neighborhood of core point and multiple core
points can share same border point and this point does not contains dense neighborhood. Advantages
• Noise/Outlier: It is point that does not belongs to cluster. • It can identify Outlier.
• Direct Density Reachable: A point p is directly Density Reachable from point q with respect • It does not require number of clusters to be specified in
to Eps, MinPts if point p belongs to NEps(q) and Core point condition advance.
i.e.|NEps(q) | ≥ MinPts Disadvantages
• Density Reachable: A point p is said to Density Reachable from point q with respect to Eps, • If the density of data keeps changing then efficiency of
MinPts if there a chain points such as p1,p2, ... ... pn, p1 = q, pn = p such that pi + 1 is
directly reachable from pn. finding clusters is difficult.
• It does not suit for high quality of data and the user has
to specify the parameter in advance.
19. Limitations with Cluster Analysis
• There are two major drawbacks that influence the feasibility of cluster analysis in real world applications in data
mining.
• The existing automated clustering algorithms on dealing with arbitrarily shaped data distribution of the datasets.
• The second issue is that, the evaluation of the quality of clustering results by statistics-based methods is time
consuming when the database is large,
• primarily due to the drawback of very high computational cost of statistics based methods for assessing the
consistency of cluster structure between the sampling subsets.
• The implementation of statistics-based cluster validation methods does not scale well in very large
datasets.
• On the other hand, and Web Mining arbitrarily shaped clusters also make the traditional statistical cluster
validity indices ineffective, which leave it difficult to determine the optimal cluster structure.
✓ Cluster analysis is a multiple runs iterative process, without any user domain knowledge,
✓ it would be inefficient and
✓ unintuitive to satisfy specific requirements of application tasks in clustering.
➢ Outlier Detection Techniques:
• Numeric Outlier- Calculated by IQR (InterQuartile Range)
20. Outlier Analysis • Z-Score- The Z-score technique considers the Gaussian
distribution of data. Outliers are data points that are on the tail
of the distribution and are therefore far from average.
• In Data Mining, it is common to utilize Outlier Detection to • DBSCAN (clustering method)- It is a non-standard, density-
• find anomalies based outlier detection method. Here, all data points are
• find patterns or trends. defined as focal points, boundary points, or noise points.
• Isolated forest- This non-parameter system is suitable for
• Examples: large datasets in one or more dimensional features.
• Identifying financial fraud such as credit card hacking or other similar ➢ Models for Outlier Detection Analysis:
scams makes use of this technology. • Intensive Value Analysis- In this external analysis approach, the
• It’s utilized to keep track of a customer's changing purchase habits. largest or smallest values are considered externally. The Z-Test
and the Students’ T-Test are excellent examples. These are good
• It’s used to find and report human-made mistakes in typing.
heuristics for the initial analysis of data but they are not of much
• It’s utilized for troubleshooting and identifying problems with machines value in multifaceted systems.
and systems.
• Linear Models- The distance of each data point is calculated for
a plane that corresponds to the sub-interval. This distance is used
✓ Outlier detection can be defined as the process of detecting to detect outliers. PCA (primary component analysis) is an
and then excluding outsiders from a given set of data. example of a linear model for anomaly detection.
• Probabilistic and Statistical Models- Expectation-enhancement
✓ There are no standardized outlier identification methods (EM) methods are used to estimate the parameters of the sample.
because these are mostly dataset-dependent. Finally, they calculate the probability of the member of each data
point for the calculated distribution. Points with the lowest
Remember two important questions about your database during Outlier probability of membership are marked externally.
Identification: • Proximity-based Models- In this mode, the outliers are
(i) What and how many features do I consider for outlier detection? designed as points of isolation from other observations. Cluster
analysis, density-based analysis, and neighborhood environment
(Similarity/diversity)
are key approaches of this type.
(ii) Can I take the distribution (s) of values for the features I have selected? • Information-theoretical models- In this mode, outliers increase
(Parameter / non-parameter) the minimum code length to describe a data set.
21) Hadoop Introduction: Hadoop Distributed File System (HDFS) :
• Hadoop is a collection of several software services that are all freely accessible to the
public and can be used in conjunction with one another.
• It offers a software framework
• for storing a huge amount of data in a variety of locations using Hadoop Distributed File System
(HDFS) and
• for working with that data by utilizing the MapReduce programming style.
• The combination of HDFS and MapReduce creates an architecture that,
• conceals all of the complexity associated with the analysis of big data.
• It is scalable and fault-tolerant.
21.1) Various daemons in Apache Hadoop
• Apache Hadoop includes the five daemons,
• Three, relates to HDFS for the purpose of efficiently managing distributed storage
• NameNode,
• DataNode, and
• Secondary NameNode
• Two, utilized by the MapReduce engine are responsible for both job tracking and job execution
• JobTracker and
• TaskTracker
• Each of the mentioned daemons runs
on their respective JVM
21.2) HDFS: NameNode
• NameNode : A single NameNode daemon operates on the master node.
• NameNode is responsible for storing and managing the metadata that is connected with the file system.
• This metadata is stored in a file that is known as fsimage.
• When a client makes a request to read from or write to a file, the metadata is held in a cache that is located within the
main memory so that the client may access it more rapidly.
• The I/O tasks are completed by the slave DataNode daemons, which are directed in their actions by the NameNode.
• The NameNode is responsible for managing and directing
• how files are divided up into blocks,
• selecting which slave node should store these blocks, and
• monitoring the overall health and fitness of the distributed file system.
• In addition, it decides which slave node should store these blocks.
• Hadoop MapReduce is a programming framework that is made available for creating applications that can process
and analyze massive data sets in parallel on large multi-node clusters of commodity hardware in a manner that is
scalable, reliable, and fault tolerant.
• The processing and analysis of data consist of two distinct stages known as
• the Map phase and the Reduce phase.
• Thus, in MapReduce programming, an entire task can be divided into map task and reduce task.
• Map takes input as a key value, and produces output as a list of <key-value> pair.
• Reduce takes input as shuffling of key and list value and the final output is the key value as shown in Figure