0% found this document useful (0 votes)
137 views108 pages

Fundamentals of Data Science

The document outlines the vision and mission of an educational institute focused on empowering individuals through holistic education, specifically in the field of Data Science. It details a B.C.A course titled 'Fundamentals of Data Science,' covering key concepts such as data mining, data preprocessing, and various data types, along with their applications and techniques. The course aims to equip students with essential skills in data understanding, pattern recognition, clustering, and classification.

Uploaded by

Shashank G S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views108 pages

Fundamentals of Data Science

The document outlines the vision and mission of an educational institute focused on empowering individuals through holistic education, specifically in the field of Data Science. It details a B.C.A course titled 'Fundamentals of Data Science,' covering key concepts such as data mining, data preprocessing, and various data types, along with their applications and techniques. The course aims to equip students with essential skills in data understanding, pattern recognition, clustering, and classification.

Uploaded by

Shashank G S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Lecture Notes on Fundamentals of Data Science

VISION OF THE INSTITUTE

Empower the individuals and society at large through educational excellence; sensitize them
for a life dedicated to the service of fellow human beings and mother land.

MISSION OF THE INSTITUTE

To impact holistic education that enables the students to become socially responsive and
useful, with roots firm on traditional and cultural values; and to hone their skills to accept
challenges and respond to opportunities in a global scenario.

Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of 03
Credits
Contact hours 42 Hours Duration of 2 1/2 Hours
SEA/Exam
Formative Assessment
40 Summative Assessment 60
Marks
Marks

Course Outcomes (COs): After the successful completion of the course, the student will be able to:

CO1 Understand the concepts of data and pre-processing


of data.

CO2 Know simple pattern recognition methods

CO3 Understand the basic concepts of Clustering and


Classification

CO4 Know the recent trends in Data Science


1

Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60

Unit 1
Topics:

Data Mining: Introduction, Data Mining Definitions, Knowledge Discovery in Databases


(KDD) Vs Data Mining, DBMS Vs Data Mining, DM techniques, Problems, Issues and
Challenges in DM, DM applications.

Data Mining:

Def 1: Refers to extracting or mining knowledge from large amount of data stored in databases,
data warehouse, or other repository. i.e. extraction of small valuable information from huge data.

Def 2: It Is the process of discovering interesting patterns & knowledge from large amount of data.

Data archeology, data dredging, data/pattern analysis are other terms for data mining. Another
popular term Knowledge Discovery From Data (KDD).

Why Data Mining is important?

Huge data is generated and there is need to turn into useful information and knowledge. This
information & knowledge is used for various applications like Market analysis (consumer buying

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


2

pattern), Fraud detection (fraud account detection, fraud credit card holders), Science exploration
(hidden facts in data), telecommunication, etc.

Steps in Knowledge Discovery from Data:

1. Data Cleaning: Remove noise & inconsistent data.


2. Data Integration: Multiple data sources are
3. Data Selection: Only relevant data are retrieved from database
4. Data Transformation: Data is consolidated into a form which is appropriate for mining
5. Data Mining: Intelligent methods are applied to extract data pattern
6. Pattern Evaluation: To identify the truly interesting patterns representing knowledge based
on some interesting measures
7. Knowledge Presentation: Visualizing(graphic) & knowledge representation technique are
used to present the mined knowledge to the user

Example of Data Cleaning in Data Mining

Data cleaning involves identifying and correcting errors or inconsistencies in datasets to improve
their quality before analysis. Below is an example:

Scenario: Customer Database Cleaning

A retail company has a customer database with errors that need to be cleaned before performing
customer segmentation.

Raw Data (Before Cleaning)

Customer_ID Name Age Email Purchase_Amount Country

101 Chandra 25 CM@mitfgc.in 500 India

102 Jane Smith -30 janesmith@xyz.com 1000 UK

103 NULL 40 n/a 750 Canada

104 Mohan 32 mohan@mitfgcin ? India

105 Rao 27 rao@gmail.com 900 NULL

Data Cleaning Steps:

1. Handling Missing Values:

o Replace NULL in the "Name" column with "Unknown."

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


3

o Replace NULL in the "Country" column with the most common value.

o Replace "?" in "Purchase_Amount" with the average purchase amount.

2. Removing or Correcting Invalid Values:

o Age cannot be negative (-30) → Convert it to an estimated valid value (e.g., 30).

o Email "mohan@mitfgcin" has a missing dot (.) → Correct it to "mohan@mitfgc.in

3. Standardization:

o Ensure all email addresses are in lowercase.

o Convert currency formats if needed.

Cleaned Data (After Cleaning)

Customer_ID Name Age Email Purchase_Amount Country

101 Chandra 25 cm@mitfgc.in 500 India

102 Jane Smith 30 janesmith@xyz.com 1000 UK

103 Unknown 40 NULL 750 Canada

104 Mohan 32 mohan@mitfgc.in 787.5 (avg) India

105 Rao 27 Rao@gmail.com 900 India

Now the dataset is clean, accurate, and ready for analysis in data mining.

Example of Data Integration in Data Mining

Data integration is the process of combining data from multiple sources into a single, unified
view. This is essential in data mining to improve analysis, accuracy, and consistency.

Scenario: Integrating Sales and Customer Databases

A retail company has two separate datasets:

1. Sales Database (stores transaction details)

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


4

2. Customer Database (stores customer details)

Before Integration:

Sales Data Table:

Transaction_ID Customer_ID Product Amount (Rs) Date

T001 101 Laptop 1200 2024-01-10

T002 102 Phone 800 2024-01-11

T003 103 Tablet 500 2024-01-12

Customer Data Table:

Customer_ID Name Age Email Country

101 Mohan 25 mohan@mitfgc.com India

102 Jane Smith 30 janesmith@xyz.com UK

103 Chandra 28 chandra@mitfgc.com India

After Data Integration:

By joining these tables using Customer_ID, we create a unified dataset.

Integrated Data Table:

Transaction_I Customer_I Ag Countr Produc Amoun


Name Email Date
D D e y t t (Rs)

2024
mohan@mitfgc.co
T001 101 Mohan 25 India Laptop 1200 -01-
m
10

2024
Jane janesmith@xyz.co
T002 102 30 UK Phone 800 -01-
Smith m
11

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


5

Transaction_I Customer_I Ag Countr Produc Amoun


Name Email Date
D D e y t t (Rs)

2024
Chandr chandra@mitfgc.co
T003 103 28 India Tablet 500 -01-
a m
12

Benefits of Data Integration in Data Mining:

Improved Data Consistency – Eliminates redundancy and inconsistencies.


Better Decision-Making – Provides a holistic view of customers and their purchase behavior.
Enhanced Analysis – Enables deeper insights into customer preferences and trends.

Now, the company can analyze buying patterns, predict customer behavior, and improve
marketing strategies.

Example of Data Selection in Data Mining

Data selection in data mining involves choosing relevant data from a larger dataset to improve
analysis efficiency and accuracy. It helps in focusing on only the necessary attributes instead of
processing the entire dataset.

Scenario: Selecting Data for Customer Cancellation of Service Prediction

A telecom company wants to predict customer churn (customers leaving the service). The
company has a large dataset with many attributes, but not all are useful for churn prediction.

Original Dataset (Before Selection)

Custom Nam A Gen Add Call_Du Data_ Monthl Payment_ Customer_F Cancel_
er_ID e ge der ress ration Usage y_Bill Method eedback Status

Moh Mal Mys 30


101 25 5GB 499 Credit Card Neutral No
an e ore min/day

Fem Mys 10
102 Arun 35 2GB 300 UPI Dissatisfied Yes
ale ore min/day

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


6

Custom Nam A Gen Add Call_Du Data_ Monthl Payment_ Customer_F Cancel_
er_ID e ge der ress ration Usage y_Bill Method eedback Status

Cha Fem Mys 50


103 28 10GB 700 Debit Card Satisfied No
ndra ale ore min/day

Step 1: Selecting Relevant Attributes

To predict churn, some attributes are irrelevant (e.g., "Name", "Address") and can be removed.
The most relevant features are:
✔ Age – May impact churn behavior.
✔ Call_Duration – Shows engagement with the service.
✔ Data_Usage – Indicates usage patterns.
✔ Monthly_Bill – Higher bills may lead to churn.
✔ Customer_Feedback – Negative feedback might indicate potential churn.
✔ Churn_Status – Target variable for prediction.

Filtered Dataset (After Data Selection)

Customer_I Ag Call_Duratio Data_Usag Monthly_Bi Customer_Feedba Cancel_Stat


D e n e ll ck us

101 25 30 min/day 5GB 499 Neutral No

102 35 10 min/day 2GB 300 Dissatisfied Yes

103 28 50 min/day 10GB 700 Satisfied No

Benefits of Data Selection in Data Mining:

Reduces Processing Time – Working with fewer attributes speeds up computations.


Enhances Model Performance – Eliminates irrelevant data that may add noise.
Improves Interpretability – Easier to analyze and derive insights.

Now, this refined dataset is ready for cancellation service prediction models like Decision Trees
or Neural Networks!

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


7

Example of Data Transformation in Data Mining

Data transformation in data mining is the process of converting data into a suitable format for
analysis. This includes normalization, aggregation, discretization, encoding, and feature
engineering.

Date Transformation (Feature Engineering)

• Convert Join_Date to "Customer Tenure" (years since joining).

• Example: If today’s year is 2024, then Customer Tenure = 2024 - Join Year.

Raw Dataset (Before Transformation)

Income
Customer_ID Age Purchase_Frequency Join_Date Preferred_Payment_Method
(Rs)
2018-06-
101 25 50000 15 Credit Card
10
2015-09-
102 42 120000 5 UPI
25
2020-01-
103 30 80000 8 Debit Card
12
After Feature/column generation:

Customer_ID Age Customer_Tenure (Years)

101 25 6

102 42 9

103 30 4

Benefits of Data Transformation in Data Mining:

Improves Model Accuracy – Scaled and encoded data improves machine learning
performance.
Enhances Interpretability – Transformed features make patterns easier to detect.
Reduces Computational Complexity – Normalization speeds up algorithms.

Now, the dataset is ready for clustering algorithms like K-Means or classification models

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


8

Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base.

Architecture of DM System

Typically DMS consists of following components:

• Database, data warehouse, WWW, or other information repository(spread sheets, files)


• Data Warehouse Server:
This server is responsible for fetching the relevant data, based on user’s data mining
request.
• Knowledge base:
This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting pattern
• DM Engine:
Consists of set of methods/functions like characterization, association, correlation
analysis, classification, cluster analysis, prediction, outlier analysis, etc.
• Pattern Evaluation:
Employs interestingness measure and interacts with the data mining modules so as to
focus the search toward interesting patterns
• User Interface:

Interface between user & DMS. User specifies query, task, etc. User browse data, visualize
output.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


9

Fig: Architecture of DM System

Which technologies are used for DM?

Data mining involves an integration of techniques from multiple discipline such as database,
data warehouse, statistics, machine learning, pattern recognition, neural networks, data
visualization, information retrieval, image/signal processing, spatial & temporal data analysis.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


10

Fig: Data Mining adopts many domains

Data mining on what kinds of data?

DM can be used to mine knowledge from any kind of data source like:

Kinds of Data That Can Be Mined in Data Mining

Data mining can be applied to various types of data to discover patterns, trends, and useful insights.
Below are the main categories:

1. Structured Data

Definition: Data that is organized in a fixed format, typically stored in relational databases (tables
with rows and columns).

Examples:

• Customer databases (e.g., names, emails, purchases)

• Transaction records (e.g., sales, payments)

• Inventory management data

Mining Techniques Used: Classification, Clustering, Association Rule Mining

2. Semi-Structured Data
Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College
11

Definition: Data that does not fit into a rigid structure but still has some organization (e.g., tags,
metadata).

Examples:

• XML, JSON data files

• Emails (subject, body, attachments)

• Sensor logs from IoT devices

Mining Techniques Used: Text Mining, Information Extraction, Natural Language Processing
(NLP)

3. Unstructured Data

Definition: Data that has no predefined format, making it more complex to analyze.

Examples:

• Text data (e.g., social media posts, reviews, emails)

• Images & Videos (e.g., CCTV footage, medical images)

• Audio files (e.g., voice recordings, music)

Mining Techniques Used: Deep Learning, Computer Vision, Speech Recognition

4. Spatial Data

Definition: Data related to geographic locations or spatial dimensions.

Examples:

• GPS data (e.g., Google Maps locations)

• Satellite images

• Geographic Information Systems (GIS)

Mining Techniques Used: Spatial Clustering, Geospatial Analysis

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


12

5. Time-Series Data

Definition: Data collected over time, where the sequence and time intervals matter.

Examples:

• Stock market trends

• Weather data

• Website traffic logs

Mining Techniques Used: Time Series Forecasting, Anomaly Detection

6. Web Data

Definition: Data extracted from websites and online sources.

Examples:

• Clickstream data (user interactions on websites)

• E-commerce transaction logs

• Social media analytics

Mining Techniques Used: Web Scraping, Sentiment Analysis, Page Ranking Algorithms

7. Multimedia Data

Definition: Data consisting of images, videos, audio, and animations.

Examples:

• Medical imaging (X-rays, MRIs)

• Facial recognition data

• Video surveillance footage

Mining Techniques Used: Image Processing, Deep Learning, Computer Vision

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


13

8. Biological & Genetic Data

Definition: Data related to genetics, DNA sequencing, and medical research.

Examples:

• DNA sequencing data

• Protein structure data

• Electronic Health Records (EHRs)

Mining Techniques Used: Bioinformatics, Machine Learning in Healthcare

DM functionalities – What kinds of patterns can be mined?

DM task can be classified into 2:

1) Descriptive: Categories general properties of those data in Database.


2) Predictive: Performs inference on current data in order to make prediction.

Different kinds of patterns that can be discovered are:

1) Concept/class description:
Data entries can be associated with classes or concepts. For example, classes of items for
sale include computers and printers, and concepts of customers include bigSpenders and
budgetSpenders. It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms. Such descriptions of a class or a concept are
called class/concept descriptions.

These description can be derived by:

1. Data characterization, by summarizing the data of the class under study. (Ex: based on
gender, buying behavior)
2. Data discrimination, by comparing the target class with one or set of comparative class.
(Ex: sales of comp with laptop)
3. Both data characterization & discrimination

The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables
2) Mining Frequent Pattern, Association & correlation:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


14

Frequent Pattern refers to pattern that occur frequently in data. Mining frequent pattern leads
to discovery of interesting association & correlation with data. Different kinds of frequent
pattern are
• Item sets - A frequent itemset typically refers to a set of items that often appear together
in a transactional data set—for example, milk and bread, which are frequently bought
together in grocery stores by many customers.
• Sub sequences – A frequently occurring subsequence, such as the pattern that
customers, tend to purchase together. For eg. Mobile, Back case, Screen guard
3) Classification & Prediction:
It is a process of building a model that describes the class & then predicting the objects
into different classes using the model. Model can be built by using if----then rules, decision
tree, neural nets etc. Methods for construction classification models. Bayesian classification,
SVM, K-nearest neighbor.
Ex: Bank manager wants to know/analyze which loan applicant are ok and which can create a
risk.
4) Regression Analysis Regression analysis is a reliable method of identifying which variables
have impact on a topic of interest. The process of performing a regression allows you to
confidently determine which factors matter most, which factors can be ignored, and how these
factors influence each other.
Regression analysis is a statistical process that estimates the relationship between a dependent
variable and one or more independent variables.

• E.g, Logistic regression


Used to predict categorical dependent variables, such as yes or no, true or false, or 0 or
1. For example, insurance companies use logistic regression to decide whether to approve
a new policy.
5) Cluster Analysis:
Clustering groups data without any model. Clustering analyzes data objects without
consulting class labels
Ex: Cluster formed by buying preferences.
6) Outlier Analysis:
Finding out data which differ drastically from others.
Ex: Fraud detection.
7) Evolution Analysis:
Describe and models trends for objects whose behavior changes over time. Ex: Shares.

Major Statistical Data Mining Methods

o Regression

o Generalized Linear Model

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


15

o Analysis of Variance

o Mixed-Effect Models

o Factor Analysis

o Discriminant Analysis

o Survival Analysis

Visual Data Mining

o Visualization: Use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data

o Visual Data Mining: discovering implicit but useful knowledge from large data sets using
visualization techniques

Visual data mining discovers implicit and useful knowledge from large data sets using data and/or
knowledge visualization techniques. Visual data mining can be viewed as an integration of two
disciplines: data visualization and data mining. It is also closely related to computer graphics,
multimedia systems, human–computer interaction, pattern recognition, and high-performance
computing.

In general, data visualization and data mining can be integrated in the following ways:
Data visualization: Data in a database or data warehouse can be viewed at different granularity or
abstraction levels, or as different combinations of attributes or dimensions. Data can be presented
in various visual forms, such as boxplots, 3-D cubes, data distribution charts, curves, surfaces, and
link graphs, etc. Visual display can help give users a clear impression and overview of the data
characteristics in a large data set.

Data mining result visualization: Visualization of data mining results is the presentation of the
results or knowledge obtained from data mining in visual forms. Such forms may include scatter
plots and boxplots , as well as decision trees, association rules, clusters, outliers, and generalized
rules.

Data mining process visualization: This type of visualization presents the various processes of data
mining in visual forms so that users can see how the data are extracted and from which database
or data warehouse they are extracted, as well as how the selected data are cleaned, integrated,
preprocessed, and mined. Moreover, it may also show which method is selected for data mining,
where the results are stored, and how they may be viewed.

Audio Data Mining


◼ Uses audio signals to indicate the patterns of data or the features of data mining results

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


16

◼ An interesting alternative to visual mining


◼ An inverse task of mining audio (such as music) databases which is to find patterns from
audio data
◼ Visual data mining may disclose interesting patterns using graphical displays, but requires
users to concentrate on watching patterns
◼ Instead, transform patterns into sound and music and listen to pitches, rhythms, tune, and
melody in order to identify anything interesting or unusual

KDD Vs Data Mining

KDD Data Mining

Knowledge Discovery in Databases (KDD) is Data mining (DM) is a step in the KDD
a process that automatically discovers process that involves applying algorithms to
patterns, rules, and other regular contents in extract patterns from data.
large amounts of data
KDD is a systematic process for identifying Data mining is the foundation of KDD and is
patterns in large and complex data sets. essential to the entire methodology.
Overall set of process for Knowledge Data mining is process of extraction of hidden
extraction like data cleaning, data selection, knowledge from large data. Intelligent
data integration, datamining, pattern algorithms are used to extract useful
evaluation, knowledge presentation information like data categorization, data
characterization, data discrimination,
Association, Frequent Pattern mining,
Regression, Outlier Analysis, classification,
clustering, etc.
Contains several steps It is one step in KDD
Sometimes called as alias name of Data Sometimes called as alias name of KDD
Mining

DBMS Vs Data Mining

DBMS Data Mining

System to manage the data in database like Data mining is process of extraction of hidden
creation, insertion, deletion, updating, etc. knowledge from large data. Intelligent
algorithms are used to extract useful
information like data categorization, data
characterization, data discrimination,
Association, Frequent Pattern mining,
Regression, Outlier Analysis, classification,
clustering, etc
Stores data in format suitable for data Data from Database is used for Mining
management.
Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College
17

Application oriented Fact oriented


Concerned on business transactions like Concerned on hidden knowledge extraction
insertion, deletion etc by using intelligent algorithms
Use SQL Use algorithms
Store and manage data Analyze data
Used to manage data of an organization Used to extract valuable information from
data generated in organization.
Query based processing (Transaction) Analytical processing

Major Issues in DM

a. Mining Methodology:
Researches have been vigorously developing new DM techniques. This involves the
investigation of new kinds of knowledge, mining in multidimensional space, integrating
methods from other disciplines and consideration of semantic ties among data objects.
b. User Interaction:
Users play an important role in DM process. Interesting areas of research include how
to interact with a DMS, how to incorporate a user’s background knowledge in mining and
how to visualize and comprehend data mining results.
c. Efficiency & Scalability:
DM algorithms must be efficient & scalable in order to effectively extract information
from huge amount of data in many data repositories or in dynamic data streams. In other
words running time of algorithm must be short.
d. Diversity of database types:
The discovery of knowledge from different sources of structured, or unstructured yet
interconnected data with diverse data semantic pose great challenges to DM.
e. DM & Society:
I. Social Impact of DM:
The improper disclosure or use of data & the potential violations of individual
privacy and data protection rights are areas of concern that need to be addressed.
II. Privacy – Preserving DM:
DM poses a risk of disclosing an individual’s personal information. The research
is to observe data sensitive & preserve peoples privacy while performing successful
DM.
III. Invisible DM:
When purchasing online, the users might be unaware that the store is likely
collecting data on the buying patterns of its customers, which may be used to
recommend other items for purchase in the future.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


18

Data Mining Applications

Two highly successful and popular application examples of data mining:

1. Business Intelligence:
BI technologies provide historical, current and productive views of business operations.
Without data mining many business may not be able to perform effective market analysis,
compare customer feedback on similar products, discover strength & weakness of
competitors, predictive analysis etc.
2. Web Search Engine:
Web Search Engines are very large DM applications. Various DM task like crawling,
indexing, ranking, searching are used.

Other important applications of Data Mining are:

Data Mining for Financial Data Analysis


o Financial data collected in banks and financial institutions are often relatively complete,
reliable, and of high quality
A credit card company can leverage its vast warehouse of customer transaction data to identify
customers most likely to be interested in a new credit product.
• Credit card fraud detection.
• Identify ‘Loyal’ customers.
• Extraction of information related to customers.
• Determine credit card spending by customer groups.
• Consumer credit rating
o Classification and clustering of customers for targeted marketing
- multidimensional segmentation by nearest-neighbor, classification, decision trees,
etc. to identify customer groups or associate a new customer to an appropriate
customer group
o Detection of money laundering and other financial crimes
- integration of from multiple DBs (e.g., bank transactions, federal/state crime
history DBs)
- Tools: data visualization, linkage analysis, classification, clustering tools, outlier
analysis, and sequential pattern analysis tools (find unusual access sequences)
-
Data Mining for Retail & Telcomm. Industries
o Retail industry: huge amounts of data on sales, customer shopping history, e-commerce,
etc.

o Applications of retail data mining

- Identify customer buying behaviors

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


19

- Discover customer shopping patterns and trends

- Improve the quality of customer service

- Achieve better customer retention and satisfaction

- Enhance goods consumption ratios

- Design more effective goods transportation and distribution policies

o Telcomm. and many other industries: Share many similar goals and expectations of retail
data mining

Data Mining Practice for Retail Industry

o Design and construction of data warehouses

o Multidimensional analysis of sales, customers, products, time, and region

o Analysis of the effectiveness of sales campaigns

o Customer retention: Analysis of customer loyalty

- Use customer loyalty card information to register sequences of purchases of


particular customers

- Use sequential pattern mining to investigate changes in customer consumption or


loyalty

- Suggest adjustments on the pricing and variety of goods

o Product recommendation and cross-reference of items

o Fraudulent analysis and the identification of usual patterns

o Use of visualization tools in data analysis

Data Mining in Science and Engineering

o Data warehouses and data preprocessing

- Resolving inconsistencies or incompatible data collected in diverse environments


and different periods (e.g. eco-system studies)

o Mining complex data types

- Spatiotemporal, biological, diverse semantics and relationships

o Graph-based and network-based mining

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


20

- Links, relationships, data flow, etc.

o Visualization tools and domain-specific knowledge

o Other issues

- Data mining in social sciences and social studies: text and social media

- Data mining in computer science: monitoring systems, software bugs, network


intrusion

Data Mining for Intrusion Detection and Prevention

Data mining technique plays a vital role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful and relevant information from
large data sets. Data mining technique helps in classify relevant data for Intrusion Detection
System. Intrusion Detection system generates alarms for the network traffic about the foreign
invasions in the system. For example:
• Detect security violations
• Misuse Detection
• Anomaly Detection

o Majority of intrusion detection and prevention systems use

- Signature-based detection: use signatures, attack patterns that are preconfigured


and predetermined by domain experts

- Anomaly-based detection: build profiles (models of normal behavior) and detect


those that are substantially deviate from the profiles

Data Mining and Recommender Systems

o Recommender systems: Personalization, making product recommendations that are likely


to be of interest to a user

o Approaches: Content-based, collaborative, or their hybrid

- Content-based: Recommends items that are similar to items the user preferred or
queried in the past

- Collaborative filtering: Consider a user's social environment, opinions of other


customers who have similar tastes or preferences

Business Transactions: Every business industry is memorized for perpetuity. Such transactions
are usually time-related and can be inter-business deals or intra-business operations. The
effective and in-time use of the data in a reasonable time frame for competitive decision-making

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


21

is definitely the most important problem to solve for businesses that struggle to survive in a
highly competitive world. Data mining helps to analyze these business transactions and identify
marketing approaches and decision-making. Example :
• Direct mail targeting
• Stock trading
• Customer segmentation
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of
purchases done by a customer in a supermarket. This concept identifies the pattern of frequent
purchase items by customers. This analysis can help to promote deals, offers, sale by the
companies and data mining techniques helps to achieve this analysis task. Example:
• Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
• Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and educators.
By using data mining EDM we can perform some educational task:
• Predicting students admission in higher education
• Predicting students profiling
• Predicting student performance
• Teachers teaching performance
• Curriculum development
• Predicting student placement opportunities
Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules generated by data
mining are unique to find results. In most of the technical research in data mining, we create a
training model and testing model. The training/testing model is a strategy to measure the
precision of the proposed model. It is called Train/Test because we split the data set into two
sets: a training data set and a testing data set. A training data set used to design the training model
whereas testing data set is used in the testing model. Example:
• Classification of uncertain data.
• Information-based clustering.
• Decision support system
• Web Mining
• Domain-driven data mining
• IoT (Internet of Things)and Cybersecurity
• Smart farming IoT(Internet of Things)
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity
and their outcomes to improve the focusing of high-value physicians and figure out which
promoting activities will have the best effect in the following upcoming months, Whereas the
Insurance sector, data mining can help to predict which customers will buy new policies, identify
behavior patterns of risky customers and identify fraudulent behavior of customers.
• Claims analysis i.e which medical procedures are claimed together.
• Identify successful medical therapies for different illnesses.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


22

• Characterizes patient behavior to predict office visits.


Transportation: A diversified transportation company with a large direct sales force can apply
data mining to identify the best prospects for its services. A large consumer merchandise
organization can apply information mining to improve its business cycle to retailers.
• Determine the distribution schedules among outlets.
• Analyze loading patterns.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


1

Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03

Unit 2
Topics:

Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data Cleaning,


Data Integration and transformation, Data reduction, Discretization.

Data Warehouse:

According to William H. Inmon, a leading architect in the construction of data warehouse systems
(Father of Data Warehouse- American Computer Scientist), “A data warehouse is a subject-
oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s
decision making process”. In simple words, it is a centralized data location for multiple sources of
data for management decision making process.

Subject-oriented: A data warehouse is organized around major subjects such as customer,


supplier, product, and sales.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.
Time-variant : Data are stored to provide information from an historic perspective.
(e.g., the past 5–10 years).
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse
does not require transaction processing, recovery, and concurrency control mechanisms. It usually
requires only two operations in data accessing: initial loading of data and access of data

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


2

Data warehousing:

The process of constructing and using data warehouses as shown the following figure.

Fig 1.1: Data ware house of a sales organization.

OLTP (Online Transaction Processing) vs. OLAP (Online Analytical Processing)

1. OLTP (Online Transaction Processing)

• Purpose: Handles real-time transaction processing.

• Usage: Used in operational systems like banking, retail, and airline reservations.

• Data Type: Stores current and detailed transactional data.

• Operations: Frequent, short, atomic transactions (INSERT, UPDATE, DELETE).

• Performance: Optimized for fast query processing and high availability.

• Example: A banking system processing multiple account transactions simultaneously.

2. OLAP (Online Analytical Processing)

• Purpose: Supports complex analysis and decision-making.

• Usage: Used in business intelligence, data mining, and reporting.

• Data Type: Stores historical and aggregated data for analysis.

• Operations: Complex queries involving multi-dimensional data (e.g., SUM, AVERAGE,


GROUP BY).

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


3

• Performance: Optimized for read-heavy operations and large data analysis.

• Example: A sales dashboard analyzing monthly revenue trends.

OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date historical,


detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc

access read/write lots of scans


index/hash on prim. key
unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


4

Data Warehousing: Three Tier Architecture

Data warehouses often adopt a three-tier architecture, as presented in Figure.

Fig. Three Tier Architecture of Data warehousing

◼ The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g., customer profile information provided
by external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as
well as load and refresh functions to update the data warehouse

◼ The middle tier is an OLAP server that is typically implemented using either

(1) a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


5

(2) a Multi-dimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).

◼ The top tier is a front-end client layer , which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse
o Enterprise warehouse
o collects all of the information about subjects spanning the entire organization
o Data Mart
o a subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
o Virtual warehouse
o A set of views over operational databases
o Only some of the possible summary views may be materialized
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner, as shown in Figure.
First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data among
different subjects and potential usages. This high-level model, although it will need to be
refined in the further development of enterprise data warehouses and departmental data marts,
will greatly reduce future integration problems. Second, independent data marts can be
implemented in parallel with the enterprise warehouse based on the same corporate data model
set noted before. Third, distributed data marts can be constructed to integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


6

Fig: A recommended approach for data warehouse development

Data Warehouse Modeling: Data Cube and OLAP


Data warehouses and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube.
o A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
It is defined by dimensions and facts. Fact tables contain numerical data, while dimension
tables provide context and background information.
- Dimension tables, such as item (item_name, brand, type), or time(day, week,
month, quarter, year) (entities in which org keeps records), Store descriptive
attributes (product, customer, date, store).
- Fact table contains numeric measures (such as dollars_sold(sale amt in $), units
sold) and keys to each of the related dimension tables
Example Scenario: Sales Data Warehouse
Fact Table (Transactional Data)
• Fact_Sales (Stores measurable business data)
o Sales_ID (Primary Key)
o Date_Key (Foreign Key to Date Dimension)
o Product_Key (Foreign Key to Product Dimension)
o Customer_Key (Foreign Key to Customer Dimension)
o Store_Key (Foreign Key to Store Dimension)

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


7

oSales_Amount (Measure)
o Quantity_Sold (Measure)
Dimension Tables (Descriptive Data)
1. Dim_Date (Time-based details)
o Date_Key (Primary Key)
o Date
o Month
o Quarter
o Year
2. Dim_Product (Product details)
o Product_Key (Primary Key)
o Product_Name
o Category
o Brand
3. Dim_Customer (Customer details)
o Customer_Key (Primary Key)
o Customer_Name
o Age
o Gender
o Location
4. Dim_Store (Store details)
o Store_Key (Primary Key)
o Store_Name
o City
o Region

In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The apex cuboid is
typically denoted by ‘all’.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


8

The lattice(patterened structure like fence) of cuboids forms a data cube as shown below.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


9

Schemas for Multidimensional Data Models

In multidimensional data modeling for a data warehouse, three common schemas define how fact
and dimension tables are structured:

1. Star schema: A fact table in the middle connected to a set of dimension tables.

Dimensions are directly linked to the fact table.

Pros: Simple, fast query performance.

Cons: Data redundancy in dimensions.

2. Snowflake schema: A refinement of star schema where some dimensional


hierarchy is normalized into a set of smaller dimension tables, forming a shape
similar to snowflake. Dimension tables are split into sub-dimensions to reduce
redundancy.

Pros: Saves storage space.

Cons: Complex queries, slower performance.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


10

3. Fact constellations: Multiple fact tables share dimension tables, viewed as a


collection of stars, therefore called galaxy schema or fact constellation. Used when
multiple business processes are analyzed together.

Pros: Flexible for large-scale data warehousing.

Cons: Complex structure.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


11

Choosing the Right Schema:

Star Schema → Best for fast query performance and simple design.

Snowflake Schema → Best for storage optimization when normalization is needed.

Galaxy Schema → Best for complex business models with multiple fact tables.

OLAP Operations

o Roll up (drill-up): summarize data or aggregation of data

- by climbing up hierarchy or by dimension reduction

- In the cube given in the overview section, the roll-up operation is


performed by climbing up in the concept hierarchy
of Location dimension (City -> Country).

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


12

o Drill down (roll down): In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:

- Moving down in the concept hierarchy

- Adding a new dimension

- In the cube given in overview section, the drill down operation is


performed by moving down in the concept hierarchy
of Time dimension (Quarter -> Month).

o Slice: Extracts a subset of the data for a single dimension value. It selects a single
dimension from the OLAP cube which results in a new sub-cube creation.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


13

Example: Viewing sales data only for Q1 2024.

o Dice: Extracts a subset of data based on multiple conditions (multiple slices).

Example: Viewing sales for Q1 2024 in New York for Electronics category.

o Pivot (rotate):

- reorient the cube, visualization, 3D to series of 2D planes. Rearranges data for


better visualization by switching rows and columns.

- It is also known as rotation operation as it rotates the current view


to get a new view of the representation. In the sub-cube obtained
after the slice operation, performing pivot operation gives a new
view of it.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


14

Summary Table:

OLAP
Function Example
Operation

Roll-Up Aggregates data to a higher level Sales from monthly → yearly

Drill-Down Breaks data into a finer level Sales from yearly → monthly

Slice Selects data for one dimension Sales only for Q1 2024

Sales for Q1 2024 & Electronics


Dice Filters data for multiple conditions
category

Rotates data for different


Pivot Sales by category vs. year
perspectives

Data Cleaning

Today’s data are highly susceptible to noisy, missing and inconsistent data due to their typically
huge size and because of heterogeneous sources. Low quality data will lead to poor mining results.

Different data preprocessing techniques(data cleaning, data integration, data reduction, data
transformation) that when applied before data mining will improve the overall quality of the
pattern mined and also time required for actual mining. Data cleaning stage helps in smooth out
noise, attempts to fill in missing values, removing outliers, and correct inconsistency in data.

Different types of data cleaning tasks:

1) Handling missing values: Missing values are encountered due to Data entry errors,
system failures, incomplete records.
Techniques to handle missing values:

i. Ignoring the tuple: Used when class label is missing. This method is not very
effective when more missing value is present.
ii. Fill in missing value manually: It is time consuming.
iii. Using global constant to fill missing value: Ex: unknown or ∞
iv. Use attribute mean to fill the missing value
v. Use attribute mean for all samples belonging to the same class as the given
tuple
vi. Use most probable value to fill the missing value: (using decision tree)

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


15

2) Handling Noisy data: Noise is a random error or variance in measured variable caused due to
Sensor errors, outliers, rounding errors, incorrect data entry.

Techniques to handle noisy data are:

1. Smoothing: Average out fluctuations in the data.


Techniques for smoothing are:
a) Binning: Smooth the sorted data by consulting its neighborhood. The values are
distributed into buckets/bins. They perform local smoothing.

Different binning methods for data smoothing:

i. Smoothing by bin means: Each value in bin is replaced by mean


Ex: BIN 1 : 4,8,15 = BIN 1: 9,9,9
ii. Smoothing by bin boundaries: Min and max value is identified and value is
replaced by closest boundary value
Ex: BIN 1 : 4,8,15 = BIN 1: 4,4,15
b) Regression: Data smoothing can also be done by regression (linear regression,
multiple linear regression). In this one attribute can be used to predict the value of
another.
c) Outlier analysis: Outliers can be done by clustering. The value outside the clusters
are outliers.

Data Integration

Data mining often works on integrated data from multiple repositories. Careful integration helps
in accuracy of data mining results.

Challenges of DI

1. Entity Identification Problem:


“How to match schema and objects from many sources?” This is called Entity
Identification Problem.
Ex: Cust-id in one table and Cust-no in another table.
Metadata helps in avoiding these problems.
2. Redundancy and correlation analysis:
Redundancy -> repetition.
Some redundancy can be detected by correlation analysis. Given two attributes, correlation
tell how strongly the relationship is (Chi-square test, correlation coefficient are ex).

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


16

Data Reduction

Data Reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintain the integrity of the original data.

Data Reduction Strategies:

1. Dimensionality reduction:
Reducing the number of attributes/variables under consideration.
Ex: Attribute subset selection, Wavelet Transform, PCA.
2. Numerosity reduction:
Replace original data by alternate smaller forms, clustering.
Ex: Histograms, Sampling, Data cube aggregation,
3. Data compression:
Reduce the size of data.

Wavelet Transform:

DWT- Discrete Wavelet Transform is a linear signal processing technique, that when applied to a
data vector X, transforms it to a numerically different vector X’ of same length. The DWT is a fast
and simple transformation that can translate an image from the spatial domain to the frequency
domain.

Principal Components Analysis(PCA)

PCA reduces the number of variables or features in a data set while still preserving the most
important information like major trends or patterns.

Attribute Subset Selection:

Dataset for analysis consists of many attribute which may be irrelevant to the mining task. (Ex:
Telephone no. may not be important while classifying customer). Attribute subset selection reduces
the data set by removing irrelevant attributes.

Some heuristics methods for attribute subset selection are:

1. Stepwise forward selection:


• Start with empty set
• Best attribute are added to reduce set
• At each iteration, the rest of remaining attribute are added.
2. Stepwise backward elimination:
• Start with full set of attributes
• At each step, remove the worst attributes.
3. Combination of forward selection & backward selection:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


17

• Combined method
• At each step, procedure selects the best attribute & remove worst from remaining.
4. Decision Tree Induction:
In DTI a tree is constructed from the given data. All attributes that do not appear in tree are
assumed to be irrelevant. Measures such as Information Gain, Gain Index, Gini Index, Chi-
square statistics, etc are used to select the best attributes out of the set of attributs. Thereby,
reducing the number of attributes.

Histograms:

Histogram is a frequency plot. It uses bins/buckets to approximate data distributions and are
popular form of data reduction. They are highly effective at approximating both sparse & dense
data as well as skewed & uniform data.

The following data are a list of AllElectronics prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30,30, 30) Figure shows the histogram for this data.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


18

Fig : Histogram for ALL Electronics

Clustering:

Clustering partition data into clusters/groups which are similar/close. In data reduction, cluster
representation of data are used to replace the actual data. Instead of storing all data points, store
only cluster centroids or representative points.

Example:

• Given a dataset with 1 million customer records, k-means clustering can reduce it to 100
clusters, where each centroid represents a group of similar customers.

Clustering can identify important features and remove redundant ones.

Example:

• In gene expression data, clustering similar genes can help reduce thousands of variables
into meaningful groups.

Instead of analyzing the entire dataset, work on a sample of clusters that represent the whole
population.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


19

Example:

• Market research: Instead of surveying all customers, businesses analyze a few customer
segments.

Clustering helps detect and remove outliers, reducing noise in the dataset.

Example:

• Fraud detection: Unusual transaction patterns form separate clusters, helping identify
fraudulent activities.

Sampling:

Used as data reduction technique in which large data are represented as small random samples
(subset).

Common ways to sample:

i.Simple random sample without replacements of size(SRSWOR)

This is created by drawing s of the N tuples from D ( s < N ), where the probability of drawing any tuple
in D is 1 = N , that is, all tuples are equally likely to be sampled.

ii. Simple random sample with replacement(SRSWR)

This is similar to SRSWOR, except that each time a tuple is drawn from , it is recorded and then replaced
. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.

iii. Cluster sample

The tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be obtained,
where s < M .

iv. Stratified sample

If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional
to the size of the sample, s , as opposed to N , the data set size. Hence, sampling complexity is
potentially sublinear to the size of the data.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


20

Fig. Sampling Techniques

Data Cube Aggregation:

• Aggregate data into one view.


• Data cube store multidimensional aggregated information.
• Data cube provides fast access to precomputed, summarized data, thereby benefits
OLAP/DM.
• Data cube created for varying level of abstraction are often referred to as cuboids.
• Cube created at lowest level of abstraction is base cuboids.
o Ex: Data regarding sales or customers.
• Cube created at highest level of abstraction is apex cuboids.
o Ex: Total sales for all 3 years, for items.
Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College
21

Fig. Data Cube

Data Transformation

The data is transformed or consolidated so that the resulting mining process may be more efficient,
and the patterns found may be easier to understand.

Data Transformation Strategies overview:

1. Smoothing: Performed to remove noise.


Ex: Binning, regression, clustering.
2. Attribute construction: New attributes are added to help mining process.
3. Aggregation: Data is summarized or aggregated.
Ex: Sales data is aggregated into monthly & annual sales. This step is used for constructing
data cube.
4. Normalization: Data is scaled so as to fall within a smaller range.
Ex: -1.0 to +1.0.
5. Data Discretization: Where raw values are replaced by interval labels or conceptual labels.
Ex: Age
• Interval labels (10-18, 19-50)
• Conceptual labels (youth, adult)
6. Concept hierarchy generation for nominal data: Attributes are generalized to higher level
concepts
Ex: Street is generalized to city or country.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


22

Data Transformation by Normalization:

The measurement unit used can affect data analysis. To help avoid dependence on the choice of
measurement units, the data should be normalized or standardized. This involves transforming the
data to fall within a smaller or common range such as Range = [-1,1], [0.0,1.0].

Normalizing the data attempts to give all attributes an equal weight For Ex: Changing unit from
meters to inches in height lead to different results because of larger range for that attribute. To help
avoid dependence on the choice of units, the data should be normalized.

Normalization attempts to give all attributes equal weight. Normalization is useful in classification
algorithm involving neural networks or distance measurements such as nearest neighbor
classification & clustering. There are different methods for normalization like - min-max
normalization, z-score normalization, normalization by decimal scaling.

Min-Max Normalization:

a) Find min & max no. in the data.


b) Transform the data to range [𝑛𝑒𝑤𝑚𝑖𝑛ᴀ, 𝑛𝑒𝑤𝑚𝑎𝑥ᴀ] by computing
𝑉𝑖−𝑚𝑖𝑛ᴀ
Vi’ = (𝑛𝑒𝑤𝑚𝑎𝑥ᴀ − 𝑛𝑒𝑤𝑚𝑖𝑛ᴀ ) + 𝑛𝑒𝑤𝑚𝑖𝑛ᴀ
𝑚𝑎𝑥ᴀ−𝑚𝑖𝑛ᴀ

c) It preserves the relationship among the original data values


Ex: If min income is Rs.12,000 & max income is Rs.98,000. If new range is [0.0,1.0].
A value Vi= Rs.73600 will transform into
73600−12000
Vi’ = (1.0 − 0.0) + 0
48000−12000

Vi’ = 0.716.

Z-score Normalization:

Values of an attribute A, are normalized on the mean & std deviation of A.


𝑉𝑖−𝐴̅
𝑉𝑖 ′ = 𝐴̅= mean; 𝜎ᴀ= std deviation.
𝜎ᴀ

Also, variance(Sᴀ) could be used which is more robust than std deviation(𝜎ᴀ).

Decimal Scaling:

a) Normalizes by moving the decimal point of values.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


23

b) The no. of decimal point moved depends on the max absolute value of A

𝑉𝑖
𝑉𝑖 ′ = 10𝑗

Ex: A= -986 to 917


Max Abs value = 986 (j=3)
Divide each no. by 1000 i.e. 103
Therefore -0.986 to 0.916 is the normalized value.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


1

Progra
m B.C.A Semester VI
Name
Course Title Fundamentals of Data Science (Theory)
Unit 3

Topics:

Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -Apriori
and Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules.

Basic Concepts

Item: Refers to an item/product/data value in a dataset. E.g., Mobile, Case, Mouse, Keyboard,
Temp, Cold, etc.

Itemset: Set of items in a single transaction. Eg., X={Mobile, charger, screen guard}
Y={Headset, pendrive};

Frequent Itemset: Set of itemset occurring repeatedly/frequently in a dataset (i.e. in many


transactions). An itemset with support ≥ min support_count.

X={X1,X2,X3, ….., Xk}

Where k refers numbers of items in an itemset (k-itemsets)

Support: It is a measure of frequency of items occurring in a dataset. It is the probability that


a transaction contains the item.

Support is often used as a threshold for identifying frequent item sets in a dataset, which can
be used to generate association rules. For example, if we set the support threshold to 5%, then
any itemset that occurs in more than 5% of the transactions in the dataset will be considered a
frequent itemset.

Support(X) = (Number of transactions containing X) / (Total number of transactions)

where X is the itemset for which you are calculating the support.

Support(X -> Y) = Support_count(X ∪ Y)

Closed Itemset: A frequent itemset with no superset that has the same support.

For example, if a dataset contains 100 transactions and the item set {milk, bread} appears in
20 of those transactions, the support count for {milk, bread} is 20. If there is no superset of
{milk, bread} that has a support count of 20, then {milk, bread} is a closed frequent itemset.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


2

Closed frequent itemsets are useful for data mining because they can be used to identify
patterns in data without losing any information. They can also be used to generate association
rules, which are expressions that show how two or more items are related.

Maximal Frequent Itemset: A frequent itemset with no superset that is also frequent. For
example if an itemset {a,b,c} is frequent itemset and do not have a superset which is also
frequent.

Confidence:

Confidence is a measure of the likelihood that an itemset will appear if another itemset
appears. It is based on conditional probability. It is measure For example, suppose we have
a dataset of 1000 transactions, and the itemset {milk, bread} appears in 100 of those
transactions. The itemset {milk} appears in 200 of those transactions. The confidence of the
rule “If a customer buys milk, they will also buy bread” would be calculated as follows:

Confidence("If a customer buys milk, they will also buy bread")

= Number of transactions containing

{milk, bread} / Number of transactions containing {milk}

= 100 / 200

= 50%

i.e., if a customer buys milk then there is 50% chances that the customer will buy bread.

Confidence(X => Y) = (Number of transactions containing X and Y) / (Number of


transactions containing X)

Confidence(X -> Y) = Support_count(X ∪ Y) / Support_count(X)

Support and confidence are two measures that are used in association rule mining to evaluate
the strength of a rule. Both support and confidence are used to identify strong association
rules. A rule with high support is more likely to be of interest because it occurs frequently
in the dataset. A rule with high confidence is more likely to be valid because it has a high
likelihood of being true.
Lift Measure in Association Rule Mining
Lift is a metric used to evaluate the strength of an association rule. It measures how much
more likely the occurrence of Y is when X is present, compared to when X and Y are
independent.
Formula for Lift
Lift(X→Y)=Confidence(X→Y)/Support(Y)

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


3

Where:
• Support(Y) = Probability of Y occurring in the dataset.
• Confidence(X → Y) = Probability of Y occurring given X has occurred.
Interpreting Lift Values
• Lift = 1 → X and Y are independent (no association).
• Lift > 1 → Positive correlation (Y is more likely when X happens).
• Lift < 1 → Negative correlation (Y is less likely when X happens).

Example:

Milk → Bread (Lift = 1.5)

→ Customers who buy Milk are 1.5× more likely to buy Bread than random chance
Example for Support, Closed and Maximal Itemset:
Given Transactions Dataset

TID Items Bought


1 A, B, C, D
2 A, B, C
3 A, B
4 B, C
5 B, C, D
6 A, B, C
Assume Minimum Support = 2 Transactions

Step 1: Finding Frequent Itemsets


1-Itemsets (Single Items)

Item Support Count


A 4
B 6
C 5
D 2
All items are frequent since support ≥ 2.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


4

2-Itemsets (Pairs)

Itemset Support Count Frequent


{A, B} 4 YES
{A, C} 3 YES
{B, C} 4 YES
{B, D} 2 YES
{C, D} 2 YES
{A,D} 1 NO
{A,B} 4 YES

3-Itemsets (Triplets)

Itemset Support Count Frequent


{A, B, C} 3 YES
{B, C, D} 2 YES
{A,B,D} 1 NO
{A,C,D} 1 NO
Article I.
4-Itemsets (Quadruplets)

Itemset Support Count Frequent


{A, B, C, D} 1 NO
Not frequent (support < 2).

Step 2: Identifying Closed Frequent Itemsets


A frequent itemset is closed if no proper superset has the same support.

Itemset Support Superset Exists with Same Support? Closed?


{B} 6 No YES
{C} 5 No YES
{A, B} 4 {A, B, C} has lower support (3 ≠ 4) Yes
{A, C} 3 {A, B, C} has the same support (3 = 3) No
{B, C} 4 {A, B, C} has lower support (3 ≠ 4) Yes
{B, D} 2 {B, C, D} has the same support (2 = 2) No
{C, D} 2 {B, C, D} has the same support (2 = 2) No
{A, B, C} 3 No superset with support 3 Yes

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


5

Itemset Support Superset Exists with Same Support? Closed?


{B, C, D} 2 No superset with support 2 Yes

Final Closed Frequent Itemsets:


{B},{C},{A, B}, {B, C}, {A, B, C}, {B, C, D}

Step 3: Identifying Maximal Frequent Itemsets


A frequent itemset is maximal if no superset is frequent.

Itemset Support Superset Exists That Is Frequent? Maximal?


{A, B} 4 {A, B, C} is frequent No
{A, C} 3 {A, B, C} is frequent No
{B, C} 4 {A, B, C} is frequent No
{B, D} 2 {B, C, D} is frequent No
{C, D} 2 {B, C, D} is frequent No
{A, B, C} 3 No frequent superset Yes
{B, C, D} 2 No frequent superset Yes
Final Maximal Frequent Itemsets:
{A, B, C}, {B, C, D}

• Frequent Itemsets: Appear ≥ min support (2 transactions).


• Closed Frequent Itemsets: No superset has the same support.
• Maximal Frequent Itemsets: No superset is frequent.
Association Rule:
Association rules are "if-then" statements, that help to know the relationships between the
items in large dataset. Frequent patterns are represented by association rules.
The association rule is of the form: X=> Y
If(antecedent)=> Y(Consequent)

It means: "If X occurs, then Y is likely to occur."

E.g., Buys (X,”Laptop”) => Buys (Y,”Wireless Mouse”) [Support=50%, Confidence=70%]

Real-Life Examples of Association Rules

Retail & E-commerce

{Laptop} → {Laptop Bag}

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


6

→ If a customer buys a laptop, they are likely to buy a laptop bag.

Healthcare

{Fever, Cough} → {Flu}

→ Patients with Fever and Cough are likely to have the Flu.

Banking & Fraud Detection

{High Transaction, Midnight} → {Fraud Alert}

→ Late-night high-value transactions are often fraudulent.

Streaming Services (Netflix, Spotify)

{Action Movies} → {Thriller Movies}

→ If a user watches Action movies, they are likely to watch Thrillers.


Frequent Pattern (Itemset) Mining:
Frequent pattern mining in data mining is the process of identifying patterns or associations
within a dataset that occur frequently. This is typically done by analyzing large datasets to
find items or sets of items that appear together frequently.
Importance of Frequent Pattern Mining:
It helps to find association, correlation and interesting relationship among data.

In general, association rule mining can be viewed as a two-step process:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as

frequently as a predetermined minimum support count, min sup.

2. Generate strong association rules from the frequent itemsets: By definition, these

rules must satisfy minimum support and minimum confidence.


Applications of Frequent Pattern Mining:

• Market basket analysis: Helps identify items that are commonly purchased
• Web usage mining: Helps understand user browsing patterns
• Bioinformatics: Helps analyze gene sequences
• Fraud detection: Helps identify unusual patterns

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


7

• Healthcare: Analyzing patient data and identifying common patterns or risk factors.
• Recommendation systems: Identify patterns of user interaction and helps with
recommendation to the users of an application.
• Cross-selling and up-selling : Identifying related products to recommend or suggest
to customers.

Frequent Itemset Mining Methods


Methods for mining the simplest form of frequent patterns.
1. Apriori Algorithm
2. Frequent Pattern Growth Mining
Apriori Algorithm:
Apriori is a important algorithm proposed by R. Agrawal and R. Srikant in 1994. It is uses
frequent itemsets to generate association rules. It is based on the concept that a subset of
frequent itemset must also be frequent itemset, which is an Apriori property. For
example, if the itemset {A, B, C} frequently appears in a dataset, then the subsets {A, B}, {A,
C}, {B, C}, {A}, {B}, and {C} must also appear frequently in the dataset. It is an iterative
technique which use breadth first search strategy to discover repeating groups/pattern.
It contains two steps:
1. Join Step: Find the itemsets (Lk), where k represents 1, 2, 3.. Itemsets
2. Prune Step: Remove the itemsets in which sub items do not satisfy the min support
count threshold. And generate association rules using confidence and lift measures.
Technique:

1. Set the minimum support threshold - min frequency required for an itemset to be
"frequent".
2. Identify frequent individual items - count the occurrence of each individual item.
3. Generate candidate itemsets of size 2 - create pairs of frequent items discovered.
4. Prune infrequent itemsets - eliminate itemsets that do no meet the threshold levels.
5. Generate itemsets of larger sizes - combine the frequent itemsets of size 3,4, and so on.
6. Repeat the pruning process - keep eliminating the itemsets that do not meet the
threshold levels.
7. Iterate till no more frequent itemsets can be generated.
8. Generate association rules that express the relationship between them - calculate
measures to evaluate the strength & significance of these rules.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


8

Algorithm:

Algorithm Apriori(T, min_sup):

1. L1 = {frequent 1-itemsets in transactions T}

2. k = 2

3. Repeat:

a. Generate candidate itemsets Ck from L(k-1)

b. **Prune Ck:** Remove itemsets that contain any infrequent (k-1)-subset

c. Count support for each remaining itemset in Ck

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


9

d. Lk = {itemsets in Ck with support ≥ min_sup}

e. k = k + 1

4. Until Lk is empty

5. Return all frequent itemsets (L1, L2, ..., Lk-1)

Example:
Consider a dataset of simple business transactions: Min support=50% and Threshold
confidence=70%
TID Items
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5

Where TID referes to Transaction ID and 1,2,3.. refers to items/products(for simplicity


numbers are considered)
Step 1: finding repeating individual items and counting its occurrences(support) using the
formula and consider it as C1 (size 1).
Support(X) = (Number of transactions containing X) / (Total number of transactions)

Item Support
1 2/4=50%
2 ¾=75%
3 ¾=75%
4 ¼=25%
5 ¾=75%
Remove the items which has support less than 50%.
Itemset- L1

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


10

1
2
3
5
Step 2: Form Itemset of size 2 (pairs) by using L1.
Item Support
1,2 1/4=25%
1,3 2/4=50%
1,5 1/4=25%
2,3 2/4=50%
2,5 3/4=75%
3,5 2/4=50%
Remove the items which has support less than 50%.
Itemset L2
Itemset
1,3
2,3
2,5
3,5

Step 3: Form Itemset of size 3 (triplets) by using L2.


Item Support
1,2,3 1/4=25%
1,3,5 1/4=25%
1,2,5 1/4=25%
2,3,5 2/4=50%
Remove the items which has support less than 50%.
Note: {1,2} has already been eliminated in step 2 therefore as per Apriori principle no need to
consider in this step.
Itemset L3

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


11

Itemset
2,3,5

As no more itemset of size 4 can be generated therefore stop the iteration.


Now compute the support and confidence for the generated association rules for itemset
{2,3,5}.
Confidence is computed using the formula:
Confidence(X -> Y) = Support_count(X ∪ Y) / Support_count(X)

Rule Support Confidence


(2^3)->5 2/4=50% 2/2=100%
(3^5)->2 2/4=50% 2/2=100%
(2^5)->3 2/4=50% 2/3=66%
2->(3^5) 2/4=50% 2/3=66%
3->(2^5) 2/4=50% 2/3=66%
5->(2^3) 2/4=50% 2/3=66%
Now remove the rules whose confidence is less than 70%(threshold confidence)
Final association rules generated are:
(2^3)->5
(3^5)->2
This will give us the relationship between the objects.
Advantages:

1. Simplicity & ease of implementation


2. The rules are easy to human-readable
3. Works well on large dataset
4. Can generate strong rules for association

Disadvantages of Apriori algorithm:


1. Computational complexity: Requires many database scans.
2. Higher memory usage: Assumes transaction database is memory resident.
3. It needs to generate a huge no. of candidate sets.
4. Limited discovery of complex patterns
5. Slow

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


12

Improving the efficiency of Apriori Algorithm:


Here are some of the methods how to improve efficiency of apriori algorithm -

1. Hash-Based Technique: This method uses a hash-based structure called a hash table
for generating the k-itemsets and their corresponding count. Uses hash tables to reduce
the number of candidate k-itemsets.

Note: A hash table is a data structure that stores key-value pairs. It uses a hash
function to map keys to specific locations (indexes) in an array, making data retrieval
fast and efficient.

Example: A hash table stores itemset counts, and infrequent hash buckets are pruned early.
Benefit: Reduces the number of candidates in later iterations.

2. Transaction Reduction: This method reduces the number of transactions scanned in


iterations. After each pass, remove transactions that do not contain frequent k-itemsets.

Benefit: Reduces the number of database scans.

3. Partitioning: This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it should
be frequent in at least one of the partitions of the database.

Benefit: Reduces memory usage and speeds up processing.

4. Sampling: Instead of scanning the full database, analyze a random sample of


transactions.It may be possible to lose a global frequent itemset. This can be reduced
by lowering the min_sup.

Benefit: Reduces the number of scans and speeds up computation.

5. Dynamic Itemset Counting: This method allows for the addition of new candidate
itemsets at any point during the database scan. This can reduce the number of database
scans required.

Benefit: Reduces database passes.

Frequent Pattern-growth Algorithm

FP-growth is an algorithm for mining frequent patterns that uses a divide-and-conquer


approach. FP Growth algorithm was developed by Han in 2000. It constructs a tree-like data
structure called the frequent pattern (FP) tree, where each node represents an item in a frequent
pattern, and its children represent its immediate sub-patterns. By scanning the dataset only
twice, FP-growth can efficiently mine all frequent itemsets without generating candidate

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


13

itemsets explicitly. It is particularly suitable for datasets with long patterns and relatively low
support thresholds.

Working on FP Growth Algorithm

The working of the FP Growth algorithm in data mining can be summarized in the following
steps:

Scan the database:

In this step, the algorithm scans the input dataset to determine the frequency of each item. This
determines the order in which items are added to the FP tree, with the most frequent items
added first.

Sort items:

In this step, the items in the dataset are sorted in descending order of frequency. The infrequent
items that do not meet the minimum support threshold are removed from the dataset. This
helps to reduce the dataset's size and improve the algorithm's efficiency.

Construct the FP-tree:

In this step, the FP-tree is constructed. The FP-tree is a compact data structure that stores the
frequent itemsets and their support counts.

Generate frequent itemsets:

Once the FP-tree has been constructed, frequent itemsets can be generated by recursively
mining the tree. Starting at the bottom of the tree, the algorithm finds all combinations of
frequent item sets that satisfy the minimum support threshold.

Generate association rules:

Once all frequent item sets have been generated, the algorithm post-processes the generated
frequent item sets to generate association rules, which can be used to identify interesting
relationships between the items in the dataset.

FP Tree

The FP-tree (Frequent Pattern tree) is a data structure used in the FP Growth algorithm for
frequent pattern mining. It represents the frequent itemsets in the input dataset compactly and
efficiently. The FP tree consists of the following components:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


14

Root Node:

The root node of the FP-tree represents an empty set. It has no associated item but a pointer to
the first node of each item in the tree.

Item Node:

Each item node in the FP-tree represents a unique item in the dataset. It stores the item name
and the frequency count of the item in the dataset.

Header Table:

The header table lists all the unique items in the dataset, along with their frequency count. It
is used to track each item's location in the FP tree.

Child Node:

Each child node of an item node represents an item that co-occurs with the item the parent
node represents in at least one transaction in the dataset.

Node Link:

The node-link is a pointer that connects each item in the header table to the first node of that
item in the FP-tree. It is used to traverse the conditional pattern base of each item during the
mining process.

The FP tree is constructed by scanning the input dataset and inserting each transaction into the
tree one at a time. For each transaction, the items are sorted in descending order of frequency
count and then added to the tree in that order. If an item exists in the tree, its frequency count
is incremented, and a new path is created from the existing node. If an item does not exist in
the tree, a new node is created for that item, and a new path is added to the tree. We will
understand in detail how FP-tree is constructed in the next section.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


15

Example: Consider the following transactions with minimum support count >= 2.

Generate frequency table: Compute the frequency of each item

Item Frequency
I1 6
I2 7
I3 6
I4 2
I5 2
Remove all the items below minimum support in the above table: As all items are above the
threshold no items are removed.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


16

Order Items as per the descending order of the frequence:

Transaction ID Items Ordered Itemset


T100 I1,I2,I5 I2,I1,I5
T200 I2,I4 I2,I4
T300 I2,I3 I2,I3
T400 I1,I2,I4 I2,I1,I4
T500 I1,I3 I1,I3
T600 I2,I3 I2,I3
T700 I1,I3 I1,I3
T800 I1,I2,I3,I5 I2,I1,I3,I5
T900 I1,I2,I3 I2,I1,I3

Use the ordered itemset in each transaction to build the FP tree:

An FP-tree is then constructed as follows. First, create the root of the tree, labeled

with “null.” Scan database D a second time. The items in each transaction are processed

in L order (i.e., sorted according to descending support count), and a branch is created

for each transaction. For example, the scan of the first transaction, “T100: I1, I2, I5,”

which contains three items (I2, I1, I5 in L order), leads to the construction of the first
Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College
17

branch of the tree with three nodes, {I2: 1, I1: 1}, and {I5: 1}, where I2 is linked as a

child to the root, I1 is linked to I2, and I5 is linked to I1. The second transaction, T200,

contains the items I2 and I4 in L order, which would result in a branch where I2 is linked

to the root and I4 is linked to I2. However, this branch would share a common prefix,

I2, with the existing path for T100. Therefore, we instead increment the count of the I2

node by 1, and create a new node, I4: 1, which is linked as a child to I2: 2. To facilitate tree
traversal, an item header table is built so that each item points to its occurrences in the tree via
a chain of node-links.

Prepare the conditional pattern base and conditional FP Tree and Frequent pattern
generated.

The conditional FP-tree associated with the conditional node I3.

Advantages of FP-Growth

1. No Candidate Generation (Unlike Apriori)

Apriori generates many candidate itemsets, which increases computation time.

FP-Growth avoids this by building an FP-Tree and mining patterns directly.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


18

📌 This makes FP-Growth much faster than Apriori for large datasets.

2. Faster and More Memory Efficient

Apriori scans the database multiple times, while FP-Growth compresses data into an FP-Tree,
requiring fewer scans.

This reduces time complexity and I/O cost.

🚀 FP-Growth is often 10x to 100x faster than Apriori on large datasets.

3. Works Well with Large Datasets

Since FP-Growth stores data in a tree structure, it scales better for large datasets with many
transactions.

Unlike Apriori, FP-Growth does not explode in size when handling large itemsets.

4. More Efficient for Sparse Data

FP-Growth handles sparse datasets better than Apriori, especially when transactions contain a
large number of unique items.

Disadvantages of FP-Growth

1. Complex Implementation

Building and mining the FP-Tree is more complicated than Apriori.

Recursion and tree-based mining make it harder to implement and debug.

2. High Memory Usage for Dense Datasets

If the dataset has many frequent patterns, the FP-Tree can become large, requiring more
memory.

This happens in dense datasets (where many items appear together frequently).

Apriori may be better in such cases since it generates fewer patterns.

3. FP-Tree Needs to Be Rebuilt for New Data

If new transactions are added, the entire FP-Tree must be rebuilt.

This makes it less flexible for dynamic or real-time updates compared to Apriori.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


19

Vertical Data Format in Frequent Itemset Mining

The Vertical Data Format is a way of representing transactions in frequent itemset mining
where we store items along with their transaction IDs (TIDs) instead of listing transactions as
item sets.

Difference Between Horizontal and Vertical Data Formats

Format Representation

Horizontal (Traditional) Each transaction lists its items.

Vertical Each item lists the transactions it appears in.

Example

Given Transaction Dataset (Horizontal Format)

TID Items Bought

1 A, B, C

2 A, C

3 A, B

4 B, C

5 B, C, D

Convert to Vertical Data Format

Item TID List (Transactions Containing the Item)

A {1, 2, 3}

B {1, 3, 4, 5}

C {1, 2, 4, 5}

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


20

Item TID List (Transactions Containing the Item)

D {5}

Example of Mining Frequent Itemsets Using Vertical Format

To find the support of {B, C}, we intersect their TID lists:

• B = {1, 3, 4, 5}

• C = {1, 2, 4, 5}

• Intersection (B ∩ C) = {1, 4, 5} → Support = 3

Advantages:

• Efficient than Horizontal Data Format as faster scanning due to intersection of TIDs.
• Reduces memory usage

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


1

Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Unit 4

Topics:

Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes Classification
Methods, Rule-Based Classification, Lazy Learners (or Learning from your Neighbors), k Nearest
Neighbor. Prediction - Accuracy- Precision and Recall.

Classification Basic Concepts

Classification is a supervised machine learning method where the model tries to predict the correct
label of a given input data.

Classification is a two-step process:

1. Learning/Training Step:

Here a model is constructed for classification. A classifier model is built by analyzing the data
which are labeled already. Because the class label of each training tuple is provided, this step is
also known as supervised learning. This stage can also be viewed as a function, y=f(x), that can
predict the associated class label ‘y’ of a given tuple x considering attribute values. This mapping
function is represented in the form of classification rules, decision trees or mathematical formula.

2. Classification/Testing Step:

Here the model that is constructed in the learning step is used to predict class labels for given
data.

The data used in learning step is called “Training data”.

The data used in classification step is called “Testing data”.

Ex: A bank loans officer needs to clarify the loan applicant as safe or risky Figure 8.1

The accuracy of classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


2

Issues in Classification

Overfitting occurs when a machine learning model learns the training data too well, including
noise and irrelevant details, instead of just the underlying pattern. This results in high accuracy on
the training data but poor performance on new, unseen data.

Ways to Prevent Overfitting:

• Using More Training Data – Helps the model learn more general patterns.
• Applying Ensemble Methods – Using techniques like Random Forest to combine
multiple trees and reduce variance.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


3

Underfitting occurs when a machine learning model is too simple to capture the underlying
patterns in the data. This leads to poor performance on both the training data and new, unseen data.

Ways to Prevent Underfitting:

o Use more relevant features and data.

Class Imbalance: When one class has significantly more samples than another, the model may
become biased toward the dominant class.

Ways to Prevent Class Imbalance:

o Resampling techniques (oversampling the minority class or undersampling


the majority class).
o Using different evaluation metrics like F1-score, precision-recall instead of
accuracy.

High-Dimensional Data (Curse of Dimensionality): Too many features can make training slow
and reduce model performance. Feature selection or dimensionality reduction (e.g., PCA) can help.

Noisy and Incomplete Data: Missing values and irrelevant features can lead to poor classification
results.

Solutions:

o Data preprocessing (handling missing values, removing outliers).


o Feature selection techniques.

Computational Complexity: Some classification algorithms, like deep learning and SVM with
large datasets, require high computational power.

Solution: Use efficient algorithms or hardware acceleration (GPU).

Decision Tree Induction (Decision tree classifier):

• Developed by J Ross Quinlan around 1980’s


• Named it as ID3 (Iterative Dichotomiser)
• Uses greedy approach in which decision trees are constructed in a top-down recursive
divide and conquer method.
• Variants of ID3 are C4.5 (successor of ID3), CART(Classification And Regression Trees)

Decision Tree Induction is the learning of decision trees from class-labelled training tuples.
Decision Tree is a tree structure, where each internal node (non leaf) denotes a test on the attribute,
each branch represents an outcome of the test and each leaf node holds a class label. The attribute

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


4

values of a tuple ‘X’ is tested against the decision tree. A path is traced from the root to leaf to
predict the class label.

Advantages:

• DT can be easily converted into classification ruler(if---then)


• DT does not require any domain knowledge and hence easy to construct and interpret
• DT can handle multidimensional data
• DT is simple and fast

Disadvantages:

o Suffers from repetition problem: Occurs when same attribute is tested multiple times.
o Suffers for replication problem: Occurs when part of tree is repeated in other branches.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


5

Applications:

Medicine, manufacturing and production, finance, astronomy, molecular biology.

Algorithm: Decision Tree Generation

Input: D = set of training tuples with associated class labels.

Attribute list= attributes of tuple.

Attribute_selection_method= a function to determine the splitting criteria that best


partitions the data tuples into individual classes.

Output: A Decision Tree

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


6

Method:

Create a node N;

If tuples in D are all of the same class, C then return N as a leaf node labeled with the class C;

If attribute list is empty then

Return N as the leaf node labeled with the majority class in D;

Apply attribute selection method(D, attribute list) to find the best splitting criterion;

Label node ‘N’ with splitting criterion;

If splitting attribute is discrete valued and multiway splits allowed then

attribute list  attribute list – splitting attribute;//remove split attribute.

For each outcome ‘j’ of splitting criterion// partition tuple and sub trees.

Let Dj be the set of data tuples in D satisfying outcome j;

If Dj is empty then

Attach a leaf labeled with majority class in D to node N;

Else

Attach the node returned by decision tree generation (Dj, attribute list) to node N

End for

Return N;

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


7

Splitting Scenarios:

Attribute Selection Measures:

It is used to decide which attribute should be chosen as the splitting point at each node in a
decision tree classifier. It is also called as splitting rules.

The attribute having the best measure is chosen as the splitting attribute for the given tuples.

Popular Attribute Selection Measures:

1. Information gain
2. Gain ratio
3. Gini index

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


8

Information Gain:

Based on the work by Claude Shanon on information theory. Information Gain is defined
as the difference between the original information requirement and new requirement. The attribute
with highest information gain is chosen as the splitting attribute for node N. It is used in ID3.

Gain(A) = Info(D) – InfoA(D)

Where,

Info(D) = ∑𝒎
𝒊=𝟎 𝑃𝑖 𝑙𝑜𝑔₂(𝑃𝑖)

m = m distinct classes

Pi = probability that a tuple belongs to class Ci

Info(D) is also known as entropy of D

How much more info is needed to arrive at an exact classification is given by computing expected
info gain for each attribute.
|𝑫𝒋|
InfoA(D) = -∑𝒗𝒋=𝟏 |𝑫| ∗Info(Dj)

Refer eg: 8.1

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


9

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


10

Gain Ratio:

Information gain measure is biased toward test with many outcomes (many partitions).
That is, it prefers to select attributes having a large number of values. For example, consider an
attribute that acts as a unique identifier such as product ID. Split on product ID would result in a
large number of partitions (as many as there are values), each one containing just one tuple. Gain
ratio overcomes this bias by normalizing the values. Attribute with maximum gain ratio is selected
on splitting attribute. It is used in C4.5.

Gain ratio is given by the following formula:


𝐆𝐚𝐢𝐧(𝐀)
Gain ratio(A) = 𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨𝐀(𝐃)

It applies a kind of normalization to information gain using a split information value defined as
|𝑫𝒋| |𝑫𝒋|
SplitInfoA(D) = -∑𝒗𝒋=𝟏 |𝑫| *𝐥𝐨𝐠 𝟐 |𝑫|

v = partitions

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


11

Gini Index:

It measures the impurity/uncertainty/randomness(heterogeneity) of Dataset D. It is given


by the following formula. A feature/attribute having lower weighted average Gini Impurity after
splitting is preferred. It is used in CART.

Gini = 1 - ∑𝒎
𝒊=𝟏 𝑷𝒊
𝟐

Pi = is the probability of class ‘i’ in the dataset. m= no of classes.

Tree Pruning:

When decision tree is built, many of the branches will reflect problems in the training data
due to noise or outliers. Tree pruning removes the branches which are not relevant. Pruned tree are
smaller and less complex and thus easier to understand. They perform faster than unpruned trees.

There are two common approaches for tree pruning:

• Pre pruning: In pre pruning the tree branch is not further split into sub branches by
deciding early using statistical measures like info gain, gini index etc.
• Post pruning: In post pruning the fully grown tree branches is cut and leaf nodes are
added. The leaf is labeled with the most frequent class among the subtree being replaced.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


12

Pruning Algorithms:

• Cost complexity pruning algorithm used in CART.


• Pessimistic pruning used in C4.5.

Bayes Classification Method

▪ BC is based on Bayes Theorem.


▪ Bayes Classification (BC) are probabilistic classifiers.
▪ It predicts the probability of a given input belonging to a particular class. Naive Bayes
calculates the probability of each class and then selects the class with the highest
probability.
▪ BC assumes that the effect of an attribute value on a given class is independent of the values
of other attributes. This means that the presence of one feature doesn't affect the presence
of another feature. This assumption is called class-conditional independence and hence it
is called Naïve Bayes Classifier.
Example for conditional Independence:
A fruit may be considered to be an apple if it is red, round and 3” diameter. A Naïve
Bayes Classifier considered each of these features to contribute independently to the
probability that this fruit is an apple, regardless of any possible correlations between the
color, roundness and diameter.

Review of Bayes Theorem:

Named after Thomas Bayes of 18th century.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


13

Bayes Theorem is:


𝑋
𝑃( ) 𝑃(𝐻)
P(H/X) = 𝐻
𝑃(𝑋)

H – Hypothesis that belongs to a particular class

X – Data Values (measurements on n-attributes)

P(H/X) – Posterior probability, i.e. probability that X belong to hypothesis made on H. Probability
of a class given the data.

Ex: Probability that a customer X will buy a computer given that we know the age and
income of customer.

P(X/H) - Likelihood (probability of data appearing in a particular class).

Ex: Probability that a customer X (Rs.40,000) will buy a computer given that we know the
customer will buy a computer.

P(H) – Prior probability of H, i.e., probability of the class occurring overall.

Ex: Probability that any given customer will buy a computer regardless of measurements
on attribute.

P(X)- Evidence (total probability of the data occurring).

Naïve Bayesian Classification:

N B classifiers works as follows:

1) Let D be a training set of tuples(with class labels)


X – tuple with n-attributes.
2) Suppose that there are m classes C1, C 2,….., Cm. Given a tuple X, the classifier will predict
that X belong to the class having the highest posterior probability conditioned on X.
i.e.
P(Ci/X) > P(Cj/X) for 1≤j≤m, j≠i.
Thus we maximize P(Ci/X). The class Ci for which P(Ci/X) is maximized is called the
maximum posterior hypothesis.
By Bayes Theorem
𝑋
𝑃( ) 𝑃(Ci)
Ci
P(Ci /X) = 𝑃(𝑋)
3) Since P(X) is constant for all classes, therefore denominator is not considered and only
P(X/Ci) P(Ci) needs to be computed for all the classes.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


14

4) To predict the class label of X, P(X/Ci) P(Ci) is evaluated for all class C and maximum of
P(X/Ci) P(Ci) is assigned as class label.

Example: Classifying if we can play based on weather

If we can play on a sunny day?

Dataset (Training)

Weather Play
Sunny Yes
Over cast No
Rainy No
Sunny No
Over cast Yes
Rainy No
Sunny Yes
Over cast No

i.e. P(Yes/Sunny) > P(No/Sunny)


Sunnyj 2 3
P( )P(Yes) ( )∗( )
P(Yes/Sunny) = Yes
=[ 3
3
8
] = (0.66*0.37)/0.37=0.66
𝑃(Sunny)
8

Sunnyj 1 5
P( )P(No) ( )∗( )
P(No/sunny) = No
=[ 3
3
8
] = (0.33*0.62)/0.37=0.55
𝑃(Sunny)
8

Since P(Yes/sunny > P(No/sunny) , based on Bayes Classification We can play


on a sunny day.

Example 2: Spam Email Classification

Let's say we have an email spam filter that classifies emails as Spam or Not Spam based on the
presence of certain words.

Step 1: Training Data (Past Emails)

We have a dataset of emails labeled as Spam or Not Spam. The classifier learns from word
frequencies in each category.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


15

Word Spam (Frequency) Not Spam (Frequency)

"Free" 60% 10%

"Offer" 50% 5%

"Urgent" 40% 5%

"Meeting" 5% 50%

"Project" 2% 40%

Also, from past data:

• 40% of emails are Spam → P(Spam)=0.4P(Spam) = 0.4P(Spam)=0.4

• 60% of emails are Not Spam → P(NotSpam)=0.6P(Not Spam) = 0.6P(NotSpam)=0.6

Step 2: Classifying a New Email

A new email contains: "Free Offer Urgent"

We need to calculate:

P(Spam∣"FreeOfferUrgent")

P(NotSpam∣"FreeOfferUrgent")

Using Naïve Bayes, we assume the words are independent:

P(Spam∣Words)=P(Free∣Spam)⋅P(Offer∣Spam)⋅P(Urgent∣Spam)⋅P(Spam)

=(0.6)⋅(0.5)⋅(0.4)⋅(0.4)= 0.048

Similarly, for Not Spam:

P(NotSpam∣Words)=P(Free∣NotSpam)⋅P(Offer∣NotSpam)⋅P(Urgent∣NotSpam)⋅P(NotSpam)

=(0.1)⋅(0.05)⋅(0.05)⋅(0.6)= 0.00015

Since P(Spam | Words) > P(Not Spam | Words), the classifier marks the email as Spam.

Advantages

- Simple and easy to implement

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


16

- Good results are obtained in most of the cases


- Fast and efficient

Disadvantages

It assumes that all features are independent given the class, which is rarely true in real-world data.
Example: In spam detection, "free" and "offer" may appear together often—but Naïve Bayes treats
them as if they are unrelated.

Applications:

• Spam filtering: Identifying emails as spam or not spam.

• Text classification: Categorizing documents or articles.

• Sentiment analysis: Determining the sentiment of text (e.g., positive, negative).

• Medical diagnosis: Predicting the probability of a disease based on symptoms

Rule Based Classifiers

Rule-based classifiers are used for classification by defining a set of rules that can be used to assign
class labels to new instances of data based on their attribute values. These rules can be created
using expert knowledge of the domain, or they can be learned automatically from a set of labeled
training data. A rule-based classifier uses a set of IF-THEN rules for classification.

An IF-THEN rule is an expression of the form:

IF condition THEN conclusion.

An example is rule R1,

R1: IF age == youth AND student == yes THEN buys computer == yes.

The “IF” part (or left side) of a rule is known as the rule antecedent or precondition.

The “THEN” part (or right side) is the rule consequent.

In the rule antecedent, the condition consists of one or more attribute tests (e.g., age == youth and
student == yes) that are logically ANDed.

If the condition (i.e., all the attribute tests) in a rule antecedent holds true for a given tuple, we say
that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the
tuple.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


17

Conflict in rules:

In rule-based classification, it's common to have conflict when multiple rules apply to the same
data instance (tuple), possibly predicting different classes. To handle this, we use conflict
resolution strategies.

Ex. Predicting Whether to Approve a Loan

We have some rules learned from data:

• Rule 1: If Credit Score > 700 AND Income > 50K Then Approve = Yes

• Rule 2: If Age < 25 AND Credit Score > 650hen Approve = No

• 👤 Test Tuple:

A new customer with:

• Credit Score = 720, Income = 60K, Age = 22

Both Rule 1 and Rule 2 apply to this tuple!

• Rule 1 says: ✅ Approve = Yes

• Rule 2 says: ❌ Approve = No

This is a conflict.

Conflict resolution strategies:

• Rule Ordering (Priority-Based Resolution): In rule-based ordering, the rules are organized
based on priority, according to some measure of rule quality, such as accuracy, coverage,
or size (number of attribute tests in the rule antecedent), or based on advice from domain
experts. Class is predicted for the tuple based on the priority, and any other rule that
satisfies tuple is ignored.
• Majority Voting: If multiple rules apply and predict different classes, take a majority vote
from the predicted classes.
• Specificity Preference/Size based: Choose the most specific rule (i.e., the rule with the most
conditions or constraints).

Coverage and Accuracy: Measures to evaluate the accuracy of the rules.

In rule-based classification, coverage is the percentage of records that satisfy the antecedent
conditions of a rule.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


18

Coverage(R)=n1/n
Where n1= instances with antecedent and n=no of training tuples

Accuracy is the percentage of records that satisfy the antecedent conditions and meet the
consequent values of a rule.

Accuracy(R)=n2/n1
Where n2= instances with antecedent AND consequent
Key Differences:
• Accuracy focuses on correctness, while coverage focuses on applicability.

• A rule can have high accuracy but low coverage (if it classifies correctly but applies to
very few instances).

• A rule can have high coverage but low accuracy (if it applies to many instances but
makes many errors).

• The best classification rules aim for a balance between accuracy and coverage to ensure
broad applicability while maintaining correctness.

Rule Extraction from a Decision Tree


Use the decision tree and generate the rules from each branch. Where nodes will become
antecedent and leaf node will be consequent. Logical AND is used to combine the nodes of the
tree.
E.g.,

Rule Induction Using a Sequential Covering Algorithm


Sequential Covering is a popular algorithm based on Rule-Based Classification. IF-THEN rules
can be extracted directly from the training data (i.e., without having to generate a decision tree
first) using a sequential covering algorithm.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


19

There are many sequential covering algorithms. Popular variations include AQ, CN2, and the
more recent RIPPER. The general strategy is as follows. Rules are learned one at a time. Each
time a rule is learned, the tuples covered by the rule are removed, and the process repeats on the
remaining tuples.
Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.
Input:
D: a data set of class-labeled tuples;
Att vals: the set of all attributes and their possible values.
Output: A set of IF-THEN rules.
Method:
(1) Rule set = {}; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn One Rule(D, Att vals, c);
(5) remove tuples covered by Rule from D;
(6) Rule set = Rule set + Rule; // add new rule to rule set
(7) until terminating condition;
(8) endfor
(9) return Rule Set ;

Rule Pruning

Rule pruning is a technique used in rule-based classification to improve the generalization of


classification rules by removing overly specific or unnecessary conditions. It helps reduce
overfitting, making the rules more effective when applied to unseen data.
Example of Rule Pruning
Before Pruning:
Rule:
If (Temperature = High) AND (Humidity = High) AND (Wind = Weak) THEN Play = No
After Pruning:
If (Humidity = High) THEN Play = No
By pruning unnecessary conditions (Temperature and Wind), the rule becomes simpler and more
generalizable.

Rule Pruning strategies


Pre pruning: Stops rule expansion before it becomes too specific.
Post pruning: First, generate the complete rule using training data. Then, remove less significant
conditions that do not significantly reduce classification accuracy.

Advantages of Rule-Based Classifiers


• Has characteristics quite similar to decision trees
• As highly expressive as decision trees
• Easy to interpret
• Faster Performance comparable to decision trees
• Can handle redundant attributes

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


20

Lazy Learners: K-Nearest Neighbor Classifier Algorithm

The classification methods discussed such as decision tree induction, Bayesian classification, rule-
based classification, etc., are all examples of eager learners. Eager learners employ two step
approach for classification, i.e., in first step they build classifier model learning from the training
set and in second step they perform the classification on unknow tuples to know class using the
model.

Lazy learning algorithms wait until they encounter a new tuple(From testing dataset), then store
and compare training examples when making predictions. This type of learning is useful when
working with large datasets that have a few attributes. Lazy learning is also known as instance-
based or memory-based learning.

Examples of lazy learners: K-Nearest Neighbor Classifier, case-based reasoning classifiers.

Advantages of Lazy learners:


• Can adapt quickly to new or changing data.
• Less affected by outliers compared to eager learning methods.
• Handles complex data distributions and nonlinear relationships

Disadvantages of lazy learners:

• Computationally expensive
• Required more memory as training data will be loaded only during classification stage.

K-Nearest Neighbor Classifier Algorithm

• The k-nearest-neighbour method was first described in the early 1950s.


• widely used in the area of pattern recognition.
• The K-Nearest Neighbors (KNN) algorithm is a popular machine learning technique
used for classification and regression tasks. It relies on the idea that similar data
points tend to have similar labels or values.

The steps for the KNN algorithm are:

1. Assign a value to K
2. Calculate the distance(E.g, Euclidean Distance) between the new data entry and all other
existing data entries
3. Arrange the distances in ascending order
4. Determine the k-closest records of the training data set for each new record
5. Take the majority vote to classify the data point.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


21

The Euclidean distance between two points or tuples, say, X1 ={x11, x12, : : : , x1n} and X2
={x21, x22, : : : , x2n}, is

Example:

Advantages

• Simple and efficient


• Easy to implement
• No assumption required
• No training time required

Disadvantages:

• Computationally expensive
• Accuracy reduces if there are noise in the dataset
• Requires large memory
• Need to accurately determine the value of k neighbors

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


22

Metrics for Evaluating Classifier Performance (Precision and Recall)

There are four terms we need to know that are the “building blocks” used in computing many
evaluation measures. Understanding them will make it easy to grasp the meaning of the various
measures.

True positives (TP): The model correctly predicts a positive class.

E.g., Person with COVID-19 correctly labelled as COIVD-19 positive.

True negatives (TN): The model correctly predicts a negative class.

E.g., Person without having COVID-19 virus is correctly labelled as COIVD-19 negative.

False positives (FP): The model incorrectly predicts a positive class when the actual class is
negative.

E.g., Person without having COVID-19 virus is incorrectly labelled as COIVD-19 positive.

False negatives (FN): The model incorrectly predicts a negative class when the actual class is
positive.

E.g., Person having COVID-19 virus is incorrectly labelled as COIVD-19 negative.

Precision and Recall:

Precision and recall are metrics used to evaluate the performance of classification models in
machine learning. Precision is the percentage of positive identifications that are correct (How
many predicted positives are actually positive?), while recall is the percentage of actual positives
that are identified correctly (How many actual positives were correctly identified?).

F1-score = 2 × (Precision × Recall) / (Precision + Recall) → Harmonic mean of precision and


recall.

Accuracy:

The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier. In the pattern recognition literature, this is also referred to as the overall

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


23

recognition rate of the classifier, that is, it reflects how well the classifier recognizes tuples of the
various classes. That is,

Accuracy=TP+TN/(TP+TN+FP+FN)

The sensitivity(recall) and specificity measures can be used, respectively, for this purpose.
Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion of positive
tuples that are correctly identified), while specificity is the true negative rate (i.e., the proportion
of negative tuples that are correctly identified).

Sensitivity tells how good the model is at detecting positives.

Sensitivity= TP/TP+FN

Specificity tells how good the model is at detecting negatives.

Specificity=TN/TN+FP

Example:

In a COVID-19 test, sensitivity answers:

“Of all the people who actually have COVID, how many did the test correctly identify?”

If sensitivity is 90%, it means the test caught 90% of the infected people — but missed 10%.

“Of all the people who do NOT have COVID, how many did the test correctly say were negative?”

If specificity is 95%, it means 95% of healthy people were correctly told they’re negative.

Confusion Matrix:

A confusion matrix represents the prediction summary in matrix form. It shows how many
prediction are correct and incorrect per class.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


24

Example:

• 1,000 people are tested.


• 100 people actually have COVID.
• 900 people do not have COVID.

Predicted Positive (1) Predicted Negative (0)


Actual Positive (1) 85 50
Actual Negative (0) 15 850

Accuracy = (TP + TN) / Total


= (85 + 850) / 1000 = 93.5%

Sensitivity (Recall) = TP / (TP + FN)


= 85 / (85 + 15) = 85%

Specificity = TN / (TN + FP)


= 850 / (850 + 50) = 94.4%

Precision = TP / (TP + FP)


= 85 / (85 + 50) = 62.96%

Interpretation:

• Sensitivity = 85% → The test correctly identifies 85% of people who actually have
COVID.

• Specificity = 94.4% → It correctly identifies 94.4% of those who don't have COVID.

• Precision = 62.96% → When the test says someone has COVID, it’s only right ~63% of
the time (high false positives).

• Accuracy = 93.5% → Overall, the test is mostly accurate.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


1

Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Unit 5
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based Methods,
Grid-Based Methods, Evaluation of Clustering

Cluster Analysis

Cluster analysis or clustering is the process of grouping a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters.

Clustering is also known as unsupervised learning since groups are made without the knowledge
of class labels.

Clustering is also called data segmentation in some applications because clustering partitions large
data sets into groups according to their similarity.
Ex: Customer Segmentation for a Retail Store

Clustering can also be used for outlier detection, where outliers (values that are “far away” from
any cluster) may be more interesting than common cases.

Clustering has many applications in different industries, including:

• Market segmentation: Categorizing customers into distinct groups based on purchasing


behavior, such as frequent buyers vs occasional buyers, for targeted marketing.
• Social network analysis: Identifying communities or clusters of users who interact
frequently, useful for targeted advertising or content recommendations.
• Recommendation engines: Grouping users with similar preferences or behaviors to
provide personalized suggestions (e.g., movies, products).
• Image segmentation: Dividing an image into clusters of pixels that represent different
regions, such as tissue types in medical imaging or objects in a photo.
• Medical data analysis: Clustering patients based on similar symptoms, medical history,
or genetic traits to assist in disease diagnosis and treatment planning.
• Search result clustering: Grouping similar search results (not queries) to organize and
present related content more effectively.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


2

• Anomaly detection: Identifying data points that do not belong to any cluster or lie far from
all clusters, which may indicate fraud, errors, or unusual events

Requirements for cluster Analysis

The following are typical requirement of clustering in data mining:

• Scalability: A large dataset consists of millions of objects. Clustering on only a sample of


a given large data set may lead to biased results. Therefore, highly scalable clustering
algorithms are needed.
• Ability to deal with different types of attributes: Clustering algorithm which can work
on all kind of data like internal based, binary, categorical, graphs, image, documents etc
are needed.
• Discovery of cluster with arbitrary shapes: Cluster with arbitrary shapes are often
required in real time scenarios.
• Requirements for domain knowledge to determine input parameters: Algorithm
which requires min input/domain knowledge about the data in required for clustering
complex data.
• Ability to deal with noisy data: Algorithms which are robust to noise is required.
• Incremental clustering and insensitivity to input order: Algorithms that can input
additional data after given results and hence form update cluster are needed.
• Capable of clustering high dimensional data: Finding clusters of data objects in a high
dimensional is a critical requirement.

Basic Clustering Methods.


1. Partitioning methods: Partitioning method construct K partitions of the data, where each
partition represents a cluster. It groups similar data points into clusters based on their
similarities and differences.
E.g., K-Means, K-Medoids, FCMA, CLARA.
2. Hierarchical methods: Hierarchical methods creates a hierarchical decomposition of the
given set of data objects. E.g, BIRCH, CHAMELEON.
There are two approaches for hierarchical clustering:
A. Agglomerative approach: It is also called as bottom up approach. Here initially each
object forms a separated group and then successively merges other/group closest to it
until all groups are merged into one. E.g., AGNES (Agglomerative Nesting)
B. Divisive approach: It is also called as top-down approach. Here all objects are merged
as one group initially and then successively split into smaller cluster until each object
is one cluster. E.g., DIANA, or Divisive ANAlysis
3. Density based methods: Partitioning and hierarchical methods are designed to find
spherical-shaped clusters. They have difficulty finding clusters of arbitrary shape such as
the “S” shape and oval clusters. To find clusters of arbitrary shape, alternatively, we can
model clusters as dense regions in the data space, separated by sparse regions. This is the
main strategy behind density-based clustering methods, which can discover clusters of

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


3

nonspherical shape. Here a cluster is grown as long on density ( no. of objects) in the
neighborhood exceeds some threshold. It groups similar data points in a dataset based on
their density. The algorithm identifies core points with a minimum number of neighboring
points within a specified distance (known as the epsilon radius). It expands clusters by
connecting these core points to their neighboring points until the density falls below a
certain threshold. Points that do not include any cluster are considered outliers or noise.
E.g., DBSCAN, OPTICS, DENCLUE, Mean-Shift.
4. Grid -based method: Here data objects are first formed as grid (cells) and then clustering
operations are performed on this grid. The object space is divided into a grid structure of
finite cells, and clustering operations are performed on the cells instead of individual data
points. This method is highly efficient for handling spatial data and has a fast-processing
time that is independent of the number of data objects.
E.g., Statistical Information Grid(STING), CLIQUE, ENCLUS

Overview of Clustering Methods:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


4

Partitioning Methods: The simplest and most fundamental version of cluster analysis is
partitioning, which groups similar data points into clusters based on their similarities and
differences.
E.g, . K-means and K-medoids.

K Means Algorithm:

• First, it randomly selects K of the object in D and this will be the Mean/Center considered
• For each iteration an object is assigned to it which is near based on Euclidean distance and
the mean/center is updated
• The iteration continues till last iteration cluster and current iteration cluster are same.

Advantages of k-means:

• Relatively simple to implement.


• Can be used for large data sets
• Guarantees convergence.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


5

Disadvantages of k-means:
• It is a bit difficult to predict the number of clusters i.e. the value of k.
• Output is strongly impacted by initial inputs like number of clusters (value of k).

K – Medoids Algorithm(Partitioning Around Medoids (PAM)):

It is an improvised version of the K-Means algorithm mainly designed to deal with outlier data
sensitivity. Instead of taking the mean value to represent the cluster. A Medoid is a point in the
cluster from which dissimilarities with all the other points in the clusters are minimal. A
representative object (Oi) is chosen randomly for representing the cluster. Each remaining object
is assigned to the cluster of which the representative object is the most similar. The partitioning
method is then performed based on the principle of minimizing the sum of the dissimilarity
between each objects ' P ' and its corresponding representative objects.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


6

Advantages of using K-Medoids:


• Deals with noise and outlier data effectively
• Easily implementable and simple to understand
• Faster compared to other partitioning algorithms
Disadvantages:
• Not suitable for Clustering arbitrarily shaped groups of data points.
• As the initial medoids are chosen randomly, the results might vary based on the choice in
different runs.

Hierarchical Methods

Partitioning Methods partitions objects into exclusive group. In some situation we may want data
formed into groups in different levels A hierarchical clustering method works by grouping data
objects into a hierarchy or 'tree' of clusters. This helps in summarizing the data with the help of
hierarchy.

Hierarchical methods can be categorized as:

a) Algorithmic methods,

b) Probabilistic methods and

c) Bayesian methods

Agglomerative, divisive and multiphase method are algorithmic meaning they consider data
objects as deterministic and compute clusters accordingly to the deterministic distance between
objects. Probabilistic methods use probabilistic models to compare clusters and measure the
quality of clusters by the firmness of models. Bayesian Methods compute a distribution of possible
clustering. That is, instead of outputting a single deterministic clustering over a data set, they
return a group of clustering structures and their probabilities, conditional on the given data.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


7

AGNES (AGglomerative NESting), an agglomerative hierarchical clustering method and


DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method:

Figure shows the application of AGNES (AGglomerative NESting), an agglomerative hierarchical


clustering method (Top arrow) , and DIANA (DIvisive Analysis), a divisive hierarchical clustering
method(Bottom arrow), on a data set of five objects, {a, b, c, d, e}.
Working of AGNES:
• Initially, AGNES, the agglomerative method, places each object into a cluster of its own.
• The clusters are then merged step-by-step according to some criterion. For example,
clusters C1 and C2 may be merged if an object in C1 and an object in C2 form the minimum
Euclidean distance between any two objects from different clusters. This is a single-linkage
approach in that each cluster is represented by all the objects in the cluster, and the
similarity between two clusters is measured by the similarity of the closest pair of data
points belonging to different clusters.
• The cluster-merging process repeats until all the objects are eventually merged to form one
cluster.
Working of DIANA:
• All the objects are used to form one initial cluster.
• The cluster is split according to some principle such as the maximum Euclidean distance
between the closest neighboring objects in the cluster.
• The cluster-splitting process repeats until, eventually, each new cluster contains only a
single object.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


8

Dendrogram:
A tree structure called a dendrogram is commonly used to represent the process of hierarchical
clustering. Dendrogram is used as a plot to show the results of hierarchical clustering method
graphically.

Distance Measuring in Algorithmic Methods (Linkage Criteria)

Whether using an agglomerative method or a divisive method, a core need is to measure the
distance between two clusters. Four widely used measures for distance between clusters are as
follows, where |p-p’| is the distance between two objects.

1 Minimum distance:

Distmin(Ci,Cj) = min { |𝑃 − 𝑃′|} 𝑃 ∈ 𝐶𝑖, 𝑃′ ∈ 𝐶𝑗

2 Maximum distance:

Distmax(Ci,Cj) = max { |𝑃 − 𝑃′|} 𝑃 ∈ 𝐶𝑖, 𝑃′ ∈ 𝐶𝑗

3 Mean distance :

Distmean(Ci,Cj) = |𝑚𝑖 − 𝑚𝑗|

Where, mi is mean of Ci;


mj is mean of Cj;

4 Average distance :
1
Distavg(Ci,Cj) = 𝑛𝑖 𝑛𝑗 ∑𝑃∈𝐶𝑖,𝑃′∈𝐶𝑗|𝑃 − 𝑃′|

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


9

Where , ni,nj are the size of clusters.

Types of Linkage of clusters:

Single linkage and complete linkage are two distinct methods used in agglomerative hierarchical
clustering. This type of clustering starts with each data point as its own cluster and iteratively
merges the closest clusters until a single cluster remains or a stopping criterion is met. The key
difference between single and complete linkage lies in how the "distance" between two clusters is
defined.

Single Linkage (Nearest Neighbor)

In single linkage, the distance between two clusters is defined as the minimum distance(use min
distance formula) between any data point in the first cluster and any data point in the second
cluster.

Mathematically, if C1 and C2 are two clusters, the distance D(C1,C2) in single linkage is:

D(C1,C2)=x∈C1,y∈C2min{d(x,y)}

where d(x,y) is the distance between data points x and y (e.g., Euclidean distance).

Complete Linkage (Farthest Neighbor)

In complete linkage, the distance between two clusters is defined as the maximum distance(use
max distance formula) between any data point in the first cluster and any data point in the second
cluster.

Mathematically, if C1 and C2 are two clusters, the distance D(C1,C2) in complete linkage is:

D(C1,C2)=x∈C1,y∈C2max{d(x,y)}

where d(x,y) is the distance between data points x and y.

Key Differences Summarized

Feature Single Linkage Complete Linkage

Minimum distance between any two Maximum distance between any two
Cluster Distance
points points

Tends to form long, chain-like Tends to form compact, spherical


Cluster Shape
clusters clusters

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


10

Noise/Outliers More sensitive Less sensitive

Computational
Generally lower Generally higher
Cost

Shape Detection Better for non-globular shapes Better for globular shapes

BIRCH : Multiphase Hierarchical Clustering Using Clustering Feature Trees


• Balanced Iterative Reducing and Clustering using Hierarchies.
• It is an unsupervised data mining algorithm that performs hierarchical clustering on large
datasets.
• Designed to cluster large amount of numeric data.

The BIRCH clustering algorithm consists of two stages:

1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions
called Clustering Feature (CF) entries. It uses clustering feature (CF) to summarize a
cluster and clustering feature tree (CF tree) to represent a cluster hierarchy. Formally, a
Clustering Feature entry is defined as an ordered triple (N, LS, SS) where 'N' is the number
of data points in the cluster, 'LS' is the linear sum of the data points, and 'SS' is the squared
sum of the data points in the cluster. A CF tree is a height-balanced tree with two
parameters, branching factor and threshold. The CF-tree is a very compact representation
of the dataset because each entry in a leaf node is not a single data point but a subcluster.
Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the
sum of CF entries in the child nodes. There is a maximum number of entries in each leaf
node. This maximum number is called the threshold. The tree size is a function of the
threshold. The larger the threshold is, the smaller tree is.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree.
A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree
contains a pointer to a child node, and a CF entry made up of the sum of CF entries in the
child nodes. Optionally, we can refine these clusters.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


11

CHAMELEON : Multiphase Hierarchical Clustering using Dynamic Modeling

Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the
similarity between pairs of cluster.
Chameleon uses a two-phase algorithm to find clusters in a data set:

1. First phase
Uses a graph partitioning algorithm to cluster data items into small subclusters
2. Second phase
Uses an algorithm to find the genuine clusters by repeatedly combining these subclusters

Two cluster are merged if their interconnectivity is high and they close together.
* Chameleon uses a K- nearest neighbour graph approach to construct a sparse graph,
Then it uses graph partition algorithm to partition the graph into smaller sub cluster.
Then an agglomerative hierarchical clustering algorithm that iteratively merges sub cluster based
on the similarity.

* Works for arbitrary shaped cluster.

Probabilistic Hierarchical Clustering:

Traditional hierarchical clustering assumes that there is no uncertainty or noise in the data being
clustered. However, this assumption does not hold for many real-world datasets where there may
be missing values, outliers, or measurement errors present in the data. Traditional methods also
assume that all features have equal importance, which may not always be true.
Probabilistic hierarchical clustering tries to overcome some of these drawbacks by employing
probabilistic model to measure distance between clusters.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


12

Advantages:

• Handles complex data containing noise or missing values.


• More flexible than algorithmic methods.

Disadvantages:

• Computationally expensive
• sensitivity to initialization parameters.
• assumption of Gaussian distributions within clusters
• requires domain knowledge and expertise in statistical modeling

Density-Based Methods

DBSCAN: Density-Based Clustering Based on Connected Regions with High Density

The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea
is that for each point of a cluster, the neighborhood of a given radius has to contain at least a
minimum number of points.

Parameters Required For DBSCAN Algorithm

1. eps: It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps value
is chosen too small then a large part of the data will be considered as an outlier. If it
is chosen very large then the clusters will merge and the majority of the data points
will be in the same clusters. One way to find the eps value is based on the k-
distance graph.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


13

2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger
the dataset, the larger value of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number of dimensions D in the dataset as,
MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.

In this algorithm, we have 3 types of data points.

Core Point: A point is a core point if it has more than MinPts points within eps.

Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a
core point.

Noise or outlier: A point which is not a core point or border point.

Steps Used In DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster
as the core point.
A point a and b are said to be density connected if there exists a point c which has a
sufficient number of points in its neighbors and both points a and b are within
the eps distance. This is a chaining process. So, if b is a neighbor of c, c is a
neighbor of d, and d is a neighbor of e, which in turn is neighbor of a implying
that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do
not belong to any cluster are noise.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


14

DENCLUE:

DENsity CLUstering. The DENCLUE algorithm employs a cluster model based on kernel density estimation.
A cluster is defined by a local maximum of the estimated density function. Observations going to the same
local maximum are put into the same cluster.

Clearly, DENCLUE doesn't work on data with uniform distribution. In high dimensional space, the data
always look like uniformly distributed because of the curse of dimensionality. Therefore, DENCLUDE doesn't
work well on high-dimensional data in general.

Grid-Based Methods

The grid-based clustering approach uses a grid data structure. It limits the object space into a finite
number of cells that form a grid structure on which all of the operations for clustering are
performed. The main advantage of the approach is its fast-processing time.

Statistical Information Grid (STING):

A STING is a grid-based clustering technique. It uses a multidimensional grid data structure that
quantifies space into a finite number of cells. Instead of focusing on data points, it focuses on
the value space surrounding the data points.
In STING, the spatial area is divided into rectangular cells and several levels of cells at different
resolution levels. High-level cells are divided into several low-level cells.
In STING Statistical Information about attributes in each cell, such as mean, maximum, and
minimum values, are precomputed and stored as statistical parameters. These statistical
parameters are useful for query processing and other data analysis tasks. The statistical parameter
of higher-level cells can easily be computed from the parameters of the lower-level cells.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


15

Working of STING:
Step 1: Determine a layer, to begin with.
Step 2: For each cell of this layer, it calculates the confidence interval or estimated range of
probability that this is cell is relevant to the query.
Step 3: From the interval calculate above, it labels the cell as relevant or not relevant.
Step 4: If this layer is the bottom layer, go to point 6, otherwise, go to point 5.
Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those cells that form
the relevant cell of the high-level layer.
Step 6: If the specification of the query is met, go to point 8, otherwise go to point 7.
Step 7: Retrieve those data that fall into the relevant cells and do further processing. Return the
result that meets the requirement of the query. Go to point 9.
Step 8: Find the regions of relevant cells. Return those regions that meet the requirement of the
query. Go to point 9.
Step 9: Stop or terminate.

Advantages:

• Grid-based computing is query-independent because the statistics stored in each cell


represent a summary of the data in the grid cells and are query-independent.
• The grid structure facilitates parallel processing and incremental updates.

Disadvantage:

• The main disadvantage of Sting (Statistics Grid). As we know, all cluster boundaries
are either horizontal or vertical, so no diagonal boundaries are detected.

CLIQUE: An Apriori-like Subspace Clustering Method

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density based clusters
in subspaces.
It is based on automatically identifying the subspaces of high dimensional data space that allow
better clustering than original space.
It uses a density threshold to identify dense cells and sparse ones.
A cell is dense if the number of objects mapped to it exceeds the density threshold.
CLIQUE Algorithm is very scalable with respect to the value of the records, and a number of
dimensions in the dataset because it is grid-based and uses the Apriori Property effectively.
Apriori Approach Stated that If an X dimensional unit is dense then all its projections in X -1
dimensional space are also dense.
This means that dense regions in a given subspace must produce dense regions when projected
to a low-dimensional subspace.
CLIQUE restricts its search for high-dimensional dense cells to the intersection of dense cells in
the subspace because CLIQUE uses apriori properties.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


16

(a) Working of CLIQUE Algorithm:

The CLIQUE algorithm first divides the data space into grids. It is done by dividing each
dimension into equal intervals called units. After that, it identifies dense units. A unit is dense if
the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find dense cells
along two dimensions, and it works until all dense cells along the entire dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the largest set
(“cluster”) of connected dense cells. Finally, the CLIQUE algorithm generates a minimal
description of the cluster. Clusters are then generated from all dense subspaces using the apriori
approach.

Advantage:

• CLIQUE is a subspace clustering algorithm that outperforms K-means, DBSCAN,


and Farthest First in both execution time and accuracy.
• CLIQUE can find clusters of any shape and is able to find any number of clusters in
any number of dimensions, where the number is not predetermined by a parameter.
• One of the simplest methods, and interpretability of results.

Disadvantage:

• The main disadvantage of CLIQUE Algorithm is that if the size of the cell is
unsuitable for a set of very high values, then too much of the estimation will take
place and the correct cluster will be unable to find.

Evaluation of Clustering

Evaluation of clustering is the process of assessing how meaningful, accurate, and useful the
results of a clustering algorithm are. In simpler terms, it's about answering the question: "Did the
clustering algorithm do a good job?" cancel

Since clustering is an unsupervised learning method (i.e., it doesn't use labels), evaluating it can
be tricky. Evaluation helps determine:

1. Whether the data has any real clusters at all

2. How many clusters there should be

3. How good or meaningful the resulting clusters are

The major tasks of clustering evaluation include the following:

• Assessing Clustering Tendency


Goal: Check if the data has any natural grouping before running a clustering algorithm.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


17

Why: Applying clustering to random data will still produce clusters, but they might be meaningless.
Example Methods:
Hopkins Statistic: is a spatial statistic that tests the spatial randomness of a variable as
distributed in a space.
Visual inspection (scatter plots, heatmaps)

• Determining the Number of Clusters


Goal: Figure out how many clusters are appropriate for your data.
Why: Some algorithms (like k-means) require you to specify the number of clusters.
Example Methods:
Elbow Method: The Elbow Method helps you find the best value for k (the number of clusters)
by plotting:
X-axis: Number of clusters (k)
Y-axis: Within-Cluster Sum of Squares (WCSS) — also called inertia or distortion
WCSS measures how tight the clusters are — lower WCSS means points in a cluster are closer
to each other (better cohesion).
Silhouette Score
Gap Statistic
• Measuring Clustering Quality
Goal: Evaluate how well the algorithm grouped the data.
Why: To know if the clusters make sense, and to compare different clustering results.
Tyes of Methods: If ground truth is available, it can be used by extrinsic methods, which compare the
clustering against the group truth and measure. If the ground truth is unavailable, we
can use intrinsic methods, which evaluate the goodness of a clustering by considering
how well the clusters are separated.
Extrinsic Methods:
Measure Concept Goal
Each cluster should have only one kind of Avoid mixing categories in a
Homogeneity
item cluster
All items from one category should be in Keep whole categories
Completeness
the same cluster together
Avoid dumping unrelated items into one
Rag Bag No “junk drawer” clusters
cluster
Small Cluster Don't lose small but
Keep small, unique groups intact
Preservation important clusters

Intrinsic Methods:
The Silhouette Coefficient tells you how well each item fits into its own cluster compared to other
clusters.
You can think of it as answering this question:
“Am I closer to the stuff in my own group than to the stuff in the next closest group?”
Calculation of Silhouette Coefficient: For each data point:
a = how close the point is to others in its own cluster (we want this to be small).
b = how close the point is to the nearest other cluster (we want this to be big).
Then, the silhouette score for that point is:

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College


18

An average silhouette coefficient close to +1 indicates that the clusters are well-clustered.
An average silhouette coefficient close to 0 suggests overlapping clusters.
An average silhouette coefficient close to -1 indicates that the clustering might be wrong.

Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy