Fundamentals of Data Science
Fundamentals of Data Science
Empower the individuals and society at large through educational excellence; sensitize them
for a life dedicated to the service of fellow human beings and mother land.
To impact holistic education that enables the students to become socially responsive and
useful, with roots firm on traditional and cultural values; and to hone their skills to accept
challenges and respond to opportunities in a global scenario.
Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of 03
Credits
Contact hours 42 Hours Duration of 2 1/2 Hours
SEA/Exam
Formative Assessment
40 Summative Assessment 60
Marks
Marks
Course Outcomes (COs): After the successful completion of the course, the student will be able to:
Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Unit 1
Topics:
Data Mining:
Def 1: Refers to extracting or mining knowledge from large amount of data stored in databases,
data warehouse, or other repository. i.e. extraction of small valuable information from huge data.
Def 2: It Is the process of discovering interesting patterns & knowledge from large amount of data.
Data archeology, data dredging, data/pattern analysis are other terms for data mining. Another
popular term Knowledge Discovery From Data (KDD).
Huge data is generated and there is need to turn into useful information and knowledge. This
information & knowledge is used for various applications like Market analysis (consumer buying
pattern), Fraud detection (fraud account detection, fraud credit card holders), Science exploration
(hidden facts in data), telecommunication, etc.
Data cleaning involves identifying and correcting errors or inconsistencies in datasets to improve
their quality before analysis. Below is an example:
A retail company has a customer database with errors that need to be cleaned before performing
customer segmentation.
o Replace NULL in the "Country" column with the most common value.
o Age cannot be negative (-30) → Convert it to an estimated valid value (e.g., 30).
3. Standardization:
Now the dataset is clean, accurate, and ready for analysis in data mining.
Data integration is the process of combining data from multiple sources into a single, unified
view. This is essential in data mining to improve analysis, accuracy, and consistency.
Before Integration:
2024
mohan@mitfgc.co
T001 101 Mohan 25 India Laptop 1200 -01-
m
10
2024
Jane janesmith@xyz.co
T002 102 30 UK Phone 800 -01-
Smith m
11
2024
Chandr chandra@mitfgc.co
T003 103 28 India Tablet 500 -01-
a m
12
Now, the company can analyze buying patterns, predict customer behavior, and improve
marketing strategies.
Data selection in data mining involves choosing relevant data from a larger dataset to improve
analysis efficiency and accuracy. It helps in focusing on only the necessary attributes instead of
processing the entire dataset.
A telecom company wants to predict customer churn (customers leaving the service). The
company has a large dataset with many attributes, but not all are useful for churn prediction.
Custom Nam A Gen Add Call_Du Data_ Monthl Payment_ Customer_F Cancel_
er_ID e ge der ress ration Usage y_Bill Method eedback Status
Fem Mys 10
102 Arun 35 2GB 300 UPI Dissatisfied Yes
ale ore min/day
Custom Nam A Gen Add Call_Du Data_ Monthl Payment_ Customer_F Cancel_
er_ID e ge der ress ration Usage y_Bill Method eedback Status
To predict churn, some attributes are irrelevant (e.g., "Name", "Address") and can be removed.
The most relevant features are:
✔ Age – May impact churn behavior.
✔ Call_Duration – Shows engagement with the service.
✔ Data_Usage – Indicates usage patterns.
✔ Monthly_Bill – Higher bills may lead to churn.
✔ Customer_Feedback – Negative feedback might indicate potential churn.
✔ Churn_Status – Target variable for prediction.
Now, this refined dataset is ready for cancellation service prediction models like Decision Trees
or Neural Networks!
Data transformation in data mining is the process of converting data into a suitable format for
analysis. This includes normalization, aggregation, discretization, encoding, and feature
engineering.
• Example: If today’s year is 2024, then Customer Tenure = 2024 - Join Year.
Income
Customer_ID Age Purchase_Frequency Join_Date Preferred_Payment_Method
(Rs)
2018-06-
101 25 50000 15 Credit Card
10
2015-09-
102 42 120000 5 UPI
25
2020-01-
103 30 80000 8 Debit Card
12
After Feature/column generation:
101 25 6
102 42 9
103 30 4
Improves Model Accuracy – Scaled and encoded data improves machine learning
performance.
Enhances Interpretability – Transformed features make patterns easier to detect.
Reduces Computational Complexity – Normalization speeds up algorithms.
Now, the dataset is ready for clustering algorithms like K-Means or classification models
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base.
Architecture of DM System
Interface between user & DMS. User specifies query, task, etc. User browse data, visualize
output.
Data mining involves an integration of techniques from multiple discipline such as database,
data warehouse, statistics, machine learning, pattern recognition, neural networks, data
visualization, information retrieval, image/signal processing, spatial & temporal data analysis.
DM can be used to mine knowledge from any kind of data source like:
Data mining can be applied to various types of data to discover patterns, trends, and useful insights.
Below are the main categories:
1. Structured Data
Definition: Data that is organized in a fixed format, typically stored in relational databases (tables
with rows and columns).
Examples:
2. Semi-Structured Data
Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College
11
Definition: Data that does not fit into a rigid structure but still has some organization (e.g., tags,
metadata).
Examples:
Mining Techniques Used: Text Mining, Information Extraction, Natural Language Processing
(NLP)
3. Unstructured Data
Definition: Data that has no predefined format, making it more complex to analyze.
Examples:
4. Spatial Data
Examples:
• Satellite images
5. Time-Series Data
Definition: Data collected over time, where the sequence and time intervals matter.
Examples:
• Weather data
6. Web Data
Examples:
Mining Techniques Used: Web Scraping, Sentiment Analysis, Page Ranking Algorithms
7. Multimedia Data
Examples:
Examples:
1) Concept/class description:
Data entries can be associated with classes or concepts. For example, classes of items for
sale include computers and printers, and concepts of customers include bigSpenders and
budgetSpenders. It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms. Such descriptions of a class or a concept are
called class/concept descriptions.
1. Data characterization, by summarizing the data of the class under study. (Ex: based on
gender, buying behavior)
2. Data discrimination, by comparing the target class with one or set of comparative class.
(Ex: sales of comp with laptop)
3. Both data characterization & discrimination
The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables
2) Mining Frequent Pattern, Association & correlation:
Frequent Pattern refers to pattern that occur frequently in data. Mining frequent pattern leads
to discovery of interesting association & correlation with data. Different kinds of frequent
pattern are
• Item sets - A frequent itemset typically refers to a set of items that often appear together
in a transactional data set—for example, milk and bread, which are frequently bought
together in grocery stores by many customers.
• Sub sequences – A frequently occurring subsequence, such as the pattern that
customers, tend to purchase together. For eg. Mobile, Back case, Screen guard
3) Classification & Prediction:
It is a process of building a model that describes the class & then predicting the objects
into different classes using the model. Model can be built by using if----then rules, decision
tree, neural nets etc. Methods for construction classification models. Bayesian classification,
SVM, K-nearest neighbor.
Ex: Bank manager wants to know/analyze which loan applicant are ok and which can create a
risk.
4) Regression Analysis Regression analysis is a reliable method of identifying which variables
have impact on a topic of interest. The process of performing a regression allows you to
confidently determine which factors matter most, which factors can be ignored, and how these
factors influence each other.
Regression analysis is a statistical process that estimates the relationship between a dependent
variable and one or more independent variables.
o Regression
o Analysis of Variance
o Mixed-Effect Models
o Factor Analysis
o Discriminant Analysis
o Survival Analysis
o Visualization: Use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data
o Visual Data Mining: discovering implicit but useful knowledge from large data sets using
visualization techniques
Visual data mining discovers implicit and useful knowledge from large data sets using data and/or
knowledge visualization techniques. Visual data mining can be viewed as an integration of two
disciplines: data visualization and data mining. It is also closely related to computer graphics,
multimedia systems, human–computer interaction, pattern recognition, and high-performance
computing.
In general, data visualization and data mining can be integrated in the following ways:
Data visualization: Data in a database or data warehouse can be viewed at different granularity or
abstraction levels, or as different combinations of attributes or dimensions. Data can be presented
in various visual forms, such as boxplots, 3-D cubes, data distribution charts, curves, surfaces, and
link graphs, etc. Visual display can help give users a clear impression and overview of the data
characteristics in a large data set.
Data mining result visualization: Visualization of data mining results is the presentation of the
results or knowledge obtained from data mining in visual forms. Such forms may include scatter
plots and boxplots , as well as decision trees, association rules, clusters, outliers, and generalized
rules.
Data mining process visualization: This type of visualization presents the various processes of data
mining in visual forms so that users can see how the data are extracted and from which database
or data warehouse they are extracted, as well as how the selected data are cleaned, integrated,
preprocessed, and mined. Moreover, it may also show which method is selected for data mining,
where the results are stored, and how they may be viewed.
Knowledge Discovery in Databases (KDD) is Data mining (DM) is a step in the KDD
a process that automatically discovers process that involves applying algorithms to
patterns, rules, and other regular contents in extract patterns from data.
large amounts of data
KDD is a systematic process for identifying Data mining is the foundation of KDD and is
patterns in large and complex data sets. essential to the entire methodology.
Overall set of process for Knowledge Data mining is process of extraction of hidden
extraction like data cleaning, data selection, knowledge from large data. Intelligent
data integration, datamining, pattern algorithms are used to extract useful
evaluation, knowledge presentation information like data categorization, data
characterization, data discrimination,
Association, Frequent Pattern mining,
Regression, Outlier Analysis, classification,
clustering, etc.
Contains several steps It is one step in KDD
Sometimes called as alias name of Data Sometimes called as alias name of KDD
Mining
System to manage the data in database like Data mining is process of extraction of hidden
creation, insertion, deletion, updating, etc. knowledge from large data. Intelligent
algorithms are used to extract useful
information like data categorization, data
characterization, data discrimination,
Association, Frequent Pattern mining,
Regression, Outlier Analysis, classification,
clustering, etc
Stores data in format suitable for data Data from Database is used for Mining
management.
Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College
17
Major Issues in DM
a. Mining Methodology:
Researches have been vigorously developing new DM techniques. This involves the
investigation of new kinds of knowledge, mining in multidimensional space, integrating
methods from other disciplines and consideration of semantic ties among data objects.
b. User Interaction:
Users play an important role in DM process. Interesting areas of research include how
to interact with a DMS, how to incorporate a user’s background knowledge in mining and
how to visualize and comprehend data mining results.
c. Efficiency & Scalability:
DM algorithms must be efficient & scalable in order to effectively extract information
from huge amount of data in many data repositories or in dynamic data streams. In other
words running time of algorithm must be short.
d. Diversity of database types:
The discovery of knowledge from different sources of structured, or unstructured yet
interconnected data with diverse data semantic pose great challenges to DM.
e. DM & Society:
I. Social Impact of DM:
The improper disclosure or use of data & the potential violations of individual
privacy and data protection rights are areas of concern that need to be addressed.
II. Privacy – Preserving DM:
DM poses a risk of disclosing an individual’s personal information. The research
is to observe data sensitive & preserve peoples privacy while performing successful
DM.
III. Invisible DM:
When purchasing online, the users might be unaware that the store is likely
collecting data on the buying patterns of its customers, which may be used to
recommend other items for purchase in the future.
1. Business Intelligence:
BI technologies provide historical, current and productive views of business operations.
Without data mining many business may not be able to perform effective market analysis,
compare customer feedback on similar products, discover strength & weakness of
competitors, predictive analysis etc.
2. Web Search Engine:
Web Search Engines are very large DM applications. Various DM task like crawling,
indexing, ranking, searching are used.
o Telcomm. and many other industries: Share many similar goals and expectations of retail
data mining
o Other issues
- Data mining in social sciences and social studies: text and social media
Data mining technique plays a vital role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful and relevant information from
large data sets. Data mining technique helps in classify relevant data for Intrusion Detection
System. Intrusion Detection system generates alarms for the network traffic about the foreign
invasions in the system. For example:
• Detect security violations
• Misuse Detection
• Anomaly Detection
- Content-based: Recommends items that are similar to items the user preferred or
queried in the past
Business Transactions: Every business industry is memorized for perpetuity. Such transactions
are usually time-related and can be inter-business deals or intra-business operations. The
effective and in-time use of the data in a reasonable time frame for competitive decision-making
is definitely the most important problem to solve for businesses that struggle to survive in a
highly competitive world. Data mining helps to analyze these business transactions and identify
marketing approaches and decision-making. Example :
• Direct mail targeting
• Stock trading
• Customer segmentation
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study of
purchases done by a customer in a supermarket. This concept identifies the pattern of frequent
purchase items by customers. This analysis can help to promote deals, offers, sale by the
companies and data mining techniques helps to achieve this analysis task. Example:
• Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
• Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and educators.
By using data mining EDM we can perform some educational task:
• Predicting students admission in higher education
• Predicting students profiling
• Predicting student performance
• Teachers teaching performance
• Curriculum development
• Predicting student placement opportunities
Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules generated by data
mining are unique to find results. In most of the technical research in data mining, we create a
training model and testing model. The training/testing model is a strategy to measure the
precision of the proposed model. It is called Train/Test because we split the data set into two
sets: a training data set and a testing data set. A training data set used to design the training model
whereas testing data set is used in the testing model. Example:
• Classification of uncertain data.
• Information-based clustering.
• Decision support system
• Web Mining
• Domain-driven data mining
• IoT (Internet of Things)and Cybersecurity
• Smart farming IoT(Internet of Things)
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity
and their outcomes to improve the focusing of high-value physicians and figure out which
promoting activities will have the best effect in the following upcoming months, Whereas the
Insurance sector, data mining can help to predict which customers will buy new policies, identify
behavior patterns of risky customers and identify fraudulent behavior of customers.
• Claims analysis i.e which medical procedures are claimed together.
• Identify successful medical therapies for different illnesses.
Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Unit 2
Topics:
Data Warehouse:
According to William H. Inmon, a leading architect in the construction of data warehouse systems
(Father of Data Warehouse- American Computer Scientist), “A data warehouse is a subject-
oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s
decision making process”. In simple words, it is a centralized data location for multiple sources of
data for management decision making process.
Data warehousing:
The process of constructing and using data warehouses as shown the following figure.
• Usage: Used in operational systems like banking, retail, and airline reservations.
OLTP OLAP
◼ The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from
operational databases or other external sources (e.g., customer profile information provided
by external consultants). These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as
well as load and refresh functions to update the data warehouse
◼ The middle tier is an OLAP server that is typically implemented using either
(1) a relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or
(2) a Multi-dimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).
◼ The top tier is a front-end client layer , which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse
o Enterprise warehouse
o collects all of the information about subjects spanning the entire organization
o Data Mart
o a subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
o Virtual warehouse
o A set of views over operational databases
o Only some of the possible summary views may be materialized
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner, as shown in Figure.
First, a high-level corporate data model is defined within a reasonably short period (such as
one or two months) that provides a corporate-wide, consistent, integrated view of data among
different subjects and potential usages. This high-level model, although it will need to be
refined in the further development of enterprise data warehouses and departmental data marts,
will greatly reduce future integration problems. Second, independent data marts can be
implemented in parallel with the enterprise warehouse based on the same corporate data model
set noted before. Third, distributed data marts can be constructed to integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.
oSales_Amount (Measure)
o Quantity_Sold (Measure)
Dimension Tables (Descriptive Data)
1. Dim_Date (Time-based details)
o Date_Key (Primary Key)
o Date
o Month
o Quarter
o Year
2. Dim_Product (Product details)
o Product_Key (Primary Key)
o Product_Name
o Category
o Brand
3. Dim_Customer (Customer details)
o Customer_Key (Primary Key)
o Customer_Name
o Age
o Gender
o Location
4. Dim_Store (Store details)
o Store_Key (Primary Key)
o Store_Name
o City
o Region
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization, is called the apex cuboid. The apex cuboid is
typically denoted by ‘all’.
The lattice(patterened structure like fence) of cuboids forms a data cube as shown below.
In multidimensional data modeling for a data warehouse, three common schemas define how fact
and dimension tables are structured:
1. Star schema: A fact table in the middle connected to a set of dimension tables.
Star Schema → Best for fast query performance and simple design.
Galaxy Schema → Best for complex business models with multiple fact tables.
OLAP Operations
o Drill down (roll down): In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
o Slice: Extracts a subset of the data for a single dimension value. It selects a single
dimension from the OLAP cube which results in a new sub-cube creation.
Example: Viewing sales for Q1 2024 in New York for Electronics category.
o Pivot (rotate):
Summary Table:
OLAP
Function Example
Operation
Drill-Down Breaks data into a finer level Sales from yearly → monthly
Slice Selects data for one dimension Sales only for Q1 2024
Data Cleaning
Today’s data are highly susceptible to noisy, missing and inconsistent data due to their typically
huge size and because of heterogeneous sources. Low quality data will lead to poor mining results.
Different data preprocessing techniques(data cleaning, data integration, data reduction, data
transformation) that when applied before data mining will improve the overall quality of the
pattern mined and also time required for actual mining. Data cleaning stage helps in smooth out
noise, attempts to fill in missing values, removing outliers, and correct inconsistency in data.
1) Handling missing values: Missing values are encountered due to Data entry errors,
system failures, incomplete records.
Techniques to handle missing values:
i. Ignoring the tuple: Used when class label is missing. This method is not very
effective when more missing value is present.
ii. Fill in missing value manually: It is time consuming.
iii. Using global constant to fill missing value: Ex: unknown or ∞
iv. Use attribute mean to fill the missing value
v. Use attribute mean for all samples belonging to the same class as the given
tuple
vi. Use most probable value to fill the missing value: (using decision tree)
2) Handling Noisy data: Noise is a random error or variance in measured variable caused due to
Sensor errors, outliers, rounding errors, incorrect data entry.
Data Integration
Data mining often works on integrated data from multiple repositories. Careful integration helps
in accuracy of data mining results.
Challenges of DI
Data Reduction
Data Reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintain the integrity of the original data.
1. Dimensionality reduction:
Reducing the number of attributes/variables under consideration.
Ex: Attribute subset selection, Wavelet Transform, PCA.
2. Numerosity reduction:
Replace original data by alternate smaller forms, clustering.
Ex: Histograms, Sampling, Data cube aggregation,
3. Data compression:
Reduce the size of data.
Wavelet Transform:
DWT- Discrete Wavelet Transform is a linear signal processing technique, that when applied to a
data vector X, transforms it to a numerically different vector X’ of same length. The DWT is a fast
and simple transformation that can translate an image from the spatial domain to the frequency
domain.
PCA reduces the number of variables or features in a data set while still preserving the most
important information like major trends or patterns.
Dataset for analysis consists of many attribute which may be irrelevant to the mining task. (Ex:
Telephone no. may not be important while classifying customer). Attribute subset selection reduces
the data set by removing irrelevant attributes.
• Combined method
• At each step, procedure selects the best attribute & remove worst from remaining.
4. Decision Tree Induction:
In DTI a tree is constructed from the given data. All attributes that do not appear in tree are
assumed to be irrelevant. Measures such as Information Gain, Gain Index, Gini Index, Chi-
square statistics, etc are used to select the best attributes out of the set of attributs. Thereby,
reducing the number of attributes.
Histograms:
Histogram is a frequency plot. It uses bins/buckets to approximate data distributions and are
popular form of data reduction. They are highly effective at approximating both sparse & dense
data as well as skewed & uniform data.
The following data are a list of AllElectronics prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14,
15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30,30, 30) Figure shows the histogram for this data.
Clustering:
Clustering partition data into clusters/groups which are similar/close. In data reduction, cluster
representation of data are used to replace the actual data. Instead of storing all data points, store
only cluster centroids or representative points.
Example:
• Given a dataset with 1 million customer records, k-means clustering can reduce it to 100
clusters, where each centroid represents a group of similar customers.
Example:
• In gene expression data, clustering similar genes can help reduce thousands of variables
into meaningful groups.
Instead of analyzing the entire dataset, work on a sample of clusters that represent the whole
population.
Example:
• Market research: Instead of surveying all customers, businesses analyze a few customer
segments.
Clustering helps detect and remove outliers, reducing noise in the dataset.
Example:
• Fraud detection: Unusual transaction patterns form separate clusters, helping identify
fraudulent activities.
Sampling:
Used as data reduction technique in which large data are represented as small random samples
(subset).
This is created by drawing s of the N tuples from D ( s < N ), where the probability of drawing any tuple
in D is 1 = N , that is, all tuples are equally likely to be sampled.
This is similar to SRSWOR, except that each time a tuple is drawn from , it is recorded and then replaced
. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.
The tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be obtained,
where s < M .
If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group. In this way, the age group having the
smallest number of customers will be sure to be represented
An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional
to the size of the sample, s , as opposed to N , the data set size. Hence, sampling complexity is
potentially sublinear to the size of the data.
Data Transformation
The data is transformed or consolidated so that the resulting mining process may be more efficient,
and the patterns found may be easier to understand.
The measurement unit used can affect data analysis. To help avoid dependence on the choice of
measurement units, the data should be normalized or standardized. This involves transforming the
data to fall within a smaller or common range such as Range = [-1,1], [0.0,1.0].
Normalizing the data attempts to give all attributes an equal weight For Ex: Changing unit from
meters to inches in height lead to different results because of larger range for that attribute. To help
avoid dependence on the choice of units, the data should be normalized.
Normalization attempts to give all attributes equal weight. Normalization is useful in classification
algorithm involving neural networks or distance measurements such as nearest neighbor
classification & clustering. There are different methods for normalization like - min-max
normalization, z-score normalization, normalization by decimal scaling.
Min-Max Normalization:
Vi’ = 0.716.
Z-score Normalization:
Also, variance(Sᴀ) could be used which is more robust than std deviation(𝜎ᴀ).
Decimal Scaling:
b) The no. of decimal point moved depends on the max absolute value of A
𝑉𝑖
𝑉𝑖 ′ = 10𝑗
Progra
m B.C.A Semester VI
Name
Course Title Fundamentals of Data Science (Theory)
Unit 3
Topics:
Mining Frequent Patterns: Basic Concept – Frequent Item Set Mining Methods -Apriori
and Frequent Pattern Growth (FPGrowth) algorithms -Mining Association Rules.
Basic Concepts
Item: Refers to an item/product/data value in a dataset. E.g., Mobile, Case, Mouse, Keyboard,
Temp, Cold, etc.
Itemset: Set of items in a single transaction. Eg., X={Mobile, charger, screen guard}
Y={Headset, pendrive};
Support is often used as a threshold for identifying frequent item sets in a dataset, which can
be used to generate association rules. For example, if we set the support threshold to 5%, then
any itemset that occurs in more than 5% of the transactions in the dataset will be considered a
frequent itemset.
where X is the itemset for which you are calculating the support.
Closed Itemset: A frequent itemset with no superset that has the same support.
For example, if a dataset contains 100 transactions and the item set {milk, bread} appears in
20 of those transactions, the support count for {milk, bread} is 20. If there is no superset of
{milk, bread} that has a support count of 20, then {milk, bread} is a closed frequent itemset.
Closed frequent itemsets are useful for data mining because they can be used to identify
patterns in data without losing any information. They can also be used to generate association
rules, which are expressions that show how two or more items are related.
Maximal Frequent Itemset: A frequent itemset with no superset that is also frequent. For
example if an itemset {a,b,c} is frequent itemset and do not have a superset which is also
frequent.
Confidence:
Confidence is a measure of the likelihood that an itemset will appear if another itemset
appears. It is based on conditional probability. It is measure For example, suppose we have
a dataset of 1000 transactions, and the itemset {milk, bread} appears in 100 of those
transactions. The itemset {milk} appears in 200 of those transactions. The confidence of the
rule “If a customer buys milk, they will also buy bread” would be calculated as follows:
= 100 / 200
= 50%
i.e., if a customer buys milk then there is 50% chances that the customer will buy bread.
Support and confidence are two measures that are used in association rule mining to evaluate
the strength of a rule. Both support and confidence are used to identify strong association
rules. A rule with high support is more likely to be of interest because it occurs frequently
in the dataset. A rule with high confidence is more likely to be valid because it has a high
likelihood of being true.
Lift Measure in Association Rule Mining
Lift is a metric used to evaluate the strength of an association rule. It measures how much
more likely the occurrence of Y is when X is present, compared to when X and Y are
independent.
Formula for Lift
Lift(X→Y)=Confidence(X→Y)/Support(Y)
Where:
• Support(Y) = Probability of Y occurring in the dataset.
• Confidence(X → Y) = Probability of Y occurring given X has occurred.
Interpreting Lift Values
• Lift = 1 → X and Y are independent (no association).
• Lift > 1 → Positive correlation (Y is more likely when X happens).
• Lift < 1 → Negative correlation (Y is less likely when X happens).
Example:
→ Customers who buy Milk are 1.5× more likely to buy Bread than random chance
Example for Support, Closed and Maximal Itemset:
Given Transactions Dataset
2-Itemsets (Pairs)
3-Itemsets (Triplets)
Healthcare
→ Patients with Fever and Cough are likely to have the Flu.
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
2. Generate strong association rules from the frequent itemsets: By definition, these
• Market basket analysis: Helps identify items that are commonly purchased
• Web usage mining: Helps understand user browsing patterns
• Bioinformatics: Helps analyze gene sequences
• Fraud detection: Helps identify unusual patterns
• Healthcare: Analyzing patient data and identifying common patterns or risk factors.
• Recommendation systems: Identify patterns of user interaction and helps with
recommendation to the users of an application.
• Cross-selling and up-selling : Identifying related products to recommend or suggest
to customers.
1. Set the minimum support threshold - min frequency required for an itemset to be
"frequent".
2. Identify frequent individual items - count the occurrence of each individual item.
3. Generate candidate itemsets of size 2 - create pairs of frequent items discovered.
4. Prune infrequent itemsets - eliminate itemsets that do no meet the threshold levels.
5. Generate itemsets of larger sizes - combine the frequent itemsets of size 3,4, and so on.
6. Repeat the pruning process - keep eliminating the itemsets that do not meet the
threshold levels.
7. Iterate till no more frequent itemsets can be generated.
8. Generate association rules that express the relationship between them - calculate
measures to evaluate the strength & significance of these rules.
Algorithm:
2. k = 2
3. Repeat:
e. k = k + 1
4. Until Lk is empty
Example:
Consider a dataset of simple business transactions: Min support=50% and Threshold
confidence=70%
TID Items
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5
Item Support
1 2/4=50%
2 ¾=75%
3 ¾=75%
4 ¼=25%
5 ¾=75%
Remove the items which has support less than 50%.
Itemset- L1
1
2
3
5
Step 2: Form Itemset of size 2 (pairs) by using L1.
Item Support
1,2 1/4=25%
1,3 2/4=50%
1,5 1/4=25%
2,3 2/4=50%
2,5 3/4=75%
3,5 2/4=50%
Remove the items which has support less than 50%.
Itemset L2
Itemset
1,3
2,3
2,5
3,5
Itemset
2,3,5
1. Hash-Based Technique: This method uses a hash-based structure called a hash table
for generating the k-itemsets and their corresponding count. Uses hash tables to reduce
the number of candidate k-itemsets.
Note: A hash table is a data structure that stores key-value pairs. It uses a hash
function to map keys to specific locations (indexes) in an array, making data retrieval
fast and efficient.
Example: A hash table stores itemset counts, and infrequent hash buckets are pruned early.
Benefit: Reduces the number of candidates in later iterations.
3. Partitioning: This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it should
be frequent in at least one of the partitions of the database.
5. Dynamic Itemset Counting: This method allows for the addition of new candidate
itemsets at any point during the database scan. This can reduce the number of database
scans required.
itemsets explicitly. It is particularly suitable for datasets with long patterns and relatively low
support thresholds.
The working of the FP Growth algorithm in data mining can be summarized in the following
steps:
In this step, the algorithm scans the input dataset to determine the frequency of each item. This
determines the order in which items are added to the FP tree, with the most frequent items
added first.
Sort items:
In this step, the items in the dataset are sorted in descending order of frequency. The infrequent
items that do not meet the minimum support threshold are removed from the dataset. This
helps to reduce the dataset's size and improve the algorithm's efficiency.
In this step, the FP-tree is constructed. The FP-tree is a compact data structure that stores the
frequent itemsets and their support counts.
Once the FP-tree has been constructed, frequent itemsets can be generated by recursively
mining the tree. Starting at the bottom of the tree, the algorithm finds all combinations of
frequent item sets that satisfy the minimum support threshold.
Once all frequent item sets have been generated, the algorithm post-processes the generated
frequent item sets to generate association rules, which can be used to identify interesting
relationships between the items in the dataset.
FP Tree
The FP-tree (Frequent Pattern tree) is a data structure used in the FP Growth algorithm for
frequent pattern mining. It represents the frequent itemsets in the input dataset compactly and
efficiently. The FP tree consists of the following components:
Root Node:
The root node of the FP-tree represents an empty set. It has no associated item but a pointer to
the first node of each item in the tree.
Item Node:
Each item node in the FP-tree represents a unique item in the dataset. It stores the item name
and the frequency count of the item in the dataset.
Header Table:
The header table lists all the unique items in the dataset, along with their frequency count. It
is used to track each item's location in the FP tree.
Child Node:
Each child node of an item node represents an item that co-occurs with the item the parent
node represents in at least one transaction in the dataset.
Node Link:
The node-link is a pointer that connects each item in the header table to the first node of that
item in the FP-tree. It is used to traverse the conditional pattern base of each item during the
mining process.
The FP tree is constructed by scanning the input dataset and inserting each transaction into the
tree one at a time. For each transaction, the items are sorted in descending order of frequency
count and then added to the tree in that order. If an item exists in the tree, its frequency count
is incremented, and a new path is created from the existing node. If an item does not exist in
the tree, a new node is created for that item, and a new path is added to the tree. We will
understand in detail how FP-tree is constructed in the next section.
Example: Consider the following transactions with minimum support count >= 2.
Item Frequency
I1 6
I2 7
I3 6
I4 2
I5 2
Remove all the items below minimum support in the above table: As all items are above the
threshold no items are removed.
An FP-tree is then constructed as follows. First, create the root of the tree, labeled
with “null.” Scan database D a second time. The items in each transaction are processed
in L order (i.e., sorted according to descending support count), and a branch is created
for each transaction. For example, the scan of the first transaction, “T100: I1, I2, I5,”
which contains three items (I2, I1, I5 in L order), leads to the construction of the first
Fundamentals of Data Science Dr. Chandrajit M, MIT First Grade College
17
branch of the tree with three nodes, {I2: 1, I1: 1}, and {I5: 1}, where I2 is linked as a
child to the root, I1 is linked to I2, and I5 is linked to I1. The second transaction, T200,
contains the items I2 and I4 in L order, which would result in a branch where I2 is linked
to the root and I4 is linked to I2. However, this branch would share a common prefix,
I2, with the existing path for T100. Therefore, we instead increment the count of the I2
node by 1, and create a new node, I4: 1, which is linked as a child to I2: 2. To facilitate tree
traversal, an item header table is built so that each item points to its occurrences in the tree via
a chain of node-links.
Prepare the conditional pattern base and conditional FP Tree and Frequent pattern
generated.
Advantages of FP-Growth
📌 This makes FP-Growth much faster than Apriori for large datasets.
Apriori scans the database multiple times, while FP-Growth compresses data into an FP-Tree,
requiring fewer scans.
Since FP-Growth stores data in a tree structure, it scales better for large datasets with many
transactions.
Unlike Apriori, FP-Growth does not explode in size when handling large itemsets.
FP-Growth handles sparse datasets better than Apriori, especially when transactions contain a
large number of unique items.
Disadvantages of FP-Growth
1. Complex Implementation
If the dataset has many frequent patterns, the FP-Tree can become large, requiring more
memory.
This happens in dense datasets (where many items appear together frequently).
This makes it less flexible for dynamic or real-time updates compared to Apriori.
The Vertical Data Format is a way of representing transactions in frequent itemset mining
where we store items along with their transaction IDs (TIDs) instead of listing transactions as
item sets.
Format Representation
Example
1 A, B, C
2 A, C
3 A, B
4 B, C
5 B, C, D
A {1, 2, 3}
B {1, 3, 4, 5}
C {1, 2, 4, 5}
D {5}
• B = {1, 3, 4, 5}
• C = {1, 2, 4, 5}
Advantages:
• Efficient than Horizontal Data Format as faster scanning due to intersection of TIDs.
• Reduces memory usage
Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Unit 4
Topics:
Classification: Basic Concepts, Issues, Algorithms: Decision Tree Induction. Bayes Classification
Methods, Rule-Based Classification, Lazy Learners (or Learning from your Neighbors), k Nearest
Neighbor. Prediction - Accuracy- Precision and Recall.
Classification is a supervised machine learning method where the model tries to predict the correct
label of a given input data.
1. Learning/Training Step:
Here a model is constructed for classification. A classifier model is built by analyzing the data
which are labeled already. Because the class label of each training tuple is provided, this step is
also known as supervised learning. This stage can also be viewed as a function, y=f(x), that can
predict the associated class label ‘y’ of a given tuple x considering attribute values. This mapping
function is represented in the form of classification rules, decision trees or mathematical formula.
2. Classification/Testing Step:
Here the model that is constructed in the learning step is used to predict class labels for given
data.
Ex: A bank loans officer needs to clarify the loan applicant as safe or risky Figure 8.1
The accuracy of classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier.
Issues in Classification
Overfitting occurs when a machine learning model learns the training data too well, including
noise and irrelevant details, instead of just the underlying pattern. This results in high accuracy on
the training data but poor performance on new, unseen data.
• Using More Training Data – Helps the model learn more general patterns.
• Applying Ensemble Methods – Using techniques like Random Forest to combine
multiple trees and reduce variance.
Underfitting occurs when a machine learning model is too simple to capture the underlying
patterns in the data. This leads to poor performance on both the training data and new, unseen data.
Class Imbalance: When one class has significantly more samples than another, the model may
become biased toward the dominant class.
High-Dimensional Data (Curse of Dimensionality): Too many features can make training slow
and reduce model performance. Feature selection or dimensionality reduction (e.g., PCA) can help.
Noisy and Incomplete Data: Missing values and irrelevant features can lead to poor classification
results.
Solutions:
Computational Complexity: Some classification algorithms, like deep learning and SVM with
large datasets, require high computational power.
Decision Tree Induction is the learning of decision trees from class-labelled training tuples.
Decision Tree is a tree structure, where each internal node (non leaf) denotes a test on the attribute,
each branch represents an outcome of the test and each leaf node holds a class label. The attribute
values of a tuple ‘X’ is tested against the decision tree. A path is traced from the root to leaf to
predict the class label.
Advantages:
Disadvantages:
o Suffers from repetition problem: Occurs when same attribute is tested multiple times.
o Suffers for replication problem: Occurs when part of tree is repeated in other branches.
Applications:
Method:
Create a node N;
If tuples in D are all of the same class, C then return N as a leaf node labeled with the class C;
Apply attribute selection method(D, attribute list) to find the best splitting criterion;
For each outcome ‘j’ of splitting criterion// partition tuple and sub trees.
If Dj is empty then
Else
Attach the node returned by decision tree generation (Dj, attribute list) to node N
End for
Return N;
Splitting Scenarios:
It is used to decide which attribute should be chosen as the splitting point at each node in a
decision tree classifier. It is also called as splitting rules.
The attribute having the best measure is chosen as the splitting attribute for the given tuples.
1. Information gain
2. Gain ratio
3. Gini index
Information Gain:
Based on the work by Claude Shanon on information theory. Information Gain is defined
as the difference between the original information requirement and new requirement. The attribute
with highest information gain is chosen as the splitting attribute for node N. It is used in ID3.
Where,
Info(D) = ∑𝒎
𝒊=𝟎 𝑃𝑖 𝑙𝑜𝑔₂(𝑃𝑖)
m = m distinct classes
How much more info is needed to arrive at an exact classification is given by computing expected
info gain for each attribute.
|𝑫𝒋|
InfoA(D) = -∑𝒗𝒋=𝟏 |𝑫| ∗Info(Dj)
Gain Ratio:
Information gain measure is biased toward test with many outcomes (many partitions).
That is, it prefers to select attributes having a large number of values. For example, consider an
attribute that acts as a unique identifier such as product ID. Split on product ID would result in a
large number of partitions (as many as there are values), each one containing just one tuple. Gain
ratio overcomes this bias by normalizing the values. Attribute with maximum gain ratio is selected
on splitting attribute. It is used in C4.5.
It applies a kind of normalization to information gain using a split information value defined as
|𝑫𝒋| |𝑫𝒋|
SplitInfoA(D) = -∑𝒗𝒋=𝟏 |𝑫| *𝐥𝐨𝐠 𝟐 |𝑫|
v = partitions
Gini Index:
Gini = 1 - ∑𝒎
𝒊=𝟏 𝑷𝒊
𝟐
Tree Pruning:
When decision tree is built, many of the branches will reflect problems in the training data
due to noise or outliers. Tree pruning removes the branches which are not relevant. Pruned tree are
smaller and less complex and thus easier to understand. They perform faster than unpruned trees.
• Pre pruning: In pre pruning the tree branch is not further split into sub branches by
deciding early using statistical measures like info gain, gini index etc.
• Post pruning: In post pruning the fully grown tree branches is cut and leaf nodes are
added. The leaf is labeled with the most frequent class among the subtree being replaced.
Pruning Algorithms:
P(H/X) – Posterior probability, i.e. probability that X belong to hypothesis made on H. Probability
of a class given the data.
Ex: Probability that a customer X will buy a computer given that we know the age and
income of customer.
Ex: Probability that a customer X (Rs.40,000) will buy a computer given that we know the
customer will buy a computer.
Ex: Probability that any given customer will buy a computer regardless of measurements
on attribute.
4) To predict the class label of X, P(X/Ci) P(Ci) is evaluated for all class C and maximum of
P(X/Ci) P(Ci) is assigned as class label.
Dataset (Training)
Weather Play
Sunny Yes
Over cast No
Rainy No
Sunny No
Over cast Yes
Rainy No
Sunny Yes
Over cast No
Sunnyj 1 5
P( )P(No) ( )∗( )
P(No/sunny) = No
=[ 3
3
8
] = (0.33*0.62)/0.37=0.55
𝑃(Sunny)
8
Let's say we have an email spam filter that classifies emails as Spam or Not Spam based on the
presence of certain words.
We have a dataset of emails labeled as Spam or Not Spam. The classifier learns from word
frequencies in each category.
"Offer" 50% 5%
"Urgent" 40% 5%
"Meeting" 5% 50%
"Project" 2% 40%
We need to calculate:
P(Spam∣"FreeOfferUrgent")
P(NotSpam∣"FreeOfferUrgent")
P(Spam∣Words)=P(Free∣Spam)⋅P(Offer∣Spam)⋅P(Urgent∣Spam)⋅P(Spam)
=(0.6)⋅(0.5)⋅(0.4)⋅(0.4)= 0.048
P(NotSpam∣Words)=P(Free∣NotSpam)⋅P(Offer∣NotSpam)⋅P(Urgent∣NotSpam)⋅P(NotSpam)
=(0.1)⋅(0.05)⋅(0.05)⋅(0.6)= 0.00015
Since P(Spam | Words) > P(Not Spam | Words), the classifier marks the email as Spam.
Advantages
Disadvantages
It assumes that all features are independent given the class, which is rarely true in real-world data.
Example: In spam detection, "free" and "offer" may appear together often—but Naïve Bayes treats
them as if they are unrelated.
Applications:
Rule-based classifiers are used for classification by defining a set of rules that can be used to assign
class labels to new instances of data based on their attribute values. These rules can be created
using expert knowledge of the domain, or they can be learned automatically from a set of labeled
training data. A rule-based classifier uses a set of IF-THEN rules for classification.
R1: IF age == youth AND student == yes THEN buys computer == yes.
The “IF” part (or left side) of a rule is known as the rule antecedent or precondition.
In the rule antecedent, the condition consists of one or more attribute tests (e.g., age == youth and
student == yes) that are logically ANDed.
If the condition (i.e., all the attribute tests) in a rule antecedent holds true for a given tuple, we say
that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the
tuple.
Conflict in rules:
In rule-based classification, it's common to have conflict when multiple rules apply to the same
data instance (tuple), possibly predicting different classes. To handle this, we use conflict
resolution strategies.
• Rule 1: If Credit Score > 700 AND Income > 50K Then Approve = Yes
• 👤 Test Tuple:
This is a conflict.
• Rule Ordering (Priority-Based Resolution): In rule-based ordering, the rules are organized
based on priority, according to some measure of rule quality, such as accuracy, coverage,
or size (number of attribute tests in the rule antecedent), or based on advice from domain
experts. Class is predicted for the tuple based on the priority, and any other rule that
satisfies tuple is ignored.
• Majority Voting: If multiple rules apply and predict different classes, take a majority vote
from the predicted classes.
• Specificity Preference/Size based: Choose the most specific rule (i.e., the rule with the most
conditions or constraints).
In rule-based classification, coverage is the percentage of records that satisfy the antecedent
conditions of a rule.
Coverage(R)=n1/n
Where n1= instances with antecedent and n=no of training tuples
Accuracy is the percentage of records that satisfy the antecedent conditions and meet the
consequent values of a rule.
Accuracy(R)=n2/n1
Where n2= instances with antecedent AND consequent
Key Differences:
• Accuracy focuses on correctness, while coverage focuses on applicability.
• A rule can have high accuracy but low coverage (if it classifies correctly but applies to
very few instances).
• A rule can have high coverage but low accuracy (if it applies to many instances but
makes many errors).
• The best classification rules aim for a balance between accuracy and coverage to ensure
broad applicability while maintaining correctness.
There are many sequential covering algorithms. Popular variations include AQ, CN2, and the
more recent RIPPER. The general strategy is as follows. Rules are learned one at a time. Each
time a rule is learned, the tuples covered by the rule are removed, and the process repeats on the
remaining tuples.
Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.
Input:
D: a data set of class-labeled tuples;
Att vals: the set of all attributes and their possible values.
Output: A set of IF-THEN rules.
Method:
(1) Rule set = {}; // initial set of rules learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn One Rule(D, Att vals, c);
(5) remove tuples covered by Rule from D;
(6) Rule set = Rule set + Rule; // add new rule to rule set
(7) until terminating condition;
(8) endfor
(9) return Rule Set ;
Rule Pruning
The classification methods discussed such as decision tree induction, Bayesian classification, rule-
based classification, etc., are all examples of eager learners. Eager learners employ two step
approach for classification, i.e., in first step they build classifier model learning from the training
set and in second step they perform the classification on unknow tuples to know class using the
model.
Lazy learning algorithms wait until they encounter a new tuple(From testing dataset), then store
and compare training examples when making predictions. This type of learning is useful when
working with large datasets that have a few attributes. Lazy learning is also known as instance-
based or memory-based learning.
• Computationally expensive
• Required more memory as training data will be loaded only during classification stage.
1. Assign a value to K
2. Calculate the distance(E.g, Euclidean Distance) between the new data entry and all other
existing data entries
3. Arrange the distances in ascending order
4. Determine the k-closest records of the training data set for each new record
5. Take the majority vote to classify the data point.
The Euclidean distance between two points or tuples, say, X1 ={x11, x12, : : : , x1n} and X2
={x21, x22, : : : , x2n}, is
Example:
Advantages
Disadvantages:
• Computationally expensive
• Accuracy reduces if there are noise in the dataset
• Requires large memory
• Need to accurately determine the value of k neighbors
There are four terms we need to know that are the “building blocks” used in computing many
evaluation measures. Understanding them will make it easy to grasp the meaning of the various
measures.
E.g., Person without having COVID-19 virus is correctly labelled as COIVD-19 negative.
False positives (FP): The model incorrectly predicts a positive class when the actual class is
negative.
E.g., Person without having COVID-19 virus is incorrectly labelled as COIVD-19 positive.
False negatives (FN): The model incorrectly predicts a negative class when the actual class is
positive.
Precision and recall are metrics used to evaluate the performance of classification models in
machine learning. Precision is the percentage of positive identifications that are correct (How
many predicted positives are actually positive?), while recall is the percentage of actual positives
that are identified correctly (How many actual positives were correctly identified?).
Accuracy:
The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier. In the pattern recognition literature, this is also referred to as the overall
recognition rate of the classifier, that is, it reflects how well the classifier recognizes tuples of the
various classes. That is,
Accuracy=TP+TN/(TP+TN+FP+FN)
The sensitivity(recall) and specificity measures can be used, respectively, for this purpose.
Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion of positive
tuples that are correctly identified), while specificity is the true negative rate (i.e., the proportion
of negative tuples that are correctly identified).
Sensitivity= TP/TP+FN
Specificity=TN/TN+FP
Example:
“Of all the people who actually have COVID, how many did the test correctly identify?”
If sensitivity is 90%, it means the test caught 90% of the infected people — but missed 10%.
“Of all the people who do NOT have COVID, how many did the test correctly say were negative?”
If specificity is 95%, it means 95% of healthy people were correctly told they’re negative.
Confusion Matrix:
A confusion matrix represents the prediction summary in matrix form. It shows how many
prediction are correct and incorrect per class.
Example:
Interpretation:
• Sensitivity = 85% → The test correctly identifies 85% of people who actually have
COVID.
• Specificity = 94.4% → It correctly identifies 94.4% of those who don't have COVID.
• Precision = 62.96% → When the test says someone has COVID, it’s only right ~63% of
the time (high false positives).
Program
Name B.C.A Semester VI
Course Title Fundamentals of Data Science (Theory)
Course Code: DSE-E2 No. of Credits 03
Contact hours 42 Hours Duration of SEA/Exam 2 1/2 Hours
Formative Assessment
Marks 40 Summative Assessment Marks 60
Unit 5
Clustering: Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density-Based Methods,
Grid-Based Methods, Evaluation of Clustering
Cluster Analysis
Cluster analysis or clustering is the process of grouping a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet
dissimilar to objects in other clusters.
Clustering is also known as unsupervised learning since groups are made without the knowledge
of class labels.
Clustering is also called data segmentation in some applications because clustering partitions large
data sets into groups according to their similarity.
Ex: Customer Segmentation for a Retail Store
Clustering can also be used for outlier detection, where outliers (values that are “far away” from
any cluster) may be more interesting than common cases.
• Anomaly detection: Identifying data points that do not belong to any cluster or lie far from
all clusters, which may indicate fraud, errors, or unusual events
nonspherical shape. Here a cluster is grown as long on density ( no. of objects) in the
neighborhood exceeds some threshold. It groups similar data points in a dataset based on
their density. The algorithm identifies core points with a minimum number of neighboring
points within a specified distance (known as the epsilon radius). It expands clusters by
connecting these core points to their neighboring points until the density falls below a
certain threshold. Points that do not include any cluster are considered outliers or noise.
E.g., DBSCAN, OPTICS, DENCLUE, Mean-Shift.
4. Grid -based method: Here data objects are first formed as grid (cells) and then clustering
operations are performed on this grid. The object space is divided into a grid structure of
finite cells, and clustering operations are performed on the cells instead of individual data
points. This method is highly efficient for handling spatial data and has a fast-processing
time that is independent of the number of data objects.
E.g., Statistical Information Grid(STING), CLIQUE, ENCLUS
Partitioning Methods: The simplest and most fundamental version of cluster analysis is
partitioning, which groups similar data points into clusters based on their similarities and
differences.
E.g, . K-means and K-medoids.
K Means Algorithm:
• First, it randomly selects K of the object in D and this will be the Mean/Center considered
• For each iteration an object is assigned to it which is near based on Euclidean distance and
the mean/center is updated
• The iteration continues till last iteration cluster and current iteration cluster are same.
Advantages of k-means:
Disadvantages of k-means:
• It is a bit difficult to predict the number of clusters i.e. the value of k.
• Output is strongly impacted by initial inputs like number of clusters (value of k).
It is an improvised version of the K-Means algorithm mainly designed to deal with outlier data
sensitivity. Instead of taking the mean value to represent the cluster. A Medoid is a point in the
cluster from which dissimilarities with all the other points in the clusters are minimal. A
representative object (Oi) is chosen randomly for representing the cluster. Each remaining object
is assigned to the cluster of which the representative object is the most similar. The partitioning
method is then performed based on the principle of minimizing the sum of the dissimilarity
between each objects ' P ' and its corresponding representative objects.
Hierarchical Methods
Partitioning Methods partitions objects into exclusive group. In some situation we may want data
formed into groups in different levels A hierarchical clustering method works by grouping data
objects into a hierarchy or 'tree' of clusters. This helps in summarizing the data with the help of
hierarchy.
a) Algorithmic methods,
c) Bayesian methods
Agglomerative, divisive and multiphase method are algorithmic meaning they consider data
objects as deterministic and compute clusters accordingly to the deterministic distance between
objects. Probabilistic methods use probabilistic models to compare clusters and measure the
quality of clusters by the firmness of models. Bayesian Methods compute a distribution of possible
clustering. That is, instead of outputting a single deterministic clustering over a data set, they
return a group of clustering structures and their probabilities, conditional on the given data.
Dendrogram:
A tree structure called a dendrogram is commonly used to represent the process of hierarchical
clustering. Dendrogram is used as a plot to show the results of hierarchical clustering method
graphically.
Whether using an agglomerative method or a divisive method, a core need is to measure the
distance between two clusters. Four widely used measures for distance between clusters are as
follows, where |p-p’| is the distance between two objects.
1 Minimum distance:
2 Maximum distance:
3 Mean distance :
4 Average distance :
1
Distavg(Ci,Cj) = 𝑛𝑖 𝑛𝑗 ∑𝑃∈𝐶𝑖,𝑃′∈𝐶𝑗|𝑃 − 𝑃′|
Single linkage and complete linkage are two distinct methods used in agglomerative hierarchical
clustering. This type of clustering starts with each data point as its own cluster and iteratively
merges the closest clusters until a single cluster remains or a stopping criterion is met. The key
difference between single and complete linkage lies in how the "distance" between two clusters is
defined.
In single linkage, the distance between two clusters is defined as the minimum distance(use min
distance formula) between any data point in the first cluster and any data point in the second
cluster.
Mathematically, if C1 and C2 are two clusters, the distance D(C1,C2) in single linkage is:
D(C1,C2)=x∈C1,y∈C2min{d(x,y)}
where d(x,y) is the distance between data points x and y (e.g., Euclidean distance).
In complete linkage, the distance between two clusters is defined as the maximum distance(use
max distance formula) between any data point in the first cluster and any data point in the second
cluster.
Mathematically, if C1 and C2 are two clusters, the distance D(C1,C2) in complete linkage is:
D(C1,C2)=x∈C1,y∈C2max{d(x,y)}
Minimum distance between any two Maximum distance between any two
Cluster Distance
points points
Computational
Generally lower Generally higher
Cost
Shape Detection Better for non-globular shapes Better for globular shapes
1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions
called Clustering Feature (CF) entries. It uses clustering feature (CF) to summarize a
cluster and clustering feature tree (CF tree) to represent a cluster hierarchy. Formally, a
Clustering Feature entry is defined as an ordered triple (N, LS, SS) where 'N' is the number
of data points in the cluster, 'LS' is the linear sum of the data points, and 'SS' is the squared
sum of the data points in the cluster. A CF tree is a height-balanced tree with two
parameters, branching factor and threshold. The CF-tree is a very compact representation
of the dataset because each entry in a leaf node is not a single data point but a subcluster.
Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the
sum of CF entries in the child nodes. There is a maximum number of entries in each leaf
node. This maximum number is called the threshold. The tree size is a function of the
threshold. The larger the threshold is, the smaller tree is.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree.
A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree
contains a pointer to a child node, and a CF entry made up of the sum of CF entries in the
child nodes. Optionally, we can refine these clusters.
Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the
similarity between pairs of cluster.
Chameleon uses a two-phase algorithm to find clusters in a data set:
1. First phase
Uses a graph partitioning algorithm to cluster data items into small subclusters
2. Second phase
Uses an algorithm to find the genuine clusters by repeatedly combining these subclusters
Two cluster are merged if their interconnectivity is high and they close together.
* Chameleon uses a K- nearest neighbour graph approach to construct a sparse graph,
Then it uses graph partition algorithm to partition the graph into smaller sub cluster.
Then an agglomerative hierarchical clustering algorithm that iteratively merges sub cluster based
on the similarity.
Traditional hierarchical clustering assumes that there is no uncertainty or noise in the data being
clustered. However, this assumption does not hold for many real-world datasets where there may
be missing values, outliers, or measurement errors present in the data. Traditional methods also
assume that all features have equal importance, which may not always be true.
Probabilistic hierarchical clustering tries to overcome some of these drawbacks by employing
probabilistic model to measure distance between clusters.
Advantages:
Disadvantages:
• Computationally expensive
• sensitivity to initialization parameters.
• assumption of Gaussian distributions within clusters
• requires domain knowledge and expertise in statistical modeling
Density-Based Methods
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key idea
is that for each point of a cluster, the neighborhood of a given radius has to contain at least a
minimum number of points.
1. eps: It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps value
is chosen too small then a large part of the data will be considered as an outlier. If it
is chosen very large then the clusters will merge and the majority of the data points
will be in the same clusters. One way to find the eps value is based on the k-
distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger
the dataset, the larger value of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number of dimensions D in the dataset as,
MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.
Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a
core point.
1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster
as the core point.
A point a and b are said to be density connected if there exists a point c which has a
sufficient number of points in its neighbors and both points a and b are within
the eps distance. This is a chaining process. So, if b is a neighbor of c, c is a
neighbor of d, and d is a neighbor of e, which in turn is neighbor of a implying
that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do
not belong to any cluster are noise.
DENCLUE:
DENsity CLUstering. The DENCLUE algorithm employs a cluster model based on kernel density estimation.
A cluster is defined by a local maximum of the estimated density function. Observations going to the same
local maximum are put into the same cluster.
Clearly, DENCLUE doesn't work on data with uniform distribution. In high dimensional space, the data
always look like uniformly distributed because of the curse of dimensionality. Therefore, DENCLUDE doesn't
work well on high-dimensional data in general.
Grid-Based Methods
The grid-based clustering approach uses a grid data structure. It limits the object space into a finite
number of cells that form a grid structure on which all of the operations for clustering are
performed. The main advantage of the approach is its fast-processing time.
A STING is a grid-based clustering technique. It uses a multidimensional grid data structure that
quantifies space into a finite number of cells. Instead of focusing on data points, it focuses on
the value space surrounding the data points.
In STING, the spatial area is divided into rectangular cells and several levels of cells at different
resolution levels. High-level cells are divided into several low-level cells.
In STING Statistical Information about attributes in each cell, such as mean, maximum, and
minimum values, are precomputed and stored as statistical parameters. These statistical
parameters are useful for query processing and other data analysis tasks. The statistical parameter
of higher-level cells can easily be computed from the parameters of the lower-level cells.
Working of STING:
Step 1: Determine a layer, to begin with.
Step 2: For each cell of this layer, it calculates the confidence interval or estimated range of
probability that this is cell is relevant to the query.
Step 3: From the interval calculate above, it labels the cell as relevant or not relevant.
Step 4: If this layer is the bottom layer, go to point 6, otherwise, go to point 5.
Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those cells that form
the relevant cell of the high-level layer.
Step 6: If the specification of the query is met, go to point 8, otherwise go to point 7.
Step 7: Retrieve those data that fall into the relevant cells and do further processing. Return the
result that meets the requirement of the query. Go to point 9.
Step 8: Find the regions of relevant cells. Return those regions that meet the requirement of the
query. Go to point 9.
Step 9: Stop or terminate.
Advantages:
Disadvantage:
• The main disadvantage of Sting (Statistics Grid). As we know, all cluster boundaries
are either horizontal or vertical, so no diagonal boundaries are detected.
CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density based clusters
in subspaces.
It is based on automatically identifying the subspaces of high dimensional data space that allow
better clustering than original space.
It uses a density threshold to identify dense cells and sparse ones.
A cell is dense if the number of objects mapped to it exceeds the density threshold.
CLIQUE Algorithm is very scalable with respect to the value of the records, and a number of
dimensions in the dataset because it is grid-based and uses the Apriori Property effectively.
Apriori Approach Stated that If an X dimensional unit is dense then all its projections in X -1
dimensional space are also dense.
This means that dense regions in a given subspace must produce dense regions when projected
to a low-dimensional subspace.
CLIQUE restricts its search for high-dimensional dense cells to the intersection of dense cells in
the subspace because CLIQUE uses apriori properties.
The CLIQUE algorithm first divides the data space into grids. It is done by dividing each
dimension into equal intervals called units. After that, it identifies dense units. A unit is dense if
the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find dense cells
along two dimensions, and it works until all dense cells along the entire dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the largest set
(“cluster”) of connected dense cells. Finally, the CLIQUE algorithm generates a minimal
description of the cluster. Clusters are then generated from all dense subspaces using the apriori
approach.
Advantage:
Disadvantage:
• The main disadvantage of CLIQUE Algorithm is that if the size of the cell is
unsuitable for a set of very high values, then too much of the estimation will take
place and the correct cluster will be unable to find.
Evaluation of Clustering
Evaluation of clustering is the process of assessing how meaningful, accurate, and useful the
results of a clustering algorithm are. In simpler terms, it's about answering the question: "Did the
clustering algorithm do a good job?" cancel
Since clustering is an unsupervised learning method (i.e., it doesn't use labels), evaluating it can
be tricky. Evaluation helps determine:
Why: Applying clustering to random data will still produce clusters, but they might be meaningless.
Example Methods:
Hopkins Statistic: is a spatial statistic that tests the spatial randomness of a variable as
distributed in a space.
Visual inspection (scatter plots, heatmaps)
Intrinsic Methods:
The Silhouette Coefficient tells you how well each item fits into its own cluster compared to other
clusters.
You can think of it as answering this question:
“Am I closer to the stuff in my own group than to the stuff in the next closest group?”
Calculation of Silhouette Coefficient: For each data point:
a = how close the point is to others in its own cluster (we want this to be small).
b = how close the point is to the nearest other cluster (we want this to be big).
Then, the silhouette score for that point is:
An average silhouette coefficient close to +1 indicates that the clusters are well-clustered.
An average silhouette coefficient close to 0 suggests overlapping clusters.
An average silhouette coefficient close to -1 indicates that the clustering might be wrong.