4.0 - Lession 6- BI mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

KHOA CÔNG NGHỆ THÔNG TIN-ĐHKHTN

MÔN HỌC

Chapter 6
BI - mining

Giáo viên: Hồ Thị Hoàng Vy


TPHCM, 8-2021

fit@hcmus
Learning Objectives
fit@hcmus

 Define data mining as an enabling technology


for business intelligence
 Describe the objectives and benefits of
business analytics and data mining
 Describe some algorithms that are applied to
some specific senarios
 Design and implement an integrated data
mining solution by using SQL Server Analysis
Services

2
fit@hcmus
Introduction to data mining

 Data Mining as a step in A KDD


Process
 The core step of knowledge
discovery process
fit@hcmus
Introduction to Data mining
 Data Mining: Concepts and Techniques
 “Data mining, also popularly referred to as knowledge
discovery from data (KDD), is the automated or convenient
extraction of patterns representing knowledge implicitly stored
or captured in large databases, data warehouses, the Web,
other massive information repositories or data streams.”
 Data Mining: Practical Machine Learning Tools and
Techniques
 “Data mining is defined as the process of discovering patterns
in data. The process must be automatic or (more usually)
semiautomatic. The patterns discovered must be meaningful in
that hey lead to some advantage, usually an economic one.
The data is invariably present in substantial quantities.”
fit@hcmus
Introduction to Data mining

 Data mining: what for?


 To look for interesting structures such as:
 Patterns from statistics
 Predictive models
 Hidden relationship
fit@hcmus
Introduction to data mining
 Patterns: Valid, Novel, Potentially useful,
Understandable to the users.
 Types of patterns
 Association
 Prediction
 Cluster (segmentation)
 Sequential (or time series) relationships
fit@hcmus
Introduction to data mining
 These patterns and trends can be collected and
defined as a data mining model. Mining models
can be applied to specific scenarios, such as:
 Banking: loan/credit card approval
 predict good customers based on old customers
 Customer relationship management:
 identify those who are likely to leave for a competitor.
 Targeted marketing:
 identify likely responders to promotions
 Fraud detection: telecommunications, financial
transactions
 from an online stream of event identify fraudulent events
fit@hcmus
Introduction to data mining

 A set of rules learned from this information


 If tear production rate = reduced then recommendation = none
 if age = young and astigmatic = no then recommendation = soft

“practical machine learning tools and techniques.—3rd”


fit@hcmus
Introduction to data mining
 Attribute (or dimension, feature, variable) is a
data field, representing a characteristic or a
feature of a data object.
 A collection of attributes describe an object
attributes

object
fit@hcmus
Data in Data mining

 Data may consist of numbers, words,


images, …
Data

Unstructured or
Structured
Semi-Structured

Categorical Numerical Textual Multimedia HTML/XML

Nominal Ordinal Interval Ratio


fit@hcmus
Data in data mining
 Nominal are used to label variables without any quantitative value
(categories, state, name of things…)
 Hair_color = {black, brown, blond, red, grey, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, socio economic status (“low
income”,”middle income”,”high income”)
fit@hcmus
Data in data mining
 Interval scales
 are numeric scales
 know both the order and the exact differences between the
values.
 don’t have a “true zero.”
 Ex: Celsius temperature, zero doesn’t mean the absence of value,
20 degrees C is not twice as hot as 10 degrees C
 Ratio
 have a clear definition of zero
 can be meaningfully added, subtracted, multiplied, divided
(ratios)
 Ex: weight, height
fit@hcmus
Data in data mining
 Discrete Attribute
 Has only a finite or countably infinite set of values
 Often represented as integer variables
 Continuous Attribute
 Has real numbers as attribute values
 Examples: . height, weight, length, temperature and
speed

https://link.springer.com/chapter/10.1007%2F978-1-84628-766-4_7 https://www.geeksforgeeks.org/understanding-data-attribute-types-qualitative-and-quantitative/
fit@hcmus
Data mining process - CRISP-DM

Step 1: Business Set goals for the project


Understanding Using business objectives and current scenario, define your data mining
goals
Account
Step 2: Data Set the data and data source
s for
Understanding Check if the available data can meet the objectives of the project and ~85%
establish how you will meet the objectives of total
Step 3: Data The data from different sources should be selected, cleaned, transformed, project
Preparation formatted, anonymized, and constructed time
Data cleaning & transformation (smoothing noisy data and filling in
missing values, aggregation, normalization…)
Step 4: Model Building Execute the algorithms that satisfíes the project objectives
Create a scenario to test check the quality and validity of the model.
Run the model on the prepared dataset.
Results should be assessed by all stakeholders
Step 5: Testing and
Evaluation
Step 6: Deployment
https://towardsdatascience.com/crisp-dm-methodology-leader-in-data-mining-and-big-data-467efd3d3781
fit@hcmus
Data Preparation
Real-world
Data

· Collect data
Data Consolidation · Select data
· Integrate data

· Impute missing values


Data Cleaning · Reduce noise in data
· Eliminate inconsistencies

· Normalize data
Data Transformation · Discretize/aggregate data
· Construct new attributes

· Reduce number of variables


Data Reduction · Reduce number of cases
· Balance skewed data

Well-formed
Data
Sharrda, Business Intelligence,3rd
fit@hcmus
Data mining tasks

Data Mining Learning Method Popular Algorithms

Classification and Regression Trees,


Prediction Supervised
ANN, SVM, Genetic Algorithms

Decision trees, ANN/MLP, SVM, Rough


Classification Supervised
sets, Genetic Algorithms

Linear/Nonlinear Regression, Regression


Regression Supervised
trees, ANN/MLP, SVM

Association Unsupervised Apriory, OneR, ZeroR, Eclat

Link analysis Unsupervised Expectation Maximization, Apriory


Algorithm, Graph-based Matching

Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique

Clustering Unsupervised K-means, ANN/SOM

Outlier analysis Unsupervised K-means, Expectation Maximization (EM)


Recommender Systems Handbook. © Springer Science+Business Media,
LLC 2011

Sharrda, Business Intelligence,3rd


fit@hcmus
Suppervised vs unsuppervised

 Supperviced learning
 All data is labeled and the algorithms learn to predict the output
from the input data.
 Given input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output:
Y = f(X)
 Unsupervised learning
 is where you only have input data (X) and no corresponding output
variables.
 is to model the underlying structure or distribution in the data in
order to learn more about the data
 there is no correct answers and there is no teacher.
 All data is unlabeled and the algorithms learn to inherent structure
from the input data.
fit@hcmus
Classic application
 “Market basket” data
 Purchase(salesID, item) • Want to find association rules:
 (3, bread) {L1,L2,...,Ln} -> R
 (3, milk) • Diễn giải: “If a customer bought all
 (3, eggs) the items in set {L1, L2, ..., Ln}, he is
very likely to also have bought item
 (3, beer)
R”
 (4, beer) • Ex:
 (4, chips) {bread, milk} -> eggs
 .... {diapers} -> beer
 Goals of data mining: Quickly find association rules over
extremely large data sets (ex: all Wal-Mart sales for a year).)

CS145 - http://infolab.stanford.edu/
fit@hcmus
Classic application
 Classification trees (= decision trees)
 Buyers(<attributes>, purchase)
 Want to predict purchase from <attributes>
 Clustering
 Buyers(<attributes>)
 Automatically group buyers into N similar types
 Top-N items
 Purchase(salesID, item)
 What were the N most often purchased items?
(salesID irrelevant)
fit@hcmus
DM - Classification
 Most frequently used DM method
 Employ supervised learning
 Learn from past data, classify new data
 The output variable is categorical (nominal or ordinal) in
nature
 Predicts categorical class labels (discrete or nominal)
 Use labels of the training data to classify new data
 There is a lot of classification algorithms available:
Decision Trees:
 Bayesian Classifiers, Neural Networks, K-Nearest
Neighbour, Support Vector Machines, Linear Regression
fit@hcmus
DM - Classification

 Example:

Classifie Will buy bike


r Will not
Customer profile

 A model or classifier is contsructed to predict categorical labels


such as {hard, soft, none } for a recommendation lense
application.
 A bank loan officer wants to analyze the data in order to know
which customer (loan applicant) are risky or which are safe.
fit@hcmus
How Does Classification Works

1. All data is labeled and the algorithms learn to predict the output from the
input data.
2. Learning Step (Training Phase): Construction of Classification Model
 Given input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output: Y =
f(X)
 Different Algorithms are used to build a classifier by making the model
learn using the training set available
3. Classification Step
 Model used to predict class labels and testing the constructed model on
test data
 Given an unlabeled observation X, the predict(X) returns the predicted
label y.
4. Evaluate the classifier model
fit@hcmus
How Does Classification Works
to predict
whether to
 The weather problem play or not
 Supposedly the weather
concerns the conditions that
are suitable for playing some
unspecified game
  when there has a new
case
  measure these variables
to predict whether to play/not

?
? Use the variables (outlook,
temperature, humidity, windy)
fit@hcmus
How Does Classification Works
The weather problem
1. Labeled data:
 Attribute/feature:
 Outlook { sunny, overcast, rainy}
 Temperature { hot, mild, cool}
 Humidity { high, normal}
 Windy { true, false}
 Attribute values: symbolic categories
 The outcome is: play or not play
 A predefine class label is assigned to every
sample tuple or object
fit@hcmus
How Does Classification Works

 Learning Step (Training Phase):


 Randomly split the loaded dataset into two
(70%-30%)
Training
 Perform the model training on the rows
training set
Testing
 Use the test set for validation rows
purpose DEPENDEN
INDEPENDE
 Model usage NT
VARIABLES
T
VARIABLES
Model
Training Data Development
2/3

Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)

Sharrda, Business Intelligence,3rd


fit@hcmus
How Does Classification Works

 Learning Step (Training Phase):


 Randomly split the loaded dataset

Training
rows

Testing rows
fit@hcmus
How Does Classification Works

 Learning Step (Training Phase):


 Perform the model training on the training
set
Training Classification
data algorithm
A set of rules learned from
this information
• If outlook = sunny and
Classifier humidity = high then play
= no
• If outlook = rainy and
windy = true then play =
no
• If outlook = overcast then
play = yes
• ………..
fit@hcmus
How Does Classification Works

 Learning Step (Training Phase):


 Use the test set for validation purpose

Test
Classifier Accuracy
Testing data
?
?
?
?
?
?
?
Hide the outcome
? Real outcome
in the testing data
fit@hcmus
How Does Classification Works

 Learning Step (Training Phase):


 Use the test set for validation purpose
 If the accuracy is acceptable:

Classifier
New data (Unlabeled data)

Overcast Mild High False ??


Overcast Cold Normal True ?? Overcast Mild High False ye
s
Overcast Cold Normal True no
fit@hcmus
How Does Classification Works
Assessment Methods
 To predict the performance of a classifier on new data, we need to assess
its error rate on a dataset that played no part in the formation of the
classifier  test set (independent dataset)
 The test data is not used in any way to create the classifier.
 If the class prediction is correct  SuccessCount ++
 if not, it is an error  ErrorCount++
 The error rate = the proportion of errors made over a whole set of
instances

 Understanding the accuracy of your model is


invaluable because you can begin to tune
the parameters of your model to increase its
performance.
fit@hcmus
How Does Classification Works

 Assessment Methods (cont)


 In classification problems, the primary
source for accuracy estimation is the
confusion matrix TP  TN
True Class Accuracy 
TP  TN  FP  FN
Positive Negative
TP
True Positive Rate 
Positive

True False
TP  FN
Predicted Class

Positive Positive
Count (TP) Count (FP) TN
True Negative Rate 
TN  FP
Negative

False True
TP TP
Negative Negative
P recision  Re call 
Count (FN) Count (TN) TP  FP TP  FN

Sharrda, Business Intelligence,3rd


fit@hcmus
How Does Classification Works

 k-Fold Cross validation


 Split the data into k mutually exclusive
subsets
 Use each subset as testing while using the
rest of the subsets as training
 Repeat the experimentation for k times
 Aggregate the test results for true estimation
of prediction accuracy training

https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
fit@hcmus
How Does Classification Works

 Training set:
 A set of examples used for learning, that is to
fit the parameters of the classifier.
 Validation set:
 A set of examples used to tune the
parameters of a classifier
 Test set:
 A set of examples used only to assess the
performance of a fully-specified classifier.
 if you have a model with no hyperparameters or ones that cannot be easily
tuned, you probably don’t need a validation set too!

https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
fit@hcmus
Classification vs Prediction

 What is prediction?
 Classification models predict categorical
class labels
 prediction models predict continuous
valued functions
 Ex:
 Suppose the marketing manager needs to
predict how much a given customer will
spend during a sale at his company. 
prediction
 A bank loan officer wants to analyze the data
in order to know which customer (loan
fit@hcmus
Classification - Decision Tree

 Definition: Root node


 Employs the divide and conquer
method
Internal
 Recursively divides a training set node
until each division consists of
examples from one class Leaf
 Type of node: nod
 Root node is an attribute to place at the root node
 Internal nodes (non leaf node) denotes a test on How are decision trees used
an attribute
for classification?
 Leaf nodes (terminal nodes) hold class labels
fit@hcmus
Classification - Decision Tree
 Attribute:
 Input (indepent variables): v feature/attribute = X1;
X2; :::; Xv
 Each Xj has domain Oj:
 Category: {high, cold}
 Numerical: {0,1}
 Output (dependent variable) /class: C with domain Oy
 Category: classification
 Numerical: Regression
 Given a dataset D, n row:
 n example (Xi, Ci); Xi: is a v-dim feature vector

 Ci  Oy is output variable
 Task:
 Given an input data vector 𝒙 predict C
http://cs246.stanford.edu
fit@hcmus
Classification - Decision Tree
Idea:
1. Create a root node and assign all of the training
data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of
the split. Split the data into mutually exclusive
subsets along the lines of the specific split.
4. Repeat the steps 2 and 3 for each and every leaf
node until the stopping criteria is reached.

Sharrda, Business Intelligence,3rd


fit@hcmus
Classification - Decision Tree
 Mainly problems:
1. Splitting criteria
 Which variable, what value, etc.
2. Stopping criteria
 When to stop building the tree
3. Pruning (generalization method)
 Pre-pruning versus post-pruning
 Most popular DT algorithms include
 ID3, C4.5, C5; CART; CHAID; M5

Sharrda, Business Intelligence,3rd


fit@hcmus
Classification - Decision Tree
 Alternative splitting criteria
 Gini index determines the purity of a specific class
as a result of a decision to branch along a particular
attribute/value
 Used in CART
 Information gain uses entropy to measure the
extent of uncertainty or randomness of a particular
attribute/value split
 Used in ID3, C4.5, C5
Classification - Decision Tree
fit@hcmus

 Impurity/Entropy (informal)
 Measures the level of impurity in a group of examples
 Entropy: a common way to measure impurity
 The expected information needed to classify a tuple in D is given
 by:
|𝐶𝑖 𝐷|
𝑖=1 𝑝𝑖 𝑙𝑜𝑔2 (𝑝𝑖) ; 𝑝𝑖 =
Info (D) = − σ𝑚 ,
|𝐷|

• m: the number of classes


• pi: the probability that an arbitrary tuple in D belongs to class Ci
estimated by: |Ci,D|/|D|(proportion of tuples of each class)
• Entropy comes from information theory. The higher the entropy the
more the information content.

https://homes.cs.washington.edu/~shapiro/EE596/notes/InfoGain.pdf
fit@hcmus
Decision Tree - Classification

 Example

Data Mining: Practical Machine Learning Tools and Techniques – p11


fit@hcmus
Information gain

 Training set D
 Attribute X = {x1, x2, …, xv}
 Outlook = {Sunny, Overcast, Rainy} Attribute X
 Temperature = {Hot, mild, cool} What feature should be used?
 Humidity = {high, normal}
What values?
 Windy = {true, false}
𝐷𝑗
 Gain (Outlook)? InfoX (D) = − σ𝑣𝑗=1 𝐷
𝐼 (𝐷j)

 Gain (Temperature)? Gain (X) = Info (D) – InfoA (D)

 Gain (Humidity)?
 Gain (Windy)?
fit@hcmus
Example: training set

 Information gain
 14 tuples: 9 yes (play tennis); 5 No
 |D| = 14
m=2
 C1 = “Yes”; C2 = “No”
 |C1,D| = 9; |C2,D| = 5
9 9 5 5
Info (D) = 𝐼 9,5 = − log2 – log2
14 14 14 14
=0.94
fit@hcmus
Information Gain
id outlook temperature humidity wind play
1 sunny hot high weak no
2 sunny hot high strong no
Outlook C1j: Yes C2j: No I(C1j,C2j) 8 sunny mild high weak no
9 sunny cool normal weak yes
Sunny 2 3 0.971 11 sunny mild normal strong Yes
Overcast 4 0 0 3 overcast hot high weak yes
7 overcast cool normal strong yes
Rainy 3 2 0.971 12 overcast mild high strong yes
13 overcast hot normal weak yes
4 rainy mild high weak yes
3 3 2 2
I (2,3) = −5 log2 5 – log2 5 5 rainy cool normal weak yes
5
= 0.971 6 rainy cool normal strong no
10 rainy mild normal weak yes
14 rainy mild high strong no
4 4 0 0
I (4,0) = −4 log2 4 – log2 4 =0
4
5 4 5
InfoOutlook(D) = I(2,3) + I(4,0) + I(3,2) = 0.693
14 14 14
3 3 2 2
I (3,2) = − 5 log2 – log2 = Gain(outlook) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜𝑂𝑢𝑡𝑙𝑜𝑜𝑘(𝐷)
5 5 5
0.971 = 0.94 - 0.693
= 0.25
fit@hcmus
Information Gain
temperatur
id outlook humidity wind play
e
1 sunny hot high weak no
2 sunny hot high strong no
3 overcast hot high weak yes
13 overcast hot normal weak yes
Temperature C1j: Yes C2j: No I(C1j,C2j) 4 rainy mild high weak yes

Hot 2 2 1 8 sunny mild high weak no


10 rainy mild normal weak yes
mild 4 2 0.9183
11 sunny mild normal strong yes
cool 3 1 0.811278
12 overcast mild high strong yes
14 rainy mild high strong no
I (2,2) = −
2
4
log2
2
4

2
4
log2
2
4
= 1 5 rainy cool normal weak yes
6 rainy cool normal strong no

= 0.9183
4 4 2 2
I (4,2) = − log2 – log2 7 overcast cool normal strong yes
6 6 6 6
9 sunny cool normal weak yes
3 3 1 1
I (3,1) = − log2 – log2 = 0.811278
I(3,1) = 0.911
4 6 4
4 4 4 4 InfoTemperature (D) = I(2,2) + I(4,2) +
14 14 14

Gain(Temperature) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜Temperature(𝐷)


= 0.94 -0.911
= 0.029
fit@hcmus
Information Gain
temperatur
id outlook humidity wind play
e
1 sunny hot high weak no
2 sunny hot high strong no
3 overcast hot high weak yes
4 rainy mild high weak yes
Humidity C1j: Yes C2j: No I(C1j,C2j) 8 sunny mild high weak no

High 3 4 0.985 12 overcast mild high strong yes


14 rainy mild high strong no
Normal 6 1 0.592
5 rainy cool normal weak yes
10 rainy mild normal weak yes
9 sunny cool normal weak yes
3 3 4 4
I (3,4) = − log2 – log2 = 0.985 11 sunny mild normal strong yes
7 7 7 7
6 rainy cool normal strong no
6 6 1 1
I (6,1) = − log2 – log2 = 0.592 7 overcast cool normal strong yes
7 7 7 7
13 overcast hot normal weak yes

7 7
InfoHumidity(D) = 14 I(3,4) + 14 I(6,1) = 0.78845
Gain(Humidity) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜Humidity(𝐷)

= 0.94 - 0.78845
= 0.152
fit@hcmus
Information Gain
temperatur
id outlook humidity wind play
e
1 sunny hot high weak no
8 sunny mild high weak no
5 rainy cool normal weak yes
3 overcast hot high weak yes
Windy C1j: Yes C2j: No I(C1j,C2j) 13 overcast hot normal weak yes

Weak 6 2 0.811 4 rainy mild high weak yes


10 rainy mild normal weak yes
strong 3 3 1
9 sunny cool normal weak yes
2 sunny hot high strong no
11 sunny mild normal strong yes
6 6 2 2
I (6,2) = − log2 – log2 = 0.81128 12 overcast mild high strong yes
8 8 8 8
14 rainy mild high strong No
3 3 3 3
I (3,3) = − log2 – log2 =1 6 rainy cool normal strong no
6 6 6 6
7 overcast cool normal strong yes

8 6
InfoWindy(D) = 14 I(6,2) + 14 I(3,3) = 0.892
Gain(Windy) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜𝑊𝑖𝑛𝑑𝑦(𝐷)

= 0.94 – 0.892
= 0.048
fit@hcmus
Information Gain

Outlook C1j: Yes C2j: No I(C1j,C2j) Humidity C1j: Yes C2j: No I(C1j,C2j)
Sunny 1 3 0.673 High 3 4 0.985
Overcast 2 0 0 Normal 6 1 0.592
Rainy 3 1 0.673 Gain (Humidity) 0.152
Gain(Outlook) 0.17
Windy C1j: Yes C2j: No I(C1j,C2j)
Temperature C1j: Yes C2j: No I(C1j,C2j)
Weak 6 2 0.811
Hot 2 2 1
strong 3 3 1
mild 4 2 0.9183
Gain (Windy) 0.048
Cool 3 1 0.811278
Gain (Temperature) 0.029   Choose attribute with the largest
information gain as the decision node:
0.17 OutLook
fit@hcmus
Information Gain
temperatur
id outlook humidity wind play
e
1 sunny hot high weak no
2 sunny hot high strong no
8 sunny mild high weak no
9 sunny cool normal weak yes
1
sunny mild normal strong Yes
1
3 overcast hot high weak yes
7 overcast cool normal strong yes
1
overcast mild high strong yes
2
Outlook 1
overcast hot normal weak yes
3
Sunny Rainy
4 rainy mild high weak yes
Overcast
5 rainy cool normal weak yes
??? Yes ??? 6 rainy cool normal strong no
id temperature humidity wind play id temperature humidity wind play
1
1 hot high weak no 4 mild high weak rainy
yes mild normal weak yes
0
2 hot high strong no 5 cool normal weak yes
1
8 mild high weak no 6 cool normal strong rainy
no mild high strong no
4
9 cool normal weak yes 1
mild normal weak yes
0
1
mild normal strong Yes
1 1
mild high strong no
4
fit@hcmus
Information Gain

 R1: if outlook = overcast then yes


 R2: if outlook = Sunny and humidity
Outlook = High then No
Sunny Rainy
Overcast
 R3: if outlook = Sunny and humidity
Humidity Yes
= Normal then yes
Windy
High Normal weak strong  R4: if outlook = rainy and windy =
No Yes Yes No weak then yes
 R4: if outlook = rainy and windy =
strong then no
fit@hcmus
Classification - Decision Tree
 Example: Bike buyer prediction using
AdventureWork database
 Purpose:
 Creating a classification model that predicts whether
or not a customer will purchase a bike
 The model should predict bike purchasing for new
customers for whom no information about average
monthly spend or previous bike purchases is
available

https://www.kaggle.com/rahulsah06/bike-buying-prediction-for-adventure-works-cycles?select=AW_BikeBuyer.csv
fit@hcmus
Classification - Decision Tree
 Bike buyer prediction using AdventureWork
database
 Data:
 The class were described as either ‘1: Yes’ or ‘0:No’
on the basis of bike buyer or not.
 The detailed description of the dataset is shown in
Table: Target
fit@hcmus
Cluster Analysis for Data Mining
 Clustering techniques apply when there is no class
to be predicted but the instances are to be divided
into natural groups
 Clustering is dividing data points into homogeneous
classes or clusters:
 Points in the same group are as similar as possible
 Points in different group are as dissimilar as possible
 Used for automatic identification of natural groupings of things
 Part of the machine-learning family
 Employ unsupervised learning
 Learns the clusters of things from past data, then assigns new
instances

Sharrda, Business Intelligence,3rd


fit@hcmus
Cluster Analysis for Data Mining
 Clustering results may be used to
 Identify natural groupings of customers
 Identify rules for assigning new cases to classes for
targeting/diagnostic purposes
 Provide characterization, definition, labeling of
populations
 Decrease the size and complexity of problems for
other data mining methods
 Identify outliers in a specific domain (e.g., rare-event
detection)

Sharrda, Business Intelligence,3rd


fit@hcmus
K-mean
 k-Means Clustering Algorithm
 k : pre-determined number of clusters
 Algorithm (Step 0: determine value of k)
Step 1: Randomly generate k random points as initial
cluster centers.
Step 2: Assign each point to the nearest cluster center.
Step 3: Re-compute the new cluster centers.
Repetition step: Repeat steps 3 and 4 until some
convergence criterion is met (usually that the
assignment of points to clusters becomes stable).

Sharrda, Business Intelligence,3rd


fit@hcmus
K-mean
The decision of merging two clusters is taken on
the basis of closeness of these clusters. There are
multiple metrics for deciding the closeness of two
clusters :
 Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
 Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
 Manhattan distance: ||a-b||1 = Σ|ai-bi|
 Pearson correlation distance
 Spearman correlation distance:
 …
fit@hcmus
Association Rule Mining
 is used when you want to find an association between
different objects in a set, find frequent patterns in a
transaction database, relational databases or any other
information repository.
 The applications of Association Rule Mining are found in
Marketing, Basket Data Analysis
 “Frequently Bought Together” → Association
 “Customers who bought this item also bought” →
Recommendation
fit@hcmus
The market basket model

 Market Basket Analysis takes


data at transaction level, which
lists all items bought by a
customer in a single purchase.
 The technique determines
relationships of what products
were purchased with which other
product(s)

• These relationships are then used to build profiles containing If-


Then rules of the items purchased.
• The rules could be written as: If {X} Then {Y}
Association Rule Mining
fit@hcmus
Tid Items bought
10 Beer, Nuts, Diaper
 Item = products purchased in a basket/transaction
20 Beer, Coffee, Diaper
 An itemset is a set of item 30 Beer, Diaper, Eggs

 An itemset that contains k items is a k-itemset 40 Nuts, Eggs, Milk


50 Nuts, Coffee, Diaper, Eggs, Milk
 Ex: {beer, diaper} : 2-itemset
 Let D = {t1,t2…,tm} be the set of all transactions called the dataset.
 Ex: t1 = {Milk, bread, eggs}
 Let I = {i1,i2,…,in} be the set of all item in a market basket data
 Each transaction ti contains a subset of items chosen from I
 An association rule is defined as an implication of the form: X  Y, where X,
Y  I and XY = 
 Ex: Diaper  beer (Buying diapers may likely lead to buying beers )
 Problem: Find sets of items that appear together
“frequently” in basket
fit@hcmus
Association Rule Mining

Tid Items bought Tid Beer Nuts Diaper Coffee Egg Milk
10 Beer, Nuts, Diaper 10 1 1 1 0 0 0
20 Beer, Coffee, Diaper 20 1 0 1 1 0 0
30 Beer, Diaper, Eggs 30 1 0 1 0 1 0
40 Nuts, Eggs, Milk 40 0 1 0 0 1 1
50 Nuts, Coffee, Diaper, 50 0 1 1 1 1 1
Eggs, Milk

This presentation is a very simplistic view of market basket data


fit@hcmus
Association Rule Mining
 Support count: is the number of elements in a set.
 Frequent itemsets: a set of items that appears in many baskets is said to
be “frequent.”
 Given a number minsupp, called support threshold. If support (X) >=
minsup then X is frequent.
 Association rule discovery
 Given a set of transaction T, find all the rules having support >= minsup,
and confidence >= minconf, where minsup, minconf are the
corresponding support and confidence thresholds

Tid Items bought


10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
fit@hcmus
Association Rule Mining
Tid Items bought
10 Beer, Nuts, Diaper
 Supp(X) = count(X) / | D |
20 Beer, Coffee, Diaper
 Ex: supp(beer} = 3/5 = 60%; 30 Beer, Diaper, Eggs

 Ex: supp(diaper} = 4/5 = 80% 40 Nuts, Eggs, Milk


50 Nuts, Coffee, Diaper, Eggs, Milk
 Ex: sup(beer, diaper} = 3/5 = 60%
 The strength of an association rule can be measure as:
 Support s(XY): The percentage of transaction in the database that
contains X  Y
 supp (X ⇒ Y ) = supp(X  Y) = count (X  Y)/|D|
 Confidence, c(XY): The conditional probability that a transaction
containing X also contains Y
 Conf (X ⇒ Y ) = supp(X  Y) / supp(X)
 Ex. c = supp{Diaper, Beer}/supp{Diaper} = ¾ = 0.75
fit@hcmus
Association Rule Mining
 Let minsupp = 50%
 All the frequent 1-itemsets:
 Beer : 3/5 (60%)
Beer : 3/5 (60%)
 Nut : 3/5 (60%) Nut : 3/5 (60%)
 Diaper: 4/5 (80%) Diaper: 4/5 (80%)
 Egg: 3/5 (60%) Egg: 3/5 (60%)
 Coffee: 2/5 (40%) < minsupp
 Milk: 2/5 (40%) < minsupp

 All the frequent 2-itemsets: {Beer, Diaper}: 3/5 (60%)


 All the frequent 3-itemsets: None.
fit@hcmus
Apriori Algorithm
Association rule mining can be viewed as a
two-step process:
1. Finds subsets that are common to at least a
minimum number of the itemsets
2. Uses a bottom-up approach
 frequent subsets are extended one item at a
time (the size of frequent subsets increases from
one-item subsets to two-item subsets, then
three-item subsets, and so on), and
 groups of candidates at each level are tested
against the data for minimum support.

Sharrda, Business Intelligence,3rd


The Apriori Algorithm—An Example
fit@hcmus

http://hanj.cs.illinois.edu/cs412/bk3_slides/06FPBasic.pdf
fit@hcmus
Association rule with SSAS

 Data source: AdventureWorksDW


 Problem: The objective of the Association
Rule Mining is to find out what models are
selling together.

https://docs.microsoft.com/en-us/analysis-services/data-mining/microsoft-association-algorithm?view=asallproducts-allversions
fit@hcmus
Association rule with SSAS
1. Data preparation:
 The requirements for an association rules
model are as follows:
 A single key column: one numeric or text column that uniquely
identifies each record. compound keys not permitted.
 A single predictable column: The values must be discrete or
discretized.
 Input columns: The input columns must be discrete. The input
data for an association model often is contained in two tables:
 one table might contain customer information while another
table contains customer purchases
 vAssocSeqLineItems: nested table
 vAssocSeqOrders: case table
fit@hcmus
fit@hcmus
Association rule with SSAS

……
fit@hcmus
Association rule with SSAS

 After the Association


Rule configuration is
completed, then the
model can be processed.
Then users can review
the prediction model and
perform the predictions.
fit@hcmus
Association rule with SSAS

 Input columns: The


input columns must
be discrete
 A single predictable
column: The values
must be discrete or
discretized
Modifying and Processing the Market Basket Model
fit@hcmus

 Before you process the association mining model that


you created, you must change the default values of two
of the parameters: Support and Probability.
 MINIMUM_PROBABILITY = 0.5
 MINIMUM_SUPPORT = 0.03
Modifying and Processing the Market Basket
fit@hcmus Model

 support :
 MINIMUM_SUPPORT value
of 0.03, it means that at least
3% of the total cases in the
data set must contain this
item or itemset for inclusion
in the model.
 Confidence:
 INIMUM_PROBABILITY: 0.5
 means that no rule with less
than fifty percent probability
can be generated
fit@hcmus
Mining model viewer

 In the Mining Model viewer, there are three


tabs to view the data patterns. In
the Rules tab, it will show the rules that can be
derived fro the Association Rule Mining model
in the sample set.
fit@hcmus

If you click any node, those


nodes will be highlighted with
different colors, as shown in the
below screenshot.
fit@hcmus

the items which will be bought by


the customers who had
bought Water Bottle.
fit@hcmus
References
 J Han, J Pei, M Kamber - Data mining: concepts and
techniques
 Ian H. Witten, Eibe Frank, Mark A. Hall - Data Mining
Practical Machine Learning Tools and Techniques
 https://medium.com/analytics-vidhya/entropy-
calculation-information-gain-decision-tree-learning-
771325d16f
 https://www.saedsayad.com/decision_tree.htm
 http://hanj.cs.illinois.edu/cs412/bk3/06.pdf

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy