4.0 - Lession 6- BI mining
4.0 - Lession 6- BI mining
4.0 - Lession 6- BI mining
MÔN HỌC
Chapter 6
BI - mining
fit@hcmus
Learning Objectives
fit@hcmus
2
fit@hcmus
Introduction to data mining
object
fit@hcmus
Data in Data mining
Unstructured or
Structured
Semi-Structured
https://link.springer.com/chapter/10.1007%2F978-1-84628-766-4_7 https://www.geeksforgeeks.org/understanding-data-attribute-types-qualitative-and-quantitative/
fit@hcmus
Data mining process - CRISP-DM
· Collect data
Data Consolidation · Select data
· Integrate data
· Normalize data
Data Transformation · Discretize/aggregate data
· Construct new attributes
Well-formed
Data
Sharrda, Business Intelligence,3rd
fit@hcmus
Data mining tasks
Supperviced learning
All data is labeled and the algorithms learn to predict the output
from the input data.
Given input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output:
Y = f(X)
Unsupervised learning
is where you only have input data (X) and no corresponding output
variables.
is to model the underlying structure or distribution in the data in
order to learn more about the data
there is no correct answers and there is no teacher.
All data is unlabeled and the algorithms learn to inherent structure
from the input data.
fit@hcmus
Classic application
“Market basket” data
Purchase(salesID, item) • Want to find association rules:
(3, bread) {L1,L2,...,Ln} -> R
(3, milk) • Diễn giải: “If a customer bought all
(3, eggs) the items in set {L1, L2, ..., Ln}, he is
very likely to also have bought item
(3, beer)
R”
(4, beer) • Ex:
(4, chips) {bread, milk} -> eggs
.... {diapers} -> beer
Goals of data mining: Quickly find association rules over
extremely large data sets (ex: all Wal-Mart sales for a year).)
CS145 - http://infolab.stanford.edu/
fit@hcmus
Classic application
Classification trees (= decision trees)
Buyers(<attributes>, purchase)
Want to predict purchase from <attributes>
Clustering
Buyers(<attributes>)
Automatically group buyers into N similar types
Top-N items
Purchase(salesID, item)
What were the N most often purchased items?
(salesID irrelevant)
fit@hcmus
DM - Classification
Most frequently used DM method
Employ supervised learning
Learn from past data, classify new data
The output variable is categorical (nominal or ordinal) in
nature
Predicts categorical class labels (discrete or nominal)
Use labels of the training data to classify new data
There is a lot of classification algorithms available:
Decision Trees:
Bayesian Classifiers, Neural Networks, K-Nearest
Neighbour, Support Vector Machines, Linear Regression
fit@hcmus
DM - Classification
Example:
1. All data is labeled and the algorithms learn to predict the output from the
input data.
2. Learning Step (Training Phase): Construction of Classification Model
Given input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output: Y =
f(X)
Different Algorithms are used to build a classifier by making the model
learn using the training set available
3. Classification Step
Model used to predict class labels and testing the constructed model on
test data
Given an unlabeled observation X, the predict(X) returns the predicted
label y.
4. Evaluate the classifier model
fit@hcmus
How Does Classification Works
to predict
whether to
The weather problem play or not
Supposedly the weather
concerns the conditions that
are suitable for playing some
unspecified game
when there has a new
case
measure these variables
to predict whether to play/not
?
? Use the variables (outlook,
temperature, humidity, windy)
fit@hcmus
How Does Classification Works
The weather problem
1. Labeled data:
Attribute/feature:
Outlook { sunny, overcast, rainy}
Temperature { hot, mild, cool}
Humidity { high, normal}
Windy { true, false}
Attribute values: symbolic categories
The outcome is: play or not play
A predefine class label is assigned to every
sample tuple or object
fit@hcmus
How Does Classification Works
Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)
Training
rows
Testing rows
fit@hcmus
How Does Classification Works
Test
Classifier Accuracy
Testing data
?
?
?
?
?
?
?
Hide the outcome
? Real outcome
in the testing data
fit@hcmus
How Does Classification Works
Classifier
New data (Unlabeled data)
True False
TP FN
Predicted Class
Positive Positive
Count (TP) Count (FP) TN
True Negative Rate
TN FP
Negative
False True
TP TP
Negative Negative
P recision Re call
Count (FN) Count (TN) TP FP TP FN
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
fit@hcmus
How Does Classification Works
Training set:
A set of examples used for learning, that is to
fit the parameters of the classifier.
Validation set:
A set of examples used to tune the
parameters of a classifier
Test set:
A set of examples used only to assess the
performance of a fully-specified classifier.
if you have a model with no hyperparameters or ones that cannot be easily
tuned, you probably don’t need a validation set too!
https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
fit@hcmus
Classification vs Prediction
What is prediction?
Classification models predict categorical
class labels
prediction models predict continuous
valued functions
Ex:
Suppose the marketing manager needs to
predict how much a given customer will
spend during a sale at his company.
prediction
A bank loan officer wants to analyze the data
in order to know which customer (loan
fit@hcmus
Classification - Decision Tree
Ci Oy is output variable
Task:
Given an input data vector 𝒙 predict C
http://cs246.stanford.edu
fit@hcmus
Classification - Decision Tree
Idea:
1. Create a root node and assign all of the training
data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of
the split. Split the data into mutually exclusive
subsets along the lines of the specific split.
4. Repeat the steps 2 and 3 for each and every leaf
node until the stopping criteria is reached.
Impurity/Entropy (informal)
Measures the level of impurity in a group of examples
Entropy: a common way to measure impurity
The expected information needed to classify a tuple in D is given
by:
|𝐶𝑖 𝐷|
𝑖=1 𝑝𝑖 𝑙𝑜𝑔2 (𝑝𝑖) ; 𝑝𝑖 =
Info (D) = − σ𝑚 ,
|𝐷|
https://homes.cs.washington.edu/~shapiro/EE596/notes/InfoGain.pdf
fit@hcmus
Decision Tree - Classification
Example
Training set D
Attribute X = {x1, x2, …, xv}
Outlook = {Sunny, Overcast, Rainy} Attribute X
Temperature = {Hot, mild, cool} What feature should be used?
Humidity = {high, normal}
What values?
Windy = {true, false}
𝐷𝑗
Gain (Outlook)? InfoX (D) = − σ𝑣𝑗=1 𝐷
𝐼 (𝐷j)
Gain (Humidity)?
Gain (Windy)?
fit@hcmus
Example: training set
Information gain
14 tuples: 9 yes (play tennis); 5 No
|D| = 14
m=2
C1 = “Yes”; C2 = “No”
|C1,D| = 9; |C2,D| = 5
9 9 5 5
Info (D) = 𝐼 9,5 = − log2 – log2
14 14 14 14
=0.94
fit@hcmus
Information Gain
id outlook temperature humidity wind play
1 sunny hot high weak no
2 sunny hot high strong no
Outlook C1j: Yes C2j: No I(C1j,C2j) 8 sunny mild high weak no
9 sunny cool normal weak yes
Sunny 2 3 0.971 11 sunny mild normal strong Yes
Overcast 4 0 0 3 overcast hot high weak yes
7 overcast cool normal strong yes
Rainy 3 2 0.971 12 overcast mild high strong yes
13 overcast hot normal weak yes
4 rainy mild high weak yes
3 3 2 2
I (2,3) = −5 log2 5 – log2 5 5 rainy cool normal weak yes
5
= 0.971 6 rainy cool normal strong no
10 rainy mild normal weak yes
14 rainy mild high strong no
4 4 0 0
I (4,0) = −4 log2 4 – log2 4 =0
4
5 4 5
InfoOutlook(D) = I(2,3) + I(4,0) + I(3,2) = 0.693
14 14 14
3 3 2 2
I (3,2) = − 5 log2 – log2 = Gain(outlook) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜𝑂𝑢𝑡𝑙𝑜𝑜𝑘(𝐷)
5 5 5
0.971 = 0.94 - 0.693
= 0.25
fit@hcmus
Information Gain
temperatur
id outlook humidity wind play
e
1 sunny hot high weak no
2 sunny hot high strong no
3 overcast hot high weak yes
13 overcast hot normal weak yes
Temperature C1j: Yes C2j: No I(C1j,C2j) 4 rainy mild high weak yes
= 0.9183
4 4 2 2
I (4,2) = − log2 – log2 7 overcast cool normal strong yes
6 6 6 6
9 sunny cool normal weak yes
3 3 1 1
I (3,1) = − log2 – log2 = 0.811278
I(3,1) = 0.911
4 6 4
4 4 4 4 InfoTemperature (D) = I(2,2) + I(4,2) +
14 14 14
7 7
InfoHumidity(D) = 14 I(3,4) + 14 I(6,1) = 0.78845
Gain(Humidity) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜Humidity(𝐷)
= 0.94 - 0.78845
= 0.152
fit@hcmus
Information Gain
temperatur
id outlook humidity wind play
e
1 sunny hot high weak no
8 sunny mild high weak no
5 rainy cool normal weak yes
3 overcast hot high weak yes
Windy C1j: Yes C2j: No I(C1j,C2j) 13 overcast hot normal weak yes
8 6
InfoWindy(D) = 14 I(6,2) + 14 I(3,3) = 0.892
Gain(Windy) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜𝑊𝑖𝑛𝑑𝑦(𝐷)
= 0.94 – 0.892
= 0.048
fit@hcmus
Information Gain
Outlook C1j: Yes C2j: No I(C1j,C2j) Humidity C1j: Yes C2j: No I(C1j,C2j)
Sunny 1 3 0.673 High 3 4 0.985
Overcast 2 0 0 Normal 6 1 0.592
Rainy 3 1 0.673 Gain (Humidity) 0.152
Gain(Outlook) 0.17
Windy C1j: Yes C2j: No I(C1j,C2j)
Temperature C1j: Yes C2j: No I(C1j,C2j)
Weak 6 2 0.811
Hot 2 2 1
strong 3 3 1
mild 4 2 0.9183
Gain (Windy) 0.048
Cool 3 1 0.811278
Gain (Temperature) 0.029 Choose attribute with the largest
information gain as the decision node:
0.17 OutLook
fit@hcmus
Information Gain
temperatur
id outlook humidity wind play
e
1 sunny hot high weak no
2 sunny hot high strong no
8 sunny mild high weak no
9 sunny cool normal weak yes
1
sunny mild normal strong Yes
1
3 overcast hot high weak yes
7 overcast cool normal strong yes
1
overcast mild high strong yes
2
Outlook 1
overcast hot normal weak yes
3
Sunny Rainy
4 rainy mild high weak yes
Overcast
5 rainy cool normal weak yes
??? Yes ??? 6 rainy cool normal strong no
id temperature humidity wind play id temperature humidity wind play
1
1 hot high weak no 4 mild high weak rainy
yes mild normal weak yes
0
2 hot high strong no 5 cool normal weak yes
1
8 mild high weak no 6 cool normal strong rainy
no mild high strong no
4
9 cool normal weak yes 1
mild normal weak yes
0
1
mild normal strong Yes
1 1
mild high strong no
4
fit@hcmus
Information Gain
https://www.kaggle.com/rahulsah06/bike-buying-prediction-for-adventure-works-cycles?select=AW_BikeBuyer.csv
fit@hcmus
Classification - Decision Tree
Bike buyer prediction using AdventureWork
database
Data:
The class were described as either ‘1: Yes’ or ‘0:No’
on the basis of bike buyer or not.
The detailed description of the dataset is shown in
Table: Target
fit@hcmus
Cluster Analysis for Data Mining
Clustering techniques apply when there is no class
to be predicted but the instances are to be divided
into natural groups
Clustering is dividing data points into homogeneous
classes or clusters:
Points in the same group are as similar as possible
Points in different group are as dissimilar as possible
Used for automatic identification of natural groupings of things
Part of the machine-learning family
Employ unsupervised learning
Learns the clusters of things from past data, then assigns new
instances
Tid Items bought Tid Beer Nuts Diaper Coffee Egg Milk
10 Beer, Nuts, Diaper 10 1 1 1 0 0 0
20 Beer, Coffee, Diaper 20 1 0 1 1 0 0
30 Beer, Diaper, Eggs 30 1 0 1 0 1 0
40 Nuts, Eggs, Milk 40 0 1 0 0 1 1
50 Nuts, Coffee, Diaper, 50 0 1 1 1 1 1
Eggs, Milk
http://hanj.cs.illinois.edu/cs412/bk3_slides/06FPBasic.pdf
fit@hcmus
Association rule with SSAS
https://docs.microsoft.com/en-us/analysis-services/data-mining/microsoft-association-algorithm?view=asallproducts-allversions
fit@hcmus
Association rule with SSAS
1. Data preparation:
The requirements for an association rules
model are as follows:
A single key column: one numeric or text column that uniquely
identifies each record. compound keys not permitted.
A single predictable column: The values must be discrete or
discretized.
Input columns: The input columns must be discrete. The input
data for an association model often is contained in two tables:
one table might contain customer information while another
table contains customer purchases
vAssocSeqLineItems: nested table
vAssocSeqOrders: case table
fit@hcmus
fit@hcmus
Association rule with SSAS
……
fit@hcmus
Association rule with SSAS
support :
MINIMUM_SUPPORT value
of 0.03, it means that at least
3% of the total cases in the
data set must contain this
item or itemset for inclusion
in the model.
Confidence:
INIMUM_PROBABILITY: 0.5
means that no rule with less
than fifty percent probability
can be generated
fit@hcmus
Mining model viewer