4.0 - Lession 6- BI mining

KHOA CÔNG NGHỆ THÔNG TIN-ĐHKHTN
MÔN HỌC
Chapter 6
BI - mining
Giáo viên: Hồ Thị Hoàng Vy

TPHCM, 8-2021
fit@hcmus
Learning Objectives
fit@hcmus
 Define data mining as an enabling technology

for business intelligence
 Describe the objectives and benefits of
business analytics and data mining
 Describe some algorithms that are applied to
some specific senarios
 Design and implement an integrated data
mining solution by using SQL Server Analysis
Services
2
fit@hcmus
Introduction to data mining
 Data Mining as a step in A KDD

Process
 The core step of knowledge
discovery process
fit@hcmus
Introduction to Data mining
 Data Mining: Concepts and Techniques
 “Data mining, also popularly referred to as knowledge
discovery from data (KDD), is the automated or convenient
extraction of patterns representing knowledge implicitly stored
or captured in large databases, data warehouses, the Web,
other massive information repositories or data streams.”
 Data Mining: Practical Machine Learning Tools and
Techniques
 “Data mining is defined as the process of discovering patterns
in data. The process must be automatic or (more usually)
semiautomatic. The patterns discovered must be meaningful in
that hey lead to some advantage, usually an economic one.
The data is invariably present in substantial quantities.”
fit@hcmus
Introduction to Data mining
 Data mining: what for?

 To look for interesting structures such as:
 Patterns from statistics
 Predictive models
 Hidden relationship
fit@hcmus
 Patterns: Valid, Novel, Potentially useful,
Understandable to the users.
 Types of patterns
 Association
 Prediction
 Cluster (segmentation)
 Sequential (or time series) relationships
fit@hcmus
 These patterns and trends can be collected and
defined as a data mining model. Mining models
can be applied to specific scenarios, such as:
 Banking: loan/credit card approval
 predict good customers based on old customers
 Customer relationship management:
 identify those who are likely to leave for a competitor.
 Targeted marketing:
 identify likely responders to promotions
 Fraud detection: telecommunications, financial
transactions
 from an online stream of event identify fraudulent events
fit@hcmus
 A set of rules learned from this information

 If tear production rate = reduced then recommendation = none
 if age = young and astigmatic = no then recommendation = soft
“practical machine learning tools and techniques.—3rd”

fit@hcmus
 Attribute (or dimension, feature, variable) is a
data field, representing a characteristic or a
feature of a data object.
 A collection of attributes describe an object
attributes
object
fit@hcmus
Data in Data mining
 Data may consist of numbers, words,

images, …
Data
Unstructured or
Structured
Semi-Structured
Categorical Numerical Textual Multimedia HTML/XML
Nominal Ordinal Interval Ratio

fit@hcmus
Data in data mining
 Nominal are used to label variables without any quantitative value
(categories, state, name of things…)
 Hair_color = {black, brown, blond, red, grey, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, socio economic status (“low
income”,”middle income”,”high income”)
fit@hcmus
Data in data mining
 Interval scales
 are numeric scales
 know both the order and the exact differences between the
values.
 don’t have a “true zero.”
 Ex: Celsius temperature, zero doesn’t mean the absence of value,
20 degrees C is not twice as hot as 10 degrees C
 Ratio
 have a clear definition of zero
 can be meaningfully added, subtracted, multiplied, divided
(ratios)
 Ex: weight, height
fit@hcmus
Data in data mining
 Discrete Attribute
 Has only a finite or countably infinite set of values
 Often represented as integer variables
 Continuous Attribute
 Has real numbers as attribute values
 Examples: . height, weight, length, temperature and
speed
https://link.springer.com/chapter/10.1007%2F978-1-84628-766-4_7 https://www.geeksforgeeks.org/understanding-data-attribute-types-qualitative-and-quantitative/
fit@hcmus
Data mining process - CRISP-DM
Step 1: Business Set goals for the project

Understanding Using business objectives and current scenario, define your data mining
goals
Account
Step 2: Data Set the data and data source
s for
Understanding Check if the available data can meet the objectives of the project and ~85%
establish how you will meet the objectives of total
Step 3: Data The data from different sources should be selected, cleaned, transformed, project
Preparation formatted, anonymized, and constructed time
Data cleaning & transformation (smoothing noisy data and filling in
missing values, aggregation, normalization…)
Step 4: Model Building Execute the algorithms that satisfíes the project objectives
Create a scenario to test check the quality and validity of the model.
Run the model on the prepared dataset.
Results should be assessed by all stakeholders
Step 5: Testing and
Evaluation
Step 6: Deployment
https://towardsdatascience.com/crisp-dm-methodology-leader-in-data-mining-and-big-data-467efd3d3781
fit@hcmus
Data Preparation
Real-world
Data
· Collect data
Data Consolidation · Select data
· Integrate data
· Impute missing values

Data Cleaning · Reduce noise in data
· Eliminate inconsistencies
· Normalize data
Data Transformation · Discretize/aggregate data
· Construct new attributes
· Reduce number of variables

Data Reduction · Reduce number of cases
· Balance skewed data
Well-formed
Data
Sharrda, Business Intelligence,3rd
fit@hcmus
Data mining tasks
Data Mining Learning Method Popular Algorithms
Classification and Regression Trees,

Prediction Supervised
ANN, SVM, Genetic Algorithms
Decision trees, ANN/MLP, SVM, Rough

Classification Supervised
sets, Genetic Algorithms
Linear/Nonlinear Regression, Regression

Regression Supervised
trees, ANN/MLP, SVM
Association Unsupervised Apriory, OneR, ZeroR, Eclat
Link analysis Unsupervised Expectation Maximization, Apriory

Algorithm, Graph-based Matching
Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique
Clustering Unsupervised K-means, ANN/SOM
Outlier analysis Unsupervised K-means, Expectation Maximization (EM)

Recommender Systems Handbook. © Springer Science+Business Media,
LLC 2011

fit@hcmus
Suppervised vs unsuppervised
 Supperviced learning
 All data is labeled and the algorithms learn to predict the output
from the input data.
 Given input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output:
Y = f(X)
 Unsupervised learning
 is where you only have input data (X) and no corresponding output
variables.
 is to model the underlying structure or distribution in the data in
order to learn more about the data
 there is no correct answers and there is no teacher.
 All data is unlabeled and the algorithms learn to inherent structure
from the input data.
fit@hcmus
Classic application
 “Market basket” data
 Purchase(salesID, item) • Want to find association rules:
 (3, bread) {L1,L2,...,Ln} -> R
 (3, milk) • Diễn giải: “If a customer bought all
 (3, eggs) the items in set {L1, L2, ..., Ln}, he is
very likely to also have bought item
 (3, beer)
R”
 (4, beer) • Ex:
 (4, chips) {bread, milk} -> eggs
 .... {diapers} -> beer
 Goals of data mining: Quickly find association rules over
extremely large data sets (ex: all Wal-Mart sales for a year).)
CS145 - http://infolab.stanford.edu/
fit@hcmus
Classic application
 Classification trees (= decision trees)
 Buyers(<attributes>, purchase)
 Want to predict purchase from <attributes>
 Clustering
 Buyers(<attributes>)
 Automatically group buyers into N similar types
 Top-N items
 Purchase(salesID, item)
 What were the N most often purchased items?
(salesID irrelevant)
fit@hcmus
DM - Classification
 Most frequently used DM method
 Employ supervised learning
 Learn from past data, classify new data
 The output variable is categorical (nominal or ordinal) in
nature
 Predicts categorical class labels (discrete or nominal)
 Use labels of the training data to classify new data
 There is a lot of classification algorithms available:
Decision Trees:
 Bayesian Classifiers, Neural Networks, K-Nearest
Neighbour, Support Vector Machines, Linear Regression
fit@hcmus
DM - Classification
 Example:
Classifie Will buy bike

r Will not
Customer profile
 A model or classifier is contsructed to predict categorical labels

such as {hard, soft, none } for a recommendation lense
application.
 A bank loan officer wants to analyze the data in order to know
which customer (loan applicant) are risky or which are safe.
fit@hcmus
How Does Classification Works
1. All data is labeled and the algorithms learn to predict the output from the
input data.
2. Learning Step (Training Phase): Construction of Classification Model
 Given input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output: Y =
f(X)
 Different Algorithms are used to build a classifier by making the model
learn using the training set available
3. Classification Step
 Model used to predict class labels and testing the constructed model on
test data
 Given an unlabeled observation X, the predict(X) returns the predicted
label y.
4. Evaluate the classifier model
fit@hcmus
to predict
whether to
 The weather problem play or not
 Supposedly the weather
concerns the conditions that
are suitable for playing some
unspecified game
  when there has a new
case
  measure these variables
to predict whether to play/not
?
? Use the variables (outlook,
temperature, humidity, windy)
fit@hcmus
The weather problem
1. Labeled data:
 Attribute/feature:
 Outlook { sunny, overcast, rainy}
 Temperature { hot, mild, cool}
 Humidity { high, normal}
 Windy { true, false}
 Attribute values: symbolic categories
 The outcome is: play or not play
 A predefine class label is assigned to every
sample tuple or object
fit@hcmus
 Learning Step (Training Phase):

 Randomly split the loaded dataset into two
(70%-30%)
Training
 Perform the model training on the rows
training set
Testing
 Use the test set for validation rows
purpose DEPENDEN
INDEPENDE
 Model usage NT
VARIABLES
T
VARIABLES
Model
Training Data Development
2/3
Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)

fit@hcmus

 Randomly split the loaded dataset
Training
rows
Testing rows
fit@hcmus

 Perform the model training on the training
set
Training Classification
data algorithm
A set of rules learned from
this information
• If outlook = sunny and
Classifier humidity = high then play
= no
• If outlook = rainy and
windy = true then play =
no
• If outlook = overcast then
play = yes
• ………..
fit@hcmus

 Use the test set for validation purpose
Test
Classifier Accuracy
Testing data
?
?
?
?
?
?
?
Hide the outcome
? Real outcome
in the testing data
fit@hcmus

 Use the test set for validation purpose
 If the accuracy is acceptable:
Classifier
New data (Unlabeled data)
Overcast Mild High False ??

Overcast Cold Normal True ?? Overcast Mild High False ye
s
Overcast Cold Normal True no
fit@hcmus
Assessment Methods
 To predict the performance of a classifier on new data, we need to assess
its error rate on a dataset that played no part in the formation of the
classifier  test set (independent dataset)
 The test data is not used in any way to create the classifier.
 If the class prediction is correct  SuccessCount ++
 if not, it is an error  ErrorCount++
 The error rate = the proportion of errors made over a whole set of
instances
 Understanding the accuracy of your model is

invaluable because you can begin to tune
the parameters of your model to increase its
performance.
fit@hcmus
 Assessment Methods (cont)

 In classification problems, the primary
source for accuracy estimation is the
confusion matrix TP  TN
True Class Accuracy 
TP  TN  FP  FN
Positive Negative
TP
True Positive Rate 
Positive
True False
TP  FN
Predicted Class
Positive Positive
Count (TP) Count (FP) TN
True Negative Rate 
TN  FP
Negative
False True
TP TP
Negative Negative
P recision  Re call 
Count (FN) Count (TN) TP  FP TP  FN

fit@hcmus
 k-Fold Cross validation

 Split the data into k mutually exclusive
subsets
 Use each subset as testing while using the
rest of the subsets as training
 Repeat the experimentation for k times
 Aggregate the test results for true estimation
of prediction accuracy training
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
fit@hcmus
 Training set:
 A set of examples used for learning, that is to
fit the parameters of the classifier.
 Validation set:
 A set of examples used to tune the
parameters of a classifier
 Test set:
 A set of examples used only to assess the
performance of a fully-specified classifier.
 if you have a model with no hyperparameters or ones that cannot be easily
tuned, you probably don’t need a validation set too!
https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
fit@hcmus
Classification vs Prediction
 What is prediction?
 Classification models predict categorical
class labels
 prediction models predict continuous
valued functions
 Ex:
 Suppose the marketing manager needs to
predict how much a given customer will
spend during a sale at his company. 
prediction
 A bank loan officer wants to analyze the data
in order to know which customer (loan
fit@hcmus
Classification - Decision Tree
 Definition: Root node

 Employs the divide and conquer
method
Internal
 Recursively divides a training set node
until each division consists of
examples from one class Leaf
 Type of node: nod
 Root node is an attribute to place at the root node
 Internal nodes (non leaf node) denotes a test on How are decision trees used
an attribute
for classification?
 Leaf nodes (terminal nodes) hold class labels
fit@hcmus
 Attribute:
 Input (indepent variables): v feature/attribute = X1;
X2; :::; Xv
 Each Xj has domain Oj:
 Category: {high, cold}
 Numerical: {0,1}
 Output (dependent variable) /class: C with domain Oy
 Category: classification
 Numerical: Regression
 Given a dataset D, n row:
 n example (Xi, Ci); Xi: is a v-dim feature vector
 Ci  Oy is output variable
 Task:
 Given an input data vector 𝒙 predict C
http://cs246.stanford.edu
fit@hcmus
Idea:
1. Create a root node and assign all of the training
data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of
the split. Split the data into mutually exclusive
subsets along the lines of the specific split.
4. Repeat the steps 2 and 3 for each and every leaf
node until the stopping criteria is reached.

fit@hcmus
 Mainly problems:
1. Splitting criteria
 Which variable, what value, etc.
2. Stopping criteria
 When to stop building the tree
3. Pruning (generalization method)
 Pre-pruning versus post-pruning
 Most popular DT algorithms include
 ID3, C4.5, C5; CART; CHAID; M5

fit@hcmus
 Alternative splitting criteria
 Gini index determines the purity of a specific class
as a result of a decision to branch along a particular
attribute/value
 Used in CART
 Information gain uses entropy to measure the
extent of uncertainty or randomness of a particular
attribute/value split
 Used in ID3, C4.5, C5
fit@hcmus
 Impurity/Entropy (informal)
 Measures the level of impurity in a group of examples
 Entropy: a common way to measure impurity
 The expected information needed to classify a tuple in D is given
 by:
|𝐶𝑖 𝐷|
𝑖=1 𝑝𝑖 𝑙𝑜𝑔2 (𝑝𝑖) ; 𝑝𝑖 =
Info (D) = − σ𝑚 ,
|𝐷|
• m: the number of classes

• pi: the probability that an arbitrary tuple in D belongs to class Ci
estimated by: |Ci,D|/|D|(proportion of tuples of each class)
• Entropy comes from information theory. The higher the entropy the
more the information content.
https://homes.cs.washington.edu/~shapiro/EE596/notes/InfoGain.pdf
fit@hcmus
Decision Tree - Classification
 Example
Data Mining: Practical Machine Learning Tools and Techniques – p11

fit@hcmus
Information gain
 Training set D
 Attribute X = {x1, x2, …, xv}
 Outlook = {Sunny, Overcast, Rainy} Attribute X
 Temperature = {Hot, mild, cool} What feature should be used?
 Humidity = {high, normal}
What values?
 Windy = {true, false}
𝐷𝑗
 Gain (Outlook)? InfoX (D) = − σ𝑣𝑗=1 𝐷
𝐼 (𝐷j)
 Gain (Temperature)? Gain (X) = Info (D) – InfoA (D)
 Gain (Humidity)?
 Gain (Windy)?
fit@hcmus
Example: training set
 Information gain
 14 tuples: 9 yes (play tennis); 5 No
 |D| = 14
m=2
 C1 = “Yes”; C2 = “No”
 |C1,D| = 9; |C2,D| = 5
9 9 5 5
Info (D) = 𝐼 9,5 = − log2 – log2
14 14 14 14
=0.94
fit@hcmus
Information Gain
id outlook temperature humidity wind play
1 sunny hot high weak no
2 sunny hot high strong no
Outlook C1j: Yes C2j: No I(C1j,C2j) 8 sunny mild high weak no
9 sunny cool normal weak yes
Sunny 2 3 0.971 11 sunny mild normal strong Yes
Overcast 4 0 0 3 overcast hot high weak yes
7 overcast cool normal strong yes
Rainy 3 2 0.971 12 overcast mild high strong yes
13 overcast hot normal weak yes
4 rainy mild high weak yes
3 3 2 2
I (2,3) = −5 log2 5 – log2 5 5 rainy cool normal weak yes
5
= 0.971 6 rainy cool normal strong no
10 rainy mild normal weak yes
14 rainy mild high strong no
4 4 0 0
I (4,0) = −4 log2 4 – log2 4 =0
4
5 4 5
InfoOutlook(D) = I(2,3) + I(4,0) + I(3,2) = 0.693
14 14 14
3 3 2 2
I (3,2) = − 5 log2 – log2 = Gain(outlook) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜𝑂𝑢𝑡𝑙𝑜𝑜𝑘(𝐷)
5 5 5
0.971 = 0.94 - 0.693
= 0.25
fit@hcmus
Information Gain
temperatur
id outlook humidity wind play
e
3 overcast hot high weak yes
Temperature C1j: Yes C2j: No I(C1j,C2j) 4 rainy mild high weak yes
Hot 2 2 1 8 sunny mild high weak no

mild 4 2 0.9183
11 sunny mild normal strong yes
cool 3 1 0.811278
12 overcast mild high strong yes
I (2,2) = −
2
4
log2
2
4
–
2
4
log2
2
4
= 1 5 rainy cool normal weak yes
6 rainy cool normal strong no
= 0.9183
4 4 2 2
I (4,2) = − log2 – log2 7 overcast cool normal strong yes
6 6 6 6
3 3 1 1
I (3,1) = − log2 – log2 = 0.811278
I(3,1) = 0.911
4 6 4
4 4 4 4 InfoTemperature (D) = I(2,2) + I(4,2) +
14 14 14
Gain(Temperature) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜Temperature(𝐷)

= 0.94 -0.911
= 0.029
fit@hcmus
Information Gain
temperatur
e
Humidity C1j: Yes C2j: No I(C1j,C2j) 8 sunny mild high weak no
High 3 4 0.985 12 overcast mild high strong yes

Normal 6 1 0.592
5 rainy cool normal weak yes
3 3 4 4
I (3,4) = − log2 – log2 = 0.985 11 sunny mild normal strong yes
7 7 7 7
6 rainy cool normal strong no
6 6 1 1
I (6,1) = − log2 – log2 = 0.592 7 overcast cool normal strong yes
7 7 7 7
7 7
InfoHumidity(D) = 14 I(3,4) + 14 I(6,1) = 0.78845
Gain(Humidity) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜Humidity(𝐷)
= 0.94 - 0.78845
= 0.152
fit@hcmus
Information Gain
temperatur
e
8 sunny mild high weak no
Windy C1j: Yes C2j: No I(C1j,C2j) 13 overcast hot normal weak yes
Weak 6 2 0.811 4 rainy mild high weak yes

strong 3 3 1
11 sunny mild normal strong yes
6 6 2 2
I (6,2) = − log2 – log2 = 0.81128 12 overcast mild high strong yes
8 8 8 8
14 rainy mild high strong No
3 3 3 3
I (3,3) = − log2 – log2 =1 6 rainy cool normal strong no
6 6 6 6
8 6
InfoWindy(D) = 14 I(6,2) + 14 I(3,3) = 0.892
Gain(Windy) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜𝑊𝑖𝑛𝑑𝑦(𝐷)
= 0.94 – 0.892
= 0.048
fit@hcmus
Information Gain
Outlook C1j: Yes C2j: No I(C1j,C2j) Humidity C1j: Yes C2j: No I(C1j,C2j)
Sunny 1 3 0.673 High 3 4 0.985
Overcast 2 0 0 Normal 6 1 0.592
Rainy 3 1 0.673 Gain (Humidity) 0.152
Gain(Outlook) 0.17
Windy C1j: Yes C2j: No I(C1j,C2j)
Temperature C1j: Yes C2j: No I(C1j,C2j)
Weak 6 2 0.811
Hot 2 2 1
strong 3 3 1
mild 4 2 0.9183
Gain (Windy) 0.048
Cool 3 1 0.811278
Gain (Temperature) 0.029   Choose attribute with the largest
information gain as the decision node:
0.17 OutLook
fit@hcmus
Information Gain
temperatur
e
8 sunny mild high weak no
1
sunny mild normal strong Yes
1
1
overcast mild high strong yes
2
Outlook 1
overcast hot normal weak yes
3
Sunny Rainy
Overcast
??? Yes ??? 6 rainy cool normal strong no
id temperature humidity wind play id temperature humidity wind play
1
1 hot high weak no 4 mild high weak rainy
yes mild normal weak yes
0
2 hot high strong no 5 cool normal weak yes
1
8 mild high weak no 6 cool normal strong rainy
no mild high strong no
4
9 cool normal weak yes 1
mild normal weak yes
0
1
mild normal strong Yes
1 1
mild high strong no
4
fit@hcmus
Information Gain
 R1: if outlook = overcast then yes

 R2: if outlook = Sunny and humidity
Outlook = High then No
Sunny Rainy
Overcast
 R3: if outlook = Sunny and humidity
Humidity Yes
= Normal then yes
Windy
High Normal weak strong  R4: if outlook = rainy and windy =
No Yes Yes No weak then yes
 R4: if outlook = rainy and windy =
strong then no
fit@hcmus
 Example: Bike buyer prediction using
AdventureWork database
 Purpose:
 Creating a classification model that predicts whether
or not a customer will purchase a bike
 The model should predict bike purchasing for new
customers for whom no information about average
monthly spend or previous bike purchases is
available
https://www.kaggle.com/rahulsah06/bike-buying-prediction-for-adventure-works-cycles?select=AW_BikeBuyer.csv
fit@hcmus
 Bike buyer prediction using AdventureWork
database
 Data:
 The class were described as either ‘1: Yes’ or ‘0:No’
on the basis of bike buyer or not.
 The detailed description of the dataset is shown in
Table: Target
fit@hcmus
Cluster Analysis for Data Mining
 Clustering techniques apply when there is no class
to be predicted but the instances are to be divided
into natural groups
 Clustering is dividing data points into homogeneous
classes or clusters:
 Points in the same group are as similar as possible
 Points in different group are as dissimilar as possible
 Used for automatic identification of natural groupings of things
 Part of the machine-learning family
 Employ unsupervised learning
 Learns the clusters of things from past data, then assigns new
instances

fit@hcmus
Cluster Analysis for Data Mining
 Clustering results may be used to
 Identify natural groupings of customers
 Identify rules for assigning new cases to classes for
targeting/diagnostic purposes
 Provide characterization, definition, labeling of
populations
 Decrease the size and complexity of problems for
other data mining methods
 Identify outliers in a specific domain (e.g., rare-event
detection)

fit@hcmus
K-mean
 k-Means Clustering Algorithm
 k : pre-determined number of clusters
 Algorithm (Step 0: determine value of k)
Step 1: Randomly generate k random points as initial
cluster centers.
Step 2: Assign each point to the nearest cluster center.
Step 3: Re-compute the new cluster centers.
Repetition step: Repeat steps 3 and 4 until some
convergence criterion is met (usually that the
assignment of points to clusters becomes stable).

fit@hcmus
K-mean
The decision of merging two clusters is taken on
the basis of closeness of these clusters. There are
multiple metrics for deciding the closeness of two
clusters :
 Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
 Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
 Manhattan distance: ||a-b||1 = Σ|ai-bi|
 Pearson correlation distance
 Spearman correlation distance:
 …
fit@hcmus
Association Rule Mining
 is used when you want to find an association between
different objects in a set, find frequent patterns in a
transaction database, relational databases or any other
information repository.
 The applications of Association Rule Mining are found in
Marketing, Basket Data Analysis
 “Frequently Bought Together” → Association
 “Customers who bought this item also bought” →
Recommendation
fit@hcmus
The market basket model
 Market Basket Analysis takes

data at transaction level, which
lists all items bought by a
customer in a single purchase.
 The technique determines
relationships of what products
were purchased with which other
product(s)
• These relationships are then used to build profiles containing If-

Then rules of the items purchased.
• The rules could be written as: If {X} Then {Y}
fit@hcmus
Tid Items bought
10 Beer, Nuts, Diaper
 Item = products purchased in a basket/transaction
20 Beer, Coffee, Diaper
 An itemset is a set of item 30 Beer, Diaper, Eggs
 An itemset that contains k items is a k-itemset 40 Nuts, Eggs, Milk

50 Nuts, Coffee, Diaper, Eggs, Milk
 Ex: {beer, diaper} : 2-itemset
 Let D = {t1,t2…,tm} be the set of all transactions called the dataset.
 Ex: t1 = {Milk, bread, eggs}
 Let I = {i1,i2,…,in} be the set of all item in a market basket data
 Each transaction ti contains a subset of items chosen from I
 An association rule is defined as an implication of the form: X  Y, where X,
Y  I and XY = 
 Ex: Diaper  beer (Buying diapers may likely lead to buying beers )
 Problem: Find sets of items that appear together
“frequently” in basket
fit@hcmus
Tid Items bought Tid Beer Nuts Diaper Coffee Egg Milk
10 Beer, Nuts, Diaper 10 1 1 1 0 0 0
20 Beer, Coffee, Diaper 20 1 0 1 1 0 0
30 Beer, Diaper, Eggs 30 1 0 1 0 1 0
40 Nuts, Eggs, Milk 40 0 1 0 0 1 1
50 Nuts, Coffee, Diaper, 50 0 1 1 1 1 1
Eggs, Milk
This presentation is a very simplistic view of market basket data

fit@hcmus
 Support count: is the number of elements in a set.
 Frequent itemsets: a set of items that appears in many baskets is said to
be “frequent.”
 Given a number minsupp, called support threshold. If support (X) >=
minsup then X is frequent.
 Association rule discovery
 Given a set of transaction T, find all the rules having support >= minsup,
and confidence >= minconf, where minsup, minconf are the
corresponding support and confidence thresholds
Tid Items bought

30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
fit@hcmus
Tid Items bought
 Supp(X) = count(X) / | D |
 Ex: supp(beer} = 3/5 = 60%; 30 Beer, Diaper, Eggs
 Ex: supp(diaper} = 4/5 = 80% 40 Nuts, Eggs, Milk

 Ex: sup(beer, diaper} = 3/5 = 60%
 The strength of an association rule can be measure as:
 Support s(XY): The percentage of transaction in the database that
contains X  Y
 supp (X ⇒ Y ) = supp(X  Y) = count (X  Y)/|D|
 Confidence, c(XY): The conditional probability that a transaction
containing X also contains Y
 Conf (X ⇒ Y ) = supp(X  Y) / supp(X)
 Ex. c = supp{Diaper, Beer}/supp{Diaper} = ¾ = 0.75
fit@hcmus
 Let minsupp = 50%
 All the frequent 1-itemsets:
 Beer : 3/5 (60%)
Beer : 3/5 (60%)
 Nut : 3/5 (60%) Nut : 3/5 (60%)
 Diaper: 4/5 (80%) Diaper: 4/5 (80%)
 Egg: 3/5 (60%) Egg: 3/5 (60%)
 Coffee: 2/5 (40%) < minsupp
 Milk: 2/5 (40%) < minsupp
 All the frequent 2-itemsets: {Beer, Diaper}: 3/5 (60%)

 All the frequent 3-itemsets: None.
fit@hcmus
Apriori Algorithm
Association rule mining can be viewed as a
two-step process:
1. Finds subsets that are common to at least a
minimum number of the itemsets
2. Uses a bottom-up approach
 frequent subsets are extended one item at a
time (the size of frequent subsets increases from
one-item subsets to two-item subsets, then
three-item subsets, and so on), and
 groups of candidates at each level are tested
against the data for minimum support.

The Apriori Algorithm—An Example
fit@hcmus
http://hanj.cs.illinois.edu/cs412/bk3_slides/06FPBasic.pdf
fit@hcmus
Association rule with SSAS
 Data source: AdventureWorksDW

 Problem: The objective of the Association
Rule Mining is to find out what models are
selling together.
https://docs.microsoft.com/en-us/analysis-services/data-mining/microsoft-association-algorithm?view=asallproducts-allversions
fit@hcmus
1. Data preparation:
 The requirements for an association rules
model are as follows:
 A single key column: one numeric or text column that uniquely
identifies each record. compound keys not permitted.
 A single predictable column: The values must be discrete or
discretized.
 Input columns: The input columns must be discrete. The input
data for an association model often is contained in two tables:
 one table might contain customer information while another
table contains customer purchases
 vAssocSeqLineItems: nested table
 vAssocSeqOrders: case table
fit@hcmus
fit@hcmus
……
fit@hcmus
 After the Association

Rule configuration is
completed, then the
model can be processed.
Then users can review
the prediction model and
perform the predictions.
fit@hcmus
 Input columns: The

input columns must
be discrete
 A single predictable
column: The values
must be discrete or
discretized
Modifying and Processing the Market Basket Model
fit@hcmus
 Before you process the association mining model that

you created, you must change the default values of two
of the parameters: Support and Probability.
 MINIMUM_PROBABILITY = 0.5
 MINIMUM_SUPPORT = 0.03
Modifying and Processing the Market Basket
fit@hcmus Model
 support :
 MINIMUM_SUPPORT value
of 0.03, it means that at least
3% of the total cases in the
data set must contain this
item or itemset for inclusion
in the model.
 Confidence:
 INIMUM_PROBABILITY: 0.5
 means that no rule with less
than fifty percent probability
can be generated
fit@hcmus
Mining model viewer
 In the Mining Model viewer, there are three

tabs to view the data patterns. In
the Rules tab, it will show the rules that can be
derived fro the Association Rule Mining model
in the sample set.
fit@hcmus
If you click any node, those

nodes will be highlighted with
different colors, as shown in the
below screenshot.
fit@hcmus
the items which will be bought by

the customers who had
bought Water Bottle.
fit@hcmus
References
 J Han, J Pei, M Kamber - Data mining: concepts and
techniques
 Ian H. Witten, Eibe Frank, Mark A. Hall - Data Mining
Practical Machine Learning Tools and Techniques
 https://medium.com/analytics-vidhya/entropy-
calculation-information-gain-decision-tree-learning-
771325d16f
 https://www.saedsayad.com/decision_tree.htm
 http://hanj.cs.illinois.edu/cs412/bk3/06.pdf

4.0 - Lession 6- BI mining

Uploaded by

Copyright:

Available Formats

4.0 - Lession 6- BI mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4.0 - Lession 6- BI mining

Uploaded by

Copyright:

Available Formats

KHOA CÔNG NGHỆ THÔNG TIN-ĐHKHTN

Giáo viên: Hồ Thị Hoàng Vy

 Define data mining as an enabling technology

 Data Mining as a step in A KDD

 Data mining: what for?

 A set of rules learned from this information

“practical machine learning tools and techniques.—3rd”

 Data may consist of numbers, words,

Categorical Numerical Textual Multimedia HTML/XML

Nominal Ordinal Interval Ratio

Step 1: Business Set goals for the project

· Impute missing values

· Reduce number of variables

Data Mining Learning Method Popular Algorithms

Classification and Regression Trees,

Decision trees, ANN/MLP, SVM, Rough

Linear/Nonlinear Regression, Regression

Association Unsupervised Apriory, OneR, ZeroR, Eclat

Link analysis Unsupervised Expectation Maximization, Apriory

Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique

Clustering Unsupervised K-means, ANN/SOM

Outlier analysis Unsupervised K-means, Expectation Maximization (EM)

Sharrda, Business Intelligence,3rd

Classifie Will buy bike

 A model or classifier is contsructed to predict categorical labels

 Learning Step (Training Phase):

Sharrda, Business Intelligence,3rd

 Learning Step (Training Phase):

 Learning Step (Training Phase):

 Learning Step (Training Phase):

 Learning Step (Training Phase):

Overcast Mild High False ??

 Understanding the accuracy of your model is

 Assessment Methods (cont)

Sharrda, Business Intelligence,3rd

 k-Fold Cross validation

 Definition: Root node

Sharrda, Business Intelligence,3rd

Sharrda, Business Intelligence,3rd

• m: the number of classes

Data Mining: Practical Machine Learning Tools and Techniques – p11

 Gain (Temperature)? Gain (X) = Info (D) – InfoA (D)

Hot 2 2 1 8 sunny mild high weak no

Gain(Temperature) = 𝐼𝑛𝑓𝑜 𝐷 − 𝐼𝑛𝑓𝑜Temperature(𝐷)

High 3 4 0.985 12 overcast mild high strong yes

Weak 6 2 0.811 4 rainy mild high weak yes

 R1: if outlook = overcast then yes

Sharrda, Business Intelligence,3rd

Sharrda, Business Intelligence,3rd

Sharrda, Business Intelligence,3rd

 Market Basket Analysis takes

• These relationships are then used to build profiles containing If-

 An itemset that contains k items is a k-itemset 40 Nuts, Eggs, Milk

This presentation is a very simplistic view of market basket data

Tid Items bought

 Ex: supp(diaper} = 4/5 = 80% 40 Nuts, Eggs, Milk

 All the frequent 2-itemsets: {Beer, Diaper}: 3/5 (60%)

Sharrda, Business Intelligence,3rd

 Data source: AdventureWorksDW

 After the Association