0% found this document useful (0 votes)
98 views

Week 01 Lecture Material PDF

This document provides an introduction to data mining and the data preprocessing step. It discusses the explosive growth of data and need for knowledge discovery. Data mining involves extracting patterns from large amounts of data through techniques like association rule mining, classification, clustering, and anomaly detection. The document outlines the typical steps in the knowledge discovery process including data selection, cleaning, transformation, and mining. It also discusses major issues in data mining like methodology, user interaction, and applications. Finally, it provides an overview of data preprocessing as the first step in the data mining process.

Uploaded by

Babita rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views

Week 01 Lecture Material PDF

This document provides an introduction to data mining and the data preprocessing step. It discusses the explosive growth of data and need for knowledge discovery. Data mining involves extracting patterns from large amounts of data through techniques like association rule mining, classification, clustering, and anomaly detection. The document outlines the typical steps in the knowledge discovery process including data selection, cleaning, transformation, and mining. It also discusses major issues in data mining like methodology, user interaction, and applications. Finally, it provides an overview of data preprocessing as the first step in the data mining process.

Uploaded by

Babita rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Data Mining

EL
Week 1: Introduction, Association Rules

PT
Pabitra Mitra

N
Computer Science and Engineering, IIT Kharagpur
Email: pabitra@gmail.com

1
Course Outline:
• Introduction: KDD Process
• Data Preprocessing
• Association Rule Mining

EL
• Classification

PT
• Clustering and Anomaly Detection
• Regression
• Case Studies
N
2
Data Mining

EL
Introduction

PT
Pabitra Mitra

N
Computer Science and Engineering, IIT Kharagpur

3
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
– Major sources of abundant data

EL
• Business: Web, e-commerce, transactions, stocks, …

PT
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube

N
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data

EL
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge extraction,

PT
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.

N
• Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems

5
N
PT
6 EL
Data Mining: Confluence of Multiple Disciplines

Database
Statistics
Technology

EL
Machine Visualization

PT
Data Mining
Learning

N
Pattern
Other
Recognition Algorithm Disciplines

7
Why Not Traditional Data Analysis?
• Tremendous amount of data
– Algorithms must be highly scalable to handle such as tera-bytes of data
• High-dimensionality of data
– Micro-array may have tens of thousands of dimensions

EL
• High complexity of data
– Data streams and sensor data

PT
– Time-series data, temporal data, sequence data

N
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web data

8
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)

EL
– Structure data, graphs, social networks and multi-linked data

PT
– Object-relational databases
– Heterogeneous databases and legacy databases

N
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web

9
Data Mining Functionalities
• Multidimensional concept description: Characterization and discrimination
– Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet
regions
• Frequent patterns, association, correlation vs. causality

EL
– Tea  Sugar [0.5%, 75%] (Correlation or causality?)
• Classification and prediction

PT
– Construct models (functions) that describe and distinguish classes or
concepts for future prediction

N
• E.g., classify countries based on (climate), or classify cars based on (gas
mileage)
– Predict some unknown or missing numerical values

10
Data Mining Functionalities
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster houses to find
distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis

EL
– Outlier: Data object that does not comply with the general behavior of the data
– Noise or exception? Useful in fraud detection, rare events analysis

PT
• Trend and evolution analysis
– Trend and deviation: e.g., regression analysis

N
– Sequential pattern mining: e.g., digital camera  large SD memory
– Periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses

11
Major Issues in Data Mining
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge

EL
Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion

PT
• User interaction
– Data mining query languages and ad-hoc mining

N
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy

12
Architecture: Typical Data Mining System
Graphical User Interface

Pattern Evaluation
Knowledge

EL
Data Mining Engine -Base

PT
Database or Data Warehouse
Server

N
data cleaning, integration, and selection

Database Data World-Wide Other Info


Warehouse Web Repositories

13
KDD Process: Summary
• Learning the application domain
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation

EL
– Find useful features, dimensionality/variable reduction, invariant representation
• Choosing functions of data mining

PT
– summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s)

N
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge

14
End of Introduction

EL
PT
N
15
Data Mining

EL
Data Preprocessing

PT
Pabitra Mitra

N
Computer Science and Engineering, IIT Kharagpur

16
What is Data?
• Collection of data objects and their Attributes
attributes
• An attribute is a property or Tid Refund Marital Taxable
characteristic of an object Status Income Cheat

EL
– Examples: eye color of a person, 1 Yes Single 125K No

temperature, etc. 2 No Married 100K No

PT
– Attribute is also known as variable, 3 No Single 70K No
4 Yes Married 120K No
field, characteristic, or feature
5 No Divorced 95K Yes
• A collection of attributes describe Objects

N
6 No Married 60K No
an object 7 Yes Divorced 220K No

– Object is also known as record, 8 No Single 85K Yes

point, case, sample, entity, or 9 No Married 75K No

instance 10
10 No Single 90K Yes
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal

EL
• Examples: rankings (e.g., taste of potato chips on a scale from 1-

PT
10), grades, height in {tall, medium, short}
– Interval

N
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
• Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
– Distinctness: = 

EL
– Order: < >
– Addition: + -

PT
– Multiplication: */

N
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute are just zip codes, employee mode, entropy,
different names, i.e., nominal attributes ID numbers, eye color, contingency
provide only enough information to
sex: {male, female} correlation, 2 test
distinguish one object from another. (=, )

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,


provide enough information to order {good, better, best}, rank correlation,

EL
objects (< >). grades, street numbers run tests, sign tests

PT
Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., a unit temperature in Celsius deviation, Pearson's
of measurement exists.
or Fahrenheit correlation, t and F

N
(+, - )
tests
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, length,
percent variation
electrical current
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.

EL
– Note: binary attributes are a special case of discrete attributes

PT
• Continuous Attribute
– Has real numbers as attribute values

N
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a
finite number of digits.
– Continuous attributes are typically represented as floating-point variables.
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph

EL
– World Wide Web

PT
– Molecular Structures
• Ordered

N
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
• Data that consists of a collection of records, each of which consists of a
fixed set of attributes Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

EL
2 No Married 100K No
3 No Single 70K No

PT
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No

N
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
• If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there

EL
are m rows, one for each object, and n columns, one for each

PT
attribute
Projection Projection Distance Load T hickness

N
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Text Data
• Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding term
occurs in the document.

EL

timeout

season
coach

game
PT
score
team

ball

lost
pla

wi
n
N y
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products purchased

EL
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.

PT
TID Items
1 Bread, Coke, Milk

N
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Facebook graph and HTML Links

EL
2
5 1

PT
2

N
5
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC

EL
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG

PT
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA

N
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?

EL
• What can we do about these problems?

PT
• Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data
N
Noise
• Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen

EL
PT
N Two Sine Waves + Noise
Two Sine Waves
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects in

EL
the data set

PT
N
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases

EL
(e.g., annual income is not applicable to children)

PT
• Handling missing values
– Eliminate Data Objects

N
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their probabilities)
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
– Major issue when merging data from heterogenous sources

EL
• Examples:

PT
– Same person with multiple email addresses

• Data cleaning
N
– Process of dealing with duplicate data issues
Data Preprocessing
• Aggregation
• Sampling

EL
• Dimensionality Reduction

PT
Feature subset selection
• Feature creation


N
Discretization and Binarization
Attribute Transformation
Aggregation
• Combining two or more attributes (or objects) into a
single attribute (or object)
• Purpose

EL
– Data reduction
• Reduce the number of attributes or objects

PT
– Change of scale

N
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Sampling
• Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the data and the final
data analysis.

EL
• Statisticians sample because obtaining the entire set of data of interest

PT
is too expensive or time consuming.

N
• Sampling is used in data mining because processing the entire set of
data of interest is too expensive or time consuming.
Sample Size

EL
PT
N
8000 points 2000 Points 500 Points
Sampling …
• The key principle for effective sampling is the
following:

EL
– using a sample will work almost as well as using the

PT
entire data sets, if the sample is representative
– A sample is representative if it has approximately the
N
same property (of interest) as the original set of data
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item

• Sampling without replacement

EL
– As each item is selected, it is removed from the population

PT
• Sampling with replacement
– Objects are not removed from the population as they are selected for the sample.

N
In sampling with replacement, the same object can be picked up more than once

• Stratified sampling
– Split the data into several partitions; then draw random samples from each
partition
Curse of Dimensionality
• When dimensionality increases, data becomes increasingly sparse
in the space that it occupies

EL
• Definitions of density and distance between points, which is

PT
critical for clustering and outlier detection, become less
meaningful

N
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining
algorithms

EL
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise

PT
• Techniques

N
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
Discretization

EL
Data Equal interval width

PT
N
Equal frequency K-means
Attribute Transformation
• A function that maps the entire set of values of
a given attribute to a new set of replacement

EL
values such that each old value can be

PT
identified with one of the new values

N
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.

EL
– Often falls in the range [0,1]
• Dissimilarity

PT
– Numerical measure of how different are two data objects
– Lower when objects are more alike

N
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.

EL
PT
N
Euclidean Distance
• Euclidean Distance
n
dist   ( pk  qk )2
k 1

EL
PT
Where n is the number of dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or data objects p and q.

N
• Standardization is necessary, if scales differ.
Mahalanobis Distance
mahalanobis( p, q)  ( p  q) 1 ( p  q)T

 is the covariance matrix of the


input data X

EL
n
1
 j ,k   ( X ij  X j )( X ik  X k )
n  1 i 1

PT
N
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.
• Example:

EL
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

PT
d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

N
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150


Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only
binary attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0

EL
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

PT
• Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes

N
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes values


= (M11) / (M01 + M10 + M11)
Correlation
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, p and q, and then
take their dot product

EL
  ( pk  mean( p)) / std( p)
pk

PT
  (qk  mean(q)) / std(q)
qk

N
correlation( p, q)  p  q
Visually Evaluating Correlation

EL
Scatter plots

PT
showing the
similarity from –1

N
to 1.
EL
End of Data Preprocessing

PT
N
Data Mining

EL
Association Rules

PT
Pabitra Mitra

N
Computer Science and Engineering

53
Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items in
the transaction

EL
Market-Basket transactions
Example of Association Rules

PT
TID Items
1 Bread, Milk {Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},

N
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke {Beer, Bread}  {Milk},
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items

• An itemset that contains k items 1 Bread, Milk

EL
• Support count () 2 Bread, Diaper, Beer, Eggs
– Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke

PT
– E.g. ({Milk, Bread,Diaper}) = 2 4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke

N
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or equal to a
minsup threshold
TID Items
1 Bread, Milk
 Association Rule
2 Bread, Diaper, Beer, Eggs
– An implication expression of the form X  Y, where X
3 Milk, Diaper, Beer, Coke
and Y are itemsets
4 Bread, Milk, Diaper, Beer
– Example:
5 Bread, Milk, Diaper, Coke
{Milk, Diaper}  {Beer}

EL
 Rule Evaluation Metrics
– Support (s) Example:

PT
 Fraction of transactions that contain both X and {Milk, Diaper }  Beer
Y
 ( Milk , Diaper, Beer )

N
2
– Confidence (c) s   0.4
 Measures how often items in Y
|T| 5
appear in transactions that  ( Milk, Diaper, Beer ) 2
contain X c   0.67
 ( Milk , Diaper ) 3
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is
to find all rules having
– support ≥ minsup threshold

EL
– confidence ≥ minconf threshold

PT
• Brute-force approach:
– List all possible association rules

N
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

EL
2. Rule Generation

PT
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a

N
frequent itemset
• Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
null

A B C D E

EL
AB AC AD AE BC BD BE CD CE DE

PT
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE N ABDE ACDE BCDE


Given d items, there are
2d possible candidate
itemsets
ABCDE
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
Transactions List of

EL
Candidates
TID Items
1 Bread, Milk

PT
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer

N
5 Bread, Milk, Diaper, Coke
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)

EL
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms

PT
N
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or transactions
– No need to match every candidate against every transaction
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be
frequent

EL
• Apriori principle holds due to the following property of the
support measure:

PT
X , Y : ( X  Y )  s( X )  s(Y )

N
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
null

A B C D E

EL
AB AC AD AE BC BD BE CD CE DE

PT
Found to be
Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

Pruned
supersets
N ABCD ABCE ABDE ACDE BCDE

ABCDE
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk
Beer
4
3
Itemset Count Pairs (2-itemsets)
{Bread,Milk} 3
Diaper 4

EL
{Bread,Beer} 2
Eggs 1
{Bread,Diaper} 3 (No need to generate
{Milk,Beer} 2
candidates involving Coke

PT
{Milk,Diaper} 3
{Beer,Diaper} 3 or Eggs)
Minimum Support = 3
Triplets (3-itemsets)

N
If every subset is considered, Itemset
{Bread,Milk,Diaper}
Count
3
6C + 6C + 6C = 41
1 2 3
With support-based pruning,
6 + 6 + 1 = 13
Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1

EL
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent

PT
itemsets
• Prune candidate itemsets containing subsets of length k that are

N
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that
are frequent
Factors Affecting Complexity
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item

EL
– if number of frequent items also increases, both computation and I/O costs may
also increase

PT
• Size of database
– Apriori makes multiple passes, run time of algorithm increase with number of

N
transactions
• Average transaction width
– This may increase max length of frequent itemsets and traversals of hash tree
(number of subsets in a transaction increases with its width)
Rule Generation
• How to efficiently generate rules from frequent itemsets?
– In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)

EL
– But confidence of rules generated from the same itemset has an anti-
monotone property

PT
– e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)

N
• Confidence is anti-monotone w.r.t. number of items on the RHS of the rule
Rule Generation for Apriori Algorithm
Lattice of rules ABCD=>{ }
Low
Confidence
Rule BCD=>A ACD=>B ABD=>C ABC=>D

EL
PT
CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

Pruned
Rules D=>ABC
N
C=>ABD B=>ACD A=>BCD
Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two rules that share the same prefix
in the rule consequent
CD=>AB BD=>AC

EL
• join(CD=>AB,BD=>AC)
would produce the candidate

PT
rule D => ABC

N
• Prune rule D=>ABC if its
D=>ABC
subset AD=>BC does not have
high confidence
Pattern Evaluation
• Association rule algorithms tend to produce too many rules
– many of them are uninteresting or redundant
– Redundant if {A,B,C}  {D} and {A,B}  {D}

EL
have same support & confidence

PT
• Interestingness measures can be used to prune/rank the derived
patterns

N
• In the original formulation of association rules, support &
confidence are the only measures used
Application of Interestingness Measure
Interestingness
Measures

EL
PT
N
Computing Interestingness Measure
• Given a rule X  Y, information needed to compute rule interestingness
can be obtained from a contingency table
Contingency table for supports X  Y

EL
Y Y
X f11 f10 f1+

PT
X f01 f00 fo+
f+1 f+0 |T|

N
Used to define various measures
 support, confidence, lift, Gini,
J-measure, etc.
Statistical Independence
• Population of 1000 students
– 600 students know how to swim (S)
– 700 students know how to bike (B)

EL
– 420 students know how to swim and bike (S,B)

PT
– P(SB) = 420/1000 = 0.42
– P(S)  P(B) = 0.6  0.7 = 0.42

N
– P(SB) = P(S)  P(B) => Statistical independence
– P(SB) > P(S)  P(B) => Positively correlated
– P(SB) < P(S)  P(B) => Negatively correlated
Statistical-based Measures
• take into account statistical dependence
P (Y | X )
Lift 
P (Y )

EL
P( X , Y )
Interest 

PT
P ( X ) P (Y )
PS  P ( X , Y )  P ( X ) P (Y )

  coefficient  N P ( X , Y )  P ( X ) P (Y )
P ( X )[1  P ( X )]P (Y )[1  P (Y )]
Example: Lift/Interest
Coffee Coffee
Tea 15 5 20
Tea 75 5 80

EL
90 10 100

PT
Association Rule: Tea  Coffee
Confidence= P(Coffee|Tea) = 0.75

N
but P(Coffee) = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
There are lots of
measures proposed in
the literature

Some measures are


good for certain

EL
applications, but not for
others

PT
What criteria should we

N
use to determine
whether a measure is
good or bad?

What about Apriori-


style support based
Subjective Interestingness Measure
• Objective measure:
– Rank patterns based on statistics computed from data
– e.g., 21 measures of association (support, confidence, Laplace, Gini,

EL
mutual information, Jaccard, etc).

PT
• Subjective measure:
– Rank patterns according to user’s interpretation

N
• A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
• A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
Interestingness via Unexpectedness
• Need to model expectation of users (domain knowledge)

+ Pattern expected to be frequent

- Pattern expected to be infrequent

EL
Pattern found to be frequent
Pattern found to be infrequent

PT
+ - Expected Patterns

N
- + Unexpected Patterns

• Need to combine expectation of users with evidence from data (i.e., extracted patterns)
EL
End of Association Rule

PT
N

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy