FP Growth
FP Growth
FP Growth
Introduction
Apriori: uses a generate-and-test approach – generates
candidate itemsets and tests if they are frequent
– Generation of candidate itemsets is expensive(in both
space and time)
– Support counting is expensive
• Subset checking (computationally expensive)
• Multiple Database scans (I/O)
FP-Growth: allows frequent itemset discovery without
candidate itemset generation. Two step approach:
– Step 1: Build a compact data structure called the FP-tree
• Built using 2 passes over the data-set.
– Step 2: Extracts frequent itemsets directly from the FP-tree
Step 1: FP-Tree Construction
FP-Tree is constructed using 2 passes over the
data-set:
Pass 1:
– Scan data and find support for each item.
– Discard infrequent items.
– Sort frequent items in decreasing order based on
their support.
Use this order when building the FP-Tree, so
common prefixes can be shared.
Step 1: FP-Tree Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a
path
2. Fixed order is used, so paths can overlap when transactions
share items (when they have the same prefix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes containing the
same item, creating singly linked lists (dotted lines)
– The more paths that overlap, the higher the compression. FP-tree
may fit in memory.
4. Frequent itemsets extracted from the FP-Tree.
Step 1: FP-Tree Construction
FP-Tree size
The FP-Tree usually has a smaller size than the
uncompressed data - typically many transactions share
items (and hence prefixes).
– Best case scenario: all transactions contain the same set of
items.
• 1 path in the FP-tree
– Worst case scenario: every transaction has a unique set of items
(no items in common)
• Size of the FP-tree is at least as large as the original data.
• Storage requirements for the FP-tree are higher - need to store the
pointers between the nodes and the counters.
A B AB
1 1 1
4
5
2
5 5
7
6 7
8 8
7
8 10
9
Advantage:
• very fast support counting
• The Eclat algorithm is naturally faster compared to the Apriori
algorithm.
• The Eclat algorithm does not involve in the repeated scanning
of the data in order to calculate the individual support values.
• This algorithm is better suited for small and medium datasets
where as Apriori algorithm is used for large datasets.
Disadvantage:
• intermediate tid-lists may become too large for memory
References
• [1] Pang-Ning Tan, Michael Steinbach, Vipin
Kumar:Introduction to Data Mining, Addison-
Wesley
• www.wikipedia.org