Candidate Generation and Pruning
Candidate Generation and Pruning
pruning
•
Candidate generation and pruning are techniques used in data science,
particularly in association rule mining and frequent itemset mining.
• Candidate Generation: In this step, potential itemsets that may be
frequent are generated. For example, in market basket analysis (a
common application), if we have a transaction dataset where each
transaction contains items purchased by a customer, candidate
itemsets are combinations of items that might occur together
frequently. Let's say we have transactions:
• Transaction 1: {apple, banana, orange}
• Transaction 2: {apple, banana, mango}
• Transaction 3: {banana, orange, mango}
• Candidate 2-itemsets might include {apple, banana}, {apple, orange},
{banana, orange}, and {banana, mango}.
• Pruning: Pruning involves eliminating candidate itemsets that cannot be
frequent based on a minimum support threshold. Support is the
frequency of occurrence of an itemset in the dataset. If an itemset's
support is below a certain threshold, it is pruned, as it cannot be
frequent.
• For example, if the minimum support threshold is set to 2 (meaning an
itemset must appear in at least 2 transactions to be considered
frequent), then {apple, banana} and {banana, mango} are kept, while
{apple, orange} and {banana, orange} are pruned since they appear only
once each.
Rule generation in apriori algorithm
•
In the Apriori algorithm, rule generation is the process of deriving
association rules from frequent itemsets discovered during the
candidate generation and pruning phases.
• Here's how rule generation works with an example:
• Let's consider a dataset of transactions in a grocery store:
• Transaction 1: {bread, milk}
• Transaction 2: {bread, butter, cheese}
• Transaction 3: {bread, milk, butter}
• Transaction 4: {bread, butter}
• Transaction 5: {milk, cheese}
• Finding frequent itemsets: First, we apply the Apriori algorithm to find frequent
itemsets that meet a minimum support threshold. Let's assume the minimum support
threshold is set to 2 transactions.
• Frequent 1-itemsets: {bread}, {milk}, {butter}, {cheese}
• Frequent 2-itemsets: {bread, milk}, {bread, butter}, {milk, butter}
• Rule generation: Once we have the frequent itemsets, we generate association rules
from them. An association rule has the form "If {X} then {Y}", where X and Y are sets of
items.
• For each frequent itemset, we generate association rules by considering all possible
subsets of the itemset as the antecedent (X) and the remaining items as the consequent
(Y).
• For example:
• From {bread, milk}, we can generate two rules: {bread} -> {milk} and {milk} -> {bread}.
• From {bread, butter}, we can generate two rules: {bread} -> {butter} and {butter} -> {bread}.
• From {milk, butter}, we can generate two rules: {milk} -> {butter} and {butter} -> {milk}.
• Pruning redundant rules: After generating all possible rules, we can
prune redundant rules based on metrics like confidence or lift.
Confidence measures the proportion of transactions that contain {Y}
among the transactions that contain {X}. Lift measures how much
more often {X} and {Y} occur together than we would expect if they
were statistically independent.
• For example, if we find that the rule {bread} -> {butter} has a low
confidence, indicating that most transactions containing bread do not
contain butter, we might prune this rule.
Brute force method
• In data science, the brute force method refers to a straightforward and exhaustive
approach to solving a problem by considering all possible solutions without employing
any optimization or heuristics.
• Here's how the brute force method typically works:
• Enumerate all possibilities: The algorithm considers all possible combinations or
permutations of the problem space without any specific strategy to reduce the search
space.
• Evaluate each possibility: For each combination or permutation generated, the
algorithm evaluates its validity or optimality according to the problem's criteria or
constraints.
• Select the best solution: After evaluating all possibilities, the algorithm selects the
solution that meets the desired criteria or optimizes the objective function, if
applicable.
The brute force method is often used when: