2 Place Solution: Instacart Market Basket Analysis

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36
At a glance
Powered by AI
The key takeaways are that the author proposes two models - one for predicting reorder and one for predicting none. Various features are engineered including user, item, user-item interaction, and datetime features. F1 maximization is used to convert probabilities to binary predictions in order to optimize the F1 score as the evaluation metric.

The two models proposed in the main approach are the reorder model which predicts reorder with user_id and product_id as keys, and the none model which predicts none with only the user_id as key. The none model cannot use features like user-item interaction.

The four types of features engineered are 1) user features capturing user preferences, 2) item features capturing item popularity, 3) user-item interaction features capturing how a user feels about an item, and 4) datetime features capturing daily/hourly patterns. An example of each would be number of purchases for user, number of total purchases for item, number of times a user purchased a particular item, and purchase count for a particular day of week.

2 nd Place Solution

Instacart Market Basket Analysis


Agenda
• My Background
• Problem Overview
• Main Approach
• Feature Engineering
• Feature Importance
• Important Findings
• F1 maximization
My Background

• Bachelor of Economics

• Programmer of Financial Industry

• Consultant of Financial Industry

• 2nd Place at KDDCUP2015

• Data Scientist at Yahoo! JAPAN


Problem Overview
• In this competition, we have to predict reorder.
• So, it is little different from general recommendation.
• I mean,
Problem Overview
• How hot(user)?

*prior is regarded as train


Problem Overview
• How hot(item)?

*Clipped by 500
Problem Overview
• Evaluation metric is mean F1 score

• Precision and Recall


Problem Overview
• Links between the files
Main Approach

• I made 2 models. For predicting reorder and for predicting None*


• reorder model’s keys are user_id and product_id
• None model’s key is only user_id
• I thought I should use more train data to make better prediction
• I decided to use prior as train
• As a result of tunings, best number of window is 3
• See next page for details
*None means there is no reorder
Main Approach
• We are given orders.csv
Main Approach
• We are given orders.csv
Main Approach

• We are given order_products.csv


Main Approach
user_id product_id label

• Reorder Prediction
Main Approach
user_id label

• None Prediction
Main Approach
Main Approach
Feature Engineering
• I made 4 types of features

1. User
• What this user like
2. Item
• What this item like
3. User x Item
• How do the user feel about the item
4. Datetime
• What this day and hour like

*For None model, I can’t use above features except user and datetime. So I convert those to
stats(min, mean, max, sum, std…).
Feature Importance for reorder
Feature Importance for None
Important Findings for reorder - 1
• Let’s think about the reordering problem. Common sense
tells us that an item purchased many times in the past has a
high probability of being reordered. However, there may be a
pattern for when the item is not reordered. We can try to
figure out this pattern and understand when a user doesn’t
repurchase an item.

• See next page for details


Important Findings for reorder - 1
• user_id: 54035
Important Findings for reorder - 1

• This user always reorders Cola.

• But at order number 8, the user didn’t. Why not?

• Probably because the user bought Fridge Pack Cola instead.

• I created features to catch this type of behavior.


Important Findings for reorder - 2
• days_last_order-max is difference between days_since_last_order_this_item and
useritem_order_days_max

• days_since_last_order_this_item is a feature belong to user and item. This means how


many days passed since last order

• Also, useritem_order_days_max is a feature belong to user and item. This means max
span(day) of order

• For more detail, see the next page


Important Findings for reorder - 2
• See the index 0, this means
the user bought this item 14 days
ago, and max span is 30 days

• So I think this feature says if the user


is bored or not by that item
Important Findings for reorder - 3
• We already know fruits are reordered more frequently than vegetables(3
Million Instacart Orders, Open Sourced)

• I wanted to know how often


• So I made a item_10to1_ratio feature
that’s defined as the reorder ratio after
an item is ordered vs. not ordered.

• Next page, for more details


Important Findings for reorder - 3
• Let’s say userA bought itemA at order_number 1 and 4
• And userB bought itemA at order_number 1 and 3
• item_10to1_ratio is 0.5
Important Findings for None - 1
• Useritem_sum_pos_cart(User A, Item B) is the average position in User A’s cart
that Item B falls into

• Useritem_sum_pos_cart-mean(User A) is the mean of the above feature across all


items

• So this feature essentially captures

the average position of an item in a user’s

cart, and we can see that users who

don’t buy many items all at once are

more likely to be None


Important Findings for None - 2
• total_buy is number of total order

• If userA bought itemA 3 times


in the past, this would be 3

• So total_buy-max is max of above


feature by user

• We can see that it predicts


whether or not a user will make a reorder
Important Findings for None - 3

• t-1_is_None(User A) is a binary feature that says whether or not the

user’s previous order was None.

• If the previous order is None,

then the next order will also be

None with 30% probability.


F1 maximization
• In this competition, the evaluation metric was an F1 score, which is a way of
capturing both precision and recall in a single metric.

• Thus, we needed to convert reorder probabilities into binary 1/0 (Yes/No)


numbers.

• However, in order to perform this conversion, we need to know a threshold. At


first, I used grid search to find a universal threshold of 0.2. But I saw
comments on the Kaggle discussion boards that said different orders should
have different thresholds.

• To understand why, let’s look at an example.


F1 maximization
F1 maximization
• In the first example, threshold is between 0.9 and 0.3
• In the second example, threshold is lower than 0.2
• As I showed, each order should have each threshold
• But using above calculation, we have to prepare all patterns of
probability at first
• Thus I needed to come up with another calculation
• See the next page
F1 maximization
• Let’s say our model predicts Item A will be reordered with probability 0.9, and Item B with probability 0.3. I then
simulate 9,999 target labels (whether A and B will be ordered or not) using these probabilities.

• For example, the simulated labels might look like this.

• I then calculate the expected F1 score for each set of labels,

starting from the highest probability items, and then adding items

(e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score

peaks and then decreases.

• We don’t need to calculate all of patterns

like A, B, AB…

• Because if we should select itemB, we should

select itemA as well


F1 maximization

• F1score_mean( , [A]) -> 0.809747641431

• F1score_mean( , [A,B]) -> 0.709004233757


F1 maximization - Predicting None

• One way to think about None is as the probability (1 - Item A)


* (1 - Item B) * …

• But another method is to try to predict None as a special


case.

• By using our None model and treating None as just another


item, we can boost the F1 score from 0.400 to 0.407.
EOP

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy