MSOM Research Challenge Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

MSOM Research Challenge Data

Paper
2019-07-12
JD.com

JD.com: Transactional data for the 2020 MSOM Data Driven


Research Challenge

Max Shen (University of California, Berkeley), Christopher S. Tang (University of


California, Los Angeles,), Di Wu (JD.COM American Technologies Corporation),
Rong Yuan (Stitch Fix), Wei Zhou (JD.COM American Technologies Corporation)

August 20, 2019

Abstract

To support the 2020 MSOM Data Driven Research Challenge, JD.com, China’s largest retailer,
offers transaction level data to MSOM members for conducting data-driven research. This article
describes the transactional data associated with over 2.5 million customers (anonymized) and
31,868 SKUs over the month of March in 2018. We also present potential research questions
suggested by JD.com. Researchers are welcome to develop econometric models or data-driven
models using this database to address some of the suggested questions or examine their own
research questions.

Keywords: E-Commerce, Transactional Data, MSOM Society, Data Driven Research

1. Introduction

The growth of e-commerce retailing (or E-tailing) has given rise to many new and challenging
problems at both strategic and operational levels. To encourage Operations Management (OM)
researchers to conduct data-driven research in E-tailing, we describe the data provided by
JD.com that are intended to enable researchers to examine research questions arising from
customer purchasing decisions and supply chain operations in the context of E-tailing.

1
JD.com is China’s largest retailer with a net revenue of US$67.2 billion in 2018 and over 320
million annual active customers. According to the official description provided by JD.com,
“JD.com is committed to providing only high-quality, authentic products, and is known for its
fast delivery speed. JD.com sets the standard for online shopping through its commitment to
quality, authenticity, and its vast product offering covering everything from fresh food and
apparel to electronics and cosmetics. JD.com combines its business model of first party, where it
controls the entire supply chain, with a marketplace that intentionally limits the number of
sellers, to ensure that it can maintain strict quality oversight. JD.com has a nationwide
fulfillment network covers 99% of China’s population, and is able to provide standard same- and
next-day delivery as standard for approximately 90% of orders.”

The data sets provided by JD.com capture a “full customer experience cycle” that begins at the
moment when a customer browses through products available on the platform before placing his
or her order and ends at the moment when the customer receives the products at his or her
designated location. The data sets provide information on 2.5 million customers over 30,000
SKUs in one specific product category during the month of March in 2018.

Based on our discussion with the management of JD.com, we developed the following set of
research questions. We encourage researchers to explore the provided data and develop
innovative solutions to address the following problems (or other research problems of their own
choosing):

1. Which product attributes and/or features have predictive power about customer’s
product choice? Does customer’s product choice differ by channel (e.g., purchasing via
mobile phones versus personal computers), by region, and by brand loyalty?

2. Would more products with similar attributes and features improve or hinder sales
revenues for JD.com?

3. For a specific target customer segment (e.g., female customers in a tier 1 city), what
should merchants and brands do to improve their sales performance?

4. What is the impact of various pricing and promotion strategies on product sales? How
should JD.com improve its pricing and promotion strategy? In particular, among all the
promotion methods (e.g., direct discounts, bundle discounts, and volume discounts),
which one is more effective?

5. Do ordinary customers behave differently from JD.com’s PLUS members?1 How should

2
JD.com improve its pricing and shipping strategy for its PLUS members?

6. How should JD.com improve its demand forecast accuracy for different geographic
regions and different customer groups?

7. How should JD.com improve its fulfillment efficiency and customer experience with
better inventory allocation strategies in a multilevel inventory network?

2. Data Description

We now describe the transactional data provided by JD.com. (We shall explain how to download
the database and the Python code in Section 4.) To ensure confidentiality, certain key
identification information such as user ID and Stock Keeping Unit (SKU) ID are fully
anonymized.2

In the database, each SKU can be identified either as “first party owned” (1P) or “third-party
owned” (3P), depending on the ownership of the inventory of that SKU.3 All 1P SKUs are
managed by JD.com, including product assortments, inventory replenishments, product pricing,
order deliveries, and after-sales customer services. Despite different operations, 1P and 3P SKUs
compete on the JD.com platform for sales through different pricing strategies and marketing
activities.

In general, 1P SKUs are usually top sellers within the category. By owning these 1P products,
JD.com can fully control the entire customer experience to provide guaranteed quality, fast
delivery, and good customer services. By contrast, all 3P SKUs are managed by third-party
merchants on the JD marketplace. Specifically, to fulfill an order of a 3P SKU, the corresponding
merchant can decide freely whether to use the logistics services provided by JD Logistics or
other logistics service providers.4

The data sets provided by JD.com offer a detailed view on the activities associated with all SKUs
within one anonymized consumable category during the month of March in 2018. Owing to
confidentiality, the specific category is not disclosed. The data set consists of seven tables that

1
JD.com’s PLUS membership is a subscription-based program that provides its members certain benefits that range
from free shipping to member-specific price discounts. For details about JD.com’s PLUS membership, see
https://plus.jd.com/index.html (Chinese Content).
2
Note that the data provided by JD.com represent only a small sample of users and SKUs. Therefore, the database
does not necessarily fully capture the business performance or business trends of JD.com.
3
All SKUs are displayed on JD.com’s product page with the seller name and/or tags so that customers are fully
aware of whether the corresponding SKU is a 1P SKU or a 3P SKU.
4
The fulfillment process is usually described on the product page so that customers will know that the shipping
process is managed by the merchant itself.

3
are labeled as (1) skus, (2) users, (3) clicks, (4) orders, (5) delivery, (6) inventory, and (7)
network. We now describe each of these seven tables.

1. Table: skus

The skus table (Table 1) describes the characteristics of each of the 31,868 SKUs that belong to a
single product category. We now define each field along with a brief description. Each entry in
the skus table corresponds to a unique SKU (sku_ID). In addition, each SKU ID is “seller-
specific.” For example, an identical product that is sold by JD as a 1P product and by a third-
party seller as a 3P product will be treated as two separate SKUs with different SKU IDs.
Similarly, an identical product sold by multiple third-party sellers will be denoted by different
SKU IDs.

Field Data type Description Sample value


sku_ID string Unique identifier of a product b4822497a5
type int 1P or 3P SKU 1
brand_ID string Brand unique identification code c840ce7809
attribute1 int First key attribute of the category 3
attribute2 int Second key attribute of the category 60
activate_date string The date at which the SKU is first introduced 2018-03-01
deactivate_dat string The date at which the SKU is terminated 2018-03-01
e
Table 1: Description of the skus table

Among these 31,868 SKUs, 1,167 of them are 1P SKUs (type value = 1) and the rest (30,701)
are 3P SKUs (type value = 2). The brand information of each SKU is provided via the field
(brand_ID). However, only 9,159 SKUs out of 32,343 were involved in purchase activities
during March of 2018.

Each SKU also has two key attributes: the first attribute takes on a value that ranges from 1 to 4,
and the second attribute takes on a value that ranges from 30 to 100. For each attribute, a higher
value indicates better performance of a certain functionality (e.g., longer battery life and higher
screen resolution). The distributions of the value associated with these two attributes across all
SKUs are depicted in Figures 1 and 2. Notice that many SKUs have missing values for different
reasons, including (a) the third-party merchants did not provide the attribute value, especially for
certain slow-moving items or (b) a certain attribute was not applicable to certain SKUs.5

JD.com displays product ratings for each SKU. However, in the Chinese marketplace, most product ratings reported
5

by customers are usually the highest rating. Because most ratings are rated 5, the information associated with
product ratings has been shown to be uninformative. Consequently, product ratings are omitted in the database.

4
Figure 1: Distribution of Attribute 1 across All SKUs

Figure 2: Distribution of Attribute 2 across All SKUs

For each SKU, the skus table provides two extra elements: activate_date and deactivate_date.
The former specifies the date at which an SKU is first introduced on the JD.com platform and the
latter specifies the date at which the SKU is terminated and removed from JD.com.6 Notice that
the data set only lists a valid activate_date and deactivate_date when one of these dates occurred
within the month of March in 2018. If one of these fields is empty, this means that the SKU was
activated before March 2018 and/or deactivated after March 2018. Our records indicate that
1,141 SKUs were deactivated and 3,058 SKUs were activated during March, and their monthly
sales account for 2% and 5% of the total monthly sales, respectively.

2. Table: users
6
Note that, even though an SKU is deactivated, it may still be able to be bought as a part of a bundled product or as
the gift portion of a promotion.

5
The users table (Table 2) describes the characteristics of each of the 457,298 users who
purchased at least one of the SKUs in the given category during March of 2018. We now define
each field along with a brief description. Each entry in the users table corresponds to a unique
customer (user_ID). The field first_order_month specifies the month when the user made his or
her “first purchase” on JD.com.

Field Data type Description Sample


value
user_ID string User unique identification code 000000f73
6
user_level int User level 10
first_order_mont string First month in which the customer placed an 2017-07
h order on JD.com (format: yyyy-mm)
plus int If user is with a PLUS membership 0
gender string User gender (estimated) F
age string User age range (estimated) 26–35
marital_status string User marital status (estimated) M
education int User education level (estimated) 3
purchase_power int User purchase power (estimated) 2
city_level int City level of user address 1
Table 2: Description of the users table

For each repeat customer, the corresponding user is classified according to his or her past
purchases so that the customer’s user_level takes on a value of 0, 1, 2, 3, or 4, where a higher
user_level is associated with a higher total purchase value in the past. For users who are
enterprise users (e.g., small shops in rural areas or small businesses), the corresponding
user_level takes on a value of 10. However, for first-time purchasers, their user_level takes on
the value −1.7 Figure 3 depicts the distribution of user levels for all 457,298 customers.

7
Regardless of different users’ user_level values, they observe the same information and receive the same service
from JD.com.

6
Figure 3: Distribution of Users: User Level.

If the field value of plus is 1, this denotes that the corresponding user is an existing PLUS
member before March of 2018.8 In addition to customer past purchase value and PLUS
membership, the users table contains certain (estimated) user demographic information because
JD.com’s customers are not required to provide any demographic information when making a
purchase. However, JD.com has a sophisticated data-driven artificial intelligence system to
estimate user demographics.

The estimated user demographics for each user are (a) gender (F: female, M: male, U:
unknown); (b) age (<=15: less than or equal to 15 years old, 16-25: 16 to 25 years old, 26-35: 26
to 35 years old, 36-45: 36 to 45 years old, 46-55: 46 to 55 years old, >=56: greater than or equal
to 56 years old, U: unknown); (c) marriage – user’s marital status (M: Married, S: Single, U:
Unknown); (d) education – user’s education level (1: less than high school, 2: high school
diploma or equivalent, 3: Bachelor’s degree, 4: post-graduate degree, −1: unknown); and (e)
purchase_power – user’s estimated purchase power (ranging from 1 to 5 with 1 being the highest
purchase power; −1 if there is no estimation).

In addition to those estimated demographics of each user, JD.com has provided actual
information about the most commonly used shipping address for each user. This information is
captured in the field city_level, which takes on values ranging between 1 and 5. Here, level 1
corresponds to highly industrialized cities such as Beijing and Shanghai; level 2 cities
correspond to provincial capitals; level 3 to 5 cities are smaller cities; if there are no data then the
value is −1. Notice that city_level is based on actual information.

JD PLUS membership costs up to US$45 per year and members enjoy a variety of perquisites including exclusive
8

discounts, higher purchasing reward rate, free delivery, and return with no pre-conditions. About 18% of those
458,269 customers in the data set are JD PLUS members.

7
Figure 4 depicts the distribution of user gender across all 457,298 customers in the database, and
Figure 5 summarizes the distribution of estimated user age. As shown in Figure 6, for this
specific product category, more than 60% of all customers are estimated to be female and the
estimated ages of these customers are in their 30s to 40s. From Figure 6, we observe a relatively
even distribution between married and single customers. Figures 7 and 8 provide the customer’s
estimated education level and purchase power. Figure 9 summarizes the distribution of shipping
address according to different city levels. It can be seen that most of the customers are from tier
1 and tier 2 cities.

Figure 4: Distribution of Users: Gender

Figure 5: Distribution of Users: Age

8
Figure 6: Distribution of Users: Marital Status

Figure 7: Distribution of Users: Education Levels

9
Figure 8: Distribution of Users: Purchase Power Levels

Figure 9: Distribution of Users: City Levels

3. Table: clicks

The clicks table (Table 3) establishes the linkage between users and SKUs through their
browsing history. Each entry in the clicks table represents a user’s “click event” on a specific
SKU page.9 The date set contains over 20 million click records that are associated with the clicks
of 2.5 million customers. Note that this table contains clicks contributed not only by the users
identified in the users table (Table 2) who purchased at least one SKU but also by “other users”
who did not end up completing a purchase order.

Field Data Description Sample value


type
sku_ID string SKU unique identification code b4822497a5
user_ID string User unique identification code 94ff800585
request_tim string The time at which the customer clicks the SKU item page 2018-03-01
e (format: yyyy-mm-dd HH:MM:SS) 23:57:53
channel string The click channel wechat
Table 3: Description of the clicks table

The records include the following: (a) the user who initiated a “click event” (user_ID), (b) the
SKU associated with the click event (sku_ID), (c) the time at which the click event occurred
(request_time), and (d) the channel in which the click event occurred (channel).10 We classify the
9
It is worth noting that this table only contains click information on the SKU detail page. There are many other page
types with which a customer can interact on JD.com, such as the website main page, category main page, various
landing pages, search, recommendation page, shopping cart page, etc. Although those pages also contain
information about SKUs and promotions, the customers still need to go to the SKU detail page to review the detailed
description of the products and place the order.

10
channel taken as five string values: pc, mobile, app, wechat, and others. Channels pc and mobile
are associated with clicks through web browsers on personal computers and mobile devices,
respectively. Channel app corresponds to JD.com’s mobile app. Channel wechat corresponds to
the mini-program that runs on the social media app WeChat. Finally, channel others aggregates
the clicks from all other channels.

The distribution of all click events across all channels is summarized in Figure 10. Because of
the popularity of smartphones in China and the popularity of mobile payment options (e.g.,
WeChat payment), the majority of click events come from the app and wechat channels.

The field request time provides extra granularity. It can be used to infer the customer browsing
sequence and habits. In Figure 11, we plot the number of clicks during the day on March 1, 2018,
within the app channel. We can clearly identify two peaks in the daily browsing activities: one
from 8am to 4pm in the day and the other in the late evening.

Figure 10: Distribution of All Click Events Across Different Channels

Note that these data capture the click event of an SKU initiated by a user, but each click event may not lead to the
10

purchase of this SKU. In other words, a user may choose not to purchase this SKU even after the click event.

11
Figure 11: Number of Click Events Occurring on March 1, 2018, through JD.com’s App
Channel

4. Table: orders

The orders table (Table 4) contains 486,928 unique customer orders associated with our focused
product category that were placed during the month of March in 2018. Each customer order
(order_ID) in the orders table is based on a specific SKU (sku_ID) associated with a unique
customer (user_id). (If a customer ordered multiple SKUs, then the same order_ID will appear in
multiple rows of SKUs.)

Field Data Description Sample value


type
order_ID string Order unique identification code 3b76bfcd3b
user_ID string User unique identification code 3cde601074
sku_ID string SKU unique identification code 443fd601f0
order_date string Order date (format: yyyy-mm-dd) 2018-03-01
order_time string Specific time at which the order gets placed 2018-03-01
11:10:40.0
(format: yyyy-mm-dd HH:MM:SS)
quantity int Number of units ordered 1
type int 1P or 3P orders 1
promise int Expected delivery time (in days) 2
original_unit_price float Original list price 99.9
final_unit_price float Final purchase price 53.9
direct_discount_per_unit float Discount due to SKU direct discount 5.0
quantity_discount_per_uni float Discount due to purchase quantity 41.0
t
bundle_discount_per_unit float Discount due to “bundle promotion” 0.0
coupon_discount_per_unit float Discount due to customer coupon 0.0

12
gift_item int If the SKU is with gift promotion 0
dc_ori int Distribution center ID where the order is 29
shipped from
dc_des int Destination address where the order is 29
shipped to (represented by the closest
distribution center ID)
Table 4: Description of the orders table

Other information associated with a customer order as shown in Table 4 include (a) order
quantity for each SKU associated with the order (quantity), (b) the date and time when the
ordering event took place (order_date and order_time), (c) the type of SKU being ordered (type
= 1 if it is a 1P SKU and type = 2 if it is a 3P SKU), and (d) the promised delivery time of the
order (promise).11 Observe from Figure 12 that most orders have promised delivery dates within
2 days. Figure 13 shows the total number of sales by date and by order type.

The orders table also offers information about product pricing and promotional activities for
each SKU. For each entry, we denote the original list price of the SKU in the field
original_unit_price and the actual paid price by the customer for the SKU as final_unit_price.
The original list price of an SKU at any given time instant is the same for all customers, but the
final price can vary among customers owing to various discounts or promotions.

The “gap” between the original price and the final price represents the coupons and discounts
associated with different promotional activities for each SKU. There are four common types of
promotional discounts on the JD.com platform: (1) SKU direct discount, (2) group promotion,
(3) bundle promotion, and (4) gift items.12 These four types of discounts can be described as
follows:

(1) The seller of an SKU may offer a price cut in terms of a direct discount. This discount
reflects the reduction in the list price as stated on the product detail page.

(2) The seller of an SKU may offer a quantity discount to entice the customer to buy more.
This quantity discount promotion can take different forms including “get an RMB 100
discount if buying over RMB 199” or “buy 3 and get 1 free.” We note that the quantity
discount promotion is usually on the order level and we apply a simple allocation rule to
calculate the contribution provided by each SKU in the order.

11
When promise = 1, this refers to the standard same- and next-day delivery promise: Orders placed before 11am will
be delivered on the same day, and orders placed before 11pm will be delivered before 3pm on the following day.
When promise is x (x > 1), this indicates that the delivery will arrive at day t + x, where t is the day the order is
placed. We note that promise information is not available for a small fraction of 1P orders and for most of the 3P
orders.
12
In addition to the discounts, coupons are also commonly used to reduce the final paid price by customers.

13
(3) The seller may offer a bundle_discount if a customer buys a “pre-specified bundle” of
SKUs within an order.

(4) The seller may offer an SKU as a “free gift” (gift_item value = 1) if the customer
purchases a “pre-specified set” of SKUs (e.g., get a free eraser if you buy x pencils and y
pads of paper). The final_unit_price for each gift item is always equal to 0.

Coupons can also be applied to the order after all other promotions are applied. In contrast to the
four aforementioned promotion activities where discounts will be applied automatically once
certain criteria are met, customers need to “clip” (or claim) a coupon before making a purchase.13
The field coupon_discount records the coupon promotional value associated with an order.
Similar to quantity discount as explained earlier, the discount value of the coupon is allocated
between items in the same order using an allocation rule when necessary.

We note that, for each entry in the orders table, the gap between original_unit_price and
final_unit_price should always equal the sum of direct_discount, group_discount,
bundle_discount, and coupon_discount.

Finally, for each order, we show from which district the order was shipped (dc_ori) and to which
district the order was shipped (dc_des). The district here is defined by the warehouse ID that
covers the demand of that district. In other words, one can think of dc_ori as the warehouse
where the package is shipped from and dc_des as the warehouse that is nearest to the customer’s
designated shipping address. If dc_ori and dc_des are the same, this means that the package is
shipped from the warehouse closest to the customer. Otherwise, it indicates that the package is
fulfilled by some other warehouse in a different district. We note that in theory any warehouse in
the nationwide network can fulfill any customer in the country. However, in practice, there is a
complicated order fulfillment logic that determines what inventory should be used to fulfill each
customer order to optimize fulfillment resources while satisfying delivery promise.

Coupons normally consist of a discount value, an eligibility criterion, and an expiration date. The discount value is
13

the monetary amount that can be deducted from the order; the eligibility criterion specifies which SKU or SKU set is
eligible for coupon use and whether there is a total purchase amount criterion. The expiration date shows when the
coupon can be applied. There are many ways in which customers can receive a coupon. They can clip coupons from
the product detail pages, promotional landing pages, or “coupon mall” (a specific section on the JD.com platform for
coupon distribution). Customers can also receive personalized coupons based on their past activities.

14
Figure 12: Distribution of Promise Delivery Time (1P Orders)

Figure 13: Sales in Quantity by Date and Order Type

5. Table: delivery

The delivery table (Table 5) establishes the linkage between each order (order_ID) and
(possibly) multiple shipping packages (i.e., multiple package_IDs) in the event that an order is
split into multiple delivery packages for logistical reasons (e.g., an order that involves in-stock
and on-order items).

15
Field Data type Description Sample
value
package_ID string Package unique identification code (same as order_ID 209a005c4
if the package contains all SKUs in the order) 0
order_ID string Order unique identification code 209a005c4
0
type int 1P or 3P orders 1
ship_out_time string The timestamp when the package is shipped out from 2018-03-01
the warehouse (format: yyyy-mm-dd HH:MM:SS) 08:37:33
arr_station_tim string The timestamp when the package arrives at the 2018-03-01
e station (format: yyyy-mm-dd HH:MM:SS) 15:37:31
arr_time string The timestamp when the package is delivered to the 2018-03-01
customer home (format: yyyy-mm-dd HH:MM:SS) 18:49:03
Table 5: Description of the delivery table

The delivery table contains 293,229 packages delivered by JD Logistics in the given time period,
among which 244,333 orders involve 1P SKUs (type = 1) and 48,896 orders involve 3P SKUs
(type = 0). We further provide three key timestamps (up to hourly granularity) for each package
delivery, namely, the time at which the package was shipped from the warehouse
(ship_out_time), the time at which the package arrived at the delivery station (arr_station_time),
and the time at which the package was successfully delivered to the customer (arr_time).

6. Table: inventory

The inventory table (Table 6) provides information about the availability of each SKU (sku_id) at
each warehouse (dc_ID). We only disclose the availability of the inventory at the end of the day
(date) instead of the amount of inventory. In addition, when an SKU is not available at a specific
warehouse on a specific day, there will be no record of that SKU at that warehouse on that day.

Field Data type Description Sample


value
dc_ID int Distribution center ID 9
sku_ID string SKU unique identification code fcc883f71
3
date string Date (format: yyyy-mm-dd) 2018-03-
01
Table 6: Description of the inventory table

7. Table: network

The network table (Table 7) provides information about the assignment of different warehouses
located in different districts (dc_ID) to different geographical regions (region_ID). For each

16
district, a designated warehouse (dc_ID) is responsible for fulfilling orders in the district. In
addition, for different districts that are assigned to a geographical region, one of the (larger)
warehouses will be designated as the “central warehouse” for that region. In JD.com’s context, a
central warehouse provides the “back-up fulfillment” option when other (typically smaller)
warehouses in the region run out of inventory for their corresponding districts. Figure 14 shows
the number of districts within each geographical region. We denote each central warehouse for
each region by setting dc_ID = region_ID.

Field Data type Description Sample


value
region_ID int Region ID 2
dc_ID int District ID (same as warehouse ID) 6
Table 7: Description of the network table

Figure 14: Number of Districts within the Regions

3. Conclusion
The data sets provided by JD.com are presented in seven tables as described above. These data
sets are based on the activities associated with 2.5 million users (among them 457,298 with
purchases) and 31,868 SKUs over the month of March in 2018. We also present potential
research questions suggested by JD.com. Researchers are welcome to develop econometric
models or data-driven models using this database to address some of the suggested questions or
examine their own research questions.

17
The data sets include product information (attributes, pricing, etc.), customer information
(demographics, total value of past purchases, PLUS membership, etc.). In addition, the data sets
capture a “full customer experience cycle” that begins at the moment when a customer chooses
the products on the platform and ends at the moment when the customer receives the products at
his or her designated location.

4. Downloading the Data and Python Code


MSOM members can access the data sets thorough the MSOM website XXXX. For easy access,
we provide a Python notebook14 with runnable sample code to facilitate reviewing and
understanding of the data sets as well as to explain the relationships among the seven tables
described in this paper. The code is provided in the online appendix, and a runnable version is
available within the data set package.

14
https://jupyter.org/

18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy