TransAct
TransAct
Recommendation at Pinterest
Xue Xia Pong Eksombatchai Nikil Pancha
xxia@pinterest.com pong@pinterest.com npancha@pinterest.com
Pinterest Pinterest Pinterest
San Francisco, CA, USA San Francisco, CA, USA San Francisco, CA, USA
Andrew Zhai∗
andrew@aideate.ai
Pinterest
San Francisco, CA, USA
ABSTRACT Our model has been deployed to production in Homefeed, Related
Sequential models that encode user activity for next action predic- Pins, Notifications, and Search at Pinterest.
tion have become a popular design choice for building web-scale
personalized recommendation systems. Traditional methods of se- CCS CONCEPTS
quential recommendation either utilize end-to-end learning on • Information systems → Web searching and information
realtime user actions, or learn user representations separately in an discovery; Content ranking; Personalization;
offline batch-generated manner. This paper (1) presents Pinterest’s
ranking architecture for Homefeed, our personalized recommenda- KEYWORDS
tion product and the largest engagement surface; (2) proposes Trans- Personalization, Recommender Systems, Sequential Recommenda-
Act, a sequential model that extracts users’ short-term preferences tion, User Interest Modeling
from their realtime activities; (3) describes our hybrid approach to
ACM Reference Format:
ranking, which combines end-to-end sequential modeling via Trans-
Xue Xia, Pong Eksombatchai, Nikil Pancha, Dhruvil Deven Badani, Po-
Act with batch-generated user embeddings. The hybrid approach Wei Wang, Neng Gu, Saurabh Vishwas Joshi, Nazanin Farahpour, Zhiyuan
allows us to combine the advantages of responsiveness from learn- Zhang, and Andrew Zhai. 2023. TransAct: Transformer-based Realtime User
ing directly on realtime user activity with the cost-effectiveness Action Model for Recommendation at Pinterest. In Proceedings of the 29th
of batch user representations learned over a longer time period. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD
We describe the results of ablation studies, the challenges we faced ’23), August 6–10, 2023, Long Beach, CA, USA. ACM, New York, NY, USA,
during productionization, and the outcome of an online A/B ex- 11 pages. https://doi.org/10.1145/3580305.3599918
periment, which validates the effectiveness of our hybrid ranking
model. We further demonstrate the effectiveness of TransAct on 1 INTRODUCTION
other surfaces such as contextual recommendations and search. The proliferation of online content in recent years has created an
∗ work
overwhelming amount of information for users to navigate. To
done at Pinterest
address this issue, recommender systems are employed in various
industries to help users find relevant items from a vast selection,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed including products, images, videos, and music. By providing person-
for profit or commercial advantage and that copies bear this notice and the full citation alized recommendations, businesses and organizations can better
on the first page. Copyrights for components of this work owned by others than the serve their users and keep them engaged with the platform. There-
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission fore, recommender systems are vital for businesses as they drive
and/or a fee. Request permissions from permissions@acm.org. growth by boosting engagement, sales, and revenue.
KDD ’23, August 6–10, 2023, Long Beach, CA, USA As one of the largest content sharing and social media plat-
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0103-0/23/08. . . $15.00 forms, Pinterest hosts billions of pins with rich contextual and
https://doi.org/10.1145/3580305.3599918 visual information, and brings inspiration to over 400 million users.
5249
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xue Xia et al.
5250
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest KDD ’23, August 6–10, 2023, Long Beach, CA, USA
both low-order and high-order feature interactions automatically. However, we found that positional information does not add much
DCN [34] and its upgraded version DCN v2 [35] both aim to auto- value. We find other designs like better early fusion and action type
matically model the explicit feature crosses. The aforementioned embedding are effective when dealing with sequence features.
recommender systems do not work well in capturing the short-
term interests of users since only the static features of users are 3 METHODOLOGY
utilized. These methods also tend to ignore the sequential relation- In this section, we introduce TransAct, our realtime-batch hybrid
ship within the action history of a user, resulting in an inadequate ranking model. We will start with an overview of the Pinterest
representation of user preferences. Homefeed ranking model, Pinnability. We then describe how to use
TrancAct to encode the realtime user action sequence features in
Pinnability for the ranking task.
2.2 Sequential Recommendation
To address this problem, sequential recommendation has been 3.1 Preliminary: Homefeed Ranking Model
widely studied in both academia and the industry. A sequential In Homefeed ranking, we model the recommendation task as a
recommendation system uses a behavior history of users as input pointwise multi-task prediction problem, which can be defined as
and applies recommendation algorithms to suggest appropriate follows: given a user 𝑢 and a pin 𝑝, we build a function to predict
items to users. Sequential recommendation models are able to cap- the probabilities of user 𝑢 performing different actions on the can-
ture users’ long-term preferences over an extended period of time, didate pin 𝑝. The set of different actions contains both positive and
similar to traditional recommendation methods. Additionally, they negative actions, e.g. click, repin2 and hide.
also have the added benefit of being able to account for users’ We build Pinnability, Pinterest’s Homefeed ranking model, to
evolving interests, which enables higher quality recommendations. approach the above problem. The high-level architecture is a Wide
Sequential recommendation is often viewed as a next item predic- and Deep learning (WDL) model [5]. The Pinnability model utilizes
tion task, where the goal is to predict a user’s next action based on various types of input signals, such as user signals, pin signals,
their past action sequence. We are inspired by the previous sequen- and context signals. These inputs can come in different formats,
tial recommendation method [4] in terms of encoding users’ past ac- including categorical, numerical, and embedding features.
tion into a dense representation. Some early sequential recommen- We use embedding layers to project categorical features to dense
dation systems use machine learning techniques, such as Markov features, and perform batch normalization on numerical features.
Chain [8] and session-based K nearest neighbors (KNN) [11] to We then apply a feature cross using a full-rank DCN V2 [35] to
model the temporal dependencies among interactions in users’ ac- explicitly model feature interactions. At last, we use fully connected
tion history. These models are criticized for not being able to fully layers with a set of output action heads 𝑯 = {ℎ 1, ℎ 2, . . . , ℎ𝑘 } to pre-
capture the long-term patterns of users by simply combining infor- dict the user actions on the candidate pin 𝑝. Each head maps to one
mation from different sessions. Recently, deep learning techniques action. As shown in Figure 2, our model is a realtime-batch hybrid
such as recurrent neural networks (RNN) [25] have shown great suc- model that encodes the user action history features by both realtime
cess in natural language processing and have become increasingly (TransAct) and batch (PinnerFormer) approaches and optimizes for
popular in sequential recommendation. As a result, many DL-based the ranking task [37].
sequential models [6, 9, 30, 42] have achieved outstanding perfor-
mance using RNNs. Convolutional neural networks (CNNs) [40] head 1 head 2 head k
are widely used for processing time-series data and image data. In … context features
pin features
the context of sequential recommendation, CNN-based models can user features
MLP
effectively learn dependency within a set of items users recently action type
interacted with, and make recommendations accordingly [31, 32]. DCN V2 pin embedding
timestamp
Attention mechanism is originated from the neural machine trans-
lation task, which models the importance of different parts of the
Concatenate
input sentences on the output words [2]. Self-attention is a mecha-
nism known to weigh the importance of different parts of an input
dense
sequence [33]. There have been more recommender systems that Embedding features
Layer
use attention [43] and self-attention [4, 13, 16, 27, 39].
Many previous works [13, 16, 27] only perform offline evalua- id features TransAct
Static features
tions using public datasets. However, the online environment is
PinnerFormer
more challenging and unpredictable. Our method is not directly
… …
comparable to these works due to differences in the problem for-
non-realtime user actions realtime user actions
mulation. Our approach resembles a Click-through Rate (CTR)
prediction task. Deep Interest Network (DIN) uses an attention Figure 2: Pinterest Homefeed ranking model (Pinnability)
mechanism to model the dependency within users’ past actions in
CTR prediction tasks. Alibaba’s Behavior Sequence Transformer Each training sample is (𝒙, 𝒚), where 𝒙 represents a set of fea-
(BST) [4] is the improved version of DIN and is closely related to our tures, and 𝒚 ∈ {0, 1} |𝑯 | . Each entry in 𝒚 corresponds to the label of
work. They propose to use Transformer to capture the user interest 2A "repin" on Pinterest refers to the action of saving an existing pin to another board
from user actions, emphasizing the importance of the action order. by a user.
5251
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xue Xia et al.
an action head in 𝑯 . The loss function of Pinnability is a weighted 3.3.1 Feature encoding. The relevance of pins that a user has en-
cross-entropy loss, designed to optimize for multi-label classifica- gaged with can be determined by the types of actions taken on them
tion tasks. We formulate the loss function as: in the user’s action history. For example, a pin repinned to a user’s
board is typically considered more relevant than one that the user
only viewed. If a pin is hidden by the user, the relevance should be
∑︁
L = 𝑤𝑢 {−𝑤ℎ [𝑦ℎ log 𝑓 (𝒙)ℎ + (1 − 𝑦ℎ ) (1 − log 𝑓 (𝒙)ℎ )]} (1)
very low. To incorporate this important information, we use train-
ℎ∈𝐻
able embedding tables to project action types to low-dimensional
where 𝑓 (𝒙) ∈ (0, 1) 𝐻 , and 𝑓 (𝒙)ℎ is the output probability of head vectors. The user action type sequence is then projected to a user
ℎ. 𝑦ℎ ∈ {0, 1} is the ground truth on head ℎ. action embedding matrix 𝑾 𝑎𝑐𝑡𝑖𝑜𝑛𝑠 ∈ R |𝑆 | ×𝑑𝑎𝑐𝑡𝑖𝑜𝑛 , where 𝑑𝑎𝑐𝑡𝑖𝑜𝑛 is
A weight 𝑤ℎ is applied on the cross entropy of each head’s output the dimension of action type embedding.
𝑓 (𝒙)ℎ . 𝑤ℎ is calculated using the ground truth 𝒚 and a label weight As mentioned earlier, the content of pins in the user action
matrix 𝑴 ∈ R |𝐻 |∗|𝐻 | as follows: sequence is represented by PinSage embeddings [38]. Therefore, the
∑︁ content of all pins in the user action sequence is a matrix 𝑾 𝑝𝑖𝑛𝑠 ∈
𝑤ℎ = 𝑴 ℎ,𝑎 × 𝑦𝑎 (2) R |𝑆 | ×𝑑𝑃𝑖𝑛𝑆𝑎𝑔𝑒 . The final encoded user action sequence feature is
CONCAT(𝑾 𝑎𝑐𝑡𝑖𝑜𝑛𝑠 , 𝑾 𝑝𝑖𝑛𝑠 ) ∈ R |𝑆 | × (𝑑𝑃𝑖𝑛𝑆𝑎𝑔𝑒 +𝑑𝑎𝑐𝑡𝑖𝑜𝑛 ) .
𝑎∈𝐻
The label weight matrix 𝑴 acts as a controlling factor for the
contribution of each action to the loss term of each head3 . Note that 3.3.2 Early fusion. One of the unique advantages of using user
if 𝑴 is a diagonal matrix, Eq (1) reduces to a standard multi-head action sequence features directly in the ranking model is that we
binary cross entropy loss. But selecting empirically determined can explicitly model the interactions between the candidate pin
label weights 𝑴 improves performance considerably. and the user’s engaged pins. Early fusion in recommendation tasks
In addition, each training example is weighted by a user-dependent refers to merging user and item features at an early stage of the
weight 𝑤𝑢 , which is determined by user attributes, such as the recommendation model. Through experiments, we find that early
user state4 , gender and location. We compute 𝑤𝑢 by multiplying fusion is an important factor to improve ranking performance. Two
user state weight, user gender weight, and user location weight: early fusion methods are evaluated:
𝑤𝑢 = 𝑤 state × 𝑤 location × 𝑤 gender . These weights are adjusted based • append: Append candidate pin’s PinSage embedding to user
on specific business needs. action sequence as the last entry of the sequence, similar to
BST [4]. Use a zero vector to serve as a dummy action type
3.2 Realtime User Action Sequence Features for candidate pin.
User’s past action history is naturally a variable length feature • concat: For each action in the user action sequence, con-
– different users have different amounts of past actions on the catenate the candidate pin’s PinSage embedding with user
platform. action features.
Although a longer user action sequence usually means more We choose concat as our early fusion method based on the offline
accurate user interest representation, in practice, it is infeasible experiment results. The resulting sequence feature with early fusion
to include all user actions. Because the time needed to fetch user is a 2-d matrix 𝑼 ∈ R |𝑆 | ×𝑑 , where 𝑑 = (𝑑𝑎𝑐𝑡𝑖𝑜𝑛 + 2𝑑𝑃𝑖𝑛𝑆𝑎𝑔𝑒 )
action features and perform ranking model inference can also grow
substantially, which in turn hurts user experience and system effi- 3.3.3 Sequence Aggregation Model. With the user action sequence
ciency. Considering infrastructure cost and latency requirements, feature 𝑼 prepared, the next challenge is to efficiently aggregate
we choose to include each user’s most recent 100 actions in the all the information in the user action sequence to represent the
sequence. For users with less than 100 actions, we pad the feature user’s short-term preference. Some popular model architectures for
to the length of 100 with 0s. The user action sequence features are sequential modeling in the industry include CNN[40], RNN [25]
sorted by timestamp in descending order, i.e. the first entry being and recently transformer [33], etc. We experimented with different
the most recent action. sequence aggregation architectures and choose transformer-based
All actions in the user action sequence are pin-level actions. For architectures. We employed the standard transformer encoder with
each action, we use three primary features: the timestamp of the 2 encoder layers and one head. The hidden dimension of feed for-
action, action type, and the 32-dimensional PinSage embedding ward network is denoted as 𝑑ℎ𝑖𝑑𝑑𝑒𝑛 . Positional encoding is not
[38] of the pin. PinSage is a compact embedding that encodes a used here because our offline experiment showed that position
pin’s content information. information is ineffective5 .
3.3.4 Random Time Window Mask. Training on all recent actions
3.3 Our Approach: TransAct of a user can lead to a rabbit hole effect, where the model recom-
Unlike static features, the realtime user action sequence feature mends content similar to the user’s recent engagements. This hurts
𝑺 (𝑢) = [𝑎 1, 𝑎 2, ..., 𝑎𝑛 ] is handled using a specialized sub-module the diversity of users’ Homefeeds, which is harmful to long-term
called TransAct. TransAct extracts sequential patterns from the user retention. To address this issue, we use the timestamps of the
user’s historical behavior and predicts (𝑢, 𝑝) relevance scores. user action sequence to build a time window mask for the trans-
3 For more details, see Appendix A
former encoder. This mask filters out certain positions in the input
4 User states are used to group users of different behavior patterns, for example, users sequence before the self-attention mechanism is applied. In each
who engage daily are in one group, while those who engage once a month have a
different user state 5 For more details about positional encoding, see Appendix B.
5252
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest KDD ’23, August 6–10, 2023, Long Beach, CA, USA
forward pass, a random time window 𝑇 is sampled uniformly from 3.4.2 GPU serving. Pinnability with TransAct is 65 times more
0 to 24 hours. All actions taken within (𝑡𝑟𝑒𝑞𝑢𝑒𝑠𝑡 − 𝑇 , 𝑡𝑟𝑒𝑞𝑢𝑒𝑠𝑡 ) are computationally complex compared to its predecessors in terms
masked, where 𝑡𝑟𝑒𝑞𝑢𝑒𝑠𝑡 stands for the timestamp of receiving the of floating point operations. Without any breakthroughs in model
ranking request. It is important to note that the random time win- inference, our model serving cost and latency would increase by
dow mask is only applied during training, while at inference time, the same scale. GPU model inference allows us to serve Pinnability
the mask is not used. with TransAct at neutral latency and cost6 .
The main challenge to serve Pinnability on GPUs is the CUDA
3.3.5 Transformer Output Compression. The output of the trans-
kernel launch overhead. The CPU cost of launching operations on
former encoder is a matrix 𝑶 = (𝒐 0 : 𝒐 |𝑆 | −1 ) ∈ R |𝑆 | ×𝑑 . We only
the GPU is very high, but it is often overshadowed by the prolonged
take the first 𝐾 columns (𝒐 0 : 𝒐𝐾 −1 ), concatenated them with the
GPU computation time. However, this is problematic for Pinnabil-
max pooling vector MAXPOOL(𝑶) ∈ R𝑑 , and flattened it to a vec-
ity GPU model serving in two ways. First, Pinnability and recom-
tor 𝒛 ∈ R (𝐾+1)∗𝑑 . The first 𝐾 output columns capture users’ most mender models in general process hundreds of features, which
recent interests and MAXPOOL(𝑶) represents users’ longer-term pref- means that there is a large number of CUDA kernels. Second, the
erence over 𝑆 (𝑢). Since the output is compact enough, it can be batch size during online serving is small and hence each CUDA
easily integrated into the Pinnability framework using the DCN kernel requires little computation. With a large number of small
v2 [35] feature crossing layer. CUDA kernels, the launching overhead is much more expensive
than the actual computation. We solved the technical challenge
Output
through the following optimizations:
Fuse CUDA kernels. An effective approach is to fuse operations
as much as possible. We leverage standard deep learning compilers
such as nvFuser7 but often found human intervention is needed for
max pool
many of the remaining operations. One example is our embedding
d
<latexit sha1_base64="pbrTBi+I5TdMb2KRSYVNYXSeyj4=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNnZ3mTM7OwyMyuEkC/w4kERr36SN//GSbIHTSxoKKq66e4KUsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZIphg2WiES1A6pRcIkNw43AdqqQxoHAVjC8m/qtJ1SaJ/LBjFL0Y9qXPOKMGivVw16p7FbcGcgy8XJShhy1XumrGyYsi1EaJqjWHc9NjT+mynAmcFLsZhpTyoa0jx1LJY1R++PZoRNyapWQRImyJQ2Zqb8nxjTWehQHtjOmZqAXvan4n9fJTHTjj7lMM4OSzRdFmSAmIdOvScgVMiNGllCmuL2VsAFVlBmbTdGG4C2+vEya5xXvqnJZvyhXb/M4CnAMJ3AGHlxDFe6hBg1ggPAMr/DmPDovzrvzMW9dcfKZI/gD5/MHyymM8g==</latexit>
|S| k cols
<latexit sha1_base64="0qnebSNwpn5sT8995J1u2BHMI/I=">AAAB7HicbVBNSwMxEJ31s9avqkcvwSJ4seyKX8eiF48V3bbQLiWbZtvQbLIkWaFs+xu8eFDEqz/Im//GtN2Dtj4YeLw3w8y8MOFMG9f9dpaWV1bX1gsbxc2t7Z3d0t5+XctUEeoTyaVqhlhTzgT1DTOcNhNFcRxy2ggHtxO/8USVZlI8mmFCgxj3BIsYwcZK/uhhdDrolMpuxZ0CLRIvJ2XIUeuUvtpdSdKYCkM41rrluYkJMqwMI5yOi+1U0wSTAe7RlqUCx1QH2fTYMTq2ShdFUtkSBk3V3xMZjrUexqHtjLHp63lvIv7ntVITXQcZE0lqqCCzRVHKkZFo8jnqMkWJ4UNLMFHM3opIHytMjM2naEPw5l9eJPWzindZubg/L1dv8jgKcAhHcAIeXEEV7qAGPhBg8Ayv8OYI58V5dz5mrUtOPnMAf+B8/gCnfI6Z</latexit>
k cols
<latexit sha1_base64="d/RhcRSMbEv4ivpbAndL8To6sfA=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVR71S2a24M5Bl4uWkDDlqvdJXtx+zNEJpmKBadzw3MX5GleFM4KTYTTUmlI3oADuWShqh9rPZoRNyapU+CWNlSxoyU39PZDTSehwFtjOiZqgXvan4n9dJTXjjZ1wmqUHJ5ovCVBATk+nXpM8VMiPGllCmuL2VsCFVlBmbTdGG4C2+vEya5xXvqnJZvyhXb/M4CnAMJ3AGHlxDFe6hBg1ggPAMr/DmPDovzrvzMW9dcfKZI/gD5/MH1cWM+Q==</latexit>
a2
<latexit sha1_base64="q9AISI9UburW4eeMTsXbP+uc7zA=">AAAB6nicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xj04jGieUCyhNlJbzJkdnaZmRXCkk/w4kERr36RN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nfqtJ1Sax/LRjBP0IzqQPOSMGis90F61Vyq7FXcGsky8nJQhR71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx1JJI9R+Njt1Qk6t0idhrGxJQ2bq74mMRlqPo8B2RtQM9aI3Ff/zOqkJr/2MyyQ1KNl8UZgKYmIy/Zv0uUJmxNgSyhS3txI2pIoyY9Mp2hC8xZeXSbNa8S4rF/fn5dpNHkcBjuEEzsCDK6jBHdShAQwG8Ayv8OYI58V5dz7mrStOPnMEf+B8/gDtW42U</latexit>
an
<latexit sha1_base64="+HklQ2ImAJDmqUhjBI5PNbdwCBA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEnons8mQ2dllZlYISz7BiwdFvPpF3vwbJ8keNLGgoajqprsrSATXxnW/ncLK6tr6RnGztLW9s7tX3j9o6jhVlDVoLGLVDlAzwSVrGG4EayeKYRQI1gpGt1O/9cSU5rF8NOOE+REOJA85RWOlB+zJXrniVt0ZyDLxclKBHPVe+avbj2kaMWmoQK07npsYP0NlOBVsUuqmmiVIRzhgHUslRkz72ezUCTmxSp+EsbIlDZmpvycyjLQeR4HtjNAM9aI3Ff/zOqkJr/2MyyQ1TNL5ojAVxMRk+jfpc8WoEWNLkCpubyV0iAqpsemUbAje4svLpHlW9S6rF/fnldpNHkcRjuAYTsGDK6jBHdShARQG8Ayv8OYI58V5dz7mrQUnnzmEP3A+fwBIWo3Q</latexit>
S(u)
the scheduling overhead of transferring hundreds of tensors indi-
Figure 3: TransAct architecture. Note that this is a submod- vidually to transferring one tensor.
ule that can be plugged into any similar architecture like Form larger batches. For CPU-based inference, smaller batches
Pinnability are preferred to increase parallelism and reduce latency. However,
for GPU-based inference, larger batches are more efficient [29]. This
led us to re-evaluate our distributed system setup. Initially, we used
3.4 Model Productionization a scatter-gather architecture to split requests into small batches
3.4.1 Model Retraining. Retraining is important for recommender and run them in parallel on multiple leaf nodes for better latency.
systems because it allows the system to continuously adapt to However, this setup did not work well with GPU-based inference.
changing user behavior and preferences over time. Without re- Instead, we use the larger batches in the original requests directly.
training, a recommender system’s performance can degrade as the To compensate for the loss of cache capacity, we implemented a
user’s behavior and preferences change, leading to less accurate hybrid cache that uses both DRAM and SSD.
recommendations [26]. This holds especially true when we use Utilize CUDA graphs. We relied on CUDA Graphs9 to com-
realtime features in ranking. The model is more time sensitive and pletely eliminate the remaining small operations overhead. CUDA
requires frequent retraining. Otherwise, the model can become Graphs capture the model inference process as a static graph of
stale in a matter of days, leading to less accurate predictions. We
6 For
retrain Pinnability from scratch twice per week. We find that this more details about model effiency, see Appendix C.
7 https://pytorch.org/blog/introducing-nvfuser-a-deep-learning-compiler-for-
retraining frequency is essential to ensure a consistent engagement
pytorch/
rate and still maintain a manageable training cost. We will dive into 8 https://github.com/NVIDIA/cuCollections
5253
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xue Xia et al.
operations instead of individually scheduled ones, allowing the 4.2 Offline Experiment
computation to be executed as a single unit without any kernel 4.2.1 Metrics. The offline evaluation data, unlike training data,
launching overheads. is randomly sampled from FVL to represent the true distribution
of the real-world traffic. With this sampling strategy, the offline
3.4.3 Realtime Feature Processing. When a user takes an action, a evaluation data is representative of the entire population, reducing
realtime feature processing application based on Flink10 consumes the variance of evaluation results.
user action Kafka11 streams generated from front-end events. It In addition to sampling bias, we also eliminate position bias
validates each action record, detects and combines duplicates, and in offline evaluation data. Position bias refers to the tendency for
manages any time discrepancies from multiple data sources. The items at the top of a recommendation to receive more attention
application then materializes the features and stores them in Rock- and engagement than the items lower down the list. This can be a
store [3]. At serving time, each Homefeed logging/serving request problem when evaluating a ranking model, as it can distort the eval-
triggers the processor to convert sequence features into a format uation results and make it difficult to accurately assess the model’s
that can be utilized by the model. performance. To avoid position bias, we randomize the order of
pins in a very small portion of Homefeed recommendation sessions.
4 EXPERIMENT This is done by shuffling the recommendations before presenting
In this section, we will present extensive offline and online A/B them to users. We gather the FVL for those randomized sessions
experiment results of TransAct. We compare TransAct with baseline and only use randomized data to perform the offline evaluation.
models using Pinterest’s internal training data. Our model is evaluated on HIT@3. A chunk 𝒄 = [𝑝 1, 𝑝 2, . . . , 𝑝𝑛 ]
refers to a group of pins that are recommended to a user at the same
time. Each input instance to the ranking model is associated with
4.1 Experiment Setup a user id 𝑢_𝑖𝑑, a pin id 𝑝_𝑖𝑑, and a chunk id 𝑐_𝑖𝑑. The evaluation
4.1.1 Dataset. We construct the offline training dataset from three output is grouped by (𝑢_𝑖𝑑, 𝑐_𝑖𝑑) so that it contains the model
weeks of Pinterest Homefeed view log (FVL). The model is trained output from the same ranking request. We sort the pins from the
on the first two weeks of FVL and evaluated on the third week. same ranking request by a final ranking score S, which is a linear
The training data is sampled based on user state and labels. For combination of Pinnability output heads 𝑓 (𝒙).
example, we design the sampling ratio for different label actions ∑︁
based on their statistical distribution and importance. In addition, S= 𝑢ℎ 𝑓 (𝒙)ℎ (3)
since users only engage with a small portion of pins shown on ℎ∈𝐻
their Homefeed page, most of the training samples are negative
samples. To balance the highly skewed dataset and improve model Then we take the top 𝐾 ranked pins in each chunk and calculate
accuracy, we employ downsampling on the negative samples and the hit@K for all heads, denoted by 𝛽𝑐,ℎ , which is defined as the
set a fixed ratio between the positive and negative samples. Our number of topK-ranked pins whose labels of ℎ are 1. For example, if
training dataset contains 3 billion training instances of 177 million a chunk 𝒄 = [𝑝 1, 𝑝 2, 𝑝 3, . . . , 𝑝𝑛 ] is sorted by S, and the user repins
users and 720 million pins. 𝑝 1 and 𝑝 4 , then hit@K of repin 𝛽𝑐,𝑟𝑒𝑝𝑖𝑛 = 1 when 𝐾 = 3.
In this paper, we conduct all experiments with the Pinterest We calculate the aggregated HIT@3 for each head ℎ as follows:
dataset. We do not use public datasets as they lack the necessary Í Í
realtime user action sequence metadata features, such as item em- 𝑢 ∈𝑈 𝑐 ∈𝐶𝑢 𝛽𝑐,ℎ
𝐻𝐼𝑇 @3/ℎ = (4)
beddings and action types, required by TransAct. Furthermore, they |𝑈 |
are incompatible with our proposal of realtime-batch hybrid model, It is important to note that for actions indicating positive engage-
which requires both realtime and batch user features. And they ment, such as repin or click, a higher HIT@K score means better
cannot be tested in online A/B experiments. model performance. Conversely, for actions indicating negative
engagement, such as hide, a lower HIT@K/hide score is desirable.
4.1.2 Hyperparameters. Realtime user sequence length is |𝑆 | = At Pinterest, a non-core user is defined as a user who has not
100 and the dimension of action embedding 𝑑𝑎𝑐𝑡𝑖𝑜𝑛 = 32. The actively saved pins to boards within the past 28 days. Non-core
encoded sequence feature is passed through a transformer encoder users tend to be less active and therefore pose a challenge in terms
composed of 2 transformer blocks, with a default dropout rate of of improving their recommendation relevance due to their limited
0.1. The feed forward network in the transformer encoder layer has historical engagement. This is also referred to as the cold-start
a dimension of 𝑑ℎ𝑖𝑑𝑑𝑒𝑛 = 32, and positional encoding is not used. user problem in recommendation [19]. Despite the challenges, it
The implementation is done using PyTorch. We use an Adam [14] is important to retain non-core users as they play a crucial role
optimizer with a learning rate scheduler. The learning rate begins in maintaining a diverse and thriving community, contributing to
with a warm-up phase of 5000 steps, gradually increasing to 0.0048, long-term platform growth.
and finally reduced through cosine annealing. The batch size is All reported results are statistically significant (p-value < 0.05)
12000. unless stated otherwise.
5254
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest KDD ’23, August 6–10, 2023, Long Beach, CA, USA
Table 1: Offline evaluation of comparing existing methods Table 2: Ablation study of realtime-batch hybrid model
with TransAct. (∗ statistically insignificant) Other User
TransAct PF HIT@3/repin HIT@3/hide
HIT@3/repin HIT@3/hide Features
Methods
all non-core all non-core ✓ ✓ ✓ — —
✓ ✕ ✓ -2.46% +3.61%
WDL + seq +0.21% +0.35% -1.61% -1.55% ✕ ✓ ✓ -8.59% +17.45%
BST (all actions) +4.41% +5.09% +2.33% +3.59% ✓ ✓ ✕ -0.67% +1.40%
BST (positive actions) +7.34% +8.16% -1.12%∗ -3.14%∗
TransAct +9.40% +10.42% -14.86% -13.54%
5255
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xue Xia et al.
the model performance, we evaluate the model on different lengths Table 4: Ablation study of transformer output compression
of user sequence input. Output Compression Size HIT@3/repin HIT@3/hide
a random col 𝑑 +6.80% -10.96%
first col 𝑑 +7.82% -11.28%
random K cols 𝐾𝑑 +7.42% -12.12%
first K cols 𝐾𝑑 +9.38% -14.33%
all cols |𝑆 |𝑑 +8.86% -15.70%
max pooling 𝑑 +6.38% -14.15%
first K cols + max pool (𝐾 + 1)𝑑 +9.41% -14.86%
all cols + max pool (|𝑆 | + 1)𝑑 +8.67% -12.64%
represents the most recently engaged pins and the max pooling
is an aggregated representation of the entire sequence, Although
Figure 4: Effect of early fusion and sequence length on rank- using all columns improved HIT@3/hide slightly, the combination
ing model performance (HIT@3/repin, HIT@3/hide) of the first K columns and max pooling provided a good balance
between performance and latency. We use K=10 for TransAct.
An analysis of Figure 4 reveals that there is a positive correla-
tion between sequence length and performance. The performance 4.4 Online Experiment
improvement increases at a rate that is sub-linear with respect to Compared with offline evaluation, one advantage of online exper-
the sequence length. The use of concatenation as the early fusion iments in recommendation tasks is that they can be run on live
method was found to be superior to the use of appending. There- user data, allowing the model to be tested in a more realistic and
fore, the optimal engagement gain can be achieved by utilizing the dynamic environment. For the online experiment, we serve the
maximum available sequence length and employing concatenation ranking model trained on the 2-week offline training dataset. We
as the early fusion method. set the control group to be the Pinnability model without any re-
altime user sequence features. The treatment group is Pinnability
4.3.4 Transformer hyperparameters. We optimized TransAct’s trans-
model with TransAct. Each experiment group serves 1.5% of the
former encoder by adjusting its hyperparameters. As shown in
total users who visit Homefeed page.
Figure 5, increasing the number of transformer layers and feed for-
ward dimension leads to higher latency and also better performance. 4.4.1 metrics. On Homefeed, one of the most important metrics
While the best performance was achieved using 4 transformer lay- is Homefeed repin volume. Repin is the strongest indicator that
ers and 384 as the feed forward dimension, this came at the cost users find the recommended pins relevant, and is usually positively
of a 30% increase in latency, which does not meet the latency re- correlated to the amount of time users spend on Pinterest. Em-
quirement. To balance performance and user experience, we chose pirically, we found that offline HIT@3/repin usually aligns very
2 transformer layers and 32 as the hidden dimension. well with Homefeed online repin volume. Another important met-
ric is Homefeed hide volume, which measures the proportion
Feedforward Dimension of recommended items that users choose to hide or remove from
30.0% 32
Latency (vs TransAct)
5256
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest KDD ’23, August 6–10, 2023, Long Beach, CA, USA
5257
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xue Xia et al.
REFERENCES sequential behavior data for click-through rate prediction. In Proceedings of the
[1] 2016. Search serving and ranking at Pinterest. https://medium.com/the-graph/ 29th ACM International Conference on Information & Knowledge Management.
search-serving-and-ranking-at-pinterest-224707599c92 2685–2692.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- [22] Steffen Rendle. 2010. Factorization Machines. In 2010 IEEE International Confer-
chine translation by jointly learning to align and translate. arXiv preprint ence on Data Mining. 995–1000. https://doi.org/10.1109/ICDM.2010.127
arXiv:1409.0473 (2014). [23] Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International conference
[3] Jessica Chan. 2022. 3 Innovations While Unifying Pinterest’s Key-Value on data mining. IEEE, 995–1000.
Storage. https://medium.com/@Pinterest_Engineering/3-innovations-while- [24] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based
unifying-pinterests-key-value-storage-8cdcdf8cf6aa collaborative filtering recommendation algorithms. In Proceedings of the 10th
[4] Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Be- international conference on World Wide Web. 285–295.
havior Sequence Transformer for E-Commerce Recommendation in Alibaba. [25] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net-
In Proceedings of the 1st International Workshop on Deep Learning Practice works. IEEE transactions on Signal Processing 45, 11 (1997), 2673–2681.
for High-Dimensional Sparse Data (Anchorage, Alaska) (DLP-KDD ’19). Asso- [26] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Di-
ciation for Computing Machinery, New York, NY, USA, Article 12, 4 pages. etmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and
https://doi.org/10.1145/3326937.3341261 Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Sys-
[5] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, tems. In Advances in Neural Information Processing Systems, C. Cortes,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28.
2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/
workshop on deep learning for recommender systems. 7–10. 86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
[6] Tim Donkers, Benedikt Loepp, and Jürgen Ziegler. 2017. Sequential user-based [27] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.
recurrent neural network recommendations. In Proceedings of the eleventh ACM 2019. BERT4Rec: Sequential recommendation with bidirectional encoder rep-
conference on recommender systems. 152–160. resentations from transformer. In Proceedings of the 28th ACM international
[7] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. conference on information and knowledge management. 1441–1450.
DeepFM: a factorization-machine based neural network for CTR prediction. arXiv [28] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.
preprint arXiv:1703.04247 (2017). 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Rep-
[8] Ruining He and Julian McAuley. 2016. Fusing similarity models with markov resentations from Transformer. CoRR abs/1904.06690 (2019). arXiv:1904.06690
chains for sparse sequential recommendation. In 2016 IEEE 16th international http://arxiv.org/abs/1904.06690
conference on data mining (ICDM). IEEE, 191–200. [29] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. 2017. Efficient
[9] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12
2015. Session-based recommendations with recurrent neural networks. arXiv (2017), 2295–2329.
preprint arXiv:1511.06939 (2015). [30] Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved recurrent neural
[10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. networks for session-based recommendations. In Proceedings of the 1st workshop
Neural Computation 9 (1997), 1735–1780. on deep learning for recommender systems. 17–22.
[11] Haoji Hu, Xiangnan He, Jinyang Gao, and Zhi-Li Zhang. 2020. Modeling personal- [31] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda-
ized item frequency information for next-basket recommendation. In Proceedings tion via convolutional sequence embedding. In Proceedings of the eleventh ACM
of the 43rd International ACM SIGIR Conference on Research and Development in international conference on web search and data mining. 565–573.
Information Retrieval. 1071–1080. [32] Trinh Xuan Tuan and Tu Minh Phuong. 2017. 3D convolutional networks
[12] Rong Jin, Joyce Y. Chai, and Luo Si. 2004. An Automatic Weighting Scheme for session-based recommendation with content features. In Proceedings of the
for Collaborative Filtering. In Proceedings of the 27th Annual International ACM eleventh ACM conference on recommender systems. 138–146.
SIGIR Conference on Research and Development in Information Retrieval (Sheffield, [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
United Kingdom) (SIGIR ’04). Association for Computing Machinery, New York, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
NY, USA, 337–344. https://doi.org/10.1145/1008992.1009051 you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von
[13] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.),
mendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/
197–206. 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[14] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- [34] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network
mization. In 3rd International Conference on Learning Representations, ICLR 2015, for Ad Click Predictions. In Proceedings of the ADKDD’17 (Halifax, NS, Canada)
San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio (ADKDD’17). Association for Computing Machinery, New York, NY, USA, Article
and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980 12, 7 pages. https://doi.org/10.1145/3124749.3124754
[15] Eileen Li. 2019. Pin2Interest: A scalable system for content classifica- [35] Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong,
tion. https://medium.com/pinterest-engineering/pin2interest-a-scalable- and Ed Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons
system-for-content-classification-41a586675ee7 for Web-Scale Learning to Rank Systems. In Proceedings of the Web Conference
[16] Jiacheng Li, Yujie Wang, and Julian McAuley. 2020. Time interval aware self- 2021 (Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery,
attention for sequential recommendation. In Proceedings of the 13th international New York, NY, USA, 1785–1797. https://doi.org/10.1145/3442381.3450078
conference on web search and data mining. 322–330. [36] Yuyan Wang, Mohit Sharma, Can Xu, Sriraj Badam, Qian Sun, Lee Richardson,
[17] David C. Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C. Ma, Lisa Chung, Ed H. Chi, and Minmin Chen. 2022. Surrogate for Long-Term User
Zhigang Zhong, Jenny Liu, and Yushi Jing. 2017. Related Pins at Pinterest: Experience in Recommender Systems. In Proceedings of the 28th ACM SIGKDD
The Evolution of a Real-World Recommender System. In Proceedings of the 26th Conference on Knowledge Discovery and Data Mining (Washington DC, USA)
International Conference on World Wide Web Companion (Perth, Australia) (WWW (KDD ’22). Association for Computing Machinery, New York, NY, USA, 4100–4109.
’17 Companion). International World Wide Web Conferences Steering Committee, https://doi.org/10.1145/3534678.3539073
Republic and Canton of Geneva, CHE, 583–592. https://doi.org/10.1145/3041021. [37] Jiajing Xu, Andrew Zhai, and Charles Rosenberg. 2022. Rethinking Personalized
3054202 Ranking at Pinterest: An End-to-End Approach. In Proceedings of the 16th ACM
[18] Hao Ma, Irwin King, and Michael R. Lyu. 2007. Effective Missing Data Prediction Conference on Recommender Systems (Seattle, WA, USA) (RecSys ’22). Association
for Collaborative Filtering. In Proceedings of the 30th Annual International ACM for Computing Machinery, New York, NY, USA, 502–505. https://doi.org/10.
SIGIR Conference on Research and Development in Information Retrieval (Amster- 1145/3523227.3547394
dam, The Netherlands) (SIGIR ’07). Association for Computing Machinery, New [38] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton,
York, NY, USA, 39–46. https://doi.org/10.1145/1277741.1277751 and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale
[19] Senthilselvan Natarajan, Subramaniyaswamy Vairavasundaram, Sivaramakrish- Recommender Systems. In Proceedings of the 24th ACM SIGKDD International
nan Natarajan, and Amir H Gandomi. 2020. Resolving data sparsity and cold Conference on Knowledge Discovery and Data Mining (London, United Kingdom)
start problem in collaborative filtering recommender system using linked open (KDD ’18). Association for Computing Machinery, New York, NY, USA, 974–983.
data. Expert Systems with Applications 149 (2020), 113248. https://doi.org/10.1145/3219819.3219890
[20] Nikil Pancha, Andrew Zhai, Jure Leskovec, and Charles Rosenberg. 2022. Pinner- [39] Shuai Zhang, Yi Tay, Lina Yao, Aixin Sun, and Jake An. 2019. Next item recom-
Former: Sequence Modeling for User Representation at Pinterest. In Proceedings mendation with self-attentive metric learning. In Thirty-Third AAAI Conference
of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining on Artificial Intelligence, Vol. 9.
(Washington DC, USA) (KDD ’22). Association for Computing Machinery, New [40] Bendong Zhao, Huanzhang Lu, Shangfeng Chen, Junliang Liu, and Dongya Wu.
York, NY, USA, 3702–3712. https://doi.org/10.1145/3534678.3539156 2017. Convolutional neural networks for time series classification. Journal of
[21] Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Systems Engineering and Electronics 28, 1 (2017), 162–169.
Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong
5258
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest KDD ’23, August 6–10, 2023, Long Beach, CA, USA
Table 8: Offline evaluation of different positional encoding • If a user only hides a pin (𝒚ℎ𝑖𝑑𝑒 = 1), and does not repin
methods compared with TransAct or click (𝒚𝑟𝑒𝑝𝑖𝑛 = 𝒚𝑐𝑙𝑖𝑐𝑘 = 0). Then we want to penalize
Positional encoding method HIT@3/hide HIT@3/repin the model more if it predicts repin or click, by setting the
𝑴 𝑟𝑒𝑝𝑖𝑛,ℎ𝑖𝑑𝑒 and 𝑴 𝑐𝑙𝑖𝑐𝑘,ℎ𝑖𝑑𝑒 to a large value.
None (TransAct) - - 𝑤𝑟𝑒𝑝𝑖𝑛 = 𝑴 𝑟𝑒𝑝𝑖𝑛,ℎ𝑖𝑑𝑒 ∗ 𝒚ℎ𝑖𝑑𝑒 = 100.
From scratch +0.86% -0.61% 𝑤𝑐𝑙𝑖𝑐𝑘 = 𝑴 𝑐𝑙𝑖𝑐𝑘,ℎ𝑖𝑑𝑒 ∗ 𝒚ℎ𝑖𝑑𝑒 = 100.
Sinusoidal +0.78% -0.13% • If a user only repins a pin (𝒚𝑟𝑒𝑝𝑖𝑛 = 1), but does not hide or
Linear projection ∗ +2.29% +0.19% click (𝒚ℎ𝑖𝑑𝑒 = 𝒚𝑐𝑙𝑖𝑐𝑘 = 0). We want to penalize the model if
it predicts hide: 𝑤ℎ𝑖𝑑𝑒 = 𝑴 ℎ𝑖𝑑𝑒,𝑟𝑒𝑝𝑖𝑛 ∗ 𝒚𝑟𝑒𝑝𝑖𝑛 = 5.
[41] Bo Zhao, Koichiro Narita, Burkay Orten, and John Egan. 2018. Notification But we do not need to penalize the model if it predicts a
Volume Control and Optimization System at Pinterest. In Proceedings of the 24th click, because a user could repin and click the same pin.
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
(London, United Kingdom) (KDD ’18). Association for Computing Machinery, 𝑤𝑐𝑙𝑖𝑐𝑘 = 𝑴 𝑐𝑙𝑖𝑐𝑘,𝑟𝑒𝑝𝑖𝑛 ∗ 𝒚𝑟𝑒𝑝𝑖𝑛 = 0
New York, NY, USA, 1012–1020. https://doi.org/10.1145/3219819.3219906
[42] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang B POSITIONAL ENCODING
Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate
prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. We tried several positional encoding approaches: learning positional
5941–5948.
[43] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui
embedding from scratch, sinusoidal positional encoding [33], and
Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click- linear projection positional encoding as proposed in [4]. Table8
Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International shows that positional encoding does not add much value.
Conference on Knowledge Discovery & Data Mining (London, United Kingdom)
(KDD ’18). Association for Computing Machinery, New York, NY, USA, 1059–1068.
https://doi.org/10.1145/3219819.3219823 C MODEL EFFICIENCY
Table 9 shows more detailed information on the efficiency of our
A HEAD WEIGHTING model, including number of flops, model forward latency per batch
We illustrate how head weighting helps the multi-task prediction (batch size = 256), and serving cost. The serving cost is not linearly
task here. Consider the following example, of a model using 3 correlated with model forward latency because it is also related to
actions: repins, clicks, and hides. The label weight matrix is set as server configurations such as time out limit, batch size, etc. GPU
Table 7. serving optimization is important to maintain low latency and
serving cost.
Table 7: An example of label weight matrix 𝑴 with 3 actions
Action Table 9: Model Efficiency Numbers from Serving Optimiza-
click repin hide
Head tion
click 100 0 100
Baseline(CPU) TransAct(CPU) TransAct(GPU)
repin 0 100 100
hide 1 5 10 Parameters 60M 92M 92M
flops 1M 77M 77M
Hides are a strong negative action, while repins and clicks are Latency 22ms 712ms 8ms
both positive engagements, although repins are considered a stronger Serving Cost 1x 32x 1x
positive signal than clicks. We set the value of 𝑴 manually, to con-
trol the weight on cross-entropy loss. Here, we give some examples
of how this is achieved.
5259