0% found this document useful (0 votes)
28 views

TransAct

Uploaded by

Hongming Zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

TransAct

Uploaded by

Hongming Zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

TransAct: Transformer-based Realtime User Action Model for

Recommendation at Pinterest
Xue Xia Pong Eksombatchai Nikil Pancha
xxia@pinterest.com pong@pinterest.com npancha@pinterest.com
Pinterest Pinterest Pinterest
San Francisco, CA, USA San Francisco, CA, USA San Francisco, CA, USA

Dhruvil Deven Badani Po-Wei Wang Neng Gu


dbadani@pinterest.com poweiwang@pinterest.com ngu@pinterest.com
Pinterest Pinterest Pinterest
San Francisco, CA, USA San Francisco, CA, USA San Francisco, CA, USA

Saurabh Vishwas Joshi Nazanin Farahpour Zhiyuan Zhang


sjoshi@pinterest.com nfarahpour@pinterest.com zhiyuan@pinterest.com
Pinterest Pinterest Pinterest
San Francisco, CA, USA San Francisco, CA, USA San Francisco, CA, USA

Andrew Zhai∗
andrew@aideate.ai
Pinterest
San Francisco, CA, USA
ABSTRACT Our model has been deployed to production in Homefeed, Related
Sequential models that encode user activity for next action predic- Pins, Notifications, and Search at Pinterest.
tion have become a popular design choice for building web-scale
personalized recommendation systems. Traditional methods of se- CCS CONCEPTS
quential recommendation either utilize end-to-end learning on • Information systems → Web searching and information
realtime user actions, or learn user representations separately in an discovery; Content ranking; Personalization;
offline batch-generated manner. This paper (1) presents Pinterest’s
ranking architecture for Homefeed, our personalized recommenda- KEYWORDS
tion product and the largest engagement surface; (2) proposes Trans- Personalization, Recommender Systems, Sequential Recommenda-
Act, a sequential model that extracts users’ short-term preferences tion, User Interest Modeling
from their realtime activities; (3) describes our hybrid approach to
ACM Reference Format:
ranking, which combines end-to-end sequential modeling via Trans-
Xue Xia, Pong Eksombatchai, Nikil Pancha, Dhruvil Deven Badani, Po-
Act with batch-generated user embeddings. The hybrid approach Wei Wang, Neng Gu, Saurabh Vishwas Joshi, Nazanin Farahpour, Zhiyuan
allows us to combine the advantages of responsiveness from learn- Zhang, and Andrew Zhai. 2023. TransAct: Transformer-based Realtime User
ing directly on realtime user activity with the cost-effectiveness Action Model for Recommendation at Pinterest. In Proceedings of the 29th
of batch user representations learned over a longer time period. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD
We describe the results of ablation studies, the challenges we faced ’23), August 6–10, 2023, Long Beach, CA, USA. ACM, New York, NY, USA,
during productionization, and the outcome of an online A/B ex- 11 pages. https://doi.org/10.1145/3580305.3599918
periment, which validates the effectiveness of our hybrid ranking
model. We further demonstrate the effectiveness of TransAct on 1 INTRODUCTION
other surfaces such as contextual recommendations and search. The proliferation of online content in recent years has created an
∗ work
overwhelming amount of information for users to navigate. To
done at Pinterest
address this issue, recommender systems are employed in various
industries to help users find relevant items from a vast selection,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed including products, images, videos, and music. By providing person-
for profit or commercial advantage and that copies bear this notice and the full citation alized recommendations, businesses and organizations can better
on the first page. Copyrights for components of this work owned by others than the serve their users and keep them engaged with the platform. There-
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission fore, recommender systems are vital for businesses as they drive
and/or a fee. Request permissions from permissions@acm.org. growth by boosting engagement, sales, and revenue.
KDD ’23, August 6–10, 2023, Long Beach, CA, USA As one of the largest content sharing and social media plat-
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0103-0/23/08. . . $15.00 forms, Pinterest hosts billions of pins with rich contextual and
https://doi.org/10.1145/3580305.3599918 visual information, and brings inspiration to over 400 million users.

5249
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xue Xia et al.

By combining the expressive power of TransAct with batch


user embeddings, the hybrid ranking model offers users realtime
feedback on their recent actions, while also accounting for their
long-term interests. The realtime component and batch component
complement each other for recommendation accuracy. This leads
to an overall improvement in the user experience on the Homefeed
page.
The major contributions of this paper are summarized as follows:
• We describe Pinnability, the architecture of Pinterest’s Home-
feed production ranking system. The Homefeed personalized
recommendation product accounts for the majority of the
overall user engagement on Pinterest.
• We propose TransAct, a transformer-based realtime user ac-
tion sequential model that effectively captures users’ short-
term interests from their recent actions. We demonstrate that
combining TransAct with daily-generated user representa-
Figure 1: Pinterest Homefeed Page tions [20] to a hybrid model leads to the best performance in
Pinnability. This design choice is justified through a compre-
Upon visiting Pinterest, users are immediately presented with the hensive ablation study. Our code implementation is publicly
Homefeed page as shown in Figure 1, which serves as the primary available1 .
source of inspiration and accounts for the majority of overall user • We describe the serving optimization implemented in Pinnabil-
engagement on the platform. The Homefeed page is powered by ity to make feasible the computational complexity increase
a 3-stage recommender system that retrieves, ranks, and blends of 65 times when introducing TransAct to the Pinnability
content based on user interests and activities. At the retrieval stage, model. Specifically, optimizations are done to enable GPU
we filter billions of pins created on Pinterest to thousands, based serving of our prior CPU-based model.
on a variety of factors such as user interests, followed boards, etc. • We describe online A/B experiments on a real-world recom-
Then we use a pointwise ranking model to rank candidate pins mendation system using TransAct. We demonstrate some
by predicting their personalized relevance to users. Finally, the practical issues in the online environment, such as recom-
ranked result is adjusted using a blending layer to meet business mendation diversity drop and engagement decay, and pro-
requirements. pose solutions to address these issues.
Realtime recommendation is crucial because it provides a quick The remainder of this paper is organized as follows: Related work
and up-to-date recommendation to users, improving their overall is reviewed in Section 2. Section 3 describes the design of TransAct
experience and satisfaction. The integration of realtime data, such as and the details of bringing it to production. Experiment results are
recent user actions, results in more accurate recommendations and reported in Section 4. We discuss some findings beyond experiments
increases the probability of users discovering relevant items [4, 21]. in Section 5. Finally, we conclude our work in Section 6.
Longer user action sequences result in improved user represen-
tation and hence better recommendation performance. However, 2 RELATED WORK
using long sequences in ranking poses challenges to infrastructure,
as they require significant computational resources and can result
2.1 Recommender System
in increased latency. To address this challenge, some approaches Collaborative filtering (CF) [12, 18, 24] makes recommendations
have utilized hashing and nearest neighbor search in long user based on the assumption that a user will prefer an item that other
sequences [21]. Other work encodes users’ past actions over an similar users prefer. It uses the user behavior history to compute the
extended time frame to a user embedding [20] to represent long- similarity between users and items and recommend items based on
term user interests. User embedding features are often generated similarity. This approach suffers from the sparsity of the user-item
as batch features (e.g. generated daily), which are cost-effective matrix and cannot handle users who have never interacted with
to serve across multiple applications with low latency. The limita- any items. Factorization machines [22, 23], on the other hand, are
tion of existing sequential recommendation is that they either only able to handle sparse matrices.
use realtime user actions, or only use a batch user representation More recently, deep learning (DL) has been used in click-through
learned from long-term user action history. rate (CTR) prediction tasks. For example, Google uses Wide &
We introduce a novel realtime-batch hybrid ranking approach Deep [5] models for application recommendation. The wide com-
that combines both realtime user action signals and batch user ponent achieves memorization by capturing the interaction be-
representations. To capture the realtime actions of users, we present tween features, while the deep component helps with generaliza-
TransAct - a new transformer-based module designed to encode tion by learning the embedding of categorical features using a feed
recent user action sequences and comprehend users’ immediate forward network. DeepFM [7] makes improvements by learning
preferences. For user actions that occur over an extended period of
1 Our code is available on Github: https://github.com/pinterest/transformer_user_
time, we transform them into a batch user representation [20].
action

5250
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest KDD ’23, August 6–10, 2023, Long Beach, CA, USA

both low-order and high-order feature interactions automatically. However, we found that positional information does not add much
DCN [34] and its upgraded version DCN v2 [35] both aim to auto- value. We find other designs like better early fusion and action type
matically model the explicit feature crosses. The aforementioned embedding are effective when dealing with sequence features.
recommender systems do not work well in capturing the short-
term interests of users since only the static features of users are 3 METHODOLOGY
utilized. These methods also tend to ignore the sequential relation- In this section, we introduce TransAct, our realtime-batch hybrid
ship within the action history of a user, resulting in an inadequate ranking model. We will start with an overview of the Pinterest
representation of user preferences. Homefeed ranking model, Pinnability. We then describe how to use
TrancAct to encode the realtime user action sequence features in
Pinnability for the ranking task.
2.2 Sequential Recommendation
To address this problem, sequential recommendation has been 3.1 Preliminary: Homefeed Ranking Model
widely studied in both academia and the industry. A sequential In Homefeed ranking, we model the recommendation task as a
recommendation system uses a behavior history of users as input pointwise multi-task prediction problem, which can be defined as
and applies recommendation algorithms to suggest appropriate follows: given a user 𝑢 and a pin 𝑝, we build a function to predict
items to users. Sequential recommendation models are able to cap- the probabilities of user 𝑢 performing different actions on the can-
ture users’ long-term preferences over an extended period of time, didate pin 𝑝. The set of different actions contains both positive and
similar to traditional recommendation methods. Additionally, they negative actions, e.g. click, repin2 and hide.
also have the added benefit of being able to account for users’ We build Pinnability, Pinterest’s Homefeed ranking model, to
evolving interests, which enables higher quality recommendations. approach the above problem. The high-level architecture is a Wide
Sequential recommendation is often viewed as a next item predic- and Deep learning (WDL) model [5]. The Pinnability model utilizes
tion task, where the goal is to predict a user’s next action based on various types of input signals, such as user signals, pin signals,
their past action sequence. We are inspired by the previous sequen- and context signals. These inputs can come in different formats,
tial recommendation method [4] in terms of encoding users’ past ac- including categorical, numerical, and embedding features.
tion into a dense representation. Some early sequential recommen- We use embedding layers to project categorical features to dense
dation systems use machine learning techniques, such as Markov features, and perform batch normalization on numerical features.
Chain [8] and session-based K nearest neighbors (KNN) [11] to We then apply a feature cross using a full-rank DCN V2 [35] to
model the temporal dependencies among interactions in users’ ac- explicitly model feature interactions. At last, we use fully connected
tion history. These models are criticized for not being able to fully layers with a set of output action heads 𝑯 = {ℎ 1, ℎ 2, . . . , ℎ𝑘 } to pre-
capture the long-term patterns of users by simply combining infor- dict the user actions on the candidate pin 𝑝. Each head maps to one
mation from different sessions. Recently, deep learning techniques action. As shown in Figure 2, our model is a realtime-batch hybrid
such as recurrent neural networks (RNN) [25] have shown great suc- model that encodes the user action history features by both realtime
cess in natural language processing and have become increasingly (TransAct) and batch (PinnerFormer) approaches and optimizes for
popular in sequential recommendation. As a result, many DL-based the ranking task [37].
sequential models [6, 9, 30, 42] have achieved outstanding perfor-
mance using RNNs. Convolutional neural networks (CNNs) [40] head 1 head 2 head k

are widely used for processing time-series data and image data. In … context features
pin features
the context of sequential recommendation, CNN-based models can user features
MLP
effectively learn dependency within a set of items users recently action type
interacted with, and make recommendations accordingly [31, 32]. DCN V2 pin embedding
timestamp
Attention mechanism is originated from the neural machine trans-
lation task, which models the importance of different parts of the
Concatenate
input sentences on the output words [2]. Self-attention is a mecha-
nism known to weigh the importance of different parts of an input
dense
sequence [33]. There have been more recommender systems that Embedding features
Layer
use attention [43] and self-attention [4, 13, 16, 27, 39].
Many previous works [13, 16, 27] only perform offline evalua- id features TransAct
Static features
tions using public datasets. However, the online environment is
PinnerFormer
more challenging and unpredictable. Our method is not directly
… …
comparable to these works due to differences in the problem for-
non-realtime user actions realtime user actions
mulation. Our approach resembles a Click-through Rate (CTR)
prediction task. Deep Interest Network (DIN) uses an attention Figure 2: Pinterest Homefeed ranking model (Pinnability)
mechanism to model the dependency within users’ past actions in
CTR prediction tasks. Alibaba’s Behavior Sequence Transformer Each training sample is (𝒙, 𝒚), where 𝒙 represents a set of fea-
(BST) [4] is the improved version of DIN and is closely related to our tures, and 𝒚 ∈ {0, 1} |𝑯 | . Each entry in 𝒚 corresponds to the label of
work. They propose to use Transformer to capture the user interest 2A "repin" on Pinterest refers to the action of saving an existing pin to another board
from user actions, emphasizing the importance of the action order. by a user.

5251
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xue Xia et al.

an action head in 𝑯 . The loss function of Pinnability is a weighted 3.3.1 Feature encoding. The relevance of pins that a user has en-
cross-entropy loss, designed to optimize for multi-label classifica- gaged with can be determined by the types of actions taken on them
tion tasks. We formulate the loss function as: in the user’s action history. For example, a pin repinned to a user’s
board is typically considered more relevant than one that the user
only viewed. If a pin is hidden by the user, the relevance should be
∑︁
L = 𝑤𝑢 {−𝑤ℎ [𝑦ℎ log 𝑓 (𝒙)ℎ + (1 − 𝑦ℎ ) (1 − log 𝑓 (𝒙)ℎ )]} (1)
very low. To incorporate this important information, we use train-
ℎ∈𝐻
able embedding tables to project action types to low-dimensional
where 𝑓 (𝒙) ∈ (0, 1) 𝐻 , and 𝑓 (𝒙)ℎ is the output probability of head vectors. The user action type sequence is then projected to a user
ℎ. 𝑦ℎ ∈ {0, 1} is the ground truth on head ℎ. action embedding matrix 𝑾 𝑎𝑐𝑡𝑖𝑜𝑛𝑠 ∈ R |𝑆 | ×𝑑𝑎𝑐𝑡𝑖𝑜𝑛 , where 𝑑𝑎𝑐𝑡𝑖𝑜𝑛 is
A weight 𝑤ℎ is applied on the cross entropy of each head’s output the dimension of action type embedding.
𝑓 (𝒙)ℎ . 𝑤ℎ is calculated using the ground truth 𝒚 and a label weight As mentioned earlier, the content of pins in the user action
matrix 𝑴 ∈ R |𝐻 |∗|𝐻 | as follows: sequence is represented by PinSage embeddings [38]. Therefore, the
∑︁ content of all pins in the user action sequence is a matrix 𝑾 𝑝𝑖𝑛𝑠 ∈
𝑤ℎ = 𝑴 ℎ,𝑎 × 𝑦𝑎 (2) R |𝑆 | ×𝑑𝑃𝑖𝑛𝑆𝑎𝑔𝑒 . The final encoded user action sequence feature is
CONCAT(𝑾 𝑎𝑐𝑡𝑖𝑜𝑛𝑠 , 𝑾 𝑝𝑖𝑛𝑠 ) ∈ R |𝑆 | × (𝑑𝑃𝑖𝑛𝑆𝑎𝑔𝑒 +𝑑𝑎𝑐𝑡𝑖𝑜𝑛 ) .
𝑎∈𝐻
The label weight matrix 𝑴 acts as a controlling factor for the
contribution of each action to the loss term of each head3 . Note that 3.3.2 Early fusion. One of the unique advantages of using user
if 𝑴 is a diagonal matrix, Eq (1) reduces to a standard multi-head action sequence features directly in the ranking model is that we
binary cross entropy loss. But selecting empirically determined can explicitly model the interactions between the candidate pin
label weights 𝑴 improves performance considerably. and the user’s engaged pins. Early fusion in recommendation tasks
In addition, each training example is weighted by a user-dependent refers to merging user and item features at an early stage of the
weight 𝑤𝑢 , which is determined by user attributes, such as the recommendation model. Through experiments, we find that early
user state4 , gender and location. We compute 𝑤𝑢 by multiplying fusion is an important factor to improve ranking performance. Two
user state weight, user gender weight, and user location weight: early fusion methods are evaluated:
𝑤𝑢 = 𝑤 state × 𝑤 location × 𝑤 gender . These weights are adjusted based • append: Append candidate pin’s PinSage embedding to user
on specific business needs. action sequence as the last entry of the sequence, similar to
BST [4]. Use a zero vector to serve as a dummy action type
3.2 Realtime User Action Sequence Features for candidate pin.
User’s past action history is naturally a variable length feature • concat: For each action in the user action sequence, con-
– different users have different amounts of past actions on the catenate the candidate pin’s PinSage embedding with user
platform. action features.
Although a longer user action sequence usually means more We choose concat as our early fusion method based on the offline
accurate user interest representation, in practice, it is infeasible experiment results. The resulting sequence feature with early fusion
to include all user actions. Because the time needed to fetch user is a 2-d matrix 𝑼 ∈ R |𝑆 | ×𝑑 , where 𝑑 = (𝑑𝑎𝑐𝑡𝑖𝑜𝑛 + 2𝑑𝑃𝑖𝑛𝑆𝑎𝑔𝑒 )
action features and perform ranking model inference can also grow
substantially, which in turn hurts user experience and system effi- 3.3.3 Sequence Aggregation Model. With the user action sequence
ciency. Considering infrastructure cost and latency requirements, feature 𝑼 prepared, the next challenge is to efficiently aggregate
we choose to include each user’s most recent 100 actions in the all the information in the user action sequence to represent the
sequence. For users with less than 100 actions, we pad the feature user’s short-term preference. Some popular model architectures for
to the length of 100 with 0s. The user action sequence features are sequential modeling in the industry include CNN[40], RNN [25]
sorted by timestamp in descending order, i.e. the first entry being and recently transformer [33], etc. We experimented with different
the most recent action. sequence aggregation architectures and choose transformer-based
All actions in the user action sequence are pin-level actions. For architectures. We employed the standard transformer encoder with
each action, we use three primary features: the timestamp of the 2 encoder layers and one head. The hidden dimension of feed for-
action, action type, and the 32-dimensional PinSage embedding ward network is denoted as 𝑑ℎ𝑖𝑑𝑑𝑒𝑛 . Positional encoding is not
[38] of the pin. PinSage is a compact embedding that encodes a used here because our offline experiment showed that position
pin’s content information. information is ineffective5 .
3.3.4 Random Time Window Mask. Training on all recent actions
3.3 Our Approach: TransAct of a user can lead to a rabbit hole effect, where the model recom-
Unlike static features, the realtime user action sequence feature mends content similar to the user’s recent engagements. This hurts
𝑺 (𝑢) = [𝑎 1, 𝑎 2, ..., 𝑎𝑛 ] is handled using a specialized sub-module the diversity of users’ Homefeeds, which is harmful to long-term
called TransAct. TransAct extracts sequential patterns from the user retention. To address this issue, we use the timestamps of the
user’s historical behavior and predicts (𝑢, 𝑝) relevance scores. user action sequence to build a time window mask for the trans-
3 For more details, see Appendix A
former encoder. This mask filters out certain positions in the input
4 User states are used to group users of different behavior patterns, for example, users sequence before the self-attention mechanism is applied. In each
who engage daily are in one group, while those who engage once a month have a
different user state 5 For more details about positional encoding, see Appendix B.

5252
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest KDD ’23, August 6–10, 2023, Long Beach, CA, USA

forward pass, a random time window 𝑇 is sampled uniformly from 3.4.2 GPU serving. Pinnability with TransAct is 65 times more
0 to 24 hours. All actions taken within (𝑡𝑟𝑒𝑞𝑢𝑒𝑠𝑡 − 𝑇 , 𝑡𝑟𝑒𝑞𝑢𝑒𝑠𝑡 ) are computationally complex compared to its predecessors in terms
masked, where 𝑡𝑟𝑒𝑞𝑢𝑒𝑠𝑡 stands for the timestamp of receiving the of floating point operations. Without any breakthroughs in model
ranking request. It is important to note that the random time win- inference, our model serving cost and latency would increase by
dow mask is only applied during training, while at inference time, the same scale. GPU model inference allows us to serve Pinnability
the mask is not used. with TransAct at neutral latency and cost6 .
The main challenge to serve Pinnability on GPUs is the CUDA
3.3.5 Transformer Output Compression. The output of the trans-
kernel launch overhead. The CPU cost of launching operations on
former encoder is a matrix 𝑶 = (𝒐 0 : 𝒐 |𝑆 | −1 ) ∈ R |𝑆 | ×𝑑 . We only
the GPU is very high, but it is often overshadowed by the prolonged
take the first 𝐾 columns (𝒐 0 : 𝒐𝐾 −1 ), concatenated them with the
GPU computation time. However, this is problematic for Pinnabil-
max pooling vector MAXPOOL(𝑶) ∈ R𝑑 , and flattened it to a vec-
ity GPU model serving in two ways. First, Pinnability and recom-
tor 𝒛 ∈ R (𝐾+1)∗𝑑 . The first 𝐾 output columns capture users’ most mender models in general process hundreds of features, which
recent interests and MAXPOOL(𝑶) represents users’ longer-term pref- means that there is a large number of CUDA kernels. Second, the
erence over 𝑆 (𝑢). Since the output is compact enough, it can be batch size during online serving is small and hence each CUDA
easily integrated into the Pinnability framework using the DCN kernel requires little computation. With a large number of small
v2 [35] feature crossing layer. CUDA kernels, the launching overhead is much more expensive
than the actual computation. We solved the technical challenge
Output
through the following optimizations:
Fuse CUDA kernels. An effective approach is to fuse operations
as much as possible. We leverage standard deep learning compilers
such as nvFuser7 but often found human intervention is needed for
max pool
many of the remaining operations. One example is our embedding
d
<latexit sha1_base64="pbrTBi+I5TdMb2KRSYVNYXSeyj4=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNnZ3mTM7OwyMyuEkC/w4kERr36SN//GSbIHTSxoKKq66e4KUsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZIphg2WiES1A6pRcIkNw43AdqqQxoHAVjC8m/qtJ1SaJ/LBjFL0Y9qXPOKMGivVw16p7FbcGcgy8XJShhy1XumrGyYsi1EaJqjWHc9NjT+mynAmcFLsZhpTyoa0jx1LJY1R++PZoRNyapWQRImyJQ2Zqb8nxjTWehQHtjOmZqAXvan4n9fJTHTjj7lMM4OSzRdFmSAmIdOvScgVMiNGllCmuL2VsAFVlBmbTdGG4C2+vEya5xXvqnJZvyhXb/M4CnAMJ3AGHlxDFe6hBg1ggPAMr/DmPDovzrvzMW9dcfKZI/gD5/MHyymM8g==</latexit>

|S| k cols
<latexit sha1_base64="0qnebSNwpn5sT8995J1u2BHMI/I=">AAAB7HicbVBNSwMxEJ31s9avqkcvwSJ4seyKX8eiF48V3bbQLiWbZtvQbLIkWaFs+xu8eFDEqz/Im//GtN2Dtj4YeLw3w8y8MOFMG9f9dpaWV1bX1gsbxc2t7Z3d0t5+XctUEeoTyaVqhlhTzgT1DTOcNhNFcRxy2ggHtxO/8USVZlI8mmFCgxj3BIsYwcZK/uhhdDrolMpuxZ0CLRIvJ2XIUeuUvtpdSdKYCkM41rrluYkJMqwMI5yOi+1U0wSTAe7RlqUCx1QH2fTYMTq2ShdFUtkSBk3V3xMZjrUexqHtjLHp63lvIv7ntVITXQcZE0lqqCCzRVHKkZFo8jnqMkWJ4UNLMFHM3opIHytMjM2naEPw5l9eJPWzindZubg/L1dv8jgKcAhHcAIeXEEV7qAGPhBg8Ayv8OYI58V5dz5mrUtOPnMAf+B8/gCnfI6Z</latexit>

k cols
<latexit sha1_base64="d/RhcRSMbEv4ivpbAndL8To6sfA=">AAAB6HicbVDLSgNBEOz1GeMr6tHLYBA8hV3xdQx68ZiAeUCyhNlJbzJmdnaZmRXCki/w4kERr36SN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6m/qtJ1Sax/LBjBP0IzqQPOSMGivVR71S2a24M5Bl4uWkDDlqvdJXtx+zNEJpmKBadzw3MX5GleFM4KTYTTUmlI3oADuWShqh9rPZoRNyapU+CWNlSxoyU39PZDTSehwFtjOiZqgXvan4n9dJTXjjZ1wmqUHJ5ovCVBATk+nXpM8VMiPGllCmuL2VsCFVlBmbTdGG4C2+vEya5xXvqnJZvyhXb/M4CnAMJ3AGHlxDFe6hBg1ggPAMr/DmPDovzrvzMW9dcfKZI/gD5/MH1cWM+Q==</latexit>

table lookup module, which consists of two computation steps: raw


id to table index lookup and table index to embedding lookup. This
transformer encoder layer 2 is repeated hundreds of times due to the large number of features.
We significantly reduce the number of operations by leveraging
transformer encoder layer 1
cuCollections8 to support hash tables for the raw ids on GPUs and
action type
embedding
implementing a custom consolidated embedding lookup module to
time window merge the lookup for multiple features into one lookup. As a result,
masks ...
we reduced hundreds of operations related to sparse features into
action pin
embedding one.
concatenate Combine memory copies. For every inference, hundreds of
features are copied from the CPU to the GPU memory as individ-
copy n times
ual tensors. The overhead of scheduling hundreds of tensor copies
. . .
becomes the bottleneck. To decrease the number of tensor copy
candidate pin
embedding
operations, we combine multiple tensors into one continuous buffer
a1
<latexit sha1_base64="wVNPTidnOFb00HtjN39OOcpPzBY=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEmYns8mQ2dllplcISz7BiwdFvPpF3vwbJ8keNLGgoajqprsrSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWj26nfeuLaiFg94jjhfkQHSoSCUbTSA+15vXLFrbozkGXi5aQCOeq98le3H7M04gqZpMZ0PDdBP6MaBZN8UuqmhieUjeiAdyxVNOLGz2anTsiJVfokjLUthWSm/p7IaGTMOApsZ0RxaBa9qfif10kxvPYzoZIUuWLzRWEqCcZk+jfpC80ZyrEllGlhbyVsSDVlaNMp2RC8xZeXSfOs6l1WL+7PK7WbPI4iHMExnIIHV1CDO6hDAxgM4Ble4c2Rzovz7nzMWwtOPnMIf+B8/gDr142T</latexit>

a2
<latexit sha1_base64="q9AISI9UburW4eeMTsXbP+uc7zA=">AAAB6nicbVDLSgNBEOz1GeMr6tHLYBA8hd3g6xj04jGieUCyhNlJbzJkdnaZmRXCkk/w4kERr36RN//GSbIHTSxoKKq66e4KEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nfqtJ1Sax/LRjBP0IzqQPOSMGis90F61Vyq7FXcGsky8nJQhR71X+ur2Y5ZGKA0TVOuO5ybGz6gynAmcFLupxoSyER1gx1JJI9R+Njt1Qk6t0idhrGxJQ2bq74mMRlqPo8B2RtQM9aI3Ff/zOqkJr/2MyyQ1KNl8UZgKYmIy/Zv0uUJmxNgSyhS3txI2pIoyY9Mp2hC8xZeXSbNa8S4rF/fn5dpNHkcBjuEEzsCDK6jBHdShAQwG8Ayv8OYI58V5dz7mrStOPnMEf+B8/gDtW42U</latexit>

an
<latexit sha1_base64="+HklQ2ImAJDmqUhjBI5PNbdwCBA=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKr2PQi8eI5gHJEnons8mQ2dllZlYISz7BiwdFvPpF3vwbJ8keNLGgoajqprsrSATXxnW/ncLK6tr6RnGztLW9s7tX3j9o6jhVlDVoLGLVDlAzwSVrGG4EayeKYRQI1gpGt1O/9cSU5rF8NOOE+REOJA85RWOlB+zJXrniVt0ZyDLxclKBHPVe+avbj2kaMWmoQK07npsYP0NlOBVsUuqmmiVIRzhgHUslRkz72ezUCTmxSp+EsbIlDZmpvycyjLQeR4HtjNAM9aI3Ff/zOqkJr/2MyyQ1TNL5ojAVxMRk+jfpc8WoEWNLkCpubyV0iAqpsemUbAje4svLpHlW9S6rF/fnldpNHkcRjuAYTsGDK6jBHdShARQG8Ayv8OYI58V5dz7mrQUnnzmEP3A+fwBIWo3Q</latexit>

before transferring them from CPU to GPU. This approach reduces


<latexit sha1_base64="KKiQJLkJoFXQp3Bug8Twpsd7Qr0=">AAAB+nicbVC7TsMwFL3hWcorhZHFokIqS5UgXmMFC2MR9CG1UeU4bmvVcSLbAVWhn8LCAEKsfAkbf4PTZoCWI1k+Oude+fj4MWdKO863tbS8srq2Xtgobm5t7+zapb2mihJJaINEPJJtHyvKmaANzTSn7VhSHPqctvzRdea3HqhULBL3ehxTL8QDwfqMYG2knl3q+hEP1Dg0V3o3qSTHPbvsVJ0p0CJxc1KGHPWe/dUNIpKEVGjCsVId14m1l2KpGeF0UuwmisaYjPCAdgwVOKTKS6fRJ+jIKAHqR9IcodFU/b2R4lBl6cxkiPVQzXuZ+J/XSXT/0kuZiBNNBZk91E840hHKekABk5RoPjYEE8lMVkSGWGKiTVtFU4I7/+VF0jypuufVs9vTcu0qr6MAB3AIFXDhAmpwA3VoAIFHeIZXeLOerBfr3fqYjS5Z+c4+/IH1+QND25QD</latexit>

S(u)
the scheduling overhead of transferring hundreds of tensors indi-
Figure 3: TransAct architecture. Note that this is a submod- vidually to transferring one tensor.
ule that can be plugged into any similar architecture like Form larger batches. For CPU-based inference, smaller batches
Pinnability are preferred to increase parallelism and reduce latency. However,
for GPU-based inference, larger batches are more efficient [29]. This
led us to re-evaluate our distributed system setup. Initially, we used
3.4 Model Productionization a scatter-gather architecture to split requests into small batches
3.4.1 Model Retraining. Retraining is important for recommender and run them in parallel on multiple leaf nodes for better latency.
systems because it allows the system to continuously adapt to However, this setup did not work well with GPU-based inference.
changing user behavior and preferences over time. Without re- Instead, we use the larger batches in the original requests directly.
training, a recommender system’s performance can degrade as the To compensate for the loss of cache capacity, we implemented a
user’s behavior and preferences change, leading to less accurate hybrid cache that uses both DRAM and SSD.
recommendations [26]. This holds especially true when we use Utilize CUDA graphs. We relied on CUDA Graphs9 to com-
realtime features in ranking. The model is more time sensitive and pletely eliminate the remaining small operations overhead. CUDA
requires frequent retraining. Otherwise, the model can become Graphs capture the model inference process as a static graph of
stale in a matter of days, leading to less accurate predictions. We
6 For
retrain Pinnability from scratch twice per week. We find that this more details about model effiency, see Appendix C.
7 https://pytorch.org/blog/introducing-nvfuser-a-deep-learning-compiler-for-
retraining frequency is essential to ensure a consistent engagement
pytorch/
rate and still maintain a manageable training cost. We will dive into 8 https://github.com/NVIDIA/cuCollections

the importance of retraining in Section 4.4.3. 9 https://developer.nvidia.com/blog/cuda-graphs/

5253
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xue Xia et al.

operations instead of individually scheduled ones, allowing the 4.2 Offline Experiment
computation to be executed as a single unit without any kernel 4.2.1 Metrics. The offline evaluation data, unlike training data,
launching overheads. is randomly sampled from FVL to represent the true distribution
of the real-world traffic. With this sampling strategy, the offline
3.4.3 Realtime Feature Processing. When a user takes an action, a evaluation data is representative of the entire population, reducing
realtime feature processing application based on Flink10 consumes the variance of evaluation results.
user action Kafka11 streams generated from front-end events. It In addition to sampling bias, we also eliminate position bias
validates each action record, detects and combines duplicates, and in offline evaluation data. Position bias refers to the tendency for
manages any time discrepancies from multiple data sources. The items at the top of a recommendation to receive more attention
application then materializes the features and stores them in Rock- and engagement than the items lower down the list. This can be a
store [3]. At serving time, each Homefeed logging/serving request problem when evaluating a ranking model, as it can distort the eval-
triggers the processor to convert sequence features into a format uation results and make it difficult to accurately assess the model’s
that can be utilized by the model. performance. To avoid position bias, we randomize the order of
pins in a very small portion of Homefeed recommendation sessions.
4 EXPERIMENT This is done by shuffling the recommendations before presenting
In this section, we will present extensive offline and online A/B them to users. We gather the FVL for those randomized sessions
experiment results of TransAct. We compare TransAct with baseline and only use randomized data to perform the offline evaluation.
models using Pinterest’s internal training data. Our model is evaluated on HIT@3. A chunk 𝒄 = [𝑝 1, 𝑝 2, . . . , 𝑝𝑛 ]
refers to a group of pins that are recommended to a user at the same
time. Each input instance to the ranking model is associated with
4.1 Experiment Setup a user id 𝑢_𝑖𝑑, a pin id 𝑝_𝑖𝑑, and a chunk id 𝑐_𝑖𝑑. The evaluation
4.1.1 Dataset. We construct the offline training dataset from three output is grouped by (𝑢_𝑖𝑑, 𝑐_𝑖𝑑) so that it contains the model
weeks of Pinterest Homefeed view log (FVL). The model is trained output from the same ranking request. We sort the pins from the
on the first two weeks of FVL and evaluated on the third week. same ranking request by a final ranking score S, which is a linear
The training data is sampled based on user state and labels. For combination of Pinnability output heads 𝑓 (𝒙).
example, we design the sampling ratio for different label actions ∑︁
based on their statistical distribution and importance. In addition, S= 𝑢ℎ 𝑓 (𝒙)ℎ (3)
since users only engage with a small portion of pins shown on ℎ∈𝐻
their Homefeed page, most of the training samples are negative
samples. To balance the highly skewed dataset and improve model Then we take the top 𝐾 ranked pins in each chunk and calculate
accuracy, we employ downsampling on the negative samples and the hit@K for all heads, denoted by 𝛽𝑐,ℎ , which is defined as the
set a fixed ratio between the positive and negative samples. Our number of topK-ranked pins whose labels of ℎ are 1. For example, if
training dataset contains 3 billion training instances of 177 million a chunk 𝒄 = [𝑝 1, 𝑝 2, 𝑝 3, . . . , 𝑝𝑛 ] is sorted by S, and the user repins
users and 720 million pins. 𝑝 1 and 𝑝 4 , then hit@K of repin 𝛽𝑐,𝑟𝑒𝑝𝑖𝑛 = 1 when 𝐾 = 3.
In this paper, we conduct all experiments with the Pinterest We calculate the aggregated HIT@3 for each head ℎ as follows:
dataset. We do not use public datasets as they lack the necessary Í Í
realtime user action sequence metadata features, such as item em- 𝑢 ∈𝑈 𝑐 ∈𝐶𝑢 𝛽𝑐,ℎ
𝐻𝐼𝑇 @3/ℎ = (4)
beddings and action types, required by TransAct. Furthermore, they |𝑈 |
are incompatible with our proposal of realtime-batch hybrid model, It is important to note that for actions indicating positive engage-
which requires both realtime and batch user features. And they ment, such as repin or click, a higher HIT@K score means better
cannot be tested in online A/B experiments. model performance. Conversely, for actions indicating negative
engagement, such as hide, a lower HIT@K/hide score is desirable.
4.1.2 Hyperparameters. Realtime user sequence length is |𝑆 | = At Pinterest, a non-core user is defined as a user who has not
100 and the dimension of action embedding 𝑑𝑎𝑐𝑡𝑖𝑜𝑛 = 32. The actively saved pins to boards within the past 28 days. Non-core
encoded sequence feature is passed through a transformer encoder users tend to be less active and therefore pose a challenge in terms
composed of 2 transformer blocks, with a default dropout rate of of improving their recommendation relevance due to their limited
0.1. The feed forward network in the transformer encoder layer has historical engagement. This is also referred to as the cold-start
a dimension of 𝑑ℎ𝑖𝑑𝑑𝑒𝑛 = 32, and positional encoding is not used. user problem in recommendation [19]. Despite the challenges, it
The implementation is done using PyTorch. We use an Adam [14] is important to retain non-core users as they play a crucial role
optimizer with a learning rate scheduler. The learning rate begins in maintaining a diverse and thriving community, contributing to
with a warm-up phase of 5000 steps, gradually increasing to 0.0048, long-term platform growth.
and finally reduced through cosine annealing. The batch size is All reported results are statistically significant (p-value < 0.05)
12000. unless stated otherwise.

4.2.2 Results. We compare TransAct with existing methods of se-


10 https://flink.apache.org/ quential recommendation. The first baseline is the WDL model [5]
11 https://kafka.apache.org/ that incorporates sequence features as part of its wide features. Due

5254
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest KDD ’23, August 6–10, 2023, Long Beach, CA, USA

Table 1: Offline evaluation of comparing existing methods Table 2: Ablation study of realtime-batch hybrid model
with TransAct. (∗ statistically insignificant) Other User
TransAct PF HIT@3/repin HIT@3/hide
HIT@3/repin HIT@3/hide Features
Methods
all non-core all non-core ✓ ✓ ✓ — —
✓ ✕ ✓ -2.46% +3.61%
WDL + seq +0.21% +0.35% -1.61% -1.55% ✕ ✓ ✓ -8.59% +17.45%
BST (all actions) +4.41% +5.09% +2.33% +3.59% ✓ ✓ ✕ -0.67% +1.40%
BST (positive actions) +7.34% +8.16% -1.12%∗ -3.14%∗
TransAct +9.40% +10.42% -14.86% -13.54%

Table 3: Offline evaluation of sequence encoder architecture


to the large size of the sequence features, the number of parame-
ters in the feature cross layer would grow quadratically, making Sequence Encoder HIT@3/repin HIT@3/hide
it unfeasible for both training and online serving. Therefore, we Average Pooling +0.21% -1.61%
used an averaging pooling for PinSage embeddings of user actions CNN +0.08% -1.29%
to encode the sequence. The second baseline is Alibaba’s behavior RNN -1.05% -2.46%
sequence transformer (BST) model [4]. We trained 2 BST model LSTM -0.75% -2.98%
variants here: one with only positive actions in user sequence, the Vanilla Transformer +1.56% -8.45%
other with all actions. We opted not to compare our results with
DIN [43] as BST has already demonstrated its superiority over DIN.
Additionally, we did not compare with variants like BERT4Rec [28] combination of a realtime sequence model with a pre-trained batch
as the problem formulations are different and a direct comparison model.
is not feasible.
The results of the model comparison are presented in Table 1.
4.3.2 Base sequence encoder architecture. We perform an offline
It is evident that BST and TransAct outperform the WDL model,
evaluation on different sequential models that process realtime user
demonstrating the necessity of using a specialized sequential model
sequence features. We use different architectures to encode the
to effectively capture short-term user preferences through real-time
PinSage embedding sequence from users’ realtime actions.
user action sequence features. BST performs well when only posi-
Average Pooling: use the average of PinSage embeddings in
tive actions are encoded, however, it struggles to distinguish nega-
user sequence to present the user’s short-term interest
tive actions. In contrast, TransAct outperforms BST, particularly in
CNN: use a 1-d CNN with 256 output channels to encode the
terms of hide prediction, due to its ability to distinguish between
sequence. Kernel size is 4 and stride is 1.
different actions by encoding action types. Furthermore, TransAct
RNN: use 2 RNN layers with a hidden dimension of 256, to
also exhibits improved performance in HIT@3/repin compared to
encode a sequence of PinSage embeddings.
BST, which can be attributed to its effective early fusion and output
LSTM: use Long Short-Term Memory (LSTM) [10], a more so-
compression design. A common trend across all groups is that the
phisticated version of RNN that better captures longer-term depen-
performance for non-core users is better than for all users, this is
dencies by using memory cells and gating. We use 2 LSTM layers
due to realtime user action features being crucial for users with
with the hidden size of 256.
limited engagement history on the platform, as they provide the
Vanilla Transformer: encodes only PinSage embeddings se-
only source of information for the model to learn their preferences.
quence directly using the Transformer encoder module. We use 2
transformer encoder layers with a hidden dimension of 32.
The baseline group is the Pinnability model without realtime
4.3 Ablation Study user sequence feature. From Table 3, we learned that using realtime
4.3.1 Hybrid ranking model. First, we investigate the effect of the user sequence features, even with a simple average pooling method,
realtime-batch hybrid design by examining the individual impact improves engagement. Surprisingly, more complex architectures
of TransAct(realtime component) and Pinnerformer(batch compo- like RNN, CNN, and LSTM do not always perform better than aver-
nent). Table 2 shows the relative decrease in offline performance age pooling. However, the best performance is achieved with the
from the model containing all user features as we remove each use of a vanilla transformer, as it significantly reduces HIT@3/hide
component. TransAct captures users’ immediate interests, which and improves HIT@3/repin.
contribute the most to the user’s overall engagement, while Pinner-
Former (PF) [20] extracts users’ long-term preferences from their 4.3.3 Early fusion and sequence length selection. As discussed in
historical behavior. We observe that TransAct is the most important Section 3.3.2, early fusion plays a crucial role in the ranking model.
user understanding feature in the model, but we still see value from By incorporating early fusion, the model can not only take into
the large-scale training and longer-term interests captured by Pin- account the dependency between different items in the user’s ac-
nerFormer, showing that longer-term batch user understanding can tion history but also explicitly learn the relationship between the
complement a realtime engagement sequence for recommendations. ranking candidate pin and each pin that the user has engaged with
In the last row of Table 2, we show that removing all user features in the past.
other than TransAct and PinnerFormer only leads to a relatively Longer user action sequences naturally are more expressive than
small drop in performance, demonstrating the effectiveness of our short sequences. To learn the effect of input sequence length on

5255
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xue Xia et al.

the model performance, we evaluate the model on different lengths Table 4: Ablation study of transformer output compression
of user sequence input. Output Compression Size HIT@3/repin HIT@3/hide
a random col 𝑑 +6.80% -10.96%
first col 𝑑 +7.82% -11.28%
random K cols 𝐾𝑑 +7.42% -12.12%
first K cols 𝐾𝑑 +9.38% -14.33%
all cols |𝑆 |𝑑 +8.86% -15.70%
max pooling 𝑑 +6.38% -14.15%
first K cols + max pool (𝐾 + 1)𝑑 +9.41% -14.86%
all cols + max pool (|𝑆 | + 1)𝑑 +8.67% -12.64%

represents the most recently engaged pins and the max pooling
is an aggregated representation of the entire sequence, Although
Figure 4: Effect of early fusion and sequence length on rank- using all columns improved HIT@3/hide slightly, the combination
ing model performance (HIT@3/repin, HIT@3/hide) of the first K columns and max pooling provided a good balance
between performance and latency. We use K=10 for TransAct.
An analysis of Figure 4 reveals that there is a positive correla-
tion between sequence length and performance. The performance 4.4 Online Experiment
improvement increases at a rate that is sub-linear with respect to Compared with offline evaluation, one advantage of online exper-
the sequence length. The use of concatenation as the early fusion iments in recommendation tasks is that they can be run on live
method was found to be superior to the use of appending. There- user data, allowing the model to be tested in a more realistic and
fore, the optimal engagement gain can be achieved by utilizing the dynamic environment. For the online experiment, we serve the
maximum available sequence length and employing concatenation ranking model trained on the 2-week offline training dataset. We
as the early fusion method. set the control group to be the Pinnability model without any re-
altime user sequence features. The treatment group is Pinnability
4.3.4 Transformer hyperparameters. We optimized TransAct’s trans-
model with TransAct. Each experiment group serves 1.5% of the
former encoder by adjusting its hyperparameters. As shown in
total users who visit Homefeed page.
Figure 5, increasing the number of transformer layers and feed for-
ward dimension leads to higher latency and also better performance. 4.4.1 metrics. On Homefeed, one of the most important metrics
While the best performance was achieved using 4 transformer lay- is Homefeed repin volume. Repin is the strongest indicator that
ers and 384 as the feed forward dimension, this came at the cost users find the recommended pins relevant, and is usually positively
of a 30% increase in latency, which does not meet the latency re- correlated to the amount of time users spend on Pinterest. Em-
quirement. To balance performance and user experience, we chose pirically, we found that offline HIT@3/repin usually aligns very
2 transformer layers and 32 as the hidden dimension. well with Homefeed online repin volume. Another important met-
ric is Homefeed hide volume, which measures the proportion
Feedforward Dimension of recommended items that users choose to hide or remove from
30.0% 32
Latency (vs TransAct)

their recommendations. High hide rates indicate that the system


64
20.0% 128 is recommending items that users do not find relevant, which can
384 lead to a poor user experience. Conversely, low hide rates indicate
10.0% # Transformer Layers
1 that the system is recommending items that users find relevant and
2 engaging, which can lead to a better user experience.
0.0% 4
-10.0% 4.4.2 Online engagement. We observe significant online metric
-1.50% -1.00% -0.50% 0.00% 0.50% improvement with TransAct introduced to ranking. Figure 5 shows
HIT@3/repin (vs TransAct)
that we improved the Homefeed repin volume by 11%. It’s worth
Figure 5: Effect of transformer hyperparameters on model noting that engagement gains for non-core users are higher because
performance and latency they do not have a well-established user action history. And realtime
features can capture their interest in a short time. Using TransAct,
the Homefeed page is able to respond quickly and adjust the ranking
4.3.5 Transformer output compression. The transformer encoder
results timely. We see hide volume dropped and that the overall
produces 𝑶 ∈ R𝑑 × |𝑆 | , with each column corresponding to an input time spent on Pinterest is increased.
user action. However, directly using 𝑶 as input to the DCN v2
layers for feature crossing would result in excessive time complexity, 4.4.3 Model retrain. One challenge observed in the TransAct group
which is quadratic to the input size. was the decay of engagement metrics over time for a given user.
To address this issue, we explored several approaches to com- As shown in Figure 6, we compare the Homefeed repin volume
press the transformer output. Figure 3 shows that the highest gain of TransAct to the baseline, with both groups either fixed or
HIT@3/repin is achieved by combining the first K columns and retrained. We observed that if TransAct was not retrained, despite
applying max pooling to the entire sequence. The first K column having a significantly higher engagement on the first day of the

5256
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest KDD ’23, August 6–10, 2023, Long Beach, CA, USA

Table 5: Online evaluation of TransAct Table 6: TransAct’s impact on other applications


Online Metrics All Users Non-core Users Application Metrics Δ
Homefeed Repin Volume +11.0% +17.0% Related Pins Repin Volume +2.8%
Homefeed Hide Volume -10.0% -10.5%
Search Repin Volume +2.3%
Overall Time Spent +2.0% +1.5%
Email CTR +1.4%
Notification
Push Open Rate +1.9%
experiment, it gradually decreased to a lower level over the course
of two weeks. However, when TransAct was retrained on fresh data,
there was a noticeable increase in engagement compared to not results than using random time window masking. They increased
retraining the model. This suggests that TransAct, which utilizes the diversity at the cost of engagement drop.
realtime features, is highly sensitive to changes in user behavior
and requires frequent retraining. Therefore, it is desired to have a 5 DISCUSSION
high retrain frequency when using TransAct. In our production, we 5.1 Feedback Loop
set the retrain frequency to twice a week and this retrain frequency
An interesting finding from our online experiment is that the true
has been proven to keep the engagement rate stable.
potential of TransAct is not fully captured. We observed a greater
improvement in performance when the model was deployed as the
production Homefeed ranking model for full traffic. This is due to
the effect of a positive feedback loop: as users experience a more
responsive Homefeed built on TransAct, they tend to engage with
more relevant content, leading to changes in their behavior (such as
more clicks or repins). These changes in behavior lead to shifts in
the realtime user sequence feature, which are then used to generate
new training data. Retraining the Homefeed ranking model with
this updated data results in a positive compounding effect, leading
to a higher engagement rate and a stronger feedback loop. This
phenomenon is similar to "direct feedback loops" in literature [26]
Figure 6: Effect of retraining on TransAct which refers to a model that directly influences the selection of its
own future training data, and it is more difficult to detect if they
occur gradually over time.
4.4.4 Random time window masking. Another challenge observed
was dropping diversity in recommendations. Diversity measures
the broadness and variety of the items being recommended to a user.
5.2 TransAct in Other Tasks
Previous literature[36] finds diversity is associated with increasing The versatility of TransAct extends beyond just ranking tasks. It
user visiting frequency. However, diversity is not always desirable has been successfully applied in the contextual recommendation
as it can lead to a drop in relevance. Therefore, it is crucial to find the and search ranking scenarios as well. TransAct is used in Related
right balance between relevance and diversity in recommendations. Pins [17] ranking, a contextual recommendation model to provide
At Pinterest, we have a 28k-node hierarchical interest taxon- personalized recommendations of pins based on a given query pin.
omy [15] that classifies all the pins. The top-level interests are TransAct is also applied in Pinterest’s Search ranking [1] system
coarse. Some examples of top-level interests are art, beauty, and and notification ranking [41]. Table 6 showcases the effectiveness
sport. Here, we measure the impression diversity as the summa- of TransAct in a variety of use cases and its potential to drive
tion of the number of unique top-level interests viewed per user. engagement in more real-world applications.
We observe that with TransAct introduced to Homefeed ranking,
the impression diversity dropped by 2% to 3%. The interpretation 6 CONCLUSIONS
is that by adding the user action sequence feature, the ranking In this paper, we present TransAct, a transformer-based realtime
model learns to optimize for the user’s short-term interest. And user action model that effectively captures users’ short-term inter-
by focusing on mainly short-term interest, the diversity of the ests by encoding their realtime actions. Our novel hybrid ranking
recommendation dropped. model merges the strengths of both realtime and batch approaches
We mitigate the diversity drop by using a random time window of encoding user actions, and has been successfully deployed in
mask in the transformer as mentioned in Section 3.3.3. This random the Homefeed recommendation system at Pinterest. The results of
masking encourages the model to focus on content other than only our offline experiments indicate that TransAct significantly outper-
the most recent items a user engaged with. With this design, the forms state-of-the-art recommender system baselines. In addition,
diversity metric drop was brought back to only -1% without influ- we have discussed and provided solutions for the challenges faced
encing relevance metrics like repin volume. We also tried using during online experimentation, such as high serving complexity,
a higher dropout rate in the transformer encoder layer and ran- diversity decrease, and engagement decay. The versatility and ef-
domly masking out a fixed percentage of actions in the user action fectiveness of TransAct make it applicable for other tasks, such as
sequence input. However, neither of these methods yielded better contextual recommendations and search ranking.

5257
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xue Xia et al.

REFERENCES sequential behavior data for click-through rate prediction. In Proceedings of the
[1] 2016. Search serving and ranking at Pinterest. https://medium.com/the-graph/ 29th ACM International Conference on Information & Knowledge Management.
search-serving-and-ranking-at-pinterest-224707599c92 2685–2692.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- [22] Steffen Rendle. 2010. Factorization Machines. In 2010 IEEE International Confer-
chine translation by jointly learning to align and translate. arXiv preprint ence on Data Mining. 995–1000. https://doi.org/10.1109/ICDM.2010.127
arXiv:1409.0473 (2014). [23] Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International conference
[3] Jessica Chan. 2022. 3 Innovations While Unifying Pinterest’s Key-Value on data mining. IEEE, 995–1000.
Storage. https://medium.com/@Pinterest_Engineering/3-innovations-while- [24] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based
unifying-pinterests-key-value-storage-8cdcdf8cf6aa collaborative filtering recommendation algorithms. In Proceedings of the 10th
[4] Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Be- international conference on World Wide Web. 285–295.
havior Sequence Transformer for E-Commerce Recommendation in Alibaba. [25] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net-
In Proceedings of the 1st International Workshop on Deep Learning Practice works. IEEE transactions on Signal Processing 45, 11 (1997), 2673–2681.
for High-Dimensional Sparse Data (Anchorage, Alaska) (DLP-KDD ’19). Asso- [26] D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Di-
ciation for Computing Machinery, New York, NY, USA, Article 12, 4 pages. etmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and
https://doi.org/10.1145/3326937.3341261 Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Sys-
[5] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, tems. In Advances in Neural Information Processing Systems, C. Cortes,
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28.
2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Curran Associates, Inc. https://proceedings.neurips.cc/paper/2015/file/
workshop on deep learning for recommender systems. 7–10. 86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
[6] Tim Donkers, Benedikt Loepp, and Jürgen Ziegler. 2017. Sequential user-based [27] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.
recurrent neural network recommendations. In Proceedings of the eleventh ACM 2019. BERT4Rec: Sequential recommendation with bidirectional encoder rep-
conference on recommender systems. 152–160. resentations from transformer. In Proceedings of the 28th ACM international
[7] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. conference on information and knowledge management. 1441–1450.
DeepFM: a factorization-machine based neural network for CTR prediction. arXiv [28] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.
preprint arXiv:1703.04247 (2017). 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Rep-
[8] Ruining He and Julian McAuley. 2016. Fusing similarity models with markov resentations from Transformer. CoRR abs/1904.06690 (2019). arXiv:1904.06690
chains for sparse sequential recommendation. In 2016 IEEE 16th international http://arxiv.org/abs/1904.06690
conference on data mining (ICDM). IEEE, 191–200. [29] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. 2017. Efficient
[9] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12
2015. Session-based recommendations with recurrent neural networks. arXiv (2017), 2295–2329.
preprint arXiv:1511.06939 (2015). [30] Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved recurrent neural
[10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. networks for session-based recommendations. In Proceedings of the 1st workshop
Neural Computation 9 (1997), 1735–1780. on deep learning for recommender systems. 17–22.
[11] Haoji Hu, Xiangnan He, Jinyang Gao, and Zhi-Li Zhang. 2020. Modeling personal- [31] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda-
ized item frequency information for next-basket recommendation. In Proceedings tion via convolutional sequence embedding. In Proceedings of the eleventh ACM
of the 43rd International ACM SIGIR Conference on Research and Development in international conference on web search and data mining. 565–573.
Information Retrieval. 1071–1080. [32] Trinh Xuan Tuan and Tu Minh Phuong. 2017. 3D convolutional networks
[12] Rong Jin, Joyce Y. Chai, and Luo Si. 2004. An Automatic Weighting Scheme for session-based recommendation with content features. In Proceedings of the
for Collaborative Filtering. In Proceedings of the 27th Annual International ACM eleventh ACM conference on recommender systems. 138–146.
SIGIR Conference on Research and Development in Information Retrieval (Sheffield, [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
United Kingdom) (SIGIR ’04). Association for Computing Machinery, New York, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
NY, USA, 337–344. https://doi.org/10.1145/1008992.1009051 you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von
[13] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.),
mendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/
197–206. 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[14] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- [34] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network
mization. In 3rd International Conference on Learning Representations, ICLR 2015, for Ad Click Predictions. In Proceedings of the ADKDD’17 (Halifax, NS, Canada)
San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio (ADKDD’17). Association for Computing Machinery, New York, NY, USA, Article
and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980 12, 7 pages. https://doi.org/10.1145/3124749.3124754
[15] Eileen Li. 2019. Pin2Interest: A scalable system for content classifica- [35] Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong,
tion. https://medium.com/pinterest-engineering/pin2interest-a-scalable- and Ed Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons
system-for-content-classification-41a586675ee7 for Web-Scale Learning to Rank Systems. In Proceedings of the Web Conference
[16] Jiacheng Li, Yujie Wang, and Julian McAuley. 2020. Time interval aware self- 2021 (Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery,
attention for sequential recommendation. In Proceedings of the 13th international New York, NY, USA, 1785–1797. https://doi.org/10.1145/3442381.3450078
conference on web search and data mining. 322–330. [36] Yuyan Wang, Mohit Sharma, Can Xu, Sriraj Badam, Qian Sun, Lee Richardson,
[17] David C. Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C. Ma, Lisa Chung, Ed H. Chi, and Minmin Chen. 2022. Surrogate for Long-Term User
Zhigang Zhong, Jenny Liu, and Yushi Jing. 2017. Related Pins at Pinterest: Experience in Recommender Systems. In Proceedings of the 28th ACM SIGKDD
The Evolution of a Real-World Recommender System. In Proceedings of the 26th Conference on Knowledge Discovery and Data Mining (Washington DC, USA)
International Conference on World Wide Web Companion (Perth, Australia) (WWW (KDD ’22). Association for Computing Machinery, New York, NY, USA, 4100–4109.
’17 Companion). International World Wide Web Conferences Steering Committee, https://doi.org/10.1145/3534678.3539073
Republic and Canton of Geneva, CHE, 583–592. https://doi.org/10.1145/3041021. [37] Jiajing Xu, Andrew Zhai, and Charles Rosenberg. 2022. Rethinking Personalized
3054202 Ranking at Pinterest: An End-to-End Approach. In Proceedings of the 16th ACM
[18] Hao Ma, Irwin King, and Michael R. Lyu. 2007. Effective Missing Data Prediction Conference on Recommender Systems (Seattle, WA, USA) (RecSys ’22). Association
for Collaborative Filtering. In Proceedings of the 30th Annual International ACM for Computing Machinery, New York, NY, USA, 502–505. https://doi.org/10.
SIGIR Conference on Research and Development in Information Retrieval (Amster- 1145/3523227.3547394
dam, The Netherlands) (SIGIR ’07). Association for Computing Machinery, New [38] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton,
York, NY, USA, 39–46. https://doi.org/10.1145/1277741.1277751 and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale
[19] Senthilselvan Natarajan, Subramaniyaswamy Vairavasundaram, Sivaramakrish- Recommender Systems. In Proceedings of the 24th ACM SIGKDD International
nan Natarajan, and Amir H Gandomi. 2020. Resolving data sparsity and cold Conference on Knowledge Discovery and Data Mining (London, United Kingdom)
start problem in collaborative filtering recommender system using linked open (KDD ’18). Association for Computing Machinery, New York, NY, USA, 974–983.
data. Expert Systems with Applications 149 (2020), 113248. https://doi.org/10.1145/3219819.3219890
[20] Nikil Pancha, Andrew Zhai, Jure Leskovec, and Charles Rosenberg. 2022. Pinner- [39] Shuai Zhang, Yi Tay, Lina Yao, Aixin Sun, and Jake An. 2019. Next item recom-
Former: Sequence Modeling for User Representation at Pinterest. In Proceedings mendation with self-attentive metric learning. In Thirty-Third AAAI Conference
of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining on Artificial Intelligence, Vol. 9.
(Washington DC, USA) (KDD ’22). Association for Computing Machinery, New [40] Bendong Zhao, Huanzhang Lu, Shangfeng Chen, Junliang Liu, and Dongya Wu.
York, NY, USA, 3702–3712. https://doi.org/10.1145/3534678.3539156 2017. Convolutional neural networks for time series classification. Journal of
[21] Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Systems Engineering and Electronics 28, 1 (2017), 162–169.
Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong

5258
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest KDD ’23, August 6–10, 2023, Long Beach, CA, USA

Table 8: Offline evaluation of different positional encoding • If a user only hides a pin (𝒚ℎ𝑖𝑑𝑒 = 1), and does not repin
methods compared with TransAct or click (𝒚𝑟𝑒𝑝𝑖𝑛 = 𝒚𝑐𝑙𝑖𝑐𝑘 = 0). Then we want to penalize
Positional encoding method HIT@3/hide HIT@3/repin the model more if it predicts repin or click, by setting the
𝑴 𝑟𝑒𝑝𝑖𝑛,ℎ𝑖𝑑𝑒 and 𝑴 𝑐𝑙𝑖𝑐𝑘,ℎ𝑖𝑑𝑒 to a large value.
None (TransAct) - - 𝑤𝑟𝑒𝑝𝑖𝑛 = 𝑴 𝑟𝑒𝑝𝑖𝑛,ℎ𝑖𝑑𝑒 ∗ 𝒚ℎ𝑖𝑑𝑒 = 100.
From scratch +0.86% -0.61% 𝑤𝑐𝑙𝑖𝑐𝑘 = 𝑴 𝑐𝑙𝑖𝑐𝑘,ℎ𝑖𝑑𝑒 ∗ 𝒚ℎ𝑖𝑑𝑒 = 100.
Sinusoidal +0.78% -0.13% • If a user only repins a pin (𝒚𝑟𝑒𝑝𝑖𝑛 = 1), but does not hide or
Linear projection ∗ +2.29% +0.19% click (𝒚ℎ𝑖𝑑𝑒 = 𝒚𝑐𝑙𝑖𝑐𝑘 = 0). We want to penalize the model if
it predicts hide: 𝑤ℎ𝑖𝑑𝑒 = 𝑴 ℎ𝑖𝑑𝑒,𝑟𝑒𝑝𝑖𝑛 ∗ 𝒚𝑟𝑒𝑝𝑖𝑛 = 5.
[41] Bo Zhao, Koichiro Narita, Burkay Orten, and John Egan. 2018. Notification But we do not need to penalize the model if it predicts a
Volume Control and Optimization System at Pinterest. In Proceedings of the 24th click, because a user could repin and click the same pin.
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
(London, United Kingdom) (KDD ’18). Association for Computing Machinery, 𝑤𝑐𝑙𝑖𝑐𝑘 = 𝑴 𝑐𝑙𝑖𝑐𝑘,𝑟𝑒𝑝𝑖𝑛 ∗ 𝒚𝑟𝑒𝑝𝑖𝑛 = 0
New York, NY, USA, 1012–1020. https://doi.org/10.1145/3219819.3219906
[42] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang B POSITIONAL ENCODING
Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate
prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. We tried several positional encoding approaches: learning positional
5941–5948.
[43] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui
embedding from scratch, sinusoidal positional encoding [33], and
Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click- linear projection positional encoding as proposed in [4]. Table8
Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International shows that positional encoding does not add much value.
Conference on Knowledge Discovery & Data Mining (London, United Kingdom)
(KDD ’18). Association for Computing Machinery, New York, NY, USA, 1059–1068.
https://doi.org/10.1145/3219819.3219823 C MODEL EFFICIENCY
Table 9 shows more detailed information on the efficiency of our
A HEAD WEIGHTING model, including number of flops, model forward latency per batch
We illustrate how head weighting helps the multi-task prediction (batch size = 256), and serving cost. The serving cost is not linearly
task here. Consider the following example, of a model using 3 correlated with model forward latency because it is also related to
actions: repins, clicks, and hides. The label weight matrix is set as server configurations such as time out limit, batch size, etc. GPU
Table 7. serving optimization is important to maintain low latency and
serving cost.
Table 7: An example of label weight matrix 𝑴 with 3 actions
Action Table 9: Model Efficiency Numbers from Serving Optimiza-
click repin hide
Head tion
click 100 0 100
Baseline(CPU) TransAct(CPU) TransAct(GPU)
repin 0 100 100
hide 1 5 10 Parameters 60M 92M 92M
flops 1M 77M 77M
Hides are a strong negative action, while repins and clicks are Latency 22ms 712ms 8ms
both positive engagements, although repins are considered a stronger Serving Cost 1x 32x 1x
positive signal than clicks. We set the value of 𝑴 manually, to con-
trol the weight on cross-entropy loss. Here, we give some examples
of how this is achieved.

5259

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy