Behavior Sequence Transformer

4 min readOct 18, 2021

** Paper Study of Alibaba’s SOTA RecSys **

In this blog, I would try to explain some key concepts of BST and what factors in this can be a game-changer. If you are not familiar with transformer-based architecture then I would recommend reading this article: The Illustrated Transformer.

Overview

The sequential deep learning model has proven benefits in the recommendation industry for generating more relevant and dynamic recommendations based on users’ past sequences of actions. BST is Alibaba’s (by Qiwei Chen et al.) take on E-commerce Recommendation System. Find the research paper here: https://arxiv.org/pdf/1905.06874.pdf

In this paper, they propose to use the powerful Transformer model to capture the sequential signals underlying users’ behavior sequences for recommendation in Alibaba.
The RecSys in Alibaba is a two-stage pipeline: match and rank.

In a match, a set of similar items are selected according to items users interacted with (https://arxiv.org/pdf/1803.02349.pdf)
and then a fine-tuned prediction model is learned to predict the probability of users clicking the given set of candidate items.

Model Architecture

The overview architecture of the proposed BST

The 3 key components of BST:

embedding layer
transformer layer
MLP

They have used encoders of transformer architecture which uses self-attention to combine signals from users’ past interactions. Self-attention is a pretty effective mechanism to capture any recent change in users’ interests and also preserving long-term context at the same time.

Input Embedding

This layer reshapes various features from interaction, user, and items data and add them together to create a final input vector which would be input to the transformer’s encoder layers.

There are two types of features to represent an item, “Sequence Item Features”(in red) and “Positional Features” (in dark blue), where “Sequence Item Features” include item_id and category_id.
There are various features, like the user profile features, item features, context features, and the combination of different features, i.e., the cross features. Since this work is focused on modeling the behavior sequence with transformer, they denote all these features as “Other Features”.

Though the combination of features can be automatically learned by neural networks, they still incorporate some hand-crafted cross features, which have been demonstrated useful in our scenarios before the deep learning era.

The way of representing Sequence Item Features in latent space has been discussed in detail here.

Now the interesting part here is the Positional Features. The authors proposed a positional embedding to capture the order information in sentences. Likewise, the order exists in users’ behavior sequences. Thus, adding the “position” as an input feature of each item in the bottom layer before it is projected as a low-dimensional vector.
Note that the position value of item v(i) is computed as pos(vi) = t(vt ) −t(vi), where t(vt ) represents the recommending time and t(vi) the timestamp when user click item v(i).
Note: sinusoidal-based encoding for relative time performed better than using absolute relative time in days as features.

MLP layers and Loss function

By concatenating the embeddings of Other Features and the output of the Transformer layer applying to the target item, then use three fully connected layers to further learn the interactions among the dense features, which is standard practice in industrial RecSys. To predict whether a user will click the target item v(t), they model it as a binary classification problem, thus use the sigmoid function as the output unit. To train the model, they have to use the cross-entropy loss.

Experimental results demonstrate the superiority of the proposed model, which is then deployed online at Taobao and obtain significant improvements in online Click-Through-Rate (CTR) comparing to two baselines.

Modifications

The modification to experiment in the original BST model in the following ways:

Incorporate the features into the processing of the embedding of each item of the input sequence and the target Item, rather than treating them as “other features” outside the transformer layer.
Utilize the ratings or user feedback in the input sequence, along with their positions in the sequence, to update them before feeding them into the self-attention layer.
Using Softmax across all class probabilities making the model learn multi-class classification.
Personalize based on the user’s clicks. Compute two features derived from user clicks and categories of clicked Experiences:
* Category Intensity: Weighted sum of user clicks on Experiences that have that particular category,
* Category Recency: Number of days that passed since the user last clicked on an Experience in that category.
Note: that the user may have clicked on many different categories with different intensities and recency, but when the feature is computed for a particular Experience that needs to be ranked, we use the intensity and recency of that Experience category.