Multi-objective Ranking in Large-Scale E-commerce Recommender Systems
Motivation
Users have multiple types of behavior and each behavior can have a different intent. For example, a user might click on a product, add it to their cart, and then wishlist it before finally purchasing it. A recommender system that only takes into account a user’s purchase history would not be able to accurately recommend products to that user.
Another reason why this research is important is that there are often trade-offs between different metrics in e-commerce. For example
- a recommender system might be able to increase the click-through rate (CTR) on a product by recommending it to everyone, but this would come at the expense of the conversion rate (CVR).
- There can be a trade-off between optimizing for a higher conversion rate and maximizing the average order value. For example, offering discounts or promotions might increase the conversion rate but could lead to a decrease in AOV if customers are incentivized to make smaller purchases.
The goal is to find a recommender system that can optimize multiple objectives at the same time.
Finally, this research is also important because it can help to identify and address biases in recommender systems. For example, a recommender system might be biased toward recommending products that are in stock or that have been recently added to the website. This research can help to develop recommender systems that are more fair and unbiased.
Deep Interest Transformer (DIT) — Overview
- DIT has dedicated Transformer Blocks for a different kind of action sequences to capture users' multifaceted interests.
- Authors argue that users’ multiple types of behavior sequences on items (e.g., click, add to cart, and order) are significantly different both in terms of timescales and intent.
- For example, a user may have many click behaviors but usually have only a few cart or order behaviors in a short time range (e.g., within a week).
Inside Deep Interest Transformer
- Encoder the mutual relationships between every pair of historical behavior.
- The decoder learns a unique user interest vector for each target item. It captures multiple unique interests.
- During training, the DIT decoder learns to generate accurate interest vectors by adjusting its parameters based on the difference between the predicted interest vector and the ground truth (actual user behavior).
- The target-product embedding plays a crucial role in guiding the decoder to focus on the relevant aspects of the input sequence, contributing to the overall learning process.
The basic idea is that users’ multiple types of behavior sequences on items (e.g., click, add to cart, and order) are significantly different and they have different timescales
Multi-Task Learning Overview
Hard Parameter Sharing
- the model shares the hidden layers across all tasks and keeps a few task-specific layers to specialize in each task
- The shared layers allow the model to learn general features across tasks, while the task-specific layers enable specialization for individual tasks.
Soft Parameter Sharing
- each task has its own set of parameters. These task-specific layers are then regularized during training to reduce the differences between shared layers.
- The idea is to balance between task-specific learning and leveraging shared knowledge, providing flexibility for tasks to adapt to their unique requirements.
Introduction to Mixture-of-Experts (MoE)
- MoE architecture is an ensemble of many models (aka experts).
- Each expert is trained on a subspace of the problem and then specializes in that specific part of the input space.
- Each of the experts can be any machine learning algorithm.
- These experts often have the same architecture and are also trained by the same algorithm.
- MoEs include a gating (or routing) function that acts as a manager that forwards individual inputs to the best experts based on their specialization.
Multi-gate Mixture-of-Experts (MMoE)
Why MMoE?
- The multiple objectives may have complex relationships (e.g., independent, related, or conflict) with each other
- In Shared Bottom, the hard-parameter sharing mechanisms may harm the learning of multiple objectives when the correlation between tasks is low.
- If the tasks are less related, then sharing experts will be penalized and the gating networks of these tasks will learn to utilize different experts instead
MMoE in DMT
- Multi-gate Mixture-of-Experts (MMoE) can capture the relation and conflict of multiple tasks.
- In this architecture, the expert submodels (feed-forward networks) are shared across all tasks, while the gating networks are task-specific.
- Each gating network can learn to “select” a subset of experts to use conditioned on the input example. This is desirable for a flexible parameter sharing in the multi-task learning situation
- Each task also has a task-specific “tower/ utility network” to decouple the optimization for tasks.
MMoE in DMT has 4 expert networks which are all multi-layer perceptrons with ReLu activations
MMoE in DMT has 2 gate networks which are all multi-layer perceptrons with Softmax activations
The output of each expert network is denoted by 𝑒1 (𝑥), 𝑒2 (𝑥), …, 𝑒𝑁 (𝑥).
Introduction to Selection Bias
- Training our models on biased historical data perpetuates the bias via a self-reinforcing feedback loop.
- This can lead to suboptimal outcomes where items that are more relevant but are shown in lower positions continue to get lower engagement and thus don’t improve their rank.
- Position-index is the index number of items in the sequence of product-list.
- Position-page is the page number in which the item is present.
- The maximum index and page number are set to 100 and 400 respectively.
- For neighboring bias is calculated for “K” neighboring items where “K” is set at 6.
Model Training
Model Prediction
Overall Performance
JD RecSys Dataset: One week’s samples are used for training and samples of the following day are for testing.
The improvements of DIEN, DIEN, and DMT over DNN(Base) become smaller when the dense features are used.
Performance of multi-task learning in DMT
- Adding an Order Sequence Feature does not add significant value to the score.
- Authors empirically figure out that the order sequence can disturb the click and cart sequence information.
- Shared bottom improves the performance but because of the hard-parameter sharing mechanisms, it fails to beat MMoE.
Expert Utilization for Multiple Tasks in MMoE
- Click prediction relies heavily on experts 1 and 4, while Order prediction is influenced by experts 1, 3, and 4.
- MMoE efficiently organizes input information using gating networks
Performance of Bias Deep Neural Network in DMT
Modeling the neighboring bias will bring better performance than modeling the position bias features.
Reasoning for Better Performance with Neighboring Bias:
- Contextual Relationships: Neighboring bias captures contextual relationships between items, which can be crucial in understanding dependencies and patterns. It allows the model to consider the influence of nearby elements on the current element, providing a more contextually rich representation.
- Adaptability to Dynamic Content: Neighboring bias is often more adaptable to dynamic content where the importance of items may vary based on their proximity to each other. This adaptability helps the model generalize better to different scenarios.
- Robustness to Position Variations: Neighboring bias can be more robust to variations in the position of items, as it focuses on the local context rather than relying solely on the fixed position defined by indices or page numbers.