Personalization and Recommendation with Contextual Bandits 🤖

Subir Verma
11 min readDec 1, 2021

Simulate a content personalization scenario with Vowpal Wabbit using contextual bandits to make choices between actions in a given context.

Photo by Markus Spiske on Unsplash

INTRODUCTION

Recommending relevant and personalised content to users is crucial for media services providers, e-commerce platforms, content based platforms etc.

Indeed, effective recommender systems improve the users’ experience and engagement on the platform, by helping them navigate through massive amounts of content.

As demand grows for features like personalization systems, efficient information retrieval, and anomaly detection, the need for a solution to optimize these features has grown as well. Contextual bandit is a machine learning framework designed to tackle these — and other — complex situations.

This tutorial includes a brief overview of reinforcement learning, the contextual bandits approach to this machine learning paradigm, and describes how to approach a contextual bandits problem with Vowpal Wabbit.

What is reinforcement learning?

Reinforcement learning is a machine learning paradigm used to train models for sequential decision making. It involves using algorithms concerned with how a software agent takes suitable actions in complex environments and uses the feedback to maximize reward over time. This approach provides the freedom to enact specific user behavior, in a given context, and provide feedback on how the chosen behavior is rewarded based on the goal.

Bandits, explained

Let’s say you are an online retailer that wants to show personalized product suggestions on your homepage.

You can only show a limited number of products to a specific customer, and you don’t know which ones will have the best reward. In this case, let’s make the reward $0 if the customer doesn’t buy the product, and the item price if they do.

To try to maximize your reward, you could utilize a multi-armed bandit (MAB) algorithm, where each product is a bandit — a choice available for the algorithm to try. As we can see below, the multi-armed bandit agent must choose to show the user item 1 or item 2 during each play. Each play is independent of the other — sometimes the user will buy item 2 for $22, sometimes the user will buy item 2 twice earning a reward of $44.

The contextual bandits problem

Now let’s say we have a customer that’s a professional interior designer and an avid knitting hobbyist. They may be ordering wallpaper and mirrors during working hours and browsing different yarns when they’re home. Depending on what time of day they access our website, we may want to show them different products.

The contextual bandit algorithm is an extension of the multi-armed bandit approach where we factor in the customer’s environment, or context, when choosing a bandit. The context affects how a reward is associated with each bandit, so as contexts change, the model should learn to adapt its bandit choice, as shown below.

In the contextual bandit problem, a learner repeatedly observes a context, chooses an action, and observes a loss/cost/reward for the chosen action only. Contextual bandits algorithms use additional side information (or context) to aid real-world decision-making. They work well for choosing actions in dynamic environments where options change rapidly, and the set of available actions is limited.

With contextual bandit, a learning algorithm can test out different actions and automatically learn which one has the most rewarding outcome for a given situation. It’s a powerful, generalizable approach for solving key business needs in industries from healthcare to finance, and almost everything in between.

Vowpal Wabbit: Working with contextual bandits

Vowpal Wabbit is an interactive machine learning library and the reinforcement learning framework for services like Microsoft Personalizer. It allows for maximum throughput and lowest latency when making personalization ranks and training the model with all events.

This tutorial uses an application example we’ll call Con-Ban Agent to introduce a Vowpal Wabbit approach to the contextual bandit problem and explore the capabilities of this reinforcement learning approach. The problem scenario of web content personalization motivates our example Con-Ban Agent. The goal is to show the user the most relevant web content on each page to maximize engagement (clicks).

Con-Ban Agent performs the following functions:

  • Some context x arrives and is observed by Con-Ban Agent.
  • Con-Ban Agent chooses an action a from a set of actions A, i.e., aA (A may depend on x).
  • Some reward r for the chosen a is observed by Con-Ban Agent.

In a contextual bandit setting, a data point has four components:

  • Context
  • Action
  • Probability of choosing the action
  • Reward/cost for the chosen action

For example:

Con-Ban Agent news website:

  • Decision to optimize: articles to display to user.
  • Context: user data (browsing history, location, device, time of day)
  • Actions: available news articles
  • Reward: user engagement (click or no click)

We need to generate a context in our simulator to get an action/decision for the given context, and to simulate generating a reward. The goal of the simulator is to maximize reward (CTR) — or minimize loss (-CTR).

The context is therefore (user, time_of_day):

  • We have two website visitors: “Tom” and “Anna.”
  • Tom and Anna visit the website in the morning or the afternoon.

We have the option of recommending a variety of articles to Tom and Anna. Therefore, the actions are the different choices of articles: “politics”, “sports”, “music”, “food”, “finance”, “health”, or “cheese.”

The reward is whether they click on the article or not: “click” or “no click.”

Contextual Bandit functionalities in Vowpal Wabbit

VW contains a contextual bandit module which allows you to optimize a predictor based on already collected contextual bandit data. In other words, the module does not implement exploration, it assumes it can only use the currently available data logged using an exploration policy. (Note: For full online contextual bandit see the use of — cb_explore and — cb_explore_adf options here.)

The data is specified as a set of tuples (x,a,c,p) where x are the current features/context for the decision, a is the chosen action by the exploration policy for the context x, c is the observed cost for action a in context x, and p is the probability the exploration policy choose this action in context x.

Each example now spans multiple lines, with one line per action. For each action, we have the label information (a,c,p), if known, as before. The action field a is ignored now since actions are identified by line numbers, and typically set to 0. The semantics of cost and probability are same as before. Each example is also allowed to specify the label information on precisely one action. A newline signals end of a multiline example. Additionally, we can specify contextual features which are shared across all actions in a line at the beginning of an example, which always has a tag shared, as in the second multiline example above. Since the shared line is not associated with any action, it should never contain the label information.

Simple Example

Here is a simple example that illustrates the input format and how to use vw on this data.

We consider a problem with 4 actions and we observed the following 5 data points in VW format:

1:2:0.4 | a c  
3:0.5:0.2 | b d
4:1.2:0.5 | a b c
2:1:0.3 | b c
3:1.5:0.7 | a d

Here each line is a separate example and each takes the form:

action:cost:probability | features

Where

  • action is the id of the action taken where we observed the cost (a positive integer in {1, k})
  • cost is the cost observed for this action (floating point, lower is better)
  • probability is the probability (floating point, in [0..1]) of the exploration policy to choose this action when collecting the data
  • features are the list of all features for this example specified as usual for classification/regression problem with vw

So the first line above indicates we observed action 1 has cost 2 on an example with features a and c, and this action was chosen with probability 0.4 by the exploration policy in this context when collecting the data.

Simulating reward for Vowpal Wabbit

In the real world, we must learn Tom and Anna’s preferences for articles as we observe their interactions. Since this is a simulation, we must define Tom and Anna’s preference profile.

The reward that we provide to the learner follows this preference profile. We hope to see if the learner can make better and better decisions as we see more samples, which in turn means we are maximizing the reward.

To accomplish this, we need to modify the reward function in a few different ways and see if the contextual bandit learner picks up the changes. Then, we compare the CTR with and without learning.

Vowpal Wabbit optimizes to minimize cost, which is negative of reward.

Therefore, we always pass negative of reward as cost to Vowpal Wabbit.

# VW tries to minimize loss/cost, therefore we will pass cost as -reward
USER_LIKED_ARTICLE = -1.0
USER_DISLIKED_ARTICLE = 0.0

The reward function below specifies that Tom likes politics in the morning and music in the afternoon. Anna likes sports in the morning and politics in the afternoon. It looks dense, but we are simulating a hypothetical world in the format of the feedback the learner understands — cost.

If the learner recommends an article that aligns with the reward function, we give a positive reward. In our simulation, this is a click.

Understanding Vowpal Wabbit format

There are steps we need to take to set up our input in a format Vowpal Wabbit understands.

This function handles converting from our context as a dictionary, list of articles, and the cost if there is one into the text format it understands:

To make sense of this format, we go through an example. In this example, the time of day is morning, and the user is Tom. There are four possible articles.

In Vowpal Wabbit format, there is one line that starts with shared-the shared context-followed by four lines each corresponding to an article:

Figure 3

More details here : https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Input-format

Getting a decision from Vowpal Wabbit

When we call Vowpal Wabbit, the output is a probability mass function (PMF). Vowpal Wabbit provides a list of probabilities over the set of actions because we are incorporating exploration into our strategy. This exploration means that the probability at a given index in the list corresponds to the likelihood of picking that specific action.

To arrive at a decision/action, we must sample from this list.

For example, given the list [0.7, 0.1, 0.1, 0.1], we would choose the first item with a 70% chance. The command sample_custom_pmf takes such a list and gives us the index it chose and what the probability of choosing that index was.

We have all the information we need to choose an action for a specific user and context. Use Vowpal Wabbit to achieve this with the following steps:

  1. Convert the context and actions into the text format needed.
  2. Pass this example to Vowpal Wabbit and get the PMF output.
  3. Sample this PMF to get the article to show.
  4. Return the chosen article and the probability of choosing it.

Reinforcement learning simulation

Now that we have done all of the setup work and we know how to interface with Vowpal Wabbit let’s simulate the world of Tom and Anna. The scenario is as follows: Tom and Anna go to a website and are shown an article. Remember that the reward function allows us to define the real-world reaction to the content that Vowpal Wabbit recommends.

We choose between Tom and Anna uniformly at random and choose the time of day they visit the site uniformly at random. Think of this as flipping a coin to choose between Tom and Anna and flipping the coin again to choose the time of day.

Instantiate learner

We instantiate a contextual bandit learner in Vowpal Wabbit and then simulate Tom and Anna’s website visits num_iterations number of times. With each visit, we do the following:

  1. Decide between Tom and Anna
  2. Decide the time of day
  3. Pass context (i.e., user, time of day) to the learner to get action (i.e., article recommendation, and the probability of choosing action).
  4. Receive reward (i.e., see if the user clicked or not). Remember that cost is just a negative reward.
  5. Format context, action, probability, reward in Vowpal Wabbit format
  6. Learn from the example

This reduction is the same for every one of our simulations, so we define the process in the run_simulation function. We have to supply the cost function to simulate how the real world works:

People’s preferences change over time in the real world. To account for this in the simulation, we incorporate two different cost functions and swap over to the second one halfway through (figure 7).

we start rewarding actions that have never seen a reward previously when we change the cost function:

User Behaviour

Now, we switch to the new reward function after a few samples (running the first reward function). Remember that this reward function changes the preferences of the users. It is working with a different action space than before. We should see the learner pick up these changes and optimize the new preferences.

Model saving and loading — Python

VW supports model saving a model and loading it in another VW process. The contents of the model are:

  • VW version
  • Command line arguments that are marked as keep
  • This is a source of some confusion, the contents of the model will often contain more or less arguments than those provided when originally running VW. This is because non-keep arguments are not saved and some reductions will insert extra command line arguments that themselves are marked as keep
  • General VW state
  • State for each enabled reduction

When using VW in Python, practically all command line parameters work as expected.

Saving a model

To save a model in Python final_regressor can be used. However, it is important to note that the model saving happens when VW cleans up, so you will need to call finish or destroy the object (with del for example) which will in turn call finish.

You can also call save to save a model file at any point. This only supports saving the binary model file and not the readable version.

vw = pyvw.vw("-f vw.model")
vw = pyvw.vw(final_regressor="vw.model")
vw.save("vw.model")

Loading a model

To load a model file in Python you should use the initial_regressor configuration object when creating the vwinstance.

vw = pyvw.vw("-i vw.model")
vw = pyvw.vw(initial_regressor="vw.model")

Resources

I have not tried to modify the existing tutorial of VW and tried to collect only relevant parts for this tutorial. The options to explore here is vast. Do read their resources and experiment accordingly. I hope this blog gave you a good idea of how powerful VW is.

My Previous Write-ups: https://subirverma.medium.com

Thanks

--

--