Categories
code recommender

Variational Autoencoders for Recommendation

Header image note: I typed ‘variation’ in Pexels and it popped up. Masarap!

We can turn a simple autoencoder into something sophisticated. Autoencoders discover a latent mapping z, which as a lower-dimensional representation of the input x, can be useful for pre-training networks and creating recommender models. However, how autoencoders compress information may come in millions of ways. To truly polish how z can be generated, we turn to variational autoencoders.

As pointed out by Bell and Koren, winners of the 2007 Netflix prize, improving recommender models come in three directions: (1) deepening known methods; (2) multi-scale view of the data; and (3) using implicit rating behavior. In my view, variational autoencoders deepen existing autoencoder literature by adding a Bayesian flavor to the encoding process. It introduces a variational distribution q parametrized by µ and σ. As seen below, q(z) is generated from the input x from variational inference.

The loss term of variational autoencoders is a bit different than the usual autoencoders. The function below is called the evidence lower bound (ELBO). The first term is often called the reconstruction loss, while the second is the KL divergence. I like the author’s alternative representation of the function. We can take the first term as the reconstruction loss (similar to other autoencoders), and the second term as a regularization factor. As the authors state, “this introduces a trade-off between how well we can fit the data and how close the
approximate posterior stays to the prior during learning.”

Image from Liang et al. Variational Autoencoders for Collaborative Filtering. In the paper, the authors use the multinomial loss for the reconstruction loss.

For the multinomial variational autoencoder (MVAE), the multinomial loss is used for the reconstruction loss to make the recommender assign more probability mass to the items that can be consumed. Therefore, it effectively creates a ranking over the possible items to be served.

Another variant, the sequential variational autoencoder (SVAE) treats the input items as sequences instead of a “bag of items”. An RNN first encodes the input, then variational inference is done similar to the other variational autoencoders. The choice of how to compute the loss is also more flexible. In addition to predicting the next item, the next k items can also be predicted.

Image from Sachdeva et al. Sequential Variational Autoencoders for Collaborative Filtering.

You can view this notebook for my PyTorch implementation of multinomial variational autoencoders. You can view this notebook for the implementation of sequential variational autoencoders. I’ll explain some bits below for more clarity.

The Model

MVAE is based on James Le’s PyTorch implementation. SVAE is based on Noveen Sachdeva’s original code.

Here’s a snippet from my PyTorch Lightning implementation of MVAE. As you can see, the forward step contains the sampling steps for ELBO and the KL-divergence. This is used to compute the loss in the training step. Note also the annealing technique used to compute the overall loss. The authors note an improved performance when the annealing technique is used.

class MVAERecommender(TopNRecommender):
# TopNRecommender contains methods to predict top k
def __init__(self, model_conf : Dict, novelty_per_item, num_users, num_items, remove_observed = False, ):
# … configuration is skipped
# # # # Model Structure # # # #
# this is to handle encoding dimensions as lists
self.encoder = nn.ModuleList()
# this enumeration produces dims in pairs, start with 1
for i, (d_in, d_out) in enumerate(zip(self.enc_dims[:1], self.enc_dims[1:]), start=1):
# double d out at last for the mean and variance parameters
if i == len(self.enc_dims) 1:
d_out *= 2
self.encoder.append(nn.Linear(d_in, d_out))
# if NOT at the middle bottleneck point, simply add nonlinearit
if i != len(self.enc_dims) 1:
self.encoder.append(nn.Tanh())
self.decoder = nn.ModuleList()
# this enumeration produces dims in pairs, start with 1
for i, (d_in, d_out) in enumerate(zip(self.dec_dims[:1], self.dec_dims[1:]), start=1):
self.decoder.append(nn.Linear(d_in, d_out))
# if we're not at the last layer, then add nonlinearities
if i != len(self.dec_dims) 1:
self.decoder.append(nn.Tanh())
def forward(self, x):
# corrupt the input after normalization using Euclidean norm
h = F.dropout(F.normalize(x), p=self.dropout, training=self.training)
# forward to the encoders
for layer in self.encoder:
h = layer(h)
# h is we get our q(z|x) parameters
# mean and standard dev
mu_q = h[:, :self.enc_dims[1]]
logvar_q = h[:, self.enc_dims[1]:]
std_q = torch.exp(0.5 * logvar_q)
## Sample from q our z
# fill a tensor with unit mean and variance
epsilon = torch.zeros_like(std_q).normal_(mean=0, std=0.01)
# sample from mean and variance
sampled_z = mu_q + self.training * epsilon * std_q
# decode for reconstruction error
output = sampled_z
for layer in self.decoder:
output = layer(output)
# kl divergence for a normal distribution
kl_loss = ((0.5 * (logvar_q + torch.exp(logvar_q) + torch.pow(mu_q, 2) 1)).sum(1)).mean()
return output, kl_loss
def training_step(self, batch, batch_idx):
"""One training step
Args:
batch (torch.Tensor): batch matrix
batch_idx (list): mini-batch index
Returns:
torch.Tensor: loss
"""
# prep the annealing, this is a linearly increasing function with a cap
if self.total_anneal_steps > 0:
self.anneal = min(self.anneal_cap, 1. * self.update_count / self.total_anneal_steps)
else:
self.anneal = self.anneal_cap
# forward prop
pred_matrix, kl_loss = self(batch)
# loss
loss = self.__compute_loss(batch, pred_matrix, kl_loss)
self.update_count += 1
self.log("train_loss", loss, on_epoch=True, prog_bar=True, logger=True)
return loss
def __compute_loss(self, batch_matrix, pred_matrix, kl_loss):
# first term is reconstructon loss
# softmax the predicted matrix
# mask the losses via multiplication
# sum and average
ce_loss = (F.log_softmax(pred_matrix, 1) * batch_matrix).sum(1).mean()
loss = ce_loss + kl_loss * self.anneal
return loss
view raw vae_pytorch.py hosted with ❤ by GitHub

Results

I used MovieLens 1M for the experiments. I performed a grid search of hyperparameter optimization for various parameters. Here are the links for my CometML experiments on MVAE and SVAE. Below, I list my results, plus the benchmark results done by Sachdeva et al. I was not able to replicate their results, however. I must have missed some parameter combinations, but I had to stop since tuning both took a very long time. SVAE especially was running very slowly since the batch size was 1. If I have to improve this algorithm, I need to perform some pre-padding to increase the batch size. Also, some increased GPU time, since I’m bound by Kaggle 🙂

Metric MVAE SVAE
NDCG@100 0.3019 0.1765
coverage@100 0.3747 0.0320
novelty@100 2.6796 1.9415
gini@100 0.2029 1.0
authors-NDCG@100 0.2219 0.2677
single-model-training-time 137.28s 1,565.46s
view raw vae_results.csv hosted with ❤ by GitHub

Overall, I thought both architectures are very promising. Both approaches have led to advancements in recommender algorithm research. Perhaps like their use in computer vision, these can also be used to pre-train larger deep learning models for recommendation, where other modalities like image and text can also contribute to the output rankings.

Thanks for reading!

By krsnewwave

I'm a software engineer and a data science guy on recommender systems, natural language processing, and computer vision.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s