Header image note: I typed ‘variation’ in Pexels and it popped up. Masarap!
We can turn a simple autoencoder into something sophisticated. Autoencoders discover a latent mapping z, which as a lower-dimensional representation of the input x, can be useful for pre-training networks and creating recommender models. However, how autoencoders compress information may come in millions of ways. To truly polish how z can be generated, we turn to variational autoencoders.
As pointed out by Bell and Koren, winners of the 2007 Netflix prize, improving recommender models come in three directions: (1) deepening known methods; (2) multi-scale view of the data; and (3) using implicit rating behavior. In my view, variational autoencoders deepen existing autoencoder literature by adding a Bayesian flavor to the encoding process. It introduces a variational distribution q parametrized by µ and σ. As seen below, q(z) is generated from the input x from variational inference.

The loss term of variational autoencoders is a bit different than the usual autoencoders. The function below is called the evidence lower bound (ELBO). The first term is often called the reconstruction loss, while the second is the KL divergence. I like the author’s alternative representation of the function. We can take the first term as the reconstruction loss (similar to other autoencoders), and the second term as a regularization factor. As the authors state, “this introduces a trade-off between how well we can fit the data and how close the
approximate posterior stays to the prior during learning.”

For the multinomial variational autoencoder (MVAE), the multinomial loss is used for the reconstruction loss to make the recommender assign more probability mass to the items that can be consumed. Therefore, it effectively creates a ranking over the possible items to be served.
Another variant, the sequential variational autoencoder (SVAE) treats the input items as sequences instead of a “bag of items”. An RNN first encodes the input, then variational inference is done similar to the other variational autoencoders. The choice of how to compute the loss is also more flexible. In addition to predicting the next item, the next k items can also be predicted.

You can view this notebook for my PyTorch implementation of multinomial variational autoencoders. You can view this notebook for the implementation of sequential variational autoencoders. I’ll explain some bits below for more clarity.
The Model
MVAE is based on James Le’s PyTorch implementation. SVAE is based on Noveen Sachdeva’s original code.
Here’s a snippet from my PyTorch Lightning implementation of MVAE. As you can see, the forward step contains the sampling steps for ELBO and the KL-divergence. This is used to compute the loss in the training step. Note also the annealing technique used to compute the overall loss. The authors note an improved performance when the annealing technique is used.
Results
I used MovieLens 1M for the experiments. I performed a grid search of hyperparameter optimization for various parameters. Here are the links for my CometML experiments on MVAE and SVAE. Below, I list my results, plus the benchmark results done by Sachdeva et al. I was not able to replicate their results, however. I must have missed some parameter combinations, but I had to stop since tuning both took a very long time. SVAE especially was running very slowly since the batch size was 1. If I have to improve this algorithm, I need to perform some pre-padding to increase the batch size. Also, some increased GPU time, since I’m bound by Kaggle 🙂
Overall, I thought both architectures are very promising. Both approaches have led to advancements in recommender algorithm research. Perhaps like their use in computer vision, these can also be used to pre-train larger deep learning models for recommendation, where other modalities like image and text can also contribute to the output rankings.
Thanks for reading!