Variational Autoencoders for Recommendation

Header image note: I typed ‘variation’ in Pexels and it popped up. Masarap!

We can turn a simple autoencoder into something sophisticated. Autoencoders discover a latent mapping z, which as a lower-dimensional representation of the input x, can be useful for pre-training networks and creating recommender models. However, how autoencoders compress information may come in millions of ways. To truly polish how z can be generated, we turn to variational autoencoders.

As pointed out by Bell and Koren, winners of the 2007 Netflix prize, improving recommender models come in three directions: (1) deepening known methods; (2) multi-scale view of the data; and (3) using implicit rating behavior. In my view, variational autoencoders deepen existing autoencoder literature by adding a Bayesian flavor to the encoding process. It introduces a variational distribution q parametrized by µ and σ. As seen below, q(z) is generated from the input x from variational inference.

The loss term of variational autoencoders is a bit different than the usual autoencoders. The function below is called the evidence lower bound (ELBO). The first term is often called the reconstruction loss, while the second is the KL divergence. I like the author’s alternative representation of the function. We can take the first term as the reconstruction loss (similar to other autoencoders), and the second term as a regularization factor. As the authors state, “this introduces a trade-off between how well we can fit the data and how close the
approximate posterior stays to the prior during learning.”

Image from Liang et al. Variational Autoencoders for Collaborative Filtering. In the paper, the authors use the multinomial loss for the reconstruction loss.

For the multinomial variational autoencoder (MVAE), the multinomial loss is used for the reconstruction loss to make the recommender assign more probability mass to the items that can be consumed. Therefore, it effectively creates a ranking over the possible items to be served.

Another variant, the sequential variational autoencoder (SVAE) treats the input items as sequences instead of a “bag of items”. An RNN first encodes the input, then variational inference is done similar to the other variational autoencoders. The choice of how to compute the loss is also more flexible. In addition to predicting the next item, the next k items can also be predicted.

Image from Sachdeva et al. Sequential Variational Autoencoders for Collaborative Filtering.

You can view this notebook for my PyTorch implementation of multinomial variational autoencoders. You can view this notebook for the implementation of sequential variational autoencoders. I’ll explain some bits below for more clarity.

The Model

MVAE is based on James Le’s PyTorch implementation. SVAE is based on Noveen Sachdeva’s original code.

Here’s a snippet from my PyTorch Lightning implementation of MVAE. As you can see, the forward step contains the sampling steps for ELBO and the KL-divergence. This is used to compute the loss in the training step. Note also the annealing technique used to compute the overall loss. The authors note an improved performance when the annealing technique is used.

	class MVAERecommender(TopNRecommender):
	# TopNRecommender contains methods to predict top k
	def __init__(self, model_conf : Dict, novelty_per_item, num_users, num_items, remove_observed = False, ):
	# … configuration is skipped
	# # # # Model Structure # # # #
	# this is to handle encoding dimensions as lists
	self.encoder = nn.ModuleList()
	# this enumeration produces dims in pairs, start with 1
	for i, (d_in, d_out) in enumerate(zip(self.enc_dims[:-1], self.enc_dims[1:]), start=1):
	# double d out at last for the mean and variance parameters
	if i == len(self.enc_dims) – 1:
	d_out *= 2
	self.encoder.append(nn.Linear(d_in, d_out))
	# if NOT at the middle bottleneck point, simply add nonlinearit
	if i != len(self.enc_dims) – 1:
	self.encoder.append(nn.Tanh())

	self.decoder = nn.ModuleList()
	# this enumeration produces dims in pairs, start with 1
	for i, (d_in, d_out) in enumerate(zip(self.dec_dims[:-1], self.dec_dims[1:]), start=1):
	self.decoder.append(nn.Linear(d_in, d_out))
	# if we're not at the last layer, then add nonlinearities
	if i != len(self.dec_dims) – 1:
	self.decoder.append(nn.Tanh())

	def forward(self, x):
	# corrupt the input after normalization using Euclidean norm
	h = F.dropout(F.normalize(x), p=self.dropout, training=self.training)
	# forward to the encoders
	for layer in self.encoder:
	h = layer(h)

	# h is we get our q(z\|x) parameters
	# mean and standard dev
	mu_q = h[:, :self.enc_dims[-1]]
	logvar_q = h[:, self.enc_dims[-1]:]
	std_q = torch.exp(0.5 * logvar_q)

	## Sample from q our z
	# fill a tensor with unit mean and variance
	epsilon = torch.zeros_like(std_q).normal_(mean=0, std=0.01)
	# sample from mean and variance
	sampled_z = mu_q + self.training * epsilon * std_q

	# decode for reconstruction error
	output = sampled_z
	for layer in self.decoder:
	output = layer(output)

	# kl divergence for a normal distribution
	kl_loss = ((0.5 * (-logvar_q + torch.exp(logvar_q) + torch.pow(mu_q, 2) – 1)).sum(1)).mean()
	return output, kl_loss


	def training_step(self, batch, batch_idx):
	"""One training step

	Args:
	batch (torch.Tensor): batch matrix
	batch_idx (list): mini-batch index

	Returns:
	torch.Tensor: loss
	"""
	# prep the annealing, this is a linearly increasing function with a cap
	if self.total_anneal_steps > 0:
	self.anneal = min(self.anneal_cap, 1. * self.update_count / self.total_anneal_steps)
	else:
	self.anneal = self.anneal_cap

	# forward prop
	pred_matrix, kl_loss = self(batch)
	# loss
	loss = self.__compute_loss(batch, pred_matrix, kl_loss)

	self.update_count += 1
	self.log("train_loss", loss, on_epoch=True, prog_bar=True, logger=True)
	return loss

	def __compute_loss(self, batch_matrix, pred_matrix, kl_loss):
	# first term is reconstructon loss
	# softmax the predicted matrix
	# mask the losses via multiplication
	# sum and average
	ce_loss = -(F.log_softmax(pred_matrix, 1) * batch_matrix).sum(1).mean()
	loss = ce_loss + kl_loss * self.anneal
	return loss

view raw vae_pytorch.py hosted with ❤ by GitHub

Results

I used MovieLens 1M for the experiments. I performed a grid search of hyperparameter optimization for various parameters. Here are the links for my CometML experiments on MVAE and SVAE. Below, I list my results, plus the benchmark results done by Sachdeva et al. I was not able to replicate their results, however. I must have missed some parameter combinations, but I had to stop since tuning both took a very long time. SVAE especially was running very slowly since the batch size was 1. If I have to improve this algorithm, I need to perform some pre-padding to increase the batch size. Also, some increased GPU time, since I’m bound by Kaggle 🙂

Metric	MVAE	SVAE
NDCG@100	0.3019	0.1765
coverage@100	0.3747	0.0320
novelty@100	2.6796	1.9415
gini@100	0.2029	1.0
authors-NDCG@100	0.2219	0.2677
single-model-training-time	137.28s	1,565.46s

view raw vae_results.csv hosted with ❤ by GitHub

Overall, I thought both architectures are very promising. Both approaches have led to advancements in recommender algorithm research. Perhaps like their use in computer vision, these can also be used to pre-train larger deep learning models for recommendation, where other modalities like image and text can also contribute to the output rankings.

Thanks for reading!

The Model

Results

Share this:

Related

By krsnewwave

Leave a comment Cancel reply