Here’s a quick post about my activities this week.
NVIDIA Merlin is a high-level open-source library to “accelerate recommender systems on NVIDIA GPUs” at scale. As advertised, it has several production-ready components that, with relatively few lines of code, can perform end-to-end recommender system training, evaluation, deployment, and inference. It’s such a great piece of software, that their team has won RecSys 2021 Challenge. So I want to try it for myself and see if I can learn a thing or two. I’ll be focusing on NVTabular and Models for this post.
TLDR; I learned the hard way here. As long as it works…
Challenge 1: Installation
Since I’m on a potato PC setup, I first tried their recommended installation procedures, which is using docker-nvidia. I have an old GTX 960 GPU with 4G of memory. I made it work half a decade ago in Ubuntu without docker. Now, I have a Windows box. After much wrangling, I quit on that, since I can’t get it to work on WSL2, which is the only option I have currently. I then moved to a Kaggle & Google Colab setup.
In Google Colab, cudf is required by NVTabular to truly make things fast. Although NVTabular can handle CPU loads, it always warns the user that its CPU functionalities are experimental. Here is the official way to make it run. On Kaggle, things are easier since cudf is already available. I made it work, easy.
Challenge 2: Training – PyTorch
Here’s my Kaggle code that combines all the example notebooks in the MovieLens-25M of Merlin. The notebooks I chose use PyTorch, and they encountered no problem in training. However, the inference notebooks use Tensorflow and HugeCTR, another component of Merlin. So it makes me wonder, how to effectively do inference on PyTorch. There’s no notebook about that yet.
Challenge 3: Training – Tensorflow
Merlin Models predominantly uses Tensorflow code. There are amazing implementations of Youtube DNN, Facebook DLRM, and standard Two Tower models. It got me really excited. SOTA models with a few lines of code? Sign. Me. Up.
Trouble though. Roadblocks. The code itself was importing the following.
from typing import TYPE_CHECKING, Dict, Union, Protocol, runtime_checkable
Protocol and runtime_checkable are constructs in Python 3.8, and are not available at all in 3.7, except in the module “typing_extensions”. So I hack it. In Google Colab, one can easily replace code in the OS via the right side window, so I corrected it to:
from typing import TYPE_CHECKING, Dict, Union from typing_extensions import Protocol, runtime_checkable
Boom that works. So that I don’t forget, I created a Github repo of the exact code, so I can just patch over it on fresh installs. Hacks. See the colab here.

And voila, it works! Training Two Tower and DLRM models work!

Challenge 4 – Deployment
Recommender system deployment is a complex business since it involves multiple stages. There’s offline training, online evaluation, and caching involved. The process diagram looks like the following (taken from NVIDIA).

Why do you need to have two models, the retrieval and ranking models? In setups where you have thousands of items, then you can get away with just a single ranking model. But in planet-scale businesses, the items will number in the millions, and ranking all of them will take an impractical amount of inference time. So we have a retrieval model, which is typically a simpler model, where we only get the user and item embeddings as features down the pipeline. In inference time, we identify the user (does not matter if cold start, if we identify a good enough feature set for all users), and retrieve a small candidate set for ranking. Ranking models are more complex and can incorporate other streaming features like last items clicked, and time of day.
Back to retrieval. The item embeddings will be used for the approximate nearest neighbor index for a quick retrieval process. Here is where the trouble began. The error is somewhere deep in pandas, dask, or somewhere in between. Perhaps even the Python version is at fault here, and I can’t do any more hacks. Hence, friends, I end there.

Learnings
Truly though, diving through the Merlin GitHub page, I can see a lot of activity and FR’s flying around. Seems like people are going full-steam ahead for the recommended installation which is a docker box. Maybe in a few months’ time, people will circle back around for potato setups like mine, but I think the space is too fast-moving for them to focus on this. I did learn a lot though:
- NVTabular and cudf can be used separately. I want to use it in conjunction with PyTorch Lightning to speed up training. Data loaders are a bottleneck since my usual implementations are in the CPU. I do want to squeeze out every millisecond of GPU time in free instances. 🙂
- Speaking of cudf, I discovered cuml as well because of NVIDIA Rapids. It uses dask and cuda to scale out computation not only for a single GPU, but for multiple ones. I’ve worked on dask, but I haven’t yet encountered multi-GPU setups — not that it’s usual though.
- I skimmed through SOTA models like DLRM and DCN v2. The recommender space is truly exciting!
Thanks for reading!
2 replies on “NVIDIA Merlin on a Potato PC Setup”
[…] my last post, I tried to run NVIDIA Merlin models using free instances on the web. That did not go so well. But the value of using the GPU for the data engineering and data loading parts is very attractive. […]
LikeLike
[…] It can process large datasets typical in production recommender setups. I tried to work with NVIDIA Merlin on free instances, but the recommended approach seems to be the only way forward. But I still wanted to use NVTabular […]
LikeLike