Categories
code recommender

You Might Like… Why?

Often, recommendations are printed as “You Might Like X because you watched Y.” This connection grounds the user from a prior experience and encourages her to consume the recommended content.

In this post, I’m using the implicit library by Ben Frederickson , a well-designed Python library that computes recommendations from implicit feedback. This time, I’ll be focusing on explaining recommendations, which is a big part of a user’s experience. Often, recommendations are printed as “You Might Like X because you watched Y.” This connection grounds the user from a prior experience and encourages her to consume the recommended content.

The foundations of this post is the much cited work of Yifan Hu, Collaborative Filtering for Implicit Feedback Datasets. On Section 5, it explains how the latent factors could be used to explain recommendations. More on that later.

I’ll be using the Steam Video Games dataset from Kaggle. It’s a small dataset and it’s no secret that video games is another hobby of mine.

Check out my other post for my previous shot at this.

In [1]:
import numpy as np
import pandas as pd
%pylab inline

data = pd.read_csv("data/steam-200k.csv", header=None, index_col=False, 
                   names=["UserID", "GameName", "Behavior", "Value"])
data[:5]
Populating the interactive namespace from numpy and matplotlib
Out[1]:
UserID GameName Behavior Value
0 151603712 The Elder Scrolls V Skyrim purchase 1
1 151603712 The Elder Scrolls V Skyrim play 273
2 151603712 Fallout 4 purchase 1
3 151603712 Fallout 4 play 87
4 151603712 Spore purchase 1
In [2]:
print "Data shape: ", data.shape
print "Unique users in dataset: ", data.UserID.nunique()
print "Unique games in dataset: ", data.GameName.nunique()
Data shape:  (200000, 4)
Unique users in dataset:  12393
Unique games in dataset:  5155
In [3]:
# average number of hours spent per game
hours_played = data[data.Behavior == "play"].groupby(["UserID","GameName"])["Value"].sum()
print("Hours Played Stats:")
print(hours_played.describe())
Hours Played Stats:
count    70477.000000
mean        48.886386
std        229.353917
min          0.100000
25%          1.000000
50%          4.500000
75%         19.100000
max      11754.000000
Name: Value, dtype: float64
In [4]:
import seaborn as sns

sns.distplot(hours_played[hours_played < 1000])
Out[4]:
reco_hours_played_1

Some Preprocessing

1) This dataset is small, 200k rows, with 12k users and 5k games.

2) A behavior of “play” and its corresponding “Value” indicates the number of hours playing the game.

3) As you can see, the hours played is highly skewed to the right with some hardcore players consuming too much hours.

4) Since we’re using implicit feedback, outliers may skew the results to very popular games. I’ll introduce tukey’s method to clip the outliers. Other methods include BM-25, which I’ll get to in a while.

In [5]:
hours_played = hours_played.reset_index()
In [6]:
# tukey with k=3
k = 3

q75 = hours_played["Value"].quantile(0.75)
q25 = hours_played["Value"].quantile(0.25)
iqr = q75 - q25

hours_played["Value"] = hours_played["Value"].clip(0, q75 + iqr * 3)
In [7]:
sns.distplot(hours_played["Value"])
Out[7]:
reco_hours_played_2

Some data transformation

The library requires the sparse format. This data structure is very memory-effective and offers some nice computation speedups as well.

You’ll also notice that the computation is very fast. As said in the paper, weighted matrix factorization runs in O(f^2 N + f^3 m) where f is the dimensionality of the latent matrix, N the number of nonzero values in the data and m the number of users. It’s linear in the size of the data. Perfect even for large datasets!

In [8]:
from scipy.sparse import coo_matrix

hours_played["UserID"] = hours_played.UserID.astype('category')
hours_played["GameName"] = hours_played.GameName.astype('category')

plays= coo_matrix((hours_played.Value.astype(np.float), 
    (hours_played["GameName"].cat.codes.copy(), 
    hours_played["UserID"].cat.codes.copy())) )
In [9]:
!export OPENBLAS_NUM_THREADS=1
In [10]:
import logging
logging.basicConfig(level=logging.DEBUG)
log = logging.getLogger("implicit")
log.setLevel(logging.DEBUG)
In [11]:
import implicit

model = implicit.als.AlternatingLeastSquares(factors=100, iterations=20, regularization=0.1,
                                             calculate_training_loss=True, 
                                             num_threads=4)
model.fit(plays)
WARNING:root:OpenBLAS detected. Its highly recommend to set the environment variable 'export OPENBLAS_NUM_THREADS=1' to disable its internal multithreading
DEBUG:implicit:finished iteration 0 in 0.168433904648
DEBUG:implicit:loss at iteration 0 is 0.00747555729539
DEBUG:implicit:finished iteration 1 in 0.238357067108
DEBUG:implicit:loss at iteration 1 is 0.00503169964673
DEBUG:implicit:finished iteration 2 in 0.224936008453
DEBUG:implicit:loss at iteration 2 is 0.00417479294388
DEBUG:implicit:finished iteration 3 in 0.240418195724
DEBUG:implicit:loss at iteration 3 is 0.00368095271631
DEBUG:implicit:finished iteration 4 in 0.252806901932
DEBUG:implicit:loss at iteration 4 is 0.0033598037319
DEBUG:implicit:finished iteration 5 in 0.245897054672
DEBUG:implicit:loss at iteration 5 is 0.00312864649216
DEBUG:implicit:finished iteration 6 in 0.2616147995
DEBUG:implicit:loss at iteration 6 is 0.00295347147061
DEBUG:implicit:finished iteration 7 in 0.244046926498
DEBUG:implicit:loss at iteration 7 is 0.00281540811624
DEBUG:implicit:finished iteration 8 in 0.25129699707
DEBUG:implicit:loss at iteration 8 is 0.00270348356697
DEBUG:implicit:finished iteration 9 in 0.259847164154
DEBUG:implicit:loss at iteration 9 is 0.00261063792048
DEBUG:implicit:finished iteration 10 in 0.233308076859
DEBUG:implicit:loss at iteration 10 is 0.00253233806323
DEBUG:implicit:finished iteration 11 in 0.282055854797
DEBUG:implicit:loss at iteration 11 is 0.00246539037816
DEBUG:implicit:finished iteration 12 in 0.257393836975
DEBUG:implicit:loss at iteration 12 is 0.00240746030433
DEBUG:implicit:finished iteration 13 in 0.303618907928
DEBUG:implicit:loss at iteration 13 is 0.00235681940443
DEBUG:implicit:finished iteration 14 in 0.19407081604
DEBUG:implicit:loss at iteration 14 is 0.00231214550339
DEBUG:implicit:finished iteration 15 in 0.170318841934
DEBUG:implicit:loss at iteration 15 is 0.00227256111595
DEBUG:implicit:finished iteration 16 in 0.155373096466
DEBUG:implicit:loss at iteration 16 is 0.00223719169848
DEBUG:implicit:finished iteration 17 in 0.250052213669
DEBUG:implicit:loss at iteration 17 is 0.0022054165621
DEBUG:implicit:finished iteration 18 in 0.241719007492
DEBUG:implicit:loss at iteration 18 is 0.00217669027829
DEBUG:implicit:finished iteration 19 in 0.220392942429
DEBUG:implicit:loss at iteration 19 is 0.00215057491699

User recommendation

In WMF, recommendation involves multiplication of the latent factors. It turns out that with some clever manipulation of the user and item factors, one can explain the recommendations with the following linear expression.

\hat{p}_{ui} = \sum_{j:\ r_{uj} >0} s^u_{ij} \ c_{uj}

The first factor in the summation is the weighted similarity of item i and j according to a user u’s viewpoint. The second factor is the confidence associated with user u’s past action. Another way of looking at it is that it is similar to neighborhood methods where similarities are immediately available. It’s just that now, we’re doing similarities of the latent factors and with the perspective of an input user. in As in the paper:

This shares much resemblance with item-oriented neighborhood models, which enables the desired ability to explain computed predictions... In addition,similarities between items become dependent on the specific user in question, reflecting the fact that different users do not completely agree on which items are similar.

Conveniently, the library already does this for us. There is an associated score for each recommendation, and each explanation also has a similarity score, the s in the above equation.

Let’s go for it!

In [12]:
user_to_id_mapping = dict([(user, idx) for (idx, user) in enumerate(hours_played.UserID.cat.categories)] )
id_to_user_mapping = dict(enumerate(hours_played.UserID.cat.categories))
item_to_id_mapping = dict([(item, idx) for (idx, item) in enumerate(hours_played.GameName.cat.categories)] )
id_to_item_mapping = dict(enumerate(hours_played.GameName.cat.categories))
In [13]:
def recommend_user(user_id, model, user_items, N=10, name=False):
    if name:
        return [(item_to_name_mapping[id_to_item_mapping[item]], score) for (item, score) in
            model.recommend(user_to_id_mapping[user_id], user_items, N=N)]
    else:
        return [(id_to_item_mapping[item], score) for (item, score) in
            model.recommend(user_to_id_mapping[user_id], user_items, N=N)]

def explain(user_id, model, user_items, N_recos=20, N_explanations=1, name=True):
    recos = recommend_user(user_id, model, user_items, N_recos, False)
    
    for item, score in recos:
        print("You might like:")
        print("'{}' ({}) because:".format(item, score))
        
        _, explanations, _ = model.explain(user_to_id_mapping[user_id], 
                user_items, item_to_id_mapping[item], N=N_explanations)
        
        for related_item, weight in explanations:        
            related_item_name = id_to_item_mapping[related_item]
            print("-- {} ({})".format(related_item_name, weight))
        print

User 1

1) Warhammer and Saint’s Row IV is explained clearly.

2) Alien vs. Predator is a little strange. Grand Theft Auto as well.

In [14]:
user_id = 87201181
user_items = plays.transpose().tocsr()
explain(user_id, model, user_items, N_recos= 5, N_explanations=2)
You might like:
'Warhammer 40,000 Space Marine' (0.650754633743) because:
-- Warhammer 40,000 Dawn of War II  Retribution (0.431388781737)
-- Warhammer 40,000 Dawn of War  Dark Crusade (0.190309152636)

You might like:
'Saints Row IV' (0.593350439269) because:
-- Saints Row The Third (0.332452200095)
-- Call of Duty Black Ops (0.179419466526)

You might like:
'Chivalry Medieval Warfare' (0.478674623652) because:
-- Warhammer 40,000 Dawn of War  Dark Crusade (0.204242121751)
-- Mortal Kombat Komplete Edition (0.158272607268)

You might like:
'Injustice Gods Among Us Ultimate Edition' (0.475480738068) because:
-- Mortal Kombat Komplete Edition (0.275720011508)
-- Warhammer 40,000 Dawn of War II  Retribution (0.187482845692)

You might like:
'Aliens vs. Predator' (0.461616421818) because:
-- Warhammer 40,000 Dawn of War II  Retribution (0.257499609042)
-- Mortal Kombat Komplete Edition (0.205403214068)

User 2

1) This user looks like a fan of action RPG.

2) Cities Skylines is strange however. A city-simulator connecting to an FPS is far-fetched.

In [15]:
user_id = 5250
user_items = plays.transpose().tocsr()
explain(user_id, model, user_items, N_recos= 5, N_explanations=2)
You might like:
'Mafia II' (0.354957815467) because:
-- Cities Skylines (0.137989084212)
-- Alien Swarm (0.129184417192)

You might like:
'Portal' (0.343594570406) because:
-- Portal 2 (0.434564176472)
-- Alien Swarm (0.0470184070953)

You might like:
'GRID 2' (0.28461606681) because:
-- Cities Skylines (0.174919238512)
-- Portal 2 (0.0495238025198)

You might like:
'RAGE' (0.279416044669) because:
-- Deus Ex Human Revolution (0.230393735747)
-- Cities Skylines (0.0752454003821)

You might like:
'Left 4 Dead 2' (0.27811598071) because:
-- Cities Skylines (0.11492406745)
-- Alien Swarm (0.107206717424)

User 3

1) Here’s our fan of first-person shooters. Counter-strike is a good recommendation.

2) He also has a preference for strategy games, making RUSE a good recommendation.

In [16]:
user_id = 76767
user_items = plays.transpose().tocsr()
explain(user_id, model, user_items, N_recos= 5, N_explanations=2)
You might like:
'Portal' (0.558594907323) because:
-- Portal 2 (0.437595504459)
-- Call of Duty Modern Warfare 2 (0.128709422013)

You might like:
'Call of Duty Black Ops III' (0.471370623459) because:
-- Total War ATTILA (0.258104185012)
-- Call of Duty World at War (0.133022459966)

You might like:
'Medieval II Total War' (0.395897631422) because:
-- Total War ATTILA (0.231480911151)
-- Banished (0.158693693231)

You might like:
'Rome Total War' (0.382202113883) because:
-- Banished (0.125397503703)
-- Total War ATTILA (0.0993190916736)

You might like:
'Don't Starve' (0.377931995365) because:
-- Call of Duty World at War (0.188067361432)
-- Banished (0.167969052676)

User 4

And here’s an RPG gamer.

In [17]:
user_id = 86540
user_items = plays.transpose().tocsr()
explain(user_id, model, user_items, N_recos= 5, N_explanations=2)
You might like:
'Torchlight' (0.371717403665) because:
-- Torchlight II (0.138597557086)
-- Audiosurf (0.100369466322)

You might like:
'DC Universe Online' (0.360708026933) because:
-- Audiosurf (0.232236096265)
-- XCOM Enemy Unknown (0.0625009941007)

You might like:
'Fallout 3 - Game of the Year Edition' (0.359058475385) because:
-- Audiosurf (0.140772373185)
-- Far Cry 3 (0.0586405840269)

You might like:
'Killing Floor' (0.320037787897) because:
-- Audiosurf (0.230890212286)
-- XCOM Enemy Unknown (0.125344723916)

You might like:
'Alien Swarm' (0.316589753989) because:
-- Audiosurf (0.182271834021)
-- Left 4 Dead 2 (0.0872668658834)

User 5

A Sportsman!

In [18]:
user_id = 17495098
user_items = plays.transpose().tocsr()
explain(user_id, model, user_items, N_recos= 5, N_explanations=2)
You might like:
'Football Manager 2016' (0.74589289089) because:
-- Football Manager 2015 (0.423861168519)
-- Football Manager 2014 (0.267616806169)

You might like:
'Counter-Strike Condition Zero Deleted Scenes' (0.501820960363) because:
-- Counter-Strike (0.264472688471)
-- TowerFall Ascension (0.151734294928)

You might like:
'Hammerwatch' (0.475934267318) because:
-- TowerFall Ascension (0.129457486512)
-- The Binding of Isaac (0.0861012657826)

You might like:
'Day of Defeat Source' (0.471574863449) because:
-- Counter-Strike Source (0.269956533642)
-- Football Manager 2013 (0.10682900327)

You might like:
'DARK SOULS II Scholar of the First Sin' (0.46081385385) because:
-- TowerFall Ascension (0.137355275935)
-- Crusader Kings II (0.13357454549)

Conclusions

Some of the results were indeed helpful. I, in particular, liked the Football Manager recommendations. However I do have a few pointers:

1) Repeated recommendations of the same franchise should come less frequently. See those repeated Football Manager?

2) Steam has genre panes, for example, there is a pane for indie games, sports, RPG, FPS, strategy, etc. To fit these explanations per pane, one can add to the score another weight for genre similarity.

Thanks for reading.

 

 

By krsnewwave

I'm a software engineer and a data science guy on recommender systems, natural language processing, and computer vision.

One reply on “You Might Like… Why?”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s