Machine Learning

# A fastText-based hybrid recommender

September 27, 2016

## Introduction

Using Facebook Research's new fastText library in supervised mode, I trained a hybrid recommender system, to recommend articles to users, given as training data both the text in the articles and the user/article interaction matrix. The labels attached to a document were both its id, and the ids of all users who viewed it. I've not finished testing it, but early signs are that it learns quite well. This post is a short progress report, perhaps of interest to others building hybrid recommenders.

### What is it?

Recently, Facebook Research released the C++ source code for fastText, a software library for learning word embeddings using a shallow neural network. It can learn either unsupervised, training on unstructured text, or in a supervised manner, from a series of labelled documents. The code is accompanied by two research preprints:

1. Enriching Word Vectors with Subword Information (unsupervised)
2. Bag of Tricks for Efficient Text Classification (supervised),

explaining fastText's workings and performance on benchmark datasets in each case. The authors state its performance is comparable with much more complex "deep learning" algorithms, but the training time is significantly lower.

### How does it work?

In unsupervised mode, fastText trains by sliding a window over the input text and either learning the centre word from the remaining context ("continuous bag of words" or "CBOW"), or all the context words from the centre word ("Skipgram"), and learning can be viewed as a series of updates to a neural network with two layers of weights and three tiers of neurons, in which the two outer tiers each have one neuron for each word in the vocabulary, and the middle tier has as many neurons as there are dimensions in the embedding space. In this way, it is similar to word2vec (in fact, one of the fastText coauthors, Tomas Mikolov, is the creator of word2vec). Unlike word2vec, however, fastText can also learn vectors for sub-parts of words  -- so-called character n-grams -- ensuring e.g. that the words "supervised", "supervise" and "supervisor" all have similar vectorisations, even if they tend to appear in different contexts. This feature enhances learning on heavily inflected languages, according to the authors.

In supervised mode, the first two neuron tiers of the network are the same, but the neurons in the output tier are in bijection with the labels, rather than the words in the vocabulary. The learning task on a given window is to predict the labels attached to the current document, given the words in the window.

## Training a hybrid recommender

### Defining the model

Supervised fastText trains on a text file in which each line is a series of space-separated labels, followed by the text of single document. The labels are prepended by a crib, by default "__label__", so they can be differentiated from words in the text.

In my case, each line of the training file contained one document, with no repetitions. The labels attached to a document were its document id, and the ids of all users who had interacted with it. Here's a fictitious example:

<pre>
__label__user_id_87sdfsdf __label__user_id_45jfkasf novak djokovic
won yet another tennis match last friday ...
</pre>

After training, the score associated to a user/document pair is the cosine similarity of their associated vectors. (Note that the vectors learned by fastText are not of unit length, so computing cosine similarity means normalising.) The document yielding the highest score is defined to be the recommendation.

### Data preparation and hyperparameters

After throwing out users and documents with too few interactions, my training set consisted of around 22k documents and 85k users, with 1 million interactions, and my test set consisted of 30,000 user/document interaction pairs. The training set was 56MB in size. Here are the hyperparameters I specified for the best-performing model (all others were default):

-dim 100
-loss ns
-minCount 5
-neg 10
-ws 20
-epoch 100000
-lr 0.075


The training took around 5 hours on a 32-core, 120 GB AWS instance. Varying the window size, dimension and minCount made little difference to model quality. Increasing "neg" -- the number of negative samples -- improved the quality of the model uniformly in the range I checked: values from 5 to 50 in intervals of 5. (The downside is that more negative samples means training takes longer.) Varying learning rate affected the model unpredictably: larger values seemed better when the number of epochs was small, but I can't see a consistent pattern. Using a very large learning rate with very many epochs lead to segfault errors, perhaps because of arithmetic overflow. Increasing the number of epochs steadily improved model quality. I stopped at 100,000 epochs because the improvements were becoming marginal, and my patience ran out.

### A small "gotcha"

In the version of the code I downloaded in late August, the documentation stated the default value for the "loss" hyperparameter was "ns", meaning negative sampling. However, this was not true at the time: reading the code revealed the true default value was "softmax", which perhaps leads to a better model, but is much slower to train. This meant that my first attempts to train a model were unsuccessful, because fixing a large enough value of "epoch" to ensure the model learned something would've meant waiting days or weeks for the model to train. After a few days' frustration, I read the code and noticed the error in the documentation. Happily, fastText coauthor Edouard Grave was extremely helpful, responding immediately to my bug report, and fixing the problem within a day. He was also kind enough to suggest I use hierarchical softmax instead of negative sampling for the "loss" parameter, since it is, apparently, faster still to train. I haven't yet experimented with hierarchical softmax systematically. A big thank you to Edouard for his help!

### Rationale: why should this work?

The first layer of weights in the network encodes the word vectors; the second layer the label vectors. During training, at any given text window, each word vector is moved closer to a randomly chosen label vector for the current document. Each document id label is seen only once per training epoch, but if we train over many epochs, then:

1. We expect the label vector associated to a document id to be pushed closer to the average of the word vectors this document contains, and
2. We expect the vector associated to a user id to be pushed closer to the average of the vectors of all words mentioned in each document seen by that user.

If both the above hold, then label vectors of document ids and label vectors of user ids will be tend to be moved closer to one another, provided the given document has been seen by the given user. For the same reason, it will also be the case that any two users with similar reading tastes, and any two documents attracting users with similar reading tastes, will have similar vectorisations.

## Results

We measured the performance of the model as follows:

1. For each user_id/doc_id pair(u, d) in the test set, compute the cosine similarity of vec(u) with vec(d'), where d' varies across all documents in the test set; this is by definition the score s(u, d') of (u, d').
2. Write these scores s(u, d') in descending order in a list L, and record the position in which s(u, d) occurs.
3. Now compute the "decimal percentile" of s(u, d) in L: if it is in first place, assign it a score of 0.0, if it is in last place, give it 1.0; in the middle is 0.5. Call this number the "percentile rank" pr(u, d) of the pair (u, d).
4. We measure the performance of the model as the mean of pr(u, d) across all 30,000 pairs (u, d).

The best model trained in this way achieved a score of around 0.093. A model that understood nothing would achieve a score of 0.5 -- meaning that the score s(u, d) would be half-way down the list of test scores on average -- so this is significantly better than random. However, this result is not as good as our current state-of-the-art model (CSOA), which achieves a score of less than 0.01. However, the CSOA is also trained on around 100 times more data. The next step is to train a fastText model on this larger data set. The experiment continues!