How to produce rankings using SerialRank algorithm

This IPython notebook is an informal demo of how to produce rankings using SerialRank. While it is self-contained, it is a merely a practical introduction to SerialRank.

Theoretical justifications for using SerialRank can be found in the preprint of our NIPS paper (Fajwel Fogel, Alexandre d'Aspremont, Milan Vojnovic).

You can download the code of this ipython notebook here: tutorial.py, tutorial.ipynb.

A starting example: ranking restaurants in 10 lines of code

Suppose Alice and Bob want to rank the five restaurants where they have been during their lovely week-end in Paris. Because they do not want to bother giving a score to each restaurant based on list of criteria (even thinking of it gives them a headache), they have decided to write down their pairwise preferences, in the form of a double entry table, which we call the comparison matrix $C$. For instance, if they have preferred the pizzeria to the brasserie they set the corresponding value in $C$ at 1. In the opposite scenario they would set it a -1.

Pizzeria Brasserie Gastronomique Café
Pizzeria - -1 1 -1
Brasserie 1 - 1 -1
Gastronomique -1 -1 - -1
Café 1 1 1 -

The corresponding matrix $C$ is by construction antisymmetric, and we set the diagonal entries to 1 (no influence in the following).

In [2]:
%matplotlib inline
import numpy as np
import scipy as sc
from pylab import *
from scipy import stats
import matplotlib.image as mpimg

# Input pairwise preference matrix C
C=np.array([[1, -1, 1, -1],
            [1,1, 1 ,-1],
            [-1, -1, 1 ,-1],
            [1, 1, 1 ,1]])

Now we count for each pair of restaurants the number of times they have the same preference relationship with an other restaurant. This gives us a similarity measure. For instance, the café and the brasserie look quite similar in terms of appreciation by Bob and Alice, since they were both less enjoyed than the gatronomique and pizza restaurants. In terms of calculations, this just amounts to computing the matricial product of $C$ with its transpose, i.e.:

In [3]:
# Calculate similarity Sim
Sim=np.dot(C,C.T)
# substract minimum of S in order to keep S non negative
Sim=Sim-Sim.min()

Now we reorder our similarity matrix, based on the fact that similar restaurants should have close index in terms of ranking. We use a spectral technique from the seriation litterature, which in practice just amounts to sorting the second eigen vector of the Laplacian matrix $L$:

In [4]:
# Compute the Laplacian matrice and its second eigen vector
L = np.diag(sum(Sim,1)) - Sim
D, V =np.linalg.eigh(L)
argD = np.argsort(D)
D=D[argD]
V=V[:,argD]
fiedler=V[:,1]

# get permutation corresponding to sorting the second eigen vector
retrievedPerm=np.argsort(fiedler)
# display ranking of restaurants
restaurants=np.array(['Pizzeria', 'Brasserie', 'Gastronomique', 'Cafe'])
ranking=restaurants[retrievedPerm]
print(ranking)
['Gastronomique' 'Pizzeria' 'Brasserie' 'Cafe']

It's just as simple as that! Of course, there is actually a lot more going on, hidden behind the eigen vector decomposition of $L$. We refer the interested reader to our NIPS paper for more explanations.

Why use SerialRank?

Besides the fact that SerialRank is intuitive and easy to implement, the main justification for using it is its robustness to non-transitive preferences. In contrast to scoring methods such as counting the number of positive comparisons minus negative comparisons, the similarity measure of SerialRank keeps most of the information present in the comparison matrix.

Let's go back to Alice and Bob to illustrate it. Suppose Alice and Bob have changed their mind and have now decided that the café was better than the gastronomique restaurant. (It could be because of a mistake while writing down comparisons, or some unknown other reason...).

If we rank restaurants according to the commonly used score #positive comparisons - #negative comparisons, there is now a tie between the gastromomique restaurant and the pizzeria, as well as between the brasserie and the cafe:

In [5]:
score=np.sum(C,0)
newC=C.copy()
newC[2,3]=1
newC[3,2]=-1
newScore=np.sum(newC,0)
print('old score is',score)
print('new score is',newScore)
old score is [ 2  0  4 -2]
new score is [2 0 2 0]

On the other hand, SerialRank preserves the original ranking, which seems much more reasonable given all comparisons.

In [7]:
# compute new similarity
newSim=np.dot(newC,newC.T)
newSim=newSim-newSim.min()
# compute new ranking
newL = np.diag(sum(newSim,1)) - newSim
newD, newV =np.linalg.eigh(newL)
argD = np.argsort(D)
newD=newD[argD]
newV=newV[:,argD]
newRetrievedPerm=np.argsort(newV[:,1])
newRanking=restaurants[newRetrievedPerm]
print(newRanking)
['Gastronomique' 'Pizzeria' 'Brasserie' 'Cafe']

This can further be viewed by looking at the similarity matrix, which convserves some strict inequalities between elements once reordered.

In [8]:
Sim_reordered=Sim[retrievedPerm,:]
Sim_reordered=Sim_reordered[:,retrievedPerm]
newSim_reordered=newSim[newRetrievedPerm,:]
newSim_reordered=newSim_reordered[:,newRetrievedPerm]
subplot(1, 2, 1)
imshow(Sim_reordered, interpolation = 'nearest')
title('Original similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(newSim_reordered, interpolation = 'nearest')
title('New similarity matrix')
axis('off')
show()

Real-world datasets

Numerical experiments comparing SerialRank with classical ranking methods on real-world datasets can be found in our NIPS paper.

More detailed description

We describe a seriation algorithm for ranking a set of $n$ items given pairwise comparisons between these items. Intuitively, the algorithm assigns similar rankings to items that compare similarly with all others. It does so by constructing a similarity matrix from pairwise comparisons, using spectral methods to reorder this matrix and construct a ranking.

Ranking based on pairwise comparisons

Suppose your input is a matrix of pairwise comparisons $C$ that takes values between -1 and 1. In the ideal scenario where all comparisons are available and transitive (i.e. are consistent with an underlying total order), $C$ would look like that:

In [11]:
%matplotlib inline
import numpy as np
import scipy as sc
from pylab import *
from scipy import stats
import matplotlib.image as mpimg
n=10
C=np.triu(np.ones((n,n)),0)-np.tril(np.ones((n,n)),-1)
C
Out[11]:
array([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [-1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [-1., -1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [-1., -1., -1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [-1., -1., -1., -1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [-1., -1., -1., -1., -1.,  1.,  1.,  1.,  1.,  1.],
       [-1., -1., -1., -1., -1., -1.,  1.,  1.,  1.,  1.],
       [-1., -1., -1., -1., -1., -1., -1.,  1.,  1.,  1.],
       [-1., -1., -1., -1., -1., -1., -1., -1.,  1.,  1.],
       [-1., -1., -1., -1., -1., -1., -1., -1., -1.,  1.]])

If some pairwise comparisons are missing, we set their values at 0 in the comparison matrix $C$.

Constructing a similarity between items

Suppose for instance that we want to oder a set of players based on the outcomes of their games. The underlying assumption of SerialRank is that players that beat the same players and are beaten by the same players should have a similar ranking.

We relate the ranking problem to the "seriation" problem, where given a similarity matrix between a set of $n$ items, we aim at ordering the items along a chain such that the similarity between items decreases with their distance within this chain. When ranking players, the similarity measure between two players will be the number of similar outcomes against other opponents.

Formally we define the simiality matrix $Sim$ componentwise as $Sim_{i,j}=\sum_{k=1}^n\left( \frac{1+C_{i,k}C_{j,k}}{2}\right)$, or equivalently $Sim=\frac{1}{2} \left(n \mathbf 1 \mathbf 1^T+ CC^T\right)$. Since $C_{i,k}C_{j,k}=1$ if $C_{i,k}$ and $C_{j,k}$ have same signs, and $C_{i,k}C_{j,k}=-1$ if they have opposite signs, $S_{i,j}$ counts the number of matching comparisons between $i$ and $j$ with other reference items $k$. If $i$ or $j$ is not compared with $k$, then $C_{i,k}C_{j,k}=0$ and the term $ (1+C_{i,k}C_{j,k})/2$ has an average effect on the similarity of $1/2$.

In [12]:
Sim=1/2*(np.dot(C,C.T)+n)
Sim
Out[12]:
array([[ 10.,   9.,   8.,   7.,   6.,   5.,   4.,   3.,   2.,   1.],
       [  9.,  10.,   9.,   8.,   7.,   6.,   5.,   4.,   3.,   2.],
       [  8.,   9.,  10.,   9.,   8.,   7.,   6.,   5.,   4.,   3.],
       [  7.,   8.,   9.,  10.,   9.,   8.,   7.,   6.,   5.,   4.],
       [  6.,   7.,   8.,   9.,  10.,   9.,   8.,   7.,   6.,   5.],
       [  5.,   6.,   7.,   8.,   9.,  10.,   9.,   8.,   7.,   6.],
       [  4.,   5.,   6.,   7.,   8.,   9.,  10.,   9.,   8.,   7.],
       [  3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,   9.,   8.],
       [  2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,   9.],
       [  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.]])

Remark: it is actually sufficient for the following to impose that $Sim \geq 0$. So we may as well do

In [13]:
Sim=1/2*(np.dot(C,C.T))
Sim=Sim-Sim.min()
Sim
Out[13]:
array([[ 9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.,  0.],
       [ 8.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.,  1.],
       [ 7.,  8.,  9.,  8.,  7.,  6.,  5.,  4.,  3.,  2.],
       [ 6.,  7.,  8.,  9.,  8.,  7.,  6.,  5.,  4.,  3.],
       [ 5.,  6.,  7.,  8.,  9.,  8.,  7.,  6.,  5.,  4.],
       [ 4.,  5.,  6.,  7.,  8.,  9.,  8.,  7.,  6.,  5.],
       [ 3.,  4.,  5.,  6.,  7.,  8.,  9.,  8.,  7.,  6.],
       [ 2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.,  8.,  7.],
       [ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.,  8.],
       [ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.]])

Reordering the similarity

When rows and columns are ordered according to the underlying total order, our similarity matrix has a nice "Robinson" structure, i.e. coefficients are decreasing when going away from the diagonal.

In [14]:
# display similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()

When rows and columns are disordered, no pattern is visible.

In [15]:
# Disorder comparison and similarity matrices
perm=np.random.permutation(n)
Sim_perm=Sim[perm,:]
Sim_perm=Sim_perm[:,perm]
C_perm=C[perm,:]
C_perm=C_perm[:,perm]

# display similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim_perm, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C_perm, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()

We use a very nice result from the seriation literature, saying in our case that reordering the similarity matrix amounts to reordering the eigenvector associated with the second smallest eigenvalue of the Laplacian matrix $L=\mathbf{Diag}(Sim\mathbf 1) -Sim$. This eigenvector is called "Fiedler" vector, and also appears in the spectral clustering literature.

In [16]:
# compute the Laplacian matrice and the Fiedler vector
L = np.diag(sum(Sim_perm,1)) - Sim_perm

D, V =np.linalg.eigh(L)
argD = np.argsort(D)
D=D[argD]
V=V[:,argD]
fiedler=V[:,1]

# get permutation corresponding to sorting the Fiedler vector
retrievedPerm=np.argsort(fiedler)
Sim_reordered=Sim_perm[retrievedPerm,:]
Sim_reordered=Sim_reordered[:,retrievedPerm]
C_reordered=C_perm[retrievedPerm,:]
C_reordered=C_reordered[:,retrievedPerm]

# display similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim_reordered, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C_reordered, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()

The retrieved permutation reorders perfectly the similarity matrix.

Remark on reverse permutation

Notice that by reordering the similarity with the "reverse" permutation (i.e. indices taken in reverse order) we get the same result. Therefore, to differentiate between the two permutations we consider one (or sereral) of the available pairwise comparisons.

In [17]:
if C_reordered[0,n-1] < 0:
    retrievedPerm=retrievedPerm[::-1]
    Sim_reordered=Sim_perm[retrievedPerm,:]
    Sim_reordered=Sim_reordered[:,retrievedPerm]
    C_reordered=C_perm[retrievedPerm,:]
    C_reordered=C_reordered[:,retrievedPerm]
    # display similarity and comparison matrices
    subplot(1, 2, 1)
    imshow(Sim_reordered, interpolation = 'nearest')
    title('Similarity matrix')
    axis('off')
    subplot(1, 2, 2)
    imshow(C_reordered, interpolation = 'nearest')
    title('Comparison matrix')
    axis('off')
    show()

Dealing with missing and corrupted comparisons

When some pairwise comparisons are missing and/or corrupted, the similarity matrix gets "noisy".

In [18]:
# number of items
n=100

# ratio of observed comparisons
ratioObserved=0.5

# ratio of corrupted comparisons
ratioCorrupted=0.1

isNotObserved=np.zeros((n,n)); 
triIdx=np.nonzero(np.triu(np.ones((n,n)),1))
isNotObserved[triIdx]=rand(n*(n-1)/2)<=1-ratioObserved
isNotObserved=isNotObserved + isNotObserved.T

isCorrupted=np.zeros((n,n)); 
isCorrupted[triIdx]=rand(n*(n-1)/2)<=ratioCorrupted
isCorrupted=isCorrupted + isCorrupted.T

C=np.triu(np.ones((n,n)),0)-np.tril(np.ones((n,n)),-1)
C[isNotObserved.astype(bool)]=0
C[isCorrupted.astype(bool)]=-C[isCorrupted.astype(bool)]

Sim=1/2*(np.dot(C,C.T))
Sim=Sim-Sim.min()
# display similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()

It turns out that if there are a limited number of corrupted/missing comparisons, the ranking given by sorting the Fiedler vector is very close to the "true" ranking.

In [19]:
perm=np.arange(n)
Sim_perm=Sim[perm,:]
Sim_perm=Sim_perm[:,perm]
C_perm=C[perm,:]
C_perm=C_perm[:,perm]
L = np.diag(sum(Sim_perm,1)) - Sim_perm
D, V =np.linalg.eigh(L)
argD = np.argsort(D)
D=D[argD]
V=V[:,argD]
fiedler=V[:,1]
retrievedPerm=np.argsort(fiedler)
Sim_reordered=Sim_perm[retrievedPerm,:]
Sim_reordered=Sim_reordered[:,retrievedPerm]
C_reordered=C_perm[retrievedPerm,:]
C_reordered=C_reordered[:,retrievedPerm]

if np.mean(C_reordered[0,:]) < 0:
    retrievedPerm=retrievedPerm[::-1]
    Sim_reordered=Sim_perm[retrievedPerm,:]
    Sim_reordered=Sim_reordered[:,retrievedPerm]
    C_reordered=C_perm[retrievedPerm,:]
    C_reordered=C_reordered[:,retrievedPerm]
# display reordered similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim_reordered, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C_reordered, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()

# Compute Kendall tau between true ranking and retrieved ranking
tau, pvalue=sc.stats.kendalltau(perm, retrievedPerm)
print("Kendall tau is", tau)
Kendall tau is 0.90101010101

Large datasets

The code shown above requires to compute the full similarity matrix and then an eigen value decomposition of the Laplacian. As such, it is very efficient and fast for datasets of reasonable size (i.e. $n$ less than a few thousands). For large datasets, it is still possible to use SerialRank, if we only consider a subset of the pairwise comparisons as input (either because only a few are available, or by choice to reduce the input size). We review below the modifications to do in the computations:

  • store the comparison matrix $C$ in sparse format
  • do not compute and store $Sim$ and $L$ as they will not be sparse
  • do not provide the eigen value solver the full Laplacian, but the function that does a matrix vector product $x \rightarrow Lx$

Handling ties

The definition of the similarity given above does not take into account ties between items. We provide below an alternative defintion of the similarity matrix which handles ties. Let $m_{i,j}$ be the number of times items $i$ and $j$ were compared, $C_{i,j}^s \in \{-1,1\}$ be the outcome of comparison $s$ between $i$ and $j$ and $Q$ be the matrix of corresponding empirical probabilities, i.e. if $m_{i,j} > 0$ we have $$ Q_{i,j}= \frac{1}{m_{i,j}} \sum_{s=1}^{m_{i,j}} \frac{C_{i,j}^s+1}{2} $$ and $Q_{i,j} = 1/2$ in case $m_{i,j} = 0$. We then define the similarity matrix $Sim$ from the observations $Q$ as $$ Sim_{i,j}= \sum_{k=1}^n \mathbf{1}_{m_{i,k} m_{j,k}>0}\left(1 - \frac {| Q_{i,k}- Q_{j,k}|}{2} \right)+ \frac{\mathbf{1}_{m_{i,k} m_{j,k}=0}}{2}, $$ where $x \rightarrow \mathbf{1}_{x}$ is the indicator function.