This IPython notebook is an informal demo of how to produce rankings using SerialRank. While it is self-contained, it is a merely a practical introduction to SerialRank.

Theoretical justifications for using SerialRank can be found in the preprint of our NIPS paper (Fajwel Fogel, Alexandre d'Aspremont, Milan Vojnovic).

You can download the code of this ipython notebook here: tutorial.py, tutorial.ipynb.

Suppose Alice and Bob want to rank the five restaurants where they have been during their lovely week-end in Paris. Because they do not want to bother giving a score to each restaurant based on list of criteria (even thinking of it gives them a headache), they have decided to write down their pairwise preferences, in the form of a double entry table, which we call the comparison matrix $C$. For instance, if they have preferred the pizzeria to the brasserie they set the corresponding value in $C$ at 1. In the opposite scenario they would set it a -1.

Pizzeria | Brasserie | Gastronomique | CafÃ© | |
---|---|---|---|---|

Pizzeria | - | -1 | 1 | -1 |

Brasserie | 1 | - | 1 | -1 |

Gastronomique | -1 | -1 | - | -1 |

CafÃ© | 1 | 1 | 1 | - |

The corresponding matrix $C$ is by construction antisymmetric, and we set the diagonal entries to 1 (no influence in the following).

In [2]:

```
%matplotlib inline
import numpy as np
import scipy as sc
from pylab import *
from scipy import stats
import matplotlib.image as mpimg
# Input pairwise preference matrix C
C=np.array([[1, -1, 1, -1],
[1,1, 1 ,-1],
[-1, -1, 1 ,-1],
[1, 1, 1 ,1]])
```

In [3]:

```
# Calculate similarity Sim
Sim=np.dot(C,C.T)
# substract minimum of S in order to keep S non negative
Sim=Sim-Sim.min()
```

In [4]:

```
# Compute the Laplacian matrice and its second eigen vector
L = np.diag(sum(Sim,1)) - Sim
D, V =np.linalg.eigh(L)
argD = np.argsort(D)
D=D[argD]
V=V[:,argD]
fiedler=V[:,1]
# get permutation corresponding to sorting the second eigen vector
retrievedPerm=np.argsort(fiedler)
# display ranking of restaurants
restaurants=np.array(['Pizzeria', 'Brasserie', 'Gastronomique', 'Cafe'])
ranking=restaurants[retrievedPerm]
print(ranking)
```

**simple** as that!
Of course, there is actually a lot more going on, hidden behind the eigen vector decomposition of $L$. We refer the interested reader to our NIPS paper for more explanations.

Besides the fact that SerialRank is intuitive and **easy to implement**, the main justification for using it is its **robustness** to non-transitive preferences.
In contrast to scoring methods such as counting the number of positive comparisons minus negative comparisons, the similarity measure of SerialRank *keeps most of the information present in the comparison matrix*.

Let's go back to Alice and Bob to illustrate it. Suppose Alice and Bob have changed their mind and have now decided that the cafÃ© was better than the gastronomique restaurant. (It could be because of a mistake while writing down comparisons, or some unknown other reason...).

If we rank restaurants according to the commonly used score #positive comparisons - #negative comparisons, there is now a **tie** between the gastromomique restaurant and the pizzeria, as well as between the brasserie and the cafe:

In [5]:

```
score=np.sum(C,0)
newC=C.copy()
newC[2,3]=1
newC[3,2]=-1
newScore=np.sum(newC,0)
print('old score is',score)
print('new score is',newScore)
```

**SerialRank preserves the original ranking**, which seems much more reasonable given all comparisons.

In [7]:

```
# compute new similarity
newSim=np.dot(newC,newC.T)
newSim=newSim-newSim.min()
# compute new ranking
newL = np.diag(sum(newSim,1)) - newSim
newD, newV =np.linalg.eigh(newL)
argD = np.argsort(D)
newD=newD[argD]
newV=newV[:,argD]
newRetrievedPerm=np.argsort(newV[:,1])
newRanking=restaurants[newRetrievedPerm]
print(newRanking)
```

In [8]:

```
Sim_reordered=Sim[retrievedPerm,:]
Sim_reordered=Sim_reordered[:,retrievedPerm]
newSim_reordered=newSim[newRetrievedPerm,:]
newSim_reordered=newSim_reordered[:,newRetrievedPerm]
subplot(1, 2, 1)
imshow(Sim_reordered, interpolation = 'nearest')
title('Original similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(newSim_reordered, interpolation = 'nearest')
title('New similarity matrix')
axis('off')
show()
```

In [11]:

```
%matplotlib inline
import numpy as np
import scipy as sc
from pylab import *
from scipy import stats
import matplotlib.image as mpimg
n=10
C=np.triu(np.ones((n,n)),0)-np.tril(np.ones((n,n)),-1)
C
```

Out[11]:

If some pairwise comparisons are missing, we set their values at 0 in the comparison matrix $C$.

Suppose for instance that we want to oder a set of players based on the outcomes of their games. The underlying assumption of SerialRank is that players that beat the same players and are beaten by the same players should have a similar ranking.

We relate the ranking problem to the "seriation" problem, where given a similarity matrix between a set of $n$ items, we aim at ordering the items along a chain such that the similarity between items decreases with their distance within this chain. When ranking players, the similarity measure between two players will be the number of similar outcomes against other opponents.

Formally we define the simiality matrix $Sim$ componentwise as $Sim_{i,j}=\sum_{k=1}^n\left( \frac{1+C_{i,k}C_{j,k}}{2}\right)$, or equivalently $Sim=\frac{1}{2} \left(n \mathbf 1 \mathbf 1^T+ CC^T\right)$. Since $C_{i,k}C_{j,k}=1$ if $C_{i,k}$ and $C_{j,k}$ have same signs, and $C_{i,k}C_{j,k}=-1$ if they have opposite signs, $S_{i,j}$ counts the number of matching comparisons between $i$ and $j$ with other reference items $k$. If $i$ or $j$ is not compared with $k$, then $C_{i,k}C_{j,k}=0$ and the term $ (1+C_{i,k}C_{j,k})/2$ has an average effect on the similarity of $1/2$.

In [12]:

```
Sim=1/2*(np.dot(C,C.T)+n)
Sim
```

Out[12]:

In [13]:

```
Sim=1/2*(np.dot(C,C.T))
Sim=Sim-Sim.min()
Sim
```

Out[13]:

In [14]:

```
# display similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()
```

When rows and columns are disordered, no pattern is visible.

In [15]:

```
# Disorder comparison and similarity matrices
perm=np.random.permutation(n)
Sim_perm=Sim[perm,:]
Sim_perm=Sim_perm[:,perm]
C_perm=C[perm,:]
C_perm=C_perm[:,perm]
# display similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim_perm, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C_perm, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()
```

In [16]:

```
# compute the Laplacian matrice and the Fiedler vector
L = np.diag(sum(Sim_perm,1)) - Sim_perm
D, V =np.linalg.eigh(L)
argD = np.argsort(D)
D=D[argD]
V=V[:,argD]
fiedler=V[:,1]
# get permutation corresponding to sorting the Fiedler vector
retrievedPerm=np.argsort(fiedler)
Sim_reordered=Sim_perm[retrievedPerm,:]
Sim_reordered=Sim_reordered[:,retrievedPerm]
C_reordered=C_perm[retrievedPerm,:]
C_reordered=C_reordered[:,retrievedPerm]
# display similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim_reordered, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C_reordered, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()
```

The retrieved permutation reorders perfectly the similarity matrix.

In [17]:

```
if C_reordered[0,n-1] < 0:
retrievedPerm=retrievedPerm[::-1]
Sim_reordered=Sim_perm[retrievedPerm,:]
Sim_reordered=Sim_reordered[:,retrievedPerm]
C_reordered=C_perm[retrievedPerm,:]
C_reordered=C_reordered[:,retrievedPerm]
# display similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim_reordered, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C_reordered, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()
```

When some pairwise comparisons are missing and/or corrupted, the similarity matrix gets "noisy".

In [18]:

```
# number of items
n=100
# ratio of observed comparisons
ratioObserved=0.5
# ratio of corrupted comparisons
ratioCorrupted=0.1
isNotObserved=np.zeros((n,n));
triIdx=np.nonzero(np.triu(np.ones((n,n)),1))
isNotObserved[triIdx]=rand(n*(n-1)/2)<=1-ratioObserved
isNotObserved=isNotObserved + isNotObserved.T
isCorrupted=np.zeros((n,n));
isCorrupted[triIdx]=rand(n*(n-1)/2)<=ratioCorrupted
isCorrupted=isCorrupted + isCorrupted.T
C=np.triu(np.ones((n,n)),0)-np.tril(np.ones((n,n)),-1)
C[isNotObserved.astype(bool)]=0
C[isCorrupted.astype(bool)]=-C[isCorrupted.astype(bool)]
Sim=1/2*(np.dot(C,C.T))
Sim=Sim-Sim.min()
# display similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()
```

In [19]:

```
perm=np.arange(n)
Sim_perm=Sim[perm,:]
Sim_perm=Sim_perm[:,perm]
C_perm=C[perm,:]
C_perm=C_perm[:,perm]
L = np.diag(sum(Sim_perm,1)) - Sim_perm
D, V =np.linalg.eigh(L)
argD = np.argsort(D)
D=D[argD]
V=V[:,argD]
fiedler=V[:,1]
retrievedPerm=np.argsort(fiedler)
Sim_reordered=Sim_perm[retrievedPerm,:]
Sim_reordered=Sim_reordered[:,retrievedPerm]
C_reordered=C_perm[retrievedPerm,:]
C_reordered=C_reordered[:,retrievedPerm]
if np.mean(C_reordered[0,:]) < 0:
retrievedPerm=retrievedPerm[::-1]
Sim_reordered=Sim_perm[retrievedPerm,:]
Sim_reordered=Sim_reordered[:,retrievedPerm]
C_reordered=C_perm[retrievedPerm,:]
C_reordered=C_reordered[:,retrievedPerm]
# display reordered similarity and comparison matrices
subplot(1, 2, 1)
imshow(Sim_reordered, interpolation = 'nearest')
title('Similarity matrix')
axis('off')
subplot(1, 2, 2)
imshow(C_reordered, interpolation = 'nearest')
title('Comparison matrix')
axis('off')
show()
# Compute Kendall tau between true ranking and retrieved ranking
tau, pvalue=sc.stats.kendalltau(perm, retrievedPerm)
print("Kendall tau is", tau)
```

The code shown above requires to compute the full similarity matrix and then an eigen value decomposition of the Laplacian. As such, it is very efficient and fast for datasets of reasonable size (i.e. $n$ less than a few thousands). For large datasets, it is still possible to use SerialRank, if we only consider a subset of the pairwise comparisons as input (either because only a few are available, or by choice to reduce the input size). We review below the modifications to do in the computations:

- store the comparison matrix $C$ in sparse format
- do not compute and store $Sim$ and $L$ as they will not be sparse
- do not provide the eigen value solver the full Laplacian, but the function that does a matrix vector product $x \rightarrow Lx$