Tutorials - KMeans

The ongoing lines describe how to use KMeans algorithm implemented in VlFeat. A user can switch between several variations of the original algorithm (proposed by Lloyd), to improve the speed of convergence (sometimes at the expense of robustness).

KMeans

KMeans is a method for finding clusters in a dataset given a particular distance metric.

Consider a dataset containing 1000 randomly sampled points in 2-D.

N         = 5000 ;
dimension = 2 ;
data = rand(dimension,N) ;

If one wants to split the data points data into 30 clusters, the ongoing procedure could be invoked:

numClusters = 30 ;
[centers, assignments] = vl_kmeans(data, numClusters);

After this, the centers of individual clusters are saved in the centers variable. If a user wants to find the assignments of data points to clusters, then he or she should take a look on the assignments variable which is a N element vector holding an index of a cluster center corresponding to each data point.

KMeans clustering of 5000 randomly sampled data points. The black dots are centers of each cluster.

Initialization

The KMeans algorithm in its original form initializes the centers of clusters to a numClusters sized subset of data points. After this initialization the algorithm runs iterative procces which outputs the refined centers of cllusters.

The original random initialization process can be improved using so called kmeans++ method. This procedure picks first center randomly, and then other centers are picked from data points, such that the probability of their selection is larger with increasing distance from already picked centers.

This method could improve the speed of convergence as well as the quality of the final local minimum of the function, which KMeans minimizes.

kmeans++ initialization can be turned on by specifying the 'Initialization' parameter:

[centers, assignments] = vl_kmeans(data, numClusters,'Initialization','plusplus');

Algorithm selection

Appart from the original KMeans algorithm proposed by Lloyd, also the Elkan and ANN methods could be used to speed up the process of finding the cluster centers.

Lloyd is the original method of finding the nearest cluster for each point. Basically, it is a naive computation of each point-to-center distance followed by picking the minimum of these computed values.

Elkan is almost the same approach as Lloyd but achieves a speedup by skipping as much distance computations as possible, by using the property of each distance metric called triangle inequality.

ANN uses randomized Approximate nearest neighbors KD-Tree forests to find the point-to-cluster correspondences.

These optimization methods can be enabled by setting the 'Algorithm' parameter to 'Lloyd', 'Elkan' or 'ANN'. When using the 'ANN' a user should supply the 'MaxNumComparisons' and 'NumTrees' options to adjust the speed/accuracy of the ANN algorithm. (for detailed explanation on ANN KD-Tree forests see KD-Trees and forests page).

The following benchmark shows the speed of implemented optimization methods. Because of the random initialization, each of the KMeans calls converges to a different local minimum in a different amount of iterations. Therefore we we fix the number of iterations (by setting the 'MaxNumIterations' option) to ensure the reliable measurement of ellapsed time.

N = 10000;
numCenters = 100;
dimension = 128;
data = rand(dimension,N);

tic
[C] = vl_kmeans(data, numCenters, ...
                'algorithm','lloyd', ...
                'MaxNumIterations', 10);
ellapsed_lloyd = toc

tic
[C] = vl_kmeans(data, numCenters, ...
                'algorithm','elkan', ...
                'MaxNumIterations', 10);
ellapsed_elkan = toc

tic;
[C] = vl_kmeans(data, numCenters, ...
                'algorithm','ann', ...
                'NumTrees', 3, ...
                'MaxNumComparisons', 5, ...
                'MaxNumIterations', 10);
ellapsed_ann = toc

The above code produces the following output:

ellapsed_lloyd =
    6.9902
ellapsed_elkan =
    1.8153
ellapsed_ann =
    1.3716

More detailed statistics of the speed and achieved energies could be seen in the following figure (generated by vl_demo_kmeans_ann_speed):

Comparisons of Elkan, Lloyd and ANN (expressed as a portion of maximum number of possible comparisons in KD-Tree forest) speeds and achieved energies when using serial and parallel computation. Also a Parallel/Serial speedup ratio is present (the experiment was run on a 4 core Intel Core i7 machine).