Documentation - C API

kmeans.h File Reference

K-means - Declaration. More...

#include "generic.h"
#include "random.h"
#include "mathop.h"

Data Structures

struct  VlKMeans
 K-means quantizer. More...

Enumerations

enum  VlKMeansAlgorithm { VlKMeansLloyd, VlKMeansElkan, VlKMeansANN }
 

K-means algorithms.

More...
enum  VlKMeansInitialization { VlKMeansRandomSelection, VlKMeansPlusPlus }
 

K-means initialization algorithms.

More...

Functions

Create and destroy
VlKMeansvl_kmeans_new (vl_type dataType, VlVectorComparisonType distance)
 Create a new KMeans object.
VlKMeansvl_kmeans_new_copy (VlKMeans const *kmeans)
 Create a new KMeans object by copy.
void vl_kmeans_delete (VlKMeans *self)
 Deletes a KMeans object.
Basic data processing
void vl_kmeans_reset (VlKMeans *self)
 Reset state.
double vl_kmeans_cluster (VlKMeans *self, void const *data, vl_size dimension, vl_size numData, vl_size numCenters)
 Cluster data.
void vl_kmeans_quantize (VlKMeans *self, vl_uint32 *assignments, void *distances, void const *data, vl_size numData)
 Quantize data.
Advanced data processing
void vl_kmeans_set_centers (VlKMeans *self, void const *centers, vl_size dimension, vl_size numCenters)
 Set centers.
void vl_kmeans_seed_centers_with_rand_data (VlKMeans *self, void const *data, vl_size dimensions, vl_size numData, vl_size numCenters)
 Seed centers by randomly sampling data.
void vl_kmeans_seed_centers_plus_plus (VlKMeans *self, void const *data, vl_size dimensions, vl_size numData, vl_size numCenters)
 Seed centers by the KMeans++ algorithm.
double vl_kmeans_refine_centers (VlKMeans *self, void const *data, vl_size numData)
 Refine center locations.
Retrieve data and parameters
vl_type vl_kmeans_get_data_type (VlKMeans const *self)
 Get data type.
VlVectorComparisonType vl_kmeans_get_distance (VlKMeans const *self)
 Get data type.
VlKMeansAlgorithm vl_kmeans_get_algorithm (VlKMeans const *self)
 Get K-means algorithm.
VlKMeansInitialization vl_kmeans_get_initialization (VlKMeans const *self)
 Get K-means initialization algorithm.
vl_size vl_kmeans_get_num_repetitions (VlKMeans const *self)
 Get maximum number of repetitions.
vl_size vl_kmeans_get_dimension (VlKMeans const *self)
 Get data dimension.
vl_size vl_kmeans_get_num_centers (VlKMeans const *self)
 Get the number of centers (K)
int vl_kmeans_get_verbosity (VlKMeans const *self)
 Get verbosity level.
vl_size vl_kmeans_get_max_num_iterations (VlKMeans const *self)
 Get maximum number of iterations.
double vl_kmeans_get_energy (VlKMeans const *self)
 Get the number energy of the current fit.
void const * vl_kmeans_get_centers (VlKMeans const *self)
 Get centers.
Set parameters
void vl_kmeans_set_algorithm (VlKMeans *self, VlKMeansAlgorithm algorithm)
 Set K-means algorithm.
void vl_kmeans_set_initialization (VlKMeans *self, VlKMeansInitialization initialization)
 Set K-means initialization algorithm.
void vl_kmeans_set_num_repetitions (VlKMeans *self, vl_size numRepetitions)
 Set maximum number of repetitions.
void vl_kmeans_set_max_num_iterations (VlKMeans *self, vl_size maxNumIterations)
 Set maximum number of iterations.
void vl_kmeans_set_verbosity (VlKMeans *self, int verbosity)
 Set verbosity level.

Detailed Description

Overview

kmeans.h implements a number of algorithm for k-means quantisation. It supports

  • data of type float or double;
  • l1 and l2 distances;
  • random selection and k-means++ initialization methods;
  • basic Lloyd and accelerated Elkan optimization methods.

Usage

To use kmeans.h to learn clusters from some training data, instantiate a VlKMeans object, set the configuration parameters, initialise the cluster centers, and run the trainig code. For instance, to learn numCenters clusters from numData vectors of dimension dimension and storage type float using L2 distance and at most 100 Lloyd iterations of the Lloyd algorithm use:

 #include <vl/kmeans.h>

 VlKMeansAlgorithm algorithm = VlKMeansLloyd ;
 VlVectorComparisonType distance = VlDistanceL2 ;
 KMeans * kmeans = vl_kmeans_new (algorithm, distance, VL_TYPE_FLOAT) ;
 vl_kmeans_seed_centers_with_rand_data (kmeans, data, dimension, numData, numCenters) ;
 vl_kmeans_set_max_num_iterations (kmeans, 100) ;
 vl_kmeans_refine_centers (kmeans, data, numData) ;

Use vl_kmeans_get_energy to get the solution energy (or an upper bound for the Elkan algorithm) and vl_kmeans_get_centers to obtain the numCluster cluster centers. Use vl_kmeans_quantize to quantize new data points.

Initialization algorithms

kmeans.h supports the following cluster initialization algorithms:

Optimization algorithms

kmeans.h supports the following optimization algorithms:

  • Lloyd [2] (VlKMeansLloyd). This is the standard k-means algorithm, alternating the estimation of the point-to-cluster memebrship and of the cluster centers (means in the Euclidean case). Estimating membership requires computing the distance of each point to all cluster centers, which can be extremely slow.
  • Elkan [3] (VlKMeansElkan). This is a variation of [2] that uses the triangular inequality to avoid many distance calculations when assigning points to clusters and is typically much faster than [2]. However, it uses storage proportional to the square of the number of clusters, which makes it unpractical for a very large number of clusters.

Technical details

Given data points $ x_1, \dots, x_n \in \mathbb{R}^d $, k-means searches for $ k $ vectors $ c_1, \dots, c_n \in \mathbb{R}^d $ (cluster centers) and a function $ \pi : \{1, \dots, n\} \rightarrow \{1, \dots, k\} $ (cluster memberships) that minimize the objective:

\[ E(c_1,\dots,c_n,\pi) = \sum_{i=1}^n d^2(x_i, c_{\pi(i)}) \]

A simple procedure due to Lloyd [2] to locally optimize this objective alternates estimating the cluster centers and the membeship function. Specifically, given the membership function $ \pi $, the objective can be minimized independently for eac $ c_k $ by minimizing

\[ \sum_{i : \pi(i) = k} d^2(x_i, c_k) \]

For the Euclidean distance, the minimizer is simply the mean of the points assigned to that cluster. For other distances, the minimizer is a generalized average. For instance, for the $ l^1 $ distance, this is the median. Assuming that computing the average is linear in the number of points and the data dimension, this step requires $ O(nd) $ operations.

Similarly, given the centers $ c_1, \dots, c_k $, the objective can be optimized independently for the membership $ \pi(i) $ of each point $ x_i $ by minimizing $ d^2(x_i, c_{\pi(i)}) $ over $ \pi(i) \in \{1, \dots, k\} $. Assuming that computing a distance is $ O(d) $, this step requires $ O(ndk) $ operations and dominates the other.

The algorithm usually starts by initializing the centers from a random selection of the data point.

Initialization by k-means++

[1] proposes a randomized initialization of the centers which improves upon random selection. The first center $ c_1 $ is selected at random from the data points $ x_1, \dots, x_n $ and the distance from this center to all points $ d^2(x_i, c_1) $ is computed. Then the second center $ c_2 $ is selected at random from the data points with probability proportional to the distance, and the procedure is repeated using the minimum distance to the centers collected so far.

Speeding up by using the triangular inequality

[3] proposes to use the triangular inequality to avoid most distances calculations when computing point-to-cluster membership and the cluster centers did not change much from the previous iteration.

This uses two key ideas:

  • If a point $ x_i $ is very close to its current center $ c_{\pi(i)} $ and this center is very far from another center $ c $, then the point cannot be assigned to $ c $. Specifically, if $ d(x_i, c_{\pi(i)}) \leq d(c_{\pi(i)}, c) / 2 $, then also $ d(x_i, c_{\pi(i)}) \leq d(x_i, c) $.
  • If a center $ c $ is updated to $ \hat c $, then the variation of the distance of the center to any point can be bounded by $ d(x, c) - d(c, \hat c) \leq d(x, \hat c) \leq d(x,c) + d(c, \hat c) $.

The first idea is used by keeping track of the inter-center distances and exlcuding reassigments to centers too far away from the current assigned center. The second idea is used by keeping for each point an upper bound to the distance to the currently assigned center and a lower bound to the distance to all the other centers. Unless such bounds do not intersect, then a point need not to be reassigned. See [3] for details.

References

  • [1] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proc. ACM-SIAM Symp. on Discrete Algorithms, 2007.
  • [2] S. Lloyd. Least square quantization in PCM. IEEE Trans. on Information Theory, 28(2), 1982.
  • [3] C. Elkan. Using the triangle inequality to accelerate k-means. In Proc. ICML, 2003.
Author:
Andrea Vedaldi

Enumeration Type Documentation

Enumerator:
VlKMeansLloyd 

Lloyd algorithm

VlKMeansElkan 

Elkan algorithm

VlKMeansANN 

Approximate nearest neighbors

Enumerator:
VlKMeansRandomSelection 

Randomized selection

VlKMeansPlusPlus 

Plus plus raondomized selection


Function Documentation

double vl_kmeans_cluster ( VlKMeans self,
void const *  data,
vl_size  dimension,
vl_size  numData,
vl_size  numCenters 
)
Parameters:
selfKMeans object.
datadata to quantize.
dimensiondata dimension.
numDatanumber of data points.
numCentersnumber of clusters.
Returns:
K-means energy at the end of optimization.

The function initializes the centers by using the initialization algorithm set by vl_kmeans_set_initialization and refines them by the quantization algorithm set by vl_kmeans_set_algorithm. The process is repeated one or more times (see vl_kmeans_set_num_repetitions) and the resutl with smaller energy is retained.

void vl_kmeans_delete ( VlKMeans self )
Parameters:
selfKMeans object instance.

The function deletes the KMeans object instance created by vl_kmeans_new.

VlKMeansAlgorithm vl_kmeans_get_algorithm ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object.
Returns:
algorithm.
void const * vl_kmeans_get_centers ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object instance.
Returns:
cluster centers.
vl_type vl_kmeans_get_data_type ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object instance.
Returns:
data type.
vl_size vl_kmeans_get_dimension ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object instance.
Returns:
data dimension.
VlVectorComparisonType vl_kmeans_get_distance ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object instance.
Returns:
data type.
double vl_kmeans_get_energy ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object instance.
Returns:
energy.
VlKMeansInitialization vl_kmeans_get_initialization ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object.
Returns:
algorithm.
vl_size vl_kmeans_get_max_num_iterations ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object instance.
Returns:
maximum number of iterations.
vl_size vl_kmeans_get_num_centers ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object instance.
Returns:
number of centers.
vl_size vl_kmeans_get_num_repetitions ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object instance.
Returns:
current number of repretitions for quantization.
int vl_kmeans_get_verbosity ( VlKMeans const *  self ) [inline]
Parameters:
selfKMeans object instance.
Returns:
verbosity level.
VlKMeans* vl_kmeans_new ( vl_type  dataType,
VlVectorComparisonType  distance 
)
Parameters:
dataTypetype of data (VL_TYPE_FLOAT or VL_TYPE_DOUBLE)
distancedistance.
Returns:
new KMeans object instance.
VlKMeans* vl_kmeans_new_copy ( VlKMeans const *  kmeans )
Parameters:
kmeansKMeans object to copy.
Returns:
new copy.
void vl_kmeans_quantize ( VlKMeans self,
vl_uint32 assignments,
void *  distances,
void const *  data,
vl_size  numData 
)
Parameters:
selfKMeans object.
assignmentsdata to centers assignments.
distancesdata to closes center distance/
datadata to quantize.
numDatanumber of data points.
double vl_kmeans_refine_centers ( VlKMeans self,
void const *  data,
vl_size  numData 
)
Parameters:
selfKMeans object.
datadata to quantize.
numDatanumber of data points.
Returns:
K-means energy at the end of optimization.

The function calls the underlying K-means quantization algorithm (VlKMeansAlgorithm) to quantize the specified data data. The function assumes that the cluster centers have already been assigned by using one of the seeding functions, or by setting them.

void vl_kmeans_reset ( VlKMeans self )

The function reset the state of the KMeans object. It deletes any stored centers, releasing the corresponding memory. This cancels the effect of seeding or setting the centers, but does not change the other configuration parameters.

void vl_kmeans_seed_centers_plus_plus ( VlKMeans self,
void const *  data,
vl_size  dimension,
vl_size  numData,
vl_size  numCenters 
)
Parameters:
selfKMeans object.
datadata to sample from.
dimensiondata dimension.
numDatanmber of data points.
numCentersnumber of centers.
void vl_kmeans_seed_centers_with_rand_data ( VlKMeans self,
void const *  data,
vl_size  dimension,
vl_size  numData,
vl_size  numCenters 
)
Parameters:
selfKMeans object.
datadata to sample from.
dimensiondata dimension.
numDatanmber of data points.
numCentersnumber of centers.

The function seeds the KMeans centers by randomly sampling the data data.

void vl_kmeans_set_algorithm ( VlKMeans self,
VlKMeansAlgorithm  algorithm 
) [inline]
Parameters:
selfKMeans object.
algorithmK-means algorithm.
void vl_kmeans_set_centers ( VlKMeans self,
void const *  centers,
vl_size  dimension,
vl_size  numCenters 
)
Parameters:
selfKMeans object.
centerscenters to copy.
dimensiondata dimension.
numCentersnumber of centers.
void vl_kmeans_set_initialization ( VlKMeans self,
VlKMeansInitialization  initialization 
) [inline]
Parameters:
selfKMeans object.
initializationinitialization.
void vl_kmeans_set_max_num_iterations ( VlKMeans self,
vl_size  maxNumIterations 
) [inline]
Parameters:
selfKMeans filter.
maxNumIterationsmaximum number of iterations.
void vl_kmeans_set_num_repetitions ( VlKMeans self,
vl_size  numRepetitions 
) [inline]
Parameters:
selfKMeans object instance.
numRepetitionsmaximum number of repetitions. The number of repetitions cannot be smaller than 1.
void vl_kmeans_set_verbosity ( VlKMeans self,
int  verbosity 
) [inline]
Parameters:
selfKMeans object instance.
verbosityverbosity level.