The ongoing lines describe how to use estimate a Gaussian mixture model using the VlFeat implementation of Expectation Maximization algorithm.

Expectation maximization

EM algorithm attempts to model a dataset as a mixture of K multivariate gaussian distributions.

Consider a dataset containing 1000 randomly sampled points in 2-D.

N = 1000 ; dimension = 2 ; data = rand(dimension,N) ;

If one wants to estimate a gaussian mixture of this dataset, the following commands could be invoked:

numClusters = 30 ; [means, sigmas, weights] = vl_gmm(data, numClusters);

In the means, sigmas and weights variables are stored means, sigmas and weights of estimated gaussians, which form the mixture. One of the possible outcomes of this algorithm is presented in following figure:

Simple gmm mixture run on a small random 2D dataset.

The visualization was done using the vl_plotframe method.

figure hold on plot(data(1,:),data(2,:),'r.'); for i=1:numClusters vl_plotframe([means(:,i)' sigmas(1,i) 0 sigmas(2,i)]); end

Covariance optimization

Note that the ellipses are axis alligned. This is an outcome of the optimization method, where (for the sake of speed) all the computations are done only with diagonals of covariance matrices.

GMM Initialization

The most simple way how to initiate the GMM algorithm is to pick numClusters random subset of data points, as initial means of individual gaussians, the covariance of the whole dataset as initial covariance matrices and equal weights which sum to one as initial weight of each gaussian. This random method is implicitly set when running vl_gmm function. However user can specify the Custom initialization method.

The Custom initialization method is used when a user wants to specify its own initialization of the algorithm. When the 'Initialization' option is set to 'Custom' also the options 'InitMeans', 'InitSigmas' and 'InitWeights' have to be set. This initialization approach is frequently used with KMeans algorithm. KMeans is used to obtain initial means, covariances and weights of gaussians. After this an EM algorithm takes place. We show the workflow in the following piece of code:

%% data init numClusters = 30; numData = 1000; dimension = 2; data = rand(dimension,numData); %% kmeans initialization [initMeans, assignments] = vl_kmeans(data, numClusters, ... 'algorithm','lloyd', ... 'MaxNumIterations',5); initSigmas = zeros(dimension,numClusters); initWeights = zeros(1,numClusters); %% find initial means, sigmas and weights for i=1:numClusters data_k = data(:,assignments==i); initWeights(i) = size(data_k,2) / numClusters; if size(data_k,1) == 0 || size(data_k,2) == 0 initSigmas(:,i) = diag(cov(data')); else initSigmas(:,i) = diag(cov(data_k')); end end %% gmm estimation [means,sigmas,weights,ll,posteriors] = vl_gmm(data, numClusters, ... 'initialization','custom', ... 'InitMeans',initMeans, ... 'InitSigmas',initSigmas, ... 'InitWeights',initWeights);

The demo scripts vl_demo_gmm_2d and vl_demo_gmm_3d also produce cute colorized figures such as these:

The figure shows how the estimated gaussian mixture looks like with and without the kmeans initialization.