The ongoing lines describe how to use estimate a Gaussian mixture model using the VlFeat implementation of Expectation Maximization algorithm.
Expectation maximization
EM algorithm attempts to model a dataset as a mixture of K multivariate gaussian distributions.
Consider a dataset containing 1000 randomly sampled points in 2-D.
N = 1000 ;
dimension = 2 ;
data = rand(dimension,N) ;
If one wants to estimate a gaussian mixture of this dataset, the following commands could be invoked:
numClusters = 30 ;
[means, sigmas, weights] = vl_gmm(data, numClusters);
In the means
, sigmas
and weights
variables are stored means, sigmas and weights of estimated gaussians,
which form the mixture. One of the possible outcomes of this algorithm is
presented in following figure:
The visualization was done using the vl_plotframe method.
figure
hold on
plot(data(1,:),data(2,:),'r.');
for i=1:numClusters
vl_plotframe([means(:,i)' sigmas(1,i) 0 sigmas(2,i)]);
end
Covariance optimization
Note that the ellipses are axis alligned. This is an outcome of the optimization method, where (for the sake of speed) all the computations are done only with diagonals of covariance matrices.
GMM Initialization
The most simple way how to initiate the GMM algorithm is
to pick numClusters
random subset of data points,
as initial means of individual gaussians, the covariance of the whole dataset
as initial covariance matrices and equal weights which sum to one as
initial weight of each gaussian. This random method is implicitly set
when running vl_gmm
function. However user can specify the
Custom initialization method.
The Custom initialization method is used when a user wants
to specify its own initialization of the algorithm. When
the 'Initialization'
option is set to 'Custom'
also the options 'InitMeans'
, 'InitSigmas'
and
'InitWeights'
have to be set. This initialization approach
is frequently used with KMeans algorithm. KMeans is used
to obtain initial means, covariances and weights of gaussians. After
this an EM algorithm takes place. We show the workflow in the following
piece of code:
%% data init
numClusters = 30;
numData = 1000;
dimension = 2;
data = rand(dimension,numData);
%% kmeans initialization
[initMeans, assignments] = vl_kmeans(data, numClusters, ...
'algorithm','lloyd', ...
'MaxNumIterations',5);
initSigmas = zeros(dimension,numClusters);
initWeights = zeros(1,numClusters);
%% find initial means, sigmas and weights
for i=1:numClusters
data_k = data(:,assignments==i);
initWeights(i) = size(data_k,2) / numClusters;
if size(data_k,1) == 0 || size(data_k,2) == 0
initSigmas(:,i) = diag(cov(data'));
else
initSigmas(:,i) = diag(cov(data_k'));
end
end
%% gmm estimation
[means,sigmas,weights,ll,posteriors] = vl_gmm(data, numClusters, ...
'initialization','custom', ...
'InitMeans',initMeans, ...
'InitSigmas',initSigmas, ...
'InitWeights',initWeights);
The demo scripts vl_demo_gmm_2d and vl_demo_gmm_3d also produce cute colorized figures such as these: