Quickstart for Audio Processing

Here is a quick tutorial on scattering representations of sounds using ScatNet.

Setup

ScatNet is now setup and will automatically be loaded at each new MATLAB session.

Fundamentals

This section describes how to compute the scattering spectrum of an audio signal with just three instructions to ScatNet.

The sound file we propose to use for this tutorial is an excerpt of Handel's Messiah, available in all versions of MATLAB. Once loaded the handel.mat file, we choose a length T for the averaging window. Since the signal y is encoded at the sampling rate of \( 8000 \text{\,Hz} \) setting T to 4096 corresponds to about half a second.

load handel;    % loads the signal into y
N = length(y);
T = 2^12;       % length of the averaging window

Before computing the scattering transform of y, we need to define the linear operators that will make up the layers of the scattering networks. In audio processing, these linear operators are constant-Q filter banks. In most experiments, two layers are sufficient, as they capture almost the whole energy in the signal, along an averaging window of up to one second. By default, the quality factors of these two layers are \( Q_1 = 8 \) and \( Q_2 = 1 \). By calling default_filter_options with the label 'audio', these parameters are automatically integrated into the structure filt_opt, which contains the options related to the construction of the filter banks.

filt_opt = default_filter_options('audio', T);

Scattering networks rely on a set of built-in "wavelet factories", which are suited to specific classes of signals. For translation-invariant representations in one dimension, one may call wavelet_factory_1d in the following way :

Wop = wavelet_factory_1d(N, filt_opt);
The previous instruction may take a few seconds on a personal computer, as it requires to build many filters at once.
Note : with the new version of ScatNet, the length N of the signal y no longer needs to be a multiple of the length T of the averaging window.

Finally, the scattering transform of y is computed with the generic function scat, which is at the core of the ScatNet toolbox.

S = scat(y, Wop);
Most often, scat requires less computer time than wavelet_factory_1d. Once the factory has been run, scattering transforms may be calculated iteratively with the same operators Wop, provided the input signals have the same length. More information about the vectorisation of several scattering transforms of one-dimensional signals may be found in the Batch processing section of this tutorial.

Filter banks

The wavelet transforms Wop provided by wavelet_factory_1d are merely function handles, with no actual data in them. Instead, they rely on a cell array named filters, which can be extracted from the factory as an optional second output argument :

[Wop,filters] = wavelet_factory_1d(N, filt_opt);

Like a scattering network, each element in the cell array corresponds to a layer ; however, while S has \( M+1 \) elements, filters has only \( M \) elements, as there is no filter bank at the root.

The psi.filter field of filters{m} contains the band-pass filters \(\psi_\lambda\). These filters are expressed in the Fourier domain, and only their nonzero coefficients are stored. For visualization purposes, one may use realize_filter to convert them to arrays of length N by padding them with zeros.
Here is a short script to display the filter banks at both orders \(M = 1\) and \(M = 2\).

figure;
for m = 1:2
	subplot(1,2,m);
	hold on; 
	for k = 1:length(filters{m}.psi.filter)
		plot(realize_filter(filters{m}.psi.filter{k}, N)); 
	end
	hold off;
	ylim([0 1.5]);
	xlim([1 5*N/8]);
end
This should result in the following figure.

Wavelets are built by dilating a mother wavelet \( \psi \) by a factor \( 2^{1/Q} \), for some quality factor \( Q \), so as to obtain the filter bank \[ \psi_j(t) = 2^{-j/Q}\psi(2^{-j/Q}t),\] or, in terms of Fourier transforms, \[ \hat\psi_j(\omega) = \hat\psi(2^{j/Q}\omega).\] If we suppose that \( \hat\psi \) has central frequency \( \xi_\psi \), \(\hat \psi_j \) will have central frequency \( 2^{-j/Q} \xi_psi\). The scale index \( j \), then is a logarithmic measure of frequency, with a linear increase in \( j \) resulting an exponential decrease of \( \omega \).

In fact, the mother wavelet \( \psi \) may only be dilated up to the maximum scale maximum scale \( J-1 \), resulting in \( J \) wavelets \( \psi_0 = \psi, \psi_1, \ldots, \psi_{J-1} \). The filter \( \psi_{J-1} \) has the smallest possible bandwidth in frequency and the largest possible support in time. If \( Q > 1 \), this is not sufficient to cover the frequency spectrum. To address this issue, \( \psi_{J-1} \) is translated in frequency to capture the remaining frequencies. That way, the temporal bandwidth will not exceed that of \( \psi_{J-1} \). In summary, \( J \) determines the maximum wavelet time support \( T \), which is equal to the averaging length of the scattering transform, and is chosen by the user.

So as to get \( J \) in a practical way, ScatNet provides the utility function T_to_J, which takes the integer T and the structure filt_opt as arguments. Thus, a customized set of options for wavelet_factory_1d may be built in the following way :

filt_opt.Q = [8 8];
filt_opt.J = T_to_J(T, filt_opt);
Note that the original prototype T_to_J(T,Q) is no longer supported in the new version of ScatNet.

To ensure coverage of the frequency domain without becoming redundant, the mother wavelet \( \psi \) is chosen such that adjacent wavelets barely overlap in frequency. In filters, one has \( Q_1 = 8 \) and \( Q_2 = 1 \) by default, which means that the first order has higher resolution in frequency compared to the second. This can be confirmed by looking at the Fourier transforms of the filters above.

Exploration of a network by scale paths

You may recall that scattering coefficients are defined by \[ S_1x(t,j_1) = |x\star\psi_{j_1}|\star\phi(t) \] \[ S_2x(t,j_1,j_2) = ||x\star\psi_{j_1}|\star\psi_{j_2}|\star\phi(t) \] and so on, where \( \psi_{j} \) are band-pass filters and \( \phi \) a low-pass filter.

In ScatNet, the scattering representation S is a cell array, whose elements correspond to respective layers in the scattering transform. Theoretically, the root of the network has the index \( m=0 \) ; however, since MATLAB uses 1-based indexing, an offset of 1 is required in the cell array : the \(m^{\text{th}}\) layer is stored as S{1+m}. Each layer has two fields :

The signal field is a cell array of P elements, each entry corresponding to a node in the scattering layer. The \(k^{\text{th}}\) node of the \(m^{\text{th}}\) layer is expressed as :

S{1+m}.signal{k};

An alternative way to identify scattering nodes is by their paths along the scattering network. For translation-invariant scattering representations, these paths are expressed with the wavelet filter scales \( j_1 \), \( j_2 \), etc. at each layer. To make the correspondence between the cell array S{m+1}.signal of coefficients and these scales, the matrix S{m+1}.meta.j provides the paths of scales of each coefficient in the layer. Namely, S{m+1}.meta.j(:,p) provides the scales [j_1 j_2 ... j_m]' associated with S{m+1}.signal{p}.

Similarly, the meta field in each layer contains other properties of the scattering coefficients, such as resolution, bandwidth, and so forth. Each field is arranged so that the \(k^{\text{th}}\) column is associated with the \(k^{\text{th}}\) signal in S{m+1}.signal. To get the node corresponding to a specifi path, one may use the MATLAB find function ; but be aware that, by default, only paths of increasing scale are computed, so find may return an empty array if the required path does not belong to the network.

Note that convolutions are implemented with symmetric boundary conditions. This can be changed with the filt_opt.boundary parameter. For more information, see the documentation of the filter_bank function.

Display and Formatting

The scattergram built-in function in ScatNet provides a spectrogram-like visualization of the second-order, translation-invariant scattering transform of one-dimensional signals. The top image is made of first-order coefficients, organized in a time-scale matrix. Very low frequencies, i.e. large scales, appear at the bottom of the image. For its part, the second-order coefficients , for a fixed scale j1, are shown in the bottom image, again in a time-scale matrix.
In the following example, one displays the first-order coefficients and second-order coefficients for an arbitrary j1 = 23. Note that the wildcard [] means that all paths are gathered at the first order.

j1 = 23;
scattergram(S{2},[],S{3},j1);
This should give the figure below.

To improve visualization or classification performance, ScatNet offers utility functions to manipulate a scattering representation, such as renormalisation (renorm_scat) and logarithmic transformation (log_scat). The former renormalizes second-order coefficients by dividing them by the corresponding first-order coefficients to decorrelate the two : \(m^{\text{th}}\)-order coefficients are similarly renormalized by dividing by \((m-1)^{\text{th}}\)-order coefficients. The latter computes the logarithm of the coefficients at each point. Both of these functions use the format described above for the scattering transform for both input and output.

Let's renormalize the coefficients and compute the logarithm.

S = renorm_scat(S);
S = log_scat(S);

scattergram(S{2},[],S{3},j1);
This results in a figure where much more details are noticeable.

Batch Processing

In order to use the scattering coefficients for classification or other tasks, we need to assemble them together in vector form. To do this, we use the function format_scat.

[S_table, meta] = format_scat(S);
Here, S_table will be a P-by-N table, where P is the total number of scattering coefficients (all orders combined) and N is the number of time points. Now, S_table can be fed into a classifier for training or testing.

The ScatNet toolbox also contains a pipeline for classification using an affine space classifier or a support vector machine (SVM). To use it, we need to create a database of feature vectors by calculating the scattering coefficients of a collection of audio files. Here, we shall consider the GTZAN dataset for musical genre classification.

First, we create the list of files accompanied by their class labels using the gtzan_src function.

src = gtzan_src('/path/to/gtzan/dataset');
Second, we need to specify the feature transform to use. This is done, as before, using wavelet_factory_1d and the scattering transform functions.
N = 5*2^17;
T = 8192;

filt_opt.Q = [8 1];
filt_opt.J = T_to_J(T, filt_opt);

scat_opt.M = 2;

Wop = wavelet_factory_1d(N, filt_opt, scat_opt);

feature_fun = @(x)(format_scat( ...
	log_scat(renorm_scat(scat(x, Wop)))));
Note that we must specify N = 5*2^17, corresponding to 29.72 s, because the files in src have been truncated to this length. Finally, we can to specify additional options for generating the features. Here, to speed up training and testing, we decide to only keep every eighth frame of the scattering transform.
database_options.feature_sampling = 8;
Now we use the prepare_database function to calculate the scattering coefficients of the files in src.
database = prepare_database(src, feature_fun, database_options);
Depending on the speed of the machine, this can take some time. If you have access to the distributed computing toolbox, invoking matlabpool open before calling prepare_database allows the function to use multiple cores simultaneously.

Support Vector Machine Classification

To perform evaluate classification performance, we split the database into a training and a testing set using the create_partition function.

[train_set, test_set] = create_partition(src, 0.8);
The second parameter specifies the proportion of the objects in each class that are assigned to the training set. Note that the MATLAB random number generator is used to calculate the partition, so if you want to ensure the same results, please reset it using the following code before creating the partition.
rs = RandStream.getDefaultStream();
rs.reset();

When using the SVM classifier, we can speed up calculations by precomputing the kernel. This is done using the svm_calc_kernel function.

database = svm_calc_kernel(database, 'gaussian');
Note that the above requires a lot of memory, in this case almost 3 gigabytes. If necessary, this step can be skipped and the code will run fine.

We now prepare parameter ranges for the SVM and conduct a parameter search using the svm_param_search function.

svm_options.kernel_type = 'gaussian';
svm_options.C = 2.^[0:4:8];
svm_options.gamma = 2.^[-16:4:-8];

[err,C,gamma] = svm_param_search( ...
	database, train_set, [], svm_options);
The svm_param_search function takes four parameters: the feature database, the training set, the development set, and an options structure. If a validation set is present, models are trained using the training set and their error is calculated on the validation set. If a validation set is missing (as in our case), five-fold cross-validation is performed on the training set to estimate the error.

Since we've performed cross-validation, we obtain an error matrix as output err from svm_param_search, the first dimension corresponding to a particular parameter combination and the second dimension corresponding to the fold index. The C and gamma outputs give the particular used in the different parameter combinations. As a result, we obtain the optimal parameters using the following code.

[temp,ind] = min(mean(err,2));
C_best = C(ind);
gamma_best = gamma(ind);

Once we've estimated the proper parameters, we are ready to calculate the error over the testing set. To do this, we set the parameters in the svm_options structure and call svm_train to train the model and svm_test to apply it to the testing set.

svm_options.C = C_best;
svm_options.gamma = gamma_best;

model = svm_train(database, train_set, svm_options);
labels = svm_test(database, model, test_set);
The error can now be calculated using the classif_err function.
test_err = classif_err(labels, test_set, src);

The error should be 0.15 for this split. To further reduce the error, we need to keep more of the scattering frames and also conduct a more extensive parameter search. This may be achieved by using a finer grid or resort to svm_adaptive_param_search.