Data Team - DI - ENS Paris

Scattering

Scattering for Audio Processing

Mel-frequency spectral coefficients (MFSCs) are computed by averaging a spectrogram along the frequency axis according to a mel-frequency scale. This averaging makes the coefficients stable to time-warping deformations. Let us consider a signal \( x(t) \) deformed into \( L_\tau{}x(t) = x(t-\tau(t)) \) for a deformation \( \tau(t) \). A feature representation \(\Phi\), mapping a signal \(x\) to a feature vector \(\Phi{}x\), is called stable to deformation if the Euclidean norm \( \| \Phi{}L_\tau{}x-\Phi{}x \| \) is small for \( \tau \) small. This condition is satisfied for the MFSCs, but not for the spectrogram, for which high-frequency components become unstable. However, the averaging performed to stabilize MFSCs loses information in the high frequencies, which is worsened for when the window size is increased. The window size is thus kept small, around 20 ms, and MFSCs therefore cannot capture large-scale structures.

Scattering coefficients extend MFSCs through a cascade of wavelet modulus operators to recover this lost information. As a result, scattering coefficients can be calculated over larger window sizes without as great of a loss of information, allowing larger-scale structures to be captured. These larger-scale structures include timbral structures such as attacks, amplitude and frequency modulations, and interference phenomena found in musical chords.

Several paper references are given by:

Multiscale Scattering for Audio Classification, Andén J. and Mallat S., Proceedings of the ISMIR 2011 conference, pp. 657-662, 2011. (PDF)
Deep Scattering Spectrum, Andén J. and Mallat. S., Submitted to IEEE Transactions on Signal Processing, 2011. (PDF)
Scattering Representation of Modulated Sounds, Andén J. and Mallat S., Proceedings of the DAFx 2012 conference, 2012. (PDF)

Software

ScatNet toolbox for computing scattering transforms and classification framework using affine space or support vector classifiers.

Reconstruction Examples

The scattering coefficients of up to order 2 were computed for a 6-second clip of classical music for for different window sizes (20ms, 400ms, 740ms, 1.5s and 4s). These coefficients were then used to reconstruct the original signal, first using only the first-order coefficients, and then using the first and second order. The results are as follows

Original:
T=20ms,m=1:		T=20ms,m=2:
T=400ms,m=1:		T=400ms,m=2:
T=740ms,m=1:		T=740ms,m=2:
T=1.5s,m=1:		T=1.5s,m=2:
T=3s,m=1:		T=3s,m=2:

Note that although the reconstruction is of good quality for m=2 when T is small, it deteriorates significantly when T increases. However, if we add the second order, quality is restored, especially with respect to transients such as attacks. For even larger values of T, however, the quality of the reconstruction for m=2 suffers as well.