SIERRA Sparse Structured Methods for Machine Learning Francis Bach ended December 2014 |

SIERRA was a research project funded by
the European Research Council (ERC)
and coordinated by Francis Bach,
from December 2009 to December 2014. It was located within the joint INRIA/CNRS/Ecole
Normale Superieure computer science laboratory in downtown Paris.

Summary

Machine learning is now a core part of
many research domains, where the abundance of data has forced researchers
to rely on automated information processing. In practice, today, machine
learning techniques are applied in two stages: practitioners first build a
large set of features; then, off-the-shelf algorithms are used to solve
the appropriate prediction tasks, such as classification or regression.
While this has led to significant advances in many domains, the potential
of machine learning is far from being fulfilled.

The tenet of the SIERRA project was that to achieve the expected breakthroughs, this two-stage paradigm should be replaced by an integrated process where the specific structure of a problem is taken into account explicitly in the learning process. This allows the consideration of massive numbers of features, in both numerically efficient and theoretically well-understood ways. The SIERRA project-team attacked this problem through the tools of regularization by sparsity-inducing norms. The scientific objective was thus to marry structure with sparsity: this was particularly challenging because structure may occur in various ways (discrete, continuous or mixed) and the targeted applications in computer vision and audio processing lead to large-scale convex optimization problems.

Throughout the project, the SIERRA project-team members have been able to provide a general and flexible framework that can formalize what can and cannot be achieved with tools from convex optimization. In particular, given explicit prior knowledge obtained from experts, we have developed an automated way to transform it into a principled stable and efficient convex optimization formulation. These theoretical developments came hand-to-hand with provably efficient optimization algorithms for problems with millions of features or observations. This has already allowed applications to large-scale problems in neuro-imagery (brain reading in functional magnetic resonance imaging), computer vision (image denoising and category-level object recognition) and audio processing (separation of instruments and voice in musical audio tracks).

Several avenues for further study have been identified, such as the need for distributed processing in large-scale scenarios, and the adaptivity to specifics of the learning problem and hardware architecture, and are currently under investigation.

The tenet of the SIERRA project was that to achieve the expected breakthroughs, this two-stage paradigm should be replaced by an integrated process where the specific structure of a problem is taken into account explicitly in the learning process. This allows the consideration of massive numbers of features, in both numerically efficient and theoretically well-understood ways. The SIERRA project-team attacked this problem through the tools of regularization by sparsity-inducing norms. The scientific objective was thus to marry structure with sparsity: this was particularly challenging because structure may occur in various ways (discrete, continuous or mixed) and the targeted applications in computer vision and audio processing lead to large-scale convex optimization problems.

Throughout the project, the SIERRA project-team members have been able to provide a general and flexible framework that can formalize what can and cannot be achieved with tools from convex optimization. In particular, given explicit prior knowledge obtained from experts, we have developed an automated way to transform it into a principled stable and efficient convex optimization formulation. These theoretical developments came hand-to-hand with provably efficient optimization algorithms for problems with millions of features or observations. This has already allowed applications to large-scale problems in neuro-imagery (brain reading in functional magnetic resonance imaging), computer vision (image denoising and category-level object recognition) and audio processing (separation of instruments and voice in musical audio tracks).

Several avenues for further study have been identified, such as the need for distributed processing in large-scale scenarios, and the adaptivity to specifics of the learning problem and hardware architecture, and are currently under investigation.

Main outcomes (Excerpts from the final activity report)

Research and technological achievements and the impact and use of them

The main goal of the ERC project was to
develop new methodologies for sparsity-inducing norms, with a triple
theoretical, algorithmic and applicative goal. We have had contributions
and impact on these three aspects:

(1) Theory: Norm design through submodular functions, and associated theoretical analysis. Analysis of stochastic approximation algorithms with improved convergence rates.

(2) Algorithms: large-scale convex optimization, with two main aspects, non-smoothness and the minimization of problems with large number of functions (common in machine learning and big data problems).

(3) Applications: we have been able to provide applications of our new framework in neuro-imaging (brain reading in functional magnetic resonance imaging), computer vision (object recognition and image denoising/deblurring), audio processing (source separation) and natural language processing (learning of embeddings).

(1) Theory: Norm design through submodular functions, and associated theoretical analysis. Analysis of stochastic approximation algorithms with improved convergence rates.

(2) Algorithms: large-scale convex optimization, with two main aspects, non-smoothness and the minimization of problems with large number of functions (common in machine learning and big data problems).

(3) Applications: we have been able to provide applications of our new framework in neuro-imaging (brain reading in functional magnetic resonance imaging), computer vision (object recognition and image denoising/deblurring), audio processing (source separation) and natural language processing (learning of embeddings).

Novel and/or unconventional methodologies

During our ERC project, we have had two
major novel methodologies:

(1) Norm design: We have been able to change the way that sparsity-inducing norms are perceived. In the last ten years, researchers were analyzing existing tools (here norms for regularization in machine learning and signal processing) and showing under what circumstances they were useful. A researcher financed by the ERC project (Guillaume Obozinski) and myself developed a general theory that allows to automatically design a sparsity-inducing norm from the desired effects on the problem. This norm design was made possible through the use of submodular functions. Using such tools was totally unexpected at the start of the project.

(2) Large-scale convex optimization: with the ERC post-doctoral researchers (Mark Schmidt and Nicolas Le Roux), we developed the “stochastic averaged gradient” algorithm, which was the first algorithm for optimizing finite convex sums with both iterations of constant running-time complexity and an exponential convergence rate. This surprising result has had a strong impact on machine learning and convex optimization, in particular in the big data era.

(1) Norm design: We have been able to change the way that sparsity-inducing norms are perceived. In the last ten years, researchers were analyzing existing tools (here norms for regularization in machine learning and signal processing) and showing under what circumstances they were useful. A researcher financed by the ERC project (Guillaume Obozinski) and myself developed a general theory that allows to automatically design a sparsity-inducing norm from the desired effects on the problem. This norm design was made possible through the use of submodular functions. Using such tools was totally unexpected at the start of the project.

(2) Large-scale convex optimization: with the ERC post-doctoral researchers (Mark Schmidt and Nicolas Le Roux), we developed the “stochastic averaged gradient” algorithm, which was the first algorithm for optimizing finite convex sums with both iterations of constant running-time complexity and an exponential convergence rate. This surprising result has had a strong impact on machine learning and convex optimization, in particular in the big data era.

Inter and cross disciplinary developments

As expected, the ERC project has been
interdisciplinary as most of our work on real data has been done in
collaboration with researchers from other disciplines such as
neuro-imaging,computer vision, bioinformatics or audio processing. In
particular:

( ) Structured sparsity-inducing norms are now routinely used in all data-oriented fields

( ) Large-scale convex optimization is routinely used in all disciplines and many areas of industry. Moreover, through our theoretical work, we have bridged gaps between mathematical statistics and computer science, in particular through the use of submodular function in a statistical context.

( ) Structured sparsity-inducing norms are now routinely used in all data-oriented fields

( ) Large-scale convex optimization is routinely used in all disciplines and many areas of industry. Moreover, through our theoretical work, we have bridged gaps between mathematical statistics and computer science, in particular through the use of submodular function in a statistical context.

Knowledge and technology transfer

Although technology transfer was not the
main goal of our ERC project, we have been able to contribute very
significantly through our relationships with various companies, such as
Microsoft, Technicolor or Google. Also several researchers financed by the
ERC project were hired in major companies (e.g., Facebook, Amazon,
Criteo), enhancing even more our technology transfer. Our work on
large-scale optimization (with and without sparsity-inducing norms) was
particularly appreciated.

Enhancing the immediate research
environmentThrough the ERC project, we have been
able to create durable relationships between the signal processing and
machine learning communities. Coming from machine learning, I have been
invited to many signal processing venues, while I have co-organized
signal-processing oriented workshops at machine learning conferences. The
ERC project was key to this success.

Establishment and/or consolidation of the research group and team composition

My ERC project was mainly financing
personnel. I have been fortunate to recruit very talented postdocs coming
from major international universities (e.g., U.C. Berkeley, Cambridge,
Montreal, Vancouver), while my recruiting of students was mainly local (I
mostly recruit them from my Masters class). My host institution INRIA has
been particularly helpful in providing extra support and funds. In
particular, we are now officially (since 2011) a “project-team”, which is
a team of 4 permanent researchers (2 of them financed by INRIA, 2 by CNRS)
with 3 postdocs (financed from external funds) and around 10 students. The
ERC project was the center of the team at its creation (they have the same
name) but the team has moved on a bit in terms of research goals, and of
course is permanent.

Related courses / tutorials

** **

** July
2014: **IFCAM
Summer School, Indian Institute of Science, Bangalore -
Large-scale machine learning and convex optimization [slides]

March 2014

September
2013: Fourth Cargese Workshop on Combinatorial Optimization - Machine
learning and convex optimization with submodular functions

**July
2012**: Computer
Vision and Machine Learning Summer School, Grenoble - Kernel
methods and sparse methods for computer vision

**July 2011**:
Computer Vision
and Machine Learning Summer School, Paris - Kernel
methods and sparse methods for computer vision

**September
2010**:
ECML/PKDD
Tutorial on Sparse methods for machine learning (Theory and
algorithms)

**July 2010**:
Computer Vision
and Machine Learning Summer School, Grenoble - Kernel
methods and sparse methods for computer vision

**July 2010**:
Signal processing summer
school, Peyresq - Sparse
method for machine learning

**June 2010**:
CVPR
Tutorial on Sparse Coding and Dictionary Learning for Image Analysis
- slides
of ML part

**December 2009**:
NIPS
Tutorial on Sparse methods for machine learning (Theory and
algorithms)

**September
2009**: ICCV
Tutorial on Sparse Coding and Dictionary Learning for Image Analysis

September 2008: Machine
Learning Summer School - Ile de Re - Learning with sparsity inducing
norms (slides)

**October 2008: **ECCV
Tutorial on Supervised Learning: Introduction
- Part I
(Theory) - Part
II (Algorithms)

January
2008: Workshop
RASMA, Franceville, Gabon - Introduction to kernel methods (slides in
French)

Related publications

**
**

**2015**

A. Defossez, F. Bach. Averaged
Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling
Distributions. *Proceedings of the International Conference on
Artificial Intelligence and Statistics (AISTATS)*, 2015. [pdf]

S. Lacoste-Julien, F. Lindsten, F. Bach. Sequential
Kernel Herding: Frank-Wolfe Optimization for Particle Filtering.
*Proceedings of the International Conference on Artificial Intelligence
and Statistics (AISTATS)*, 2015. [pdf]

2014

F. Bach. Breaking the Curse of
Dimensionality with Convex Neural Networks. Technical report,
HAL-01098505, 2014. [pdf]

J. Mairal, F. Bach, J. Ponce. Sparse
Modeling for Image and Vision Processing. Foundations
and Trends in Computer Vision, 8(2-3):85-283, 2014. [pdf]

D. Garreau, R. Lajugie, S. Arlot and F. Bach. Metric
Learning for Temporal Sequence Alignment. Advances
in Neural Information Processing Systems (NIPS), 2014. [pdf]

A. Dieuleveut, F. Bach. Non-parametric
Stochastic Approximation with Large Step sizes. Technical report,
HAL 01053831, 2014. [pdf]

A. Defazio, F. Bach, S. Lacoste-Julien. SAGA:
A Fast Incremental Gradient Method With Support for Non-Strongly Convex
Composite Objectives. Advances
in Neural Information Processing Systems (NIPS), 2014. [pdf]

R. Gribonval, R. Jenatton, F. Bach. Sparse
and spurious: dictionary learning with noise and outliers.
Technical report, HAL 01025503, 2014. [pdf]

E. Grave, G. Obozinski, F. Bach.

N. Shervashidze and F. Bach. Learning
to Learn for Structured Sparsity. Technical report, HAL 00986380,
2014. [pdf]
[code]

F. Bach. Adaptivity of averaged
stochastic gradient descent to local strong convexity for logistic
regression. Journal of
Machine Learning Research, 15(Feb):595−627, 2014. [pdf]

A. d'Aspremont, F. Bach, L. El Ghaoui. Approximation
Bounds for Sparse Principal Component Analysis. Mathematical
Programming, 2014. [pdf]

**2013**

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Technical report, HAL 00831977, 2013. To appear in Advances in Neural Information Processing Systems (NIPS). [pdf] [slides] [IPAM slides]

S. Jegelka, F. Bach, S. Sra. Reflection methods for user-friendly submodular optimization. Technical report, HAL 00905258, 2013. To appear in Advances in Neural Information Processing Systems (NIPS). [pdf]

B. Mishra, G. Meyer, F. Bach, R. Sepulchre. Low-rank optimization with trace norm penalty. SIAM Journal on Optimization, 23(4):2124-2149, 2013. [pdf]

F. Fogel, R. Jenatton, F. Bach, A. d'Aspremont. Convex Relaxations for Permutation Problems. Technical report, arXiv:1306.4805, 2013. To appear in Advances in Neural Information Processing Systems (NIPS). [pdf]

M. Schmidt, N. Le Roux, F. Bach. Minimizing Finite Sums with the Stochastic Average Gradient. Technical report, HAL 00860051, 2013. [pdf] [code]

K. S. Sesh Kumar and F. Bach. Maximizing submodular functions using probabilistic graphical models. Technical report, HAL 00860575, 2013. [pdf]

A. Nelakanti, C. Archambeau, J. Mairal, F. Bach, G. Bouchard. Structured Penalties for Log-linear Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013. [pdf]

F. Bach. Convex relaxations of structured matrix factorizations. Technical report, HAL 00861118, 2013. [pdf]

F. Bach. Duality between subgradient and conditional gradient methods. Technical report, HAL 00757696-v3, 2013. [pdf]

Z. Harchaoui, F. Bach, O. Cappe and E. Moulines. Kernel-Based Methods for Hypothesis Testing: A Unified View. IEEE Signal processing Magazine, 30(4): 87-97, 2013. [pdf]

E. Grave, G. Obozinski, F. Bach.

E. Richard, F. Bach, and J.-P. Vert. Intersecting singularities for multi-structured estimation. Proceedings of the International Conference on Machine Learning (ICML), 2013. [pdf]

G. Rigaill, T. D. Hocking, F. Bach, and J.-P. Vert. Learning Sparse Penalties for Change-Point Detection using Max Margin Interval Regression. Proceedings of the International Conference on Machine Learning (ICML), 2013. [pdf]

F.
Bach. Sharp
analysis of low-rank kernel matrix approximations. Technical
report, HAL 00723365, 2013.

R. Lajugie, S. Arlot and F. Bach. Large-Margin Metric Learning for Partitioning Problems. Technical report, HAL 00796921 , 2013. [pdf]

K.
S. Sesh Kumar and F. Bach. Convex
Relaxations for Learning Bounded Treewidth Decomposable Graphs.
Technical report, HAL 00763921, 2012. Proceedings
of the International Conference on Machine Learning (ICML). [pdf]

N. Le Roux, F. Bach. Local
component analysis. Proceedings
of the International Conference on Learning Representations (ICLR),
2013. [

**
**

G. Obozinski and F. Bach. Convex Relaxation for Combinatorial Penalties. Technical report, HAL 00694765, 2012. [pdf]

N. Le Roux, M. Schmidt, F. Bach. A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. Technical report, HAL 00674995, 2012.

R. Jenatton, A. Gramfort, V.
Michel, G. Obozinski, E. Eger, F. Bach, B. Thirion. Multi-scale
Mining of fMRI data with Hierarchical Structured Sparsity. SIAM
Journal on Imaging Sciences, 2012, 5(3):835-856, 2012. [pdf]

F. Bach, S. Lacoste-Julien, G. Obozinski. On
the Equivalence between Herding and Conditional Gradient Algorithms.
Proceedings of the International
Conference on Machine Learning (ICML), 2012.

F. Bach, R. Jenatton, J. Mairal, G. Obozinski. Optimization
with sparsity-inducing penalties. Foundations
and Trends in Machine Learning, 4(1):1-106, 2012. [

J. Mairal, F. Bach, J. Ponce. Task-Driven
Dictionary
Learning. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
34(4):791-804, 2012. [pdf

**2011**

F. Bach.

C. Archambeau, F. Bach. Multiple
Gaussian process models. Technical Report Arxiv 110.5238, 2011. [pdf]

F. Bach, E. Moulines. Non-Asymptotic
Analysis of Stochastic Approximation Algorithms for Machine Learning.
Advances in Neural Information
Processing Systems (NIPS), 2011. [pdf]
[long-version-pdf-HAL]

M. Schmidt, N. Le Roux, F. Bach.
Convergence Rates of Inexact
Proximal-Gradient Methods for Convex Optimization. Advances
in Neural Information Processing Systems (NIPS), 2011. [pdf]
[

E. Grave, G. Obozinski, F. Bach.

B. Mishra, G. Meyer, F. Bach, R. Sepulchre. Low-rank
optimization with trace norm penalty. Technical report, Arxiv
1112.2318, 2011. [

R. Jenatton, J.-Y. Audibert and F. Bach. Structured
Variable Selection with Sparsity-inducing Norms. Journal of
Machine Learning Research, 12, 2777-2824, 2011. [pdf]
[code]

N. Le Roux, F. Bach. Local component
analysis. Technical report, HAL 00617965, 2011. [

J. Mairal, R. Jenatton, G. Obozinski, F. Bach. Convex
and Network Flow Optimization for Structured Sparsity.

R. Jenatton, J. Mairal, G. Obozinski, F. Bach. Proximal
Methods for Hierarchical Sparse Coding. Journal
of Machine Learning Research, 12,
2297-2334, 2011. [pdf]

F. Couzinie-Devy, J. Mairal, F. Bach and J. Ponce. Dictionary
Learning for Deblurring and Digital Zoom. Technical report, HAL :
inria- 00627402, 2011. [pdf]

Y-L. Boureau, N. Le Roux, F. Bach, J. Ponce, and Y. LeCun. Ask
the locals: multi-way local pooling for image recognition.
Proceedings of the International Conference on Computer Vision (ICCV),
2011. [pdf]

M. Solnon, S. Arlot, F. Bach. Multi-task
Regression using Minimal Penalties. Technical report, HAL
00610534, 2011. [pdf]

A. Lefèvre, F. Bach, C. Févotte. Online
algorithms for Nonnegative Matrix Factorization with the Itakura-Saito
divergence. Technical report, HAL 00602050, 2011. IEEE
Workshop on Applications of Signal Processing to Audio and Acoustics
(WASPAA), 2011. [pdf]

T. Hocking, A. Joulin, F. Bach and J.-P. Vert. Clusterpath:
an Algorithm for Clustering using Convex Fusion Penalties. Proceedings
of the *International Conference on
Machine Learning (ICML),* 2011. [pdf]

L. Benoît, J. Mairal, F. Bach, J. Ponce, Sparse
Image Representation with Epitomes. Proceedings
of the Conference on Computer Vision and Pattern Recognition (CVPR),
2011. [pdf]

A. Lefèvre, F. Bach, C. Févotte, Itakura-Saito
nonnegative matrix factorization with group sparsity, *Proceedings
of the International Conference on Acoustics, Speech, and Signal
Processing (ICASSP)*, 2011. [pdf]

F. Bach, R. Jenatton, J. Mairal and G. Obozinski. Convex
optimization with sparsity-inducing norms. In S. Sra, S. Nowozin,
S. J. Wright., editors, Optimization for Machine Learning, MIT Press,
2011. [pdf]

** 2010**

F. Bach. Convex Analysis and
Optimization with Submodular Functions: a Tutorial. Technical
report, HAL 00527714, 2010. [pdf]

F. Bach. Structured Sparsity-Inducing
Norms through Submodular Functions. Advances
in Neural Information Processing Systems (NIPS), 2010. [pdf]
[long version, arxiv] [slides]

J. Mairal, R. Jenatton, G. Obozinski, F. Bach. Network
Flow Algorithms for Structured Sparsity. Advances
in Neural Information Processing Systems (NIPS), 2010. [pdf]

F. Bach, S. D. Ahipasaoglu, A. d'Aspremont. Convex
Relaxations for Subset Selection. Technical report Arxiv
1006-3601. [pdf]

M. Hoffman, D. Blei, F. Bach. Online
Learning for Latent Dirichlet Allocation. Advances
in Neural Information Processing Systems (NIPS), 2010. [pdf]

F. Bach, S. D. Ahipasaoglu, A. d'Aspremont. Convex
Relaxations for Subset Selection. Technical report, ArXiv
1006.3601, 2010. [pdf]

R. Jenatton, J. Mairal, G. Obozinski, F. Bach. Proximal
Methods for Sparse Hierarchical Dictionary Learning. Proceedings
of the *International Conference on
Machine Learning (ICML),* 2010. [pdf]
[slides]

M. Journée, F. Bach, P.-A. Absil and R. Sepulchre. Low-Rank
Optimization on the Cone of Positive Semidefinite Matrices. SIAM
Journal on Optimization, 20(5):2327–2351, 2010. [pdf]
[code]

Y-L. Boureau, F. Bach, Y. LeCun, J. Ponce. Learning
Mid-Level Features For Recognition. Proceedings
of
the Conference on Computer Vision and Pattern Recognition (CVPR),
2010. [pdf]

R. Jenatton, G. Obozinski, F. Bach. Structured
Sparse
Principal Component Analysis. *Proceedings of the
International Conference on Artificial Intelligence and Statistics
(AISTATS)*, 2010. [pdf]
[code]

Sparse modeling software - SPAM (C)

Computing regularization paths for multiple kernel learning - version 1.0 (matlab)

Predictive low-rank decomposition for kernel methods - version 1.0 (matlab/C)

Support Kernel Machine - Multiple kernel learning (matlab)

SimpleMKL - version 1.0 (matlab)

Grouplasso - version 1.0 (matlab)

Hierarchical kernel learning - version 3.0 (matlab)