ERC - SIERRA - Sparse Structured Methods for Machine Learning

SIERRA

Sparse Structured Methods for Machine Learning

Francis Bach

ended December 2014

SIERRA was a research project funded by the European Research Council (ERC) and coordinated by Francis Bach, from December 2009 to December 2014. It was located within the joint INRIA/CNRS/Ecole Normale Superieure computer science laboratory in downtown Paris.

Summary

Machine learning is now a core part of many research domains, where the abundance of data has forced researchers to rely on automated information processing. In practice, today, machine learning techniques are applied in two stages: practitioners first build a large set of features; then, off-the-shelf algorithms are used to solve the appropriate prediction tasks, such as classification or regression. While this has led to significant advances in many domains, the potential of machine learning is far from being fulfilled.

The tenet of the SIERRA project was that to achieve the expected breakthroughs, this two-stage paradigm should be replaced by an integrated process where the specific structure of a problem is taken into account explicitly in the learning process. This allows the consideration of massive numbers of features, in both numerically efficient and theoretically well-understood ways. The SIERRA project-team attacked this problem through the tools of regularization by sparsity-inducing norms. The scientific objective was thus to marry structure with sparsity: this was particularly challenging because structure may occur in various ways (discrete, continuous or mixed) and the targeted applications in computer vision and audio processing lead to large-scale convex optimization problems.

Throughout the project, the SIERRA project-team members have been able to provide a general and flexible framework that can formalize what can and cannot be achieved with tools from convex optimization. In particular, given explicit prior knowledge obtained from experts, we have developed an automated way to transform it into a principled stable and efficient convex optimization formulation. These theoretical developments came hand-to-hand with provably efficient optimization algorithms for problems with millions of features or observations. This has already allowed applications to large-scale problems in neuro-imagery (brain reading in functional magnetic resonance imaging), computer vision (image denoising and category-level object recognition) and audio processing (separation of instruments and voice in musical audio tracks).

Several avenues for further study have been identified, such as the need for distributed processing in large-scale scenarios, and the adaptivity to specifics of the learning problem and hardware architecture, and are currently under investigation.

Main outcomes (Excerpts from the final activity report)

Research and technological achievements and the impact and use of them

The main goal of the ERC project was to develop new methodologies for sparsity-inducing norms, with a triple theoretical, algorithmic and applicative goal. We have had contributions and impact on these three aspects:
(1) Theory: Norm design through submodular functions, and associated theoretical analysis. Analysis of stochastic approximation algorithms with improved convergence rates.
(2) Algorithms: large-scale convex optimization, with two main aspects, non-smoothness and the minimization of problems with large number of functions (common in machine learning and big data problems).
(3) Applications: we have been able to provide applications of our new framework in neuro-imaging (brain reading in functional magnetic resonance imaging), computer vision (object recognition and image denoising/deblurring), audio processing (source separation) and natural language processing (learning of embeddings).

Novel and/or unconventional methodologies

During our ERC project, we have had two major novel methodologies:
(1) Norm design: We have been able to change the way that sparsity-inducing norms are perceived. In the last ten years, researchers were analyzing existing tools (here norms for regularization in machine learning and signal processing) and showing under what circumstances they were useful. A researcher financed by the ERC project (Guillaume Obozinski) and myself developed a general theory that allows to automatically design a sparsity-inducing norm from the desired effects on the problem. This norm design was made possible through the use of submodular functions. Using such tools was totally unexpected at the start of the project.
(2) Large-scale convex optimization: with the ERC post-doctoral researchers (Mark Schmidt and Nicolas Le Roux), we developed the “stochastic averaged gradient” algorithm, which was the first algorithm for optimizing finite convex sums with both iterations of constant running-time complexity and an exponential convergence rate. This surprising result has had a strong impact on machine learning and convex optimization, in particular in the big data era.

Inter and cross disciplinary developments

As expected, the ERC project has been interdisciplinary as most of our work on real data has been done in collaboration with researchers from other disciplines such as neuro-imaging,computer vision, bioinformatics or audio processing. In particular:
( ) Structured sparsity-inducing norms are now routinely used in all data-oriented fields
( ) Large-scale convex optimization is routinely used in all disciplines and many areas of industry. Moreover, through our theoretical work, we have bridged gaps between mathematical statistics and computer science, in particular through the use of submodular function in a statistical context.

Knowledge and technology transfer

Although technology transfer was not the main goal of our ERC project, we have been able to contribute very significantly through our relationships with various companies, such as Microsoft, Technicolor or Google. Also several researchers financed by the ERC project were hired in major companies (e.g., Facebook, Amazon, Criteo), enhancing even more our technology transfer. Our work on large-scale optimization (with and without sparsity-inducing norms) was particularly appreciated.

Enhancing the immediate research environment

Through the ERC project, we have been able to create durable relationships between the signal processing and machine learning communities. Coming from machine learning, I have been invited to many signal processing venues, while I have co-organized signal-processing oriented workshops at machine learning conferences. The ERC project was key to this success.

Establishment and/or consolidation of the research group and team composition

My ERC project was mainly financing personnel. I have been fortunate to recruit very talented postdocs coming from major international universities (e.g., U.C. Berkeley, Cambridge, Montreal, Vancouver), while my recruiting of students was mainly local (I mostly recruit them from my Masters class). My host institution INRIA has been particularly helpful in providing extra support and funds. In particular, we are now officially (since 2011) a “project-team”, which is a team of 4 permanent researchers (2 of them financed by INRIA, 2 by CNRS) with 3 postdocs (financed from external funds) and around 10 students. The ERC project was the center of the team at its creation (they have the same name) but the team has moved on a bit in terms of research goals, and of course is permanent.

Related courses / tutorials

July 2014: IFCAM Summer School, Indian Institute of Science, Bangalore - Large-scale machine learning and convex optimization [slides]
March 2014: YES Workshop, Eurandom, Eindhoven - Large-scale machine learning and convex optimization [slides]

September 2013: Fourth Cargese Workshop on Combinatorial Optimization - Machine learning and convex optimization with submodular functions

September 2012: Machine Learning Summer School, Kyoto - Learning with submodular functions [slides] [notes]

July 2012: Computer Vision and Machine Learning Summer School, Grenoble - Kernel methods and sparse methods for computer vision
July 2011: Computer Vision and Machine Learning Summer School, Paris - Kernel methods and sparse methods for computer vision
September 2010: ECML/PKDD Tutorial on Sparse methods for machine learning (Theory and algorithms)
July 2010: Computer Vision and Machine Learning Summer School, Grenoble - Kernel methods and sparse methods for computer vision
July 2010: Signal processing summer school, Peyresq - Sparse method for machine learning
June 2010: CVPR Tutorial on Sparse Coding and Dictionary Learning for Image Analysis - slides of ML part
December 2009: NIPS Tutorial on Sparse methods for machine learning (Theory and algorithms)
September 2009: ICCV Tutorial on Sparse Coding and Dictionary Learning for Image Analysis
September 2008: Machine Learning Summer School - Ile de Re - Learning with sparsity inducing norms (slides)
October 2008: ECCV Tutorial on Supervised Learning: Introduction - Part I (Theory) - Part II (Algorithms)

January 2008: Workshop RASMA, Franceville, Gabon - Introduction to kernel methods (slides in French)

Related publications

2015

A. Defossez, F. Bach. Averaged Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions. Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2015. [pdf]

S. Lacoste-Julien, F. Lindsten, F. Bach. Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering. Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2015. [pdf]

2014

F. Bach. Breaking the Curse of Dimensionality with Convex Neural Networks. Technical report, HAL-01098505, 2014. [pdf]

J. Mairal, F. Bach, J. Ponce. Sparse Modeling for Image and Vision Processing. Foundations and Trends in Computer Vision, 8(2-3):85-283, 2014. [pdf]

D. Garreau, R. Lajugie, S. Arlot and F. Bach. Metric Learning for Temporal Sequence Alignment. Advances in Neural Information Processing Systems (NIPS), 2014. [pdf]

A. Dieuleveut, F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Technical report, HAL 01053831, 2014. [pdf]

A. Defazio, F. Bach, S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. Advances in Neural Information Processing Systems (NIPS), 2014. [pdf]

R. Gribonval, R. Jenatton, F. Bach. Sparse and spurious: dictionary learning with noise and outliers. Technical report, HAL 01025503, 2014. [pdf]

E. Grave, G. Obozinski, F. Bach. A Markovian approach to distributional semantics with application to semantic compositionality. Proceedings of the International Conference on Computational Linguistics (COLING), 2014. [pdf]

R. Lajugie, S. Arlot and F. Bach. Large-Margin Metric Learning for Partitioning Problems. Proceedings of the International Conference on Machine Learning (ICML), 2014. [pdf]

N. Shervashidze and F. Bach. Learning to Learn for Structured Sparsity. Technical report, HAL 00986380, 2014. [pdf] [code]

F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Journal of Machine Learning Research, 15(Feb):595−627, 2014. [pdf]

A. d'Aspremont, F. Bach, L. El Ghaoui. Approximation Bounds for Sparse Principal Component Analysis. Mathematical Programming, 2014. [pdf]

2013

F. Bach. Learning with Submodular Functions: A Convex Optimization Perspective. Foundations and Trends in Machine Learning, 6(2-3):145-373, 2013. [FOT website] [pdf] [slides]

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Technical report, HAL 00831977, 2013. To appear in Advances in Neural Information Processing Systems (NIPS). [pdf] [slides] [IPAM slides]

S. Jegelka, F. Bach, S. Sra. Reflection methods for user-friendly submodular optimization. Technical report, HAL 00905258, 2013. To appear in Advances in Neural Information Processing Systems (NIPS). [pdf]

B. Mishra, G. Meyer, F. Bach, R. Sepulchre. Low-rank optimization with trace norm penalty. SIAM Journal on Optimization, 23(4):2124-2149, 2013. [pdf]

F. Fogel, R. Jenatton, F. Bach, A. d'Aspremont. Convex Relaxations for Permutation Problems. Technical report, arXiv:1306.4805, 2013. To appear in Advances in Neural Information Processing Systems (NIPS). [pdf]

M. Schmidt, N. Le Roux, F. Bach. Minimizing Finite Sums with the Stochastic Average Gradient. Technical report, HAL 00860051, 2013. [pdf] [code]

K. S. Sesh Kumar and F. Bach. Maximizing submodular functions using probabilistic graphical models. Technical report, HAL 00860575, 2013. [pdf]

A. Nelakanti, C. Archambeau, J. Mairal, F. Bach, G. Bouchard. Structured Penalties for Log-linear Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013. [pdf]

F. Bach. Convex relaxations of structured matrix factorizations. Technical report, HAL 00861118, 2013. [pdf]

F. Bach. Duality between subgradient and conditional gradient methods. Technical report, HAL 00757696-v3, 2013. [pdf]

Z. Harchaoui, F. Bach, O. Cappe and E. Moulines. Kernel-Based Methods for Hypothesis Testing: A Unified View. IEEE Signal processing Magazine, 30(4): 87-97, 2013. [pdf]

E. Grave, G. Obozinski, F. Bach.

Hidden Markov tree models for semantic class induction.

Proceedings of the Conference on Computational Natural Language Learning (CoNLL), 2013. [pdf]

E. Richard, F. Bach, and J.-P. Vert. Intersecting singularities for multi-structured estimation. Proceedings of the International Conference on Machine Learning (ICML), 2013. [pdf]

G. Rigaill, T. D. Hocking, F. Bach, and J.-P. Vert. Learning Sparse Penalties for Change-Point Detection using Max Margin Interval Regression. Proceedings of the International Conference on Machine Learning (ICML), 2013. [pdf]

F. Bach. Sharp analysis of low-rank kernel matrix approximations. Technical report, HAL 00723365, 2013. Proceedings of the International Conference on Learning Theory (COLT). [pdf]

R. Lajugie, S. Arlot and F. Bach. Large-Margin Metric Learning for Partitioning Problems. Technical report, HAL 00796921 , 2013. [pdf]

K. S. Sesh Kumar and F. Bach. Convex Relaxations for Learning Bounded Treewidth Decomposable Graphs. Technical report, HAL 00763921, 2012. Proceedings of the International Conference on Machine Learning (ICML). [pdf]

N. Le Roux, F. Bach. Local component analysis. Proceedings of the International Conference on Learning Representations (ICLR), 2013. [pdf]

2012

G. Obozinski and F. Bach. Convex Relaxation for Combinatorial Penalties. Technical report, HAL 00694765, 2012. [pdf]

N. Le Roux, M. Schmidt, F. Bach. A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. Technical report, HAL 00674995, 2012. [pdf]

R. Jenatton, A. Gramfort, V. Michel, G. Obozinski, E. Eger, F. Bach, B. Thirion. Multi-scale Mining of fMRI data with Hierarchical Structured Sparsity. SIAM Journal on Imaging Sciences, 2012, 5(3):835-856, 2012. [pdf]

F. Bach, S. Lacoste-Julien, G. Obozinski. On the Equivalence between Herding and Conditional Gradient Algorithms. Proceedings of the International Conference on Machine Learning (ICML), 2012. [pdf]

A. Joulin, F. Bach, J. Ponce. Multi-Class Cosegmentation. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2012. [pdf]

F. Bach, R. Jenatton, J. Mairal, G. Obozinski. Structured sparsity through convex optimization. Technical report, HAL 00621245-v2, to appear in Statistical Science, 2012. [pdf]

F. Bach, R. Jenatton, J. Mairal, G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1-106, 2012. [pdf]

J. Mairal, F. Bach, J. Ponce. Task-Driven Dictionary Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4):791-804, 2012. [pdf]

2011

F. Bach. Learning with Submodular Functions: A Convex Optimization Perspective. Technical Report HAL 00645271, 2011. Submitted to Foundations and Trends in Machine Learning. [pdf]

C. Archambeau, F. Bach. Multiple Gaussian process models. Technical Report Arxiv 110.5238, 2011. [pdf]

F. Bach, E. Moulines. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. Advances in Neural Information Processing Systems (NIPS), 2011. [pdf] [long-version-pdf-HAL]

M. Schmidt, N. Le Roux, F. Bach. Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization. Advances in Neural Information Processing Systems (NIPS), 2011. [pdf] [long-version-pdf-HAL]

E. Grave, G. Obozinski, F. Bach. Trace Lasso: a trace norm regularization for correlated designs. Advances in Neural Information Processing Systems (NIPS), 2011. [pdf] [long-version-pdf-HAL]

F. Bach. Shaping Level Sets with Submodular Functions. Advances in Neural Information Processing Systems (NIPS), 2011. [pdf] [long-version-pdf-HAL]

B. Mishra, G. Meyer, F. Bach, R. Sepulchre. Low-rank optimization with trace norm penalty. Technical report, Arxiv 1112.2318, 2011. [pdf]

R. Jenatton, J.-Y. Audibert and F. Bach. Structured Variable Selection with Sparsity-inducing Norms. Journal of Machine Learning Research, 12, 2777-2824, 2011. [pdf] [code]

N. Le Roux, F. Bach. Local component analysis. Technical report, HAL 00617965, 2011. [pdf]

J. Mairal, R. Jenatton, G. Obozinski, F. Bach. Convex and Network Flow Optimization for Structured Sparsity. Journal of Machine Learning Research, 12, 2681-2720. [pdf]

R. Jenatton, J. Mairal, G. Obozinski, F. Bach. Proximal Methods for Hierarchical Sparse Coding. Journal of Machine Learning Research, 12, 2297-2334, 2011. [pdf]

F. Couzinie-Devy, J. Mairal, F. Bach and J. Ponce. Dictionary Learning for Deblurring and Digital Zoom. Technical report, HAL : inria- 00627402, 2011. [pdf]

Y-L. Boureau, N. Le Roux, F. Bach, J. Ponce, and Y. LeCun. Ask the locals: multi-way local pooling for image recognition. Proceedings of the International Conference on Computer Vision (ICCV), 2011. [pdf]

M. Solnon, S. Arlot, F. Bach. Multi-task Regression using Minimal Penalties. Technical report, HAL 00610534, 2011. [pdf]

A. Lefèvre, F. Bach, C. Févotte. Online algorithms for Nonnegative Matrix Factorization with the Itakura-Saito divergence. Technical report, HAL 00602050, 2011. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011. [pdf]

T. Hocking, A. Joulin, F. Bach and J.-P. Vert. Clusterpath: an Algorithm for Clustering using Convex Fusion Penalties. Proceedings of the International Conference on Machine Learning (ICML), 2011. [pdf]

L. Benoît, J. Mairal, F. Bach, J. Ponce, Sparse Image Representation with Epitomes. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2011. [pdf]

A. Lefèvre, F. Bach, C. Févotte, Itakura-Saito nonnegative matrix factorization with group sparsity, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011. [pdf]

F. Bach, R. Jenatton, J. Mairal and G. Obozinski. Convex optimization with sparsity-inducing norms. In S. Sra, S. Nowozin, S. J. Wright., editors, Optimization for Machine Learning, MIT Press, 2011. [pdf]

2010

F. Bach. Convex Analysis and Optimization with Submodular Functions: a Tutorial. Technical report, HAL 00527714, 2010. [pdf]

F. Bach. Structured Sparsity-Inducing Norms through Submodular Functions. Advances in Neural Information Processing Systems (NIPS), 2010. [pdf] [long version, arxiv] [slides]

J. Mairal, R. Jenatton, G. Obozinski, F. Bach. Network Flow Algorithms for Structured Sparsity. Advances in Neural Information Processing Systems (NIPS), 2010. [pdf]

F. Bach, S. D. Ahipasaoglu, A. d'Aspremont. Convex Relaxations for Subset Selection. Technical report Arxiv 1006-3601. [pdf]

M. Hoffman, D. Blei, F. Bach. Online Learning for Latent Dirichlet Allocation. Advances in Neural Information Processing Systems (NIPS), 2010. [pdf]

F. Bach, S. D. Ahipasaoglu, A. d'Aspremont. Convex Relaxations for Subset Selection. Technical report, ArXiv 1006.3601, 2010. [pdf]

R. Jenatton, J. Mairal, G. Obozinski, F. Bach. Proximal Methods for Sparse Hierarchical Dictionary Learning. Proceedings of the International Conference on Machine Learning (ICML), 2010. [pdf] [slides]

M. Journée, F. Bach, P.-A. Absil and R. Sepulchre. Low-Rank Optimization on the Cone of Positive Semidefinite Matrices. SIAM Journal on Optimization, 20(5):2327–2351, 2010. [pdf] [code]

Y-L. Boureau, F. Bach, Y. LeCun, J. Ponce. Learning Mid-Level Features For Recognition. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2010. [pdf]

R. Jenatton, G. Obozinski, F. Bach. Structured Sparse Principal Component Analysis. Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2010. [pdf] [code]

F. Bach. Self-Concordant Analysis for Logistic Regression. Electronic Journal of Statistics, 4, 384-414, 2010. [pdf ]

J. Mairal, F. Bach, J. Ponce, G. Sapiro. Online Learning for Matrix Factorization and Sparse Coding. Journal of Machine Learning Research, 11, 10-60, 2010. [pdf] [code]

Related software:

Sparse modeling software - SPAM (C)

Computing regularization paths for multiple kernel learning - version 1.0 (matlab)

Predictive low-rank decomposition for kernel methods - version 1.0 (matlab/C)

Support Kernel Machine - Multiple kernel learning (matlab)

SimpleMKL - version 1.0 (matlab)

Grouplasso - version 1.0 (matlab)

Hierarchical kernel learning - version 3.0 (matlab)