I am a Research Scientist at Google Research.
Research interests: The focus of my research is on the design of algorithms for optimization problems, with an emphasis on problems arising in data analysis and machine learning contexts. My goal is to come up with efficient algorithms and understand the complexity of these problems. I have also a strong interest in differential privacy, online optimization, learning theory, computational geometry and fixed-parameter and fined-grained complexity.
My (almost up-to-date) Curriculum Vitae.
Previously: I am on leave from a CNRS researcher position at Sorbonne Université. Before, I was working at the University of Copenhagen, supported by a Marie Sklodowska-Curie individual fellowship. I was very lucky to be hosted by Mikkel Thorup. I did my Ph.D at the Département d'Informatique de l'École normale supérieure. I was very fortunate to have Claire Mathieu as my advisor.
Arxiv - Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2024.
Click to show the abstract.Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as k-means. Given a point set P, a coreset Ω is a small, weighted summary that preserves the cost of all candidate solutions S up to a (1±ε) factor. For k-means in d-dimensional Euclidean space the cost for solution S is ∑p∈Pmins∈S∥p−s∥2. A very popular method for coreset construction, both in theory and practice, is Sensitivity Sampling, where points are sampled in proportion to their importance. We show that Sensitivity Sampling yields optimal coresets of size O~(k/ε2min(k−−√,ε−2)) for worst-case instances. Uniquely among all known coreset algorithms, for well-clusterable data sets with Ω(1) cost stability, Sensitivity Sampling gives coresets of size O~(k/ε2), improving over the worst-case lower bound. Notably, Sensitivity Sampling does not have to know the cost stability in order to exploit it: It is appropriately sensitive to the clusterability of the data set while being oblivious to it. We also show that any coreset for stable instances consisting of only input points must have size Ω(k/ε2). Our results for Sensitivity Sampling also extend to the k-median problem, and more general metric spaces.
Arxiv - Proceedings of the International Conference on Machine Learning (ICML) 2024.
Click to show the abstract.We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on k-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of “typical” k+1/ε2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±ε) factor and an additive ελΦk, where Φk represents the k-means cost for the input embeddings and λ is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable.
Arxiv Proceedings of the International Conference on Machine Learning (ICML) 2024 -- Spotlight.
Click to show the abstract.We revisit the objective perturbations framework for differential privacy where noise is added to the input A in S and the result is then projected back to the space of admissible datasets S. Through this framework, we first design novel efficient algorithms to privately release pair-wise cosine similarities. Second, we derive a novel algorithm to compute k-way marginal queries over features. Prior work could achieve comparable guarantees only for k even. Furthermore, we extend our results to t-sparse datasets, where our efficient algorithms yields novel, stronger guarantees whenever t <= n^{5/6}/log n. Finally, we provide a theoretical perspective on why fast input perturbation algorithms works well in practice. The key technical ingredients behind our results are tight sum-of-squares certificates upper bounding the Gaussian complexity of sets of solutions.
Arxiv Proceedings of the International Conference on Machine Learning (ICML) 2024.
Click to show the abstract.Graph clustering is a central topic in unsupervised learning with a multitude of practical applications. In recent years, multi-view graph clustering has gained a lot of attention for its applicability to real-world instances where one often has access to multiple data sources. In this paper we formalize a new family of models, called multi-view stochastic block models that capture this setting. For this model, we first study efficient algorithms that naively work on the union of multiple graphs. Then, we introduce a new efficient algorithm that provably outperforms previous approaches by analyzing the structure of each graph separately. Finally, we complement our results with an information-theoretic lower bound studying the limits of what can be done in this model.
Arxiv Proceedings of the International Conference on Machine Learning (ICML) 2024 -- Spotlight.
Click to show the abstract.We study the classic problem of correlation clustering in dynamic vertex streams. In this setting, vertices are either added or randomly deleted over time, and each vertex pair is connected by a positive or negative edge. The objective is to continuously find a partition which minimizes the sum of positive edges crossing clusters and negative edges within clusters. We present an algorithm that maintains an O(1)-approximation with O(polylog n) amortized update time. Prior to our work Behnezhad et al. in SODA 2023 achieved a 5-approximation with O(1) expected update time in edge streams which translates in vertex streams to an O(D)-update time where D is the maximum possible degree. Finally we complement our theoretical analysis with experiments on real world data.
Arxiv Proceedings of the International Conference on Machine Learning (ICML) 2024.
Click to show the abstract.We consider the semi-random graph model of (Makarychev et al., 2012), where, given a random bipartite graph with α edges and an unknown bipartition (A, B) of the vertex set, an adver- sary can add arbitrary edges inside each com- munity and remove arbitrary edges from the cut (A, B) (i.e. all adversarial changes are mono- tone with respect to the bipartition). For this model, a polynomial time algorithm is known to approximate the Balanced Cut problem up to value O(α) (Makarychev et al., 2012) as long as the cut (A, B) has size Ω(α). However, it consists of slow subroutines requiring optimal solutions for logarithmically many semidefinite programs. We study the fine-grained complexity of the prob- lem and present the first near-linear time algo- rithm that achieves similar performances to that of (Makarychev et al., 2012). Our algorithm runs in time O(|V (G)|1+o(1) + |E(G)|1+o(1)) and finds a balanced cut of value O(α) . Our approach ap- pears easily extendible to related problem, such as Sparsest Cut, and also yields an near-linear time O(1)-approximation to Dagupta’s objective func- tion for hierarchical clustering (Dasgupta, 2016) for the semi-random hierarchical stochastic block model inputs of (Cohen-Addad et al., 2019).
Arxiv Proceedings of the Symposium on Theory of Computing (STOC) 2024.
Click to show the abstract.In the classic Correlation Clustering problem introduced by Bansal, Blum, and Chawla [BBC04], the input is a complete graph where edges are labeled either + or -, and the goal is to find a partition of the vertices that minimizes the sum of the +edges across parts plus the sum of the -edges within parts. In recent years, Chawla, Makarychev, Schramm and Yaroslavtsev [CMSY15] gave a 2.06-approximation by providing a near-optimal rounding of the standard LP, and Cohen-Addad, Lee, Li, and Newman [CLN22, CLLN23] finally bypassed the integrality gap of 2 for this LP giving a 1.73-approximation for the problem. While introducing new ideas for Correlation Clustering, their algorithm is more complicated than typical approximation algorithms in the following two aspects: (1) It is based on two different relaxations with separate rounding algorithms connected by the round-or-cut procedure. (2) Each of the rounding algorithms has to separately handle seemingly inevitable correlated rounding errors, coming from correlated rounding of Sherali-Adams and other strong LP relaxations [GS11, BRS11, RT12]. In order to create a simple and unified framework for Correlation Clustering similar to those for typical approximate optimization tasks, we propose the cluster LP as a strong linear program that might tightly capture the approximability of Correlation Clustering. It unifies all the previous relaxations for the problem. It is exponential-sized, but we show that it can be (1+eps)-approximately solved in polynomial time for any eps > 0, providing the framework for designing rounding algorithms without worrying about correlated rounding errors; these errors are handled uniformly in solving the relaxation. We demonstrate the power of the cluster LP by presenting a simple rounding algorithm, and providing two analyses, one analytically proving a 1.49-approximation and the other solving a factor-revealing SDP to show a 1.437-approximation. Both proofs introduce principled methods by which to analyze the performance of the algorithm, resulting in a significantly improved approximation guarantee. Finally, we prove an integrality gap of 4/3 for the cluster LP, showing our 1.437-upper bound cannot be drastically improved. Our gap instance directly inspires an improved NP-hardness of approximation with a ratio 24 / 23 ≈1.042; no explicit hardness ratio was known before.
Arxiv Proceedings of the Symposium on Theory of Computing (STOC) 2024.
Click to show the abstract.Correlation Clustering is a classic clustering objective arising in numerous machine learning and data mining applications. Given a graph G=(V,E), the goal is to partition the vertex set into clusters so as to minimize the number of edges between clusters plus the number of edges missing within clusters. The problem is APX-hard and the best known polynomial time approximation factor is 1.73 by Cohen-Addad, Lee, Li, and Newman [FOCS'23]. They use an LP with $|V|^{1/\epsilon^{\Theta(1)}}$ variables for a some small $\epsilon$. However, due to the practical relevance of correlation clustering, there has also been great interest in getting more efficient sequential and parallel algorithms. The classic combinatorial \emph{pivot} algorithm of Ailon, Charikar and Newman [JACM'08] provides a 3-approximation in linear time. Like most other algorithms discussed here, this uses randomization. Recently, Behnezhad, Charikar, Ma and Tan [FOCS'22] presented a 3+ϵ-approximate solution for solving problem in a constant number of rounds in the Massively Parallel Computation (MPC) setting. Very recently, Cao, Huang, Su [SODA'24] provided a 2.4-approximation in a polylogarithmic number of rounds in the MPC model and in (∣E∣^1.5) time in the classic sequential setting. They asked whether it is possible to get a better than 3-approximation in near-linear time? We resolve this problem with an efficient combinatorial algorithm providing a drastically better approximation factor. It achieves a $2-2/13 \sim 1.846$-approximation in sub-linear ($\tilde O(|V|)$) sequential time and it uses only a constant number of rounds in the MPC model.
Arxiv Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) 2024.
Click to show the abstract.We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung, Kannan and Lutz, and Mahabadi and Vakilian. Given $n$ points $P$ in a metric space, let $\delta(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n/k$ points. A clustering is then called individually fair if it has centers within distance $\delta(x)$ of $x$ for each $x \in P$. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in $~O(nk^2)$ time and obtains a bicriteria $(O(1), 6)$ approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
Arxiv Proceedings of the Symposium on Discrete Algorithms (SODA) 2024.
Click to show the abstract.We consider the Low Rank Approximation problem, where the input consists of a matrix A∈RnR×nC and an integer k, and the goal is to find a matrix B of rank at most k that minimizes ∥A−B∥0, which is the number of entries where A and B differ. For any constant k and ε>0, we present a polynomial time (1+ε)-approximation time for this problem, which significantly improves the previous best poly(k)-approximation. Our algorithm is obtained by viewing the problem as a Constraint Satisfaction Problem (CSP) where each row and column becomes a variable that can have a value from Rk. In this view, we have a constraint between each row and column, which results in a {\em dense} CSP, a well-studied topic in approximation algorithms. While most of previous algorithms focus on finite-size (or constant-size) domains and involve an exhaustive enumeration over the entire domain, we present a new framework that bypasses such an enumeration in Rk. We also use tools from the rich literature of Low Rank Approximation in different objectives (e.g., ℓp with p∈(0,∞)) or domains (e.g., finite fields/generalized Boolean). We believe that our techniques might be useful to study other real-valued CSPs and matrix optimization problems. On the hardness side, when k is part of the input, we prove that Low Rank Approximation is NP-hard to approximate within a factor of Ω(logn). This is the first superconstant NP-hardness of approximation for any p∈[0,∞] that does not rely on stronger conjectures (e.g., the Small Set Expansion Hypothesis).
Arxiv Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2023.
Click to show the abstract.We consider the classic Correlation Clustering problem: Given a complete graph where edges are labelled either + or −, the goal is to find a partition of the vertices that minimizes the number of the plus edges across parts plus the number of the minus edges within parts. Recently, Cohen-Addad, Lee and Newman [CLN22] presented a 1.994-approximation algorithm for the problem using the Sherali-Adams hierarchy, hence breaking through the integrality gap of 2 for the classic linear program and improving upon the 2.06-approximation of Chawla, Makarychev, Schramm and Yaroslavtsev [CMSY15]. We significantly improve the state-of-the-art by providing a 1.73-approximation for the problem. Our approach introduces a preclustering of Correlation Clustering instances that allows us to essentially ignore the error arising from the {\em correlated rounding} used by [CLN22]. This additional power simplifies the previous algorithm and analysis. More importantly, it enables a new {\em set-based rounding} that complements the previous roundings. A combination of these two rounding algorithms yields the improved bound.
Arxiv Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2023.
Click to show the abstract.We prove that there is a randomized polynomial-time algorithm that given an edge-weighted graph G excluding a fixed-minor Q on n vertices and an accuracy parameter ε>0, constructs an edge-weighted graph~H and an embedding η:V(G)→V(H) with the following properties: * For any constant size Q, the treewidth of H is polynomial in ε−1, logn, and the logarithm of the stretch of the distance metric in G. * The expected multiplicative distortion is (1+ε): for every pair of vertices u,v of G, we have distH(η(u),η(v))≥distG(u,v) always and Exp[distH(η(u),η(v))]≤(1+ε)distG(u,v). Our embedding is the first to achieve polylogarithmic treewidth of the host graph and comes close to the lower bound by Carroll and Goel, who showed that any embedding of a planar graph with O(1) expected distortion requires the host graph to have treewidth Ω(logn). It also provides a unified framework for obtaining randomized quasi-polynomial-time approximation schemes for a variety of problems including network design, clustering or routing problems, in minor-free metrics where the optimization goal is the sum of selected distances. Applications include the capacitated vehicle routing problem, and capacitated clustering problems.
Arxiv Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2023.
Click to show the abstract.We consider the classic Euclidean k-median and k-means objective on data streams, where the goal is to provide a (1+ε)-approximation to the optimal k-median or k-means solution, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, the merge-and-reduce framework, bicriteria approximation, sensitivity sampling, and so on. Despite this intense effort to obtain smaller sketches for these problems, all known techniques require storing at least Ω(log(nΔ)) words of memory, where n is the size of the input and Δ is the aspect ratio. A natural question is if one can beat this logarithmic dependence on n and Δ. In this paper, we break this barrier by first giving an insertion-only streaming algorithm that achieves a (1+ε)-approximation to the more general (k,z)-clustering problem, using O~(dkε2)⋅(2zlogz)⋅min(1εz,k)⋅poly(loglog(nΔ)) words of memory. Our techniques can also be used to achieve two-pass algorithms for k-median and k-means clustering on dynamic streams using O~(1ε2)⋅poly(d,k,loglog(nΔ)) words of memory.
Arxiv Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2023.
Click to show the abstract.In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic k-median and k-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension d, the precision parameter ε−1 or k. Furthermore, there is no coreset construction that succeeds with probability 1−1/n and whose size does not depend on the number of input points, n. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman, WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are Ω(1) for both k-median and k-means, even when allowing a complexity FPT in the number of clusters k. This stands in sharp contrast with the (1+ε)-approximation achievable in that case, when allowing randomization. In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We also construct a deterministic algorithm for computing (1+ε)-approximation to k-median and k-means in high dimensional Euclidean spaces in time 2k2/εO(1)poly(nd), close to the best randomized complexity. Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS '22] by a factor k.
Proceedings of the conference on Neural Information Processing Systems (Neurips) 2023 -- Spotlight.
Proceedings of the conference on Neural Information Processing Systems (Neurips) 2023.
Proceedings of the Symposium on Theory of Computing (STOC) 2023.
Proceedings of Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX) 2023.
Proceedings of the Innovations in Theoretical Computer Science Conference (ITCS) 2023.
Proceedings of the Symposium on Simplicity in Algorithms (SOSA) 2023.
Proceedings of the Symposium on Discrete Algorithms (SODA) 2023.
Proceedings of the conference on Neural Information Processing Systems (Neurips) 2022.
Proceedings of the conference on Neural Information Processing Systems (Neurips) 2022.
Proceedings of the conference on Neural Information Processing Systems (Neurips) 2022.
Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2022.
Click to show the abstract.Given a complete graph G=(V,E)G = (V, E)G=(V,E) where each edge is labeled +++ or −-−, the Correlation Clustering problem asks to partition VVV into clusters to minimize the number of +++edges between different clusters plus the number of −-−edges within the same cluster. Correlation Clustering has been used to model a large number of clustering problems in practice, making it one of the most widely studied clustering formulations. The approximability of Correlation Clustering has been actively investigated, culminating in a 2.062.062.06-approximation algorithm by Chawla, Makarychev, Schramm, and Yaroslavtsev, based on rounding the standard LP relaxation. Since the integrality gap for this formulation is 222, it has remained a major open question to determine if the approximation factor of 222 can be reached, or even breached. In this paper, we answer this question affirmatively by showing that there exists a (1.994+ε)(1.994 + \varepsilon)(1.994+ε)-approximation algorithm based on O(1/ε2O(1/\varepsilon^2O(1/ε2) rounds of the Sherali-Adams hierarchy. In order to round a solution to the Sherali-Adams relaxation, we adapt the correlated rounding originally developed for CSPs. With this tool, we reach an approximation ratio of 2+ε2+\varepsilon2+ε for Correlation Clustering. To breach this ratio, we go beyond the traditional triangle-based analysis by employing a global charging scheme that amortizes the total cost of the rounding across different triangles.
Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2022.
Click to show the abstract.Given x∈(R≥0)([n]2)x \in (\R_{\geq 0})^{\binom{[n]}{2}}x∈(R≥0)(2[n]) recording pairwise distances, the Metric Violation Distance problem asks to compute the ℓ0\ell_0ℓ0 distance between xxx and the metric cone; i.e., modify the minimum number of entries of xxx to make it a metric. Due to its large number of applications in various data analysis and optimization tasks, this problem has been actively studied recently. We present an O(logn)O(\log n)O(logn)-approximation algorithm for Metric Violation Distance, exponentially improving the previous best approximation ratio of O(OPT1/3)O(OPT^{1/3})O(OPT1/3) of Fan, Raichel, and Van Buskirk [SODA, 2018]. Furthermore, a major strength of our algorithm is its simplicity and running time. We also study the related problem of Ultrametric Violation Distance, where the goal is to compute the ℓ0\ell_0ℓ0 distance to the cone of ultrametrics, and achieve a constant factor approximation algorithm. The Ultrametric Violation Distance problem can be regarded as an extension of the problem of fitting ultrametrics studied by Ailon and Charikar [SIAM J. Computing, 2011] and by Cohen-Addad, Das, Kipouridis, Parotsidis, and Thorup [FOCS, 2021] from ℓ1\ell_1ℓ1 norm to ℓ0\ell_0ℓ0 norm. We show that this problem can be favorably interpreted as an instance of Correlation Clustering with an additional hierarchical structure, which we solve using a new O(1)O(1)O(1)-approximation algorithm for correlation clustering that has the structural property that it outputs a refinement of the optimum clusters. An algorithm satisfying such a property can be considered of independent interest. We also provide an O(lognloglogn)O(\log n \log \log n)O(lognloglogn) approximation algorithm for weighted instances. Finally, we investigate the complementary version of these problems where one aims at choosing a maximum number of entries of xxx forming an (ultra-)metric. In stark contrast with the minimization versions, we prove that these maximization versions are hard to approximate within any constant factor assuming the Unique Games Conjecture.
Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2022.
Click to show the abstract.Motivated by practical generalizations of the classic kkk-median and kkk-means objectives, such as clustering with size constraints, fair clustering, and Wassterstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring-instances with a much-relaxed additive error. This enables us to use uniform sampling, in contrast to the widely-used importance sampling, when constructing our coreset, and consequently we can easily handle constrained objectives. Notably and perhaps surprisingly, we show that this simpler sampling scheme can yields bounds that are independent of nnn, the number of input points. Our technique yields better coreset bounds, and sometimes the first coreset, for a large number of constrained clustering problems, including capacitated clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in minor-excluded graph, and polygon clustering under Fr'{e}chet and Hausdorff distance. Finally, our technique also yields improved coresets for \oneMedian in low-dimensional Euclidean spaces, specifically of size O~(ϵ−1.5)\tilde{O}(\epsilon^{-1.5})O~(ϵ−1.5) in R2\mathbb{R}^2R2, and of size O~(ϵ−1.6)\tilde{O}(\epsilon^{-1.6})O~(ϵ−1.6) in R3\mathbb{R}^3R3.
Arxiv - Proceedings of the conference on Knowledge Discovery and Data Mining (KDD) 2022.
Click to show the abstract.We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
ICML proceedings - Proceedings of the International Conference on Machine Learning (ICML) 2022.
Click to show the abstract.We consider k-means clustering of n data points in Euclidean space in the Massively Parallel Computation (MPC) model, a computational model which is an abstraction of modern massively parallel computing system such as MapReduce. Recent work provides evidence that getting O(1)-approximate k-means solution for general input points using o(logn) rounds in the MPC model may be impossible under certain conditions [Ghaffari, Kuhn \& Uitto’2019]. However, the real-world data points usually have better structures. One instance of interest is the set of data points which is perturbation resilient [Bilu \& Linial’2010]. In particular, a point set is α-perturbation resilient for k-means if perturbing pairwise distances by multiplicative factors in the range [1,α] does not change the optimum k-means clusters. We bypass the worst case lower bound by considering the perturbation resilient input points and showing o(logn) rounds k-means clustering algorithms for these instances in the MPC model. Specifically, we show a fully scalable (1+ε)-approximate k-means clustering algorithm for O(α)-perturbation resilient instance in the MPC model using O(1) rounds and Oε,d(n1+1/α2+o(1)) total space. If the space per machine is sufficiently larger than k, i.e., at least k⋅nΩ(1), we also develop an optimal k-means clustering algorithm for O(α)-perturbation resilient instance in MPC using O(1) rounds and Od(n1+o(1)⋅(n1/α2+k)) total space.
ICML proceedings - Proceedings of the International Conference on Machine Learning (ICML) 2022.
Click to show the abstract.In the correlation clustering problem the input is a signed graph where the sign indicates whether each pair of points should be placed in the same cluster or not. The goal of the problem is to compute a clustering which minimizes the number of disagreements with such recommendation. Thanks to its many practical applications, correlation clustering is a fundamental unsupervised learning problem and has been extensively studied in many different settings. In this paper we study the problem in the classic online setting with recourse; The vertices of the graphs arrive in an online manner and the goal is to maintain an approximate clustering while minimizing the number of times each vertex changes cluster. Our main contribution is an algorithm that achieves logarithmic recourse per vertex in the worst case. We also complement this result with a tight lower bound. Finally we show experimentally that our algorithm achieves better performances than state-of-the-art algorithms on real world data.
COLT proceedings - Proceedings of the Conference on Learning Theory (COLT) 2022.
Click to show the abstract.We consider the problem of recovering communities in a random directed graph with planted communities. To model real-world directed graphs such as the Twitter or Instagram graphs that exhibit very heterogeneous degree sequences, we introduce the Degree-Heterogeneous Stochastic Block Model (DHSBM), a generalization of the classic Stochastic Block Model (SBM), where the vertex set is partitioned into communities and each vertex u has two (unknown) associated probabilities, pu and qu, pu>qu. An arc from u to v is generated with probability pu if u and v are in the same community and with probability qu otherwise. Given a graph generated from this model, the goal is to retrieve the communities. The DHSBM allows to generate graphs with planted communities while allowing heterogeneous degree distributions, a quite important feature of real-world networks. In the case where there are two communities, we present an iterative greedy linear-time algorithm that recovers them whenever min_u pu−qu/sqrt(pu)=Ω(sqrt(log(n)/n)). We also show that, up to a constant, this condition is necessary. Our results also extend to the standard (undirected) SBM, where pu=p and qu=q for all nodes u. Our algorithm presents the first linear-time algorithm that recovers exactly the communities at the asymptotic information-theoretic threshold, improving over previous near-linear time spectral approaches.
Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC) 2022.
Click to show the abstract.Graph clustering is one of the most basic and popular unsupervised learning problems. Among the different formulations of the problem, the modularity objective has been particularly successful in helping design impactful algorithms; Most notably, the Louvain algorithm has become one of the most used algorithm for clustering graphs. Yet, one major limitation of the Louvain algorithm is its sequential nature which makes it impractical in distributed environments and on massive datasets. In this paper, we provide a parallel version of Louvain which works in the massively parallel computation model (MPC). We show that it recovers the ground-truth clusters in the classic stochastic block model in only a constant number of parallel rounds, and so for a wider regime of parameters than the standard Louvain algorithm as shown recently in Cohen-Addad et al. [2020]
Arxiv - Proceedings of the International Colloquium on Automata, Languages, and Programming (ICALP) 2022.
Click to show the abstract.We study several questions related to diversifying search results. We give improved approximation algorithms in each of the following problems, together with some lower bounds. 1) We give a polynomial-time approximation scheme (PTAS) for a diversified search ranking problem [Nikhil Bansal et al., 2010] whose objective is to minimizes the discounted cumulative gain. Our PTAS runs in time n^{2^O(log(1/ε)/ε)} ⋅ m^O(1) where n denotes the number of elements in the databases and m denotes the number of constraints. Complementing this result, we show that no PTAS can run in time f(ε) ⋅ (nm)^{2^o(1/ε)} assuming Gap-ETH and therefore our running time is nearly tight. Both our upper and lower bounds answer open questions from [Nikhil Bansal et al., 2010]. 2) We next consider the Max-Sum Dispersion problem, whose objective is to select k out of n elements from a database that maximizes the dispersion, which is defined as the sum of the pairwise distances under a given metric. We give a quasipolynomial-time approximation scheme (QPTAS) for the problem which runs in time n^{O_ε(log n)}. This improves upon previously known polynomial-time algorithms with approximate ratios 0.5 [Refael Hassin et al., 1997; Allan Borodin et al., 2017]. Furthermore, we observe that reductions from previous work rule out approximation schemes that run in n^õ_ε(log n) time assuming ETH. 3) Finally, we consider a generalization of Max-Sum Dispersion called Max-Sum Diversification. In addition to the sum of pairwise distance, the objective also includes another function f. For monotone submodular function f, we give a quasipolynomial-time algorithm with approximation ratio arbitrarily close to (1-1/e). This improves upon the best polynomial-time algorithm which has approximation ratio 0.5 [Allan Borodin et al., 2017]. Furthermore, the (1-1/e) factor is also tight as achieving better-than-(1-1/e) approximation is NP-hard [Uriel Feige, 1998].
Proceedings of the Symposium on Theory of Computing (STOC) 2022.
Click to show the abstract.Since the mid 90s, the study of the complexity of classic network design problems such as the traveling salesman problem (TSP), the Steiner tree problem (ST), or the k-MST problem on metric spaces such as low-dimensional Euclidean spaces, doubling metrics, planar or minor-free graphs, has led to major improvements of our understanding of the structure of both these important metric spaces, and the underlying problems. In their celebrated work, Arora and Mitchell gave quasi-polynomial-time approximation schemes (QPTAS) for several network design problems in Euclidean space, that have later been improved to polynomial and (near-)linear in some cases. Arora and Mitchell’s QPTAS result has been extended by Talwar [STOC’04] to doubling metrics showing that the Euclidean embedding is not a necessary condition to design approximation schemes for network design problems. In the case of planar and more broadly minor-free graphs, the picture is much less satisfactory. For example, nothing better than the general case is known for k-MST, an open question that has thus been repeatedly asked. Even when approximation schemes are known for the planar case, generalizing them to the minor-free setting is often a major challenge because most of the topological properties are lost. Therefore, one of the most important question of the field is whether we can bypass these topological arguments to obtain a general framework for network design problems in minor-free graphs, similar to Arora, Mitchell and Talwar’s? In this paper, we answer this question in the affirmative by proposing a framework for network design problems in minor-free graphs. We give the first approximation schemes, with quasi- polynomial running time, for k-MST, Steiner tree, and Steiner forest in minor-free graphs. The result on k-MST was not known before even for the planar case, and obtaining approximation schemes for Steiner tree on minor-free graphs has been a major open problem.
Arxiv - Proceedings of the Symposium on Theory of Computing (STOC) 2022.
Click to show the abstract.Motivated by data analysis and machine learning applications, we consider the popular high-dimensional Euclidean k-median and k-means problems. We propose a new primal-dual algorithm, inspired by the classic algorithm of Jain and Vazirani and the recent algorithm of Ahmadian, Norouzi-Fard, Svensson, and Ward. Our algorithm achieves an approximation ratio of 2.408 and 5.957 for Euclidean k-median and k-means, respectively, improving upon the 2.633 approximation ratio of Ahmadian et al. and the 6.1291 approximation ratio of Grandoni, Ostrovsky, Rabani, Schulman, and Venkat. Our techniques involve a much stronger exploitation of the Euclidean metric than previous work on Euclidean clustering. In addition, we introduce a new method of removing excess centers using a variant of independent sets over graphs that we dub a “nested quasi-independent set”. In turn, this technique may be of interest for other optimization problems in Euclidean and lp metric spaces.
Arxiv - Proceedings of the Symposium on Theory of Computing (STOC) 2022.
Click to show the abstract.Given a set of points in a metric space, the (k, z)-clustering problem consists of finding a set of k points called centers, such that the sum of distances raised to the power of z of every data point to its closest center is minimized. Special cases include the famous k-median problem (z = 1) and k-means problem (z = 2). The k-median and k-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative (1 ± ε) factor, hence reducing from large to small scale the input to the problem. While there has been an intensive effort to understand what is the best coreset size possible for both problems in various metric spaces, there is still a significant gap between the state- of-the-art upper and lower bounds. In this paper, we make progress on both upper and lower bounds, obtaining tight bounds for several cases, namely: In finite n point general metrics, any coreset must consist of Ω(k/ε^2 log n) points. This improves on the Ω(k/ε log n) lower bound of Braverman, Jiang, Krauthgamer, and Wu [ICML’19] and matches the upper bounds proposed for k-median by Feldman and Langberg [STOC’11] and k-means by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC’21] up to polylog(1/ε) factors. The dependency in k, n is therefore optimal. For doubling metrics with doubling constant D, any coreset must consist of Ω(k/ε^2 D) points. This matches the k-median and k-means upper bounds by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC’21] up to polylog(1/ε) factors. The dependency in k, D is therefore optimal. In d-dimensional Euclidean space, any coreset for (k, z) clustering requires Ω(k/ε^2 ) points. This improves on the Ω(k/ ε) lower bound of Baker, Braverman, Huang, Jiang, Krauthgamer, and Wu [ICML’20] for k-median and complements the Ω(k min(d, 2^{z/20} )) lower bound of Huang and Vishnoi [STOC’20]. We complement our lower bound for d-dimensional Euclidean space with the construction of a coreset of size Õ(k/ε^2 · min(ε^{−z} , k)). This improves over the Õ(k^2 ε^{−4}) upper bound for general power of z proposed by Braverman Jiang, Krauthgamer, and Wu [SODA’21] and over the Õ(k/ε 4 ) upper bound for k-median by Huang and Vishnoi [STOC’20]. In fact, ours is the first construction breaking through the ε^−2 · min(d, ε^−2 ) barrier inherent in all previous coreset constructions. To do this, we employ a novel chaining based analysis that may be of independent interest. Together our upper and lower bounds for k-median in Euclidean spaces are tight up to a factor O(ε^−1 polylog k/ε).
AISTATS proceedings - Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) 2022.
Click to show the abstract.We study the facility location problem under the constraints imposed by the local differential pri- vacy (LDP). Recently, Gupta et al. (2010) and Esencayi et al. (2019) proposed lower and upper bounds for the problem on the central differential privacy (DP) model where a curator first collects all data before being processed. In this paper, we focus on the LDP model, where we protect a client’s participation in the facility location in- stance. Under the HST metric, we show that there is a non-interactive epsilon-LDP algorithm achieving O(n^{1/4}/epsilon^2)-approximation ratio, where n is the size of the metric. On the negative side, we show a lower bound of Ω(n^1/4/√epsilon) on the approxima- tion ratio for any non-interactive epsilon-LDP algorithm. Thus, our results are tight up to a polynomial fac- tor of epsilon. Moreover, unlike previous results, our results generalize for non-uniform facility costs.
Arxiv - Proceedings of the Conference on Integer Programming and Combinatorial Optimization (IPCO) 2022.
Click to show the abstract.In the non-uniform sparsest cut problem, we are given a supply graph G and a demand graph D, both with the same set of nodes V. The goal is to find a cut of V that minimizes the ratio of the total capacity on the edges of G crossing the cut over the total demand of the crossing edges of D. In this work, we study the non-uniform sparsest cut problem for supply graphs with bounded treewidth k. For this case, Gupta, Talwar and Witmer [STOC 2013] obtained a 2-approximation with polynomial running time for fixed k, and the question of whether there exists a c-approximation algorithm for a constant c independent of k, that runs in FPT time, remained open. We answer this question in the affirmative. We design a 2-approximation algorithm for the non-uniform sparsest cut with bounded treewidth supply graphs that runs in FPT time, when parameterized by the treewidth. Our algorithm is based on rounding the optimal solution of a linear programming relaxation inspired by the Sherali-Adams hierarchy. In contrast to the classic Sherali-Adams approach, we construct a relaxation driven by a tree decomposition of the supply graph by including a carefully chosen set of lifting variables and constraints to encode information of subsets of nodes with super-constant size, and at the same time we have a sufficiently small linear program that can be solved in FPT time.
Arxiv - Proceedings of the Symposium on Discrete Algorithms (SODA) 2022.
Click to show the abstract.We present a new local-search algorithm for the k-median clustering problem. We show that local optima for this algorithm give a (2.836+ϵ)-approximation; our result improves upon the (3+ϵ)-approximate local-search algorithm of [Arya et al. STOC'01]. Moreover, a computer-aided analysis of a natural extension suggests that this approach may lead to an improvement over the best-known approximation guarantee for the problem. The new ingredient in our algorithm is the use of a potential function based on both the closest and second-closest facilities to each client. Specifically, the potential is the sum over all clients, of the distance of the client to its closest facility, plus (a small constant times) the truncated distance to its second-closest facility. We move from one solution to another only if the latter can be obtained by swapping a constant number of facilities, and has a smaller potential than the former. This refined potential allows us to avoid the bad local optima given by Arya et al. for the local-search algorithm based only on the cost of the solution.
Arxiv - Proceedings of the Symposium on Discrete Algorithms (SODA) 2022.
Click to show the abstract.
k-median and k-means are the two most popular objectives for clustering algorithms. Despite intensive effort, a good understanding of the approximability of these objectives, particularly in ℓp\ell_pℓp-metrics, remains a major open problem. In this paper, we significantly improve upon the hardness of approximation factors known in literature for these objectives in ℓp\ell_pℓp-metrics.
We introduce a new hypothesis called the Johnson Coverage Hypothesis (JCH), which roughly asserts that the well-studied Max k-Coverage problem on set systems is hard to approximate to a factor greater than (1−1/e), even when the membership graph of the set system is a subgraph of the Johnson graph. We then show that together with generalizations of the embedding techniques introduced by Cohen-Addad and Karthik (FOCS '19), JCH implies hardness of approximation results for k-median and k-means in ℓp-metrics for factors which are close to the ones obtained for general metrics. In particular, assuming JCH we show that it is hard to approximate the k-means objective:
Discrete case: To a factor of 3.94 in the ℓ1-metric and to a factor of 1.73 in the ℓ2-metric; this improves upon the previous factor of 1.56 and 1.17 respectively, obtained under the Unique Games Conjecture (UGC).
Continuous case: To a factor of 2.10 in the ℓ1-metric and to a factor of 1.36 in the ℓ2-metric; this improves upon the previous factor of 1.07 in the ℓ2-metric obtained under UGC (and to the best of our knowledge, the continuous case of k-means in ℓ1-metric was not previously analyzed in literature).
We also obtain similar improvements under JCH for the k-median objective.
Additionally, we prove a weak version of JCH using the work of Dinur et al. (SICOMP '05) on Hypergraph Vertex Cover, and recover all the results stated above of Cohen-Addad and Karthik (FOCS '19) to (nearly) the same inapproximability factors but now under the standard NP≠P assumption (instead of UGC).
Finally, we establish a strong connection between JCH and the long standing open problem of determining the Hypergraph Turan number. We then use this connection to prove improved SDP gaps (over the existing factors in literature) for k-means and k-median objectives.
Arxiv - Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2021.
Click to show the abstract.We consider the numerical taxonomy problem of fitting a positive distance function R_{>0} by a tree metric. We want a tree $T$ with positive edge weights and including $S$ among the vertices so that their distances in $T$ match those in $\calD$. A nice application is in evolutionary biology where the tree $T$ aims to approximate the branching process leading to the observed distances in $\calD$ [Cavalli-Sforza and Edwards 1967]. We consider the total error, that is the sum of distance errors over all pairs of points. We present a deterministic polynomial time algorithm minimizing the total error within a constant factor. We can do this both for general trees, and for the special case of ultrametrics with a root having the same distance to all vertices in $S$. The problems are APX-hard, so a constant factor is the best we can hope for in polynomial time. The best previous approximation factor was O((\log n)(\log \log n)) by Ailon and Charikar [2005] who wrote Determining whether an O(1) approximation can be obtained is a fascinating question.
Neurips Proceedings - Proceedings of the conference on Neural Information Processing Systems (Neurips) 2021.
Click to show the abstract.As a fundamental unsupervised learning task, hierarchical clustering has been extensively studied in the past decade. In particular, standard metric formulations as hierarchical $k$-center, $k$-means, and $k$-median received a lot of attention and the problems have been studied extensively in different models of computation. Despite all this interest, not many efficient parallel algorithms are known for these problems. In this paper we introduce a new parallel algorithm for the Euclidean hierarchical $k$-median problem that, when using machines with memory $s$ (for $s\in \Omega(\log^2 (n+\Delta+d))$), outputs a hierarchical clustering such that for every fixed value of $k$ the cost of the solution is at most an $O(\min\{d, \log n\} \log \Delta)$ factor larger in expectation than that of an optimal solution. Furthermore, we also get that for all $k$ simultanuously the cost of the solution is at most an $O(\min\{d, \log n\} \log \Delta \log (\Delta d n))$ factor bigger that the corresponding optimal solution. The algorithm requires in $O\left(\log_{s} (nd\log(n+\Delta))\right)$ rounds. Here $d$ is the dimension of the data set and $\Delta$ is the ratio between the maximum and minimum distance of two points in the input dataset. To the best of our knowledge, this is the first \emph{parallel} algorithm for the hierarchical $k$-median problem with theoretical guarantees. We further complement our theoretical results with an empirical study of our algorithm that shows its effectiveness in practice.
Neurips Proceedings - Proceedings of the conference on Neural Information Processing Systems (Neurips) 2021 -- Spotlight.
Click to show the abstract.In this paper, we consider the problem of finding high dimensional power means: given a set $A$ of $n$ points in $\R^d$, find the point $m$ that minimizes the sum of Euclidean distance, raised to the power $z$, over all input points. Special cases of problem include the well-known Fermat-Weber problem -- or geometric median problem -- where $z = 1$, the mean or centroid where $z=2$, and the Minimum Enclosing Ball problem, where $z = \infty$. We consider these problem in the big data regime. Here, we are interested in sampling as few points as possible such that we can accurately estimate $m$. More specifically, we consider sublinear algorithms as well as coresets for these problems. Sublinear algorithms have a random query access to the $A$ and the goal is to minimize the number of queries. Here, we show that $\tilde{O}(\varepsilon^{-z-3})$ samples are sufficient to achieve a $(1+\varepsilon)$ approximation, generalizing the results from Cohen, Lee, Miller, Pachocki, and Sidford [STOC '16] and Inaba, Katoh, and Imai [SoCG '94] to arbitrary $z$. Moreover, we show that this bound is nearly optimal, as any algorithm requires at least $\Omega(\varepsilon^{-z+1})$ queries to achieve said approximation. The second contribution are coresets for these problems, where we aim to find find a small, weighted subset of the points which approximate cost of every candidate point $c\in \mathbb{R}^d$ up to a $(1\pm\varepsilon)$ factor. Here, we show that $\tilde{O}(\varepsilon^{-2})$ points are sufficient, improving on the $\tilde{O}(d\varepsilon^{-2})$ bound by Feldman and Langberg [STOC '11] and the $\tilde{O}(\varepsilon^{-4})$ bound by Braverman, Jiang, Krauthgamer, and Wu [SODA 21].
PLoS open access -
PLoS Computational Biology (2021).
Press coverage (in French):
Europe1,
L'Express,
Sud Ouest.
The COVID-19 epidemic has forced most countries to impose contact-limiting restrictions at workplaces, universities, schools, and more broadly in our societies. Yet, the effectiveness of these unprecedented interventions in containing the virus spread remain largely unquantified. Here, we develop a simulation study to analyze COVID-19 outbreaks on three real-life contact networks stemming from a workplace, a primary school and a high school in France. Our study provides a fine-grained analysis of the impact of contact-limiting strategies at workplaces, schools and high schools, including: (1) Rotating strategies, in which workers are evenly split into two shifts that alternate on a daily or weekly basis; and (2) On-Off strategies, where the whole group alternates periods of normal work interactions with complete telecommuting. We model epidemics spread in these different setups using a stochastic discrete-time agent-based transmission model that includes the coronavirus most salient features: super-spreaders, infectious asymptomatic individuals, and pre-symptomatic infectious periods. Our study yields clear results: the ranking of the strategies, based on their ability to mitigate epidemic propagation in the network from a first index case, is the same for all network topologies (workplace, primary school and high school). Namely, from best to worst: Rotating week-by-week, Rotating day-by-day, On-Off week-by-week, and On-Off day-by-day. Moreover, our results show that below a certain threshold for the original local reproduction number within the network (< 1.52 for primary schools, < 1.30 for the workplace, < 1.38 for the high school, and < 1.55 for the random graph), all four strategies efficiently control outbreak by decreasing effective local reproduction number to < 1. These results can provide guidance for public health decisions related to telecommuting.
ICML Proceedings - Proceedings of the International Conference on Machine Learning (ICML) 2021.
Click to show the abstract.To tackle the curse of dimensionality in data analysis and unsupervised learning, it is critical to be able to efficiently compute “simple” faithful representations of the data that helps extract information, improves understanding and visualization of the structure. When the dataset consists of d-dimensional vectors, simple representations of the data may consist in trees or ultrametrics, and the goal is to best preserve the distances (i.e.: dissimilarity values) between data elements. To circumvent the quadratic running times of the most popular methods for fitting ultrametrics, such as average, single, or complete linkage, CKL20 recently presented a new algorithm that for any c≥1, outputs in time n^{1+O(1/c2)} an ultrametric Δ such that for any two points u,v, Δ(u,v) is within a multiplicative factor of 5c to the distance between u and v in the “best” ultrametric representation. We improve the above result and show how to improve the above guarantee from 5c to √2c+ε while achieving the same asymptotic running time. To complement the improved theoretical bound, we additionally show that the performances of our algorithm are significantly better for various real-world datasets.
Arxiv - Proceedings of the International Conference on Machine Learning (ICML) 2021 -- Long Talk.
Click to show the abstract.Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental scalability evaluation of our techniques.
Arxiv - Proceedings of the Symposium on Theory of Computing (STOC) 2021.
Click to show the abstract.Given a metric space, the (k,z)-clustering problem consists of finding k centers such that the sum of the of distances raised to the power z of every point to its closest center is minimized. This encapsulates the famous k-median (z=1) and k-means (z=2) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as coresets, with minimal dependency in the number of clusters k, the number of input points n, or the underlying dimension, and the has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases: with size = min(eps^{-2} + eps^{-z}, k eps^{-2}) polylog (eps^{-1}), this framework constructs coreset with size O(Gamma k (d + log k)) in doubling metrics, improving upon the recent breakthrough of [Huang, Jiang, Li, Wu, FOCS' 18], who presented a coreset with size O(k^3 d /eps^2). O(Gamma k min(d,eps^{-2} log k)) in d-dimensional Euclidean space, improving on the recent results of [Huang, Vishnoi, STOC' 20], who presented a coreset of size O (k log k eps^{-2z} min(d, eps^{-2} log k)) O(Gamma k (t + log k)) for graphs with treewidth t, improving on [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML'20], who presented a coreset of size O(k^3 t/eps^2) for z=1 only O(Gamma k (log^2 k + frac{log k}{eps^4})) for shortest paths metrics of graphs excluding a fixed minor. This improves on [Braverman, Jiang, Krauthgamer, Wu, SODA'21], who presented a coreset of size O(k^3 / eps^4) -- hence an improvement of Omega(k^2/polylog(eps^{-1})). O(Gamma k log n ) in general discrete metric spaces, improving on the results of [Feldman, Lamberg, STOC'11], who presented a coreset of size O(k eps^{-2z} log n log k ) -- hence an improvement of Omega(log k/(eps^{z-1} polylog(eps^{-1}))). A lower bound Omega(frac{k log n}{eps}) for k-Median in general metric spaces [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML'20] implies that in general metrics as well as metrics with doubling dimension d, our bounds are optimal up to a poly(log(1/eps) / eps) factor. For graphs with treewidth t, the lower bound of Omega(frac{k t}{eps}) of [Baker, Braverman, Huang, Jiang, Krauthgamer, Wu, ICML'20] shows that our bounds are optimal up to the same factor.
Arxiv - Proceedings of the Symposium on Theory of Computing (STOC) 2021.
Click to show the abstract.The (non-uniform) sparsest cut problem is the following classical graph-partitioning problem: given a “supply” graph, and demands on pairs of vertices, delete some subset of supply edges to minimize the ratio of the supply edges cut to the total demand of the pairs separated by this deletion. The sparsest cut problem has been a driver for new algorithmic approaches. Despite much effort, there are only a handful of nontrivial classes of supply graphs for which constant-factor approximations are known. We consider the problem for planar graphs, and give a (2 + ε)-approximation algorithm that runs in quasipolynomial time. Our approach defines a new structural decomposition of an optimal solution using a “radial cuts” primitive. We combine this decomposition with a Sherali-Adams-style linear programming relaxation of the problem, which we then round. This should be compared with the polynomial-time approximation algorithm of Rao (1999), which uses the metric linear programming relaxation and l_1 -embeddings, and achieves an O(√(log n))- approximation in polynomial time.
Arxiv - Proceedings of the Symposium on Foundations of Responsible Computing (FORC) 2021.
Click to show the abstract.Redistricting is the problem of dividing a state into a number k of regions, called districts. Voters in each district elect a representative. The primary criteria are: each district is connected, district populations are equal (or nearly equal), and districts are "compact". There are multiple competing definitions of compactness, usually minimizing some quantity. One measure that has been recently promoted by Duchin and others is number of cut edges. In redistricting, one is given atomic regions out of which each district must be built. The populations of the atomic regions are given. Consider the graph with one vertex per atomic region (with weight equal to the region's population) and an edge between atomic regions that share a boundary. A districting plan is a partition of vertices into k parts, each connnected, of nearly equal weight. The districts are considered compact to the extent that the plan minimizes the number of edges crossing between different parts. Consider two problems: find the most compact districting plan, and sample districting plans under a compactness constraint uniformly at random. Both problems are NP-hard so we restrict the input graph to have branchwidth at most w. (A planar graph's branchwidth is bounded by its diameter.) If both k and w are bounded by constants, the problems are solvable in polynomial time. Assume vertices have weight 1. One would like algorithms whose running times are of the form O(f(k,w)n^c) for some constant c independent of k and w, in which case the problems are said to be fixed-parameter tractable with respect to k and w). We show that, under a complexity-theoretic assumption, no such algorithms exist. However, we do give algorithms with running time O(c^w n^{k+1}). Thus if the diameter of the graph is moderately small and the number of districts is very small, our algorithm is useable.
Arxiv - Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) 2021.
Click to show the abstract.We study the problem of learning a clustering of an online set of points. The specific formulation we use is the k-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred by the algorithm is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the k-means objective in hindsight. We show that provided the data lies in a bounded region, learning is possible, namely an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of tilde{O}(sqrt{T}) in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the k-means problem, which is known to be NP-hard. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to (1 + epsilon) OPT and present a no-regret algorithm with runtime O(T poly(log(T),k,d,1/epsilon)^{O(kd)}). Our algorithm is based on maintaining a set of points of bounded size which is a coreset that helps identifying the relevant regions of the space for running an adaptive, more efficient, variant of the MWUA. We show that simpler online algorithms, such as Follow The Leader (FTL), fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data. Our theoretical results answer an open question of Dasgupta (2008).
Arxiv - Proceedings of the Symposium on Discrete Algorithms (SODA) 2021.
Click to show the abstract.The k-means objective is arguably the most widely-used cost function for modeling clustering tasks in a metric space. In practice and historically, k-means is thought of in a continuous setting, namely where the centers can be located anywhere in the metric space. For example, the popular Lloyd's heuristic locates a center at the mean of each cluster. Despite persistent efforts on understanding the approximability of k-means, and other classic clustering problems such as k-median and k-minsum, our knowledge of the hardness of approximation factors of these problems remains quite poor. In this paper, we significantly improve upon the hardness of approximation factors known in the literature for these objectives. We show that if the input lies in a general metric space, it is NP-hard to approximate: ∙ Continuous k-median to a factor of 2−o(1); this improves upon the previous inapproximability factor of 1.36 shown by Guha and Khuller (J. Algorithms '99). ∙ Continuous k-means to a factor of 4−o(1); this improves upon the previous inapproximability factor of 2.10 shown by Guha and Khuller (J. Algorithms '99). ∙ k-minsum to a factor of 1.415; this improves upon the APX-hardness shown by Guruswami and Indyk (SODA '03). Our results shed new and perhaps counter-intuitive light on the differences between clustering problems in the continuous setting versus the discrete setting (where the candidate centers are given as part of the input).
Neurips Proceedings - Proceedings of the conference on Neural Information Processing Systems (Neurips) 2020.
Click to show the abstract.A classic problem in machine learning and data analysis is to partition the vertices of a network in such a way that vertices in the same set are densely connected and vertices in different sets are loosely connected. In practice, the most popular approaches rely on local search algorithms; not only for the ease of implementation and the efficiency, but also because of the accuracy of these methods on many real world graphs. For example, the Louvain algorithm -- a local search based algorithm -- has quickly become the method of choice for clustering in social networks, accumulating more than 11400 citations over the past 10 years. However, explaining the success of these methods remains an open problem: in the worst-case, the runtime can be up to $\Omega(n^2)$, much worse than what is typically observed in practice, and no guarantee on the quality of its output can be established. The goal of this paper is to shed light on the inner-workings of Louvain; only if we understand Louvain, can we rely on it and further improve it. To achieve this goal, we study the behavior of Louvain in the famous Stochastic Block Model, which has a clear ground-truth and serves as the standard testbed for graph clustering algorithms. We provide valuable tools for the analysis of Louvain, but also for many other combinatorial algorithms. For example, we show that the probability for a node to have more edges towards its own community is $1/2 + \Omega( \min( \Delta(p-q)/\sqrt{np},1 ))$ in the SBM($n,p,q$), where $\Delta$ is the imbalance. Note that this bound is asymptotically tight and useful for the analysis of a wide range of algorithms (Louvain, Kernighan-Lin, Simulated Annealing etc).
Arxiv - Proceedings of the conference on Neural Information Processing Systems (Neurips) 2020.
Click to show the abstract.k-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, k-means++ sometimes suffers from being slow on large data-sets so a natural question has been to obtain more efficient algorithms with similar guarantees. In this paper, we present such a near linear time algorithm for k-means++ seeding. Interestingly our algorithm obtains the same theoretical guarantees as k-means++ and significantly improves earlier results on fast k-means++ seeding. Moreover, we show empirically that our algorithm is significantly faster than k-means++ and obtains solutions of equivalent quality.
Arxiv - Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2020.
Click to show the abstract.
Understanding the structure of minor-free metrics, namely shortest path metrics obtained
over a weighted graph excluding a fixed minor, has been an important research direction since
the fundamental work of Robertson and Seymour. A fundamental idea that helps both to
understand the structural properties of these metrics and lead to strong algorithmic results is
to construct a “small-complexity” graph that approximately preserves distances between pairs
of points of the metric. We show the two following structural results for minor-free metrics:
1. Construction of a light subset spanner. Given a subset of vertices called terminals, and epsilon,
in polynomial time we construct a subgraph that preserves all pairwise distances between
terminals up to a multiplicative 1 + epsilon factor, of total weight at most O(1) times the weight
of the minimal Steiner tree spanning the terminals.
2. Construction of a stochastic metric embedding into low treewidth graphs with expected
additive distortion epsilon D. Namely, given a minor free graph G = (V, E, w) of diameter D,
and parameter epsilon, we construct a distribution D over dominating metric embeddings into
treewidth O_{epsilon}(log n) graphs such that ∀u, v ∈ V , E_{f ∼D}[d_H(f (u), f (v))] ≤ d G (u, v) + epsilon D.
One of our important technical contributions is a novel framework that allows us to reduce
both problems to problems on simpler graphs of bounded diameter that we solve using a new
decomposition.
Our results have the following algorithmic consequences: (1) the first efficient
approximation scheme for subset TSP in minor-free metrics; (2) the first approximation scheme
for vehicle routing with bounded capacity in minor-free metrics; (3) the first efficient approxi-
mation scheme for vehicle routing with bounded capacity on bounded genus metrics. En route
to the latter result, we design the first FPT approximation scheme for vehicle routing with
bounded capacity on bounded treewidth graphs (parameterized by the treewidth).
Arxiv - Proceedings of the International Conference on Machine Learning (ICML) 2020.
Click to show the abstract.A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic "linkage" algorithms (single, average, or complete). However, these methods exhibit a quite prohibitive running time of Theta(n²). In this paper, we provide a new algorithm which takes as input a set of points P in R^d, and for every c >= 1, runs in time n^{1+O(1/c^2)} to output an ultrametric Delta such that for any two points u,v in P, we have Delta(u,v) is within a multiplicative factor of 5c to the distance between u and v in the "best" ultrametric representation of P. Here, the best ultrametric is the ultrametric Delta* that minimizes the maximum distance distortion with respect to the Euclidean distance, namely that minimizes max_{u,v in P} Delta*(u,v)/||u-v||. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant epsilon > 0, no algorithm with running time n^{2-\epsilon} can distinguish between inputs that admit isometric embedding and inputs that can incur a distortion of 3/2 in L∞ -metric. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
Arxiv - Proceedings of the Symposium on Theory of Computing (STOC) 2020.
Click to show the abstract.Sparsest Cut is a fundamental optimization problem that have been extensively studied. For planar inputs the problem is in P and can be solved in tilde{O}(n^3) time if all vertex weights are 1. Despite a significant amount of effort, the best algorithms date back to the early 90's and can only achieve O(log n)-approximation in tilde{O}(n) time or 3.5-approximation in tilde{O}(n^2) time [Rao, STOC92]. Our main result is an Omega(n^{2-epsilon}) lower bound for Sparsest Cut even in planar graphs with unit vertex weights, under the (min,+)-Convolution conjecture, showing that approximations are inevitable in the near-linear time regime. To complement the lower bound, we provide a 3.3-approximation in near-linear time, improving upon the 25-year old result of Rao in both time and accuracy. We also show that our lower bound is not far from optimal by observing an exact algorithm with running time tilde{O}(n^{5/2}) improving upon the tilde{O}(n^3) algorithm of Park and Phillips [STOC93]. Our lower bound accomplishes a repeatedly raised challenge by being \emph{the first fine-grained lower bound} for a natural planar graph problem in P. Building on our construction we prove near-quadratic lower bounds under SETH for variants of the closest pair problem in planar graphs, and use them to show that the popular Average-Linkage procedure for Hierarchical Clustering cannot be simulated in truly subquadratic time. At the core of our constructions is a diamond-like gadget that also settles the complexity of Diameter in distributed planar networks. We prove an Omega(n/log{n}) lower bound on the number of communication rounds required to compute the weighted diameter of a network in the CONGET model, even when the underlying graph is planar and all nodes are D=4 hops away from each other. This is the first poly(n) lower bound in the planar-distributed setting, and it complements the recent poly(D, log{n}) upper bounds of Li and Parter [STOC 2019] for (exact) unweighted diameter and for (1+\epsilon) approximate weighted diameter.
Arxiv - Proceedings of the Symposium on Discrete Algorithms (SODA) 2020.
Click to show the abstract.Motivated by applications in redistricting, we consider the uniform capacitated k-median and uniform capacitated k-means problems in bounded doubling metrics. We provide the first QPTAS for both problems and the first PTAS for the uniform capacitated k-median problem for points in R^2 . This is the first improvement over the bicriteria QPTAS for capacitated k-median in low-dimensional Euclidean space of Arora, Raghavan, Rao [STOC 1998] (1 + epsilon-approximation, 1 + epsilon-capacity violation) and arguably the first polynomial-time approximation algorithm for a non-trivial metric.
Arxiv - Proceedings of the Symposium on Discrete Algorithms (SODA) 2020.
Click to show the abstract.Motivated by crowdsourced computation, peer-grading, and recommendation systems, Braverman, Mao and Weinberg [STOC'16] ç studied the query and round complexity of fundamental problems such as finding the maximum max, finding all elements above a certain value threshold-v or computing the top−k elements Top-k in a noisy environment. For example, consider the task of selecting papers for a conference. This task is challenging due the crowdsourcing nature of peer reviews: the results of reviews are noisy and it is necessary to parallelize the review process as much as possible. We study the noisy value model and the noisy comparison model: In the "noisy value model", a reviewer is asked to evaluate a single element: "What is the value of paper i?" (\eg accept). In the noisy comparison model (introduced in the seminal work of Feige, Peleg, Raghavan and Upfal [SICOMP'94]) a reviewer is asked to do a pairwise comparison: "Is paper i better than paper j?" In this paper, we show optimal worst-case query complexity for the max,threshold-v and Top-k problems. For max and Top-k, we obtain optimal worst-case upper and lower bounds on the round vs query complexity in both models. For threshold-v, we obtain optimal query complexity and nearly-optimal round complexity, where k is the size of the output) for both models. We then go beyond the worst-case and address the question of the importance of knowledge of the instance by providing, for a large range of parameters, instance-optimal algorithms with respect to the query complexity. Furthermore, we show that the value model is strictly easier than the comparison model.
Neurips Proceedings - Proceedings of the conference on Neural Information Processing Systems (Neurips) 2019.
Click to show the abstract.We consider the widely-used average-linkage, single-linkage, and Ward's methods for computing hierarchical clusterings of high-dimensional Euclidean inputs. It is easy to show that there is no efficient implementation of these algorithms in high dimensional Euclidean space since it implicitly requires to solve the closest pair problem, a notoriously difficult problem. However, how fast can these algorithms be implemented if we allow approximation? More precisely: these algorithms successively merge the clusters that are at closest average (for average-linkage), minimum distance (for single-linkage), or inducing the least sum-of-square error (for Ward's). We ask whether one could obtain a significant running-time improvement if the algorithm can merge gamma-approximate closest clusters (namely, clusters that are at distance (average, minimum, or sum-of-square error) at most gamma times the distance of the closest clusters). We show that one can indeed take advantage of the relaxation and compute the approximate hierarchical clustering tree using O(n polylog n) gamma-approximate nearest neighbor queries. This leads to an algorithm running in time O(((nd) + n^{1+O(1/\gamma)}) polylog n) for d-dimensional Euclidean space. We then provide experiments showing that these algorithms perform as well as the non-approximate version for classic classification tasks while achieving a significant speed-up.
Neurips Proceedings - Proceedings of the conference on Neural Information Processing Systems (Neurips) 2019.
Click to show the abstract.We consider classic clustering problems in fully dynamic data streams, where data elements can be both inserted and deleted. In this context, several parameters are of importance: (1) the quality of the solution after each insertion or deletion, (2) the time it takes to update the solution, and (3) how different consecutive solutions are. The question of obtaining efficient algorithms in this context for facility location, k-median and k-means has been raised in a recent paper by Hubert-Chan et al. [WWW'18] and also appears as a natural follow-up on the online model with recourse studied by Lattanzi and Vassilvitskii [ICML'17] (i.e.: in insertion-only streams). In this paper, we focus on general metric spaces and mainly on the facility location problem. We give an arguably simple algorithm that maintains a constant factor approximation, with O(n log n) update time, and total recourse O(n). This improves over the naive algorithm which consists in recomputing a solution at each time step and that can take up to O(n²) update time, and O(n²) total recourse. These bounds are nearly optimal: in general metric space, inserting a point take O(n) times to describe the distances to other points, and we give a simple lower bound of O(n) for the recourse. Moreover, we generalize this result for the k-medians and k-means problems: our algorithms maintains a constant factor approximation in time O((n+k²) polylog n). We complement our analysis with experiments showing that the cost of the solution maintained by our algorithm at any time t is very close to the cost of a solution obtained by quickly recomputing a solution from scratch at time t while having a much better running time.
HAL - Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2019.
Click to show the abstract.Proving hardness of approximation for min-sum objectives is an infamous challenge. For classic problems such as the Traveling Salesman problem, the Steiner tree problem, or the k-means and k-median problems, the best known inapproximability bounds for L-p metrics of dimension O(log n) remain well below 1.01. In this paper, we take a significant step to improve the hardness of approximation of the k-means problem in various L-p metrics, and more particularly on Manhattan (L-1), Euclidean (L-2), Hamming (L-0) and Chebyshev (L-infinity) metrics of dimension log n and above. We show that it is hard to approximate the k-means objective in O(log n) dimensional space: (1) To a factor of 3.94 in the L-infinity metric when centers have to be chosen from a discrete set of locations (i.e., the discrete case). This improves upon the result of Guruswami and Indyk (SODA'03) who proved hardness of approximation for a factor less than 1.01. (2) To a factor of 1.56 in the L-1 metric and to a factor of 1.17 in the L-2 metric, both in the discrete case. This improves upon the result of Trevisan (SICOMP'00) who proved hardness of approximation for a factor less than 1.01 in both the metrics. (3) To a factor of 1.07 in the L-2 metric, when centers can be placed at arbitrary locations, (i.e., the continuous case). This improves on a result of Lee-Schmidt-Wright (IPL'17) who proved hardness of approximation for a factor of 1.0013. We also obtain similar improvements over the state of the art hardness of approximation results for the k-median objective in various L-p metrics. Our hardness result given in (1) above, is under the standard NP is not equal to P assumption, whereas all the remaining results given above are under the Unique Games Conjecture (UGC). We can remove our reliance on UGC and prove standard NP-hardness for the above problems but for smaller approximation factors. Finally, we note that in order to obtain our result for the L-1 and L-infinity metrics in O(log n) dimensional space we introduce an embedding technique which combines the tran- scripts of certain communication protocols with the geometric realization of certain graphs.
Arxiv - Journal of the A.C.M. and proceedings of the Symposium on Foundations of Computer Science (FOCS) 2019.
Click to show the abstract.We consider the classic Facility Location, k-Median, and k-Means problems in metric spaces of doubling dimension d. We give nearly linear-time approximation schemes for each problem. The complexity of our algorithms is 2^(log (1/ε)/ε)^O(d²) n log⁴ n + 2^d n log⁹ n, making a significant improvement over the state-of-the-art algorithms which run in time n^(d/ε)^d. Moreover, we show how to extend the techniques used to get efficient approximation schemes for the problems of prize-collecting k-Medians and k-Means, and efficient bicriteria approximation schemes for k-Medians with outliers, k-Means with outliers and k-Center.
Arxiv - Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2019.
Click to show the abstract.We consider the classic Facility Location problem on planar graphs (non-uniform, uncapacitated). Given an edge-weighted planar graph G, a set of clients C ⊆ V (G), a set of facilities F ⊆ V (G), and opening costs open : F → R_+, the goal is to find a subset D of F that minimizes Sum_{c ∈ C} min_{f ∈ D} dist(c, f ) + Sum_{f ∈ D} open(f). The Facility Location problem remains one of the most classic and fundamental optimization problem for which it is not known whether it admits a polynomial-time approximation scheme (PTAS) on planar graphs despite significant effort for obtaining one. We solve this open problem by giving an algorithm that for any ε > 0, computes a solution of cost at most (1 + ε) times the optimum in time n^2^O(ε⁻² log(1/ε)).
Arxiv - Proceedings of European Symposium on Algorithms (ESA) 2019.
Click to show the abstract.
We consider the k-Median problem on planar graphs: given an edge-weighted planar graph G, a
set of clients C ⊆ V (G), a set of facilities F ⊆ V (G), and an integer parameter k, the task is to
find a set of at most k facilities whose opening minimizes the total connection cost of clients, where
each client contributes to the cost with the distance to the closest open facility. We give two new
approximation schemes for this problem:
- FPT Approximation Scheme: for any ε > 0, in time 2^O(k ε⁻³ log(k ε⁻¹)) · poly(n) we can compute a
solution that has connection cost at most (1 + ε) times the optimum, with high probability.
- Efficient Bicriteria Approximation Scheme: for any ε > 0, in time 2^O(ε⁻⁵ log(1/ε )) · poly(n) we can
compute a set of at most (1 + ε)k facilities whose opening yields connection cost at most (1 + ε)
times the optimum connection cost for opening at most k facilities, with high probability.
- As a direct corollary of the second result we obtain an EPTAS for Uniform Facility Location on
planar graphs, with same running time.
Our main technical tool is a new construction of a “coreset for facilities” for k-Median in planar
graphs: we show that in polynomial time one can compute a subset of facilities F0 ⊆ F of size
k · (log n/ε)^O(ε⁻³) with a guarantee that there is a (1 + ε)-approximate solution contained in F0 .
Proceedings of the International Colloquium on Automata, Languages, and Programming (ICALP) 2019.
Click to show the abstract.We study the complexity of the classic capacitated k-median and k-means problems parameterized by the number of centers, k. These problems are notoriously difficult since the best known approximation bound for high dimensional Euclidean space and general metric space is Theta(log k) and it remains a major open problem whether a constant factor exists. We show that there exists a (3+epsilon)-approximation algorithm for the capacitated k-median and a (9+epsilon)-approximation algorithm for the capacitated k-means problem in general metric spaces whose running times are f(epsilon,k) n^O(1). For Euclidean inputs of arbitrary dimension, we give a (1+epsilon)-approximation algorithm for both problems with a similar running time. This is a significant improvement over the (7+epsilon)-approximation of Adamczyk et al. for k-median in general metric spaces and the (69+epsilon)-approximation of Xu et al. for Euclidean k-means.
Arxiv - Proceedings of the International Colloquium on Automata, Languages, and Programming (ICALP) 2019.
Click to show the abstract.We investigate the fine-grained complexity of approximating the classical k-median / k-means clustering problems in general metric spaces. We show how to improve the approximation factors to (1+2/e+ε) and (1+8/e+ε) respectively, using algorithms that run in fixed-parameter time. Moreover, we show that we cannot do better in FPT time, modulo recent complexity-theoretic conjectures.
Arxiv - Journal of the A.C.M. and proceedings of the Symposium on Computational Geometry (SoCG) 2019. Best Paper Award.
Click to show the abstract.We prove essentially tight lower bounds, conditionally to the Exponential Time Hypothesis, for two fundamental but seemingly very different cutting problems on surface-embedded graphs: the Shortest Cut Graph problem and the Multiway Cut problem. A cut graph of a graph G embedded on a surface S is a subgraph of G whose removal from S leaves a disk. We consider the problem of deciding whether an unweighted graph embedded on a surface of genus g has a cut graph of length at most a given value. We prove a time lower bound for this problem of nΩ(g/logg) conditionally to ETH. In other words, the first n^O(g)-time algorithm by Erickson and Har-Peled [SoCG 2002, Discr. Comput. Geom. 2004] is essentially optimal. We also prove that the problem is W[1]-hard when parameterized by the genus, answering a 17-year old question of these authors. A multiway cut of an undirected graph G with t distinguished vertices, called terminals, is a set of edges whose removal disconnects all pairs of terminals. We consider the problem of deciding whether an unweighted graph G has a multiway cut of weight at most a given value. We prove a time lower bound for this problem of n^Ω(gt+g2√/log(gt)), conditionally to ETH, for any choice of the genus g≥0 of the graph and the number of terminals t≥4. In other words, the algorithm by the second author [Algorithmica 2017] (for the more general multicut problem) is essentially optimal; this extends the lower bound by the third author [ICALP 2012] (for the planar case). Reductions to planar problems usually involve a grid-like structure. The main novel idea for our results is to understand what structures instead of grids are needed if we want to exploit optimally a certain value g of the genus.
Proceedings of the Symposium on Theory of Computing (STOC) 2019.
Click to show the abstract.We show that for n points in d-dimensional Euclidean space, a data oblivious random projection of the columns onto m ∈ O((log k + log log n) ε⁻⁶ log 1/ε) dimensions is sufficient to approximate the cost of any k-means clustering up to a multiplicative (1 ± ε) factor. The previous-best upper bounds on m are O((log n)ε⁻²) given by a direct application of the Johnson-Lindenstrauss Lemma, and O(k ε⁻²) given by [Cohen et al.-STOC’15]. We also prove the existence of a non-oblivious cost preserving sketch with target dimension O(log k ε⁻⁴ log 1/ε), improving on k/ε [Cohen et al.-STOC’15]. Furthermore, we show how to construct strong coresets for the k-means problem of size O(k · poly(log k, ε⁻¹)). Previous constructions of strong coresets have size of order k · min(d, k/ε).
Arxiv - Proceedings of the Symposium on Discrete Algorithms (SODA) 2019.
Click to show the abstract.In this paper we study lower bounds for the fundamental problem of text indexing with mismatches and differences. In this problem we are given a long string of length n, the “text”, and the task is to preprocess it into a data structure such that given a query string Q, one can quickly identify substrings that are within Hamming or edit distance at most k from Q. This problem is at the core of various problems arising in biology and text processing. While exact text indexing allows linear-size data structures with linear query time, text indexing with k mismatches (or k differences) seems to be much harder: All known data structures have exponential dependency on k either in the space, or in the time bound. We provide conditional and pointer-machine lower bounds that make a step toward explaining this phenomenon. We start by demonstrating conditional lower bounds for k = Θ(log n). We show that as- suming the Strong Exponential Time Hypothesis, any data structure for text indexing that can be constructed in polynomial time cannot have O(n^1−δ) query time, for any δ > 0. This bound also extends to the setting where we only ask for (1 + ε)-approximate solutions for text index- ing. However, in many applications the value of k is rather small, and one might hope that for small k we can develop more efficient solutions. We show that this would require a radically new approach as using the current methods one cannot avoid exponential dependency on k either in the space, or in the time bound for all even √log n ≤ k = o(log n). Our lower bounds also apply to the dictionary look-up problem, where instead of a text one is given a set of strings.
Neurips Proceedings - Proceedings of the conference on Neural Information Processing Systems (Neurips) 2018.
Click to show the abstract.Kleinberg stated three axioms that any clustering procedure should satisfy and showed there is no clustering procedure that simultaneously satisfies all three. One of these, called the consistency axiom, requires that when the data is modified in a helpful way, i.e. if points in the same cluster are made more similar and those in different ones made less similar, the algorithm should output the same clustering. To circumvent this impossibility result, research has focused on considering clustering procedures that have a clustering quality measure (or a cost) and showing that a modification of Kleinberg’s axioms that takes cost into account lead to feasible clustering procedures. In this work, we take a different approach, based on the observation that the consistency axiom fails to be satisfied when the “correct” number of clusters changes. We modify this axiom by making use of cost functions to determine the correct number of clusters, and require that consistency holds only if the number of clusters remains unchanged. We show that single linkage satisfies the modified axioms, and if the input is well-clusterable, some popular procedures such as k-means also satisfy the axioms, taking a step towards explaining the success of these objective functions for guiding the design of algorithms.
Arxiv - Proceedings of the International Conference on Advances in Geographic Information Systems (SIGSPATIAL) 2018.
Click to show the abstract.
We consider the problem of political redistricting: given the locations of people in a geographical
area (e.g. a US state), the goal is to decompose the area into subareas, called districts, so
that the populations of the districts are as close as possible and the districts are “compact” and
“contiguous,” to use the terms referred to in most US state constitutions and/or US Supreme
Court rulings.
We study a method that outputs a solution in which each district is the intersection of a
convex polygon with the geographical area. The average number of sides per polygon is less
than six. The polygons tend to be quite compact. Every two districts differ in population by at
most one (so we call the solution balanced ).
The solution is a centroidal power diagram: each polygon has an associated center in R 3
such that
• the projection of the center onto the plane z = 0 is the centroid of the locations of people
assigned to the polygon, and
• for each person assigned to that polygon, the polygon’s center is closest among all centers.
The polygons are convex because they are the intersections of 3D Voronoi cells with the
plane.
The solution is in a sense a locally optimal solution to the problem of choosing centers in the
plane and choosing an assignment of people to those 2-d centers so as to minimize the sum of
squared distances subject to the assignment being balanced.
A practical problem with this method is that, in real-world redistricting, exact locations
of people are unknown. Instead, the input consists of polygons (census blocks) and associated
populations. A real redistricting must not split census blocks. We therefore propose a
post-processing step in which we perturb the solution so it does not split census blocks. In our
experiments, our postprocessing step is able to achieve this while preserving population balance
and connectivity of the districts. The district polygons are no longer convex at the fine scale
because their boundaries must follow the boundaries of census blocks, but at a coarse scale they
preserve the shape of the original polygons.
Arxiv - Proceedings of the Symposium on Theory of Computing (STOC) 2018.
Click to show the abstract.We consider very natural “fence enclosure” problems studied by Arkin, Khuller, and Mitchell in the early 90s. Given a set S of n points in the plane, we aim at finding a set of closed curves such that (1) each point is enclosed by a curve and (2) the total length of the curves is minimized. We consider two main variants. In the first variant, we pay a unit cost per curve in addition to the total length of the curves. An equivalent formulation of this version is that we have to enclose n unit disks, paying only the total length of the enclosing curves. In the other variant, we are allowed to use at most k closed curves and pay no cost per curve. For the variant with at most k closed curves, we present an algorithm that is polynomial in both n and k. For the variant with unit cost per curve, or unit disks, we present a near-linear time algorithm. Arkin, Khuller, and Mitchell solved the problem with at most k curves in n^O(k) time, and used this to solve the unit cost per curve version in exponential time. At the time, they conjectured that the problem with k curves is NP-hard for general k. Our polynomial time algorithm refutes this unless P equals NP.
ArXiv - Proceedings of the Symposium on Discrete Algorithms (SODA) 2018.
Click to show the abstract.We consider the popular k-means problem in d-dimensional Euclidean space. Recently Friggstad, Rezapour, Salavatipour [FOCS'16] and Cohen-Addad, Klein, Mathieu [FOCS'16] showed that the standard local search algorithm yields a (1+epsilon)-approximation in time (n * k)^(1/epsilon^O(d)), giving the first polynomial-time approximation scheme for the problem in low-dimensional Euclidean space. While local search achieves optimal approximation guarantees, it is not competitive with the state-of-the-art heuristics such as the famous k-means++ and D^2-sampling algorithms. In this paper, we aim at bridging the gap between theory and practice by giving a (1+epsilon)-approximation algorithm for low-dimensional k-means running in time n * k * (log n)^(d/epsilon)^O(d), and so matching the running time of the k-means++ and D^2-sampling heuristics up to polylogarithmic factors. We speed-up the local search approach by making a non-standard use of randomized dissections that allows to find the best local move efficiently using a quite simple dynamic program. We hope that our techniques could help design better local search heuristics for geometric problems.
ArXiv - Proceedings of the Symposium on Discrete Algorithms (SODA) 2018. Refer to the Journal of the ACM version.
Click to show the abstract.Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, Dasgupta framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a `good' hierarchical clustering is one that minimizes some cost function. He showed that this cost function has certain desirable properties. We take an axiomatic approach to defining `good' objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of "admissible" objective functions (that includes Dasgupta's one) that have the property that when the input admits a `natural' hierarchical clustering, it has an optimal value. Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better algorithms. For similarity-based hierarchical clustering, Dasgupta showed that the divisive sparsest-cut approach achieves an O(log3/2n)-approximation. We give a refined analysis of the algorithm and show that it in fact achieves an O(√log n)-approx. (Charikar and Chatziafratis independently proved that it is a O(√log n)-approx.). This improves upon the LP-based O(log n)-approx. of Roy and Pokutta. For dissimilarity-based hierarchical clustering, we show that the classic average-linkage algorithm gives a factor 2 approx., and provide a simple and better algorithm that gives a factor 3/2 approx.. Finally, we consider `beyond-worst-case' scenario through a generalisation of the stochastic block model for hierarchical clustering. We show that Dasgupta's cost function has desirable properties for these inputs and we provide a simple 1 + o(1)-approximation in this setting.
ArXiv - SIAM Journal of Computing and proceedings of the Symposium on Discrete Algorithms (SODA) 2018.
Click to show the abstract.For an undirected edge-weighted graph G and a set R of pairs of vertices called pairs of terminals, a multicut is a set of edges such that removing these edges from G disconnects each pair in R. We provide an algorithm computing a (1+ε)-approximation of the minimum multicut of a graph G in time (g+t)(O(g+t)3)⋅(1/ε)O(g+t)⋅n log n, where g is the genus of G and t is the number of terminals. This result is tight in several aspects, as the minimum multicut problem is both APX-hard and W[1]-hard (parameterized by the number of terminals), even on planar graphs (equivalently, when g=0). In order to achieve this, our article leverages on a novel characterization of a minimum multicut as a family of Steiner trees in the universal cover of a surface on which G is embedded. The algorithm heavily relies on topological techniques, and in particular on the use of homotopical tools and computations in covering spaces, which we blend with classic ideas stemming from approximation schemes for planar graphs and low-dimensional geometric inputs.
ArXiv - Proceedings of the Symposium on Discrete Algorithms (SODA) 2018.
Click to show the abstract.In this paper, we give a conditional lower bound of n^Ω(k) for the classic k-median and k-means clustering objectives (where n is the size of the input), even in low-dimensional Euclidean space of dimension four, assuming the Exponential Time Hypothesis (ETH). We also consider k-median (and k-means) with penalties where each point need not be assigned to a center, in which case it must pay a penalty, and extend our lower bound to at least three-dimensional Euclidean space. This stands in stark contrast to many other geometric problems such as the traveling salesman problem, or computing an independent set of unit spheres. While these problems benefit from the so-called (limited) blessing of dimensionality, as they can be solved in time n^O(k^(1-1/d)) or 2^(n^(1−1/d)) in d dimensions, our work shows that widely-used clustering objectives have a lower bound of n^Ω(k), even in dimension four. We complete the picture by considering the two-dimensional case: we show that there is no algorithm that solves the penalized version in time less than n^o(√k), and provide a matching upper bound of n^O(√k). The main tool we use to establish these lower bounds is the placement of points on the moment curve, which takes its inspiration from constructions of point sets yielding Delaunay complexes of high complexity.
NIPS proceedings - Proceedings of the Conference on Neural Information Processing Systems (NIPS) 2017.
Click to show the abstract.Hierarchical clustering, that is computing a recursive partitioning of a dataset to obtain clusters at increasingly finer granularity is a fundamental problem in data analysis. Although hierarchical clustering has mostly been studied through procedures such as linkage algorithms, or top-down heuristics, rather than as optimization problems, recently Dasgupta proposed an objective function for hierarchical clustering and initiated a line of work developing algorithms that explicitly optimize an objective. In this paper, we consider a fairly general random graph model for hierarchical clustering, called the hierarchical stochastic block model (HSBM), and show that in certain regimes the SVD approach of McSherry combined with specific linkage methods results in a clustering that give an O(1) approximation to Dasgupta’s cost function. We also show that an approach based on SDP relaxations for balanced cuts based on the work of Makarychev et al., combined with the recursive sparsest cut algorithm of Dasgupta, yields an O(1) approximation in slightly larger regimes and also in the semi-random setting, where an adversary may remove edges from the random graph generated according to an HSBM. Finally, we report empirical evaluation on synthetic and real-world data showing that our proposed SVD-based method does indeed achieve a better cost than other widely-used heurstics and also results in a better classification accuracy when the underlying problem was that of multi-class classification.
ArXiv - Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2017.
Click to show the abstract.We study the classic k-median and k-means clustering objectives in the beyond-worst-case scenario. We consider three well-studied notions of structured data that aim at characterizing real-world inputs: Distribution Stability (introduced by Awasthi, Blum, and Sheffet, FOCS 2010), Spectral Separability (introduced by Kumar and Kannan, FOCS 2010), Perturbation Resilience (introduced by Bilu and Linial, ICS 2010). We prove structural results showing that inputs satisfying at least one of the conditions are inherently "local". Namely, for any such input, any local optimum is close both in term of structure and in term of objective value to the global optima. As a corollary we obtain that the widely-used Local Search algorithm has strong performance guarantees for both the tasks of recovering the underlying optimal clustering and obtaining a clustering of small cost. This is a significant step toward understanding the success of local search heuristics in clustering applications.
ArXiv - Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2017.
Click to show the abstract.For a given a graph, a distance oracle is a data structure that answers distance queries between pairs of vertices. We introduce an O(n^5/3)-space distance oracle which answers exact distance queries in O(log n) time for n-vertex planar edge-weighted digraphs. All previous distance oracles for planar graphs with truly subquadratic space i.e., space O(n^(2-epsilon)) for some constant epsilon>0) either required query time polynomial in n or could only answer approximate distance queries. Furthermore, we show how to trade-off time and space: for any S>=n^3/2, we show how to obtain an S-space distance oracle that answers queries in time O((n^(5/2)/S^(3/2))log n). This is a polynomial improvement over the previous planar distance oracles with o(n^(1/4)) query time.
ArXiv - Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) 2017.
Click to show the abstract.We study online optimization of smoothed piecewise constant functions over the domain [0, 1). This is motivated by the problem of adaptively picking parameters of learning algorithms as in the recently introduced framework by Gupta and Roughgarden. Majority of the machine learning literature has focused on Lipschitz-continuous functions or functions with bounded gradients. This is with good reason -- any learning algorithm suffers linear regret even against piecewise constant functions that are chosen adversarially, arguably the simplest of non-Lipschitz continuous functions. The smoothed setting we consider is inspired by the seminal work of Spielman and Teng and the recent work of Gupta and Roughgarden in this setting, the sequence of functions may be chosen by an adversary, however, with some uncertainty in the location of discontinuities. We give algorithms that achieve sublinear regret in the full information and bandit settings.
ArXiv -
Journal of Combinatorial Theory, Series B (JCTB), 2017.
See also this article (in french) in the popular science magazine "Pour la Science".
Steinberg conjectured in 1976 that every planar graph with no cycles of length four or five is 3-colorable. We disprove this conjecture.
ArXiv - Proceedings of the Symposium on Foundations of Computer Science (FOCS) 2016. Refer to the SICOMP special issue on FOCS'16 version.
Click to show the abstract.We give the first polynomial-time approximation schemes (PTASs) for the following problems: (1) uniform facility location in edge-weighted planar graphs; (2) k-median and k-means in edge-weighted planar graphs; (3) k-means in Euclidean space of bounded dimension. Our first and second results extend to minor-closed families of graphs. All our results extend to cost functions that are the p-th power of the shortest-path distance. The algorithm is local search where the local neighborhood of a solution S consists of all solutions obtained from S by removing and adding 1/epsilon^O(1) centers.
ArXiv - Proceedings of the Conference on Economics and Computation (EC) 2016.
Click to show the abstract.Walrasian prices, if they exist, have the property that one can assign every buyer some bundle in her demand set, such that the resulting assignment will maximize social welfare. Unfortu- nately, this assumes carefully breaking ties amongst different bundles in the buyer demand set. Presumably, the shopkeeper cleverly convinces the buyer to break ties in a manner consistent with maximizing social welfare. Lacking such a shopkeeper, if buyers arrive sequentially and simply choose some arbitrary bundle in their demand set, the social welfare may be arbitrar- ily bad. In the context of matching markets, we show how to compute dynamic prices, based upon the current inventory, that guarantee that social welfare is maximized. Such prices are set without knowing the identity of the next buyer to arrive. We also show that this is impossible in general (e.g., for coverage valuations), but consider other scenarios where this can be done.
ICALP proceedings - Proceedings of the International Colloquium on Automata, Languages, and Programming (ICALP) 2016.
Click to show the abstract.In this paper, we investigate small space summaries for diameter and k-center clus- tering in sliding windows. In particular, we give an algorithm maintaining a 3 + epsilon approximate diameter, i.e. the maximum distance between two input points, in arbitrary metric spaces with nearly optimal space requirements. Furthermore, we investigate the related problem of k-center clustering, i.e. the covering of the input with k balls of minimum radius. For k = 2 we give a sliding window algorithm with optimal approximation ratio 4 + epsilon and for arbitrary values of k, algorithms with approximation ratio 6 + epsilon in general metric spaces and 4.48 in Euclidean spaces. In the metric distance model it is assumed that any algorithm with space x can store at most x points and that it cannot generate new points that are distinct from the stored points except by reading from the input stream. Under this assumptions, we obtain lower bounds that separate the space complexity for sliding window algorithm from insertion-only algorithms by an exponential factor. To our knowledge, no such separation was previously known.
Proceedings of the Symposium on Theory of Computing (STOC) 2016.
Click to show the abstract.We present a framework for addressing several problems on weighted planar graphs and graphs of bounded genus. With that framework, we derive polynomial-time approximation schemes for the following problems in planar graphs or graphs of bounded genus: edge-weighted tree cover and tour cover; vertex-weighted connected dominating set, maximum-weight-leaf spanning tree, and connected vertex cover. In addition, we obtain a polynomial-time approximation scheme for feedback vertex set in planar graphs. These are the first polynomial-time approximation schemes for all those problems in weighted embedded graphs. (For unweighted versions of some of these problems, poynomial-time approximation schemes were previously given using bidimensionality.) Additionally, we design a quasi- polynomial-time approximation scheme for weighted connected face cover in planar graphs, which implies a quasi-polynomial-time approximation scheme for the minimum corridor problem, a problem in the plane.
ArXiv - Discrete Applied Mathematics, vol. 200. 2016.
Click to show the abstract.We introduced the notion of involution module, the first generalization of modular decomposition which has a unique linear-sized decomposition tree in the general context of 2-structures. We derived O(n²) decomposition algorithm and we take advantage of the involution modular decomposition tree to state several algorithmic results. We introduced the class of switch cographs, the class of graphs that are totally decomposable w.r.t involution modular decomposition. This class generalizes the class of cographs and is exactly the class of (Bull, Gem, Co-Gem, C5)-free graphs. We used our new decomposition tool to design three practical algorithms for the maximum cut and the vertex separator problems. The complexity of these problems was still unknown for this class of graphs. We also improved the complexity of the maximum clique, the maximum independant set, the chromatic number and the maximum clique cover problems by giving efficient algorithms, thanks to the decomposition tree. Eventually, we showed that this class of graphs has Clique-Width at most 4 and that a clique-width expression can be computed in linear time.
ArXiv - Proceedings of the European Symposium on Algorithms (ESA) 2015.
Click to show the abstract.Given a graph G cellularly embedded on a surface S of genus g, a cut graph is a subgraph of G such that cutting S along G yields a topological disk. We provide a fixed parameter tractable approximation scheme for the problem of computing the shortest cut graph, that is, for any epsilon > 0, we show how to compute a (1 + epsilon) approximation of the shortest cut graph in time f (epsilon, g)n^3 . Our techniques first rely on the computation of a spanner for the problem using the technique of brick decompositions, to reduce the problem to the case of bounded tree-width. Then, to solve the bounded tree-width case, we introduce a variant of the surface-cut decomposition of Rué, Sau and Thilikos, which may be of independent interest
ArXiv - Proceedings of the Symposium on Computational Geometry (SoCG) 2015.
Click to show the abstract.We prove that local search with local neighborhoods of magnitude of 1/epsilon^C is an approximation scheme for the following problems in the Euclidian plane: TSP with random inputs, Steiner tree with random inputs, facility location (with worst case inputs), and bicriteria k-median (also with worst case inputs). We show that the randomness assumption is necessary for TSP
ArXiv - Proceedings of the Workshop on Approximation and Online Algorithms (WAOA) - 2014.
Click to show the abstract.We improve complexity bounds for energy-efficient speed scheduling problems for both the single processor and multi-processor cases. Energy conservation has become a major concern, so revisiting traditional scheduling problems to take into account the energy consumption has been part of the agenda of the scheduling community for the past few years. We consider the energy minimizing speed scaling problem introduced by Yao et al. where we wish to schedule a set of jobs, each with a release date, deadline and work volume, on a set of identical processors. The processors may change speed as a function of time and the energy they consume is the αth power of its speed. The objective is then to find a feasible schedule which minimizes the total energy used. We show that in the setting with an arbitrary number of processors where all work volumes are equal, there is a 2(1+\varepsilon)(5(1+\varepsilon))\alpha−1B_\alpha=O_\alpha(1) approximation algorithm, where B_\alpha is the generalized Bell number. This is the first constant factor algorithm for this problem. This algorithm extends to general unequal processor-dependent work volumes, up to losing a factor of (((1+r)r)^\alpha)/2 in the approximation, where r is the maximum ratio between two work volumes. We then show this latter problem is APX-hard, even in the special case when all release dates and deadlines are equal and r is 4. In the single processor case, we introduce a new linear programming formulation of speed scaling and prove that its integrality gap is at most 12^(\alpha−1). As a corollary, we obtain a (12(1+\varepsilon))^(\alpha−1) approximation algorithm where there is a single processor, improving on the previous best bound of 2^{\alpha−1)(1+\varepsilon)^\alpha B_\alpha when \alpha > 24.