Worst-case and smoothed analysis of k-means clustering with Bregman divergences

Bodo Manthey, Heiko Röglin

Research output: Contribution to journalArticleAcademicpeer-review

11 Downloads (Pure)

Abstract

The $k$-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, $k$-means has been studied in the semi-random input model of smoothed analysis, which often leads to more realistic conclusions than mere worst-case analysis. For the case that $n$ data points in $\mathbb{R}^d$ are perturbed by Gaussian noise with standard deviation $\sigma$, it has been shown that the expected running-time is bounded by a polynomial in $n$ and $1/\sigma$. This result assumes that squared Euclidean distances are used as the distance measure. In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the Kullback-Leibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the running-time of the $k$-means method for Bregman divergences. We first give a smoothed analysis of $k$-means with (almost) arbitrary Bregman divergences, and we show bounds of poly($n^{\sqrt{k}}$), $1/\sigma$) and $k^{kd}$.poly($n$,$1/\sigma$). The latter yields a polynomial bound if $k$ and $d$ are small compared to $n$. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences.
Original languageEnglish
Pages (from-to)94-132
Number of pages39
JournalJournal of computational geometry
Volume4
Issue number1
DOIs
Publication statusPublished - 2013

Keywords

  • Kullback-Leibler divergence
  • Smoothed analysis
  • k-Means clustering
  • Bregman divergence
  • Clustering
  • k-Means method

Fingerprint

Dive into the research topics of 'Worst-case and smoothed analysis of k-means clustering with Bregman divergences'. Together they form a unique fingerprint.

Cite this