Using a set of triangle inequalities to accelerate k-means clustering

Qiao Yu, Kuan-Hsun Chen, Jian-Jia Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

The k-means clustering is a well-known problem in data mining and machine learning. However, the de facto standard, i.e., Lloyd’s k-mean algorithm, suffers from a large amount of time on the distance calculations. Elkan’s k-means algorithm as one prominent approach exploits triangle inequality to greatly reduce such distance calculations between points and centers, while achieving the exactly same clustering results with significant speed improvement, especially on high-dimensional datasets. In this paper, we propose a set of triangle inequalities to enhance the filtering step of Elkan’s k-means algorithm. With our new filtering bounds, a filtering-based Elkan (FB-Elkan) is proposed, which preserves the same results as Lloyd’s k-means algorithm and additionally prunes unnecessary distance calculations. In addition, a memory-optimized Elkan (MO-Elkan) is provided, where the space complexity is greatly reduced by trading-off the maintenance of lower bounds and the run-time efficiency. Throughout evaluations with real-world datasets, FB-Elkan in general accelerates the original Elkan’s k-means algorithm for high-dimensional datasets (up to 1.69x), whereas MO-Elkan outperforms the others for low-dimensional datasets (up to 2.48x). Specifically, when the datasets have a large number of points, i.e., n≥5 M, MO-Elkan still can derive the exact clustering results, while the original Elkan’s k-means algorithm is not applicable due to memory limitation.
Original languageEnglish
Title of host publicationSimilarity Search and Applications
Subtitle of host publication13th International Conference, SISAP 2020, Copenhagen, Denmark, September 30 – October 2, 2020, Proceedings
Pages297-311
ISBN (Electronic)978-3-030-60936-8
DOIs
Publication statusPublished - 2020
Externally publishedYes
Event13th International Conference on Similarity Search and Applications, SISAP 2020 - Virtual Conference
Duration: 1 Jan 19702 Oct 2020
Conference number: 13

Publication series

NameLecture notes in computer science
PublisherSpringer
Volume12440
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference13th International Conference on Similarity Search and Applications, SISAP 2020
Abbreviated titleSISAP 2020
CityVirtual Conference
Period1/01/702/10/20

Fingerprint

Dive into the research topics of 'Using a set of triangle inequalities to accelerate k-means clustering'. Together they form a unique fingerprint.

Cite this