Abstract
The k-means clustering is a well-known problem in data mining and machine learning. However, the de facto standard, i.e., Lloyd’s k-mean algorithm, suffers from a large amount of time on the distance calculations. Elkan’s k-means algorithm as one prominent approach exploits triangle inequality to greatly reduce such distance calculations between points and centers, while achieving the exactly same clustering results with significant speed improvement, especially on high-dimensional datasets. In this paper, we propose a set of triangle inequalities to enhance the filtering step of Elkan’s k-means algorithm. With our new filtering bounds, a filtering-based Elkan (FB-Elkan) is proposed, which preserves the same results as Lloyd’s k-means algorithm and additionally prunes unnecessary distance calculations. In addition, a memory-optimized Elkan (MO-Elkan) is provided, where the space complexity is greatly reduced by trading-off the maintenance of lower bounds and the run-time efficiency. Throughout evaluations with real-world datasets, FB-Elkan in general accelerates the original Elkan’s k-means algorithm for high-dimensional datasets (up to 1.69x), whereas MO-Elkan outperforms the others for low-dimensional datasets (up to 2.48x). Specifically, when the datasets have a large number of points, i.e., n≥5 M, MO-Elkan still can derive the exact clustering results, while the original Elkan’s k-means algorithm is not applicable due to memory limitation.
Original language | English |
---|---|
Title of host publication | Similarity Search and Applications |
Subtitle of host publication | 13th International Conference, SISAP 2020, Copenhagen, Denmark, September 30 – October 2, 2020, Proceedings |
Pages | 297-311 |
ISBN (Electronic) | 978-3-030-60936-8 |
DOIs | |
Publication status | Published - 2020 |
Externally published | Yes |
Event | 13th International Conference on Similarity Search and Applications, SISAP 2020 - Virtual Conference Duration: 1 Jan 1970 → 2 Oct 2020 Conference number: 13 |
Publication series
Name | Lecture notes in computer science |
---|---|
Publisher | Springer |
Volume | 12440 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 13th International Conference on Similarity Search and Applications, SISAP 2020 |
---|---|
Abbreviated title | SISAP 2020 |
City | Virtual Conference |
Period | 1/01/70 → 2/10/20 |