TY - JOUR
T1 - SDCOR
T2 - Scalable density-based clustering for local outlier detection in massive-scale datasets[Formula presented]
AU - Naghavi Nozad, Sayyed Ahmad
AU - Amir Haeri, Maryam
AU - Folino, Gianluigi
N1 - Funding Information:
Special thanks to Dr. Victoria J. Hodge (University of York, UK) for her clever advice on the title of this paper, and Dr. Mohammad Mehdi Ebadzadeh (Amirkabir University of Technology, Iran), for his valuable comments on some of the algorithms. Moreover, we are grateful to Dr. Fabrizio Angiulli (University of Calabria) and Dr. Dan Pelleg (Yahoo Labs) for providing us with the DOLPHIN and X-means implementation codes, respectively. We also appreciate Dr. Ali-Mohammad Saghiri and Dr. Ehsan Nazerfard (Amirkabir University of Technology, Iran) for their generous review of the paper before submission.
Publisher Copyright:
© 2021 Elsevier B.V.
PY - 2021/9/27
Y1 - 2021/9/27
N2 - This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Unlike the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. A temporary clustering model is built at the first phase; then, it is gradually updated by analyzing consecutive memory loads of points. Subsequently, at the end of scalable clustering, the approximate structure of the original clusters is obtained. Finally, by another scan of the entire dataset and using a suitable criterion, an outlying score is assigned to each object called SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the memory; and also, to some fast distance-based methods, which can perform on data resident in the disk.
AB - This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Unlike the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. A temporary clustering model is built at the first phase; then, it is gradually updated by analyzing consecutive memory loads of points. Subsequently, at the end of scalable clustering, the approximate structure of the original clusters is obtained. Finally, by another scan of the entire dataset and using a suitable criterion, an outlying score is assigned to each object called SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the memory; and also, to some fast distance-based methods, which can perform on data resident in the disk.
KW - 2022 OA procedure
KW - Density-based clustering
KW - Local outlier detection
KW - Massive-scale datasets
KW - Scalable
KW - Anomaly detection
UR - http://www.scopus.com/inward/record.url?scp=85109162420&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2021.107256
DO - 10.1016/j.knosys.2021.107256
M3 - Article
AN - SCOPUS:85109162420
SN - 0950-7051
VL - 228
JO - Knowledge-based systems
JF - Knowledge-based systems
M1 - 107256
ER -