关键词: DBSCAN Density-based clustering K-nearest neighbors Unsupervised clustering

来  源:   DOI:10.7717/peerj-cs.1921   PDF(Pubmed)

Abstract:
The density-based clustering method is considered a robust approach in unsupervised clustering technique due to its ability to identify outliers, form clusters of irregular shapes and automatically determine the number of clusters. These unique properties helped its pioneering algorithm, the Density-based Spatial Clustering on Applications with Noise (DBSCAN), become applicable in datasets where various number of clusters of different shapes and sizes could be detected without much interference from the user. However, the original algorithm exhibits limitations, especially towards its sensitivity on its user input parameters minPts and ɛ. Additionally, the algorithm assigned inconsistent cluster labels to data objects found in overlapping density regions of separate clusters, hence lowering its accuracy. To alleviate these specific problems and increase the clustering accuracy, we propose two methods that use the statistical data from a given dataset\'s k-nearest neighbor density distribution in order to determine the optimal ɛ values. Our approach removes the burden on the users, and automatically detects the clusters of a given dataset. Furthermore, a method to identify the accurate border objects of separate clusters is proposed and implemented to solve the unpredictability of the original algorithm. Finally, in our experiments, we show that our efficient re-implementation of the original algorithm to automatically cluster datasets and improve the clustering quality of adjoining cluster members provides increase in clustering accuracy and faster running times when compared to earlier approaches.
摘要:
基于密度的聚类方法由于其识别异常值的能力而被认为是无监督聚类技术中的一种鲁棒方法。形成不规则形状的簇,并自动确定簇的数量。这些独特的特性帮助了它的开创性算法,基于密度的噪声应用空间聚类(DBSCAN),变得适用于数据集,其中可以检测到不同形状和大小的各种数量的集群,而不会受到用户的太多干扰。然而,原始算法表现出局限性,尤其是对其用户输入参数minPts和♪的敏感性。此外,该算法将不一致的聚类标签分配给在单独聚类的重叠密度区域中发现的数据对象,从而降低其准确性。为了缓解这些特定问题并提高聚类的准确性,我们提出了两种方法,使用来自给定数据集的k-最近邻密度分布的统计数据来确定最优σ值。我们的方法减轻了用户的负担,并自动检测给定数据集的集群。此外,为了解决原算法的不可预测性,提出并实现了一种识别单独聚类的精确边界对象的方法。最后,在我们的实验中,我们表明,与早期方法相比,我们有效地重新实现了对原始算法的自动聚类数据集并提高了相邻聚类成员的聚类质量,从而提高了聚类的准确性和更快的运行时间。
公众号