文档介绍:∗
基于层次划分的最佳聚类数确定方法
陈黎飞 1, 姜青山 2+, 王声瑞 3
1(厦门大学计算机科学系,厦门 361005)
2(厦门大学软件学院,厦门 361005)
3(加拿大舍尔布鲁克大学计算机科学系,魁北克 J1K 2R1)
A Hierarchical Method for Determining the Number of Clusters
CHEN Li-Fei1, JIANG Qing-Shan2+, WANG Sheng-Rui3
1(Department puter Science, Xiamen University, 361005, China)
2(School of Software, Xiamen University, 361005, China)
3(Department puter Science, University of Sherbooke, J1K 2R1, Canada)
+ Corresponding author: Phn: +86-592-2186707, E-mail: ******@xmu., http:// ./View/shizi/
Received 2007-04-01; Accepted 2007-10-09
Abstract: A fundamental and difficult problem in cluster analysis is the determination of the “true” number of
clusters in a dataset. mon trail-and-error method generally depends on certain clustering algorithms and is
inefficient when processing large datasets. In this paper, a hierarchical method is proposed to get rid of repeatedly
clustering on large datasets. The method firstly obtains the CF (Clustering Feature) via scanning the dataset and
agglomerative generates the hierarchical partitions of the dataset, then a curve of the clustering quality the
varying partitions is incrementally constructed. The partitions corresponding to the extremum of the curve is used
pute the number of clusters finally. A new validity index is also presented to quantify the clustering quality,
which is independent of clustering algorithm and emphasis on the geometric features of clusters, handling
efficiently noisy data and arbitrary shaped clusters. Experimental results on both real world and synthesis datasets
demonstrates that the new method outperforms the recently published approaches; while the efficiency is
significantly improved.
Key words: clustering; clustering validity index; statistics; number of clusters; hierarchically clustering
摘要: trail-and-error 方法通常依赖于特定
的聚类算法,,它不需要对数据集进行反
复的聚类,而是首先