文档介绍：基于距离的划分聚簇算法
叶若芬李春平
（清华大学软件学院北京100084）
摘要：k-means算法在聚簇大的数据集时是公认比较有效的算法之一，然而它只能应用在具有数值属性描述的数据对象集合上，这种数据对象叫做数值数据；却无法应用于真实世界中具有其他形形色色属性的数据对象集合上，比如颜色、纹理、形状等特征描述的数据对象集合，能对分类数据进行聚簇，对k-means算法进行了扩展，出现两种新的算法：一种是k-modes算法，另一种是k-prototypes算法。但这两种算法都需要用户事先确定聚簇数h阈值r和聚簇中心Q,在不明白数据分布状况的情况下能较准确地确走这3个参数值是很不容易的，改进的k-modes算法有效解决了这一问题。关键词：聚簇，k-means, k-modes, k-prototypes,相异度
Distance-based Partition Clustering Algorithm
Ye Ruofen Li Chunping
（School of Software, Tsinghua University, Beijing 100084, China ）
Abstract: The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values, such as those data whose attributes is color, texture and shape etc. To cluster categorical values,the k-modes algorithm and k-prototypes algorithm were presented. Yet it is necessary for users to predefine the number of clusters, the center of a cluster and the initial threshold for these algorithms. It is difficult to judge the number of clusters and the initial threshold while not understanding the distribution of the original data. The issue is addressed in this paper for an improved k-modes algorithm.
Key words: Cluster, k-means, k-modes, k-prototypes, Dissimilarity
1引言
数据挖掘是数据库研究、开发和应用最活跃的分支科学之一，从大量数据中用非平凡的方法发现有用的知识和人们感兴趣的数据模式成了人们的一种自然需求随着数