文档介绍:摘 要
 
    基于网格和密度的聚类算法速度快,能发现任意形状的簇,适于空间数据的聚类。
但现有的基于网格和密度的聚类算法往往要求用户输入网格粒度和密度阈值这两个参
数,这加重了用户的负担,并且导致聚类结果不可控。
网格粒度决定了观察数据所用的分辨率,从而基本上决定了聚类的结果;同时网格
的大小还影响到聚类的速度。现有算法在处理网格粒度和密度阈值时,多数是根据数据
点总数以及平均密度等统计量,利用一个经验公式求得这两个参数,比较简单化。
在分析聚类分析主要算法特别是其对网格粒度和密度阈值的处理方法的基础上,首
先提出了在给定密度阈值条件下,密网格最多时网格粒度最优的观点。在此基础上,提
出了在给定一组密度阈值条件下,根据密网格最多原则以及网格划分中密网格和稀疏网
格的产生情况确定最佳密度阈值和网格粒度的方法。用该方法得到的网格粒度既能反映
数据的内部结构,同时又不致陷入到琐碎的局部细节中去,其粒度对于聚类分析是合适
的,是对数据的一种很好的压缩。 这种方法大大减轻了用户对领域知识的需求,基本
实现了无参聚类。实验表明,这种方法速度快,能发现空间数据的主要聚类结构。
关键词:聚类算法;网格粒度;密度阈值;无参聚类
 
 
I
ABSTRACT
Grid and density based clustering algorithms are fast, and they can find clusters of
arbitrary shape, so these algorithms are suitable for clustering of spatial data. However,
currently available grid and density-based clustering algorithms often require the user to enter
two parameters: grid size and density threshold, thus increasing the burden on users and
making clustering results uncontrollable. Grid size determines the resolution used to observe
the data, which basically determines the results of clustering. Meanwhile the grid size also
affects the speed of grid clustering algorithm. When dealing with the grid size and density
threshold, most currently available algorithms are based on the total number of data points,
the average density and other statistics. They use an empirical formula to obtain these two
parameters. Theses methods are rather simply.
After analyzing the main clustering algorithm especially the approaches of dealing with
grid size and density threshold, this paper firstly presents a point of view that under a given
density threshold condition the grid size is most optimal when dense grid reaches the
maximum. On this basis, a new approach to get optimal grid size and density threshold is
proposed at a set of given density threshold conditions. This approach gets the optimal grid
size and density thresho