文档介绍:IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 7, JULY 2002 881
An Efficient k-Means Clustering Algorithm:
Analysis and Implementation
Tapas Kanungo, Senior Member, IEEE, David M. Mount, Member, IEEE,
Nathan S. Netanyahu, Member, IEEE, Christine D. Piatko, Ruth Silverman, and
Angela Y. Wu, Senior Member, IEEE
AbstractÐIn k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k and the problem is to
determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center.
A popular heuristic for k-means clustering is Lloyd's algorithm. In this paper, we present a simple and efficient implementation of Lloyd's
k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only
major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis
of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we
present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization,
pression, and image segmentation.
Index TermsÐPattern recognition, machine learning, data mining, k-means clustering, nearest-neighbor searching, k-d tree,
computational geometry, knowledge discovery.
æ
1INTRODUCTION
d
LUSTERING problems arise in many different applica- integer k, the problem is to determine a set of k points in R ,
Ctions, such as data mining and knowledge discovery called centers, so as to minimize the mean squared distance
[19], pression and vector quantization [24], and from each data point to its nearest center. This measure is
pattern recognition and pattern classification [16]. The often called the squared-error distortion