文档介绍:DB Seminar Series:HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Projected Clustering
Presented by:
Kevin Yip
20 September 2002
0
Short Summary
Our own work (unpublished), supervised by Dr. Cheung and Dr. Ng
Problem: to cluster datasets of very high dimensionality
Assumption: clusters are formed in subspaces
1
Short Summary
Previous approaches: either have special restrictions on the dataset or target clusters, or cannot determine the dimensionality of the clusters automatically
Our approach: not restricted by these limitations
2
Presentation Outline
Clustering
Projected clustering
Previous approaches to projected clustering
Our approach: HARP
Concepts
Implementation:
Experiments
Future work and conclusions
3
Clustering
Goal: given a dataset D with N records and d attributes (dimensions), partition the records into k disjoint clusters such that
Intra-cluster similarity is maximized
Inter-cluster similarity is minimized
4
Clustering
How to measure similarity?
Distance-based: Manhattan distance, Euclidean distance, etc.
Correlation-based: cosine correlation, Pearson correlation, etc.
Link-based (common neighbors)
Pattern-based
5
Clustering
mon types of clustering algorithms:
Partitional: selects some representative points for each cluster, assigns all other points to their closest clusters, and then re-determines the new representative points
Hierarchical (agglomerative): repeatedly determines the two most similar clusters, and merges them
6
Clustering
Partitional clustering:
Dataset
Representatives
Assignment
Replacement
7
Clustering
Hierarchical clustering:
Dataset
Similarity calculation
Best merge
determination
Merging
8
Projected Clustering
Assumption (general case): each cluster is formed in a subspace
Source of figures:
ORCLUS (SIGMOD 2000)
Assumption (special case): each cluster has a set of relevant attributes
Goal: determine the records and relevant attributes of each cluster (to “