文档介绍:DB Seminar Series:The Subspace Clustering Problem
By: Kevin Yip
(17 May 2002)
Presentation Outline
Problem definition
Different approaches
Focus: the projective clustering approach
Problem Definition – Traditional Clustering
Traditional clustering problem:To divide data points into disjoint groups such that the value of an objective function is optimized.
Objective function: to minimize intra-cluster distance and maximize inter-cluster distance.
Distance function: define over all dimensions, numeric or categorical.
Problem Definition – Traditional Clustering
ExampleProblem: clustering points in 2-D space.Distance function: Euclidean distance (d: no. of dimensions, 2 in this case).
Problem Definition – Traditional Clustering
Example (source: CURE, SIGMOD 1998)
Problem Definition – Distance Function Problem
Observation: distance measures defined over all dimensions are sometimes inappropriate.
Example (source: DOC, SIGMOD 2002)
C1: (x1, x2)
C2: (x2, x3)
C3: (x1, x3)
Problem Definition – Distance Function Problem
As the number of noise dimensions increases, the distance functions e less and less accurate.
=> For each cluster, except the set of data points, we also need to find out the set of “related dimensions”(“bounded attributes”)
Problem Definition – The Subspace Clustering Problem
Formal Definition:Given a dataset of N data points and d dimensions, we want to divide the points into k disjoint clusters, each relating to a subset of dimensions, such that an objective function is optimized.
Objective function: usually intra-cluster distance, each cluster uses its own set of dimensions in distance calculation.
Problem Definition – The Subspace Clustering Problem
Observation: normal distance functions (Manhattan, Euclidean, etc.) give a smaller value if less dimensions are involved.
=> 1. Use a normalized distance function.=> 2. Should also try to maximize the number of dimensions.
Example (DOC): score(C, D) = |C|(1/β)|D|, C = points in a c