文档介绍:An Overview of Clustering Methods
With Applications to Bioinformatics
Sorin Istrail
Informatics Research
Celera Genomics
What is Clustering?
Given a collection of objects, put objects into groups based on similarity.
Used for “discovery-based” science, to find unexpected patterns in data.
Also called “unsupervised learning” or “data mining”
Inherently an ill-defined problem
Do we put Collins with Venter because they’re both biologists, or do we put Collins with Lander because they both work for the HGP?
Biologist
Mathemat-ician
Celera
HGP
Data Representations for Clustering
Input data to algorithm is usually a vector (also called a “tuple” or “record”)
Types of data
Numerical
Categorical
Boolean
Example: Clinical Sample Data
Age (numerical)
Weight (numerical)
Gender (categorical)
Diseased? (boolean)
Must also include a method puting similarity of or distance between vectors
Calculating Distance
Distance is the most natural method for numerical data
Lower values indicate more similarity
Distance metrics
Euclidean distance
Manhattan distance
Etc.
Does not generalize well to non-numerical data
What is the distance between “male” and “female”?
Calculating Numerical Similarity
Traditionally over the range [, ]
= no similarity, = identity
Converting distance to similarity
Distance and similarity are two sides of the same coin
To obtain similarity from distance, take the maximum pairwise distance and subtract from
Pearson correlation
Removes magnitude effects
In range [-, ]
- = anti-correlated, = no correlation, = perfectly correlated
In the example below, the red and blue lines have high correlation, even though the distance between the lines is significant
Calculating Boolean Similarity
Given two boolean vectors X and Y, let A be the number of places where both are 1, etc. as shown below.
Two standard methods for similarity given at right
Can be generalized to handle categorical data as well.
Boolean Similarity