文档介绍：ponent Analysis(PCA)
Data Reduction
summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite) variables.
n
p
A
n
k
X
Data Reduction
“Residual” variation is information in A that is not retained in X
balancing act between
clarity of representation, ease of understanding
oversimplification: loss of important or relevant information.
ponent Analysis(PCA)
probably the most widely-used and well-known of the “standard” multivariate methods
invented by Pearson (1901) and Hotelling (1933)
first applied in ecology by Goodall (1954) under the name “factor analysis”(“principal factor analysis” is a synonym of PCA).
ponent Analysis(PCA)
takes a data matrix of n objects by p variables, which may be correlated, and summarizes it by uncorrelated axes (ponents or principal axes) that are binations of the original p variables
the first ponents display as much as possible of the variation among objects.
Geometric Rationale of PCA
objects are represented as a cloud of n points in a multidimensional space with an axis for each of the p variables
the centroid of the points is defined by the mean of each variable
the variance of each variable is the average squared deviation of its n values around the mean of that variable.
Geometric Rationale of PCA
degree to which the variables are linearly correlated is represented by their covariances.
Sum over all n objects
Value of variable j
in object m
Mean ofvariable j
Value of variable i
in object m
Mean ofvariable i
Covariance ofvariables i and j
Geometric Rationale of PCA
objective of PCA is to rigidly rotate the axes of this p-dimensional space to new positions (principal axes) that have the following properties:
ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis p has the lowest variance
covariance among each pair of the principal axes is zero (the principal axes are uncorrelated).
2D Example of PCA