文档介绍:Multi- and Megavariate data analysisof hierarchical biological data
Lennart Eriksson and Mark Earll
1
Contents
Introduction
Hierarchical modelling – for easier model interpretation
Application of Hi-modelling to QSAR data set
Multi- vs Megavariate data analysis
Conclusions
11/11/2017
2
Introduction: PLS Weight plot hard to interpret
QSAR data set
N = 10
K = 30
M = 255
11/11/2017
3
Variable selection?
Risky – may change interpretation
”Surviving” variables take over importance over the deleted ones
Weakened possibility of diagnosing outliers
11/11/2017
4
Alternative - Hierarchical modelling
Hi-PCA & PLS are useful when variables can be blocked, ., Process data and QSAR
Top level provides overview
Base level allows ”zooming” onto interesting sub-sets of data
11/11/2017
5
Example data set (SIRAC) – more details
The 10 training pounds were selected by SMD:
CH2Cl2 (2)
CHCl3 (3)
CCl3F (7)
CH2Cl-CH2Cl (11)
CHCl2-CHCl2 (15)
CH3-CH2Br (30)
CH3-CHBr2 (33)
CBr3F (39)
CH3-CHCl-CH3 (48)
CH3-CH2-CH2-CH2Br (52)
11/11/2017
6
Multivariate Biological Profiles
Block 1: In vivo Acute Toxicity, K = 2
Log Acute toxicity to rat, ”LD50”
Log Highest non-lethal dose to mouse, ”HNLD”
Block 2: In vivo Sub-acute toxicity in albino rat (28 days), K = 248
Block 2a: Body an weights, 72 responses
Block 2b: Hematology Data, 64 responses
Block 2c: Clinical Chemistry Data, 112 responses
Block 3: In vitro measurements, K = 3
Cytotoxicity to chinese hamster V79 cells, ”EC20”
Genotoxicity to V79 cells using DNA precipitation assay, ”Slope”
Cytotoxicity to human HeLa cells, ”HeLa”, IC50
Block 4: Environmental responses, K = 2
Atmospheric persistence, log rate constant with OH-radical, ”k(oh)”
Soil/sediment persistence, log k for reductive dehalogenation in sediment/water mixture, ”log k”
11/11/2017
7
Arrangement of data – base level
PCA was used to summarize block Y2
Four PCs obtained
Compact summary of Y2
11/11/2017
8
ponents Analysis - PCA
Multivariate pr