文档介绍:Computing with Whole Genomes
Stuart M. Brown
puting, NYU School of Medicine
The Human Genome Project
Genome Sequencing
The ability to sequence entire genomes has created a huge demand for bioinformatics
Simple data management for the sequencing projects
Genome assembly
Annotation
Public access to the data
New types of whole genome analyses
Genome sequencing factories churn out raw sequence data at an ever increasing rate
Fewer scientists are involved in generating data and more are involved in data analysis
Sequence Pipeline
Laboratory Information Management - track samples, store raw data
Assemble fragments
Track orientation and distance for paired reads from libraries of known sized clones
Find genes
Gene prediction algorithms
Map known genes and cDNAs
Annotation and public access to data
Raw Genome Data:
Finding genes in genome sequence is not easy
About 1% of human DNA encodes functional genes.
Genes are interspersed among long stretches of non-coding DNA.
Repeats, pseudo-genes, and introns confound matters
The next step is obviously to locate all of the genes and describe their functions. This will probably take another 15-20 years!
UCSC
Gene Prediction Works Poorly
Algorithms are not accurate
non-consensus splice sites
where is the true first 5' exon?
cDNA data is plete and confusing
truncated cDNA sequences
real alternative splicing
Pseudo-genes and true gene duplication
vs.
Mistakes in the genome assembly