文档介绍:Information: The Language of Biology
Gary Strong
NSF, ITR Program
Cell
Human Language
Suggestive Biology-Language Homologies
Goals Leading Toward Predictive Biology
Gene Sequence Data
Gene Identification
Protein Circuit &
work
Discovery
Biosimulation
Structure Prediction
Natural Language Processing and Bioinformatics are Already Related
Both NL and biology are faced with data mining over massive amounts of data
Applying NLP tools to biology
Hidden Markov gene finders
Protein grammars to predict function from sequence
Protein circuit extraction from scientific literature
Convergence of biological and language mining
PSI Blast homology searching in genome augmented by medical literature
Model-based GENSCAN Is Best Among HMM Gene Finders
Sn = Sensitivity
Sp = Specificity
Ac = Approximate Correlation
ME = Missing Exons
WE = Wrong Exons
GENSCAN Performance Data,
The Chomsky Hierarchy
Regular
Languages
Context-
Free
Languages
Context-
Sensitive
Languages
Recursively
Enumerable
Languages
Language
Automaton
Turing Machine
Linear-Bounded
Pushdown
(stack)
Finite-State
Machine
Grammar
Unrestricted
Baa A
Context-Sensitive
At aA
Context-Free
S gSc
Regular
A cA
Recognition
Linear
Polynomial
plete
Undecidable
Dependency
Biology
Strictly Local
Nested
Crossing
Arbitrary
Central Dogma
Pseudoknots, etc.
Orthodox 2o Structure
Unknown
From D. Searls
Mildly CSG’s for Structure Modeling
Yasuo UEMURA et al.
Tree Adjunct Grammars (TAG) have been applied to modeling RNA secondary structures including pseudoknots.
An efficient parsing algorithm for this grammar was developed, and applied to putational problems concerning RNA secondary structures.
Further, a (-1) frame shift grammar is constructed based on a biological observation that a (-1) frame shift might be caused from some structural features of RNA sequences.
The proposed grammar was used to find candidate sequences for (-1) frame s