文档介绍:Pattern Discoveryin Biological Sequences: A Review
ChengXiang Zhai
Language Technologies Institiute
School puter Science
Carnegie Mellon University
Presentation at the Biological Language Modeling Seminar, June17, 2002
Outline
Biology
Computer
Science
Basic Concepts (“Common Language”)
Pattern
Discovery
Motivation
Formalization
Algorithm
Application
Basic Concepts
Alphabet & Language
Alphabet = set of symbols, ., ={A, T, G, C} is the nucleotide alphabet
String/Sequence (over an alphabet) = finite seq. of symbols, ., w=AGCTGC ( How many different nucleotide strings of length 3 are there?)
Language (over an alphabet) = set of strings, ., L={AAA, AAT, ATA, AGC, …, AGG} all nucleotide triplets starting with A.
Example:“Essential AA Language”
The language (set) of “essential” amino acids
on the alphabet {A, U, C, G}
L={CAC, CAU, …, UAC, UAU}
The ic Code
Questions to Ask about a Language (L)
Syntax & Semantics
How do we describe L and interpret L?
Recognition
Is sequence s in L or not?
Learning
Given example sequences in L and not in L, how do we learn L? What if given sequences that either match or do not match a sub-sequence in L ?
Syntax & Semantics of Language
Syntax: description of the form of sequences
“Surface” description: enumeration
“Deep” description: a concise decision rule or a characterizing pattern, .,
L contains all the triplets ending with “A”, or
L contains all sequences that match “AGGGGGA”
Semantics: meaning of sequences
Functional description of a amino acid sequence
Gene regulation of a nucleotide sequence
Recognizing Sequences in L
Recognizer (for L): given a sequence s, it tells us if s is in L or not. An operational way of describing L!
L (G-receptors)
* ( all protein sequences)
Is the sequence “SNASCTTNAP…TGAK” a G-receptor?
Algorithm
(G-rec. Recgonizer)
0 (no)
1 (yes)
More than “recognizing”...
Can the recognizer explain why a sequence is a “G-receptor”? Is the explanation biologically meaningf