文档介绍:Machine Learning for Information Extraction in
Informal Domains
Dayne Freitag
November, 1998
CMU-CS-99-104
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy.
mittee:
Tom Mitchell, Chair
Jaime Carbonell
David Evans
Oren Etzioni, University of Washington
  c 1998 Dayne Freitag
This research was sponsored by Wright Laboratory, Aeronautical Systems Center under grant number
F33615-93-1-1330 and Rome Laboratory under grant number F30602-97-1-0215, both of the Air Force Ma-
mand-USAF, and by the Defense Advanced Research Projects Agency (DARPA). Part of this
research was conducted during a summer internship at Justsystem Pittsburgh Research Center.
The views and conclusions contained in this document are those of the author and should not be inter-
preted as representing the official policies, either expressed or implied, of any sponsoring party or the US
Government.
Keywords: machine learning, information extraction, information retrieval, multistrat-
egy learning
Abstract
Information extraction, the problem of generating structured summaries of human-oriented
text documents, has been studied for over a decade now, but the primary emphasis has been
on document collections characterized by well-formed prose (., newswire articles). So-
lutions have often involved the hand-tuning of general natural language processing systems
to a particular domain. However, such solutions may be difficult to apply to “informal” do-
mains, domains based on genres characterized by syntactically unparsable text and frequent
out-of-lexicon terms. With the growth of the , such genres, which include email
messages, newsgroup posts, and Web pages, are particularly abundant, and there is no lack
of potential information extraction applications. Examples include a program to extract
names from personal home pages, or a system that monit