文档介绍:中文摘要
中文自然语言处理中有最基本的三个问题: 分词、命名体识别和词性标注。
中文不同于英文, 因为中文词间没有空格, 于是中文自然处理比英文困难得多。
本文讨论了最大熵模型和条件随机场在中文自然语言处理中的应用。针对
每个模型, 本文首先介绍数学背景以及模型的推导, 然后介绍实现中的相应细
节, 最后介绍如何将模型应用在中文自然语言处理中。针对命名体识别, 本文详
细介绍了如何基于领域知识抽取特征, 并且介绍了全局特征的应用。
本文使用《人民日报》和 SIGHAN bakeoff 4两个语料集进行实验。实验结果
表明条件随机场无论从查全率和查准率都较最大熵模型更好, 并且领域知识能
平滑模型, 并在一定程度上缓解过拟合问题。
关键词: 最大熵模型, 条件随机场, 局部特征, 全局特征, 分词, 命名体识别, 词
性标注
I
ABSTRACT
There are three basic problems in Chinese Natural Language Processing: Seg-
mentation, Named Entity Recognition and Part-Of-Speech Tagging. Chinese differs a
lot from English as there is no blank between Chinese words, and thus it is much harder
to handle with Chinese.
This paper discusses Maximum Entropy Model and Conditional Random Fields
for Chinese Natural Language Processing. For each model, we first introduce mathe-
matical ideas and inductions. Then, we introduce details of implementation, and finally,
we introduce features we used for the problems. This paper introduces how to extract
features from domain knowledge, and the usage of global features.
This paper’s experiments based on People’s daily and SIGHAN bakeoff 4. The
results show that Conditional Random Fields perform better than Maximum Entropy
Model on both precision and recall. Besides, domain knowledge can help to smooth
the model, and help to e the problem of overfitting.
Key words: Maximum Entropy Model, Conditional Random Fields, Local Features,
Global Features, Segmentation, Named Entity Recognition, Part-Of-
Speech Tagging
III
目录
第 1 章绪论......................................................................................... 1
本文研究的背景和意义...................................................................... 1
国内外研究历史现状及其分析............................................................. 1
本文研究工作概述........................................................................... 2
本文的组织结构.............................................................................. 2
第 2 章最大