文档介绍:一种基于贝叶斯网的术语发现方法
归耀城,高志强
东南大学计算机科学与工程学院,南京 211189
摘要: 学术语义搜索系统使用具有明确含义的术语描述特定研究领域。术语发现的主要方法
度量了术语和论文之间的关系作为术语的特征。学术语义搜索系统不提供论文正文,由论文标
题和摘要构成的短文本语料库限制了特征的度量。本文提出一种新的术语发现方法,首先度量
术语和其他实体之间的关系作为术语的新特征;然后根据术语的特征之间的关系,构造基于贝
叶斯网(work)的TRBN(term recognition work)模型,综合术语
的特征作为术语发现的依据。本文在来自电信和计算机领域的7,750,000论文标题和4,500,000论
文摘要构成的语料库上进行实验,基于TRBN模型的方法在精度上超过基线方法10%,取得令
人满意的结果。
关键词:自然语言处理;术语发现;贝叶斯网
中图分类号: TP312
A work for Automatic Term
Recognition
GUI Yao-Cheng , GAO Zhi-Qiang
School puter Science and Engineering, Southeast University, Nanjing 211189
Abstract: Terms with explicit meanings are used in the academic semantic search system
to represent specific research domains. The major works of Automatic Term Recognition
(ATR) focus on measuring the relationship between term and paper as the feature of term.
The academic semantic search system does not provide full papers, and the short-text-corpus
constructed by titles and abstracts of papers reduces the influence of the feature. This paper
proposes a novel ATR approach. Firstly, new types of features are provided by measuring the
relationships between term and other entities. Secondly, based on the relations between the
features of term, the TRBN (term recognition work) model which is represented
by work is proposed to integrate the features. The results of experiments, which
are implemented on the corpus containing 7,750,000 titles and 4,500,000 abstracts from the
domain of munication puter science, illustrate the good performance of this
new approach that is 10 percent of precision outperforms the baseline method.
Key words: Natural Language Processing; Automatic Term Recognition; work
基金项目: 国家自然科学基金(60873153,60803061,61170165)
作者简介: 归耀城(1987-),男,硕士研究生,主要研究方向:语义搜索,本体学习。通信作者:高志强(1966-),男,教
授,主要研究方向:多Agent系统,自然语言处理,语义Web。
-1-
0
引言
语义搜索(Semantic Search)将语义Web技术与搜索系统相结合,其目的在于提高当前搜
索系统的搜索效果[1]。学术语义搜索系统是以特定领域的学术活动实体作为搜索对象的语义搜
索系统。实体(entity)是独立可区分的存在,学术语义搜索系统关注的实体包括论文、研究
人员和研究机构等。学术语义搜索系统需要一个描述