文档介绍:
基于主题模型的短文本查询扩展算法
刘润楠,陈光**
(北京邮电大学信息与通信工程学院,北京 100876)
5
10
摘要:近年来,微博短文本语料下的信息检索需求日益突出。查询扩展作为信息检索领域的
关键技术,对于查询结果的优化具有非常重要的作用。本文提出了一种基于 Bayes-LDA 模
型的微博语料建模方法,该模型能够在保证建模质量的基础上对微博短文本的完整建模;并
设计了基于以上模型的微博语料查询扩展算法,其核心是将 Bayes-LDA 的建模结果应用于
特征词的生成与选择、查询结果重排序等操作,从而提高短文本查询的效果。实验结果表明,
该算法在 TREC 2011 年微博评测的数据集中的多种主要性能指标均优于 BM25 伪相关反馈
方法。
关键词:自然语言处理;查询扩展;LDA 模型;短文本;贝叶斯理论;伪相关反馈
中图分类号:
15
SHORT TEXT QUERY EXPANSION BASED ON TOPIC
MODEL
Liu Runnan, Chen Guang
(School of Information munication Engineering, Beijing University of Posts and
20
25
30
35
40
munications, Beijing 100876)
Abstract: In recent years, the requirement of microblog retrieval is ing more. As a key
technology in the field of information retrieval, query expansion is vital to optimize retrieved
results. This paper proposes a Bayes-LDA based modeling method on microblog. The model can
guarantee the quality pleteness of the modeling on short texts such as microblogs. We
design a query expansion algorithm based on this model. Its core thought is to apply the modeling
results of Bayes-LDA to the generation of expansion features and the re-ranking of search results.
The experiments show that this algorithm has a better performance of various indicators on the
TREC 2011 Microblog evaluation corpus than the BM25 pseudo-relevance feedback method.
Key words: Natural Language Processing; Quer