文档介绍:
基于网页分块的 BBS 评论信息抽取技术
*
(北京邮电大学信息与通信工程学院)
5
10
15
20
25
30
35
40
摘要:随着互联网的迅猛发展,人们对互联网技术和互联网上的大量资源的关注越来越多,
人们期望能够从海量的互联网信息中快速精确的提取出目标信息。论坛是人们对某些关键问
题提出疑问、需求解答、自由发表观点的网站。怎样对论坛的评论信息进行抽取,对于快速
获取答案,掌握人们思想动态至关重要。因此,本文提出一种基于网页分块的 BBS 信息抽取
技术,该算法不但保证了准确率且具有一定的普适性,减少人工的参与和开发的成本。首先,
本文提出基于信息论的网页分块方法,进行噪音信息的去除。其次,根据 BBS 的评论信息具
有一定的相似性的特点,本文在网页分块的基础上提出基于深度加权的 DOM 树相似度算法来
抽取评论信息,在减少人工参与和开发难度的同时保证了正确率。该算法可以快速精准地提
取 BBS 的评论信息,在舆情分析和搜索引擎的信息抽取方面有很好的应用前景和参考价值。
关键词:web 信息抽取;网页分块;DOM 树
中图分类号:
Information Extraction Based on Page Segmention for BBS
Comments
JIA Lulu, XIAO Bo
(School of Information munication Engineering,Beijing University of Posts and
munications)
Abstract: With the rapid development of the , people are concerned about the large number of
resources on the , more and more people expect to extract the target information from the mass
of information. The forum is a web site where people can free to express their views. To
extract ments information of forums can get answers quickly and analyse peple's thoughts
ensure the accuracy, this paper puts forward an algorithm that ments
extraction based the page segmentation, reducing the cost of development. First of all, this paper
proposed a page segmentation method based on information theory, remove the noise information.
Secondly, as the ments has some similarities with each oth