文档介绍:上海交通大学
硕士学位论文
互联网舆情信息管控关键技术研究与实现
姓名:李若鹏
申请学位级别:硕士
专业:通信与信息系统
指导教师:李建华;李翔
20080101
互联网舆情信息管控关键技术研究与实现
摘要
本文针对互联网信息内容新、变化快和新类别层出不穷的特点,对
舆情信息管控领域的几个关键技术做了较为深入的研究,设计了中文文
本聚类模型 CTCM。
本文首先对中文分词技术、文本特征选取、汉语语言构成及分词词
表进行分析,提出并实现了基于正向最大匹配的新词发现,该算法可及
时发现任意长度的热点词汇,可实现动态调整词表。
其次,通过对各种聚类算法的分析、比较与实验,针对互联网舆情
信息管控领域的特点,本文创新性的提出了基于密度与 CFK-Means 相结
合的聚类算法—DK 算法,既极大地降低了计算复杂度和计算时间,又克
服了单纯采用 K-Means 算法依赖初始聚类数和初始聚类中心点的缺陷。
大量的实验数据显示,DK 算法显著提高了聚类速度和准确率。
最后,本文将文本聚类的思想引入类描述信息的自动生成,将每个
段落看成是一个篇幅较小的文本,计算类与文本的相似程度,找出与类
相似程度最高的段和语句,从而确定类的描述信息。
通过测试,证明本文设计实现的中文文本聚类系统实现了对于互联
网舆情信息热点的及时发现、及时分类,有效的改善了网络管控的效果。
关键词:舆情信息管控,文本聚类,DK,CTCM
I
KEY TECHNIQUE OF INFORMATION MANAGEMENT AND
CONTROL RESEARCH OVER
ABSTRACT
This paper makes relatively deep research in the field of information
management and control over . Based on the characteristics of new
contents and quick change, this paper designs the Chinese Text Clustering
Model.
Firstly, this paper turns back to the achievement in the fields of word
division, feature selection and Chinese word structure. It puts forward and
implements a new Chinese word division method based on maximum match.
This method can discover new words with random length and update the
word dictionary automatically.
Secondly, on the basis of clustering method research works, this paper
innovatively puts forward a new clustering method - DK, which consists of
density-based method and CFK-Means method. It not only reduce the
computing time but also e the K-Means’ feedback with inappropriate
initial clustering center. A mount of experiments show that the DK method
can improve both the clustering correctness and efficiency.
Finally, this paper put the clustering method into the generation of class
information description. Every paragraph is regarded as a small text, we
compute the similarity of the class and the small text, choose the closest
paragraph and sentence to make up t