文档介绍:基于Web的文本分类挖掘的研究
中文提要
互联网现在已经成为一个巨大的信息源,如何让互联网信息更好地为人类服务,如何快速、准确获取所需信息,是我们面临的一个重要课题。因此,基于Web的网络信息处理成了当前的研究热点,其中,Web上的文本分类方法的研究是网络数据挖掘的研究重点之一。
本文介绍了数据挖掘,Web挖掘和文本分类的理论,对Web数据的特点作了分析,比较了HTML与传统数据的区别,分析了文本分类的几种算法,重点研究了朴素贝叶斯分类算法和算法改进的具体过程。尝试利用HTML标记权重来改善朴素贝叶斯算法的条件独立假设的不足。简述了现有的对网页的标记过滤的知识,并利用标记中的有用信息结合文本分类算法进行文本分类。最后,针对改进的分类器的在精确率上不太理想的特点,对本课题下一步要研究的内容进行了总结,并提出了自己的一些看法。
关键词
Web挖掘朴素贝叶斯数据挖掘文本分类网页标记
Research of Text Classification Mining based on WEB
ABSTRACT
has e a great information source. It is an important issues for us to confront that how to make the information serve people better and how to obtain the information quickly and accurately. Nowadays the Research of information processing based on web is a hotspot. The text categorization of web has became more important than the other research of web mining.
The theoretical development of data mining, Web mining and text classification are introduced, analyzes the feature of Web pares with the other datanaive bayes classifier . Analyzes some arithmetics of text categorization and the concrete process of the improvement of arithmetic in naive bayes classifier are put emphasis on. This thesis tries to make use of HTML tags to improve the arithmetic of naive bayes classifier whose bug is its hypothesis. In the practice of the classifier ,the thesis summarizes the method which can leach HTML tags,then tries to use the information from the tags and the text categorization arithmetic to classify the text.
Finally, the precision of the classifier which has been improved is not ideal, so the next contentsof this subject are summarized and some one's own views are also presented.
Xu Ying
Directed by Liu Li-zhen
Key word
WebMining Naïve Bayes Data Mining Text categorization HTML tags
目录
中文提要 1
外文提要 错误!未定义书签。
第一章 绪论 4
选题背景及意义 4
数据挖掘 4
Web挖掘 5
Web挖掘的研究现状与发展 8
本文的主要研究内容与组织结构 9
第二章 基于Web的文本分类挖掘 9
引言 9
Web文本的预处理 10
Web文本数据采集 10
文本分词 10
文本特征库 11
文本