文档介绍:南京航空航天大学
硕士学位论文
文本分类的特征选择方法研究
姓名:宋江
申请学位级别:硕士
专业:计算机科学与技术
指导教师:徐敏
2010-12
南京航空航天大学硕士学位论文
摘要
随着当前计算机技术的不断发展,特别是 技术的发展,文本信息的数量呈现爆炸
式增长。如何帮助人们有效的获取自己需要的信息,成了信息处理领域一个亟需解决的问题。
而对文本进行有效的管理方法之一就是对文本进行归类,因此文本分类技术就是帮助人们准确
高效的定位所需的信息,有效组织信息的手段。
本文首先对文本分类的基本概念作了介绍,阐述了文本分类的过程及其难点,并对文本分
类的相关技术进行讨论,包括文本预处理、文本表示、权重计算等。文本分类中的特征选择是
文本分类中的一项关键技术,因此,论文着重讨论了文本分类中的特征选择算法,对目前的一
些常见的特征选择方法进行详细的介绍,并针对传统的 TFIDF 算法的不足,提出了一种基于信
息熵的 TFIDF 公式的特征选择方法 TDE,并将其应用到文本分类中去。另外本文也对各种常见
的文本分类算法做了分析和对比其优缺点。
最后,论文讨论了文本分类系统的性能评价体系,给出了常见的几种分类效果的评价方法。
用实验对比分析了几种特征选择方法,表明了 TDE 方法的有效性。
关键词: 文本分类,特征选择,向量空间模型,TFIDF 算法
i
文本分类的特征选择方法研究
ABSTRACT
With the development puter technology and improvement of technology, the
number of documents is exponentially increased. How to effectively access the information people
need has already e the question urgently awaited to be solved in the information processing
domain. One of effective methods to management texts is to classify them,also called text
classification. So the text categorization is an effective solution, which can help users to locate,
organize and manage their information effectively.
This paper firstly introduces the conception of text categorization and explains the process and
difficulties in text classification. This thesis analyzes the essential technologies detailedly, such as
pretreatment of text, text expression, putation and so on. However, the feature selection of
text classification has always been a key and bottle-neck technology of text classification. So, the
thesis is focused on feature selection algorithms. The thesis deeply researched and evaluated many
texts feature selection algorithm. This paper proposes a new method TDE based on TFIDF by
applying traditional feature item weighting function TFIDF to feature selection, combining the
knowledge of information entropy. And we apply TDE to text classification. Also this paper discusses
several general text classification met