文档介绍:技术与方法
Technique and Method
基于特征噪声加权的特征权重算法改进*
赵航 1,杨天奇 1,赵小厦 2
(,广东广州 510632;
,广东广州 510631)
摘要: 特征权重算法 TF-IDF 是文本分类的重要算法之一,该算法 IDF 值容易受特征噪声影响
出现波动。提出一种基于特征噪声加权的特征权重改进算法,该算法通过分析噪声特征的分布特点,
对不能准确表达文档真实意思的特征噪声进行加权,降低特征噪声对 IDF 的影响,最终有效地提高算
法的精度和健壮性。
关键词: 向量空间模型;文本分类;特征噪声;特征权重;健壮性
中图分类号: TP391 文献标识码: A 文章编号: 1674-7720 (2012 )03-0066-03
Feature weight algorithm based on feature noise weighting
Zhao Hang1 ,Yang Tianqi1 ,Zhao Xiaoxia2
( of Information Science and Technology ,Jinan University ,Guangzhou 510632 ,China ;
puter ,South China Normal University ,Guangzhou 510631 ,China )
Abstract : The algorithm of term weighting TF -IDF is one of the most important algorithm, but it fluctuates greatly when
affected by the term noises. The paper proposes a feature weight algorithm basing on feature noise weighting. This algorithm analyses
the distribution features of the term noises and weights the term noise which can′t express the true meaning of the author in the
document. Thereby the influence on the IDF is reduced, which is caused by the term noise. Finally the precision and the robustness
are improved obviously.
Key words : VSM; text classification; feature noise; feature weighting; robustness
随着信息技术的发展, 信息极度膨胀, 人们迫切希噪声特征的选择,但特征噪声在分类中出现是不可避免的。
望找到一种信息自动处理技术。文本分类作为信息处理 1 向量空间算法的分析
的技术之一由于其能对信息进行分类使得获取信息