文档介绍：非均衡文本分类中基于特征分布的抽样技术研究*

张爱华 1,王斌 1,徐燕 2
1 中国科学院计算技术研究所,北京,100190
2 北京语言大学,北京,100083
E-mail: ******@ict.
摘要:在处理非均衡文本分类问题的诸多方法中,基于数据的方法最灵活,应用也最广泛。然而,传统
的基于数据的方法存在过学习、丢失有用信息及增加训练分类器时间成本等问题,本文提出一系列策略,
在一定程度上解决了上述问题。本文借鉴 SMOTE 算法提出的通过构造新的小类样本做 Over-Sampling 的
思路,独立处理各个特征维度,实现了真正的基于特征的抽样。使用对各类分布模拟能力最强的高斯混合
模型对小类中每个特征的权值分布建模,再依据该模型抽取新权值以进一步组合为新样本加入小类训练
集。该方法完全基于特征抽样,能够有效地避免过学习现象。同时,由于对特征的建模及抽样过程严格遵
守特征原始分布,构造的新样本质量高。实验结果表明,该方法效果良好,并显著好于 SMOTE 算法。
关键词:文本分类;非均衡;Over-Sampling;基于特征的抽样技术;高斯混合模型

Feature-based Sampling for Imbalanced Text
Categorization

Zhang Aihua1, Wang Bin1, Xu Yan2
1Institute puting Technology, Chinese Academy of Sciences, Beijing, 100190
2Beijing Language and Culture University, Beijing, 100083
E-mail: ******@ict.
Abstract: Among all solutions to imbalanced text categorization, data level ones are the most flexible and most
widely used. However, traditional data level solutions may cause the problems of overfitting, losing useful
information or increasing the plexity of training a classifier. This thesis aims to e these
problems. It uses the consideration that implementing over-sampling by constructing new instances of the
minority class from SMOTE for reference, pared to SMOTE, we have implemented “true” feature-based
sampling by processing each dimension respectively. Gaussian Mixture Model, which is co