1 / 53
文档名称:

基于潜在语义索引和var-tree的文本分类分析.docx

格式:docx   大小:407KB   页数:53页
下载后只包含 1 个 DOCX 格式的文档,没有任何的图纸或源代码,查看文件列表

如果您已付费下载过本站文档,您可以点这里二次下载

分享

预览

基于潜在语义索引和var-tree的文本分类分析.docx

上传人:wz_198613 2018/5/12 文件大小:407 KB

下载得到文件列表

基于潜在语义索引和var-tree的文本分类分析.docx

相关文档

文档介绍

文档介绍:Abstract
Along with the widespread of the , mobile , puting and other emerging information technology developed rapidly. People exchanged information by micro-blogging, working, and other information media. Facilitated the people’s learning life and work, massive truly information explosion. It was a very importantly practical and theoretical research topic to Obtain useful information accurately and efficiently from the massive information. Automatic text classification was basic technology for processing anizing amounts of text data came into being. Automatic text classification short for text classification (Text Categorization) was basic technology and hot research topic on information retrieval and data mining, from the end of the 50s of the last century, taken widely attention and had made significant progress. Widely used in mail classification, web content management, information filtering and warning, conference call.
The paper elaborated the Chinese text classification and related technical theories include: text preprocessing, text representation, feature extraction, feature weight calculation and evaluation of the classification results, text classification algorithm. Analysis the advantages and disadvantages of the traditional KNN text classification algorithm and the vector space model (Vector Space Model, VSM).Made improvement based on the analysis’ result. Mainly work as follows:
Firstly, using latent semantic indexing/singular value position (LSI/SVD) to improve and extend the vector space model. posed Training set’s term-document matrix, build low-dimensional semantic space substitute the original keyword-based vector space. reserved visual representation and facilitating calculation advantage of the vector space model, the LSI model could eliminate the adverse effects of synonyms and polysemy, extracted text semantic information and highlights this feature, more accurately descript text; On the other hand could rule out a lot of useless, interf