文档介绍:Abstract
Along with the widespread of the , mobile , puting and other emerging information technology developed rapidly. People exchanged information by micro-blogging, working, and other information media. Facilitated the people’s learning life and work, massive truly information explosion. It was a very importantly practical and theoretical research topic to Obtain useful information accurately and efficiently from the massive information. Automatic text classification was basic technology for processing anizing amounts of text data came into being. Automatic text classification short for text classification (Text Categorization) was basic technology and hot research topic on information retrieval and data mining, from the end of the 50s of the last century, taken widely attention and had made significant progress. Widely used in mail classification, web content management, information filtering and warning, conference call.
The paper elaborated the Chinese text classification and related technical theories include: text preprocessing, text representation, feature extraction, feature weight calculation and evaluation of the classification results, text classification algorithm. Analysis the advantages and disadvantages of the traditional KNN text classification algorithm and the vector space model (Vector Space Model, VSM).Made improvement based on the analysis’ result. Mainly work as follows:
Firstly, using latent semantic indexing/singular value position (LSI/SVD) to improve and extend the vector space model. posed Training set’s term-document matrix, build low-dimensional semantic space substitute the original keyword-based vector space. reserved visual representation and facilitating calculation advantage of the vector space model, the LSI model could eliminate the adverse effects of synonyms and polysemy, extracted text semantic information and highlights this feature, more accurately descript text; On the other hand could rule out a lot of useless, interf