文档介绍：中文及英文的文本挖掘—— R 语言
所需要的包 tm(text mining) rJava,Snowball,zoo,XML,slam,Rz, RWeka,matlab
1 文本挖掘概要
文本挖掘是从大量的文本数据中抽取隐含的，求和的value pairs and a data frame
Aval 1ab1e tags are:
create_date creator
Aval 1ab1e 叮日rin the data frame are;
MetaiD
Snreut-'
■indonesia seen crossroads econom-ic change
jeremy cliftf reuters
jakartat march 1 -
Indonesia appears nearing political crossroads measures deregulate pr u・w・ emb as sy says new report. counter falling oil revenues , gov er nmerrt measures past nine months boost exports outside oil 百电匚tor attract nev
由于语料库已经将大小写转换以及将介词类删除所以对应的语句只是特有单词的组合。
生成词频矩阵并查看内容
dtm <- Docume ntTermMatrix(reuters)
查看某部分的词频内容个数，其中 dtm行提示哪个文件，列表示词语。
> in spect(dtm[10:15,110:120])
A docume nt-term matrix (6 docume nts, 11 terms)
No n-/sparse en tries: 6/60
Sparsity : 91%
Maximal term len gth: 9
Weighti ng : term freque ncy (tf)
Terms
Docs activity. add added added. address addressed adherence adhering advantage advisers agency
[1,]
0 0
0 0
0
0
1
1
0
0
2
[2,]
0
0
0
0
0
0
0
0
0
0
0
[3,]
0
0
0
0
0
0
0
0
0
0
1
[4,]
0
0
0
0
0
0
0
1
0
0
2
[5,]
0
0
0
0
0
0
0
0
0
0
0
[6,]
0
0
0
0
0
0
0
0
0
0
0

查看含有特疋词的文档
若要考察多个文档特定词汇的出现频率或以手工生成字典，并将其作为生成阵的参数
> in spect(tdm[c("price", "texas"),c("127","144","191","194")])
A term-docume nt matrix (2 terms, 4 docume nts)
Non-/sparse en tries: 6/2
Sparsity : 25%
Maximal term length: 5
Weighti ng : term freque ncy (tf)
Docs
Terms 127 144 191 194
price 2 1 2 2
texas 1 0 0 2
> in spect(Docume ntTermMatrix(reuters,
+ list(dictio nary = c("prices", "crude",
"oil"))))
A docume nt-term matrix (20 docume nts, 3 terms)
Non- /sparse en tries: 41/19
Sparsity : 32%
Maximal term len gth: 6
Weight ing : term freque ncy (tf)
Terms
Docs
crude oi
l prices
127
3
5
4
144
0
11
4
191
3
2
0
194
4
1
0
211
0
2
0
236
1
7
2
237
0
3
0
元数据操作(词元素)
查看词条出现次数大于某个具体值的词
findFreqTerms(dtm,5)#查看出现频大于等