1 / 13
文档名称:

R语言文本挖掘.doc

格式:doc   大小:204KB   页数:13页
下载后只包含 1 个 DOC 格式的文档,没有任何的图纸或源代码,查看文件列表

如果您已付费下载过本站文档,您可以点这里二次下载

分享

预览

R语言文本挖掘.doc

上传人:小s 2022/4/11 文件大小:204 KB

下载得到文件列表

R语言文本挖掘.doc

文档介绍

文档介绍:中文及英文的文本挖掘—— R 语言
所需要的包 tm(text mining) rJava,Snowball,zoo,XML,slam,Rz, RWeka,matlab
1 文本挖掘概要
文本挖掘是从大量的文本数据中抽取隐含的,求和的value pairs and a data frame
Aval 1ab1e tags are:
create_date creator
Aval 1ab1e 叮日rin the data frame are;
MetaiD
Snreut-'
■indonesia seen crossroads econom-ic change
jeremy cliftf reuters
jakartat march 1 -
Indonesia appears nearing political crossroads measures deregulate pr u・w・ emb as sy says new report. counter falling oil revenues , gov er nmerrt measures past nine months boost exports outside oil 百电匚tor attract nev
由于语料库已经将大小写转换以及将介词类删除所以对应的语句只是特有单词的组合。
生成词频矩阵并查看内容
dtm <- Docume ntTermMatrix(reuters)
查看某部分的词频内容个数,其中 dtm行提示哪个文件,列表示词语。
> in spect(dtm[10:15,110:120])
A docume nt-term matrix (6 docume nts, 11 terms)
No n-/sparse en tries: 6/60
Sparsity : 91%
Maximal term len gth: 9
Weighti ng : term freque ncy (tf)
Terms
Docs activity. add added added. address addressed adherence adhering advantage advisers agency
[1,]
0 0
0 0
0
0
1
1
0
0
2
[2,]
0
0
0
0
0
0
0
0
0
0
0
[3,]
0
0
0
0
0
0
0
0
0
0
1
[4,]
0
0
0
0
0
0
0
1
0
0
2
[5,]
0
0
0
0
0
0
0
0
0
0
0
[6,]
0
0
0
0
0
0
0
0
0
0
0

查看含有特疋词的文档
若要考察多个文档特定词汇的出现频率或以手工生成字典,并将其作为生成阵的参数
> in spect(tdm[c("price", "texas"),c("127","144","191","194")])
A term-docume nt matrix (2 terms, 4 docume nts)
Non-/sparse en tries: 6/2
Sparsity : 25%
Maximal term length: 5
Weighti ng : term freque ncy (tf)
Docs
Terms 127 144 191 194
price 2 1 2 2
texas 1 0 0 2
> in spect(Docume ntTermMatrix(reuters,
+ list(dictio nary = c("prices", "crude",
"oil"))))
A docume nt-term matrix (20 docume nts, 3 terms)
Non- /sparse en tries: 41/19
Sparsity : 32%
Maximal term len gth: 6
Weight ing : term freque ncy (tf)
Terms
Docs
crude oi
l prices
127
3
5
4
144
0
11
4
191
3
2
0
194
4
1
0
211
0
2
0
236
1
7
2
237
0
3
0
元数据操作(词元素)
查看词条出现次数大于某个具体值的词
findFreqTerms(dtm,5)#查看出现频大于等