文档介绍:该【SpamHamEmailClassification垃圾邮件分类RNNG】是由【鼠标】上传分享,文档一共【3】页,该文档可以免费在线阅读,需要了解更多关于【SpamHamEmailClassification垃圾邮件分类RNNG】的内容,可以使用淘豆网的站内搜索功能,选择自己适合的文档,以下文字是截取该文章内的部分文字,如需要获得完整电子版,请下载此文档到您的设备,方便您编辑和打印。[Kaggle]SpamHamEmailClassificatio垃n 圾邮件分类(RNNG。。。
⽂章⽬录
练习地址:
相关博⽂
1. 读⼊数据
读取数据,test集没有标签
impor tpandasa s pd
impor tnumpya s np
train= (")"
test = ("")
()
数据有⽆效的单元
prin((((l)==True), axis=0))
prin((((l)==True), axis=0))
存在 Na 单元格
[0 6 0 0]
[0 1 0]
fillna 填充处理
train= (" ")
test = (" ")
prin((((l)==True), axis=0))
prin((((l)==True), axis=0))
填充完成,显⽰ sum = 0
[0 0 0 0]
[0 0 0]
y 标签 只有 0 不是垃圾邮件, 1 是垃圾邮件
prin(ttrain['spam]'.unique())
[0 1]
2. ⽂本处理
邮件内容和主题合并为⼀个特征
X_train = train['subject]' + ' '+ train['email]'
y_train = train['spam]'
X_test = tes[t'subject]' + ' '+ tes[t'email]'
⽂本转成 tokens ids 序列
from mpor tTokenizer
max_words= 300
tokenizer= Tokenize(rnum_words=max_words, lower=True, spli=t ' )'
# 只给频率最⾼的300个词分配 id,其他的忽略
(list(X_train)+list(X_test)) # tokenizer训 练
X_train_tokens= (sX_train)
X_test_tokens= (sX_test)
pad ids 序列,使之长度⼀样
# 样本 tokens的 长度不⼀样,pad
maxlen= 100
from por tsequence
X_train_tokens_pad= (sX_train_token,s maxlen=maxlen,padding='post)'
X_test_tokens_pad= (sX_test_tokens, maxlen=maxlen,padding='post)'
3. 建模
embeddings_dim= 30 # 词嵌⼊向量维度
from mpor tMode, lSequential
from impor tEmbeddin,g LSTM, GRU, SimpleRNN, Dense
model= Sequentia()l
(Embeddin(ginput_dim=max_word,s # Size of the vocabulary
o=uembeddings_dimtput_dim , # 词嵌⼊的维度
in=pmuat_xlleenn)g)th
(GRU(units=64)) # 可以改为 SimpleRNN, LSTM
(Dense(units=1, activation='sigmoid))'
()
模型结构:
Mode:l "sequential_5"
_________________________________________________________________
Layer (type) Output Shape # Param
=================================================================
embedding_2( Embeddin)g (N one, 100, 30) 9 0 0 0
_________________________________________________________________
gru (GRU) ( N o n e , 6 4 ) 1 8 4 3 2
_________________________________________________________________
dense_2 (Dense) ( N o n e, 1) 6 5
=================================================================
Total params: 27,497
Trainable params: 27,497
Non-trainable params: 0
_________________________________________________________________
4. 训练
(optimizer='adam',
='binar loss y_crossentropy, '
=m['aectrciucsracy]') # 配置模型
history= (X_train_tokens_pad, y_train,
b=at12ch_size8, epoch=s10, validation_sp= )
(") "# 保存训练好的模型
绘制训练曲线
from matplotlibim por tpyploat s plt
().plot(figsize=(8, 5))
(True)
()
5. 测试
pred_prob= (X_test_tokens_pad).squeeze()
pred_class = (pred_prob> ).astype()
id = tes[t'id]'
output= ({'id:'id, 'Class': pred_clas})s
("", index=False)
3种RNN模型对⽐: