文档介绍：词典与统计方法结合的中文分词模型研究及应用
蒋建洪1, 赵嵩正1,罗玫1
(1. 西北工业大学,陕西省西安市 710129)
摘要:传统的基于词典的分词法和基于统计的分词方法均存在不足,如基于词典的分词方法效率很高但是对于新词的识别能力不足,而基于统计的分词方法能很好的发现新词,但是分词的效率较低。通过对特定领域的文本语料数据进行处理,结合两类分词方法的优点,能够较好的解决这些问题,实现了一个快速、准确度高的分词模型。
关键词:分词;mmseg算法;互信息;词典;统计
中图分类号:TP311
The Analysis and Application of Chinese Word Segmentation Model which Consist of Dictionary and Statistics Method
JIANG Jian-Hong1,ZHAO Song-Zheng1,LUO Mei1
( Polytechnical University, Xi'An, 710129, China)
Abstract:There are some problem existed in the current dictionary-based word segmentation method and statistical-based word segmentation method. For example the dictionary-based word segmentation method has better efficiency but has shortage at discovering new words. The word segmentation method based on statistical can easily discover the new word, but its processing speed is slow. Processes the domain-specific text corpus data bine such two segment method can make a better way to resolve these problem. It can provide a rapid and accurate segment model for data process.
Key words:word segment; mmseg algorithm; mutual information; dictionary; statistics
0 引言
在电子商务的数据挖掘中,交易的商品通常是提供了商品的名称,而没有提供所属的类别,而名称中可能还包括了一些基本的描述,这对于商品的分类或者聚类来说,缺少足够的信息,因此需要将这些信息提取出来,过滤掉一些多余的信息,在数据清洗阶段需要对其进行预处理。而如何从语句中抽取出有价值的信息,涉及到自然语言处理技术。使用词处理对语句进行分词。能够为商品的识别提供更加明确的含义。汉语分词是中文信息处理的基础, 为了能够对文本进行分类, 首先要对文本的内容进行分词处理。中文分词方法的基本原理是针对输入文字串进行分