文档名称：

中文信息检索系统的模糊匹配算法研究和实现.doc

格式：doc 页数：14页

下载后只包含 1 个 DOC 格式的文档，没有任何的图纸或源代码，查看文件列表

如果您已付费下载过本站文档，您可以点这里二次下载

预览

下载此文档

中文信息检索系统的模糊匹配算法研究和实现.doc

上传人:wc69885 2016/6/23 文件大小：0 KB

下载得到文件列表

中文信息检索系统的模糊匹配算法研究和实现.doc

相关文档

文档介绍

文档介绍：中文信息检索系统的模糊匹配算法研究和实现第2l卷第6期 2007年11月中文信息学报 JOURN AL OF CHINESE INFORMATION PRO CESSING Vo1 . 2l , No .6 NOV ., 2007 文章编号: 1003 — 0077 ( 2007 ) 06 — 0059-06 中文信息检索系统的模糊匹配算法研究和实现王静帆, 邬晓钧, 夏云庆, 郑方( 清华大学计算机系清华信息科学与技术国家实验室技术创新和开发部语音和语言技术中心,北京100084) 摘要: 在现代中文信息检索系统中, 用户输入的字符串和实际数据库中的条目往往存在局部偏差, 而基于关键词匹配的检索技术不能很好地解决这一问题。本文参考并改进了Tarhio和Ukkonen提出的过滤算法] ,针对汉字拼音输入法中常出现的同音字/近音字混用现象, 将算法进一步扩展到广义的Edit Distance 上。实验表明, 本文提出的算法能有效提高中文信息检索系统的召回率, 在实际应用中可达到“子线性”的效率。关键词: 计算机应用;中文信息处理;模糊匹配;过滤算法;动态规划中图分类号: TP39l 文献标识码:A An Approximate S tring Matching Algorithm for Chinese Information Retrieval Systems WANG Jing~fan , WU Xiao — j un , XIA Yun — qing , ZHENG Fang (D ept . puter Sci . &Tech . Tsin ghua University , Center for Sp eech and Language Technologie s, Division of Technical Innova tion and Development , Tsinghua National Laboratory for Infor mation Science and Technology , Beijing 100084 , China ) Abstract : In the modern Chinese informa tion retrieval systems , classic al keyword based string match ing can not work when the inp ut string is different from t he entries in the database . Thi s paper proposed a method bas ed on Tarhio and Ukkonen ’S fi ltering algorithm tO solve th e the Chinese Pinyin typewriting usually co n— sists Chinese characters w ith the same or similar pronu nciations , we defined a special Edlt Distance and expended ou r method accordingly . The exper imental results showed that o ur algorithm can improve the recall rate of the re — trleva l systems and obtain practica l sub — plexity . Key w ords : computer application ; Chin ese information processing ; app roximate matching ; filter algor ithm;dynamic programming 1 引言现有的信息检索系统大部分采用基于关键词匹配的检索技术l_ 2] 。在实际应用中,用户往往凭借印象进行检索,有时只能模糊地描述查询目标, 输入的关键词无法和数据集合中保存的数据完全一致;另一方面, 在构建数据集时引入的错误( 如OCR识别错误等) 也可能造成这些数据无法被用户获取。在上述情况下, 传统的检索系统将难以从数据集中查找到所需要的信息。本文采用模糊匹配方法查找数据集中和用户输人相似的项, 并根据相似度排序输出结果, 以部分解决上述问题。模糊匹配方法还可以用于其他领域, 如入侵检测、信息过滤、基因检测等_3 “中文用户大部分使用拼音输入法。用户输入查询串时选词错误造成的同音字替换是很典型的一种现象; 方言、发音****惯等造成的音近字替换(