文档介绍:algorithm has greatly improved in the recall rate, which is especially in small documents, and greatly reduces the plexity.
Key words: elimination of near-duplicated web pages; text structure tree; key sentence; layer fingerprint; high frequency punctuation
目 录
中文摘要···································································································· I 英文摘要··································································································· II 1 绪 论 1
研究背景·································································································1
研究的意义·····························································································2
信息检索与搜索引擎··················································································3
信息检索····························································································3
搜索引擎····························································································3
搜索引擎的分类··················································································3
论文结构································································································7
2 相关技术研究·······················································································8
文本复制检测··························································································8
国内外去重方法介绍··················································································9
基于 URL 的去重 9
基于特征串匹配的去重算法·····································································9
基于聚类去重算法·············································································· 11
去重时机······························································································ 14
论文主要创新点······················································································ 14
本章小结·······························································