文档介绍:一种基于LCS的相似网页检测算法
摘要:网页的相似性检测长期以来是一个研究的热点,shingling和simhash被认为是当前最好的两个算法,然而它们存在着一定的不足:一方面,高的分数意味着高的相似概率,但并不必然意味着高的相似度;另一方面,哈希代码的长度和多样性之间的冲突决定了难以同时获得高的准确率和覆盖率。本文提出了一种新型的相似网页检测算法,同时具备高准确率与高覆盖率的优点。该算法采用基于LCS(mon subsequence)的相似性度量方法,它可以更好地度量网页之间的相似度和包含关系,并获得高的准确度。同时,本文设计了一个包含了三个步骤的检测过程框架,以保证算法的效率。综合实验表明本文的算法同时获得了高准确率与高覆盖率并具有较好的效率。,,整个过程使用6台Linux服务器仅花费了5天的时间。
关键词: 相似性检测; 消重; 最长公共子序列; 相似性度量
Achieving both High Precision and High Recall in Near-duplicate Detection
Abstract: To Find near-duplicate documents, fingerprint-based paradigms such as shingling and simhash have been recognized as effective approaches and are considered the state-of-the-art. Nevertheless, we see two aspects of these approaches which may be improved. First, high score under these algorithms' similarity measurement implies high probability of similarity between documents, which is different from high similarity of the documents. Second, there has to be a tradeoff between hash-code length and hash-code multiplicity in fingerprint paradigms, which makes it hard to maintain a satisfactory recall level while improving precision. In this paper we propose a new approache of near-duplicate detection for web pages, which has both merits of high precision and high recall. Technically, our approache is based on LCS (mon subsequence) measurement, together with a 3-step framework, which is carefully designed to ensure high efficiency. prehensive experiment was conducted, which shows our method achieves both high precision and high recall. Also, the method has been essfully used to partition a set of 430 million web pages into 68 million subsets of similar pages. The process of partition took only 5 days plete, using a cluster of 6 linux boxes, which demonstrates its effectiveness.
Key words: near-duplicate detection; replica detection; mon subsequence; similarity measurement
引言
Web上相似网页的检测方法长期以来一直都是一个研究的热点。它在很多与Web信息相关的应用中扮演着重要的角色,包括:检索结果的聚类和排序、Web搜集、信息提取、Spam检测等等。
正是因为