文档名称：

外文翻译---基于网络爬虫的有效URL缓存.doc

格式：doc 大小：56KB 页数：11页

下载后只包含 1 个 DOC 格式的文档，没有任何的图纸或源代码，查看文件列表

如果您已付费下载过本站文档，您可以点这里二次下载

预览

下载此文档

外文翻译---基于网络爬虫的有效URL缓存.doc

上传人:小健 2021/7/19 文件大小：56 KB

下载得到文件列表

外文翻译---基于网络爬虫的有效URL缓存.doc

相关文档

文档介绍

文档介绍：外文资料原文
Efficient URL Caching for World Wide Web Crawling
Marc Najork
BMJ (International Edition) 2009
Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all theURLs not seen before, repeat (a)-(c). However, the size of the web(estimated at over 4 billion pages) and its rate of change (estimatedat 7% per week) move this plan from a trivial programming exerciseto a serious algorithmic and system design challenge. Indeed, thesetwo factors alone imply that for a reasonably fresh and completecrawl of the web, step (a) must be executed about a thousand timesper second, and thus the membership test (c) must be done wellover ten thousand times per second against a set too large to storein main memory. This requires a distributed architecture, whichfurther complicates the membership test.
A crucial way to speed up the test is to cache, that is, to store inmain memory a (dynamic) subset of the "seen" URLs. The maingoal of this paper is to carefully investigate several URL cachingtechniques for web crawling. We consider both practical algorithms:random replacement, static cache, LRU, and CLOCK, andtheoretical limits: clairvoyant caching and infinite cache. We performedabout 1,800 simulations using these algorithms with variouscache sizes, using actual log data extracted from a massive 33day web crawl that issued over one billion HTTP main conclusion is that caching is very effective 一 in oursetup, a cache of roughly 50,000 entries can achieve a hit rate ofalmost 80%. Interestingly, this cache size falls at a critical point: asubstantially smaller cache is much less effective while a substantiallylarger cache brings little additional benefit. We conjeeturethat such critical points are inherent to our problem and venture anexplanation for this phenomenon.
INTRODUCTION A recent Pew Foundation study [31] states that "Search engineshave become an indispensable utility for Internet users95