文档介绍:外文资料原文
Efficient URL Caching for World Wide Web Crawling
Marc Najork
BMJ (International Edition) 2009
Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all theURLs not seen before, repeat (a)-(c). However, the size of the web(estimated at over 4 billion pages) and its rate of change (estimatedat 7% per week) move this plan from a trivial programming exerciseto a serious algorithmic and system design challenge. Indeed, thesetwo factors alone imply that for a reasonably fresh and completecrawl of the web, step (a) must be executed about a thousand timesper second, and thus the membership test (c) must be done wellover ten thousand times per second against a set too large to storein main memory. This requires a distributed architecture, whichfurther complicates the membership test.
A crucial way to speed up the test is to cache, that is, to store inmain memory a (dynamic) subset of the "seen" URLs. The maingoal of this paper is to carefully investigate several URL cachingtechniques for web crawling. We consider both practical algorithms:random replacement, static cache, LRU, and CLOCK, andtheoretical limits: clairvoyant caching and infinite cache. We performedabout 1,800 simulations using these algorithms with variouscache sizes, using actual log data extracted from a massive 33day web crawl that issued over one billion HTTP main conclusion is that caching is very effective 一 in oursetup, a cache of roughly 50,000 entries can achieve a hit rate ofalmost 80%. Interestingly, this cache size falls at a critical point: asubstantially smaller cache is much less effective while a substantiallylarger cache brings little additional benefit. We conjeeturethat such critical points are inherent to our problem and venture anexplanation for this phenomenon.
INTRODUCTION A recent Pew Foundation study [31] states that "Search engineshave become an indispensable utility for Internet users95