1 / 11
文档名称:

外文翻译---基于网络爬虫的有效URL缓存.docx

格式:docx   大小:42KB   页数:11页
下载后只包含 1 个 DOCX 格式的文档,没有任何的图纸或源代码,查看文件列表

如果您已付费下载过本站文档,您可以点这里二次下载

分享

预览

外文翻译---基于网络爬虫的有效URL缓存.docx

上传人:jiquhe72 2018/1/20 文件大小:42 KB

下载得到文件列表

外文翻译---基于网络爬虫的有效URL缓存.docx

文档介绍

文档介绍:外文资料原文
Efficient URL Caching for World Wide Web Crawling
Marc Najork
BMJ (International Edition) 2009
Crawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh plete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which plicates the membership test.
A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.
1. INTRODUCTION
A recent Pew Foundation study [31] states that “Search engines have e an indispensable utility for users”

最近更新

沪教版六年级下册数学期末测试卷及完整答案(.. 6页

浙教版六年级下册数学期末测试卷附参考答案(.. 7页

苏教版一年级下册科学期末测试卷及完整答案【.. 8页

苏教版二年级下册科学期末测试卷附完整答案【.. 7页

苏教版六年级上册科学期末测试卷及完整答案(.. 8页

苏教版六年级下册数学期末测试卷附完整答案(.. 7页

苏教版小学二年级下册科学期末测试卷及参考答.. 7页

《军事理论小组展》 45页

苏教版小学科学六年级上册期末测试卷及参考答.. 7页

苏教版科学二年级下册期末测试卷带答案(综合.. 7页

苏教版科学六年级上册期末测试卷【培优】 7页

苏教版科学四年级下册期末测试卷及参考答案【.. 9页

苏教版科学小学二年级下册期末测试卷有答案 7页

苏教版科学小学五年级上册期末测试卷【b卷】 8页

西师大版六年级上册数学第七单元 负数的初步认.. 5页

西师大版六年级下册数学第三单元 正比例和反比.. 7页

部编版三年级上册道德与法治期末测试卷及答案.. 7页

部编版五年级上册道德与法治期末测试卷附完整.. 7页

部编版四年级上册道德与法治期末测试卷附参考.. 7页

(完整版)六年级下册数学期末测试卷及参考答.. 6页

(小学段)比例的应用练习题带答案(培优b卷).. 7页

(小学段)负数练习题【各地真题】 6页

(新)急救相关知识考试题库含答案(考试直接.. 53页

人文完整版医师定期考核题库 693页

人大网教试题 14页

大学生消费调查报告 32页

《冯海军机械与人》 19页

2024年福建南平下半年事业单位讲座历年公开引.. 245页

桌游工作室创业计划书 8页

行政费用分析报告 28页