1 / 23
文档名称:

基于网络爬虫的有效URL缓存毕业论文外文文献翻译.docx

格式:docx   大小:116KB   页数:23页
下载后只包含 1 个 DOCX 格式的文档,没有任何的图纸或源代码,查看文件列表

如果您已付费下载过本站文档,您可以点这里二次下载

分享

预览

基于网络爬虫的有效URL缓存毕业论文外文文献翻译.docx

上传人:蓝天 2022/2/25 文件大小:116 KB

下载得到文件列表

基于网络爬虫的有效URL缓存毕业论文外文文献翻译.docx

相关文档

文档介绍

文档介绍:外文资料原文
Efficient URL Caching for World Wide Web Crawling
Andrei Z. Broder
旧M TJ Watson Research Center
19 Skyline Drache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.
INTRODUCTION
A recent Pew Foundation study [31] states that "Search engines have become an indispensable utility for Internet users" and estimates that as of mid-2002, slightly over 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the search engine corpus.
Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is
Fetch a page
Parse it to extract all linked URLs
For all the URLs not seen before, repeat (a)-(c)
Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as yahoo. com, but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure of the web that leads to this phenomenon.
If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling becomes a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS