文档介绍:Google puting Faculty Training Workshop
Module III: Nutch
This presentation © Michael Cafarella
Redistributed under the mons Attribution license.
Meta-details
Built to encourage public search work
Open-source, w/pluggable modules
Cheap to run, both machines & admins
Goal: Search more pages, with better quality, than any other engine
Pretty good ranking
Has done ~ 200M pages, more possible
Hadoop is a spinoff
Outline
Nutch design
Link database, fetcher, indexer, etc…
Hadoop support
Distributed filesystem, job control
WebDB
Fetcher 2 of N
Fetcher 1 of N
Fetcher 0 of N
Fetchlist 2 of N
Fetchlist 1 of N
Fetchlist 0 of N
Update 2 of N
Update 1 of N
Update 0 of N
Content 0 of N
Content 0 of N
Content 0 of N
Indexer 2 of N
Indexer 1 of N
Indexer 0 of N
Searcher 2 of N
Searcher 1 of N
Searcher 0 of N
WebServer 2 of M
WebServer 1 of M
WebServer 0 of M
Index 2 of N
Index 1 of N
Index 0 of N
Inject
Moving Parts
Acquisition cycle
WebDB
Fetcher
Index generation
Indexing
Link analysis (maybe)
Serving results
WebDB
Contains info on all pages, links
URL, last download, # failures, link score, content hash, ref counting
Source hash, target URL
Must always be consistent
Designed to minimize disk seeks
19ms seek time x 200m new pages/mo
= ~44 days of disk seeks!
Single-disk WebDB was huge headache
Fetcher
Fetcher is very stupid. Not a “crawler”
Pre-MapRed: divide “to-fetch list” into k pieces, one for each fetcher machine
URLs for one domain go to same list, otherwise random
“Politeness” w/o inter-fetcher protocols
Can observe similarly
Better DNS, robots caching
Easy parallelism
Two outputs: pages, WebDB edits
2. Sort edits (externally, if necessary)
WebDB/Fetcher Updates
URL:
LastUpdated: Never
ContentHash: None
URL:
LastUpdated: Never
ContentHash: None
URL: .html
LastUpdated: 4/07/05
ContentHash: MD5_toewkekqmekkalekaa
URL:
LastUpdated: 3/22/05
ContentHash: MD5_sdflkjweroiwelksd
Edit: DOWNL