文档介绍:Google puting Faculty Training Workshop
Module V: Hadoop Technical Review
© Spinnaker Labs, Inc.
Overview
Hadoop Technical Walkthrough
HDFS
Databases
Using Hadoop in an Academic Environment
Performance tips and other tools
© Spinnaker Labs, Inc.
You Say, “tomato…”
Google calls it:
Hadoop equivalent:
MapReduce
Hadoop
GFS
HDFS
Bigtable
HBase
Chubby
(nothing yet… but planned)
Some MapReduce Terminology
Job – A “full program”- an execution of a Mapper and Reducer across a data set
Task – An execution of a Mapper or a Reducer on a slice of data
. Task-In-Progress (TIP)
Task Attempt – A particular instance of an attempt to execute a task on a machine
© Spinnaker Labs, Inc.
Terminology Example
Running “Word Count” across 20 files is one job
20 files to be mapped imply 20 map tasks + some number of reduce tasks
At least 20 map task attempts will be performed… more if a machine crashes, etc.
© Spinnaker Labs, Inc.
Task Attempts
A particular task will be attempted at least once, possibly more times if it crashes
If the same input causes crashes over and over, that input will eventually be abandoned
Multiple attempts at one task may occur in parallel with speculative execution turned on
Task ID from TaskInProgress is not a unique identifier; don’t use it that way
© Spinnaker Labs, Inc.
MapReduce: High Level
© Spinnaker Labs, Inc.
Node-to-munication
Hadoop uses its own RPC protocol
munication begins in slave nodes
Prevents circular-wait deadlock
Slaves periodically poll for “status” message
Classes must provide explicit serialization
© Spinnaker Labs, Inc.
Nodes, Trackers, Tasks
Master node runs JobTracker instance, which accepts Job requests from clients
TaskTracker instances run on slave nodes
TaskTracker forks separate Java process for task instances
© Spinnaker Labs, Inc.
Job Distribution
MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options
Running a MapReduce job p