文档介绍:云计算关键技术
Page 2
虚拟化技术内容
1 虚拟化定义
2 虚拟化分类
3 全虚拟化与半虚拟化
4虚拟化实现
5虚拟化技术比较与选型
6虚拟化带来的好处
7虚拟化带来的问题
8虚拟化适用范围
9服务器虚拟化过程
MapReduce
MapReduce是一个简单易用的并行编程模型,它极大简化了大规模数据处理问题的实现
Page 3
Divide and Conquer
“Work”
w1
w2
w3
r1
r2
r3
“Result”
“worker”
“worker”
“worker”
Partition
Combine
Parallelization Challenges
How do we assign work units to workers?
What if we have more work units than workers?
What if workers need to share partial results?
How do we aggregate partial results?
How do we know all the workers have finished?
What if workers die?
What is mon theme of all of these problems?
Common Theme?
Parallelization problems arise from:
Communication between workers (., to exchange state)
Access to shared resources (., data)
Thus, we need a synchronization mechanism
Managing Multiple Workers
Difficult because
We don’t know the order in which workers run
We don’t know when workers interrupt each other
We don’t know the order in which workers access shared data
Thus, we need:
Semaphores (lock, unlock)
Conditional variables (wait, notify, broadcast)
Barriers
Still, lots of problems:
Deadlock, livelock, race conditions...
Dining philosophers, sleepy barbers, cigarette smokers...
Moral of the story: be careful!
Current Tools
Programming models
Shared memory (pthreads)
Message passing (MPI)
Design Patterns
Master-slaves
Producer-consumer flows
Shared work queues
Message Passing
P1
P2
P3
P4
P5
Shared Memory
P1
P2
P3
P4
P5
Memory
master
slaves
producer
consumer
producer
consumer
work queue
But , now Mapreduce!
Mapreduce: Parallel/puting Programming Model
Input split
shuffle
output
Typical problem solved by MapReduce
读入数据: key/value 对的记录格式数据
Map: 从每个记录里extract something
map (in_key, in_value) -> list(out_key, intermediate_value)
处理input key/value pair
输出中间结果key/value pairs
Shuffle: 混排交换数据
把相同key的中间结果汇集到相同节点上
Reduce: aggregate, summarize, filter, etc.
reduce (out_key, list(intermediate_value)) -> list(out_value)
归并某一个key的所有va