文档介绍:
HDFS 数据节点本地缓存的设计与实现#
赵婧,王洪波,程时端**
(北京邮电大学网络与交换技术国家重点实验室,北京 100876)
5
10
摘要:随着互联网应用的不断丰富和网络数据的急剧增长,海量数据的处理与存储已成为当
前互联网应用中的最主要问题之一。Hadoop 分布式文件系统是 Apache Hadoop 项目开发的
适合运行在通用硬件上的分布式文件系统,它具有高可靠性、高容错性的特点,能提供高吞
吐量的数据访问,适用于海量数据集的存储和分布式处理。然而,从 HDFS 中存储海量数据
中频繁访问重复的小块数据,会产生频繁的磁盘 I/O 操作,导致服务器产生磁盘瓶颈等过载
现象。本文针对该现象提出一种在 HDFS 数据节点上增加本地缓存的解决方案,分析并修改
了 HDFS 数据访问部分的开源代码,实现了 HDFS 数据节点的本地缓存,并通过实验证明了该
方案提高了数据访问效率,减轻了服务器的 CPU 占用率和磁盘占用率。
关键词:计算机应用技术;HDFS;数据节点;缓存
中图分类号:
15
The Design and Implementation of DataNode Local Cache
of HDFS
Zhao Jing, Wang Hongbo, Cheng Shiduan
(State Key Laboratory working and Switching Technology, Beijing University of Posts &
20
25
30
35
40
munications, Beijing 100876)
Abstract: As the increase of the applications work data, the process and storage of
massive data has e one of the most important problems. Hadoop distributed file system is a
file system developed by Apache Hadoop project which is designed to run modity
hardware. HDFS is high reliable ,tolerable and able to provide high throughput access to large data
sets. It is proper for storage and puting for massive data sets. However, the
frequent access of small data segments will bring frequent disk I/O operations and causes the disk
bottlenecks and resource overloads of servers. To solve this problem, this paper proposes a
solution that appending a local cache function to DataNodes of HDFS. The source code of reading
data part of HDFS has been analyzed and modified to implement the DataNode local cache of
HDFS. The experiments results validate that the data access rate are improved. It also prove that
CPU and disk utilization of servers are decreased.
Keywords: Computer application technology; HDFS; DataNode; cache
0 引言
随着互联网应用的不断丰富和网络数据的急剧增长,海量数据的处理与存储已成为当前
互联网应用中的最主要问题之一[1]。Hadoop 分布式文件系统[2]是 Apache Hadoop 项目[3]开发
的适合运行在通用硬件上的分布式文件系统,它具有高可靠性、高容错性的特点,能提供高
吞吐量的数据访问,适用于海量数据集的存储和分布式处理。
HDFS 采用主/从式的逻辑架构[4]。一个 HDFS 集群是由一个名字节点(NameNode)和
一定数目的数据节点