文档介绍:搭建基于云计算的开源海量数据挖掘平台
赵华茗
(中国科学院国家科学图书馆北京 100190)
【摘要】本文通过分析亚马逊弹性MapReduce(EMR)平台构架,针对信息情报机构内部数据处理的迫切需求,提出通过开源技术XEN和Hadoop平台构建基于云计算的动态可伸缩的海量数据处理平台并给出了实施方案、海量文本数据处理案例和开源EMR平台的优势分析。实施方案主要分为三部分:一、搭建动态虚拟的云计算环境,二、安装制作HADOOP虚拟服务器模板,三、配置运行Cloudera和Cloudera Desktop。通过开源EMR架构的应用,可以效解决服务器蔓延问题,提高网络计算资源的利用效率和分布式数据挖掘服务的快速布署能力及灵活性。
【关键词】云计算;海量数据挖掘;虚拟技术;分布式计算;xen;Cloudera;Hadoop;
【分类号】TP393
Building the Open Source Mass Data Mining Platform Based on puting
Zhao Huaming
(National Science Library,Chinese Academy of Sciences,Beijing 100190,China)
【Abstract】Aims to meet the internal data processing needs of anizations, this paper, by analyzing the frameworks of Amazon elastic map/reduce (EMR) platform, puts forward to build the dynamic and elastic open source mass data mining platform based on puting, and provides a roadmap of essful implementation, an example of massive text data processing and the analysis of advantage of open source EMR platform. This implementation plan includes three parts: dynamic virtual environment of puting; the virtual server template of HADOOP; and running Cloudera and Cloudera Desktop. Through the application of the open source EMR platform, we can solve the problem of server sprawl effectively to improve utilization ratio puting resource and to enhance the rapid deployment capability and agility of distributed data processing