文档介绍:一种基于Hadoop的海量Web数据挖掘系统研究与实现
Research and Implementation of a Massive Data Mining System Based on Hadoop
金松昌, 杨树强
Songchang Jin, Shuqiang Yang
jsc04@
(国防科技大学计算机学院, 湖南长沙 410073)
(School puter Science, National University of Defense Technology, Changsha 410073, China)
摘要: 针对目前Web数据规模的快速增长,传统的基于单机的数据挖掘模式不能胜任当前Web海量数据存储与处理。随着“云计算”技术的兴起,将传统的数据挖掘方法与“云计算“平台融合以提高数据挖掘的效率成为一种研究方向。本文将传统的遗传算法与Hadoop的MapReduce进行融合,针对Hadoop的分布式文件存储系统HDFS中的海量Web数据进行挖掘。为进一步验证该平台的高效性,在该平台上利用融合后的算法挖掘Web 日志中用户的偏爱访问路径。实验结果表明,在Hadoop中运用分布式算法处理大量的Web 数据,可以明显提高Web数据挖掘的效率。
Abstract:
With the rapid growth of web data, the current data mining system based on single node is not equal to the task of storing and processing massive web data. With the development of "puting" technology, integrating the traditional data mining methods and "puting" to improve the efficiency of data mining is to be one of the research direction. We integrate the traditional ic algorithm and the MapReduce parallel processing framework of Hadoop to mine the massive web data stored in the HDFS (Hadoop Distributed File System). To further verify the effectiveness and efficiency of the platform, we use the improved algorithm to mine users’ preferred access path in weblog on the platform. Experimental results show that, using distributed algorithm to process large number of weblog files in the cluster, can significantly improve the efficiency