文档介绍:Classified Index: : for the Master Degree of Engineering DESIGN AND IMPLIMENTATION OF WEB CRAWLER FOR DEEP WEB MINING Candidate:Tian WeiSupervisor:Prof. Wang XiaolongAssociate Supervisor:Associate Prof. Chen QingcaiAcademic Degree Applied for:Master of EngineeringSpecialty: Computer ScienceAffiliation: Shenzhen Graduate SchoolDate of Defence:December, 2007Degree-Conferring-Institution:Harbin Institute of 上信息量的迅速增长,搜索引擎已经成为人们检索网上信息的主要手段,是互联网上人们获取信息最重要的方式之一。网络爬虫作为搜索引擎中负责采集信息的模块,有重要的作用。由于互联网具有信息数量大、更新和增长速度快的特点,而且随着Web 的发展,越来越多的数据可以通过表单提交来获取。这些表单提交所产生信息是由Deep Web后台数据库动态产生的。在这种情况下,信息集成就更加需要网路爬虫来自动获取这些页面以进一步地处理数据。因此搜索引擎需要一个功能强大,工作高效的网络爬虫为其采集信息,使其能够为用户提供全面、及时的查询结果。为了满足以上需要,本文提出一种用于搜集Deep Web页面的网络爬虫的设计方法。首先,运用启发式规则来筛选目标表单并提取标签。其次,对表单进行建模。最后,通过分析表单模型将属性的值填入表单控件,完成填充。本文主要研究内容如下:(1)分析系统结构中各个模块需要完成的任务,给出各个模块的设计思想和实现策略,设计并实现Deep Web网络爬虫。(2)对网页表单建立特征四元组模型,自动生成查询词。(3)通过实际测试检验Deep Web网络爬虫工作效率和其有效性。(4)展望未来的发展方向,分析现有系统存在的问题。通过实验表明,本文所做的研究内容能够有效提高网络爬虫的性能,实现了挖掘Deep Web内容要求,较好地完成预定目标,达到了预期效果。关键词搜索引擎;网络爬虫;深层网络I 哈尔滨工业大学工学硕士学位论文AbstractWith the rapid growth of the information on the , search engine has e an indispensable tool for surfing on the . It is a major tool for people to get information from . In a search engine system, the web crawler is responsible for collecting information as a module and has played an important role. For the information is huge, grows and updates very fast. Furthermore, along with the development of the web, more and more information can be obtained by submitting through web form. This information is dynamically generated by web's database. In such circumstance, the information integration needs the web crawler to automatic access to the web form for further data processing. The search engine needs a powerful, efficient web crawler to work for the collection of information, so as to prehensive, timely query result for users. To meet these needs, this paper proposes a design method of the web crawler to collect Deep Web page