文档介绍：华中科技大学
硕士学位论文
网络安全扫描器中网络爬虫的设计与实现
姓名:申布琦
申请学位级别:硕士
专业:通信与信息系统
指导教师:谭运猛
20090521
华中科技大学硕士学位论文
摘要
本文的研究目的在于设计和实现网络爬虫,该模块是 Web 应用程序漏洞评估工
具的重要组成部分,设计和开发 Web 应用程序漏洞评估工具目的在于扫描网站、识
别安全漏洞并且给出扫描评估报告。
网络爬虫同样被称为搜寻器或机器人,是一种自动从网站上下载 WEB 页面的程
序。设计网络爬虫的主要目的在于恢复 WEB 页面。爬虫的主要目标在于下载一系列
重要的页面,刷新已下载的页面,发现新的页面和保证页面拥有一个合适的展现。
网络爬虫同样被用在信息提取上,例如提取商业情报,一个公司可以利用网络
爬虫从 WEB 网站上提取关于他们竞争对手的信息。网络爬虫的其他应用在于监测
WEB 页面和搜索引擎。爬虫使得通过 WEB 页面中的超链接自动提取网页中的信息
来实现上述功应用变得可能。一般来说,爬虫开始都是从一个初始的页面中提取超
链接,然后从这些超链接中得到更多的页面信息,直到页面数量达到一定的规模或
者已经实现某些较高的目标。
在这个简单的叙述下面有更多复杂的研究点,例如可以利用的资源就包括带宽
使用、硬盘空间、网络连接、爬虫陷阱、URLs 分类、HTML 和动态页面内容的分析。
WEB 的动态特性给网络爬虫的实现带来了挑战,如果 WEB 页面是静态的,爬虫只
需要做少量的工作,因为爬虫可以维持一个它已经获得的页面列表信息,但是爬虫
必须处理页面更新和删除问题。

关键词:网络安全,网络扫描,网络爬虫
I
华中科技大学硕士学位论文
Abstract
The goal of this research is the design and implementation of the Web-spidering
ponent which is an integral part of the Web Application Vulnerability
Assessment (WAVA) tool developed and designed to scan the website, identify security
vulnerabilities and provides an assessment report of the results of the scan.
Web spiders, also known as crawlers or robots, are programs that automatically
download Web pages. The major reason for designing the Web spider is to retrieve Web
pages. The general goals of a spider are to download a significant set of pages, refresh
downloaded pages, find new pages and ensure the pages it has are a proper representation.
Web spiders are used in information retrieval for example in business intelligence, a
company can use a Web spider to collect information from the Web about petition.
Other applications of Web spiders are in monitoring Web pages and in search engines.
Spiders make possible the above applications by following the hyperlinks in Web pages to
automatically retrieve a limited view of the Web. Basically, the spider begins with an
initial page and extracts the hyperlinks embedded within the Web pages to get new pages.
The process repeats with t