文档名称：

面向Web挖掘的主题网络爬虫的研究与实现.pdf

格式：pdf 大小：1,750KB 页数：77页

下载后只包含 1 个 PDF 格式的文档，没有任何的图纸或源代码，查看文件列表

如果您已付费下载过本站文档，您可以点这里二次下载

预览

下载此文档

面向Web挖掘的主题网络爬虫的研究与实现.pdf

上传人:wxc6688 2021/12/8 文件大小：1.71 MB

下载得到文件列表

面向Web挖掘的主题网络爬虫的研究与实现.pdf

相关文档

文档介绍

文档介绍：摘要
摘要
随着互联网的迅速发展，越越来来越越多的信息资资源以网络为媒介呈呈现在人们面前前，
而通过搜索引擎获取生活、⃝生产所需的信息资资料也开始成为人们掌握资资讯的主流
方式之一。⃞但是由于 Web 信息资资源的爆炸式增长及其半结构构化化、⃝实时性、⃝异构构性
和和离散性等的特点，如何对 Web 资资源进行挖掘分析析、⃝提取人们需要的特定主题的
信息，已经成为一项重要的研究课题。⃞
本本文的研究内容是基于企业竞争情报、⃝面向 Web 挖掘的主题式搜索，在介绍
了课题的研究背景和和现状之后，着重讨论了 Web 挖掘和和主题搜索引擎的核心技术术。⃞
具体的研究工作如下：
主题网络爬虫：综合分析析了现有搜索引擎的网络搜索算法，改进了相关的搜
索策略，提出了一种非贪婪遗传搜索算法。⃞
Web 文档分析析：本本文利用 HTML Tidy 工具将 Web 文档转换为其对应的树树型结
构构，然后根据用户的需求利用不同的遍历算法提取相关的信息；爬虫系统对网页
的正文内容进行提取和和分词之后，采用经过改进的特征项权权重计算方法建立文本本
的特征向量。⃞
主题相关性评价：在利用向量空间模型对网页正文内容进行主题相关性评价
的基础上，系统结合超超链接的锚文本本、⃝自身字符串和和它所在的网页对其进行了主
题相关性的计算。⃞
在以上研究内容的基础上，设计并实现了基于企业竞争情报的主题网络爬虫
系统。⃞

关键词关键词：：：：Web 挖掘挖掘挖掘主题网络爬虫相关度计算搜索算法文本本分类算法
Abstract
Abstract
With the rapid development of Internet, more and more information is presented in
front of people, and Search engine becomes the mainstream way for people to accessing
information. However, due to the explosive growth of Web resources, and due to the
characters of them, such as discrete, heterogeneous, half-structure and real-time, how to
carry out mining analysis on them and extracting information about a particular, custom
topic required have already become an important research subject.
The research content of this paper is Web-based Topic Search oriented Enterprise
Competitive Information. After the introduction of research background and current
situation, the key technology of Web mining and search engine is emphatically
discussed. The main research work can be described as follows:
Topic Web Crawler: With a comprehensive analysis of existing search algorithm of
search engine in Web mining, system improves the relevant search strategy and
proposes a non-greedy genetic search algorithm.
Web Document Analysis: In