1 / 43
文档名称:

大连港竞争力现状及对策分析.doc

格式:doc   页数:43
下载后只包含 1 个 DOC 格式的文档,没有任何的图纸或源代码,查看文件列表

如果您已付费下载过本站文档,您可以点这里二次下载

分享

预览

大连港竞争力现状及对策分析.doc

上传人:策划大师 2011/11/13 文件大小:0 KB

下载得到文件列表

大连港竞争力现状及对策分析.doc

文档介绍

文档介绍:本科生毕业论文
题目:(中文) 大规模网页模块识别与信息提取系统设计与实现
(英文) Design and Implementation of Large Scale Web Template Detection and Information Extraction System
姓名:
学号:
院系:计算机系
专业:搜索引擎与互联网信息挖掘
指导教师:
二〇一七二〇一七年十一月十一日
摘要
本文在已有的基于Dom-Tree和启发式规则的网页信息提取算法的基础上,通过为所有符合W3C规范的Html标签分类,逐个分析各Html标签所包含的语义信息,细化规则设置,实现了一种自底向上的无信息遗漏的网页分块算法,并在此基础上,利用统计方法得到详细的概率分布数据,实现了文本相似度比较和Bayes后验概率估计两种网页主题内容信息块识别算法,并将其求交,提高了主题内容信息块的识别精确度。
上述算法已集成到天网搜索引擎平台的网页预处理模块中,并且在SEWM 2008会议中,以这套算法为框架,组织了主题型网页识别和网页主题内容信息块提取两个中文Web信息检索评测项目。在这套算法的基础上,基于天网文件系统与Map-Reduce计算平台,实现了分布式的网页块级别PageRank算法,命名为QuarkRank算法。实际检验表明,该套算法具有很好的适应性与可扩展性,并达到了很高的精度和召回率。
关键词:网页分块信息提取 SEWM 评测 PageRank
Abstract
This paper has been based on the Dom-Tree and heuristic rules of the Web information extraction method, by classifying all the Html tags in line with W3C standards, and by analyzing semantic information contained in the Html tags one by one, it refines the rules set and achieves a bottom-up page block algorithm without information missing.
On this basis, with the probability distribution of data getting from statistical methods, this paper realizes two algorithms of information block recognition, one is text parison and the other is Bayes posterior probability estimates, and the final es from their intersection, which improves the accuracy of information theme block recognition.
These algorithms have been integrated into the page pretreatment module of TianWang search engine platform, and in SEWM 2008 meeting, using these algorithms, anized two Chinese Web Information Retrieval Evaluation Project,
Which two are theme-based Web page identification and block extraction of the information theme content.
In this method, based on TianWang file system and the Map-puting platform, this paper reports the distributed block-level PageRank algorithm, named QuarkRank algorithm here. The actual test showed that these algorithms are good at adaptability and scalability, and reach a very high precision and recall.
Keywords:
Web-Page B