文档介绍:uracyandcoverageislow,paidbackthecontentisnotdetailedenoughandtoomuchnoise,maintainahugeindexlibraryofwebpagesisespeciallydifficult,,accurate,:,includingsystemarchitecturewhichincludingawebspider,indexer,crawler,andUSerinterfaceandthemesdistributingfeatures,,andthesubjectofdetermination,collectionandpurificationofthebasicwebandalsoSOoperatingprinciple,optimizeandimplementanalgorithmforeliminationofduplicatedpages·,Mainlyimplementinglucenedevelopmentkit,WebspidertoachieveresolutionofthevarioUStypesofdocuments,includingtext,html,Word,pdfandotherformats,byparsingthedocumenttoextractthetopic-relatedinformation,andthepageachievemodulesincludingtheChinesewordsegmentation,-,reproducedledtosuchastheemergenceofthesamecontentatdifferentweburl,SOtherewillbealotofduplicatecontent,Theimprovedalgorithmofthispaperusethemaincodeandsecondarycodetoachieve,,secondarycodeidentifythecontentsofthewebpage,:Chinesesegmentatingword,lucene,featureseries,eliminationofduplicatedpages独创性声明本人声明,所呈交的论文是本人在导师指导下进行的研究工作及取得的研究成果。尽我所知,除了文中特别加以标注和致谢的地方外,论文中不包含其他人已经发表或撰写过的研究成果,也不包含为获得武汉理工大学或其他教育机构的学位或证书而使用过的材料。与我一同工作的同志对本研究所做的任何贡献均已在论文中作