文档介绍:计算机工程与科学 第 卷第 期 年 月
CN43-1258/TP
(CollegeofComputerandInformation,HohaiUniversity,Nanjing211100,China)
Abstract
:T hepro blematicdat aint hed ataware house ha sa greatimpa ct ondata nord erto
findand del eteth esepro blema ticdata,th epr imaryworkis th epr ocessingofsimilarre peatedd -
rently ,themost wi delyusedalgo rit hmf ordeduplic atio nisthes orted- neighborh ood meth od (SNM).
After analyzin gthesh ortc omingsof thisal gori thm,an improvedSNM algorith m (ISN M)is pro posed.
Theat tribute we ightsare calculated usi ngth eattribut ediscrimi nat ionm et hod,whi chs olvesthes ub jec-
tivitycau sedby artificia lwe he fieldfilterin ga lgorith misused to calc ulateth esim ilarityoft wo
records, which red uces thenumber ofcomp arisonsof reco rdatt ributes int hewindow andacce ler atesthe
detection speed ofth ealgor ithm. Variabl ewindowsare usedinstead offixe d-size win dowst oprevent
miss ingreco rdsandredu ce useles sr ecordc omparison xperime ntalr esultsshowthatISNMalgorithm
hasobvio usadvantag esintermso frecall,precisio nandrun ningtimeover head.
Keyw