1 / 13
文档名称:

基于K近邻的聚类有效性指标.pdf

格式:pdf   大小:152KB   页数:13页
下载后只包含 1 个 PDF 格式的文档,没有任何的图纸或源代码,查看文件列表

如果您已付费下载过本站文档,您可以点这里二次下载

分享

预览

基于K近邻的聚类有效性指标.pdf

上传人:司棋夸克 2022/9/29 文件大小:152 KB

下载得到文件列表

基于K近邻的聚类有效性指标.pdf

文档介绍

文档介绍:该【基于K近邻的聚类有效性指标 】是由【司棋夸克】上传分享,文档一共【13】页,该文档可以免费在线阅读,需要了解更多关于【基于K近邻的聚类有效性指标 】的内容,可以使用淘豆网的站内搜索功能,选择自己适合的文档,以下文字是截取该文章内的部分文字,如需要获得完整电子版,请下载此文档到您的设备,方便您编辑和打印。:.

K-NearestNeighborClusterValidityIndex
YuXiao,JianYu
DepartmentofComputerScienceandTechnologyBeijingJiaotongUniversityBeijing,(100044)
Abstract
,weproposeanewcluster
validityindexbasedonK-,manyclustervalidityindices
,mostofthemarerestrictedtoeitherthe

fromthesimpleprincipleoftheK-nearestneighboralgorithm(KNN),thatis,adatapointshould


informationtoevaluatetheclusteringresult,itwillbeappropriateforboththesphericaldataand
,ithasnoadditionrequestfortheclustering

thefinestructureinthedata(thebiggestvaluethatcanbeastheoptimalnumberofclusters).In
numericalexperiments,thecomparisonbetweentheNNIindexandotherclustervalidity
approachesontheclusteringresultsoftheK-meansandSpectralclusteringalgorithmssuggests
thatourmethodperformswellformostconditions.
Keywords:clustervalidity,K-nearestneighbor,K-meansalgorithm,spectralclustering
1Introduction
Clusteringisanimportanttechniquefordataanalysis—theproblemofextractinggroupsfrom
,howtoinfertheoptimalnumber
ofclustersandtoevaluatetheclusteringresultswithaknownnumberofclustersisstillahardand
criticalpartoftheclusteranalysis,whichiscalledclustervalidityissue.
Intheliterature,therehavebeenmanymethodsproposedtoestimatetheoptimalnumberof
,thereweresomeclassicalvalidityindices,namely,theDavies-Bouldin
index,Calinski-Harabaszindex,XieBeniindex,
twotypesaccordingtoGordon(1999):
thenumberofclustersbasedontheentiredatasetandfindtheoptimalclusternumbertomaximize

,compactnessand
separationarefrequentlyusedasthecriteriaforclusteringevaluationandchoosingtheoptimal
(2007)summarizedshortcomingsoftheclustervalidity
measuresusingthesetwocriteria:theyarenotapplicableforclusterswithdifferentdensitiesor
sizes;thecompactnessmeasuresareinclinedtomonotonicdecreaseasthenumberofclusters

solvethislimitationofthesepreviousindices,twoimportantmethodshavebeendevelopedfor
,WaltherandHastie(2001)used

Statisticistofindtheoptimalclusternumbertomaximizethegapbetweenthequantityofthe
originaldataandtheaveragequantityofthesamplesgeneratingfromthereferencedistribution.
TheGapStatisticoftenworksforthesphericallydistributeddataset,itperformsnotsowellforthe
-basedapproach(Lange,Roth,
Braun&Buhnamm,2004)
approachisasfollows:usingtheclusteringsolutionoftheoriginaldatasettoconstructaclassifier
inthefirststep,thenclassifyinganotherdatasetsourcingfromtheoriginaldatasetusingthe
classifierandclusteringtheseconddatausingclusteringalgorithmdirectly,finally,comparethe
-1-:.


-based
,itcannot
,MollerandRadke(2006)
proposedaclustervalidityapproachbasedonnearest-

aboutalltheseindicesreferredaboveinsection4.
Inthispaper,wepresentanovelclustervalidityindexcalledNNIwhichisbasedontheideaof
K--nearestneighboralgorithmisapopularandimportant
’snearestneighborstoclassifythe
point,thatis,

,giventheclusteringresult,wecancompute
theknearestneighborsforeverypointandcomparetheclasslabelofthepointwiththechosen
’s,itisreasonablethatthis
pointiscorrectlyclustered,otherwise,apunishingvaluewillbeassignedtothatpointaccordingto
’spunishablevaluecanbeusedasanew
clustervalidityindextotesttheclusteringresults,whichiscalledK-nearestneighborcluster
validityindex(NNI).Minimizingthisnewclustervalidityindexmeanstheclusteringresultismore
,thebiggest
numbertominimizethisnewclustervalidityindexisregardedastheoptimalnumberofcluster.
,wechoose
K-meansandSpectralclusteringtoillustrateitsefficacywithcomparisontoexistedmethodsby
numericalexperiments.
-means
,whichis

experimentsandthelastsectionisthediscussionoftheresults.
2ClusteringAlgorithms
TheclusteringalgorithmsusedinthispaperarethewellknownK-meansandSpectral
clusteringalgorithms.
-Means
TheK-Meansalgorithmisoneoftheclassicalunsupervisedlearningalgorithms(MacQueen,1967).
Theaimofthisalgorithmistoevolvecclustersinthedatasuchthatanobjectivefunction
nc2
J=∑∑urixi−zr(1)
i==11r
,andsimplyuri∈{0,1}.
Here,,all
,
cnewcentersoftheclustersareupdatedtothemeanofthepointsbelongingtothecorresponding

Jfallsbelowathresholdoramaximalnumberofiterationsisreached.
-2-:.


Inrecentyears,spectralclusteringhasbecomeoneofthemostpopularmodernclustering
,spectralclusteringoften

(2006)gaveatutorialonspectralclusteringand

paper,wejustchooseoneofthesecommonspectralclusteringalgorithmstouseinourexperiments,
whichisproposedbyNg,JordanandWeiss(2002)calledNJWalgorithm.
Algorithm1NJW
Input:DatasetX,c:Clusternumber.
n×n22
∈RdefinedbyAij=exp(−xi−xj2σ)ifi≠j,
andAii=0.
(i,i)-elementisthesumofA’sithrowand
constructthematrix
L=D−12AD−12.
(chosentobeorthogonaltoeachotherinthecaseof
repeatedeigenvalues),andformthematrixW∈Rn×cbystackingintheeigenvectorsin
columns.
'fromWbyrenormalizingeachofW’srowtohaveunitlength:
212
W'ij=Wij(∑Wij).
j
'cc
,clusterthemintoclustersviaK-means.
,assigntheoriginalpointxtoclusterjifandonlyifrowiofthematrixW'was
i
assignedtoclusterj.
Returnclusterlabelofallthedatapoints.
2
Here,thescalingparameterσcontrolshowrapidlytheaffinitymatrixfallsoffwiththe


clusters,
algorithmsspecialforthenon-
.
3TheNNIIndex
WepresentanewclustervalidityindexcalledNNIwhichisbasedontheideaofK-nearest
-neighbortechniqueisusedasasimplemeanstoobtainlocal

classificationweassumethattheknearestneighborsofpointxiessentiallybelongtothesame
,theparameterkshouldbeclearlysmallerthanthesizeofthesmallest
-analysisaboutthesizeoftheclustersinthedataset.
-3-:.

LetX={x1,Κ,xn}'astheEuclideandistance
betweenxiandxi'.Supposewehaveclusteredthedataintocdisjointclustersandrepresenta
vectoroflabelsY={y1,Κ,yn}with1≤yi≤={xi1,Κ,xik}asthesetofk
nearestneighborsofxi,thatisthekdatapointswiththek-
definetheNNIindex,wegiveeachpointapunishingvalueaccordingtothefollowingformula:
P(xi)=∑bij*exp(−(dij−δyi))
(2)
xj∈Si
Then,theNNIindexisdefinedas:
N
NNI(c)=∑i=1P(xi)(2)
⎧1yi≠yj
wherebij=⎨,δr=median(DKr),withDKr={dij|xi∈Cr,xj∈Si},
⎩0otherwise
1≤r≤c.
TheNNIindexisdefinedasthesumofallthepoints’
dataset,thedefinitionofthepunishingvalueforeachpointnotonlyconsidersthenumberofthe
point’sk-nearestneighborsthatbelongtothedifferentclustersfromthepoint’scluster,butalso
thinkseachk-nearestneighbor’seffecttothepunishingvalueofthispointrespectivelyaccording
toformula(2).Consideringthateachclusterhasadensitydifferentdensity,itiseasytodefinethe
’sk-nearest
neighborsismuchbiggerthanthedensityparameterδ,thepunishingvalueonthispoint
−δi,thesmaller
isthepunishingvalue,onthecontrary,
punishingvalueforadatapointistoavoidtheinfluenceoftheoutliersandincreasethepunishing
valueofthepointsinhighdensityareasthatbelongtothedifferentclusterswithit’sk-nearest
neighbors.
Algorithm2NNI
Input:DatasetX
Repeatforeachc∈{cmin,Κ,cmax}:

(c)oftheclusteringresultinthefirststep.

Returnthelargestc=argmincNNI(c)astheestimatednumberofclusters.
4ExperimentalResults
Inthissection,weprovideempiricalevidencefortheusefulnessofourapproachtoevaluatethe
,sevenexistedclustervalidityindicesareusedtocomparewiththe
NNIindexforthetoydatasetsandrealdatasets,andtheexperimentresultssuggestthecompetitive
performanceoftheNNIindex.
-4-:.


ThesixtoydatasetsusedinthispaperarenamedasA,B,C,D,
-lifedatasets:irisdataset,andthewiscousindiagnosticbreastcancer
(Wdbc).Bothofthetworeal-lifedatasetscanbedownloadedfrom
/~mlearn/MLSummaryhtml.
Inordertoexplaintheexperimentresultsofalltheclusterindicesmoreclearly,wewillshowthe
,weusePasthenumberoffeatures,andcasthetrue
numberofclusters.
Table1:Characteristicsofallthedatasets
DataSetClusterSizesPc
A6×4026
B200,100,5023
C2×100,3