文档介绍:该【基于K近邻的聚类有效性指标 】是由【司棋夸克】上传分享,文档一共【13】页,该文档可以免费在线阅读,需要了解更多关于【基于K近邻的聚类有效性指标 】的内容,可以使用淘豆网的站内搜索功能,选择自己适合的文档,以下文字是截取该文章内的部分文字,如需要获得完整电子版,请下载此文档到您的设备,方便您编辑和打印。:.
K-NearestNeighborClusterValidityIndex
YuXiao,JianYu
DepartmentofComputerScienceandTechnologyBeijingJiaotongUniversityBeijing,(100044)
Abstract
,weproposeanewcluster
validityindexbasedonK-,manyclustervalidityindices
,mostofthemarerestrictedtoeitherthe
fromthesimpleprincipleoftheK-nearestneighboralgorithm(KNN),thatis,adatapointshould
informationtoevaluatetheclusteringresult,itwillbeappropriateforboththesphericaldataand
,ithasnoadditionrequestfortheclustering
thefinestructureinthedata(thebiggestvaluethatcanbeastheoptimalnumberofclusters).In
numericalexperiments,thecomparisonbetweentheNNIindexandotherclustervalidity
approachesontheclusteringresultsoftheK-meansandSpectralclusteringalgorithmssuggests
thatourmethodperformswellformostconditions.
Keywords:clustervalidity,K-nearestneighbor,K-meansalgorithm,spectralclustering
1Introduction
Clusteringisanimportanttechniquefordataanalysis—theproblemofextractinggroupsfrom
,howtoinfertheoptimalnumber
ofclustersandtoevaluatetheclusteringresultswithaknownnumberofclustersisstillahardand
criticalpartoftheclusteranalysis,whichiscalledclustervalidityissue.
Intheliterature,therehavebeenmanymethodsproposedtoestimatetheoptimalnumberof
,thereweresomeclassicalvalidityindices,namely,theDavies-Bouldin
index,Calinski-Harabaszindex,XieBeniindex,
twotypesaccordingtoGordon(1999):
thenumberofclustersbasedontheentiredatasetandfindtheoptimalclusternumbertomaximize
,compactnessand
separationarefrequentlyusedasthecriteriaforclusteringevaluationandchoosingtheoptimal
(2007)summarizedshortcomingsoftheclustervalidity
measuresusingthesetwocriteria:theyarenotapplicableforclusterswithdifferentdensitiesor
sizes;thecompactnessmeasuresareinclinedtomonotonicdecreaseasthenumberofclusters
solvethislimitationofthesepreviousindices,twoimportantmethodshavebeendevelopedfor
,WaltherandHastie(2001)used
Statisticistofindtheoptimalclusternumbertomaximizethegapbetweenthequantityofthe
originaldataandtheaveragequantityofthesamplesgeneratingfromthereferencedistribution.
TheGapStatisticoftenworksforthesphericallydistributeddataset,itperformsnotsowellforthe
-basedapproach(Lange,Roth,
Braun&Buhnamm,2004)
approachisasfollows:usingtheclusteringsolutionoftheoriginaldatasettoconstructaclassifier
inthefirststep,thenclassifyinganotherdatasetsourcingfromtheoriginaldatasetusingthe
classifierandclusteringtheseconddatausingclusteringalgorithmdirectly,finally,comparethe
-1-:.
-based
,itcannot
,MollerandRadke(2006)
proposedaclustervalidityapproachbasedonnearest-
aboutalltheseindicesreferredaboveinsection4.
Inthispaper,wepresentanovelclustervalidityindexcalledNNIwhichisbasedontheideaof
K--nearestneighboralgorithmisapopularandimportant
’snearestneighborstoclassifythe
point,thatis,
,giventheclusteringresult,wecancompute
theknearestneighborsforeverypointandcomparetheclasslabelofthepointwiththechosen
’s,itisreasonablethatthis
pointiscorrectlyclustered,otherwise,apunishingvaluewillbeassignedtothatpointaccordingto
’spunishablevaluecanbeusedasanew
clustervalidityindextotesttheclusteringresults,whichiscalledK-nearestneighborcluster
validityindex(NNI).Minimizingthisnewclustervalidityindexmeanstheclusteringresultismore
,thebiggest
numbertominimizethisnewclustervalidityindexisregardedastheoptimalnumberofcluster.
,wechoose
K-meansandSpectralclusteringtoillustrateitsefficacywithcomparisontoexistedmethodsby
numericalexperiments.
-means
,whichis
experimentsandthelastsectionisthediscussionoftheresults.
2ClusteringAlgorithms
TheclusteringalgorithmsusedinthispaperarethewellknownK-meansandSpectral
clusteringalgorithms.
-Means
TheK-Meansalgorithmisoneoftheclassicalunsupervisedlearningalgorithms(MacQueen,1967).
Theaimofthisalgorithmistoevolvecclustersinthedatasuchthatanobjectivefunction
nc2
J=∑∑urixi−zr(1)
i==11r
,andsimplyuri∈{0,1}.
Here,,all
,
cnewcentersoftheclustersareupdatedtothemeanofthepointsbelongingtothecorresponding
Jfallsbelowathresholdoramaximalnumberofiterationsisreached.
-2-:.
Inrecentyears,spectralclusteringhasbecomeoneofthemostpopularmodernclustering
,spectralclusteringoften
(2006)gaveatutorialonspectralclusteringand
paper,wejustchooseoneofthesecommonspectralclusteringalgorithmstouseinourexperiments,
whichisproposedbyNg,JordanandWeiss(2002)calledNJWalgorithm.
Algorithm1NJW
Input:DatasetX,c:Clusternumber.
n×n22
∈RdefinedbyAij=exp(−xi−xj2σ)ifi≠j,
andAii=0.
(i,i)-elementisthesumofA’sithrowand
constructthematrix
L=D−12AD−12.
(chosentobeorthogonaltoeachotherinthecaseof
repeatedeigenvalues),andformthematrixW∈Rn×cbystackingintheeigenvectorsin
columns.
'fromWbyrenormalizingeachofW’srowtohaveunitlength:
212
W'ij=Wij(∑Wij).
j
'cc
,clusterthemintoclustersviaK-means.
,assigntheoriginalpointxtoclusterjifandonlyifrowiofthematrixW'was
i
assignedtoclusterj.
Returnclusterlabelofallthedatapoints.
2
Here,thescalingparameterσcontrolshowrapidlytheaffinitymatrixfallsoffwiththe
clusters,
algorithmsspecialforthenon-
.
3TheNNIIndex
WepresentanewclustervalidityindexcalledNNIwhichisbasedontheideaofK-nearest
-neighbortechniqueisusedasasimplemeanstoobtainlocal
classificationweassumethattheknearestneighborsofpointxiessentiallybelongtothesame
,theparameterkshouldbeclearlysmallerthanthesizeofthesmallest
-analysisaboutthesizeoftheclustersinthedataset.
-3-:.
LetX={x1,Κ,xn}'astheEuclideandistance
betweenxiandxi'.Supposewehaveclusteredthedataintocdisjointclustersandrepresenta
vectoroflabelsY={y1,Κ,yn}with1≤yi≤={xi1,Κ,xik}asthesetofk
nearestneighborsofxi,thatisthekdatapointswiththek-
definetheNNIindex,wegiveeachpointapunishingvalueaccordingtothefollowingformula:
P(xi)=∑bij*exp(−(dij−δyi))
(2)
xj∈Si
Then,theNNIindexisdefinedas:
N
NNI(c)=∑i=1P(xi)(2)
⎧1yi≠yj
wherebij=⎨,δr=median(DKr),withDKr={dij|xi∈Cr,xj∈Si},
⎩0otherwise
1≤r≤c.
TheNNIindexisdefinedasthesumofallthepoints’
dataset,thedefinitionofthepunishingvalueforeachpointnotonlyconsidersthenumberofthe
point’sk-nearestneighborsthatbelongtothedifferentclustersfromthepoint’scluster,butalso
thinkseachk-nearestneighbor’seffecttothepunishingvalueofthispointrespectivelyaccording
toformula(2).Consideringthateachclusterhasadensitydifferentdensity,itiseasytodefinethe
’sk-nearest
neighborsismuchbiggerthanthedensityparameterδ,thepunishingvalueonthispoint
−δi,thesmaller
isthepunishingvalue,onthecontrary,
punishingvalueforadatapointistoavoidtheinfluenceoftheoutliersandincreasethepunishing
valueofthepointsinhighdensityareasthatbelongtothedifferentclusterswithit’sk-nearest
neighbors.
Algorithm2NNI
Input:DatasetX
Repeatforeachc∈{cmin,Κ,cmax}:
(c)oftheclusteringresultinthefirststep.
∧
Returnthelargestc=argmincNNI(c)astheestimatednumberofclusters.
4ExperimentalResults
Inthissection,weprovideempiricalevidencefortheusefulnessofourapproachtoevaluatethe
,sevenexistedclustervalidityindicesareusedtocomparewiththe
NNIindexforthetoydatasetsandrealdatasets,andtheexperimentresultssuggestthecompetitive
performanceoftheNNIindex.
-4-:.
ThesixtoydatasetsusedinthispaperarenamedasA,B,C,D,
-lifedatasets:irisdataset,andthewiscousindiagnosticbreastcancer
(Wdbc).Bothofthetworeal-lifedatasetscanbedownloadedfrom
/~mlearn/MLSummaryhtml.
Inordertoexplaintheexperimentresultsofalltheclusterindicesmoreclearly,wewillshowthe
,weusePasthenumberoffeatures,andcasthetrue
numberofclusters.
Table1:Characteristicsofallthedatasets
DataSetClusterSizesPc
A6×4026
B200,100,5023
C2×100,3