文档介绍:浙江大学本科生《数据挖掘导论》课件第2课数据预处理技术徐从富,副教授浙江大学人工智能研究所内容提纲aWhypreprocessthedata?aDatacleaningaDataintegrationandtransformation■:lackingattributevalues,lackingcertainattributesofinterestorcontainingonlyaggregatedatag,occupation“”Lnoisy:-Dinconsistent:containingdiscrepanciesincodesornameseg,Age=“4“03/07/1997”“1,2,3”,nowrating“A,B,,discrepancybetweenduplicaterecordsWhyIsDataDirty?esfromDn/differentconsiderationbetweenthetimewhenthedatawascollectedandwhenitisanalyzedahuman/hardware/esfromaDifferentdatasourcesaFunctionaldependencyviolationWhyIsdataPreprocessingImportant?Noqualitydata,,cleaning,prisesthemajorityoftheworkofbuildingadatawarehouse-BillInmonMulti-DimensionalmeasureofdataQualityaAwell-essibility■BroadcategoriesOintrinsic,contextual,representational,essibilityMajortasksinDataPreprocessingIDatacleaningaFillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,andresolveinconsistenciesaDataintegrationaIntegrationofmultipledatabases,datacubes,orfiles■DatatransformationaNormalizationandaggregation■DatareductionaObtainsreducedrepresentationinvolumebutproducesthesameorsimilaranalyticalresults■DatadiscretizationDPartofdatareductionbutwithparticularimportance,especiallyfornumericaldataFormsofdatapreprocessingI'clean-lcokin2daralsdnnnwsnapsandsnstutalDataanSI。nation22,100,。,,,.s9,'DCIsurveyaDatacleaningtasks口FillinmissingvaluesDIdentifyoutliersandsmoothoutnoisydataDCorrectinconsistentdataDResolveredundancycausedbydataintegrationMissingdataaDataisnotalwaysavailableDEg,manytupleshavenorecordedvalueforsever