文档介绍:Scalable Parallel Data Mining
Eui-Hong (Sam) Han
Department puter Science and Engineering
Army High puting Research Center
University of Minnesota
Research Supported by NSF, DOE,
Army Research Office, AHPCRC/ARL
/~han
Joint work with e Karypis, Vipin Kumar, Anurag Srivastava, and Vineet Singh
What is Data Mining?
Many Definitions
Search for Valuable Information in Large Volumes of Data.
Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules.
A Step in the KDD Process…
Why Mine Data? Commercial ViewPoint...
Lots of data is being collected and warehoused.
Computing has e affordable.
Competitive Pressure is Strong
Provide better, customized services for an edge.
Information is ing product in its own right.
Why Mine Data?Scientific Viewpoint...
Data collected and stored at enormous speeds (Gbyte/hour)
remote sensor on a satellite
telescope scanning the skies
microarrays generating gene expression data
scientific simulations generating terabytes of data
Traditional techniques are infeasible for raw data
Data mining for data reduction..
cataloging, classifying, segmenting data
Helps scientists in Hypothesis Formation
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Classification Example
categorical
categorical
continuous
class
Test
Set
Training
Set
Model
Learn
Classifier
Classification Application
Direct Marketing
Fraud Detection
Customer Attrition/Churn
Sky Survey Cataloging
Example Decision Tree
categorical
categorical
continuous
class
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Splitting Attributes
The splitting attribute at a node is
determined based on the Gini index.
Hunt’s Method
An Example:
Attributes: Refund (Yes, No), Marital St