文档介绍:Data Mining Techniques
1
Review(Ⅰ)
What is data mining?
Data mining is the task of discovering interesting patterns from large amounts of data, where the data can be stored in databases, data warehouses, or other information repositories. It is a young interdisciplinary field, drawing from areas such as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high-puting. Other contributing areas include works, pattern recognition, spatial data analysis, image databases, signal processing, and many application fields, such as business, economics, and bioinformatics.
Data Mining Techniques
2
Review(Ⅱ)
KDD - knowledge discovery in databases
Data mining—core of knowledge discovery process
Preprocessing
Data cleaning
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation
Data Mining Techniques
3
Top 10 Algorithms
#1: (61 票), (判定树或决策树,分类算法)#2: K-Means (60票),(K-平均聚类算法)#3: SVM (58票),(分类算法)(支持向量机,分类算法)#4: Apriori (52票), (关联规则挖掘算法)#5: EM (48票),(期望最大化算法,聚类与参数估计)
#6: PageRank (46票), (著名的google页面评价算法)#7: AdaBoost (45票), (积弱为强的分类算法)#7: kNN (45票),(以近邻为楷模的分类方法)#7: Naive Bayes (45票),(基于对象原生态分布的分类算法,不需或少需先验知识)#10: CART (34票), (二分递归分割的的判定树分类方法)
Data Mining Techniques
4
Top 10 Problems
#1:数据挖掘的统一理论。十年前,专家看到当时的数据挖掘中急用先研的短期行为较多,为单个问题研究技术,无统一的理论,目光不远大, 至今,比较完整的数据挖掘的同一理论还在探索中;#2:规模伸缩性、高维和高速问题。十年前的数据挖掘技术,在维度增加,数据规模增大时,所需资源(时间、空间和CPU)指数级地增加,在数据流分析、网络攻防、传感器网络应用中成为瓶颈;如今问题仍然在;#3:时间序列的高效率处理+ 高效分类聚类和预测。如今,在短长期预报,高精度处理方面问题仍然存在;#4:复杂数据中挖掘复杂知识,如图数据挖掘等表现突出,如今,在亚复杂系统干预规则的挖掘中也有需求;
#5:网络挖掘,社会网络,邮件,网页,网络反恐,海量数据挖掘等;问题仍然存在;#6:分布式挖掘和多代理挖掘,如大型网络游戏,网络军事对抗等,需求日益增加;#7:生物数据挖掘,艾滋病疫苗相关、DNA相关的数据挖掘,方兴未艾;#8:数据挖掘自身的方法论研究, 尚待突破;#9:数据挖掘与信息安全和隐私保护;成为目前关注热点;#10:特色数据的挖掘:包括高价值数据(如重症监护室数据),偏斜数据(抽样偏斜失真),不平衡数据(有用的只占很小比例)。
Data Mining Techniques
5
Mining Frequent Patterns, Associations
Data Mining Techniques
6
Outline
What is association rule mining and frequent pattern mining?
M