文档介绍:CS 361A (Advanced Data Structures and Algorithms)
Lecture 20 (Dec 7, 2005)
Data Mining: Association Rules
Rajeev Motwani
(partially based on notes by Jeff Ullman)
CS 361A
1
Association Rules Overview
Market Baskets & Association Rules
Frequent item-sets
A-priori algorithm
Hash-based improvements
One- or two-pass approximations
High-correlation mining
CS 361A
2
Association Rules
Two Traditions
DM is science of approximating joint distributions
Representation of process generating data
Predict P[E] for interesting events E
DM is technology for fast counting
pute certain summaries quickly
Lets try to use them
Association Rules
Captures interesting pieces of joint distribution
Exploits fast counting technology
CS 361A
3
Market-Basket Model
Large Sets
Items A = {A1, A2, …, Am}
., products sold in supermarket
Baskets B = {B1, B2, …, Bn}
small subsets of items in A
., items bought by customer in one transaction
Support – sup(X) = number of baskets with itemset X
Frequent Itemset Problem
Given – support threshold s
Frequent Itemsets –
Find – all frequent itemsets
CS 361A
4
Example
Items A = {milk, coke, pepsi, beer, juice}.
Baskets
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
Support threshold s=3
Frequent itemsets
{m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}
CS 361A
5
Application 1 (Retail Stores)
Real market baskets
chain stores keep TBs of customer purchase info
Value?
how typical customers navigate stores
positioning tempting items
suggests “tie-in tricks”– ., hamburger sale while raising ketchup price
…
High support needed, or no $$’s
CS 361A
6
Application 2 (Information Retrieval)
Scenario 1
baskets = documents
items = words in documents
frequent word-groups = linked concepts.
Scenario 2
items = sentences
baskets = documents containing sentences
frequent sentence-groups = possible plagiarism
CS 361A
7
Application 3 (Web Search)
Scenario 1
baskets = web pag