Data Mining: Frequent Itemsets and Association Rules
Data Mining: Frequent Itemsets and Association Rules
LECTURE 3
Frequent Itemsets and Association Rules
This is how it all started…
• Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami:
Mining Association Rules between Sets of Items in
Large Databases. SIGMOD Conference 1993: 207-
216
• Rakesh Agrawal, Ramakrishnan Srikant: Fast
Algorithms for Mining Association Rules in Large
Databases. VLDB 1994: 487-499
Market-Basket Data
• A large set of items, e.g., things sold in a
supermarket.
• A large set of baskets, each of which is a small
subset of the items, e.g., the things one customer
buys on one day.
Items: {Bread, Milk, Diaper, Beer, Eggs, Coke}
TID Items
1 Bread, Milk
Baskets: Transactions
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
4
Frequent itemsets
• Goal: find combinations of items (itemsets) that
occur frequently
• Called Frequent Itemsets
Support 𝑠 𝐼 : number of
TID Items
transactions that contain
1 Bread, Milk
itemset I
2 Bread, Diaper, Beer, Eggs Examples of frequent itemsets 𝑠 𝐼 ≥ 3
3 Milk, Diaper, Beer, Coke {Bread}: 4
4 Bread, Milk, Diaper, Beer {Milk} : 4
5 Bread, Milk, Diaper, Coke {Diaper} : 4
{Beer}: 3
{Diaper, Beer} : 3
{Milk, Bread} : 3
5
Market-Baskets – (2)
• Really, a general many-to-many mapping
(association) between two kinds of things, where the
one (the baskets) is a set of the other (the items)
• But we ask about connections among “items,” not “baskets.”
Applications – (1)
• Items = products; baskets = sets of products
someone bought in one trip to the store.
Applications – (2)
• Baskets = Web pages; items = words.
Applications – (3)
• Baskets = sentences; items = documents
containing those sentences.
• Frequent Itemset
• An itemset 𝐼 whose support is greater than or equal to a
minsup threshold, 𝑠 𝐼 ≥minsup
Mining Frequent Itemsets task
• Input: Market basket data, threshold minsup
• Output: All frequent itemsets with support ≥ minsup
• Problem parameters:
• N (size): number of transactions
• Wallmart: billions of baskets per year
• Web: billions of pages
• d (dimension): number of (distinct) items
• Wallmart sells more than 100,000 items
• Web: billions of words
• w: max size of a basket
• M: Number of possible itemsets.
M = 2𝑑
The itemset lattice
null Representation of all possible
itemsets and their relationships
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
13
Computation Model
• Typically, data is kept in flat files rather than in a
database system.
• Stored on disk.
• Stored basket-by-basket.
• We can expand a basket into pairs, triples, etc. as we read
the data.
• Use k nested loops, or recursion to generate all itemsets of size k.
Main-Memory Bottleneck
• For many frequent-itemset algorithms, main
memory is the critical resource.
• As we read baskets, we need to count something, e.g.,
occurrences of pairs.
• The number of different things we can count is limited
by main memory.
• Swapping counts in/out is too slow
The Apriori Principle
• Apriori principle (Main observation):
– If an itemset is frequent, then all of its subsets must also
be frequent
– If an itemset is not frequent, then all of its supersets
cannot be frequent
– The support of an itemset never exceeds the support of
its subsets
∀𝑋, 𝑌: 𝑋 ⊆ 𝑌 ⇒ 𝑠 𝑋 ≥ 𝑠(𝑌)
Found to be frequent
Illustration of the Apriori principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent supersets
ABCDE
Pruned
The Apriori algorithm
Ck = candidate itemsets of size k
Level-wise approach Lk = frequent itemsets of size k
1. k = 1, C1 = all items
2. While Ck not empty
Frequent 3. Scan the database to find which itemsets in
itemset
generation Ck are frequent and put them into Lk
Candidate 4. Generate the candidate itemsets Ck+1 of
generation size k+1 using Lk
5. k = k+1
R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",
Proc. of the 20th Int'l Conference on Very Large Databases, 1994.
Candidate Generation
• Apriori principle:
• An itemset of size k+1 is candidate to be frequent only if
all of its subsets of size k are known to be frequent
Candidate generation:
• Construct a candidate of size k+1 by combining
frequent itemsets of size k
• If k = 1, take the all pairs of frequent items
• If k > 1, join pairs of itemsets that differ by just one item
• For each generated candidate itemset ensure that all
subsets of size k are frequent.
Generate Candidates Ck+1
• Assumption: The items in an itemset are ordered
• Integers ordered in increasing order, strings ordered lexicographicly
• The order ensures that if item y > x appears before x, then x is not in
the itemset
• The itemsets in Lk are also ordered
• self-join Lk
insert into Ck+1
select p.item1, p.item2, …, p.itemk, q.itemk
from Lk p, Lk q
where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk < q.itemk
Example
• L3={abc, abd, acd, ace, bcd}
• Generating candidate set C4
• Self-join: L3*L3
minsup = 3 2
3
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Item Count Items (1-itemsets) 4 Bread, Milk, Diaper, Beer
Bread 4 5 Bread, Milk, Diaper, Coke
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3
{Bread,Milk} 3
Diaper 4
{Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Triplets (3-itemsets)
If every subset is considered,
6 6 6 Itemset Count
+ + = 6 + 15 + 20 = 41 {Bread,Milk,Diaper} 2
1 2 3
With support-based pruning,
6 4 Only this triplet has all subsets to be frequent
+ + 1 = 6 + 6 + 1 = 13
1 2 But it is below the minsup threshold
Generate Candidates Ck+1
• Are we done? Are all the candidates valid?
Apriori principle
• Pruning step:
• For each candidate (k+1)-itemset create all subset k-itemsets
• Remove a candidate if it contains a subset k-itemset that is
not frequent
Example
{a,b,c} {a,b,d}
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3 {a,b,c,d}
First Second
pass pass
Frequent Frequent
items pairs
41
Picture of A-Priori
Counts of
pairs of
frequent
items
Pass 1 Pass 2
43
12 per
4 per pair
occurring pair
Triangular-Matrix Approach
• Number items 1, 2,…
• Requires table of size O(n) to convert item names to
consecutive integers.
• Count {i, j } only if i < j.
• Keep pairs in the order {1,2}, {1,3},…, {1,n }, {2,3},
{2,4},…,{2,n }, {3,4},…, {3,n },…{n -1,n }.
• Find pair {i, j } at the position
(i –1)(n –i /2) + j – i.
Freq- Old
Item counts
quent item
items #’s
Counts of
pairs of
frequent
items
Pass 1 Pass 2
47
Details of Approach #2
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Mining Association Rules
Association Rule TID Items
– An implication expression of the form 1 Bread, Milk
X Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– {Milk, Diaper} {Beer} 3 Milk, Diaper, Beer, Coke
Rule Evaluation Metrics 4 Bread, Milk, Diaper, Beer
– Support (s) 5 Bread, Milk, Diaper, Coke
Fraction of transactions that contain both X
and Y = the probability P(X,Y) that X and Y
Example:
occur together {Milk, Diaper} Beer
– Confidence (c) (Milk , Diaper, Beer) 2
How often Y appears in transactions that s 0.4
|T| 5
contain X = the conditional probability P(Y|X)
that Y occurs given that X has occurred. (Milk, Diaper, Beer) 2
c 0.67
(Milk , Diaper ) 3
Problem Definition
– Input: Market-basket data, minsup, minconf values
– Output: All rules with items in I having s ≥ minsup and c≥ minconf
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a partitioning of a frequent itemset into Left-
Hand-Side (LHS) and Right-Hand-Side (RHS)
Frequent itemset: {A,B,C,D}
E.g., Rule: ABCD
All Candidate rules:
BCD A, ACD B , ABD C, ABC D,
CD AB, BD AC, BC AD, AD BC, AB CD, AC BD,
D ABC, C ABD, B ACD, A BCD
Association Rule anti-monotonicity
• In general, confidence does not have an anti-
monotone property with respect to the size of the
itemset:
c(ABC D) can be larger or smaller than c(AB D)
• e.g., L = {A,B,C,D}:
CD=>AB
CD=>AB BD=>AC
BD=>AC BC=>AD
BC=>AD AD=>BC
AD=>BC AC=>BD
AC=>BD AB=>CD
AB=>CD
D=>ABC
D=>ABC C=>ABD
C=>ABD B=>ACD
B=>ACD A=>BCD
A=>BCD
Pruned
Rules
Lattice of rules created by the RHS
Rule Generation for APriori Algorithm
• Candidate rule is generated by merging two rules that
share the same prefix
in the RHS
CD->AB BD->AC
• join(CDAB,BDAC)
would produce the candidate
rule D ABC
10
• Number of frequent itemsets 3
10
k
k 1
Maximal A B C D E
Itemsets
Maximal itemsets = positive border
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCDE
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCDE
1 ABC
2 ABCD 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
3 BCE
4 ACDE
5 DE 12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported
by any ABCDE
transactions
Maximal vs Closed Frequent Itemsets
Closed but not
null
Minimum support = 2 maximal
124 123 1234 245 345
A B C D E
Closed
and
12 124 24 123
maximal
4 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
# Closed = 9
# Maximal = 4
ABCDE
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Pattern Evaluation
• Association rule algorithms tend to produce too many rules but
many of them are uninteresting or redundant
• Redundant if {A,B,C} {D} and {A,B} {D} have same support &
confidence
• Summarization techniques
• Uninteresting, if the pattern that is revealed does not offer useful
information.
• Interestingness measures: a hard problem to define
15
Confidence= 𝑃(Coffee|Tea) = = 0.75
20
• 𝑃(Coffee|Tea) = 0.9375
Statistical Independence
• Population of 1000 students
• 600 students know how to swim (S)
• 700 students know how to bike (B)
• 420 students know how to swim and bike (S,B)
• Lift/Interest/PMI
𝑃(𝑌|𝑋) 𝑃(𝑋, 𝑌)
Lift = = = Interest
𝑃(𝑌) 𝑃 𝑋 𝑃(𝑌)
• Piatesky-Shapiro
PS = 𝑃 𝑋, 𝑌 − 𝑃 𝑋 𝑃(𝑌)
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
0.0001
𝑀𝐼 ℎ𝑜𝑛𝑘, 𝑘𝑜𝑛𝑘 = = 10000
0.0001 ∗ 0.0001
0.19
𝑀𝐼 ℎ𝑜𝑛𝑔, 𝑘𝑜𝑛𝑔 = = 4.75
0.2 ∗ 0.2
First Second
pass pass
Frequent Frequent
items pairs
76
Picture of A-Priori
Counts of
pairs of
frequent
items
Pass 1 Pass 2
77
PCY Algorithm
Item counts
• During Pass 1 (computing frequent
items) of Apriori, most memory is idle.
Needed Extensions
1. Pairs of items need to be generated from the
input file; they are not present in the file.
2. Memory organization:
• Space to count each item.
• One (typically) 4-byte integer per item.
• Use the rest of the space for as many integers,
representing buckets, as we can.
79
Picture of PCY
Item counts
Hash
table
Pass 1
80
Picture of PCY
Item counts
Bucket Counts
Pass 1
81
Picture of PCY
Bitmap
Hash
table Counts of
candidate
pairs
Pass 1 Pass 2
85
Main-Memory Picture
Copy of
sample
baskets
Space
for
counts
89
Negative Border
…
triples
pairs
singletons
Frequent Itemsets
from Sample
98
doubletons
singletons
Frequent Itemsets
from Sample
101
Theorem:
• If there is an itemset that is frequent in the whole,
but not frequent in the sample, then there is a
member of the negative border for the sample
that is frequent in the whole.
102
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE
Border
FREQUENT ITEMSET
RESEARCH