Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Concepts and
Techniques
— Chapter 5 —
analysis
Cluster analysis: frequent pattern-based
clustering
Data warehousing: iceberg cube and cube-
gradient
August 10, 2009 Data Mining: Concepts and Techniques 5
Basic Concepts: Frequent
Patterns
What is the chance this particular set of 10 products to be
August 10, 2009frequent 103 times
Data in 109Concepts
Mining: transactions?
and Techniques 10
Chapter 5: Mining Frequent Patterns,
Association and Correlations
frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts}
@SIGMOD’00)
Vertical data format approach (Charm—Zaki &
Hsiao @SDM’02)
August 10, 2009 Data Mining: Concepts and Techniques 12
Apriori: A Candidate Generation & Test
Approach
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
August 10, 2009 Data Mining: Concepts and Techniques 15
Implementation of Apriori
ABCD
Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are
determined frequent, the counting of
BCD begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication
rules for market basket data.
In SIGMOD’97
August 10, 2009 Data Mining: Concepts and Techniques 22
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using
local frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc:
DB|abc
“d” is a local frequent
August 10, 2009 itemConcepts
Data Mining: in DB|abc abcd is a frequent
and Techniques 23
Construct FP-tree from a Transaction
Database
Patterns containing p
…
Pattern f
base
Construct the FP-tree for the frequent items of
{}
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
August 10, 2009 Data Mining: Concepts and Techniques 29
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent
pattern mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not
count node-links and the count field)
conditional FP-tree
Until the resulting FP-tree is empty, or it
am-proj DB cm-proj DB
fc f …
fc f
fc f
August 10, 2009 Data Mining: Concepts and Techniques 33
FP-Growth vs. Apriori: Scalability With the
Support Threshold
70
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
100
Runtime (sec.)
80
60
40
20
0
0 0.5 1 1.5 2
Support threshold (%)
August 10, 2009 Data Mining: Concepts and Techniques 35
Advantages of the Pattern Growth
Approach
Divide-and-conquer:
Decompose both the mining task and DB according to
the frequent patterns obtained so far
Lead to focused search of smaller databases
Other factors
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building sub FP-
tree, no pattern search and matching
A good open-source implementation and refinement of
FPGrowth
FPGrowth+ (Grahne
August 10, 2009
and J. Zhu, FIMI'03)
Data Mining: Concepts and Techniques 36
Extension of Pattern Growth Mining
Methodology
Mining closed frequent itemsets and max-patterns
CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu,
Fimi’03)
Mining sequential patterns
Pattern-growth-based Clustering
Pattern-Growth-Based Classification
Summary
August 10, 2009 Data Mining: Concepts and Techniques 45
Mining Various Kinds of Association Rules
lower support
Exploration of shared multi-level mining (Agrawal
& Srikant@VLB’95, Han & Fu@VLDB’95)
mined is maximized
2-D quantitative association
rules: Aquan1 ∧ Aquan2 ⇒ Acat
Cluster adjacent association
rules to form general rules
using a 2-D grid
Example
August 10, 2009 Data Mining: Concepts and Techniques Subtle: They disagree58
Analysis of DBLP Coauthor Relationships
Recent DB conferences, removing balanced associations, low sup, etc.
Chicago in Dec.’02
Dimension/level constraint
category
Rule (or pattern) constraint
> $200)
Interestingness constraint
≥ 60%
August 10, 2009 Data Mining: Concepts and Techniques 62
Constraint-Based Frequent Pattern
Mining
Classification of constraints based on their
constraint-pushing capabilities
Anti-monotonic: If constraint c is violated, its
again
Data anti-monotonic: If a transaction t does not
Succinctness:
Given A1, the set of items satisfying a
succinctness constraint C, then any set S
satisfying C is based on A1 , i.e., S contains a
subset belonging to A1
Idea: Without looking at the transaction
database, whether an itemset S satisfies
constraint C can be determined based on the
selection of items
min(S.Price) ≤ v is succinct
sum(S.Price) ≥Data
August 10, 2009 v Mining:
is not succinct
Concepts and Techniques 67
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
August 10, 2009 Data Mining: Concepts and Techniques 68
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} <
August 10, 2009 Data Mining: Concepts and Techniques 5 69
The Constrained Apriori Algorithm:
Push a Succinct Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
not immediately
{1 5} 1 {1 5}
{2 3} 2 to be used
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 min{S.price } <=
August 10, 2009 Data Mining: Concepts and Techniques 1 70
Algorithm: Push a Succinct
Constraint Deep
1-Projected DB
TID Items
100 3 4 No Need to project on 2, 3, or 5
300 2 3 5
Constraint:
min{S.price } <=
August 10, 2009 Data Mining: Concepts and Techniques 1 71
The Constrained FP-Growth Algorithm:
Push a Data Antimonotonic Constraint
Deep
Remove from data
TID Items TID Items
100 134 100 1 3
200 235 300 1 3
FP-Tree
300 1235
400 25
Constraint:
min{S.price } <=
August 10, 2009 Data Mining: Concepts and Techniques 1 72
The Constrained FP-Growth Algorithm:
TID Transaction
Push a Data Antimonotonic Constraint
10 a, b, c, d, f, h
Deep
20 b, c, d, f, g, h
30 b, c, d, f, g
TID Transaction 40 a, c, e, f, g
10 a, b, c, d, f,
h f, g, Item Profit
20 b, c, d,
FP-Tree a 40
30 b, c, h
d, f, g
b 0
40 a, c, e, f, g
c -20
B-Projected DB Recursive
Data d -15
TID Transaction Pruning
e -30
10 a, c, d, f, h
20 c, d, f, g, h f -10
B
30 c, d, f, g FP-Tree g 20
h -5
S⊆ V yes no yes
min(S) ≤ v no yes yes
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes no no
sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no yes no
range(S) ≤ v yes no no
range(S) ≥ v no yes no
support(S) ≤ ξ no yes no
Monotone
Antimonoto
ne
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Inconvertible
methods
Summary
August 10, 2009 Data Mining: Concepts and Techniques 82
Why Mining Colossal Frequent
Patterns?
F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal
Frequent Patterns by Core Pattern Fusion”, ICDE'07.
We have many algorithms, but can we mine large (i.e., colossal)
patterns? ― such as just size around 50 to 100? Unfortunately, not!
Why not? ― the curse of “downward closure” of frequent patterns
The “downward closure” property
Any sub-pattern of a frequent pattern is frequent.
Example. If (a1, a2, …, a100) is frequent, then a1, a2, …, a100, (a1,
a2), (a1, a3), …, (a1, a100), (a1, a2, a3), … are all frequent! There
are about 2100 such frequent itemsets!
No matter using breadth-first search (e.g., Apriori) or depth-first
search (FPgrowth), we have to examine so many patterns
Thus the downward closure property leads to explosion!
August 10, 2009 Data Mining: Concepts and Techniques 83
Colossal Patterns: A Motivating
Example
Let’s make a set of 40 transactions Closed/maximal patterns may
T1 = 1 2 3 4 ….. 39 partially alleviate the problem but not
40 really solve it: We often need to
T2 = 1 2 3 4 ….. 39 mine scattered large patterns!
40
Let the minimum support threshold
: .
σ= 20
: . 40
: . There are 20 frequent patterns of
: delete the items .on the diagonal size 20
Then
T40=1 2 3 4 ….. 39 Each is closed and maximal
T140
= 2 3 4 ….. 39 40
T2 = 1 3 4 ….. 39 40 # patterns = n 2n
: . ≈ 2 / π
: . n / 2 n
: . The size of the answer set is
: . exponential to n
T40=1 2 3 4 …… 39
August 10, 2009 Data Mining: Concepts and Techniques 84
Colossal Pattern Set: Small but Interesting
Colossal patterns
are usually
attached with Data Mining: Concepts and Techniques
August 10, 2009 85
Mining Colossal Patterns: Motivation
and Philosophy
Motivation: Many real-world tasks need mining colossal
patterns
Micro-array analysis in bioinformatics (when support is
low)
Biological sequence patterns
August10,The
2009 key is to develop a mechanism
Data Mining: that may quickly reach
Concepts and Techniques 86
Alas, A Show of Colossal Pattern Mining!
Transaction Database D
A colossal pattern α
α D
α1 Dαk
α2
D
Dα1
α
Dα2
αk
size c
A random draw from a complete set of pattern of size c would
(abcef) (100) (ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e),
(abc), (abf), (abe), (ace), (acf), (afe), (bcf), (bce),
(bfe), (cfe), (abcf), (abce), (bcfe), (acfe), (abfe),
(abcef)
30
REPLACE
A program trace data set, recording 4395
methods
Summary
August 10, 2009 Data Mining: Concepts and Techniques 103
Frequent-Pattern Mining: Summary
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458