Seminar 2009

Frequent Subgraph/ Substructure Mining

Lei Shi
Department of Computer Science and
Engineering
State University of New York at Buffalo

University at BuffaloThe State University of New York

Outline
 Introduction
 Apriori-based Subgrah Mining
 Pattern Growth Subgraph Mining
 Summary

University at BuffaloThe State University of New York

Graphs are everywhere

University at BuffaloThe State University of New York

Graph Mining Problems
 Graph Pattern Mining




Frequent subgraph pattern mining
Pattern summarization
Optimal graph patterns
Graph patterns with constraints
Approximate graph patterns ….

 Graph Classification
• Graph clustering
• Important node identification
• Bridge and hub identification

 Other Important Topics
• Graph compression
• Graph model
• Social network analysis.
University at BuffaloThe State University of New York

clustering. University at BuffaloThe State University of New York .Subgraph pattern Mining  Frequent subgraph • A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold  Application of subgraph pattern mining • • • • Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classifiction. comparison and correlation analysis.compression.

Frequent Subgraph Example (1) A A (2) A C C B 1 3 subgraph University at BuffaloThe State University of New York A A A C Support C B B B (3) A 3 .

Subgraph candidate generation • generate candidate frequent subgraphs from datasets University at BuffaloThe State University of New York . as long as graphs have the same topological structure and the same labeling of edges and vertices.Key Challenges in Subgraph Mining  Graph isomorphism •  to detect if two graphs are identical in structure Graph representation (Canonical Labeling) • •  A canonical label is a unique code of a given graph. Canonical label should be the same no matter how graphs are represented.

ICDM. (SIGMOD’08) University at BuffaloThe State University of New York . (KDD’04) • FTOSM: Horvath et al. et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) M. In ICDM’01. Karypis.Subgraph Mining Approaches  Apriori-based • • AGM/AcGM: Inokuchi. (ICDM’03) and SPIN: Huan et al. J. Frequent subgraph discovery. ICDM’04) • FFSM: Huan. pages 313-320. Kuramochi and G. (TKDE’05) • LEAP: Yan et al. 2001 • PATH#: Vanetik and Gudes (ICDM’02. (KDD’94) • MoFa: Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) Yan. X. and Han. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12. DC. 2002). et al. IEEE Computer Society. 2002. 721 • Gaston: Nijssen and Kok (KDD’04) • CMTreeMiner: Chi et al. Nov. Washington. gSpan : Graph-Based Substructure Pattern Mining. (KDD’06)  Pattern growth based • Subdue: Holder et al.

Outline  Introduction and Background  Apriori-based Subgrah Mining  Pattern Growth Subgraph Mining  Summary University at BuffaloThe State University of New York .

Nov.Apriori-based Approach  FSG : Frequent subgraph discovery. 2001 M.  Flattened Representation as Canonical Labeling  Apriori-based method to generate subgraph candidate University at BuffaloThe State University of New York . Karypis. In ICDM’01.Kuramochi and G.

Graph Representation in FSG  Flattened Representation 0e0 00e1 00 University at BuffaloThe State University of New York .

Graph Representation in FSG  Flatterned Representation Lexicographic order or dictionary order University at BuffaloThe State University of New York .

all of its subgraphs are frequent.  Candidate Generation • Create a set of candidate size k+1 -from given two frequent ksubgraphs -containing the same (k-1)subgraph -Result in several candidates size k+1 University at BuffaloThe State University of New York .Apriori-based method  Apriori Property • If a graph is frequent.

Apriori-based method Graph candidate generated Example University at BuffaloThe State University of New York .

Apriori-based method  FlowChart University at BuffaloThe State University of New York .

Apriori-based method  Experiment Result -Chemical Compound Dataset.24 different atoms (vertices) University at BuffaloThe State University of New York . which contains 340 compounds.

Outline  Introduction  Apriori-based Subgrah Mining  Pattern Growth Subgraph Mining  Summary University at BuffaloThe State University of New York .

Pruning false positive : subgraph isomorphism is an NP complete problem which is costly.Motivation of gSpan  Weakness of Apriori-based approach • •  The generation of size (k+1) subgraph candidates from size k frequent subgraph too complicated and complex. gSpan: Graph-Based Substructure Pattern Mining • • Change the way to represent a graph (DFS: Depth First Search) Using pattern growth to generate new subgraph candidate. University at BuffaloThe State University of New York .

• Second Step: DFS Lexicographic Order  Pattern Growth subgraph generation University at BuffaloThe State University of New York .gSpan: Graph-Based Substructure Pattern Mining  DFS (Depth First Search) Code • First Step: DFS the graph and use edges on the path to represent the graph.

(i. j ) . Y ) University at BuffaloThe State University of New York . a. X .DFS code An edge is presented by 5 tuples. l j ) (0.1. li . l( i . j .

DFS code  Second Step: DFS Lexicographic Order University at BuffaloThe State University of New York .

Pattern Growth Approach  Pattern Growth (free extension) University at BuffaloThe State University of New York .

Pattern Growth Approach  Duplicate Graphs University at BuffaloThe State University of New York .

Pattern Growth Approach  Free extension University at BuffaloThe State University of New York .

Pattern Growth Approach  Right most extension University at BuffaloThe State University of New York .

Pattern Growth Approach  Exmaples (cont.) University at BuffaloThe State University of New York .

gSpan University at BuffaloThe State University of New York .

gSpan University at BuffaloThe State University of New York .

Pattern Growth Approach  Experimental result using Chemical data •340 molecules 66 atom types and 4 bond types as labels •On average only 27 vertices with 28 edges University at BuffaloThe State University of New York .

DFS code  Generation of Candidate Patterns apriori vs. pattern growth University at BuffaloThe State University of New York .Summary  Graph representation Flattern representation vs.

University at BuffaloThe State University of New York .

Pattern-Growth Approach University at BuffaloThe State University of New York .

find subgraph g.t. freq(g )   Where freq(g ) is the percentage of graphs in D that contain g.Frequent Graph Pattern Given a graph dataset D. Problem 1 : Exponential Pattern Set Problem 2 : Threshold Setting University at BuffaloThe State University of New York . s.

Difference between frequent itemset and frequent subgraph discovery University at BuffaloThe State University of New York .

Frequent itemset discovery University at BuffaloThe State University of New York .

(TKDE’05) – LEAP: Yan et al. et al. (KDD’04) – FTOSM: Horvath et al. ICDM’04) – FFSM: Huan. (PKDD’00) – FSG: Kuramochi and Karypis (ICDM’01) – PATH#: Vanetik and Gudes (ICDM’02. (SIGMOD’08) University at BuffaloThe State University of New York .subgraph Mining Algorithms  Apriori-based approach – AGM/AcGM: Inokuchi. (KDD’06)  Pattern growth approach – Subdue: Holder et al. (KDD’94) – MoFa: Borgelt and Berthold (ICDM’02) – gSpan: Yan and Han (ICDM’02) – Gaston: Nijssen and Kok (KDD’04) – CMTreeMiner: Chi et al. et al. (ICDM’03) and SPIN: Huan et al.

depth complete vs. active  Support Calculation embedding store or not University at BuffaloThe State University of New York . incomplete  Generation of Candidate Patterns apriori vs. pattern growth  Discovery Order of Patterns DFS order path tree graph  Elimination of Duplicate Subgraphs passive vs.Framework of subraph Mining Algorithms  Search Order breadth vs.

Frequent Subgraph Examples: University at BuffaloThe State University of New York .

) University at BuffaloThe State University of New York .Example (cont.

Frequent subgraph discovery. Washington. IEEE Computer Society. et al. (TKDE’05) • LEAP: Yan et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) M. 2002. and Han. 2002). 2001 • PATH#: Vanetik and Gudes (ICDM’02. X. Nov. Kuramochi and G. In ICDM’01. Karypis. DC. gSpan : Graph-Based Substructure Pattern Mining. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12. (KDD’04) • FTOSM: Horvath et al. 721 • Gaston: Nijssen and Kok (KDD’04) • CMTreeMiner: Chi et al. ICDM. J. pages 313-320. ICDM’04) • FFSM: Huan. (SIGMOD’08) University at BuffaloThe State University of New York . (KDD’94) • MoFa: Borgelt and Berthold (ICDM’02) • gSpan: Yan and Han (ICDM’02) Yan. (ICDM’03) and SPIN: Huan et al.Subgraph Mining Approaches Apriori-based approach • • AGM/AcGM: Inokuchi. et al. (KDD’06) Pattern growth approach • Subdue: Holder et al.

Outline  Introduction and Background  Apriori-based Subgrah Mining  Pattern Growth Subgraph Mining  Summary DFS code Yan. DC. 721 University at BuffaloThe State University of New York . gSpan : Graph-Based Substructure Pattern Mining. J. 2002. X. Washington. 2002). ICDM. In Proceedings of the 2002 IEEE international Conference on Data Mining (Icdm’02) (December 09-12. IEEE Computer Society. and Han.

Pattern Growth Approach University at BuffaloThe State University of New York .

Sign up to vote on this title
UsefulNot useful