Está en la página 1de 6

Proceedings of the 8th WSEAS International Conference on APPLIED COMPUTER SCIENCE (ACS'08)

Mining Sequential Patterns by PrefixSpan algorithm


with approximation
ANKHBAYAR YUKHUU* SANSARBOLD GARAMRAGCHAA
and HWANG YOUNG SUP**
Department of Computer Science,
Sun Moon University,
100 Kalsan-ri, Tangjeongmyeon, Asan ; Chugnam, 336-708
SOUTH KOREA

Abstract- We want to find sequential patterns in a long continues noisy DNA sequence. Sequential pattern mining, which
discovers frequent subsequences as patterns in a sequence database, is an important data mining problem with broad
applications, including the analysis of customer purchase patterns or Web access patterns and analysis of DNA
sequences, and so on. We investigated sequential pattern mining algorithms for long continues DNA sequences. Most
previously proposed mining algorithms follow the exact matching with a sequential pattern definition. They are not able
to work in noisy environments and inaccurate data in practice. We investigated approximate matching method to deal
with those cases. In this paper, we develop and apply Pattern–Growth PrefixSpan algorithm to find most repeated
patterns, for example, motifs in DNA sequence. Our algorithm gains its efficiency by using pattern growth and
approximation methodologies. The algorithm is based on the observation that all occurrences of a frequent pattern can be
classified into groups, which we call approximated pattern. We developed algorithms to quickly find out all relative
frequents by a pattern growth method and to determine approximated patterns from those frequents. Our experimental
studies demonstrate that our algorithm is efficient in mining repeated approximate sequential patterns that would have
been missed by existing methods.

Key-Words: - sequential pattern mining, searching sequential pattern, pattern growth method, PrefixSpan algorithm with
approximation, long DNA sequence mining, searching subsequences.

1 Introduction Prefixspan extremely run faster than the a priori-based


The sequential pattern mining problem was first GSP algorithm [3][4][5].
introduced by Agrawal and Srikant in [1]: Given a set of Almost existing mining algorithms follow the exact
sequences, where each sequence consists of a list of matching with a sequential pattern definition. They are
elements and each element consists of a set of items, and not able to work in noisy environments and inaccurate
given a user-specified minimum support(user specified data in practice. We studied approximate matching
minimum length frequent) threshold, sequential pattern method to deal with such as DNA sequence. The
mining is to find all frequent subsequences, i.e., the development of efficient and scalable algorithms for
subsequences whose occurrence frequency in the set of mining approximate sequential patterns is a challenging
sequences is no less than the minimum support. and practically useful direction to pursue. It has been
Sequential pattern mining is an important data mining shown that the capacity to accommodate approximation
problem with broad applications, including the analyses in the mining process has become critical due to inherent
of customer purchase behavior, Web access patterns, noise and imprecision in data or gene mutations in
scientific experiments, disease treatments, natural genomic DNA sequences [8]. The notion of approximate
disasters, DNA sequences, and so on[2]. There are too sequential pattern has been proposed in [8], in which an
many previously proposed pattern mining methods based algorithm called ApproxMap is designed to mine
on a priori-like methods: typical a priori-like method such consensus patterns. While mining consensus patterns
as GSP (Generalized Sequence Pattern) adopts a provides one way to produce compact mining result
multiple-pass, candidate generation, test approach in under general distance measures, it remains a challenge
sequential pattern mining [2][3][4][5]. Also Freespan and how to efficiently mine the complete set of approximate

ISSN: 1790-5109 159 ISBN: 978-960-474-028-4


Proceedings of the 8th WSEAS International Conference on APPLIED COMPUTER SCIENCE (ACS'08)

sequential patterns under some distance measure which is so far and project the sequence database based on the
stricter yet equally useful in many cases. partition of such patterns. The general idea is outlined as
Classical exploratory data analysis methods used in follows: Instead of repeatedly scanning the entire
statistics and many of the earlier KDD methods tend to database, generate and test large sets of candidate
focus only on basic data types, such as interval or sequences. One can recursively project a sequence
categorical data, as the unit of analysis. However, some database into a set of smaller databases associated with
information cannot be interpreted unless the data is the set of patterns mined so far and, then, mine locally
treated as a unit, leading to complex data types. For frequent patterns in each projected database.
example, the research in DNA sequences involves
interpreting huge databases of amino acid sequences. The
interpretation couldn’t be obtained if the DNA sequences
2 Term definition
Definition 1- If sequence X is corresponded to (x1, x2, …,
were analyzed as multiple categorical variables. The
xi) where xi is an element of the sequence. The value of
interpretation requires a view of the data at the sequence
instances of items in a sequence is length (l) of the
-level. DNA sequence research has developed many
sequence.
methods to interpret long sequences of alphabets.
Definition 2 - If X sequence is subsequence of another
The purpose of this study is efficiently mining long
sequence Y=<y1,y2, ..., ym>, if there exist integers
sequential patterns in noisy environments. In this paper,
1<i1<i2<…<in<m such that x1=yi1, x2=yi2, …, xn=yin.
we develop and apply Pattern–Growth PrefixSpan
Definition 3 - An X sequence and a user-specified error
algorithm to find most repeated patterns, motifs in DNA
threshold minimum support θ, a sequence is said to be
sequence. We establish a more general model. The
frequent if it is contained in most times with θ frequents
sequences in its support set may have different patterns of
in the sequence. A sequential pattern is a maximal
substitutions; they can in fact be classified into groups,
sequence that is frequent.
which we call approximated patterns. Each approximated
In the case of DNA sequence: The element of sequence
pattern is a set of sequences sharing a unified pattern
Xi=A,C,G,T. If X sequence includes a P sequence at least
representation together with its support.
θ times, a sequence P is called a sequential pattern and
The idea is that by pattern growth algorithm the mined
integer θ is called user-specified repeat threshold.
sets of the patterns discover their related approximated
In our case we studied to pattern recognition algorithm to
patterns, and we are able to classify them. These initial
find a DNA pattern sequence in long sequence with noise
approximated patterns will then be iteratively assembled
by pattern growth algorithm with error recognition by
into longer approximated patterns in a local search
user specified error threshold.
fashion, until no longer ones can be found. In the second
stage, generalize the longest or most repeated patterns
from grouped based on their constituting sequences to 3 Related Algorithms
form a support set so that the frequent approximate In this section, we first outline a projection-based
patterns would be identified. Instead of mining only the sequential pattern mining method, called FreeSpan, and
patterns repeating within a sliding window of fixed sizes, then introduce an improved PrefixSpan. Both methods
our algorithm is able to mine all globally repeating generate projected databases, but they differ at the criteria
approximate patterns. of database projection: FreeSpan creates projected
Traditionally in data mining, sequential patterns have databases based on the current set of frequent patterns
been defined as a subsequence that appears frequently in without a particular ordering (i.e., growth direction),
a sequence database. Such definition leads to a support whereas PrefixSpan projects databases by growing
model based sequential pattern mining methods [3]. frequent prefixes. PrefixSpan is substantially faster than
Currently as far as we know, the only methods available FreeSpan in most sequence databases.
for analyzing sequences of sets are based on such a 3.1 Freespan
support model. FreeSpan is based on the following property: If an itemset
Although the support model based sequential pattern X is infrequent, any sequence whose projected itemset is a
mining has been extensively studied and many methods superset of X cannot be a sequential pattern. FreeSpan
have been proposed, there are some inherent obstacles mines sequential patterns by partitioning the search space
within conventional model. These methods meet inherent and projecting the sequence subdatabases recursively
difficulties in mining databases with long sequences and based on the projected itemsets.
noise. These limitations are discussed with our algorithm. Given the database S and min_support θ1, FreeSpan first
The most common sequential pattern mining method is scans S, collects the support for each item, and finds the
Mining Patterns by Pattern growth [3]: divide sequential set of frequent items.
patterns to be mined based on the subsequences obtained Frequent items are listed in support descending order.

ISSN: 1790-5109 160 ISBN: 978-960-474-028-4


Proceedings of the 8th WSEAS International Conference on APPLIED COMPUTER SCIENCE (ACS'08)

f_list = a: 4, b: 4, c: 4, d: 3, e: 3, f: 3 PrefixSpan algorithm’s main attribute is that PrefixSpan


Table 1. A sequence database only grows longer sequential patterns from the shorter
id_number Sequence frequent this was our useful source of idea.
110 (a(abc)(ac)d(cf)) Algorithm of PrefixSpan as follows:
120 ((ad)c(bc)(ae)) Input: A sequence database S, and the minimum
130 ((ef)(ab)(df)cb) support threshold θ
140 (eg(af)cbc) Output: The complete set of sequential patterns
Method: Call PrefixSpan (<>, 0, S)
Algortihm1: Freespan algorithm process for mining each
projected database using the sequence database in Table Subroutine PrefixSpan (α, l, S|α )
1 is detailed as follows: Parameters: α sequential pattern; l the length of α,
I. Mining sequential patterns containing only item a. the S|α -projected database, if α≠<>; otherwise
By scanning sequence database once, the only two the sequence database S.
sequential patterns containing only item.
II. Mining sequential patterns containing item b but no 4 PrefixSpan algorithm with
item after b in the f_list. This can be achieved by approximation
constructing the {b}- projected database. For a One particularly important problem in the area of
sequence α containing item b in S sequence, a sequential pattern mining is the problem of discovering
subsequence α' is derived by removing from α all all subsequences that appear on a given sequence
items after b in f_list. α′ is inserted into {b}- database and have minimum support threshold. The
projected database. Thus {b}-projected database difficulty is in figuring out what sequences to try and then
contains four sequences: )a(ab)a), (aba) , ((ab)a). efficiently finding out which of those are frequent [2].
By scanning the data project once more, all Pattern-growth methods are a more recent approach to
sequential patterns containing item b but no item deal with sequential pattern mining problems. The key
after b in f_list are found. They are (b),( ab) idea is to avoid the candidate generation step altogether,
III. Mining other subsets of sequential patterns. Other and to focus the search on a restricted portion of the initial
subsets of sequential patterns can be mined database.
similarly on their corresponding projected PrefixSpan is the most promising of the pattern-growth
databases. This mining process proceeds methods and is based on recursively constructing the
recursively, which derives the complete set of patterns, and simultaneously, restricting the search to
sequential patterns. projected databases[3]. A database is the set of
3.2 PrefixSpan subsequences, which are suffixes of the sequences that
PrefixSpan algorithm major idea is: any frequent have prefix. In each step, the algorithm looks for the
subsequences can always be found by growing frequent frequent sequences with prefix a, in the correspondent
prefixes. The PrefixSpan algorithm steps are: projected database.
I. Find length -1 sequential patterns. Scan S once to In this way, the search space is reduced at each step,
find all frequent items in sequences. Each of those allowing for better performances in the presence of small
frequent items is a length -1 sequential pattern. support thresholds. In the presence of gap constraints, the
They are <a>:4 , <b>:4, <c>:4, <d>:3, < e>:3 algorithm has to be adapted, and this claim is no longer
and <f>:3, where <pattern> : count represents the valid. However, as been shown, the new version of
pattern and its associated support count. PrefixSpan remains faster than other methods.
II. Divide search spaces. The complete set of 4.1 Approximate matching
sequential patterns can be partitioned into the A fundamental problem of the conventional methods is
following six subsets according to the six prefixes: that the exact match on patterns does not take into
(1) Ones having prefix a; Ones having prefix b;… account the noise in the data. This causes two potential
(6) the ones having prefix <f>; problems. In real data, long patterns tend to be noisy and
III. Find subsets of sequential patterns. The subset of may not meet the support level with exact matching. Even
sequential patterns can be mined by constructing with moderate noise, a frequent long pattern can be
corresponding projected databases and mine each mislabelled as an infrequent pattern [8]. Not much work
recursively. has been done on approximate pattern mining.
PrefixSpan algorithm is created for mining in projected We can extend the conventional sequential pattern
databases. In this study our database is a long continuous mining model to get an approximate sequential pattern
sequence. So we used PrefixSpan algorithm and to mine a
continuous sequence database. Significant issue of mining model. Given a minimum threshold distance σ,

ISSN: 1790-5109 161 ISBN: 978-960-474-028-4


Proceedings of the 8th WSEAS International Conference on APPLIED COMPUTER SCIENCE (ACS'08)

the patterns generated with approximate matching *pattern), the second is searching patterns
depend on the number of string matching transition σ. SearchPrefix(sequence, pattern).
Instead of generating the possible sequences, considering sequential patterns from the shorter frequent ones. And
possible errors, we verify if each sequence element classify them by lengths (length) and by repeats.
performs a valid transition in the automaton, beginning Algorithm 2:
on the initial state. Whenever an element does not
CreateseqPref(char *pattern)
correspond to a valid transition, the algorithm tries to
// Creating pattern for searching function by
replace it (which corresponds to apply a Replacement), pattern growth approach
and if this fail, then tries to ignore (which corresponds to { ...
a Deletion) and finally by trying to introduce a valid return pattern;
transition (which corresponds to an Insertion). Those }
insertions and replacements try correct element by every SearchPrefix(sequence, pattern)
// Searching pattern in sequence for find out
possible element {frequent length is constant}. The repeat of pattern
operation does not depend on the size of the number of { ...
different elements in the database, but only on the DNA Boolean approxAcc(Sequence, int i, pattern,
nucleotides of sequences. int e, int Th);
In our algorithm we use edit operations in approximate return repeat;
}
string matching, 3 transitions as follows:
Boolean approxAcc(Sequence, int i, pattern,
Insertion- Ins(x, i) - adds symbol x to the position i; int e, int Th)
Deletion- Del(x. i) - deletes the symbol x from the // Approximate (by Th – error threshold) in
sequence at position i; searching process
{
Replacement – Repl(x, y, i) – substitutes the symbol x,
if (e > Th)
that occurs at position i, by symbol y return false; // Exceeds maximal error
Example 3: If we are searching ATAT– motif, we will if (I < |s|)
find all occurrences with approximate string matching return AccTrans(sequence,s[i], i,
with pattern,e,Th);
Insertion ATAAT, // Test the achievement of final state
Deletion ATT, and else if (pattern in Sequence)
return true; // Try to achieve a final state
Replacement ATGT … etc
else
Unfortunately, the above model may suffer from the return accTrans(sequence, ( ), i, pattern,
following two problems. First, the mining may find many Th, e);
short and probably trivial patterns. Short patterns tend to }
be easier to get similarity counts from the sequences than Boolean accTrans(Sequence,S [i], int i ,
long patterns. Thus, short patterns may overwhelm the pattern, int Th, e)
// Function for checking transitions by every
results. element
Second, the complete set of approximate sequential {
patterns may be larger than that of exact sequential int dif = max(s[i],[a]-|a in s[i]|)
patterns and thus difficult to understand. By if ( |s[i]|= |a in s[i]|)
approximation, a user may want to get and understand the ok = approxAcc(s,i+1,pattern,Th.e); // found
the transition
general trend and ignore the noise. However, a native else
output of the complete set of approximate patterns in the ok = approxAcc(s,i+1,pattern,e+dif,Th);
above model may generate many (trivial) patterns and //Check edits operations/replacement
thus obstruct the mining of information. if (ok)
4.2 PrefixSpan algorithm with approximate return true;
// Element is not valid delete;
matching ok = approxAcc(s,i+1,pattern,e+s[i] ,Th);
We have used PrefixSpan algorithm to find exact patterns, if (ok)
the approximate matching can identify which return true; // If deletion fails then check
approximate pattern is related. insertion;
We created Algorithm 2, PrefixSpan, approximate ok = approxAcc(s,i+1,pattern,e+[a] ,Th);
if (ok)
matching algorithms which to mine sequential patterns.
return true;
Input data is a long continuous DNA sequence. return false ; }
Our algorithm is divided into two main sections, the first
creating patterns CreateseqPref(char

ISSN: 1790-5109 162 ISBN: 978-960-474-028-4


Proceedings of the 8th WSEAS International Conference on APPLIED COMPUTER SCIENCE (ACS'08)

First, creating section defines which frequent is next Example 4:


searching process. Creating pattern step is creating longer AAGTATTGCGCATACTTCTTATGGGAGATCCAAATTCC
Second, searching pattern section is finding subsequence CAGTACATTAGTTGACTCTCCTTCCTTAATGTTTTCCT
from main sequence with approximate matching CTGTACGATTATTTATGTTTAAAC…
algorithm, the section returns a pattern repeat. The rule is A text file consist A, G, C, T nucleotides. Algorithm
comparing two sequences with approximation outputs were most repeated patterns and their repeats. We
transitions. have used 27,126 length DNA sequence and user defined
When searching short patterns (length=1; error threshold. It means that one transition is possible.
Table 2 illustrates experimental results of two algorithms.
length=2; lentgh=3) the algorithm may meet
When a pattern length was three {length=3}, output
collision. Furthermore, if pattern length is too long,
pattern of the proposed algorithm was repeated 2 times
pattern mining result would be reduced. However, we are
more than PrefixSpan algorithm’s and last nucleotides
using user defined coefficient minimum length,
were different {TTT, TTA}. Two algorithms are
min_length and maximum length, max_length to
observed that they can find motif can be dissimilar at
solve those problems.
length to length. Different patterns between the outputs
CreateseqPref(char *pattern) is creating Table 2: Comparison of the results achieved without and
patterns to search sequence by pattern growth approach, with approximation.
and add to all possibilities of one element for pattern.
SearchPrefix(sequence, pattern) function Pattern
1 2 3 4 5
length
returns how many times pattern is repeated in sequence. PrefixSpan
Boolean approxAcc(Sequence, int i, Pattern
T TT TTT TTTT TTTTT
pattern, int e, int Th) is checking error exceed Pattern with
T TT TTA TTAT TTATT
Approximation
and final states, Boolean accTrans(Sequence,S PrefixSpan
[i], int i , pattern, int Th, e) is finding Repeat
9,288 2,646 936 270 90
and recognizing transitions. Th is user defined error Repeat with
9,288 2,646 2,015 1,350 520
threshold and e is a temporary error. Approximation
Whenever an element doesn’t correspond to valid
transition, the algorithm tries to replace it (Replacement), showed hidden unknown information to us. Because,
and if this fail , then tries to ignore (Deletion) and finally many patterns can include some transitions in continues
by trying to introduce a valid transitions (Insertion) long noisy DNA sequence.
instead of trying every possible elements(A,C,G,T). Figure 1 compares of the repeats of PrefixSpan
Transitions are limited by user defined threshold Th can algorithm and approximated PrefixSpan algorithm. The
be 1 or 2, because Th increased by 1 searching process repeat desperately rose with approximation. One and two
time would be increase at least 3 times. length pattern is not approximated with our algorithm.
The 3 length repeats has increased 936 to 2,015, when
5. Experimental Results length was 4 repeats were 270 and 1,350 and when
finding 5 length PrefixSpan algorithm repeat was 90, our
Our goal is to validate our claim that approximations
algorithm showed 520 times repeated pattern /ATAGT/.
allow the discovery of unknown information, keeping the
The repeats of our algorithm were always more than other
mining process centered on the user. In order to do it, we
algorithm and showed dissimilar patterns during the
compared general Prefixspan algorithm and the
experiment. It can be observed that PrefixSpan
Prefixspan algorithm with an approximation approach
with approximation algorithm can effectively uncover the
when finding a motif in a DNA sequence.
hidden trend in the DNA sequence. That comparison
This PrefixSpan pattern growth method has been used in
shows that our algorithm can discover more patterns,
most studies in a sequential pattern mining, and it
trends than PrefixSpan algorithm. In result, we are able to
generates datasets that mimic real-world transactions,
find approximately the motifs, most repeated patterns
where customers tend to make a sequence of transactions
from a long noisy DNA sequence.
involving a set of items. Sequence and transaction sizes
In summary our experiment shows that the PrefixSpan
are clustered around a mean and some of them may have
with approximation algorithm is more efficient than non
larger sizes.
approximated PrefixSpan in finding a motif in a long
Input data was a long continuous DNA sequence text file.
repeated pattern.
The first 100 element of sequence is as follows:

ISSN: 1790-5109 163 ISBN: 978-960-474-028-4


Proceedings of the 8th WSEAS International Conference on APPLIED COMPUTER SCIENCE (ACS'08)

3000
Mining with Gap Constraints,” in Int'l Conf Machine
Learning and Data Mining, (2003) 239-251
2500 [3] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto,
Q. Chen, U. Dayal, Mei-Chun Hsu “Mining
2000 Sequential Patterns by Pattern-Growth: The
PrefixSpan Approach,” IEEE Transactions on
repeats

1500 knowledge and engineering , vol. 16, No. 10, october


2004
1000 [4] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen,
U. Dayal, and M.-C. Hsu, “PrefixSpan: Mining
500
Sequential Patterns Efficiently by Prefix-Projected
Pattern Growth,” Proc. 2001 Int. Conf. on Data
0
2 3 4 5
Engineering (ICDE'01), Heidelberg, Germany, April
pattern length 2001.
[5] Odysseas Efremides, George Ivanov “A Study for
PrefixSpan Repeat Repeat with Approximated )
Data Pool Dynamic Workload Approaches of a String
Figure 1: Comparison of the results achieved with and Matching Algorithm over a Heterogeneous Cluster ”
without approximation. (27,126 length DNA sequence) Proceedings of the WSEAS Transactions on
Computers, Issue 7, Volume 5, July 2007
6. Conclusions (pp-1466-1474)
We proposed the definition of closed frequent [6] M. J. Zaki, “SPADE: An Efficient Algorithm for
approximate sequential patterns to solve the problem of Mining Frequent Sequences,” in Machine Learning
mining the complete set of frequent approximate Journal, special issue on Unsupervised Learning
sequential pattern mining. Our study shows that (Doug Fisher, ed.), pages 31-60, Vol. 42 Nos. 1/2,
PrefixSpan algorithm with approximation mines the Jan/Feb 2001.
complete set of patterns and it is more efficient than [7] J. Han, J. Pei, and Y. Yin. “Mining frequent patterns
PrefixSpan algorithm. without candidate generation,” In Proc. 2000
Our algorithm is based on the notion of classifying a ACM-SIGMOD Int. Conf. Management of Data
pattern’s support set into approximated patterns. We (SIGMOD’00), pages 1–12, Dallas, TX, May 2000.
combine approximation approach mining and iterative [8] H.C. Kum, J. Pei, W. Wang, and D. Duncan.
pattern growth. We adopt a local search optimization Approx-MAP : “Approximate Mining of Consensus
technique to completeness of the mining result. Our Sequential Patterns,” Technical Report TR02-031,
performance study shows that our algorithm is able to UNC-CH, 2002.
mine out globally repeating approximate patterns in [9] Antunes, C. and Oliveira, A.L., “Sequential Pattern
biological genomic DNA data with great efficiency Mining with Approximated Constraints,” Int. Conf
(when length=4, 5 times better etc.). It is important to Applied Computing, IADIS (2004) 131-138
explore how to further develop such a pattern-growth [10] Rachel Akimana, Oliver Markovitch and Yves
based sequential mining methodology for effectively Roggeman “Grid’s Confidential Outsourcing of
mining DNA databases. String Matching ” Proceedings of the 6th WSEAS Int.
Conf. on Software Engineering, Parallel and
Acknowledgments Distributed Systems, Corfu Island, Greece, February
16-19, 2007 (pp-63-68)
This work was supported by the Korea Research
[11] M Yazid M Saman, M Nordin A Rahman Aziz
Foundation. Grant funded by the Korean Government
Ahmad and A Osman M Tap “A Minimum Cost
(MOEHRD), (KRF-2006-311-D00833).
Process in Searching for a Set of Similar DNA
Sequences ” Proceedings of the 5th WSEAS
References
International Conference on Telecommunications
[1] R. Agrawal and R. Srikant. “Fast algorithms for
and Informatics, Istanbul, Turkey, May 27-29, 2006
mining association rules,” In Proc. 1994 Int. Conf.
(pp348-353)
Very Large Data, Bases (VLDB’94), pages 487–499,
[12] M. Yazid M. Saman, M. Nordin A. Rahman, Aziz
Santiago, Chile, Sept. 1994.
Ahmad, A. Osman M. Tap “Incremental Clustering as
[2] Antunes C and Oliveira A.L: “Generalization of
a Tool for CRM” Proceedings of the WSEAS
Pattern-Growth Methods for Sequential Pattern
Transactions on Computers, Issue7, Volume 5, July
2007 (pp-1525-1533)

ISSN: 1790-5109 164 ISBN: 978-960-474-028-4

También podría gustarte