Está en la página 1de 8

A Specialized Random Multi-Parent Crossover Operator embedded into a Genetic Algorithm for gene selection and classication problems

Edmundo Bonilla-Huerta, Jos e Hern andez Hern andez, and Roberto Morales-Caporal
LITI, Instituto Tecnol ogico de Apizaco, Av. Instituto Tecnol ogico s/n. 93000, Apizaco, Tlaxcala, M exico {edbonn,roberto-morales,josechh}itapizaco.edu.mx

Abstract. The microarray data classication problem is a recent pattern recognition problem. In supervised classication of microarray data, the goal of gene selection is to select a small number of relevant genes from the initial data in order to obtain high predictive classication accuracy. With the framework of embedded genetic algorithm, we study in this paper the role of multi-parent recombination operator. For this purpose, we introduce a Random Multi Parent crossover (RMPX) and we analyze their eects in a genetic algorithm (GA) which is combined with Fishers Linear Discriminant Analysis (LDA). This embedded LDAbased GA algorithm has the major characteristic that the GA uses not only a LDA classier in its tness function, but also LDAs discriminant coecients to integrate a multi-parent specialized crossover and mutation operation to improve the performance of our approach. In the experimental results it is observed that RPMX operator work very well by achieving lower classication error rates than other methods reported in the literature. Keywords: RPMX, multi-parent recombination, embedded, linear discriminant analyze, genetic algorithm, gene selection, microarray.

Introduction

Recently in many evolutionary algorithms (EAs), dierent multi-parent recombinations have been proposed to create ospring. In [5, 6] are proposed scanning crossover and diagonal crossover. These two methods allows to adopt more than two parents in the process of recombination. Basically, these multi-parent operators are a generalization of genetic algorithms operators such as uniform crossover and one-point crossover respectively. In [10] is proposed the gene pool recombination operator, where the gene pool consists of several pre-selected parents. In [15] is developed the real-coded center of mass crossover (CMX) and the multi-parent feature-wise crossover (MFX) as two multi-parents recombination operators that can lead to obtain a better performance than the crossover operators used in the genetic algorithms. In [8] is proposed a tness-weighted crossover

Lecture Notes in Computer Science: Authors Instructions

(FWX) with an original random threshold mechanism which is used to determine the parent-numbers to reproduce ospring. A study of crossover operators using a multi-parent approach is presented in [16] by using a memetic algorithm as framework to solve the Unconstrained binary quadratic programming (UBQP) problem. In all these methods the parents number play a key role to keep the population diversity and to avoid the premature convergence, although the optimal parents number is actually a problem still open. The DNA Microarray technology permits to monitor and to measure gene expression levels for tens of thousands of genes simultaneously in a cell mixture. This technology enables to consider cancer diagnosis based on gene expressions [2, 3, 1, 7]. Given the very high number of genes, it is useful to select a limited number of relevant genes for classifying tissue samples. This paper presents an empirical study of multi-parent embedded genetic algorithm for gene selection. In this paper, we analyze the eect of the number of parents for our Random Multi Parent crossover (RMPX), this number vary from 2 to 12 parents. We argue that more parents are used more good solutions (gene subset) are obtained, because more parents explore and exploit into space search. Nevertheless, we need to know how many parents are necessary to scape from local minimum and to obtain high classication accuracy solutions with a minimum number of genes. For this study, three gene selection lters are used on the embedded approach and the Fishers Linear Discriminant Analysis (LDA) is used to provide useful information to a Genetic Algorithm (GA) for an ecient exploration of gene subsets space. LDA is a well-known method of dimension reduction and classication, where the data vectors are transformed into a lowdimensional subspace such that the class centroids are spread out as much as possible. It has been used for several classication problems and recently for microarray data [4]. The organization of the rest of this paper is as follows. Section 2 presents our wrapper LDA-based GA for gene selection. Section 3 shows the experimental results on seven microarray datasets and presents a table of comparisons with other well-known approaches. Finally, conclusions are presented in Section 4.

Filter-Wrapper method

In this section we describe our lter-wrapper method LDA-based Genetic Algorithm (LDA-GA) for gene subset selection. We apply a lter BW [4] to retain a group Gp of dierent p top ranking genes (p=50,100,150,200,250 and p=300). Then, the LDA-based GA is used to reduce the search space of 2p . The purpose of this search is to nd good solutions (gene subsets) with high classication performance. In what follows, we present the general procedure and then show the components of the LDA-based Genetic Algorithm. In particular, we explain how LDA is combined with the Genetic Algorithm.

RMPX-GA

2.1

General GA procedure

2 Our LDA-based Genetic Algorithm follows the conventional schema of a generational GA and uses also an elitism strategy. Initial population: The initial population is generated randomly in such a way that each chromosome contains a number of genes ranging from p 0.6 to p 0.75. The population size is xed at 100 in this work. Evolution: The chromosomes of the current population P are sorted according to the tness function (see Section 2.3). The best 10% chromosomes of P are directly copied to the next population P and removed from P . The remaing 90% chromosomes of P are then generated by using crossover and mutation. Crossover and mutation: Mating chromosomes are determined from the remaining chromosomes of P by considering each pair of adjacent chromosomes. By applying our multi-parent recombination operator (see Section 2.4), one child is created each time. This child undergoes then a mutation operation (see Section 2.5) before joining the next population P . Stop condition: The evolution process ends when a pre-dened number of generations is reached (xed at 250 generations in this work). 2.2 Chromosome encoding

Conventionally, a chromosome is used simply to represent a candidate gene subset. In our GA, a chromosome encodes more information and is dened by a couple: I = ( ; ) where and have the following meaning. The rst part ( ) is a binary vector and represents eectively a candidate gene subset. Each allele i indicates whether the corresponding gene gi is selected (i =1) or not selected (i =0). The second part of the chromosome () is a real-valued vector where each i corresponds to the discriminant coecient of the eigen vector for gene gi . We use the LDA discriminant coecient to dene the contribution of gene gi to the projection axis wopt . A chromosome can be thus represented as follows: I = (1 , 2 , . . . , p ; 1 , 2 , . . . , p ) The length of and is dened by p, the number of the pre-selected genes with a lter (B/W) [4]. 2.3 Fitness evaluation

The purpose of the genetic search in our LDA-GA approach is to seek good gene subsets having the minimal size and the highest prediction accuracy. To achieve this double objective, we devise a tness function taking into account these (somewhat conicting) criteria.

Lecture Notes in Computer Science: Authors Instructions

To evaluate a chromosome I =( ;), the tness function considers the classication accuracy of the chromosome (f1 ) and the number of selected genes in the chromosome (f2 ). More precisely, f1 is obtained by evaluating the gene subset using the LDA classier on the training dataset with replacement from the original dataset through the 10-FOLD cross validation method. The second part of the tness function f2 is calculated by the formula: m (1) p where m is the number of bits having the value 1 in the candidate gene subset , i.e. the number of selected genes; p is the length of the chromosome corresponding to the number of the pre-selected genes from the lter ranking. Then the tness function f is dened as the following weighted aggregation: f2 (I ) = 1 f (I ) = f1 (I ) + (1 )f2 (I ) subject to 0 < < 1 where is a parameter that allows us to allocate a relative importance factor to f1 or f2 . Assigning to a value greater than 0.5 will push the genetic search toward solutions of high classication accuracy (probably at the expense of having more selected genes). Inversely, using small values of helps the search toward small sized gene subsets. So varying will change the search direction of the genetic algorithm. 2.4 Multi-parent recombination

We use the discriminant coecients from the LDA classier to design our crossover and mutation operators. Here, we explain how our LDA-based specialized genetic operators operates.Our method is based in modied random threshold [8] to select the parents of crossover operation (denoted by RMPX hereafter). This threshold activates randomly between the interval [0, 1]. the number of parents involved in the recombination. The number of parents is determined by the random threshold as show below: 2 if 0.2 3 if 0.2 0.4 np = 4 if 0.4 0.6 5 if 0.6 0.8 6 if 0.8 RMPX combines randomly dierent parent chromosomes I 1 . . . I np , we picks the majority of parental genes following the denition of OB-SCAN [14] to create a new chromosome I c . Given np parents based in the random threshold, OBSCAN reproduce the child I c as follows: np c if j =1 (Ij )i < n 0 2 np c c if j =1 (Ij )i > n Ii = 1 2 rand(0, 1) otherwise

RMPX-GA

c where rand(0, 1) denote a binary random function and (Ij )i the ith gene of c the chromosome (Ij ). In order to obtain a good subset of informative genes, we propose to create a specialized recombination operator AND (denoted as ). This genetic operator preserves the genes obtained by the multi-parent crossover c ) and the genes that have the most frequently appearing genes by the LDA (Ij c coecients (Jj ). That is denoted as K c = I c J c , where K c is the child that contains the best information of the multi-parent recombination using LDA coecients. Before inserting the child into the next population, K c undergoes a mutation operation based in the LDA-coecients to remove the gene having the lowest discriminant coecients.

2.5

LDA-based mutation

In a conventional GA, the purpose of mutation is to introduce new genetic materials for diversifying the population by making local changes in a given chromosome. For binary coded GAs, this is typically realized by ipping the value of some bits (1 0, or 0 1). In our case, mutation is used for dimension reduction; each application of mutation eliminates a single gene (1 0). To determine which gene is discarded, one criteria is used, leading to the next mutation operator. Mutation using discriminant coecient (M1) : Given a chromosome K =( ; ), we identify the smallest LDA discriminant coecient in and remove the corresponding gene (this is the least informative genes among the current candidate gene subset ).

3
3.1

Experiments on microarray datasets


Microarray gene expression datasets

Table 1. Summary of datasets used for experimentation. Dataset Leukemia Colon Lung Prostate CNS Ovarian DLBCL Genes Samples References 7129 72 Golub et al [7] 2000 62 Alon et al [2] 12533 181 Gordon et al [9] 12600 109 Singh et al [13] 7129 60 Pomeroy et al [12] 15154 253 Petricoin et al [11] 4026 47 Alizadeh et al [1]

Lecture Notes in Computer Science: Authors Instructions

3.2

Experimental results

All experiments were made on a DELL precision M4500 laptop with Intel Core i7, 1.87 Ghz processor and 4 GB of RAM. Our model was implemented in MATLAB. The following parameters were used in the experiments: a) population size |P | = 50, b) maximal number of generations is xed at 250, c) individual length (number of pre-selected genes) p = 50, 100, 150, 200, 250 and p = 300 are evaluated in this experimental protocol. We use a classical mutation where each bit of an individual has a mutation probability of 0.01. For the single point and uniform crossover operators, we use a crossover probability of 0.875, whereas the general settings for our LDA based crossover operator are explained in subsection 2.4. We study the eects of multi-parent LDA-based operators (crossover and mutation) within the same embedded GA/LDA framework. All the crossover operators were tested under the same conditions on seven microarray datasets (Leukemia, Colon cancer, Lung cancer, Prostate cancer, Ovarian, CNS and DLBCL). Figure 1, list the clear inuence of the number of genes used in each dataset. We use p=50,p=100,p=150,p=200, p=250 and p=300 of the topranking genes obtained by the lter BSS/WSS. More genes are used more high is the rate classication of our model in 6 of 7 datasets. Only ovarian dataset oers a best performance by using a minimal number of genes by using p=50.
a) Leukemia b) Colon c) DLBCL d) CNS e) Lung f) Prostate g) Ovarian Fig. 1. A comparison between dierent values of selected genes with = 0.50.

We observe that our model achieve the highest accuracy with p=300 in all datasets. For leukemia we obtain 99.50% of accuracy with a gene subset of 3 genes. Colon tumor oers best performances 98.00% and 98.83% with 5 and 7 genes respectively. DLBCL provides 4 reduced subsets of genes (only 3) with a classication greater equal to 90%. The number of genes for CNS dataset is 4 with a high recognition rate. Lung and Prostate cancer with 4 genes gives a classication performace greater equal to 90%. For the Ovarian cancer is the dataset whose the number of genes is greater 4 to provide a good performance.

Conclusions and discussion

In this paper we proposed a study of multi-parent using a GA/LDA framework with specialized genetic operators RMXP for the gene selection and classication of microarray gene expression. Our model work very well when the number of parents is randomly selected into interval [4 12]. We conrm that more than two parents and less than 10 parents are randomly used we obtain better performances of our RPMX-LDA based genetic-operator. That is, we have more diversity if the number of parents is changed in every generation we can improve

RMPX-GA

both exploration and exploitation capacities of our algorithm. In contrast when the number of parents was increased, the performance decreased leading to a premature convergence in the rst generations of our algorithm. The propose approach begins with the B/W lter that pre-selects dierent number of top-ranked genes (50,100,150,200,250 and 300). In this LDA-GA, LDA is used not only to assess the tness of a candidate gene subset, but also to inform the crossover and mutation operators. We have extensively evaluated our model on seven public datasets (Leukemia, Colon, DLBCL, Lung, Prostate, CNS and Ovarian) using a 10-fold cross-validation process. We conrm that our model is very competitive using more top-ranking genes that leads to obtain a very small gene subsets with high performance. Acknowledgments : This work is partially supported by the PROMEP project ITAPIEXB-000.

References
1. A. Alizadeh, M.B. Eisen, et al. Distinct types of diuse large (b)cell lymphoma identied by gene expression proling. Nature, 403:503511, February 2000. 2. U. Alon, N. Barkai, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Nat. Acad. Sci. USA., 96:67456750, 1999. 3. A. Ben-Dor, L. Bruhn, et al. Tissue classication with gene expression proles. Journal of Computational Biology, 7(3-4):559583, 2000. 4. S. Dudoit, J. Fridlyand, and T. Speed. Comparison of discrimination methods for the classication of tumors using gene expression data. Journal of the American Statistical Association, 97:7787, 2002. 5. A.E. Eiben, P-E. Raue and Z. Ruttkay. Genetic algorithms with multi-parent recombination. Parallel Problem Solving from Nature - PPSN III, 866:7887, 1994. 6. A.E. Eiben. Multiparent recombination in evolutionary computing. Advances in Evolutionary Computing, pages 175192. Springer, 2002. 7. T. Golub, D. Slonim, et al. Molecular classication of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531537, 1999. 8. D. Gong and X. Ruan. A new multi-parent recombination genetic algorithm. Control and Automation CMAS, 286:531537, 2004. 9. G.J. Gordon, R.V. Jensen, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research, 17(62):49634967, 2002. 10. H. Muhlenbein and H.M. Voigt. Gene Pool Recombination for the Breeder Genetic Algorithm. Proc. of the Metaheuristics International Conference, pages 1925, 1995. 11. E. F. Petricoin, A. M. Ardekani, B. A. Hitt, P. J. Levine, V. A. Fusaro, S. M. Steinberg, G. B. Mills, C. Simone, D. A. Fishman, E. C. Kohn, and L. A. Liotta. Use of proteomic patterns in serum to identify ovarian cancer. Lancet, 359:572577, 2002. 12. S. L. Pomeroy, P. Tamayo, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415:436442, 2002. 13. D. Singh, P. Febbo, K. Ross, D. Jackson, J. Manola, C. Ladd, P. Tamayo, A. Renshaw, A. DAmico and J. Richie. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1:203209, 2002.

Lecture Notes in Computer Science: Authors Instructions

14. C-K. Ting On the convergence of multi-parent genetic algorithms. Proc. of ECAL, pages 403412, 2005. 15. S. Tsutsui and A. Ghosh. A study of the eect of multi-parent recombination with simplex crossover in real coded genetic algorithms. Proc. of the IEEE-ICEC, pages 657664, 1998. 16. Y. Wang, Z. Lu, and J-K. Hao. A study of multi-parent crossover operators in a memetic algorithm. In PPSN XI., pages 556565, 2010. 17. Smith, T.F., Waterman, M.S.: Identication of Common Molecular Subsequences. J. Mol. Biol. 147, 195197 (1981) 18. May, P., Ehrlich, H.C., Steinke, T.: ZIB Structure Prediction Pipeline: Composing a Complex Biological Workow through Web Services. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 11481158. Springer, Heidelberg (2006) 19. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999) 20. Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C.: Grid Information Services for Distributed Resource Sharing. In: 10th IEEE International Symposium on High Performance Distributed Computing, pp. 181184. IEEE Press, New York (2001) 21. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The Physiology of the Grid: an Open Grid Services Architecture for Distributed Systems Integration. Technical report, Global Grid Forum (2002)

También podría gustarte