Int. J. Appl. Math. Comput. Sci., 2004, Vol. 14, No. 3, 335–349

EVOLUTIONARY COMPUTATION BASED ON BAYESIAN CLASSIFIERS T ERESA MIQUÉLEZ∗ , E NDIKA BENGOETXEA∗ , P EDRO LARRAÑAGA∗∗ ∗

∗

Department of Computer Architecture and Technology University of the Basque Country P.O. Box 649, 20080 San Sebastian, Spain e-mail: {teresa, endika}@si.ehu.es

Department of Computer Science and Artificial Intelligence University of the Basque Country P.O. Box 649, 20080 San Sebastian, Spain e-mail: [email protected]

Evolutionary computation is a discipline that has been emerging for at least 40 or 50 years. All methods within this discipline are characterized by maintaining a set of possible solutions (individuals) to make them successively evolve to fitter solutions generation after generation. Examples of evolutionary computation paradigms are the broadly known Genetic Algorithms (GAs) and Estimation of Distribution Algorithms (EDAs). This paper contributes to the further development of this discipline by introducing a new evolutionary computation method based on the learning and later simulation of a Bayesian classifier in every generation. In the method we propose, at each iteration the selected group of individuals of the population is divided into different classes depending on their respective fitness value. Afterwards, a Bayesian classifier—either naive Bayes, seminaive Bayes, tree augmented naive Bayes or a similar one—is learned to model the corresponding supervised classification problem. The simulation of the latter Bayesian classifier provides individuals that form the next generation. Experimental results are presented to compare the performance of this new method with different types of EDAs and GAs. The problems chosen for this purpose are combinatorial optimization problems which are commonly used in the literature. Keywords: hybrid soft computing, probabilistic reasoning, evolutionary computing, classification, optimization, Bayesian networks, estimation of distribution algorithms

1. Introduction Estimation of Distribution Algorithms (EDAs) (Larrañaga and Lozano, 2001; Mühlenbein and Paaß, 1996; Pelikan et al., 1999) deals with evolutionary computation techniques that store more than a solution every iteration similarly as Genetic Algorithms (GAs) (Goldberg, 1989; Holland, 1975). The main difference between these two paradigms is the fact that GAs evolve using crossover and mutation operators, without explicitly expressing the characteristics of the selected individuals within a population. EDAs take into account these characteristics by considering the interdependencies between the different variables that form an individual and learning a probabilistic graphical model to represent them. EDAs allow us to take into account the dependencies between variables, and they have therefore shown to be more suitable for complex problems where these types of dependencies apply (Inza et al., 2000). EDAs have a theoretical foundation in probability theory and are based on probabilistic modelling of promising solutions in com-

bination with the simulation of models induced to guide their search. In most of EDAs all selected individuals chosen for building the probabilistic graphical model, usually the fittest ones, are treated equally for the learning step, and no difference is done between the fitness of one or another. One of EDAs in which the learning takes into account the differences in fitness among the selected individuals is the Bit-Based Simulated Crossover (Syswerda, 1993). Other authors have already applied fitness in many evolutionary computation operators in the past, for instance, in multiobjective GAs (Zitzler et al., 1999; Thierens and Bosman, 2001) and in discretization (Cantu-Paz, 2001). This paper introduces EBCOAs (Evolutionary Bayesian Classifier-based Optimization Algorithms) as a new approach in evolutionary computation. The motivation for this approach that makes it innovative is twofold: firstly, it evolves a generation of individuals by constructing Bayesian classifier models that take into account deeper differences rather than simply a subset of individuals of the previous population. Secondly, it also takes into

T. Miquélez et al.

336 account the differences between individuals in the population that make them more or less fit regarding their fitness values, and it applies this knowledge to create a new population by enhancing the characteristics of the fitter individuals and tries to avoid the less fitted ones. In this paper we analyse many of the different possibilities that can be exploited in this new framework. Briefly speaking, the main contribution of this new approach is to propose the use of classification techniques in the form of Bayesian networks applied to optimization problems in order to improve the generation of individuals in every iteration. This paper is organised as follows: the next section describes the estimation of distribution algorithms, paying special attention to the step of learning the probabilistic graphical model that allows the population to improve step after step. Section 3 describes the new paradigm that we propose in this paper as an innovative way of constructing probabilistic graphical models in the discrete domain by taking into account not only the dependencies between the different variables, but also the different fitness values of each of the individuals. Section 4 describes the experiments carried out, as well as the results obtained compared to other evolutionary computation techniques. Finally, Section 5 explains the conclusions and the future work to be done in this domain.

2. Estimation of Distribution Algorithms (EDAs) 2.1. Introduction The main idea of Estimation of Distribution Algorithms (EDAs) (Larrañaga and Lozano, 2001; Mühlenbein and Paaß, 1996; Pelikan et al., 1999) is to keep a population of individuals (or a set of solutions to a particular problem) and to make them evolve in order to obtain in each iteration a population of fitter individuals. Each individual is a vector of values considered to be instantiations of statistical variables. In EDAs the new population of individuals is generated by sampling from a probabilistic graphical model. This probabilistic graphical model is learned from a database containing only selected individuals from the previous generation, and the interrelations between the different variables that form each individual are expressed explicitly through the joint probability distribution associated to the individuals selected in each iteration. Generally speaking, the EDA approach, illustrated in Fig. 1, contains the following steps: 1. The first population D0 of R individuals is generated. The generation of these R individuals is usually produced by assuming a uniform distribution on each variable, and then each individual is evaluated.

2. A number N (N < R) of individuals are selected from Dl following some criteria (usually the ones with the best fitness values are selected, although in the literature there are many different selection procedures1 that allow any individual to be selected). These individuals form the selected population DlN . 3. The n-dimensional probabilistic model pl ( x) = pl ( x|DlN ) that better represents the interdependencies between the n variables is induced. This model is created in the form of a probabilistic graphical model (i.e. a Bayesian network if the domain is discrete) containing the variables X1 , X2 , . . . , Xn , where n is the size of each individual. 4. Finally, the new population Dl+1 formed from the R new individuals is obtained by carrying out the simulation of the probability distribution learned in the previous step. Steps 2, 3 and 4 are repeated until a stopping criterion is satisfied. Examples of stopping criteria are: achieving a fixed number of populations or a fixed number of different individuals, uniformity in the generated population, or the fact of having arrived at the optimum solution (at least, if the latter is known). The step of estimating the joint probability distribution associated with the database of the selected individuals constitutes the hardest work to perform, and this task has to be performed for each generation. That is why methods proposed for learning probabilistic graphical models from data have been applied to EDAs. Furthermore, all the different EDA approaches proposed in the literature can be categorized in order of interdependencies between variables that they can take into account as follows: the ones that consider all the variables to be independent of each other (Baluja, 1994; Harik et al., 1998; Mühlenbein, 1998; Syswerda, 1993), the ones that consider pairwise dependencies (Baluja and Davies, 1997; Chow and Liu, 1968; Pelikan and Mühlenbein, 1999), and the ones that can take into account multiple dependencies between the variables (Etxeberria and Larrañaga, 1999; Harik, 1999; Mühlenbein and Mahning, 1999; Mühlenbein et al., 1999; Pelikan et al., 1999). The reader can find a more complete review on this topic in (Larrañaga and Lozano, 2001). 2.2. Towards a More Efficient Learning Phase The step of learning the probabilistic graphical model is performed at each iteration, and this results in a new population. In EDAs, the set of individuals selected to learn the probabilistic graphical model are usually the best ones 1

Other methods in the literature propose to create multiple copies of the fittest solutions and fewer for the inferior ones to form the new population.

Evolutionary computation based on Bayesian classifiers

337

D0 X1

X2

X3

...

Xn

eval

1 2 ...

4 5 ...

5 3 ...

2 1 ...

... ... ...

3 6 ...

13.25 32.45 ...

R

1

5

4

...

2

34.12

Selection of N I(X1 , X3 |C) > I(X3 , X4 |C) > I(X2 , X4 |C), I(X1 , X4 |C) . In (4) the branch (X1 , X3 ) is rejected since it would form a loop. Here (6) is the result of the second phase of augmenting the tree structure. Following the tree augmented naive Bayes model, and using the classifier shown in this figure, an individual x = (x1 , x2 , x3 , x4 ) will be assigned to the class c∗ = arg max p(c)p(x1 |c, x2 )p(x2 |c) c

× p(x3 |c, x2 )p(x4 |c, x3 ).

(6)

There are several other methods to build Bayesian classifiers taking into account more or fewer dependencies between variables. These methods have been extensively proposed in the last years and their number is growing quite fast as they constitute a hot research topic. Examples of Bayesian classifiers that can be found in the literature are the K-dependence Bayesian classifier (Sahami, 1996), Bayesian augmented networks (Cheng and Greiner, 1999), general Bayesian networks (Neapolitan, 2003), and Bayesian multinets (Kontkanen et al., 2000). 3.5. Description of the Main Steps of EBCOAs Having described the different Bayesian classifiers that we can apply to EBCOAs, this section describes the main steps of the method as well as the implications of the different choices to be done on them.

344 3.5.1. Supervised Classification Step: Labelling Individuals and Selecting Classes In EBCOAs, instead of selecting a subset of individuals as EDAs do, the whole population is firstly classified into a fixed number |K| of different classes. These classes are formed by dividing the whole population into groups of individuals from the fittest to the least fitted ones. The result of this procedure is to assign to each individual in Dl a label k (with k ∈ {1, 2, . . . , K}). Each of the R individuals is assigned a label k, and they form the class variable K in the database DlK . As in EBCOAs the aim is also to take into account the main characteristics that distinguish both the fittest and the less fitted classes, some of the classes in DlK could be discarded to facilitate the learning. An example of this idea is to ignore the middle classes in DlK for the learning of the Bayesian classifier, so that the differences between the most distant classes are enhanced. DlC is the result of removing from DlK the classes that are not used for learning, and C is the class variable that is used for learning as the root of the Bayesian classifiers, with |C| ≤ |K|.

3.5.2. Learning Step: Building the Bayesian Classifier Learning is performed by applying an algorithm to induce a Bayesian classifier such that it forms a Bayesian network in which the root is the variable C representing the labels of the individual (C is treated as another variable), and the rest of the variables X1 to Xn can also be present. This Bayesian network will be formed following different classifier construction algorithms such as the ones described in the previous section. Therefore, the probabilistic graphical model obtained as a result of this method will contain a maximum of n + 1 nodes (the variables X1 to Xn and C), with the variable C always being the root and the parent of all the rest. As a result of this learning procedure, probability distribution can be represented by a factorization of the form pl (c| x) ∝ pl ( x|c). It is important to realize that in our case we are not interested in obtaining the best possible Bayesian classifier to represent a strictly correct classifier. These algorithms for obtaining optimum classifiers in the form of a Bayesian network are very time consuming, and the execution time requirement is crucial in EBCOAs. Taking into account the fact that this learning step (i.e. the classifier building step) is going to be applied in every generation, it is more important to use a Bayesian classifier builder that will return a satisfactory classifier in a reasonable time rather than a perfect classifier that will be ignored in the next generation.

T. Miquélez et al. 3.5.3. Simulation Step: Instantiating the New Population The step of instantiating the probabilistic graphical model to obtain the new R individuals is also performed in a similar way as in EDAs, although there is an important difference due to the fact of the existence of the C variable in the Bayesian network: every individual will be generated using a specific criterion, such as, for instance, the probability distribution pl ( x|c). Therefore, the simulation of the individual is performed following the probability distribution learned in the previous step. But the main difference comes from the need to reflect the different characteristics of individuals from the fittest and less fitted classes. In that sense, to perform the simulation and thus the generation of new individuals that will form the next population Dl+1 , the individuals should be generated using all classes in C. Our proposal is to generate R new individuals by assigning a different number of individuals by instantiating the probability distribution of all classes proportional to p(c), knowing that X p(c) ∝ f ( x), (7) x| C( x)=c where f ( x) is the fitness value of the individual x, and C( x) is the class assigned to the individual x in DlC . After generating these new R individuals, we fuse these with the previous R individuals of the population Dl , and we select the R/C individuals that better adjust to the characteristics of each of the casses of C, thus obtaining the R individuals that will form the next population Dl+1 . The reason for doing the simulation in this way is to ensure that individuals from all classes will be present in the next generation, while giving more chance to include individuals from the fittest ones according to the fitness value of the individuals. Following this procedure, even individuals from the less fitted classes will be included in the new generations, and this fact ensures that the differences between the fittest individuals and the less fitted ones are still present in the last generations of the search process as the algorithms converge to the optimum solution. The fact of keeping these differences is important since the convergence of the whole approach is based on the ability of the Bayesian classifier to model the main characteristics that place an individual within the fittest class found in the whole search process. Another important point worth commenting regarding the generation of new individuals of the next population Dl+1 is the decision of how to instantiate some of the variables that are not present in the Bayesian network classifier. This can happen, for instance, if in the learning step we apply algorithms such as selective Bayes or seminaive Bayes. These two algorithms can induce a Bayesian

Evolutionary computation based on Bayesian classifiers classifier in which some of the variables X1 , X2 , . . . , Xn are not present at all. Note that for the purpose of instantiating new individuals using such a model, this is a different situation as to have these variables present but disconnected, as even when the variables appear to be disconnected they have a probability distribution that has been learned and therefore they have probability distribution pl ( x) estimated for allowing simulating new individuals. The meaning of not having a variable Xi present in the final Bayesian classifier structure implies that the values assigned to such a variable in the individuals of all |C| classes are not relevant for distinguishing between them. This has an important consequence, since it does not mean that the value assigned to such variables is not important and that any value can be set. Note that the individual is a point in the search space for a specific problem, and that all values assigned to all the variables are usually relevant for obtaining a fitted individual and therefore converge to the optimum solution. However, as the search goes on, some variables might have the same values on the best and worst classes, and therefore in the learning step of EBCOAs these will be removed from the Bayesian classifiers. As a result, we propose to simulate the variables not present in the Bayesian classifier as follows: we consider that it is important to distinguish between irrelevant variables (i.e. variables that always have the same values in all classes) and redundant variables (i.e. those in which all values appear similarly in the different classes and therefore do not reflect any difference between the characteristics of the classes). For the former, the estimated probability for a redundant variable Xi to take its k-th value is computed as pˆ(xi ) = p(xki |c). For the latter type of variables, we assume that the probability distribution is uniform. 3.5.4. Stopping Criterion All the previous steps are repeated in EBCOAs until a stopping condition is satisfied. Examples of stopping conditions are: achieving a fixed number of populations or a fixed number of different evaluated individuals, uniformity in the generated population, and the fact of not obtaining an individual with a better fitness value after a certain number of generations.

4. Experimental Results An experiment was carried out in order to test the performance of EBCOAs compared with some EDAs and GAs. This section describes the experiments and the results obtained. We chose EDAs that take into account different numbers of dependencies between variables, in particular, UMDA (Mühlenbein, 1998), MIMIC (de Bonet et al., 1997), and EBNABIC (Etxeberria and Larrañaga, 1999).

345 K

Dl

X1

X2

X3

...

Xn

K

1 2 ... R/3

4 2 ... 3

1 3 ... 1

5 4 ... 4

... ... ... ...

3 6 ... 2

H H ... H

R/3 +1 ... 2R/3

2 ... 1

3 ... 5

1 ... 4

... ... ...

6 ... 2

M ... M

2R/3 +1 ... R

4 ... 1

2 ... 5

6 ... 7

... ... ...

6 ... 1

L ... L

Fig. 10. Three classes in DlK chosen for our experiments, from which only the clases H and L will be used in our case. The class M is simply ignored in the learning step, since those individuals will not be present in DlC .

The GAs that we chose are the broadly known basic (cGA) (Holland, 1975), elitist (eGA) (Whitley and Kauth, 1988) and steady state (ssGA) (Michalewicz, 1992) ones. We tried three standard optimization problems in the discrete domain such as HIFF, IsoPeak, and IsoTorus, which are known to be complex and full of local optima. Table 1 describes briefly these three functions. The reader can find more information on these problems in (Santana, 2004). In our particular experiments, in order to show the validity of EBCOAs, we divided each population into three different classes (|K| = 3), from which only those of the best and worst individuals are used for the learning step (|C| = 2). This is illustrated in Fig. 10. The stopping criterion in all experiments is when obtaining the optimum solution in that generation or reaching the generation number 500. Table 2 shows the mean fitness of the best individual found in the last generation, as well as the number of generations to reach the final solution for each of the experiments. In IsoPeak there is a local optimum with fitness 3906 which corresponds to the individuals with only zeros, very close to the global optimum. This fact confuses most algorithms, and even if some of them are sometimes able to find it (EBNA 3 times, and ssGA once), the results show that only the EBCOATAN B algorithm was able to find the optimum in all the ten runs. In IsoTorus there are also other local optima, and EDAs and GAs fall in these in some of the executions. From the ten runs of each algorithm, most of EDAs and GAs were able to find sometimes the global optimum (MIMIC once, EBNA and cGA 4 times, and ssGA and eGA 5 times), but EBCOAnBayes and EBCOATAN B found the global optimum in all of the 10 runs, while EBCOAseminnB−BSSJ and EBCOAselectivenBayes also found it 8 times and twice, respectively. In the HIFF fitness function the results

T. Miquélez et al.

346

Table 1. Description of the HIFF, IsoPeak, and IsoTorus fitness functions. The first column describes the objective funtion, the second the size of the individual, and the third and the fourth contain are the optimum solutions and their respective fitness values.

HIFF (x) = f (x1 , . . . , xn )

HIFF

f (x1 , . . . , xn ) = 1, if (|s| = 1) |s| + f (x1 , . . . , x 2s ) if (|s| > 1) P |s| +f (x 2s +1 , . . . , xs ) and xi = 0 , i=1 P |s| or xi = |s| i=1 f (x1 , . . . , x 2s ) +f (x s , . . . , x ) otherwise s +1 2

( n = 64

(1, 1, . . . , 1) (0, 0, . . . , 0)

Opt = 448

n = 64

(1,1,. . . ,1)

Opt = 3907

n = 64

(1,1,. . . ,1)

Opt = 505

m=n+1 m IsoC1 = m−1 0 ( IsoC2 =

IsoPeak

m 0

if x = 00 if x = 11 otherwise if x = 11 otherwise

FIsoPeak ( x) = IsoC2 (x1 , x2 ) m P + IsoC1 (xi , xi+1 ) i=2

n = m2 m IsoT1 = m−1 0 ( IsoT2 =

IsoTorus

m2 0

if u = 0 if u = 5 otherwise if u = 5 otherwise

FIsoTorus = IsoT1 (x1−m+n + x1−m+n + x1 + x2 + x1+m ) n P + IsoT2 (xup + xleft + xi + xright + xdown ) i=2

where xup , xleft , xi , xright , xdown are defined as the appropriate neighbors

Evolutionary computation based on Bayesian classifiers

347

Table 2. Mean results after 10 executions with each algorithm and objective function. The Ev and Val columns represent respectively the best fitness value obtained in the last generation, and the evaluations number in which it ended. HIFF

IsoPeak Ev. Val.

IsoTorus Ev. Val.

Ev.

Val.

EBCOAnBayes EBCOAselectivenBayes EBCOAseminnB−F SSJ EBCOAseminnB−BSSJ EBCOATAN B

105036.8 94640.7 249838.2 189178.9 4589.9

290 355.2 290.2 184.5 448

51995.4 43910.0 249893.5 58694.3 4391.8

3906 3906 3859.8 3803.8 3907

25175.9 207914.1 227610.3 66701.9 3989.6

505 472 471.6 474.3 505

UMDA MIMIC EBNA cGA eGA ssGA

107120.4 97572.0 23336.0 202000 202000 202000

295.6 283.2 448 395.2 388.8 448

67303.3 69385.9 19708.6 202000 202000 202000

3905.5 3906 3906.3 3628.1 3793.7 3906.1

47244.7 46941 28703.0 202000 202000 202000

400.3 422.3 485.2 477.2 488.5 488.5

are more similar between EBCOAs, EDAs and GAs, since EBCOATAN B , EBNA and ssGA obtained the best result in all the 10 runs. Also note that most of EBCOAs require fewer evaluations (e.g. fewer different solutions to be evaluated during the search) to reach these final results. These results show that the tree augmented naive Bayes approach performs very well in all these fitness functions, even improving the results obtained in many EDAs and GAs. Also, if we compare the behaviour of EBCOAnBayes with that of UMDA, its EDA equivalent in taking into account the dependencies between variables, we see that the results are at least comparable. Finally, regarding the seminaive and selective Bayes approaches, after monitoring the evolution of the search we realized that the choice of how to instantiate the variables that are not present in the Bayesian classifier is the main reason for these results, and further research is already in progress. In the light of the results we can conclude that the new paradigm EBCOA produced promising results in this experiment, sometimes giving better and comparable results to GAs and EDAs. However, their potential is still to be analysed, as there are still many different aspects that need to be tested and could result in a considerable improvement in the performance of these algorithms.

5. Conclusions and Further Work This paper introduces for the first time a new paradigm, Evolutionary Bayesian Classifier-based Optimization Algorithms (EBCOAs), which combines both evolutionary computation techniques and Bayesian classifiers in order

to solve optimization problems. The theoretical foundations and the generic pseudocode have been introduced for this new evolutionary computation paradigm. This paper also illustrates the behaviour of these algorithms in standard optimization problems in discrete domains such as HIFF, IsoPeak and IsoTorus. The first results obtained in these experiments show that some of the choices (such as the instantiation of variables not present in the Bayesian classifier) have to be revised and more techniques have to be tried. However, the fact that some EBCOAs perform in these problems in a similar way and even outperform in some cases EDAs and GAs is a promising result to encourage further testing. This experiment was performed with general objective functions, and further testing should be done with more complex problems and using EBCOAs that can take into account higher-order dependencies between variables. We reckon that the application of more complex EBCOAs to these problems should turn out to improve the performance of even EDAs and GAs. Future research trends also include the study and experimentation of new Bayesian network classifiers that are capable of taking into account more interdependencies than the ones introduced here. An example of possible structures to apply are the generalization of structures from the EBNA approach in problems where the dependencies between variables are high. Another future research topic for EBCOAs also includes applying classification techniques for building statistical probabilistic graphical models in continuous domains so that we can compare their performance with continuous EDAs and other evolutionary approaches in continuous domains.

T. Miquélez et al.

348

Acknowledgments This work was partly supported by the University of the Basque Country within the project 9/UPV-EHU 00140.226-15334/2003. The authors would also like to thank J.A. Lozano, Iñaki Inza, and the other members of the Intelligent Systems Group of the University of the Basque Country for their useful advise and contribution to this work.

Friedman N., Geiger D. and Goldsmidt M. (1997): Bayesian network classifiers. — Mach. Learn., Vol. 29, No. 2, pp. 131– 163. Gammerman A. and Thatcher A.R. (1991): Bayesian diagnostic probabilities without assuming independence of symptoms. — Meth. Inf. Medic., Vol. 30, No. 1, pp. 15–22. Goldberg D.E. (1989): Genetic Algorithms in Search, Optimization, and Machine Learning. — Reading: AddisonWesley.

References

Harik G. (1999): Linkage learning via probabilistic modeling in the EcGA. — Techn. Rep., University of Illinois, Urbana, IlliGAL Report No. 99010.

Baluja S. (1994): Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning. — Techn. Rep., Carnegie Mellon, CMU-CS-94-163.

Harik G., Lobo F.G. and Golberg D.E. (1998): The compact genetic algorithm. — Proc. IEEE Conf. Evolutionary Computation, Piscataway, NJ, pp. 523–528.

Baluja S. and Davies S. (1997): Using optimal dependency-trees for combinatorial optimization: Learning the structure of the search space. — Techn. Rep., Carnegie Mellon, CMUCS-97-107. Cantu-Paz E. (2001): Supervised and unsupervised discretization methods for evolutionary algorithms. — Proc. Genetic and Evolutionary Computation Conference (GECCO’2001), Workshop Optimization by Building and Using Probabilisitic Models, San Francisco, California, pp. 213–216. Cestnik B., Kononenko I. and Bratko I. (1987): ASSISTANT-86: A knowledge elicitation tool for sophisticated users, In: Progress in Machine Learning (I. Bratko and N. Lavrac, Eds.). — Wilmslow, U.K.: Sigma Press, pp. 31–45. Cheng J. and Greiner R. (1999): Comparing Bayesian network classifiers. — Proc. 15th Conf. Uncertainty in Artificial Intelligence, San Francisco, CA: Morgan Kaufmann Publishers, pp. 101–107. Chow C. and Liu C. (1968): Approximating discrete probability distributions with dependence trees. — IEEE Trans. Inf. Theory, Vol. 14, No. 3, pp. 462–467. de Bonet J.S., Isbell C.L. and Viola P. (1997): MIMIC: Finding optima by estimating probability densities, In: Advances in Neural Information Processing Systems (M. Mozer, M. Jordan and Th. Petsche, Eds.). — Cambridge, MA: The MIT Press, Vol. 9, pp. 424–431. Domingos P. and Pazzani M. (1997): On the optimality of the simple Bayesian classifier under zero-one loss. — Mach. Learn., Vol. 29, No. 2–3, pp. 103–130. Duda R. and Hart P. (1973): Pattern Classification and Scene Analysis. — New York: Wiley. Etxeberria R. and Larrañaga P. (1999): Global optimization with Bayesian networks. — Proc. 2nd Symp. Artificial Intelligence, CIMAF99, La Habana, Cuba, pp. 332–339.

Holland J.H. (1975): Adaptation in Natural and Artificial Systems. — Michigan: The University of Michigan Press. Inza I., Larrañaga P., Etxeberria R. and Sierra B. (2000): Feature subset selection by Bayesian network-based optimization. — Artif. Intell., Vol. 123, No. 1–2, pp. 157–184. Kaufman K. and Michalski R. (1999): The AQ18 machine learning and data mining system: An implementation and user’s guide. — Techn. Rep., Machine Learning and Inference Laboratory, George Manson University, Fairfax, Virginia. Kohavi R. and John G. (1997): Wrappers for feature subset selection. — Artif. Intell., Vol. 97, No. 1–2, pp. 273–324. Kononenko I. (1990): Comparison of inductive and naïve Bayesian learning approaches to automatic knowledge acquisition, In: Current Trends in Knowledge Acquisition (B. Wielinga, J. Boose, B. Gaines, G. Shereiber and M. van Someren, Eds.). — Amsterdam: IOS Press, pp. 190–197. Kononenko I. (1991): Semi-naïve Bayesian classifiers. — Proc. 6th Europ. Working Session on Learning, Porto, Portugal, pp. 206–219. Kontkanen P., Myllymäki P., Tirri H. and Valtonen K. (2000): Bayesian multinet classifiers. — Proc. 10th Int. Conf. Computing and Information (ICCI’2000). Langley P. and Sage S. (1994): Induction of selective Bayesian classifiers. — Proc. 10th Conf. Uncertainty in Artificial Intelligence, Seattle, WA, pp. 399–406. Larrañaga P. and Lozano J.A. (2001): Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. — Kluwer. Liu H. and Motoda H. (1998): Feature Selection. From Knowledge Discovery and Data Mining. — Boston: Kluwer. Llorà X. and Goldberg D.E. (2003): Wise Breeding GA via Machine Learning Techniques for Function Optimization. — Proc. Genetic and Evolutionary Computation Conference GECCO-03, Part I, Chicago, Illinois, pp. 1172–1183.

Evolutionary computation based on Bayesian classifiers Michalewicz Z. (1992): Genetic Algorithms + Data Structures = Evolution Programs. — Berlin: Springer Verlag. Michalski R.S. (2000): Learnable execution model: Evolutionary processes guided by machine learning. — Mach. Learn., Vol. 38, pp. 9–40. Minsky M. (1961): Steps toward artificial intelligence. — Trans. Inst. Radio Eng., Vol. 49, pp. 8–30. Mühlenbein H. (1998): The equation for response to selection and its use for prediction. — Evolut. Comput., Vol. 5, No. 3, pp. 303–346. Mühlenbein H. and Mahning T. (1999): FDA—A scalable evolutionary algorithm for the optimization of additively decomposed functions. — Evolut. Comput., Vol. 7, No. 4, pp. 353–376. Mühlenbein H., Mahning T. and Ochoa A. (1999): Schemata, distributions and graphical models in evolutionary optimization. — J. Heurist., Vol. 5, pp. 215–247. Mühlenbein H. and Paaß G. (1996): From recombination of genes to the estimation of distributions, I. Binary parameters, In: Parallel Problem Solving from Nature—PPSN IV (M. Voigt, W. Ebeling, I. Rechenberg and H.-P. Schwefel, Eds.). — Lecture Notes in Computer Science 1411, Berlin: Springer, pp. 178–187. Muñoz A. (2003): LEM algorithms. — Final year project, Computer Engineering Faculty, University of the Basque Country, (in Spanish). Neapolitan R.E. (2003): Learning Bayesian Networks. — Prentice Hall. Ohmann C., Yang Q., Kunneke M., Stolzing H., Thon K. and Lorenz W. (1988): Bayes theorem and conditional dependence of symptoms: Different models applied to data of upper gastrointestinal bleeding. — Meth. Inf. Medic., Vol. 27, No. 2, pp. 73–83. Pazzani M. (1997): Searching for dependencies in Bayesian classifiers, In: Learning from Data: Artificial Intelligence and Statistics V (D. Fisher and H.-J. Lenz, Eds.). — New York: Springer, pp. 239–248.

349 Pelikan M., Goldberg D.E. and Cantú-Paz E. (1999): BOA: The Bayesian optimization algorithm. — Proc. Genetic and Evolutionary Computation Conference GECCO-99, Orlando, pp. 525–532. Pelikan M. and Mühlenbein H. (1999): The bivariate marginal distribution algorithm, In: Advances in Soft ComputingEngineering Design and Manufacturing (R. Roy, T. Furuhashi and P.K. Chandhory, Eds.). — London: SpringerVerlag, pp. 521–535. Sahami M. (1996): Learning limited dependence Bayesian classifiers. — Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining, Portland, OR, pp. 335–338. Santana R. (2004): Estimation of distribution algorithms with Kikuchi approximations. — (submitted). Syswerda G. (1993): Simulated crossover in genetic algorithms, In: Foundations of Genetic Algorithms (L.D. Whitley, Ed.). — Vol. 2, San Mateo: Morgan Kaufmann, pp. 239– 255. Thierens D. and Bosman P.A.N. (2001): Multi-objective mixture-based iterated density estimation evolutionary algorithms. — Proc. Genetic and Evolutionary Computation Conference, GECCO’2001, San Francisco, pp. 663–670. Todd B.S. and Stamper R. (1994): The relative accuracy of a variety of medical diagnostic programs. — Meth. Inf. Medic., Vol. 33, No. ??, pp. 402–416. Ventura S., Herrera F., Berná J.N. and Hervás C. (2002): Evolución guiada mediante aprendizaje. Comparación en problemas de optimización. — Proc. Conf. Algoritmos Evolutivos y Bioinspirados, AEB’02, pp. 430–436, (in Spanish). Whitley D. and Kauth J. (1988): GENITOR: A different genetic algorithm. — Proc. Rocky Mountain Conf. Artificial Intelligence, USA, Vol. 2, pp. 118–130. Zitzler E., Deb K. and Thiele L. (1999): Comparison of multiobjective evolutionary algorithms on test functions of different difficulty. — Proc. 1999 Genetic and Evolutionary Computation Conf., Orlando, FL, pp.121–122.

EVOLUTIONARY COMPUTATION BASED ON BAYESIAN CLASSIFIERS T ERESA MIQUÉLEZ∗ , E NDIKA BENGOETXEA∗ , P EDRO LARRAÑAGA∗∗ ∗

∗

Department of Computer Architecture and Technology University of the Basque Country P.O. Box 649, 20080 San Sebastian, Spain e-mail: {teresa, endika}@si.ehu.es

Department of Computer Science and Artificial Intelligence University of the Basque Country P.O. Box 649, 20080 San Sebastian, Spain e-mail: [email protected]

Evolutionary computation is a discipline that has been emerging for at least 40 or 50 years. All methods within this discipline are characterized by maintaining a set of possible solutions (individuals) to make them successively evolve to fitter solutions generation after generation. Examples of evolutionary computation paradigms are the broadly known Genetic Algorithms (GAs) and Estimation of Distribution Algorithms (EDAs). This paper contributes to the further development of this discipline by introducing a new evolutionary computation method based on the learning and later simulation of a Bayesian classifier in every generation. In the method we propose, at each iteration the selected group of individuals of the population is divided into different classes depending on their respective fitness value. Afterwards, a Bayesian classifier—either naive Bayes, seminaive Bayes, tree augmented naive Bayes or a similar one—is learned to model the corresponding supervised classification problem. The simulation of the latter Bayesian classifier provides individuals that form the next generation. Experimental results are presented to compare the performance of this new method with different types of EDAs and GAs. The problems chosen for this purpose are combinatorial optimization problems which are commonly used in the literature. Keywords: hybrid soft computing, probabilistic reasoning, evolutionary computing, classification, optimization, Bayesian networks, estimation of distribution algorithms

1. Introduction Estimation of Distribution Algorithms (EDAs) (Larrañaga and Lozano, 2001; Mühlenbein and Paaß, 1996; Pelikan et al., 1999) deals with evolutionary computation techniques that store more than a solution every iteration similarly as Genetic Algorithms (GAs) (Goldberg, 1989; Holland, 1975). The main difference between these two paradigms is the fact that GAs evolve using crossover and mutation operators, without explicitly expressing the characteristics of the selected individuals within a population. EDAs take into account these characteristics by considering the interdependencies between the different variables that form an individual and learning a probabilistic graphical model to represent them. EDAs allow us to take into account the dependencies between variables, and they have therefore shown to be more suitable for complex problems where these types of dependencies apply (Inza et al., 2000). EDAs have a theoretical foundation in probability theory and are based on probabilistic modelling of promising solutions in com-

bination with the simulation of models induced to guide their search. In most of EDAs all selected individuals chosen for building the probabilistic graphical model, usually the fittest ones, are treated equally for the learning step, and no difference is done between the fitness of one or another. One of EDAs in which the learning takes into account the differences in fitness among the selected individuals is the Bit-Based Simulated Crossover (Syswerda, 1993). Other authors have already applied fitness in many evolutionary computation operators in the past, for instance, in multiobjective GAs (Zitzler et al., 1999; Thierens and Bosman, 2001) and in discretization (Cantu-Paz, 2001). This paper introduces EBCOAs (Evolutionary Bayesian Classifier-based Optimization Algorithms) as a new approach in evolutionary computation. The motivation for this approach that makes it innovative is twofold: firstly, it evolves a generation of individuals by constructing Bayesian classifier models that take into account deeper differences rather than simply a subset of individuals of the previous population. Secondly, it also takes into

T. Miquélez et al.

336 account the differences between individuals in the population that make them more or less fit regarding their fitness values, and it applies this knowledge to create a new population by enhancing the characteristics of the fitter individuals and tries to avoid the less fitted ones. In this paper we analyse many of the different possibilities that can be exploited in this new framework. Briefly speaking, the main contribution of this new approach is to propose the use of classification techniques in the form of Bayesian networks applied to optimization problems in order to improve the generation of individuals in every iteration. This paper is organised as follows: the next section describes the estimation of distribution algorithms, paying special attention to the step of learning the probabilistic graphical model that allows the population to improve step after step. Section 3 describes the new paradigm that we propose in this paper as an innovative way of constructing probabilistic graphical models in the discrete domain by taking into account not only the dependencies between the different variables, but also the different fitness values of each of the individuals. Section 4 describes the experiments carried out, as well as the results obtained compared to other evolutionary computation techniques. Finally, Section 5 explains the conclusions and the future work to be done in this domain.

2. Estimation of Distribution Algorithms (EDAs) 2.1. Introduction The main idea of Estimation of Distribution Algorithms (EDAs) (Larrañaga and Lozano, 2001; Mühlenbein and Paaß, 1996; Pelikan et al., 1999) is to keep a population of individuals (or a set of solutions to a particular problem) and to make them evolve in order to obtain in each iteration a population of fitter individuals. Each individual is a vector of values considered to be instantiations of statistical variables. In EDAs the new population of individuals is generated by sampling from a probabilistic graphical model. This probabilistic graphical model is learned from a database containing only selected individuals from the previous generation, and the interrelations between the different variables that form each individual are expressed explicitly through the joint probability distribution associated to the individuals selected in each iteration. Generally speaking, the EDA approach, illustrated in Fig. 1, contains the following steps: 1. The first population D0 of R individuals is generated. The generation of these R individuals is usually produced by assuming a uniform distribution on each variable, and then each individual is evaluated.

2. A number N (N < R) of individuals are selected from Dl following some criteria (usually the ones with the best fitness values are selected, although in the literature there are many different selection procedures1 that allow any individual to be selected). These individuals form the selected population DlN . 3. The n-dimensional probabilistic model pl ( x) = pl ( x|DlN ) that better represents the interdependencies between the n variables is induced. This model is created in the form of a probabilistic graphical model (i.e. a Bayesian network if the domain is discrete) containing the variables X1 , X2 , . . . , Xn , where n is the size of each individual. 4. Finally, the new population Dl+1 formed from the R new individuals is obtained by carrying out the simulation of the probability distribution learned in the previous step. Steps 2, 3 and 4 are repeated until a stopping criterion is satisfied. Examples of stopping criteria are: achieving a fixed number of populations or a fixed number of different individuals, uniformity in the generated population, or the fact of having arrived at the optimum solution (at least, if the latter is known). The step of estimating the joint probability distribution associated with the database of the selected individuals constitutes the hardest work to perform, and this task has to be performed for each generation. That is why methods proposed for learning probabilistic graphical models from data have been applied to EDAs. Furthermore, all the different EDA approaches proposed in the literature can be categorized in order of interdependencies between variables that they can take into account as follows: the ones that consider all the variables to be independent of each other (Baluja, 1994; Harik et al., 1998; Mühlenbein, 1998; Syswerda, 1993), the ones that consider pairwise dependencies (Baluja and Davies, 1997; Chow and Liu, 1968; Pelikan and Mühlenbein, 1999), and the ones that can take into account multiple dependencies between the variables (Etxeberria and Larrañaga, 1999; Harik, 1999; Mühlenbein and Mahning, 1999; Mühlenbein et al., 1999; Pelikan et al., 1999). The reader can find a more complete review on this topic in (Larrañaga and Lozano, 2001). 2.2. Towards a More Efficient Learning Phase The step of learning the probabilistic graphical model is performed at each iteration, and this results in a new population. In EDAs, the set of individuals selected to learn the probabilistic graphical model are usually the best ones 1

Other methods in the literature propose to create multiple copies of the fittest solutions and fewer for the inferior ones to form the new population.

Evolutionary computation based on Bayesian classifiers

337

D0 X1

X2

X3

...

Xn

eval

1 2 ...

4 5 ...

5 3 ...

2 1 ...

... ... ...

3 6 ...

13.25 32.45 ...

R

1

5

4

...

2

34.12

Selection of N I(X1 , X3 |C) > I(X3 , X4 |C) > I(X2 , X4 |C), I(X1 , X4 |C) . In (4) the branch (X1 , X3 ) is rejected since it would form a loop. Here (6) is the result of the second phase of augmenting the tree structure. Following the tree augmented naive Bayes model, and using the classifier shown in this figure, an individual x = (x1 , x2 , x3 , x4 ) will be assigned to the class c∗ = arg max p(c)p(x1 |c, x2 )p(x2 |c) c

× p(x3 |c, x2 )p(x4 |c, x3 ).

(6)

There are several other methods to build Bayesian classifiers taking into account more or fewer dependencies between variables. These methods have been extensively proposed in the last years and their number is growing quite fast as they constitute a hot research topic. Examples of Bayesian classifiers that can be found in the literature are the K-dependence Bayesian classifier (Sahami, 1996), Bayesian augmented networks (Cheng and Greiner, 1999), general Bayesian networks (Neapolitan, 2003), and Bayesian multinets (Kontkanen et al., 2000). 3.5. Description of the Main Steps of EBCOAs Having described the different Bayesian classifiers that we can apply to EBCOAs, this section describes the main steps of the method as well as the implications of the different choices to be done on them.

344 3.5.1. Supervised Classification Step: Labelling Individuals and Selecting Classes In EBCOAs, instead of selecting a subset of individuals as EDAs do, the whole population is firstly classified into a fixed number |K| of different classes. These classes are formed by dividing the whole population into groups of individuals from the fittest to the least fitted ones. The result of this procedure is to assign to each individual in Dl a label k (with k ∈ {1, 2, . . . , K}). Each of the R individuals is assigned a label k, and they form the class variable K in the database DlK . As in EBCOAs the aim is also to take into account the main characteristics that distinguish both the fittest and the less fitted classes, some of the classes in DlK could be discarded to facilitate the learning. An example of this idea is to ignore the middle classes in DlK for the learning of the Bayesian classifier, so that the differences between the most distant classes are enhanced. DlC is the result of removing from DlK the classes that are not used for learning, and C is the class variable that is used for learning as the root of the Bayesian classifiers, with |C| ≤ |K|.

3.5.2. Learning Step: Building the Bayesian Classifier Learning is performed by applying an algorithm to induce a Bayesian classifier such that it forms a Bayesian network in which the root is the variable C representing the labels of the individual (C is treated as another variable), and the rest of the variables X1 to Xn can also be present. This Bayesian network will be formed following different classifier construction algorithms such as the ones described in the previous section. Therefore, the probabilistic graphical model obtained as a result of this method will contain a maximum of n + 1 nodes (the variables X1 to Xn and C), with the variable C always being the root and the parent of all the rest. As a result of this learning procedure, probability distribution can be represented by a factorization of the form pl (c| x) ∝ pl ( x|c). It is important to realize that in our case we are not interested in obtaining the best possible Bayesian classifier to represent a strictly correct classifier. These algorithms for obtaining optimum classifiers in the form of a Bayesian network are very time consuming, and the execution time requirement is crucial in EBCOAs. Taking into account the fact that this learning step (i.e. the classifier building step) is going to be applied in every generation, it is more important to use a Bayesian classifier builder that will return a satisfactory classifier in a reasonable time rather than a perfect classifier that will be ignored in the next generation.

T. Miquélez et al. 3.5.3. Simulation Step: Instantiating the New Population The step of instantiating the probabilistic graphical model to obtain the new R individuals is also performed in a similar way as in EDAs, although there is an important difference due to the fact of the existence of the C variable in the Bayesian network: every individual will be generated using a specific criterion, such as, for instance, the probability distribution pl ( x|c). Therefore, the simulation of the individual is performed following the probability distribution learned in the previous step. But the main difference comes from the need to reflect the different characteristics of individuals from the fittest and less fitted classes. In that sense, to perform the simulation and thus the generation of new individuals that will form the next population Dl+1 , the individuals should be generated using all classes in C. Our proposal is to generate R new individuals by assigning a different number of individuals by instantiating the probability distribution of all classes proportional to p(c), knowing that X p(c) ∝ f ( x), (7) x| C( x)=c where f ( x) is the fitness value of the individual x, and C( x) is the class assigned to the individual x in DlC . After generating these new R individuals, we fuse these with the previous R individuals of the population Dl , and we select the R/C individuals that better adjust to the characteristics of each of the casses of C, thus obtaining the R individuals that will form the next population Dl+1 . The reason for doing the simulation in this way is to ensure that individuals from all classes will be present in the next generation, while giving more chance to include individuals from the fittest ones according to the fitness value of the individuals. Following this procedure, even individuals from the less fitted classes will be included in the new generations, and this fact ensures that the differences between the fittest individuals and the less fitted ones are still present in the last generations of the search process as the algorithms converge to the optimum solution. The fact of keeping these differences is important since the convergence of the whole approach is based on the ability of the Bayesian classifier to model the main characteristics that place an individual within the fittest class found in the whole search process. Another important point worth commenting regarding the generation of new individuals of the next population Dl+1 is the decision of how to instantiate some of the variables that are not present in the Bayesian network classifier. This can happen, for instance, if in the learning step we apply algorithms such as selective Bayes or seminaive Bayes. These two algorithms can induce a Bayesian

Evolutionary computation based on Bayesian classifiers classifier in which some of the variables X1 , X2 , . . . , Xn are not present at all. Note that for the purpose of instantiating new individuals using such a model, this is a different situation as to have these variables present but disconnected, as even when the variables appear to be disconnected they have a probability distribution that has been learned and therefore they have probability distribution pl ( x) estimated for allowing simulating new individuals. The meaning of not having a variable Xi present in the final Bayesian classifier structure implies that the values assigned to such a variable in the individuals of all |C| classes are not relevant for distinguishing between them. This has an important consequence, since it does not mean that the value assigned to such variables is not important and that any value can be set. Note that the individual is a point in the search space for a specific problem, and that all values assigned to all the variables are usually relevant for obtaining a fitted individual and therefore converge to the optimum solution. However, as the search goes on, some variables might have the same values on the best and worst classes, and therefore in the learning step of EBCOAs these will be removed from the Bayesian classifiers. As a result, we propose to simulate the variables not present in the Bayesian classifier as follows: we consider that it is important to distinguish between irrelevant variables (i.e. variables that always have the same values in all classes) and redundant variables (i.e. those in which all values appear similarly in the different classes and therefore do not reflect any difference between the characteristics of the classes). For the former, the estimated probability for a redundant variable Xi to take its k-th value is computed as pˆ(xi ) = p(xki |c). For the latter type of variables, we assume that the probability distribution is uniform. 3.5.4. Stopping Criterion All the previous steps are repeated in EBCOAs until a stopping condition is satisfied. Examples of stopping conditions are: achieving a fixed number of populations or a fixed number of different evaluated individuals, uniformity in the generated population, and the fact of not obtaining an individual with a better fitness value after a certain number of generations.

4. Experimental Results An experiment was carried out in order to test the performance of EBCOAs compared with some EDAs and GAs. This section describes the experiments and the results obtained. We chose EDAs that take into account different numbers of dependencies between variables, in particular, UMDA (Mühlenbein, 1998), MIMIC (de Bonet et al., 1997), and EBNABIC (Etxeberria and Larrañaga, 1999).

345 K

Dl

X1

X2

X3

...

Xn

K

1 2 ... R/3

4 2 ... 3

1 3 ... 1

5 4 ... 4

... ... ... ...

3 6 ... 2

H H ... H

R/3 +1 ... 2R/3

2 ... 1

3 ... 5

1 ... 4

... ... ...

6 ... 2

M ... M

2R/3 +1 ... R

4 ... 1

2 ... 5

6 ... 7

... ... ...

6 ... 1

L ... L

Fig. 10. Three classes in DlK chosen for our experiments, from which only the clases H and L will be used in our case. The class M is simply ignored in the learning step, since those individuals will not be present in DlC .

The GAs that we chose are the broadly known basic (cGA) (Holland, 1975), elitist (eGA) (Whitley and Kauth, 1988) and steady state (ssGA) (Michalewicz, 1992) ones. We tried three standard optimization problems in the discrete domain such as HIFF, IsoPeak, and IsoTorus, which are known to be complex and full of local optima. Table 1 describes briefly these three functions. The reader can find more information on these problems in (Santana, 2004). In our particular experiments, in order to show the validity of EBCOAs, we divided each population into three different classes (|K| = 3), from which only those of the best and worst individuals are used for the learning step (|C| = 2). This is illustrated in Fig. 10. The stopping criterion in all experiments is when obtaining the optimum solution in that generation or reaching the generation number 500. Table 2 shows the mean fitness of the best individual found in the last generation, as well as the number of generations to reach the final solution for each of the experiments. In IsoPeak there is a local optimum with fitness 3906 which corresponds to the individuals with only zeros, very close to the global optimum. This fact confuses most algorithms, and even if some of them are sometimes able to find it (EBNA 3 times, and ssGA once), the results show that only the EBCOATAN B algorithm was able to find the optimum in all the ten runs. In IsoTorus there are also other local optima, and EDAs and GAs fall in these in some of the executions. From the ten runs of each algorithm, most of EDAs and GAs were able to find sometimes the global optimum (MIMIC once, EBNA and cGA 4 times, and ssGA and eGA 5 times), but EBCOAnBayes and EBCOATAN B found the global optimum in all of the 10 runs, while EBCOAseminnB−BSSJ and EBCOAselectivenBayes also found it 8 times and twice, respectively. In the HIFF fitness function the results

T. Miquélez et al.

346

Table 1. Description of the HIFF, IsoPeak, and IsoTorus fitness functions. The first column describes the objective funtion, the second the size of the individual, and the third and the fourth contain are the optimum solutions and their respective fitness values.

HIFF (x) = f (x1 , . . . , xn )

HIFF

f (x1 , . . . , xn ) = 1, if (|s| = 1) |s| + f (x1 , . . . , x 2s ) if (|s| > 1) P |s| +f (x 2s +1 , . . . , xs ) and xi = 0 , i=1 P |s| or xi = |s| i=1 f (x1 , . . . , x 2s ) +f (x s , . . . , x ) otherwise s +1 2

( n = 64

(1, 1, . . . , 1) (0, 0, . . . , 0)

Opt = 448

n = 64

(1,1,. . . ,1)

Opt = 3907

n = 64

(1,1,. . . ,1)

Opt = 505

m=n+1 m IsoC1 = m−1 0 ( IsoC2 =

IsoPeak

m 0

if x = 00 if x = 11 otherwise if x = 11 otherwise

FIsoPeak ( x) = IsoC2 (x1 , x2 ) m P + IsoC1 (xi , xi+1 ) i=2

n = m2 m IsoT1 = m−1 0 ( IsoT2 =

IsoTorus

m2 0

if u = 0 if u = 5 otherwise if u = 5 otherwise

FIsoTorus = IsoT1 (x1−m+n + x1−m+n + x1 + x2 + x1+m ) n P + IsoT2 (xup + xleft + xi + xright + xdown ) i=2

where xup , xleft , xi , xright , xdown are defined as the appropriate neighbors

Evolutionary computation based on Bayesian classifiers

347

Table 2. Mean results after 10 executions with each algorithm and objective function. The Ev and Val columns represent respectively the best fitness value obtained in the last generation, and the evaluations number in which it ended. HIFF

IsoPeak Ev. Val.

IsoTorus Ev. Val.

Ev.

Val.

EBCOAnBayes EBCOAselectivenBayes EBCOAseminnB−F SSJ EBCOAseminnB−BSSJ EBCOATAN B

105036.8 94640.7 249838.2 189178.9 4589.9

290 355.2 290.2 184.5 448

51995.4 43910.0 249893.5 58694.3 4391.8

3906 3906 3859.8 3803.8 3907

25175.9 207914.1 227610.3 66701.9 3989.6

505 472 471.6 474.3 505

UMDA MIMIC EBNA cGA eGA ssGA

107120.4 97572.0 23336.0 202000 202000 202000

295.6 283.2 448 395.2 388.8 448

67303.3 69385.9 19708.6 202000 202000 202000

3905.5 3906 3906.3 3628.1 3793.7 3906.1

47244.7 46941 28703.0 202000 202000 202000

400.3 422.3 485.2 477.2 488.5 488.5

are more similar between EBCOAs, EDAs and GAs, since EBCOATAN B , EBNA and ssGA obtained the best result in all the 10 runs. Also note that most of EBCOAs require fewer evaluations (e.g. fewer different solutions to be evaluated during the search) to reach these final results. These results show that the tree augmented naive Bayes approach performs very well in all these fitness functions, even improving the results obtained in many EDAs and GAs. Also, if we compare the behaviour of EBCOAnBayes with that of UMDA, its EDA equivalent in taking into account the dependencies between variables, we see that the results are at least comparable. Finally, regarding the seminaive and selective Bayes approaches, after monitoring the evolution of the search we realized that the choice of how to instantiate the variables that are not present in the Bayesian classifier is the main reason for these results, and further research is already in progress. In the light of the results we can conclude that the new paradigm EBCOA produced promising results in this experiment, sometimes giving better and comparable results to GAs and EDAs. However, their potential is still to be analysed, as there are still many different aspects that need to be tested and could result in a considerable improvement in the performance of these algorithms.

5. Conclusions and Further Work This paper introduces for the first time a new paradigm, Evolutionary Bayesian Classifier-based Optimization Algorithms (EBCOAs), which combines both evolutionary computation techniques and Bayesian classifiers in order

to solve optimization problems. The theoretical foundations and the generic pseudocode have been introduced for this new evolutionary computation paradigm. This paper also illustrates the behaviour of these algorithms in standard optimization problems in discrete domains such as HIFF, IsoPeak and IsoTorus. The first results obtained in these experiments show that some of the choices (such as the instantiation of variables not present in the Bayesian classifier) have to be revised and more techniques have to be tried. However, the fact that some EBCOAs perform in these problems in a similar way and even outperform in some cases EDAs and GAs is a promising result to encourage further testing. This experiment was performed with general objective functions, and further testing should be done with more complex problems and using EBCOAs that can take into account higher-order dependencies between variables. We reckon that the application of more complex EBCOAs to these problems should turn out to improve the performance of even EDAs and GAs. Future research trends also include the study and experimentation of new Bayesian network classifiers that are capable of taking into account more interdependencies than the ones introduced here. An example of possible structures to apply are the generalization of structures from the EBNA approach in problems where the dependencies between variables are high. Another future research topic for EBCOAs also includes applying classification techniques for building statistical probabilistic graphical models in continuous domains so that we can compare their performance with continuous EDAs and other evolutionary approaches in continuous domains.

T. Miquélez et al.

348

Acknowledgments This work was partly supported by the University of the Basque Country within the project 9/UPV-EHU 00140.226-15334/2003. The authors would also like to thank J.A. Lozano, Iñaki Inza, and the other members of the Intelligent Systems Group of the University of the Basque Country for their useful advise and contribution to this work.

Friedman N., Geiger D. and Goldsmidt M. (1997): Bayesian network classifiers. — Mach. Learn., Vol. 29, No. 2, pp. 131– 163. Gammerman A. and Thatcher A.R. (1991): Bayesian diagnostic probabilities without assuming independence of symptoms. — Meth. Inf. Medic., Vol. 30, No. 1, pp. 15–22. Goldberg D.E. (1989): Genetic Algorithms in Search, Optimization, and Machine Learning. — Reading: AddisonWesley.

References

Harik G. (1999): Linkage learning via probabilistic modeling in the EcGA. — Techn. Rep., University of Illinois, Urbana, IlliGAL Report No. 99010.

Baluja S. (1994): Population-based incremental learning: A method for integrating genetic search based function optimization and competitive learning. — Techn. Rep., Carnegie Mellon, CMU-CS-94-163.

Harik G., Lobo F.G. and Golberg D.E. (1998): The compact genetic algorithm. — Proc. IEEE Conf. Evolutionary Computation, Piscataway, NJ, pp. 523–528.

Baluja S. and Davies S. (1997): Using optimal dependency-trees for combinatorial optimization: Learning the structure of the search space. — Techn. Rep., Carnegie Mellon, CMUCS-97-107. Cantu-Paz E. (2001): Supervised and unsupervised discretization methods for evolutionary algorithms. — Proc. Genetic and Evolutionary Computation Conference (GECCO’2001), Workshop Optimization by Building and Using Probabilisitic Models, San Francisco, California, pp. 213–216. Cestnik B., Kononenko I. and Bratko I. (1987): ASSISTANT-86: A knowledge elicitation tool for sophisticated users, In: Progress in Machine Learning (I. Bratko and N. Lavrac, Eds.). — Wilmslow, U.K.: Sigma Press, pp. 31–45. Cheng J. and Greiner R. (1999): Comparing Bayesian network classifiers. — Proc. 15th Conf. Uncertainty in Artificial Intelligence, San Francisco, CA: Morgan Kaufmann Publishers, pp. 101–107. Chow C. and Liu C. (1968): Approximating discrete probability distributions with dependence trees. — IEEE Trans. Inf. Theory, Vol. 14, No. 3, pp. 462–467. de Bonet J.S., Isbell C.L. and Viola P. (1997): MIMIC: Finding optima by estimating probability densities, In: Advances in Neural Information Processing Systems (M. Mozer, M. Jordan and Th. Petsche, Eds.). — Cambridge, MA: The MIT Press, Vol. 9, pp. 424–431. Domingos P. and Pazzani M. (1997): On the optimality of the simple Bayesian classifier under zero-one loss. — Mach. Learn., Vol. 29, No. 2–3, pp. 103–130. Duda R. and Hart P. (1973): Pattern Classification and Scene Analysis. — New York: Wiley. Etxeberria R. and Larrañaga P. (1999): Global optimization with Bayesian networks. — Proc. 2nd Symp. Artificial Intelligence, CIMAF99, La Habana, Cuba, pp. 332–339.

Holland J.H. (1975): Adaptation in Natural and Artificial Systems. — Michigan: The University of Michigan Press. Inza I., Larrañaga P., Etxeberria R. and Sierra B. (2000): Feature subset selection by Bayesian network-based optimization. — Artif. Intell., Vol. 123, No. 1–2, pp. 157–184. Kaufman K. and Michalski R. (1999): The AQ18 machine learning and data mining system: An implementation and user’s guide. — Techn. Rep., Machine Learning and Inference Laboratory, George Manson University, Fairfax, Virginia. Kohavi R. and John G. (1997): Wrappers for feature subset selection. — Artif. Intell., Vol. 97, No. 1–2, pp. 273–324. Kononenko I. (1990): Comparison of inductive and naïve Bayesian learning approaches to automatic knowledge acquisition, In: Current Trends in Knowledge Acquisition (B. Wielinga, J. Boose, B. Gaines, G. Shereiber and M. van Someren, Eds.). — Amsterdam: IOS Press, pp. 190–197. Kononenko I. (1991): Semi-naïve Bayesian classifiers. — Proc. 6th Europ. Working Session on Learning, Porto, Portugal, pp. 206–219. Kontkanen P., Myllymäki P., Tirri H. and Valtonen K. (2000): Bayesian multinet classifiers. — Proc. 10th Int. Conf. Computing and Information (ICCI’2000). Langley P. and Sage S. (1994): Induction of selective Bayesian classifiers. — Proc. 10th Conf. Uncertainty in Artificial Intelligence, Seattle, WA, pp. 399–406. Larrañaga P. and Lozano J.A. (2001): Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. — Kluwer. Liu H. and Motoda H. (1998): Feature Selection. From Knowledge Discovery and Data Mining. — Boston: Kluwer. Llorà X. and Goldberg D.E. (2003): Wise Breeding GA via Machine Learning Techniques for Function Optimization. — Proc. Genetic and Evolutionary Computation Conference GECCO-03, Part I, Chicago, Illinois, pp. 1172–1183.

Evolutionary computation based on Bayesian classifiers Michalewicz Z. (1992): Genetic Algorithms + Data Structures = Evolution Programs. — Berlin: Springer Verlag. Michalski R.S. (2000): Learnable execution model: Evolutionary processes guided by machine learning. — Mach. Learn., Vol. 38, pp. 9–40. Minsky M. (1961): Steps toward artificial intelligence. — Trans. Inst. Radio Eng., Vol. 49, pp. 8–30. Mühlenbein H. (1998): The equation for response to selection and its use for prediction. — Evolut. Comput., Vol. 5, No. 3, pp. 303–346. Mühlenbein H. and Mahning T. (1999): FDA—A scalable evolutionary algorithm for the optimization of additively decomposed functions. — Evolut. Comput., Vol. 7, No. 4, pp. 353–376. Mühlenbein H., Mahning T. and Ochoa A. (1999): Schemata, distributions and graphical models in evolutionary optimization. — J. Heurist., Vol. 5, pp. 215–247. Mühlenbein H. and Paaß G. (1996): From recombination of genes to the estimation of distributions, I. Binary parameters, In: Parallel Problem Solving from Nature—PPSN IV (M. Voigt, W. Ebeling, I. Rechenberg and H.-P. Schwefel, Eds.). — Lecture Notes in Computer Science 1411, Berlin: Springer, pp. 178–187. Muñoz A. (2003): LEM algorithms. — Final year project, Computer Engineering Faculty, University of the Basque Country, (in Spanish). Neapolitan R.E. (2003): Learning Bayesian Networks. — Prentice Hall. Ohmann C., Yang Q., Kunneke M., Stolzing H., Thon K. and Lorenz W. (1988): Bayes theorem and conditional dependence of symptoms: Different models applied to data of upper gastrointestinal bleeding. — Meth. Inf. Medic., Vol. 27, No. 2, pp. 73–83. Pazzani M. (1997): Searching for dependencies in Bayesian classifiers, In: Learning from Data: Artificial Intelligence and Statistics V (D. Fisher and H.-J. Lenz, Eds.). — New York: Springer, pp. 239–248.

349 Pelikan M., Goldberg D.E. and Cantú-Paz E. (1999): BOA: The Bayesian optimization algorithm. — Proc. Genetic and Evolutionary Computation Conference GECCO-99, Orlando, pp. 525–532. Pelikan M. and Mühlenbein H. (1999): The bivariate marginal distribution algorithm, In: Advances in Soft ComputingEngineering Design and Manufacturing (R. Roy, T. Furuhashi and P.K. Chandhory, Eds.). — London: SpringerVerlag, pp. 521–535. Sahami M. (1996): Learning limited dependence Bayesian classifiers. — Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining, Portland, OR, pp. 335–338. Santana R. (2004): Estimation of distribution algorithms with Kikuchi approximations. — (submitted). Syswerda G. (1993): Simulated crossover in genetic algorithms, In: Foundations of Genetic Algorithms (L.D. Whitley, Ed.). — Vol. 2, San Mateo: Morgan Kaufmann, pp. 239– 255. Thierens D. and Bosman P.A.N. (2001): Multi-objective mixture-based iterated density estimation evolutionary algorithms. — Proc. Genetic and Evolutionary Computation Conference, GECCO’2001, San Francisco, pp. 663–670. Todd B.S. and Stamper R. (1994): The relative accuracy of a variety of medical diagnostic programs. — Meth. Inf. Medic., Vol. 33, No. ??, pp. 402–416. Ventura S., Herrera F., Berná J.N. and Hervás C. (2002): Evolución guiada mediante aprendizaje. Comparación en problemas de optimización. — Proc. Conf. Algoritmos Evolutivos y Bioinspirados, AEB’02, pp. 430–436, (in Spanish). Whitley D. and Kauth J. (1988): GENITOR: A different genetic algorithm. — Proc. Rocky Mountain Conf. Artificial Intelligence, USA, Vol. 2, pp. 118–130. Zitzler E., Deb K. and Thiele L. (1999): Comparison of multiobjective evolutionary algorithms on test functions of different difficulty. — Proc. 1999 Genetic and Evolutionary Computation Conf., Orlando, FL, pp.121–122.