A Graph-Based Evolutionary Algorithm

A Graph-Based Evolutionary Algorithm: Genetic Network Programming (GNP) and Its Extension Using Reinforcement Learning
Shingo Mabu mabu@asagi.waseda.jp Graduate School of Information, Production and Systems, Waseda University, Hibikino 2-7 Wakamatsu-ku, Kitakyushu, Fukuoka, 808-0135, Japan Kotaro Hirasawa
hirasawa@waseda.jp Graduate School of Information, Production and Systems, Waseda University, Hibikino 2-7, Wakamatsu-ku, Kitakyushu, Fukuoka, 808-0135, Japan
Jinglu Hu
jinglu@waseda.jp Graduate School of Information, Production and Systems, Waseda University, Hibikino 2-7, Wakamatsu-ku, Kitakyushu, Fukuoka, 808-0135, Japan
Abstract This paper proposes a graph-based evolutionary algorithm called Genetic Network Programming (GNP). Our goal is to develop GNP, which can deal with dynamic environments efciently and effectively, based on the distinguished expression ability of the graph (network) structure. The characteristics of GNP are as follows. 1) GNP programs are composed of a number of nodes which execute simple judgment/processing, and these nodes are connected by directed links to each other. 2) The graph structure enables GNP to re-use nodes, thus the structure can be very compact. 3) The node transition of GNP is executed according to its node connections without any terminal nodes, thus the past history of the node transition affects the current node to be used and this characteristic works as an implicit memory function. These structural characteristics are useful for dealing with dynamic environments. Furthermore, we propose an extended algorithm, GNP with Reinforcement Learning (GNPRL) which combines evolution and reinforcement learning in order to create effective graph structures and obtain better results in dynamic environments. In this paper, we applied GNP to the problem of determining agents behavior to evaluate its effectiveness. Tileworld was used as the simulation environment. The results show some advantages for GNP over conventional methods. Keywords Evolutionary computation, graph structure, reinforcement learning, agent, tileworld.
1 Introduction
A large number of studies have been conducted on evolutionary optimization techniques. Genetic Algorithm (GA) (Holland, 1975), Genetic Programming (GP) (Koza, 1992, 1994) and Evolutionary Programming (EP) (Fogel et al., 1966; Fogel, 1994) are typical evolutionary algorithms. GA evolves strings and is mainly applied to optimization problems. GP was devised later in order to expand the expression ability of GA by using tree structures. EP is a graph structural system creating nite state machines by evolution. In this paper, a new graph-based evolutionary algorithm named Genetic
c 2007 by the Massachusetts Institute of Technology Evolutionary Computation 15(3): 369-398
S. Mabu, K. Hirasawa and J. Hu
Network Programming (GNP) (Katagiri et al., 2000, 2001; Hirasawa et al., 2001; Mabu et al., 2002, 2004) is described. Our aim in developing GNP is to deal with dynamic environments efciently and effectively by using the higher expression ability of graph structure, and the inherently equipped functions in it. The distinguishing functions of the GNP structure are directed graph expression, reusability of nodes, and implicit memory function. The directed graph expression can realize some repetitive processes, and it can be effective because it works like the Automatically Dened Functions (ADFs) in GP. The node transition of GNP starts from a start node and continues based on the node connections, thus it can be said that an agents1 actions in the past are implicitly memorized in the network ow. In addition, we propose an extended algorithm, GNP with Reinforcement Learning (GNP-RL) which combines evolution and reinforcement learning in order to create effective graph structures and obtain better results in dynamic environments. The distinguishing functions of GNP-RL are the combinations of ofine and online learning, and diversied and intensied search. Although we have already proposed a method, online learning of GNP (Mabu et al., 2003), which uses only reinforcement learning to select the node connections, this method has a problem in that the Q table becomes very large, and the calculation time and the occupation of the memory also become large. Thus, we propose a method using both evolution and reinforcement learning in this paper. Evolutionary algorithms are superior in terms of wide space search ability because they continue to evolve various individuals and select better ones (ofine learning), while RL can learn incrementally, based on rewards obtained during task execution (online learning). Therefore, the combination of evolution and RL can cooperatively make good graph structures. In fact, the proposed evolutionary algorithm (diversied search) makes rough graph structures by selecting some important functions among many kinds of functions and connects them based on tness values after task execution, thus the Q table becomes quite compact. And then RL (intensied search) selects the best function during task execution, i.e., determines an appropriate node transition. This paper is organized as follows. Section 2 provides the related work and the comparisons with GNP. In Section 3, the details of GNP and GNP-RL are described. Section 4 explains the Tileworld problem, available function sets, and shows the simulation results. Section 5 discusses future work and remaining problems. Section 6 is devoted to conclusions.
2 Related Work and Comparisons

A GP tree can be used as a decision tree when all function nodes are if -then type functions and all terminal nodes are concrete action functions. In this case, a tree is executed from a root node to a certain terminal node in each step, so the behaviors of agents are determined mainly by the current information. On the other hand, since GNP has an implicit memory function, GNP can determine an action by not only the current, but also the past information. The most important problem for GP is the bloat of the tree. The increase in depth causes an exponential enlargement of the search space, the occupation of large amounts of memory, and an increase in calculation time. Constraining the depth of the tree is one of the ways to overcome the bloat problem. Since the graph structure of GNP has an implicit memory function and the ability to re-use nodes, GNP is expected to use necessary nodes repeatedly and create compact
1 An agent is a computer system that is situated in some environments, and it is capable of autonomous action in these environments in order to meet its design objectives (Weiss, 1999). In this paper, the autonomous action is determined by GNP.
370
Evolutionary Computation Volume 15, Number 3
Genetic Network Programming and Its Extension
structures. We will have a discussion on the program size in the simulation section. Evolutionary Programming (EP) is a graph structural system used for the automatic synthesis of nite state machines (FSMs). For example, FSM programs are evolved in the iterated prisoners dilemma game (Fogel, 1994; Angeline, 1994) and the ant problem (Angeline and Pollack, 1993). However, there are some essential differences between EP and GNP. Generally FSM must dene their transition rules for all combinations of states and possible inputs, thus the FSM program will become large and complex when the number of states and inputs is large. In GNP, the nodes are connected by necessity, so it is possible that only the essential inputs obtained in the current situation are used in the network ow. As a result, the graph structure of GNP can be quite compact. PADO (Teller and Veloso, 1995, 1996; Teller, 1996) is also a graph-based algorithm, but its fundamental concept is different from that of GNP. Each node in PADO programs has two functional parts: an action part and a branch decision part, and PADO also has both a start node and an end node. The state transition of PADO is based on stack and indexed memory. Since PADO has been successfully applied to image and sound classication problems, it can be said that PADO has a splendid ability for static problems. GNP is designed mainly to deal with problems in dynamic environments. First, the main concept of GNP is to make use of the implicit memory function. Therefore, GNP does not presuppose that it uses explicit memories such as stack and indexed memories. Second, GNP has judgment nodes and processing nodes which correspond to branch decision parts and action parts in PADO, respectively. Note that GNP separates judgment and processing functions, while both functions of PADO are in a node. Therefore GNP can create more complex combinations/rules of judgment and processing. Finally, the nodes of GNP have a unique node number and the number of nodes is the same in all the individuals. This characteristic contributes to executing RL effectively in GNP-RL (see section 3.6). Some information on the techniques of graph or network based GP is given in Luke and Spector (1996). Finally, we explain the methods that combine evolution and learning. In Downing (2001), the special terminal nodes for learning are introduced to GP and the contents of the terminal nodes, i.e., the actions of agents are determined by Q learning. In Iba (1998), a Q table is produced by GP to make an efcient state space for Q learning, e.g., if the GP program is (T AB(xy)(+z5)), it represents a 2-dimensional Q table having 2 axes, x y and z + 5. In Kamio and Iba (2005), the terminal nodes of GP select appropriate Q tables and the agent action is determined by the selected Q table. The most important difference between these methods and GNP-RL is how to create state-action spaces (Q tables). GNP creates Q tables using its graph structures. Concretely speaking, an activated node corresponds to the current state, and selection of a function in the activated node corresponds to an action. In the other methods, the current state is determined by the combination of the inputs, and actions are actual actions such as move forward, turn right and so on.
3 Genetic Network Programming

In this section, Genetic Network Programming is explained in detail. GNP is an extension of GP in terms of gene structures. The original motivation for developing GNP is based on the more general representation ability of graphs as opposed to that of trees in dynamic environments.
371
Directed graph structure Agent

start node
Gene structure
0 Agent fitness value
P1
1
d1
A
d1
A
2
A B
J2
A
node 0 0 0 0 1 0 node 1 2 1 5 2 0
node 2 1 2 1 3 0 4 0
B
A A
node 3 1 1 1 4 0 1 0 node 4 2 2 5 3 0
Environment
LIBRARY
J1
Node gene
P2
Connection gene
J1: J2:
.....
P1: P2:
.....
node i
Ki IDi di Ci
di
Ci
di
.....
Jk:
Pl:
: Judgement node : Processing node : Time delay
J: Judgement function P: Processing function
Figure 1: Basic structure of GNP.
3.1 Basic Structure of GNP 3.1.1 Components Fig. 1 shows a basic structure of GNP. GNP program is composed of one start node, plural judgment nodes and plural processing nodes. In Fig. 1, there are one start node, two judgment nodes and two processing nodes, and they are connected to each other. Start node has no function and no conditional branch. The only role of the start node is to determine the rst node to be executed. Judgment nodes have conditional branch decision functions. Each judgment node returns a judgment result and determines the next node to be executed. Processing nodes work as action/processing functions. For example, processing nodes determine the agents actions such as go forward, turn right and turn left. In contrast to judgment nodes, processing nodes have no conditional branch. By separating processing and judgment functions, GNP can handle various combinations of judgment and processing. That is, how many judgments and which kinds of judgment should be used can be determined by evolution. Suppose there are eight kinds of judgment nodes (J1 , . . . , J8 ), and four kinds of processing nodes (P1 , . . . , P4 ). Then, GNP can make a node transition by selecting necessary nodes, e.g., J1 -> J5 -> J3 -> P1 . Here, it says that judgment nodes J1 , J5 and J3 are needed for processing node P1 . By selecting the necessary nodes, GNP program can be quite compact and evolved efciently. In this paper, as described above, each processing node determines an agents action such as go forward, turn right and so on. And each judgment node determines the next node after judging what is in the forward?, what is in the right? and so on. However, in other applications, they could, for example, be applied to other functions such as judge sensor values (judgment), determine wheel speed (processing) of Khepera robot (by K-Team Corp.), judge whether stocks rise or drop (judgment) and determine buy or sell strategy (processing) in stock markets. 372
GNP evolves the graph structure with the predened number of nodes, so it never causes bloat2 . In addition, GNP has an ability to use certain judgment/processing nodes repeatedly to achieve a task. Therefore, even if the number of nodes is predened and small, GNP can perform well by making effective node connections based on re-using nodes. As a result, we do not have to prepare an excessive number of nodes. The compact structure of GNP is a quite important and distinguishing characteristic, because it contributes to saving memory consumption and calculation time. 3.1.2 Memory Function The node transition begins from a start node, but there are no terminal nodes. After the start node, the current node is transferred according to the node connections and judgment results, in other words, the selection of the current node is inuenced by the node transitions of the past. Therefore, the graph structure itself has an implicit memory function of the past agent actions. Although a judgment node is a conditional branch decision function, the GNP program is not merely the aggregate of if -then rules, because it includes information of past judgment and processing. For example, in Fig. 1, after node 1 (processing node P1 ) is executed, the next node becomes node 2 (judgment node J2 ). Therefore, when the current node is node 2, we can know the previous processing was P1 . The node transition of GNP ends when the end condition is satised, e.g., when the time step reaches the preassigned one or the GNP program completes the given task. 3.1.3 Time Delays GNP has two kinds of time delays: the time delay GNP spends on judgment or processing, and the one it spends on node transitions. In real world problems, when agents judge environments, prepare for actions and take actions, they need time. For example, when a man is walking and sees a puddle before him, he will avoid it. At that moment, it takes some time to judge the puddle (time delay of judgment), to put judgment into action (time delay of transition from judgment to processing) and to avoid the puddle (time delay of processing). Since time delays are listed in each node gene and are unique attributes of each node, GNP can evolve exible programs considering time delays. In this paper, time delay of each node transition is set at zero time unit, that of each judgment node is one time unit, that of each processing node is ve time units, and that of a start node is zero time unit. In addition, the one step of an agents behavior is dened in such a way that one step ends when an agent uses ve or more time units. Thus an agent should do fewer than ve judgments and one processing, or ve judgments in one step. Suppose there are three agents (agent 0, agent 1, agent 2) in an environment. During one step, rst agent 0 takes an action, next agent 1, nally agent 2. In this way, agents repeatedly take actions until reaching the maximum preassigned steps. Another important role of time delays and steps is to prevent the program from falling into deadlocks. For example, if an agent cannot execute processing because of the judgment loop, then one step ends after ve judgments. Such a program is removed from the population in the evolutionary process, or the node transition is changed by the learning process of GNP-RL, as described later.
2A
phenomenon that a program size, i.e., the number of nodes, becomes too large as generation goes on.
373
3.2 Gene Structure The graph structure of GNP is determined by the combination of the following node genes. A genetic code of node i (0 i n3 1) is also shown in Fig. 1. Ki represents the node type, Ki = 0 means start node, Ki = 1 means judgment node and Ki = 2 means processing node. IDi represents the identication number of the node function, e.g., Ki = 1 and IDi = 2 mean the node is J2 . di is the time A B delay spent on judgment or processing. Ci , Ci . . . show the node number connected from node i. dA , dB , . . . mean time delays spent on the transition from node i to node i i A B Ci , Ci . . ., respectively. Judgment nodes determine the upper sufx of the connection genes to refer to depending on their judgment results. For example, if the judgment B result is B, GNP refers to Ci and dB . However, a start node and processing nodes i A use only Ci and dA , because they have no conditional branch. i 3.3 Initialization of a GNP Population Fig. 2 shows the whole owchart of GNP. An initial population is produced according to the following rules. First, we determine the number of each kind of node4 , therefore all programs in a population have the same number of nodes and the nodes with the same node number have the same function. However, the extended algorithm, GNP-RL described later, determines the node functions automatically, so we only need to determine the number of judgment nodes and processing nodes, e.g., 40 judgment A B nodes and 20 processing nodes. The connection genes Ci , Ci , . . . are set at the values selected randomly from 1, . . . , n 1 (except i in order to avoid self-loop). 3.4 A Run of a GNP Program The node transition of GNP is based on Ci . If the current node i is a judgment node, GNP executes the judgment function IDi and determines the next node using its result. B For example, when the judgment result is B, the next node becomes Ci . When the current node is a processing node, after executing the processing function IDi , the next A node becomes Ci . 3.5 Genetic Operators In each generation, the elite individuals are preserved and the rest of the individuals are replaced with the new ones generated by crossover and mutation. In Simulation I (Section 4.3), rst, 179 individuals are selected from the population by tournament selection5 and their genes are changed by mutation. Then, 120 individuals are also selected from the population and their genes are exchanged by crossover. Finally, the 299 individuals generated by mutation and crossover and one elite individual form the next population. 3.5.1 Mutation Mutation is executed in one individual and a new one is generated [Fig. 3]. The procedure of mutation is as follows.
3 Each node in a program has a unique node number from 0 to n 1 (n: total number of nodes), respectively. 4 Five of each kind in this paper. It could be determined experimentally, however in this paper, previous experience indicates ve nodes per each kind (J1 , J2 , . . . , P1 , P2 , . . .) could keep a reasonable balance of expression ability and search speed. 5 The calculation cost of tournament selection is relatively small, because it simply compares the tness values of some individuals, and we can easily determine selection pressure by tournament size. Thus, we use tournament selection in this paper. Tournament size is set at six.
374
start generate an initial population ind=1
Judgement/Processing trial ends ? Yes ind=the number of individuals? Yes reproduction mutation crossover last generation ? Yes stop
No ind=ind+1
No
No ind : individual number
Figure 2: Flowchart of GNP system.
1. Select one individual using tournament selection and reproduce it as a parent. 2. Each connection of each node (Ci ) is selected with the probability of Pm . The selected Ci is changed to other value (node number) randomly. 3. Generated new individual becomes the new one of the next generation. 3.5.2 Crossover Crossover is executed between two parents and generates two offspring [Fig. 4]. The procedure of crossover is as follows. 1. Select two individuals using tournament selection twice and reproduce them as parents. 2. Each node i is selected as a crossover node with the probability of Pc . 3. Two parents exchange the genes of the corresponding crossover nodes, i.e., the nodes with the same node number. 4. Generated new individuals become the new ones of the next generation. Fig. 4 shows a crossover example of the graph structure with three processing nodes for simplicity. If GNP exchanges the genes of judgment nodes, it must exchange all the genes with sufx A, B, C, . . . simultaneously. 3.6 Extended Algorithm GNP with Reinforcement Learning (GNP-RL) In this subsection, we propose an extended algorithm, GNP with Reinforcement Learning (GNP-RL). Standard GNP (SGNP) described in the previous section is based on the
375
mutation
A
C3=1
C3=2
Each branch is selected with the probability of Pm
The selected branch becomes connected to another node randomly.
Figure 3: Mutation.
general evolutionary framework such as selection, crossover and mutation. GNP-RL is based on evolution and reinforcement learning (Sutton and Barto, 1998). The aim of combining RL with evolution is to produce programs using the current information (state and reward) during task execution. Evolution-based methods change their programs mainly after task execution or enough trials, i.e., ofine learning. On the other hand, GNP-RL can change its programs incrementally based on rewards obtained during task execution, i.e., online learning. For example, when an agent takes a good action with a positive reward at a certain state, the action is reinforced and the action will be adopted with higher probability when visiting the state again. Online learning is one of the advantages of GNP-RL. The other advantage is a combination of a diversied search of evolution and an intensied search of RL. The role of evolution is to make rough structures, i.e., plural paths of node transition, through selection, crossover and mutation. The role of RL is to determine one appropriate path in a structure made by evolution. Because RL is executed based on immediate rewards obtained after taking actions, intensied search, i.e., local search, can be executed efciently. Evolution changes programs largely than RL and the programs (solutions) could escape from local minima, so we call evolution as a diversied search. The nodes of GNP have a unique node number and the number of nodes (states) is the same in all the individuals. In addition, the crossover operator exchanges the nodes with the same node number. Therefore, the large changes of the Q tables do not occur and the obtained knowledge in the previous generation can be used effectively in the current generation. 3.6.1 Basic Structure of GNP-RL Fig. 5 shows a basic structure of GNP-RL. The difference between GNP-RL and SGNP is whether or not plural functions exist in a node. Each node of SGNP has one function, but that of GNP-RL has several functions and one of them is selected based on a policy. Ki represents the node type, which is the same as in SGNP. IDip (1 p mi 6 ) shows the identication number of the node function. In Fig. 5, mi of all nodes are set at 2, i.e., GNP can select the node function IDi1 or IDi2 . Qip is a Q value which is assigned to each state and action pair. In reinforcement learning, state and action
6 m (1 m M M : Maximum number of functions in a node, e.g., M = 4) shows the number of i i node functions GNP can select at the current node i. mi is determined randomly at the beginning of the rst generation, but they can be changed by mutation.
376
parent1
1
A A
parent2
1
d3
2
C3=1 3
A d3 *
ID3
A* C3=2
d3
3 ID* 3 d3*
Each node is selected with the probability of Pc (crossover node)
crossover offspring1
1
offspring2
1
A
d3
2
d3 *
A* C3=2
C3=1 * 3 ID3 2 * d3 3 ID3
d3
Figure 4: Crossover.
must be dened, and generally, the current state is determined by the combination of the current information, e.g., sensor inputs, and action is an actual action an agent takes, e.g., go forward. However, in GNP-RL, the current state is dened as the current node, and a selection of a node function (IDip ) is dened as an action. dip is the time delay A B spent on judgment or processing. Cip , Cip , . . . show the node number of the next node. A B A B dip , dip , . . . mean time delays spent on the transition from node i to node Cip , Cip , . . ., respectively. 3.6.2 A Run of GNP with Reinforcement Learning The node transition of GNP-RL also starts from a start node and continues depending on the node connections and judgment results. If the current node i is a judgment node, rst, one Q value is selected from Qi1 , . . . , Qimi based on -greedy policy. That is, a maximum Q value among Qi1 , . . . , Qimi is selected with the probability of 1 , or a random one is selected with the probability of , then the corresponding IDip is selected. GNP executes the selected judgment function IDip and determines the next node depending on the judgment result. For example, if the selected function is IDi2 and the judgment result is B, the B next node becomes node Ci2 . If the current node is a processing node, GNP selects and executes a processing A function in the same way as judgment nodes, and the next node becomes node Ci2 when the selected function is IDi2 . Here, a concrete example of node transition is explained using Fig. 6. The rst node is a judgment node 2 and there are functions JF and TD (see Table 1). Suppose JF is selected based on the -greedy policy and the judgment result is D(=oor). Then the D next node number becomes C21 = 4. In node 4, the processing function MF is selected,
377
Directed graph structure

start node
Gene structure Node gene Connection gene 10 40
0 1 2
node 0 0
node 1 2 1 0.3 5 2 0.0 5 2 0
node 2 1 2 2.0 1 1 0.0 1 3 0 4 0 1 0 4 0 node 3 1 1 0.0 1 3 0.1 1 4 0 1 0 2 0 4 0
node 4 2 4 0.7 5 1 2.7 5 2 0
10
Node gene
node i
Ki
IDi1 Qi1 di1

A B
.....
B
IDimi Qimi dimi
Connection gene
Ci1 di1 Ci1 di1

Judgement node
in the case of mi=2
.....
node i
.....
di1
A
Cimi dimi Cimi dimB i

node node
Ci1
A B
Qi1 di1 IDi1 IDi2 Qi2 di2
di1
Ci1
One branch is selected according to the judgement result.
...
di2 di2
B
node node
Ci2
: Time delay
Ci2
...
Processing node
node i
node j
A di1
Qi1 di1 IDi1 IDi2 Qi2 di2
node
Ci1
di2
node
Ci2
Figure 5: Basic structure of GNP with Reinforcement Learning.
A so an agent moves forward, and the next node becomes node C42 = 9.
3.6.3 Genetic Operators A crossover operator in GNP-RL is the same as in SGNP, i.e., all the genes of the selected nodes are exchanged. However, GNP-RL has its own mutation operators. The procedure is as follows. 1. Select one individual using tournament selection and reproduce it as a parent 2. Mutation operator There are three kinds of mutation operators [Fig. 7], and, uniformly selected, one operator is executed. (a) connection of functions : Each node connection is re-connected to another node (Cip is changed to another node number) with the probability of Pm . (b) content of functions : Each function is selected with the probability of Pm and changed to another function, i.e., IDip and dip are each changed. 378
node 2
C21=1
tile B C21=5 hole C obstacle C21=8 floor D C21=4 agent E C21=10
node 4
ID21=1
Q21=1.0
ID41=1
Q41=0.02
C41=7
JF TD
ID22=5 Q22=0.1
TL MF
ID42=3 Q42=0.5
A
....
C42=9
JF : judge forward TD : direction of the nearest Tile from the agent
MF: move forward TL : turn left
Figure 6: An example of node transition using the nodes for Tileworld problem.
(c) number of functions : Each node i is selected with the probability of Pm , and the number of functions mi is changed to 1, . . . , or M randomly. If the revised mi becomes larger than the previous mi , then one or more new functions selected from LIBRARY are added to the node so that the number of functions becomes the revised mi . If the revised mi becomes smaller, then one or more functions are deleted from the node. 3. The generated new individual becomes the new one of the next generation. 3.7 Learning Phase Reinforcement learning is carried out when agents are carrying out their tasks and terminates when the time step reaches the predened steps. The learning phase of GNP is based on a basic Sarsa algorithm (Sutton and Barto, 1998). Sarsa calculates Q values which are functions of state s and action a. Q values estimate the sum of the discounted rewards obtained in the future. Suppose that an agent selects an action at at state st at time t, a reward rt is obtained and an action at+1 is taken at the next state st+1 . Then Q(st , at ) is updated as follow. Q(st , at ) Q(st , at ) + [rt + Q(st+1 , at+1 ) Q(st , at )] (1)
is a step size parameter, and is a discount rate which determines the present value of future rewards: a reward received k time steps later is worth only k1 times of the reward supposed to receive at the current step. As described before, a state means the current node and an action means the selection of a function. Here a procedure for updating Q value is explained using Fig. 8 which shows states, actions and an example of node transition. 1. At time t, GNP refers to Qi1 , Qi2 , . . . , Qimi and select one of them based on greedy. Suppose that GNP selects Qip and the corresponding function IDip . 2. GNP executes the function IDip , gets the reward rt and the next node j becomes A Cip . 3. At time t + 1, GNP selects one Q value in the same way as step 1. Suppose that Qjp is selected.
379
1
A
mutation
(the connection)
C31=1
C31=2 2 C32=2
A
C32=2
The connection is changed randomly.

node i IDi di
mutation
(the content of functions) * * IDi di
The content of the functions is changed randomly.
node i mi=2
mutation
(the number of functions)
mi=3
The number of functions is selected from 1, ..., M.
Figure 7: Mutation of GNP with Reinforcement Learning.
4. Q value is updated as follows. Qip Qip + [rt + Qjp Qip ] 5. t t + 1, i j, p p then return step 2. In this example, node i is a processing node, but if it is a judgment node, the next A B current node is selected among Cip , Cip , . . . depending on the judgment result.
4 Simulations
To conrm the effectiveness of the proposed method, the simulations for determining the agents behavior using the Tileworld problem (Pollack and Ringuette, 1990) are described in this section. 4.1 Tileworld Tileworld is well known as a testbed for the problem of agents. Fig. 9 shows an example of Tileworld, which is a 2D grid world including multi-agents, obstacles, tiles, holes and oors. Agents can move to a contiguous cell in one step. Moreover, agents can push a 380
node i Qi1 IDi1

...
reward rt
node j (=Cip ) Qj1 IDj1
reward rt+1
Qip
action at
...
Cip IDip
...
Qjp at+1 IDjp Qjmj IDjmj
...
...
...
tile to the next cell except when an obstacle or other agent exists in the cell. When a tile is dropped into a hole, the hole and the tile vanish, i.e., the hole is lled with the tile. Agents have some sensors and action abilities, and their aim is to drop many tiles into holes as fast as possible. Therefore, agents are required to use sensors and take actions properly according to their situations. Since the given sensors and simple actions are not enough to achieve tasks, agents must make clever combinations of judgment and processing. The nodes used by agents are shown in Table 1. The judgment nodes { JF, JB, JL, JR } return { tile, hole, obstacle, oor, agent }, and { TD, HD, THD, STD } return { forward, backward, left, right, nothing } as judgment results, like A, B, . . . in Fig. 1. Fig. 10 shows the four directions an agent can perceive when it faces north. 4.1.1 Fitness and Reward A trial ends when the time step reaches the preassigned step, and then tness is calculated. Fitness is used in the evolutional processes and Reward is used in the learning phase of GNP-RL. Fitness = the number of dropped tiles Reward =1 (when an agent drops a tile into a hole) 4.2 Simulation Conditions The simulation conditions are shown in Table 2. For the comparison, the simulations are carried out by SGNP, GNP-RL, standard GP, GP with ADFs, and EP evolving FSMs. 4.2.1 Conditions of GNP As shown in Table 2, the number of nodes in a program is 61. In the case of SGNP, the number of each kind of judgment and processing nodes is xed at ve. In the case of GNP-RL, 40 judgment nodes and 20 processing nodes are used, but the number of different kinds of nodes (IDi ) are changed through the evolution. At the rst generation, all nodes have functions randomly selected from LIBRARY. But, IDi are exchanged by crossover and also changed by mutation, thus the appropriate kinds of nodes are selected as a result of evolution.
...
... ...
...
Qimi
...
...
IDimi
state st t
state st+1 t+1 time
Figure 8: An example of node transition.
381
T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T
(North) Forward
T Tile Hole Obstacle Floor Agent
T Left T
Left
Forward
Right
Backward
Figure 9: Tileworld.
Backward (South)
Figure 10: Four directions Agents perceive.
The number of elite individuals and offspring generated by crossover and mutation is predened. Simulation I uses the environment shown in Fig. 9, where the positions of tiles and holes are xed, so the same environment is used every generation. On the other hand, in Simulation II, the positions of tiles and holes are determined randomly, so the problem becomes more difcult and complex. In Simulation I, the best individual is preserved as an elite one, but in Simulation II, ve elite individuals are preserved, because the environment changes generation by generation. In fact, the best individual in the previous generation does not always show a good result in the current generation. Therefore, in order to make the performance of each method stable, we preserve ve good individuals in Simulation II. In GNP-RL, when creating offspring in the evolution, offspring inherits the Q values of its parents and uses them as initial values. That is, the offspring generated by mutation has the same Q values as the parents because mutation does not operates on Q values. Hence, the Q values of an elite individual carry over to the next generation. Furthermore, the offspring generated by crossover has the exchanged Q values of the parents In addition, three agents in the tileworld share the Q values. Crossover rate Pc and Mutation rate Pm are determined appropriately through our experiments, which maintains the variation of the population, but does not change the programs too much. The settings of the parameters used in the learning phase of GNP-RL are as follows. The step size parameter is set at 0.9 in order to nd solutions quickly and the discount rate is set at 0.9 in order to sufciently consider the future rewards. is set at 0.1 experimentally, which considers the balance between exploitation and exploration. In fact, the programs with lower epsilon fall into local minima with higher probability, and those with higher epsilon take too much random actions. M (the maximum number of functions in a node) is set at the best value among M = 2, 3 and 4. 4.2.2 Conditions of GP We use GP as a decision maker, so the function nodes are used as if -then type branch decision functions, and the terminal nodes are used as action selection functions. The terminal nodes of standard GP are composed of the processing nodes of GNP: {MF, 382
J J1 J2 J3 J4 J5 J6 J7 J8
symbol JF JB JL JR TD HD THD STD
Table 1: Function Set. Judgment node content judge FORWARD judge BACKWARD judge LEFT side judge RIGHT side direction of the nearest TILE from the agent direction of the nearest HOLE from the agent direction of the nearest HOLE from the nearest TILE direction of the second nearest TILE from the agent P P1 P2 P3 P4 Processing node symbol content MF move forward TR turn right TL turn left ST stay
TL, TR, ST}, and the function nodes are the judgment nodes of GNP: {JF, JB, JL, JR, TD, HD, THD, STD}. Terminal nodes have no arguments and function nodes have ve arguments corresponding to the judgment results. In the case of GP with ADFs, the main tree uses {ADF1, . . ., ADF10 7 } as terminal nodes in addition to the terminal and function nodes in standard GP. The ADF tree uses the same nodes as standard GP. The genetic operators of GP used in this paper are crossover (Poli and Langdon, 1998, 1997), mutation and inversion (Koza, 1992). In the simulations, the maximum depth of the trees is xed in order to avoid bloat, but the setting of the maximum depth is very important, because the expression ability is improved as the depth becomes large, while the search space is increased. Therefore, we try to use various depths in the range permitted by machine memory and calculation time, and use a full and a ramped half-and-half initialization methods (Koza, 1992) in order to produce trees with various sizes and shapes in the initial population. 4.2.3 Conditions of EP EP uses the same sensor information as the judgment nodes of GNP use, and the outputs are the same as the contents of the processing nodes. Generally EP must dene transitions and outputs for all combinations of states and inputs. Here, we would like to discuss how the complexity of the EP and GNP programs differs depending on the problem. Table 3 shows the number of outputs/connections for each individual in EP and GNP. Fig. 11 shows the number of outputs at each state/node. In case 1, there is only one sensor which can distinguish two objects, and the number of states/nodes is 608 . Then, the number of outputs of EP becomes 120, that of SGNP becomes 100, and that of GNP-RL becomes 100-400 (variable). However, as the number of sensors
7 The number of ADFs in each individual is 10 and each ADF is called by the terminal nodes of the main tree. 8 A start node of GNP is not counted because it has only one branch determining the rst judgment or processing node and does not have any functions. EP has a branch determining a rst state, but it is not counted as an output.
383
Table 2: Simulation conditions.

GNP-RL Population size crossover mutation inversion elite program size Crossover rate Pc Mutation rate Pm Tournament size Other parameters 300 120 179 [175] SGNP 300 120 179 [175] GP GP-ADFs 300 120 119 [115] 60 1 [5] GP: 35 GP-ADFs: main 3,4, ADF 2,3 EP 300
299 [295]
1 [5] 61 0.1 0.01
1 [5] 61 0.1 0.02 [0.01]
1 [5] state 60,30,5 input 14
0.1 [0.01] 6
0.1
= 0.9 = 0.9 = 0.1 M = 4 [3]
[ ]: conditions in Simulation II Population size: the number of individuals Program size: the number of nodes including one start node (GNP), max depth (GP), max number of states and number of inputs (EP).
one connection processing node
... ...
Y outputs
judgement node GNP EP
Figure 11: The number of outputs from a node/state.
(inputs) and the number of objects each sensor can distinguish increase, the number of outputs of EP becomes exponentially large (case 2case 4). On the other hand, the number of connections in GNP does not become exponentially large because each judgment node deals with only one sensor and does not need to consider all the combinations of the inputs. Case 4 in Table 3 shows the case of the Tileworld problem: the total number of outputs of an EP program is 23, 437, 500 (= 60 58 ). This is impractical to use, so we limit the number of sensors (inputs) used at each state to a certain number (see Table 2). However, which inputs are used at each state is a very important matter, thus, in order to nd the necessary kind(s) of input(s), the additional mutation operator is introduced, which can change the kind(s) of input(s) used at each state. Fig. 12 shows an example of the EP program using two states and one sensor at each state for simplicity. In this example, the input used at state 1 is the judgment result of JF and that of state 2 is the judgment result of TD. Then, the number of transitions and outputs is ve, each corresponding to the judgment result, and each output shows the next action an agent 384
... ...
Y connections
state
Table 3: The relation between the number of inputs and outputs in EP and GNP. case 1 case 2 case 3 case 4 X 1 4 8 8 Y 2 2 2 5 EP 120 960 15,360 23,437,500 Z SGNP 100 100 100 220 GNP-RL (M = 4) 100-400 (variable) 100-400 100-400 220-880 X: the number of inputs (sensors) Y : the number of objects each sensor can distinguish Z: the total number of outputs (connections) the number of states/nodes: 60 (judgment node: 40, processing node: 20 in the case of GNP)
hole/TR
tile/MF
right/TR initial state floor/MF nothing/MF left/TL agent/TR obstacle/TL
forward/MF
2
backward/TR
JF
TD
in the case where one input is dealt with
Figure 12: An example of an EP program.
should take. If the number of inputs at each state is two instead of one, the number of transitions and outputs becomes 25 each. At the beginning of the rst generation, the predened number of sensors are assigned to each state randomly, but the types of sensors are changed by the mutation operator as the generation goes on. As shown in Table 2, the number of sensors (inputs) and the maximum number of states are set at some value in order to nd good settings for EP. The mutation operators used in the simulations are {ADD STATE, REMOVE STATE, CHANGE TRANSITION, CHANGE OUTPUT, CHANGE INITIAL STATE, CHANGE INPUT}. The last operator is the additional one adopted in this paper and the others are the same as the ones used in Angeline (1994). 4.3 Simulation I Simulation I uses the environment shown in Fig. 9 where there are 30 tiles and 30 holes. Three agents have the same program made by GNP, GP or EP. In this environment, since each input for judgment nodes is not the complete information needed to distinguish various situations, each method is required to judge its situations and take actions properly by combining various kinds of judgment and processing nodes. The maximum number of time steps is set at 150.
385
Table 4: The data on the tness of the best individuals at the last generation in Simulation I.
average standard deviation t-test GNP-RL (p-value) SGNP GNP-RL 21.23 2.73 SGNP 18.00 1.88
1.04 106
GP-ADFs 15.43 1.94

3.03 1013 1.32 106
GP 14.00 2.00
3.13 1017 3.17 1011
EP 16.30 1.99
5.31 1011 5.95 104
Figs. 13, 14 and 15 show the tness curves of the best individuals at each generation averaged over 30 independent simulations. From the results, GNP-RL shows the best tness value at 5000 generation. In early generations, SGNP exhibited better tness, because Q values in GNP-RL are set at zero in the rst generation and must be updated gradually. However, GNP-RL can produce the better result in the later generations by appropriately learned Q values. Table 4 shows the averaged tness values of the best individuals over 30 simulations9 at the last generation, their standard deviation, and the results of a t-test (onesided test). The results of the t-test show the p-values between GNP-RL and the other methods, and between SGNP and the other methods. There are signicant differences between GNP-RL and the other methods, and between SGNP and GP, GP with ADFs, and EP. Although it seems natural that the method using RL can obtain better solutions than the other methods without it, the aim of developing GNP-RL is to solve problems faster than the others within the same time limit of actions (same steps). In other words, GNP-RL aims to make full use of the information obtained during task execution in order to make the appropriate node transitions. From Fig. 14, we see that standard GP of depth four, initialized by the full method (GP-full4), shows better results than the other standard GP programs, and GP with ADFs of depth three (main tree) and depth two (ADF tree) initialized by the full method (GP-ADF-full3-2) produces the best result of all the GP programs. However, in this problem, the arity of the function nodes of GP is relatively large (ve), so the total number of nodes of GP becomes quite large as the depth becomes large. For example, GP-full4 has 781 nodes, GP-full5 has 3,906 nodes, and GP-full6 has 19,531 nodes. Although GP programs can have higher expression ability as the number of nodes increases, they take much time to explore the programs and much memory is needed. For example, GP-full6 takes too much time to execute the programs and GP (depth7) cannot be executed because of the lack of memory in our machine (Pentium4 2.5GHz, DDR-SDRAM PC2100 512MB). On the other hand, GNP can obtain good results using relatively small number of nodes. As shown in Fig. 15, EP using three inputs and ve states shows better results, so this setting is suitable for the environment. EP uses a graph structure, so it can also execute state transition considering the past agents actions. Furthermore, as the number of states increases, EP can implicitly memorize the past action sequences. However, if there are many inputs, it causes a large number of outputs and state transition rules and the programs then become impractical for exploration and execution. The structure of GNP does not become exponentially large even if the number of inputs increases as
9 the results of the best settings, i.e., GP-full4, GP-ADF-full3-2 and EP-input3-state5. Table 5, 6 and 7 also show the results of the best settings.
386
22 20 18 16 14 12 10 8 6 4 2 0 0 1000 2000 3000 4000 5000

SGNP (18.00) GNP-RL (21.23)
fitness
fitness at the last generation
generation
Figure 13: Fitness curves of GNP in Simulation I.

22 20 18 16 14
GP-ADF-full3-2 (15.43) GP-ADF-full4-3 (14.46) GP-full4 (14.00)
fitness
12 10 8 6 4 2 0 1000
GP-ADF-ramp4-3 (13.5)
GP-ADF-ramp3-2 (13.86) GP-full5 (13.76) GP-ramp4 (10.10) GP-ramp5 (10.30)
2000 3000 generation
4000
5000
Figure 14: Fitness curves of GP in Simulation I.

22 20 18 16 14
EP-input4 state 5 (14.93) EP-input3 state 5 (16.30)
fitness
12 10 8 6 4 2 0 0 1000
EP-input2 state 30 (13.70) EP-input1 state 60 (13.30)
4000
5000
Figure 15: Fitness curves of EP in Simulation I.

387
Table 5: Calculation time for 5000 generations in Simulation I. GNP-RL SGNP GP-ADFs GP Calculation time [s] 1,364 1,019 3,252 3,281 1.34 1 3.19 3.22 ratio of each method to SGNP
the average number of functions in each node
EP 2802 2.75
2.6 2.4 2.2 2 1.8 1.6 0 1000 2000 3000 generation 4000 5000
Figure 16: Change of the average number of functions mi in each node in GNP-RL.
described in 4.2.3, therefore many more states (nodes) can be used in GNP than in EP. As a result, the implicit memory function of GNP becomes more effective in dynamic environments than that of EP. Table 5 shows the calculation time for 5,000 generations. SGNP is the fastest, GNPRL is second and EP is third. GNP-RL takes more time than SGNP, because it executes RL during tasks, however, it does not take so much time. The maximum number of functions (M ) in each node is 4 and one of them is selected as an action; this procedure does not take much time. In addition, more importantly, mi tends to decrease as the generation goes on as shown in Fig. 16 (1.75 at the last generation) because the appropriate number and contents of functions are selected automatically in the evolutional processes. Therefore, reinforcement learning just selects one function from 1.75 functions on average. This tendency contributes to saving time. In addition, the relatively large epsilon (= 0.1) succeeded in achieving the tasks thanks to this tendency, because less than two functions (actions) are in a node on average at the last generation, while the relatively large epsilon is useful at the beginning of the generations, because the agents can try to do many kinds of actions and nd good ones by RL. EP takes more time than GNP and GNP-RL, but it can save its calculation time compared with ordinary EP, because the number of inputs is limited to three and the structure becomes compact. Actually, the calculation time of EP using four inputs is 4,184. GP and GP with ADFs have many nodes, thus they take more time than the others in evolutional processes. Fig. 17 (a) shows the typical node transition of the upper left agent (in Fig. 9) operated by SGNP. The x-axis shows the symbols of the nodes, and the y-axis distinguishes the same kind of nodes, i.e., there are ve nodes per each kind/symbol10 and they have the number 0, 1, 2, 3 and 4. For example, (x, y) = (4, 1) shows the second JF node. From the gure, we can see that the specic nodes are repeatedly used.
10
Five JF nodes, ve JB nodes, . . . , ve MF node, ve TL nodes, . . . are used in SGNP. Evolutionary Computation Volume 15, Number 3
388
10
11
MF
TL
TR
ST
JF
JB
JL
JR
TD HD THD STD
(a) whole node transition for 150 steps

reward (drop tile) judgement result
floor MF (x,y)=(0,1) JF (4,1) TD (8,0)
forward MF (0,1) JF (4,1)
obstacle
reward right MF (0,0) TR (2,0) forward HD (9,2) MF (0,2) THD (10,4) backward
TD (8,0)
(b) partial node transition extracted from (a) Figure 17: Node transition of standard GNP.
Fig. 17 (b) shows the partial node transition extracted from the whole node transition [Fig. 17 (a)]. In Fig. 17 (b), the rst node is MF (0,1), so the agent moves forward, and the next node becomes JF (4,1) according to the node connection from MF (0,1). Thus the points (0,1) and (4,1) in Fig. 17 (a) are connected with line. Next, the judgment JF (4,1) is executed and the judgment result is oor, thus the corresponding node branch (connected to TD (8,0) ) is selected. Then the points (4,1) and (8,0) in Fig. 17 (a) are connected. After executing the judgment TD (8,0) (judgment result: forward), the agent goes forward, judges forward (judgment result: obstacle), judge the tile direction (judgment result: right), and so on. Finally, the simulations using Environment A, B and C [Figs. 18, 19 and 20] are carried out. The condition of each method is the same as that showing the best result in the previous environment [Fig. 9]. From Figs. 21, 22 and 23, GNP-RL and SGNP show better results than the other methods.
389
T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T
T T T T T
Figure 18: Environment A. Figure 19: Environment B. Figure 20: Environment C. 4.4 Simulation II In Simulation II, we use an environment whose size (20x20) and distribution of the obstacles are the same as Simulation I. However, 20 tiles and 20 holes are set at random positions at the beginning of each generation. In addition, when an agent drops a tile into a hole, the tile and the hole disappear; however, a new tile and a new hole appear at random positions. Therefore, the individuals obtained in the previous generation are required to show good performance in the new, unexperienced environment. This problem is more dynamic and suitable than Simulation I in terms of conrming the generalization ability of each method. The maximum number of steps is set at 300. Figs. 24, 2511 and 26 show the averaged tness curves of the best individuals over 30 independent simulations at each generation. From the gures, we can see that GNPRL can obtain the highest tness value at the last generation because the information obtained during task execution is used for making node transitions efciently. From Table 6, we can also see that there are signicant differences between GNP-RL and the other methods. SGNP can obtain better tness value than GP, GP-ADFs at the last generation, but, from Table 6, it is found that there is no signicant difference between SGNP and EP-input1-state60. In the case of EP, it is interesting to note that the programs in Simulation II show the opposite results to those in Simulation I, i.e., the program using one input shows better results in Simulation II, while that using three inputs show better results in Simulation I. Therefore, for EP in this environment, it is recommended that an action be determined by one input and that a relatively large number of states is used. In other words, this EP makes many simple rules and combines them considering the past state transition. This special structure of EP is similar to that of GNP. However, the advantage of GNP is automatically selecting the necessary number of inputs and actions depending on the situations, and moreover, it is found that GNP programs with 61 nodes show good results in both Simulation I and II, therefore we are not worried about the settings of the number of nodes. In fact, there are signicant differences between SGNP and EP-input3-state5 (p-value= 8.67 106) which shows the best tness value in Simulation I, EP-input2-state30 (1.12103) and EP-input4-state5 (1.29109). In the case of GP, it is difcult to nd effective programs, because the environment changes randomly generation by generation. In addition, GP has relatively complex structures and wide search space compared to GNP and EP, thus it is more difcult for GP to explore solutions.
11 Fig. 25 shows the tness curves of GP-full5 and GP-ADF-full3-2, and the tness values of the other settings at the last generation. Because the tness curves of the GP overlap each other, the best two results (GP-full5 and GP-ADF-full3-2) are shown.
390
22 20 18 16 14
fitness
EP (15.97) GNP-RL (20.80) SGNP (16.87)
12 10 8 6 4 2 0 0 1000 2000 3000 generation

GP (13.80)
GP-ADF (14.20)
4000
5000
Figure 21: Fitness curves in Environment A.

24 22 20 18 16 14 12 10 8 6 4 2 0
GNP-RL (24.37) SGNP (18.83)
fitness
EP (17.00)
GP (16.93)
GP-ADF (16.00)
1000
4000
5000
Figure 22: Fitness curves in Environment B.

24 22 20 18 16
GNP-RL (22.83) SGNP (19.87)
EP (19.17) GP-ADF (16.43) GP (13.73)
fitness
14 12 10 8 6 4 2 0 0 1000
4000
5000
Figure 23: Fitness curves in Environment C. 391
22 20 18 16 14
GNP-RL (19.93)
fitness
12 10 8 6 4 2 0 0 1000
SGNP (15.30)
4000
5000
Figure 24: Fitness curves of GNP in Simulation II.

22 20 18 16 14
GP-full4 (5.90) GP-ramp4 (5.13) GP-ramp5 (5.20) GP-ADF-full4-3 (5.56) GP-ADF-ramp3-2 (5.73) GP-ADF-ramp4-3 (5.60)
fitness
12 10 8 6 4 2 0 1000 2000 3000 generation
GP-ADF-full3-2 (6.67)
GP-full5 (6.10)
4000
5000
Figure 25: Fitness curves of GP in Simulation II.

22 20 18 16 14
EP-input 1 state 60 (14.40) EP-input 2 state 30 (12.77)
fitness
12 10 8 6 4 2 0 0 1000
EP-input 3 state 5 (9.40) EP-input 4 state 5 (7.67)
4000
5000
Figure 26: Fitness curves of EP in Simulation II. 392

Table 6: The data on the tness of the best individuals at the last generation in Simulation II.
average standard deviation t-test GNP-RL (p-value) SGNP GNP-RL 19.93 2.43 SGNP 15.30 3.88
5.90 108
GP-ADFs 6.67 3.19

7.46 1026 1.36 1013
GP 6.10 1.75
1.53 1031 5.91 1015
EP 14.40 2.54
2.90 1012 1.46 101
Table 7: Calculation time for 5000 generations in Simulation II. GNP-RL SGNP GP-ADFs GP Calculation time [s] 2,734 1,177 5,921 12,059 ratio of each method to SGNP 2.32 1 5.03 10.25
EP 1,584 1.35
Table 7 shows the calculation time for 5,000 generations. SGNP is the fastest as in Simulation I. However, EP shows better results than GNP-RL because EP deals with one input at each node, and the amount of calculation in one step decreases. In fact, the total calculation time of EP becomes smaller than in Simulation I. Although GPfull5 shows the best tness value of all the settings of standard GP, the calculation time becomes quite large because each individual of GP-full5 has 3,906 nodes. Fig. 27 shows the ratio of the nodes used by the best individual of GNP-RL in order to see which nodes are used and which are most efcient for solving the problem. The x-axis represents the unique node number. In GNP-RL, the total number of nodes is 61, and each processing node has a node number (120) and each judgment node has a node number (2160). In addition, the symbol (function) of each node is changed through the evolution, thus the x-axis also shows the node symbols in order to show which symbols are more selected as a result of evolution. For example, if the width of TD is wider than STD, we can know that GNP-RL selects more TD than STD as a result of evolution. The y-axis shows the ratio of the used nodes. In the rst generation, GNP uses various kinds of nodes randomly and thus cannot obtain effective node transition rules. However, in the last generation, MF, JF, JB, TD and THD are used frequently. So the basic behavior obtained by evolution and RL turns out to be that each agent judges the direction of the nearest tile from the agent and the nearest hole from the nearest tile rst, then turns to the tile, looks forward and backward, and moves to the forward cell. The other kinds of nodes are used according to necessity. It is interesting that GNP rarely uses HD but THD. An agent can know the direction of the nearest hole from the agent by using HD, but the aim of the agent is not to reach the hole position or drop itself into holes. The agent must drop tiles into holes, thus it is important to know the direction of the nearest hole from the nearest tile rather than from the agent. Finally, the simulations using the other three environments (Environment A-II, BII and C-II) are carried out. The positions of obstacles in these environments are the same as Environment A, B and C [Figs. 18, 19 and 20], respectively. However, like the previous environment, 20 tiles and 20 holes are set at random positions at the beginning of each generation and a tile and a hole appear at random positions when an agent drops a tile into a hole. Figs. 28, 29 and 30 show the tness curves for each environment. SGNP and GNPEvolutionary Computation Volume 15, Number 3
393
0.14 0.12
ratio of used node
0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 TL TR ST JF JB JL JR TD node number and symbol 50 60 HD THD STD
MF
Generation 1
0.14 0.12
ratio of used node
0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 TL TR ST JF JB JL JR node number and symbol 50 60 TD HDTHD STD
MF
Generation 100
0.14 0.12
ratio of used node
0.1 0.08 0.06 0.04 0.02 0 0 10 40 50 60 20 30 TL TR ST JF JB JL JR TD HD THD STD node number and symbol
MF
Generation 5000
Figure 27: Ratio of nodes used by GNP with Reinforcement Learning. 394
24 22 20 18 16
fitness
EP-iuput 1 state 60 (16.87) GNP-RL (22.63)
14 12 10 8 6 4 2 0 0
EP-iuput 2 state 30 (12.53) EP-iuput 3 state 5 (11.83)
SGNP (16.17)
GP-ADF (6.20)
GP (6.20)
EP-iuput 4 state 5 (9.93)
1000
4000
5000
Figure 28: Fitness curves in Environment A-II.

18 16 14 12
fitness
EP-iuput 1 state 60 (11.63) SGNP (9.50) GNP-RL (16.07)
10 8 6 4 2 0 0
EP-iuput 2 state 30 (7.67) EP-iuput 4 state 5 (7.00) GP-ADF (5.30) GP (4.80) EP-iuput 3 state 5 (8.37)
1000
4000
5000
Figure 29: Fitness curves in Environment B-II.

14 12 10
EP-iuput 1 state 60 (8.47) GNP-RL (11.87)
fitness
8 6 4 2
EP-iuput 4 state 5 (3.70) GP-ADF (4.63) GP (4.46) EP-iuput 2 state 30 (5.10) SGNP (8.17) EP-iuput 3 state 5 (5.13)
1000
4000
5000
Figure 30: Fitness curves in Environment C-II.

395
RL use the same settings as in the previous simulations. GP and GP-ADFs use the settings showing the best results in the previous simulations. EP uses four settings in order to see the differences between GNP and EP. From the gures, GNP-RL shows the best tness values in all the environments. However, SGNP does not show better tness values than EP-input1-state60, although it shows better tness values than GP, GP-ADFs and EPs with other settings. We used the special EP in this paper because ordinary EP, i.e., EP using eight inputs, is impractical to use, and it is found that if we determine the appropriate number of inputs and states, it shows good results. However, GNP can automatically determine the necessary number of inputs and actions, and this characteristic is an advantage of GNP.
5 Discussion
This paper describes a basic analysis of GNP in order to apply GNP to more complex dynamic problems, such as Elevator group supervisory control systems (Crites and Barto, 1998), stock prediction (Potvin et al., 2004), and RoboCup soccer (Luke, 1998) in the future. GP and EP, for example, must prepare a large number of nodes if a problem is complex, and therefore the search space and calculation time become very large. GNP results in compact structures comparing to GP and EP, and can consider its past judgment and processing when determining the next judgment or processing. GNP has some problems which remain to be solved. We set out the judgment nodes and processing nodes using raw information obtained from environments. In the Tileworld problem, for example, since GNP can obtain eight kinds of information and execute four kinds of processing, it uses eight kinds of judgment nodes and four kinds of processing nodes. However, when we apply GNP to more complex problems such as Elevator Group Supervisory Control Systems and stock prediction we are now studying, there is too much information to consider, and therefore the number of nodes increases. Although GNP can select the necessary nodes as a result of evolution, the calculation time becomes large. In order to solve this problem and apply GNP to real world problems, we will combine some important judgments and processing using foresight information and set out the new judgment nodes and processing nodes like macro function nodes. Secondly, the current GNP judges discrete information, e.g., the direction of the nearest tile is right. But we will make the GNP system deal with a continuous input, such as an angle of 23 degrees to the tile, using reinforcement learning architecture, e.g., actor-critic. Finally, we must compare GNP with other evolutionary and RL methods in more realistic problems to conrm the applicability of the proposed method to a great variety of applications.
6 Conclusions
In this paper, a graph-based evolutionary algorithm called Genetic Network Programming (GNP) and its extended algorithm, GNP with Reinforcement Learning (GNP-RL), are proposed. From the simulation results of the Tileworld problem, GNP shows good results in terms of tness values and calculation time. It is well known that the expression ability of GP is improved as the number of nodes increases, but the larger trees use greater amounts of memory and take more time to explore and execute the programs. On the other hand, GNP shows good results using a comparatively small number of nodes. It is also claried that GNP selects important kinds of nodes and uses them repeatedly, while the unnecessary nodes gradually become unused, as the generation goes on. GNP-RL shows much better tness values than SGNP because GNP-RL can obtain and utilize more information than SGNP during task execution. 396
References
Angeline, P. J. (1994). An alternate interpretation of the iterated prisoner dilemma and the s evolution of non-mutual cooperation. In Brooks, R. and Maes, P., editors, Proceedings of 4th Articial Life Conference, pages 353358. MIT Press. Angeline, P. J. and Pollack, J. B. (1993). Evolutionary module acquisition. In Fogel, D. and Atmar, W., editors, Proceedings of the Second Annual Conference on Evolutionary Programming, pages 154 163, La Jolla, CA. Crites, R. H. and Barto, A. G. (1998). Elevator group control using multiple reinforcement learning agents. Machine Learning, 33(2-3):235262. Downing, K. L. (2001). Adaptive genetic programming via reinforcement learning. In Spector, L., Goodman, E. D., Wu, A., Langdon, W. B., Voigt, H.-M., Gen, M., Sen, S., Dorigo, M., Pezeshk, S., Garzon, M. H., and Burke, E., editors, Proceedings of the 3rd Genetic and Evolutionary Computation Conference, pages 1926. Morgan Kaufmann. Fogel, D. B. (1994). An introduction to simulated evolutionary optimization. IEEE Transactions on Neural Networks, 5(1):314. Fogel, L. J., Owens, A. J., and Walsh, M. J. (1966). Articial Intelligence through Simulated Evolution. John Wiley & Sons. Hirasawa, K., Okubo, M., Katagiri, H., Hu, J., and Murata, J. (2001). Comparison between genetic network programming (GNP) and genetic programming (GP). In Proceedings of Congress on Evolutionary Computation, pages 12761282. IEEE Press. Holland, J. H. (1975). Adaptation in Natural and Articial Systems. University of Michigan Press, Ann Arbor. Iba, H. (1998). Multi-agent reinforcement learning with genetic programming. In Proceedings of the Third Annual Conference of Genetic Programming, pages 167172. Kamio, S. and Iba, H. (2005). Adaptation technique for integrating genetic programming and reinforcement learning for real robots. IEEE Transactions on Evolutionary Computation, 9(3):318 333. Katagiri, H., Hirasawa, K., and Hu, J. (2000). Genetic network programming-application to intelligent agents -. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pages 38293834. IEEE Press. Katagiri, H., Hirasawa, K., Hu, J., and Murata, J. (2001). Network structure oriented evolutionary model-genetic network programming-and its comparison with genetic programming. In Spector, L., Goodman, E. D., Wu, A., Langdon, W. B., Voigt, H.-M., Gen, M., Sen, S., Dorigo, M., Pezeshk, S., Garzon, M. H., , and Burke, E., editors, 2001 Genetic and Evolutionary Computation Conference Late Breaking Papers, pages 219226. Morgan Kaufmann. Koza, J. R. (1992). Genetic Programming, on the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA. Koza, J. R. (1994). Genetic Programming II, Automatic Discovery of Reusable Programs. MIT Press, Cambridge, MA. Luke, S. (1998). Genetic programming produced competitive soccer softbot teams for robocup97. In Koza, J. R., Banzhaf, W., Chellapilla, K., Deb, K., Dorigo, M., Fogel, D. B., Garzon, M. H., Goldberg, D. E., Iba, H., and Riolo, R., editors, Proceedings of the Third Anuual Genetic Programming Conference. Morgan Kaufmann. Luke, S. and Spector, L. (1996). Evolving graphs and networks with edge encoding: Preliminary report. In Koza, J. R., editor, Late Breaking Paper at the Genetic Programming 1996 Conference, Stanford University, July 28-31, 1996, pages 117124, Stanford University, CA.
397
Mabu, S., Hirasawa, K., and Hu, J. (2004). Genetic network programming with reinforcement learning and its performance evaluation. In Proceedings Part II of 2004 Genetic and Evolutionary Computation, pages 710711, Seattle, WA. Mabu, S., Hirasawa, K., Hu, J., and Murata, J. (2002). Online learning of genetic network programming. In Proceedings of Congress on Evolutionary Computation, pages 321326. IEEE Press. Mabu, S., Hirasawa, K., Hu, J., and Murata, J. (2003). Online learning of genetic network programming and its application to prisoners dilemma game. Transactions of IEE Japan, 123C(3):535543. Poli, R. and Langdon, W. B. (1997). Genetic programming with one-point crossover. In Chawdhry, P. K., Roy, R., and Pant, R. K., editors, Second On-line World Conference on Soft Computing in Engineering Design and Manufacturing, pages 180189. Springer-Verlag, London. Poli, R. and Langdon, W. B. (1998). Schema theory for genetic programming with one-point crossover and point mutation. Evolutionary Computation, 6(3):231252. Pollack, M. E. and Ringuette, M. (1990). Introducing the tile-world: Experimentally evaluating agent architectures. In Dietterich, T. and Swartout, W., editors, Proceedings of the conference of the American Association for Articial Intelligence, pages 183189. AAAI Press. Potvin, J.-Y., Soriano, P., and Vallee, M. (2004). Generating trading rules on the stock markets with genetic programming. Computers & Operations Research, 31:10331047. Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning - An Introduction. MIT Press, Cambridge, MA. Teller, A. (1996). Evolving programmers: The co-evolution of intelligent recombination operators. In Angeline, P. J. and K. E. Kinnear, J., editors, Advances in Genetic Programming, chapter 3, pages 4568. MIT Press, Cambridge, MA. Teller, A. and Veloso, M. (1995). PADO, learning tree-structured algorithm for orchestration into an object recognition system. Technical report, Carnegie Mellon University. Teller, A. and Veloso, M. (1996). PADO: A new learning architecture for object recognition. In Ikeuchi, K. and Veloso, M., editors, Symbolic Visual Learning. Oxford University Press, New York. Weiss, G., editor (1999). Multiagent Systems, A Modern Approach to Distributed Articial Intelligence. MIT Press, Cambridge, MA.
398

A Graph-Based Evolutionary Algorithm

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Graph-Based Evolutionary Algorithm

Enviado por

Direitos autorais:

Formatos disponíveis

A Graph-Based Evolutionary Algorithm: Genetic Network Programming (GNP) and Its Extension Using Reinforcement Learning

S. Mabu, K. Hirasawa and J. Hu

2 Related Work and Comparisons

Evolutionary Computation Volume 15, Number 3

Genetic Network Programming and Its Extension

3 Genetic Network Programming

S. Mabu, K. Hirasawa and J. Hu

Directed graph structure Agent

0 Agent fitness value

J: Judgement function P: Processing function

Figure 1: Basic structure of GNP.

Genetic Network Programming and Its Extension

Evolutionary Computation Volume 15, Number 3

S. Mabu, K. Hirasawa and J. Hu

Evolutionary Computation Volume 15, Number 3

Genetic Network Programming and Its Extension

start generate an initial population ind=1

No ind : individual number

Figure 2: Flowchart of GNP system.

S. Mabu, K. Hirasawa and J. Hu

Each branch is selected with the probability of Pm

The selected branch becomes connected to another node randomly.

Evolutionary Computation Volume 15, Number 3

Genetic Network Programming and Its Extension

Each node is selected with the probability of Pc (crossover node)

C3=1 * 3 ID3 2 * d3 3 ID3

S. Mabu, K. Hirasawa and J. Hu

Directed graph structure

Gene structure Node gene Connection gene 10 40

node 1 2 1 0.3 5 2 0.0 5 2 0

node 2 1 2 2.0 1 1 0.0 1 3 0 4 0 1 0 4 0 node 3 1 1 0.0 1 3 0.1 1 4 0 1 0 2 0 4 0

node 4 2 4 0.7 5 1 2.7 5 2 0

IDi1 Qi1 di1

IDimi Qimi dimi

Ci1 di1 Ci1 di1

Cimi dimi Cimi dimB i

Qi1 di1 IDi1 IDi2 Qi2 di2

One branch is selected according to the judgement result.

Qi1 di1 IDi1 IDi2 Qi2 di2

Figure 5: Basic structure of GNP with Reinforcement Learning.

Genetic Network Programming and Its Extension

JF : judge forward TD : direction of the nearest Tile from the agent

MF: move forward TL : turn left

S. Mabu, K. Hirasawa and J. Hu

The connection is changed randomly.

The content of the functions is changed randomly.

The number of functions is selected from 1, ..., M.

Figure 7: Mutation of GNP with Reinforcement Learning.

Genetic Network Programming and Its Extension

node i Qi1 IDi1

node j (=Cip ) Qj1 IDj1

Qjp at+1 IDjp Qjmj IDjmj

state st+1 t+1 time

Figure 8: An example of node transition.

S. Mabu, K. Hirasawa and J. Hu

Genetic Network Programming and Its Extension

symbol JF JB JL JR TD HD THD STD

Evolutionary Computation Volume 15, Number 3

S. Mabu, K. Hirasawa and J. Hu

Table 2: Simulation conditions.

1 [5] 61 0.1 0.01

1 [5] 61 0.1 0.02 [0.01]

1 [5] state 60,30,5 input 14