4.1. Biased exploration

LCS are being considered as self-adaptive and autonomous learners. However, in order to reach full autonomy, the credit assignment part must decide on their own when to exploit the existing knowledge by taking the most promising action and when to deliberately select an action that is not apparent best to gain additional knowledge potentially.

This decision is commonly referred to as the explore/exploit (E/E) dilemma, since obtaining new knowledge through exploration incurs a short-term performance loss, while too much exploitation risks staying on an unnecessarily low level of performance in the long term [55]. In a typical exploration phase, the action is selected randomly with the intention of an unbiased exploration.

Such exploration might be inefficient for real-valued environments where the search space is presumably large. Certain regions of space might be explored multiple times, while the other ones remain unknown. Therefore, an approach towards optimizing the process of acquiring knowledge by suggesting more valuable actions is considered herein.

At first, model learning capabilities in ALCS was enhanced by Stolzmann and Butz by using behavioural capabilities (internal reinforcement learning or lookahead action selection) in [130] and by introducing the action planning mechanism [131] [89].

Later, Butz suggested that by using computationally inexpensive methods by biasing exploration towards specially chosen regions of search-space, the model can be learned locally [132].

This chapter aims to narrow the existing gap in research by experimentally comparing four biased exploration strategies. Any ALCS agent operating a large realm of observation space might take advantage of the possibility of improving the time needed to form an internal model. First, a baseline method - epsilon-greedy, which is a default option for LCS, will be described. Later two methods were introduced by Butz - action-delay and knowledge-array bias. They examine the match set \([M]\) searching for indications of which action might result in the highest information gain. Eventually, the latter approach is inspired by an “Optimistic Initial Values” approach described by Sutton in [55]. This strategy turned out to be very effective for Multi-armed bandit problems, where the main objective is to select the most promising action and was never examined in any LCS domain.

Epsilon-Greedy (EG)

In the epsilon-greedy approach, the agent equally discovers all regions from the input-space, not favouring any specific behaviour. In each step, random action is executed with \(p_{explr}\) or \(\epsilon\) probability. Then it is chosen uniform randomly from classifiers composing a match set \([M]\). In the case of \(1-p_{explr}\), action from the most fitted classifier is executed. By doing so, the agent can occasionally perform the move he thinks is the best at a given time, reinforcing its beliefs about the consequences.

Action-Delay (AD)

This bias is based on the recency-based principle assuming that the action executed a long time ago might introduce new situation-action-effect triples. To determine which action was executed most long ago at the current time \(t\) the \(t_{alp}\) field of all classifiers in a match set \([M](t)\) is analyzed. For situation \(\sigma(t)\) the action of classifier \(cl\) with the lowest value of \(t_{alp}\) is selected.

In case there exists an action not represented by any classifier in \([M](t)\) that it is assumed to be experienced most long ago (if ever) and therefore chosen for execution.

Knowledge Array (KA)

This method, on the contrary, is based on the error-based principle. A classifier quality \(q\) metric denoting the accuracy of predictions for each individual can be used to measure it.

The bias generates the knowledge array KA from classifiers in a match set \([M]\) in which each entry specifies the averaged quality for the anticipation for each possible action - see Equation (4.1).

(4.1)\[KA[a] = \frac{\sum_{cl \in [M] \wedge cl.A = a}cl.q \cdot cl.num}{\sum_{cl \in [M] \wedge cl.A = a} cl.num}\]

An action with the lowest value in the knowledge array is selected for execution. Similarly, as in the action delay bias, if any classifiers do not express actions, they are chosen first.

Optimistic Initial Quality (OIQ)

By default, newly created classifiers \(cl\) have the initial quality set to \(cl.q = 0.5\). It can be said that they are biased towards their initial quality. In practice, it should not be a problem because this value will converge to optimal ones over a set of trials. Changing this parameter provides an easy way of supplying the agent with the confidence of the generated classifiers.

In this method, the agent’s behaviour was parametrized by an extra parameter - initial quality - \(q_0\). Every time a new classifier is generated, its quality is set to new value \(cl.q = q_0\).

In all the experiments, there was fixed \(q_0 = 0.8\), expecting the agent to build an internal model of knowledge representation faster (especially in the early stages). For the action selection strategy, the default epsilon-greedy method was chosen.

Interestingly, Hansmeier and Platzner made an effort to compare four strategies optimizing the time for alternating E/E phases using the XCS algorithm [133]. It turned out that despite automized parameter tuning that none of the strategies is superior to the other. On the other side, they noticed that specific multi-step environments with scarce reward signals become challenging due to setting the classifier’s accuracy value too aggressively. Such problem with scarce reward and long action-chains is further discussed in the following chapter.

Experimental evaluation

This section presents the motivation, goals and setup of the performed experiments and their results.

Research questions

The conducted research aims to answer the following question regarding the biased exploration strategies

  1. Does the biased exploration methods (AD, KA, OIQ) significantly accelerate the agent’s learning speed?

  2. Can the OIQ method improve the performance of ingesting knowledge or reducing classifier population size?

Goals of the experiments

The difference between the four exploration methods will be compared using the ACS2 agent in all the experiments.

Experiment 1 - Single-step problem performance

Similarly, as above but in this case, a single step 6-bit Real Multiplexer environment is used, introducing much larger possible states space. Since calculating precise model knowledge is infeasible, the key performance indicators chosen are the average obtained reward and model size.

Experiment 2 - Multiple-step problems performance

Use two basic multistep environments (Corridor and Grid) to determine the differences between the rate of gaining knowledge, the ability to build an internal pool of classifiers and operating in the environments.

Experiment 3 - Balancing the pole

The methods will be evaluated on the Cart Pole problem of balancing a pole on a cart. This is a novel problem for the LCS due to specific observation space (where two attributes span to infinity) and a specific reward scheme based on how long the pole is kept upright.

Answers to research questions

Q1: Does the biased exploration methods (AD, KA, OIQ) have significantly accelerate the agent’s learning speed?

Conducted research showed that biased exploration methods like AD and KA positively impact the knowledge evolution process. The OIQ performance was correlated with the basic EG method and did not yield any competitive performance.

Q2: Can the OIQ method improve the performance in terms of ingesting knowledge or reducing classifier population size?

The impact of initializing classifier quality with a higher (positive) value resulted in having a neglectable effect on all evaluated metrics. The OIQ, in theory, can perform better when forming a population of reliable classifiers, but for the investigated problems, the performance was highly correlated with a default EG method.