5.1. Diminishing reward

Previous chapters showed that applying real-valued input handling either by using the interval predicates or discretization significantly increases the input space. This magnitude of possible states implies the growth of classifier population size; therefore, the traditional discounted sum of reward distribution might not be an appropriate option. The desired intention is for the case when the rewards received in all decision instances are equally important.

The alternative criterion applied in this situation is called the average reward criterion and was introduced by Puterman [134]. He stated that the decision-maker might prefer it when the decisions are made frequently (so that the discount rate is very close to 1), or other terms cannot easily describe the performance criterion. Possible areas of an application might include situations where system performance is assessed based on the throughput rate (like making frequent decisions when controlling the flow of communication networks).

The averaged reward criterion was first implemented in XCS by Tharakunnel and Goldberg [135]. They called their modification AXCS and showed that it performed similarly to the standard XCS in the Woods2 environment. Later, Zang et al. [136] formally introduced the R-learning [137] [138] technique to XCS and called it XCSAR. They compared it with XCSG (where the prediction parameters are modified by applying the idea of gradient descent) and ACXS (maximizing the average of successive rewards) in large multistep problems (Woods1, Maze6, and Woods14).

This chapter replaces the ACS2 credit assignment component, optimizing the performance in the infinite horizon (discounted reward) with an averaged version. The introduced variant is AACS2 (Averaged Anticipatory Classifier System) and is implemented in two slightly different variants - AACS2-v1 and AACS2-v2. The performance is validated using two scalable and discretized environments requiring multiple steps to pursue reward.

Reinforcement Learning and Reward Criterion

Reinforcement Learning (RL) is a formal framework in which the agent can influence the environment by executing specific actions and receive corresponding feedback (reward) afterwards. Usually, it takes multiple steps to reach the goal, which makes the process much more complicated. In the general form, RL consists of:

  • A discrete set of environment states \(S\),

  • A discrete set of available actions \(A\),

  • mapping \(R\) between a particular state \(s \in S\) and action \(a \in A\). The environmental payoff \(r \in R\) describes the expected reward obtained after executing an action in a given state

In each trial, the agent perceives the environmental state \(s\). Next, it evaluates all possible actions from \(A\) and executes action \(a\) in the environment. The environment returns a reward \(r\) and next state \(s'\) as intermediate feedback.

The agent’s task is to represent the knowledge, using the policy \(\pi\) mapping states to actions, therefore optimizing a long-run measure of reinforcement. There are two popular optimality criteria used in Markov Decision Problems (MDP) - a discounted reward and an average reward [139] [140].

Discounted Reward Criterion

In discounted RL, the future rewards are geometrically discounted according to a discount factor \(\gamma\), where \(0 \leq \gamma < 1\). The performance is usually optimized in the infinite horizon [55]:

\[\lim_{N \to \infty} E^{\pi} \left(\sum_{t=0}^{N-1}\gamma^{t} r_{t}(s)\right)\]

The \(E\) expresses the expected value, \(N\) is the number of time steps, and \(r_t(s)\) is the reward received at time \(t\) starting from state \(s\) under the policy.

Undiscounted (Averaged) Reward Criterion

The averaged reward criterion [137], which is the undiscounted RL, is where the agent selects actions maximizing its long-run average reward per step \(\rho(s)\):

\[\rho^{\pi}(s) = \lim_{N \to \infty} \frac{E^{\pi} \left(\sum_{t=0}^{N-1} r_{t}(s) \right)}{N}\]

If a policy maximizes the average reward over all states, it is a gain optimal policy. Usually, average reward \(\rho(s)\) can be denoted as \(\rho\), which is state-independent [141], formulated as \(\rho^{\pi}(x) = \rho^{\pi}(y) = \rho^{\pi}, \forall x,y \in S\) when the resulting Markov chain with policy \(\pi\) is ergodic (aperiodic and positive recurrent) [142].

To solve an average reward MDP problem, a stationary policy \(\pi\) maximizing the average reward \(\rho\) must be determined. To do so, the average adjusted sum of rewards earned following a policy \(\pi\) is defined as:

\[V^{\pi}(s) = E^{\pi} \left( \sum_{t=0}^{N \to \infty} (r_t - \rho^{\pi}) \right)\]

The \(V^{\pi}(s)\) can also be called a bias or relative value. Therefore, the optimal relative value for a state–action pair \((s, a)\) can be written as:

(5.1)\[V(s, a) = r^{a} (s, s') - \rho + \max_b V(s', b) \forall s \in S \text{ and } \forall a \in A\]

where \(r^{a} (s, s')\) denotes the immediate reward of action \(a\) in state \(s\) when the next state is \(s'\), \(\rho\) is the average reward, and \(\max_b V(s', b)\) is the maximum relative value in state \(s'\) among all possible actions \(b\). Equation (5.1) is also known as the Bellman equation for an average reward MDP [142].

Integrating Reward Criterions in ACS2

Despite the ACS’s latent-learning capabilities, the RL is realized using two classifier metrics-reward \(cl.r\) and immediate reward \(cl.ir\). The latter stores the immediate reward predicted to be received after acting in a particular situation and is used mainly for model exploitation where the reinforcement might be propagated internally. The reward parameter \(cl.r\) stores the reward predicted to be obtained in the long run.

For the first version of ACS, Stolzmann proposed a bucket-brigade algorithm to update the classifier’s reward \(r_c\) [84] [143]. Let \(c_t\) be the active classifier at time \(t\) and \(c_{t+1}\) the active classifier at time \(t+1\):

\[\begin{split}r_{c_{t}}(t+1) = \begin{cases} (1-b_r) \cdot r_{c_{t}}(t) + b_r \cdot r(t+1), & \mbox{if } r(t+1) \neq 0\\ (1-b_r) \cdot r_{c_{t}}(t) + b_r \cdot r_{c_{t+1}}(t), & \mbox{if } r(t+1) = 0 \end{cases}\end{split}\]

where \(b_r \in [0,1]\) is the bid-ratio. The idea is that if there is no environmental reward at time \(t+1\), then the currently active classifier \(c_{t+1}\) gives a payment of \(b_r \cdot r_{c_{t+1}}(t)\) to the previous active classifier \(c_t\). If there is an environmental reward \(r(t+1)\), then \(b_r \cdot r(t+1)\) is given to the previous active classifier \(c_t\).

Later, Butz adopted the Q-learning idea in ACS2 alongside other modifications [26]. For the agent to learn the optimal behavioural policy, both the reward \(cl.r\) and intermediate reward \(cl.ir\) are continuously updated. To assure maximal Q-value, the quality of a classifier is also considered assuming that the reward converges in common with the anticipation’s accuracy. The following updates are applied to each classifier \(cl\) in action set \([A]\) during every trial:

(5.2)\[\begin{split}\begin{array}{lcl} cl.r & = & cl.r + \beta \left(\phi(t) + \gamma \max \limits_{cl' \in [M](t+1) \land cl'.E \neq \{\#\}^L} (cl'.q \cdot cl'.r) - cl.r \right)\\ cl.ir & = & cl.ir + \beta \left( \phi(t) - cl.ir \right) \end{array}\end{split}\]

The parameter \(\beta \in [0,1]\) denotes the learning rate and \(\gamma \in [0, 1)\) is the discount factor. With a higher \(\beta\) value, the algorithm takes less care of past encountered cases. On the other hand, \(\gamma\) determines to what extent the reward prediction measure depends on future reward.

Thus, in the original ACS2, the calculation of the discounted reward estimation at a specific time \(t\) is described as \(Q(t)\), which is part of Equation (5.2):

\[Q(t) \gets \phi(t) + \gamma \max \limits_{cl' \in [M](t+1) \land cl'.E \neq \{\#\}^L} (cl'.q \cdot cl'.r)\]

The modified ACS2 implementation replacing the discounted reward with the averaged version with the formula \(R(t)\) is defined below - Equation (5.3):

(5.3)\[R(t) = \phi(t) - \rho + \max \limits_{cl' \in [M](t+1) \land cl'.E \neq \{\#\}^L} (cl'.q \cdot cl'.r)\]

The definition above requires an estimate of the average reward \(\rho\). Equation (5.1) showed that the maximization of the average reward is achieved by maximizing the relative value. The following sections will propose two variants of using the average reward criterion for internal reward distribution. The altered version is named AACS2, which stands for Averaged ACS2.

As the next operation in both cases, the reward parameter of all classifiers in the current action set \([A]\) is updated using the following formula:

\[cl.r \gets cl.r + \beta (R - cl.r)\]

where \(\beta\) is the learning rate and \(R\) was defined in Equation (5.3).

AACS2-v1

The first variant of the AACS2 represents the \(\rho\) parameter as the ratio of the total reward received along the path to reward and the average number of steps needed. It is initialized as \(\rho = 0\), and its update is executed as the first operation in RL using the Widrow-Hoff delta rule - Equation (5.4). The update is also restricted to be executed only when the agent chooses the action greedily during the explore phase:

(5.4)\[\rho \gets \rho + \zeta [\phi - \rho]\]

The \(\zeta\) parameter denotes the learning rate for the average reward and is typically set at a very low value. This ensures a nearly constant value of average reward for the update of the reward, which is necessary for the convergence of average reward RL algorithms [144].

AACS2-v2

The second version is based on the XCSAR proposition by Zang [136]. The only difference from the AACS2-v1 is that the estimate is also dependent on the maximum classifier fitness calculated from the previous and current match set:

(5.5)\[\rho \gets \rho + \zeta [\phi + \max \limits_{cl \in [M](t) \land cl.E \neq \{\#\}^L} (cl.q \cdot cl.r) -\max \limits_{cl \in [M](t+1) \land cl.E \neq \{\#\}^L} (cl.q \cdot cl.r) -\rho]\]

Experimental evaluation

This section presents the motivation, goals and setup of the performed experiments and their results.

Research questions

The conducted research aims to answer the following questions:

  1. Can the undiscounted reward criterion be used in discretized multistep environments?

  2. Does the undiscounted reward criterion result in better environment exploitation?

Goals of the experiments

In all the experiments, different environments are evaluated by two versions of the AACS2 algorithm. Benchmarking algorithms such as pure ACS2, Q-Learning and R-Learning are also used to accent the changes. The main performance metrics are collected, such as the number of steps to reach the reward state and the estimated average reward. Additionally, each available state-action pair from the environment is analyzed for the potential calculated reward.

Experiment 1 - Straight Corridor

The Corridor environment provides a simple, real-valued and scalable benchmark perfectly suited for the problem of long action chains. The agent must repeat the same action a certain amount of times unless reaching the final reward state.

Experiment 2 - Deceptive Corridor

The Finite State World environment is an extension to the Corridor. In each state, actions alternate - one will bring the agent closer to the reward, and the other is languishing. The forager will have to distinguish every other state and distribute rewards properly.

Answers to research questions

Can the undiscounted reward criterion be used in discretized multistep environments?

Concluded experiments have shown that the undiscounted reward criterion can be successfully applied to a problem requiring long-action chains such as the Corridor of FSW. The new system AACS2 varies only in a way the classifier reward \(cl.r\) parameter is calculated. The clear difference between the discounted criterion is visible in the payoff landscapes generated from the testing environments. The AACS2 can produce a distinct payoff landscape with uniformly spaced payoff levels, similar to the one generated by the R-learning algorithm. When taking a closer look, all algorithms generate step-like payoff-landscape plots, but each particular state-action pairs are more distinguishable when the reward-criterion is used. The explanation of why the agent moves toward the goal at all can be found in Equation (5.3) - it can find the following best action by using the best classifiers’ fitness from the next match-set.

In addition, the rate at which the average estimate value \(\rho\) converges is different for AACS2-v1 and AACS2-v2. Figures 5.1 and 5.3 demonstrate that the AACS2-v2 settles to the final value faster, but also has greater fluctuations. That is caused by the fact that both match sets’ maximum fitness is considered when updating the values. Zang also observed this and proposed that the learning rate \(\zeta\) in Equation (5.5) could decay over time [136]:

\[\zeta \gets \zeta - \frac{\zeta^{max} - \zeta^{min}}{NumOfTrials}\]

where \(\zeta^{max}\) is the initial value of \(\zeta\), and \(\zeta^{min}\) is the minimum learning rate required. The update should take place at the beginning of each exploration trial.

Does the undiscounted reward criterion result in better environment exploitation?

Number of steps during exploitation trials depicted on figures 5.1 and 5.3 indicate that for the investigated problems the agent is able to pick up better actions when the undiscounted criterion is used.

In addition, the fact that the \(\rho\) converged to a slightly suboptimal value might be caused by the exploration strategy adopted, which was set to the \(\epsilon\)-greedy. Because the estimated average reward is updated only when the greedy action is executed, the number of greedy actions to be performed during the exploration trial is uncertain. Moreover, the probability distribution when the agent observes the rewarding state might be too low in order to enable the estimated average reward to reach optimal value. This was observed during the experimentation - the \(\rho\) value was correlated with the \(\epsilon\) parameter.