Reinforcement Learning Empowers Retailers
Part II. RL Introduction
Reinforcement Learning is a separate class of machine learning algorithms. As a rule, initially, information on the environment is missing. In other words, there are no traced patterns for learning before the algorithm running
Attempting to escape a labyrinth or maize is a go-to simplified example of RL application. Initially, the RL algorithm does not know anything about the maize. However, while exploring various paths, the algorithm learns how to find the shortest escape route.
What is the difference between RL and a random trial-and-error search?

Classic RL (i.e., without using deep networks) makes the processing of alternatives more consistent and efficient than a random trial-and-error search. A fundamental principle of RL is environment exploration followed by knowledge exploitation. In other words, nothing prevents us from mixing model application and testing, provided that a reasonable balance is maintained.

However, for some tasks, it is unfeasible to process all possible scenarios. In this case, advanced RL algorithms allow us to summarize collected knowledge and apply it to new ones. Again, the same explore and exploit concept is in action.

What is the optimal algorithm of interaction with an environment?

RL mantra is: An instant win does not always guarantee sustainable success. For instance, capturing an opponent’s piece in chess might lead to more considerable losses.

However, when we choose an action, we make assumptions about what might happen on the next move. Again, on the next step, we can assume what would happen next and so on. All the accumulated knowledge is accounted for when we opt for the next action. Thus, the behavior strategy emerges. This principle is witnessed in games, where there is progress in teaching robots. Below are examples and references:

GO tutorial: AlphaGo algorithms take over the best Go professional players
Chatbot: this bot is negotiating with another agent to strike a deal.

For the sake of establishing a terminological glossary, first, we will discuss three examples that demonstrate certain conceptual features of RL:
The Multi-armed bandit
The multi-armed bandit is a classical problem to showcase RL strategic approaches to reach the optimal algorithm of interaction with the environment. 

We start with the following setup: A slot machine with N arms. You can pull only one arm at a time and receive an award.

Objective: Define an action (i.e., choose an arm) to bring maximum reward.

Solution: Pull every arm multiple times then, choose the arm with the highest average reward as the optimal choice. If we make the best choice every time, we call this: a greedy strategy. This strategy will work only in a stationary environment. In the case of a non-stationary environment, for example, somebody is regularly changing the bandit’s settings, the implementation of the greedy strategy will not produce an optimal result.

There are also other strategies other than the greedy one, such as:
  • ϵ-greedy strategy: in ϵ% of cases, we chose the optimal move, and in (1−ϵ)% of cases, we explore a random option.
  • Upper confidence bound (UCB) Strategy: The algorithm uses a weighting factor to make a choice. The meaning of this factor depends on how many times did the algorithm test the move. Meaning the less it is tested, the more probable for the algorithm to choose this action.
  • Softmax: The algorithm is more likely to choose the action if it expects it to yield more reward. 

The multi-armed bandit problem is a straightforward example where we have no prior knowledge of the subject observed, and we are learning to interact with it from scratch. In solving this problem, the algorithm uses the trial-and-error method, very lifelike indeed. Thus, the more experience we gain, the more successful our actions become.

Lessons learned from this example:

  • The trial-and-error method is a valid solution choice.
  • Various strategies can be implemented to enhance Random processing.
  • One should distinguish between stationary and non-stationary environments.
Balancing a cart-pole
Time to make the task a bit more complicated by looking at the cart-pole problem.

Environment: The cart-pole can move right and left.
Objective: Learn to keep the pole vertical as long as possible.
The difference from the previous problem: now it is required to consider additional parameters, namely, the angle (a) and velocity (v) of the pole, and take decisions considering these data.

The problem is more complicated since there are many (a;v) combinations, making it impossible to try each combination multiple times.

Any (a;v) combination is called a state. The number of these states can be both finite and continuous. As a rule, it is easier to implement algorithms that work with a finite number of states. Thus, a state is a set of some parameters of the system. RL theory has an important assumption that this set of parameters must exhaustively define the state of the system. It means that what happened with the system during previous steps is irrelevant, and the current observation is the only thing that matters.

Lessons learned from this example:

  • While choosing an optimal action, it is required to take into account the state of the system. The number of states influences the complexity of an algorithm.
  • Parameters defining the state of the system must provide complete information on the system at the current moment.
Queen Gambit Vs the robot
Of course, any article about RL is incomplete without talking about Chess. 

The number of possible positions for the chessboard pieces is a 52-digit figure. In this problem, we need to choose an action that will lead to a victory many steps later rather than to opt for one that will bring maximum value now.

Lessons learned:

  • The Decision-making process must take into account long-term effects rather than an immediate benefit.
Based on the examples above, we will create a must know RL Glossary:
is a subject that interacts with the environment and takes actions while receiving and remembering feedback.
  • For example, an engine that moves the cart-pole or a multi-armed bandit are agents.
is a space where an agent exists and receives feedback.
that the agent receives from the environment is usually uncertain.
  • For example, the cart-pole makes a move and receives feedback on whether the pole has fallen down or not. Therefore, the cart and the pole part form an environment.
is any move available to the agent which is finite.
  • For example, the right or left movement of the pole is an action
is the immediate result/feedback of the action taken, and it is always a figure.
  • For example, a win on a multi-armed bandit is a reward
As a rule, an agent's goal is to maximize the cumulative reward. Ultimately, the goal is to maximize this cumulative reward from a sequence of steps rather than on the next move.
  • For example, our goal is not to balance the pole only for a few seconds but to hold it upright as long as possible.
is the projection of states into actions.
  • For example, probability of choosing action A in the state S
Formal Problem Setup

  1. At every step, the environment may be in the state s of all the states S (s∈S).
  2. According to some strategy π, the agent chooses an action 'a' from a set of all the possible actions A (a∈A).
  3. The environment informs the agent on the reward r and the final state s∗ (s∗∈S).
  4. The agent makes adjustments to the strategy π.

The only question left is: What is this mysterious strategy π? In other words, how does the agent decide each step?

Spoiler alert, since we suggest a solution based on Q-learning, we will intentionally focus only on table methods.

Table RL Algorithms

Table methods are fundamental in RL and are used to solve problems with a finite number of states and actions. A characteristic feature of these methods is to apply state-action tables. The states are presented in the table rows while the actions are in the columns. The cells contain the output of the value function.
Q(si;aj) is the value of action aj in the si state. Simply put, it is the expected reward that we will get if we choose action aj while being in si state. At the first step, values Q(si;aj) are initiated, for example, by zeroes.

The initial state-action table for the labyrinth problem might be as follows:
In this case, the state is the location of the agent(cell of the labyrinth). Upon taking any action, the agent changes its state and gets the reward. In the labyrinth problem, the reward can be:
  • 1, escape is found,
  • 0, if not.

When the agent receives the actual feedback from the environment, the value Q(si;aj) is adjusted. There are various compensation algorithms, for example, the Monte Carlo method, SARSA, Q-learning. You can find more details here or here.

At first glance, Q-learning and SARSA formulas are pretty much alike:
Both methods use the expected value of the reward for action at the next step. It is quite simple to calculate. We assume that the agent is in a si state and performs the action aj. At this point, the environment informs the agent that it received a reward ri, and it took a new state sk as a result of this action. Using the state-action table, we can find the row with the state sk and set the reward's value corresponding to each action in the row.

The difference is that Q-learning Q(sk;a) is always the maximum reward in a new state, while the SARSA method assumes that the agent is modeling the choice of action in the state sk according to ε-greedy or UCB strategy. When using the greedy-strategy, those two methods are the same.

The disadvantage of such algorithms is that it is required to keep and store the appropriate state-action table. Moreover, some problems may have a large number of states and actions. It makes the irrelevant application of classic table processing methods. In this case, it is appropriate to use Q(si;aj) approximation methods enforced by neural networks.

Dynamic programming is an alternative to table methods. We will not review these algorithms, but we recommend reading Reinforcement Learning by Richard Sutton and Andrew Barto instead.