Q-learning example with Liar’s Dice in R

In my last post I coded Liar’s Dice in R and some brainless bots to play against. I build on that post by using Q-learning to train an agent to play Liar’s Dice well.

Spoiler alert: The brainless bots aren’t actually that brainless! More on that later.

Note – I’ll share enough code to run the simulations however the full code can be found on Github. Check out my previous post for the rules to Liar’s Dice.

What is Q-learning?

Firstly, some background. Q-learning is a reinforcement learning algorithm which trains an agent to make the right decisions given the environment it is in and what tasks it needs to complete. The task may be navigating a maze, playing a game, driving a car, flying a drone or learning which offers to make to increase customer retention. In the case of Liar’s Dice, when to call or raise the bid. The agent learns through

  • Exploring the environment
  • Choosing an action
  • Receiving a penalty or reward and
  • Storing that information to do better the next time.

This cycle is repeated until eventually it learns the best decisions to make given it’s situation.

The environment is formulated as a Markov Decision Process defined by a set of states s. By taking an action a the agent will transition to a new state s^\prime with probability P(s^\prime|s, a). After transitioning to the new state the agent receives a reward R(s^\prime|s, a) which will either tell it this was a good move, or this was a bad move. It may also tell the agent this was neither a good or bad move until it reaches a win/lose state.

Finding the optimal policy of an MDP can be done through value iteration and policy iteration. The optimal policy and state value functions are given by

    \[\pi(s) = \text{argmax}_a \left[ \sum_{s^\prime}{P(s^\prime|s, a) \left( R(s^\prime|s, a) + \gamma V(s^\prime) \right)} \right]\]

and

    \[V(s) = \sum_{s^\prime}{P(s^\prime|s,\pi(s))\left( R(s^\prime|s,\pi(s)) + \gamma V(s^\prime) \right)}\]

where \alpha and \gamma are learning rate and discount parameters. The above equations rely on knowing (or often, making crude approximations to) the transition probabilities. The benefit of Q-learning is the transition probabilities are not required, instead they are derived through simulation. The Q function is given by

    \[Q_{i+1}(s, a) = (1-\alpha)Q_i(s,a) + \alpha \left( R(s^\prime|s,a) + \gamma \text{max}_a Q_i(s^\prime,a) \right)\]

The cells in the Q matrix represent the ‘quality’ of taking action a given  state s. After each action the Q matrix is updated. After many iterations, the agent would have explored many states and determined which states and action pairs led to the best outcomes. Now it has the information needed to make the optimal choice by taking the action which leads to the maximum overall reward indicated by the largest Q value.

Liar’s Dice Markov Decision Process

State space

The key is to formulate the states the agent can be in at any point in the game and the reward for transitioning from one state to another. MDP’s can become very large very quickly if every possible state is accounted for so it’s important to identify the key information and the redundancies.

The key pieces of information needed to make a decision in Liar’s Dice are

  • The total number of dice on the table
  • The number of dice in the players possession
  • The players roll of the dice and
  • The bid

Consider the player has 6 dice, this gives a possible 6^6 = 46656 possible hands and this hasn’t yet factored in the bid or the total number of dice on the table. You can see how the number of states blows out.

To make a good decision on whether or not to raise or call the player only needs to know how many dice of the current bid value the player has in their hand and the chance the remainder are in the unseen opponents dice. Essentially, the dice value isn’t required in the formulation of the game state.

The states are given by 3 values.

  • The total number of dice on the table
  • The number of dice in the players possession
  • The probability bucket e.g. 10%, 20%, etc

The last point is the combination of the information given by the players dice and the bid. The probability there is at least the bid quantity on the table is calculated and reduced to a bucket.

    \[\text{probability bucket} = \text{floor} \left( \text{number of buckets} \times P(X \geq x) \right)\]

where

    \[P(X \geq x) = \sum_{k=x}^{n}{{{n}\choose{x}} \left(\frac{1}{6}\right)^{x}\left(\frac{5}{6}\right)^{n-x}}\]

and x is the unknown quantity needed and n is the number of unobserved dice on the table. This reduces the state space down to something more manageable. For this example we’ll use a maximum of 20 buckets i.e. (5%, 10%, …, 100%). Overkill for small numbers of dice, but it doesn’t hurt.

The function below generates the complete game states given the number of dice and players.

The state space reduces to only 2772 states for a game with 4 players with 6 dice each where previously it would have been several orders of magnitude larger. There are still redundant states in this formulation (mostly because I’m lazy) but it’s been reduced enough to be viable and won’t significantly slow down training.

Actions

To simplify the problem, the agent only needs to decide on whether to call or raise. A more complicated problem would be to allow the agent to choose what the new bid should be (this is for a later post, for now we’ll keep it simple).

The agent will explore the states randomly and will eventually learn when it’s a good time to call and a good time to raise. For example if the bid is three 5’s and the agent has three 5’s in hand, the obvious action is to raise. The agent won’t know this at first but will soon work it out.

If the agent raises, it will first randomly select whether to bluff or play the numbers. By bluffing the agent randomly selects a dice value and increases the bid by 1. If the agent plays the numbers it selects the value it has the most of and raises the quantity by 1.

When the agent has been trained it makes the optimal decision by selecting the maximum Q value given the current state \text{max}_a Q(s, a).

Rewards

The agent needs to know how good taking action a was. The reward matrix is defined by rewarding

  • 10 points for winning the game
  • -10 points for losing the game
  • 1 point for winning a round
  • -1 point for losing a round

The reward values are arbitrary but work well in this case. We want to emphasize that losing a die is bad but losing the game is worse. While any state other than the terminal states i.e. when the number dice the player has is 0 (lose) or the same as the total number of dice on the table (win) no state is particularly good/bad but the transition from one to the other s \to s^\prime is what triggers the reward or penalty. Therefore, each reward matrix will be an n \times n square matrix where n is the total number of states.

There is a reward matrix for each action and stored in a list. For Liar’s Dice this isn’t necessary since the rewards and penalties are same whether the player raises or calls and transitions to another state. However, the framework is there for actions to have different rewards.

The process

The process follows the steps below.

Assume player 1 raises on their turn. In a 4 person game, player 1 may actually transition to multiple other states before control returns. For each other raise or call by the other players, the game state will change for player 1. For the context of the model the action player 1 took is considered to be the last action for all subsequent transitions.

Here is an example of the state transition table for 4 players each with 3 dice.

This table is then passed to the update.Q() function.

The learning rate and discount values have been initialised to 0.1 and 0.9 respectively.

The simulation

Liar’s Dice is now simulated 5000 times and Q value iteration is conducted with the above functions (see github for the full code). The first agent will be the only one that uses the Q matrix to decide it’s actions and therefore the only agent that is trained. It will bluff with a probability of 50% to add in some more realism to the agents decision. The other 3 will be random agents, bluffing 100% of the time and randomly deciding to call or raise at each decision point. It is expected that after training agent 1 will outperform the other 3 random agents.

plot of chunk simulation results

After only 5000 iterations (which isn’t a lot given there are approximately 2000 valid states) the results show that agent 1 performs very well against the random agents. If each agent was equivalent the win percentage would be on average 25% where as here the results show agent 1 won 65% of the games.

The graph shows the percentage of wins for agent 1 continuing to increase as it is trained. Further training will improve the Q matrix and hence the performance of the agent. Given the stochastic nature of the game we wouldn’t expect a win percentage of 100%, so this is a great result.

Here is another 100 games with the trained agent.

Solid effort.

Bot got brains

What’s really happening here? The last variable in our state space formulation is the probability bucket which is in essence an approximation of the actual probability that the bid quantity exists on the table. At first the agent doesn’t know what to do with that information and will decide to call or raise randomly. Over time it learns how best to use that information and either calls or raises. In my previous post we simply used the probability directly by randomly choosing to raise with probability P(X \geq x) and call with probability 1-P(X \geq x). So in truth the original bots weren’t too bad.

The Q-learning algorithm has an advantage by being able to solve for more complex scenarios. The original agents only had the probability to base a decision, where as under an MDP framework the agent is free to also make decisions based on how many dice they have in hand and how many on the table. It has the ability to vary the risk depending on how close it is to winning or losing.

There are ways we can expand the state space to allow for potentially more complex decisions such as factoring in the remaining dice of the person to the left or right and allowing the agent to learn each players bluffing likelihoods. The state space could also be reduced to when a player has 0 dice and 1 or more, since whether the player has 2 or 6 dice may not matter too much. It’s worth an experiment to test this and see if it performs just as well.

Takeaways

In short a few things to take away are,

  • Q-learning improved the agents win percentage from 25% to 65%
  • When an environment can be appropriately quantified into states, MDP’s work really well
  • The state space can be reduced to speed up computation
  • The state space can be expanded to allow for more complex decisions and the actual value to raise the bid
  • Q-learning allows you to train an agent without knowledge of the transition probabilities, instead they are derived through simulation

Appendix: Code bits

Follow me on social media: