In my last post I coded Liar’s Dice in R and some brainless bots to play against. I build on that post by using Q-learning to train an agent to play Liar’s Dice well.

Spoiler alert: The brainless bots aren’t actually that brainless! More on that later.

Note – I’ll share enough code to run the simulations however the full code can be found on Github. Check out my previous post for the rules to Liar’s Dice.

## What is Q-learning?

Firstly, some background. Q-learning is a reinforcement learning algorithm which trains an agent to make the right decisions given the environment it is in and what tasks it needs to complete. The task may be navigating a maze, playing a game, driving a car, flying a drone or learning which offers to make to increase customer retention. In the case of Liar’s Dice, when to call or raise the bid. The agent learns through

• Exploring the environment
• Choosing an action
• Receiving a penalty or reward and
• Storing that information to do better the next time.

This cycle is repeated until eventually it learns the best decisions to make given it’s situation.

The environment is formulated as a Markov Decision Process defined by a set of states . By taking an action the agent will transition to a new state with probability . After transitioning to the new state the agent receives a reward which will either tell it this was a good move, or this was a bad move. It may also tell the agent this was neither a good or bad move until it reaches a win/lose state.

Finding the optimal policy of an MDP can be done through value iteration and policy iteration. The optimal policy and state value functions are given by and where and are learning rate and discount parameters. The above equations rely on knowing (or often, making crude approximations to) the transition probabilities. The benefit of Q-learning is the transition probabilities are not required, instead they are derived through simulation. The Q function is given by The cells in the Q matrix represent the ‘quality’ of taking action given  state . After each action the Q matrix is updated. After many iterations, the agent would have explored many states and determined which states and action pairs led to the best outcomes. Now it has the information needed to make the optimal choice by taking the action which leads to the maximum overall reward indicated by the largest Q value.

## Liar’s Dice Markov Decision Process

### State space

The key is to formulate the states the agent can be in at any point in the game and the reward for transitioning from one state to another. MDP’s can become very large very quickly if every possible state is accounted for so it’s important to identify the key information and the redundancies.

The key pieces of information needed to make a decision in Liar’s Dice are

• The total number of dice on the table
• The number of dice in the players possession
• The players roll of the dice and
• The bid

Consider the player has 6 dice, this gives a possible possible hands and this hasn’t yet factored in the bid or the total number of dice on the table. You can see how the number of states blows out.

To make a good decision on whether or not to raise or call the player only needs to know how many dice of the current bid value the player has in their hand and the chance the remainder are in the unseen opponents dice. Essentially, the dice value isn’t required in the formulation of the game state.

The states are given by 3 values.

• The total number of dice on the table
• The number of dice in the players possession
• The probability bucket e.g. 10%, 20%, etc

The last point is the combination of the information given by the players dice and the bid. The probability there is at least the bid quantity on the table is calculated and reduced to a bucket. where and is the unknown quantity needed and is the number of unobserved dice on the table. This reduces the state space down to something more manageable. For this example we’ll use a maximum of 20 buckets i.e. (5%, 10%, …, 100%). Overkill for small numbers of dice, but it doesn’t hurt.

The function below generates the complete game states given the number of dice and players.

# generate games states
generate.game.states <- function(nplayers, ndice, bin.size = 20){

# create the basic data frame with total dice and player dice count
total.dice <- nplayers*ndice
states <- t(combn(rep(0:ndice, nplayers), nplayers)) %>%
as.data.frame() %>%
distinct() %>%
set_names(paste0("p", 1:nplayers))
states$total <- rowSums(states) states <- states %>% select(total, p1) %>% arrange(total, p1) %>% filter(total > 0) %>% distinct # add in the probability bucket state game.states <- data.frame() for(k in 1:nrow(states)){ total <- states$total[k]
p1 <- game.states$p1[i] # small penalty for losing a die reward[i, p1 - game.states$p1 == 1 & total - game.states$total == 1 & game.states$p1 != game.states$total] <- -1 # small reward for others losing a die reward[i, p1 == game.states$p1 & total - game.states$total == 1 & game.states$p1 != game.states$total & p1 > 0] <- 1 # fail states - when players dice count is 0 if(p1 == 1){ reward[i, which(total - game.states$total == 1 & game.states$p1 == 0)] <- -10 } # win states when the player dice count equals the total dice count if(total - p1 == 1){ reward[i, game.states$total == p1 & game.states$p1 == p1] <- 10 } } return(reward) } rw <- generate.reward.matrix(gs) reward <- list(raise = rw, call = rw)  ## The process The process follows the steps below. Assume player 1 raises on their turn. In a 4 person game, player 1 may actually transition to multiple other states before control returns. For each other raise or call by the other players, the game state will change for player 1. For the context of the model the action player 1 took is considered to be the last action for all subsequent transitions. Here is an example of the state transition table for 4 players each with 3 dice. play.liars.dice(auto = TRUE, players = 4, num.dice = 3, verbose = 0, agents = agents, Q.mat = Q.mat, print.trans = TRUE)$winner


##    y.ctrl y.state y.action
## 1       3     808    raise
## 2       3     787    raise
## 3       3     778    raise
## 4       4     778    raise
## 5       4     736    raise
## 6       4     736    raise
## 7       1     736     call
## 8       1     673     call
## 9       1     678    raise
## 10      2     674    raise
## 11      3     673    raise
## 12      4     673    raise
## 13      4     589    raise
## 14      4     592    raise
## 15      1     589     call
## 16      1     505     call
## 17      1     507     call
## 18      1     423     call
## 19      2     422    raise
## 20      3     421    raise
## 21      1     421    raise
## 22      2     421    raise
## 23      2     316    raise
## 24      2     324    raise
## 25      2     240    raise
## 26      3     232    raise
## 27      3     148    raise
## 28      2     151    raise
## 29      2      88    raise

##  1


This table is then passed to the update.Q() function.

# update Q matrix
update.Q <- function(play, Q.mat, reward, alpha = 0.1, discount = 0.9){
for(k in 2:nrow(play)){
curr.state <- play$y.state[k] prev.state <- play$y.state[k-1]
Q.mat <- out$Q.mat setTxtProgressBar(pb, k) }  # table of winners table(winners)  ## winners ## 1 2 3 4 ## 3288 434 496 782  # agent win percentage x <- 1000:5000 y <- (cumsum(winners == 1)/(1:5000))[x] qplot(x = x, y = y, geom = "line", xlab = "iterations", ylab = "Proportion of agent 1 wins")  After only 5000 iterations (which isn’t a lot given there are approximately 2000 valid states) the results show that agent 1 performs very well against the random agents. If each agent was equivalent the win percentage would be on average 25% where as here the results show agent 1 won 65% of the games. The graph shows the percentage of wins for agent 1 continuing to increase as it is trained. Further training will improve the Q matrix and hence the performance of the agent. Given the stochastic nature of the game we wouldn’t expect a win percentage of 100%, so this is a great result. Here is another 100 games with the trained agent. library(pbapply) sim <- pbsapply(1:100, function(x) play.liars.dice(auto = TRUE, players = 4, num.dice = 6, verbose = 0, agents = agents, Q.mat = Q.mat)[["winner"]]) table(sim)  ## sim ## 1 2 3 4 ## 65 6 15 14  sum(sim == 1)/100  ##  0.65  Solid effort. ## Bot got brains What’s really happening here? The last variable in our state space formulation is the probability bucket which is in essence an approximation of the actual probability that the bid quantity exists on the table. At first the agent doesn’t know what to do with that information and will decide to call or raise randomly. Over time it learns how best to use that information and either calls or raises. In my previous post we simply used the probability directly by randomly choosing to raise with probability and call with probability . So in truth the original bots weren’t too bad. The Q-learning algorithm has an advantage by being able to solve for more complex scenarios. The original agents only had the probability to base a decision, where as under an MDP framework the agent is free to also make decisions based on how many dice they have in hand and how many on the table. It has the ability to vary the risk depending on how close it is to winning or losing. There are ways we can expand the state space to allow for potentially more complex decisions such as factoring in the remaining dice of the person to the left or right and allowing the agent to learn each players bluffing likelihoods. The state space could also be reduced to when a player has 0 dice and 1 or more, since whether the player has 2 or 6 dice may not matter too much. It’s worth an experiment to test this and see if it performs just as well. ## Takeaways In short a few things to take away are, • Q-learning improved the agents win percentage from 25% to 65% • When an environment can be appropriately quantified into states, MDP’s work really well • The state space can be reduced to speed up computation • The state space can be expanded to allow for more complex decisions and the actual value to raise the bid • Q-learning allows you to train an agent without knowledge of the transition probabilities, instead they are derived through simulation ## Appendix: Code bits  # set dice value set.dice.value <- function(note, max.val, prev.val = 0){ good.val <- FALSE while(!good.val){ val <- readline(note) %>% as.numeric() if(val > 0 & val <= max.val & !is.na(val) & (val > prev.val)){ good.val <- TRUE }else{ cat("please select a value between 1 and", max.val, "\n") } } return(val) } # roll table roll.table.fn <- function(rolls){ rt <- table(unlist(rolls)) roll.table <- rep(0, 6) names(roll.table) <- 1:6 roll.table[names(rt)] <- rt return(roll.table) } # call probability that there is at least the bid quantity on the table and converts to a bucket calc.prob <- function(x, bin.size = 20) { if(x <= x){ return(1*bin.size) }else{ n <- x-x k <- seq(min(x-x, n), n, 1) return(floor(sum(choose(n, k)*(1/6)^k*(5/6)^(n-k))*bin.size)) } } # agent function chooses the best action e.g. raise or call # it needs to take in as input dice, total dice, dice value and dice quantity # as output action (raise or call), if raised also new dice value and quantity # dice, total.dice, dice.value, dice.quantity. # this is wrapped by a building function to make it easier to change certain # parameters and decisions an agent might make and be able to play them off against # each other to see which is the better strategy build.agent <- function(bluff.prob, method = "random"){ return( function(pars, Q.mat){ # bluff or truth bluff <- sample(c(TRUE, FALSE), 1, prob = bluff.prob) # pobability table roll.table <- roll.table.fn(pars$dice)
ptable <- roll.table/sum(roll.table)

# if the initial bid do this
if(is.null(pars$dice.value)){ new.dice.value <- which.max(ptable[1:6]) %>% names() %>% as.numeric() new.dice.quantity <- max(roll.table) + 1 return(list(dice.value = new.dice.value, dice.quantity = new.dice.quantity)) } # are you gonna call? # use the Q matrix to make a decision if(method == "Q.decide"){ if(abs(max(Q.mat[pars$p1.state,]) - min(Q.mat[pars$p1.state,])) < 1e-6) call <- sample(c(TRUE, FALSE), 1) else{ # exploration vs exploitation if(runif(1) < 0.1){ call <- sample(c(TRUE, FALSE), 1) }else{ call <- names(which.max(Q.mat[pars$p1.state,])) == "call"
}
}

# the random agent
}else if(method == "random"){

prob <- 0.5
call <- sample(c(TRUE, FALSE), 1, prob = c(1-prob, prob))

# playing the actual numbers
}else if(method == "true.prob"){

prob <- 1-sum(dbinom(0:max(c(pars$dice.quantity - roll.table[pars$dice.value], 0)), pars$total.dice-length(pars$dice), prob = 1/6))
call <- sample(c(TRUE, FALSE), 1, prob = c(1-prob, prob))

}

# if called return the values
if(call){
return(list(action = "call", dice.value = pars$dice.value, dice.quantity = pars$dice.quantity))
}else{

# raise
# if choosing to bluff randomly select a number and increase by 1
if(bluff){

new.dice.value <- sample(1:6, 1)
new.dice.quantity <- pars$dice.quantity + 1 }else{ # if not bluffing choose the maximum number in hand and increase by one # this should be made to be more flexible however in general raising by # 1 occurs 99% of the time new.dice.value <- which.max(ptable) %>% names() %>% as.numeric() new.dice.quantity <- pars$dice.quantity + 1

}

# return the new values and action
return(list(action = "raise", dice.value = new.dice.value, dice.quantity = new.dice.quantity))
}
}
)
}

#------ play a round of liars dice
liars.dice.round <- function(players, control, player.dice.count, agents, game.states, reward, Q.mat, a = 1, verbose = 1){

# set array for recording results
y.ctrl = c(); y.state = c(); y.action = c()

# roll the dice for each player
if(verbose > 0) cat("\n\n")
rolls <- lapply(1:players, function(x) sort(sample(1:6, player.dice.count[[x]], replace = TRUE)))
if(verbose > 1) lapply(rolls, function(x) cat("dice: ", x, "\n"))
total.dice <- sum(unlist(player.dice.count))

# set penalty
penalty <- sapply(1:players, function(x) 0, simplify = FALSE)

# print dice blocks
if(verbose > 0) Dice(rolls[])

# set up roll table
roll.table <- roll.table.fn(rolls)

# initial bid
if(verbose > 0) cat("place first bid\nPlayer", control, "has control\n")
if(control == a){

dice.value <- set.dice.value("dice value: ", 6)
dice.quantity <- set.dice.value("quantity; ", sum(roll.table))

}else{

# agent plays
p1.state <- which(game.states$total == total.dice & game.states$p1 == player.dice.count[] & game.states$prob_cat == total.dice) pars <- list(dice = rolls[[control]], total.dice = total.dice, dice.value = NULL, dice.quantity = 0, p1.state = p1.state) agent.action <- agents[[control]](pars = pars, Q.mat = Q.mat) dice.value <- agent.action$dice.value
dice.quantity <- agent.action$dice.quantity } # calculate probability cat and determine the game state # action set to raise because you can't call without an initial bid # this could be a 3rd action (initial bid) but it's not really necessary player.dice.qty <- table(rolls[])[as.character(dice.value)] player.dice.qty <- ifelse(is.na(player.dice.qty), 0, player.dice.qty) %>% unname prob.cat <- calc.prob(c(total.dice, player.dice.count[], dice.quantity, player.dice.qty)) p1.state <- which(game.states$total == total.dice & game.states$p1 == player.dice.count[] & game.states$prob_cat == prob.cat)
p1.action <- "raise"

# storing states for Q iteration
y.ctrl = c(); y.state = c(); y.action = c()

# moving control to the next player
# storing the previous player since if the next player calls the previous player could lose a die
prev <- control
control <- control %% players + 1
if(verbose > 0) cat("dice value ", dice.value, "; dice quantity ", dice.quantity, "\n")

# loop through each player and continue until there is a winner and loser
called <- FALSE
while(!called){

# check if the player with control is still in the game - if not skip
if(player.dice.count[[control]] > 0){
if(control == a){

action <- readline("raise or call (r/c)? ")

}else{

# the agent makes a decision
pars <- list(dice = rolls[[control]], total.dice = total.dice, dice.value = dice.value, dice.quantity = dice.quantity, p1.state = p1.state)
agent.action <- agents[[control]](pars = pars, Q.mat = Q.mat)
action <- agent.action$action } # storing states for reward iteration if(control == 1 & !is.null(agent.action$action)){
player.dice.qty <- table(rolls[])[as.character(dice.value)]
player.dice.qty <- ifelse(is.na(player.dice.qty), 0, player.dice.qty) %>% unname

p1.action <- agent.action$action prob.cat <- calc.prob(c(total.dice, player.dice.count[], dice.quantity, player.dice.qty)) p1.state <- which(game.states$total == total.dice & game.states$p1 == player.dice.count[] & game.states$prob_cat == prob.cat)
}

# called
if(action %in% c("call", "c")){

if(verbose > 0) {
cat("player", control, "called\nRoll table\n")
print(roll.table)
}

# dice are reavealed

# check if the quantity of dice value is less or more than the total in the pool
# if more control loses otherwise control-1 win
if(dice.quantity > roll.table[dice.value]){

penalty[[prev]] <- penalty[[prev]] - 1
if(verbose > 0) cat("player", prev, "lost a die\n")

}else{

penalty[[control]] <- penalty[[control]] - 1
if(verbose > 0) cat("player", control, "lost a die\n")

}

# for Q iteration
y.ctrl <- c(y.ctrl, control); y.state <- c(y.state, p1.state); y.action <- c(y.action, p1.action)

# if called use the penalty array to change states
prob.cat <- calc.prob(c(total.dice, player.dice.count[], dice.quantity, player.dice.qty))
p1.state <- which(game.states$total == total.dice-1 & game.states$p1 == player.dice.count[]+penalty[] & game.states$prob_cat == prob.cat) # break the loop called <- TRUE }else{ if(verbose > 0) cat("player", control, "raised\n") if(control == a){ # player sets next dice value dice.value <- set.dice.value("dice value: ", 6) dice.quantity <- set.dice.value("quantity; ", sum(roll.table)) }else{ dice.value <- agent.action$dice.value
dice.quantity <- agent.action$dice.quantity } # p1 state after the raise prob.cat <- calc.prob(c(total.dice, player.dice.count[], dice.quantity, player.dice.qty)) p1.state <- which(game.states$total == total.dice & game.states$p1 == player.dice.count[] & game.states$prob_cat == prob.cat)
if(verbose > 0) cat("dice value", dice.value, "; dice quantity", dice.quantity, "\n")
}

# store info for Q update
y.ctrl <- c(y.ctrl, control); y.state <- c(y.state, p1.state); y.action <- c(y.action, p1.action)

# set the control player to now be the previous player
prev <- control
}

# next player has control
control <- control %% players + 1
}

# play results and return
play <- data.frame(y.ctrl, y.state, y.action)
return(list(penalty = penalty, play = play))
}

# play a full game of liars dice
play.liars.dice <- function(players = 4, num.dice = 6, auto = FALSE, verbose = 1, agents, Q.mat = NULL, train = FALSE, print.trans = FALSE){

# begin!
if(verbose > 0) liars.dice.title()

# setting the number of dice each player has
ndice <- sapply(rep(num.dice, players), function(x) x, simplify = FALSE)
players.left <- sum(unlist(ndice) > 0)

# setting game states matrix
game.states <- generate.game.states(players, num.dice)

# set up reward matrix
reward <- generate.reward.matrix(game.states)
reward <- list(raise = reward, call = reward)

# set Q matrix if null
if(is.null(Q.mat)) Q.mat <- matrix(0, nrow = nrow(reward$raise), ncol = length(reward), dimnames = list(c(), names(reward))) # while there is at least 2 left in the game # who has control ctrl <- sample(1:players, 1) play.df <- data.frame() while(players.left > 1){ # play a round results <- liars.dice.round( players = players, control = ctrl, player.dice.count = ndice, game.states = game.states, reward = reward, Q.mat = Q.mat, agents = agents, a = as.numeric(!auto), verbose = verbose ) # update how many dice the players are left with given the # outcomes of the round for(k in seq_along(ndice)){ ndice[[k]] <- ndice[[k]] + results$penalty[[k]]
if(ndice[[k]] == 0 & results$penalty[[k]] == -1){ if(verbose > 0) cat("player", k, "is out of the game\n") } # update who has control so they can start the bidding if(results$penalty[[k]] == -1){
ctrl <- k
while(ndice[[ctrl]] == 0){
ctrl <- ctrl %% players + 1
}
}
}

# checking how many are left and if anyone won the game
players.left <- sum(unlist(ndice) > 0)
if(players.left == 1){
if(verbose > 0) cat("player", which(unlist(ndice) > 0), "won the game\n")
}

# appending play
play.df <- rbind(play.df, results\$play)
}

if(print.trans) print(play.df)

# update Q
# rather than training after each action, training at the
# end of each game in bulk
# just easier this way
if(train) Q.mat <- update.Q(play.df, Q.mat, reward)

# return the winner and Q matrix
return(list(winner = which(unlist(ndice) > 0), Q.mat = Q.mat))
}


Follow me on social media: