Q-learning example with Liar’s Dice in R

In my last post I coded Liar’s Dice in R and some brainless bots to play against. I build on that post by using Q-learning to train an agent to play Liar’s Dice well.

Spoiler alert: The brainless bots aren’t actually that brainless! More on that later.

Note – I’ll share enough code to run the simulations however the full code can be found on Github. Check out my previous post for the rules to Liar’s Dice.

What is Q-learning?

Firstly, some background. Q-learning is a reinforcement learning algorithm which trains an agent to make the right decisions given the environment it is in and what tasks it needs to complete. The task may be navigating a maze, playing a game, driving a car, flying a drone or learning which offers to make to increase customer retention. In the case of Liar’s Dice, when to call or raise the bid. The agent learns through

  • Exploring the environment
  • Choosing an action
  • Receiving a penalty or reward and
  • Storing that information to do better the next time.

This cycle is repeated until eventually it learns the best decisions to make given it’s situation.

The environment is formulated as a Markov Decision Process defined by a set of states s. By taking an action a the agent will transition to a new state s^\prime with probability P(s^\prime|s, a). After transitioning to the new state the agent receives a reward R(s^\prime|s, a) which will either tell it this was a good move, or this was a bad move. It may also tell the agent this was neither a good or bad move until it reaches a win/lose state.

Finding the optimal policy of an MDP can be done through value iteration and policy iteration. The optimal policy and state value functions are given by

    \[\pi(s) = \text{argmax}_a \left[ \sum_{s^\prime}{P(s^\prime|s, a) \left( R(s^\prime|s, a) + \gamma V(s^\prime) \right)} \right]\]

and

    \[V(s) = \sum_{s^\prime}{P(s^\prime|s,\pi(s))\left( R(s^\prime|s,\pi(s)) + \gamma V(s^\prime) \right)}\]

where \alpha and \gamma are learning rate and discount parameters. The above equations rely on knowing (or often, making crude approximations to) the transition probabilities. The benefit of Q-learning is the transition probabilities are not required, instead they are derived through simulation. The Q function is given by

    \[Q_{i+1}(s, a) = (1-\alpha)Q_i(s,a) + \alpha \left( R(s^\prime|s,a) + \gamma \text{max}_a Q_i(s^\prime,a) \right)\]

The cells in the Q matrix represent the ‘quality’ of taking action a given  state s. After each action the Q matrix is updated. After many iterations, the agent would have explored many states and determined which states and action pairs led to the best outcomes. Now it has the information needed to make the optimal choice by taking the action which leads to the maximum overall reward indicated by the largest Q value.

Liar’s Dice Markov Decision Process

State space

The key is to formulate the states the agent can be in at any point in the game and the reward for transitioning from one state to another. MDP’s can become very large very quickly if every possible state is accounted for so it’s important to identify the key information and the redundancies.

The key pieces of information needed to make a decision in Liar’s Dice are

  • The total number of dice on the table
  • The number of dice in the players possession
  • The players roll of the dice and
  • The bid

Consider the player has 6 dice, this gives a possible 6^6 = 46656 possible hands and this hasn’t yet factored in the bid or the total number of dice on the table. You can see how the number of states blows out.

To make a good decision on whether or not to raise or call the player only needs to know how many dice of the current bid value the player has in their hand and the chance the remainder are in the unseen opponents dice. Essentially, the dice value isn’t required in the formulation of the game state.

The states are given by 3 values.

  • The total number of dice on the table
  • The number of dice in the players possession
  • The probability bucket e.g. 10%, 20%, etc

The last point is the combination of the information given by the players dice and the bid. The probability there is at least the bid quantity on the table is calculated and reduced to a bucket.

    \[\text{probability bucket} = \text{floor} \left( \text{number of buckets} \times P(X \geq x) \right)\]

where

    \[P(X \geq x) = \sum_{k=x}^{n}{{{n}\choose{x}} \left(\frac{1}{6}\right)^{x}\left(\frac{5}{6}\right)^{n-x}}\]

and x is the unknown quantity needed and n is the number of unobserved dice on the table. This reduces the state space down to something more manageable. For this example we’ll use a maximum of 20 buckets i.e. (5%, 10%, …, 100%). Overkill for small numbers of dice, but it doesn’t hurt.

The function below generates the complete game states given the number of dice and players.

# generate games states
generate.game.states <- function(nplayers, ndice, bin.size = 20){

  # create the basic data frame with total dice and player dice count
  total.dice <- nplayers*ndice
  states <- t(combn(rep(0:ndice, nplayers), nplayers)) %>% as.data.frame() %>% distinct()
  colnames(states) <- paste0("p", 1:nplayers)
  states$total <- rowSums(states)
  states <- states %>% dplyr::select(total, p1) %>% dplyr::arrange(total, p1) %>% dplyr::filter(total > 0) %>% distinct

  # add in the probability bucket state
  game.states <- data.frame()
  for(k in 1:nrow(states)){
    total <- states$total[k]
    p1 <- states$p1[k]
    for(j in 0:bin.size){
      game.states <- rbind(game.states, c(total, p1, j))
    }
  }
  colnames(game.states) <- c("total", "p1", "prob_cat")
  return(game.states)
}

gs <- generate.game.states(4, 6)
dim(gs)
## [1] 2772    3

The state space reduces to only 2772 states for a game with 4 players with 6 dice each where previously it would have been several orders of magnitude larger. There are still redundant states in this formulation (mostly because I’m lazy) but it’s been reduced enough to be viable and won’t significantly slow down training.

Actions

To simplify the problem, the agent only needs to decide on whether to call or raise. A more complicated problem would be to allow the agent to choose what the new bid should be (this is for a later post, for now we’ll keep it simple).

The agent will explore the states randomly and will eventually learn when it’s a good time to call and a good time to raise. For example if the bid is three 5’s and the agent has three 5’s in hand, the obvious action is to raise. The agent won’t know this at first but will soon work it out.

If the agent raises, it will first randomly select whether to bluff or play the numbers. By bluffing the agent randomly selects a dice value and increases the bid by 1. If the agent plays the numbers it selects the value it has the most of and raises the quantity by 1.

When the agent has been trained it makes the optimal decision by selecting the maximum Q value given the current state \text{max}_a Q(s, a).

Rewards

The agent needs to know how good taking action a was. The reward matrix is defined by rewarding

  • 10 points for winning the game
  • -10 points for losing the game
  • 1 point for winning a round
  • -1 point for losing a round

The reward values are arbitrary but work well in this case. We want to emphasize that losing a die is bad but losing the game is worse. While any state other than the terminal states i.e. when the number dice the player has is 0 (lose) or the same as the total number of dice on the table (win) no state is particularly good/bad but the transition from one to the other s \to s^\prime is what triggers the reward or penalty. Therefore, each reward matrix will be an n \times n square matrix where n is the total number of states.

There is a reward matrix for each action and stored in a list. For Liar’s Dice this isn’t necessary since the rewards and penalties are same whether the player raises or calls and transitions to another state. However, the framework is there for actions to have different rewards.

# generate reward matrix
generate.reward.matrix <- function(game.states){
  reward <- matrix(0, nrow = nrow(game.states), ncol = nrow(game.states))

  for(i in 1:nrow(reward)){

    # which state are we in
    total <- game.states$total[i]
    p1 <- game.states$p1[i]

    # small penalty for losing a die
    reward[i, p1 - game.states$p1 == 1 & total - game.states$total == 1 & game.states$p1 != game.states$total] <- -1

    # small reward for others losing a die
    reward[i, p1 == game.states$p1 & total - game.states$total == 1 & game.states$p1 != game.states$total & p1 > 0] <- 1

    # fail states - when players dice count is 0
    if(p1 == 1){
      reward[i, which(total - game.states$total == 1 & game.states$p1 == 0)] <- -10
    }

    # win states when the player dice count equals the total dice count
    if(total - p1 == 1){
      reward[i, game.states$total == p1 & game.states$p1 == p1] <- 10
    }
  }
  return(reward)
}

rw <- generate.reward.matrix(gs)
reward <- list(raise = rw, call = rw)

The process

The process follows the steps below.

Assume player 1 raises on their turn. In a 4 person game, player 1 may actually transition to multiple other states before control returns. For each other raise or call by the other players, the game state will change for player 1. For the context of the model the action player 1 took is considered to be the last action for all subsequent transitions.

Here is an example of the state transition table for 4 players each with 3 dice.

play.liars.dice(auto = TRUE, players = 4, num.dice = 3, verbose = 0, agents = agents, Q.mat = Q.mat, print.trans = TRUE)$winner
##    y.ctrl y.state y.action
## 1       3     808    raise
## 2       3     787    raise
## 3       3     778    raise
## 4       4     778    raise
## 5       4     736    raise
## 6       4     736    raise
## 7       1     736     call
## 8       1     673     call
## 9       1     678    raise
## 10      2     674    raise
## 11      3     673    raise
## 12      4     673    raise
## 13      4     589    raise
## 14      4     592    raise
## 15      1     589     call
## 16      1     505     call
## 17      1     507     call
## 18      1     423     call
## 19      2     422    raise
## 20      3     421    raise
## 21      1     421    raise
## 22      2     421    raise
## 23      2     316    raise
## 24      2     324    raise
## 25      2     240    raise
## 26      3     232    raise
## 27      3     148    raise
## 28      2     151    raise
## 29      2      88    raise
## [1] 1

This table is then passed to the update.Q() function.

# update Q matrix
update.Q <- function(play, Q.mat, reward, alpha = 0.1, discount = 0.9){
  for(k in 2:nrow(play)){
    curr.state <- play$y.state[k]
    prev.state <- play$y.state[k-1]
    action <- play$y.action[k]

    # Q update
    Q.mat[prev.state, action] <- (1-alpha)*Q.mat[prev.state, action] + alpha*(reward[[action]][prev.state, curr.state] + discount*max(Q.mat[curr.state,]))
  }
  return(Q.mat)
}

The learning rate and discount values have been initialised to 0.1 and 0.9 respectively.

The simulation

Liar’s Dice is now simulated 5000 times and Q value iteration is conducted with the above functions (see github for the full code). The first agent will be the only one that uses the Q matrix to decide it’s actions and therefore the only agent that is trained. It will bluff with a probability of 50% to add in some more realism to the agents decision. The other 3 will be random agents, bluffing 100% of the time and randomly deciding to call or raise at each decision point. It is expected that after training agent 1 will outperform the other 3 random agents.

# set the agents
agent1 <- build.agent(c(0.5,0.5), method = "Q.decide")
agent2 <- build.agent(c(1,0), method = "random")
agent3 <- build.agent(c(1,0), method = "random")
agent4 <- build.agent(c(1,0), method = "random")
agents <- list(agent1, agent2, agent3, agent4)
# training
its <- 5000

# setting the vector to store the winners
winners <- vector("numeric", length = its)

# loopy loop
pb <- txtProgressBar(min = 0, max = its, style = 3)
for(k in 1:its){
  out <- play.liars.dice(auto = TRUE, players = 4, num.dice = 6, verbose = 0, agents = agents, Q.mat = Q.mat, train = TRUE)
  winners[k] <- out$winner
  Q.mat <- out$Q.mat
  setTxtProgressBar(pb, k)
}
# table of winners
table(winners)
## winners
##    1    2    3    4 
## 3288  434  496  782
# agent win percentage
x <- 1000:5000
y <- (cumsum(winners == 1)/(1:5000))[x]
qplot(x = x, y = y, geom = "line", xlab = "iterations", ylab = "Proportion of agent 1 wins")
plot of chunk simulation results

After only 5000 iterations (which isn’t a lot given there are approximately 2000 valid states) the results show that agent 1 performs very well against the random agents. If each agent was equivalent the win percentage would be on average 25% where as here the results show agent 1 won 65% of the games.

The graph shows the percentage of wins for agent 1 continuing to increase as it is trained. Further training will improve the Q matrix and hence the performance of the agent. Given the stochastic nature of the game we wouldn’t expect a win percentage of 100%, so this is a great result.

Here is another 100 games with the trained agent.

library(pbapply)
sim <- pbsapply(1:100, function(x) play.liars.dice(auto = TRUE, players = 4, num.dice = 6, verbose = 0, agents = agents, Q.mat = Q.mat)[["winner"]])
table(sim)
## sim
##  1  2  3  4 
## 65  6 15 14
sum(sim == 1)/100
## [1] 0.65

Solid effort.

Bot got brains

What’s really happening here? The last variable in our state space formulation is the probability bucket which is in essence an approximation of the actual probability that the bid quantity exists on the table. At first the agent doesn’t know what to do with that information and will decide to call or raise randomly. Over time it learns how best to use that information and either calls or raises. In my previous post we simply used the probability directly by randomly choosing to raise with probability P(X \geq x) and call with probability 1-P(X \geq x). So in truth the original bots weren’t too bad.

The Q-learning algorithm has an advantage by being able to solve for more complex scenarios. The original agents only had the probability to base a decision, where as under an MDP framework the agent is free to also make decisions based on how many dice they have in hand and how many on the table. It has the ability to vary the risk depending on how close it is to winning or losing.

There are ways we can expand the state space to allow for potentially more complex decisions such as factoring in the remaining dice of the person to the left or right and allowing the agent to learn each players bluffing likelihoods. The state space could also be reduced to when a player has 0 dice and 1 or more, since whether the player has 2 or 6 dice may not matter too much. It’s worth an experiment to test this and see if it performs just as well.

Takeaways

In short a few things to take away are,

  • Q-learning improved the agents win percentage from 25% to 65%
  • When an environment can be appropriately quantified into states, MDP’s work really well
  • The state space can be reduced to speed up computation
  • The state space can be expanded to allow for more complex decisions and the actual value to raise the bid
  • Q-learning allows you to train an agent without knowledge of the transition probabilities, instead they are derived through simulation

Appendix: Code bits


# set dice value
set.dice.value <- function(note, max.val, prev.val = 0){
  good.val <- FALSE
  while(!good.val){
    val <- readline(note) %>% as.numeric()
    if(val > 0 & val <= max.val & !is.na(val) & (val > prev.val)){
      good.val <- TRUE
    }else{
      cat("please select a value between 1 and", max.val, "\n")
    }
  }
  return(val)
}


# roll table
roll.table.fn <- function(rolls){
  rt <- table(unlist(rolls))
  roll.table <- rep(0, 6)
  names(roll.table) <- 1:6
  roll.table[names(rt)] <- rt
  return(roll.table)
}


# call probability that there is at least the bid quantity on the table and converts to a bucket
calc.prob <- function(x, bin.size = 20) {
  if(x[3] <= x[4]){
    return(1*bin.size)
  }else{
    n <- x[1]-x[2]
    k <- seq(min(x[3]-x[4], n), n, 1)
    return(floor(sum(choose(n, k)*(1/6)^k*(5/6)^(n-k))*bin.size))
  }
}


# agent function chooses the best action e.g. raise or call
# it needs to take in as input dice, total dice, dice value and dice quantity
# as output action (raise or call), if raised also new dice value and quantity
# dice, total.dice, dice.value, dice.quantity.
# this is wrapped by a building function to make it easier to change certain 
# parameters and decisions an agent might make and be able to play them off against 
# each other to see which is the better strategy
build.agent <- function(bluff.prob, method = "random"){
  
  return(
    
    function(pars, Q.mat){
      
      # bluff or truth
      bluff <- sample(c(TRUE, FALSE), 1, prob = bluff.prob)
      
      # pobability table
      roll.table <- roll.table.fn(pars$dice)
      ptable <- roll.table/sum(roll.table)
      
      # if the initial bid do this
      if(is.null(pars$dice.value)){
        
        new.dice.value <- which.max(ptable[1:6]) %>% names() %>% as.numeric()
        new.dice.quantity <- max(roll.table) + 1
        return(list(dice.value = new.dice.value, dice.quantity = new.dice.quantity))
        
      }
      
      
      # are you gonna call?
      # use the Q matrix to make a decision
      if(method == "Q.decide"){
        if(abs(max(Q.mat[pars$p1.state,]) - min(Q.mat[pars$p1.state,])) < 1e-6)
          call <- sample(c(TRUE, FALSE), 1)
        else{
          # exploration vs exploitation
          if(runif(1) < 0.1){ 
            call <- sample(c(TRUE, FALSE), 1)
          }else{
            call <- names(which.max(Q.mat[pars$p1.state,])) == "call"
          }
        }
        
        
        # the random agent  
      }else if(method == "random"){
        
        prob <- 0.5
        call <- sample(c(TRUE, FALSE), 1, prob = c(1-prob, prob))
        
        # playing the actual numbers  
      }else if(method == "true.prob"){
        
        prob <- 1-sum(dbinom(0:max(c(pars$dice.quantity - roll.table[pars$dice.value], 0)), pars$total.dice-length(pars$dice), prob = 1/6))
        call <- sample(c(TRUE, FALSE), 1, prob = c(1-prob, prob))
        
      }
      
      # if called return the values
      if(call){
        return(list(action = "call", dice.value = pars$dice.value, dice.quantity = pars$dice.quantity))
      }else{
        
        # raise
        # if choosing to bluff randomly select a number and increase by 1
        if(bluff){
          
          new.dice.value <- sample(1:6, 1)
          new.dice.quantity <- pars$dice.quantity + 1
          
        }else{
          
          # if not bluffing choose the maximum number in hand and increase by one
          # this should be made to be more flexible however in general raising by
          # 1 occurs 99% of the time
          new.dice.value <- which.max(ptable) %>% names() %>% as.numeric()
          new.dice.quantity <- pars$dice.quantity + 1
          
        }
        
        # return the new values and action
        return(list(action = "raise", dice.value = new.dice.value, dice.quantity = new.dice.quantity))
      }
    }
  )
}



#------ play a round of liars dice
liars.dice.round <- function(players, control, player.dice.count, agents, game.states, reward, Q.mat, a = 1, verbose = 1){
  
  # set array for recording results
  y.ctrl = c(); y.state = c(); y.action = c()
  
  # roll the dice for each player
  if(verbose > 0) cat("\n\n")
  rolls <- lapply(1:players, function(x) sort(sample(1:6, player.dice.count[[x]], replace = TRUE)))
  if(verbose > 1) lapply(rolls, function(x) cat("dice: ", x, "\n"))
  total.dice <- sum(unlist(player.dice.count))
  
  # set penalty
  penalty <- sapply(1:players, function(x) 0, simplify = FALSE)
  
  # print dice blocks
  if(verbose > 0) Dice(rolls[[1]])
  
  # set up roll table
  roll.table <- roll.table.fn(rolls)
  
  # initial bid
  if(verbose > 0) cat("place first bid\nPlayer", control, "has control\n")
  if(control == a){
    
    dice.value <- set.dice.value("dice value: ", 6)
    dice.quantity <- set.dice.value("quantity; ", sum(roll.table))
    
  }else{
    
    # agent plays
    p1.state <- which(game.states$total == total.dice & game.states$p1 == player.dice.count[[1]] & game.states$prob_cat == total.dice)
    pars <- list(dice = rolls[[control]], total.dice = total.dice, dice.value = NULL, dice.quantity = 0, p1.state = p1.state)
    agent.action <- agents[[control]](pars = pars, Q.mat = Q.mat)
    dice.value <- agent.action$dice.value
    dice.quantity <- agent.action$dice.quantity
    
  }
  
  
  # calculate probability cat and determine the game state
  # action set to raise because you can't call without an initial bid
  # this could be a 3rd action (initial bid) but it's not really necessary
  player.dice.qty <- table(rolls[[1]])[as.character(dice.value)]
  player.dice.qty <- ifelse(is.na(player.dice.qty), 0, player.dice.qty) %>% unname
  prob.cat <- calc.prob(c(total.dice, player.dice.count[[1]], dice.quantity, player.dice.qty))
  p1.state <- which(game.states$total == total.dice & game.states$p1 == player.dice.count[[1]] & game.states$prob_cat == prob.cat)
  p1.action <- "raise"
  
  # storing states for Q iteration
  y.ctrl = c(); y.state = c(); y.action = c()
  
  # moving control to the next player
  # storing the previous player since if the next player calls the previous player could lose a die
  prev <- control
  control <- control %% players + 1
  if(verbose > 0) cat("dice value ", dice.value, "; dice quantity ", dice.quantity, "\n")
  
  
  # loop through each player and continue until there is a winner and loser
  called <- FALSE
  while(!called){
    
    # check if the player with control is still in the game - if not skip
    if(player.dice.count[[control]] > 0){
      if(control == a){
        
        action <- readline("raise or call (r/c)? ")
        
      }else{
        
        # the agent makes a decision
        pars <- list(dice = rolls[[control]], total.dice = total.dice, dice.value = dice.value, dice.quantity = dice.quantity, p1.state = p1.state)
        agent.action <- agents[[control]](pars = pars, Q.mat = Q.mat)
        action <- agent.action$action
        
      }
      
      
      # storing states for reward iteration
      if(control == 1 & !is.null(agent.action$action)){
        player.dice.qty <- table(rolls[[1]])[as.character(dice.value)]
        player.dice.qty <- ifelse(is.na(player.dice.qty), 0, player.dice.qty) %>% unname
        
        p1.action <- agent.action$action
        prob.cat <- calc.prob(c(total.dice, player.dice.count[[1]], dice.quantity, player.dice.qty))
        p1.state <- which(game.states$total == total.dice & game.states$p1 == player.dice.count[[1]] & game.states$prob_cat == prob.cat)
      }
      
      
      # called
      if(action %in% c("call", "c")){
        
        if(verbose > 0) {
          cat("player", control, "called\nRoll table\n")
          print(roll.table)
        }
        
        # dice are reavealed
        
        # check if the quantity of dice value is less or more than the total in the pool
        # if more control loses otherwise control-1 win
        if(dice.quantity > roll.table[dice.value]){
          
          penalty[[prev]] <- penalty[[prev]] - 1
          if(verbose > 0) cat("player", prev, "lost a die\n")
          
        }else{
          
          penalty[[control]] <- penalty[[control]] - 1
          if(verbose > 0) cat("player", control, "lost a die\n")
          
        }
        
        # for Q iteration
        y.ctrl <- c(y.ctrl, control); y.state <- c(y.state, p1.state); y.action <- c(y.action, p1.action)
        
        # if called use the penalty array to change states
        prob.cat <- calc.prob(c(total.dice, player.dice.count[[1]], dice.quantity, player.dice.qty))
        p1.state <- which(game.states$total == total.dice-1 & game.states$p1 == player.dice.count[[1]]+penalty[[1]] & game.states$prob_cat == prob.cat)
        
        # break the loop
        called <- TRUE
        
      }else{
        
        if(verbose > 0) cat("player", control, "raised\n")
        
        if(control == a){
          
          # player sets next dice value
          dice.value <- set.dice.value("dice value: ", 6)
          dice.quantity <- set.dice.value("quantity; ", sum(roll.table))
          
        }else{
          
          dice.value <- agent.action$dice.value
          dice.quantity <- agent.action$dice.quantity
        }
        
        # p1 state after the raise
        prob.cat <- calc.prob(c(total.dice, player.dice.count[[1]], dice.quantity, player.dice.qty))
        p1.state <- which(game.states$total == total.dice & game.states$p1 == player.dice.count[[1]] & game.states$prob_cat == prob.cat)
        if(verbose > 0) cat("dice value", dice.value, "; dice quantity", dice.quantity, "\n")
      }
      
      # store info for Q update
      y.ctrl <- c(y.ctrl, control); y.state <- c(y.state, p1.state); y.action <- c(y.action, p1.action)
      
      # set the control player to now be the previous player
      prev <- control
    }
    
    # next player has control
    control <- control %% players + 1
  }
  
  # play results and return
  play <- data.frame(y.ctrl, y.state, y.action)
  return(list(penalty = penalty, play = play))
}



# play a full game of liars dice
play.liars.dice <- function(players = 4, num.dice = 6, auto = FALSE, verbose = 1, agents, Q.mat = NULL, train = FALSE, print.trans = FALSE){
  
  # begin!
  if(verbose > 0) liars.dice.title()
  
  # setting the number of dice each player has
  ndice <- sapply(rep(num.dice, players), function(x) x, simplify = FALSE)
  players.left <- sum(unlist(ndice) > 0)
  
  # setting game states matrix
  game.states <- generate.game.states(players, num.dice)
  
  # set up reward matrix
  reward <- generate.reward.matrix(game.states)
  reward <- list(raise = reward, call = reward)
  
  # set Q matrix if null
  if(is.null(Q.mat)) Q.mat <- matrix(0, nrow = nrow(reward$raise), ncol = length(reward), dimnames = list(c(), names(reward)))
  
  # while there is at least 2 left in the game
  # who has control
  ctrl <- sample(1:players, 1)
  play.df <- data.frame()
  while(players.left > 1){
    
    # play a round
    results <- liars.dice.round(
      players = players, 
      control = ctrl,
      player.dice.count = ndice, 
      game.states = game.states,
      reward = reward,
      Q.mat = Q.mat,
      agents = agents,
      a = as.numeric(!auto),
      verbose = verbose
    )
    
    # update how many dice the players are left with given the 
    # outcomes of the round
    for(k in seq_along(ndice)){
      ndice[[k]] <- ndice[[k]] + results$penalty[[k]]
      if(ndice[[k]] == 0 & results$penalty[[k]] == -1){
        if(verbose > 0) cat("player", k, "is out of the game\n")
      }
      
      # update who has control so they can start the bidding
      if(results$penalty[[k]] == -1){
        ctrl <- k
        while(ndice[[ctrl]] == 0){
          ctrl <- ctrl %% players + 1
        }
      }
    }
    
    # checking how many are left and if anyone won the game
    players.left <- sum(unlist(ndice) > 0)
    if(players.left == 1){
      if(verbose > 0) cat("player", which(unlist(ndice) > 0), "won the game\n")
    }
    
    # appending play
    play.df <- rbind(play.df, results$play)
  }
  
  if(print.trans) print(play.df)
  
  # update Q
  # rather than training after each action, training at the 
  # end of each game in bulk
  # just easier this way
  if(train) Q.mat <- update.Q(play.df, Q.mat, reward)
  
  # return the winner and Q matrix
  return(list(winner = which(unlist(ndice) > 0), Q.mat = Q.mat))
}


Follow me on social media: