Following up on my previous post, where I estimated the chances of winning and making the final Tribal Council for […]

The post Are the chances of winning Survivor getting better or worse? appeared first on Dan Oehm | Gradient Descending.

]]>Following up on my previous post, where I estimated the chances of winning and making the final Tribal Council for different demographic groups, I’ll look at whether that has changed.

To summarise, I found that players over 40 were less likely to win, some evidence that women are less likely to win, and no difference between race/ethnicity.

24 years is a long time, and there have been many changes, such that the periods of US Survivor are often referred to as Old School (1-20), New School (21-40), and the New Era (41+). In the later seasons, the diversity campaign ensured a balance between racial groups. Historically, there has always been a balance between genders. The age profile has always been weighted towards younger players, with 47% under 30 and 75% under 40.

So, have ‘biases’ in the voting patterns changed across the eras of Survivor? It looks that way, and that it is now more balanced, but too early to tell.

I’ll only look at the expected number of winners between demographic groups at the start of the game. The previous analysis had the most evidence of voting biases, so it is not worth looking at the other stages of the game.

There are several ways to look at this, but here I’ll take the six-season moving average of the residual, i.e., the difference between the observed number and the expected number within those six seasons. This way, we can see if it has increased or decreased over time. I chose six seasons largely because the New Era has run for six seasons.

Where the line is above zero, there have been more winners in that cohort than expected for the last six seasons. This means the cohort has a higher chance of winning. Where the line is below zero, there have been fewer winners than expected for the last six seasons and a lower chance of winning.

If the line is going up, there is increasing favorability (or bias, or whatever you want to call it) for that cohort. If it goes down, the favorability of that cohort decreases.

These charts show how this has changed over time for each cohort and how different it was within each era. In the case of the New Era, we can see if the trends are changing and becoming more or less balanced.

The charts, particularly the bottom one, clarify that older players have a lower chance of winning. However, with Tony’s win in 40 and Gabler’s win in 43, the chances of winning for those over 40 have hit their peak and are right on zero or meeting expectations.

Otherwise, the pattern was similar across the Old School and New School Survivor eras; those 18-39 had a better chance of winning than those over 40.

There was a big bump in the New School era for the 30-39-year-olds, but that has balanced out as we enter the New Era.

In general, men have won more times than expected. The previous analysis shows some evidence of a statistically significant difference, but it is not convincing.

Below, you can see how it has gone up and down over the past 46 seasons. The most interesting feature is that the chances for men hit their highest point at season 40 with Tony’s second win and, equivalently, the lowest point for women. Since then, the chances for men to win have plummeted. It’s the biggest drop over a six-season period. The biggest drop before this was from seasons 19-25, where 5/7 winners were women.

As I showed in the previous analysis, there’s no real difference between BIPOC and White players.

From season 7, we saw more BIPOC players win than expected. The number declined from there until season 28. It started to bounce back when Natalie won season 29. From that point, the number of winners is as expected, maybe trending slightly upwards.

Women of colour have won more seasons than expected in the New Era and are at their highest point since season 7.

If you want more info or the data, follow the links.

- {survivoR} package
- The Sanctuary
- Google sheets data
- More of predicting winners of Survivor

The post Are the chances of winning Survivor getting better or worse? appeared first on Dan Oehm | Gradient Descending.

]]>This analysis is to see if particular demographics are more/less likely to win the season or at least make the […]

The post Who has the best chance of winning Survivor? appeared first on Dan Oehm | Gradient Descending.

]]>This analysis is to see if particular demographics are more/less likely to win the season or at least make the final tribal than others. It compares the actual number of winners and finalists of demographic cohorts, i.e. age, gender, and race/ethnicity, to the expected number assuming equal probability.

I also conducted a Bayesian analysis to estimate the bias to understand if the difference between the actual and expected values is meaningful. You can read up on the methodology here. This post will cut to the chase and give you the results.

There’s not a lot of difference between age, gender, and racial groups concerning making the final tribal council. However, there is for winning.

- Older players are less likely to win than younger players
- Some evidence women are less likely to win than men
- No difference between racial groups

I find the first and last point interesting since there is usually a lot of focus on race and little on age but it’s clear from the analysis that age is the bigger factor.

So, who has the best chance of winning Survivor? At the start of the game, white men under 40. At Merge, anyone under 40.

One key consideration here is that this analysis is based on 46 seasons of US Survivor. The first seasons of Survivor are very different from the first seasons of the new era. In the new era, there have been 4 female winners, 4 BIPOC winners, and 1 winner over 40. This may mean there is a trend towards a more balanced competition. It’s worth looking into in another post.

At the start of a season of Survivor, 16-20 people line up on the beach with a seemingly equal chance of winning a million dollar bucks. As it turns out some have a slighter better chance than others.

- Players over 40 have a lower chance of winning than those under 40
- Given equal numbers, the probability of winning for those over 40 drops from 50% to 32%.

- Some evidence women are less likely to win
- Evidence that white men are more likely to win than other demographics.
- Some evidence white women are less likely to win than other demographics
- There is no difference between other demographic groups

- Some evidence women of colour are less likely to make the final tribal council.
- No difference between other demographic groups

The game dynamic changes at the Merge. To make it here players would have had to make strong enough social connections to make it through a few tribal councils (usually, sometimes they don’t attend any tribal councils).

The next set of tables will see if players that make it to this stage of the game have more equal chances of making the final tribal or winning than at the start of the game.

- Some evidence that those over 40 are less likely to win than those under 40
- Bias is -16% points, only slightly better than at the start of the game.

- Weak evidence that women have a lower chance of winning.
- Bias -5% points

- No difference between demographic group with respect to the chances of making the final tribal council

Players that have made it to the final tribal council have played a very good game, although some better than others. Some players are considered goats and dragged to the end. My opinion is, if you’ve stayed under the radar, made strong connections with players and made it to the end, you’ve played a good game.

Anyway, is there a difference between demographic groups in the chances of winning when they’re sitting next to each other at the end?

It seems there is a difference.

- Those over 40 have only a 25% chance of winning rather than equal chances
- Some evidence that white women have a 9% lower chance
- Some evidence that white men have a 10% higher chance

What’s interesting about this analysis is it highlights that age is the most influential factor for winning Survivor, followed by gender and then race/ethnicity. There’s a lot of focus in the fandom on racial bias, some on gender bias, but not a lot on age biases. From this, it’s undeniably the biggest factor. Older players have a much lower chance of winning.

If you want more info or the data, follow the links.

- {survivoR} package
- The Sanctuary
- Google sheets data
- More of predicting winners of Survivor

The post Who has the best chance of winning Survivor? appeared first on Dan Oehm | Gradient Descending.

]]>I bang on a fair bit about jury favorability and the number of votes with jury members. This is because […]

The post Can Jury Favorability Predict the Winner of Survivor? appeared first on Dan Oehm | Gradient Descending.

]]>I bang on a fair bit about jury favorability and the number of votes with jury members. This is because the number of votes with someone else is a good indicator of the strength of their alliance, and if one is a finalist and the other is on the jury, there is a good chance they will get their vote.

Of the 220 jurors, 115 voted for the finalist they shared the most votes with. We expect only 84. The test statistic is = 11.4 and P-val = 0.0007. This is approximately 37% above random chance.

There is an association between the number of votes shared with the finalist and voting for them. It’s a good predictor but far from deterministic.

Here’s a short analysis.

I’ll use the `castaways`

table to find the jury members and finalists, and the `vote_history`

table to calculate how many times players have voted with each other.

The number of votes a jury member has shared with a finalist is simply the number of times they have voted for the same person at the same Tribal.

I’ve summarised the data to see if the jury members voted for the finalists they shared the most votes with. This means the finalist will be flagged as sharing the most votes if, for example, they shared 7 votes and the other two only 6 votes, or they shared 7 votes and the others 1 vote each, the finalist with 7 votes is flagged as the most. The difference isn’t taken into account, but you can see how this is important.

Typically jury members are more likely to vote for the finalists they shared the most votes with. It’s most easily seen in the percentage chart. The more votes with the finalist, the more likely they are to vote for them in the final Tribal.

If a jury member has voted with each finalists equally, they are guaranteed to vote for the finalist they shared the most votes with. So, I’m going to filter the dataset to those cases where there is only one finalist they’ve shared the most votes with. I’ll then test if they are more likely to vote for that finalist.

Out of the 388 jury members and votes cast, 220 are cases where there is only one of the finalists they share the most votes with.

Of the 220 jurors, 115 voted for the finalist they shared the most votes with. We expect only 84. The test statistic is = 11.4 and P-val = 0.0007. This is approximately 37% above random chance. That percentage increases as the number of votes with the jury increases.

I think it’s safe to say there is an association between the number of votes shared with the finalist and voting for them.

During the season I look at the vote stats frequently to see who has positioned themselves well, has the strongest alliance, and who is the favorite to win. Often the data shows things that aren’t particularly obvious in the edit. The chart below is an example of how I view the votes shared between finalists and jurors.

As for predicting the winner, it’s far from deterministic but it is useful to know where votes are likely to land.

Links:

- {survivoR} package
- The Sanctuary
- Google sheets data
- More of predicting winners of Survivor

All code to run the analysis is below.

Functions

```
# functions
all_vs <- function(.v = c("US", "AU", "UK", "SA", "NZ")) {
survivoR::season_summary |>
filter(version %in% .v) |>
pull(version_season)
}
filter_vs <- function(df, .vs) {
df |>
filter(version_season %in% .vs)
}
voted_with <- function(
.vs,
.n_boots = NULL,
.ep_range = NULL,
alive = FALSE,
.vote_order = 1
) {
if(is.null(.ep_range)) {
.ep_range <- survivoR::vote_history |>
filter_vs(.vs) |>
pull(episode) |>
range()
} else {
.ep_range <- range(.ep_range)
}
if(is.null(.n_boots)) {
.n_boots <- survivoR::vote_history |>
filter_vs(.vs) |>
pull(order) |>
max()
}
.vote_history <- survivoR::vote_history |>
filter(
version_season == .vs,
episode >= .ep_range[1],
episode <= .ep_range[2],
!is.na(vote)
) |>
mutate(vote = ifelse(!is.na(split_vote), split_vote, vote))
# vote order
if(.vote_order == "max") {
.vote_history <- .vote_history |>
group_by(order, castaway) |>
slice_max(vote_order, with_ties = TRUE) |>
ungroup()
} else if(.vote_order == "all") {
} else if(is.numeric(.vote_order)){
.vote_history <- .vote_history |>
filter(vote_order == .vote_order)
}
.castaways <- survivoR::castaways |>
filter(version_season == .vs)
full_cast <- .castaways |>
filter(version_season == .vs) |>
distinct(castaway) |>
pull(castaway)
df_base <- expand_grid(
castaway = full_cast,
voted_with = full_cast
) |>
mutate(version_season = .vs) |>
select(version_season, castaway, voted_with) |>
left_join(
.castaways |>
distinct(castaway, castaway_id),
by = "castaway"
) |>
left_join(
.castaways |>
distinct(castaway, voted_with_id = castaway_id),
by = c("voted_with" = "castaway")
)
# weight
df_votes <- .vote_history |>
filter(order <= .n_boots) |>
distinct(episode, order, castaway, vote, vote_order) |>
mutate(wt = 1/(max(order)-order+1))
voted_with <- df_votes |>
rename(voted_with = castaway) |>
select(-wt)
df_v_count <- df_votes |>
left_join(voted_with, by = c("episode", "order", "vote")) |>
group_by(castaway, voted_with) |>
summarise(
n = n(),
wt = sum(wt)
)
# number of tribals together
tribals <- .vote_history |>
filter(order <= .n_boots) |>
pull(order) |>
unique()
df_n_tribals_base <- df_base |>
rename(
attended_tribal_with = voted_with,
attended_tribal_with_id = voted_with_id
)
df_n_tribals <- map_dfr(tribals, ~{
who_attended_tribal <- .vote_history |>
filter(order == .x) |>
pull(castaway) |>
unique()
df_n_tribals_base |>
filter(
castaway %in% who_attended_tribal,
attended_tribal_with %in% who_attended_tribal
)
}) |>
count(castaway, attended_tribal_with, name = "n_tribals") |>
filter(castaway != attended_tribal_with)
# filter for those still alive
if(alive) {
alive <- still_alive(.vs, .n_boots) |>
pull(castaway)
} else {
alive <- survivoR::boot_mapping |>
filter(
version_season == .vs,
order == 0
) |>
pull(castaway)
}
# joining
df_base |>
left_join(df_v_count, by = c("castaway", "voted_with")) |>
left_join(df_n_tribals, by = c("castaway", "voted_with" = "attended_tribal_with")) |>
mutate(
n = replace_na(n, 0),
n_tribals = replace_na(n_tribals, 0),
wt = replace_na(wt, 0),
p = n/n_tribals
) |>
filter(
castaway != voted_with,
castaway %in% alive,
voted_with %in% alive
)
}
```

Code

```
df_jury <- survivoR::castaways |>
filter(jury) |>
distinct(version_season, voted_with_id = castaway_id)
df_finalists <- survivoR::castaways |>
filter(finalist) |>
distinct(version_season, castaway_id)
df <- map_dfr(all_vs("US"), ~voted_with(.x)) |> # functions in append
semi_join(df_jury, by = c("version_season", "voted_with_id")) |>
semi_join(df_finalists, by = c("version_season", "castaway_id")) |>
left_join(
survivoR::jury_votes |>
distinct(voted_with_id = castaway_id, castaway_id = finalist_id, vote),
by = c("voted_with_id", "castaway_id")
)
df |>
group_by(version_season, jury = voted_with) |>
filter(sum(max) == 1) |>
group_by(version_season, jury) |>
mutate(p = sum(max)/n()) |>
ungroup() |>
filter(vote == 1) |>
summarise(
n = n(),
obs = sum(max),
exp = sum(p)
) |>
mutate(
p = obs/n,
p_exp = exp/n,
chisq = (obs-exp)^2/exp,
p_val = 1-pchisq(chisq, 1),
index = p/p_exp
)
```

The post Can Jury Favorability Predict the Winner of Survivor? appeared first on Dan Oehm | Gradient Descending.

]]>The IID assumption (independent and identically distributed) is pretty important. Ignoring it can lead you to make incorrect conclusions (usually […]

The post Ignoring the IID assumption isn’t a great idea appeared first on Dan Oehm | Gradient Descending.

]]>The IID assumption (independent and identically distributed) is pretty important. Ignoring it can lead you to make incorrect conclusions (usually through pseudoreplication). Here’s a quick example.

You have 50 bags filled with 1 red and 9 green balls. You randomly draw 1 ball, and record the colour from each bag. You draw the red ball 5 times and a green ball 45 times. Let’s put that into a 2×2.

```
x1 <- matrix(
c(405, 45, 45, 5),
nrow = 2,
dimnames = list(
c("Green", "Red"),
c("Not drawn", "Drawn")
)
)
> x1
Not drawn Drawn
Green 405 45
Red 45 5
```

I’ll fit a chi-squared test to see if there is a difference between drawing a green ball to drawing a red ball. Maybe the red balls are bigger, are rougher, lighter, stickier, something which means it may be more likely to be drawn from the bag over the green balls.

```
> chisq.test(x1)
Pearson's Chi-squared test
data: x1
X-squared = 0, df = 1, p-value = 1
```

Nope. The p-value is 1 because this is exactly what we would expect to draw if the balls were the same just different colours.

Now perhaps you have 40 bags with 3 red balls and 1 green ball. It’s more likely you’ll draw more red balls in this case. You draw 30 red balls and 10 green balls. Let’s put it into a 2×2 and fit a chi-squared test again.

```
x2 <- matrix(
c(30, 90, 10, 30),
nrow = 2,
dimnames = list(
c("Green", "Red"),
c("Not drawn", "Drawn")
)
)
> x2
Not drawn Drawn
Green 30 10
Red 90 30
> chisq.test(x2)
Pearson's Chi-squared test
data: x2
X-squared = 0, df = 1, p-value = 1
```

Look at that, the p-value is 1, we drew exactly what was expected. In which case we would conclude that the red and green balls are not different and the red balls are not more likely to be drawn than green balls.

Because we’re lazy, let’s combine the draws and fit the chi-squared test again.

```
> x1+x2
Not drawn Drawn
Green 435 55
Red 135 35
> chisq.test(x1+x2)
Pearson's Chi-squared test with Yates' continuity correction
data: x1 + x2
X-squared = 8.6183, df = 1, p-value = 0.003328
```

Now, seemingly by magic, there is a difference. We might conclude that there is a difference between the balls and that the red ones are more likely to be drawn.

We know this isn’t correct though, we set this up such that each test was not significant and we drew the expected number of red balls. So, why is this now showing a strong association?

It’s because the observations are not iid. The observation is the colour of the ball being drawn from the bag and there is only one of them, not all 10 or 4 that are drawn. If one ball is drawn it means all the others can’t be drawn.

The probability of drawing a red in the first set is 1/10 and in the second set, it’s 3/4. They are not the same and the model needs to be set up in a way that accounts for that.

This problem can be structured as a regression problem and you’ll get the same result, in isolation the first and second examples are not significant but pooled together they are.

It seems fairly benign and just poor stats, but imagine replacing the bags with job vacancies and the colour of the balls with some demographic variable of applicants e.g. age, gender, race, ethnicity, etc. Suddenly you find yourself in a situation.

I’m sure you could think of other examples where this would be a problem. Unfortunately, I’ve seen this on more than one occasion in the real world. Most recently regarding first boots in Survivor. You can check out that post if you wish (you will see the same example though).

In this case, the observation isn’t the balls in the bag but the one that is drawn. Assuming we know the contents of the bag the correct way to test for an association between colour and being drawn is as follows.

- The observed number of red balls is 5 in the first set and 30 in the second.
- The expected value is 50*1/10 = 5 in the first and 40*3/4 = 30 in the second.
- The test is

And we have observed exactly what was expected, ergo, no further action.

The takeaway is, be careful how you analyse data and keep it in mind when reading others.

The post Ignoring the IID assumption isn’t a great idea appeared first on Dan Oehm | Gradient Descending.

]]>Alone Australia season 2 has finished and is now available in the package and ready for analysis. As per usual […]

The post {alone} v0.4 is now available appeared first on Dan Oehm | Gradient Descending.

]]>Alone Australia season 2 has finished and is now available in the package and ready for analysis. As per usual install via Git or CRAN.

```
devtools::install_github("doehm/alone")
```

```
install.packages("alone")
```

Any issues please raise them via Git.

It was another great season. Season 1 started off rough with a few early taps, but those in season 2 hung around for a little longer which shows in the survival chart. While the average days lasted is longer for season 2, Gina still holds the record for lasting 67 days.

Over the next few weeks, I’ll update my analysis comparing the US and AU versions.

Two key pieces of information missing from the data are the full names of the contestants and the loadouts for AU. I haven’t been able to find this data anyway. If you do come across it, let me know and I can add it to the package.

Alone season 11 has just started and will be available after the season. In the meantime, you can see the results and grab what data is available from Google Sheets.

The post {alone} v0.4 is now available appeared first on Dan Oehm | Gradient Descending.

]]>I wanted a space to throw all my tables and charts made using the {survivoR} R package into, so I […]

The post The Sanctuary: Stats and data from {survivoR} appeared first on Dan Oehm | Gradient Descending.

]]>I wanted a space to throw all my tables and charts made using the {survivoR} R package into, so I started The Sanctuary (built with Quarto, of course). It has interactive tables about the castaways, challenges, voting history, confessionals, episode details, and a bunch more.

While there is a lot of data out there about Survivor it’s rarely all in one place. This provides a view of castaways across seasons and various stats. There won’t be a lot of explanation or in-depth analysis, just a truckload of data, tables, and charts to explore. Longer form posts will remain on the blog.

The Sanctuary is updated regularly during seasons and whenever new data hits Git. It’s the companion for the {survivoR} package.

I won’t share too much here, I’ll let you explore for yourself, but here is an idea of what you can find.

The score is a measure of how many Tribal Councils they survived, the difficulty of surviving the vote (e.g surviving a Tribal with 4 people is harder than surviving one with 12), and how many votes they copped. Denise, Sandra, and Stephanie take out the top 3 spots.

The challenge score is a measure of challenge success. For individual immunity challenges Ozzy takes out the top spot followed by Brad Culpepper in season 34 Games Changers and Mike Holloway in Season 30 Worlds Apart.

The highest rated season based on IMDb ratings is season 31 Cambodia, the second chance season, followed by season 40 Winners at War and Season 20 Heroes vs Villains. The top 3 are all returnee seasons. The 4th highest rated season is season 7 Pearl Islands, which is also the highest rated all newbie season.

Russell Hantz still holds the most number of coveralls in a season, which is going to be hard to beat. The next two are Rob Cesternino and Colby Donaldson to round out the top 3.

Anyway, expect more things as time goes on.

The post The Sanctuary: Stats and data from {survivoR} appeared first on Dan Oehm | Gradient Descending.

]]>Wrapping up season 46 and time for another release of survivoR. A few new things in this release including two […]

The post {survivoR} 2.3.3 is now available appeared first on Dan Oehm | Gradient Descending.

]]>Wrapping up season 46 and time for another release of survivoR. A few new things in this release including two new datasets.

Install from Git or CRAN:

```
install.packages("survivoR")
```

```
devtools::install_github("doehm/survivoR")
```

As usual, if you find any issues, raise an issue on Git – survivoR issues

For non-R users, it’s free to download from Google Sheets.

If you just want the stats you can head over to The Sanctuary

- New seasons:
- US46

- New datasets added:
`episode_summary`

– the summary of the episode from Wikipedia`challenge_summary`

– a summarised version of`challenge_results`

for easy analysis

- New fields added:
- team on
`challenge_results`

– identifying the team that the castaways were on during the challenge

- team on

I have included the episode summary extracts from Wikipedia that detail the events of the episode. It usually includes pre-challenge events and discussions of strategy, challenge description and results, strategy discussions amongst the tribe heading to Tribal Council, and the result. It may be interesting for NLP type applications.

```
> episode_summary
# A tibble: 647 Ã— 4
version version_season episode episode_summary
<chr> <chr> <dbl> <chr>
1 US US01 1 "The two tribes paddled their way to their respective beaches on a raft with meager supplies. Upon arâ€¦
2 US US01 2 "Following their Tribal Council, Tagi found its fish traps still empty. Disappointed that Rudy was noâ€¦
3 US US01 3 "At the Tagi tribe, Stacey still wanted to get rid of Rudy and tried to create a girl alliance to do â€¦
4 US US01 4 "At Pagong, Ramona started to feel better after having been sick and tried to begin pulling her weighâ€¦
5 US US01 5 "At Tagi, Dirk and Sean were still trying to fish instead of helping around camp, but to no avail. Suâ€¦
6 US US01 6 "Both tribes were wondering what the merge was going to be like. Tagi was afraid due to their numericâ€¦
7 US US01 7 "The day after Pagong voted Joel out, one person from each tribe went to the opposite tribe's camp anâ€¦
8 US US01 8 "At camp, the remaining members of the former Pagong tribe felt vulnerable because the Tagi tribe hadâ€¦
9 US US01 9 "While Richard was catching fish, the other players began to realize that nobody voted him out becausâ€¦
10 US US01 10 "Some people were happy that Jenna was voted out because she was getting on everyone's nerves. Everyoâ€¦
# 637 more rows
# Use `print(n = ...)` to see more rows
```

When I was making some charts for The Sanctuary and specifically the challenge score I realised it was quite difficult to summarise the `challenge_results `

table to the different types of challenges e.g. individual immunity. There are a few edge cases where there are combined challenges e.g. Team / Individual Immunity and Reward challenges where a team will win reward and the last person standing on each team wins immunity. So there are 3 winning outcomes – reward only, immunity only, and immunity and reward for the last person standing.

To make it easier to summarise I created `challenge_summary`

. It looks like this…

```
> challenge_summary
# A tibble: 50,428 Ã— 12
category version_season challenge_id challenge_type outcome_type tribe castaway_id castaway n_entities n_winners n_in_team won
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <dbl>
1 All US01 1 Immunity and Reward Tribal Pagong US0002 B.B. 2 1 8 1
2 All US01 1 Immunity and Reward Tribal Pagong US0004 Ramona 2 1 8 1
3 All US01 1 Immunity and Reward Tribal Pagong US0006 Joel 2 1 8 1
4 All US01 1 Immunity and Reward Tribal Pagong US0007 Gretchen 2 1 8 1
5 All US01 1 Immunity and Reward Tribal Pagong US0008 Greg 2 1 8 1
6 All US01 1 Immunity and Reward Tribal Pagong US0009 Jenna 2 1 8 1
7 All US01 1 Immunity and Reward Tribal Pagong US0010 Gervase 2 1 8 1
8 All US01 1 Immunity and Reward Tribal Pagong US0011 Colleen 2 1 8 1
9 All US01 1 Immunity and Reward Tribal Tagi US0001 Sonja 2 1 8 0
10 All US01 1 Immunity and Reward Tribal Tagi US0003 Stacey 2 1 8 0
# 50,418 more rows
# Use `print(n = ...)` to see more rows
```

The other challenge datasets can be easily joined to this table. `challenge_summary `

is not MECE, for example, the category contains ‘All’, ‘Individual’, ‘Individual Reward’, and ‘Individual Immunity’, to name a few. The results are counted separately for each category. You will need to filter for the right category before using the table.

Not every castaway is counted in every category. If they didn’t make it to the merge they didn’t compete in an individual challenge (except in some edge cases). Rather than their record being 0, they are not featured in that category. See Github for more details.

The post {survivoR} 2.3.3 is now available appeared first on Dan Oehm | Gradient Descending.

]]>A while back, I looked into whether or not BIPOC players are disproportionately voted out first. I didn’t find a […]

The post Racial Bias in Survivor: Are BIPOC Players Disproportionately Voted Out First? (part 2) appeared first on Dan Oehm | Gradient Descending.

]]>A while back, I looked into whether or not BIPOC players are disproportionately voted out first. I didn’t find a lot of evidence to support this claim, despite what it may seem.

However, I did find that female players were disproportionately voted out first. BIPOC women were as well although that is likely more due to gender not race/ethnicity. At the merge, it flips and men are more likely to be voted out.

I refreshed the analysis after Season 46 made the merge to see if there had been a change. I’ve expanded it to test the following points:

- Are BIPOC players disproportionately voted out of their original tribe first?
- Are women disproportionately voted out of their original tribe first?
- Are BIPOC women disproportionately voted out of their original tribe first?
- Are white women disproportionally voted out of their original tribe first?

I take a very statistical view here but I think that’s needed to cut through perceptions and confirmation bias.

The number of first boots is over expectation for both BIPOC and Female cohorts, but far more so for women.

Cohort | Expected | Actual | Difference |
---|---|---|---|

BIPOC | 29 | 33 | 4 |

Female | 45 | 55 | 10 |

Male, BIPOC | 12 | 9 | -3 |

Male, White | 31 | 24 | -7 |

Female, White | 28 | 31 | 3 |

Female, BIPOC | 17 | 24 | 7 |

The model estimates if there is an increase in the probability of being voted out for a certain cohort. When there are equal numbers of BIPOC and other players in the tribe at the first Tribal Council, above 50% means a positive bias. For the splits e.g. BIPOC women, I have assumed 25%. The bands indicate statistical variation and represent the 50%, 80%, and 95% credible intervals.

Both BIPOC and female cohorts have a positive bias. The interval for the BIPOC cohort fairly comfortably contains 50% within the 80% CI, so maybe there is some weak evidence there but, it’s not particularly strong.

The bias estimate for women is clearly higher than 50% so it’s pretty obvious there is a bias here. Women are more likely to be voted out of their tribe first.

Not really.

If there was no bias and perfectly random, out of 82 Tribal Councils that have at least 1 BIPOC player, we would expect 29 BIPOC players to be voted out first and have observed 33.

The bias is estimated to be a +6% chance of being voted out first, but it’s not a huge amount, the 95% CI is (-6%, 17%). The claim that BIPOC players are disproportionately voted out first is not supported by the data.

To put this into perspective a little more, let’s say there are 8 people on the tribe, 4 of which are BIPOC players (50%). The average bias is +6% so the probability a BIPOC player will be voted out is 56%. Therefore each player has a 14% (56%/4) chance of being voted out where equal chance is 12.5%, just a +1.5% increase per person.

Yes.

If there was no bias we could expect 45 first boots to be female but we have seen 55. If it were completely random the probability of seeing 55 boots is about 1%.

The bias is estimated to be a +13% chance of being voted out first and is significant in this case. The 95% CI is (3%, 24%).

Yes, but likely more due to gender than race/ethnicity.

If there was no bias we could expect to see 17 but there have been 24. The probability of seeing 24 is about 2%. The bias is estimated to be a +11% chance of being voted out first with a 95% CI of (0%, 24%). It’s wide due to the lower numbers, but from the above, we can say that it’s primarily due to gender not race.

Actually, no.

We would expect to see 28 but there have been 31. The bias is estimated to be +4% of being voted out first with a 95% of (-5%, 13%). This is pretty interesting, gender is clearly the strongest factor (out of gender and race/ethnicity), but particularly from S42 the first boots have been primarily BIPOC women. This suggests that BIPOC women contributing more than their fair share to first boots.

These results are similar to what I found last time. The number of BIPOC first boots is above average but only by 4, which seems like a considerable increase above what was expected, but it could also happen randomly. After many more seasons something may emerge, but at this stage claiming that BIPOC players are voted out first isn’t supported by the data. Although, I don’t think it’s as clear-cut as that.

Gender bias is present and more of a factor than race/ethnicity. The data shows that women are more likely to be voted out of the tribe first with a +13% increase in probability, and easily above equal probability.

What’s important here is that while women are disproportionately voted out, BIPOC women make up the most of that imbalance. When modeling white women and BIPOC women independently white women have a smaller bias and, to be fair, within reasonable variation. The bias for BIPOC women is about 2.5x higher.

There could be other points for consideration. There was a post comparing votes for BIPOC players and other players. I’ll look into that next to see if it holds water (I have a lot to say about that post, to be honest). In a later post, I’ll also consider age as a contributing factor.

Again, I’m not saying that (subconscious) bias doesn’t exist towards BIPOC players in Survivor, just that it’s not measurable in the data point of who is voted out first.

How I arrived at this position is important, so we get into the weeds a bit (by a bit I mean a lot) here about how I conducted the analysis. It’s important because I consider a few things that are often overlooked.

I looked at it in 3 ways:

- Bayesian model
- Simulation model
- Regression model (there are issues with this though. I’ll explain.)

All code and results are contained in the post so you can reproduce the analysis.

There are quite a few things to consider to set up the data correctly:

- Who is considered BIPOC? In the survivoR package, I record anyone as BIPOC if they are listed as African-American/Canadian, Asian-American/Canadian, Latin-American, or Native American on the Survivor Wiki. I don’t make assumptions about someone’s identity.
- I only consider the first time an original tribe goes to Tribal and votes for the first time. There could be players that didn’t go to Tribal until either the Merge or before they swapped tribes. They are removed from the analysis because I want to control for any other sources of variation. This leaves 88 Tribal Councils for the analysis. Still a good amount for this analysis.
- I only consider the first true vote out. That means if a player quits in the first Tribal e.g. Hannah in S45, or when Jonny Fairplay asked to be voted out in S16, I don’t consider that the first Tribal, similarly I remove medically evacuated players. I want to make sure I only include the first true
*target*for each tribe. - To calculate the probability of someone being voted out I consider only those eligible to be voted out. For example, if someone has individual immunity for the first Tribal, or safety without power, they are removed from the analysis.
- I ensure the makeup of the tribe is accounted for. This is the single most important consideration and what other analyses overlook. More below in the analysis.
- I have included players who have played multiple times. Even though in seasons where there are returning players and newbies, returning players already have a target on their back. This is mostly because it would remove too many tribals.

The tricky thing with this problem is that every first tribe has a different number of BIPOC / other players. The proportion ranges from 0-1 in the 88 tribes that attend Tribal Council across the 46 seasons. On average 25% of the castaways are BIPOC in a tribe. If it was the same every Tribal there wouldn’t be an issue but this is an important point that can’t be ignored. More below.

I want to directly estimate the bias with this model. If bias is present and BIPOC players are more likely to be voted out, the probability should be some factor above equal probability.

The way I’ve formulated it is as follows:

where is the probability of a BIPOC player being voted out first. takes the log odds of the probability and adds , the bias term. is a hierarchical term across all the seasons whereas the other parameters are different for each season.

Under this model, we can let be normally distributed and unconstrained to estimate the bias. I’ve used a prior of . On the log scale, it’s hard to interpret, but essentially if this would equate to a bias of +0.12 which I think is fair.

Code: Bayesian data analysis

```
# set up ------------------------------------------------------------------
no_quitters <- survivoR::castaways |>
filter(
version == "US",
str_detect(result, "voted"),
!(castaway == "Jonny Fairplay" & version_season == "US16")
) |>
distinct(version_season, castaway, castaway_id)
demogs <- survivoR::castaway_details |>
select(castaway_id, gender, bipoc, race, ethnicity)
tribe_size <- survivoR::boot_mapping |>
filter(order == 0) |>
count(version_season, tribe)
log_odds <- function(x) {
log(x/(1-x))
}
log_odds_inv <- function(x) {
1/(1+exp(-x))
}
p_adj <- function(p, bias) {
log_odds_inv(log_odds(p)+bias)
}
summarise_quantiles <- function(df, x) {
x <- enquo(x)
df |>
summarise(
q2.5 = quantile(!!x, 0.025),
q10 = quantile(!!x, 0.1),
q25 = quantile(!!x, 0.25),
q50 = quantile(!!x, 0.5),
q75 = quantile(!!x, 0.75),
q90 = quantile(!!x, 0.90),
q97.5 = quantile(!!x, 0.975),
mean = mean(!!x),
sd = sd(!!x)
)
}
levels <- c("bipoc", "not_bipoc", "female", "male", "female_bipoc", "male_bipoc",
"female_not_bipoc", "male_not_bipoc")
df_labs <- tribble(
~var, ~lab_text,
"bipoc", "BIPOC",
"female", "Female",
"female_bipoc", "Female, BIPOC",
"female_not_bipoc", "Female, White",
"male_bipoc", "Male, BIPOC",
"male_not_bipoc", "Male, White"
) |>
mutate(var = factor(var, levels = levels))
# first boots -------------------------------------------------------------
# voted out data frame
df_voted_out <- survivoR::vote_history |>
filter(
version == "US",
tribe_status == "Original"
) |>
distinct(version_season, voted_out, voted_out_id, order, tribe, tribe_status) |>
semi_join(no_quitters, by = c("version_season", "voted_out_id" = "castaway_id")) |>
group_by(version_season, tribe) |>
slice_min(order) |>
left_join(demogs, by = c("voted_out_id" = "castaway_id")) |>
mutate(
bipoc = replace_na(bipoc, FALSE),
not_bipoc = !bipoc,
female = gender == "Female",
male = gender == "Male",
female_bipoc = gender == "Female" & bipoc,
male_bipoc = gender == "Male" & bipoc,
female_not_bipoc = gender == "Female" & !bipoc,
male_not_bipoc = gender == "Male" & !bipoc
) |>
ungroup()
# expected data frame
df_expected <- survivoR::vote_history |>
filter(
version == "US",
tribe_status == "Original",
is.na(immunity) | immunity == "Hidden"
) |>
distinct(version_season, castaway, castaway_id, order, tribe, tribe_status) |>
group_by(version_season, tribe) |>
slice_min(order) |>
left_join(demogs, by = "castaway_id") |>
mutate(
bipoc = replace_na(bipoc, FALSE),
female = gender == "Female",
male = gender == "Male",
female_bipoc = gender == "Female" & bipoc,
male_bipoc = gender == "Male" & bipoc,
female_not_bipoc = gender == "Female" & !bipoc,
male_not_bipoc = gender == "Male" & !bipoc
) |>
group_by(version_season, order, tribe) |>
summarise(
n = n(),
n_bipoc = sum(bipoc),
n_not_bipoc = sum(!bipoc),
n_female = sum(female),
n_male = sum(male),
n_female_bipoc = sum(female_bipoc),
n_male_bipoc = sum(male_bipoc),
n_female_not_bipoc = sum(female_not_bipoc),
n_male_not_bipoc = sum(male_not_bipoc),
.groups = "drop"
) |>
mutate(
p_bipoc = n_bipoc/n,
p_not_bipoc = n_not_bipoc/n,
p_female = n_female/n,
p_male = n_male/n,
p_female_bipoc = n_female_bipoc/n,
p_male_bipoc = n_male_bipoc/n,
p_female_not_bipoc = n_female_not_bipoc/n,
p_male_not_bipoc = n_male_not_bipoc/n
) |>
ungroup()
# summary -----------------------------------------------------------------
# observed
df_obs <- df_voted_out |>
summarise(
bipoc = sum(bipoc),
not_bipoc = sum(not_bipoc),
female = sum(female),
male = sum(male),
female_bipoc = sum(female_bipoc),
male_bipoc = sum(male_bipoc),
female_not_bipoc = sum(female_not_bipoc),
male_not_bipoc = sum(male_not_bipoc)
) |>
pivot_longer(everything(), names_to = "var", values_to = "observed") |>
left_join(
df_expected |>
select(starts_with("p")) |>
summarise_all(~round(sum(.x))) |>
pivot_longer(everything(), names_to = "var", values_to = "expected") |>
mutate(var = str_remove(var, "p_")),
by = "var"
) |>
mutate(
var = factor(var, levels = levels),
res = observed-expected
)
# bayes model -------------------------------------------------------------
library(rstan)
library(tidybayes)
stan_dat <- df_voted_out |>
left_join(
df_expected |>
select(version_season, tribe, p_bipoc, p_not_bipoc, p_female, p_male, p_female_bipoc,
p_male_bipoc, p_female_not_bipoc, p_male_not_bipoc),
by = c("version_season", "tribe")
) |>
transmute(
version_season,
tribe,
y_bipoc = as.numeric(bipoc),
y_not_bipoc = as.numeric(!bipoc),
y_female = as.numeric(gender == "Female"),
y_male = as.numeric(gender == "Male"),
y_female_bipoc = as.numeric(female_bipoc),
y_male_bipoc = as.numeric(male_bipoc),
y_female_not_bipoc = as.numeric(female_not_bipoc),
y_male_not_bipoc = as.numeric(male_not_bipoc),
p_bipoc,
p_not_bipoc,
p_female,
p_male,
p_female_bipoc,
p_male_bipoc,
p_female_not_bipoc,
p_male_not_bipoc
)
stan_dat <- stan_dat |>
select(-starts_with("p")) |>
pivot_longer(starts_with("y"), names_to = "var", values_to = "y") |>
mutate(var = str_remove(var, "y_")) |>
left_join(
stan_dat |>
select(-starts_with("y")) |>
pivot_longer(starts_with("p"), names_to = "var", values_to = "p") |>
mutate(var = str_remove(var, "p_")),
by = c("version_season", "tribe", "var")
) |>
mutate(
log_odds = log(p/(1-p)),
var = factor(var, levels = levels),
mu0 = case_when(
var %in% c("bipoc", "female", "female_not_bipoc", "female_bipoc") ~ 0.5,
var %in% c("not_bipoc", "male", "male_not_bipoc", "male_bipoc") ~ -0.5,
TRUE ~ 0
)
)
stan_code <- "data {
int<lower=0> N;
array[N] int<lower=0, upper=1> y;
array[N] real<lower=0, upper=1> p;
array[N] real log_odds;
real mu0;
}
parameters {
real beta;
}
transformed parameters {
array[N] real<lower=0, upper=1> kappa;
for(k in 1:N) {
kappa[k] = 1/(1+exp(-(log_odds[k] + beta)));
}
}
model {
beta ~ normal(mu0, 1.5);
y ~ bernoulli(kappa);
}"
# compile one for faster fitting
dat <- stan_dat |>
filter(
var == "female",
p > 0,
p < 1
) |>
as.list()
dat$mu0 <- unique(dat$mu0)
dat$N <- length(dat$y)
mod_stan <- stan(
model_code = stan_code,
data = dat
)
# fit models
df_bias <- map_dfr(levels, ~{
dat <- stan_dat |>
filter(
var == .x,
p > 0,
p < 1
) |>
as.list()
dat$mu0 <- unique(dat$mu0)
dat$N <- length(dat$y)
mod_stan <- stan(
model_code = stan_code,
data = dat
)
tibble(
var = .x,
bias = rstan::extract(mod_stan, "beta")$beta
)
}) |>
mutate(var = factor(var, levels = levels)) |>
left_join(
stan_dat |>
group_by(var) |>
summarise(median = median(p)),
by = "var"
)
df_bias_summary <- df_bias |>
group_by(var) |>
summarise_quantiles(bias) |>
mutate(
lab = snakecase::to_title_case(levels) |>
str_replace("Bipoc", "BIPOC")
)
df_bias_summary_p <- df_bias |>
mutate(
p0 = ifelse(var %in% c("female_bipoc", "male_bipoc", "female_not_bipoc", "male_not_bipoc"), 0.25, 0.5),
p = log_odds_inv(log_odds(p0)+bias),
) |>
group_by(var, p0) |>
summarise_quantiles(p) |>
mutate(
pct = glue("{ifelse(q50<p0, '', '+')}{100*round(q50-p0, 2)}%"),
)
```

```
> df_bias_summary
# A tibble: 8 Ã— 11
var q2.5 q10 q25 q50 q75 q90 q97.5 mean sd lab
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bipoc -0.245 -0.0708 0.0768 0.236 0.390 0.533 0.692 0.234 0.238
2 not_bipoc -0.687 -0.532 -0.384 -0.222 -0.0525 0.0908 0.268 -0.218 0.244
3 female 0.0987 0.247 0.369 0.525 0.691 0.832 0.997 0.533 0.232
4 male -0.967 -0.817 -0.678 -0.525 -0.374 -0.237 -0.0831 -0.526 0.226
5 female_bipoc 0.00560 0.178 0.342 0.513 0.699 0.861 1.04 0.517 0.263
6 male_bipoc -1.21 -0.929 -0.698 -0.443 -0.204 -0.00400 0.211 -0.458 0.366
7 female_not_bipoc -0.288 -0.127 0.0177 0.183 0.341 0.482 0.627 0.179 0.238
8 male_not_bipoc -0.932 -0.755 -0.588 -0.413 -0.240 -0.0944 0.0718 -0.416 0.258
```

For BIPOC players, the bias term CI includes 0 fairly comfortably so I’d say that they are not voted out first any more than other players, at least not enough evidence to confirm that they are. It is also lower than my expectations.

Women are voted out first more often than male players. The 95% CI doesn’t include 0 and clearly different. BIPOC women players are similar. From this it should be clear that it’s more due to gender than race/ethnicity.

The Bayesian analysis showed us what we need to know, but I wanted to look at this another way as well. I’ve also fit a simulation model and looked at the probability distribution. If it was completely random how many BIPOC players can we expect to be voted out first?

I took 4,000 random draws from each of the first Tribal Councils and counted how many times a BIPOC, female, or female BIPOC castaway was voted out. Below are the probability distributions under the assumption of perfect randomness. Each bar represents the likelihood of observing that many boots out of the 88 Tribal councils.

For example, there have been 33 BIPOC players booted from the first Tribal Council, and under perfect randomness, we would expect 29. But the distribution shows we can reasonably expect somewhere between 22-37, so 33 is on the upper end but isn’t particularly unusual.

We have seen 55 female castaways booted first where would expect 45. This is right in the tail of the distribution. There’s only a 1% chance that we should see 55 or more female first boots which means there’s probably something here, there’s a preference to vote women out first.

Code: Simulation

```
# number of sims
n_sims <- 4000
levels <- c("bipoc", "female", "female_bipoc", "female_not_bipoc")
df_sim0 <- map_dfr(1:n_sims, ~{
df_expected |>
mutate(sim = .x)
}) |>
mutate(
bipoc = rbernoulli(n(), p_bipoc),
female = rbernoulli(n(), p_female),
female_bipoc = rbernoulli(n(), p_female_bipoc),
female_not_bipoc = rbernoulli(n(), p_female_not_bipoc),
male_bipoc = rbernoulli(n(), p_male_bipoc),
male_not_bipoc = rbernoulli(n(), p_male_not_bipoc)
) |>
group_by(sim) |>
summarise(
bipoc = sum(bipoc),
female = sum(female),
female_bipoc = sum(female_bipoc),
female_not_bipoc = sum(female_not_bipoc),
male_bipoc = sum(male_bipoc),
male_not_bipoc = sum(male_not_bipoc)
) |>
pivot_longer(-sim, names_to = "var", values_to = "y") |>
filter(var %in% levels)
df_ci <- df_sim0 |>
group_by(var) |>
summarise_quantiles(y)
df_sim <- df_sim0 |>
count(var, y)
```

I’ve only included those with a positive bias in the chart.

The final way I’ll look at this is by fitting a basic regression model. This is a bad model choice, to be honest for reasons I’ll explain.

For this model to work the model data frame needs to be at the person level. The response is either 0 or 1 if the person voted out. The predictors are BIPOC (yes, no) and gender (male, female).

The issue with this model is each observation is assumed to be independent meaning that whether or not the person is voted out is only dependent on the person’s characteristics and independent from all other people. But, that doesn’t hold. There is only one person eliminated per Tribal Council. That means whoever is voted out of the tribe first means all the others can’t be voted out. Independence only holds between Tribals Councils.

That’s really important to understand because if you were to predict who will be voted out the model may spit out multiple people going home which is dumb. It’s also important to consider what that means for the coefficients of the model. What’s going to happen is that the dependent relationship is going to change the variance depending on the proportions within the tribe and the effect is going to be averaged across the seasons. This could be misleading.

You need to be really careful when interpreting the output under these conditions. I’m doing this anyway because I’m mainly interested in if it’s drastically different from the above.

The convenient thing about the regression model is comparing the coefficients of gender and BIPOC status. Even with the issues of dependence, we can compare the magnitude of both to see which has the strongest influence. You still have to be careful though.

Given the analysis above, gender should be the stronger predictor. If that’s true, I rest my case.

Code: Regression

```
# regression --------------------------------------------------------------
diverse_tribes <- df_expected |>
filter(
p_bipoc > 0,
p_bipoc < 1
) |>
distinct(version_season, tribe)
df_mod <- survivoR::vote_history |>
semi_join(diverse_tribes, by = c("version_season", "tribe")) |>
filter(
version == "US",
tribe_status == "Original",
is.na(immunity) | immunity == "Hidden"
) |>
distinct(version_season, castaway, castaway_id, voted_out, voted_out_id, order, tribe, tribe_status) |>
semi_join(no_quitters, by = c("version_season", "voted_out_id" = "castaway_id")) |>
mutate(voted_out = as.numeric(voted_out == castaway)) |>
group_by(version_season, tribe) |>
slice_min(order) |>
left_join(demogs, by = "castaway_id") |>
mutate(bipoc = replace_na(bipoc, FALSE)) |>
filter(gender != "Non-binary")
mod <- glm(voted_out ~ gender + bipoc, data = df_mod, family = binomial(link='logit'))
summary(mod)
```

```
> summary(mod)
Call:
glm(formula = voted_out ~ gender + bipoc, family = binomial(link = "logit"),
data = df_mod)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7650 0.1816 -9.721 <2e-16 ***
genderMale -0.5869 0.2454 -2.392 0.0168 *
bipocTRUE 0.2285 0.2473 0.924 0.3555
---
Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 494.06 on 654 degrees of freedom
Residual deviance: 486.77 on 652 degrees of freedom
AIC: 492.77
Number of Fisher Scoring iterations: 5
```

I rest my case.

Gender is clearly more influential than BIPOC status. If I was a frequentist, I’d be removing BIPOC from the model as it’s not significant.

I didn’t want to talk about this, but here we go. The paper titled ‘Surviving Racism and Sexism: What Votes in the Television Program Survivor Reveal about Discrimination‘ came out after my original post. The analysis looks at if BIPOC and female contestants are disproportionately voted out as well as trends at other stages of the game.

It shows that women are disproportionately voted out first, as I’ve shown above. However, it claims that BIPOC players are also more likely to be the target and disproportionately voted out: *“Compared to White contestants, BIPOC contestants had 51% higher odds of being voted out of their tribe first, (1, N=731)=4.59, p=.032, OR=1.51, 95% CI [1.03â€“2.19]”*. This is counter to what I have shown in the analysis.

They have used a logistic regression model at the person level for all 731 castaways in seasons 1-40. As I’ve shown above, this is not a good model for the problem. Even with the issues of fitting a regression model to this data I don’t see anything close to 51% higher odds.

I suspect there are also differences in how the data was set up. There’s not a lot of discussion about the data considerations before modeling as I’ve done, e.g. only using the original tribes, and removing ineligible castaways.

They have used the Survivor WIki for labeling race/ethnicity so that should be consistent but there could be differences in which race/ethnicities are included.

I’ve curated the data to only those eligible to be voted out and the first true vote for a tribe using 46 seasons. Some tribes/castaways are removed because they didn’t go to Tribal Council before a swap. This leaves 655 castaways over 46 seasons.

They have only kept diverse tribes (although N=731 from above so I’m not so sure about that). A tribe that consists of entirely BIPOC players like the Manihiki tribe in Cook Islands, only has one choice so is removed. This is an important consideration and should be removed but the same logic should extend to all tribe makeups. The probability of voting out a BIPOC player with there are 5/6 in a tribe is much higher than if there is only 1/10.

This imbalance alters the model outcome and I believe is the heart of the issue and why the paper probably made some incorrect conclusions. I’ll explain.

To demonstrate why this is important, I’ll make up a toy example.

Let’s assume 50 tribes went to Tribal Council. Each tribe has 1 BIPOC and 9 white players. In total, there are 500 players – 50 BIPOC and 450 white.

Let’s also assume there is no bias and everyone has a 1/10 chance of being voted out. Then we expect to see 5 BIPOC players and 45 white players voted out first. We’ll put that into a 2×2.

```
x1 <- matrix(c(405, 45, 45, 5), nrow = 2, dimnames = list(c("White", "BIPOC"), c("No", "Yes")))
> x1
No Yes
White 405 45
BIPOC 45 5
```

I’ll fit a Chi-squared test to see if there is an association between race and being voted out first.

```
> chisq.test(x1)
Pearson's Chi-squared test
data: x1
X-squared = 0, df = 1, p-value = 1
```

The p-value is 1 because we’ve assumed equal probability of being voted out first. Makes perfect sense.

Let’s choose another example, 40 Tribals, and each tribe has 3 BIPOC players and 1 White player. That’s 120 BIPOC players and 40 White players in total. Again, let’s assume no bias and equal probability of being voted out of 1/4. Then we expect 30 BIPOC players and 10 white players voted out. I’ll put that into a 2×2 and run a Chi-squared test.

```
x2 <- matrix(c(30, 90, 10, 30), nrow = 2, dimnames = list(c("White", "BIPOC"), c("No", "Yes")))
> x2
No Yes
White 30 10
BIPOC 90 30
> chisq.test(x2)
Pearson's Chi-squared test
data: x2
X-squared = 0, df = 1, p-value = 1
```

No surprises to anyone, there’s no association.

Now what if we joined them together i.e. ? There would be 90 Tribal Councils, 490 White players, and 170 BIPOC players, 55 and 35 voted out respectively. Again, I’ll run a Chi-squared test for an association.

```
> x <- x1 + x2
> x
No Yes
White 435 55
BIPOC 135 35
> chisq.test(x)
Pearson's Chi-squared test with Yates' continuity correction
data: x
X-squared = 8.6183, df = 1, p-value = 0.003328
```

Now the test is highly significant! We would confirm without a doubt that there IS an association between race and being voted out first.

But we know this isn’t correct because we specifically assumed equal probability of being voted out. We set the data up with this exact property and the individual tests produced a p-value of 1.

So, why is combining them suddenly showing an association when there is not? It’s because each observation is assumed to be iid – independent and identically distributed. But, the observations are not independent. In a single Tribal Council if one person gets voted out it means the others can’t be voted out. There is a dependency within each Tribal Council so the makeup of the tribe matters. The model doesn’t understand this though.

Each Tribal Council IS independent since there is no interaction between the votes in one Tribal and the votes in another. That’s what the observation should be, the Tribal Council, not the player.

This is accounted for in the Bayesian model and the simulation, but not the regression model since it is at the person level.

I hope this makes sense because by ignoring this property you could be making conclusions about an association when there is none, which is what I think may have happened.

To call a spade a spade, a group of people go to Tribal Council to vote someone out. Bias enters the game when humans group together and make decisions about who to vote out, which is the very essence of the game. Maybe, that’s what needs to change? To make the game fairer perhaps more elements of chance need to be introduced. Perhaps they need to remove players’ votes, force people to rely on their social game and make true, meaningful connections.

I can’t imagine the fandom getting behind any major change though. They are a pretty conservative bunch and resist almost any change made to the game – final 4 fire, the 3 tribe set up, 26 days, new advantages, more than one hidden immunity idol, moral dilemmas, Summit journeys, the rice negotiation, beware advantages, fake idol kits, shot in the dark…. pretty much everything. Now in the new era, players can lose their vote and they fucking hate it.

So, it’s pretty funny reading unhinged posts like ‘Man who manipulates Survivorâ€™s game cannot imagine adjusting to make it fair‘ because a) maybe production is adjusting it? And b) I can’t imagine any change aimed at making it ‘fairer’ that would receive unanimous approval, particularly when that would probably mean removing human decisions or introducing a new mechanic. The sentiment tends to be ‘if it ain’t broke, don’t fix it, except when the person I liked gets voted out’.

The post Racial Bias in Survivor: Are BIPOC Players Disproportionately Voted Out First? (part 2) appeared first on Dan Oehm | Gradient Descending.

]]>If you’re looking for something a little different, ggbrick creates a â€˜waffleâ€™ style chart with the aesthetic of a brick […]

The post ggbrick is now on CRAN appeared first on Dan Oehm | Gradient Descending.

]]>If you’re looking for something a little different, `ggbrick`

creates a â€˜waffleâ€™ style chart with the aesthetic of a brick wall. The usage is similar toÂ `geom_col`

Â where you supply counts as the height of the bar and aÂ `fill`

Â for a stacked bar. Each whole brick represents 1 unit. Two half bricks equal one whole brick.

It has been available on Git for a while, but recently I’ve made some changes and it now has CRAN’s tick of approval.

```
install.packages("ggbrick")
```

There are two main geoms included:

`geom_brick()`

: To make the brick wall-style waffle chart.`geom_waffle()`

: To make a regular-style waffle chart.

Use `geom_brick()`

the same way you would use `geom_col()`

.

```
library(dplyr)
library(ggplot2)
library(ggbrick)
# basic usage
mpg |>
count(class, drv) |>
ggplot() +
geom_brick(aes(class, n, fill = drv)) +
coord_brick()
```

`coord_brick()`

is included to maintain the aspect ratio of the bricks. It is similar to `coord_fixed()`

, in fact, it is just a wrapper for `coord_fixed()`

with a parameterised aspect ratio based on the number of bricks. The default number of bricks is 4. To change the width of the line outlining the brick use the `linewidth`

parameter as normal.

To change specify the `bricks_per_layer`

parameter in the geom and coord functions.

```
mpg |>
count(class, drv) |>
ggplot() +
geom_brick(aes(class, n, fill = drv), bricks_per_layer = 6) +
coord_brick(6)
```

You can change the width of the columns similar to `geom_col()`

to add more space between the bars. To maintain the aspect ratio you also need to set the width in `coord_brick()`

.

```
mpg |>
count(class, drv) |>
ggplot() +
geom_brick(aes(class, n, fill = drv), width = 0.5) +
coord_brick(width = 0.5)
```

To get more space between each brick use the `gap`

parameter.

```
mpg |>
count(class, drv) |>
ggplot() +
geom_brick(aes(class, n, fill = drv), gap = 0.04) +
coord_brick()
```

For no gap set `gap = 0`

or use the shorthand `geom_brick0()`

.

```
mpg |>
count(class, drv) |>
ggplot() +
geom_brick0(aes(class, n, fill = drv)) +
coord_brick()
```

For fun, I’ve included a parameter to randomise the fill of the bricks or add a small amount of variation at the join between two groups. The proportions are maintained and designed to just give a different visual.

```
mpg |>
count(class, drv) |>
ggplot() +
geom_brick(aes(class, n, fill = drv), type = "soft_random") +
coord_brick()
```

```
mpg |>
count(class, drv) |>
ggplot() +
geom_brick(aes(class, n, fill = drv), type = "random") +
coord_brick()
```

`geom_waffle()`

has the same functionality as `geom_brick()`

but the bricks are square giving a standard waffle chart. I added this so you can make a normal waffle chart in the same way you would use `geom_col()`

. It requires `coord_waffle()`

. To maintain the aspect ratio.

```
mpg |>
count(class, drv) |>
ggplot() +
geom_waffle(aes(class, n, fill = drv)) +
coord_waffle()
```

```
mpg |>
count(class, drv) |>
ggplot() +
geom_waffle0(aes(class, n, fill = drv), bricks_per_layer = 6) +
coord_waffle(6)
```

You may want to flip the coords when using `geom_waffle()`

. To do so you’ll need to use `coord_flip()`

and `theme(aspect.ratio = <number>)`

. I haven’t made a `coord_waffle_flip()`

, yet!

```
mpg |>
count(class, drv) |>
ggplot() +
geom_waffle0(aes(class, n, fill = drv)) +
coord_flip() +
theme(aspect.ratio = 1.8)
```

I think `geom_brick()`

pairs with `with_shadow()`

and `with_inner_blur()`

pretty well!

```
library(ggbrick)
library(ggfx)
font_add_google("Karla", "karla")
showtext_auto()
ft <- "karla"
txt <- "grey10"
bg <- "white"
survivoR::challenge_results |>
filter(
version == "US",
outcome_type == "Individual",
result == "Won"
) |>
left_join(
survivoR::castaway_details |>
select(castaway_id, gender, bipoc),
by = "castaway_id"
) |>
left_join(
survivoR::challenge_description |>
mutate(type = ifelse(race, "Race", "Endurance")) |>
select(version_season, challenge_id, type),
by = c("version_season", "challenge_id")
) |>
count(type, gender) |>
drop_na() |>
ggplot() +
with_shadow(
with_inner_glow(
geom_brick(aes(type, n, fill = gender), linewidth = 0.1, bricks_per_layer = 6)
),
x_offset = 4,
y_offset = 4
) +
coord_brick(6) +
scale_fill_manual(values = blue_pink[c(5, 1, 4)]) +
labs(
title = toupper("Survivor Challenges"),
subtitle = "Approximately a third of races and half of endurance challenges\nare won by women.",
fill = "Gender",
caption = "Individual challenges only. The different proportions of men and women at merge hasn't been taken into consideration."
) +
theme_void() +
theme(
text = element_text(family = ft, colour = txt, lineheight = 0.3, size = 32),
plot.background = element_rect(fill = bg, colour = bg),
plot.title = element_markdown(size = 128, colour = txt, hjust = 0.5, margin = margin(b = 10)),
plot.subtitle = element_text(hjust = 0.5, size = 48, margin = margin(b = 30)),
plot.caption = element_text(size = 24, hjust = 0, margin = margin(t = 20)),
axis.text = element_text(vjust = 0),
axis.title.y = element_blank(),
plot.margin = margin(t = 30, b = 10, l = 30, r = 30),
legend.position = "top"
)
```

The post ggbrick is now on CRAN appeared first on Dan Oehm | Gradient Descending.

]]>The Survivor Auction is classic. Seeing hungry people bid, win, and then binge on the food they just purchased ispretty […]

The post Survivor Auction analysis: Should you bid on the first covered item? appeared first on Dan Oehm | Gradient Descending.

]]>The Survivor Auction is classic. Seeing hungry people bid, win, and then binge on the food they just purchased ispretty good stuff.

It’s not always great. Remember bat soup? A covered item could be the meal of your dreams or it could be bat soup. So, when a covered item comes out should you bid on it?

I’ll be running a simulation model to determine if bidding on the first covered item is a good idea or a terrible one.

Before that, I’ll give a rundown of the data and show some summary stats of Survivor Auctions over the years.

All data is available in the {survivoR} R package and code at the bottom of the post.

Should you bid on the first covered item?

It’s best not to.

Should you bid on any?

Probably not.

Since this is Survivor and people want to take risks. So if they’re going to bid on one, which one should it be?

Go with the second one.

I’ll use the `survivor_auction`

and `auction_details`

datasets. A couple of things to note:

- I’ll only use the US version data, but it could be expanded to other versions.
- The auctions have changed and evolved over the years, for example, Season 5 was done early and the tribe bid together, and people could pool money in earlier seasons. That was restricted in later seasons. Fortunately, none of those things should complicate what we’re looking at here.
- What is considered a ‘bad item’ is subjective. You could argue that ‘rice and water’ isn’t a bad item but in the spirit of the Survivor Auction players wouldn’t be happy with that when they were hoping for a burger and chips or some sort of protein.

A borderline case for me was the skewer of chicken hearts Stephen purchased in Tocantins. It’s probably not what he hoped for but it’s far from bat soup, and you can buy them at the supermarket. I’ve recorded it as ‘food and drink’ rather than a ‘bad item’ but could be convinced otherwise. If the skewer had been regular chicken meat it’d be fine.

Another case was Will purchasing his removal from the auction at the very beginning, which is bad but at camp, he found the location for hidden rations. The result was good even though he couldn’t participate in the rest of the auction. I’ve categorised this as ‘food and drink’. - Where an item is for multiple people it is still considered one item. For example, letters from home are a common item. Usually, one person wins the bid and then it’s opened up to everyone. I consider this as one item.
- In the case where a covered item is purchased and then purchased by another player e.g. Austin buying the giant fish eyes it is only one item but auctioned off twice. The second time it was uncovered though.
- Occasionally they are given the option to switch to an alternative covered item. One of them is likely bad and the other isn’t. So far most have refused but Erik in Micronesia switched and got the good item. I’ve ignored this for the moment for data reasons but it is worth looking into.

Survivor Auctions aren’t super clean from a data point of view. There aren’t strict rules or the rules have changed and it’s a big collection of edge cases. But I think the way I’ve structured it makes sense.

Whether or not you should bid on an item depends on a few points of uncertainty:

- The number of players attending the Survivor Auction
- The number of items at the Survivor Auction
- The number of covered items
- The number of ‘bad items’

These points affect the chances of winning bat soup. Only one of them is known to the player. Three of them are unknown.

We need to understand how each of these varies from season to season to know if it’s a good idea to bid or not.

The first auction was held in Season 2 and there have been 17 in total with its return in Season 45. It is held at different stages of the game. The number of players at the Survivor Auction ranges from 6 to 12, and on average 8 people.

Code

```
# set up data frame
df <- survivoR::auction_details |>
filter(
version == "US",
auction_num == 1
) |>
distinct(version_season, item, item_description, category, covered) |>
group_by(version_season) |>
summarise(
n_items = n(),
n_covered = sum(covered),
n_bad = sum(category == "Bad item"),
pos_first_bad = cumsum(covered)[which(category == "Bad item")[1]]
) |>
left_join(
survivoR::survivor_auction |>
count(version_season, name = "n_cast"),
by = "version_season"
) |>
mutate(pos_first_bad = replace_na(pos_first_bad, 99))
# number of castaways
df_n_cast <- df |>
count(n_cast)
```

Number of castaways at the auction | Number of seasons |
---|---|

6 | 2 |

7 | 6 |

8 | 3 |

9 | 4 |

10 | 1 |

12 | 1 |

How many people are at the Survivor Auction could help to estimate how many items and therefore covered items there may be. The more people, the more items seem like a reasonable assumption. (Spoiler: It doesn’t matter).

The number of items at each auction varies from 5 to 12 items, and on average 8 items.

Code

```
# number of items
df_n_items <- df |>
count(n_items)
# categories
df_category <- survivoR::auction_details |>
filter(
version == "US",
auction_num == 1
) |>
distinct(version_season, item, item_description, category, covered) |>
count(category)
```

Number of items up for grabs | Number of seasons |
---|---|

5 | 2 |

6 | 4 |

7 | 2 |

8 | 4 |

9 | 1 |

10 | 2 |

11 | 1 |

12 | 1 |

I’ve binned the items into 5 main categories. Without too much surprise the majority are food and drink.

Category | Number of items |
---|---|

Food and drink | 95 |

Comfort | 8 |

Advantage | 13 |

Letter or message from home | 7 |

Bad item | 9 |

Every auction includes covered items where the player doesn’t know what they’re bidding on. The number of covered items varies from 1 to 5, and on average 3 items are covered.

Code

```
# number of covered items
df_n_covered <- df |>
count(n_covered)
# categories
df_category_covered <- survivoR::auction_details |>
filter(
version == "US",
auction_num == 1,
covered
) |>
distinct(version_season, item, item_description, category) |>
count(category)
```

Number of covered items | Number of seasons |
---|---|

1 | 4 |

2 | 4 |

3 | 4 |

4 | 3 |

5 | 2 |

The majority of covered items are food and drink as well, a few are advantages and 9 are bad items.

Category | Number of items |
---|---|

Food and drink | 34 |

Advantage | 3 |

Bad item | 9 |

Here are the 9 bad items that have been purchased over the years.

Code

```
# number of bad items
df_bad <- df |>
count(n_bad) |>
mutate(p = n/sum(n))
survivoR::auction_details |>
group_by(version_season) |>
filter(
category == "Bad item",
auction_num == 1,
version == "US"
) |>
select(season_name, item, item_description, castaway, cost)
# position
df |>
filter(pos_first_bad < 99) |>
count(pos_first_bad) |>
ungroup() |>
mutate(p = n/sum(n))
```

Season | Item number | Description | Castaway | Cost | The nth covered item |
---|---|---|---|---|---|

S2 The Australian Outback | 11 | Glass of river water | Amber | $200 | 1st |

S5 Thailand | 3 | Backed grubs | Sook Jai | $80 | 1st |

S6 The Amazon | 2 | Manioc | Alex | $240 | 1st |

S13 Cook Islands | 6 | Sea cucumber | Sundra | $140 | 3rd |

S16 Micronesia | 3 | Fruit bat soup | Natalie | $240 | 3rd |

S19 Samoa | 2 | Sea noodles and slug guts with parmesan cheese | Shambo | $240 | 1st |

S26 Caramoan | 8 | Pig brain | Brenda | $300 | 4th |

S28 Cagayan | 4 | Rice and water | Trish | $60 | 3rd |

S45 45 | 5 | Two giant fish eyes | Katurah | $480 | 2nd |

There has been a maximum of 1 bad item purchased for a given season. On average, 1 in every 2 seasons includes a bad item. Honestly, not as many as I remember.

Number of ‘bad items’ | Number of seasons | Percentage |
---|---|---|

0 | 8 | 47% |

1 | 9 | 53% |

The bad item is often revealed at different positions i.e. either the first, second, etc, covered item.

Position of ‘bad item’ | Number of seasons | Percentage |
---|---|---|

First covered item | 4 | 44% |

Second | 1 | 11% |

Third | 3 | 33% |

Fourth | 1 | 11% |

Fifth | 0 | 0% |

Out of the 9 seasons that had a bad item 4 were revealed in the first covered item and the other 5 across the second, third, and fourth covered items. There has only been a bad item under the second covered item on one occasion and it was in 45. If this was all we were going by, choosing the second item may be the way to go.

There has also only been one bad item under the 4th covered item and never under the 5th. There have only been 5 covered items on 2 occasions though. Not saying it won’t happen in the future.

I’ll be fitting a Bayesian simulation model to estimate the probability of which covered item holds the bad one, if any.

Each iteration of the simulation is done by the following:

- Draw the number of covered items in the auction.
- Draw the number of ‘bad items’ in the season.
- Draw the position of the ‘bad item’.

The number of covered items is drawn from a Dirichlet-Mutlinomial distribution using a non-informative prior.

Where is the observed number of seasons with that many covered items (table 3).

This will output a vector of probabilities from which a single number is drawn for each iteration.

The number of ‘bad items’ is drawn from a Beta-Bernoulli distribution using a non-informative prior.

where and are 8 and 9 from the table above. For each iteration, the number of ‘bad items’ is drawn from a Bernoulli distribution using . I’m restricting the simulation to have only one bad item but it could be expanded to more using a binomial.

In S16 Micronesia, Erik purchased an item and was offered a switch with another item. He switched and got Nachos instead of Jarred Octopus. We know that there were two in a season but weren’t won. An edge case I’m willing to ignore right now.

Similar to step 1, the position of the bad item is drawn from a Dirichlet-Mutlinomial using each draw from 1 and 2.

where is the vector of frequencies that the bad item appeared (table 6).

I’ll run 40,000 simulations. Each simulation can be considered a season.

Simulation code

```
library(tidyverse)
library(dirmult)
# set main data frame
df0 <- survivoR::auction_details |>
filter(
version == "US",
auction_num == 1
) |>
distinct(version_season, item, item_description, category, covered) |>
group_by(version_season) |>
summarise(
n_items = n(),
n_covered = sum(covered),
n_bad = sum(category == "Bad item"),
pos_first_bad = cumsum(covered)[which(category == "Bad item")[1]]
) |>
left_join(
survivoR::survivor_auction |>
count(version_season, name = "n_cast"),
by = "version_season"
) |>
mutate(pos_first_bad = replace_na(pos_first_bad, 99)) |>
# parameter to run the sim after a certain number of items are revealed.
.after <- 0
# data
df <- df0 |>
filter(pos_first_bad > .after & n_covered > .after)
# number of items
df_n_items <- df |>
count(n_items)
# number of covered items
df_n_covered <- df |>
count(n_covered)
# number of bad items
df_bad <- df |>
count(n_bad) |>
mutate(p = n/sum(n))
# position of first bad
df_pos_first_bad <- df |>
filter(pos_first_bad < 99) |>
drop_na() |>
count(n_covered, pos_first_bad) |>
group_by(n_covered) |>
mutate(p = n/sum(n))
# simulation
n_covered <- table(df$n_covered)
n_bad <- table(df$n_bad)
# vector of freqs for position simulation
n_pos_obs <- map((.after+1):5, ~{
i <- df_pos_first_bad |>
filter(n_covered == .x) |>
pull(pos_first_bad)
n <- df_pos_first_bad |>
filter(n_covered == .x) |>
pull(n)
obs <- rep(1, .x-.after)
obs[i-.after] <- obs[i-.after] + n
obs
})
# number of sims
n_sims <- 40000
# draw the probabilities
# theta
p_draws_n_covered <- rdirichlet(n = n_sims, alpha = n_covered+1)
# gamma
p_draws_n_bad <- rbeta(n_sims, n_bad[1]+1, n_bad[2]+1)
# draw the values
# y_covered
draws_n_covered <- apply(p_draws_n_covered, 1, function(x) sample((.after+1):5, 1, prob = x))
# y_bad
draws_n_bad <- rbinom(n = n_sims, 1, prob = p_draws_n_bad)
# sample position
pos <- rep(0, n_sims)
# run loop
fixed_n <- FALSE
i <- 3
equal_prob <- FALSE
for(k in 1:n_sims) {
if(draws_n_bad[k] == 1) {
if(!fixed_n) {
i <- draws_n_covered[k]-.after
}
if(equal_prob) {
pos[k] <- sample(1:i, 1, prob = rep(1/i, i))
} else {
p_k <- rdirichlet(1, alpha = n_pos_obs[[i]])
pos[k] <- sample(1:i, 1, prob = p_k)
}
}
}
# get the probs
table(pos)/n_sims
```

The vector pos holds all the simulated positions of the ‘bad item’. I’ve coded position ‘0’ for the cases where there wasn’t a bad item, or interpreted as the proportion of seasons that didn’t have a bad item. This sits at 53% (9/17) which makes sense.

The sum of the positions 1 to 5 is 47% or the probability that there will be a ‘bad item’. The next part is interpreting the probabilities at each position.

The probability that the first item will be a ‘bad item’ is 25%. This is the highest out of all positions. You may think that’s the worst one to bid on. The reason why the others are low is because the season may only have 1 covered item, or 2, or 3, etc. This probability also accounts for the unknown number of covered items.

We also have to consider how this plays out. Let’s assume you don’t bid on the first item. It turned out it’s a good item and then another covered item is presented. Should you bid on this item?

In this case, the probabilities change. We know with certainty that the first item was good, so the probability that it is bad goes to 0. We also know with certainty that at least 2 covered items are in the auction. So what are the probabilities now?

Thinking ahead briefly we could extend this to the 3rd item, 4th, etc. The table below details all the scenarios.

After ‘n’ covered items are revealed | No bad item | 1st covered item | 2nd | 3rd | 4th | 5th |
---|---|---|---|---|---|---|

0 | 53% | 25% | 9% | 9% | 3% | 1% |

1 | 50% | 23% | 18% | 7% | 2% | |

2 | 56% | 30% | 12% | 3% | ||

3 | 40% | 52% | 8% | |||

4 | 98% | 2% |

Based on this, skipping the first one and bidding on the second one is probably the best but it’s much of a muchness.

If the number of covered items is known (or we simply make an assumption), it changes slightly.

Let’s assume there are 3 covered items since there are on average 3. The table is now…

After ‘n’ covered items are revealed | No bad item | 1st covered item | 2nd | 3rd |
---|---|---|---|---|

0 | 53% | 15% | 8% | 24% |

1 | 50% | 13% | 38% | |

2 | 55% | 45% |

In this case, go with the second one, maybe the first, but don’t go with the third.

The best strategy is to not bid on the covered items.

But that’s boring, huh? Ok, bid on the second one instead.

If the first one is the bad item then by our assumptions and past seasons there is only one bad item so you should be safe from there. Of course, there still could be though, it just hasn’t happened before.

Have you ever rolled a standard 6-sided die and tallied up the results? You’re likely to roll a single number 0, 1, or 2 times, but could be more. If you estimate the probabilities from the results you’ll just get nonsense. We know the chance to roll a single number is 1/6 and not 0/6 or 2/6 or whatever.

That’s kind of what’s happening here. If we get to season 40k we’ll probably find that number of covered items and the position of the bad items balance out. I could be wrong. There could be subtle trends we pick up on. I’ll follow up after season 40k.

The best strategy at the Survivor Auction (in my opinion) is to just slam down 500 bucks or however many bucks you’ve got on the first food item you see.

Anyway, that was a fun way to say it doesn’t really matter that much! At the very least it should give you a good idea of how you can use the `auction_details`

data set.

The post Survivor Auction analysis: Should you bid on the first covered item? appeared first on Dan Oehm | Gradient Descending.

]]>This one rounds out another full year of Tidy Tuesday. A quick one looking at the number of lines of […]

The post Tidy Tuesday week 52: R Package Structure appeared first on Dan Oehm | Gradient Descending.

]]>This one rounds out another full year of Tidy Tuesday. A quick one looking at the number of lines of code by the major version number. On average, the higher the version number the more lines of code. Surprise, surprise!

Code Github

The post Tidy Tuesday week 52: R Package Structure appeared first on Dan Oehm | Gradient Descending.

]]>Catching up on the lead-up to the holiday break. This was a quick one looking at the ratings of The […]

The post Tidy Tuesday week 51: Holiday Episodes appeared first on Dan Oehm | Gradient Descending.

]]>Catching up on the lead-up to the holiday break. This was a quick one looking at the ratings of The Simpsons holiday episodes. It’s simple but I like the use of the yellow, blue, and Simpsons font.

Code Github

The post Tidy Tuesday week 51: Holiday Episodes appeared first on Dan Oehm | Gradient Descending.

]]>Approaching the holiday season, we’re looking at Christmas / holiday movies. This list was curated to movies that include the […]

The post Tidy Tuesday week 50: Holiday Movies appeared first on Dan Oehm | Gradient Descending.

]]>Approaching the holiday season, we’re looking at Christmas / holiday movies. This list was curated to movies that include the words Christmas, Holiday, Hanukkah, or Kwanzaa in the title. So it doesn’t include classic ‘Christmas’ movies like Die Hard or Home Alone, much to my disappointment.

I attempted to bring the Christmas theme to life by plotting the 20 most popular movies on the list as baubles on a Christmas Tree. The higher up the tree, the higher the IMDb rating. The ‘x’ variable is simply random. I tried to use a quantitative variable but nothing really worked. So in the interest of time, this is what you get. I like it though.

Code Github. The tree was generated with Midjourney.

The post Tidy Tuesday week 50: Holiday Movies appeared first on Dan Oehm | Gradient Descending.

]]>This week we’re looking at life expectancy across the globe. I wanted to look at the data holistically across each […]

The post Tidy Tuesday week 49: Life Expectancy appeared first on Dan Oehm | Gradient Descending.

]]>This week we’re looking at life expectancy across the globe. I wanted to look at the data holistically across each country and then focus on Australia. This would be great for an interactive chart built in Shiny but that’s for another day.

The first chart shows how overall life expectancy has increased over the years 1970-2020 as well as the rank change for each country. It also highlights events that have impacted life expectancy in some countries e.g. Rwanda in the 90s.

The second highlights Australia whose life expectancy for children born in a given year has increased from 71 to 84 and from ranked 33rd to 5th in the world.

Code Github

The post Tidy Tuesday week 49: Life Expectancy appeared first on Dan Oehm | Gradient Descending.

]]>This week we’re looking at Doctor Who episode data compiled by Jonathan Kitt. This was a quick one charting the […]

The post Tidy Tuesday week 48: Doctor Who appeared first on Dan Oehm | Gradient Descending.

]]>This week we’re looking at Doctor Who episode data compiled by Jonathan Kitt. This was a quick one charting the relationship between the episode rating and the number of viewers.

- The images were cropped using {cropcircles}
- The colour palette was created using {eyedroppeR}
- To get the facets in the right place I hacked it by adding -1 and 0 to the season number and placed the logo in -1 and the subtitle text in 0.

The hardest part was finding who played the doctor in each season.

Code Github

The post Tidy Tuesday week 48: Doctor Who appeared first on Dan Oehm | Gradient Descending.

]]>