I am pleased to announce survivoR 0.9.2 is now on CRAN. The survivoR data package is a collection of datasets […]

The post survivoR now on CRAN! appeared first on Daniel Oehm | Gradient Descending.

]]>I am pleased to announce **survivoR** **0.9.2** is now on CRAN. The **survivoR **data package is a collection of datasets detailing events and the cast across all 40 seasons of the US Survivor, including castaway information, vote history, immunity and reward challenge winners, jury votes, and viewers. It also includes season and tribe colour palettes, and ggplot2 scale functions.

To install, simply enter

```
install.packages("survivoR")
```

Or from Github with the following

```
devtools::install_github("doehm/survivoR")
```

For more details on the content of the package please follow the link below or visit the Github page.

I intend to update the data after each episode airs (by that I mean within a few days after the episode airs and probably on the weekend – just to manage expectations). There is plenty of data here, but there is also plenty of room for expansion. I will be adding new features, such as details on exiled castaways across the seasons, challenge details, and when hidden immunity idols were found, played, and for who over time. If you have a request for specific data let me know in the comments and I’ll see what I can do. Also, if you’d like to contribute by adding to existing datasets or contribute a new dataset, please contact me directly on the contacts page.

Just to jazz up the post a little, here are a few plots I’ve produced recently using the data.

The post survivoR now on CRAN! appeared first on Daniel Oehm | Gradient Descending.

]]>596 episodes. 40 seasons. 1 package! I’m a pretty big fan of Survivor and have religiously watched every season since […]

The post survivoR | Data from the TV series in R appeared first on Daniel Oehm | Gradient Descending.

]]>596 episodes. 40 seasons. 1 package!

I’m a pretty big fan of Survivor and have religiously watched every season since the first. With 40 seasons under its belt, there’s a tonne of data to dive into. However, getting that data in one place has been tedious. Hence, the survivoR package.

survivoR is a collection of datasets detailing events across all 40 seasons of the US Survivor, including castaway information, vote history, immunity and reward challenge winners, jury votes, and viewers.

survivoR is now on CRAN. Install the package using

```
install.packages("survivoR")
```

Or from Github with the following.

```
devtools::install_github("doehm/survivoR")
```

Below are all the datasets that are contained within the package.

A data frame containing summary details of each season of Survivor, including the winner, runner ups and location. This is a nested data frame given there maybe 1 or 2 runner-ups. By using a nested data frame the grain is maintained to 1 row per season.

```
season_summary
```

```
#> # A tibble: 40 x 17
#> season_name season location country tribe_setup full_name winner runner_ups
#> <chr> <int> <chr> <chr> <chr> <glue> <chr> <list>
#> 1 Survivor: ~ 1 Pulau T~ Malays~ Two tribes~ Richa~ Richard <tibble [~
#> 2 Survivor: ~ 2 Herbert~ Austra~ Two tribes~ Tina ~ Tina <tibble [~
#> 3 Survivor: ~ 3 Shaba N~ Kenya Two tribes~ Ethan~ Ethan <tibble [~
#> 4 Survivor: ~ 4 Nuku Hi~ Polyne~ Two tribes~ Vecep~ Vecepia <tibble [~
#> 5 Survivor: ~ 5 Ko Taru~ Thaila~ Two tribes~ Brian~ Brian <tibble [~
#> 6 Survivor: ~ 6 Rio Neg~ Brazil Two tribes~ Jenna~ Jenna <tibble [~
#> 7 Survivor: ~ 7 Pearl I~ Panama Two tribes~ Sandr~ Sandra <tibble [~
#> 8 Survivor: ~ 8 Pearl I~ Panama Three trib~ Amber~ Amber <tibble [~
#> 9 Survivor: ~ 9 Efate, ~ Vanuatu Two tribes~ Chris~ Chris <tibble [~
#> 10 Survivor: ~ 10 Koror, ~ Palau A schoolya~ Tom W~ Tom <tibble [~
#> # ... with 30 more rows, and 9 more variables: final_vote <chr>,
#> # timeslot <chr>, premiered <date>, premier_viewers <dbl>, ended <date>,
#> # finale_viewers <dbl>, reunion_viewers <dbl>, rank <dbl>, viewers <dbl>
```

```
season_summary %>%
select(season, viewers_premier, viewers_finale, viewers_reunion, viewers_mean) %>%
pivot_longer(cols = -season, names_to = "episode", values_to = "viewers") %>%
mutate(
episode = to_title_case(str_replace(episode, "viewers_", ""))
) %>%
ggplot(aes(x = season, y = viewers, colour = episode)) +
geom_line() +
geom_point(size = 2) +
theme_minimal() +
scale_colour_survivor(16) +
labs(
title = "Survivor viewers over the 40 seasons",
x = "Season",
y = "Viewers (Millions)",
colour = "Episode"
)
```

Season and demographic information about each castaway. Within a season the data is ordered by the first voted out to sole survivor indicated by `order`

which represents the order they castaways left the island. This may be by being voted off the island, being evacuated due to medical reasons, or quitting. When demographic information is missing, it likely means that the castaway re-entered the game at a later stage by winning the opportunity to return. Castaways that have played in multiple seasons will feature more than once with the age and location representing that point in time.

```
castaways %>%
filter(season == 40)
```

```
#> # A tibble: 22 x 15
#> season_name season castaway nickname age city state day original_tribe
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 Survivor: ~ 40 Natalie~ Natalie <NA> <NA> <NA> 2 Sele
#> 2 Survivor: ~ 40 Amber M~ Amber 40 Pens~ Flor~ 3 Dakal
#> 3 Survivor: ~ 40 Danni B~ Danni 43 Shaw~ Kans~ 6 Sele
#> 4 Survivor: ~ 40 Ethan Z~ Ethan 45 Hill~ New ~ 9 Sele
#> 5 Survivor: ~ 40 Tyson A~ Tyson <NA> <NA> <NA> 11 Dakal
#> 6 Survivor: ~ 40 Rob Mar~ Rob 43 Pens~ Flor~ 14 Sele
#> 7 Survivor: ~ 40 Parvati~ Parvati 36 Los ~ Cali~ 16 Sele
#> 8 Survivor: ~ 40 Sandra ~ Sandra 44 Rive~ Flor~ 16 Dakal
#> 9 Survivor: ~ 40 Yul Kwon Yul 44 Los ~ Cali~ 18 Dakal
#> 10 Survivor: ~ 40 Wendell~ Wendell 35 Phil~ Penn~ 21 Dakal
#> # ... with 12 more rows, and 6 more variables: merged_tribe <chr>,
#> # result <chr>, jury_status <chr>, order <int>, swapped_tribe <chr>,
#> # swapped_tribe2 <chr>
```

This data frame contains a complete history of votes cast across all seasons of Survivor. This allows you to see who voted for who at which tribal council. It also includes details on who had individual immunity as well as who had their votes nullified by a hidden immunity idol. This details the key events for the season.

While there are consistent events across the seasons such as the tribe swap, there are some unique events such as the ‘mutiny’ in Survivor: Cook Islands (Season 13) or the ‘Outcasts’ in Survivor: Pearl Islands (season 7). When castaways change tribes by some means other than a tribe swap, it is still recorded as ‘swapped’ to maintain a standard.

The data is recorded as ‘swapped’ with a trailing digit if a swap has occurred more than once. This includes absorbed tribes when 3 tribes are reduced to 2 or when Stephanie was ‘absorbed’ in Survivor: Palau (season 10) when everyone but herself was voted off the tribe (and making Palau one of the classic seasons of Survivor). To indicate a change in tribe status these events are also considered ‘swapped’.

This data frame is at the tribal council by castaway grain, so there is a vote for everyone that attended the tribal council. However, there are some edge cases such as when the ‘steal a vote’ advantage is played. In this case, there is a second row for the castaway indicating their second vote.

In the case of a tie and a revote, the first vote is recorded and the result is recorded as ‘Tie’. The deciding vote is recorded as normal. Where there is a double tie, it is recorded as ‘Tie2’ (for lack of a better name). In the case of a double tie and it goes to rocks, the vote is either ‘Black rock’ or ‘White rock’. In the older episodes of Survivor, when there were two ties in a row, rather than going to rocks there was a countback of votes.

```
vh <- vote_history %>%
filter(
season == 40,
episode == 10
)
vh
```

```
#> # A tibble: 9 x 11
#> season_name season episode day tribe_status castaway immunity vote
#> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 Survivor: ~ 40 10 25 merged Tony individ~ Tyson
#> 2 Survivor: ~ 40 10 25 merged Michele <NA> Tyson
#> 3 Survivor: ~ 40 10 25 merged Sarah <NA> Deni~
#> 4 Survivor: ~ 40 10 25 merged Sarah <NA> Tyson
#> 5 Survivor: ~ 40 10 25 merged Ben <NA> Tyson
#> 6 Survivor: ~ 40 10 25 merged Nick <NA> Tyson
#> 7 Survivor: ~ 40 10 25 merged Kim <NA> Soph~
#> 8 Survivor: ~ 40 10 25 merged Sophie <NA> Deni~
#> 9 Survivor: ~ 40 10 25 merged Tyson <NA> Soph~
#> # ... with 3 more variables: nullified <lgl>, voted_out <chr>, order <dbl>
```

```
vh %>%
count(vote)
```

```
#> # A tibble: 5 x 2
#> vote n
#> <chr> <int>
#> 1 Denise 2
#> 2 Immune 1
#> 3 None 1
#> 4 Sophie 2
#> 5 Tyson 5
```

Events in the game such as fire challenges, rock draws, steal-a-vote advantages, or countbacks (in the early days) often mean a vote wasn’t placed for an individual. Rather a challenge may be won, lost, no vote cast, etc but attended tribal council. These events are recorded in the `vote`

field. I have included a function `clean_votes`

for when only the votes cast for individuals are needed. If the input data frame has the `vote`

column it can simply be piped.

```
vh %>%
clean_votes() %>%
count(vote)
```

```
#> # A tibble: 3 x 2
#> vote n
#> <chr> <int>
#> 1 Denise 2
#> 2 Sophie 2
#> 3 Tyson 5
```

A nested tidy data frame of immunity and reward challenge results. The winners and winning tribe of the challenge are found by expanding the `winners`

column. For individual immunity challenges the winning tribe is simply `NA`

.

```
challenges %>%
filter(season == 40)
```

```
#> # A tibble: 28 x 7
#> season_name season episode title day challenge_type winners
#> <chr> <dbl> <dbl> <chr> <dbl> <chr> <list>
#> 1 Survivor: Winne~ 40 1 Greatest of~ 2 reward <tibble[,2~
#> 2 Survivor: Winne~ 40 1 Greatest of~ 2 immunity <tibble[,2~
#> 3 Survivor: Winne~ 40 1 Greatest of~ 3 immunity <tibble[,2~
#> 4 Survivor: Winne~ 40 2 It's Like a~ 6 reward <tibble[,2~
#> 5 Survivor: Winne~ 40 2 It's Like a~ 6 immunity <tibble[,2~
#> 6 Survivor: Winne~ 40 3 Out for Blo~ 9 reward <tibble[,2~
#> 7 Survivor: Winne~ 40 3 Out for Blo~ 9 immunity <tibble[,2~
#> 8 Survivor: Winne~ 40 4 I Like Reve~ 11 reward <tibble[,2~
#> 9 Survivor: Winne~ 40 4 I Like Reve~ 11 immunity <tibble[,2~
#> 10 Survivor: Winne~ 40 5 The Buddy S~ 14 immunity <tibble[,2~
#> # ... with 18 more rows
```

Typically in the merge, if a single person wins a reward they are allowed to bring others along with them. The first castaway in the expanded list is likely to be the winner and the subsequent players those they brought along with them. Although, not always. Occasionally in the merge, the castaways are split into two teams for the purpose of the reward, in which case all castaways win the reward rather than a single person.

The `day`

field on this data set represents the day of the tribal council rather than the day of the challenge. This is to more easily associate the reward challenge with the immunity challenge and result of the tribal council. It also helps for joining tables.

Note the challenges table is the combined immunity and rewards tables which will eventually be dropped in later releases.

This data frame contains the history of jury votes. It is more verbose than it needs to be. However, having a 0-1 column indicating if a vote was placed for the finalist makes it easier to summarise castaways that received no votes.

```
jury_votes %>%
filter(season == 40)
```

```
#> # A tibble: 48 x 5
#> season_name season castaway finalist vote
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 Survivor: Winners at War 40 Sarah Michele 0
#> 2 Survivor: Winners at War 40 Sarah Natalie 0
#> 3 Survivor: Winners at War 40 Sarah Tony 1
#> 4 Survivor: Winners at War 40 Ben Michele 0
#> 5 Survivor: Winners at War 40 Ben Natalie 0
#> 6 Survivor: Winners at War 40 Ben Tony 1
#> 7 Survivor: Winners at War 40 Denise Michele 0
#> 8 Survivor: Winners at War 40 Denise Natalie 0
#> 9 Survivor: Winners at War 40 Denise Tony 1
#> 10 Survivor: Winners at War 40 Nick Michele 0
#> # ... with 38 more rows
```

```
jury_votes %>%
filter(season == 40) %>%
group_by(finalist) %>%
summarise(votes = sum(vote))
```

```
#> # A tibble: 3 x 2
#> finalist votes
#> <chr> <dbl>
#> 1 Michele 0
#> 2 Natalie 4
#> 3 Tony 12
```

A data frame containing the viewer information for every episode across all seasons. It also includes the rating and viewer share information for viewers aged 18 to 49 years.

```
viewers %>%
filter(season == 40)
```

```
#> # A tibble: 14 x 9
#> season_name season episode_number_~ episode title episode_date viewers
#> <chr> <dbl> <dbl> <dbl> <chr> <date> <dbl>
#> 1 Survivor: ~ 40 583 1 Grea~ 2020-02-12 6.68
#> 2 Survivor: ~ 40 584 2 It's~ 2020-02-19 7.16
#> 3 Survivor: ~ 40 585 3 Out ~ 2020-02-26 7.14
#> 4 Survivor: ~ 40 586 4 I Li~ 2020-03-04 7.08
#> 5 Survivor: ~ 40 587 5 The ~ 2020-03-11 6.91
#> 6 Survivor: ~ 40 588 6 Quic~ 2020-03-18 7.83
#> 7 Survivor: ~ 40 589 7 We'r~ 2020-03-25 8.18
#> 8 Survivor: ~ 40 590 8 This~ 2020-04-01 8.23
#> 9 Survivor: ~ 40 591 9 War ~ 2020-04-08 7.85
#> 10 Survivor: ~ 40 592 10 The ~ 2020-04-15 8.14
#> 11 Survivor: ~ 40 593 11 This~ 2020-04-22 8.16
#> 12 Survivor: ~ 40 594 12 Frie~ 2020-04-29 8.08
#> 13 Survivor: ~ 40 595 13 The ~ 2020-05-06 7.57
#> 14 Survivor: ~ 40 596 14 It A~ 2020-05-13 7.94
#> # ... with 2 more variables: rating_18_49 <dbl>, share_18_49 <dbl>
```

This data frame contains the tribe names and colours for each season, including the RGB values. These colours can be joined with the other data frames to customise colours for plots. Another option is to add tribal colours to ggplots with the scale functions.

```
tribe_colours
#> # A tibble: 139 x 7
#> season_name season tribe_name r g b tribe_colour
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 Survivor: Winners at War 40 Sele 0 103 214 #0067D6
#> 2 Survivor: Winners at War 40 Dakal 216 14 14 #D80E0E
#> 3 Survivor: Winners at War 40 Yara 4 148 81 #049451
#> 4 Survivor: Winners at War 40 Koru 0 0 0 #000000
#> 5 Survivor: Island of the Ido~ 39 Lairo 243 148 66 #F39442
#> 6 Survivor: Island of the Ido~ 39 Vokai 217 156 211 #D99CD3
#> 7 Survivor: Island of the Ido~ 39 Lumuwaku 48 78 210 #304ED2
#> 8 Survivor: Edge of Extinction 38 Manu 16 80 186 #1050BA
#> 9 Survivor: Edge of Extinction 38 Lesu 0 148 128 #009480
#> 10 Survivor: Edge of Extinction 38 Kama 250 207 34 #FACF22
#> # ... with 129 more rows
```

Included are ggplot2 scale functions of the form `scale_fill_survivor()`

and `scale_fill_tribes()`

to add season and tribe colours to ggplot. The `scale_fill_survivor()`

scales uses a colour palette extracted from the season logo and `scale_fill_tribes()`

scales uses the tribal colours of the specified season as a colour palette.

Simply input the desired season number. If no season is provided it will default to season 40.

```
castaways %>%
count(season, personality_type) %>%
ggplot(aes(x = season, y = n, fill = personality_type)) +
geom_bar(stat = "identity") +
scale_fill_survivor(40) +
theme_minimal()
```

Below are the palettes for all seasons.

To use the tribe scales, simply input the season number desired to use those tribe colours. If the fill or colour aesthetic is the tribe name, this needs to be passed to the scale function as `scale_fill_tribes(season, tribe = tribe)`

(for now) where `tribe`

is on the input data frame. If the fill or colour aesthetic is independent from the actual tribe names, like gender for example, `tribe`

does not need to be specified and will simply use the tribe colours as a colour palette, such as the viewers line graph above.

```
ssn <- 35
labels <- castaways %>%
filter(
season == ssn,
str_detect(result, "Sole|unner")
) %>%
mutate(label = glue("{castaway} ({original_tribe})")) %>%
select(label, castaway)
jury_votes %>%
filter(season == ssn) %>%
left_join(
castaways %>%
filter(season == ssn) %>%
select(castaway, original_tribe),
by = "castaway"
) %>%
group_by(finalist, original_tribe) %>%
summarise(votes = sum(vote)) %>%
left_join(labels, by = c("finalist" = "castaway")) %>%
{
ggplot(., aes(x = label, y = votes, fill = original_tribe)) +
geom_bar(stat = "identity", width = 0.5) +
scale_fill_tribes(ssn, tribe = .$original_tribe) +
theme_minimal() +
labs(
x = "Finalist (original tribe)",
y = "Votes",
fill = "Original\ntribe",
title = "Votes received by each finalist"
)
}
```

This data provides a way to deeper analyse each season and the plays within each episode. For example, we could construct a graph of who voted for who, where the castaway is the node and the edge is who they voted for using the vote history data. While in this representation it’s possible to use clustering algorithms to identify alliances in the data. Other uses include identifying the probability of players jumping ship and pivotal votes. This is particularly interesting for the first 1 or 2 tribals of the merge to see if players stick with their original tribe or jump ship.

```
library(extrafont)
loadfonts()
ft <- "Segoe UI Light"
ssn <- 40
df <- vote_history %>%
filter(
season == ssn,
order == 13
)
nodes <- df %>%
distinct(castaway) %>%
mutate(id = 1:n()) %>%
rename(label = castaway)
edges <- df %>%
count(castaway, vote) %>%
left_join(
nodes %>%
rename(from = id),
by = c("castaway" = "label")
) %>%
left_join(
nodes %>%
rename(to = id),
by = c("vote" = "label")
) %>%
mutate(arrows = "to") %>%
rename(value = n) %>%
left_join(
castaways %>%
filter(season == ssn) %>%
select(castaway, original_tribe),
by = "castaway"
)
labels <- edges %>%
select(from, to, castaway, original_tribe) %>%
distinct(from, castaway, original_tribe) %>%
arrange(castaway) %>%
left_join(
edges %>%
count(vote),
by = c("castaway" = "vote")
)
cols <- tribe_colours$tribe_colour
names(cols) <- tribe_colours$tribe
ggraph(
edges %>%
rename(`Original tribe` = original_tribe),
layout = "linear") +
geom_edge_arc(aes(colour = `Original tribe`), arrow = arrow(length = unit(4, "mm"), type = "closed"), end_cap = circle(10, 'mm')) +
geom_node_point(size = 26, colour = cols[labels$original_tribe]) +
geom_node_point(size = 24, colour = "black") +
geom_node_text(aes(label = labels$castaway), colour = "grey", size = 4, vjust = 0, family = ft) +
geom_node_text(aes(label = labels$n), colour = "grey", size = 4, vjust = 2, family = ft) +
scale_edge_colour_manual(values = cols[unique(edges$original_tribe)]) +
scale_colour_manual(values = cols[unique(edges$original_tribe)]) +
theme_graph()
```

I intend to update the survivoR package each week during the airing of future seasons. For Survivor and data nuts like myself, this will enable a deeper analysis of each episode, and just neat ways visualise the evolution of the game.

New features will be added, such as details on exiled castaways across the seasons. If you have a request for specific data let me know in the comments and I’ll see what I can do. Also, if you’d like to contribute by adding to existing datasets or contribute a new dataset, please contact me directly on the contacts page.

Given the variable nature of the game of Survivor and how the rules are tweaked each season, there are bound to be edge cases where the data is not quite right. Please log an issue on Github, or with me directly in the comments and I will correct the datasets.

Data in the survivoR package was mostly sourced from Wikipedia. Other data, such as the tribe colours, was manually recorded and entered myself.

Torch graphic in hex: Fire Torch Vectors by Vecteezy

If R isn’t your thing you can download an XLSX of the data R data files here

The post survivoR | Data from the TV series in R appeared first on Daniel Oehm | Gradient Descending.

]]>A while ago I simulated Martingale strategies and some variations. This is the strategy where you double your bet on […]

The post Don’t get fooled by the Gambler’s Fallacy appeared first on Daniel Oehm | Gradient Descending.

]]>A while ago I simulated Martingale strategies and some variations. This is the strategy where you double your bet on each loss in order to win your money back. I simulated a number of different strategies to see if it gave you the edge or at least minimised your risk of losing.

TL;DR in the long run you will experience a big loss or incremental smaller losses, but you can minimise your losses by setting limits. It’s not a strategy that ensures you’ll always win your money back.

One thing I didn’t address was the Gambler’s Fallacy and it has come up more than once since the post. This is the perception that if one outcome has turned up many times in a row, such as black on a roulette wheel, then red must be ‘overdue’ to turn up and believed to be more likely to occur on the next spin. The strategy is to wait until black or red has come up a set number of times in a row and on the next spin place a bet on the other colour, in a sense ‘hacking’ the probability.

In roulette each spin is completely independent of the last meaning all previous spins have absolutely no influence on the next. Mathematically the property of independence is . Independence holds for most typical mechanical chance processes such as flipping a coin. The probability of an outcome is static.

Consider flipping a coin. A strategy may be to wait for 3 heads in a row then place a bet on tails, because tails is ‘overdue’. Let’s look at the probabilities. The probability of flipping a head is

and so the probability of 3 heads in a row is

What’s the probability the next flip is tails? Given each flip is independent the probability the next flip is tails is simply

Therefore the probability of 3 heads followed by a tails is

So, what’s the probability of observing 4 heads in a row?

Exactly the same. In other words, waiting for 3 heads to turn up doesn’t affect the chances of either heads or tails turning up next.

Let’s simulate the Gambler’s Fallacy just to convince ourselves that waiting for a high occurrence of one outcome doesn’t result in a higher probability of an alternative outcome.

A coin is flipped 40 times. The strategy we’ll test is waiting until 10 heads are observed in a row. The probability of this occurring is 0.00098. In the remaining 30 flips we’ll observe the proportion of tails. If the mean proportion of tails is greater than 0.5 then tails was ‘overdue’.

```
library(purrr)
n_sequences <- 1e6
n_flips <- 40
watching_window <- 1:10
bet_window <- (1:n_flips)[-watching_window]
n_heads <- 10
sequences <- map(1:n_sequences, ~sample(coin, n_flips, replace = TRUE))
heads <- sequences[map_lgl(sequences, function(sequence) sum(sequence[watching_window] == "heads") >= n_heads)]
tails_tally <- map_dbl(heads, function(sequence) sum(sequence[bet_window] == "tails"))
tibble(p_tails = tails_tally/30)
```

Did we hack the probability of tails being flipped? Nah. The mean proportion of tails observed in the remaining 30 flips is 0.493, very close to the expectation 0.5. In other words, tails was not ‘overdue’ to turn up. If we did hack the probability, the proportion would be something >0.5.

Our brains are funny and not random. If we saw 10 heads turn up in a row we would perceive that as a pretty remarkable event given the probability of that occurring is 0.00098. But, if we saw HTTTHHTTHT, we wouldn’t see that as anything special. Even though observing this exact sequence has the same probability as 10 heads in a row, at 0.00098.

Our brains are good at recognising patterns and 10 heads in a row is an easy pattern to recognise. Although, some combination of heads and tails, less so. In the above case I suspect (given my unqualified opinion on how brains work) we see 4 heads and 6 tails as a summary of the 10 flips. It’s not exactly a conscious thing, rather just how we summarise information. Observing 4 heads and 6 tails in any order has a probability of 0.21, which is much more likely to occur and so don’t see it as anything special.

The ‘Hot-hand’ fallacy is related the Gambler’s Fallacy where bets are placed on the same outcome as the last when a streak is expected. This arose in basketball where players could be considered ‘hot’ if they are having a good game. This has some logical basis since human performance is not entirely a stochastic process. If a player is having a cracker of a game it’s reasonable to expect higher scores / more scoring shots.

In the case of mechanical chance games like flipping a coin or roulette, this does not hold for the same reasons as discussed. Independence.

If you’re playing Roulette, flipping a coin, or something similar, don’t bother with waiting for a run to get an edge. You may as well just start betting straight away. The best strategy you can have is to go all in and walk away on your first bet. Not exactly fun. The next best is to set a limit and stick to it. It’s also worthwhile keeping a record of your profit / loss for each session so you are aware of your global profit / loss. It’s easy to remember the wins and forget about the losses. Personally, I opt for video games instead.

```
# density of tails outcomes
tibble(p_tails = tails_tally/30) %>%
ggplot(aes(x = p_tails)) +
geom_density(fill = "purple2", alpha = 0.8, colour = "purple2") +
geom_vline(xintercept = 0.5, linetype = 1) +
geom_vline(xintercept = 0.493, linetype = 2) +
annotate("segment", x = 0.493, xend = 0.64, y = 3, yend = 3, linetype = 2) +
annotate("segment", x = 0.5, xend = 0.64, y = 2.7, yend = 2.7, linetype = 1) +
annotate("text", x = 0.67, y = 3, label = "Mean: 0.493", size = 12, hjust = 0, family = "userfont") +
annotate("text", x = 0.67, y = 2.7, label = "p = 0.5", size = 12, hjust = 0, family = "userfont") +
theme_minimal() +
theme(
plot.title = element_blank(),
axis.title = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(margin = margin(0.5, 0, 0.5, 0)),
plot.subtitle = element_text(lineheight = 0.3, margin = margin(0, 0, 0, 50), hjust = 0)
) +
scale_x_continuous(breaks = round(seq(5, 25, 5)/30, 1), labels = round(seq(5, 25, 5)/30, 1))
```

The post Don’t get fooled by the Gambler’s Fallacy appeared first on Daniel Oehm | Gradient Descending.

]]>What’s that saying? Choice is the enemy of happiness, or something like that. There are so many colour palettes out […]

The post Evolve new colour palettes in R with evoPalette appeared first on Daniel Oehm | Gradient Descending.

]]>What’s that saying? Choice is the enemy of happiness, or something like that. There are so many colour palettes out there, that saying tends to resonate when either choosing one, or creating a new one. So, I started to think about a way I could make this more organic, and came up with the idea for evoPalette.

**evoPalette** allows for the discovery and generation of new colour palettes for R. Taking user input it uses an evolutionary algorithm to spawn new palettes with the expectation the colours you like will be combined over time. This can be repeated as many times as needed until you find something that works for you.

There’s a lot to colour theory, no choice is trivial, and so leaving some of those choices largely to an algorithm will have varying success. This is trading best practice for fun and discovery but I’m okay with that!

Very briefly, evolutionary algorithms operate in 3 main steps:

**Selection**: Each candidate has a measure of fitness and those that reach some threshold survive. In this case parents of the next generation are selected by the user and so fitness is essentially personal preference.**Crossover:**Children are spawned from a random combination of two parents genes. In this case colours represent a gene.**Mutation**: Mutation occurs in two ways 1) there is a chance of a completely new colour generated for a palette and 2) each colour varies locally around it’s initial value.

The process can continue for as long as needed.

You can install evoPalette from Github with the following command.

```
devtools::install_github("doehm/evoPalette")
```

(It’s only on Git for the moment. We’ll see how it goes.)

To jump straight into it launch the Shiny app with

```
launch_evo_palette()
```

To generate the first set of palettes, click **‘evolve’**. Using the default setting the initial palettes will be randomly drawn from a standard set saved in evoPalette and paletteer. Paletteer contains some 1700+ palettes from a number of packages.

Select one or more palettes that you like and think may work well together from the check box and click **‘evolve’**. The next generation of palettes will be created.

With each generation you’ll notice how the palette is converging to a similar theme based on your selections, along with some variation of each individual colour and a chance of a completely new colour. The features that you liked about one palette have a chance to be combined with other colours and flow through to the next generation. Although, this isn’t completely ensured since there is still a stochastic element at play, but you may discover another combination that works well for you. The selected parent palettes are shown in the sidebar for reference.

Selecting only a single parent is convenient for generating minor variations to tweak the colours.

Select** ‘Example: Fill aesthetic’** to see how the palette works in practice (similarly for **‘Example: Colour aesthetic’**).

To save a palette, select the desired palette from the check box and click ‘**save**‘. A box will appear giving the opportunity to change the randomly generated palette name but, why would you want to!

The palette is now accessible from `palette_box()`

once the app is closed. Multiple palettes can be saved which are collected in the palette box, just save one at a time. The palette box is refreshed when starting a new session so remember to save it to disk.

```
palette_box()
$instructive_customers
[1] "#C7B6C8" "#74B6BA" "#7783C4" "#B56D4E" "#F28E47" "#FCE269"
```

To begin again, deselect all palettes, click **‘evolve’**. New palettes are randomly selected beginning a new process.

A palette is viewed by using `show_palette()`

.

```
show_palette(palette_box()["instructive_customers"])
```

The continuous scales are simply created from the two colours on the ends and one in the middle (room for improvement).

The user has some control over the algorithm by adjusting the mutation rate and variation allowed for each colour. Select the parameters from the drop down in the menu.

- Number of colours to generate for each palette. Initial starting palettes with more / fewer colours are coerced to have the desired amount. So, a standard palette with more or fewer than this amount will be slightly different.
- Number of palettes to generate at each evolution. The default is 9.
- Mutation rate – Each colour in the palette has a probability of random mutation equal to this value (default p = 0.05)
- Mutation variation – Each colour will vary slightly around it’s original value.
- Load existing palettes from the global environment. Object should be a list, ideally with names i.e. the same format as
`palette_box()`

. - Are you feeling lucky? Select for generating palettes from completely random colours.

When you are happy with your selections, click **‘evolve’** and begin a new process.

The generated colour palettes are easily added to ggplots using the scale functions `scale_fill_evo`

and `scale_colour_evo`

.

```
library(ggplot2)
mpg %>%
ggplot(aes(x = displ, fill = class)) +
geom_histogram() +
scale_fill_evo("instructive_customers")
```

The first palette in the list is used if no name is given. The scales are also parameterised to reverse the palette and switch discrete / continuous.

Colour theory is pretty complex stuff so choosing a good palette isn’t easy, let alone evolving one. So, you’re going to have some hits and some misses. This is definitely more for fun seeing what you discover rather than finding the perfect palette. Having said that you could discover some gold!

There are best practices when choosing a palette for data visualisation depending on the context and what is to be shown. For example people tend to respond to certain colours representing high / low, hot / cold or good / bad, there is also colourblindness considerations. evoPalette won’t necessarily adhere to these ideals.

This is best for generating discrete colour palettes with 4-7 distinct colours. It will help you find colours which are to your liking and generate a palette which is unlikely to already exist. It’s fun to experiment and see what you come up with!

Some of the initial starting palettes I’ve created and are shown below. A few are created from evoPalette, some from a clustering process. The 90’s rock palettes were created from the album covers. Sorry for doubling up on Alice in Chains .

```
g_evo <- readRDS(system.file("extdata/palettes.rds", package = "evoPalette"))
map2(g_evo$palette, g_evo$name, ~show_palette(.x, title = .y)) %>%
wrap_plots(ncol = 2)
```

The post Evolve new colour palettes in R with evoPalette appeared first on Daniel Oehm | Gradient Descending.

]]>Generative Adversarial Networks are great for generating something from essentially nothing and there are some interesting uses for them. Most […]

The post Some basics and intuition behind GAN’s in R and Python appeared first on Daniel Oehm | Gradient Descending.

]]>Generative Adversarial Networks are great for generating something from essentially nothing and there are some interesting uses for them. Most uses are some sort of image processing. Nvidia’s GauGAN is an impressive example, giving the user an MS paint-like interface and generating landscapes. You can give the beta a shot here.

I wanted to take a step back and use a small example to understand the basics and build some intuition behind GAN’s. There’s a tonne of information out there on how to fit a GAN to generate new hand drawn numbers, faces or Pokemon to varying success (the jury is still out as to whether or not the output can pass as Pokemon, but anyway). This isn’t the focus for this post. Instead, I’m going to simplify things further and use a GAN to generate sine waves.

Keras and Tensorflow are used for this analysis. While there are R libraries I personally find it easier using Python via reticulate for deep learning tasks. You can find the code on Github. All the code bits in this post refer to functions from this repo.

For this to run correctly you’ll need Ananconda, Python 3.6-3.7 installed with keras and Tensorflow, as well as the standard libraries, numpy, pandas, etc. Along with python you’ll need reticulate installed and configured to use the appropriate version on Python. In short, run `py_config()`

to initialise python for the session and `py_available()`

to check if it’s all good to go. This can be tricky to set up and relies on how you’ve installed Python. If you have trouble refer to the reticulate cheat sheet and documentation.

For training data I’m going to use two wave functions,

with random noise added to throw in some minor variation.

```
get_training_data <- function(n = 200, m = 250, shape = list(a = c(1, 3), p = c(2, 10))) {
mat <- matrix(NA, nrow = n, ncol = m)
n_waves <- length(shape$a)
for(k in 1:n){
ak <- shape$a[(k-1) %% n_waves + 1]
pk <- shape$p[(k-1) %% n_waves + 1]
mat[k,] <- ak*sin(2*pi*seq(0, 1, length = m)*pk) + rnorm(m, 0, 0.05)
}
mat
}
train <- get_training_data()
plot_waves(train)
```

Nothing too complicated, just two distinct waves. Because we are generating training data using these two wave functions we only need to generate a handful of observations.

There are two main components, the generator and discriminator. The generator generates new waves from random input, in this case a standard normal distribution. The discriminator sorts the real from the fake data. During training it will switch between training the discriminator and the generator. At each iteration both components perform better – the generator gets better at generating real observations and the discriminator gets better at determining whether or not the observation is real or fake.

Like any neural network determining the number of hidden layers and sizes is more a process of experimentation than it is a science. For this example what I found worked well is,

All are dense layers with LeakyReLU activation and 20% dropout. Given the input data is distinct it seems like overkill however I found this worked best. I’m sure other network configurations also work.

The input data is time series data, in which case it is appropriate to use recurrent layers for the generator and discriminator. I actually found this to not be very successful. Either it takes far longer to train or just has trouble converging to a good solution, not saying it can’t be done though. For more challenging problems you’ll need more advanced models. Fortunately for this example we can keep it simple.

To train the GAN on these two functions we simply run.

`gan(train, epochs = 2000)`

This will call the Python script with the GAN code, run it in Python for 2000 epochs and return the results. The training is saved in the global environment as `x_train`

which is then able to be imported into the Python environment with `r.x_train`

. A log file is created within the working directory and records the progress every 100 epochs.

Once training has finished, view the output by `py$gan_r$output`

. At each iteration set by `trace`

a set of waves are generated. The `plot_waves`

function will plot a set from the final iteration.

Recall, we only fed the GAN two sine functions, which makes the output below pretty cool. Here we see 12 randomly generated waves from the final iteration of the GAN.

`plot_waves(py$gan_r$output, seq(1, 24, 2))`

With a single draw from a random normal distribution the GAN iteratively adjusted it’s weights until it learnt to generate and identify each wave. But also it learnt to generate new waves. What stands out here is the new waves appear to be some combination of the input waves. What we’ve done is found a really, really expensive and not particularly accurate way to estimate this…

where is between 0 and 1. This can be seen in the plot below where 12 waves have been plotted for different values of .

Without explicitly telling the GAN what the functions were, it managed to learn them and explore the space between. While it estimated the frequency well it didn’t quite explore the whole range of amplitudes. They tend to range between 1.5-2.5 rather than 1-3. With more training it would probably get there.

This took a few goes as training the GAN tends to converge to one of the input functions. By generating only one of the waves with high accuracy it can trick the discriminator into thinking it’s real every time. It’s a solution to the problem just not a very exciting one. With tuning we can get the desired result.

Each starting value will correspond to some kind of wave. Out of the 12 random waves, 4 are very similar, right down to the two little spikes at the top of the second crest (see the waves in the third column). This suggests this wave is mapped to a set of values that may be drawn with a higher probability.

This isn’t as sexy as generating new landscape images using Paint but it’s helpful to understand what is going on within the guts of the GAN. It attempts to identify the key features in the observations making it distinct from random noise, pass as a real observation and map the space between.

The same process is essentially happening at scale for more complex tasks. In this example it’s very inefficient to get the result we were looking for, but as the problem becomes more and more complex the trade-off makes sense.

With an image example such as faces, the GAN will identify what a nose looks like and the range of noses that exist in the population, what an eye looks like and that they generally come in pairs at a slightly varying distance apart and so on. It will then generate eyes and noses along some space of possible eyes and noses, in the same way it generated a wave along some space of possible waves given the input.

What’s interesting, the GAN only maps the space between. There are no examples where it generated a wave with a frequency greater than 10 or less than 2. Nor did it generate a wave with an amplitude greater than 3. The input waves essentially formed the boundaries of what could be considered real. There may be GAN variations that allow for this exploration.

The easiest way to get the code for this example is from Github. Either clone or install.

`devtools::install_github("doehm/rgan")`

The Python component is quite long and don’t think there is much to gain pasting it here. As mentioned the trickiest part is going to be configuring Python, keras and tensorflow but the R bits should work.

If you want to explore the output of the GAN the data is found at `inst/extdata/gan-output.Rdata`

in the Github repo. This will show you how the GAN improved with each iteration. Each element of the list is a sample of generated waves at iteration 100, 200, …, 2000. This data is the basis of the output waves above. e.g.

`plot_waves(gan_output[[1]], id = 1:12, nrow = 3)`

The code below created the title graphic. It is an area chart of waves using 48 values for . Perhaps worthy of an accidental aRt submission.

```
shape <- list(a = c(1, 3), p = c(2, 10))
m <- 250
ln <- 48
pal <- c("#4E364B", "#8D4E80", "#D86C15", "#F3C925", "#48B2A8")
map_dfr(seq(0, 1, length = ln), ~data.frame(
x = seq(0, 1, length = m),
y = .x*shape$a[1]*sin(2*pi*seq(0, 1, length = m)*shape$p[1]) +
(1-.x)*shape$a[2]*sin(2*pi*seq(0, 1, length = m)*shape$p[2]),
a = round(.x, 2))) %>% filter(x < 0.5) %>%
ggplot(aes(x = x, y = y, colour = a, fill = a, group = as.factor(a))) +
geom_area() +
theme_void() +
theme(legend.position = "none") +
scale_fill_gradientn(colors = colorRampPalette(pal)(200)) +
scale_colour_gradientn(colors = colorRampPalette(pal)(200))
```

The post Some basics and intuition behind GAN’s in R and Python appeared first on Daniel Oehm | Gradient Descending.

]]>Bayesian networks are really useful for many applications and one of those is to simulate new data. Bayes nets represent […]

The post Simulating data with Bayesian networks appeared first on Daniel Oehm | Gradient Descending.

]]>Bayesian networks are really useful for many applications and one of those is to simulate new data. Bayes nets represent data as a probabilistic graph and from this structure it is then easy to simulate new data. This post will demonstrate how to do this with bnlearn.

Before simulating new data we need a model to simulate data from. Using the same Australian Institute of Sport dataset from my previous post on Bayesian networks we’ll set up a simple model. For convenience I’ll subset the data to 6 variables.

```
library(DAAG)
library(tidyverse)
# ais data from the DAAG package
ais_sub <- ais %>% dplyr::select(sex, sport, pcBfat, hg, rcc, hc)
```

The variables sex and sport are pretty straight forward. The remaining four are

- pcBfat – percent of body fat
- hg – hemoglobin concentration
- rcc – red cell count
- hc – hematocrit percentage

I’ve allowed the data to learn the structure of the network, bar one arc, sport to percentage of body fat. The details are not shown here, but check out the post above on how to fit the structure algorithmically (also I suggest heading to the bnlearn doco which has great examples of a number of networks that can be downloaded). The structure is defined by the string and converted to a `bn`

class object.

```
library(visNetwork)
library(bnlearn)
# set structure
bn_struct <- model2network("[sex][sport|sex][hg|sex][pcBfat|sex:sport][hc|hg:pcBfat][rcc|hc]")
bn_struct
```

## ## Random/Generated Bayesian network ## ## model: ## [sex][hg|sex][sport|sex][pcBfat|sex:sport][hc|hg:pcBfat][rcc|hc] ## nodes: 6 ## arcs: 7 ## undirected arcs: 0 ## directed arcs: 7 ## average markov blanket size: 2.67 ## average neighbourhood size: 2.33 ## average branching factor: 1.17 ## ## generation algorithm: Empty

```
# plot network - code for function at end of post
plot.network(bn_struct)
```

Now that we have set the structure of the model it is fit to the data with `bn.fit`

using maximum likelihood estimation.

```
bn_mod <- bn.fit(bn_struct, data = ais_sub, method = "mle")
```

The output is quite detailed so it’s worth running `bn_mod`

to view the conditional probability tables and Gaussian distributions.

New data is simulated from a Bayes net by first sampling from each of the root nodes, in this case sex. Then followed by the children conditional on their parent(s) (e.g. sport | sex and hg | sex) until data for all nodes has been drawn. The numbers on the nodes below indicate the sequence in which the data is simulated, noting that rcc is the terminal node.

From this point it’s easy to simulate new data using `rbn`

. Here we simulate a dataset the same size as the original, but you can simulate as many rows as needed.

```
ais_sim <- rbn(bn_mod, 202)
head(ais_sim)
```

## hc hg pcBfat rcc sex sport ## 1 38.00754 12.43565 13.772499 3.917082 f Swim ## 2 45.54250 15.79388 13.586402 4.824458 m Field ## 3 49.87429 17.31542 5.308675 5.814398 m T_400m ## 4 49.05707 17.19443 9.230973 5.337367 m W_Polo ## 5 37.66307 12.99088 13.685909 4.020170 f Gym ## 6 42.33715 14.62894 15.147165 4.440556 f Netball

Done. We now have a fully synthetic dataset which retains the properties of the original data. And it only took a few lines of code.

An important property of generating synthetic data is that it doesn’t use real data to do so, meaning any predictors need to be simulated first (my post on the synthpop package explains this in more detail). This property is retained since the data is generated sequentially as per the structure of network. Also, when using synthpop the order in which the variables are simulated needs to be set. The order can alter the accuracy of simulated dataset and so it’s important to spend the time to get it right. For a Bayesian network the order is determined by the structure, so in effect this step is already done.

The original and simulated datasets are compared in a couple of ways 1) observing the distributions of the variables 2) comparing the output from various models and 3) comparing conditional probability queries. The third test is more of a sanity check. If the data is generated from the original Bayes net then a new one fit on the simulated data should be approximately the same. The more rows we generate the closer the parameters will be to the original values.

The variable distributions are very close to the original with only a small amount of variation, mostly observed in sport. Red cell count may have a slight bi-modal distribution but in most part it’s a good fit. This amount of variation is reasonable since there are only 202 simulated observations. Simulating more rows will be a closer fit but there are often practical considerations for retaining the same size dataset.

For the second check, two linear models are fit to the original and simulated data to predict hematocrit levels with sex, hemoglobin concentration, percentage of body fat and red cell count as predictors. Sport was left out of the model since it was not a strong predictor of hc and only increased the error.

```
# fit models
glm_og <- glm(hc ~ hg + rcc + pcBfat + sex, data = ais_sub)
glm_sim <- glm(hc ~ hg + rcc + pcBfat + sex, data = ais_sim)
# print summary
summary(glm_og)
```

## ## Call: ## glm(formula = hc ~ hg + rcc + pcBfat + sex, data = ais_sub) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.9218 -0.4868 0.0523 0.5470 2.8983 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.15656 1.04109 4.953 1.57e-06 *** ## hg 1.64639 0.11520 14.291 < 2e-16 *** ## rcc 3.04366 0.31812 9.568 < 2e-16 *** ## pcBfat -0.02271 0.01498 -1.517 0.131 ## sexm -0.20182 0.23103 -0.874 0.383 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.888242) ## ## Null deviance: 2696.92 on 201 degrees of freedom ## Residual deviance: 174.98 on 197 degrees of freedom ## AIC: 556.25 ## ## Number of Fisher Scoring iterations: 2

```
summary(glm_sim)
```

## ## Call: ## glm(formula = hc ~ hg + rcc + pcBfat + sex, data = ais_sim) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2117 -0.6232 0.0089 0.5910 2.0073 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.318057 0.953131 5.580 7.9e-08 *** ## hg 1.633765 0.107235 15.235 < 2e-16 *** ## rcc 2.905111 0.299397 9.703 < 2e-16 *** ## pcBfat -0.001506 0.014768 -0.102 0.919 ## sexm 0.319853 0.220904 1.448 0.149 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.7520993) ## ## Null deviance: 3014.20 on 201 degrees of freedom ## Residual deviance: 148.16 on 197 degrees of freedom ## AIC: 522.64 ## ## Number of Fisher Scoring iterations: 2

The coefficients and test statistics of the models are very similar, so both datasets result in the same conclusions. Percent of body fat is the least accurate but still make the same conclusion. In practice you should fit more models to assess the quality of the simulated data.

As mentioned the third is more of a sanity check but it is also a good demonstration of the process. By fitting the same structure to the simulated data we expect to estimate the same parameters and calculate very similar conditional probabilities. Here we simulate 20 000 observations to better estimate the parameters. The conditional probability for the athletes red cell count given the sport they compete in i.e. what is the probability the athletes red cell count will be greater than where is the 33rd and 66th percentile?

```
library(progress)
# fit bayes net with the same structure using simulated
ais_sim <- rbn(bn_mod, 2e4)
bn_mod_sim <- bn.fit(bn_struct, ais_sim, method = "mle")
# sport - mcpquery function at end of post - it generalises bnlearn::cpquery
# rcc is continuous so we're calculating the probability rcc > some value, here the 33rd and 66th percentile
# the function replaces xx and yy with actually values and evaluates the text
given <- "sport == 'xx'"
event <- "rcc > yy"
vals <- as.matrix(expand.grid(sport = unique(ais_sub$sport), rcc = quantile(ais_sub$rcc, c(0.33, 0.66))))
orig <- mcpquery(bn_mod, values = vals, token = c("xx", "yy"), event, given, n = 1e6) %>% spread(rcc, cp)
```

## P( rcc > yy | sport == 'xx' )

```
sim <- mcpquery(bn_mod_sim, values = vals, token = c("xx", "yy"), event, given, n = 1e6) %>% spread(rcc, cp)
```

## P( rcc > yy | sport == 'xx' )

```
# join for display
left_join(orig, sim, by = "sport", suffix = c("_orig", "_sim"))
```

## # A tibble: 10 x 5 ## sport `4.4533_orig` `4.9466_orig` `4.4533_sim` `4.9466_sim` ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 B_Ball 0.687 0.310 0.682 0.302 ## 2 Field 0.764 0.386 0.768 0.384 ## 3 Gym 0.477 0.0658 0.475 0.0717 ## 4 Netball 0.444 0.0584 0.442 0.0575 ## 5 Row 0.652 0.270 0.654 0.271 ## 6 Swim 0.752 0.370 0.758 0.376 ## 7 T_400m 0.767 0.388 0.769 0.386 ## 8 T_Sprnt 0.822 0.449 0.826 0.445 ## 9 Tennis 0.640 0.251 0.639 0.249 ## 10 W_Polo 0.945 0.571 0.944 0.566

The conditional probabilities from the simulated data are very close to the original as expected. Now we can be confident that our simulated data can be used as an alternative to the original.

Another useful property of Bayes nets is to impute missing values. This is easily done using `impute`

. We’ll remove 25% of the observations from variables hg and hc, and allow the Bayes net to impute them.

```
ais_miss <- ais_sub
miss_id <- sample(1:nrow(ais_sub), 50)
ais_miss[miss_id, c("hg", "hc")] <- NA
# table counting the NA values in hg and hc
apply(ais_miss[,c("hg", "hc")], 2, function(x) table(is.na(x)))
```

## hg hc ## FALSE 152 152 ## TRUE 50 50

The table confirms there are 50 missing observations from hemoglobin and hematocrit variables. Now impute using Bayesian likelihood weighting.

```
ais_imp <- impute(bn_mod, data = ais_miss, method = "bayes-lw")
```

Plotting the imputed against the true values shows the Bayes net imputed the missing values quite well.

I’ve only tested and shown two variables but the others would perform similarly for the subset of data I have chosen. This data is normally distributed so I expected it to work well, however if your data has more complex relationships you’ll need to be more rigorous with defining the structure.

```
# plot network function
plot.network <- function(structure, ht = "400px", cols = "darkturquoise", labels = nodes(structure)){
if(is.null(labels)) labels <- rep("", length(nodes(structure)))
nodes <- data.frame(id = nodes(structure),
label = labels,
color = cols,
shadow = TRUE
)
edges <- data.frame(from = structure$arcs[,1],
to = structure$arcs[,2],
arrows = "to",
smooth = FALSE,
shadow = TRUE,
color = "black")
return(visNetwork(nodes, edges, height = ht, width = "100%"))
}
# conditional probability query functions
mcpquery <- function(mod, values, token, event, given, ns = 1e4){
cat("P(", event, "|", given, ")\n")
UseMethod("mcpquery", values)
}
# numeric
mcpquery.numeric <- function(mod, values, token, event, given, ns = 1e4){
y <- rep(NA, length(values))
pb <- progress_bar$new(
format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )",
clear = FALSE, total = length(values))
for(k in 1:length(values)){
givenk <- gsub(token, values[k], given)
eventk <- gsub(token, values[k], event)
pb$tick(token = list(k = k, n = length(values), event = eventk, given = givenk))
y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')")))
}
return(tibble(values = values, cp = y) %>% arrange(desc(cp)))
}
# character
mcpquery.character <- function(mod, values, token, event, given, ns = 1e4){
y <- rep(NA, length(values))
pb <- progress_bar$new(
format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )",
clear = FALSE, total = length(values))
for(k in 1:length(values)){
givenk <- gsub(token, values[k], given)
eventk <- gsub(token, values[k], event)
pb$tick(token = list(k = k, n = length(values), event = eventk, given = givenk))
y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')")))
}
return(tibble(values = values, cp = y) %>% arrange(desc(cp)))
}
# matrix
mcpquery.matrix <- function(mod, values, token, event, given, ns = 1e4){
n <- nrow(values)
y <- rep(NA, n)
pb <- progress_bar$new(
format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )",
clear = FALSE, total = n)
for(k in 1:n){
givenk <- given
eventk <- event
for(j in 1:ncol(values)){
givenk <- gsub(token[j], values[k,j], givenk)
eventk <- gsub(token[j], values[k,j], eventk)
}
pb$tick(token = list(k = k, n = n, event = eventk, given = givenk))
y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')")))
}
out <- as.tibble(values) %>%
bind_cols(tibble(cp = y))
colnames(out)[1:ncol(values)] <- colnames(values)
return(out)
}
# compare the synthetic and original data frames
df <- ais_sub %>%
mutate(type = "orig") %>%
bind_rows(
rbn(bn_mod, 202) %>%
mutate(type = "sim")
) # %>%
gg_list <- list()
grp_var <- "type"
vars <- colnames(df)[colnames(df) != grp_var]
for(k in 1:length(vars)){
var_k <- vars[k]
gg_list[[k]] <- ggplot(df, aes_string(x = var_k, fill = grp_var, col = grp_var))
if(is.numeric(df[[var_k]])){
gg_list[[k]] <- gg_list[[k]] + geom_density(alpha = 0.85, size = 0)
}else{
gg_list[[k]] <- gg_list[[k]] + geom_bar(position = "dodge")
}
gg_list[[k]] <- gg_list[[k]] +
theme(
axis.text.x = element_text(angle = 90),
axis.title.x = element_blank()
) +
labs(title = var_k)
}
```

The post Simulating data with Bayesian networks appeared first on Daniel Oehm | Gradient Descending.

]]>Previously I posted on how to apply gradient descent on linear regression as an example. With that as background it’s […]

The post Use more of your data with matrix factorisation appeared first on Daniel Oehm | Gradient Descending.

]]>Previously I posted on how to apply gradient descent on linear regression as an example. With that as background it’s relatively easy to extend the logic to other problems. One of those is matrix factorisation.

There are many ways to factorise a matrix into components such as PCA, singular value decomposition (SVD), but one way is to use gradient descent. While it’s inception is in image processing, it was popularised by it’s use with recommender systems (Funk SVD). But, it is also found useful in other ways such as market basket analysis and topic modelling. Once you know the nuts and bolts and how to apply it you’ll find uses that are not recommenders.

Firstly, the maths.

Consider a matrix . We want to factorise into two matrices and such that

and have dimensions x and x where typically . These are estimated in the following steps 1) initialise each matrix with random values 2) iteratively adjust each matrix by the direction of negative gradient to minimise the MSE and 3) stop at the maximum iterations or a specified tolerance.

The objective function is the mean squared error and a regularisation term.

Here is an indicator where if the rating or value exists and 0 if it is missing. The gradients with respect to and are

The factor is dropped to keep it tidy. By doing so this quantity is effectively rolled into the learning rate, if you work through the maths. For each iteration and are adjusted by

where is the learning rate.

Factorising a matrix into two lower dimensional matrices learns the latent information. For example, if your data is a patient-drug profile it may learn the diagnosis of the patient. In the case of a recommender if your data are song ratings it will learn those that prefer metal over pop, classical over electro, etc. The first column of the matrix may represent the level of distortion in a song, or tempo as a numeric value. What the algorithm picks out as important isn’t necessarily obvious and worth inspecting those on the extremes to get a feel for the data. The user matrix works similarly. The numeric values in the user row represent how much weight that user gives to the particular feature.

Why this is useful over PCA or SVD is it handles missing values. The other methods treat every cell as a value so need to be imputed before factorising. It will then use the imputed values to learn the data structure. The beauty of this method is it factorises only on the observed values and imputes the missings.

I’ve developed a simple package to apply the method. There are other packages which are more optimised but this is just simple and easy to use. You can install it from Github (not on CRAN).

`devtools::install_github("doehm/matrixfactorisation")`

Simulate some data.

```
m <- matrix(sample(c(NA, 1:5), 60, replace = TRUE, prob = c(0.2, rep(0.8/5, 5))), nrow = 10)
m
```

## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] NA NA 1 4 2 2 ## [2,] NA 2 NA 1 5 4 ## [3,] 5 3 3 1 4 4 ## [4,] NA 1 3 5 3 5 ## [5,] 4 4 3 2 4 3 ## [6,] NA 4 5 NA 4 4 ## [7,] NA NA 2 3 3 2 ## [8,] 2 NA 2 NA 3 2 ## [9,] 2 3 5 2 3 NA ## [10,] 3 3 2 2 5 5

Select a test set and factorise. Selecting a test set holds out a proportion of randomly selected observations. It ensures there is at least 1 observation remaining in each row. This can be increased by `min_row`

.

```
id <- select_test(m, 0.2)
mf <- matrix_factorisation(m, 2, test = id$test, pbar = TRUE)
```

## 00:00:02 // dimensions 2 // epoch 1000 // train 1.2703 // test 1.6059 // delta 9.467e-04

Plot the convergence.

```
plot(mf)
```

Viewing the convergence profile is useful to see whether or not the data has been overfit.

Typically missing values are treated before clustering which may include removing certain rows or columns. Usually this also means throwing away actual observations. If the data has a proportionally large amount of missing values you could be throwing away most of the data. This is quite typical for ‘user-item’ type data. Matrix factorisation enables the use of more data.

This is demonstrated in a small example. A matrix is randomised with 12 users, 6 items and 2 groups with a high degree of missingness. The group is the latent variable. The values are drawn from a Poisson distribution . Those that are in group 1 are drawn from the distribution with and those in group 2 are drawn from . It is then simple to see the relationship between the users, items and groups. Count data is very common and often has it’s own challenges when clustering, particularly when counts are small or only part of the units history is observed e.g. a data set where the grain is the patient, the items are prescription drugs and the values are counts of that drug.

```
user_a <- function(m) c(rpois(m/2, 5), rpois(m/2, 1))
user_b <- function(m) c(rpois(m/2, 1), rpois(m/2, 5))
df <- NULL
n <- 12
m <- 6
grp <- as.factor(sort(rep(letters[1:2], n/2)))
for(k in 1:n){
df <- rbind(df, if(k <= 6) user_a(m) else user_b(m))
}
df
```

## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 6 7 6 0 3 2 ## [2,] 5 5 3 1 1 1 ## [3,] 7 3 7 1 0 0 ## [4,] 5 5 5 0 0 0 ## [5,] 3 1 3 2 0 0 ## [6,] 5 7 4 0 1 0 ## [7,] 0 0 2 7 6 6 ## [8,] 3 1 0 6 9 8 ## [9,] 2 0 2 6 4 5 ## [10,] 0 2 1 5 4 9 ## [11,] 0 2 2 4 4 5 ## [12,] 0 0 2 1 6 8

We’ll now suppress 60% of the values.

```
df_na <- select_test(df, 0.6)$train
df_na
```

## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] NA 7 NA NA NA 2 ## [2,] 5 NA 3 NA NA NA ## [3,] NA 3 7 NA 0 NA ## [4,] 5 5 5 0 0 NA ## [5,] 3 1 3 NA NA 0 ## [6,] 5 NA NA 0 NA 0 ## [7,] 0 NA NA NA NA 6 ## [8,] 3 1 NA 6 9 8 ## [9,] NA 0 2 NA 4 NA ## [10,] 0 2 NA NA NA 9 ## [11,] 0 2 2 NA NA 5 ## [12,] NA NA 2 NA 6 NA

This data can be challenging to use, but by factorising using only the observed data it is embedded in a much more usable form (mathematically this is similar to using the autoencoder – PCA vs autoencoders). Using the prediction matrix all units can be clustered. While there is error in the prediction matrix it should be accurate enough to find the two groups.

```
# factorise
df_mf <- matrix_factorisation(df_na, 2, epochs = 2000)
```

```
# cluster
km <- kmeans(df_mf$pred, 2, nstart = 100)
clus <- as.factor(km$cluster)
names(clus) <- grp
clus
```

## a a a a a a b b b b b b ## 1 1 1 1 1 1 2 2 2 2 2 2 ## Levels: 1 2

```
table(clus, grp)
```

## grp ## clus a b ## 1 6 0 ## 2 0 6

A simple K-means clustering process finds the two groups, which is encouraging with a highly patchy and small data set. K-means can struggle with low value count data so embedding the data in this form can really help.

Often the original data set can be large and therefore inefficient to cluster. The same results (or a at least a very good approximation) is achieved using the matrix instead of the prediction matrix.

```
# cluster
km <- kmeans(df_mf$u, 2, nstart = 100)
clus <- as.factor(km$cluster)
names(clus) <- grp
clus
```

## a a a a a a b b b b b b ## 1 1 1 1 1 1 2 2 2 2 2 2 ## Levels: 1 2

```
table(clus, grp)
```

## grp ## clus a b ## 1 6 0 ## 2 0 6

In this case we achieve the same result which is great. This is an approximation so may not always work as well, in fact if you randomised the input matrix a number of times I’m sure you’ll see different results. But in general you should achieve very reasonable results most of the time.

In practice I have had success using this technique on large data sets. While my preferred method is to use the autoencoder for dimension reduction and embedding, there is an overhead in setting it up. On the other hand matrix factorisation is quick and simple to get you exploring the data sooner. Clustering is just one example and I’m sure you’ll find more applications that benefit from matrix factorisation.

The post Use more of your data with matrix factorisation appeared first on Daniel Oehm | Gradient Descending.

]]>If you’re like me, when running a process through a loop you’ll add in counters and progress indicators. That way […]

The post Fun with progress bars: Fish, daggers and the Star Wars trench run appeared first on Daniel Oehm | Gradient Descending.

]]>If you’re like me, when running a process through a loop you’ll add in counters and progress indicators. That way you’ll know if it will take 5 minutes or much longer. It’s also good for debugging to know when the code wigged-out.

This is typically what’s done. You take a time stamp at the start – `start <- Sys.time()`

, print out some indicators at each iteration – `cat(“iteration”, k, “// reading file”, file, “\n”)`

and print out how long it took at the end – `print(Sys.time()-start)`

. The problem is it will print out a new line at each time it is called, which is fine but ugly. You can reduce the number of lines printed by only printing out every 10th or 100th iteration e.g. `if(k %% 10 == 0) …`

.

A simple way to make this better is instead of using `"\n"`

for a new line use `"\r"`

for carriage return. This will overwrite the same line which is much neater. It’s much more satisfying watching a number go up, or down, whichever way is the good direction. Try it out…

```
y <- matrix(0, nrow = 31, ncol = 5)
for(sim in 1:5){
y[1, sim] <- rnorm(1, 0, 8)
for(j in 1:30){
y[j+1, sim] <- y[j, sim] + rnorm(1) # random walk
cat("simulation", sim, "// time step", sprintf("%2.0f", j), "// random walk", sprintf(y[j+1, sim], fmt='% 6.2f'), "\r")
Sys.sleep(0.1)
}
}
```

## simulation 5 // time step 30 // random walk 8.97

The best way is to use the {progress} package. This package allows you to simply add running time, eta, progress bars, percentage complete as well as custom counters to your code. First decide on what counters you want and the format of the string. The function identifies counters by using a colon at the beginning of the label. Check the doco for built-in tokens.

To add your own token add the label to the format string and add the token to `tick()`

. To make it pretty I recommend formatting digits with `sprintf()`

. Here’s an example.

```
library(progress)
pb <- progress_bar$new(format = ":elapsedfull // eta :eta // simulation :sim // time step :ts // random walk :y [:bar]", total = 30*5, clear = FALSE)
y <- matrix(0, nrow = 31, ncol = 5)
for(sim in 1:5){
y[1, sim] <- rnorm(1, 0, 8)
for(j in 1:30){
y[j+1, sim] <- y[j, sim] + rnorm(1) # random walk
pb$tick(tokens = list(sim = sim, ts = sprintf("%2.0f", j), y = sprintf(y[j+1, sim], fmt='% 6.2f')))
Sys.sleep(0.1)
}
}
```

00:00:17 // eta 0s // simulation 5 // time step 30 // random walk -12.91 [====================================================]

You can also jazz it up with a bit of colour with {crayon}. Be careful with this, it doesn’t handle varying string lengths very well and can start a new line exploding your console.

```
library(crayon)
pb <- progress_bar$new(format = green$bold(":elapsedfull // eta :eta // simulation :sim // time step :ts // random walk :y [:bar]"), total = 30*5, clear = FALSE)
...
```

` 00:00:17 // eta 0s // simulation 5 // time step 30 // random walk -12.91 [====================================================]`

That’s a much neater progress bar.

Procrastination set in and creative tangents were followed. So, made a progress bar into a big fish which eats smaller fish … and made it green.

```
n <- 300
bar_fmt <- green$bold(":elapsedfull | :icon |")
pb <- progress_bar$new(format = bar_fmt, total = n, clear = FALSE)
icon <- progress_bar_icon("fish", n, 75)
for(j in 1:n){
pb$tick(tokens = list(
icon = token(icon, j)
))
Sys.sleep(0.03)
}
```

Each fish represents 25% completion. Once they’re all gobbled up, the job is done.

I also threw knives at boxes. Each box represents 20% completion.

```
n <- 300
bar_fmt <- green$bold(":elapsedfull | :icon |")
pb <- progress_bar$new(format = bar_fmt, total = n, clear = FALSE)
icon <- progress_bar_icon("dagger", n, 75)
for(j in 1:n){
pb$tick(tokens = list(
icon = token(icon, j)
))
Sys.sleep(0.03)
}
```

And my personal favourite, the Star Wars trench run.

```
n <- 500
bar_fmt <- green$bold(":elapsedfull | :icon |")
pb <- progress_bar$new(format = bar_fmt, total = n, clear = FALSE)
icon <- progress_bar_icon("tiefighter", n, 75)
for(j in 1:n){
pb$tick(tokens = list(
icon = token(icon, j)
))
Sys.sleep(0.03)
}
```

Ok… I have spent way too long on this! But at least it was fun. If you want to play around with it, feel free to download it from Git.

```
devtools::install_github(“doehm/progressart”)
```

The post Fun with progress bars: Fish, daggers and the Star Wars trench run appeared first on Daniel Oehm | Gradient Descending.

]]>In my previous post I built a Shiny app mapping accidents on Queensland roads which was great at showing the […]

The post Bayesian estimation of fatality rates and accidents involving cyclists on Queensland roads appeared first on Daniel Oehm | Gradient Descending.

]]>In my previous post I built a Shiny app mapping accidents on Queensland roads which was great at showing the problematic areas within cities and regional areas. I have added to this by estimating the fatality rate given the observed accidents and the rate of accidents involving cyclists for SA3 and SA4 areas. I have also updated the filters making them tidier. What follows in commentary of what can be found in the Shiny app. If you want to jump straight to it, run

```
shiny::runGitHub("doehm/road-accidents/", "doehm", launch.browser = TRUE)
```

I used a Bayesian approach to estimate the fatality rate for 2017 (the data isn’t complete for 2018) and presented it as a proportion of the number of observed road accidents. The data dates back to 2001 but it’s reasonable to use the most recent data to allow for improvements in road conditions, policies, advertising, population growth, etc which may have an underlying impact on the risk of road accidents.

To construct a prior I used years 2014-2016 at SA3 level. By taking 10,000 draws from the posterior we estimate the fatality rate distributions for each area. The bands represent the 80% and 95% prediction interval.

The top 3 most dangerous SA4’s for 2017 are Mackay – Isaac – Whitsunday, Wide Bay and The Queensland outback. It’s not surprising that the large majority of fatalities occurred at high speeds on highways and back roads. Having said that a few have occurred closer to the town centers in these areas. It’s possible the road conditions are not particularly good but that’s hard to determine. The road condition filter only has 4 states, sealed (dry / wet) and unsealed (dry / wet) and doesn’t offer much more information. It appears dry conditions on sealed roads is when most accidents occur.

The 3 least dangerous areas are Brisbane North, Brisbane Inner City and Brisbane South. Again, it shouldn’t be surprising given the speed limit is lower in the city, if accidents occur they are generally less severe. There are a lot of rear ended accidents and accidents when leaving a car park – the classic.

At the SA3 level the incident rates focus on smaller geographical areas which can highlight other features. For example, at the SA4 level the area often includes the city centers and the Hinterland regions which tend to have different rates where it’s often higher in the Hinterlands. This is largely due to the differing speed limits. The most dangerous areas are Burnett, Port Douglas and the Ipswich Hinterlands. The least dangerous are Chermside, Brisbane Inner West and Springfield – Redbank. Click the image to zoom in, there are a lot of SA3’s.

As a cyclist it’s good to know which roads to pay particular attention on. In a similar way we’ll look at the rate of road accidents involving cyclists by area. An influencing factor is how many cyclists are on the roads relative to cars and other vehicles. If the area has a strong cycling culture it’s more likely that an accident will involve a cyclist.

It’s not surprising the dense city centers with more traffic have more accidents involving cyclists. At the SA4 level it’s pretty clear that Brisbane Inner City is a dangerous place to ride a bike, particularly on Adelaide street which you can see from the app (go to show filter > unit type involved > bicycle). The rate of accidents involving cyclists in Brisbane Inner City is significantly higher than all other SA4’s. I’m curious to know if the the growth of the CityCycle scheme (which is awesome by the way) is somewhat a contributing factor. Although, given this was introduced in 2011, and 2005-2010 saw the highest rate of growth in accidents involving cyclists, probably isn’t the biggest factor in the high rate of accidents but it should contribute if more people are on bikes – they’re also not the easiest bikes in the world to ride. The other areas in the top 3, Brisbane City West and Cairns. Sheridan Street, the main drag in Cairns is where you need to be extra vigilant.

Here Noosa tops the chart for 2017. It sits within the Sunshine Coast SA4 region ranking 5th in the chart above which also includes the Hinterland regions which is why it scores lower above. Noosa has a strong cycling culture which could be influencing it’s high rate. With a median of a whopping 18% definitely keep your eyes peeled when cycling on those roads. Also, take care in Sherwood – Indooroopilly. Keep in mind that these areas are not necessarily where the most accidents involving cyclists occur. Rather if there is an accident it’s more likely to have involved a cyclists relative to other areas. If you’re driving in those areas keep an eye out for cyclists on the road, perhaps there aren’t marked cycling lanes or off-road bike paths or some dodgy intersections.

Ideally the rate would be adjusted for the number of cyclists on the road, however this is still a useful statistic to find the more dangerous areas. This analysis has also been done for pedestrians involved in car accidents which is can be found in the app. Just a heads up, be careful in Fortitude Valley.

Run the shiny app

```
library(shiny)
runGitHub("doehm/road-accidents/", "doehm", launch.browser = TRUE)
```

Load the data from Github.

```
library(repmis)
source_data("https://github.com/doehm/road-accidents/raw/master/data/road-accident-data.Rdata")
```

Estimate hyperparameters

```
# get prior information from previous 3 years
# use method of moments to get alpha and beta
# ideally we would use a hierarchical model but this is suitable
hyper_params <- function(y){
mu <- mean(y)
s2 <- var(y)
gamma <- mu*(1-mu)/s2-1
alpha <- gamma*mu
beta <- gamma*(1-mu)
return(list(alpha = alpha, beta = beta))
}
# estimating the actual parameters
df <- accidents_raw %>%
filter(
Loc_Queensland_Transport_Region != "Unknown",
Crash_Year >= 2014,
Crash_Year <= 2016
) %>%
group_by(Loc_ABS_Statistical_Area_3) %>%
summarise(
count = n(),
n_fatalities = sum(Count_Casualty_Fatality)
) %>%
mutate(theta = n_fatalities/(count))
hyper_params(df$theta)
```

Forest plots

```
library(tidyverse)
library(showtext)
# font
font_add_google(name = "Montserrat", family = "mont")
showtext_auto()
# SA3 forest plot of fatality rate
# set my theme and colours
theme_forest <- function(scale = 1){
theme_minimal() +
theme(
legend.position = "none",
axis.text = element_text(family = "mont", size = 16*scale),
axis.text.x = element_text(vjust = 0.5),
axis.title.y = element_blank(),
axis.title.x = element_text(family = "mont", size = 16*scale),
plot.title = element_text(family = "mont", hjust = 0.5, size = 26*scale, face = "bold"),
plot.subtitle = element_text(family = "mont", hjust = 0.5, size = 20*scale),
plot.caption = element_text(size = 12*scale)
)
}
my_cols <- function(n = 16) colorRampPalette(c("darkmagenta", "turquoise"))(n)
# simulate from the posterior
posterior_f <- function(df, y, n, a = 1.3, b = 77, inflator = 100) {
out <- data.frame()
qs <- c(0.025, 0.1, 0.5, 0.9, 0.975)
for(k in 1:nrow(df)){
out <- rbind(out, inflator*rgamma(1e4, shape = a+y[k], rate = b+n[k]) %>% quantile(qs))
}
colnames(out) <- paste0("q", 100*qs)
return(out)
}
# SA4
# fatalities
areas <- grep("Area", colnames(accidents_raw), value = TRUE)[3:4]
names(areas) <- c("sa3", "sa4")
fatality_fn <- function(area){
accidents_raw %>%
group_by_(area) %>%
filter(
Crash_Year == 2017,
) %>%
summarise(
count = length(Count_Casualty_Total),
n_fatalities = sum(Count_Casualty_Fatality)
) %>%
bind_cols(posterior_f(df = ., y = .$n_fatalities, n = .$count)) %>%
arrange(q50) %>%
mutate_(area = interp(~factor(v, level = v), v = as.name(area))) %>%
ggplot() +
geom_segment(mapping = aes(x = q2.5, xend = q97.5, y = area, yend = area)) +
geom_segment(mapping = aes(x = q10, xend = q90, y = area, yend = area, col = q50), size = 2) +
geom_point(mapping = aes(x = q50, y = area), pch = 3) +
theme_forest() +
scale_colour_gradientn(colors = my_cols()) +
labs(
title = "Fatality rate given observed road accidents",
subtitle = paste("Bayesian estimate of the fatality rate for", toupper(names(area)), "areas in 2017"),
x = "Fatality rate (%)")
}
fatality_plots <- list(sa3 = fatality_fn(areas[1]), sa4 = fatality_fn(areas[2]))
```

Cyclist plot

```
# cyclists
df <- accidents_raw %>%
filter(
Crash_Year >= 2014,
Crash_Year <= 2016,
Loc_ABS_Statistical_Area_3 != "Unknown"
) %>%
group_by(Loc_ABS_Statistical_Area_3) %>%
summarise(
n_bicycles = sum(Count_Unit_Bicycle > 0),
n_accidents = n()
) %>%
mutate(p = n_bicycles/n_accidents) %>%
arrange(desc(p))
# estimate hyperparameters
hyper_params(df$p)
# cyclists
cyclist_fn <- function(area){
accidents_raw %>%
group_by_(area) %>%
filter(Crash_Year == 2017) %>%
summarise(
count = n(),
n_bicycles = sum(Count_Unit_Bicycle > 0)
) %>%
bind_cols(posterior_f(df = ., y = .$n_bicycles, n = .$count, a = 1.55, b = 25)) %>%
arrange(q50) %>%
mutate_(area = interp(~factor(v, level = v), v = as.name(area))) %>%
ggplot() +
geom_segment(mapping = aes(x = q2.5, xend = q97.5, y = area, yend = area)) +
geom_segment(mapping = aes(x = q10, xend = q90, y = area, yend = area, col = q50), size = 2) +
geom_point(mapping = aes(x = q50, y = area), pch = 3) +
theme_forest() +
scale_colour_gradientn(colors = my_cols()) +
labs(
title = "Rate of cyclists involved in road accidents",
subtitle = paste("Bayesian estimate of the rate of accidents involving cyclists for", toupper(names(area)), "areas in 2017"),
x = "Accidents involving cyclists (%)"
)
}
cyclist_plots <- list(sa3 = cyclist_fn(areas[1]), sa4 = cyclist_fn(areas[2]))
# Brisbane inner
accidents_raw %>%
filter(Loc_ABS_Statistical_Area_4 == "Brisbane Inner City", Crash_Year < 2018) %>%
group_by(Crash_Year) %>%
summarise(
n_bikes = sum(Count_Unit_Bicycle > 0),
n_accidents = n(),
p_bikes = n_bikes/n_accidents
) %>%
bind_cols(posterior_f(df = ., y = .$n_bikes, n = .$n_accidents, a = 1.55, b = 25)/100) %>%
ggplot(aes(x = Crash_Year, y = q50)) +
geom_line(col = "darkmagenta") +
geom_point(col = "darkmagenta") +
theme_minimal() +
theme_forest() +
labs(
x = "Year",
title = "Accidents involving cyclists - Brisbane Inner City",
subtitle = "Accidents involving cyclists are increasing, indicative of growth in popularity of cycling - Be careful on the roads"
)
```

The post Bayesian estimation of fatality rates and accidents involving cyclists on Queensland roads appeared first on Daniel Oehm | Gradient Descending.

]]>The Queensland government collects data on road accidents dating back to 1st January 2001 and details characteristics of the incident […]

The post Queensland road accidents mapped with Shiny and leaflet in R appeared first on Daniel Oehm | Gradient Descending.

]]>The Queensland government collects data on road accidents dating back to 1st January 2001 and details characteristics of the incident including,

- Location of the crash (lat / long coordinates)
- ABS statistical area codes (SA2-4, LGA, remoteness)
- Atmospheric and road conditions (weather, lighting, sealed / unsealed roads, speed limit zone, etc)
- Severity of the incident (minor injury to fatality)
- Types of vehicles involved (car, bus, truck, bike, etc) and
- Description of the incident

Mapping this data highlights hot spots where car accidents occur more often. In particular the dangerous areas in wet conditions, problematic intersections and the areas of Queensland which are more dangerous than others in terms of fatality rates.

I developed a Shiny App utilising leaflet to easily explore the data (and just for fun). It features,

- A collapsible panel for filtering the data by selecting statistical areas and other features.
- An insights panel showing the breakdown of severity, vehicles involved, accidents over time and a Bayesian estimate of the fatality rate for the selected area.
- Data explorer tab.

This data is of road accidents, so the estimate of fatality rate in this case is the fatality rate given the vehicle was involved in an accident, rather than the fatality rate by road accident in the population. It is a slightly different take on how this statistic is usually published, but a useful one.

The best way to view the app is to run the following code. Firstly, check to make sure you have the packages installed by running

```
check_packages <- function(packages){
if(all(packages %in% rownames(installed.packages()))){
TRUE
}else{
cat("Install the following packages before proceeding\n", packages[!(packages %in% rownames(installed.packages()))], "\n")
}
}
packages_needed <- c("tidyverse", "shiny", "leaflet", "leaflet.extras", "magrittr", "htmltools", "htmlwidgets", "showtext", "data.table")
check_packages(packages_needed)
```

If all good, now run the line below and it will load the app.

`runGitHub("doehm/road-accidents/", "doehm", launch.browser = TRUE)`

This will launch it directly on your machine. Or you can follow the link directly to the Shiny app.

There are a lot of neat things we can do with this data and I’ll be adding to the app over time.

A subset of the app focuses on the “Brisbane Inner” SA3 area to give a taste of what to expect. It shows car accidents in the city since 1st January 2013. When zooming in, hover over the marker to get a short description of the crash.

View the full screen map here.

Below is the underlying code of the example above leaflet map, but I strongly recommend running the code above to view the Shiny app. See Github for the full code.

```
# queensland road accident data
# libraries
library(tidyverse)
library(shiny)
library(leaflet)
library(leaflet.extras)
library(magrittr)
library(htmltools)
library(htmlwidgets)
library(showtext)
library(data.table)
# font
try({
font_add_google(name = "Montserrat", family = "mont")
showtext_auto()
}, TRUE)
# load data
# or if it doesn't work grab the Rdata file from Github - see link above
load_data <- function(){
if(!file.exists("locations.csv")){
cat('\n Download may take a few minutes...\n')
url <- "http://www.tmr.qld.gov.au/~/media/aboutus/corpinfo/Open%20data/crash/locations.csv"
download.file(url, destfile = "locations.csv", method="libcurl")
}
accidents_raw <- read_csv("locations.csv")
return(accidents_raw)
}
accidents_raw <- load_data() %>% filter(Crash_Severity != "Property damage only")
# sample of brisbane inner
accidents <- accidents_raw %>%
filter(
Loc_ABS_Statistical_Area_3 == "Brisbane Inner",
Crash_Year > 2013
) %>%
mutate(fatality = Count_Casualty_Fatality > 0)
# basic leaflet
m <- leaflet(accidents) %>%
addProviderTiles(providers$Stamen.Toner, group = "Black and white") %>%
addTiles(options = providerTileOptions(noWrap = TRUE), group="Colour") %>%
addMarkers(
lng = ~Crash_Longitude_GDA94,
lat = ~Crash_Latitude_GDA94,
clusterOptions = markerClusterOptions(),
label = ~htmlEscape(Crash_DCA_Description)
) %>%
addCircleMarkers(
lng = ~Crash_Longitude_GDA94[accidents$fatality],
lat = ~Crash_Latitude_GDA94[accidents$fatality],
color = "#8B0000",
stroke = FALSE,
fillOpacity = 0.8,
group = "Fatalities"
) %>%
addHeatmap(
lng = ~Crash_Longitude_GDA94,
lat = ~Crash_Latitude_GDA94,
radius = 17,
blur = 25,
cellSize = 25
) %>%
addLayersControl(
overlayGroups = c("Fatalities"),
baseGroups = c("Black and white","Colour"),
options = layersControlOptions(collapsed = FALSE)
)
```

The post Queensland road accidents mapped with Shiny and leaflet in R appeared first on Daniel Oehm | Gradient Descending.

]]>The Buffalo Stampede is one of the most brutal races on the ultra calendar covering 75km and 4545m of climbing. […]

The post Buffalo Stampede Ultra 75km 2019: Race report appeared first on Daniel Oehm | Gradient Descending.

]]>The Buffalo Stampede is one of the most brutal races on the ultra calendar covering 75km and 4545m of climbing. An out-and-back from Bright to the top of Mount Buffalo tapping Mystic and Clear Spot along the way. It’s iconic for it’s brutally steep ascents that are technical and demanding.

This was my second Buffalo Ultra. My first was in 2016 and one of the hardest I’ve done. In 2016 I finished 9th in 9:56:00 and that feeling of crossing the finish line is one of the greatest.

This year my goal was to do better than my 2016 effort and given the work I’ve put in training I was confident heading into it. My goal time was 9:30:00 which was reasonable since Buffalo does it’s best to throw you into the pain cave and ruin your day.

Kicking of at 6am it was still a little dark. It doesn’t take long to hit the first climb up to Mystic, a 500m ascent to get warmed up. I took it very easy. Trying to run this first climb is a death sentence but people still try.

One sure way to destroy your race is attacking the descents too hard, too early. This is easily done on the infamous Mick’s track, at -45% gradient it’s a monster. I took it very easy, saving the quads for later and just got to the bottom safely.

It’s always a good feeling tapping Clear Spot (10km mark) since you know much of the hard climbing is over, for now. I got into a nice rhythm, smashed my mash potato, cruised up and over Keating Ridge and into the Eurobin aid station clocking up 26kms.

At this point the field was pretty spread out. I came in in 3rd place running with 4th. Refueled, left in 4th place and began the 1000m climb to Buffalo Summit. It wasn’t long until I caught and moved into 3rd place. I was feeling very strong on the climbs and took advantage of that.

At the Chalet Aid station, again refueled and quickly began the loop still in 3rd place and about 10 minutes behind second. The Galleries are always fun, scrambling through the rocks. I recall in 2016 I was cramping up massively at this point. This time however, not even close. I was in good shape.

Coming into the Chalet aid station at the 42km mark again, still feeling great. I had really closed the gap to 2nd place who was only 2 minutes ahead, and apparently looking way more fresh which was a good sign!

On the descent I caught and passed 2nd place. I tried to not get too carried away and just kept the legs turning over and remain comfortable.

About halfway down Buffalo I felt the first tweak in my ITB. I’ve been having ITB issues for the last month leading into the race which is just depressing when you’re in top form but can’t run due to an injury. It was well enough to start but this is what I feared.

It did not feel good and brought me to a walk. I didn’t walk for long and tried to take the edge off on the steep downs hoping it wouldn’t do it again.

Entering Eurobin aid station at the 53km mark in 2nd place, 14 mins behind 1st and a few minutes in front of 3rd and 4th. It was a quick stop for me, grabbed more mash, more water melon and electrolyte. I got moving pretty quickly. I was very happy with where my energy levels and felt I had heaps left in the tank. My ITB was hanging in there but it did not feel good at all and just hoping it would hold together for the last 26kms. I smashed the anti-inflammatories hoping that would do something.

I took the climb over Keating Ridge fairly easy since there were two monsters waiting for me at the 65km mark. A few minutes into the descent it happened, my ITB locked up, I couldn’t straighten my leg and every step was very painful. Running was impossible.

I walked for 500m trying to get some movement and start running again. Limping along for 100m it again flared up, very painful to put any weight on it and forced into a walk. I knew deep down my race was over.

3rd and 4th place passed me and there was nothing I could do about it. I tried running after walking for about 2km but again 100m down the road running become too painful.

At this point, 60km deep, I had to make a decision, do I continue even though the race I wanted is over and just finish by walking it in and hope I don’t do any more damage? Or tap out and not risk major damage? The thought of rupturing my ITB and turning a few weeks of recovery into a few months was not a risk I was willing to take. So, I made the tough decision to drop. No matter how right that decision was it never feels good to drop from a race.

I gave the word and got a lift back to the finish line for first aid. The medical officer felt my knee and said “Faaaark!”. It was in some ways comforting to hear, reassuring me I made the right decision to drop.

I haven’t had a good run in my last 3 races so the way it went bums me out even more. But I’m remaining positive and spending the next few weeks doing all the right things to fix it.

I was on track to crush my goal time and have one of my best races. I guess now this race goes onto the ‘I need redemption’ pile, so I’ll be back next year. For now, it’s preparing for UTA and hoping I don’t see a repeat.

View my race on Strava

The post Buffalo Stampede Ultra 75km 2019: Race report appeared first on Daniel Oehm | Gradient Descending.

]]>Every so often a problem arises where it’s appropriate to use gradient descent, and it’s fun (and / or easier) […]

The post Applying gradient descent – primer / refresher appeared first on Daniel Oehm | Gradient Descending.

]]>Every so often a problem arises where it’s appropriate to use gradient descent, and it’s fun (and / or easier) to apply it manually. Recently I’ve applied it optimising a basic recommender system to ‘unsuppressing’ suppressed tabular data. I thought I’d do a series of posts about how I’ve used gradient descent, but figured it was worth while starting with the basics as a primer / refresher.

To understand how this works gradient descent is applied we’ll use the classic example, linear regression.

A simple linear regression model is of the form

where

The objective is to find the parameters such that they minimise the mean squared error.

This is a good problem since we know the analytical solution and can check our results.

In practice you would never use gradient descent to solve a regression problem, but it is useful for learning the concepts.

Set up

```
library(ggplot2)
set.seed(241)
nobs <- 250
b0 <- 4
b1 <- 2
# simulate data
x <- rnorm(nobs)
y <- b0 + b1*x + rnorm(nobs, 0, 0.5)
df <- data.frame(x, y)
# plot data
g1 <- ggplot(df, aes(x = x, y = y)) +
geom_point(size = 2) +
theme_minimal()
```

The analytical solution is given by

```
# set model matrix
X <- model.matrix(y ~ x, data = df)
beta <- solve(t(X) %*% X) %*% t(X) %*% y
beta
```

## [,1] ## (Intercept) 4.009283 ## x 2.016444

And just to convince ourselves this is correct

```
# linear model formulation
lm1 <- lm(y ~ x, data = df)
coef(lm1)
```

## (Intercept) x ## 4.009283 2.016444

```
g1 + geom_abline(slope = coef(lm1)[2], intercept = coef(lm1)[1], col = "darkmagenta", size = 1)
```

The objective is to achieve the same result using gradient descent. It works by updating the parameters with each iteration in the direction of negative gradient to minimise the mean squared error i.e.

where is the learning rate. Here is the MSE with respect to the regression parameters. Firstly, we find the partial derivatives of .

The learning rate is to ensure we don’t jump too far with each iteration and rather some proportion of the gradient, otherwise we could end up overshooting the minimum and taking much longer to converge or not find the optimal solution at all.

Applying this to the problem above, we’ll initialise our values for to something sensible e.g. . I’ll choose a learning rate of . This is a slow burn, a learning rate of 0.1-0.2 is more appropriate for this problem but we’ll get to see the movement of the gradient better. It’s worth trying different values of to see how it changes convergence. The algorithm is setup as

```
# gradient descent function
gradientDescent <- function(formula, data, par.init, loss.fun, lr, iters){
formula <- as.formula(formula)
X <- model.matrix(formula, data = data)
y <- data[,all.vars(formula)[1]]
par <- loss <- matrix(NA, nrow = iters+1, ncol = 2)
par[1,] <- par.init
for(k in 1:iters){
loss[k,] <- loss.fun(X=X, y=y, par=par[k,])
par[k+1,] <- par[k,] - lr*loss[k,]
}
return(list(par = par))
}
# loss function
loss.fun <- function(X, y, par) return(-2/nrow(X)*(t(X) %*% (y - X %*% par)))
# gradient descent. not much to it really
beta <- gradientDescent(y ~ x, data = df, par.init = c(1, 1), loss.fun = loss.fun, lr = 0.01, iters = 1000)$par
# plotting results
z <- seq(1, 1001, 10)
g1 + geom_abline(slope = beta[z,2], intercept = beta[z,1], col = "darkmagenta", alpha = 0.2, size = 1)
```

```
tail(beta, 1)
```

## [,1] [,2] ## [1001,] 4.009283 2.016444

As expected we obtain the same result. The lines show the gradient and how the parameters converge to the optimal values. A less reasonable set of starting values still converges quickly to the optimal solution showing how well graident descent works on linear regression.

```
beta <- gradientDescent(y ~ x, data = df, par.init = c(6, -1), loss.fun = loss.fun, lr = 0.01, iters = 1000)$par
# plotting results
z <- seq(1, 1001, 10)
beta.df <- data.frame(b0 = beta[z,1], b1 = beta[z,2])
g1 + geom_abline(data = beta.df, mapping = aes(slope = b1, intercept = b0), col = "darkmagenta", alpha = 0.2, size = 1)
```

```
tail(beta, 1)
```

## [,1] [,2] ## [1001,] 4.009283 2.016444

```
library(gganimate)
library(magrittr)
ggif_minimal <- df %>%
ggplot(aes(x = x, y = y)) +
geom_point(size = 2) +
theme_minimal() +
geom_abline(data = beta.df, mapping = aes(slope = b1, intercept = b0), col = "darkmagenta", size = 1) +
geom_text(
data = data.frame(z, b0 = beta[z,1], b1 = beta[z,2]),
mapping = aes(
x = -2.8, y = 9,
label = paste("b0 = ", round(b0, 2), "\nb1 = ", round(b1, 2))),
hjust = 0,
size = 6
) +
transition_reveal(z) +
ease_aes("linear") +
enter_appear() +
exit_fade()
animate(ggif_minimal, width = 1920, height = 1080, fps = 80)
```

They are the basics of applying gradient descent. In practice there is no need to use gradient descent to solve a regression problem, but once you know how to apply it you’ll find real-world applications elsewhere that are more complicated (and interesting). If you can define the objective function and it is differentiable, you can apply gradient descent. In later posts i’ll demonstrate how I’ve applied it to real world problems. Stay tuned!

The post Applying gradient descent – primer / refresher appeared first on Daniel Oehm | Gradient Descending.

]]>It’s generally accepted that Martingale betting strategies don’t work. But people still gravitate towards them because they are intuitive. I […]

The post Martingale strategies don’t work, but we knew that – Simulation analysis in R appeared first on Daniel Oehm | Gradient Descending.

]]>It’s generally accepted that Martingale betting strategies don’t work. But people still gravitate towards them because they are intuitive. I was curious to find out how they actually perform.

*Disclaimer: I do not encourage gambling, nor do I gamble myself but the games are good examples of stochastic processes.*

- The classic and Reverse Martingale strategies don’t work. They will almost certainly have you leaving with less money than what you started with, or more often than not no money at all.
- The reverse strategy is generally less risky, higher potential to win big but lower returns on average.
- Going all in on the first bet is the best way to double your money and save your time.

The Classic Martingale strategy is as follows. Make an initial bet $. If you win, bet $ on the next round. If you lose, double your previous bet. In a nutshell you bet $ where is the number of losses in a row. The intention is to make your money back on the next win.

Assume we start with $100 and our initial bet is $1. We keep playing until there is not enough money in the cash pool to make the next bet. Also assume we are playing roulette on an American table and place bets on red or black which both have the probability 18/38. It only takes a streak of 6 losses before the game is over because we don’t have enough cash to double our bet for a 7th time. If we see a streak of 10 or more it really starts to get out of control.

trials | bet | cumulative_loss |
---|---|---|

1 | 1 | 1 |

2 | 2 | 3 |

3 | 4 | 7 |

4 | 8 | 15 |

5 | 16 | 31 |

6 | 32 | 63 |

7 | 64 | 127 |

8 | 128 | 255 |

9 | 256 | 511 |

10 | 512 | 1023 |

The probability of losing 6 in a row is . Sounds unlikely, but it will occur more often than you think. With each win we will win $1, so once we have won 27 times we’ll have enough cash in order to afford a losing streak of 6 and bet on the 7th.

It’s more likely we’ll have a few wins and losses before observing a long losing streak that takes us out of the game. The question is how many trials (spins of the roulette wheel) will we place bets on before we lose our money and play stops? A slight variation I’ve applied is, if there is not enough money left to double the bet, we will simply bet the remainder in the cash pool, in other words go all-in. More in line with what someone might do.

This simulation was a relatively lucky one, winning over $170 and almost 400 trials, however one bad streak and it’s all over. You can plot as many simulations as you like, some are shorter and some are longer but they all end the same way.

This is a typical pattern for the classic strategy. If you wish to see more try running the code at the end of this post or view this set.

Instead of doubling the bet upon a loss, double the bet upon a win. Not surprisingly this ends the same way as the classic strategy.

The players cash amount slowly decreases by $1 on each loss. Occasionally there is a big win, rather than a big loss. With this strategy you can watch your money dwindle away rather than vanish in front of your eyes.

This plot is using the same win-lose sequence as the one above. In this case the maximum cash held throughout the game is higher with the classic strategy than the reverse.

Click here to see more simulations.

These strategies were simulated 20,000 times. The distribution of the number of trials shows how long a typical game will last until bankruptcy. The classic strategy has a very long tail, so potentially could be playing for a very long time. The maximum number of trials is this simulation was 64254. But you could also be playing for a very short time.

The reverse strategy has a slightly higher median number of trials but much less variable than the classic strategy meaning you can be sure to play between 166 and 217 spins.

min | 2.5% | 10% | 50% | 90% | 97.5% | max | |
---|---|---|---|---|---|---|---|

classic | 7 | 10 | 22 | 170 | 1132 | 2759 | 64254 |

reverse | 152 | 167 | 172 | 191 | 207 | 213 | 226 |

Assume the goal is to double your money. What is the probability you’ll double your money before going bust? After 20,000 simulations for both strategies, the probability you will double your money using…

- The Classic Martingale strategy is 0.35
- The Reverse Martingale strategy is 0.28

The Classic Martingale strategy tends to do better on average, but only slightly. Neither of these strategies are better than simply playing once and going all-in which is 0.47

The distribution of the maximum amount of cash held by the player at any given time during the game shows the classic strategy has the slight edge over the reverse strategy on average. Although the reverse strategy has the potential for massive wins if you score a good winning streak.

min | 2.5% | 10% | 50% | 90% | 97.5% | max | |
---|---|---|---|---|---|---|---|

classic | 100 | 101 | 107 | 158 | 425 | 926 | 19894 |

reverse | 100 | 100 | 102 | 133 | 534 | 2062 | 131163 |

However, keep in mind these simulations all resulted in total loss at the end of the game. Perhaps the key is to know when to stop?

Rather than stopping once you reach a specified amount which may not happen, stop when the bet reaches a specified amount.

We could specify a streak length, however a better idea is to specify a ratio of bet to total cash. This way the is stopping condition is dynamic. For example, if there is a winning streak we’ll have more money in which to bet.

Essentially by using this ratio we are fixing a certain level risk rather than betting amount. The ratio is calculated as

If we fix a stopping ratio of 0.1, we could place 4 bets before the ratio exceeds 0.1. If our initial cash pool was $200 we could place 5 bets until the ratio exceeds 0.1.

If we stop as soon as this ratio is reached it means we’re on a losing streak so it makes sense in the case of the classic strategy to bet again until our next win and walk away.

In the case of the reverse it makes sense to leave as soon as the ratio is met since we’re on a winning streak.

There are clear differences and similarities between the two strategies.

- The median final amount for the low risk stopping condition is actually slightly above 100 for both strategies, meaning it is likely you will break even or take home a couple of bucks.
- As the risk goes up the median value decreases where the classic strategy tends to have a higher median than the reverse strategy.
- Higher the risk, the higher potential for bigger wins.
- The reverse strategy has some interesting features. The ‘bumps’ occur when the risk is great enough to allow the next bet when you’re on a streak. However the chance of this streak is low enough that it doesn’t improve the strategy.

Overall the stopping strategies help to minimise loss rather than lock in wins, so on average you will still lose your money.

The code for the plots in this post can be found on github. The Martingale function is below.

```
# libraries
library(tidyverse)
# martingale function
martingale <- function(bet, cash, p, stop_condn = Inf, stop_factor = Inf, reverse = FALSE, plot = TRUE, stop_on_next_win = TRUE){
bet_vec <- vector(mode = "numeric")
cash_vec <- vector(mode = "numeric")
outcome <- vector(mode = "numeric")
winnings <- vector(mode = "numeric")
total_cash <- vector(mode = "numeric")
trial <- 0
total_cash[1] <- cash
while(total_cash[max(trial, 1)] > 0){
# iterate through trials
trial <- trial + 1
# update cash pool
if(trial == 1){
cash_vec[trial] <- cash
}else{
cash_vec[trial] <- total_cash[trial-1]
}
# set bet
if(!reverse){
if(outcome[trial - 1] == 1 || trial == 1){
bet_vec[trial] <- bet
}else{
bet_vec[trial] <- min(2*bet_vec[trial-1], cash_vec[trial]) # if there isn't enough to double the bet just bet what is left
}
}else{
if(outcome[trial - 1] == 0 || trial == 1){
bet_vec[trial] <- bet
}else{
bet_vec[trial] <- min(2*bet_vec[trial-1], cash_vec[trial]) # if there isn't enough to double the bet just bet what is left
}
}
# stop condition
if(bet_vec[trial]/cash_vec[trial] > stop_condn){
```

if(stop_on_next_win & !reverse){

stop_trigger <- TRUE

}else{

outcome[trial] <- NA

winnings[trial] <- NA

total_cash[trial] <- cash_vec[trial]

break

}

}
outcome[trial] <- sample(c(0,1), 1, prob = c(1-p, p))
winnings[trial] <- bet_vec[trial]*outcome[trial] - bet_vec[trial]*(1-outcome[trial])
total_cash[trial] <- cash_vec[trial] + winnings[trial]
# stop condition
if(total_cash[trial] >= stop_factor*cash) break
}
# make the plot
g1 <- NULL
if(plot){
df <- data.frame(trials = 1:trial, cash = total_cash)
gg <- ggplot() +
geom_line(data = df, mapping = aes(x = trials, y = cash), col = "darkmagenta", lty = 1, size = 1) +
geom_hline(yintercept = cash_vec[1], col = "grey", lty = 2) +
theme_minimal() +
labs(
x = "Number of spins",
y = "Total cash in hand",
title = ifelse(reverse, "Reverse Martingale strategy", "Martingale strategy"),
subtitle = "The growth and decline of the gamblers cash pool - it always ends the same way"
) +
ylim(0, NA)
print(gg)
}
return(list(
bet = bet_vec,
cash = cash_vec,
outcome = outcome,
total_cash = total_cash,
trials = trial,
plot = gg))
}
# run the simulation and plot the output
# try different parameters to see the effect
martingale(1, 100, 18/38, reverse = FALSE, plot = TRUE, stop_condn = 1,

stop_on_next_win = TRUE)

The post Martingale strategies don’t work, but we knew that – Simulation analysis in R appeared first on Daniel Oehm | Gradient Descending.

]]>ggplot – You can spot one from a mile away, which is great! And when you do it’s a silent […]

The post Adding Custom Fonts to ggplot in R appeared first on Daniel Oehm | Gradient Descending.

]]>ggplot – You can spot one from a mile away, which is great! And when you do it’s a silent fist bump. But sometimes you want more than the standard theme.

Fonts can breathe new life into your plots, helping to match the theme of your presentation, poster or report. This is always a second thought for me and need to work out how to do it again, hence the post.

There are two main packages for managing fonts – `extrafont`

, and `showtext`

.

A relatively old package and it’s not well supported unfortunately. You can run into problems, however the base functions work well enough.

The fonts in the system directory are first imported into the `extrafontdb`

using `font_import()`

. This only needs to be run once in order to load the fonts into the right directory. Secondly, they are registered in R using `loadfonts()`

for your specific device. The fonts need to be loaded each session.

```
library(tidyverse)
library(extrafont)
library(cowplot)
```

```
# import fonts - only once
font_import()
```

```
# load fonts - every session
loadfonts(device = "win", quiet = TRUE)
```

Below are all the available fonts (that I have – click on the image to enlarge).

To use these fonts in a plot, change the text family using one of the names above. For demonstration I’ll use the Antigua corn data from the DAAG package.

```
library(DAAG)
# corn plot
corn <- antigua %>%
dplyr::filter(ears > 0) %>%
ggplot(, aes(x = ears, y = harvwt, col = site)) +
geom_point(size = 4) +
scale_colour_manual(values = colorRampPalette(c("orange", "darkmagenta", "turquoise"))(8)) +
labs(title = "ANTIGUA CORN YIELDS",
x = "Ears of corn harvested",
y = "Harvest weight") +
theme(
text = element_text(family = "candara", size = 24),
plot.title = element_text(size = 30),
plot.caption = element_text(size = 28))
corn
```

It’s likely you’ll want more than what is available in the standard font set. You can add custom fonts with `extrafont()`

, however I’ve only had limited success. A better option is using `showtext`

.

`showtext`

is a package by Yixuan Qiu and it makes adding new fonts simple. There are a tonne of websites where you can download free fonts to suit pretty much any style you are going for. I’ll only touch on the key bits, so check out the vignette for more details.

The simplest way is to add fonts is via `font_add_google()`

. Find the font you like on Google Fonts and add it to R using the following.

```
library(showtext)
font_add_google(name = "Amatic SC", family = "amatic-sc")
```

Amatic SC can now be used by changing the font family to “amatic-sc”. For R to know how to properly render the text we first need to run `showtext_auto()`

prior to displaying the plot. One downside is it currently does not display in Rstudio. Either open a new graphics window with `windows()`

or save as an external file e.g. `.png`

.

```
# turn on showtext
showtext_auto()
```

Custom fonts are added by first,

- Finding the font you want to use (I’ve picked a few from 1001 Free Fonts and Font Space, but there are many more out there)
- Download the font
`.ttf()`

file and unzip if needed - Use
`font_add()`

to register the font - Run
`showtext_auto()`

to load the fonts

```
font_add(family = "docktrin", regular = "./fonts/docktrin/docktrin.ttf")
showtext_auto()
```

And that’s pretty much it. Given how effortless it is to add new fonts you can experiment with many different styles.

These fonts are outrageous but demonstrate that you really can go for any style, from something minimal and easy reading to something fit for a heavy metal band. For professional reports you’ll want to go for something sensible, but if you’re making a poster, website or infographic you may want to get creative e.g.

The tools are there for you to be as creative as you want to be.

The last thing to note is you’ll need to play around with different sizes given the resolution of your screen.

To turn off `showtext`

and use the standard fonts, run

```
showtext_auto(FALSE)
```

```
# load fonts
ft <- data.frame(x = sort(rep(1:4, 31))[1:nrow(fonttable())], y = rep(31:1, 4)[1:nrow(fonttable())], text_name = fonttable()$FullName)
font_plot <- ggplot(ft, aes(x = x, y = y)) +
geom_text(aes(label = text_name, family = text_name), size = 20) +
coord_cartesian(xlim = c(0.5, 4.5)) +
theme_void()
font_plot
# amatic sc
scale <- 4.5 # scale is an adjustment for a 4k screen
corn <- ggplot(antigua %>% dplyr::filter(ears > 0), aes(x = ears, y = harvwt, col = site)) +
geom_point(size = 4) +
scale_colour_manual(values = colorRampPalette(c("orange", "darkmagenta", "turquoise"))(8)) +
labs(title = "ANTIGUA CORN YIELDS",
subtitle = "Study of different treatments and their effect on corn yields in Antigua",
x = "Ears of corn harvested",
y = "Harvest weight",
caption = "@danoehm | gradientdescending.com") + theme_minimal() +
theme(
text = element_text(family = "amatic-sc", size = 22*scale),
plot.title = element_text(size = 26*scale, hjust = 0.5),
plot.subtitle = element_text(size = 14*scale, hjust = 0.5),
plot.caption = element_text(size = 12*scale),
legend.text = element_text(size = 16*scale))
png(file = "./fonts/corn yields amatic-sc.png", width = 3840, height = 2160, units = "px", res = 72*4)
corn
dev.off()
# wanted dead or alive
# generate data - unfortunately I couldn't find data on actual outlaws
n <- 20
x <- 100*runif(n)
y <- (25 + 0.005*x^2 + rnorm(n, 0, 10))*10
wanted <- data.frame(x, y) %>%
ggplot(aes(x = x, y = y)) +
geom_smooth(col = "black", lty = 2) +
geom_point(size = 4) +
theme_minimal() +
labs(title = "WANTED: DEAD OR ALIVE",
subtitle = "Relationship of the crimes committed by outlaws and the bounty on their head",
x = "CRIMES",
y = "BOUNTY",
caption = "@danoehm | gradientdescending.com") + theme_minimal() +
theme(
text = element_text(family = "docktrin", size = 16*scale),
plot.title = element_text(size = 40*scale, hjust = 0.5),
plot.subtitle = element_text(size = 14*scale, hjust = 0.5, margin = margin(t = 40)),
plot.caption = element_text(size = 10*scale),
legend.text = element_text(size = 12*scale),
panel.grid = element_line(color = "black"),
axis.title = element_text(size = 26*scale),
axis.text = element_text(color = "black"))
png(file = "./fonts/wanted1.png", width = 3840, height = 2160, units = "px", res = 72*4)
ggdraw() +
draw_image("./fonts/wanted_dead_or_alive_copped.png", scale = 1.62) + # a png of the background for the plot
draw_plot(wanted)
dev.off()
# Horror movies
horror <- read_csv("./fonts/imdb.csv")
gghorror <- horror %>%
dplyr::filter(str_detect(keywords, "horror")) %>%
dplyr::select(original_title, release_date, vote_average) %>%
ggplot(aes(x = release_date, y = vote_average)) +
geom_smooth(col = "#003300", lty = 2) +
geom_point(size = 4, col = "white") +
labs(title = "HORROR MOVIES!",
subtitle = "Average critics ratings of horror films and relationship over time",
x = "YEAR",
y = "RATING",
caption = "Data from IMDB.\n@danoehm | gradientdescending.com") + theme_minimal() +
theme(
text = element_text(family = "swamp-witch", size = 16*scale, color = "#006600"),
plot.title = element_text(size = 48*scale, hjust = 0.5),
plot.subtitle = element_text(family = "montserrat", size = 14*scale, hjust = 0.5, margin = margin(t = 30)),
plot.caption = element_text(family = "montserrat", size = 8*scale),
legend.text = element_text(size = 12*scale),
panel.grid = element_line(color = "grey30"),
axis.title = element_text(size = 26*scale),
axis.title.y = element_text(margin = margin(r = 15)),
axis.text = element_text(color = "#006600"),
plot.background = element_rect(fill = "grey10"))
png(file = "./fonts/swamp.png", width = 3840, height = 2160, units = "px", res = 72*4)
gghorror
dev.off()
```

The post Adding Custom Fonts to ggplot in R appeared first on Daniel Oehm | Gradient Descending.

]]>Townsville, Qld, has been inundated with torrential rain and has broken the record of the largest rainfall over a 10 […]

The post The Most Amount of Rain over a 10 Day Period on Record appeared first on Daniel Oehm | Gradient Descending.

]]>Townsville, Qld, has been inundated with torrential rain and has broken the record of the largest rainfall over a 10 day period. It has been devastating for the farmers and residents of Townsville. I looked at Townsville’s weather data to understand how significant this event was and if there have been comparable events in the past.

Where this may interest the R community is in obtaining the data. The package ‘bomrang’ is an API allowing R users to fetch weather data directly from the Australian Bureau of Meteorology (BOM) and have it returned in a tidy data frame.

Historical weather observations including rainfall, min/max temperatures and sun exposure are obtained via `get_historical()`

. Either the ID or the lat-long coordinates of the weather station are needed to extract the data. The ID information can be found on the BOM website by navigating to the observations page.

Using the station ID the rainfall data is extracted with the following.

```
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(bomrang))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(MCMCpack))
mycols <- c("darkmagenta", "turquoise")
# import data - simple as
townsville <- get_historical("032040")
```

And cleaning up the dates.

```
# fix date functions
fix.date <- function(y,m,d){
s <- paste(c(y, m, d), collapse = "-")
d <- as.Date(s, "%Y-%m-%d")
return(d)
}
for(k in 1:nrow(townsville)){
townsville$date[k] <- fix.date(townsville$Year[k], townsville$Month[k], townsville$Day[k])
}
townsville$date <- as.Date(townsville$date, origin = "1970-01-01")
# trim - for replication of when this was first run
townsville %<>% dplyr::filter(date < as.Date("2019-02-13"))
# take a look
head(townsville)
```

## Product_code Station_number Year Month Day Rainfall Period Quality ## 1 IDCJAC0009 32040 1941 1 1 0.0 NA Y ## 2 IDCJAC0009 32040 1941 1 2 6.6 1 Y ## 3 IDCJAC0009 32040 1941 1 3 16.5 1 Y ## 4 IDCJAC0009 32040 1941 1 4 205.5 1 Y ## 5 IDCJAC0009 32040 1941 1 5 175.0 1 Y ## 6 IDCJAC0009 32040 1941 1 6 72.9 1 Y ## date ## 1 1941-01-01 ## 2 1941-01-02 ## 3 1941-01-03 ## 4 1941-01-04 ## 5 1941-01-05 ## 6 1941-01-06

Applying a 10 day rolling window over the entire historical record it’s easy to see the significance of this rainfall event. The 8th February recorded 1259.8mm of rain in the 10 days prior. It dwarfs the previous record of 925.5mm set in 1953. It also highlights other significant events in the past, in particular 1968, 1998 and 2009 but these don’t come close to the 2019 event.

```
# get 10 day total
townsville$rolling10 <- 0
for(k in 10:nrow(townsville)){
townsville$rolling10[k] <- sum(townsville$Rainfall[(k-9):k])
}
# plot
ggplot(
townsville %>% dplyr::filter(date > as.Date("1940-01-01")),
aes(x = date, y = rolling10, col = rolling10)) + geom_line() +
scale_color_gradientn(colors = colorRampPalette(mycols)(32)) +
labs(y = "Total rainfall in the last 10 days")
```

It really was a phenomenal amount of rain. This was not the largest rainfall in a day however, that record occurred in 1998 with a massive 548.8mm of rain. In fact the 2019 floods don’t feature in the top 10 wettest days, but the consistency over 10 days made it the wettest on record.

```
townsville %>% arrange(desc(Rainfall)) %>% dplyr::select(date, Rainfall) %>% head(10)
```

## date Rainfall ## 1 1998-01-11 548.8 ## 2 1946-03-03 366.5 ## 3 1953-01-16 346.7 ## 4 1977-02-01 317.6 ## 5 1997-03-24 302.8 ## 6 1978-01-31 273.4 ## 7 2000-04-04 271.6 ## 8 1946-02-10 251.7 ## 9 2009-02-03 236.8 ## 10 1944-03-29 233.4

Townsville received over a years worth of rain in 10 days. The graph below shows the annual rainfall measurements and average annual rainfall (dotted line) given the historical records. Even with only 5-6 weeks of the year, 2019 is already one of the wettest years on record.

```
# calculate the total annual rainfall and rainfall to date
annual.rainfall <- townsville %>%
dplyr::filter(date > as.Date("1940-01-01")) %>%
mutate(
rainfall_to_date = as.numeric(as.POSIXlt(date)$yday < 40)*Rainfall,
rainfall_after_wet = as.numeric(as.POSIXlt(date)$yday < 90)*Rainfall
) %>%
group_by(Year) %>%
summarise(
annual = sum(Rainfall),
Feb12th = sum(rainfall_to_date),
april = sum(rainfall_after_wet),
remaining = sum(Rainfall) - sum(rainfall_to_date)
)
# bar plot
ggplot(annual.rainfall, aes(x = Year, y = annual, fill = annual)) +
geom_bar(stat = "identity") +
scale_fill_gradientn(colors = colorRampPalette(mycols)(32)) +
geom_hline(yintercept = mean(annual.rainfall$annual, na.rm = TRUE), lty = 3, col = "grey20", lwd = 1) +
labs(y = "Total annual rainfall")
```

On close inspection the data suggests that the first 40 years of records are less variable than from 1980 on-wards. There appears to be drier years and wetter years in the latter half.

The current record was set in 2000 at 2400mm where in this year Townsville had a few heavy rainfall events in the months up until April and some lesser events near the end of the year. Comparing 2019 to these years, there is definitely potential for 2019 to be the wettest on record.

```
ggplot(townsville %>% dplyr::filter(Year %in% c(1950, 1956, 2000, 2019)), aes(x = as.POSIXlt(date)$yday, y = Rainfall)) +
geom_line() +
facet_grid(Year ~ .) + labs(x = "Day of the year") +
labs(title = "Wettest 3 years on record vs 2019")
```

Below clearly shows which years have had significant rainfall in the first part of the year. The years which have received greater than 700mm (dotted line) are quite distinct from the bulk of the data. Since the wet season ends in April the other heavy years (like 2000) haven’t had their major events yet. This is shown in the April plot at the bottom which has a much stronger relationship (obviously). The years which experienced heavy rainfall at this time of year, in general didn’t get too much afterwards.

```
grid.arrange(
ggplot(annual.rainfall, aes(x = Feb12th, y = annual, col = annual)) +
geom_point(size = c(rep(2, nrow(annual.rainfall)-1), 4)) +
scale_color_gradientn(colors = colorRampPalette(mycols)(32)) +
labs(y = "Annual rainfall", x = "Total rainfall as at 12th Feb") +
geom_vline(xintercept = 700, lty = 3, col = "grey20", lwd = 1),
ggplot(annual.rainfall, aes(x = april, y = annual, col = annual)) +
geom_point(size = c(rep(2, nrow(annual.rainfall)-1), 4)) +
scale_color_gradientn(colors = colorRampPalette(mycols)(32)) +
labs(y = "Annual rainfall", x = "Total rainfall as at 1st April") +
geom_vline(xintercept = 700, lty = 3, col = "grey20", lwd = 1)
)
```

For what it’s worth I’ll offer a prediction for the expected annual rainfall and probability of being the wettest year on record (which, to be honest is a fools errand – tropical weather systems are pretty complex stuff)

```
# bayesian model
blm <- MCMCregress(annual ~ Feb12th, data = annual.rainfall %>% dplyr::filter(Feb12th > 0, Year != 2019),
sigma.mu = 235, sigma.var = 35^2) # priors from exploration - details skipped here
# prediction
x <- matrix(c(1, 1444))
pred.annual.rainfall <- data.frame(
annual = blm[,-3] %*% x + rnorm(10000, 0, sqrt(blm[,"sigma2"])), # posterior predictive distribution
exp.val = blm[,-3] %*% x) # mean distribution
# c(min(pred.annual.rainfall$annual), table(pred.annual.rainfall$annual < 1444)/10000)
n <- 1000
xstart <- rep(0, n)
xend <- rep(1700, n)
ystart <- blm[1:n,1] + blm[1:n,2]*xstart
yend <- blm[1:n,1] + blm[1:n,2]*xend
ystartp <- blm[1:n,1] + blm[1:n,2]*xstart + rnorm(n, 0, sqrt(blm[,3]))
yendp <- blm[1:n,1] + blm[1:n,2]*xend + rnorm(n, 0, sqrt(blm[,3]))
df.seg <- data.frame(xstart, xend, ystart, yend, ystartp, yendp)
exp.val <- quantile(pred.annual.rainfall$exp.val, c(0.025, 0.5, 0.975))
post.pred <- quantile(pred.annual.rainfall$annual, c(0.025, 0.5, 0.975))
prob.record.rainfall <- sum(as.numeric(pred.annual.rainfall$annual > 2400))/10000
# drawing each line of the posterior draws takes time but its a nice aesthetic
df.pred <- data.frame(x = c(1444, 1444), y = range(exp.val))
a <- 6
df.post <- data.frame(xs = rep(1444, 3) + c(0, -a, -a), xe = rep(1444, 3) + c(0, a, a),
ys = c(post.pred[1], post.pred[1], post.pred[3]), ye = c(post.pred[3], post.pred[1], post.pred[3]))
ggplot(annual.rainfall %>% dplyr::filter(Year != 2019, Feb12th > 0), aes(x = Feb12th, y = annual)) +
geom_segment(mapping = aes(x = xstart, xend = xend, y = ystartp, yend = yendp), data = df.seg, col = "grey20", alpha = I(0.05)) +
geom_segment(mapping = aes(x = xstart, xend = xend, y = ystart, yend = yend), data = df.seg, col = "darkmagenta", alpha = I(0.025)) +
geom_point() +
labs(y = "Annual rainfall", x = "Total rainfall as at 12th Feb") +
geom_line(mapping = aes(x = x, y = y), data = df.pred, size = 2) +
geom_segment(mapping = aes(x = xs, xend = xe, y = ys, yend = ye), data = df.post) +
geom_hline(yintercept = 2400, col = "red", lty = 2)
```

Based on this (crude) model, the expected annual rainfall credible interval is (2027, 2472) mm. Using the posterior predictive distribution for 2019, the probability 2019 will experience record rainfall is 0.29.

But let’s all hope there’s not too much more rain. There has been enough.

The post The Most Amount of Rain over a 10 Day Period on Record appeared first on Daniel Oehm | Gradient Descending.

]]>