# Surviving Survivor’s Tribal Council: Survival Analysis in Python/R

I’ve been playing around with Survival Analysis at work, and it hit me that the TV show Survivor is a perfect application of the method since it deals with people ‘dying’ in the metaphorical sense, a.k.a. getting voted out.

In this article, we will apply Survival Analysis to Survivor and calculate different survival curves which estimate the probability of reaching day 2, day 3, up to day 39 across demographic factors – sex, age, and geographic location.

But first, let me give a brief rundown on what Survival Analysis is.

## Survival Analysis and Kaplan-Meier Curves

When will I die? When will I recover from my sickness? When will the customer stop using our company’s service? These are questions that survival analysis attempts to answer in a probabilistic sense.

What are the common elements of these questions? There are two: a time element (which answers ‘when’) and the event itself (which answers ‘what will happen’). One of the main objectives of Survival Analysis is to build a map from time t to the probability of existing up to time t. This function is called the **survival function**, often denoted by S(t). A graph of the survival function is called a **survival plot/curve**.

Survival curves are really handy since they can be used to answer a ton of interesting questions. Take for example the scenario of disease treatment. Let’s say we have two different treatments T1 and T2. We can look at the survival curves for T1 and T2 to see which treatment is more effective – the one with the higher survival function consistently across time is obviously the better option.

But how do you actually construct a survival function? It can be estimated by conducting an experiment, i.e. observing subjects from start up to the time they ‘die’. Ideally, we would know the time of death of ALL our subjects, and in that case we can empirically construct the survival curve by counting the number of subjects still alive at time t for every t.

But there are cases when the experiment timeframe is done – meaning we aren’t tracking the subject anymore – but some subjects are still alive. Or there might be cases wherein we lose track of still-alive subjects during the experiment, and therefore we don’t know the actual time that they die. All we know is that they’re alive up to the point wherein we last saw them. These last two scenarios are called **right-censored observations** since the time of event is censored, meaning we don’t know the specific time of death but we have limited info on it. How do we construct survival curves in this case?

We can construct **Kaplan-Meier (KM) survival curves**. The basic idea is to use all available information for each subject, even right-censored observations, up until they die or are censored, and then using the chain rule for conditional probabilities to tie up all the pieces of information together. In this way, we are able to use censored data even if we don’t know the actual time-to-event. You can read more about KM curves from Wikipedia.

## Survivor Alliance Analysis

Before proceeding, let me just introduce you to another set of posts I have on Survivor which I call *Survivor Alliance Analysis*. There, I define alliance metrics and apply network analysis methods to uncover alliances algorithmically for different seasons of the show. It’s all very interesting stuff, so please check it out.

- Scraping the Survivor Wiki with Beautiful Soup
- Computational Analysis of Survivor Alliances
- Survivor Alliance Networks Visualized
- Network Analysis of Survivor Alliances

## Survivor Survival Analysis

Now that we have some background on Survival Analysis and KM curves, let’s apply it to our favorite reality show, Survivor. Specifically, we will look at how different demographic stratifications (sex, age, geography) affect the castaways in terms of lasting in the game.

In the Survivor context, dying means getting voted out, so the survival probability at day t can be thought of as the probability of ‘still being in the game’ or ‘not getting voted out’ up to day t. Also in this context, castaways part of the Final Tribal Council or those who exited not via tribal (medical evac, quitting, etc.) are treated as censored observations since, technically, they weren’t voted out of the game.

The plots below are clipped up to day 39. Remember that The Australian Outback lasted up to day 42, and so days 40-42 of Australia are not included in the analysis below.

For the data source, I scraped the Survivor Wiki for contestant information using Beautiful Soup. I used Python Pandas for preprocessing and the wonderful survival and survminer packages in R to do the actual Survival Analysis. You can check out all the code in my Github.

I’ll keep my explanations short and sweet and let the data speak for itself.

## Survival by Sex

The first thing I looked at is age. We can see that early in the game, up to maybe day 27, males oftentimes last longer than females. This is quite an expected conclusion – the focus early on in the game, the tribal phase, is physical prowess, and males have an upper hand in this aspect. When tribes lose immunity challenges, they often target the least physically strong, which is sadly almost always women.

## Survival by Age

In terms of age, it’s interesting to see how young and old interchange on who lasts longer early and late in the game. In the early stage of the game, old castaways expectedly get voted out at a higher probability than young ones, and this reverses in the latter part of the game.

(By young, I mean age between 18 to 35; by middle, 36 to 50; by old, above 50.)

## Survival by Age Group and Sex

Now let’s look at the combination of age and sex. Old females expectedly get the short end of the stick in the early stage. Young males have high survivability at that stage, but lose it all as the game continues. They oftentimes get targeted as ‘physical threats’ and are disposed off during early merge. In terms of overall survivability, I’d say that old males have the best survival curve. They start out middle-of-the-pack in terms of survival probability in the early game, but once they get past merge, they tend to go deep in the game.

## Survival by Region

Now let’s look at the effect of geographic origin on the survival curves. Frankly, I’m not very knowledgeable about difference in culture across the four regions (since I’m not from the US). Simply looking at the survival curves, we see that Midwest and South have the highest and lowest mid-game survivability, but Northeast leads in the endgame.

## Survival by Coast

All I know is that West Coast people are stereotypically more laid back than those in the East Coast. There isn’t really a lot of difference in the survival curves across this geographic stratification. One noticeable thing is that West Coasters have lower survivability in the latter parts of the game compared to East Coasters.

(By West Coast, I mean here Pacific West Coast)

## … There’s More

This post is getting longer than I initially planned. Check out the next article wherein I stratify survival curves in terms of Old School and New School Survivor.

Don’t forget to like the FB page and sign up on the mailing list to get updates. Leave out comments below if this is something interesting to you.