Scraping the Survivor Wiki with Beautiful Soup

Survivor is my favorite reality competition show of all time. It’s been on air for almost 17 years, and it’s not hard to see why. It has everything that you would want from a TV show – drama, twists, challenges. The show has gained a loyal following over the years, and it’s unlikely for the show to end its run anytime soon.

Gameplay-wise, the main focus of the show is the concept of alliances. At its core, Survivor is a power struggle among several competing alliances. It’s very hard, if not impossible, to make it far in the game without being part of one. This is the something that’s very interesting to study – the concept of alliances and relationships in the game.

Before data analysis, one must first perform data extraction. It isn’t often that data is available in a convenient format. Usually, data coming from the internet is messy or tangled in an HTML mess. It’s safe to say that data extraction is not a trivial matter; it’s worthy of a blog post in itself.

In this post, I’ll be extracting data from the Survivor Wiki using the Python module Beautiful Soup. Specifically, the data I’ll be fetching is the voting history table from the pages of each season in the wiki. Along the way, I’ll introduce to you the essential parts of Beautiful Soup.

This is the first article in my Survivor Alliance Analysis series. The other articles in the series are:

  1. Scraping the Survivor Wiki with Beautiful Soup
  2. Computational Analysis of Survivor Alliances
  3. Survivor Alliance Networks Visualized
  4. Network Analysis of Survivor Alliances

Screen Shot 2017-04-06 at 3.33.51 PM.png
Survivor Caramoan voting history table. Source: http://survivor.wikia.com/wiki/Survivor:_Caramoan

This post serves two purposes: as a very basic tutorial to web scraping in Python using Beautiful Soup, and as the first part of my three-article post on the computational analysis of Survivor alliances.

What is Web Scraping?

Web scraping is the process of extracting data from websites. Since websites are basically HTML documents underneath (with added bells and whistles from CSS and JavaScript), web scraping usually involves parsing the HTML document for the data you want. You can easily see the underlying HTML of any site using tools form your browser. For example, in Chrome, you can right-click and select View Page Source from the options that appear.

Scraping is quite a sensitive topic to talk about since it’s in the legal grey area. Note, however, that the content of the Survivor Wiki is under the CC-BY-SA license and can be used for this analysis.

To be proficient in web scraping, you must first have some understanding of the structure of an HTML document. In a nutshell, each HTML document can be thought of as a tree with nodes connected to branches. Each node is an object which represents a part of the document (or a feature of the website). Each node branches from a parent node (unless it’s the root node) and then branch off into child nodes (unless it’s the terminal node). Elements in the tree can be accessed using the programming API known as the Document Object Model (or the DOM). You can read more about the DOM on Wikipedia.

Let’s take this simple HTML document, for example.

Opening this up in Google Chrome, this gets rendered as:

Screen Shot 2017-04-07 at 2.47.40 PM.png

Looking at the HTML file, you can see items surrounded by < >. These are tags – the basic elements of an HTML file. The first tag simply tells your browser that the file is an HTML file, and should be interpreted as such. The second tag is the root of the file, and basically says that all tags below it are part of the HTML and should be read. Notice that at the end of the document is the tag. This is a closing tag for the opening tag . Often, tags appear in pairs, but there are some tags as well which appear only once, for example a line break tag .

As I’ve said earlier, HTML can be interpreted as a tree. The reason why tags often appear in pairs is that this allows them to contain other elements. Notice that the html tag branches off into the head and body tags. The first contains meta-data related to the file, including the title of the HTML page, and the second contains the elements which are actually rendered in the webpage.

Notice that the body contains two tags, h1 and p. The h tag is a header tag, which is basically a heading in the webpage. The p tag is a paragraph tag, which contains a block of text. In the second tag, you can see that there is another attribute that is included – a class. A class is an identifier for the tag, which you can use to access it, for example, when you want to manipulate its appearance using CSS. The important things to note about a class is that more than one tag can have the same class, and a tag can have more than one class. If the tag has more than one class, the classes are separated by a whitespace. There is also another type of identifier, an id, which is the unique identifier of a tag.

All these tags can be accessed using the Document Object Model, as mentioned earlier.

Obviously, webpages on the Internet have way, way more complicated HTML than the one I’ve shown above. A simple check of the HTML code of this page proves this point. It becomes apparent that web scraping isn’t a very trivial matter.

Web Scraping with Beautiful Soup

Enter Beautiful Soup – a Python module which makes web scraping a very simple affair. It provides a very simple interface to perform web scraping – you can do a lot of complicated things with only a few lines of code. I’ll provide a very short introduction on how to access information from webpages using Beautiful Soup, but if you want to learn more, the documentation can be accessed here.

Let’s assume that you already have the HTML document stored as a string in a variable called html_doc. However, if you only have the url of the website that you wish to scrape, you can use the requests module to extract the HTML. The way to do it is to call requests.get(target_url), where target_url is the url of the webpage you want to scrape, and assign the output to some variable, say request_object. To obtain the HTML as a string and save it to html_doc, assign request_object.content to html_doc. Simple, right? That’s the power of Python.

Next, you have to convert the HTML string into a Beautiful Soup object, say bs_object, which you can query for tags. The way to do this is to set BeautifulSoup(html_doc) to bs_object. From this object, you can now query for tags and the information that they contain.

There are a number of methods you can call on the Beautiful Soup object to query for the tags that you want. I’ll be focusing on three: .find(), .find_all(), and .select().

The .find() method accepts as input a number of things – a tag, a class, or an id. The input identifies the tag that you want. Note that my discussion is very much simplified – the .find() method is really powerful. You can actually even query tags using their attributes, for example, their style attribute, if they have one.

It’s entirely possible that your identifier matches a lot of tags in the HTML, but .find() only returns the first one it finds. The .find_all() method, on the other hand, returns all tags which match the input identifier.

Lastly, if you’re well-versed in CSS, you can use a CSS selector as an identifier using the .select() method. The .select() method can also be used if you want to look for tags which have two or more classes. For example, if you want to get the tags which have class = “class_1 class_2”, you can use .select(“p.class_1.class_2”) on the Beautiful Soup object.

I’m just scratching the surface here. If you’re interested, read more about BeautifulSoup here.

Scraping the Survivor Wiki

Now that you’re acquainted with scraping in Beautiful Soup, let’s try to extract the voting history data from the Survivor Wiki Caramoan page. The main objective is to copy the contents of the table to a Pandas data frame. This allows for easy data manipulation.

I will save you all the effort of looking at the HTML of the wiki page. Let me just give the essential observations on the structure of the table.

  1. The voting history table is the last table tag with class ‘table.wikitable.article-table’.
  2. The tr tags which contain a single td tag with style ‘text-align: left;‘ are the rows of the voting history table.
  3. The name of the voter is enclosed in a td tag with style ‘text-align: left;‘ or text-align: left; white-space: nowrap;‘.
  4. The other td tags in the contain the contestants the voter voted for during the tribal councils he was present in, or ‘-‘ if he was not present. These tags can be accessed by calling find_next_siblings() on the tag in Number 3.

Using these four observations, you will be able to parse the HTML document for the votes of each contestant. Note that I will take a simplified approach and consider all votes cast, including those cancelled by a hidden immunity idol or votes resulting in a tie – each casted vote is included in the resulting Pandas data frame.

I wrote a function called extract_voting_table_as_df(link) to do all I’ve stated above. Note that some cells span more than one column, particularly those votes which result in a tie, so I had to compensate for it in the resulting data frame.

Hopefully, with the introduction to HTML and Beautiful Soup that I’ve given above, you are able to understand the script in its entirety.

The resulting data frame object for the season Caramoan looks like:

Screen Shot 2017-04-07 at 3.55.32 PM.png

The numbering of the tribal councils in the resulting data frame is not the same as that in the original Survivor wiki page. In the wiki, the numbering is per episode – so that may include more than one voting round. This can be fixed with some Pandas manipulation, but it doesn’t matter in the analysis that I will be doing so I’ll just leave it as is. In addition, cancelled votes are not marked as cancelled here. I want to consider these in my analysis so I’m including them in the data frame.

In Conclusion

Now that data extraction is done, we can move on to the next step in our Survivor Network Analysis – quantifying voting relationships among Survivor castaways using voting history data.  This involves a tiny bit of math and some logical thinking, so best be prepared for it! Using the alliance index that we define in the second article in the series, we are able to build Survivor alliance networks. Check it out in the third and fourth articles in the series, Survivor Alliance Networks Visualized and Network Analysis of Survivor Alliances.

You can check out the complete code for the scraping exercise I did above (as well as code for the next two posts) on my Github. If you have any questions, don’t hesitate to leave a comment below. While you’re at it, like the FB page and subscribe via e-mail (see sidebar) to get updates on new posts. ‘Til next time.

Leave a Reply