Data Science Articles on the Web: Simpsons, Harry Potter, Pokemon

Data science methods applied to TV shows and movies are the coolest thing ever. This blog’s main purpose is to share my passion for data science applied to movies and TV. I’m not the first one to do this, however. There are a lot of articles and blogs on the web which use data methods to study entertainment media. In this article, I share with you three of these.

One applied neural networks for Pokemon card type prediction, another used text analytics to study the Harry Potter books, and another employed audio processing and sentiment analysis to determine how The Simpsons characters like each other. Needless to say, all of these are really, really interesting, so check them out.



Pokemon Card Type Prediction

Eric Feldman’s four-year-old nephew Yali recently discovered the wonders of Pokemon cards. Since Yali is too young to understand the rules of the Pokemon trading card game, he came up with a different game to play – visually identifying the type of Pokemon cards. (Note that Yali doesn’t know how to read yet.)

As a machine learning enthusiast, Eric thought that it would be a nice challenge to build a model that could beat Yali at predicting Pokemon card types. Thinking that optical character recognition could be considered cheating because, you know, Yali couldn’t read yet, he used the pixels of each Pokemon card as his input dataset.

In total, he built 1,533 neural network (NN) models of different configurations. Training data (Pokemon card images) was obtained from PkmnCards. He tested the models using a wide assortment of Pokemon cards with visual defects – some with drawings on them, some mutilated, some edited out. The final model that he used was an ensemble of the four best-performing NN models.

He pitted his NN model to a Pokemon card type-identification showdown with Yali. In the end, his model prevailed with 23/25 correct, as opposed to Yali who only got 21/25 correct.

1 – machines, 0 – humans.



Harry Potter Textual Analysis

Zareen Farooqui, at the time of writing, was just starting out with data science. She wanted to do a Python project, and settled on a textual analysis of J.K. Rowling’s Harry Potter books.

This article focuses on four aspects of textual analysis: word count/ punctuation analysis, sentiment analysis, N-gram analysis, and relationship analysis.

Word Count/Punctuation Analysis

Zareen found that the last three books of the series contain a much larger number of words compared to the first four. The same trend holds for the number of unique words and punctuation marks as well. Her reason for this is that J.K. Rowling’s writing style probably matured over time in order to keep up with the series’ maturing fanbase.

Sentiment Analysis

The sentiment analysis part of her analysis is a bit oversimplified. Here, Zareen used a lexicographic method and counted the number of positive and negative words for each novel. Across all novels, Zareen identified only a few positive and negative words. Each Harry Potter book is composed of about 3% positive words and 3% negative words – the other 96% are neutral words.

(If you’re interested in sentiment analysis, check out VADER Sentiment Analysis Explained, wherein I introduce in simple words how sentiment analysis is done. Also check out my other article wherein I rank the negativity of Black Mirror episodes using sentiment analysis.)

N-gram Analysis

Zareen also looked at the co-occurrence of certain word combinations across the different novels. An interesting result that she found was that the 3-gram “Ron and Hermione” increased in frequency as the series went on, foreshadowing Ron and Hermione ending up together in the end. She also analyzed the occurrence of the different classes of Hogwarts in the text and found that Defense Against the Dark Arts and Potions are the most important.

Relationship Analysis

Lastly, the article also studied character relationships. She used the notion of a viewpoint character and a target character. Zareen first fixed a (viewpoint) character, and looked for every occurrence of that character in the text. For each occurrence, she looked 40 words before the word and 40 words after the word and counted how many times the target character was mentioned. She did that for all character pairs in the series and obtained a matrix of relationship strengths. Harry was found to be closest to Hermione and Ron, as expected.

(Interested in network and relationship analysis? Check out my primer on Network Analysis and the Survivor Alliance Analysis series wherein I study relationships in the TV show Survivor.)



Sentiment Analysis of The Simpsons Characters

This is honestly one of the most amazing things I’ve ever seen on the Internet. In this article, Vic Paruchuri studied how The Simpsons characters feel about each other. He used the transcripts for all the episodes of the show, available online, as his dataset. The problem is, the transcripts don’t contain information on who spoke which line.

To fix the previous issue, he used the audio from each episode and a Support Vector Machine classifier to actually determine which character was speaking. After successfully labelling the transcripts, Vic proceeded with sentiment analysis using the AFINN-1 word list and random indexing.

In the end, Vic was able to construct a sentiment matrix of the major Simpsons characters. He obtained trivial results, like Mr. Burns actually hating everyone, and quite unexpected ones as well, like Krusty the Clown actually hating Lisa but not the other way around.

Vic’s Simpsons analysis is the gold standard for projects about data science applied to media.


Do you know any cool data science articles?

Leave a Reply