A common problem that data science newbies face after finishing an online course on data science or machine learning is answering the question, “Now, what?”
Yes, you may know a little bit about supervised learning and text mining, but that doesn’t really matter unless you convert your knowledge into action. It is important that you find a topic you like and try to answer a problem on it using your knowledge on data science.
If you’re like me, and your interest lies in the field of media – TV, movies, etc. – then you wouldn’t have a problem looking for an interesting topic. Data science has been applied to the Pokemon card game, the Simpsons, and Harry Potter. I have posts on this blog wherein I applied it to Black Mirror and Survivor. You can formulate an interesting data science question for literally any show or movie you are interested in. And since it’s where your interest lies, doing data science becomes 10x more fun.
Here are some very cool project ideas involving TV shows and movies in which you can apply data science tools. One involves predicting the culprit in Scooby-Doo, another studies the relationship dynamics of Jesse and Celine in the Before Sunrise series, and another involves ranking the negativity of American Horror Story seasons.
Scooby-Doo Culprit Prediction
This is an idea that I thought would be fun to do. Usually, the format of the cartoon Scooby-Doo involves the following sequence or some variation thereof. First, the gang drives into some town menaced by a ghost or monster. They stop at the town and interact a bit with the townsfolk. They encounter the monster, analyze a bit, and eventually unmask the monster, who is usually a member of the townsfolk.
One interesting thing we can do is to actually predict who the culprit is before the gang unmasks the monster. We can analyze the dialog of every character introduced in the episode and find patterns which identify the culprit. We can also probably map the relations that exist among the characters to identify ulterior motives. The pertinent tools for this project are textual analysis, particularly sentiment analysis, and network analysis.
There is no shortage of data for this task. A massive compedium of Scooby-Doo episodes exists because the show has been on-air from 1969 to the present.
Before Sunrise/Sunset/Midnight Analysis
I have two ideas forRichard Linklater’s Before series. One is an application of sentiment analysis, and the other is an application of topic modeling.
Idea 1: Sentiment Dynamics of Jesse and Celine
Jesse and Celine have so many lines of dialog. One project that we can do is to plot the sentiment dynamics of Jesse and Celine over the course of the three movies in order to see how their sentiment/mood changes. We can split this into two tasks.
For a specific movie, we apply sentiment analysis to Jesse’s and Celine’s dialog in, then we divide the movie into k-minute bins and collect Jesse’s and Celine’s dialog for each bin. After that, we calculate the sentiment scores of Jesse and Celine for each bin and plot the dynamics. What we get for each movie are two time series which show how sentiment changes over time – one for Jesse and one for Celine. From the two time series, we will be able to see how Jesse’s and Celine’s sentiment or mood change over the course of the movie: Do they become happier? sadder?
Next, we apply the previous procedure to each movie. In total, we have three sets (one for each movie) of two time series (one for Jesse and one for Celine). From these three sets, we can see how Jesse’s sentiment changes because of the two nine-year time jumps. Same for Celine.
Idea 2: Topic Summarization of Jesse and Celine’s Conversations
The three movies in the Before series happen in three instances set nine years apart. With the dialog of the three movies as our dataset, we can use topic modeling to identify the primary topics of the conversations of Jesse and Celine in each movie. Using the results here, we will be able to see how Jesse and Celine’s topics of conversation change over time, and we will be able to summarize the topics in a small set of words.
American Horror Story Sentiment Ranking
People often say that Asylum is the bleakest and most depressing season, while Coven is the campiest and light-hearted. One project that we can do is apply sentiment analysis to come up with a ranking of each season’s sentiment. So we basically answer the questions: Which season is the most positive? most negative?
This is quite similar in objective to the study I made on Black Mirror, wherein I ranked the episodes in terms of their negativity. However, Black Mirror, format-wise, is a much simpler show than American Horror Story (AHS). Each episode of Black Mirror is a different narrative, whereas each season of AHS (~13 episodes) is a different narrative. Thus, how to actually do sentiment analysis on AHS becomes a legitimate question. We have to define a suitable and justifiable metric which would represent the sentiment of a season in order to be able to rank the sentiment of seasons.
Do you have any cool data science project ideas?