Jesse or Celine? Text Classification on Before Sunrise Dialog: Part I

Ah, Before Sunrise. The Before trio is my favorite trilogy of all time. It’s the epitome of romance on film, conveyed by Jesse and Celine in all its glory. This blog will not be complete without at least one post on the best love story of all time.

56f68c89-bf78-4b6b-a1e2-dea48bf40944-before-sunrise.png

So let’s kick it off with an interesting research question. Given a line of dialog from the movie, can we determine who said it: Jesse or Celine? This is a basic binary classification exercise. The hard part is collecting and cleaning the data and then processing the text into machine learning features.

Since Jesse and Celine are very different people (i.e. they argue and disagree a lot), it should be the case that their word usage and patterns should be quite different as well – and that’s what we want to pick up with this classification task. Will we get good-enough results and be able to distinguish the speaker?

Screen Shot 2018-03-31 at 10.09.15 PM.png

Data Source

For this exercise, we will be using all spoken lines from Before Sunrise and Before Sunset. Let’s leave out Before Midnight for the moment. I’m having a hard time scraping and cleaning the last one’s dialog. For the first two movies, scripts are directly available from the Before wiki. I used Beautiful Soup to parse the HTML and return speaker-dialog tuples. The third movie’s script is available as a PDF somewhere in the net, but it’s really hard to process it in a suitable way.

Screen Shot 2018-03-31 at 10.22.27 PM.png

I stored all dialog tuples into a single Pandas dataframe and filtered out lines that had less than three words. I had to do this because ‘Hi’ and ‘Okay thanks’ aren’t really that distinctive and would just add to the noise in our classification task.

In total, I ended up with 1,024 lines with an even split for Jesse and Celine. Coincidence? I think not.

1_QVIyc5HnGDWTNX3m-nIm9w.png

Modeling

We’ll be using a simple feedforward neural network built on Keras as our classification model. I initially tried passing the words through a Word2Vec embedding layer, but I realized that it doesn’t really make much sense to do that. Word2Vec encodes the *semantic* meaning of text, but what we actually want to detect is the distinction in word usage, something much shallower. We aren’t looking that deep into semantics. As such, a simple unigram bag-of-words representation is sufficient for this task. Check out my old article to get a gist on what this means.

14695_1.jpg

I don’t want to bog you down with details on model selection, but I tried out two- and three-layer nets under 10-fold cross-validation. The best-performing one was a three-layer net with 32 hidden units on the first two (RELU) layers and a single sigmoidal output layer. I also placed a 0.5-dropout layer between the first two layers to have some regularization.

Results

Currently, the best-performing model is sitting at 86% training accuracy and 62% test accuracy average over 10 folds. The model is obviously not all that amazing, but at least it’s doing better than random guessing by 12%.

We can use the model to tell us whether some not-before-seen line is more likely to be uttered by Jesse or Celine. Let’s try out lines from the third movie.

So let’s try:

Just keep practicing the piano, okay?, You’re really good and they spend so, much time at that school of yours… just remember that music is actually, something you will use in your life., Right, and don’t forget to – you, want those sesame things, right?, They re really good.

That line is uttered by Jesse. The model predicts a 67% chance that Jesse said that.

Two kitties., Every time, every year two cats.,I mean it was just…, amazing. Then one day, I was around,, 30 and I was having lunch with my,, Dad, I was remembering, mentioning little Cleopatra and he was like -,, the hardest thing I ever had to do,, was to kill those cute little kittens’-,, and I was like WHAT?,,It turns out– listen to this–there were sometimes,, up to 7 kittens in that litter–

This one is by Celine. The model predicts Celine with 70% probability.

You’re so corny! Sometimes I’m just like -,,”

By Celine, but the model predicts Jesse with 52% probability.

So there is some weight to our assumption that Jesse and Celine’s lines can be distinguished. I think that the reason why our model isn’t performing well for some lines is that the shorter the line, the sparser the BOW vector becomes. If it becomes too sparse, it becomes difficult to discern between Jesse and Celine.

Let’s make no illusions. The classifier is still quite shitty at the moment and there are lots of things we can do to improve.

maxresdefault.jpg

To Do

Well, I’m not confident that we will be able to obtain *extremely high* accuracy given that we only have a maximum of three movies worth of script to train on. We have to let Richard Linklater know that we need more data so he has to make a fourth movie. 🙂

Well, what can we do to improve our results?

One, we have to include Before Midnight into our training data. Obviously, this will improve our results. At this point, the more lines we have, the better.

Two, we can include bigrams in our bag-of-words model. Jesse and Celine might be using distinct word pairs, and this could distinguish between the two. Who knows?

Three, sentiment might be something interesting to use as a feature. Jesse is a lot more mellow than Celine, and this could potentially be encoded in the sentiment of their words.

Four, we can try out different ML algorithms. Random forests are nonparametric models that historically have done really well on binary classification tasks. Why not try it out?

In terms of Before Sunrise as a data science topic, there are a lot of possible stuff to work on! sentiment analysis, topic modelling, dynamic word trends. You name it. It’s all very exciting. Let’s explore all of them on Data Meets Media.

If you’re interested to learn more about text analytics in Python, my favorite reference is:

The book focuses more on the practical aspects, so you don’t get bogged down by theory.
All of my code and processed data can be found on Github. Don’t forget to follow me on Facebook and Twitter.