This is a follow-up to my post on exploring geographic associations in the Tagalog FastText embedding. In the last post, we focused on associations with the provinces of the Philippines. This time, we zoom in on Metro Manila and check out city associations. For more details on the models used and analysis performed, it would be best to check out my previous post.
Two caveats. As mentioned in the last post, since FastText is a word embedding, word sense disambiguation is not taken into account. And so Quezon City, the largest city in Metro Manila, is not properly represented in the model since the quezon token is an amalgam of the Quezon province, Quezon city, and the president Manuel Quezon. Second, bigrams are not represented in the unigram FastText model. The city San Juan is a bigram, making it non-existent in the model. Because of these issues, Quezon City and San Juan are unfortunately not included in the analysis.
What’s the goal of this analysis? A FastText model contains knowledge implicitly contained in the corpus it was trained on. In the case of the pretrained model, the corpus is Tagalog wikipedia. By testing some obvious associations, we can check whether the knowledge representation in the embedding model is satisfactory or not. Testing non-obvious associations, we can detect implicit biases contained in the input corpus.
Let’s start with something obvious.
As expected, the south part of Metro Manila lights up since this is where the airport is located. The peak of the association is Parañaque (followed by Pasay), which is where the NAIA terminals can be found.
For river, Marikina, Pasig and Mandaluyong all light up. The first two have rivers associated with them (Marikina River and Pasig River). Mandaluyong is neighbors with Pasig and is also bordered by the river.
Interestingly, Pasay is most associated with traffic. But this is not such a big surprise when we think about all the recent developments in the city (Resorts World, casinos) and also the proximity to the airports. Makati and Taguig, which I thought would have the highest association with traffic, aren’t highly associated based on the model.
I put an asterisk on the translation because there is really no direct translation for sosyal. Classy as sosyal is somewhat debatable. Somewhat expectedly, Makati and Muntinlupa are peak overall. Makati has Forbes and high-end shopping malls, while Muntinlupa has Alabang. All of these are highly associated with the sosyal subculture.
For urban, Makati, Muntinlupa and Marikina are in the highest quantile. Makati in particular is expected since Makati is a central business district. Muntinlupa and Marikina’s appearance at the top are quite surprising since these two cities aren’t what I would put on the top of the list of most urbanized. I would give the mantle to Taguig (for Bonifacio Global City) or Mandaluyong (for Ortigas).
What we’re probing in this analysis is the implicit associations contained in the Tagalog Wikipedia corpus. What would be more interesting is to capture biases in the minds of actual Filipinos.
How can we carry this out? By training a model on text from actual Filipinos. There are a myriad of sources: Reddit, online forums, Facebook. That’s the subject of the next article in this series.