One of the first data science specializations I took in Coursera is the five-course sequence from the University of Michigan. The specialization includes three ‘core’ courses focusing on data wrangling, visualization and modeling, and two ‘elective’ courses on text mining and social network analysis. The main catch with this sequence is that the focus is not on theory but on actual practice with the de-facto language of data science: Python.
While Andrew Ng’s Machine Learning dwells more on the clean side of data science – theory, this sequence by University of Michigan tackles the dirty side, the actual handling of data. Both are essential to be adept at if one wants to be a great data science, so the two MOOCs complement each other very well. For a review of Andrew Ng’s Deep Learning specialization, check out my other blog post.
And while I more-or-less liked the core courses of this specialization, the latter courses left a lot more to be desired. I’ll just blame it on the specialization currently being at version 1.0. I’m sure it’ll improve in the future.
Introduction to Data Science in Python
The first course focuses on basic Python patterns for data science and the Pandas library for data manipulation. The focus in the first course is how to load data into your environment, how to clean it, and how to prepare it for visualization and modeling.
This is hands-down the best course in the sequence and one I recommend to everyone who wants to get acquainted with the wonderful Pandas library in Python. One commonality across all the courses in the sequence is that the programming assignments are a notch higher in terms of difficulty compared to your usual Coursera course. The assignments in this specialization will surely stretch your wrangling muscle, and I bet you’ll come out of the course a Pandas wiz.
A great way to learn Pandas is to go over Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython from O’Reilly. It’s very comprehensive and hands-on. I treat it as the canonical reference for Pandas.
Applied Plotting, Charting & Data Representation in Python
Visualization is the topic in the second course in the sequence. The course teaches you some conceptual principles on what makes a good visualization as well as practical charting using the Matplotlib library in Python.
For me, the standard Matplotlib tools that you’d use 90% of the time are so basic that you don’t really need a whole course just to discuss Matplotlib. Maybe that’s why they put so many principles in the course. However, what I liked about this course is that it goes way beyond standard Matplotlib and teaches the students really fancy stuff like interactive graphics and animations. Though you wouldn’t really use it in standard everyday life, it’s something good to know for when you might need it.
For Matplotlib visualizations, a reference that served me well is Mastering matplotlib by McGreggor. It goes from simple techniques to the way-way advanced.
Applied Machine Learning in Python
The last core course goes over basic principles of machine learning and the Scikit-Learn library in Python. In terms of concepts, this course is a really basic overview compared to Andrew Ng’s wonderful course. The commonality between this course and the next two courses in the sequence is that the handling feels rushed.
A good reference for learning Scikit-Learn is Raschka’s Python Machine Learning. Sebastian Raschka also has a lot of useful Jupyter notebooks for learning ML models. Check his site out.
Applied Text Mining in Python
Textual data is my favorite form of data, so I had really high expectations for this course. This course goes over regular expressions, the NLTK (Natural Language Toolkit) library in Python, and to a very small extent the topic modeling library Gensim. For a first intro to these topics, I think this course is quite good. However, I think the coverage of topics is a bit too shallow (it’s too much of an overview) – the standard NLTK tutorial is way more informative than this course.
Text Analytics with Python by Sarkar is a good book to get started with NLTK. It goes over the library, but it also gives good background on the different fields studied in NLP : semantic analysis, sentiment analysis, POS/NER tagging, topic modeling, and much more. For a first intro to the topic, you can’t go wrong with Sarkar’s book.
Applied Social Network Analysis in Python
The last course deals with network analysis with NetworkX. Frankly, I’m quite disappointed with this course. Most of the course just goes over the standard metrics for connectivity, centrality, models for network creation, and their corresponding NetworkX functions. You would gain the same amount of knowledge just reading Wikipedia. For future versions, I’d recommend use cases on how these network metrics could be tied to solve problems in the real world.
For social network analysis in general, Mining the Social Web by Russell is probably the best text right now. It doesn’t just go over NetworkX, but on different ways to access data from our favorite social media sites like Facebook, Linkedin and Twitter. If you’re interested in all of that, check the book out!