The past two weeks involved supervised learning. After discovering NumPy and doing some feature engineering with the homework, we proceeded with supervised learning, where a target variable y is mapped from the inputs X .
Week 2: Intro to Supervised Learning
An excerpt from our class ipython notebooks:
More formally, this is expressed as
where f is the unknown function that maps X to y. Our task as data scientists is to come up with g that captures f closely. Formally,
Depending on the task, g may need to output continuous or discrete values. These two tasks are called regression and classification , respectively. Formally, we’ll call these tasks creating models that capture the underlying function f . In literature, oftentimes you’ll see models also being called as hypothesis .
For studying regression, we continued to use our board game dataset and modeled linear relationships between the features. Being a machine learning class, I didn’t expound on the analytical way of solving this. Instead, we did gradient descent as our optimization technique and visualized the resulting lines. We also used r2 as a metric for correlation, just to look at the problem from the perspective of statistics.
After regression, we applied logistic regression to the “hello world” problem for all novice data scientists, the Iris dataset. Cost functions and all, we dissected the logistic regression algorithm.
Lastly, I had them do a homework in Kaggle, the Titanic dataset. Kaggle is a site for data science competitions, as well as a good resource for novices. The Titanic dataset involved a rather somber account of who survived during that fateful day when the Titanic sank. The task was to predict who survived. Everyone submitted their solutions and Kaggle, and some were even addicted to the whole gamification concept.
Week 3: Advanced Supervised Learning
We discussed our solutions to the Titanic dataset. I was surprised at the solutions of some. Some did really good feature engineering — techniques I never did. It truly is a rewarding experience to teach!
The focus of this week was to see the weaknesses of assuming linear relationships or linear separating hyperplanes on harder datasets. I offered several perspectives on the bias-variance problem, first looking at it from a practical “production” like concept, to a bulls-eye model, to the statistics perspective and lastly, a mathematical approach. Having the most powerful algorithm in the world will not necessarily give good results. So what’s the point of machine learning then? The way out of this is a rather long discussion in the theory of learning (VC dimension omg?) . But long story short, what we want is an algorithm to understand the data “just right”, not overfitting nor underfitting in a way useful for the task at hand.
We then blitzed through higher order polynomials, how neural networks automate that, and an introduction to support vector machines — without the tears. I did it this way since I can only explain a little on the concepts without actual application. I showed the decision boundaries, how to regularize the algorithms, as well as how to use them in scikit-learn. It was dizzying I know. That’s why I think the next session should be devoted to revisiting these algorithms and seeing how to optimize the hyperparameters and diagnose the learning algorithms.
Lastly, I gave the homework! The homework was to use these new algorithms on the Titanic dataset and see how their Kaggle scores improve. Also, to encourage individual learning, I had them read papers involving machine learning and security. These papers have new algorithms they haven’t encountered before. But I’m confident they have the vocabulary needed to understand the papers. We will devote 1 hour for this next session. As a professor of mine said, “Imagine hearing about several papers at once. It’s as if you’ve read them all.”