Lecture 6: Cross Validation and Feature Engineering

Today’s Topics

In today’s lecture you will look at various techniques to deal with and understand overfitting. Dealing with overfitting leads nicely to the model selection problem. First and foremost, how do you decide which machine learning algorithm to use. Further, in many machine learning algorithms there are hyper-parameters that are not decided by the training data. Choosing which model to use or values of the models hyper-parameters is a difficult task and can greatly affect the performance of you algorithm. Cross validation is a useful technique from statistics that allows you to partition your data up into many combinations of training, test and validation sets. You can then use cross validation to help you decide which machine learning model to use and what values to set the hyper-parameters.

Finally we will cover various techniques for data-normalisation, and dealing with categorical data. One important thing to remember is that a little bit of data prepossessing can often improve the performance of your learning algorithm significantly.

Slides

I used these slides in the lecture.

Reading Guide

Hundred-Page Machine Learning Book the whole of Chapter 5.
Notes from Andrew Ng’s Lectures
Machine learning mastery on cross validation
Last year when I taught the course live, I used these notes that you might find useful.

What should I know by the end of this lecture?

What is over fitting? what are some strategies to avoid it?
What is the bias
What is the model selection problem?
What are hyper-parameters?
Why do you need to split data in training, test and validation sets?
What is cross validation?
What is k-fold cross validation?
What are the different encoding strategies for categorical data? One-hot encoding, dummy encoding?
What is data normalisation and when is it necessary?