Lecture 9: Principle Component Analysis and Preprocessing

Today’s Topics

As we have seen before it is often a good idea to do some pre-processing of your input data. Sometimes there are linear relationships hidden in the data, and if you do not remove or discover them then your machine learning algorithm will be trying to learn these linear relationships. In linear algebra, principle component analysis (PCA) is a well established method of reducing the dimension of a data set. It uses eigenvalues and eigenvectors to uncover hidden linear relationships in your data-set. For this course you do not need to know how to actually compute eigenvalues and eigenvectors by hand, but you should understand the definition and what is going on.

An interesting application of PCA are Eigenfaces.

Slides

I use these slide. In previous years, I have done the lecture on the blackboard, and these are some notes that I used. They contain some extra derivations.

Reading Guide

Hundred-Page Machine Learning Book Chapter 9 section 9.3
A First Course in Machine Learning 7.1,7.2 has a good derivation of PCA from the point of view of minimising the variance in the data.
PCA in Machine Learning
- Principal Component Analysis
- Towards Data Science on PCA

What should I know by the end of this lecture?

What is principle competent analysis (PCA)?
What is the co-variance matrix and what do the entries mean?
What do the eigenvalues of the co-variance matrix tell you about the data?
How do you choose the number of dimensions in PCA?
What are some applications of PCA in machine learning?