Classify cover type

Linköping University
Department of Mathematics
Lars Eldén

October 2008

NGSSC Computational Statistics

Computer project

Classification of Forest Cover Type

ASSIGNMENT

Construct an algorithm in MATLAB or R for classification of forest cover type. The data set is quite large, 581 012 observations, so the purpose of the project is to experiment with the sizes of test set and training set to see how the performance varies with size. Is the SVD-based method applicable? Can one use the function classregtree from the statistics toolbox in MATLAB or the corresponding function (tree) in R?

SPECIFIC TASKS

The tasks below are examples, it is not required that you do everything (except the random selection that should always be done). And if you have your own ideas, go ahead and try.

Tune the algorithm for accuracy of classification.
Check if all forest types are equally easy or difficult to classify.
Does it help to scale the data?
Investigate the properties of the matrix of observations using the SVD, and perhaps other tools.
Most of the attributes are qualitative. One may treat them as quantitative or ignore them. Does it make a difference?
When you divide the set in training and tests sets, make a random selection so that you can be rather sure that you get representative sets.

DATA

The test data covtype.data are available at http://www.mai.liu.se/~laeld/kurser/NGSSC-comp-stat.

The training and test data are described in the file covtype.info Note that these data are quite difficult to classify (e.g., 70% correct with neural networks and 58% with linear discriminant analysis).