Linköping University
Department of Mathematics
Lars Eldén
October 2008
NGSSC Computational Statistics
Computer project
Classification of Forest Cover Type
ASSIGNMENT
Construct an algorithm in MATLAB or R for classification
of forest cover type. The data set is quite large, 581 012 observations, so the purpose of the project
is to experiment with the sizes of test set and training set to see how the performance varies with size.
Is the SVD-based method applicable? Can one use the function classregtree
from the statistics toolbox in MATLAB or the corresponding function (tree) in R?
SPECIFIC TASKS
The tasks below are examples, it is not required that you do everything (except the random
selection that should always be done). And if you have your own ideas, go ahead and try.
- Tune the algorithm for accuracy of classification.
- Check if all forest types are equally easy or difficult to classify.
- Does it help to scale the data?
- Investigate the properties of the matrix of observations using the SVD, and
perhaps other tools.
- Most of the attributes are qualitative. One may treat them as quantitative or ignore them.
Does it make a difference?
- When you divide the set in training and tests sets, make a random selection so that you can
be rather sure that you get representative sets.
DATA
The test data covtype.data are available at http://www.mai.liu.se/~laeld/kurser/NGSSC-comp-stat.
The training and test data are described in the file covtype.info Note that these
data are quite difficult to classify (e.g., 70% correct with neural networks and 58%
with linear discriminant analysis).