# -*- coding: utf-8 -*- """IML-Lab1.ipynb Automatically generated by Colaboratory. Original file is located at https://colab.research.google.com/drive/1G4kLtZu6wLUKSOHgaCn47d1Xfz_Y5Z8q # Introduction to Machine Learning - Lab 1 In this lab, you will learn the skills to become a master Jedi. Please avoid the traps of the Siths and resists all dark emotions that could arise from the frustation associated to programming. """ import numpy as np # Loading a plotting library import matplotlib.pyplot as plt # Loading a table library import pandas as pd # Changing how floating point numbers are displayed pd.set_option("display.float_format", lambda x: f"{x:.5f}") import sklearn """## Part 1 - Linear regression **Task 1** Download the data available at https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt. The data is a diabetes study trying to measure disease progression one year after baseline from 10 explanatory variables (N = 442). The explanatory variables are age, sex, body mass index, average blood pressure, and six blood serum measurements 1. Load it using pandas. Pandas has a function called "read_csv" that can be useful (documentation: https://pandas.pydata.org/pandas-docs/stable/). 2. Separate the data into X (Explanatory variable) and Y (Response variable) 3. Inspect the data. Look at for instance the function "describe" in pandas 4. Standardize the data by subtracting the mean and dividing by the standard deviation. 5. Inpect the data again to see that the mean is zero and the standard deviation is one """ # Downloading the dataset !wget https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt # Loading the diabetes dataset #df = ... # Separating the dataset into X (Explanatory variable) and y (Response variable) #X = ... #y = ... # Normalize the features #X = ... """**Task 2** In this task we want to perform linear regression on the data to see what relation from the explanatory variables we can obtain in respect to the response variable. We are going to use the package scikit learn (sklearn) in python which is a package containing a bunch of machine learning models. Check out the models in https://scikit-learn.org/stable/user_guide.html 1. Look at the documentation for scikit learns Linear Regression model and try to get familiar with what the input and output parameters does https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html 2. Create a Linear regression model and fit it to your data 3. Extract the coefficients (weights) and the intecept (bias) from your fitted model and inspect them. Either look at the values or plot them as a bar chart. Which variables has the highest coefficients and what does it mean if the coefficients are negative or positive? 4. Predict y from your model and the data in X and calculate the root mean sqared error of your predictions. 5. (Optional) fit a new model without standardizing the data. What happens to the coeficcients and the error now? 6. As we have a lot of variables, we cannot visualize our results and regression fit in such multi-dim. space. Instead, let's see how it looks in 2D: select an explanatory variable that has a large coefficient and plot it against y as a scatterplot. (https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html or https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html) Add a line that represents the hyperplane resulting from the linear regression, projected to this 2D space (***Hint***: *in 2D, this is a line, defined by the regression weights $w_x$ of the chosen explanatory variable $x$ and bias: $y = x\cdot w_x + bias $*). Does it look like a good fit? (Optional:) And what if you instead use a less important explanatory variable? """ # Perform a regression (predict y from X) from sklearn.linear_model import LinearRegression #model = ... # Model parameters #coefficients = ... #intercept = ... # Calculate the error in the predictions #y_pred = ... # Scatter plot of a chosen explanatory variable vs. y # and the regressed line fit """## Part 2 - Naive Bayes **Task 1** For the second part we are going to look at classification using naive bayes. The data contains passenger data from Titanic and the task is to predict "what sort of people were most likely to survive?". The passenger data has 7 features: Name, Sex, Socio-economic class, Siblings/Spouses Aboard, Parents/Children Aboard and Fare and a binary responce variable "survived". 1. Download the data from https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv and load it using pandas 2. Inspect the data to see what you are working with. Test for instance if you can calculate the average survival rates for male and females. 3. Divide the data into a training set and a testing set. This can for instance be done with "sklearn.model_selection.train_test_split" (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) 4. Separate X and y for both the training and testing set. """ # Downloading the titanic dataset !wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv # Load the data and get familiar with it #df = ... # Separating the data in a training #df_train, df_test = ... #X_train = ... #y_train = ... #X_test = ... #y_test = ... """**Task 2** The second part is to implement a naive bayes classifier to predict survived/not survived from some of the variables. For now we focus on predicting survival only based on the gender and passenger fare (but feel free to experiment with other combinations of variables that you find reasonable!). Keep in mind that while gender is a categorical (binary) variable, the fare is a contiuous variable! 1. Write a function that takes in the gender variable and returns the probabilities of a subject being of this gender, given that it survived/died in the crash. 2. Write a function that takes in the fare variable and returns the probabilities of a subject paying such fare given that they survived/died in the crash. (You can use a Gaussian distribution to model this; look for example into the *normal pdf* function from the *scipy.stats* library) 3. Write a function that takes in an input of the relevant variables and returns the final prediction (that is, calculates all the needed intermediate results, using also the previous two functions, to finally output which one of $P(died\,|\,input\,sex,\,fare)$ or $P(survived\,|\,input \,sex,\,fare)$ is larger). 4. Predict survival on the test dataset. Calculate the accuracy of your prediction. 5. Plot a confusion matrix of the predictions (you can check out the confusion_matrix function from sklearn library, and heatmap from seaborn for plotting). How do you interpret the results? Are the sex and fare variables we chose good predictors? What seems more informative, accuracy or confusion matrix, and why? 6. (Optional) Can you improve your predictions using other/more variables? ***Hint:*** *The quantity we are interested in is $P(survival \,| \,sex, \,fare)$. As described in the lecture slides, Bayes theorem tells us that this is proportional to $\propto P(sex,\, fare\,|\,survival) \cdot P(survival)$. You can assume that the fare and sex variables are independent, which means $P(sex\,,fare\,|\,survival) = P(sex\,|\,survival)\cdot P(fare\,|\,survival)$.* """ # write the needed functions: # P(sex | survival) # P(fare | survival) # P(survived | fare, sex) vs. P(died | fare, sex) ? # Use your model to predict survival on test data # Calculate accuracy of your predictions # Calculate, plot confusion matrix