Basic Data Analytics: 01/24/19

###################################################################

################Machine Learning - Linear Regression#####################

###################################################################

options(repos = c(CRAN = "http://cran.rstudio.com"))

library(readxl)

data <- read_excel("D:\\R studio\\R portable\\Practice\\ML - linear regression\\Bai-tap-thuc-hanh.xlsx")

head(data)

names(data)[names(data)=='Chiều cao (cm)'] <- 'Height'

names(data)[names(data)=='Cân nặng (kg)'] <- 'Weight'

head(data, n=6)

#we need to split the data into a training set and a testing set.

#As their names imply, the training set is used to train and build the model,

#and then this model is tested on the testing set.

#Let’s say we want to have about 75% of the data in the training set and 25% of the data in the testing set

#First, we set the seed so that the results are reproducible

set.seed(123)

#we create a sequence whose length is equal to the number of rows of the dataset. These numbers act as the indices of the dataset. We randomly select 75% of the numbers in the sequence and store it in the variable split

split <- sample(seq_len(nrow(data)), size = floor(0.75 * nrow(data)))

#we copy all the rows which correspond to the indices in split into trainData and all the remaining rows into testData

train.data <- data[split,]

head(train.data)

View(train.data)

test.data <- data[-split,]

head(test.data)

#Building the prediction model

predictionmodel <- lm(Weight~Height, data = train.data )

The above function will try to predict Weight based on the variable Height. Since we are using all the variables present in the dataset, a shorter way to write the above command is: (this is very helpful when are are a large number of variables.)

predictionModel <- lm(Weight ~ ., data = trainData)

summary(predictionmodel)

This will help us determine which variables to include in the model. A linear regression can be represented by the equation: y_i = β_1 x_i1 + β_2 x_i2 + β_3 x_i3 + ⋯ + ε_i where y_i represents the outcome we’re prediction (PE), x_irepresent the various attributes (AT, V, AP, and RH), β represent their coefficients, and ε represents the constant term.

The first column in the summary, namely Estimate gives us these values. The first value corresponds to ε, and the rest of the values correspond to the various βvalues. If the coefficient for a particular attribute is 0 or close to 0, that means it has very little to no effect on the prediction, and hence, can be removed. The standard error column gives an estimation of how much the coefficients may vary from the estimate values. The t value is calculated by dividing the estimate by the standard error column. The last column gives a measure of how likely it is that the coefficient is 0 and is inversely proportional to the t value column. Hence, an attribute with a high absolute value of t, or a very low absolute value of Pr(>|t|) is desirable.

The easiest way to determine which variables are significant is by looking at the stars next to them. The scheme is explained at the bottom of the table. Variables with three stars are most significant, followed by two stars, and finally one. Variables with a period next to them may or may not be significant and are generally not included in prediction models, and variables with nothing next to them are not significant.

In our model, we can see that all our variables are highly significant, so we will leave our prediction model as it is. In case you are dealing with a dataset in which there are one or more variables which are non-significant, it is advisable to test the model by removing one variable at a time. This is because when two variables are highly correlated with each other, they may become non-significant to the model, but when one of them is removed, they could become significant. This is due to multicollinearity. You can find out more about multicollinearity here.

The easiest way to check the accuracy of a model is by looking at the R-squared value. The summary provides two R-squared values, namely Multiple R-squared, and Adjusted R-squared. The Multiple R-squared is calculated as follows:

Multiple R-squared = 1 – SSE/SST where:

SSE is the sum of square of residuals. Residual is the difference between the predicted value and the actual value, and can be accessed by predictionModel$residuals.
SST is the total sum of squares. It is calculated by summing the squares of difference between the actual value and the mean value.

For example, lets say that we have 5, 6, 7, and 8, and a model predicts the outcomes as 4.5, 6.3, 7.2, and 7.9. Then, SSE can be calculated as: SSE = (5 – 4.5) ^ 2 + (6 – 6.3) ^ 2 + (7 – 7.2) ^ 2 + (8 – 7.9) ^ 2; and SST can be calculated as: mean = (5 + 6 + 7 + 8) / 4 = 6.5; SST = (5 – 6.5) ^ 2 + (6 – 6.5) ^ 2 + (7 – 6.5) ^ 2 + (8 – 6.5) ^ 2

The Adjusted R-squared value is similar to the Multiple R-squared value, but it accounts for the number of variables. This means that the Multiple R-squared will always increase when a new variable is added to the prediction model, but if the variable is a non-significant one, the Adjusted R-squared value will decrease. For more info, refer here.

An R-squared value of 1 means that it is a perfect prediction model, and an R-squared value of 0 means that it is of no improvement over the baseline model (the baseline model just predicts the output to always be equal to the mean). From the summary, we can see that our R-squared value is 0.9284, which is very high.

#Testing the prediction model

prediction <- predict(predictionmodel, newdata = test.data)

head(prediction)

head(test.data$Weight)

#calculate the value of R-squared for the prediction model on the test data

SSE <- sum((test.data$Weight - prediction)^2)

SST <- sum((test.data$Weight- mean(test.data$Weight))^2)

1-SSE/SST

Thursday, 24 January 2019

Machine Learning - Linear Regression