###################################################################
################Machine Learning - Linear Regression#####################
###################################################################
options(repos
= c(CRAN = "http://cran.rstudio.com"))
library(readxl)
data
<- read_excel("D:\\R studio\\R portable\\Practice\\ML - linear
regression\\Bai-tap-thuc-hanh.xlsx")
head(data)
names(data)[names(data)=='Chiều
cao (cm)'] <- 'Height'
names(data)[names(data)=='Cân
nặng (kg)'] <- 'Weight'
head(data,
n=6)
#we need to split the data into a training set and a testing
set.
#As their names imply, the training set is used to train and
build the model,
#and then this model is tested on the testing set.
#Let’s say we want to have about 75% of the data in the training
set and 25% of the data in the testing set
#First, we set the seed so that the results are reproducible
set.seed(123)
#we create a sequence whose length is equal to the number of rows
of the dataset. These numbers act as the indices of the dataset. We randomly
select 75% of the numbers in the sequence and store it in the variable split
split
<- sample(seq_len(nrow(data)), size = floor(0.75 * nrow(data)))
#we copy all the rows which correspond to the indices in split
into trainData and all the remaining rows into testData
train.data
<- data[split,]
head(train.data)
View(train.data)
test.data
<- data[-split,]
head(test.data)
#Building the prediction model
predictionmodel
<- lm(Weight~Height, data = train.data )
The above function will
try to predict Weight based on the
variable Height. Since we are using
all the variables present in the dataset, a shorter way to write the above
command is: (this is very helpful when are are a large number of variables.)
predictionModel <-
lm(Weight ~ ., data = trainData)
summary(predictionmodel)
This
will help us determine which variables to include in the model. A linear
regression can be represented by the equation: y_i = β_1 x_i1 + β_2
x_i2 + β_3 x_i3 + ⋯ + ε_i where y_i represents the outcome we’re prediction
(PE), x_irepresent the various
attributes (AT, V, AP, and RH), β represent their coefficients, and ε represents the constant term.
The first column in the summary, namely Estimate gives us these values. The first value
corresponds to ε, and the rest of the
values correspond to the various βvalues. If the coefficient for a particular attribute is 0 or
close to 0, that means it has very little to no effect on the prediction, and
hence, can be removed. The standard error column gives an estimation of how much the coefficients may
vary from the estimate values. The t value is calculated by dividing the estimate by the standard
error column. The last column gives a measure of how likely it is that the
coefficient is 0 and is inversely proportional to the t value column. Hence, an attribute with a
high absolute value of t, or a very low absolute
value of Pr(>|t|) is
desirable.
The
easiest way to determine which variables are significant is by looking at the
stars next to them. The scheme is explained at the bottom of the table.
Variables with three stars are most significant, followed by two stars, and
finally one. Variables with a period next to them may or may not be significant
and are generally not included in prediction models, and variables with nothing
next to them are not significant.
In
our model, we can see that all our variables are highly significant, so we will
leave our prediction model as it is. In case you are dealing with a dataset in
which there are one or more variables which are non-significant, it is
advisable to test the model by removing one variable at a time. This is because
when two variables are highly correlated with each other, they may become
non-significant to the model, but when one of them is removed, they could
become significant. This is due to multicollinearity. You can find out more
about multicollinearity here.
The
easiest way to check the accuracy of a model is by looking at the R-squared
value. The summary provides two R-squared values, namely Multiple R-squared,
and Adjusted R-squared. The Multiple R-squared is calculated as follows:
Multiple
R-squared = 1 – SSE/SST where:
- SSE is the sum of square of
residuals. Residual is the difference between the predicted value and the
actual value, and can be accessed by predictionModel$residuals.
- SST is the total sum of
squares. It is calculated by summing the squares of difference between the
actual value and the mean value.
For
example, lets say that we have 5, 6, 7, and 8, and a model predicts the
outcomes as 4.5, 6.3, 7.2, and 7.9. Then, SSE can be calculated as: SSE = (5 –
4.5) ^ 2 + (6 – 6.3) ^ 2 + (7 – 7.2) ^ 2 + (8 – 7.9) ^ 2; and SST can be
calculated as: mean = (5 + 6 + 7 + 8) / 4 = 6.5; SST = (5 – 6.5) ^ 2 + (6 –
6.5) ^ 2 + (7 – 6.5) ^ 2 + (8 – 6.5) ^ 2
The
Adjusted R-squared value is similar to the Multiple R-squared value, but it
accounts for the number of variables. This means that the Multiple R-squared
will always increase when a new variable is added to the prediction model, but
if the variable is a non-significant one, the Adjusted R-squared value will
decrease. For more info, refer here.
An
R-squared value of 1 means that it is a perfect prediction model, and an
R-squared value of 0 means that it is of no improvement over the baseline model
(the baseline model just predicts the output to always be equal to the mean).
From the summary, we can see that our R-squared value is 0.9284, which is very
high.
#Testing the prediction model
prediction
<- predict(predictionmodel, newdata = test.data)
head(prediction)
head(test.data$Weight)
#calculate the value of R-squared for the prediction model on
the test data
SSE
<- sum((test.data$Weight - prediction)^2)
SST
<- sum((test.data$Weight- mean(test.data$Weight))^2)
1-SSE/SST
No comments:
Post a Comment