Basic Data Analytics: January 2019

Thursday, 24 January 2019

Machine Learning - Linear Regression

###################################################################

################Machine Learning - Linear Regression#####################

###################################################################

options(repos = c(CRAN = "http://cran.rstudio.com"))

library(readxl)

data <- read_excel("D:\\R studio\\R portable\\Practice\\ML - linear regression\\Bai-tap-thuc-hanh.xlsx")

head(data)

names(data)[names(data)=='Chiều cao (cm)'] <- 'Height'

names(data)[names(data)=='Cân nặng (kg)'] <- 'Weight'

head(data, n=6)

#we need to split the data into a training set and a testing set.

#As their names imply, the training set is used to train and build the model,

#and then this model is tested on the testing set.

#Let’s say we want to have about 75% of the data in the training set and 25% of the data in the testing set

#First, we set the seed so that the results are reproducible

set.seed(123)

#we create a sequence whose length is equal to the number of rows of the dataset. These numbers act as the indices of the dataset. We randomly select 75% of the numbers in the sequence and store it in the variable split

split <- sample(seq_len(nrow(data)), size = floor(0.75 * nrow(data)))

#we copy all the rows which correspond to the indices in split into trainData and all the remaining rows into testData

train.data <- data[split,]

head(train.data)

View(train.data)

test.data <- data[-split,]

head(test.data)

#Building the prediction model

predictionmodel <- lm(Weight~Height, data = train.data )

The above function will try to predict Weight based on the variable Height. Since we are using all the variables present in the dataset, a shorter way to write the above command is: (this is very helpful when are are a large number of variables.)

predictionModel <- lm(Weight ~ ., data = trainData)

summary(predictionmodel)

This will help us determine which variables to include in the model. A linear regression can be represented by the equation: y_i = β_1 x_i1 + β_2 x_i2 + β_3 x_i3 + ⋯ + ε_i where y_i represents the outcome we’re prediction (PE), x_irepresent the various attributes (AT, V, AP, and RH), β represent their coefficients, and ε represents the constant term.

The first column in the summary, namely Estimate gives us these values. The first value corresponds to ε, and the rest of the values correspond to the various βvalues. If the coefficient for a particular attribute is 0 or close to 0, that means it has very little to no effect on the prediction, and hence, can be removed. The standard error column gives an estimation of how much the coefficients may vary from the estimate values. The t value is calculated by dividing the estimate by the standard error column. The last column gives a measure of how likely it is that the coefficient is 0 and is inversely proportional to the t value column. Hence, an attribute with a high absolute value of t, or a very low absolute value of Pr(>|t|) is desirable.

The easiest way to determine which variables are significant is by looking at the stars next to them. The scheme is explained at the bottom of the table. Variables with three stars are most significant, followed by two stars, and finally one. Variables with a period next to them may or may not be significant and are generally not included in prediction models, and variables with nothing next to them are not significant.

In our model, we can see that all our variables are highly significant, so we will leave our prediction model as it is. In case you are dealing with a dataset in which there are one or more variables which are non-significant, it is advisable to test the model by removing one variable at a time. This is because when two variables are highly correlated with each other, they may become non-significant to the model, but when one of them is removed, they could become significant. This is due to multicollinearity. You can find out more about multicollinearity here.

The easiest way to check the accuracy of a model is by looking at the R-squared value. The summary provides two R-squared values, namely Multiple R-squared, and Adjusted R-squared. The Multiple R-squared is calculated as follows:

Multiple R-squared = 1 – SSE/SST where:

SSE is the sum of square of residuals. Residual is the difference between the predicted value and the actual value, and can be accessed by predictionModel$residuals.
SST is the total sum of squares. It is calculated by summing the squares of difference between the actual value and the mean value.

For example, lets say that we have 5, 6, 7, and 8, and a model predicts the outcomes as 4.5, 6.3, 7.2, and 7.9. Then, SSE can be calculated as: SSE = (5 – 4.5) ^ 2 + (6 – 6.3) ^ 2 + (7 – 7.2) ^ 2 + (8 – 7.9) ^ 2; and SST can be calculated as: mean = (5 + 6 + 7 + 8) / 4 = 6.5; SST = (5 – 6.5) ^ 2 + (6 – 6.5) ^ 2 + (7 – 6.5) ^ 2 + (8 – 6.5) ^ 2

The Adjusted R-squared value is similar to the Multiple R-squared value, but it accounts for the number of variables. This means that the Multiple R-squared will always increase when a new variable is added to the prediction model, but if the variable is a non-significant one, the Adjusted R-squared value will decrease. For more info, refer here.

An R-squared value of 1 means that it is a perfect prediction model, and an R-squared value of 0 means that it is of no improvement over the baseline model (the baseline model just predicts the output to always be equal to the mean). From the summary, we can see that our R-squared value is 0.9284, which is very high.

#Testing the prediction model

prediction <- predict(predictionmodel, newdata = test.data)

head(prediction)

head(test.data$Weight)

#calculate the value of R-squared for the prediction model on the test data

SSE <- sum((test.data$Weight - prediction)^2)

SST <- sum((test.data$Weight- mean(test.data$Weight))^2)

1-SSE/SST

Tuesday, 15 January 2019

[R] Toán tử pipe %>%

Vài năm trở lại đây, với sự phổ biến của package dplyr, toán tử %>% trở nên quen thuộc với phần lớn những người sử dụng R. Tuy nhiên, %>% không chỉ gói gọn trong package dplyr, nó đã được sử dụng bên trong rất nhiều các packages khác, chủ yếu liên quan đến việc data visualization.
Trong lập trình, pipe là 1 khái niệm thường dùng để mô tả việc sử dụng đầu ra của một hàm xử lý này để sử dụng như đầu vào của một hàm xử lý khác (chain). Khi khối xử lý kéo dài thành một chuối liên tiếp nhau thì còn được gọi là chainning và nó được gợi nhớ đến việc ghép nối giữa các ống nước trong cuộc sống hàng ngày (pipe)?

Ví dụ:
Trong R, khi ta muốn thể hiện: plot(exp(cumsum(returns)))
có thể viết thành: returns |> cumsum |> exp |> plot
Hoặc một cách "khoa học" hơn: %|>%` = function(x, y) y(x)
-> Console: 1:10 %|>% cumsum %|>% plot

Sự xuất hiện của dplyr

dplyr được thiết kế để xử lý các tác vụ xử lý với data frame, 5 chức năng chính bao gồm: select, filter, mutate, summarise và arrange. dplyr trình làng toán tử đầu tiên %.% nhằm thay thế cho hàm chain (trong package ptools của Peter Melstrup). Quá trình thay đổi được thể hiện như sau:

+ Cách viết truyền thống:

library(hflights)
library(dplyr)
filter(
   summarise(
     select(
       group_by(hflights, Year, Month, DayofMonth), 
       Year:DayofMonth, ArrDelay, DepDelay
     ), 
     arr = mean(ArrDelay, na.rm = TRUE),  
     dep = mean(DepDelay, na.rm = TRUE)
   ), 
   arr > 30 | dep > 30
 )

+ Cách sử dụng hàm Chain (dễ nhìn và trực quan hơn):

chain(
   hflights,
   group_by(Year, Month, DayofMonth),
   select(Year:DayofMonth, ArrDelay, DepDelay),
   summarise(
     arr = mean(ArrDelay, na.rm = TRUE), 
     dep = mean(DepDelay, na.rm = TRUE)
   ),
   filter(arr > 30 | dep > 30)
 )
+ Dùng toán tử %.%
hflights %.%
   group_by(Year, Month, DayofMonth) %.%
   select(Year:DayofMonth, ArrDelay, DepDelay) %.%
   summarise(
     arr = mean(ArrDelay, na.rm = TRUE), 
     dep = mean(DepDelay, na.rm = TRUE)
   ) %.%
   filter(arr > 30 | dep > 30)

Tuy nhiên, kể từ ngày 14/4/2014, %>% được sử dụng làm toán tử pipe trong dplyr, thay hoàn toàn cho %.% và chain.

(R) Hồi Quy Phân Vị - Quantile Regression

Phương pháp hồi quy phân vị được Koenker & Bassett giới thiệu lần đầu tiên năm 1978. Thay vì ước lượng các tham số của hàm hồi quy trung bình bằng phương pháp OLS, Koenker & Bassett (1978) đề xuất việc ước lượng tham số hồi quy trên từng phân vị của biến phụ thuộc để sao cho tổng chênh lệch tuyệt đối của hàm hồi quy tại phân vị τ của biến phụ thuộc là nhỏ nhất. Nói một cách khác, thay vì xác định tác động biên của biến độc lập đến giá trị trung bình của biến phụ thuộc, hồi quy phân vị sẽ giúp xác định tác động biên của biến độc lập đến biến phụ thuộc trên từng phân vị của biến phụ thuộc đó.

Ưu điểm:

- Thứ nhất, phương pháp hồi quy phân vị cho phép thể hiện một cách chi tiết về mối quan hệ giữa biến phụ thuộc và các biến độc lập trên từng phân vị của biến phụ thuộc, không phải chỉ xét mối quan hệ này trên giá trị trung bình như hồi quy OLS.

- Thứ hai, mặc dù các tính toán thực hiện trong hồi quy phân vị là phức tạp và khối lượng tính toán nhiều hơn trong OLS, nhưng với sự phát triển của toán học, thống kê học cộng với sự hỗ trợ của công nghệ thông tin thì những tính toán như quy hoạch tuyến tính, bootstrap, được thực hiện rất dễ dàng và nhanh chóng.

- Thứ ba, trong hồi quy OLS, các quan sát bất thường (outliers) thường được loại bỏ để ước lượng OLS không bị chệch. Trong khi đó, hồi quy phân vị có tính ổn định (robustness), không bị ảnh hưởng bởi sự hiện diện của các quan sát bất thường đó.

- Thứ tư, các kiểm định về tham số của hồi quy phân vị không dựa vào tính chuẩn của sai số. Hơn nữa, các kiểm định này không dựa trên bất kỳ một giả định nào về dạng phân phối của sai số hồi quy.

- Thứ năm, hồi quy phân vị đặc biệt phù hợp khi phân tích trên mô hình hồi quy có sự hiện diện của phương sai thay đổi hoặc trong mẫu số liệu mà hàm phân phối của biến phụ thuộc bất đối xứng quanh giá trị trung bình. Khi đó, hàm hồi quy phân vị trên các phân vị khác nhau sẽ có sự khác biệt rõ rệt, cho thấy tác 26 động không giống nhau của biến độc lập đến biến phụ thuộc ở những phân vị khác nhau.

Nhược điểm:

- Một là, các tính toán trong hồi quy phân vị phức tạp hơn so với OLS. Ví dụ như trong OLS, muốn tìm ước lượng tham số hồi quy sao cho tổng bình phương sai số là nhỏ nhất thì có thể áp dụng các công thức tìm cực trị của giải tích toán học như lấy đạo hàm riêng và giải hệ phương trình ứng với điều kiện cần của cực trị. Trong khi đó, ước lượng tham số của hồi quy phân vị thực hiện thông qua việc giải bài toán quy hoạch tuyến tính. Việc này sẽ khó khăn nếu không có sự hỗ trợ của máy tính.

- Hai là, phải thực hiện nhiều hàm hồi quy trên nhiều phân vị mới cho thấy được toàn diện sự tác động của biến độc lập đến biến phụ thuộc thay vì chỉ có một hàm hồi quy trung bình có điều kiện trong OLS.

- Ba là, việc áp dụng hồi quy phân vị cho các dạng hàm phi tuyến còn khá hạn chế. Các lý thuyết để xử lý tự tương quan hoặc nội sinh trong hồi quy phân vị còn chưa được phát triển hoàn thiện.

Với những thế mạnh đó, ngày càng có nhiều các bài nghiên cứu ứng dụng sử dụng hồi quy phân vị được thực hiện và công bố, cho thấy phương pháp ước lượng này đang ngày càng được hoàn thiện và là công cụ đắc lực trong nghiên cứu kinh tế cũng như các lĩnh vực khác. Ở VN các bạn có thể tham khảo hai nghiên cứu sau:

http://www.depocen.org/vn/our-publication/articles/189-cac-yeu-to-anh-huong-den-ket-qua-hoc-tap-o-hoa-ky.html.

http://sdh.ueh.edu.vn/download/luan-an-tien-si/LATS-TranThiTuanAnh.pdf.

Quartiles (tứ phân vị)

Tứ phân vị là đại lượng mô tả sự phân bố và sự phân tán của tập dữ liệu. Tứ phân vị có 3 giá trị, đó là tứ phân vị thứ nhất (Q1), thứ nhì (Q2), và thứ ba (Q3). Ba giá trị này chia một tập hợp dữ liệu (đã sắp xếp dữ liệu theo trật từ từ bé đến lớn) thành 4 phần có số lượng quan sát đều nhau.

Tứ phân vị được xác định như sau:

· Sắp xếp các số theo thứ tự tăng dần

· Cắt dãy số thành 4 phàn bằng nhau

· Tứ phân vị là các giá trị tại vị trí cắt

Độ trải giữa (Interquartile Range - IQR)

Interquartile Range được xác định như sau:

Typical Case

Nghiên cứu tác động của income lên chi cho thực phẩm (foodexp) với bộ dữ liệu có tên engel tích hợp trong gói quantreg do chính tác giả của phương pháp này là Koenker (2005) viết.

#Fix lỗi không install.package

options(repos = c(CRAN = "http://cran.rstudio.com"))

#Install necessary packages

rm(list = ls())

install.packages("quantreg")

library(quantreg)

library(ggplot2)

install.packages("stargazer")

library(stargazer)

#Read the “engel” dataset which is one of integrated dataset in “quantreg” package (Engel food expenditure data used in Koenker and Bassett(1982). This is a regression data set consisting of 235 observations on income and expenditure on food for Belgian working class households)

data("engel")

head(data)

plot(engel, log = "xy", main = "'engel' data (log - log scale)")

plot(log10(foodexp) ~ log10(income), data = engel, main = "'engel' data (log10 - transformed)")

#Transform engel to data frame

data.frame(engel)

head(engel, n=10)

view(engel)

#Quantile regression in R

phanvi <- rq(foodexp ~ income,engel, tau = c(0.25, 0.50, 0.75))

summary (phanvi)

# Để hỗ trợ so sánh trực quan, dùng gói stargazer:

q25=rq(foodexp ~ income, engel,tau=0.25)

q50=rq(foodexp ~ income, engel,tau=0.50)

q75=rq(foodexp ~ income, engel,tau=0.75)

stargazer(q25,q50,q75,title="Regression quantitle result", type="text")

#Explanation: in the first quantitle, foodexp = 0.029 + 0.474*income (If the income increase by 1 bp, the food expense will increase by 0.474 bp). It is clearly seen that there is a difference between the Q1 and the Q3, then we can take an Anova test to check whether it differs from each other or not. The hypothesis will be explained by:

Ho: the impact of income on foodexp is homogenous

H1: the impact of income on foodexp is not homogenous

anova(q25,q75,joint = FALSE)

#Data frame in quantitle

df_phanvi <- data.frame(t(coef(phanvi)))

View(df_phanvi)

#rename

names(df_phanvi)[names(df_phanvi)=='X.Intercept.'] <- 'Intercept'

names(df_phanvi)[names(df_phanvi)=='income'] <- 'slope'

head(df_phanvi)

or: colnames(df_phanvi) <- c("intercept", "slope")

df_phanvi$cacphanvi <- row.names(df_phanvi)

head(df_phanvi)

#Draw graph

graph <- qplot(x=income, y= foodexp, data = engel)

graph + geom_abline(aes(intercept = Intercept, slope = slope, colour = cacphanvi), data = df_phanvi)