Monday, 18 March 2019

Using dplyr for data visulization (Part 2: ARRANGE, SELECT, RENAME, MUTATE)

rm(list= ls())
library(tidyverse)
options(repos = c(CRAN = "http://cran.rstudio.com"))
install.packages("nycflights13")
library(nycflights13)
nycflights13::flights
arrange(flights, year, month, day)

#Using desc() to reorder by a column in descending order#
arrange(flights, desc(arr_delay))
?select

#select a column by name#
select(flights, year, month, day)
#select all columns between year and day#
select(flights, year:day)
#or except for year to day#
select(flights, -(year:day))

#rename a variable#
rename(flights,tail_num = tailnum)

#Add new variable by Mutate() function#
flights <- data.frame(flights)
head(flights)
flightsml <- select(flights, year:day, ends_with("delay"), distance, air_time)
head(flightsml)
mutate(flightsml,delay = arr_delay - dep_delay, speed = distance/air_time*60)
?flights
#*ends_with: for the variable which ends by letter "delay", such as dep_delay, arr_delay...*#

#Using pipe function : this is a series of imperative statements: group, then summarize, then filter. As suggested by this reading, a good way to pronounce %>% when reading code is “then.”
#This code is to explore the relationship between the distance and average delay for each location. 

delays <- flights %>%
  group_by(dest) %>%
  summarise(count = n(),
            dist = mean(distance, na.rm = TRUE),
            delay = mean(arr_delay, na.rm = TRUE)
            ) %>%
  filter(count > 20, dest != "HNL")

delays %>% ggplot(delays, mapping = aes(x = dist, y = delay))+
  geom_point(aes(size = count), alpha = 1/3)+
  geom_smooth(se = FALSE)
#It looks like delays increase with distance up to ~750 miles and then decrease.

# Na.rm = TRUE
The aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument, which removes the missing values prior to computation.



Using dplyr for data visulization (Part1: FILTER)


###Using package dplyr for data visualization###
rm(list=ls())
options(repos = c(CRAN = "http://cran.rstudio.com"))
install.packages("nycflights13")
library(nycflights13)
library(tidyverse)

###Using flights data to practice###
int stands for integers.
dbl stands for doubles, or real numbers.
chr stands for character vectors, or strings.
dttm stands for date-times (a date + a time)
lgl stands for logical, vectors that contain only TRUE or FALSE
fctr stands for factors, which R uses to represent categorical variables with fixed possible values
date stands for dates

?flights
nycflights13::flights
head(flights, n= 10)

### Using dplyr function to manipulate the data ###
Filter (): Pick observations by their values
Arrange (): Reorder the rows
Select (): Pick variables by their names
Mutate(): Create new variables with functions of existing variables
Summarize (): Collapse many values down to a single summary
Group_by():changes the scope of each function from operating on the entire dataset to operating on it group-by-group

Filter:
filter(flights, month == 1, day == 1)
filter(flights, month == 2, day == 2)

jan1 <- filter(flights, month ==1, day ==1)
head(jan1, n =10)
dec25 <- filter(flights, month == 12, day ==25)
head(dec25, n =9)

### Comparision: the standard suite: >, >=, <, <=, != (not equal), and == (equal)###
###Boolean operators: & is “and,” | is “or,” and ! is “not.”
filter(flights, month == 11 | month ==12)

### x %in% y: This will select every row where x is one of the values in y
nov_dec <- filter(flights, month %in% c(11,12))

### find flights that weren’t delayed (on arrival or departure) by more than two hours
filter(flights, !(arr_delay >=120 | dep_delay >=120))
filter(flights, !(arr_delay >= 45 & dep_delay >=45))
filter(flights, (arr_delay <=-3 | dep_delay <=-3 ))