As you have seen in your own work, being able to summarize information is crucial. We need to be able to take out data and summarize it as well. We will consider doing this using the summarise()
function.
Like in the rest of these lessons, let’s consider what happens when we try to to do this in base R. We will:
dest
.arr_delay
.head(with(flights, tapply(arr_delay, dest, mean, na.rm=TRUE)))
head(aggregate(arr_delay ~ dest, flights, mean))
I am going to default to not explaining the logic and exactly what R is doing with Base R but let’s consider this with the summarise()
function.
summarise()
FunctionThe summarise()
function is:
summarise(.data, ...)
where
.data
is the tibble of interest....
is a list of name paired summary functionsmean()
median
var()
sd()
min()
Note: summarise()
is Primarily useful with data that has been grouped by one or more variables.
Our example:
flights %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay, na.rm=TRUE))
Consider the logic here:
avg_delay
.This is much easier to understand than the Base R code.
## # A tibble: 105 × 2
## dest avg_delay
## <chr> <dbl>
## 1 ABQ 4.381890
## 2 ACK 4.852273
## 3 ALB 14.397129
## 4 ANC -2.500000
## 5 ATL 11.300113
## 6 AUS 6.019909
## 7 AVL 8.003831
## 8 BDL 7.048544
## 9 BGR 8.027933
## 10 BHM 16.877323
## # ... with 95 more rows
Lets say that we would like to have more than just the averages but we wish to have the minimum and the maximum departure delays by carrier:
flights %>%
group_by(carrier) %>%
summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("delay"))
## # A tibble: 16 × 5
## carrier dep_delay_min arr_delay_min dep_delay_max arr_delay_max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 9E -24 -68 747 744
## 2 AA -24 -75 1014 1007
## 3 AS -21 -74 225 198
## 4 B6 -43 -71 502 497
## 5 DL -33 -71 960 931
## 6 EV -32 -62 548 577
## 7 F9 -27 -47 853 834
## 8 FL -22 -44 602 572
## 9 HA -16 -70 1301 1272
## 10 MQ -26 -53 1137 1127
## 11 OO -14 -26 154 157
## 12 UA -20 -75 483 455
## 13 US -19 -70 500 492
## 14 VX -20 -86 653 676
## 15 WN -13 -58 471 453
## 16 YV -16 -46 387 381
The following is a new function:
n()
counts the number of rows in a groupThen for each day:
Your answer should look like:
## Source: local data frame [365 x 3]
## Groups: month [12]
##
## month day flight_count
## <int> <int> <int>
## 1 11 27 1014
## 2 7 11 1006
## 3 7 8 1004
## 4 7 10 1004
## 5 12 2 1004
## 6 7 18 1003
## 7 7 25 1003
## 8 7 12 1002
## 9 7 9 1001
## 10 7 17 1001
## # ... with 355 more rows
We could also have used what is called the tally()
function:
flights %>%
group_by(month, day) %>%
tally(sort = TRUE)
## Source: local data frame [365 x 3]
## Groups: month [12]
##
## month day n
## <int> <int> <int>
## 1 11 27 1014
## 2 7 11 1006
## 3 7 8 1004
## 4 7 10 1004
## 5 12 2 1004
## 6 7 18 1003
## 7 7 25 1003
## 8 7 12 1002
## 9 7 9 1001
## 10 7 17 1001
## # ... with 355 more rows
The following is a new function:
n_distinct(vector)
counts the number of unique items in that vectorThen for each destination
Your answer will look like:
## # A tibble: 105 × 3
## dest flight_count plane_count
## <chr> <int> <int>
## 1 ABQ 254 108
## 2 ACK 265 58
## 3 ALB 439 172
## 4 ANC 8 6
## 5 ATL 17215 1180
## 6 AUS 2439 993
## 7 AVL 275 159
## 8 BDL 443 186
## 9 BGR 375 46
## 10 BHM 297 45
## # ... with 95 more rows