Tibbles

Previously we have worked with data in the form of

  • Vectors
  • Lists
  • Arrays
  • Dataframes

What is a Tibble????

“Tibbles” are a new modern data frame. It keeps many important features of the original data frame. It removes many of the outdated features. They are another amazing feature added to R by Hadley Wickham. We will use them in the tidyverse to replace the older outdated dataframe that we just learned about.

Compared to Data Frames

  • A tibble never changes the input type.
    • No more worry of characters being automatically turned into strings.
  • A tibble can have columns that are lists.
  • A tibble can have non-standard variable names.
    • can start with a number or contain spaces.
    • To use this refer to these in a backtick.
  • It only recycles vectors of length 1.
  • It never creates row names.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.2
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: package 'tidyr' was built under R version 3.3.2
try <- tibble(x = 1:3, y = list(1:5, 1:10, 1:20))
try
## # A tibble: 3 × 2
##       x          y
##   <int>     <list>
## 1     1  <int [5]>
## 2     2 <int [10]>
## 3     3 <int [20]>

We can see that y is displayed as a list. If we try to do this with a traditional data frame we get:

try <- as_data_frame(c(x = 1:3, y = list(1:5, 1:10, 1:20)))
try
Error: Variables must be length 1 or 20. Problem variables: 'y1', 'y2'

We can use a non standard name in our Tibble as well:

names(data.frame(`crazy name` = 1))
## [1] "crazy.name"
names(tibble(`crazy name` = 1))
## [1] "crazy name"

Notice that the dataframe replaced the name that we wanted because it could not handle a space being in the name.

Coercing into Tibbles

A tibble can be made by coercing as_tibble(). This works similar to as.data.frame(). It is a very efficient process though.

l <- replicate(26, sample(100), simplify = FALSE)
names(l) <- letters

microbenchmark::microbenchmark(
  as_tibble(l),
  as.data.frame(l)
)
## Unit: microseconds
##              expr      min       lq      mean    median       uq      max
##      as_tibble(l)  309.250  327.099  376.2002  344.7265  386.004 1689.046
##  as.data.frame(l) 1390.507 1464.361 1614.3087 1543.3465 1690.608 3104.097
##  neval cld
##    100  a 
##    100   b

Microbenchmarking is a way to calculate the average times spent on an object. You can see how much faster it is to create a tibble than a dataframe. This will make a large difference in a data analysis.

Tibbles vs Data Frames

There are a couple key differences between tibbles and data frames.

  • Printing.
  • Subsetting.

Printing

  • Tibbles only print the first 10 rows and all the columns that fit on a screen. - Each column displays its data type.
  • You will not accidentally print too much.
tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)
## # A tibble: 1,000 × 5
##                      a          b     c          d     e
##                 <dttm>     <date> <int>      <dbl> <chr>
## 1  2017-02-19 09:02:23 2017-03-09     1 0.02150370     f
## 2  2017-02-19 01:42:10 2017-03-09     2 0.08031493     k
## 3  2017-02-19 05:36:59 2017-03-08     3 0.11670172     u
## 4  2017-02-19 18:49:56 2017-03-09     4 0.24552337     h
## 5  2017-02-19 04:15:06 2017-03-05     5 0.11232662     b
## 6  2017-02-19 10:00:27 2017-03-09     6 0.52834632     m
## 7  2017-02-19 13:42:43 2017-03-16     7 0.78928491     v
## 8  2017-02-19 17:02:27 2017-03-16     8 0.80388276     h
## 9  2017-02-19 15:09:33 2017-03-19     9 0.45767339     d
## 10 2017-02-19 09:14:04 2017-02-25    10 0.18177950     t
## # ... with 990 more rows

Subsetting

  • We can index a tibble in the manners we are used to
    • df$x
    • df[["x"]]
    • df[[1]]
  • We can also use a pipe which we will learn about later.
    • df %>% .$x
    • df %>% .[["x"]]
df <- tibble(
  x = runif(5),
  y = rnorm(5)
)

df$x
## [1] 0.6227033 0.7363213 0.8551199 0.9173554 0.5542486
df[["x"]]
## [1] 0.6227033 0.7363213 0.8551199 0.9173554 0.5542486
df[[1]]
## [1] 0.6227033 0.7363213 0.8551199 0.9173554 0.5542486

The above commands should seem very familiar after the previous work but wit the piping or chaining we can do the same:

df %>% .$x
## [1] 0.6227033 0.7363213 0.8551199 0.9173554 0.5542486
df %>% .[["x"]]
## [1] 0.6227033 0.7363213 0.8551199 0.9173554 0.5542486
df %>% .[[1]]
## [1] 0.6227033 0.7363213 0.8551199 0.9173554 0.5542486

Previous section: