With statistics we are most likely to use the data structure called a data frame. This is similar to a matrix in appearance however we can have multiple types of data in it like a list. Each column must contain the same type of data or R will most likely default to character for that column. It is very important that you become proficient in working with data frames in order to fully understand data analysis.
We usually create a data frame with vectors.
names <- c("Angela", "Shondra")
ages <- c(27,36)
insurance <- c(TRUE, T)
patients <- data.frame(names, ages, insurance)
## names ages insurance
## 1 Angela 27 TRUE
## 2 Shondra 36 TRUE
We may wish to add rows or columns to our data. We can do this with:
For example we can go back to our patient data and say we wish to add another patient we could just do the following
l <- c(names="Liu Jie", age=45, insurance=TRUE)
rbind(patients, l)
## Warning in `[<-.factor`(`*tmp*`, ri, value = "Liu Jie"): invalid factor
## level, NA generated
## names ages insurance
## 1 Angela 27 TRUE
## 2 Shondra 36 TRUE
## 3 <NA> 45 TRUE
This warning serves as a reminder to always know what your data type is. R has read our data in as a factor when we want it as a character.
patients$names <- as.character(patients$names)
patients <- rbind(patients, l)
## names ages insurance
## 1 Angela 27 TRUE
## 2 Shondra 36 TRUE
## 3 Liu Jie 45 TRUE
Finally if we decided to then place another column of data in we could
# Next appointments
next.appt <- c("09/23/2016", "04/14/2016", "02/25/2016")
#Lets R know these are dates
next.appt <- as.Date(next.appt, "%m/%d/%Y")
## [1] "2016-09-23" "2016-04-14" "2016-02-25"
We then have a vector of dates which we can cbind
in R.
patients <- cbind(patients, next.appt)
## names ages insurance next.appt
## 1 Angela 27 TRUE 2016-09-23
## 2 Shondra 36 TRUE 2016-04-14
## 3 Liu Jie 45 TRUE 2016-02-25
In order to best consider accessing of data frames we will use some built in data from R.
titanic <- data.frame(Titanic)
We can look at the different columns that we have in the data set:
## [1] "Class" "Sex" "Age" "Survived" "Freq"
We can use the notion of indexing that we did with arrays to look at the first 2 rows of data:
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
A simple function for looking at the start of the data is the head()
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
We can also look at the last few rows as well
## Class Sex Age Survived Freq
## 27 3rd Male Adult Yes 75
## 28 Crew Male Adult Yes 192
## 29 1st Female Adult Yes 140
## 30 2nd Female Adult Yes 80
## 31 3rd Female Adult Yes 76
## 32 Crew Female Adult Yes 20
If we wished to access the age information, we could do this by the column number:
## [1] Child Child Child Child Child Child Child Child Adult Adult Adult
## [12] Adult Adult Adult Adult Adult Child Child Child Child Child Child
## [23] Child Child Adult Adult Adult Adult Adult Adult Adult Adult
## Levels: Child Adult
or more frequently we would use the column name instead:
titanic[, "Age"]
## [1] Child Child Child Child Child Child Child Child Adult Adult Adult
## [12] Adult Adult Adult Adult Adult Child Child Child Child Child Child
## [23] Child Child Adult Adult Adult Adult Adult Adult Adult Adult
## Levels: Child Adult
This means we can access data with a column or row number. More importantly we can use the name. For large data frames accessing by a name is key.
Let’s say we wish to know information about a particular class
titanic["1st", ]
## Class Sex Age Survived Freq
## NA <NA> <NA> <NA> <NA> NA
We could also ask for information by using the factors that we have as well
first.class.freq <- titanic[titanic$Class=="1st", "Freq"]
## [1] 0 0 118 4 5 1 57 140
male.freq <- titanic[titanic$Sex=="Male", "Freq"]
## [1] 0 0 35 0 118 154 387 670 5 11 13 0 57 14 75 192
Then we can add up the new values
## [1] 325
## [1] 1731
example = data.frame(c1=runif(50), c2=rnorm(50), c3=runif(50))
# 1. How many observations are there in example?
# 2. How many variables are there in example?
# 3. What are the names of the variables in example?
# 4. Create a dataframe with only observations where c1 > 0.2? Name this c1_gr_02.
# 5. Create a dataframe with only observations where c1 > 0.2 and c2 > 0.2? Name this c1_c2_gr_02.
# 1. How many observations are there in example?
# 2. How many variables are there in example?
# 3. What are the names of the variables in example?
# 4. Create a dataframe with only observations where c1 > 0.2? Name this c1_gr_02.
c1_gr_02 <- example[example$c1>0.2,]
# 5. Create a dataframe with only observations where c1 > 0.2 and c2 > 0.2? Name this c1_c2_gr_02.
c1_c2_gr_02 <- example[example$c1>0.2 & example$c2>0.2,]
test_object("c1_gr_02", incorrect_msg = "Did you remember to name the new dataframe?")
test_object("c1_c2_gr_02", incorrect_msg = "Did you remember to name the new dataframe?")
success_msg("Great Job")
Suppose we not only want to know the frequency of survival but the proportion. We can ask R to calculate this and add it to our data.
titanic$surv_p <- titanic$Freq/sum(titanic$Freq)
## Class Sex Age Survived Freq surv_p
## 1 1st Male Child No 0 0.00000000
## 2 2nd Male Child No 0 0.00000000
## 3 3rd Male Child No 35 0.01590186
## 4 Crew Male Child No 0 0.00000000
Perhaps we were not pleased the decimal places and want to have this as a percentage. We can overwrite the values and change this.
titanic$surv_p <- titanic$surv_p*100
## Class Sex Age Survived Freq surv_p
## 1 1st Male Child No 0 0.000000
## 2 2nd Male Child No 0 0.000000
## 3 3rd Male Child No 35 1.590186
## 4 Crew Male Child No 0 0.000000
In the future we will be performing many more operations on data frames.