Up until this point in the course we have been just working with toy data that was in the window only or that we made up. It now comes the time where we need to work on getting data into R from many different sources.
We get data from many different sources. Some of these sources are:
Many packages in R have built in data. They use this data in order to display what the functions they have built can do. It ends up being a great resource for us to use while we learn how to work with data.
If you would like to see what data you have in R right now, run the following command:
data()
In RStudio a window will pop up and display the data as well as what packages that data is in.
data(package="tidyr")
We can also call data from a specific package. When you begin to have many packages installed in R you will want to make sure you call from specific packages.
install.packages()
function.datasets
package.Much of the data we download or receive from researchers is in the form of delimited files. Whether that be a comma separated (csv) or a tab delimited file, there are multiple functions that can read these data into R.
We will stick to loading these data from the tidyverse
packages but be aware these are not the only methods for doing this. We will use the tidyverse
functions just to maintain consistency with everything else we do.
readr
in TidyverseThe first package in tidyverse
we will use is called readr
. This is actually a collection of multiple functions:
read_csv()
: comma separated (CSV) filesread_tsv()
: tab separated filesread_delim()
: general delimited filesread_fwf()
: fixed width filesread_table()
: tabular files where columns are separated by white-space.read_log()
: web log filesreadxl
reads in Excel files.In order to show an example of this we will create a simple dataset. Consider below with the read.table()
function:
## subject sex size
## 1 1 M 7
## 2 2 F NA
## 3 3 F 9
## 4 4 M 11
This functions able to see the text in the quotations as rows and columns of a dataset. If you have data which is separated by space, this command is a great way to load the data in.
Let’s say that we wish to load a csv file into R now. We will take the data
that we already have loaded in and create a simple csv file.
We write the csv file as shown below:
# Write to a file, suppress row names
write.csv(data, "data1.csv", row.names=FALSE)
# Same, except that instead of "NA", output blank cells
write.csv(data, "data2.csv", row.names=FALSE, na="")
# Use tabs, suppress row names and column names
write.table(data, "data3.tab", sep="\t", row.names=FALSE, col.names=FALSE)
The functions all create a different file that we will read into R now. For example we can see what each of these files look like below:
readLines("data1.csv")
## [1] "\"subject\",\"sex\",\"size\"" "1,\"M\",7"
## [3] "2,\"F\",NA" "3,\"F\",9"
## [5] "4,\"M\",11"
We can see that in the above file we have commas separating all of the data elements. We also have NA where the data was missing.
readLines("data2.csv")
## [1] "\"subject\",\"sex\",\"size\"" "1,\"M\",7"
## [3] "2,\"F\"," "3,\"F\",9"
## [5] "4,\"M\",11"
In this one we do not have any NA, but R has treated the missing data with blank spaces.
readLines("data3.tab")
## [1] "1\t\"M\"\t7" "2\t\"F\"\tNA" "3\t\"F\"\t9" "4\t\"M\"\t11"
With the third data set we do not have any commas but the \t
represents a tabbed space.
We can read the csv files with the read.csv()
function:
data1 <- read.csv("data1.csv")
data1
## subject sex size
## 1 1 M 7
## 2 2 F NA
## 3 3 F 9
## 4 4 M 11
data2 <- read.csv("data2.csv")
data2
## subject sex size
## 1 1 M 7
## 2 2 F NA
## 3 3 F 9
## 4 4 M 11
With the tab delimited file we use the general function of read.delim()
function. Note that the sep="\t"
displays what separator was used.
data3 <- read.delim("data3.tab", sep="\t", header=F)
data3
## V1 V2 V3
## 1 1 M 7
## 2 2 F NA
## 3 3 F 9
## 4 4 M 11
We could also use read.delim()
to read a csv file by using sep=","
.
R can read data from more than just delimited files or internal datasets. R can also read files from all other major statistical software:
Haven
PackageHaven
is another R package that is part of the tidyverse. It is designed to bring in data from multiple sources. We can also use this function to write data to these same courses.
For SAS files we can read and write them in the following manner:
read_sas(data_file, catalog_file = NULL, encoding = NULL)
write_sas(data, path)
For Stata files, we can read and write them in the following manner:
read_dta(file, encoding = NULL)
read_stata(file, encoding = NULL)
write_dta(data, path, version = 14)
For SPSS files, we can read and write them in the following manner:
read_sav(file, user_na = FALSE)
read_por(file, user_na = FALSE)
write_sav(data, path)
read_spss(file, user_na = FALSE)