Getting started with the tidyverse; loading your data
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How do I load the tidyverse?
How can I read and write tabular data in R?
Objectives
To know how to load the main tidyverse packages
To be able to read csv data into a tibble
To be aware of the different types of data.
To begin exploring
tibbles
To be able to extract parts of a
tibble
Installing Tidyverse
The tidyverse consists of several packages (we discuss the main ones below), which are all on CRAN. We can install the tidyverse like any other CRAN package, using:
install.packages("tidyverse")
The tidyverse
package acts as a “wrapper”, which will install all of the packages that constitute the tidyverse. If space is at a premium, individual packages from the tidyverse can be installed. In practice, it’s usually easier to install the whole thing.
Loading the tidyverse
We can load the tidyverse using:
library(tidyverse)
── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0 ✔ purrr 0.2.5
✔ tibble 1.4.2 ✔ dplyr 0.7.6
✔ tidyr 0.8.1 ✔ stringr 1.3.1
✔ readr 1.1.1 ✔ forcats 0.3.0
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
This loads the most commonly used packages in the tidyverse:
readr
: for reading dataggplot2
: for plottingtibble
: for creating “tibbles”; these are the tidyverse’s take on data frames.dplyr
: for manipulating tibbles (or data frames); creating new variables, calculating summary statistics etc.tidyr
: for reshaping data (making it from long to wide format, and vice versa)purrr
: for functional programming.stringr
: for manipulating stringsforcats
: FOR CATegorical data (factors); this makes it easier to reorder and rename the levels in factor variables.
You can load each package you need for your analysis separately (e.g. library(readr)
, etc.); most processes will involve almost all of them though, so I tend to load them all in one go.
There are other packages included in the tidyverse, but these have more specialised uses, and so need to be loaded separately. For example the lubridate package makes it easier to work with dates and times. A full list of all the tidyverse packages can be found here.
Workshop overview
In this workshop we’re going to work with some data from the gapminder project. The data consists of the population, GDP per capita and average life expectancy of 142 countries, recorded at five year intervals.
We’ll start by reading the data into R, using readr
. This will give us a tibble
, which is the tidyverse’s (improved) take on a data.frame; we’ll briefly look at how to access parts of a tibble. We’ll then start analysing the data, using dplyr
. We’ll end the session using ggplot2
(which is the tidyverse’s plotting package) to explore the data graphically.
Making a new project
If you’re using RStudio, it’s a good idea to use its projects feature to organise your work. Create a new project (File, New Project) for the workshop. Within this, create directories data
, results
and src
, for our input data, analysis results, and analysis scripts respectively.
Uncompress the course data to the data
directory.
Make sure that your working directory is the project directory. You can see your current working directory above your console window in RStudio, or by typing getwd()
a the console.
The easiest way to set the working directory is to navigate to the project directory in the files tab, and then select “more”, “set as working directory”. You can also use the setwd()
command, if you prefer.
Reading data
In this workshop we’ll limit ourselves to using comma separated data. We’ll work with the gapminder data, which is included in course data.
Although standard R has the ability to load csv files, using the read.csv()
function, we’ll use the functions in the readr Tidyverse package to load this data.
Let’s make a new script for this episode, by choosing the menu options File, New File, R Script. We should make our scripts self-contained, so we should include library(tidyverse)
command at the start of the script. We can read the data contained in the gapminder-FiveYearData.csv
file to an object called gapminder
with the following command:
library(tidyverse)
gapminder <- read_csv("./data/gapminder-FiveYearData.csv")
Parsed with column specification:
cols(
country = col_character(),
year = col_integer(),
pop = col_double(),
continent = col_character(),
lifeExp = col_double(),
gdpPercap = col_double()
)
We see that the read_csv()
table reports a “column specification”. This shows the variable names that were read in, and the type of data that each column was interpreted as.
gapminder
# A tibble: 1,704 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.8 779.
2 Afghanistan 1957 9240934 Asia 30.3 821.
3 Afghanistan 1962 10267083 Asia 32.0 853.
4 Afghanistan 1967 11537966 Asia 34.0 836.
5 Afghanistan 1972 13079460 Asia 36.1 740.
6 Afghanistan 1977 14880372 Asia 38.4 786.
7 Afghanistan 1982 12881816 Asia 39.9 978.
8 Afghanistan 1987 13867957 Asia 40.8 852.
9 Afghanistan 1992 16317921 Asia 41.7 649.
10 Afghanistan 1997 22227415 Asia 41.8 635.
# ... with 1,694 more rows
When we enter gapminder
by itself on the command line, it will print the contents of gapminder
; we see that it consists of a tibble. A tibble is a way of storing tabular data, which is part of the tidyverse. We see the variable names, and an (abbreviated) string indicating what type of data is stored in each variable.
read_csv() vs read.csv()
read.csv()
is included as part of base R, and performs a similar job toread_csv()
. We will be usingread_csv()
in this course; it is part of the tidyverse, so works well with other parts of the tidyverse, is faster thanread.csv()
and handles strings in a way that is usually more useful thanread.csv()
Loading other types of data
Another type of file you might encounter are tab-separated value files (.tsv); these can be read with the
read_tsv()
function in thereadr
package. To read files with other delimiters, use theread_delim()
function. If files are fixed width format (i.e. the variable is defined by its position on the line), then use theread_fwf()
function.The tidyverse comes with several packages for loading data in other formats. These include:
- readxl for reading data from Excel spreadsheets
- haven for reading SAS, SPSS and Stata data files
- xml2 for reading xml data
These aren’t loaded by default (when we use
library("tidyverse")
), so they will need to be loaded separately, e.g.library("readxl")
, etc. There are also tidyverse packages for getting data via web APIs, or by “scraping” websites.
Data types
Every piece of data in R is stored as either double
, integer
, complex
, logical
or character
.
integer
variables can only store whole numbersdouble
variables can store floating point numbers (i.e. with a decimal part)complex
variables can store complex numbers (i.e. of the form1+2i
)logical
variables can storeTRUE
orFALSE
character
variables can store strings of characters.
When we read the data into
R using read_csv()
it tries to work out what data type each variable is, which it does by looking at the data contained in the first 1000 rows of the data file. We can see from the displayed message that read_csv()
has treated the country
variable as a character variable, the gdpPercap
variable as a floating point number and pop
variable as an integer variable.
We can override these guesses using the col_types
argument to read_csv()
. Although read_csv()
has correctly guessed the correct column types for our file, it is a good idea to explicitly tell it what sort of data to expect in each column. This way our analysis will be robust to the wrong type of data being stored in a column (or to readr changing the algorithm it uses to guess data types).
read_csv()
helpfully formats the column specficiation in the format that the col_types
argument expect, so we can cut and paste this into our command (editing it if we need to override any of the guesses).
gapminder <- read_csv("./data/gapminder-FiveYearData.csv",
col_types = cols(
country = col_character(),
year = col_integer(),
pop = col_double(),
continent = col_character(),
lifeExp = col_double(),
gdpPercap = col_double()
))
Importing data using RStudio
You may have noticed when we viewed the
gapminder-FiveYearData.csv
file in RStudio, before importing it, that another option appeared, labelled “Import Dataset”. This lets us import the data interactively. It can be more convenient to use this approach, rather than manually writing the required code. If you do this, you will find that the code RStudio has written is put into the console and run (and will appear in the history tab in RStudio). It’s fine to do this initially, but you should copy the generated code to your script, so that you can reproduce your analysis.
Exploring tibbles
We can “unpick” the contents of a tibble in several ways. We can return a vector containing the values
of a variable using the dollar symbol, $
:
gapminder$pop
We can also use the subsetting operator []
directly on tibbles. In contrast to a vector, a tibble
is two dimensional. We pass two arguments to the []
operator; the first indicates the row(s) we
require and the second indicates the columns. So to return rows 1 and 2, and columns 2 and 3 we can use:
gapminder[1:2,2:3]
# A tibble: 2 x 2
year pop
<int> <dbl>
1 1952 8425333
2 1957 9240934
If we leave an index blank, this acts as a wildcard and matches all of the rows or columns:
gapminder[1,]
# A tibble: 1 x 6
country year pop continent lifeExp gdpPercap
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.8 779.
gapminder[,1]
# A tibble: 1,704 x 1
country
<chr>
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
5 Afghanistan
6 Afghanistan
7 Afghanistan
8 Afghanistan
9 Afghanistan
10 Afghanistan
# ... with 1,694 more rows
Subsetting a tibble returns another tibble; using $
to extract a variable returns a vector:
gapminder$country
gapminder[,1]
Tibbles vs data frames
Tibbles are used to represent tabular data in the tidyverse. In contrast, base R uses data frames to represent tabular data. One of the differences between these two types of obbject is what is returned when you extract a subset of rows/columns. In contrast to a tibble, taking a subset of a data frame doesn’t always return another data frame. For more details see the callout at the end of this episode, or take a look at the tibble vignette in R with
vignette("tibble")
.
Writing data in R
We can save a tibble (or data frame) to a csv file, using readr
’s write_csv()
function. For example, to save the gapminder data to mygapminder.csv
:
write_csv(gapminder, "data/mygapminder.csv")
Differences with base R
In this lesson we’ve taught you how to read files and make factors using the functionality in the
readr
package, which is part of the tidyverse.
This section highlights some of the differences between the tidyverse and its equivalent functionality in base R.R’s standard data structure for tabular data is the
data.frame
. In contrast,read_csv()
creates atibble
(also referred to, for historic reasons, as atbl_df
). This extends the functionality of adata.frame
, and can, for the most part, be treated like adata.frame
You may find that some older functions don’t work on tibbles. A tibble can be converted to a dataframe using
as.data.frame(mytibble)
. To convert a data frame to a tibble, useas.tibble(mydataframe)
Tibbles behave more consistently than data frames when subsetting with
[]
; this will always return another tibble. This isn’t the case when working with data.frames. You can find out more about the differences between data.frames and tibbles by typingvignette("tibble")
.
read_csv()
will always read variables containing text as character variables. In contrast, the base R functionread.csv()
will, by default, convert any character variable to a factor. This is often not what you want, and can be overridden by passing the optionstringsAsFactors = FALSE
toread.csv()
.
Key Points
Tibbles let us store tabular data in R. Tibbles are an extension of the base R data frame.
Use
read_csv
to read tabular data into a tibble R.User
write_csv
to write tabular data to a comma separated value file.Use factors to represent categorical data in R. You should specify the levels of your factors.