Manipulating tibbles with dplyr
Overview
Teaching: 40 min
Exercises: 15 minQuestions
How can I manipulate tibbles without repeating myself?
Objectives
To be able to use the six main
dplyr
data manipulation ‘verbs’ with pipes.
In the previous episode we used the readr
package to load tabular data into a tibble within R. The readr
package is part of a family of packages known as the tidyverse. The tidyverse packages are designed to work well together; they provide a modern and streamlined approach to data-analysis, and deal with some of the idiosyncrasies of base R.
This loads the most commonly used packages in the tidyverse; we used readr
in the previous episode. We will cover all of the other main packages, with the exception of purrr
in this course. There are other libraries included but these are less widely used, and must be loaded manually if they are required; these aren’t covered in this course.
Let’s dive in and look at how we can use the tidyverse to analyse and, in a couple of episodes’ time, plot data from the gapminder project. At the start of the course, you should have copied the file gapminder-FiveYearData.csv
to your data
directory. Take a look at it using a text editor such as notepad. The first line contains variable names, and values are separated by commas. Each record starts on a new line.
As we did with the previous episode we use the read_csv()
function to load data from a comma separated file. Let’s make a new script (using the file menu), and load the tidyverse: (in the previous episode we only loaded readr
; since we’ll be using several packages in the tidyverse, we load them all).
library("tidyverse")
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.5 ✔ purrr 0.3.4
✔ tibble 3.1.4 ✔ dplyr 1.0.7
✔ tidyr 1.1.3 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
gapminder <- read_csv("./data/gapminder-FiveYearData.csv")
── Column specification ────────────────────────────────────────────────────────
cols(
country = col_character(),
year = col_double(),
pop = col_double(),
continent = col_character(),
lifeExp = col_double(),
gdpPercap = col_double()
)
New versions of readr
Remember that you may need to use the
spec()
function to show the full column specification if you are using a newer version of readr i.e.spec(gapminder)
As we discussed in the previous episode, variables in R can be character, integer, double, etc. A tibble (and R’s built in equivalent; the data-frame) require that all the values in a particular column have the same data type. The read_csv()
function will attempt to infer the data type of each column, and prints the column types it has guessed to the screen. If the wrong column types have been generated, you can pass the col_types=
option to read_csv()
.
For example, if we wanted to load the pop
column as a character string, we would use:
gapminderPopChar <- read_csv("./data/gapminder-FiveYearData.csv",
col_types = cols(
country = col_character(),
year = col_integer(),
pop = col_character(),
continent = col_character(),
lifeExp = col_double(),
gdpPercap = col_double()
) )
Setting column types
Try reading a file using the
read_csv()
defaults (i.e. guessing column types). If this fails you can cut and paste the guessed column specification, and modify this with the correct column types. It is good practice to do this anyway; it makes the data types of your columns explicit, and will help protect you if the format of your data changes.
Manipulating tibbles
Manipulation of tibbles means many things to many researchers. We often select only certain observations (rows) or variables (columns). We often group the data by a certain variable(s), or calculate summary statistics.
The dplyr
package
The dplyr
package is part of the tidyverse. It provides a number of very useful functions for manipulating tibbles (and their base-R cousin, the data.frame
)
in a way that will reduce repetition, reduce the probability of making
errors, and probably even save you some typing.
We will cover:
- selecting variables with
select()
- subsetting observations with
filter()
- grouping observations with
group_by()
- generating summary statistics using
summarize()
- generating new variables using
mutate()
- Sorting tibbles using
arrange()
- chaining operations together using pipes
%>%
Using select()
If, for example, we wanted to move forward with only a few of the variables in
our tibble we use the select()
function. This will keep only the
variables you select.
year_country_gdp <- select(gapminder,year,country,gdpPercap)
print(year_country_gdp)
# A tibble: 1,704 × 3
year country gdpPercap
<dbl> <chr> <dbl>
1 1952 Afghanistan 779.
2 1957 Afghanistan 821.
3 1962 Afghanistan 853.
4 1967 Afghanistan 836.
5 1972 Afghanistan 740.
6 1977 Afghanistan 786.
7 1982 Afghanistan 978.
8 1987 Afghanistan 852.
9 1992 Afghanistan 649.
10 1997 Afghanistan 635.
# … with 1,694 more rows
Select will select columns of data. What if we want to select rows that meet certain criteria?
Using filter()
The filter()
function is used to select rows of data. For example, to select only countries in
Europe:
gapminder_Europe <- filter(gapminder, continent=="Europe")
print(gapminder_Europe)
# A tibble: 360 × 6
country year pop continent lifeExp gdpPercap
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Albania 1952 1282697 Europe 55.2 1601.
2 Albania 1957 1476505 Europe 59.3 1942.
3 Albania 1962 1728137 Europe 64.8 2313.
4 Albania 1967 1984060 Europe 66.2 2760.
5 Albania 1972 2263554 Europe 67.7 3313.
6 Albania 1977 2509048 Europe 68.9 3533.
7 Albania 1982 2780097 Europe 70.4 3631.
8 Albania 1987 3075321 Europe 72 3739.
9 Albania 1992 3326498 Europe 71.6 2497.
10 Albania 1997 3428038 Europe 73.0 3193.
# … with 350 more rows
Only rows of the data where the condition (i.e. continent=="Europe"
) is TRUE
are kept.
Using pipes and dplyr
We’ve now seen how to choose certain columns of data (using select()
) and certain rows of data (using filter()
). In an analysis we often want to do both of these things (and many other things, like calculating summary statistics, which we’ll come to shortly). How do we combine these?
There are several ways of doing this; the method we will learn about today is using pipes.
The pipe operator %>%
lets us pipe the output of one command into the next. This allows us to build up a data-processing pipeline. This approach has several advantages:
- We can build the pipeline piecemeal — building the pipeline step-by-step is easier than trying to perform a complex series of operations in one go
- It is easy to modify and reuse the pipeline
- We don’t have to make temporary tibbles as the analysis progresses.
Pipelines and the shell
If you’re familiar with the Unix shell, you may already have used pipes to pass the output from one command to the next. The concept is the same, except the shell uses the
|
character rather than R’s pipe operator%>%
Keyboard shortcuts and getting help
The pipe operator can be tedious to type. In Rstudio pressing Ctrl + Shift+M under Windows / Linux will insert the pipe operator. On the mac, use ⌘ + Shift+M.
We can use tab completion to complete variable names when entering commands. This saves typing and reduces the risk of error.
RStudio includes a helpful “cheat sheet”, which summarises the main functionality and syntax of
dplyr
. This can be accessed via the help menu –> cheatsheets –> data transformation with dplyr.
Let’s rewrite the select command example using the pipe operator:
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)
print(year_country_gdp)
# A tibble: 1,704 × 3
year country gdpPercap
<dbl> <chr> <dbl>
1 1952 Afghanistan 779.
2 1957 Afghanistan 821.
3 1962 Afghanistan 853.
4 1967 Afghanistan 836.
5 1972 Afghanistan 740.
6 1977 Afghanistan 786.
7 1982 Afghanistan 978.
8 1987 Afghanistan 852.
9 1992 Afghanistan 649.
10 1997 Afghanistan 635.
# … with 1,694 more rows
To help you understand why we wrote that in that way, let’s walk through it step
by step. First we summon the gapminder tibble and pass it on, using the pipe
symbol %>%
, to the next step, which is the select()
function. In this case
we don’t specify which data object we use in the select()
function since in
gets that from the previous pipe.
What if we wanted to combine this with the filter example? I.e. we want to select year, country and GDP per capita, but only for countries in Europe? We can join these two operations using a pipe; feeding the output of one command directly into the next:
year_country_gdp_euro <- gapminder %>%
filter(continent == "Europe") %>%
select(year,country,gdpPercap)
print(year_country_gdp_euro)
# A tibble: 360 × 3
year country gdpPercap
<dbl> <chr> <dbl>
1 1952 Albania 1601.
2 1957 Albania 1942.
3 1962 Albania 2313.
4 1967 Albania 2760.
5 1972 Albania 3313.
6 1977 Albania 3533.
7 1982 Albania 3631.
8 1987 Albania 3739.
9 1992 Albania 2497.
10 1997 Albania 3193.
# … with 350 more rows
Note that the order of these operations matters; if we reversed the order of the select()
and filter()
functions, the continent
variable wouldn’t exist in the data-set when we came to apply the filter.
What about if we wanted to match more than one item? To do this we use the %in%
operator:
gapminder_scandinavia <- gapminder %>%
filter(country %in% c("Denmark",
"Norway",
"Sweden"))
Another way of thinking about pipes
It might be useful to think of the statement
gapminder %>% filter(continent=="Europe") %>% select(year,country,gdpPercap)
as a sentence, which we can read as “take the gapminder data and then
filter
records where continent == Europe and thenselect
the year, country and gdpPercap.We can think of the
filter()
andselect()
functions as verbs in the sentence; they do things to the data flowing through the pipeline.
Splitting your commands over multiple lines
It’s generally a good idea to put one command per line when writing your analyses. This makes them easier to read. When doing this, it’s important that the
%>%
goes at the end of the line, as in the example above. If we put it at the beginning of a line, e.g.:gapminder_benelux <- gapminder %>% filter(country %in% c("Belgium", "Netherlands", "France"))
Error: <text>:2:1: unexpected SPECIAL 1: gapminder_benelux <- gapminder 2: %>% ^
the first line makes a valid R command. R will then treat the next line as a new command, which won’t work.
Challenge 1
Write a single command (which can span multiple lines and includes pipes) that will produce a tibble that has the values of
lifeExp
,country
andyear
, for the countries in Africa, but not for other Continents. How many rows does your tibble
have? (You can use thenrow()
function to find out how many rows are in a tibble.)Solution to Challenge 1
year_country_lifeExp_Africa <- gapminder %>% filter(continent=="Africa") %>% select(year,country,lifeExp) nrow(year_country_lifeExp_Africa)
[1] 624
As with last time, first we pass the gapminder tibble to the
filter()
function, then we pass the filtered version of the gapminder tibble to theselect()
function. Note: The order of operations is very important in this case. If we used ‘select’ first, filter would not be able to find the variable continent since we would have removed it in the previous step.
Sorting tibbles
The arrange()
function will sort a tibble by one or more of the variables in it:
gapminder %>%
filter(continent == "Europe", year == 2007) %>%
arrange(pop)
# A tibble: 30 × 6
country year pop continent lifeExp gdpPercap
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Iceland 2007 301931 Europe 81.8 36181.
2 Montenegro 2007 684736 Europe 74.5 9254.
3 Slovenia 2007 2009245 Europe 77.9 25768.
4 Albania 2007 3600523 Europe 76.4 5937.
5 Ireland 2007 4109086 Europe 78.9 40676.
6 Croatia 2007 4493312 Europe 75.7 14619.
7 Bosnia and Herzegovina 2007 4552198 Europe 74.9 7446.
8 Norway 2007 4627926 Europe 80.2 49357.
9 Finland 2007 5238460 Europe 79.3 33207.
10 Slovak Republic 2007 5447502 Europe 74.7 18678.
# … with 20 more rows
We can use the desc()
function to sort a variable in reverse order:
gapminder %>%
filter(continent == "Europe", year == 2007) %>%
arrange(desc(pop))
# A tibble: 30 × 6
country year pop continent lifeExp gdpPercap
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Germany 2007 82400996 Europe 79.4 32170.
2 Turkey 2007 71158647 Europe 71.8 8458.
3 France 2007 61083916 Europe 80.7 30470.
4 United Kingdom 2007 60776238 Europe 79.4 33203.
5 Italy 2007 58147733 Europe 80.5 28570.
6 Spain 2007 40448191 Europe 80.9 28821.
7 Poland 2007 38518241 Europe 75.6 15390.
8 Romania 2007 22276056 Europe 72.5 10808.
9 Netherlands 2007 16570613 Europe 79.8 36798.
10 Greece 2007 10706290 Europe 79.5 27538.
# … with 20 more rows
Generating new variables
The mutate()
function lets us add new variables to our tibble. It will often be the case that these are variables we derive from existing variables in the data-frame.
As an example, the gapminder data contains the population of each country, and its GDP per capita. We can use this to calculate the total GDP of each country:
gapminder_totalgdp <- gapminder %>%
mutate(gdp = gdpPercap * pop)
We can also use functions within mutate to generate new variables. For example, to take the log of gdpPercap
we could use:
gapminder %>%
mutate(logGdpPercap = log(gdpPercap))
# A tibble: 1,704 × 7
country year pop continent lifeExp gdpPercap logGdpPercap
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.8 779. 6.66
2 Afghanistan 1957 9240934 Asia 30.3 821. 6.71
3 Afghanistan 1962 10267083 Asia 32.0 853. 6.75
4 Afghanistan 1967 11537966 Asia 34.0 836. 6.73
5 Afghanistan 1972 13079460 Asia 36.1 740. 6.61
6 Afghanistan 1977 14880372 Asia 38.4 786. 6.67
7 Afghanistan 1982 12881816 Asia 39.9 978. 6.89
8 Afghanistan 1987 13867957 Asia 40.8 852. 6.75
9 Afghanistan 1992 16317921 Asia 41.7 649. 6.48
10 Afghanistan 1997 22227415 Asia 41.8 635. 6.45
# … with 1,694 more rows
The dplyr cheat sheet contains many useful functions which can be used with dplyr. This can be found in the help menu of RStudio. You will use one of these functions in the next challenge.
Challenge 2
Create a tibble containing each country in Europe, its life expectancy in 2007 and the rank of the country’s life expectancy. (note that ranking the countries will not sort the table; the row order will be unchanged. You can use the
arrange()
function to sort the table).Hint: First
filter()
to get the rows you want, and then usemutate()
to create a new variable with the rank in it. The cheat-sheet contains useful functions you can use when you make new variables (the cheat-sheets can be found in the help menu in RStudio).
There are several functions for ranking observations, which handle tied values differently. For this exercise it doesn’t matter which function you choose.Can you reverse the ranking order so that the country with the longest life expectancy gets the lowest rank? Hint: This is similar to sorting in reverse order.
Solution to challenge 2
europeLifeExp <- gapminder %>% filter(continent == "Europe", year == 2007) %>% select(country, lifeExp) %>% mutate(rank = min_rank(lifeExp)) print(europeLifeExp, n=100)
# A tibble: 30 × 3 country lifeExp rank <chr> <dbl> <int> 1 Albania 76.4 11 2 Austria 79.8 23 3 Belgium 79.4 20 4 Bosnia and Herzegovina 74.9 8 5 Bulgaria 73.0 3 6 Croatia 75.7 10 7 Czech Republic 76.5 12 8 Denmark 78.3 15 9 Finland 79.3 17 10 France 80.7 26 11 Germany 79.4 18 12 Greece 79.5 21 13 Hungary 73.3 4 14 Iceland 81.8 30 15 Ireland 78.9 16 16 Italy 80.5 25 17 Montenegro 74.5 6 18 Netherlands 79.8 22 19 Norway 80.2 24 20 Poland 75.6 9 21 Portugal 78.1 14 22 Romania 72.5 2 23 Serbia 74.0 5 24 Slovak Republic 74.7 7 25 Slovenia 77.9 13 26 Spain 80.9 28 27 Sweden 80.9 27 28 Switzerland 81.7 29 29 Turkey 71.8 1 30 United Kingdom 79.4 19
To reverse the order of the ranking, use the
desc
function, i.e.mutate(rank = min_rank(desc(lifeExp)))
.There are several functions for calculating ranks; you may have used, e.g.
dense_rank()
. The functions handle ties differently. The help file fordplyr
’s ranking functions explains the differences, and can be accessed with?ranking
.
Calculating summary statistics
We often wish to calculate a summary statistic (the mean, standard deviation, etc.)
for a variable. We frequently want to calculate a separate summary statistic for several
groups of data (e.g. the experiment and control group). We can calculate a summary statistic
for the whole data-set using the dplyr’s summarise()
function:
gapminder %>%
filter(year == 2007) %>%
summarise(meanlife = mean(lifeExp))
# A tibble: 1 × 1
meanlife
<dbl>
1 67.0
To generate summary statistics for each value of another variable we use the
group_by()
function:
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarise(meanlife = mean(lifeExp))
# A tibble: 5 × 2
continent meanlife
<chr> <dbl>
1 Africa 54.8
2 Americas 73.6
3 Asia 70.7
4 Europe 77.6
5 Oceania 80.7
Aside
In the examples above it would be preferable to calculate the weighted mean (to reflect the different populations of the countries). R can calculate this for us using
weighted.mean(lifeExp, pop)
. For simplicity I’ve used the regular mean in the above examples.
Statistics revision
If you need to revise or learn about statistical concepts, the University Library’s “My Learning Essentials” team have produced a site Start to Finish:Statistics which covers important statistical concepts.
Challenge 3
For each combination of continent and year, calculate the average life expectancy.
Solution to Challenge 3
lifeExp_bycontinentyear <- gapminder %>% group_by(continent, year) %>% summarise(mean_lifeExp = mean(lifeExp))
`summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
print(lifeExp_bycontinentyear)
# A tibble: 60 × 3 # Groups: continent [5] continent year mean_lifeExp <chr> <dbl> <dbl> 1 Africa 1952 39.1 2 Africa 1957 41.3 3 Africa 1962 43.3 4 Africa 1967 45.3 5 Africa 1972 47.5 6 Africa 1977 49.6 7 Africa 1982 51.6 8 Africa 1987 53.3 9 Africa 1992 53.6 10 Africa 1997 53.6 # … with 50 more rows
count()
and n()
A very common operation is to count the number of observations for each
group. The dplyr
package comes with two related functions that help with this.
If we need to use the number of observations in calculations, the n()
function
is useful. For instance, if we wanted to get the standard error of the life
expectancy per continent:
gapminder %>%
filter(year == 2002) %>%
group_by(continent) %>%
summarize(se_pop = sd(lifeExp)/sqrt(n()))
# A tibble: 5 × 2
continent se_pop
<chr> <dbl>
1 Africa 1.33
2 Americas 0.960
3 Asia 1.46
4 Europe 0.534
5 Oceania 0.630
Although we could use the group_by()
, n()
and summarize()
functions to calculate the number of observations in each group, dplyr
provides the count()
function which automatically groups the data, calculates the totals and then un-groups it.
For instance, if we wanted to check the number of countries included in the dataset for the year 2002, we can use:
gapminder %>%
filter(year == 2002) %>%
count(continent, sort = TRUE)
# A tibble: 5 × 2
continent n
<chr> <int>
1 Africa 52
2 Asia 33
3 Europe 30
4 Americas 25
5 Oceania 2
We can optionally sort the results in descending order by adding sort=TRUE
:
Connect mutate with logical filtering: ifelse()
When creating new variables, we can hook this with a logical condition. A simple combination of
mutate()
and ifelse()
facilitates filtering right where it is needed: in the moment of creating something new.
This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimension
of the tibble will not change) or for updating values depending on this given condition.
The ifelse()
function takes three parameters. The first it the logical test. The second is the value to use if the test is TRUE for that observation, and the third is the value to use if the test is FALSE.
## keeping all data but "filtering" after a certain condition
# calculate GDP only for people with a life expectation above 50
gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
mutate(gdp_billion = ifelse(lifeExp > 50, gdpPercap * pop / 10^9, NA))
Equivalent functions in base R
In this course we’ve taught the tidyverse. You are likely come across code written others in base R. You can find a guide to some base R functions and their tidyverse equivalents here, which may be useful when reading their code.
Other great resources
- Data Wrangling tutorial — an excellent four part tutorial covering selecting data, filtering data, summarising and transforming your data.
- R for Data Science
- Data Wrangling Cheat sheet
- Introduction to dplyr — this is the package vignette. It can be viewed within R using
vignette(package="dplyr", "dplyr")
- Data wrangling with R and RStudio
Key Points
Use the
dplyr
package to manipulate tibbles.Use
select()
to choose variables from a tibbles.Use
filter()
to choose data based on values.Use
group_by()
andsummarize()
to work with subsets of data.Use
mutate()
to create new variables.