Manipulating tibbles with dplyr

Overview

Teaching: 40 min
Exercises: 15 min
Questions
  • How can I manipulate tibbles without repeating myself?

Objectives
  • To be able to use the six main dplyr data manipulation ‘verbs’ with pipes.

In the previous episode we used the readr package to load tabular data into a tibble within R. The readr package is part of a family of packages known as the tidyverse. The tidyverse packages are designed to work well together; they provide a modern and streamlined approach to data-analysis, and deal with some of the idiosyncrasies of base R.

This loads the most commonly used packages in the tidyverse; we used readr in the previous episode. We will cover all of the other main packages, with the exception of purrr in this course. There are other libraries included but these are less widely used, and must be loaded manually if they are required; these aren’t covered in this course.

Let’s dive in and look at how we can use the tidyverse to analyse and, in a couple of episodes’ time, plot data from the gapminder project. At the start of the course, you should have copied the file gapminder-FiveYearData.csv to your data directory. Take a look at it using a text editor such as notepad. The first line contains variable names, and values are separated by commas. Each record starts on a new line.

As we did with the previous episode we use the read_csv() function to load data from a comma separated file. Let’s make a new script (using the file menu), and load the tidyverse: (in the previous episode we only loaded readr; since we’ll be using several packages in the tidyverse, we load them all).

library("tidyverse")
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.1.4     ✔ dplyr   1.0.7
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
gapminder <- read_csv("./data/gapminder-FiveYearData.csv")

── Column specification ────────────────────────────────────────────────────────
cols(
  country = col_character(),
  year = col_double(),
  pop = col_double(),
  continent = col_character(),
  lifeExp = col_double(),
  gdpPercap = col_double()
)

New versions of readr

Remember that you may need to use the spec() function to show the full column specification if you are using a newer version of readr i.e. spec(gapminder)

As we discussed in the previous episode, variables in R can be character, integer, double, etc. A tibble (and R’s built in equivalent; the data-frame) require that all the values in a particular column have the same data type. The read_csv() function will attempt to infer the data type of each column, and prints the column types it has guessed to the screen. If the wrong column types have been generated, you can pass the col_types= option to read_csv().

For example, if we wanted to load the pop column as a character string, we would use:

gapminderPopChar <- read_csv("./data/gapminder-FiveYearData.csv", 
                             col_types = cols(
                               country = col_character(),
                               year = col_integer(),
                               pop = col_character(),
                               continent = col_character(),
                               lifeExp = col_double(),
                               gdpPercap = col_double()
) )

Setting column types

Try reading a file using the read_csv() defaults (i.e. guessing column types). If this fails you can cut and paste the guessed column specification, and modify this with the correct column types. It is good practice to do this anyway; it makes the data types of your columns explicit, and will help protect you if the format of your data changes.

Manipulating tibbles

Manipulation of tibbles means many things to many researchers. We often select only certain observations (rows) or variables (columns). We often group the data by a certain variable(s), or calculate summary statistics.

The dplyr package

The dplyr package is part of the tidyverse. It provides a number of very useful functions for manipulating tibbles (and their base-R cousin, the data.frame) in a way that will reduce repetition, reduce the probability of making errors, and probably even save you some typing.

We will cover:

  1. selecting variables with select()
  2. subsetting observations with filter()
  3. grouping observations with group_by()
  4. generating summary statistics using summarize()
  5. generating new variables using mutate()
  6. Sorting tibbles using arrange()
  7. chaining operations together using pipes %>%

Using select()

If, for example, we wanted to move forward with only a few of the variables in our tibble we use the select() function. This will keep only the variables you select.

year_country_gdp <- select(gapminder,year,country,gdpPercap)
print(year_country_gdp)
# A tibble: 1,704 × 3
    year country     gdpPercap
   <dbl> <chr>           <dbl>
 1  1952 Afghanistan      779.
 2  1957 Afghanistan      821.
 3  1962 Afghanistan      853.
 4  1967 Afghanistan      836.
 5  1972 Afghanistan      740.
 6  1977 Afghanistan      786.
 7  1982 Afghanistan      978.
 8  1987 Afghanistan      852.
 9  1992 Afghanistan      649.
10  1997 Afghanistan      635.
# … with 1,694 more rows

Select will select columns of data. What if we want to select rows that meet certain criteria?

Using filter()

The filter() function is used to select rows of data. For example, to select only countries in Europe:

gapminder_Europe <- filter(gapminder, continent=="Europe") 
print(gapminder_Europe)
# A tibble: 360 × 6
   country  year     pop continent lifeExp gdpPercap
   <chr>   <dbl>   <dbl> <chr>       <dbl>     <dbl>
 1 Albania  1952 1282697 Europe       55.2     1601.
 2 Albania  1957 1476505 Europe       59.3     1942.
 3 Albania  1962 1728137 Europe       64.8     2313.
 4 Albania  1967 1984060 Europe       66.2     2760.
 5 Albania  1972 2263554 Europe       67.7     3313.
 6 Albania  1977 2509048 Europe       68.9     3533.
 7 Albania  1982 2780097 Europe       70.4     3631.
 8 Albania  1987 3075321 Europe       72       3739.
 9 Albania  1992 3326498 Europe       71.6     2497.
10 Albania  1997 3428038 Europe       73.0     3193.
# … with 350 more rows

Only rows of the data where the condition (i.e. continent=="Europe") is TRUE are kept.

Using pipes and dplyr

We’ve now seen how to choose certain columns of data (using select()) and certain rows of data (using filter()). In an analysis we often want to do both of these things (and many other things, like calculating summary statistics, which we’ll come to shortly). How do we combine these?

There are several ways of doing this; the method we will learn about today is using pipes.

The pipe operator %>% lets us pipe the output of one command into the next. This allows us to build up a data-processing pipeline. This approach has several advantages:

Pipelines and the shell

If you’re familiar with the Unix shell, you may already have used pipes to pass the output from one command to the next. The concept is the same, except the shell uses the | character rather than R’s pipe operator %>%

Keyboard shortcuts and getting help

The pipe operator can be tedious to type. In Rstudio pressing Ctrl + Shift+M under Windows / Linux will insert the pipe operator. On the mac, use + Shift+M.

We can use tab completion to complete variable names when entering commands. This saves typing and reduces the risk of error.

RStudio includes a helpful “cheat sheet”, which summarises the main functionality and syntax of dplyr. This can be accessed via the help menu –> cheatsheets –> data transformation with dplyr.

Let’s rewrite the select command example using the pipe operator:

year_country_gdp <- gapminder %>% select(year,country,gdpPercap)
print(year_country_gdp)
# A tibble: 1,704 × 3
    year country     gdpPercap
   <dbl> <chr>           <dbl>
 1  1952 Afghanistan      779.
 2  1957 Afghanistan      821.
 3  1962 Afghanistan      853.
 4  1967 Afghanistan      836.
 5  1972 Afghanistan      740.
 6  1977 Afghanistan      786.
 7  1982 Afghanistan      978.
 8  1987 Afghanistan      852.
 9  1992 Afghanistan      649.
10  1997 Afghanistan      635.
# … with 1,694 more rows

To help you understand why we wrote that in that way, let’s walk through it step by step. First we summon the gapminder tibble and pass it on, using the pipe symbol %>%, to the next step, which is the select() function. In this case we don’t specify which data object we use in the select() function since in gets that from the previous pipe.

What if we wanted to combine this with the filter example? I.e. we want to select year, country and GDP per capita, but only for countries in Europe? We can join these two operations using a pipe; feeding the output of one command directly into the next:

year_country_gdp_euro <- gapminder %>%
    filter(continent == "Europe") %>%
    select(year,country,gdpPercap)
print(year_country_gdp_euro)
# A tibble: 360 × 3
    year country gdpPercap
   <dbl> <chr>       <dbl>
 1  1952 Albania     1601.
 2  1957 Albania     1942.
 3  1962 Albania     2313.
 4  1967 Albania     2760.
 5  1972 Albania     3313.
 6  1977 Albania     3533.
 7  1982 Albania     3631.
 8  1987 Albania     3739.
 9  1992 Albania     2497.
10  1997 Albania     3193.
# … with 350 more rows

Note that the order of these operations matters; if we reversed the order of the select() and filter() functions, the continent variable wouldn’t exist in the data-set when we came to apply the filter.

What about if we wanted to match more than one item? To do this we use the %in% operator:

gapminder_scandinavia <- gapminder %>% 
  filter(country %in% c("Denmark",
                        "Norway",
                        "Sweden"))

Another way of thinking about pipes

It might be useful to think of the statement

 gapminder %>%
    filter(continent=="Europe") %>%
    select(year,country,gdpPercap)

as a sentence, which we can read as “take the gapminder data and then filter records where continent == Europe and then select the year, country and gdpPercap.

We can think of the filter() and select() functions as verbs in the sentence; they do things to the data flowing through the pipeline.

Splitting your commands over multiple lines

It’s generally a good idea to put one command per line when writing your analyses. This makes them easier to read. When doing this, it’s important that the %>% goes at the end of the line, as in the example above. If we put it at the beginning of a line, e.g.:

gapminder_benelux <- gapminder 
%>% filter(country %in% c("Belgium", "Netherlands", "France"))
Error: <text>:2:1: unexpected SPECIAL
1: gapminder_benelux <- gapminder 
2: %>%
   ^

the first line makes a valid R command. R will then treat the next line as a new command, which won’t work.

Challenge 1

Write a single command (which can span multiple lines and includes pipes) that will produce a tibble that has the values of lifeExp, country and year, for the countries in Africa, but not for other Continents. How many rows does your tibble
have? (You can use the nrow() function to find out how many rows are in a tibble.)

Solution to Challenge 1

year_country_lifeExp_Africa <- gapminder %>%
                           filter(continent=="Africa") %>%
                           select(year,country,lifeExp)
nrow(year_country_lifeExp_Africa)
[1] 624

As with last time, first we pass the gapminder tibble to the filter() function, then we pass the filtered version of the gapminder tibble to the select() function. Note: The order of operations is very important in this case. If we used ‘select’ first, filter would not be able to find the variable continent since we would have removed it in the previous step.

Sorting tibbles

The arrange() function will sort a tibble by one or more of the variables in it:

gapminder %>%
  filter(continent == "Europe", year == 2007) %>% 
  arrange(pop)
# A tibble: 30 × 6
   country                 year     pop continent lifeExp gdpPercap
   <chr>                  <dbl>   <dbl> <chr>       <dbl>     <dbl>
 1 Iceland                 2007  301931 Europe       81.8    36181.
 2 Montenegro              2007  684736 Europe       74.5     9254.
 3 Slovenia                2007 2009245 Europe       77.9    25768.
 4 Albania                 2007 3600523 Europe       76.4     5937.
 5 Ireland                 2007 4109086 Europe       78.9    40676.
 6 Croatia                 2007 4493312 Europe       75.7    14619.
 7 Bosnia and Herzegovina  2007 4552198 Europe       74.9     7446.
 8 Norway                  2007 4627926 Europe       80.2    49357.
 9 Finland                 2007 5238460 Europe       79.3    33207.
10 Slovak Republic         2007 5447502 Europe       74.7    18678.
# … with 20 more rows

We can use the desc() function to sort a variable in reverse order:

gapminder %>%
  filter(continent == "Europe", year == 2007) %>% 
  arrange(desc(pop))
# A tibble: 30 × 6
   country         year      pop continent lifeExp gdpPercap
   <chr>          <dbl>    <dbl> <chr>       <dbl>     <dbl>
 1 Germany         2007 82400996 Europe       79.4    32170.
 2 Turkey          2007 71158647 Europe       71.8     8458.
 3 France          2007 61083916 Europe       80.7    30470.
 4 United Kingdom  2007 60776238 Europe       79.4    33203.
 5 Italy           2007 58147733 Europe       80.5    28570.
 6 Spain           2007 40448191 Europe       80.9    28821.
 7 Poland          2007 38518241 Europe       75.6    15390.
 8 Romania         2007 22276056 Europe       72.5    10808.
 9 Netherlands     2007 16570613 Europe       79.8    36798.
10 Greece          2007 10706290 Europe       79.5    27538.
# … with 20 more rows

Generating new variables

The mutate() function lets us add new variables to our tibble. It will often be the case that these are variables we derive from existing variables in the data-frame.

As an example, the gapminder data contains the population of each country, and its GDP per capita. We can use this to calculate the total GDP of each country:

gapminder_totalgdp <- gapminder %>% 
  mutate(gdp = gdpPercap * pop)

We can also use functions within mutate to generate new variables. For example, to take the log of gdpPercap we could use:

gapminder %>% 
  mutate(logGdpPercap = log(gdpPercap))
# A tibble: 1,704 × 7
   country      year      pop continent lifeExp gdpPercap logGdpPercap
   <chr>       <dbl>    <dbl> <chr>       <dbl>     <dbl>        <dbl>
 1 Afghanistan  1952  8425333 Asia         28.8      779.         6.66
 2 Afghanistan  1957  9240934 Asia         30.3      821.         6.71
 3 Afghanistan  1962 10267083 Asia         32.0      853.         6.75
 4 Afghanistan  1967 11537966 Asia         34.0      836.         6.73
 5 Afghanistan  1972 13079460 Asia         36.1      740.         6.61
 6 Afghanistan  1977 14880372 Asia         38.4      786.         6.67
 7 Afghanistan  1982 12881816 Asia         39.9      978.         6.89
 8 Afghanistan  1987 13867957 Asia         40.8      852.         6.75
 9 Afghanistan  1992 16317921 Asia         41.7      649.         6.48
10 Afghanistan  1997 22227415 Asia         41.8      635.         6.45
# … with 1,694 more rows

The dplyr cheat sheet contains many useful functions which can be used with dplyr. This can be found in the help menu of RStudio. You will use one of these functions in the next challenge.

Challenge 2

Create a tibble containing each country in Europe, its life expectancy in 2007 and the rank of the country’s life expectancy. (note that ranking the countries will not sort the table; the row order will be unchanged. You can use the arrange() function to sort the table).

Hint: First filter() to get the rows you want, and then use mutate() to create a new variable with the rank in it. The cheat-sheet contains useful functions you can use when you make new variables (the cheat-sheets can be found in the help menu in RStudio).
There are several functions for ranking observations, which handle tied values differently. For this exercise it doesn’t matter which function you choose.

Can you reverse the ranking order so that the country with the longest life expectancy gets the lowest rank? Hint: This is similar to sorting in reverse order.

Solution to challenge 2

europeLifeExp <- gapminder %>% 
  filter(continent == "Europe", year == 2007) %>% 
  select(country, lifeExp) %>% 
  mutate(rank = min_rank(lifeExp))
print(europeLifeExp, n=100)
# A tibble: 30 × 3
   country                lifeExp  rank
   <chr>                    <dbl> <int>
 1 Albania                   76.4    11
 2 Austria                   79.8    23
 3 Belgium                   79.4    20
 4 Bosnia and Herzegovina    74.9     8
 5 Bulgaria                  73.0     3
 6 Croatia                   75.7    10
 7 Czech Republic            76.5    12
 8 Denmark                   78.3    15
 9 Finland                   79.3    17
10 France                    80.7    26
11 Germany                   79.4    18
12 Greece                    79.5    21
13 Hungary                   73.3     4
14 Iceland                   81.8    30
15 Ireland                   78.9    16
16 Italy                     80.5    25
17 Montenegro                74.5     6
18 Netherlands               79.8    22
19 Norway                    80.2    24
20 Poland                    75.6     9
21 Portugal                  78.1    14
22 Romania                   72.5     2
23 Serbia                    74.0     5
24 Slovak Republic           74.7     7
25 Slovenia                  77.9    13
26 Spain                     80.9    28
27 Sweden                    80.9    27
28 Switzerland               81.7    29
29 Turkey                    71.8     1
30 United Kingdom            79.4    19

To reverse the order of the ranking, use the desc function, i.e. mutate(rank = min_rank(desc(lifeExp))).

There are several functions for calculating ranks; you may have used, e.g. dense_rank(). The functions handle ties differently. The help file for dplyr’s ranking functions explains the differences, and can be accessed with ?ranking.

Calculating summary statistics

We often wish to calculate a summary statistic (the mean, standard deviation, etc.) for a variable. We frequently want to calculate a separate summary statistic for several groups of data (e.g. the experiment and control group). We can calculate a summary statistic for the whole data-set using the dplyr’s summarise() function:

gapminder %>% 
  filter(year == 2007) %>% 
  summarise(meanlife = mean(lifeExp))
# A tibble: 1 × 1
  meanlife
     <dbl>
1     67.0

To generate summary statistics for each value of another variable we use the group_by() function:

gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  summarise(meanlife = mean(lifeExp))
# A tibble: 5 × 2
  continent meanlife
  <chr>        <dbl>
1 Africa        54.8
2 Americas      73.6
3 Asia          70.7
4 Europe        77.6
5 Oceania       80.7

Aside

In the examples above it would be preferable to calculate the weighted mean (to reflect the different populations of the countries). R can calculate this for us using weighted.mean(lifeExp, pop). For simplicity I’ve used the regular mean in the above examples.

Statistics revision

If you need to revise or learn about statistical concepts, the University Library’s “My Learning Essentials” team have produced a site Start to Finish:Statistics which covers important statistical concepts.

Challenge 3

For each combination of continent and year, calculate the average life expectancy.

Solution to Challenge 3

lifeExp_bycontinentyear <- gapminder %>% 
   group_by(continent, year) %>% 
  summarise(mean_lifeExp = mean(lifeExp))
`summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
print(lifeExp_bycontinentyear)
# A tibble: 60 × 3
# Groups:   continent [5]
  continent  year mean_lifeExp
  <chr>     <dbl>        <dbl>
1 Africa     1952         39.1
2 Africa     1957         41.3
3 Africa     1962         43.3
4 Africa     1967         45.3
5 Africa     1972         47.5
6 Africa     1977         49.6
7 Africa     1982         51.6
8 Africa     1987         53.3
9 Africa     1992         53.6
10 Africa     1997         53.6
# … with 50 more rows

count() and n()

A very common operation is to count the number of observations for each group. The dplyr package comes with two related functions that help with this.

If we need to use the number of observations in calculations, the n() function is useful. For instance, if we wanted to get the standard error of the life expectancy per continent:

gapminder %>%
    filter(year == 2002) %>%	
    group_by(continent) %>%
    summarize(se_pop = sd(lifeExp)/sqrt(n()))
# A tibble: 5 × 2
  continent se_pop
  <chr>      <dbl>
1 Africa     1.33 
2 Americas   0.960
3 Asia       1.46 
4 Europe     0.534
5 Oceania    0.630

Although we could use the group_by(), n() and summarize() functions to calculate the number of observations in each group, dplyr provides the count() function which automatically groups the data, calculates the totals and then un-groups it.

For instance, if we wanted to check the number of countries included in the dataset for the year 2002, we can use:

gapminder %>%
    filter(year == 2002) %>%
    count(continent, sort = TRUE)
# A tibble: 5 × 2
  continent     n
  <chr>     <int>
1 Africa       52
2 Asia         33
3 Europe       30
4 Americas     25
5 Oceania       2

We can optionally sort the results in descending order by adding sort=TRUE:

Connect mutate with logical filtering: ifelse()

When creating new variables, we can hook this with a logical condition. A simple combination of mutate() and ifelse() facilitates filtering right where it is needed: in the moment of creating something new. This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimension of the tibble will not change) or for updating values depending on this given condition.

The ifelse() function takes three parameters. The first it the logical test. The second is the value to use if the test is TRUE for that observation, and the third is the value to use if the test is FALSE.

## keeping all data but "filtering" after a certain condition
# calculate GDP only for people with a life expectation above 50
gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
    mutate(gdp_billion = ifelse(lifeExp > 50, gdpPercap * pop / 10^9, NA)) 

Equivalent functions in base R

In this course we’ve taught the tidyverse. You are likely come across code written others in base R. You can find a guide to some base R functions and their tidyverse equivalents here, which may be useful when reading their code.

Other great resources

Key Points

  • Use the dplyr package to manipulate tibbles.

  • Use select() to choose variables from a tibbles.

  • Use filter() to choose data based on values.

  • Use group_by() and summarize() to work with subsets of data.

  • Use mutate() to create new variables.