Preparing R code for unattended parallel execution

Martin Herrerias Azcue (Research IT)

2024-04-11

Intro / Setup

These slides are available at: https://uomresearchit.github.io/RRCSF/

  • You’re expected to have R installed on your local machine, and an editor you feel comfortable with (doesn’t have to be Rstudio).

  • If not, you can either use RStudio Cloud, or follow-along from (i)CSF.

  • Dowload the materials for challenges here, or by cloning the course repository: https://github.com/UoMResearchIT/RRCSF.

Overview

Make sure your code can run:

Somewhere else

  • Use (project/package) relative file paths
  • Reproduce your environment (renv.lock or DESCRIPTION)

Unattended (CSF)

  • From the terminal, using R [CMD BATCH] or Rscript
  • Save results to disk (write_csv, save)
  • Save plots to disk (graphics devices and ggsave)

In parallel

  • Use all the cores in your laptop / reserved for your job

Use relative paths

  • Make sure to use -cwd when submitting a job

  • Make (double) sure the script is running where it’s supposed to

# Optional
library(here)
i_am("R/my_script.R")
setwd(here())
  • Make all paths relative to the project / package base dir.
load("data/my_data.RData")
source("R/my_functions.R")
  • If you’re saving results to subfolders, make sure they exist
if (!dir.exists("results")) dir.create("results")
write_csv(my_table, file = "results/my_table.csv")

Save all results (and plots) to disk

Using the --save flag saves the end state of R to .RData.
A more robust practice is to do so explicitly (and job-specific):

job_id = Sys.getenv('JOB_ID', NA)
save.image(file = paste0('my_results_', job_id, '.RData'))

Or save specific variables as tables, or binary data:

write_csv(iris, "results/iris.csv")
save(iris, file = "results/iris.RData")

Graphics Devices

By default, R/Rscript will change the Graphics Device when running from the shell, and you might notice your plots appear together on a file Rplots.pdf.

You can tweak the file-names / format / appearance using pdf, svg, png, … devices:

pdf(file = "my_plot.pdf", width = 4, height = 3)
# plot something
dev.off()

For ggplot2 plots, there’s also:

ggsave("plots/my_last_plot.pdf", 
       width = 20, height = 15, units = "cm")

🧩 Challenge 1. ‘Prepare code to run on CSF’

  • Copy the contents of challenges/01_bad to a new directory
  • Remove all absolute path references.
  • Make all paths relative to the new project directory
  • Save each plot individually to a .png file
  • At the end of the script, save the KM variable to results/KM.RData.
challenges/01_bad/R/main_script.R
# Plots kmeans clustering of iris data

library(dplyr)
library(cluster)
library(ggplot2)

setwd("/home/martin/R/RonCSF/bad_example/R")
source("../data-raw/prepare_data.R")

load("../data/iris.RData")

show(
  iris |>
    ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
    geom_point(size = 4) + theme_minimal()
)

set.seed(42)
KM <- iris |> select(-Species) |> kmeans(centers = 3, nstart = 20)
clusplot(iris, KM$cluster, color = TRUE, shade = TRUE)
challenges/01_bad/data-raw/prepare_data.R
# Writes the example `iris` data set to data/iris.RData

setwd("/home/martin/R/RonCSF/bad_example/data")

save(iris, file = "iris.RData")

Challenge 1. (Solution)

challenges/02_portable/R/main_script.R
# Plots kmeans clustering of iris data

library(dplyr)
library(cluster)
library(ggplot2)

library(here)
i_am("R/main_script.R")
setwd(here())

source("data-raw/prepare_data.R")
load("data/iris.RData")

if (!dir.exists("results")) dir.create("results")

iris |>
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 4) + theme_minimal()
ggsave("results/iris_plot.png", width = 600, height = 480, units = "px", dpi = 92)

set.seed(42)
KM <- iris |> select(-Species) |> kmeans(centers = 3, nstart = 20)

png(file = "results/KM_plot.png", width = 600, height = 480)
clusplot(iris, KM$cluster, color = TRUE, shade = TRUE)
dev.off()

save(file = "results/KM.RData", KM)
challenges/02_portable/data-raw/prepare_data.R
# Writes the example `iris` data set to data/iris.RData

if (!dir.exists("data")) dir.create("data")

save(iris, file = "data/iris.RData")

Reproduce your environment – R version

  1. Make sure your R.version matches one of the R versions installed on CSF
  • Use module search /R/ to see available options
  • Use e.g. module load apps/gcc/R/4.3.2 to select one

You can install multiple R versions on your local machine:

  • Rstudio: Tools > Global Options > General (Basic) (R sessions) > R version
  • Modify your PATH, or take a look at rig

Most code will run just fine with minor version differences.

renv will record, but not change, the R version.

Reproduce your environment – Packages & Settings

  1. Make sure your packages match those installed on CSF.

Use renv / packrat to:

  • Auto-detect dependencies (renv::dependencies)
  • Install missing packages (renv::install)
  • Document dependencies (renv::snapshot)
  • Reproduce your environment (renv::restore)
  • Keep different projects isolated
  1. Watch out for custom settings (.Rprofile, .Renviron, .R/Makevars) in your HOME directory. Try usethis::edit_r_environ() and friends.

Auto-detect, install, and log dependencies

install.packages("renv") if required.

Then renv::init() to:

  • Create renv folder
  • Add source("renv/activate.R") to .Rprofile
  • Try* to resolve your dependencies
  • Link any packages you already have to the project library
  • Try* to install missing packages
  • Write a renv.lock

(*) If something fails, resolve with renv::install(...) or install::packages(...)

When renv::status() is happy, renv::snapshot()

Full docs: https://rstudio.github.io/renv/articles/renv.html

🧩 Challenge 2. Install missing libraries with renv

  • Start from your solution to Challenge 1, or from challenges/02_portable.
  • Use renv to auto-detect dependencies, install missing packages, and generate a renv.lock file.
  • Test the code on your local machine, at this point it should run

Local snapshot, copy to remote

On a local R console:

renv::status()   # should reply `No issues found...`
renv::snapshot()

Copy files to remote.

You typically don’t want to include (system-specific) .Rprofile, .Renviron and the complete renv folder. e.g.:

rsync -avz --exclude ".*" --exclude "renv" \
  ./my_project <USER>@csf3.itservices.manchester.ac.uk:~/my_project

If you are cloning from a git repo, you might want to track these files (and renv.settings.json) in a system-specific branch, e.g.:

git clone --depth=1 -b CSF3 git@github.com:my_repo.git

… and remote restore

On a remote R console (interactive compute node session):

local$ ssh <USER>@csf3.itservices.manchester.ac.uk
login$ qrsh -l short
@node$ module load apps/gcc/R/4.3.2
@node$ R
> install.packages("renv")
> setwd("~/my_project")
> renv::restore()

If something goes wrong:

  • troubleshoot with renv::install and renv::status;
  • restart the R session
  • do a last renv::status

Explicit dependencies

Option 1. Build your library manually, from scratch:

renv::init(bare = TRUE)
renv::install('package@0.0.1')
...

Option 2. Manually edit a DESCRIPTION file:

See https://r-pkgs.org/description.html

DESCRIPTION
...
Imports:
    dplyr,
    tidyr (>= 1.1.0)

Then use:

renv::settings$snapshot.type="explicit"

config and settings

On CSF, disable sandbox to avoid warnings:

.Renviron
RENV_CONFIG_SANDBOX_ENABLED="false"

Bioconductor packages require special setting:

renv::settings$bioconductor.version("3.14")

Ignore packages that are only relevant for development, or for a particular platform:

renv::settings$ignored.packages(c('devtools','Rmpi', ...))

For full list of options:

🧩 Challenge 3. Copy the code to CSF, and restore your environment

  • Start with your results from Challenge 2
  • Make sure your local renv.lock is up to date with renv::status
  • Copy your code (excluding the renv directory) to CSF
  • Start an interactive session with qrsh -l short
  • Load the latest R module
  • Install renv (running R from your home folder)
  • Move to your project folder, and try renv::restore
  • Test using source("R/main_script.R") from the R console

HINT: to make things simpler, try this exercise copying your code to another (local) folder first

Running R from the shell

Rscript [options] example.R [...] &> example.log
R [options] < example.R [--args ...] &> example.log
R CMD BATCH [options] ['--args ...'] example.R [example.Rout]

Recommended [options]: --no-restore, --no-save

They’re roughly equivalent, but have different defaults, (type R -h to see all options).

  • Rscript x is similar to R --quiet --no-echo -f x
  • R CMD BATCH x is similar to R -f x &> example.Rout.
    It also prints a handy proc.time() at the end of the script.

R / Rscript have to be in the system’s PATH.
On CSF you achieve that using: module load apps/gcc/R/...

NOTE: There’s also littler. See “Why (or when) is Rscript (or littler) better than R CMD BATCH?” Stackoverflow

🧩 Challenge 4. Submit the script as a job

On your local machine:

  • Run R/main_script.R using R CMD BATCH and Rscript
  • Play with different flags, e.g. --save and --no-save, --echo and --no-echo
  • Check your log files after each run. When you’re happy…

On CSF3: Place your command in a job script, e.g.:

csf3.job
#!/bin/bash --login
#$ -cwd
module load apps/gcc/R/4.2.3
R CMD BATCH --no-save --no-restore R/main_script.R

… and submit from a login node:

qsub -l short csf3.job

Alternatively, use -V -b y to submit the command directly, e.g.:

module load apps/gcc/R/4.2.3
qsub -V -b y -l short "Rscript R/main_script.R > main_script.log"

Parallelization

Use all the cores available for your job (but not more)

Parallel computing is a world on its own:
https://cran.r-project.org/web/views/HighPerformanceComputing.html

  • Implicit parallelism
  • Explicit parallelism
    • Distributed (memory) processing
    • “Embarrasingly parallel” processing

Implicit parallelism

  • Vector / Matrix operations should run faster on CSF “out of the box”.

If you do sessionInfo() you should see:

Matrix products: default
BLAS:   /opt/apps/apps/gcc/R/4.3.2/lib64/R/lib/libRblas.so
LAPACK: /opt/apps/apps/gcc/R/4.3.2/lib64/R/lib/libRlapack.so
  • pnmath and romp are experimental projects to use OpenMP directives in base R functions.
    “Similar functionality is expected to become integrated into R ‘eventually’.” CRAN/HPC

In-package parallelism

R is single-threaded by default, but some packages can use multiple threads or have multi-threaded alternatives. Read your package documentation.

Distributed processing

Series of taks that can run asynchronously, but have non-trivial dependencies to each other.

  • Rmpi low-level MPI directives
  • mirai Minimalist Async Evaluation Framework for R
  • future “simple and uniform” asynchronous evaluation (wraps mirai / callr)
  • targets Make-like pipeline tool for R

“Embarrasingly parallel” processing

Series of tasks that are independent from the rest, e.g. apply or purrr::map calls, or plain for loops:

loop.R
# Runs `iter` dummy tasks of `sleep` seconds, sequentially

sleep <- 0.1
iter <- 24

cat(paste("Running", iter, "iterations of", sleep, "seconds sequentially\n"))

slow_fcn <- function(x) {
  Sys.sleep(sleep)
  x^2
}

y <- c()
for (i in 1:iter) {
  y[i] <- slow_fcn(i)
}

foreach() %dopar% with registerDo...

library(parallel)
library(foreach)
library(doParallel) # there's also doMC, doSNOW, doMPI

workers = as.integer(Sys.getenv("NSLOTS", unset = detectCores()))
cl <- parallel::makeCluster(workers)
doParallel::registerDoParallel(cl)

y <- foreach(x = 1:iter) %dopar% {
  slow_fcn(x)
}

stopCluster(cl)

foreach() %dofuture%

library(foreach)
library(doFuture)

# correctly detects NSLOTS on SGE
plan(multisession, workers = availableCores())

y <- foreach(x = 1:iter) %dofuture% {
  slow_fcn(x)
}
  • Uniform syntax (independent of cluster type)
  • Better dependency tracking
  • Handling of seeded random numbers

Henrik Bengtsson, “%dofuture% - a Better foreach() Parallelization Operator than %dopar%”, June 26, 2023 in R. https://www.jottr.org/2023/06/26/dofuture/

🧩 Challenge 5. Parallelize a for loop

  • Copy the contents of challenges/05_serial to a new directory
  • Use foreach ... %dofuture% to replace the for loop.
  • You might have to install the required foreach and doFuture packages, e.g. using renv::init
  • Test your script (locally) with R CMD BATCH loop.R
  • Test your script on CSF, providing the -pe smp.pe N option, e.g.:
qsub -V -b y -l short -pe smp.pe 4 "R CMD BATCH loop.R"

Non-scalar outputs, random numbers

  • Use a custom .combine to merge the results evaluated at different cores.
  • Random numbers require special care in parallel processing
foreach_dataframe.R
library(doFuture)
library(foreach)
library(progressr)

slow_table <- function(i, n) {
  Sys.sleep(0.1)
  data.frame(itr = i, idx = 1:n, x = runif(n))
}

parallel_tables <- function(iter, n) {

  plan(multisession, workers = 2)

  progress <- progressor(along = 1:iter)
  foreach(i = 1:iter,
          .combine = rbind,
          .options.future = list(seed = TRUE)) %dofuture% {

    progress(sprintf("i=%g", i))
    slow_table(i, n)
  }
}
df <- parallel_tables(100, 3)

Progress tracking

Iterations will usually not be evaluated in order, so progress tracking becomes challenging. %dofuture offers support for progressr

.Renviron
R_PROGRESSR_ENABLE="true"
foreach_dataframe.R
library(doFuture)
library(foreach)
library(progressr)

slow_table <- function(i, n) {
  Sys.sleep(0.1)
  data.frame(itr = i, idx = 1:n, x = runif(n))
}

parallel_tables <- function(iter, n) {

  plan(multisession, workers = 2)

  progress <- progressor(along = 1:iter)
  foreach(i = 1:iter,
          .combine = rbind,
          .options.future = list(seed = TRUE)) %dofuture% {

    progress(sprintf("i=%g", i))
    slow_table(i, n)
  }
}
df <- parallel_tables(100, 3)

Mapping alternatives

See: https://future.apply.futureverse.org/#role

Example with future.apply:

library(future.apply)
plan(multisession)

library(datasets)
library(stats)
y <- future_lapply(mtcars, FUN = mean, trim = 0.10)

🧩 Challenge 6. Self study

  • Try to modify the challenges/06_parallel/loop.R to use a custom .combine function, and to track progress using progressr.

  • Try to use a mapping function e.g. future_lapply to achieve the same result.

  • Look at parallel_tester.R and tester.job in the examples folder. They use [argparser(https://cran.r-project.org/web/packages/argparser/index.html) to make the script configurable from the command line. Try to run the script with different arguments.

  • Modify tester.job to submit a job array to run the same script with different arguments, in parallel.

Where to go next