# Optional
library(here)
i_am("R/my_script.R")
setwd(here())
2024-04-11
These slides are available at: https://uomresearchit.github.io/RRCSF/
You’re expected to have R
installed on your local machine, and an editor you feel comfortable with (doesn’t have to be Rstudio
).
If not, you can either use RStudio Cloud, or follow-along from (i)CSF.
Dowload the materials for challenges here, or by cloning the course repository: https://github.com/UoMResearchIT/RRCSF.
Make sure your code can run:
Somewhere else
renv.lock
or DESCRIPTION
)Unattended (CSF)
R [CMD BATCH]
or Rscript
write_csv
, save
)ggsave
)In parallel
Make sure to use -cwd
when submitting a job
Make (double) sure the script is running where it’s supposed to
Using the --save
flag saves the end state of R
to .RData
.
A more robust practice is to do so explicitly (and job-specific):
Or save specific variables as tables, or binary data:
By default, R/Rscript
will change the Graphics Device when running from the shell, and you might notice your plots appear together on a file Rplots.pdf
.
You can tweak the file-names / format / appearance using pdf
, svg
, png
, … devices:
For ggplot2
plots, there’s also:
challenges/01_bad
to a new directory.png
fileKM
variable to results/KM.RData
.challenges/01_bad/R/main_script.R
# Plots kmeans clustering of iris data
library(dplyr)
library(cluster)
library(ggplot2)
setwd("/home/martin/R/RonCSF/bad_example/R")
source("../data-raw/prepare_data.R")
load("../data/iris.RData")
show(
iris |>
ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 4) + theme_minimal()
)
set.seed(42)
KM <- iris |> select(-Species) |> kmeans(centers = 3, nstart = 20)
clusplot(iris, KM$cluster, color = TRUE, shade = TRUE)
challenges/02_portable/R/main_script.R
# Plots kmeans clustering of iris data
library(dplyr)
library(cluster)
library(ggplot2)
library(here)
i_am("R/main_script.R")
setwd(here())
source("data-raw/prepare_data.R")
load("data/iris.RData")
if (!dir.exists("results")) dir.create("results")
iris |>
ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 4) + theme_minimal()
ggsave("results/iris_plot.png", width = 600, height = 480, units = "px", dpi = 92)
set.seed(42)
KM <- iris |> select(-Species) |> kmeans(centers = 3, nstart = 20)
png(file = "results/KM_plot.png", width = 600, height = 480)
clusplot(iris, KM$cluster, color = TRUE, shade = TRUE)
dev.off()
save(file = "results/KM.RData", KM)
R.version
matches one of the R versions installed on CSFmodule search /R/
to see available optionsmodule load apps/gcc/R/4.3.2
to select oneYou can install multiple R versions on your local machine:
Rstudio
: Tools > Global Options > General (Basic) (R sessions) > R versionPATH
, or take a look at rigMost code will run just fine with minor version differences.
renv
will record, but not change, the R
version.
Use renv
/ packrat
to:
renv::dependencies
)renv::install
)renv::snapshot
)renv::restore
).Rprofile
, .Renviron
, .R/Makevars
) in your HOME
directory. Try usethis::edit_r_environ()
and friends.install.packages("renv")
if required.
Then renv::init()
to:
renv
foldersource("renv/activate.R")
to .Rprofile
renv.lock
(*) If something fails, resolve with
renv::install(...)
orinstall::packages(...)
When renv::status()
is happy, renv::snapshot()
Full docs: https://rstudio.github.io/renv/articles/renv.html
renv
challenges/02_portable
.renv
to auto-detect dependencies, install missing packages, and generate a renv.lock
file.R
console:You typically don’t want to include (system-specific) .Rprofile
, .Renviron
and the complete renv
folder. e.g.:
If you are cloning from a git
repo, you might want to track these files (and renv.settings.json
) in a system-specific branch, e.g.:
If something goes wrong:
renv::install
and renv::status
;R
sessionrenv::status
Option 1. Build your library manually, from scratch:
Option 2. Manually edit a DESCRIPTION
file:
See https://r-pkgs.org/description.html
DESCRIPTION
...
Imports:
dplyr,
tidyr (>= 1.1.0)
Then use:
On CSF, disable sandbox to avoid warnings:
Bioconductor packages require special setting:
Ignore packages that are only relevant for development, or for a particular platform:
For full list of options:
renv.lock
is up to date with renv::status
renv
directory) to CSF
qrsh -l short
R
modulerenv
(running R
from your home folder)renv::restore
source("R/main_script.R")
from the R
consoleHINT: to make things simpler, try this exercise copying your code to another (local) folder first
Recommended [options]: --no-restore
, --no-save
They’re roughly equivalent, but have different defaults, (type R -h
to see all options).
Rscript x
is similar to R --quiet --no-echo -f x
R CMD BATCH x
is similar to R -f x &> example.Rout
.proc.time()
at the end of the script.R
/ Rscript
have to be in the system’s PATH
.
On CSF you achieve that using: module load apps/gcc/R/...
NOTE: There’s also
littler
. See “Why (or when) is Rscript (or littler) better than R CMD BATCH?” Stackoverflow
On your local machine:
R/main_script.R
using R CMD BATCH
and Rscript
--save
and --no-save
, --echo
and --no-echo
On CSF3: Place your command in a job script, e.g.:
… and submit from a login node:
Alternatively, use -V -b y
to submit the command directly, e.g.:
Use all the cores available for your job (but not more)
Parallel computing is a world on its own:
https://cran.r-project.org/web/views/HighPerformanceComputing.html
If you do sessionInfo()
you should see:
Matrix products: default
BLAS: /opt/apps/apps/gcc/R/4.3.2/lib64/R/lib/libRblas.so
LAPACK: /opt/apps/apps/gcc/R/4.3.2/lib64/R/lib/libRlapack.so
pnmath
and romp
are experimental projects to use OpenMP directives in base R functions.In-package parallelism
R
is single-threaded by default, but some packages can use multiple threads or have multi-threaded alternatives. Read your package documentation.
Series of taks that can run asynchronously, but have non-trivial dependencies to each other.
Series of tasks that are independent from the rest, e.g. apply
or purrr::map
calls, or plain for
loops:
foreach() %dopar%
with registerDo...
foreach() %dofuture%
Henrik Bengtsson, “%dofuture% - a Better foreach() Parallelization Operator than %dopar%”, June 26, 2023 in R. https://www.jottr.org/2023/06/26/dofuture/
challenges/05_serial
to a new directoryforeach ... %dofuture%
to replace the for
loop.foreach
and doFuture
packages, e.g. using renv::init
R CMD BATCH loop.R
-pe smp.pe N
option, e.g.:.combine
to merge the results evaluated at different cores.foreach_dataframe.R
library(doFuture)
library(foreach)
library(progressr)
slow_table <- function(i, n) {
Sys.sleep(0.1)
data.frame(itr = i, idx = 1:n, x = runif(n))
}
parallel_tables <- function(iter, n) {
plan(multisession, workers = 2)
progress <- progressor(along = 1:iter)
foreach(i = 1:iter,
.combine = rbind,
.options.future = list(seed = TRUE)) %dofuture% {
progress(sprintf("i=%g", i))
slow_table(i, n)
}
}
df <- parallel_tables(100, 3)
Iterations will usually not be evaluated in order, so progress tracking becomes challenging. %dofuture
offers support for progressr
foreach_dataframe.R
library(doFuture)
library(foreach)
library(progressr)
slow_table <- function(i, n) {
Sys.sleep(0.1)
data.frame(itr = i, idx = 1:n, x = runif(n))
}
parallel_tables <- function(iter, n) {
plan(multisession, workers = 2)
progress <- progressor(along = 1:iter)
foreach(i = 1:iter,
.combine = rbind,
.options.future = list(seed = TRUE)) %dofuture% {
progress(sprintf("i=%g", i))
slow_table(i, n)
}
}
df <- parallel_tables(100, 3)
See: https://future.apply.futureverse.org/#role
Example with future.apply
:
Try to modify the challenges/06_parallel/loop.R
to use a custom .combine
function, and to track progress using progressr
.
Try to use a mapping function e.g. future_lapply
to achieve the same result.
Look at parallel_tester.R
and tester.job
in the examples folder. They use [argparser
(https://cran.r-project.org/web/packages/argparser/index.html) to make the script configurable from the command line. Try to run the script with different arguments.
Modify tester.job
to submit a job array to run the same script with different arguments, in parallel.