This lesson is being piloted (Beta version)

BASH Programming for Workflow Management

Dates, Scheduling, and Downloading Files

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How can we deal with date maths on the command line

  • How can we schedule regular compute jobs

  • How can we download files outside of a web-browser

Objectives
  • Understanding some of the tools available for addressing regular needs.

Date maths

On UNIX systems the system time and date information can be found using the date command:

date
Wed 27 Jan 2021 11:52:17 GMT

This will return a formatted string (likely) containing information on the day, date, time, and timezone, as shown above.

This command can be used for more than just finding the current time, though. It can also be used to create formatted strings containing other dates and times, as well as calculating the same relative to either the current time or another given time. Here we will give a short introduction on how to use this command on linux systems.

Linux vs OSX

Although all UNIX systems will have a date program, the exact syntax for using it can vary from system to system, due to differing implementations of the Unix standards. The two systems you are most likely to encounter are linux-based systems (such as debian, and SE-linux), and the freeBSD Unix-based OS-X system. These use quite different syntaxes, so care must be taken to use the correct syntax for the system you are on.

Below we will focus on the linux syntax, as this is most common for HPC environments. Where the OSX syntax is different this will be highlighted in a separate note.

The output string for the date command can be formatted by adding +[format string]. The options for these are defined in the man page for date, but some examples are %Y, for 4-digit year, %m for 2-digit month, and %d for 2-digit day of month. These can be used individually, or combined as you wish. E.g.:

date +%Y
2021
date +%Y-%m
2021-01
date +"%Y %d"
2021 27

In the last example we enclose the format string in quotation marks, to all inclusion of a space in the formatted output.

To display a date which is not now, you can use -d [date string]. The most useful (for our purposes) date string here is YYYYMMDD, e.g.:

date -d "20120423"
Mon 23 Apr 00:00:00 BST 2012

Displaying dates that are not ‘now’ on OSX

On OSX you have to explicitly set the format of the date string that you pass to date, using -f "[format string]" "[date string]". You will also need to ensure that date does not try to reset the system clock, by passing the -j flag too. The OSX command equivalent to the linux command above is:

date -j -f "%Y%m%d" "20120423"

To calculate an offset from a given date (either ‘now’, or a supplied date), you add the desired offset into your date string. This offset can be composed of a number of different elements, for example:

date -d "20210127 +3 day +1 month -18 year"
Sun  2 Mar 00:00:00 GMT 2003

The advantage of using date to do this calculation for you is that it can deal with transitioning across month and year boundaries easily - to do this by hand would require a lot of checks for the lengths of months, leap years, etc.

Offset calculation order

You should note that the calculation order for the offset change from the largest incremental unit, downwards. This is most important where the offset will cross month boundaries, but could be important to remember in other scenarios too.

Calculating date offsets on OSX

On OSX the offset is given as one or more separate strings for each element of the offset that is required, each preceded by a -v flag. You should also note that the offset elements are applied in the order which you provide them, giving more explicit control over this process than you have with the linux date command.

The OSX command equivalent to the linux command above is:

date -j -v "-18y" -v "+1m" -v "+3d" -f "%Y%m%d" "20210127"

Scheduling tasks with CRON

Often workflows need to be repeated at regular intervals - checking for updates in source data, running maintenance tasks, or producing regular updates to services. These tasks can be automated using the cron job scheduler. Cron is available on most UNIX based systems (although, on many HPC platforms, access to cron will be blocked for ordinary users), you can find out if you have it installed (and what tasks you have scheduled) using the crontab (cron table) command:

crontab -l

To configure cron it is easiest to create a configuration file, which will be read simply using the command:

crontab [config.txt]

Each line of this file will represent a job, and will look like:

# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │                                   7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * * <command to execute>

This notation can be a little confusing to use, but online tools such as crontab.guru are available for checking your configurations, to ensure that cron will run jobs at the times you expect.

Scheduling jobs

Jon wants to set two data download tasks, one to run at 2:45 am every Monday, and one which runs at 2:00 pm on the 1st and 15th of each month. What notation should he use to set these jobs (crontab.guru can be used to work this out).

Solution

45 2 * * 1 - 2:45 am on every Monday

00 14 1,15 * * - 2pm on the 1st and 15th of the month

OS-X and cron

OS-X does include the cron, but it is difficult to use because of the security settings.

If you wish to use cron on OS-X you will need to enable Full Disk Access within the Security & Privacy settings menu for the program /usr/sbin/cron.

Downloading using wget

Often workflows involve input or source file downloads, and it is useful if these can be automated on the command line, rather than relying on using a web-browser for this.

Wget enables the retrieval of files using the widely used HTTP, HTTPS, FTP, and FTPS protocols (much as most web-browsers do). It has many features, such as being able to resume aborted downloads, using filename wild cards and recursive searches, and use of timestamps to determine if documents need redownloading. Here we will just cover the basic usage though, for more details and advanced usage check the GNU Wget documentation.

The basic usage is to simply give the path to a file that you wish to download. If it exists, and is accessible, then wget will download it and save it in your local directory. E.g.:

wget http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms/meteo_ABED_2020-01-04_1800_data.txt

This can be used for directories as well as files, but when used on a directory what is returned is a file listing the directory contents (as you would see in a web-browser). E.g.:

wget http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms

To download all the contents of a directory you need to use the recursive -r flag. In most cases this should be combined with the -np flag, which tells wget that you don’t want it to crawl up to parent directories. You can also restrict the files you download based on their file-extension string using the -A flag.

For example, to download a single file use:

wget -r -np http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms/meteo_ABED_2020-01-04_1800_data.txt

This creates a directory structure following the http path structure, containing the file you requested.

To download all the text (.txt) files in that directory use:

wget -r -np -A txt http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms/

This creates the same directory structure, but it will contain all the text files in that directory.

Restricting webcrawling to a single domain

Although it is not important for our interest in downloading data from internet archives, when downloading data from a website it can be useful to ensure that wget doesn’t follow links off that site. This can be done using the -D [domain] flag, where [domain] is the web domain that you wish to restrict wget to. In the examples given above we could add -D manchester.ac.uk if we wished to restrict searches to the University of Manchester domain, or even -D manunicast.seaes.manchester.ac.uk if we wished to restrict searches to the ManUniCast site only.

Using wildcards in file names

The HTTP protocol does not support wildcards, so using wildcards would not work in the examples given above. Wget can also use the FTP protocol though, which does support the use of wildcards, so if you are downloading data from an FTP server these are an option.

Scheduling jobs

Go to the data_download directory inside the data-adv-shell directory. Here there is a crontab file crontab_settings.txt and a script for downloading some data manunicast_download.sh.

Edit the crontab file to

  • set a time which is 3-4 minutes in the future
  • set the correct path to the manunicast_download.sh file

Then submit the crontab settings using:

crontab crontab_settings.txt

If it works you will find the downloaded file on the Desktop in a few minutes, if not, check the error message in stderr.log (also on the Desktop) to see what the problem was.

Once this has worked you can unset the crontab using:

crontab -r

Key Points

  • Using date for your date and time calculations will save you a lot of hassle

  • crontab can be used for scheduling regular tasks

  • wget can be used for scripting your data download process

  • Tool syntax is not always consistent across different unix flavours


Variables and Arrays

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How can I store information without writing it to file

Objectives
  • Learn how to create and reference a variable

  • Learn how to create and reference an indexed array

Variables

In the shell novice course you were, indirectly, introduced to bash variables, when working with loops: http://swcarpentry.github.io/shell-novice/05-loop/index.html

for thing in list of things
do
  echo $thing
done
echo 'variables persist:' $thing
list
of
things
variables persist: things

Variables can also be defined directly:

echo $thing
thing=this
echo $thing
things
this

Exploring Shell Variables

You can check if a variable exists, and what it’s contents are, using the declare -p command. This will list all variables and functions in the environment, so to ease use of this output of this command it is recommended that you pipe the output to another command, such as less or grep, to search for the variable you require.

To delete a variable you can use unset.

thing=stuff
unset thing

Note that you do not reference the contents of the variable (using $thing) to do this, instead you reference the variable directly (using thing).

Referencing a non-existent variable

What happens if you reference a variable which doesn’t exist? e.g.

thing=stuff
unset thing
echo $thing

Solution

Referencing a non-existent variable simply returns an empty string, no error message is given. This can aid in the smooth-running of a script, but can also create problems if you are used to the error messages that languages such as python return after referencing a non-existent variable. We will look later at how you can test for the existence of a variables more safely.

Arrays

BASH objects do not have to only contain a single value, they can contain a list of values instead.

To create an array object you can use the notation:

listthings=( these are my things )

Whitespace warning

Note that the whitespace is important — it is used to denote the breaks between individual values, as well as the start and end of the list. If you wish to have items which contain whitespace you will need to wrap these in quotation marks. e.g.

listthings=( these "are my" things )

The array object now contains an indexed list of values:

declare -p | grep -w listthings
listthings=([0]="these" [1]="are my" [2]="things")

To access the contents of the array you should use the indexes, e.g. [0] or [1]. You can also access all values using the index [@]. However, to make use of these indexes you must use curly braces to delimit the variable name.

Referencing Arrays

What are the outputs, and why, from the following commands?

  1. echo ${listthings[1]}
  2. echo $listthings[1]
  3. echo ${listthings[@]}
  4. echo $listthings
  5. echo ${#listthings[@]}

Solution

  1. are my This is the value at index 1
  2. these[1] This is the first (index 0) value in the array, followed by the string [1]
  3. these are my things This is all values within the array.
  4. these This is the first (index 0) value in the array.
  5. 3 This is the length of the array.

Finally, we note that arrays can be referenced using a for loop, as at the start of this lesson:

for thing in ${listthings[@]}
do
  echo $thing
done

This is generally the best way to create a for loop; except for the most trivial examples it is wise to keep the array assignment separate from the loop itself.

Key Points

  • BASH variables can store a single piece of information

  • BASH arrays can store an indexed lists of information

  • {} denotes a code block, and are essential for referencing arrays

  • [@] denotes all of an array, while [X] denotes the value at position X

  • ${#VAR} returns the length of the string

  • ${#ARRAY[@]} returns the number of items in the array


Subshells and Functions

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • How can you organise your code into functional blocks

  • How can you import settings from other files into your scripts

Objectives
  • Explain how BASH functions operate

  • Explain the difference between local and global variables

  • Explain how to import bash files into your environment

Shell and Environment Variables

In the shell novice course you were introduced to writing and using shell scripts, http://swcarpentry.github.io/shell-novice/06-script/index.html.

When a shell process is created, a corresponding environment is created to contain the shell process. The environment contains information needed for the shell to interact with the operating system (such as the location of your home directory, or which display is being used). Any further processes created by commands run in the shell process will inhabit this environment, and have access to this information too. This information is stored in variables, similar to the variables introduced in the previous lesson, but these are ‘environment’ or ‘global’ variables, whereas the variables used previously were ‘shell’ or ‘local’ variables. The difference between these is that ‘environment’ variables are readable by all processes in that environment, while ‘shell’ variables are unique to that process which created them.

Invoking a shell script leads to the launch of a new process, which processes the commands within that script. That process will not have access to the ‘shell’ variables of the shell from which it was launched.

To demonstrate this we can create a simple bash script, called ‘test.sh’ and containing this code:

echo $var1
echo $var2

Then run this code:

var1=23
export var2=12
bash test.sh

12

The export command turns the shell variable into an environment variable, enabling all processes within that environment to access it.

Exploring Environment Variables

You can check what other environment variables have been set using the declare -px command.

Environment variables are also accessible for other programs running the same environment. E.g, for python these can be accessed using os.environ:

python -c "import os ; print(os.environ['var2'])"

Subshells

Subshells are separate instances of the command process, run as a new process, and defined within your scripts using (). Because a subshell is run in a new process, these can be used for parallel processing (although we will not cover that here). Here we will introduce key concepts behind the use of subshells.

Unlike calling a shell script, subshells inherit the same variables as the original process, and thus can access any of these (even those which have not been exported).

x=5 ; (echo $x)
5

However these variables are copies, and modifications made to existing variables, or newly created variables, will not be available to the original process.

x=5 ; (echo $x; x=3; y=2; echo $x $y) ; echo $x $y
5
3 2
5

Code blocks

Code blocks can be denoted between curly brackets {} (as will be used later). However this does not launch a subprocess, and any changes to, or additions of, variables will be persistant.

x=5 ; {echo $x; x=3; y=2; echo $x $y;} ; echo $x $y
5
3 2
3 2

Note that, unlike subprocesses, the final bracket needs to be on a new line, or separated by a ; from the last command used.

Because there is no persistence of variables outside of the subshell, to pass information out from this process you have to either write to file, or use one of the two following operations: command or process substitution.

Command Substitution

Even though the syntax for a subshell is very similar to that for declaring an array, we cannot directly save the subshell in the same manner, instead we have to use command substitution. This is a feature which enables the recording of the output from an executed command. The command is executed within a subshell, and the substitution is carried out using the following notation:

x=4; echo $x; x=$(y=5;echo $x$y); echo $x
4
45

Command Substitution using backquotes

Command substitution can also be carried out using backquotes ` `. This is not directly analogous to $(), as some special characters will need escaping within the backquotes, but they behave in very similar manners. ` ` is the older implementation, and so will be common in script libraries, but is now deprecated, and the $() notation is recommended for all new scripts.

Example usage of command substitution

Jon wants to save the year, month, and day, from the date function, as variables, so that he can use them later for running download scripts. Can you write some code using command substitution to do this?

Solution

You will need three separate calls to the date function for this, one each for the day, month, and year. E.g.:

YEAR=$(date +%Y)
MONTH=$(date +%m)
DAY=$(date +%d)

Now that he has the code for obtaining today’s date, he needs to do the same for yesterday’s date. Can you copy and adapt your code above to do this?

Solution

Again you will need three separate calls to the date function for this, one each for the day, month, and year. To these you will need to remove one day in each, E.g.:

PREV_YEAR=$(date -d "-1 day" +%Y)
PREV_MONTH=$(date -d "-1 day" +%m)
PREV_DAY=$(date -d "-1 day" +%d)

Using variables in an operational script

Now that Jon has code for calculating today and yesterday’s dates, he needs to add these to his download script. Please edit the manunicast_download.sh script, so that the web address passed to wget is for today’s data.

Solution

The web address is stored in the script as a simple string, so we can insert information from the date variables directly into this.

YEAR=$(date +%Y)
MONTH=$(date +%m)
DAY=$(date +%d)

PREV_YEAR=$(date -d "-1 day" +%Y)
PREV_MONTH=$(date -d "-1 day" +%m)
PREV_DAY=$(date -d "-1 day" +%d)

WEBROOT="http://manunicast.seaes.manchester.ac.uk/charts/manunicast/"

ADDRESS="${WEBROOT}${YEAR}${MONTH}${DAY}/d02/meteograms/meteo_ABED_${PREV_YEAR}-${PREV_MONTH}-${PREV_DAY}_1800_data.txt"
wget $ADDRESS

To make this code more readable, we have split out the fixed base of the address as WEBROOT, you do not need to do this in your own code.

Using { } brackets around the variable name is good practice - avoiding any errors in case you accidentally make another variable name with the following text. It is also good practice to wrap the text string in " ", in case there are spaces in the string.

Process Substitution

Process substitution, using <(), is a feature which enables the usage of the output of an executed command within another command, similar to the piping of output using |.

As a (rather artificial) example, this enables us to compare the contents of two variables. Passing these variables directly, or via a command substitution leads to the shell searching for a file named after the variable value, e.g.:

diff $(echo 3) 5
diff: 3: No such file or directory
diff: 5: No such file or directory

Using process substitution ensures that the shell avoids this error:

diff <(echo 3) <(echo 5)
1c1
< 3
---
> 5

And these subshell methods can be nested as required:

cat <(echo year is $(date +%Y)) <(echo month is $(date +%m)) <(echo day is $(date +%d))
year is 2021
month is 02
day is 02

More information on, and examples of using, subshells are available here: https://www.tldp.org/LDP/abs/html/subshells.html

Functions

Bash functions are, at their most basic level, labelled code blocks which enable the repeated use of a set of commands. Their purpose is to make your code more readable and, in avoiding writing the same code time and again, more maintainable.

Functions can be declared by either

functionname () {
  commands
}

or

function functionname () {
  commands
}

Functions must be declared before they are used, and as they are a simple code block within the shell process, can read, modify, and create all variables within the shell.

var1='E'
examplechange () { var1='D'; }
echo $var1
examplechange
echo $var1
E
D

Note that the code within the function is not executed until it is called.

Bash functions are more basic than those of other programming languages, and do not easily lend themselves to modern functional programming practices. There are some tricks which can help though.

Arguments can be passed to functions, in a similar manner as for scripts and programs, as you learnt in the introduction to bash lessons:

readingargs () { echo $#; echo $1; echo $2; }
readingargs "arg1" "arg2"
2
arg1
arg2

To avoid unintentional usage of variables created within a function, these can be set as local, so that they are not available outside the function.

combineargs () { local var1=$1$2; echo $var1; }
var1="start"
var2="end"
combineargs $var1 $var2
echo $var1
startend
start

Note that using the local declaration ensures that you don’t overwrite any external variables which share the same name.

To return information from a function in manner in which it can be allocated to an arbitrary variable, it is best to use echo to output it, and then capture the result of this.

funcresult="$(combineargs $var1 $var2)"

Multiple values can be returned in a similar manner, using the read command to split the output.

returnargs () { echo $1; echo $2; }
read arg1 arg2 <<<$(returnargs "start" "end")
echo $arg1
echo $arg2

Output Warning

What is the output from the following command? And why is it as it is?

read arg1 arg2 <<<$(returnargs "start here" "end")
echo $arg1
echo $arg2

Solution

The first word is stored in arg1, while the last two words are stored in arg2.

start
here end

Be aware that the outputs from a function are separated only by whitespace, so if any exists within an outputted variable then this will cause problems for the read command. In addition, if there are more items returned than expected, then any extra items will be appended to the last argument in the list.

Return Values

An arbitrary status can be returned from a function (using return $arg1), however this is best saved for error checking your processes, rather than for passing results out from the function.

Functionalising the Date commands

Jon wants to use the date commands that he has been given in later scripts, so he has created the following function template. Can you complete this, and replace the date code in the script with a call to the function?

determine_next_date () {
    # This function determines the next date, using a specified offset.
    # usage: determine_next_date [+/-] [#days] [current year] [current month] [current day]
}

Solution

determine_next_date () {
    # This function determines the next date, using a specified offset.
    # usage: determine_next_date [+/-] [#days] [current year] [current month] [current day]

    local next_year=$( date -d "$3$4$5 $1 $2 day" +%Y )
    local next_month=$( date -d "$3$4$5 $1 $2 day" +%m )
    local next_day=$( date -d "$3$4$5 $1 $2 day" +%d )

    echo $next_year $next_month $next_day
}
read year month day <<<$(determine_next_date - 1 $year $month $day)

Sourcing Bash Scripts

Rather than simply calling one bash script from another, with the limitations described above, code and resources from bash scripts can be imported into another script using the source functionality. This can take the form:

. script.sh

or:

source script.sh

All code within that script will be executed at the point it is sourced, so there are limitations in how this can be used. It is most useful for loading key variables, or for configuring the environment or loading functions (as is done in the .bash_profile and .bashrc files).

Splitting off the date math function

In order that the determine_next_date function can be used in other scripts, please move it to a new script, called function_date_math.sh, source this script in your original file, and then make sure it still works as before.

Solution

Example scripts for linux and OSX are included in the date_math_function directory, called functions_date_math_linux.sh and functions_date_math_osx.sh.

Key Points

  • environment variables are accessible by all programs run from that shell

  • export turns a (private) shell variable into an environment variable

  • () creates a subshell

  • {} creates a code block within the current shell

  • $() allows a subshell to be used for command substitution, for saving the output as a variable

  • <() allows a subshell to be used for process substitution, for passing the output to another program

  • read var1 var2 <<<$() can be used to save more than one output from a command substitution

  • function NAME() {} creates a function code block

  • code inside functions are not executed on creation, and can be used repeatedly after creation

  • . script.sh and source script.sh enable the importing of code and variables from other scripts


BASH Logic and Maths

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • How can we have responsive workflows on the BASH shell

Objectives
  • Explain Logical operators, and demonstrate IF structures

  • Detail Maths operators, and cover basic sequencing

Integer Maths

Bash variables are, in the general context, treated as strings. Integer maths can be carried out using ‘arithmetic expansion’, using a maths context to tell the interpreter when to do this. The recommended method for this is using $(( )), and this is used in the same manner as a command substitution:

i=3
j=$(($i + 1))
echo $j
4

Alternatively a command can be used to create the math context - these follow the same rules as $(( )), but give an exit status (which will be useful later).

a=3
((a=$a+7))
echo $a
10

Within the math context the syntax and semantics of C’s integer arithmetic are used, enabling shorthands such as += or -= for incremental changes in a variable. For example, this gives the same result as the addition above:

a=3
((a+=7))
echo $a
10

let, expr and bc

Other programs are available for the execution of integer math in bash. For example:

j=$(expr $i + 1)
let j=i+1
j=$(echo "$i+1" | bc)

These follow the same integer math rules, however because they do not use the math context that $(( )) and (( )) provide (which protects, for example, against whitespace errors), they can be more difficult to work with, and so wont be covered here.

The arithmetic operators that can be used within the math context are:

  +  :  Addition
  -  :  Subtraction
  *  :  Multiplication
  /  :  Division
  ** :  Exponent
  %  :  Modulus

Conditional expressions are also allowed, as covered below, as well as C’s bitwise operations and ternary operator, which won’t be covered here.

Integer / String Conversions

A lot of information strings containing digits, such as date strings, step numbers, etc, use leading zeros to ensure regularity of string length, as well as for sorting purposes. This habit, however, can cause problems during when carrying out maths operations, e.g.:

x=05
y=08
echo $(($x+$y))
-bash: 05+08: value too great for base (error token is "08")

This occurs because bash interprets values starting with 0 as base 8 integers (and values starting with 0x as base 16 integers) instead of the base 10 integers which we generally use for maths.

To force an integer to be interpreted in base 10 we can prepend 10# to the value, e.g.:

x=05
y=08
echo $((10#$x+10#$y))
13

Note that this will not work for values which start with + or -, so cannot be used to set the base of negative numbers. This is not, generally, and issue for date maths, but could be for other uses. In these situations it would be wiser to use regular expression substitutions in sed, as will be covered in a later lesson.

To create a string with leading zeros from an integer the printf (formatted print) program. The format string used is %X.Yi, where X is the width of the string (optional), and Y is the number of digits to use, e.g.:

day=5
printf "day number %.2i\n" $day
day number 05

If Control Structure

Many workflows are not linear, and all workflows will have exceptional cases in which operations should be halted (if, for example, a file is missing, or a calculation would fail). Bash has the if control structure, which can be used to control your workflow in these situations.

In it’s most basic form, this control structure will test for a condition, then execute a set of program statements if that condition is true. For example:

if [[ 8 > 7 ]];then
  echo "8 is greater than 7"
fi

and:

if [[ b > a ]];then
  echo "b is greater than a"
fi

These are lexicographic comparisons of the strings, using the alphabetical order. Other non-numeric conditions available are <, >=, <=, == and != (described in the table at the end of this section):

if [[ b != a ]];then
  echo "b is not the same as a"
fi

For arithmetic comparisons inside [[ ]] brackets we instead would use the operators -gt, -lt, -le, -ge, -eq and -ne:

if [[ 8 -gt 7 ]];then
  echo "8 is greater than 7"
fi

For arithmetic expressions we can also use (( )), as described in the section above:

if (( 8 > 7 ));then
  echo "8 is greater than 7"
fi

These, confusingly, use the same operators as the non-numeric conditions in the [[ ]] environment, rather than being the same as the numerical operators.

Logical Test

Which of these conditionals are True, which are False, and which don’t work?

  1. [[ 4 -eq 4 ]]
  2. [[ be == eb ]]
  3. [[ 4 == 4 ]]
  4. [[ 4 == 04 ]]
  5. (( 4 == 04 ))
  6. (( 4 -eq 04 ))

Solution

  1. True.
  2. False.
  3. True (but a string, not arithmetic comparison).
  4. False, because it is a string comparison.
  5. True, because this is a math context.
  6. Error, because -eq can’t be used within a math context

Arithmetic Comparisons of Strings

Be aware that if you try to make an arithmetic comparison of strings, the [[ ]] command converts any non-numeric strings to variable names.

a=2
b=6
if [[ "b" -gt "a" ]];then
 echo "this is slightly less obvious"
fi

One bracket or two?

[[ ]] is a, relatively, new command, which is not universally available in other, non-bash, shells. Previous to this there was [ ] (synonymous for the test command), which is more portable than [[ ]], but is not as powerful or robust. [ ] wont be covered here - more information on the differences between these can be found here: http://mywiki.wooledge.org/BashFAQ/031

To make the structure more useful, we can add alternate tests, to be conducted in sequence if the first fails, and a final default set of program statements which are executed if all conditional statements fail.

value=a
if [[ $value > g ]];then
  echo "value is greater than g"
elif [[ $value < g ]]; then
  echo "value is less than g"
else
  echo "value must be 8"
fi

There are three types of operators for these conditional expressions: file, numeric, and non-numeric. Each will return true (0) if the condition is met, or false (1) if the condition is not met. These operators are slightly different for [[ ]] and (( )) expression testing, as shown in the following two tables.

conditional expressions for [[ ]]    
object or non-numeric numeric or boolean description
-e   file exists
-d   file is a directory
-f   file is a regular file
-z   variable does not exist
> -gt greater than
< -lt less than
>= -ge greater than or equal to
<= -le less than or equal to
== -eq equal
!= -ne not equal
  ! logical NOT
  && logical AND
  || logical OR
conditional expressions for (( ))  
numeric or boolean description
> greater than
< less than
>= greater than or equal to
<= less than or equal to
== equal
!= not equal
! logical NOT
&& logical AND
|| logical OR

Key Points

  • (( )) is math context, and enables the use of C’s integer arithmetic operators

  • $(( )) can be used in the same way as a command substitution

  • bash interprets 0X strings as base 8, prefix strings or variables with 10# to force base 10

  • if statements can use both (( )) and [[ ]] commands for expression testing. These use different syntax, so be careful to check your code!


Advanced Loops

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • How do you avoid repetition of code (but not actions) in your scripts?

  • How can we repeat actions an indeterminate number of times, but not get stuck in infinity?

Objectives
  • Explain While loops, and how to control their use

  • Why is waiting good, and how is it best used?

Sequences and indexes for loops

As well as looping through arrays directly, we can also write loops which iterate through series of integer values, which can then be used as indexes. This can be useful if you need to add or remove matching data from a series of arrays, etc.

Two methods can be used to do this. The first is to use the seq command to build a sequence of integer values, e.g.:

seq 4

The universal format for this command is seq [first] [increment] last, where the last value must be defined, but the incremental value is optional, as is the first value (but this must be defined if the incremental value is defined). Note that other options can be added, but these will be system-specific, and wont be covered here. The default values for both the start and incremental values are 1. The sequence will be built according to these rules, starting at the first value, and continuing until the next value created would greater than the last value.

sequences

Please create a sequence of every 4th number between 3 and 37. Then create a sequence of these same numbers in reverse.

Solution

seq 3 4 37

seq 35 -4 3 (Note: because seq starts from the first number given, you have to use a priori knowledge to ensure that this sequence is the exact reverse of the first sequence).

real number maths

Although we are not interested in this feature for the purposes of indexing arrays, it should be noted that seq uses real numbers, not integers, and so more complex sequences of numbers can be created than are shown here.

To use seq within a for loop we must execute the command within it’s own subshell, e.g:

for i in $(seq 1 31); do
  echo $i
done

Trying to do this without the subshell will simply cause your loop to iterate through the set of strings seq 1 31.

These integer values can be used for referencing items within an array - but do remember that bash array indexes start at 0, and adjust your seq command accordingly:

varlist=( a list of strings )
for i in $(seq 0 3); do
  echo ${varlist[$i]}
done

Indexing lengths

Because our index starts from 0, we must remember to end our loop at length-1. Not doing this in bash is unlikely to give a fatal error (unlike other, stricter, programming languages, for which it would cause a buffer overrun), but the empty string which is returned could cause issues for your workflow if you are not expecting it.

For loops can be constructed using C-style notation too. These use a set of three expressions (start conditional; end conditional; increment conditional) within a math context, e.g:

varlist=( a list of strings )
for (( i=0; i<4; i++ )); do
  echo ${varlist[$i]}
done

C-style integer increments

The code (( i++ )) increments the value of variable i by one, equivalent to using (( i+=1 )) or (more explicity) (( i=$i+1 )). Negative increments can be performed by using - instead of +.

Looping through arrays without predetermined length

Often we want to loop through arrays without a predetermined number of elements, so is useful to not have to hard-code the end number into the for loop.

1) Remembering that you can get the length of an array using ${#array[@]}, and that integer maths should be carried out within an arithmetic expansion ($(( ))), please adapt the seq loop above so that it determines the array length automatically, and uses that within the for loop.

2) Please do the same using the C-style notation loop.

3) Please adapt the C-style notation loop to run through the array in reverse order.

Solutions

1)

varlist=( a list of strings )
len=${#varlist[@]}
endvalue=$(($len-1))
for i in $(seq 0 $endvalue); do
  echo ${varlist[$i]}
done

2)

varlist=( a list of strings )
len=${#varlist[@]}
for (( i=0; i<$len; i++ )); do
  echo ${varlist[$i]}
done

3)

varlist=( a list of strings )
len=${#varlist[@]}
endvalue=$(($len-1))
for (( i=$len-1; i>=0; i-- )); do
  echo ${varlist[$i]}
done

Note: the lines creating variables len and endvalue could be incorporated directly into the for loop statements, but they are explicitly stated here to make the solutions more readable.

While loops

So far we have worked with fixed length loops, however not all processes are a fixed length, and so a more open-ended solution is needed.

This is provided by the while loop, which uses a conditional statement to check when the loop should be exited.

For example, this can be used to break the loop after user input:

halt=no
while [[ $halt != 'yes' ]]; do
  wait 3
  echo "break out of the loop?"
  read halt< /dev/tty
done

Note the use of wait, to wait a given number of seconds before continuing with execution of the loop.

It can also be used to replicate (in a more convoluted, and less maintainable, manner) the for loops above:

varlist=( a list of strings )
len=${#varlist[@]}
i=0
while (( i<$len )); do
  echo ${varlist[$i]}
  (( i++ ))
done

While loops are useful for process control: for automating the checking to see if processes are finished, for example, and moving onto the next stage of the workflow once they are.

Tracking program progress

As an example of how while loops can be used to wait for a process to finish, we will create a function which waits for a random period before finishing. It writes it’s status to a log file, which we can use to track the progress of the program.

sleeptest () { echo 'started' > log.out ; sleep $(($RANDOM/1000)) ; echo 'finished' >> log.out ; }

Tracking the current status of the program can be done using tail, e.g.:

sleeptest &
tail -1 log.out
started

Using the -1 flag tells tail to only return the last line of the file.

Can you fill the three gaps in this while loop, so that it exits once the sleeptest function has ended?

sleeptest &
finished_tasks=0
job_limit=1
while [[ $finished_tasks ____ $job_limit ]]; do
  sleep 3
  finished_tasks=0
  log_tail=$(______)
  if [[ ______ ]]; then
    echo "finished a task"
    ((finished_tasks+=1))
  else
    echo "still going"
  fi
done

Solution

sleeptest &
finished_tasks=0
job_limit=1
while [[ $finished_tasks -lt $job_limit ]]; do
  sleep 3
  finished_tasks=0
  LOG_TAIL=$( tail -1 log.out )
  if [[ $LOG_TAIL == "finished" ]]; then
    echo "finished a task"
    halt=yes
    ((finished_tasks+=1))
  else
    echo "still going"
  fi
done

Key Points

  • seq [first] [increment] last creates a sequence of (real) numbers

  • for loops can be controlled using seq or C-style notation

  • ${#array[@]} is useful for setting these sequences

  • while loops use conditional statements, and aren’t fixed in length like for loops

  • while loops cna be used for process control


Sed and regular expressions

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • How can you edit text files within your scripts?

  • How can you make your searches more powerful?

Objectives
  • Detail regular expressions, and how they are used

  • Learn how to use regular expressions for string replacement

Workflows often require the editing of configuration files or scripts, or the searching of these for specific information to copy. This lesson will introduce a tool for editing files, as well as regular expressions, which can be used to make your searches more powerful.

Sed is a stream editor. It can be used to perform basic text transformations on an input stream, which can be a file, or it can be passed from a pipeline, allowing it to be combined with other tools.

To demonstrate the basic usage of sed, we will create a text file containing the string hello, and use sed to change this to world:

echo hello > input.txt
sed -e 's/hello/world/' input.txt
world

The -e flag indicates that the first parameter (s/hello/world/) is a script to be applied to the stream, while the following non-option parameters (input.txt) are input files. Sed will, by default, prints all the processed input, to save this for later use you will need to pipe the output into a file:

sed -e 's/hello/world/' input.txt > output.txt

The script we are using here is a substitution (indicated by the s at the start of the script). The first string (hello) is what is being searched for, while the last string (world) is the string that will be used as a replacement.

Sed can be passed multiple inputs, it will treat these as a single stream:

echo goodbye > input_pt2.txt
sed -e 's/goodbye/world/' input.txt input_pt2.txt
hello
world

It can also be passed multiple scripts, which will be run in sequential order:

sed -e 's/goodbye/world/' -e 's/l/a/' input.txt input_pt2.txt
healo
worad

You will note that the second (and first) script has been applied to both of the input files, but only one l has been replaced. This is because the search was not global, that is, as soon as one match string is found it is replaced and sed moves onto the next s cript.

If you want to replace all matching strings, then you need to specify that the script should be applied globally, by appending a g at the end of that script. E.g.:

sed -e 's/goodbye/world/' -e 's/l/a/g' input.txt input_pt2.txt
heaao
worad

Rather than search the whole input for match you can, if you know the exact line you wish to process, specify a line number:

echo 'hello' > input_lines.txt
echo 'hello' >> input_lines.txt
sed '2 s/hello/world/' input_lines.txt
hello
world

Note that sed starts indexes from 1, not 0.

Providing scripts via a file.

Sed allows the use of a file for providing the scripts for processing your files. This file should contain one script per line, and passed using -f instead of -e, e.g.:

echo 's/goodbye/world/' > myscript.sed
sed -f myscript.sed input.txt input_pt2.txt
hello
world

Editing dates for a complex configuration file

Jon has a configuration file (namelist.input in the wrf_configuration directory) that he uses for running the Weather Research and Forecast (WRF) model. He wants to run this daily, keeping all of the configuration the same except for the start and end dates. These will need to be changed each day that the model is run, so that the start date is today, and the end date is today + 3 days. Can you automate this using sed? The template configuration file itself does not need to be useable, so can be modified if that would help you.

Solution

There are two ways to do this. The first is to specify the lines you want to change, and change only those lines. However this is a fragile solution, as any changes to the template configuration file could change the line numbering, breaking your script.

A more robust solution is to replace the dates in the template file with clear identifier strings. These should be something you would not normally see in a script (e.g. I tend to use a string such as %%DAY%% or %%MONTH%%). Then you can carry out a global sed action for each string requiring changing, without worrying that you might make any unwanted changes.

Regular Expressions

In the Finding Things lesson in the shell novice course you were introduced to using wildcards, such as ., in your grep searches. Grep uses regular expressions (often abbreviated to regex), sequences of characters which define the string(s) to be searched for. Sed uses the same regex patterns for it’s searches, and we will cover some basic principles of using these here.

Regex syntax and interoperability

Regular expressions are implemented in a number of different programming languages. These all follow similar rules, but there will be differences, often subtle, between each of these implementations.

Many implementations follow the feature-rich regex syntax that was developed first for the Perl language. However UNIX command line programs tend to use the older ‘POSIX’ regex standards. These are further split into the POSIX Basic (BRE) and POSIX Extended Regular Expression (ERE) standards. Below we will teach you the ERE standard, because this has a more readable syntax which is closer to that of the more modern regex implementations. More information on the difference between the two POSIX standards can be found here.

To use ERE in sed the -E flag must be used. We will do this below, even in situations where it is not necessary, to get you in the habit of using it in your own code.

Regular expressions rely on the use of literal characters and metacharacters to construct the search term. Metacharacters are characters which have a special meaning (such as . represents any single character). If you wish to search for a literal character which happens to be a regex metacharacter, then it will need to be “escaped”, that is preceded by a \ character. For example:

echo "Hello. World" > input.txt
sed -E -e 's/\./.../' input.txt
Hello... World

Note that the string which is being used as a replacement is not a regex pattern, so the periods in this did not need escaping.

Forgetting to escape a metacharacter

What string would sed return if the \ character was not used in the above regex?

Solution

The first character found will be replaced, giving an output:

...ello. World

The search was not global though, so the rest of the string remains unchanged. Only if a g were added to the end of the script then it will replace all characters in the string with ....

Matching ranges of characters

One of the most common patterns used in regex is the definition of a list or range of characters, which can be denoted using square brackets. E.g.:

sed -E -e 's/[HW]/J/g' input.txt
Jello. Jorld

This list of upper case characters to replace is very focused, but if you did not know in advance what the upper case characters would be you can use the list [A-Z]. Similarly, to replace all lower case letters use the list [a-z], and to replace any digit us [0-9]. These can be combined as you require, for example, to match all characters (of any case) between B and H you would use [B-Hb-h].

Creating new strings

What regex expressions would you use to create the following strings from the Hello. World string in the input.txt file?

  1. Halla. Warld
  2. Heno. Worn

Solution

  1. sed -E -e 's/[eo]/a/g'
  2. sed -E -e 's/l[ld]/n/g'

Matching Repeated Instances

It can be useful to match more, or less, than a single instance of a particular element in the search string. This can be done by adding one of these special characters:

The elements that these can be used on can be either single characters, or sets of characters. E.g.:

sed -E -e 's/l{2}o/n/g' input.txt
Hen. World

This is particularly useful for changing date strings, e.g.:

YEAR=2021
sed -E -e "s/[0-9]{4}/${YEAR}/g" <(echo 'the date is: 23-04-2020')
the date is: 23-04-2021

Here we change the date to that set in a previously set variable (note the use of double quotation marks, so that the shell will interpret the string and replace the variable name with the required value).

Matching Line Endings

The ^ and $ metacharacters can be used to respectively assert the position of the start or end of a line. This allows you to “anchor” your search at either end of a line. For example, if we are provided with a YEAR variable which only contains the last two digits, but we know that the year digits will always be at the end of the line, we can search for [0-9]{2} without risking changing the day or month:

YEAR=21
sed -E -e "s/[0-9]{2}$/${YEAR}/g" <(echo 'the date is: 23-04-2020')
the date is: 23-04-2021

Back References and Subexpressions

A back-reference is a regex command which refers to a previous part (or subexpression) of the matched regular expression. They can be used to repeat patterns within a regex search or, as we will do here, pass part of the matched regex forward to the replacement string. Back references are specified by a single escaped digit (e.g. \1; up to nine are allowed in a single regex), while the subexpression is indicated using () brackets.

A common use of these is pulling out a single element of the search, e.g. the year from a date string:

date | sed -E -e "s/^.*([0-9]{4}).*$/\1/g"
2021

Note how the 4-digit year is stored in a subexpression, while the strings before and after it are included in the match using ^.* and .*$.

BASH logic and regex

In the logic and maths lesson you were introduced to the [[ ]] command, which is used for logical control structures. This command also allows regex patterns to be used, checking to see if a given string matches the regex or not. This comparison is performed using the =~ operator. For example:

YEAR=1999
if [[ $YEAR =~ ^[0-9]{2}$ ]]; then
  echo "year is in 2 digit format"
elif [[ $YEAR =~ ^[0-9]{4}$ ]]; then
  echo "year is in 4 digit format"
else
  echo "year is in unrecognised format"
fi
year is in 4 digit format

Note that the ^ and $ metacharacters have been used to ensure the pattern matches the whole string, and that no partial matches are made by mistake.

Further Learning

Library Carpentry have a longer introduction to regex course (from which some of this material has been taken). If you will be working with, and processing, a lot of text files then you will find this course useful. Do note, however, it is written with the more advanced regex implementations in mind, so some features mentioned in that course will not be available for shell programming.

Key Points

  • sed performs basic text transformations on an input stream

  • The basic usage is sed -e 's/pattern/replacement/' input.txt

  • Multiple scripts can be chained, by using additional -e 's/pattern/replacement/' declarations

  • Matches will be made on the first instance of the pattern, or all matches can be found by using s/pattern/replacement/g

  • Extended regular expressions can be enabled with the -E flag

  • Specify character ranges using [A-Z0-9]

  • Repeat single characters or ranges by appending *, +, ?, or {RANGE}

  • Match the start and end of lines using ^ and $, respectively

  • Special character can be matched if they are escaped by prepending \

  • Capture subexpressions with ( ), and back-reference in your pattern or replacement text these using \1-\9

  • Regex can be used in logic tests, with the =~ operator

  • Regex are easier to write than to read. Document yours well!


Symbolic Links

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can you reuse one file in multiple directories?

  • How can you ease moving around your file system?

Objectives
  • Detail symbolic links, and how they are used

  • Learn when not to use symbolic links, and how to avoid pitfalls in using them.

A symbolic link, also called a soft link, is a pointer which enables you to find another file, much like a shortcut in Windows. Like these it is useful for creating shortcuts within the file system, for simplifying the file paths used by other programs, or easing your navigation between different work directories in a networked system (important when working on HPC systems).

It is important to remember that symbolic links do not point directly to any data that might be in the target, they instead point to the file system itself. This allows you to link to either files or directories using the same command, and also to link to filesystems hosted on remote computers. But it also means that there is a high risk of data loss if the remote files are moved or deleted. Because of this it is recommended that you use them sparingly in your workflows.

We will use the dataset from the BASH introduction course to demonstrate the use of links.

A symbolic link can be created using the command:

cd ~/Desktop/
ln -s data-shell/molecules
ls -l molecules
lrwxrwxrwx  1  user group 20 Feb  8 20:00 molecules -> data-shell/molecules/

This has created a symlink to the molecules directory, with the name molecules. Like cp, ln will default to the given object name, but unlike the copy command it does not need to be given a destination location.

You can identify symlinks by the @ following their name if ls -F is used:

ls -F
data-shell/ data-shell.zip molecules@

You can use cd to enter, and exit, this directory, as you would any other directory:

cd ~/Desktop/molecules
pwd
cd ..
pwd
/home/jon/Desktop/molecules
/home/jon/Desktop

This is because the cd command is able to resolve, or track, the symlinks by processing them after following the .. path in the second cd command. We can disable this ability, using the -P flag (which forces cd to resolve the symlinks to the original directory structure before following .. paths):

cd ~/Desktop/molecules
pwd
cd -P ..
pwd
/home/jon/Desktop/molecules
/home/jon/Desktop/data-shell

In this case, even though we seemed to be in ~/Desktop/molecules, using .. while using the -P flag takes us to the ~/Desktop/data-shell directory, because that is the true parent directory. This relationship is made explicitly clear if we use -P for the first cd command:

cd -P ~/Desktop/molecules
pwd
cd ..
pwd
/home/jon/Desktop/data-shell/molecules
/home/jon/Desktop/data-shell

In this case we arrive directly in the original directory with the first cd command, meaning that it does not matter whether we use the -P flag or not for the second command, we will always arrive back in the data-shell directory.

Shell intrinsic commands, such as ls and pwd are able to make use of the shell’s tracking of the symlink, so that they deal with the directory structure as we would expect. Commands such as ls and cp are not able to do this, and so they always resolve symlinks to the original directory structure before following .. paths.

Because of this behaviour, it is advised that you avoid using .. paths which cross a symlink in your scripts - in this situation it would be safer to use the absolute path (or a path relative to a fixed point, such as your home directory ~/).

Symlinks can be removed without destroying the object they point to:

cd ~/Desktop
ls -l molecules
rm molecules
ls -ld data-shell/molecules
lrwxrwxrwx 1 user group 20 Feb  8 19:58 molecules -> data-shell/molecules
drwxr-xr-x 2 user group 4096 Feb  8 11:36 data-shell/molecules

Although we can leave the name of the symlink the same as the original object, one of the most useful features of symlinks is being able to rename files without moving or changing the original file.

For example, in data-shell/data/elements/ we have xml files describing each atom. Each of these is named using the periodic table symbol, e.g. N.xml is the Nitrogen descriptor. However, we have a program which is expecting the files to have the full atom name, e.g. Nitrogen.xml. We can easily enough create these symlinks, e.g.:

cd ~/Desktop/data-shell/data
mkdir elements-fullnames
cd elements-fullnames
ln -s ../elements/N.xml Nitrogen.xml
ls -l
lrwxrwxrwx 1 user group 17 Feb  8 21:32 Nitrogen.xml -> ../elements/N.xml

Doing this will enable us to use the program, without having to create all the input files again. Do keep in mind though that, although you can delete symlinks without deleting the original file, if a program tries to write to a symlink, it will write to the original file. This method is suitable for easily replicating or renaming input files. Extreme caution should be used if you use the same method for output or log files.

Scripting the linking of all atom files.

There are over 100 atom files in the elements directory, linking to each of these by hand would be quite painful. Fortunately these are text files, and each of them contains the full name of the element in the first line of the file, e.g.:

head -1 ~/Desktop/data-shell/data/elements/N.xml
<element name="Nitrogen"/>

To strip the atom name out of this string you can either use sed:

head -1 ~/Desktop/data-shell/data/elements/N.xml | sed -E -e 's/^.*"([A-Za-z]*)".*$/\1/'
Nitrogen

Please write a bash script which will use a for loop and this string processing pipeline to create links that use the full element names for these files within whatever directory it is run.

Solution

for orig_file in "${@}"
do
   element_name=$(grep -i 'name=' ${orig_file} | sed -E -e 's/^.*"([A-Za-z]*)".*$/\1/' )
   ln -s ${orig_file} ${element_name}.xml
done

This script should be run using:

bash link_script.sh ~/Desktop/data-shell/data/elements/*.xml

If you have a hammer, every problem is a nail

In the above solution we use sed and regex to extract the string we require from the xml file. There are other bash tools that could do this for us, and in ways which are arguable more readable (and being as readable as possible is a good trait for code). Can you identify a tool from either these lessons, or the bash introduction lessons, that you could use for this, and adapt your script to use this tool?

Solution

The cut tool can be used to split the text we need, by using " as the delimiter:

head -1 ~/Desktop/data-shell/data/elements/N.xml | cut -d '"' -f 2
Nitrogen

Using this command instead would make your code more readable. Both solutions do still make similar assumptions about what format the string will take though (and the head command also makes major assumptions about the formatting of the file). These assumptions can make your scripts fragile. This fragility can be addressed by either well documenting what inputs you expect the script to have, or by building in extra checks to your code. The choice of which of these solutions to use depends on the script itself - which solution do you think would be the most appropriate for this script?

Key Points

  • Symbolic links to objects (files or directories) can be created using ln -s

  • These are links to the object, not it’s contents, so these can change or be deleted

  • Symbolic links can cross physical disks, and so are useful in networked filesystems

  • Caution must be exercised when following .. paths across symbolic links

  • They are most useful for linking to, and/or renaming, input and configuration files or directories