Dates, Scheduling, and Downloading Files
Overview
Teaching: 10 min
Exercises: 5 minQuestions
How can we deal with date maths on the command line
How can we schedule regular compute jobs
How can we download files outside of a web-browser
Objectives
Understanding some of the tools available for addressing regular needs.
Date maths
On UNIX systems the system time and date information can be found using the date
command:
date
Wed 27 Jan 2021 11:52:17 GMT
This will return a formatted string (likely) containing information on the day, date, time, and timezone, as shown above.
This command can be used for more than just finding the current time, though. It can also be used to create formatted strings containing other dates and times, as well as calculating the same relative to either the current time or another given time. Here we will give a short introduction on how to use this command on linux systems.
Linux vs OSX
Although all UNIX systems will have a
date
program, the exact syntax for using it can vary from system to system, due to differing implementations of the Unix standards. The two systems you are most likely to encounter are linux-based systems (such as debian, and SE-linux), and the freeBSD Unix-based OS-X system. These use quite different syntaxes, so care must be taken to use the correct syntax for the system you are on.Below we will focus on the linux syntax, as this is most common for HPC environments. Where the OSX syntax is different this will be highlighted in a separate note.
The output string for the date
command can be formatted by adding +[format string]
.
The options for these are defined in the man page for date, but some examples are %Y
,
for 4-digit year, %m
for 2-digit month, and %d
for 2-digit day of month. These can be
used individually, or combined as you wish. E.g.:
date +%Y
2021
date +%Y-%m
2021-01
date +"%Y %d"
2021 27
In the last example we enclose the format string in quotation marks, to all inclusion of a space in the formatted output.
To display a date which is not now, you can use -d [date string]
. The most useful (for our purposes) date string here is YYYYMMDD
, e.g.:
date -d "20120423"
Mon 23 Apr 00:00:00 BST 2012
Displaying dates that are not ‘now’ on OSX
On OSX you have to explicitly set the format of the date string that you pass to
date
, using-f "[format string]" "[date string]"
. You will also need to ensure thatdate
does not try to reset the system clock, by passing the-j
flag too. The OSX command equivalent to the linux command above is:date -j -f "%Y%m%d" "20120423"
To calculate an offset from a given date (either ‘now’, or a supplied date), you add the desired offset into your date string. This offset can be composed of a number of different elements, for example:
date -d "20210127 +3 day +1 month -18 year"
Sun 2 Mar 00:00:00 GMT 2003
The advantage of using date
to do this calculation for you is that it can deal with
transitioning across month and year boundaries easily - to do this by hand would require a
lot of checks for the lengths of months, leap years, etc.
Offset calculation order
You should note that the calculation order for the offset change from the largest incremental unit, downwards. This is most important where the offset will cross month boundaries, but could be important to remember in other scenarios too.
Calculating date offsets on OSX
On OSX the offset is given as one or more separate strings for each element of the offset that is required, each preceded by a
-v
flag. You should also note that the offset elements are applied in the order which you provide them, giving more explicit control over this process than you have with the linuxdate
command.The OSX command equivalent to the linux command above is:
date -j -v "-18y" -v "+1m" -v "+3d" -f "%Y%m%d" "20210127"
Scheduling tasks with CRON
Often workflows need to be repeated at regular intervals - checking for updates in source data, running maintenance tasks, or producing regular updates to services. These tasks can be automated using the cron job scheduler. Cron is available on most UNIX based systems (although, on many HPC platforms, access to cron will be blocked for ordinary users), you can find out if you have it installed (and what tasks you have scheduled) using the crontab
(cron table) command:
crontab -l
To configure cron it is easiest to create a configuration file, which will be read simply using the command:
crontab [config.txt]
Each line of this file will represent a job, and will look like:
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │ 7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * * <command to execute>
This notation can be a little confusing to use, but online tools such as crontab.guru are available for checking your configurations, to ensure that cron will run jobs at the times you expect.
Scheduling jobs
Jon wants to set two data download tasks, one to run at 2:45 am every Monday, and one which runs at 2:00 pm on the 1st and 15th of each month. What notation should he use to set these jobs (crontab.guru can be used to work this out).
Solution
45 2 * * 1
- 2:45 am on every Monday
00 14 1,15 * *
- 2pm on the 1st and 15th of the month
OS-X and cron
OS-X does include the cron, but it is difficult to use because of the security settings.
If you wish to use cron on OS-X you will need to enable Full Disk Access within the Security & Privacy settings menu for the program
/usr/sbin/cron
.
Downloading using wget
Often workflows involve input or source file downloads, and it is useful if these can be automated on the command line, rather than relying on using a web-browser for this.
Wget enables the retrieval of files using the widely used HTTP, HTTPS, FTP, and FTPS protocols (much as most web-browsers do). It has many features, such as being able to resume aborted downloads, using filename wild cards and recursive searches, and use of timestamps to determine if documents need redownloading. Here we will just cover the basic usage though, for more details and advanced usage check the GNU Wget documentation.
The basic usage is to simply give the path to a file that you wish to download. If it exists, and is accessible, then wget will download it and save it in your local directory. E.g.:
wget http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms/meteo_ABED_2020-01-04_1800_data.txt
This can be used for directories as well as files, but when used on a directory what is returned is a file listing the directory contents (as you would see in a web-browser). E.g.:
wget http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms
To download all the contents of a directory you need to use the recursive -r
flag. In
most cases this should be combined with the -np
flag, which tells wget that you don’t
want it to crawl up to parent directories. You can also restrict the files you download
based on their file-extension string using the -A
flag.
For example, to download a single file use:
wget -r -np http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms/meteo_ABED_2020-01-04_1800_data.txt
This creates a directory structure following the http path structure, containing the file you requested.
To download all the text (.txt) files in that directory use:
wget -r -np -A txt http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms/
This creates the same directory structure, but it will contain all the text files in that directory.
Restricting webcrawling to a single domain
Although it is not important for our interest in downloading data from internet archives, when downloading data from a website it can be useful to ensure that wget doesn’t follow links off that site. This can be done using the
-D [domain]
flag, where [domain] is the web domain that you wish to restrict wget to. In the examples given above we could add-D manchester.ac.uk
if we wished to restrict searches to the University of Manchester domain, or even-D manunicast.seaes.manchester.ac.uk
if we wished to restrict searches to the ManUniCast site only.
Using wildcards in file names
The HTTP protocol does not support wildcards, so using wildcards would not work in the examples given above. Wget can also use the FTP protocol though, which does support the use of wildcards, so if you are downloading data from an FTP server these are an option.
Scheduling jobs
Go to the
data_download
directory inside thedata-adv-shell
directory. Here there is a crontab filecrontab_settings.txt
and a script for downloading some datamanunicast_download.sh
.Edit the crontab file to
- set a time which is 3-4 minutes in the future
- set the correct path to the manunicast_download.sh file
Then submit the crontab settings using:
crontab crontab_settings.txt
If it works you will find the downloaded file on the
Desktop
in a few minutes, if not, check the error message instderr.log
(also on theDesktop
) to see what the problem was.Once this has worked you can unset the crontab using:
crontab -r
Key Points
Using
date
for your date and time calculations will save you a lot of hassle
crontab
can be used for scheduling regular tasks
wget
can be used for scripting your data download processTool syntax is not always consistent across different unix flavours
Variables and Arrays
Overview
Teaching: 10 min
Exercises: 5 minQuestions
How can I store information without writing it to file
Objectives
Learn how to create and reference a variable
Learn how to create and reference an indexed array
Variables
In the shell novice course you were, indirectly, introduced to bash variables, when working with loops: http://swcarpentry.github.io/shell-novice/05-loop/index.html
for thing in list of things
do
echo $thing
done
echo 'variables persist:' $thing
list
of
things
variables persist: things
Variables can also be defined directly:
echo $thing
thing=this
echo $thing
things
this
Exploring Shell Variables
You can check if a variable exists, and what it’s contents are, using the
declare -p
command. This will list all variables and functions in the environment, so to ease use of this output of this command it is recommended that you pipe the output to another command, such asless
orgrep
, to search for the variable you require.
To delete a variable you can use unset
.
thing=stuff
unset thing
Note that you do not reference the contents of the variable (using $thing
) to do this,
instead you reference the variable directly (using thing
).
Referencing a non-existent variable
What happens if you reference a variable which doesn’t exist? e.g.
thing=stuff unset thing echo $thing
Solution
Referencing a non-existent variable simply returns an empty string, no error message is given. This can aid in the smooth-running of a script, but can also create problems if you are used to the error messages that languages such as python return after referencing a non-existent variable. We will look later at how you can test for the existence of a variables more safely.
Arrays
BASH objects do not have to only contain a single value, they can contain a list of values instead.
To create an array object you can use the notation:
listthings=( these are my things )
Whitespace warning
Note that the whitespace is important — it is used to denote the breaks between individual values, as well as the start and end of the list. If you wish to have items which contain whitespace you will need to wrap these in quotation marks. e.g.
listthings=( these "are my" things )
The array object now contains an indexed list of values:
declare -p | grep -w listthings
listthings=([0]="these" [1]="are my" [2]="things")
To access the contents of the array you should use the indexes, e.g. [0]
or [1]
. You
can also access all values using the index [@]
. However, to make use of these indexes you
must use curly braces to delimit the variable name.
Referencing Arrays
What are the outputs, and why, from the following commands?
echo ${listthings[1]}
echo $listthings[1]
echo ${listthings[@]}
echo $listthings
echo ${#listthings[@]}
Solution
are my
This is the value at index 1these[1]
This is the first (index 0) value in the array, followed by the string[1]
these are my things
This is all values within the array.these
This is the first (index 0) value in the array.3
This is the length of the array.
Finally, we note that arrays can be referenced using a for
loop, as at the start of this
lesson:
for thing in ${listthings[@]}
do
echo $thing
done
This is generally the best way to create a for
loop; except for the most trivial examples
it is wise to keep the array assignment separate from the loop itself.
Key Points
BASH variables can store a single piece of information
BASH arrays can store an indexed lists of information
{}
denotes a code block, and are essential for referencing arrays
[@]
denotes all of an array, while[X]
denotes the value at positionX
${#VAR}
returns the length of the string
${#ARRAY[@]}
returns the number of items in the array
Subshells and Functions
Overview
Teaching: 30 min
Exercises: 10 minQuestions
How can you organise your code into functional blocks
How can you import settings from other files into your scripts
Objectives
Explain how BASH functions operate
Explain the difference between local and global variables
Explain how to import bash files into your environment
Shell and Environment Variables
In the shell novice course you were introduced to writing and using shell scripts, http://swcarpentry.github.io/shell-novice/06-script/index.html.
When a shell process is created, a corresponding environment is created to contain the shell process. The environment contains information needed for the shell to interact with the operating system (such as the location of your home directory, or which display is being used). Any further processes created by commands run in the shell process will inhabit this environment, and have access to this information too. This information is stored in variables, similar to the variables introduced in the previous lesson, but these are ‘environment’ or ‘global’ variables, whereas the variables used previously were ‘shell’ or ‘local’ variables. The difference between these is that ‘environment’ variables are readable by all processes in that environment, while ‘shell’ variables are unique to that process which created them.
Invoking a shell script leads to the launch of a new process, which processes the commands within that script. That process will not have access to the ‘shell’ variables of the shell from which it was launched.
To demonstrate this we can create a simple bash script, called ‘test.sh’ and containing this code:
echo $var1
echo $var2
Then run this code:
var1=23
export var2=12
bash test.sh
12
The export
command turns the shell variable into an environment variable, enabling all
processes within that environment to access it.
Exploring Environment Variables
You can check what other environment variables have been set using the
declare -px
command.
Environment variables are also accessible for other programs running the same environment.
E.g, for python these can be accessed using os.environ
:
python -c "import os ; print(os.environ['var2'])"
Subshells
Subshells are separate instances of the command process, run as a new process, and defined
within your scripts using ()
. Because a subshell is run in a new process, these can be
used for parallel processing (although we will not cover that here). Here we will
introduce key concepts behind the use of subshells.
Unlike calling a shell script, subshells inherit the same variables as the original process, and thus can access any of these (even those which have not been exported).
x=5 ; (echo $x)
5
However these variables are copies, and modifications made to existing variables, or newly created variables, will not be available to the original process.
x=5 ; (echo $x; x=3; y=2; echo $x $y) ; echo $x $y
5
3 2
5
Code blocks
Code blocks can be denoted between curly brackets
{}
(as will be used later). However this does not launch a subprocess, and any changes to, or additions of, variables will be persistant.x=5 ; {echo $x; x=3; y=2; echo $x $y;} ; echo $x $y
5 3 2 3 2
Note that, unlike subprocesses, the final bracket needs to be on a new line, or separated by a
;
from the last command used.
Because there is no persistence of variables outside of the subshell, to pass information out from this process you have to either write to file, or use one of the two following operations: command or process substitution.
Command Substitution
Even though the syntax for a subshell is very similar to that for declaring an array, we cannot directly save the subshell in the same manner, instead we have to use command substitution. This is a feature which enables the recording of the output from an executed command. The command is executed within a subshell, and the substitution is carried out using the following notation:
x=4; echo $x; x=$(y=5;echo $x$y); echo $x
4
45
Command Substitution using backquotes
Command substitution can also be carried out using backquotes ` `. This is not directly analogous to
$()
, as some special characters will need escaping within the backquotes, but they behave in very similar manners. ` ` is the older implementation, and so will be common in script libraries, but is now deprecated, and the$()
notation is recommended for all new scripts.
Example usage of command substitution
Jon wants to save the year, month, and day, from the
date
function, as variables, so that he can use them later for running download scripts. Can you write some code using command substitution to do this?Solution
You will need three separate calls to the
date
function for this, one each for the day, month, and year. E.g.:YEAR=$(date +%Y) MONTH=$(date +%m) DAY=$(date +%d)
Now that he has the code for obtaining today’s date, he needs to do the same for yesterday’s date. Can you copy and adapt your code above to do this?
Solution
Again you will need three separate calls to the
date
function for this, one each for the day, month, and year. To these you will need to remove one day in each, E.g.:PREV_YEAR=$(date -d "-1 day" +%Y) PREV_MONTH=$(date -d "-1 day" +%m) PREV_DAY=$(date -d "-1 day" +%d)
Using variables in an operational script
Now that Jon has code for calculating today and yesterday’s dates, he needs to add these to his download script. Please edit the
manunicast_download.sh
script, so that the web address passed to wget is for today’s data.Solution
The web address is stored in the script as a simple string, so we can insert information from the date variables directly into this.
YEAR=$(date +%Y) MONTH=$(date +%m) DAY=$(date +%d) PREV_YEAR=$(date -d "-1 day" +%Y) PREV_MONTH=$(date -d "-1 day" +%m) PREV_DAY=$(date -d "-1 day" +%d) WEBROOT="http://manunicast.seaes.manchester.ac.uk/charts/manunicast/" ADDRESS="${WEBROOT}${YEAR}${MONTH}${DAY}/d02/meteograms/meteo_ABED_${PREV_YEAR}-${PREV_MONTH}-${PREV_DAY}_1800_data.txt" wget $ADDRESS
To make this code more readable, we have split out the fixed base of the address as
WEBROOT
, you do not need to do this in your own code.Using
{ }
brackets around the variable name is good practice - avoiding any errors in case you accidentally make another variable name with the following text. It is also good practice to wrap the text string in" "
, in case there are spaces in the string.
Process Substitution
Process substitution, using <()
, is a feature which enables the usage of the output
of an executed command within another command, similar to the piping of output using |
.
As a (rather artificial) example, this enables us to compare the contents of two variables. Passing these variables directly, or via a command substitution leads to the shell searching for a file named after the variable value, e.g.:
diff $(echo 3) 5
diff: 3: No such file or directory
diff: 5: No such file or directory
Using process substitution ensures that the shell avoids this error:
diff <(echo 3) <(echo 5)
1c1
< 3
---
> 5
And these subshell methods can be nested as required:
cat <(echo year is $(date +%Y)) <(echo month is $(date +%m)) <(echo day is $(date +%d))
year is 2021
month is 02
day is 02
More information on, and examples of using, subshells are available here: https://www.tldp.org/LDP/abs/html/subshells.html
Functions
Bash functions are, at their most basic level, labelled code blocks which enable the repeated use of a set of commands. Their purpose is to make your code more readable and, in avoiding writing the same code time and again, more maintainable.
Functions can be declared by either
functionname () {
commands
}
or
function functionname () {
commands
}
Functions must be declared before they are used, and as they are a simple code block within the shell process, can read, modify, and create all variables within the shell.
var1='E'
examplechange () { var1='D'; }
echo $var1
examplechange
echo $var1
E
D
Note that the code within the function is not executed until it is called.
Bash functions are more basic than those of other programming languages, and do not easily lend themselves to modern functional programming practices. There are some tricks which can help though.
Arguments can be passed to functions, in a similar manner as for scripts and programs, as you learnt in the introduction to bash lessons:
readingargs () { echo $#; echo $1; echo $2; }
readingargs "arg1" "arg2"
2
arg1
arg2
To avoid unintentional usage of variables created within a function, these can be set as
local
, so that they are not available outside the function.
combineargs () { local var1=$1$2; echo $var1; }
var1="start"
var2="end"
combineargs $var1 $var2
echo $var1
startend
start
Note that using the local
declaration ensures that you don’t overwrite any external
variables which share the same name.
To return information from a function in manner in which it can be allocated to an
arbitrary variable, it is best to use echo
to output it, and then capture the result of
this.
funcresult="$(combineargs $var1 $var2)"
Multiple values can be returned in a similar manner, using the read
command to split
the output.
returnargs () { echo $1; echo $2; }
read arg1 arg2 <<<$(returnargs "start" "end")
echo $arg1
echo $arg2
Output Warning
What is the output from the following command? And why is it as it is?
read arg1 arg2 <<<$(returnargs "start here" "end") echo $arg1 echo $arg2
Solution
The first word is stored in
arg1
, while the last two words are stored inarg2
.start here end
Be aware that the outputs from a function are separated only by whitespace, so if any exists within an outputted variable then this will cause problems for the read command. In addition, if there are more items returned than expected, then any extra items will be appended to the last argument in the list.
Return Values
An arbitrary status can be returned from a function (using
return $arg1
), however this is best saved for error checking your processes, rather than for passing results out from the function.
Functionalising the Date commands
Jon wants to use the date commands that he has been given in later scripts, so he has created the following function template. Can you complete this, and replace the date code in the script with a call to the function?
determine_next_date () { # This function determines the next date, using a specified offset. # usage: determine_next_date [+/-] [#days] [current year] [current month] [current day] }
Solution
determine_next_date () { # This function determines the next date, using a specified offset. # usage: determine_next_date [+/-] [#days] [current year] [current month] [current day] local next_year=$( date -d "$3$4$5 $1 $2 day" +%Y ) local next_month=$( date -d "$3$4$5 $1 $2 day" +%m ) local next_day=$( date -d "$3$4$5 $1 $2 day" +%d ) echo $next_year $next_month $next_day }
read year month day <<<$(determine_next_date - 1 $year $month $day)
Sourcing Bash Scripts
Rather than simply calling one bash script from another, with the limitations described
above, code and resources from bash scripts can be imported into another script using
the source
functionality. This can take the form:
. script.sh
or:
source script.sh
All code within that script will be executed at the point it is sourced, so there are
limitations in how this can be used. It is most useful for loading key variables, or for
configuring the environment or loading functions (as is done in the .bash_profile
and
.bashrc
files).
Splitting off the date math function
In order that the
determine_next_date
function can be used in other scripts, please move it to a new script, calledfunction_date_math.sh
, source this script in your original file, and then make sure it still works as before.Solution
Example scripts for linux and OSX are included in the
date_math_function
directory, calledfunctions_date_math_linux.sh
andfunctions_date_math_osx.sh
.
Key Points
environment variables are accessible by all programs run from that shell
export
turns a (private) shell variable into an environment variable
()
creates a subshell
{}
creates a code block within the current shell
$()
allows a subshell to be used for command substitution, for saving the output as a variable
<()
allows a subshell to be used for process substitution, for passing the output to another program
read var1 var2 <<<$()
can be used to save more than one output from a command substitution
function NAME() {}
creates a function code blockcode inside functions are not executed on creation, and can be used repeatedly after creation
. script.sh
andsource script.sh
enable the importing of code and variables from other scripts
BASH Logic and Maths
Overview
Teaching: 30 min
Exercises: 0 minQuestions
How can we have responsive workflows on the BASH shell
Objectives
Explain Logical operators, and demonstrate IF structures
Detail Maths operators, and cover basic sequencing
Integer Maths
Bash variables are, in the general context, treated as strings. Integer maths can be
carried out using ‘arithmetic expansion’, using a maths context to tell the interpreter
when to do this. The recommended method for this is using $(( ))
, and this is used in the
same manner as a command substitution:
i=3
j=$(($i + 1))
echo $j
4
Alternatively a command can be used to create the math context - these follow the same
rules as $(( ))
, but give an exit status (which will be useful later).
a=3
((a=$a+7))
echo $a
10
Within the math context the syntax and semantics of C’s integer arithmetic are used,
enabling shorthands such as +=
or -=
for incremental changes in a variable. For example,
this gives the same result as the addition above:
a=3
((a+=7))
echo $a
10
let, expr and bc
Other programs are available for the execution of integer math in bash. For example:
j=$(expr $i + 1) let j=i+1 j=$(echo "$i+1" | bc)
These follow the same integer math rules, however because they do not use the math context that
$(( ))
and(( ))
provide (which protects, for example, against whitespace errors), they can be more difficult to work with, and so wont be covered here.
The arithmetic operators that can be used within the math context are:
+ : Addition
- : Subtraction
* : Multiplication
/ : Division
** : Exponent
% : Modulus
Conditional expressions are also allowed, as covered below, as well as C’s bitwise operations and ternary operator, which won’t be covered here.
Integer / String Conversions
A lot of information strings containing digits, such as date strings, step numbers, etc, use leading zeros to ensure regularity of string length, as well as for sorting purposes. This habit, however, can cause problems during when carrying out maths operations, e.g.:
x=05
y=08
echo $(($x+$y))
-bash: 05+08: value too great for base (error token is "08")
This occurs because bash interprets values starting with 0
as base 8 integers (and
values starting with 0x
as base 16 integers) instead of the base 10 integers which we
generally use for maths.
To force an integer to be interpreted in base 10 we can prepend 10#
to the value, e.g.:
x=05
y=08
echo $((10#$x+10#$y))
13
Note that this will not work for values which start with +
or -
, so cannot be used to
set the base of negative numbers. This is not, generally, and issue for date maths, but
could be for other uses. In these situations it would be wiser to use regular expression
substitutions in sed, as will be covered in a later lesson.
To create a string with leading zeros from an integer the printf
(formatted print) program.
The format string used is %X.Yi
, where X
is the width of the string (optional), and
Y
is the number of digits to use, e.g.:
day=5
printf "day number %.2i\n" $day
day number 05
If Control Structure
Many workflows are not linear, and all workflows will have exceptional cases in which
operations should be halted (if, for example, a file is missing, or a calculation would
fail). Bash has the if
control structure, which can be used to control your workflow
in these situations.
In it’s most basic form, this control structure will test for a condition, then execute a set of program statements if that condition is true. For example:
if [[ 8 > 7 ]];then
echo "8 is greater than 7"
fi
and:
if [[ b > a ]];then
echo "b is greater than a"
fi
These are lexicographic comparisons of the strings, using the alphabetical order. Other
non-numeric conditions available are <
, >=
, <=
, ==
and !=
(described in the
table at the end of this section):
if [[ b != a ]];then
echo "b is not the same as a"
fi
For arithmetic comparisons inside [[ ]]
brackets we instead would use the operators -gt
,
-lt
, -le
, -ge
, -eq
and -ne
:
if [[ 8 -gt 7 ]];then
echo "8 is greater than 7"
fi
For arithmetic expressions we can also use (( ))
, as described in the section above:
if (( 8 > 7 ));then
echo "8 is greater than 7"
fi
These, confusingly, use the same operators as the non-numeric conditions in the [[ ]]
environment, rather than being the same as the numerical operators.
Logical Test
Which of these conditionals are True, which are False, and which don’t work?
[[ 4 -eq 4 ]]
[[ be == eb ]]
[[ 4 == 4 ]]
[[ 4 == 04 ]]
(( 4 == 04 ))
(( 4 -eq 04 ))
Solution
- True.
- False.
- True (but a string, not arithmetic comparison).
- False, because it is a string comparison.
- True, because this is a math context.
- Error, because
-eq
can’t be used within a math context
Arithmetic Comparisons of Strings
Be aware that if you try to make an arithmetic comparison of strings, the
[[ ]]
command converts any non-numeric strings to variable names.a=2 b=6 if [[ "b" -gt "a" ]];then echo "this is slightly less obvious" fi
One bracket or two?
[[ ]]
is a, relatively, new command, which is not universally available in other, non-bash, shells. Previous to this there was[ ]
(synonymous for thetest
command), which is more portable than[[ ]]
, but is not as powerful or robust.[ ]
wont be covered here - more information on the differences between these can be found here: http://mywiki.wooledge.org/BashFAQ/031
To make the structure more useful, we can add alternate tests, to be conducted in sequence if the first fails, and a final default set of program statements which are executed if all conditional statements fail.
value=a
if [[ $value > g ]];then
echo "value is greater than g"
elif [[ $value < g ]]; then
echo "value is less than g"
else
echo "value must be 8"
fi
There are three types of operators for these conditional expressions:
file, numeric, and non-numeric. Each will return true (0) if the condition is met,
or false (1) if the condition is not met. These operators are slightly different for
[[ ]]
and (( ))
expression testing, as shown in the following two tables.
conditional expressions for [[ ]] |
||
---|---|---|
object or non-numeric | numeric or boolean | description |
-e | file exists | |
-d | file is a directory | |
-f | file is a regular file | |
-z | variable does not exist | |
> | -gt | greater than |
< | -lt | less than |
>= | -ge | greater than or equal to |
<= | -le | less than or equal to |
== | -eq | equal |
!= | -ne | not equal |
! | logical NOT | |
&& | logical AND | |
|| | logical OR |
conditional expressions for (( )) |
|
---|---|
numeric or boolean | description |
> | greater than |
< | less than |
>= | greater than or equal to |
<= | less than or equal to |
== | equal |
!= | not equal |
! | logical NOT |
&& | logical AND |
|| | logical OR |
Key Points
(( ))
is math context, and enables the use of C’s integer arithmetic operators
$(( ))
can be used in the same way as a command substitutionbash interprets
0X
strings as base 8, prefix strings or variables with10#
to force base 10
if
statements can use both(( ))
and[[ ]]
commands for expression testing. These use different syntax, so be careful to check your code!
Advanced Loops
Overview
Teaching: 20 min
Exercises: 0 minQuestions
How do you avoid repetition of code (but not actions) in your scripts?
How can we repeat actions an indeterminate number of times, but not get stuck in infinity?
Objectives
Explain While loops, and how to control their use
Why is waiting good, and how is it best used?
Sequences and indexes for loops
As well as looping through arrays directly, we can also write loops which iterate through series of integer values, which can then be used as indexes. This can be useful if you need to add or remove matching data from a series of arrays, etc.
Two methods can be used to do this. The first is to use the seq
command to build a
sequence of integer values, e.g.:
seq 4
The universal format for this command is seq [first] [increment] last
, where the last
value must be defined, but the incremental value is optional, as is the first value (but
this must be defined if the incremental value is defined). Note that other options can be
added, but these will be system-specific, and wont be covered here. The default values for both
the start and incremental values are 1. The sequence will be built according to these rules,
starting at the first value, and continuing until the next value created would greater than
the last value.
sequences
Please create a sequence of every 4th number between 3 and 37. Then create a sequence of these same numbers in reverse.
Solution
seq 3 4 37
seq 35 -4 3
(Note: becauseseq
starts from the first number given, you have to use a priori knowledge to ensure that this sequence is the exact reverse of the first sequence).
real number maths
Although we are not interested in this feature for the purposes of indexing arrays, it should be noted that
seq
uses real numbers, not integers, and so more complex sequences of numbers can be created than are shown here.
To use seq
within a for
loop we must execute the command within it’s own subshell, e.g:
for i in $(seq 1 31); do
echo $i
done
Trying to do this without the subshell will simply cause your loop to iterate through the
set of strings seq 1 31
.
These integer values can be used for referencing items within an array - but do remember
that bash array indexes start at 0, and adjust your seq
command accordingly:
varlist=( a list of strings )
for i in $(seq 0 3); do
echo ${varlist[$i]}
done
Indexing lengths
Because our index starts from 0, we must remember to end our loop at length-1. Not doing this in bash is unlikely to give a fatal error (unlike other, stricter, programming languages, for which it would cause a buffer overrun), but the empty string which is returned could cause issues for your workflow if you are not expecting it.
For
loops can be constructed using C-style notation too. These use a set of three
expressions (start conditional; end conditional; increment conditional) within a math
context, e.g:
varlist=( a list of strings )
for (( i=0; i<4; i++ )); do
echo ${varlist[$i]}
done
C-style integer increments
The code
(( i++ ))
increments the value of variable i by one, equivalent to using(( i+=1 ))
or (more explicity)(( i=$i+1 ))
. Negative increments can be performed by using-
instead of+
.
Looping through arrays without predetermined length
Often we want to loop through arrays without a predetermined number of elements, so is useful to not have to hard-code the end number into the for loop.
1) Remembering that you can get the length of an array using
${#array[@]}
, and that integer maths should be carried out within an arithmetic expansion ($(( ))
), please adapt theseq
loop above so that it determines the array length automatically, and uses that within the for loop.2) Please do the same using the C-style notation loop.
3) Please adapt the C-style notation loop to run through the array in reverse order.
Solutions
1)
varlist=( a list of strings ) len=${#varlist[@]} endvalue=$(($len-1)) for i in $(seq 0 $endvalue); do echo ${varlist[$i]} done
2)
varlist=( a list of strings ) len=${#varlist[@]} for (( i=0; i<$len; i++ )); do echo ${varlist[$i]} done
3)
varlist=( a list of strings ) len=${#varlist[@]} endvalue=$(($len-1)) for (( i=$len-1; i>=0; i-- )); do echo ${varlist[$i]} done
Note: the lines creating variables
len
andendvalue
could be incorporated directly into the for loop statements, but they are explicitly stated here to make the solutions more readable.
While loops
So far we have worked with fixed length loops, however not all processes are a fixed length, and so a more open-ended solution is needed.
This is provided by the while
loop, which uses a conditional statement to check when the
loop should be exited.
For example, this can be used to break the loop after user input:
halt=no
while [[ $halt != 'yes' ]]; do
wait 3
echo "break out of the loop?"
read halt< /dev/tty
done
Note the use of wait
, to wait a given number of seconds before continuing with execution
of the loop.
It can also be used to replicate (in a more convoluted, and less maintainable, manner) the for loops above:
varlist=( a list of strings )
len=${#varlist[@]}
i=0
while (( i<$len )); do
echo ${varlist[$i]}
(( i++ ))
done
While loops are useful for process control: for automating the checking to see if processes are finished, for example, and moving onto the next stage of the workflow once they are.
Tracking program progress
As an example of how
while
loops can be used to wait for a process to finish, we will create a function which waits for a random period before finishing. It writes it’s status to a log file, which we can use to track the progress of the program.sleeptest () { echo 'started' > log.out ; sleep $(($RANDOM/1000)) ; echo 'finished' >> log.out ; }
Tracking the current status of the program can be done using
tail
, e.g.:sleeptest & tail -1 log.out
started
Using the
-1
flag tellstail
to only return the last line of the file.Can you fill the three gaps in this
while
loop, so that it exits once the sleeptest function has ended?sleeptest & finished_tasks=0 job_limit=1 while [[ $finished_tasks ____ $job_limit ]]; do sleep 3 finished_tasks=0 log_tail=$(______) if [[ ______ ]]; then echo "finished a task" ((finished_tasks+=1)) else echo "still going" fi done
Solution
sleeptest & finished_tasks=0 job_limit=1 while [[ $finished_tasks -lt $job_limit ]]; do sleep 3 finished_tasks=0 LOG_TAIL=$( tail -1 log.out ) if [[ $LOG_TAIL == "finished" ]]; then echo "finished a task" halt=yes ((finished_tasks+=1)) else echo "still going" fi done
Key Points
seq [first] [increment] last
creates a sequence of (real) numbers
for
loops can be controlled usingseq
or C-style notation
${#array[@]}
is useful for setting these sequences
while
loops use conditional statements, and aren’t fixed in length likefor
loops
while
loops cna be used for process control
Sed and regular expressions
Overview
Teaching: 20 min
Exercises: 0 minQuestions
How can you edit text files within your scripts?
How can you make your searches more powerful?
Objectives
Detail regular expressions, and how they are used
Learn how to use regular expressions for string replacement
Workflows often require the editing of configuration files or scripts, or the searching of these for specific information to copy. This lesson will introduce a tool for editing files, as well as regular expressions, which can be used to make your searches more powerful.
Sed is a stream editor. It can be used to perform basic text transformations on an input stream, which can be a file, or it can be passed from a pipeline, allowing it to be combined with other tools.
To demonstrate the basic usage of sed, we will create a text file containing the string
hello
, and use sed to change this to world
:
echo hello > input.txt
sed -e 's/hello/world/' input.txt
world
The -e
flag indicates that the first parameter (s/hello/world/
) is a script to be
applied to the stream, while the following non-option parameters (input.txt
) are input
files. Sed will, by default, prints all the processed input, to save this for later use
you will need to pipe the output into a file:
sed -e 's/hello/world/' input.txt > output.txt
The script we are using here is a substitution (indicated by the s
at the start of the
script). The first string (hello
) is what is being searched for, while the last string
(world
) is the string that will be used as a replacement.
Sed can be passed multiple inputs, it will treat these as a single stream:
echo goodbye > input_pt2.txt
sed -e 's/goodbye/world/' input.txt input_pt2.txt
hello
world
It can also be passed multiple scripts, which will be run in sequential order:
sed -e 's/goodbye/world/' -e 's/l/a/' input.txt input_pt2.txt
healo
worad
You will note that the second (and first) script has been applied to both of the input
files, but only one l
has been replaced. This is because the search was not global, that
is, as soon as one match string is found it is replaced and sed moves onto the next s
cript.
If you want to replace all matching strings, then you need to specify that the script
should be applied globally, by appending a g
at the end of that script. E.g.:
sed -e 's/goodbye/world/' -e 's/l/a/g' input.txt input_pt2.txt
heaao
worad
Rather than search the whole input for match you can, if you know the exact line you wish to process, specify a line number:
echo 'hello' > input_lines.txt
echo 'hello' >> input_lines.txt
sed '2 s/hello/world/' input_lines.txt
hello
world
Note that sed starts indexes from 1, not 0.
Providing scripts via a file.
Sed allows the use of a file for providing the scripts for processing your files. This file should contain one script per line, and passed using
-f
instead of-e
, e.g.:echo 's/goodbye/world/' > myscript.sed sed -f myscript.sed input.txt input_pt2.txt
hello world
Editing dates for a complex configuration file
Jon has a configuration file (
namelist.input
in thewrf_configuration
directory) that he uses for running the Weather Research and Forecast (WRF) model. He wants to run this daily, keeping all of the configuration the same except for the start and end dates. These will need to be changed each day that the model is run, so that the start date is today, and the end date is today + 3 days. Can you automate this usingsed
? The template configuration file itself does not need to be useable, so can be modified if that would help you.Solution
There are two ways to do this. The first is to specify the lines you want to change, and change only those lines. However this is a fragile solution, as any changes to the template configuration file could change the line numbering, breaking your script.
A more robust solution is to replace the dates in the template file with clear identifier strings. These should be something you would not normally see in a script (e.g. I tend to use a string such as
%%DAY%%
or%%MONTH%%
). Then you can carry out a global sed action for each string requiring changing, without worrying that you might make any unwanted changes.
Regular Expressions
In the Finding Things
lesson in the shell novice course you were introduced to using wildcards, such as .
, in
your grep
searches. Grep uses regular expressions (often abbreviated to regex),
sequences of characters which define the string(s) to be searched for. Sed uses the same
regex patterns for it’s searches, and we will cover some basic principles of using these
here.
Regex syntax and interoperability
Regular expressions are implemented in a number of different programming languages. These all follow similar rules, but there will be differences, often subtle, between each of these implementations.
Many implementations follow the feature-rich regex syntax that was developed first for the Perl language. However UNIX command line programs tend to use the older ‘POSIX’ regex standards. These are further split into the POSIX Basic (BRE) and POSIX Extended Regular Expression (ERE) standards. Below we will teach you the ERE standard, because this has a more readable syntax which is closer to that of the more modern regex implementations. More information on the difference between the two POSIX standards can be found here.
To use ERE in sed the
-E
flag must be used. We will do this below, even in situations where it is not necessary, to get you in the habit of using it in your own code.
Regular expressions rely on the use of literal characters and metacharacters to construct
the search term. Metacharacters are characters which have a special meaning (such as .
represents any single character). If you wish to search for a literal character which
happens to be a regex metacharacter, then it will need to be “escaped”, that is preceded
by a \
character. For example:
echo "Hello. World" > input.txt
sed -E -e 's/\./.../' input.txt
Hello... World
Note that the string which is being used as a replacement is not a regex pattern, so the periods in this did not need escaping.
Forgetting to escape a metacharacter
What string would sed return if the
\
character was not used in the above regex?Solution
The first character found will be replaced, giving an output:
...ello. World
The search was not global though, so the rest of the string remains unchanged. Only if a
g
were added to the end of the script then it will replace all characters in the string with...
.
Matching ranges of characters
One of the most common patterns used in regex is the definition of a list or range of characters, which can be denoted using square brackets. E.g.:
sed -E -e 's/[HW]/J/g' input.txt
Jello. Jorld
This list of upper case characters to replace is very focused, but if you did not know in
advance what the upper case characters would be you can use the list [A-Z]
. Similarly,
to replace all lower case letters use the list [a-z]
, and to replace any digit us
[0-9]
. These can be combined as you require, for example, to match all characters (of
any case) between B and H you would use [B-Hb-h]
.
Creating new strings
What regex expressions would you use to create the following strings from the
Hello. World
string in theinput.txt
file?
Halla. Warld
Heno. Worn
Solution
sed -E -e 's/[eo]/a/g'
sed -E -e 's/l[ld]/n/g'
Matching Repeated Instances
It can be useful to match more, or less, than a single instance of a particular element in the search string. This can be done by adding one of these special characters:
*
matches the preceding element zero or more times+
matches the preceding element one or more times?
matches when the preceding element appears zero or one time{VALUE}
matches the preceding element appears the number of times defined byVALUE
; ranges can be defined by{VALUE,VALUE}
The elements that these can be used on can be either single characters, or sets of characters. E.g.:
sed -E -e 's/l{2}o/n/g' input.txt
Hen. World
This is particularly useful for changing date strings, e.g.:
YEAR=2021
sed -E -e "s/[0-9]{4}/${YEAR}/g" <(echo 'the date is: 23-04-2020')
the date is: 23-04-2021
Here we change the date to that set in a previously set variable (note the use of double quotation marks, so that the shell will interpret the string and replace the variable name with the required value).
Matching Line Endings
The ^
and $
metacharacters can be used to respectively assert the position of the
start or end of a line. This allows you to “anchor” your search at either end of a line.
For example, if we are provided with a YEAR variable which only contains the last two
digits, but we know that the year digits will always be at the end of the line, we can
search for [0-9]{2}
without risking changing the day or month:
YEAR=21
sed -E -e "s/[0-9]{2}$/${YEAR}/g" <(echo 'the date is: 23-04-2020')
the date is: 23-04-2021
Back References and Subexpressions
A back-reference is a regex command which refers to a previous part (or subexpression) of
the matched regular expression. They can be used to repeat patterns within a regex search
or, as we will do here, pass part of the matched regex forward to the replacement string.
Back references are specified by a single escaped digit (e.g. \1
; up to nine are allowed
in a single regex), while the subexpression is indicated using ()
brackets.
A common use of these is pulling out a single element of the search, e.g. the year from a date string:
date | sed -E -e "s/^.*([0-9]{4}).*$/\1/g"
2021
Note how the 4-digit year is stored in a subexpression, while the strings before and after it are included in the match using ^.*
and .*$
.
BASH logic and regex
In the logic and maths lesson you were introduced to the [[ ]]
command, which is used
for logical control structures. This command also allows regex patterns to be used,
checking to see if a given string matches the regex or not. This comparison is performed
using the =~
operator. For example:
YEAR=1999
if [[ $YEAR =~ ^[0-9]{2}$ ]]; then
echo "year is in 2 digit format"
elif [[ $YEAR =~ ^[0-9]{4}$ ]]; then
echo "year is in 4 digit format"
else
echo "year is in unrecognised format"
fi
year is in 4 digit format
Note that the ^
and $
metacharacters have been used to ensure the pattern matches the
whole string, and that no partial matches are made by mistake.
Further Learning
Library Carpentry have a longer introduction to regex course (from which some of this material has been taken). If you will be working with, and processing, a lot of text files then you will find this course useful. Do note, however, it is written with the more advanced regex implementations in mind, so some features mentioned in that course will not be available for shell programming.
Key Points
sed
performs basic text transformations on an input streamThe basic usage is
sed -e 's/pattern/replacement/' input.txt
Multiple scripts can be chained, by using additional
-e 's/pattern/replacement/'
declarationsMatches will be made on the first instance of the pattern, or all matches can be found by using
s/pattern/replacement/g
Extended regular expressions can be enabled with the
-E
flagSpecify character ranges using
[A-Z0-9]
Repeat single characters or ranges by appending
*
,+
,?
, or{RANGE}
Match the start and end of lines using
^
and$
, respectivelySpecial character can be matched if they are escaped by prepending
\
Capture subexpressions with
( )
, and back-reference in your pattern or replacement text these using\1
-\9
Regex can be used in logic tests, with the
=~
operatorRegex are easier to write than to read. Document yours well!
Symbolic Links
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How can you reuse one file in multiple directories?
How can you ease moving around your file system?
Objectives
Detail symbolic links, and how they are used
Learn when not to use symbolic links, and how to avoid pitfalls in using them.
A symbolic link, also called a soft link, is a pointer which enables you to find another file, much like a shortcut in Windows. Like these it is useful for creating shortcuts within the file system, for simplifying the file paths used by other programs, or easing your navigation between different work directories in a networked system (important when working on HPC systems).
It is important to remember that symbolic links do not point directly to any data that might be in the target, they instead point to the file system itself. This allows you to link to either files or directories using the same command, and also to link to filesystems hosted on remote computers. But it also means that there is a high risk of data loss if the remote files are moved or deleted. Because of this it is recommended that you use them sparingly in your workflows.
We will use the dataset from the BASH introduction course to demonstrate the use of links.
A symbolic link can be created using the command:
cd ~/Desktop/
ln -s data-shell/molecules
ls -l molecules
lrwxrwxrwx 1 user group 20 Feb 8 20:00 molecules -> data-shell/molecules/
This has created a symlink to the molecules
directory, with the name molecules
.
Like cp
, ln
will default to the given object name, but unlike the copy command it does
not need to be given a destination location.
You can identify symlinks by the @
following their name if ls -F
is used:
ls -F
data-shell/ data-shell.zip molecules@
You can use cd
to enter, and exit, this directory, as you would any other directory:
cd ~/Desktop/molecules
pwd
cd ..
pwd
/home/jon/Desktop/molecules
/home/jon/Desktop
This is because the cd
command is able to resolve, or track, the symlinks by processing
them after following the ..
path in the second cd
command. We can disable this ability,
using the -P
flag (which forces cd
to resolve the symlinks to the original directory
structure before following ..
paths):
cd ~/Desktop/molecules
pwd
cd -P ..
pwd
/home/jon/Desktop/molecules
/home/jon/Desktop/data-shell
In this case, even though we seemed to be in ~/Desktop/molecules
, using ..
while using
the -P
flag takes us to the ~/Desktop/data-shell
directory, because that is the true
parent directory. This relationship is made explicitly clear if we use -P
for the first
cd
command:
cd -P ~/Desktop/molecules
pwd
cd ..
pwd
/home/jon/Desktop/data-shell/molecules
/home/jon/Desktop/data-shell
In this case we arrive directly in the original directory with the first cd
command,
meaning that it does not matter whether we use the -P
flag or not for the second command,
we will always arrive back in the data-shell
directory.
Shell intrinsic commands, such as ls
and pwd
are able to make use of the shell’s tracking
of the symlink, so that they deal with the directory structure as we would expect. Commands
such as ls
and cp
are not able to do this, and so they always resolve symlinks to the
original directory structure before following ..
paths.
Because of this behaviour, it is advised that you avoid using ..
paths which cross a
symlink in your scripts - in this situation it would be safer to use the absolute path
(or a path relative to a fixed point, such as your home directory ~/
).
Symlinks can be removed without destroying the object they point to:
cd ~/Desktop
ls -l molecules
rm molecules
ls -ld data-shell/molecules
lrwxrwxrwx 1 user group 20 Feb 8 19:58 molecules -> data-shell/molecules
drwxr-xr-x 2 user group 4096 Feb 8 11:36 data-shell/molecules
Although we can leave the name of the symlink the same as the original object, one of the most useful features of symlinks is being able to rename files without moving or changing the original file.
For example, in data-shell/data/elements/
we have xml
files describing each atom. Each
of these is named using the periodic table symbol, e.g. N.xml
is the Nitrogen descriptor.
However, we have a program which is expecting the files to have the full atom name, e.g.
Nitrogen.xml
. We can easily enough create these symlinks, e.g.:
cd ~/Desktop/data-shell/data
mkdir elements-fullnames
cd elements-fullnames
ln -s ../elements/N.xml Nitrogen.xml
ls -l
lrwxrwxrwx 1 user group 17 Feb 8 21:32 Nitrogen.xml -> ../elements/N.xml
Doing this will enable us to use the program, without having to create all the input files again. Do keep in mind though that, although you can delete symlinks without deleting the original file, if a program tries to write to a symlink, it will write to the original file. This method is suitable for easily replicating or renaming input files. Extreme caution should be used if you use the same method for output or log files.
Scripting the linking of all atom files.
There are over 100 atom files in the
elements
directory, linking to each of these by hand would be quite painful. Fortunately these are text files, and each of them contains the full name of the element in the first line of the file, e.g.:head -1 ~/Desktop/data-shell/data/elements/N.xml
<element name="Nitrogen"/>
To strip the atom name out of this string you can either use
sed
:head -1 ~/Desktop/data-shell/data/elements/N.xml | sed -E -e 's/^.*"([A-Za-z]*)".*$/\1/'
Nitrogen
Please write a bash script which will use a
for
loop and this string processing pipeline to create links that use the full element names for these files within whatever directory it is run.Solution
for orig_file in "${@}" do element_name=$(grep -i 'name=' ${orig_file} | sed -E -e 's/^.*"([A-Za-z]*)".*$/\1/' ) ln -s ${orig_file} ${element_name}.xml done
This script should be run using:
bash link_script.sh ~/Desktop/data-shell/data/elements/*.xml
If you have a hammer, every problem is a nail
In the above solution we use
sed
and regex to extract the string we require from the xml file. There are other bash tools that could do this for us, and in ways which are arguable more readable (and being as readable as possible is a good trait for code). Can you identify a tool from either these lessons, or the bash introduction lessons, that you could use for this, and adapt your script to use this tool?Solution
The
cut
tool can be used to split the text we need, by using"
as the delimiter:head -1 ~/Desktop/data-shell/data/elements/N.xml | cut -d '"' -f 2
Nitrogen
Using this command instead would make your code more readable. Both solutions do still make similar assumptions about what format the string will take though (and the
head
command also makes major assumptions about the formatting of the file). These assumptions can make your scripts fragile. This fragility can be addressed by either well documenting what inputs you expect the script to have, or by building in extra checks to your code. The choice of which of these solutions to use depends on the script itself - which solution do you think would be the most appropriate for this script?
Key Points
Symbolic links to objects (files or directories) can be created using
ln -s
These are links to the object, not it’s contents, so these can change or be deleted
Symbolic links can cross physical disks, and so are useful in networked filesystems
Caution must be exercised when following
..
paths across symbolic linksThey are most useful for linking to, and/or renaming, input and configuration files or directories