Dates, Scheduling, and Downloading Files

Overview

Teaching: 10 min
Exercises: 5 min

Questions

How can we deal with date maths on the command line

How can we schedule regular compute jobs

How can we download files outside of a web-browser

Objectives

Understanding some of the tools available for addressing regular needs.

Date maths

On UNIX systems the system time and date information can be found using the date command:

date

Wed 27 Jan 2021 11:52:17 GMT

This will return a formatted string (likely) containing information on the day, date, time, and timezone, as shown above.

This command can be used for more than just finding the current time, though. It can also be used to create formatted strings containing other dates and times, as well as calculating the same relative to either the current time or another given time. Here we will give a short introduction on how to use this command on linux systems.

Linux vs OSX

Although all UNIX systems will have a date program, the exact syntax for using it can vary from system to system, due to differing implementations of the Unix standards. The two systems you are most likely to encounter are linux-based systems (such as debian, and SE-linux), and the freeBSD Unix-based OS-X system. These use quite different syntaxes, so care must be taken to use the correct syntax for the system you are on.

Below we will focus on the linux syntax, as this is most common for HPC environments. Where the OSX syntax is different this will be highlighted in a separate note.

The output string for the date command can be formatted by adding +[format string]. The options for these are defined in the man page for date, but some examples are %Y, for 4-digit year, %m for 2-digit month, and %d for 2-digit day of month. These can be used individually, or combined as you wish. E.g.:

date +%Y

date +%Y-%m

2021-01

date +"%Y %d"

2021 27

In the last example we enclose the format string in quotation marks, to all inclusion of a space in the formatted output.

To display a date which is not now, you can use -d [date string]. The most useful (for our purposes) date string here is YYYYMMDD, e.g.:

date -d "20120423"

Mon 23 Apr 00:00:00 BST 2012

Displaying dates that are not ‘now’ on OSX

On OSX you have to explicitly set the format of the date string that you pass to date, using -f "[format string]" "[date string]". You will also need to ensure that date does not try to reset the system clock, by passing the -j flag too. The OSX command equivalent to the linux command above is:
date -j -f "%Y%m%d" "20120423"

To calculate an offset from a given date (either ‘now’, or a supplied date), you add the desired offset into your date string. This offset can be composed of a number of different elements, for example:

date -d "20210127 +3 day +1 month -18 year"

Sun  2 Mar 00:00:00 GMT 2003

The advantage of using date to do this calculation for you is that it can deal with transitioning across month and year boundaries easily - to do this by hand would require a lot of checks for the lengths of months, leap years, etc.

Offset calculation order

You should note that the calculation order for the offset change from the largest incremental unit, downwards. This is most important where the offset will cross month boundaries, but could be important to remember in other scenarios too.

Calculating date offsets on OSX

On OSX the offset is given as one or more separate strings for each element of the offset that is required, each preceded by a -v flag. You should also note that the offset elements are applied in the order which you provide them, giving more explicit control over this process than you have with the linux date command.

The OSX command equivalent to the linux command above is:
date -j -v "-18y" -v "+1m" -v "+3d" -f "%Y%m%d" "20210127"

Scheduling tasks with CRON

Often workflows need to be repeated at regular intervals - checking for updates in source data, running maintenance tasks, or producing regular updates to services. These tasks can be automated using the cron job scheduler. Cron is available on most UNIX based systems (although, on many HPC platforms, access to cron will be blocked for ordinary users), you can find out if you have it installed (and what tasks you have scheduled) using the crontab (cron table) command:

crontab -l

To configure cron it is easiest to create a configuration file, which will be read simply using the command:

crontab [config.txt]

Each line of this file will represent a job, and will look like:

# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │                                   7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * * <command to execute>

This notation can be a little confusing to use, but online tools such as crontab.guru are available for checking your configurations, to ensure that cron will run jobs at the times you expect.

Scheduling jobs

Jon wants to set two data download tasks, one to run at 2:45 am every Monday, and one which runs at 2:00 pm on the 1st and 15th of each month. What notation should he use to set these jobs (crontab.guru can be used to work this out).

Solution

45 2 * * 1 - 2:45 am on every Monday

00 14 1,15 * * - 2pm on the 1st and 15th of the month

OS-X and cron

OS-X does include the cron, but it is difficult to use because of the security settings.

If you wish to use cron on OS-X you will need to enable Full Disk Access within the Security & Privacy settings menu for the program /usr/sbin/cron.

Downloading using wget

Often workflows involve input or source file downloads, and it is useful if these can be automated on the command line, rather than relying on using a web-browser for this.

Wget enables the retrieval of files using the widely used HTTP, HTTPS, FTP, and FTPS protocols (much as most web-browsers do). It has many features, such as being able to resume aborted downloads, using filename wild cards and recursive searches, and use of timestamps to determine if documents need redownloading. Here we will just cover the basic usage though, for more details and advanced usage check the GNU Wget documentation.

The basic usage is to simply give the path to a file that you wish to download. If it exists, and is accessible, then wget will download it and save it in your local directory. E.g.:

wget http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms/meteo_ABED_2020-01-04_1800_data.txt

This can be used for directories as well as files, but when used on a directory what is returned is a file listing the directory contents (as you would see in a web-browser). E.g.:

wget http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms

To download all the contents of a directory you need to use the recursive -r flag. In most cases this should be combined with the -np flag, which tells wget that you don’t want it to crawl up to parent directories. You can also restrict the files you download based on their file-extension string using the -A flag.

For example, to download a single file use:

wget -r -np http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms/meteo_ABED_2020-01-04_1800_data.txt

This creates a directory structure following the http path structure, containing the file you requested.

To download all the text (.txt) files in that directory use:

wget -r -np -A txt http://manunicast.seaes.manchester.ac.uk/charts/manunicast/20200105/d02/meteograms/

This creates the same directory structure, but it will contain all the text files in that directory.

Restricting webcrawling to a single domain

Although it is not important for our interest in downloading data from internet archives, when downloading data from a website it can be useful to ensure that wget doesn’t follow links off that site. This can be done using the -D [domain] flag, where [domain] is the web domain that you wish to restrict wget to. In the examples given above we could add -D manchester.ac.uk if we wished to restrict searches to the University of Manchester domain, or even -D manunicast.seaes.manchester.ac.uk if we wished to restrict searches to the ManUniCast site only.

Using wildcards in file names

The HTTP protocol does not support wildcards, so using wildcards would not work in the examples given above. Wget can also use the FTP protocol though, which does support the use of wildcards, so if you are downloading data from an FTP server these are an option.

Scheduling jobs

Go to the data_download directory inside the data-adv-shell directory. Here there is a crontab file crontab_settings.txt and a script for downloading some data manunicast_download.sh.

Edit the crontab file to

set a time which is 3-4 minutes in the future

set the correct path to the manunicast_download.sh file

Then submit the crontab settings using:
crontab crontab_settings.txt
If it works you will find the downloaded file on the Desktop in a few minutes, if not, check the error message in stderr.log (also on the Desktop) to see what the problem was.

Once this has worked you can unset the crontab using:
crontab -r

Key Points

Using date for your date and time calculations will save you a lot of hassle

crontab can be used for scheduling regular tasks

wget can be used for scripting your data download process

Tool syntax is not always consistent across different unix flavours

lesson home

BASH Programming for Workflow Management

next episode

Dates, Scheduling, and Downloading Files

Overview

Date maths

Linux vs OSX

Displaying dates that are not ‘now’ on OSX

Offset calculation order

Calculating date offsets on OSX

Scheduling tasks with CRON

Scheduling jobs

Solution

OS-X and cron

Downloading using wget

Restricting webcrawling to a single domain

Using wildcards in file names

Scheduling jobs

Key Points

lesson home

next episode