Automation¶
The command line keeps track of prior commands that have been run. These include both interactively selecting prior commands to re-run
up and down arrow keys to bring back up prior commands
CTRL+r to interactively search through prior commands
and the history
command to display the last commands. This lets us process our history of commands with our
commands.
As an example, we created a variety of pipelines for extracting information from the igc files, such as
[tyson@gra-login2 ~]$ grep '^B' flights/0144f5b1.igc | cut -c 31-35 | sort | tail -n 1
which gives the highest GPS altitude recorded in the given igc file. After such as session we can use the history
command along with output redirection to save the commands we came up with to a file for future reference of
sharing with a colleague
[tyson@gra-login2 ~]$ history 20 > flight-commands
where the 20
specifies that we want the last twenty commands run. From this it is trivial to open our file up
in a text editor like nano
and clean it up a bit to get a nice reference
[tyson@gra-login2 ~]$ nano flight-commands
...
[tyson@gra-login2 ~]$ cat flight-commands
Reference of useful pipes for working with igc files extracted from history
Date
grep '^HFDTE' flights/0144f5b1.igc | cut -c 6-
Pilot
grep '^HFPLT' flights/0144f5b1.igc | cut -d : -f 2
Plane
grep '^HFGIDGLIDERID' flights/0144f5b1.igc | cut -d : -f 2
Start time
grep '^B' flights/0144f5b1.igc | cut -c 2-7 | head -n 1
End time
grep '^B' flights/0144f5b1.igc | cut -c 2-7 | tail -n 1
Highest GPS altitude
grep '^B' flights/0144f5b1.igc | cut -c 31-35 | sort | tail -n 1
All pilots
grep ^HFPLT flights/*.igc | cut -d : -f 3 | sort | uniq
Once we have our commands in a file, it is pretty natural to wonder if we can get bash to just run our commands from the file instead of us having to type them back in each time.
Scripting¶
This is precisely what a shell script is: a file with a list of commands in it that we get our shell (bash) to
run. Our flights-command is almost a shell script as we have written it above. The only issue is that bash
doesn’t know what to make of the comments as they aren’t proper commands. We can fix this by prefixing them with
#
to mark them as comments
[tyson@gra-login2 ~]$ nano flight-commands
...
[tyson@gra-login2 ~]$ cat flight-commands
# Reference of useful pipes for working with igc files extracted from history
# Date
grep '^HFDTE' flights/0144f5b1.igc | cut -c 6-
...
Now we can tell bash to run our commands directly for us from our file
[tyson@gra-login2 ~]$ source flight-commands
There are actually several ways this last step can be done
. <filename>
orsource <filename>
- run commands in current session( source <filename> )
- run commands in a sub shell (current directory and such will be restored)bash <filename>
- start a new shell, run the commands, and exit back to current shell
Earlier we had mentioned that Linux doesn’t use a .exe
extension to identify executable files. Instead executable
files have the executable mode set on them. We can set this with the command chmod +x <file>
and we can see it as
the x
when we run look at the ls -l
long listing. Because our program is a script, we also have to tell Linux
what program to use to run it by adding a special #!<interpreter>
comment to start of it
[tyson@gra-login2 ~]$ ls -l flights-commands
-rw-r----- 1 tyson tyson 630 May 11 22:32 flights-commands
[tyson@gra-login2 ~]$ chmod +x flight-commands
-rwxr-x--- 1 tyson tyson 630 May 11 22:32 flights-commands
[tyson@gra-login2 ~]$ nano flight-commands
...
[tyson@gra-login2 ~]$ cat flight-commands
#!/bin/bash
# Reference of useful pipes for working with igc files extracted from history
...
With all this in place (the executable mode set and the special interpret comment as the first line) we can now directly run our file as if it was just another command
[tyson@gra-login2 ~]$ flight-commands
-bash: flight-commands: command not found
[tyson@gra-login2 ~]$ ./flight-commands
030816
Lena
...
The first run attempt failed because the current directory is not somewhere bash look for a command unless we
explicitly tell it to as we did in the second command. environment variable in a :
delimited format. We can
see the setting of this variable by either using variable substitution with the echo
command or using the
declare
command to print it
[tyson@gra-login2 ~]$ echo $PATH
/opt/software/slurm/current/bin:...:/home/tyson/bin
[tyson@gra-login2 ~]$ declare -p PATH
declare -x PATH="/opt/software/slurm/current/bin:...:/home/tyson/bin"
The declare
version is interesting as it actually prints the declare
command we would have to run to set it to
its current value. This shows us additional information such as the -x
which means that it is to be also made
available (exported) to commands that bash runs as well.
You might be tempted to add .
(the current directory) to this list. This will work, but don’t do it. If someone
puts a ls
command in a directory you go into and run ls
in, it will then run their ls
command and not the
system one you are expecting. Their ls
command could do anything, including deleting all your files or giving
them access to your account in the background. The last element of PATH
is a bin
directory under your home
directory. Create this directory instead and put your scripts there
[tyson@gra-login2 ~]$ mkdir bin
[tyson@gra-login2 ~]$ mv flight-commands bin
[tyson@gra-login2 ~]$ flight-commands
030816
Lena
GBJY
...
Our command isn’t as useful as the other commands though as we can tell them what files to operate on. Our command just ignores everything we tell it and always does the same operations on the same files
[tyson@gra-login2 ~]$ flight-commands --you-are-just-going-to-ignore-this--
030816
Lena
GBJY
...
To make our command more useful, we can use variables to change it from a specific command to run to a template command to run. We do this by replacing the fixed filenames with special symbols (variables) that get replaced with the arguments provided on the command line
$<n>
- the nth argument provided on the command line$@
- all the arguments provided on the command line separated by spaces$#
- the number of arguments provided on the command line
With this we can make a copy of our example commands file and edit it into a command that that takes an igc filename and prints the date of the flight
[tyson@gra-login2 ~]$ cd bin/flights-commands bin/igc-date
[tyson@gra-login2 ~]$ nano bin/igc-date
...
[tyson@gra-login2 ~]$ cat bin/igc-date
#!/bin/bash
# Run the date extraction pipeline using the first argument as the source filename
grep '^HFDTE' $1 | cut -c 6-
[tyson@gra-login2 ~]$ igc-date ../flights/fffdcaad.igc
250616
All we have done is put a name to pipeline template. This isn’t trivial though. Our minds can only deal with so much information at any one point. Switching from thinking about a complex pipeline to a simple, appropriate-named command, frees up the brain power required to successfully integrate that command into its some other complex operation. Repeating this process lets us build up from small blocks to mansions.
Exercises¶
In these exercises, you will see the ;
character. In bash the ;
is equivalent to a newline (pressing enter
on your keyboard). This lets us write multiline commands on a single line. You will see when you scroll back
through your history (the up key), that bash will replace you newlines with ;
s.
In the live session, we converted our example date extension pipeline into an new
igc-date
commandigc-date <filename>
- date field from the igc-file
Do this for the other pipelines to create the following commands
a.
igc-pilot <filename>
- pilot field from the igc-fileb.
igc-plane <filename>
- plane call sign from the igc-filec.
igc-start <filename>
- starting (first) time for the igc-filed.
igc-end <filename>
- ending (last) time for the igc-filee.
igc-maxalt <filename>
- maximum altitude recorded in the igc-fileThe
[[ <test> ]]
command lets us perform a variety of test (seehelp [[
andhelp test
). Combined with theif <command>; then <command>; else <command>; fi
command (seehelp if
), this lets us write a further improvedigc-date
command that provide feedback to the user if it was invoked incorrectly.[tyson@gra-login2 ~]$ cat bin/igc-date #!/bin/bash if [[ $# -eq 1 ]]; then grep '^HFDTE' $1 | cut -c 6- else echo "Proper usage is igc-date [igc file]" fi
Give this a try and update the other commands to also do this.
Add an
elif [[ $# -eq 0]]; then <command>
branch to make the command also support reading the igc file directly from the keyboard when not given any filenames as most other commands do (rememberCTRL+c
aborts andCTRL+d
signals the end of the input when testing this out).
Submitting jobs¶
Earlier we discussed how the graham supercomputer is actually a large number very beefy standard computers, and that you ran programs on these computers by telling the system
what commands you want to run, and
what resources those commands require to run.
Now that we know how to create scripts, we know how to do this first of these. The second is a simply a matter of
looking through the command options for sbatch
(slurm batch) command. The Compute Canada documentation
wiki has pretty extensive coverage of most circumstances and what options should be
specified. At a minimum we need
--time [[dd-]hh:]<mm>
- amount of time required--mem-per-cpu [megabytes]
- amount of memory required--account <account>
- sponsor account to record usage against--output <file>
- file to record output in
The script will be killed if it exceeds the resources specified, so we want to give ourselves a bit of room when specifying our limits. We don’t want to be excessive though, as our script will not run until the system has secured all the specified resources for us, so the more we specify, the longer we wait before running.
For most people there is only one account to submit the script under (their sponsor’s default account).
Running without --account
option will print a list of possible accounts. For most users there is only one option
(their sponsor’s default account), for these guest accounts we use the special def-training-wa
account which is
configured to allow us to start small training jobs without much delay. A sample submission might then be
[tyson@gra-login2 ~]$ sbatch --time 5 --mem-per-cpu 500 --account def-training-wa --output example.log igc-date flights/0144f5b1.igc
Submitted batch job 31352623
To avoid having to specify all these options every time, the sbatch command also lets us put them in the comments
at the top of our script file after the #!/bin/bash
line but before any commands by prefixing them with SBATCH
.
#SBATCH --mem-per-cpu 500
#SBATCH --account def-training-wa
If an option is specified on both the command line and the script file, the command line will take precedence. In this sense, putting options in our script files gives us a powerful way to specify our defaults.
The returned number is the job identifier. Make sure to provide this to us if you ever request support for an issue
regarding your job so we can look it up. It can also be used to cancel a job with the scancel
(slurm cancel)
command. The squeue
(slurm queue) command shows you the status of queued jobs
[tyson@gra-login2 ~]$ squeue -u tyson
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS GRES MIN_MEM NODELIST (REASON)
31352623 tyson def-training igc-date R 4:50 1 1 (null) 500M gra1064 (None)
The -u <username>
option limits the output to just the jobs queued for the specified user. From this we see that
our job is running (ST
is R
) on the node (computer) gra1064
. Once a job has completed running, it is removed
from the queue and no longer shows up in the output of of squeue
. Information about it can still be retrieved
using the sacct
(slurm accounting) command
[tyson@gra-login2 ~]$ sacct -S 2020-05-01 -u tyson
...
where the -S <date>
option specifies how far back in the job records to report on (the default is just the
current day). We can also look in the specified output file to get any messages that may have been printed by the
job (the extracted flight date in our case)
[tyson@gra-login2 ~]$ cat example.log
030816
The supercomputers have a lot of standard software already installed on them. It is not possible to enable all
these software packages at the same time though, as many provide the same commands, so you need to use the module
command to tell the system what software you want enabled. The module avail
command lists what software is
available. For example
[tyson@gra-login2 ~]$ module avail python
...
ipython-kernel/2.7 ipython-kernel/3.6 ipython-kernel/3.8 (D) python/3.6.10 (t,3.6) python/3.7.9 (t)
ipython-kernel/2.7 ipython-kernel/3.7 python/2.7.18 (t,2.7) python/3.7.7 (t,3.7) python/3.8.2 (t,D:3.8)
...
and the module module load
command enables the choosen software
[tyson@gra-login2 ~]$ python --version
Python 3.7.7
[tyson@gra-login2 ~]$ module load python/3.8
[tyson@gra-login2 ~]$ python --version
Python 3.8.2
The module avail
command only shows software that is compatible with the current core packages that loaded. To
see all available software you need to use the module spider
command. It will also tell you what other packages
you need to load in order to make your desired package available. For example
[tyson@gra-login2 ~]$ module avail qgis
...
No module(s) or extension(s) found!
Use "module spider" to find all possible modules and extensions.
[tyson@gra-login2 ~]$ module spider qgis
...
Versions:
qgis/2.18.24
qgis/3.10.6
...
For detailed information about a specific "qgis" package (including how to load the modules) use the module's full name.
...
[tyson@gra-login2 ~]$ module spider qgis/3.10.6
...
You will need to load all module(s) on any one of the lines below before the "qgis/3.10.6" module is available to load.
StdEnv/2020 gcc/9.3.0
...
Loops¶
At the very start, we demoed how easy it was to add the date to the name of a large number of files with the command line instead of a graphical user interface. We did this using a for loop. For loops let us create a template command (such as one to renaming a file to include the date), and then apply it to a large number of cases.
Lets consider the case of getting all our pilot names. We have our igc-pilot
command that gives us the pilot
field from a single igc file. If we wanted to retrieve all our pilot, we would have to run this command once
for each file. That is
[tyson@gra-login2 ~]$ igc-pilot flights/0144f5b1.igc
[tyson@gra-login2 ~]$ igc-pilot flights/04616075.igc
[tyson@gra-login2 ~]$ igc-pilot flights/054b9ff8.igc
...
Comparing these first few cases, it is pretty clear that the only thing changing between each of these commands is the name of the igc file. That is, we have a common command template that we are running
igc-pilot $file
where $file
is placeholder for the filename that changes with each command. Our initial commands could then
equally well be written as
[tyson@gra-login2 ~]$ file=flights/0144f5b1.igc; igc-pilot $file
[tyson@gra-login2 ~]$ file=flights/04616075.igc; igc-pilot $file
[tyson@gra-login2 ~]$ file=flights/054b9ff8.igc; igc-pilot $file
...
where we are simply setting the value file
each time and then doing our template that runs igc-file
on our
file.
A for loop is nothing more than special syntax for doing that only requires us to have to specify the template once, which makes sense as the template is the same every time. In bash the syntax looks like this
for file in flights/0144f5b1.igc flights/04616075.igc flights/054b9ff8.igc ...; do
igc-pilot $file
done
By bringing the list of file we run our template for into one place, we have also now made it possible for us to specify our file list using a glob pattern. That is, we can say
[tyson@gra-login2 ~]$ for file in flights/*.igc; do
igc-pilot $file
done
Lena
Bill
Lena
...
A for
statement is also a command, and can be used as any other commands. For example, its output can be piped
through sort
and uniq
in order to obtain a compact list of our pilots
[tyson@gra-login2 ~]$ for file in flights/*.igc; do igc-pilot $file; done | sort | uniq
Aasia
Bill
Fred
Lena
Mary
Mo
When we say $(<command>)
in bash, it gets replaced with the output <command>
. We can use this to write a
powerful for loop to (finally!) rename all our files to something more useful than their current names
[tyson@gra-login2 ~]$ cd flights
[tyson@gra-login2 ~]$ for file in *.igc; do
pilot=$(igc-pilot $file)
date=$(igc-date $file)
mv $file $date-$pilot-$file
done
[tyson@gra-login2 ~]$ ls
000000-Lena-551a9c25.igc 010815-Aasia-7cded70f.igc 010815-Mary-4bf15e07.igc ...
Having the pilot name and date in each of the files allows us to easily do things like select all the files for a
specific pilot or year using glob patterns (e.g., *-Lena-*
will select all of Lena’s flights).
Exercises¶
In this exercise we are going to create a triple loop to print out the highest altitude obtained by each pilot for each year by filling in the missing command in the following double loop
for year in 15 16 17; do
for pilot in Aasia Bill Fred Lena Mary Mo; do
maxalt=$(<command to retrieve greatest altitude for $pilot in $year>)
echo "$year - $pilot: $maxalt"
done
done
Past experience has shown this is a very tough problem for people to solve outright. When you get stuck like this, the key is to switch from trying to find an outright solution, and instead focus on trying to break the problem down into a series of smaller problems that you can produce outright solutions to. Often the steps become obvious if you do a few cases by hand. You just need to take note of what you did and get the computer to do the same.
For this problem, we can break it down into the following sub-problems (which are the exercise)
Extract the highest altitude for a single flight (we made a command for this).
Come up with a glob pattern to select all the flights in a given year for a given pilot.
Use both of these to extract the highest altitudes for a given pilot in a given year (put the highest altitude command in a for loop over the files that match the glob pattern).
Extract the highest of the highest flight altitudes (pipe the output of the for loop into a pipeline that extracts the largest value).
Insert the command you built up in 1-4 into the loop above and run that to see which pilot achieved the highest altitude each year.