CRUKCI Cluster Transition - Hands-on training
ssh my_username@clust1-headnode.cri.camres.org # access the cluster head node
cd /scratchb/my_group/my_username/ # go to your scratch space
java -jar /home/my_username/clarity-tools.jar -l SLX-ID # download your sequencing data
cd SLX-ID/ # navigate to your project folder
zcat my_sequence_file.fq.gz | more # output the content of the file paging through text one screenful at a time
EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.
- Navigate to your scratch folder.
- Download your preferred project data.
- Visualise one FASTQ file
Congratulations!
You did it!
FASTQ file consists of multiple blocks of these four lines
@NS500222:320:HHMJ3BGX3:1:11101:24390:1371 1:N:0:TAAGGCGA+ATAGAGAG
CATCTGCAAGTTGGAGACCCAGATAAGCCAGTAATGTAGTTCAGTCCATGACCAAACTGTCTCTTATACACATCT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEE
@
followed by identifier+
/scratchb/bioinformatics/reference_data/reference_genomes/
/scratchb/bioinformatics/reference_data/reference_genomes/$organism/$assembly
e.g. for Human GRCh38 /scratchb/bioinformatics/reference_data/reference_genomes/homo_sapiens/GRCh38
Aligned sequences: SAM/BAM/CRAM
README.txt
fileREADME.txt
file in each of your working directoryhistory
EXERCISE Go to your Terminal window, or open a new one and log in onto the cluster head node.
- Navigate to your project data.
- Create a
README.txt
file.- Type a small description and the commands you used to retrieve the data.
Congratulations!
You did it!
We’ve seen in session 1 how to use wildcard *
to get a command executed on multiple files, as well as combining them using pipe |
. Unfortunately it is not possible to use wildcard to rename files. We will have to use a loop to do some operation once for each thing in a list.
EXERCISE Go to your Terminal window, or open a new one.
- Let’s get back to Nelle’s data from session 1. On the command line download the zipped data file using
wget
followed byunzip
to decompress the archive and navigate to thesession1-data/nelle/creatures
directory usingcd
command and list all the files usingls
.
Congratulations!
You did it!
There are two files in this directory but imagine we had hundred which we would like to rename to original_*.bat
to take a back up of them before editing them.
If you run
mv *.dat original_*.dat
you’ll get an message about how to use mv
because mv
cannot receives more than two inputs. Instead, we can use a loop to do some operation once for each thing in a list. Let’s start by printing the name of each of us using the echo
command. We could do it one command at a time, or we can use a loop for variable_name in element_1 element_2 element_3; do command $variable_name; done
to repeat the same command three times for each element of the list:
echo Anne
echo Rob
echo Jochen
echo Katie
echo Ummi
for name in Anne Rob Jochen Katie Ummi; do echo $name; done
EXERCISE Go to your Terminal window, or open a new one and go to
session1-data/nelle/creatures
.
- Write a loop to print out the name of each file first and then,
- Write a loop to rename these files
*.bat
intooriginal_*.bat
Congratulations!
You did it!
It is really important to name your variable with a meaningful name and not a random one or a one letter one. Programs are only useful if people can understand them, so meaningless names (like x
) or misleading names (like temperature
) increase the odds that the program won’t do what its readers think it does.
We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them a file so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake: these are actually small programs.
Let’s go back to the session1-data/nelle/molecules
to extract lines 11 to 15 of each PDB file using a shell script called middle.sh
.
First, we have to create the file middle.sh
cd session1-data/nelle/molecules
nano middle.sh
and type the commands we want to run
head -15 octane.pdb | tail -5
make this file executable to you by changing its mode
chmod u+x middle.sh
and finally run the command
./middle.sh
What if we want to select lines from an arbitrary file? We could edit middle.sh
each time to change the filename, but that would probably take longer than just retyping the command. Instead, let’s edit middle.sh
and replace octane.pdb
with a very special variable called $1
. $1
means the first parameter on the command line. We can now run our script like this:
./middle.sh octane.pdb
We still need to edit middle.sh
each time we want to adjust the range of lines, though. Let’s fix that by using the special variables $2
and $3
.
EXERCISE Go to your Terminal window, or open a new one and go to
session1-data/nelle/molecules
.
- Update the script
middle.sh
to take two other parameters on the command line for the range of lines to select- Run the script
Congratulations!
You did it!
This works, but it may take the next person who reads middle.sh
a moment to figure out what it does. We can improve our script by adding some comments at the top.
# Select lines from the middle of a file.
# Usage: middle.sh filename -end_line -num_lines
head $2 $1 | tail $3
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
We are mostly interested by its job scheduling aspect which allocate resources to a user for a specified amount of time. Slurm provides resource management for the processors allocated to a job, so that multiple job steps can be simultaneously submitted and queued until there are available resources within the job’s allocation.
Type command followed by -h for command line usage details.
sbatch -h
squeue -h
sacct -h
scancel -h
We are going to start by submitting a very simple job to the cluster using the echo
command and a shell script job.sh
containing specific Slurm instructions.
First, log in onto the cluster head node
ssh my_username@clust1-headnode.cri.camres.org
create a job.sh
file containing these lines, do replace /scratcha/xxlab/my_username
by your scratch space
#!/bin/sh
#SBATCH --partition general
#SBATCH --mem 512
#SBATCH --job-name hello_world
#SBATCH --output /scratcha/xxlab/my_username/hello_world.%j.out
echo Running on $(hostname): 'Hello world!'
and finally submit your job to the cluster using the command sbatch
sbatch job.sh
In the job.sh
script, we have specific SBATCH
instructions:
--partition
: select a specific queue to submit the job--mem
: specify the real memory required e.g. 2GB
is 2048
--job-name
: specify a name for the job allocation--output
: connect the batch script’s standard output directly to the file name specified. By default both standard output and standard error are directed to the same file. The filename is hello_world.%j.out
where the %j
is replaced by the job ID.See sbatch man page for all the options and explanation on submitting a batch script to Slurm.
EXERCISE Go to your Terminal window, or open a new one and go to
session1-data/nelle/
.
- Copy
molecules/
onto your scratch space on the cluster usingscp -r
- Submit
middle.sh
script to Slurm to extract lines 20-23 ofoctane.pdb
- Write a loop to submit jobs for all PDB files by modifying
job.sh
to take one command line argument which will be given at each step of the loop
Congratulations!
You did it!