HPC17-4: Exercises | SIB

Exercise 1: Check cluster status.

Objective: Discover the different Vital-IT clusters (UNIL, EPFL, UNIGE) in terms of host and CPU numbers. Learn how to evaluate the current load of machines on the clusters.

Go to the Vital-IT computing infrastructure webpage [https://www.vital-it.ch/services/infrastructure] and look at the cluster status of the UNIL, EPFL and UNIGE clusters. Use the information you find to answer the following questions:

Questions:

How many hosts and how many CPUs are available on each cluster ?
What is the difference between CPU and host ?
Compare the CPU/host ratios on the UNIL and EPFL cluster. Can both clusters be used to run a job that requires 8 threads ?
What is the current average load on the UNIL cluster ?

Exercise 2: Discover software applications installed on the cluster and find out how to load them.

Objective: learn how to check for software availability on the cluster using either your web browser or the command line interface of the linux shell on the cluster.

A) Search for software using your web browser.

Go to the Vital-IT "bioinformatics tools" webpage [https://www.vital-it.ch/services/software] and browse/search through the different software categories. Answer the questions below.

Questions:

How many versions of "samtools" are currently available ?
What is the category under which "samtools" is found ?
What is the command to load "samtools" version 1.3 ?

If you have some application in mind that you plan to use, search to see whether it is installed on the cluster. If yes, then what is the "module add" command you need in order to load that software ?

B) Search for software from the command line, using the "vit_soft" command.

1. If not already done, open a shell on your computer and connect to the UNIL cluster development front-end machine (dev.vital-it.ch). The command to connect to this machine is the following (you will be asked for your password):

ssh userName@dev.vital-it.ch (replace "username" by your own user name).

2. List the content of the /software directory and browse through the different categories and sub-categories. Almost all software available on the Vital-IT cluster are located in /software.

3. As software might not always be in the category you expect, a more efficient way to search for software on the cluster is to use the vit_soft command that is available by default on all machines on the cluster. Here are some useful options for the vit_soft command:

vit_soft -h (display help for the command).
vit_soft -s <string to search in package name> (search for a software package based on its name).
vit_soft -b <name of binary file> (search for a software package based on the exact name of a binary files it contains).
vit_soft -m <string to search in package name> (display the "module add" command for a package based on its name).

Using the vit_soft command, answer the questions below.

Questions:

List all packages available for the software "tophat". How many are there ? What is the latest version available ?
Which package(s) provide the binary bam2fastx ?
How can you load the latest version of the tophat package ?
After loading the latest version of tophat, check which modules are currently loaded in your environment using the command module list. What do you notice ? Can you see a potential pitfall when loading several modules ?

C) Load / test software using the "module" command.

Let's assume that we want to run a short sequence alignment on a protein sequence (stored in the file exercise5/AF311055_prot.seq. As we know this will take only a few seconds to compute, we decide to run it directly on the development front-end machine, where it is allowed to run small scripts or programs for testing purposes. (Note: this would not be allowed on the production front-end !). The command that we want to execute is the following:

blastp -db "swiss" -query ./exercise5/AF311055_prot.seq

1. Change directory into /scratch/cluster/weekly/<userName>/HPCexercises

2. Try to run the blastp command on the development front-end. (Hint: if the command doesn't work then maybe you need to use the "module" command to enable it).

Questions:

did the command work "out-of-the-box" or did you have to type something else before using it?
which version of the blastp command did you use ? hint: to display the version of blastp, type "blastp -version".

3. Now run the same command again, but with a different version of blastp.

Questions:

how did you switch to another version of blastp ?
how does the "module add" command affect the shell variable "$PATH"?
how many modules do you currently have loaded ? What is the module command to check this?.
what command(s) would you execute to revert to the version of blastp that you have used in point 2. of this exercise?

Notes:

The general directory structure in /software is the following: /software/<category>/<package>/<version>
Module files are all located in /software/module/<category>

Exercise 3: Basic job submission via LSF command line client (bsub)

Objective: submit a simple test job to get familiar with the bsub command. Understand the difference between the front-end machine - from where jobs are submitted -, and cluster node machines - where jobs are executed. Understand the difference between the production front-end machine (frt.vital-it.ch) and the development front-end machine (dev.vital-it.ch).

1. Login to the development front-end machine dev.vital-it.ch. The development front-end (dev.vital-it.ch) and the production front-end (frt.vital-it.ch) machines have access to exactly the same pool of machines, but on the development machine it is allowed to run some short scripts for testing purposes, something that we will do in this exercise.

2. Change directory into ./exercise3, where you should find a shell script named "simple.sh".

3. Look at the content of the "simple.sh" script (e.g. using the "cat" or "less" command). As its name suggests, the simple.sh script is extremely simple and does only one thing: display the name of the host executing the script on the standard output.

4. Submit the shell script "simple.sh" to the cluster using the bsub command.

Question:

Did you get the output of the simple.sh script you submitted ? If no then why not ?

5. Submit the shell script to the cluster again, but this time using the following command: bsub -I < ./simple.sh . The "-I" (capital i) option tells LSF to redirect the standard output stream of the node executing the job to the terminal of the front-end machine that submitted the job. You should now be able to see the output of your job and answer the questions below.

Questions:

What is the name of the machine that executed your simple.sh script?
Compare your result to those obtained by your neighbors. Are they the same? Should they be the same?

6. Now we will also run the script "simple.sh" directly on the front-end machine. This is only allowed because we are on the development front-end (it is not allowed to run software and scripts directly on the production front-end). Try to run the script by typing "./simple.sh"

Questions:

Why did the command ./simple.sh fail ? How can the problem be fixed ? (hint: use the interpreter used in simple.sh to call the script).
Compare the result you obtained from running the simple.sh script on the front-end machine to those you obtained when submitting the script to the cluster.

Note: In this exercise we could also use the syntax "bsub -o simple.out < ./simple.sh" to get the standard output of our job into a file named "simple.out". While this is a perfectly valid syntax, it is not the one we generally recommend for more complex scripts, where we advise you to always put all bsub options directly into the script that you submit using "#BSUB" (e.g. #BSUB -o simple.out). Embedding all options into the script is better in terms of reproducibility as all options that were used are kept together with the script.

Exercise 4: Submit a job using a bioinformatics program

Objective: learn to submit a job that requires loading a software.

1. Change directory into the ../exercise4 directory and have a look at the script named "blast.sh".

2. If you think the script is OK, submit it to the cluster. If not, modify it before submitting it.

Questions:

Does the original script complete successfully ? If not what is the problem, and how can it be fixed ?

Note: In the script blast.sh we use the database "swiss" (i.e. Swiss-Prot). There are many more databases and respective indices that can be used. To see other databases available on the Vital-IT cluster, visit https://www.vital-it.ch/services/blast

Exercise 5: basic bsub options and e-mail notification.

Objective: learn to use the basic bsub options (standard output and error streams redirection to file, email notification, and giving a name to your job).

1. Now that you are familiar with working on the front-end machine of the cluster - and know that you are not supposed to run programs/scripts directly on it -, you can login to the production front-end machine of Vital-IT: ssh <userName>@frt.vital-it.ch

2. Change directory into ../exercise5.

3. The directory hosts a file named "AF311055_prot.seq". This file contains a protein sequence (a list of amino acids).

4. You are asked to write a script that will align the protein sequence in "AF311055_prot.seq" against a reference database called "swiss" using the program "blastp". The command to run "blastp" on the protein sequence is the following:

"blastp -db "swiss" -query AF311055_prot.seq".

5. In addition, your script should also do the following:

Run in the "priority" queue when submitted to the cluster.
Give the name "blastRun" to the job when submitted to the cluster.
Send its standard output to a file named "blastRun.out", to be saved in your home directory.
Send its standard error stream to a file named "blastRun.err", to be saved in your home directory.
Send a notification email to your mailbox

Questions:

Show your complete script "blastRun.sh".
What output do you get in "blastRun.out" ?
Why is the file "blastRun.err" empty ? or if not, then why not ?

Hint:

Check the template script "HPC_bsub_script_template.sh" and the lecture notes to find out which bsub options (#BSUB) you should use.
All options for bsub should be embedded directly in the script.

Exercise 6: running jobs with more memory

Objective: learn the LSF options to request additional memory when running a job on the cluster.

In this exercise you will run a program called "FastQC" on the cluster. FastQC is a program that provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. More information about FastQC can be found here: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

1. Change directory into ../exercise6. It contains a script named "fastqc.sh", have a look at it and answer the following questions:

Questions:

What does the "%J" variable in the #BSUB option do ? Why is it useful to add this variable ?
What does the $LSB_JOBID variable do in the main part of the script ? Why is it useful ?

2. Add an option so that the script requests 6 GB of memory, and also an option so that it runs in the "priority" queue (we can do this because it takes less than 30 minutes to complete), then submit the script to the cluster.

Questions:

How much memory was used for the job to complete ? Could the job be run with less than 6GB of memory ?
What is is the default memory allocation and the memory limit on the "priority", "normal" and "long" queues ?
How much CPU time was used to complete the job ?

Exercise 7: Running multi-threaded software on the cluster.

Objective: learn the LSF options to correctly run a multi-threaded software on the cluster.

1. Change directory into ../exercise7 and look at the script named "tblastx_1c.sh". This script runs the program "tblastx" on a protein sequence.

2. Submit "tblastx_1c.sh" to the cluster.

3. "tblastx" is a software that supports multi-threading, but the command as used in the script "tblastx_1c.sh" runs only on a single thread. Create a copy of the script and rename it "tblastx_4c.sh", then modify it so that it runs using 4 threads. In order to achieve this, you will need to:

modify the #BSUB options in the script so that LSF allocates 4 cores to your job. Important: make sure that all the cores are located on the same host !
modify the blastall command so that it uses 4 threads (read the blastall documentation to find out which parameters allows to set the number of threads to use. You can access the documentation for blastall with the command blastall --help).

4. Submit "tblastx_4c.sh" to the cluster, then answer the questions below:

Questions:

What is the completion time for respectively the "tblastx_1c.sh" and the "tblastx_4c.sh" scripts ? Compare both the wall time and the CPU time.
For the particular case of this script, is it worth adding more threads ?
Do you see any potential drawbacks in requesting more CPUs in a script ? (hint: there are some).

Note: In cases where a software has no or poor documentation, it might not always be easy to know whether it is multi-threaded or not. One possibility to find out is to directly observe how much CPU the program is consuming using a command such as "top". As an exercise, try to run the blastall commands from your "tblastx_1c.sh" and "tblastx_4c.sh" scripts directly on the dev.vital-it.ch machine. Then run "top -u $USER" in a second terminal window (on the same machine): you can see how many threads are used by looking at the "%CPU" column (100% = 1 thread, 200% = 2 threads, etc.).

Last modified: Friday, 8 September 2017, 3:44 PM