Objectives and hints

Below, you will find a set of exercises with different degrees of difficulty: Choose some of them according to your experience (basic, advanced) and interest.

Exercises 5 and beyond are for advanced users but after this course we expect all of you to be advanced users :-)

Exercise 1: Discover software applications installed on the cluster and how to use them

A) Graphical interface

Goal: get familiar with the organisation of software tools (genetics, proteomics, sequence analysis, UHTS etc.)

  • Use the link below to see which software applications are installed on the clusters:

 

  • Click on a few applications that you plan or would like to use and see how they need to be executed.
    (if you can't think of any, you can use: tophat, bowtie, samtools, blast, etc.)
  • Check/verify what modules you need to load in order to execute the applications

B) Command line

vit_soft

Try to find tools using the command ‘vit_soft’ (use options -s, -m, …)

Module

Goal: get basic experience with the command "module" to use software tools in your scripts.

  • If not already done, log into the front-end node using your UNIX userid and password.
  • In /software/, visit the different categories
  • there is a package called "tophat2" in the category "UHTS". How can you enable it?
  • Get familiar with the commands module commands use, add, list, avail, rm (see course slides)
Hints:
  • General directory structure/software/<category>/<package>/<version>
  • Use the web interface if necessary, e.g. to see available version or to see the command to load or unload a package
  • module files are installed in /software/module/<category>

 

Exercise 2: Basic Job Submission via LSF command line client (bsub)

A) Script displaying basic information

Goal: Submit a simple test job to get familiar with the bsub command line client and basic usage of the cluster.

Task:

  • From the /scratch/cluster/daily/<username>/examples-files folder, submit a simple test job to show date/time, hostname and list files in the directory $HOME.
    • you can edit the shell script simple.sh
    • (Optional) - select the priority queue for your job to get a quick response (note that we use the priority queue for testing small jobs only!)

Hints:

  • You can use the basic shell commands date and hostname. If you are not familiar with them, just type them on the command line. Next, write a very basic shell script - or modify the simple.sh script - and submit it using bsub.

B) Script to execute a BLAST query


Goal: Get experience with the program BLAST (the executable is called blastall)

Task:

  • Go to /scratch/cluster/daily/<username>/exercise_blast/ folder, then check the content of the file blast.sh
  • If you think that the script is OK, execute it with the job submission system. If not, modify it and execute it afterwards.
       

Hints:

  • In the script blast.sh we use the database "swiss" (i.e. Swiss-Prot). There are many more databases and respective indices that can be used. If you want to see other existing databases, use: 

Exercise 3: Job Submission via LSF command line client and e-mail notification


Tasks:

  • From the /scratch/cluster/daily/<username>/ folder, submit a simple test job which sends an e-mail notification to your mailbox (create a shell script and potentially use the application from Exercise 2b).

Hint:

  • In case you need help with an application, try to use BLAST.
    Run BLAST against a few selected sequences from UniProt, e.g. use protein P12344 as input: http://www.uniprot.org/uniprot/P12344.fasta
    Save your selected sequences from UniProt in the /scratch/cluster/daily/<username>/exercise_blast/ folder.
  • In case you need help with bsub arguments, get familiar with the user manual (man pages):

    man bsub

Exercise 4: Check cluster and job status


Overall clusters status:


Questions:

  • What is average load on the cluster?
  • How many CPUs and how many hosts are available? What's the difference between CPU and host?

Queue status:


Questions:

  • How many jobs are currently running? (Hint: alternatively to the Web interface you can also use the following command: bjobs -u all)
  • How many jobs are pending?
  • On which queue are the most running jobs?

Exercise 5: Single analysis example

FastQC is a program that provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines.

  • Look carefully at the file 'exercise_ga/fastqc.sh' 
  • run it on the cluster
  • Find _all_ the output files - including error file.
  • How much "CPU time" was used to complete the job?

(Some information about the results will be given during the practicals)

Task:

  • Based on the 'exercise_ga/fastqc.sh' file, create your own script using your favorite program and your input data and run it on the cluster.

Exercise 6: Experience a real world usage (Genome Analysis: Genome Assembly)

A short scientific background on Genome Assembly will be given before the beginning of the exercise

Scenario:
Some biologists want to assemble a new genome.
After DNA shearing and sequencing, they received the following files from the sequencing facility:

  • s_1_1_sequence.txt
  • s_1_2_sequence.txt

Now, you are in charge!

Prerequisite - Input data:

  • we already downloaded the S. aureus reference genome here: /db/HPC-course/ga/CP001844.fa
  • we already put the following files in the same directory: /db/HPC-course/ga/
    • soap.config - configuration file used by SOAPdenovo
    • sort_contigs.pl - a small PERL script to sort and keep only contigs > 500bp
  • we placed the files from the sequencing facility here:
    /db/HPC-course/ga/s_1_1_sequence.txt
    /db/HPC-course/ga/s_1_2_sequence.txt

Goals:

  • Understand the impact of large data (data from sequencing machine to Vital-IT cluster) - potentially several TBs! (experiment with real data)
  • How to efficiently process the data and finally store the results
  • Experience and discuss real world issues: pipelines, run programs locally or on the cluster, multi-thread software, etc.

Tasks:

  • Copy and extract the scripts from the following archive in your scratch/cluster/daily directory: /db/HPC-course/ga/hpc-ga-ex-scripts.tar.gz
  • Run sequentially the following scripts and commands from your scratch/cluster/daily directory - Note that some results may be discussed

A) Quality Control - QC
1a-quality-control.sh
2-data-cleaning.sh
1b-quality-control-filtered.sh
[exercise: modify script 1b to use your filtered sequences]

B) Assembly
3-assembly.sh
[exercise: modify script 3 to use your filtered sequences
[exercise/ADVANCED: modify script 3 to use an array for the K parameter: 21 and 33]


C) Assembly evaluation [optional]

By metrics

Look at the file(s) XX.scafStatistics (with XX=21 or 33)
(these should be in /scratch/cluster/daily/username/soapdenovo_XXXXXX)

By mapping against reference genome
## In /scratch/cluster/daily/username/,
##
create a directory named 'nucmer' and move into it:
mkdir nucmer_XX
cd nucmer_XX/

## replace the 'X' from the command below to fit your data
perl /db/HPC-course/ga/sort_contigs.pl -b -m 500 -p -z ../soapdenovo_XXXXXX/graph_XX.scafSeq mybestsorted_XX.fa

## nucmer belongs to the MUMmer module and it is not in your path ...
nucmer
/db/HPC-course/ga/CP001844.fa mybestsorted_XX.fa -p ref-vs-mybestXX
mummerplot --large --layout
ref-vs-mybestXX.delta --postscript

Transfer the file(s) out.ps on your laptop and display them with your favorite graphical program

  • [optional/ADVANCED]: create your own script merging the above commands from point C]
  • Group discussion about the exercise

Exercise 7 (Advanced): Experience a real world usage (Genome Analysis: RNA-Seq example)

A short scientific background on RNA-Seq will be given before the beginning of the exercise

Scenario:
Some biologists want to study the effects of a new drug in the mouse transcriptome using RNA-Seq analysis.
They collected the murine RNA of the white blood cells after 6h of treatment and sent it to a sequencing facility.
After a couple of days, they received the M_800.fastq file. Now, you are in charge of the analysis!

Prerequisite - Input data:

  • we already downloaded the bowtie2 mouse index, which is the reference genome used by bowtie - the mm10 version
    (it is now in /db/HPC-course/bowtie2_mouse_index/)
  • we already downloaded the mm10.gtf gene features file which is necessary to htseq-count
    (it is now in /db/HPC-course/mm10.gtf)
  • and we placed the file from the sequencing facility here:
    /db/HPC-course/RNA-Seq_reads/M_800.fastq

Goals:

  • Understand the impact of large data (data from sequencing machine to Vital-IT cluster) - potentially several TBs! (experiment with real data)
  • How to efficiently process the data and finally store the results (pipeline using tophat, bowtie, samtools and htseq-count)
  • Learn the benefit of using multi-threads (or not! :-)

Tasks:

  • Copy and extract the scripts from the following archive in your /scratch/cluster/daily/<username>/ directory: /db/HPC-course/ga/hpc-ga-ex-scripts.tar.gz
  • Take a careful look at the rnaseqanalysis_800k_1.sh shell script and compare it to rnaseqanalysis_800k_6.sh
    analyze the different parts of the script: the BSUB options, the various controls and the different programs and commands
  • Some of the participants will execute the rnaseqanalysis_800k_1.sh shell script while the others will execute the rnaseqanalysis_800k_6.sh
    (A this point, a quick review will be given to clarify some points for people not familiar with the RNA-Seq Analysis programs)
  • Group discussion about the exercise

Exercise 8 (Advanced): Get experience with a multi-threaded test program

Goal:

  • correctly run a multi-threaded application on the cluster using the correct LSF parameters.

Please check the two programs and see how many threads they are using:

If you run on dev.vital-it.ch, use the command top to see how much CPU time the command uses.

Task:

  • Based on the number of threads used, write an LSF script to launch the two programs above on the cluster. Note that you need to specify the number of cores. Keep in mind that you want to use one thread per core!
  • Can you use the same LSF script to launch both programs? Is there a potential problem with efficiency of cluster usage?

Exercise 9 (Advanced): Using a multi-threaded BLAST program

Goal: correctly run a multi-threaded BLAST on the cluster using the correct LSF parameters.

From the /scratch/cluster/daily/<username>/exercise_blast/ folder, check the two tblastx shell scripts:

  • tblastx_1c.sh
  • tblastx_4c.sh

Execute both scripts and compare their respective time of completion. Is it worth adding more threads with the program blastall ?
How could you change the output of the BLAST to show the best 10 alignments ?
Hints: blastall is not in the "NCBI-BLAST" module. What does tblastx program do? Would it require a small amount of memory to run or would it be rather memory intensive?

Exercises 10 (Advanced): array job

Create an array job with 10 sub-jobs. Each sub-job should produce 3 output files that are stored in a distinct directory under

/scratch/cluster/daily/<username>/<jobid>/<job-index>


Hint
: You can use the environment variable LSB_JOBINDEX to indicate the job index in your LSF-batch script. Use LSB_JOBID to indicate the jobid.


Exercise 11 (Advanced): Design an embarrassingly parallel job

Choose one of the following tasks:

  • Use many BLAST jobs in parallel, i.e. prepare for submitting many jobs working on different data sets in parallel.
  • Select a software application (hint: is it already pre-installed?), run several instances in parallel and merge the final result.








Last modified: Thursday, 3 December 2015, 9:53 AM