Practical 2 BLAST/BLAT
a) Blast with nucleotides
1) Run blastn (at NCBI) of this mRNA NM_001261117 (you should now be able to find the FASTA of this easily) against NCBI Human Genomic + Transcript (Human G+T) database.
You find a first hit on chromosome 9 and a second hit on chromosome 1.
Which is the best?
Tricks: If NCBI is too slow or doesn't give you the answer, try the same with BLAT at ENSEMBL against human genome.
Can you now answer the question above?
Can you run a BLAT or BLAST against multiple species?
Can you download all hit sequences (at NBCI?, at ENSEMBL?)
b) Blast with taxonomy limitations
1) Do a blastp of this protein below vs NR (NCBI) with organism limitation to cetartiodactyla
>myprot1
MASGPGGWLGPAFALRLLLAAVLQPVSAFRAEFSSESCRELGFSSNLLCSSCDLLGQFSL
LQLDPDCRGCCQEEAQFETKKYVRGSDPVLKLLDDNGNIAEELSILKWNTDSVEEFLSEK
LERI
OK it is a protein from camel, but strangely there are matches in other cetartiodactyla species that all contain an additional sequence in the middle.
However the first match is perfect (of course it is the query sequence ;-). If you look closer you find another camel protein that also displays a sequence insertion identical to the other related species.
Strangely this match has lower score than other less good matches. How can you explain that result?
2) Redo a blastp (click on Edit and Resubmit link), by changing the "Compositional adjustments" parameter to "No adjustment"
Now the second camel looks correctly placed (and scored) at second position.
Which camel sequence seems correct to you?
3) Go back to your first blastp results (click on the top tab "Recent Results")
Align the 17 first proteins (only discard last hit that is not related to your results e-value=10)
Which camel protein do you believe more?
4) Do a similar blastp vs UniProtKB (UniProt) with post filtering with Taxonomy against cetartiodactyla
Select and align all 8 proteins. Can you explain the differences?
What do you think of the Bos mutus (Yak) sequence (L8IZ46_9CETA) and the second pig sequence (I3LH65_PIG)?
Follow the ENSEMBL link in the UniProt entry of this pig sequence to view the genomic region and try to understand the problem.
Additional exercises for those who are quick:
c) A difficult case
1) Try to run a blastp vs NCBI NR with this protein
>my_weird_prot
MMMNKCVVFKNFKQMASSRMRAQQLYQALGGGVGGGSDGGNGGGDGGGNGGRGGGGGTGG
NGGGSDGGSEGGGGGNRGGGSGGGGAGGSGGGSGGNEGGGGGGNGGDGGGNGGGGGGGGN
GGGGGGGGNGGGGGGSGHSGGGGGGGSGGGGGGSGRSGGGSGGGSGGGGVSNGGGGSGGG
NGGGGGGGGGGGGGGGSGGGNDGGSGGGGGRGRGSGGGGGGTGGGGGKN
NCBI-BLAST rejects the protein. Why? It looks as it codes for DNA!!!
How to proceed ?
Well you can add a fake sequence as long as it doesn't contain ACGT (e.g., multiple copies of QWER)
How many hits do you get?
2) Try with UniProt Blast
It works at UniProt, you find more hits, but the results might not be relevant.
Try to add the filtering parameter, how many hits do you get now?
d) PSI-BLAST
You identified a new protein domain (BRCT in BRCA1_HUMAN) and you need to identify a homologue of this domain in Yeast.
>BRCT_DOMAIN
STERVNKRMSMVVSGLTPEEFMLVYKFARKHHITLTNLITEETTHVVMKTDAEFVCERTL
KYFLGIAGGKWVVSYFWVTQSIKERKMLNEHDFEV
1) Using PSI-BLAST vs SwissProt at NCBI, can you identify a protein in yeast (Sacchoromyces cerevisiae)? If yes how many iteration did you use? Can you align all the matching sequences?
2) Try the same at MyHits.
Can you align all the matching sequences?