{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises\n", "\n", "In these exercises we will work with genome sequence records for some isolates of the SARS-CoV-2 virus. We will inspect the records, get familiar with their annotations and features. We will then extract nucleic acid sequence for the spike gene and predict the protein sequence.\n", "\n", "### 7.1\n", "\n", "You are given a text file named \"SARS-CoV-2_acc.txt\". In this file, Genbank accessions for a number of SARS-CoV-2 isolates from different locations around the world are listed. The accessions are separated by new lines. Open, read and parse this file into a list called `accessions`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many accessions are there? What is the first accession? What is the last accession?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.2\n", "\n", "\n", "**1.** For this and the following exercises we will work only with the **first accession**. Retrieve the sequence record for the first accesssion from Genbank using `Bio.Entrez`. Keep this record as a new variable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the `id`, `name` and `description` of this record?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Confirm that the ID of this record matches the first accession in the accessions text file. Based on these information, can you guess in which country this isolate was identified?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2.** How many entries are there in the `annotations` of this record?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the 'taxonomy' and the 'references' of this record. What is the title of the publication this sequence was published?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3.** How many entries are there in the `features` of this record? Print the first feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a 'source' feature and usually there is a single 'source' feature for a record. It holds like the 'annotations' very useful information about the source of this record. Can you confirm the country of origin?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many different possible values are there for the `type` of these features and what are they?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many 'gene's does this viral genome have, according to the features? How many 'CDS's are defined in the features? Are the number of genes and CDSs same?\n", "\n", "*Hint*, dictionaries can be used to count multiple things within a single loop." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.3\n", "\n", "In this exercise, we will find the 'gene' and 'CDS' features for the spike protein. The spike protein (S protein) is a large type I transmembrane protein, which is highly glycosylated. Spike proteins assemble into trimers on the virion surface to form the distinctive \"corona\", or crown-like appearance. The gene is usually called the **S** gene and the protein product is called **surface glycoprotein**.\n", "\n", "**1.** First, write a function, which accepts a single argument, a record, which has to be of `SeqRecord` type. This function should loop over all features of this record and return a `list` of features whose `.type` is either **'gene' or 'CDS'**. You will use this function in all other questions of **7.3**!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this function, print the **keys** of the qualifiers for all 'gene' and 'CDS' features. What is the single **key** that can be found in all qualifiers?\n", "\n", "*Hint:* Qualifiers are a special kind of dictionary called OrderedDict. Their keys can be accessed by the `.keys()` method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2.** Now, using the same function, print the `'gene'` qualifier for all. Can you spot the 'S' gene?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3.** Using the `'gene'` qualifier from the previous exercise, print the `location` of all 'gene' and 'CDS' features for the 'S' gene. \n", "\n", "*Attention, the value of the `'gene'` qualifier is a `list`*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.4\n", "\n", "**1.** Create a new record by **slicing** the previous record using the coordinates of the 'S' gene. How many features does this 'S' gene record have?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2.** Translate the 'S' gene (transcript in this case, as this is an RNA virus) into a new variable called `s_protein`. Which translation table should we use (a little biology)? Does the protein sequence contain a stop? If so, try to translate without a stop character. How many amino acids does this protein have?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7.5\n", "\n", "**1.** In **\"SARS-CoV-2_S.gb\"** file, you will find the GenBank sequence records of the 'S' gene for the first 50 accessions we used in previous exercises. Can you create a **fasta** file that contains the spike protein sequences for these? Try to keep their `.id` and avoid translating stop codons!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2**. Did you notice the *Warning* above, when we try to translate the first 50 accessions. It seems the length of one or more of our 'S' CDSs is not multiple of three. Can you find which one?" ] } ], "metadata": { "kernelspec": { "display_name": "steps_venv", "language": "python", "name": "steps_venv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }