{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "In these exercises we will work with genome sequence records for some isolates of the SARS-CoV-2 virus. We will inspect the records, get familiar with their annotations and features. We will then extract nucleic acid sequence for the spike gene and predict the protein sequence.\n",
    "\n",
    "\n",
    "### 7.1\n",
    "\n",
    "You are given a text file named \"SARS-CoV-2_acc.txt\". In this file, Genbank accessions for a number of SARS-CoV-2 isolates from different locations around the world are listed. The accessions are separated by new lines. Open, read and parse this file into a list called `accessions`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"SARS-CoV-2_acc.txt\") as infile:\n",
    "    accessions = infile.read().splitlines()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many accessions are there? What is the first accession? What is the last accession?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 200 accessions\n",
      "First accession is: MT385458.1\n",
      "Last accession is: MT263453.1\n"
     ]
    }
   ],
   "source": [
    "print(\"There are {} accessions\".format(len(accessions)))\n",
    "print(\"First accession is:\", accessions[0])\n",
    "print(\"Last accession is:\", accessions[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.2\n",
    "\n",
    "\n",
    "**1.** For this and the following exercises we will work only with the **first accession**. Retrieve the sequence record for the first accesssion from Genbank using `Bio.Entrez`. Keep this record as a new variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio import Entrez\n",
    "from Bio import SeqIO\n",
    "\n",
    "Entrez.email = \"bulak.arpat@unil.ch\"\n",
    "handle = Entrez.efetch(db=\"nucleotide\", id=accessions[0], rettype=\"gb\", retmode=\"text\")\n",
    "rec = SeqIO.read(handle, \"genbank\")\n",
    "handle.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the `id`, `name` and `description` of this record?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ID=MT385458.1\n",
      "Name=MT385458\n",
      "Description=Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/CA-CZB0706/2020, complete genome\n"
     ]
    }
   ],
   "source": [
    "print(\"ID={}\\nName={}\\nDescription={}\".format(\n",
    "    rec.id, rec.name, rec.description)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Confirm that the ID of this record matches the first accession in the accessions text file. Based on these information, can you guess in which country this isolate was identified?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accession IDs do match: True\n",
      "Country of origin could be USA\n"
     ]
    }
   ],
   "source": [
    "print(\"Accession IDs do match:\", rec.id == accessions[0])\n",
    "print(\"Country of origin could be\", rec.description.split(\"/\")[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2.** How many entries are there in the `annotations` of this record?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 12 annotations associated with this record\n"
     ]
    }
   ],
   "source": [
    "print(\"There are\", len(rec.annotations), \"annotations associated with this record\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print the 'taxonomy' and the 'references' of this record. What is the title of the publication this sequence was published?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Viruses', 'Riboviria', 'Orthornavirae', 'Pisuviricota', 'Pisoniviricetes', 'Nidovirales', 'Cornidovirineae', 'Coronaviridae', 'Orthocoronavirinae', 'Betacoronavirus', 'Sarbecovirus']\n",
      "[Reference(title='Direct Submission', ...)]\n",
      "\n",
      "Method of submission: Direct Submission\n"
     ]
    }
   ],
   "source": [
    "print(rec.annotations[\"taxonomy\"])\n",
    "print(rec.annotations[\"references\"])\n",
    "print()\n",
    "print(\"Method of submission:\", rec.annotations[\"references\"][0].title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3.** How many entries are there in the `features` of this record? Print the first feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 55 features\n",
      "type: source\n",
      "location: [0:29861](+)\n",
      "qualifiers:\n",
      "    Key: collection_date, Value: ['2020-03-23']\n",
      "    Key: country, Value: ['USA: CA']\n",
      "    Key: db_xref, Value: ['taxon:2697049']\n",
      "    Key: host, Value: ['Homo sapiens']\n",
      "    Key: isolate, Value: ['SARS-CoV-2/human/USA/CA-CZB0706/2020']\n",
      "    Key: isolation_source, Value: ['nasopharyngeal/oropharyngeal swab']\n",
      "    Key: mol_type, Value: ['genomic RNA']\n",
      "    Key: organism, Value: ['Severe acute respiratory syndrome coronavirus 2']\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"There are\", len(rec.features), \"features\")\n",
    "print(rec.features[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a 'source' feature and usually there is a single 'source' feature for a record. It holds like the 'annotations' very useful information about the source of this record. Can you confirm the country of origin?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The place of origin for this isolate is USA: CA\n"
     ]
    }
   ],
   "source": [
    "print(\"The place of origin for this isolate is\", rec.features[0].qualifiers[\"country\"][0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many different possible values are there for the `type` of these features and what are they?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 5 different values for feature types\n",
      "['source', 'gene', 'CDS', 'mat_peptide', 'stem_loop']\n"
     ]
    }
   ],
   "source": [
    "# Using a list\n",
    "feature_types = []\n",
    "for feature in rec.features:\n",
    "    if feature.type not in feature_types:\n",
    "        feature_types.append(feature.type)\n",
    "print(\"There are {} different values for feature types\".format(len(feature_types)))\n",
    "print(feature_types)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 5 different values for feature types\n",
      "{'source', 'stem_loop', 'gene', 'mat_peptide', 'CDS'}\n"
     ]
    }
   ],
   "source": [
    "# Optional - alternative using set\n",
    "feature_types_set = set([feature.type for feature in rec.features])\n",
    "print(\"There are {} different values for feature types\".format(len(feature_types_set)))\n",
    "print(feature_types_set)\n",
    "\n",
    "# Can you spot the differences between the list and set versions?\n",
    "# Sets are very useful counters to hold unique members\n",
    "# Try help(set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many 'gene's does this viral genome have, according to the features? How many 'CDS's are defined in the features? Are the number of genes and CDSs same?\n",
    "\n",
    "*Hint*, dictionaries can be used to count multiple things within a single loop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'CDS': 12, 'gene': 11}\n"
     ]
    }
   ],
   "source": [
    "counter = {'CDS': 0, 'gene': 0}\n",
    "for feature in rec.features:\n",
    "    if feature.type in counter:\n",
    "        counter[feature.type] += 1\n",
    "print(counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.3\n",
    "\n",
    "In this exercise, we will find the 'gene' and 'CDS' features for the spike protein. The spike protein (S protein) is a large type I transmembrane protein, which is highly glycosylated. Spike proteins assemble into trimers on the virion surface to form the distinctive \"corona\", or crown-like appearance. The gene is usually called the **S** gene and the protein product is called **surface glycoprotein**.\n",
    "\n",
    "**1.** First, write a function, which accepts a single argument, a record, which has to be of `SeqRecord` type. This function should loop over all features of this record and return a `list` of features whose `.type` is either **'gene' or 'CDS'**. You will use this function in all other questions of **7.3**!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_gene_cds_features(rec):\n",
    "    result = []\n",
    "    for feature in rec.features:\n",
    "        if feature.type in ['CDS', 'gene']:\n",
    "            result.append(feature)\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using this function, print the **keys** of the qualifiers for all 'gene' and 'CDS' features. What is the single **key** that can be found in all qualifiers?\n",
    "\n",
    "*Hint:* Qualifiers are a special kind of dictionary called OrderedDict. Their keys can be accessed by the `.keys()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gene: ['gene']\n",
      "CDS: ['gene', 'ribosomal_slippage', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "gene: ['gene']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "gene: ['gene']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "gene: ['gene']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "gene: ['gene']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "gene: ['gene']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "gene: ['gene']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "gene: ['gene']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "gene: ['gene']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "gene: ['gene']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n",
      "gene: ['gene']\n",
      "CDS: ['gene', 'codon_start', 'product', 'protein_id', 'translation']\n"
     ]
    }
   ],
   "source": [
    "for feature in get_gene_cds_features(rec):\n",
    "    print(\"{}: {}\".format(feature.type, list(feature.qualifiers.keys())))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2.**  Now, using the same function, print the `'gene'` qualifier for all. Can you spot the 'S' gene?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gene: ['ORF1ab']\n",
      "CDS: ['ORF1ab']\n",
      "CDS: ['ORF1ab']\n",
      "gene: ['S']\n",
      "CDS: ['S']\n",
      "gene: ['ORF3a']\n",
      "CDS: ['ORF3a']\n",
      "gene: ['E']\n",
      "CDS: ['E']\n",
      "gene: ['M']\n",
      "CDS: ['M']\n",
      "gene: ['ORF6']\n",
      "CDS: ['ORF6']\n",
      "gene: ['ORF7a']\n",
      "CDS: ['ORF7a']\n",
      "gene: ['ORF7b']\n",
      "CDS: ['ORF7b']\n",
      "gene: ['ORF8']\n",
      "CDS: ['ORF8']\n",
      "gene: ['N']\n",
      "CDS: ['N']\n",
      "gene: ['ORF10']\n",
      "CDS: ['ORF10']\n"
     ]
    }
   ],
   "source": [
    "for feature in get_gene_cds_features(rec):\n",
    "    print(\"{}: {}\".format(feature.type, feature.qualifiers['gene']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3.** Using the `'gene'` qualifier from the previous exercise, print the `location` of all 'gene' and 'CDS' features for the 'S' gene. \n",
    "\n",
    "*Attention, the value of the `'gene'` qualifier is a `list`*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gene: [21549:25371](+)\n",
      "CDS: [21549:25371](+)\n"
     ]
    }
   ],
   "source": [
    "for feature in get_gene_cds_features(rec):\n",
    "    if feature.qualifiers['gene'][0] == \"S\":\n",
    "        print(\"{}: {}\".format(feature.type, feature.location))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.4\n",
    "\n",
    "**1.** Create a new record by **slicing** the previous record using the coordinates of the 'S' gene. How many features does this 'S' gene record have?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'S' gene record has 2 features\n"
     ]
    }
   ],
   "source": [
    "s_gene = rec[21549:25371]\n",
    "print(\"'S' gene record has\", len(s_gene.features), \"features\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2.** Translate the 'S' gene (transcript in this case, as this is an RNA virus) into a new variable called `s_protein`. Which translation table should we use (a little biology)? Does the protein sequence contain a stop? If so, try to translate without a stop character. How many amino acids does this protein have?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "S protein has 1273 amino acids\n"
     ]
    }
   ],
   "source": [
    "s_protein = s_gene.translate(to_stop=True)\n",
    "print(\"S protein has\", len(s_protein), \"amino acids\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.5\n",
    "\n",
    "**1.** In **\"SARS-CoV-2_S.gb\"** file, you will find the GenBank sequence records of the 'S' gene for the first 50 accessions we used in previous exercises. Can you create a **fasta** file that contains the spike protein sequences for these? Try to keep their `.id` and avoid translating stop codons!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"spike_proteins.fa\", \"w\") as fasta:\n",
    "    for rec in SeqIO.parse(\"SARS-CoV-2_S.gb\", \"genbank\"):\n",
    "        prot = rec.translate(to_stop=True)\n",
    "        prot.id = rec.id\n",
    "        SeqIO.write(prot, fasta, \"fasta\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2**. Did you notice the *Warning* above, when we try to translate the first 50 accessions. It seems the length of one or more of our 'S' CDSs is not multiple of three. Can you find which one?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "'S' gene from 'MT444558.1' has a length of 3719, which is not a multiple of 3\n"
     ]
    }
   ],
   "source": [
    "for rec in SeqIO.parse(\"SARS-CoV-2_S.gb\", \"genbank\"):\n",
    "    for feature in rec.features:\n",
    "        if feature.type == 'gene' and 'S' in feature.qualifiers['gene']:\n",
    "            s_length = feature.location.end - feature.location.start\n",
    "            if s_length % 3 != 0:\n",
    "                print(\"'S' gene from '{}' has a length of {}, which is not a multiple of 3\".format(\n",
    "                    rec.id, s_length))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "steps_venv",
   "language": "python",
   "name": "steps_venv"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}