{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Biopython" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's consider a typical task in bioinformatics: we are interested in finding the GC% for some sequences. For this we would need i) open the file, ii) parse and extract the sequence information, and iii) calculate and report their GC%. In a very simple example, we could write something like following: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq = {}\n", "identifier = None\n", "with open(\"my_sequences.fa\") as input_file:\n", " for line in input_file:\n", " if line[0] == \">\":\n", " identifier, *description = line.strip()[1:].split(\" \", 1)\n", " my_seq[identifier] = \"\"\n", " else:\n", " my_seq[identifier] += line.strip().upper()\n", "for identifier in my_seq:\n", " gc_count = my_seq[identifier].count(\"G\") + my_seq[identifier].count(\"C\")\n", " gc_pc = gc_count / len(my_seq[identifier]) * 100\n", " print(\"GC% for {} is {:.2f}\".format(identifier, gc_pc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see how it would look like if we used Biopython:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from Bio import SeqIO\n", "from Bio.SeqUtils import GC\n", "for rec in SeqIO.parse(\"my_sequences.fa\", \"fasta\"):\n", " print(\"GC% for {} is {:.2f}\".format(rec.id, GC(rec.seq)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Broad examples of what we can do with Biopython\n", "\n", "- Sequence analysis\n", " - Motif\n", " - Search: HMMs, Alignments\n", " - Restriction\n", "- Structures\n", " - SCOP\n", " - PDB\n", "- Database query\n", "- Phylogeny\n", "- Pathway\n", "- And more ...\n", "\n", "## Main concept of Biopython\n", "\n", "- I/O interface and parsing abilities for bioinformatics files/DBs\n", " - Blast\n", " - FASTA\n", " - PubMed\n", " - SwissProt\n", "- Efficient and practical data-structures for bioinformatics data\n", " - Sequences\n", " - Alignments\n", " - Structures\n", "- Methods implementing bioinformatics analysis\n", " - Translation\n", " - Classification\n", " - Phylogeny trees\n", "- Interface to common bioinformatics programs\n", " - Standalone Blast\n", " - ClustalW\n", " - EMBOSS\n", "\n", "## Objects\n", "\n", "In Biopython there are many modules and ecah module contains several major new data types. Objects created with these types serve specific purposes in the above mentioned functionalities. We will focus on sequence analysis; some new objects we will discover are:\n", "\n", "- `Seq`\n", "- `Alphabet`\n", "- `SeqRecord`\n", "- `SeqFeature`\n", "\n", "## Help - important!\n", "\n", "- A relatively detailed tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html\n", "- Help on certain concepts and modules: https://biopython.org/wiki/Category%3AWiki_Documentation\n", "\n", "In this notebook we will cover basics of the Biopython package:\n", " - Seq object and alphabets\n", " - Bio.SeqIO module and SeqRecord object\n", " - Interacting with external DBs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `Seq` Object and (Alphabets)\n", "\n", "Seq object is a flexible encapsulator for _sequence-like_ strings. It has two main components:\n", "- A Python string representing a _biological sequence_\n", "- An Alphabet object describing the _letters_ used in this sequence.\n", "\n", "The related module is __Bio.Seq__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from Bio.Seq import Seq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create our first `Seq` object. The only required positional argument is _'data'_, which is used for the sequence string. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq = Seq('AGCGCGATTTATATATAGCGAGCGATTCGGAGCGATCGACGGATTCGAC')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing and slicing of `Seq`\n", "\n", "A `Seq` object is based on Python's `str` object, so most methods we can use with a `str` object, we can also use with a `Seq` object. Let's try to use indexing and slicing with our sequence:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"5th nucleotide is {}\".format(my_seq[6]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq[5:11], my_seq[16:22]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note that the result of slicing is a **new** `Seq` object!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to make a difference between indexing and single-element slicing. Let's have a *closer* look at the results of these two operations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using index\n", "my_seq[6]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# And using slice\n", "my_seq[6:7]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question 1.\n", "\n", "Can you make a new `Seq` object combining the nucleotides 6-11 and 17-22. *Hint:* try to use slices" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can use this cell for Question 1.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Methods of `Seq` objects\n", "\n", "As `Seq` objects inherits many properties from `str`, many of its methods are available to us to use on our sequences." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq.count('TAT') # counts the occurance of the input" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq.lower()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq.index('TAT') # gives the index (0-based) of the first occurance of the input" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the main difference from calling these methods on a `str` object is that they return new `Seq` objects as a result:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq.split('GAT')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, being a specialized object, it **does not** support all `str` methods. And more importantly, it implements many **new** methods that relate to biological sequences. Let's explore some examples for both cases:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# .isdigit() is one of the string methods that is not supported by Seq objects\n", "\n", "my_seq.isdigit()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# on the other hand, .count_overlap() is one the new methods that Seq implements\n", "\n", "my_seq.count_overlap('TAT')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question 2.\n", "Can you tell the difference between the `str` method `.count()` and the specialized `Seq` method `.count_overlap()`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can use this cell for Question 2.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# another example for Seq specific method is .reverse_complement()\n", "\n", "print(\"Reverse complement of my sequence is: {}\".format(my_seq.reverse_complement()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> To see all methods and their descriptions you can always visit the documentation pages or simply use the help function whenever needed: `help(Seq)`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mutability of `Seq` objects\n", "\n", "Just like the `str` object, the `Seq` object does not support item assignment; in other words, it is __immutable__." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq[6] = 'T'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, in biology, **mutations are fundamental**. An important variant of the `Seq` object is the `MutableSeq` object. There are two ways of creating a `MutableSeq` object. We can convert our normal `Seq` object into a mutable one:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mutable_seq = my_seq.tomutable()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, we can import the `MutableSeq` class and instantiate ourselves:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from Bio.Seq import MutableSeq\n", "mutable_seq = MutableSeq('AGCGCGATTTATATATAGCGAGCGATTCGGAGCGATCGACGGATTCGAC')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mutable_seq[6] = 'T'\n", "mutable_seq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question 3.\n", "Is point mutation the only type of mutation supported by the `MutableSeq` object? Can we introduce insertions/deletions by simple item assignment? Try it!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can use this cell for Question 3.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Alphabets\n", "We have mentioned that `Seq` objects also contained an `Alphabet`. The purpose of the `Alphabet` class is to provide a consistent data structure and functionality around genetic code contained in different molecule types, such as DNA, RNA, protein." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq.alphabet" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_seq.back_transcribe() # back_transcribe() is a method of Seq objects, 'transcribing back' a RNA -> DNA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Best and simplest way to describe our sequence at the time of instantiation is to use the `Bio.Alphabet` module. IUPAC stands for 'International Union of Pure and Applied Chemistry' and the class `IUPAC` contains many well-defined alphabets to use with our sequences as described here: https://www.bioinformatics.org/sms2/iupac.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from Bio.Alphabet import IUPAC\n", "\n", "seq_types = [\n", " \"extended_protein\", \"ambiguous_dna\", \"protein\", \"ambiguous_rna\",\n", " \"unambiguous_dna\", \"extended_dna\", \"unambiguous_rna\"]\n", "for seq_type in seq_types:\n", " letters = getattr(IUPAC, seq_type).letters\n", " print(\"Letters in {}: {}\".format(seq_type, letters))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_dna = Seq('AGGTCNNNNGCTWS', IUPAC.ambiguous_dna)\n", "\n", "# Be careful, our sequence is not validated against the letters.\n", "# That is, we can use letters that are not defined in the alphabet,\n", "# and yet it will not complain about it.\n", "\n", "some_dna = Seq('AGGTCNNNNGCTWS', IUPAC.unambiguous_dna) # See above the letters defined for 'unambiguous_dna'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's define a function that would compare the letters used in the sequence to the letters described in its alphabet and return a `set` of letters that normally would not have been allowed:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def undef_letters(seq):\n", " allowed_letters = set(seq.alphabet.letters)\n", " seq_letters = set(seq)\n", " return seq_letters - allowed_letters\n", "\n", "print(\"Undefined letters used in 'my_dna':\", undef_letters(my_dna))\n", "print(\"Undefined letters used in 'some_dna':\", undef_letters(some_dna))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see,\n", "- the `Alphabet` class is aimed at helping us to annotate the origin of a sequence\n", "- however, it is not *always* enforced\n", "- (and its functionality, sometimes, does not follow common sense)\n", "\n", "Its API recommends that we **should avoid** using the `Alphabet` **explicitly**:\n", "\n", "> The design of Bio.Aphabet included a number of historic design choices which, with the benefit of hindsight, were regretable. While the details remain to be agreed, we intend to remove or replace Bio.Alphabet in 2020. Please avoid using this module explicitly in your code. See also: https://github.com/biopython/biopython/issues/2046" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 'Biological' methods of `Seq` objects\n", "Let's now focus on `Seq` specific 'biological' methods:\n", "\n", "- complement()\n", "- reverse_complement()\n", "- transcribe()\n", "- back_transcribe()\n", "- translate()\n", "\n", "These methods do not *require* any other extra arguments and they return a new `Seq` object." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Complementing and reverse complementing sequences" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "intronless_dna = Seq(\n", " 'CACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTA'\n", " 'TAGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCA', IUPAC.unambiguous_dna)\n", "print(\"intronless_dna:\", intronless_dna)\n", "print(\"its alphabet:\", intronless_dna.alphabet)\n", "\n", "# If we need to find its complement sequence:\n", "\n", "compl_seq = intronless_dna.complement()\n", "print()\n", "print(\"compl_seq:\", compl_seq)\n", "\n", "# If we would need to find the reverse complement of our sequence, we can simply\n", "\n", "rev_strand = intronless_dna.reverse_complement()\n", "print()\n", "print(\"rev_strand:\", rev_strand)\n", "print(\"alternatively:\", compl_seq[::-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Transcription and reverse transcription of sequences" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's now transcribe this piece of DNA into RNA. This is an intronless DNA sequence, so\n", "\n", "rna_seq = intronless_dna.transcribe()\n", "print(\"RNA:\", rna_seq)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question 4.\n", "What `alphabet` does the newly transcribed rna_seq have?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can use this cell for Question 4.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can also reverse transcribe an RNA sequence to cDNA...\n", "cdna_seq = rna_seq.back_transcribe()\n", "cdna_seq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Translation of sequences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Up to here, we have described an intronless DNA sequence, and 'transcribed' it into an RNA sequence, `rna_seq`. If we know where the CDS starts, we can also translate the CDS into a protein sequence. In this example, the start codon can be found at 53. position." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "cds_seq = rna_seq[53:]\n", "cds_seq" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "protein_seq = cds_seq.translate()\n", "protein_seq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.translate()` method can take extra arguments to finetune the translation specifics:\n", "\n", "- **table** - Which codon table to use? Tables are based on [NCBI's Genetic Codes](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). Default is the \"Standard\" table.\n", "\n", "\n", "- **stop_symbol** - Single character string, what to use for terminators. Default is the asterisk, *\n", "\n", "\n", "- **to_stop** - Boolean, defaults to False meaning do a full translation continuing on past any stop codons (translated as the specified stop_symbol). If True, translation is terminated at the first in frame stop codon (and the stop_symbol is not appended to the returned protein sequence).\n", "\n", "\n", "- **cds** - Boolean, indicates this is a complete CDS. If True, this checks the sequence starts with a valid alternative start codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.\n", "\n", "\n", "- **gap** - Single character string to denote symbol used for gaps. It will try to guess the gap character from the alphabet." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Codon tables can be defined by string names (NCBI genetic code table name)\n", "# or integers (NCBI table ids)\n", "codon_tables = [\n", " \"Vertebrate Mitochondrial\", # 2. The Vertebrate Mitochondrial Code\n", " 6 # 6. The Ciliate, Dasycladacean and Hexamita Nuclear Code\n", "]\n", "\n", "# Stop symbols could be and single character string\n", "stop_symbols = [\"\\\\\", \"£\"] # Careful with \"\\\", we need to escape with a backslash \"\\\"\n", "\n", "# to_stop is boolean, either True or False\n", "to_stops = [True, False]\n", "\n", "print(\"{: <24}\\t{}\\t{}\\t{}\".format(\"Table\", \"Symbol\", \"to_stop\", \"peptide\"))\n", "for table in codon_tables:\n", " for stop_symbol in stop_symbols:\n", " for to_stop in to_stops:\n", " protein = cds_seq.translate(table=table,\n", " stop_symbol=stop_symbol,\n", " to_stop=to_stop)\n", " print(\"{: <24}\\t{}\\t{}\\t{}\".format(table, stop_symbol, to_stop, protein))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Optional, more advanced code\n", "\n", "# The nested for-loops in the above example can be combined efficiently using\n", "# the 'itertools' module. The 'product' method returns the cartesian product of\n", "# its arguments, equivalent to a nested for-loop\n", "\n", "import itertools\n", "header = [\"Table\", \"Symbol\", \"to_stop\", \"peptide\"]\n", "row = \"{: <24}\\t{}\\t{}\\t{}\"\n", "\n", "print(row.format(*header)) # (*) unpacks the list as arguments to the function!\n", "for table, stop_symbol, to_stop in itertools.product(codon_tables, stop_symbols, to_stops):\n", " protein = cds_seq.translate(table=table,\n", " stop_symbol=stop_symbol,\n", " to_stop=to_stop)\n", " print(row.format(table, stop_symbol, to_stop, protein))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### When alphabets become useful" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Alphabet` can sometimes prevent us from errenously applying `Seq` methods to types of sequences that would 'biologically' not support these actions. For example, if we try to reverse-complement a protein sequence:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "protein_seq.reverse_complement()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, if we try to concatenate two incompatible `Seq` objects together:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# These would work\n", "for pair in [(intronless_dna, intronless_dna),\n", " (cds_seq, rna_seq)]:\n", " try:\n", " new_seq = pair[0] + pair[1]\n", " except TypeError as err:\n", " print(err)\n", " else:\n", " print(\"New length:\", len(new_seq))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# These would NOT work\n", "for pair in [(rna_seq, intronless_dna),\n", " (cds_seq, protein_seq)]:\n", " try:\n", " new_seq = pair[0] + pair[1]\n", " except TypeError as err:\n", " print(err)\n", " else:\n", " print(\"New length:\", len(new_seq))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alpahbets will also be different if the sequence translated had a stop codon or not. Consider the following sequence that has an internal stop codon 'TGA':" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "some_cds = Seq(\n", " 'ATGCCTATTGGATCCAAAGAGAGGCCAACATTTTTTTGAATTTTTAAGACACGCTGCAACAAAGCA',\n", " IUPAC.unambiguous_dna)\n", "some_cds.translate()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# split first - translate after\n", "stop_codon = some_cds.index(\"TGA\") + 3 # index where the stop codon ends\n", "cds1, cds2 = some_cds[:stop_codon], some_cds[stop_codon:]\n", "cds1.translate(), cds2.translate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question 5.\n", "Try the opposite now. Translate `come_cds` and split the protein sequence by stop codon. Can you notice a difference in terms of the alphabets of the products, and if so why? *Hint:* remember the `.split()` method and stops are indicated by '*'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can use this cell for Question 5.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bio.SeqIO Module and SeqRecord Object\n", "\n", "The `Bio.SeqIO` module provides a simple and uniform interface to read and parse from and write to various sequence file formats (including multiple sequence alignments). The `Bio.SeqIO` module supports a large number of sequence file formats, including *fasta*, *fastq*, and *genbank* and many more, which can be found [here](https://biopython.org/wiki/SeqIO). All sequences in this method will be accessed as `SeqRecord` objects.\n", "\n", "### Reading sequence records from files\n", "\n", "The main function of the module is `.parse()`, which takes a file handle (or filename) and format name, and returns a `SeqRecord` iterator. An iterator is an object that can be iterated upon, meaning that you can traverse through all the values, one by one. In this case `Bio.SeqIO.parse()` method returns an iterator of `SeqRecord` objects extracted and parsed from the input file. We can then iterate over the records and process them in a very efficient manner." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's parse an example fasta file and iterate over its records\n", "\n", "from Bio import SeqIO\n", "for record in SeqIO.parse(\"example.fa\", \"fasta\"): # the for loop iterates over the SeqRecord objects\n", " print(record.id) # each SeqRecord object contains an ID" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, it could be handier to supply a filehandle instead of the filename to `.parse()` method. This also permits the use of different input resources such as streams. We can re-examine the previous example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"example.fa\", \"r\") as filehandle:\n", " for rec in SeqIO.parse(filehandle, \"fasta\"):\n", " print(rec.id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second argument of the `Bio.SeqIO.parse()` method defines the type of the input file and it is mandatory. A detailed list of possible values for this argument can be found on [the official documentation](https://biopython.org/wiki/SeqIO). Let's open another file, **example.gp**, that contains the same sequences as before but this time in GenBank format (full GenPept format to be more precise)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for rec in SeqIO.parse(\"example.gp\", \"genbank\"):\n", " print(rec.id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `SeqRecord` objects\n", "We have mentioned that the results of 'Bio.SeqIO' parsing methods are `SeqRecord` objects. We have already seen that these objects contain an `.id` attribute. A `SeqRecord` object holds, beside the sequence (as a `Seq` object), identifiers (ID and name), description and optionally annotation and sub-features. The completeness of a record will depend on its source." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`SeqRecord` contains many attributes, some of which can be rather complex. We will further examine most commonly used ones:\n", "\n", "- seq\n", "- id\n", "- name\n", "- description\n", "- annotations\n", "- features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's grab the last record in our example file to use in our following cells\n", "records = list(SeqIO.parse(\"example.gp\", \"genbank\"))\n", "rec = records[-1]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# rec.seq is a `Seq` object containing the sequence and the alphabet\n", "rec.seq" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# rec.id is a string containing the sequence identifier\n", "rec.id" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# rec.name is a string containing the display name of the sequence\n", "# it may be identical to or quite different than rec.id !!\n", "rec.name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# rec.description is also a simple string\n", "rec.description" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# rec.annotations is dictionary that contains different annotations about the sequence record\n", "# When created from a GenBank source, certain keys can be expected to be present such as source, taxonomy\n", "print(\"record is from\", rec.annotations[\"source\"])\n", "print(\"and its full taxonomy is\", rec.annotations[\"taxonomy\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# rec.features is a list\n", "# It contains SeqFeature objects which we will discuss briefly later.\n", "# But in a nutshell, they hold information about various positional features identified on the sequence\n", "\n", "rec.features[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's examine the records contained in 'example.gb' file more closely." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for index, rec in enumerate(SeqIO.parse(\"example.gp\", \"genbank\")):\n", " print(\"{} [{}]\\t{}\\tlength {}\\twith {} features and {} annotations\".format(\n", " index, rec.id, rec.description[:21], len(rec.seq), len(rec.features), len(rec.annotations)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question 6.\n", "\n", "Similarly parse the **example.fa** file and compare the `SeqRecord` objects to the above cell from **example.gp**. Which attributes are identical? What are the differences? Why?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can use this cell for Question 6.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a `SeqRecord` directly. We need to import the class from its module, `Bio.SeqRecord`, first. In order to create its sequence we will also need the `Bio.Seq.Seq` or a similar `Seq` class." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from Bio.Seq import Seq\n", "from Bio.SeqRecord import SeqRecord\n", "\n", "record = SeqRecord(Seq(\"ACGGCTATCTGAGGACTACGAGCATCATCGA\"),\n", " id=\"my_seq_ID\", name=\"made-up sequence\",\n", " description=\"Just some randomly typed ATGCs\")\n", "print(record)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slicing a SeqRecord\n", "\n", "A very useful feature of the `SeqRecord` object is that we can simply slice it just like slicing a `str` or `list`. Within the context of a list, we refer to splice positions as indices. In this case, they correspond to positions along the sequence. Just like the `Seq` object.\n", "\n", "We have to keep in mind that the positions are zero-based, that is a portion of the sequnce that maps to 10-50 bases would need to be coded as 9:50" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's grab the last record in our example file\n", "records = list(SeqIO.parse(\"example.gp\", \"genbank\"))\n", "last_record = records[-1]\n", "\n", "# We were told that the zinc binding site is from 48 to 178 (do not forget to convert to 0-based)\n", "# If you are interested in finding out by examining the features, see below \"Optional Material\"\n", "zinc_binding = last_record[47:178]\n", "print(zinc_binding)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When a `SeqRecord` is sliced:\n", "\n", "- A new `SeqRecord` is returned\n", "- The sequence is sliced but the id, name and description are kept unchanged\n", "- Only the features fall **completely** wihtin the new coordinates are kept (see Optional section for more information)\n", "- Positions of any feature inherited will be adjusted to the new coordinates\n", "- All positions are 0-based " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question 7.\n", "\n", "Can you create a new sequence record containing the nucleotides 23-123 from the `last_record`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can use this cell for Question 7.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `SeqFeature` object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have seen that the `SeqRecord` object has an attribute called `.features` which is a list of `SeqFeature` objects. Let's understand this object a little bit better.\n", "\n", "`SeqFeature` object describes a feature along a sequence. Some of its useful attributes are:\n", "\n", "- `.location` which tells us the 0-based coordinates of the feature\n", "- `.type` which holds the type or the annotation of the feature\n", "- `.qualifiers` is an ordered dictionary (the keys are always returned in same order) that holds qualifying information about the feature. These are analogous to the qualifiers from a GenBank feature table.\n", "\n", "`SeqFeature` is defined within its own module: `Bio.SeqFeature`. For more detail on this object and the module, you can start with its [tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc37) and [documentation](https://biopython.org/docs/1.74/api/Bio.SeqFeature.html).\n", "\n", "Let's dissect one `SeqFeature` object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's grab the third record in our example file\n", "rec = list(SeqIO.parse(\"example.gp\", \"genbank\"))[2]\n", "\n", "# Simply print features neatly \n", "for f_i, feature in enumerate(rec.features):\n", " print(\"Feature index:\", f_i)\n", " print(feature)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe:\n", "\n", "1) `.type` attribute contains useful information for selecting the right kind of feature we might be interested in. Its values are controlled - any information that is available in the database will be used. But usually databases such as NCBI and UniProt are vocabulary controlled and use always the same keywords such as 'source' and 'CDS'.\n", "\n", "2) `.qualifiers` has no strict structure - it is a `dict` that contains many relevant information about the `.feature`. The **only consensus** is that the values are always a list! We have to explore and figure out what we need/want from `.qualifiers` ourselves.\n", "\n", "Let's see how we access the data within a `SeqFeature` object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# continuing with the last feature object:\n", "\n", "# be careful value of location is not list, tuple etc\n", "print(\"its location:\", feature.location)\n", "print(\"its start:\", feature.location.start)\n", "print(\"and end:\", feature.location.end)\n", "print()\n", "print(\"its type:\", feature.type)\n", "print()\n", "print(\"its qualifiers:\", feature.qualifiers) # this is an OrderedDict, essentially a dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Question 8.\n", "\n", "Can you print which translation table (only integer code) should be used for translating this CDS sequence?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can use this cell for Question 8.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing to sequence files\n", "So far we have seen how to parse sequence files and work with sequence records. In many applications, we will also need to write our processed sequence records back into a standard sequence file. For this purpose, the `Bio.SeqIO` module has a `.write()` function. It can be thought as the reverse of the `.parse()` method we have learnt above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's create 10 random sequences each around 400 bases long. For this we can use the 'random' module\n", "\n", "import random\n", "\n", "records = []\n", "for i in range(10):\n", " random_atgc = random.choices(\"ATGC\", k=random.randint(380, 420))\n", " record = SeqRecord(Seq(\"\".join(random_atgc)),\n", " id=\"my_seq_{}\".format(i),\n", " name=\"random sequnce {}\".format(i),\n", " description=\"A randomly generated sequence\")\n", " records.append(record)\n", "SeqIO.write(records, \"random_sequences.fa\", \"fasta\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing Online Databases\n", "\n", "Biophyton introduces multiple means to interact with commonly used bioinformatics databases over the internet. We will cover the basics of accessing online databases with two databases:\n", "\n", "1. NCBI's Entrez\n", "2. UniProt/SwissProt's ExPASy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Entrez can be queried using the `Bio.Entrez` module. There many different functionalities, but today we will only cover the `.efetch()` method, which we can use to fetch records from Entrez over the internet.\n", "\n", "Let's discover this functionality with an example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from Bio import Entrez\n", "from Bio import SeqIO\n", "\n", "Entrez.email = \"bulak.arpat@unil.ch\" # This is obligatory.You should use yours!\n", "\n", "with Entrez.efetch(\n", " db=\"protein\", # which Entrez database to use, nucleotide, protein,..\n", " rettype=\"gp\", # return-type: fasta, gb (genbank), gp (genpept)\n", " retmode=\"text\", # return-mode: text or xml (perhaps other options?)\n", " id=\"NP_001315822.1\" # the accession ID of the record we would like to fetch\n", ") as handle: # Once it is a handle, we know the rest from SeqIO...\n", " seq_record = SeqIO.read(handle, \"gb\")\n", "print(\"{} with {} features\".format(seq_record.id, len(seq_record.features)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with Entrez.efetch(\n", " db=\"protein\",\n", " rettype=\"fasta\", # if the return-type is set to 'fasta'\n", " retmode=\"text\",\n", " id=\"NP_001315822.1\"\n", ") as handle:\n", " seq_record = SeqIO.read(handle, \"fasta\") # we can only parse it as fasta\n", "print(\"{} with {} features\".format(seq_record.id, len(seq_record.features)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Bio.ExPASy` module provides functions to interact with the UniProt/SwissProt database. Unfortunately, after a recent update in the database structure, this functionality is not working as intended. We provide nevertheless an example, which hopefully can work again in close future." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from Bio import ExPASy\n", "from Bio import SeqIO\n", "\n", "with ExPASy.get_sprot_raw(\"P48977\") as handle: # a simple .get_sprot_raw() function fetches records\n", " record = SeqIO.read(handle, \"swiss\") # which we can then parse with SeqIO\n", "print(\"Retrived {}: {}\".format(record.id, record.description))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now, you can try to solve the exercises.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optional material\n", "\n", "The Biopython module is 'huge'! Under this section, we have included some more information on basic functionality which is not needed for the exercises." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### On `SeqRecord` iterators and processing large sequence files with `SeqIO`\n", "\n", "We have learnt how we can iterate over the sequence records contained in an input file using the `.parse()` method of `SeqIO`. There are some important limitations that come with iterators. Below, we will explore some of there limitations, how we can overcome them but at a cost in memory usage. Finally, some alternatives if we need to process large files without iterators." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from Bio import SeqIO" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A word about iterators. Once we iterate over all values an iterator gets empty.\n", "# That means, we can not access the contents any longer.\n", "\n", "# Let's create an iterator of sequence records:\n", "records = SeqIO.parse(\"example.fa\", \"fasta\")\n", "\n", "# We can access the next item (in this case, the first one) via core next() method:\n", "print(\"First record is\", next(records).id)\n", "\n", "# Now, let's iterate over all values with a for loop:\n", "for record in records:\n", " print(\"first use\", record.seq[:5])\n", "print(\"We have reached the end of the iterator\")\n", "\n", "# At this point, we have reached the end of the iterator. We can not use it again:\n", "for record in records:\n", " print(\"second use\", record.seq[:5])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# If we force the iterator with next(), it will not return a None, but a StopIteration exception\n", "next(records)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Therefore, sometimes it is useful to convert the iterator into a list or a dictionary, given that it is not too big to fit into the memory. Iterators can be easily converted into lists, simply casting a `list()` function. For conversion into a dictionary, the `Bio.SeqIO` provides a specialized method called `.to_dict()`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Converting the sequence record iterator into a list\n", "\n", "records = list(SeqIO.parse(\"example.fa\", \"fasta\"))\n", "# Now, let's loop over all values with for:\n", "for record in records:\n", " print(\"first use\", record.seq[:5])\n", "# And again:\n", "for record in records:\n", " print(\"second use\", record.seq[:5])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Converting the sequence record iterator into a dictionary\n", "\n", "records = SeqIO.parse(\"example.fa\", \"fasta\")\n", "rec_dict = SeqIO.to_dict(records)\n", "print(\"Listing all items:\")\n", "for rec_id, rec in rec_dict.items():\n", " print(rec_id, rec.seq[:5])\n", "print()\n", "\n", "specific_rec_id = \"ADY55933.1\"\n", "print(\"Accessing to a specific record:\", specific_rec_id)\n", "print(specific_rec_id, rec_dict[specific_rec_id].seq[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For larger files, it isn’t possible to hold everything in memory, so `Bio.SeqIO.to_dict()` is not suitable. In these situations we can use the `Bio.SeqIO.index()` function. This will not populate a dictionary, rather index the input file such that we can access records in a an arbitrary order. This should be used with very large files; for small files it will be slower than the `.to_dict()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rec_dict = SeqIO.index(\"example.fa\", \"fasta\")\n", "print(specific_rec_id, rec_dict[specific_rec_id].seq[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More on `SeqFeature` objects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`SeqFeature` object describes a feature along a sequence. Some of its useful attributes are:\n", "\n", "- `.location` which tells us the 0-based coordinates of the feature\n", "- `.type` which holds the type or the annotation of the feature\n", "- `.qualifiers` is an ordered dictionary (the keys are always returned in same order) that holds qualifying information about the feature. These are analogous to the qualifiers from a GenBank feature table.\n", "\n", "`SeqFeature` is defined within its own module: `Bio.SeqFeature`. For more detail on this object and the module, you can start with its [tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc37) and [documentation](https://biopython.org/docs/1.74/api/Bio.SeqFeature.html).\n", "\n", "Let's extract some useful information for our sequence record from its features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for feature in rec.features:\n", " print(\"From {} to {}, there is a '{}' feature\".format(\n", " feature.location.start, feature.location.end, feature.type))\n", " for key, val in feature.qualifiers.items():\n", " print(\" {} is {}\".format(key, val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we slice a `SeqRecord` that contains `SeqFeature`s, we have learnt that only those features that completely fall within the slice will be kept. Then the positions will be offsetted relative to the new sliced sequence." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's grab the last record in our example file\n", "records = list(SeqIO.parse(\"example.gp\", \"genbank\"))\n", "last_record = records[-1]\n", "\n", "# We have seen above that the zinc binding site is from 47 to 178 (0-based)\n", "zinc_binding = last_record[47:178]\n", "incomplete_overlap = last_record[30:100]\n", "print(zinc_binding)\n", "print()\n", "\n", "# Let's have a closer look at the single feature included within the sliced portion\n", "feat = zinc_binding.features[0]\n", "print(\"The feature is of '{}' type\".format(feat.type))\n", "print(\"It is from {} to {}\".format(feat.location.start, feat.location.end))\n", "print(\"Its location is {}\".format(feat.location)) # We have the positions of 4 residues that make contact with Zn\n", "print(\"Its qualifiers are: {}\".format(feat.qualifiers))\n", "print()\n", "\n", "# On the other hand, the incompletely overlapping part (look at the features)\n", "print(incomplete_overlap)" ] } ], "metadata": { "kernelspec": { "display_name": "steps_venv", "language": "python", "name": "steps_venv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }