{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise solutions: reading and writing files\n",
    "\n",
    "## Exercise 3.1\n",
    "Write a function that reads the first 10 lines of the file `Homo_sapiens.GRCh38.99.MT.gtf` and writes them to another file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"../exercises/Homo_sapiens.GRCh38.99.MT.gtf\", 'r') as in_file, open(\"Homo_sapiens.GRCh38.99.MT.head.gtf\", 'w') as out_file:\n",
    "    for i in range(10):\n",
    "        print(in_file.readline(), file=out_file, end='')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notes :\n",
    "* to open the input file we specify its location relatively to our notebook. Since the file is located in a different directory, we must use `../exercises` to reach it.\n",
    "* We use the `end` optional argument of `print()` to avoid printing an additional `\\n` character that would cause line skips (because each line already ends with a `\\n` character read from the input).  \n",
    "  Alternatively we could have used `strip()` on the lines read from the input: `print(in_file.readline().strip(), file=out_file)`\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## Exercise 3.2\n",
    "Using the same `Homo_sapiens.GRCh38.99.MT.gtf` GFF file you used in the previous exercise, report (print to terminal) all of the feature entries for genes on the motichondrial chromosome (`MT`) between coordinates 7500 and 10000 on the forward strand.  \n",
    "The complete GFF file format description can be found <a href=\"https://www.ensembl.org/info/website/upload/gff.html\">here</a>, but here is a succint description: \n",
    "* Column 1: chromosome\n",
    "* Column 2: data source\n",
    "* Column 3: feature type\n",
    "* Column 4: feature start coordinate\n",
    "* Column 5: feature end coordinate\n",
    "* Column 6: score\n",
    "* Column 7: strand\n",
    "\n",
    "> hint: use the `split()` method; what is the separator between fields ?\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# In the code below, we will not presume that the features are ordered by chromosome and position, \n",
    "# and therefore we will read through the entire file.\n",
    "# However, if that assumption can be made, then it could be good way to terminate the reading early, \n",
    "# and save some execution time if the input file is very big (note that the gtf file of the full \n",
    "# human genome is ~2 GB)\n",
    "\n",
    "to_report = []\n",
    "with open(\"../exercises/Homo_sapiens.GRCh38.99.MT.gtf\" , 'r') as in_file:\n",
    "    for line in in_file:\n",
    "        split_line = line.split('\\t')    # The input file is in tab delimited format, so we split by tab.\n",
    "        chromosome = split_line[0]       # The 1st column of the file contains the chromosome info.\n",
    "        feature = split_line[2]          # the 3rd column contains the feature info.\n",
    "        \n",
    "        # The elements of the list returned by .split() are string, so to get numerical values for\n",
    "        # \"start\" and \"end\" we must convert them to int().\n",
    "        start = int(split_line[3]) \n",
    "        end   = int(split_line[4]) \n",
    "        is_forward_strand = split_line[6] == '+'    # The 7th column contains the strand info: + = forward; - = reverse.\n",
    "    \n",
    "        if chromosome == 'MT' and feature == 'gene' and is_forward_strand and start > 7500 and end < 10000:\n",
    "            to_report.append(line)\n",
    "\n",
    "for line in to_report:\n",
    "    print(line, end='')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## Exercise 3.3\n",
    "Write the sequences of every exon of the mitochondrial chromosome between coordinates 7500 and 12500 to a file, in the <a href=\"https://en.wikipedia.org/wiki/FASTA_format\">fasta</a> format.  \n",
    "In the output fasta file, identify the exons using the name of their gene, as well as their start and end positions.\n",
    "\n",
    "For instance, for an exon of gene ENSG000002, going from position 42 to 1337 : `>ENSG000002:42-1337`\n",
    "\n",
    "Use the same GFF file you used in the previous exercise, \n",
    "as well as the fasta file `Homo_sapiens.GRCh38.dna.chromosome.MT.fa`, which contains the sequence of the mitochondrial chromosome..\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Let's first create a function to extract the gene name for the \"Additional Info\" field of a GFF file. \n",
    "# This will simplify our code later.\n",
    "def get_gene_name_from_additional_info(field):\n",
    "    \"\"\"Extracts the gene name from the 'additionnal info' field of a GFF file\"\"\"\n",
    "    subfields = field.split(';')   # The \"additional info\" field uses ';' as sub-field delimiter.\n",
    "    #print(split_field)\n",
    "    gene_name = ''\n",
    "    \n",
    "    # Looking for a pattern like 'gene_id \"ENSG00000198786\"'\n",
    "    for subfield in subfields:\n",
    "        words = subfield.split(' ')\n",
    "        if words[0] == \"gene_id\": \n",
    "            gene_name = words[1].strip('\"')  # Remove the quote around subfield value.\n",
    "            break                            # if the gene_id subfield is found, exit loop immediatly.\n",
    "            \n",
    "    return gene_name\n",
    "    \n",
    "\n",
    "# Part 1: extract the gene name and coordinate positions of the exons of interest from the GTF file.\n",
    "desired_chromosome = 'MT'\n",
    "desired_feature = 'exon'\n",
    "extraction_begin = 7500\n",
    "extraction_end = 12500\n",
    "\n",
    "# Create an empty list to store \n",
    "exons = []\n",
    "with open(\"../exercises/Homo_sapiens.GRCh38.99.MT.gtf\" , 'r') as in_file:\n",
    "    for line in in_file:\n",
    "        # GFF is a tab delimited format, we split the input line by tab.\n",
    "        split_line = line.split('\\t')\n",
    "        \n",
    "        # Retrieve info from the different fields of the input line.\n",
    "        chromosome = split_line[0]\n",
    "        feature = split_line[2]\n",
    "        start = int(split_line[3]) \n",
    "        end   = int(split_line[4]) \n",
    "        is_forward_strand = split_line[6] == '+'   # The 7th column contains the strand info: + = forward; - = reverse.\n",
    "       \n",
    "        if chromosome == desired_chromosome and feature == desired_feature and start >= extraction_begin and end <= extraction_end:\n",
    "            # Extract the gene name form the additional info field - the 9th column of the file.\n",
    "            gene_name = get_gene_name_from_additional_info(split_line[8])\n",
    "            exons.append((gene_name, start, end))\n",
    "\n",
    "\n",
    "# Part 2: extract the whole sequence of the mitochondrial chromosome from the FASTA file.\n",
    "with open( \"../exercises/Homo_sapiens.GRCh38.dna.chromosome.MT.fa\", 'r' ) as in_file:\n",
    "    # Read the header line, as it does not contain nucleotides but only the FASTA header. \n",
    "    l = in_file.readline()\n",
    "    \n",
    "    # Loop through the remainer of the file to extract the entire nucleotide sequence of the MT chromosome.\n",
    "    sequence = ''\n",
    "    for l in in_file:\n",
    "        sequence += l.strip()\n",
    "    \n",
    "# Part 3: write exons and their sequences in FASTA format.\n",
    "with open(\"sequences.fa\", 'w') as out_file:\n",
    "    for exon_name, start, end in exons:   # Note: when iterating over a list of tuples that all have the same size, \n",
    "                                          # the elements of each tuple can be directly unpacked into variables.\n",
    "        # Here, we have to be careful: \n",
    "        #  * the GFF files indexing starts at 1, but python strings start indexing at 0, \n",
    "        #    so we have to add +1 to start and end GFF positions.\n",
    "        #  * the slicing excludes the end position, therfore we have to add another + 1\n",
    "        #    to the end value.\n",
    "        exon_seq = sequence[start - 1: end]\n",
    "\n",
    "        print('>', exon_name, ':', start, '-', end, sep='', file=out_file)\n",
    "        print(exon_seq, file=out_file)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}