{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# First  steps with Python exam\n",
    "\n",
    "This exam should be completed and sent to `wandrille.duchemin@unibas.ch` by the 11h of September 2020.\n",
    "I will send you an e-mail upon receiving your submission, so if you hear nothing from me send me further e-mails to get my attention.\n",
    "\n",
    "Solve the exercises in the jupyter notebook. **Remember to:**  \n",
    "* **Comment your code** to convey what you are trying to do and how you are trying to do it.\n",
    "* think about **using intermediary functions**, that can help you write **cleaner and more lisible code**.\n",
    "\n",
    "**Important:** Before sending us your notebook with the answers, please rename it to `firstName_LastName.exam.ipynb`, e.g. if your name is Alice Smith, the notebook would be renamed to `alice_smith.exam.ipynb`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "# Exercise 1\n",
    "\n",
    "Generate random numbers between 1 and 6 until you get the number 6.\n",
    "Simulate like you would be rolling a dice.\n",
    "1. You should indicate the number of random numbers generated before you got the number 6.\n",
    "2. You should print the sum over all the numbers generated.\n",
    "> **hint**: check the random library"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 2\n",
    "\n",
    "The file `name_size.txt` contains data about the name and height of a large number of people. Each line corresponds to one person and has two fields : a person's name and their height. \n",
    "A first name can appear several times in the file, meaning that several people in the sample bear that name.\n",
    "\n",
    "1. Take a look at the file. **Question:** what is the field separator?\n",
    "\n",
    "2. Read the file and build a dictionnary whose keys are first names, and whose values are lists containing the height of all people bearing the given name.\n",
    "\n",
    "For example, if the file was:\n",
    "```\n",
    "Robin 175.3\n",
    "Gallahad 163.5\n",
    "Arthur 182.0\n",
    "Robin 160.8\n",
    "Robin 183.7\n",
    "```\n",
    "\n",
    "Then the dictionnary should be :\n",
    "```\n",
    "{'Robin' : [ 175.3, 160.8, 183.7 ],\n",
    " 'Gallahad' : [ 163.5 ],\n",
    " 'Arthur' : [ 182.0 ] }\n",
    "```\n",
    "\n",
    "3. **Question:** what is the most popular (i.e. the most frequent) name that starts with the letter 'B'?\n",
    "\n",
    "4. Compute the average height of people whose name contains more than 5 letters.\n",
    "\n",
    "\n",
    "> **Remember:** you can subdivide each question into smaller subtasks to acheive the requested result step by step.\n",
    "\n",
    "> **Hint:** the file is very long. It can be easier to try your code on a smaller dummy file (or subset of the original file) that you create for testing purposes.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 3\n",
    "\n",
    "We are given a piece of Python script by a colleague that is supposed to work with a certain kind of data file containing information on peptides with some very strange repeat elements. These repeat elements can be approximately 20-30 amino acid long, and each element can be repeated several times per protein. To add to the complexity of this imaginary phenomenon, such proteins typically contain 3-10 different elements.\n",
    "\n",
    "We have a data file called `peptide_data.tsv` that we know has correct format and information. The file is a simple tab-delimited text file with a header containing these information per repeat element (row):\n",
    "    \n",
    "    protein_ID  repeat_len  num_repeat  repeat_ID\n",
    "\n",
    "1. protein_ID    : ID of the protein (unique for each protein)\n",
    "2. repeat_length : length of repeat element in nucletides\n",
    "3. num_repeat    : how many times this element is repeated\n",
    "4. repeat_ID     : ID of the repeat element\n",
    "\n",
    "However, the script, which is given below seems to have a number of bugs/issues. The **_correct_** script should calculate the average size of repeated regions per protein. Note that in the script's output file the first letter of the protein IDs are capitalized. The script, somehow(!), collected many bugs (around 10) over time before it landed into our hands. Bugs are of different kinds, including but not limited to typos, missing lines and wrong types. Your task is to fix the code and annotate your fixes?\n",
    "\n",
    "> **Important:** Please indicate any \"fix\" you do in the code with a comment that briefly describes what you fixed and/or why it was needed.\n",
    "\n",
    "As a reference, you are also given a file (`output.tsv`) that contains the correct results if the script is fixed and works as intended."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "ename": "SyntaxError",
     "evalue": "invalid syntax (<ipython-input-1-4148f209625b>, line 42)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;36m  File \u001b[0;32m\"<ipython-input-1-4148f209625b>\"\u001b[0;36m, line \u001b[0;32m42\u001b[0m\n\u001b[0;31m    for repeat_size in repeats\u001b[0m\n\u001b[0m                              ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
     ]
    }
   ],
   "source": [
    "def parse_line(line):\n",
    "    \"\"\"\n",
    "    This function accepts a string containing single line from input file.\n",
    "\n",
    "    Input line has these columns:\n",
    "    1            2                3                 4\n",
    "    protein_ID   HPAA_repeat_len  HPAA_num_repeat   HPAA_ID\n",
    "\n",
    "    It returns a tuple of protein ID and two integers,\n",
    "    first one the length of repeat and\n",
    "    second one the number of repeat element\n",
    "    \n",
    "    (protein_id, repeat_len, num_repeat)\n",
    "    \"\"\"\n",
    "    parsed = line.split(',')\n",
    "    return parsed[1], parsed[2], parsed[3]\n",
    "\n",
    "def calc_total_len(repeat_len, num_repeat):\n",
    "    \"\"\"\n",
    "    Returns the product of repeat_len and num_repeat \n",
    "    \"\"\"\n",
    "    return repeat_len * num_repeat\n",
    "\n",
    "# init data dictionary\n",
    "data = []\n",
    "\n",
    "# read the input file\n",
    "with open(\"peptide_data.tsv\") as infile:\n",
    "    for line in infile:\n",
    "        protein_id, repeat_len, num_repeat = parse_line(line)\n",
    "        protein_id[0] = protein_id[0].upper()\n",
    "        if protein_id not in data:\n",
    "            data[protein_id] = []\n",
    "    data[protein_id].append(calc_total_len(repeat_len, num_repeat))\n",
    "\n",
    "# write the output\n",
    "with open(\"output2.tsv\") as outfile:\n",
    "    for protein in sorted(data.keys()):\n",
    "        repeats = data[protein]\n",
    "        total_num_repeats = 0\n",
    "        total_size_repeats = 0\n",
    "        for repeat_size in repeats\n",
    "            total_num_repeats += 1\n",
    "        total_size_repeats += repeat_size\n",
    "        print(protein, total_size_repeats / total_num_repeats,\n",
    "              sep=\"\\t\", file=oufile)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}