{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading and writing files\n", "In many use cases, you will want your python code to read/write from/to files stored on your local hard drive. \n", "Here are a few important points to consider when working with files:\n", "\n", "* Do I need to read the entire dataset/file into memory?\n", " * remember the table about the time a computer takes to perform various action. \n", " Accessing the hard drive is among the slower operations. \n", " Reading an entire file when you only need the first few lines will be costly\n", " * if you are reading a very large file, then having the entire file in memory at \n", " once may overburden your computer\n", " \n", "* Are there concurrency issues?\n", " * if another software (or even your code if you have messed up) writes to a file you \n", " are currently reading, you could run into trouble.\n", "\n", "Whether it is for reading or for writing, operations with files occur using **file objects** (sometimes also referred-to as file **handles**).\n", "File objects are created using the `open()` function, and closed using the object's `.close()` method. \n", "Here is a basic example:\n", "\n", "```python\n", "# Open a file in a given \"mode\" (e.g. read, writte or append).\n", "file = open(filename, mode)\n", "\n", "# Do something with the file...\n", "\n", "# When you are done, don't forget to close the file.\n", "file.close()\n", "```\n", "\n", "> Remember to always close the files you have opened, or you will encounter errors if you want to access this file again further down in your code.\n", "\n", "
\n", "\n", "### File opening modes\n", "When using the `open()` function, a **mode** can be passed as argument to the function. This specifies the type of access you will have on the file. For instance, the `'r'` mode will only allow to read the content of a file, and will not allow writing to it (this is useful to avoir accidental writing to the file).\n", "\n", "There are several more modes of opening files.\n", "* `'r'`: open file in read-only mode.\n", "* `'w'`: open file in write-only mode, **overwriting** existing file with the same name.\n", "* `'a'`: open file in write-only mode, **appending** to existing file with the same name.\n", "* `'rb'`, `'wb'`, `'ab'`: same as `'r'`, `'w'` and `'a'`, but reading/writing to/from binary files (such as `.zip` or `.bmp` image files). \n", " The content is read/written as bytes objects without any decoding.\n", "\n", "See `help(open)` for a full list of modes and details about them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Reading from files\n", "To start reading a file, one creates a **file object** using `open` function with `mode='r'` . \n", "Reading lines can then be done simply by iterating over the file object in a `for` loop or using the `.readline()` method of the file object." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "passionfruit\n", "\n", "oranges\n", "\n", "apples\n", "\n", "grapefruit (whole and segments)\n", "\n", "pointed sticks\n" ] } ], "source": [ "reading_handle = open(\"fresh_fruits.txt\" , mode='r')\n", "\n", "line = reading_handle.readline() # this function reads a single line from the file \n", "print(line) # I print the line \n", "\n", "line = reading_handle.readline()\n", "print(line) \n", "line = reading_handle.readline()\n", "print(line) \n", "line = reading_handle.readline()\n", "print(line) \n", "line = reading_handle.readline()\n", "print(line) \n", "# problem : how many time should I do this ?\n", "\n", "reading_handle.close()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `for` loop method solves this and makes fro much tidier code: each iteration reads 1 line." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line 0 : passionfruit\n", "\n", "line 1 : oranges\n", "\n", "line 2 : apples\n", "\n", "line 3 : grapefruit (whole and segments)\n", "\n", "line 4 : pointed sticks\n" ] } ], "source": [ "reading_handle = open(\"fresh_fruits.txt\" , mode='r')\n", "i = 0\n", "for line in reading_handle:\n", " print(\"line\", i, \":\", line)\n", " i += 1\n", " \n", "reading_handle.close()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### End-of-line characters\n", "As you can see in the example above, there are additionnal empty lines in between our prints. This is because the lines are read from the file with their **end-of-line** characters, which generally is `\\n` . \n", "To avoid this kind of issue, one typically uses the `.strip()` method of strings, which removes any whitespace or *end-of-line* character at the start or end of the string.\n", "\n", "Here is an illustratin of using `.strip()` when reading content from a file:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line 0 : passionfruit\n", "line 1 : oranges\n", "line 2 : apples\n", "line 3 : grapefruit (whole and segments)\n", "line 4 : pointed sticks\n" ] } ], "source": [ "reading_handle = open(\"fresh_fruits.txt\", 'r')\n", "\n", "for i, line in enumerate(reading_handle): # enumerate() is our friend that will automatically enumerate items\n", " print(\"line\", i, \":\", line.strip()) # Note: we use the \"strip()\" method of \"str\" to remove the \n", " # trailing \"\\n\" (carriage return) of each line.\n", "\n", "reading_handle.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "And here is yet another way to read the fruity content of our file, this time using the `readlines()` function instead of `readline()`.\n", "As its name suggests, `readlines()` reads more than one line at a time (by default, all lines in the file)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fresh_fruits.txt has 5 lines:\n", "line 0 : passionfruit\n", "line 1 : oranges\n", "line 2 : apples\n", "line 3 : grapefruit (whole and segments)\n", "line 4 : pointed sticks\n" ] } ], "source": [ "reading_handle = open(\"fresh_fruits.txt\" , 'r')\n", "\n", "entire_file = reading_handle.readlines() # the whole content of the file is in memory\n", "print(reading_handle.name, \"has\", len(entire_file), \"lines:\") # and we can check how many lines there are\n", "for i, line in enumerate(entire_file): # before we start looping over them\n", " print(\"line\", i, \":\", line.strip())\n", "\n", "reading_handle.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** While our examples using `readline()` or `readlines()` work equally well, there can be important implications in using one or the other of these functions, especially when dealing with large files. \n", "Can you think of a drawback of using `readlines()` ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Answer:** using `readlines()` will (by default) load the entire file in memory, and this can be problematic when working with large files as is often the case in bioinformatics. Always consider the file sizes you are dealign with when using `readlines()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Opening files with context managers\n", "Now that you understand the basics of opening and closing a file, we can show you the actual \"pythonic\", recommended, way to deal with files:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line 0 : passionfruit\n", "line 1 : oranges\n", "line 2 : apples\n", "line 3 : grapefruit (whole and segments)\n", "line 4 : pointed sticks\n" ] } ], "source": [ "with open(\"fresh_fruits.txt\", 'r') as file:\n", " for i, line in enumerate(file):\n", " print(\"line\", i, \":\", line.strip())\n", " \n", "# Wait, did we just forget to close our file ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "As you might have noticed in the example above, there is no explicit call to `close()`. So how does our file get closed then?\n", "\n", "The magic happens in the `with ... as ...` construct, which starts a **context manager** code block in python. \n", "**Context managers** are code blocks that have special functions associated with them that execute automatically at the \n", "start and end of the context manager block. \n", "In the case of `with open() as ...:`, the special function executed at the start of the block is opening the file, and \n", "the special function executed at the end of the code block is closing the file handle.\n", "\n", "Using the `with open() as ...:` context manager has several advantages:\n", "* shorter, abstracted code: you don't have to worry about the implementation details of opening and closing a file.\n", "* guarantee that a file will always be closed, no matter what happens (e.g. if an error occurs).\n", "\n", "Context managers are not covered in this course, but you can look them up if you are interested.\n", "\n", "
\n", "\n", "### Mini Exercise:\n", "Read the content of `fresh_fruits.txt` using a context manager and a while loop. Make sure that no white space is printed between lines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Writing to files\n", "Writing to a file is achieved in pretty much the same way as reading from it, but the opening mode is now `'w'`. \n", "And instead of reading lines, we now `print()` them to the file." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "with open(\"shopping_list.txt\", 'w') as writing_handle:\n", " print(\"onion\", file=writing_handle)\n", " print( 34 , \"potato\" , file=writing_handle)\n", " print(\"shrubbery\", file=writing_handle)\n", " print(\"tomato sauce\", file=writing_handle)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By passing the file object (or file handle) to the `file` argument of the `print()` function, we now print to the file rather than to our terminal.\n", "> **Reminder**: the `'w'` mode overwrites the opened file - *i.e.*, if you use it on an existing file, its original content is lost. \n", "> **Pro tip:** you can open more than one file using a single `with` statement:\n", "```python\n", "with open(\"input.txt\", 'r') as in_file, open(\"output.txt\", 'w') as out_file:\n", " do_something()\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### FYI\n", "\n", "You will see some Python code, especially older ones, will use the `.write()` method of the **file object**. There are some differences between the `print()` method and `.write()`; most important one is that `.write()` will not do any formatting and even the end-of-line (carriage return) characters need to be manually written." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "with open(\"shopping_list2.txt\", 'w') as writing_handle:\n", " writing_handle.write(\"onion\\n\")\n", " writing_handle.write(\"{} potato\\n\".format(34))\n", " writing_handle.write(\"shrubbery\\n\")\n", " writing_handle.write(\"tomato sauce\\n\")\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Mini Exercise:\n", "Write some code to read the content of the `shopping_list.txt` file we just created, in order to check that the writing did work properly. Make sure that no white space is printed between lines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Exercises: 3.1, 3.2 and 3.3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## to go further : reading using a while loop\n", "\n", "Here is an exemple of file reading where, instead of a for loop, we use a while loop and `.readine()`." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line 0 : passionfruit\n", "line 1 : oranges\n", "line 2 : apples\n", "line 3 : grapefruit (whole and segments)\n", "line 4 : pointed sticks\n" ] } ], "source": [ "reading_handle = open(\"fresh_fruits.txt\", 'r')\n", "i = 0\n", "line = reading_handle.readline()\n", "\n", "# When the file has been entirely read, readline() returns an empty string and the while loop will end.\n", "# In python a non-empty string evalutes to \"True\", and therefore we can use \"while line\" as a shortcut \n", "# for \"while line != '' \".\n", "while line:\n", " print(\"line\", i, \":\", line.strip()) # Note: we use the \"strip()\" method of \"str\" to remove the \n", " # trailing \"\\n\" (carriage return) of each line.\n", " line = reading_handle.readline() # Don't forget this or you will have an infinite loop.\n", " i += 1\n", " \n", "reading_handle.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### some new cool syntax for Python 3.8 and above" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line 0 : passionfruit\n", "line 1 : oranges\n", "line 2 : apples\n", "line 3 : grapefruit (whole and segments)\n", "line 4 : pointed sticks\n" ] } ], "source": [ "reading_handle = open(\"fresh_fruits.txt\", 'r')\n", "i = 0\n", "while (line := reading_handle.readline()): # := assigns values to variables as part of a larger expression. \n", " print(\"line\", i, \":\", line.strip()) # It is known as “the walrus operator” and it works really well\n", " i += 1 # together with the while-loop" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }