{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading and writing files\n", "In many use cases, you will want your python code to read/write from/to files stored on your local hard drive. \n", "Here are a few important points to consider when working with files:\n", "* do I need to read the entire dataset/file into memory?\n", " * remember the table about the time a computer takes to perform various action. Accessing the hard drive is among the slower operations. \n", " Reading an entire file when you only need the first few lines will be costly\n", " * if you are reading a very large file, then having the entire file in memory at once may overburden your computer\n", "* are there concurrency issues?\n", " * if another software (or even your code if you have messed up) writes to a file you are currently reading, you could run into trouble.\n", "\n", "Whether it is for reading or for writing, operations with files occur using **file objects** (sometimes also refferred-to as file **handles**).\n", "File objects are created using the `open()` function, and closed using the object's `.close()` method. \n", "Here is a basic example:\n", "\n", "```python\n", "# Open a file in a given \"mode\" (e.g. read, writte or append).\n", "file = open(filename, mode)\n", "\n", "# Do something with the file...\n", "\n", "# When you are done, don't forget to close the file.\n", "file.close()\n", "```\n", "\n", "> Remember to close the files you have opened, or you will encounter errors if you want to access this file again further down in your code.\n", "\n", "
\n", "\n", "### File opening modes\n", "When using the `open()` function, a **mode** can be passed as argument to the function. This specifies the type of access you will have on the file. For instance, the `'r'` mode will only allow to read the content of a file, and will not allow writing to it (this is useful to avoir accidental writing to the file).\n", "\n", "There are several more modes of opening files.\n", "* `'r'`: open file in read-only mode.\n", "* `'w'`: open file in write-only mode, **overwriting** existing file with the same name.\n", "* `'a'`: open file in write-only mode, **appending** to existing file with the same name.\n", "* `'rb'`, `'wb'`, `'ab'`: same as `'r'`, `'w'` and `'a'`, but reading/writing to/from binary files (such as `.zip` or `.bmp` image files). \n", " The content is read/written as bytes objects without any decoding.\n", "\n", "See `help(open)` for a full list of modes and details about them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Reading from files\n", "To start reading a file, one creates a **file object** using the `'r'` mode of the `open` function. \n", "Then, reading lines can be done simply by iterating over the file object in a `for` loop or using the `.readline()` method of file objects." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line 0 : passionfruit\n", "\n", "line 1 : oranges\n", "\n", "line 2 : apples\n", "\n", "line 3 : grapefruit (whole and segments)\n", "\n", "line 4 : pointed sticks\n" ] } ], "source": [ "reading_handle = open('fresh_fruits.txt' , 'r')\n", "i = 0\n", "for line in reading_handle:\n", " print('line', i, ':', line)\n", " i += 1\n", " \n", "reading_handle.close()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### End-of-line characters\n", "As you can see in the example above, there are additionnal empty lines in between our prints. This is because the lines are read from the file with their **end-of-line** characters, which generally is `\\n` . \n", "To avoid this kind of issue, one typically uses the `.strip()` method of strings, which removes any whitespace or *end-of-line* character at the start or end of the string.\n", "\n", "Here is an illustratin of using `.strip()` when reading content from a file, this time using a `while` loop:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line 0 : passionfruit\n", "line 1 : oranges\n", "line 2 : apples\n", "line 3 : grapefruit (whole and segments)\n", "line 4 : pointed sticks\n" ] } ], "source": [ "reading_handle = open('fresh_fruits.txt', 'r')\n", "i = 0\n", "line = reading_handle.readline()\n", "\n", "# When the file has been entirely read, readline() returns an empty string and the while loop will end.\n", "# In python a non empty string evalutes to \"True\", and therefore we can use \"while line\" as a shortcut for \"while line != '' \".\n", "while line:\n", " print('line', i, ':', line.strip()) # Note: here we use the \"strip()\" method of \"str\" to remove the trailing \"\\n\" (carriage return) of each line.\n", " line = reading_handle.readline() # Don't forget this or you will have an infinite loop.\n", " i += 1\n", " \n", "reading_handle.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "And here is yet another way to read the fruity content of our file, this time using the `readlines()` function instead of `readline()`. \n", "As its name suggests, `readlines()` reads more than one line at a time (by default, all lines in the file).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "And here is yet another way to read the fruity content of our file, this time using the `readlines()` function instead of `readline()`.\n", "As its name suggests, `readlines()` reads more than one line at a time (by default, all lines in the file).\n", "\n", "**Question:** while our examples using `readline()` or `readlines()` work equally well, there can be important implications in using one or the other of these functions, especially when dealing with large files. \n", "can you think of a drawback of using `readlines()` ?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line 0 : passionfruit\n", "line 1 : oranges\n", "line 2 : apples\n", "line 3 : grapefruit (whole and segments)\n", "line 4 : pointed sticks\n" ] } ], "source": [ "reading_handle = open('fresh_fruits.txt' , 'r')\n", "entire_file = reading_handle.readlines()\n", "for i, line in enumerate(entire_file): # Reminder: the enumerate() function allows to generate (index, value) tuples for any iterable object.\n", " print('line', i, ':', line.strip())\n", "\n", "reading_handle.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Answer:** using `readlines()` will (by default) load the entire file in memory, and this can be problematic when working with large files as is often the case in bioinformatics. \n", "Always consider the file sizes you are dealign with when using `readlines()`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Opening files with context managers\n", "Now that you understand the basics of opening and closing a file, we can show you the actual \"pythonic\", recommended, way to deal with files:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "line 0 : passionfruit\n", "line 1 : oranges\n", "line 2 : apples\n", "line 3 : grapefruit (whole and segments)\n", "line 4 : pointed sticks\n" ] } ], "source": [ "with open('fresh_fruits.txt', 'r') as file:\n", " for i, line in enumerate(file):\n", " print('line', i, ':', line.strip())\n", " \n", "# Wait, did we just forget to close our file ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "As you might have noticed in the example above, there is no explicit call to `close()`. So how does our file get closed then?\n", "\n", "The magic happens in the `with ... as ...` construct, which starts a **context manager** code block in python. \n", "**Context managers** are code blocks that have special functions associated with them that execute automatically at the \n", "start and end of the context manager block. \n", "In the case of `with open() as ...:`, the special function executed at the start of the block is opening the file, and \n", "the special function executed at the end of the code block is closing the file handle.\n", "\n", "Using the `with open() as ...:` context manager has several advantages:\n", "* shorter, abstracted code: you don't have to worry about the implementation details of opening and closing a file.\n", "* guarantee that a file will always be closed, no matter what happens (e.g. if an error occurs).\n", "\n", "Context managers are not covered in this course, but you can look them up if you are interested.\n", "\n", "
\n", "\n", "### Mini Exercise:\n", "Read the content of `fresh_fruits.txt` using a context manager and a while loop. Make sure to remove empty lines between print outputs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Writing to files\n", "Writing to a file is achieved in pretty much the same way as reading from it, but the opening mode is now `'w'`. \n", "And instead of reading lines, we now `print()` them to the file." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "with open(\"shopping_list.txt\", 'w') as writing_handle:\n", " print(\"onion\", file=writing_handle)\n", " print( 34 , \"potato\" , file=writing_handle)\n", " print(\"shrubbery\", file=writing_handle)\n", " print(\"tomato sauce\", file=writing_handle)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By passing the file object (or file handle) to the `file` argument of the `print()` function, we now print to the file rather than to our terminal.\n", "> **Reminder**: the `'w'` mode overwrites the opened file - *i.e.*, if you use it on an existing file, its original content is lost. \n", "> **Pro tip:** you can open more than one file using a single `with` statement:\n", "```python\n", "with open('input.txt', 'r') as in_file, open('output.txt', 'w') as out_file:\n", " do_something()\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "### Mini Exercise:\n", "Write some code to read the content of the `shopping_list.txt` file we just created, in order to check that the writing did work properly. Make sure that no white space is printed between lines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Exercises: 3.1, 3.2 and 3.3" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }