Retrieving information from the NCBI with Entrez
Entrez is a retrieval system for searching several linked databases stored at the NCBI (National Computational Bioinfology Institute of the United States).
During this tutorial, we will learn to use the interface of NCBI Entrez to retrieve a protein of interest. As will be seen, a simple formulation of the query generally returns too many hits, and the desired answer may be lost in hundreds or thousands of other records. We will see how to use advanced search options in order to refine the query.
Quick panorama of the databases
- Open the Entrez home pagehttp://www.ncbi.nlm.nih.gov/Entrez/You can see the impressive list of the databases supported at Entrez.
- As a first trial, we will see which of these databases contain information aout our gene of interest (e.g. Gal4). In the query box, type
How many results are returned for the submitted keyword in the different databases?
The query we formulated by entering a single keyword was obviously too imprecise, and we thus obtained tens of thousands of hits in different databases. In the subsequent steps we will learn how to use the Entrez query interface in order to formulate precise queries.
A naive query to the protein database
We will now select the protein sequence database in order to collect information about the Gal4 protein from the budding yeast Saccharomyces cerevisiae.
Click on the link Protein: sequence database and enter the query Gal4.
How many results do you obtain ? How many of them correspond to your needs ? How could you try to improve the result ?
The simple query Gal4 returned 61,135 proteins (Aug 2014). Needless to say, this is too much for what we search: the genome of the budding yeast Saccharomyces cerevisiae contains ~6,000 coding genes, and only one of them codes for the Gal4 protein.
A first reason is that we did not impose any constraint on the organism.
A second reason is that, by typing Gal4 in the query box, we asked Entrez to return all the proteins which contained this string in any field (name, description, ...). Thus, our answer includes some proteins related with Gal4, for example because they interact with this protein, or because a Gal4 fragment was used to construct hybrid proteins (e.g. for enhancer trap experiments).
A first improvement can be obtained by imposing some additional words in the query. For instance, we could impose to find the words "Saccharomyces" and "cerevisiae", in addition to "Gal4".
For this, you can use the logical operators 'AND', 'OR', and 'NOT' within the query sentence. Beware ! These operators are case-sensitive, i.e. if you type them in lowercase, they will be considered as imposed words rather than operators.
In the query box, type
Gal4 AND Saccharomyces AND cerevisiae
An even more precise way to select Saccharomyces cerevisiae is to quote the pair of words.
Gal4 AND "Saccharomyces cerevisiae"
This will only retain the records where these two words are written consecutively.
What about the result ? Did we obtain an improvement ? How do you explain the incorrect result ?
By combining Gal4 and 'Saccharomyces cerevisiae' in the query, we already obtained some improvement, and the number of results has been reduced. However, we still obtain a lot proteins (2,836 in Aug 2014) most of which do not directly correspond to Gal4, but are returned because the three words of our query were found in some field (name, decription, organism, ...).
Imposing constraints on a specific field
You can refine the selection by specifying the field in which your query text has to be found.
- Click on the link Advanced below the query box.
- In the Search builder, select the field Gene name and enter GAL4. By pressing the Enter key, yo obtain a list of matches for the gene name GAL4.
- Click the button Add to Search box. This will add a structured text in the query box
- You can now click the Search link below the query box.
How many results do we obtain now ? Do they all fit our needs ? How could we refine the query ?
We obtained improvement over our first query (Gal4 alone) by imposing that the value GAL4 has to be found in the field Gene name. However, we still did not achieve the desired precision (in Aug 2014, the query returns 205 records). There are two reasons:
- We did not impose any constraint on the organism.
- We matched any gene whose name matches "gal4", for example "galectin-4 [Homo sapiens]", also named "GAL4" (in the field "gene_synonym")
We would thus like to formulate a query with constraints on multiple fields: GAL4 as gene name and Saccharomyces cerevisiae as organism.
Specifying constraints on multiple fields
We will further use the Advanced query form to impose constraints simultaneously on gene name and on organism.
- In the query box, type the structured query obtained in the previous section.
Do not click on the Search button yet ! We still need to add some constraints.
How many results do you obtain now ? What is the difference between these entries ?
Browsing a protein entry
Now that we have selected a reasonably low number of proteins, we can identify the one we were searching for: Gal4p from Saccharomyces cerevisiae.
- The result list should include a record with the accession number P04386.2. Click on the link to display the entire record.
- Browse the resulting page to get an idea about the annotation content.
Saving the protein sequence in FASTA format
- On the top of the window, the option Display allows you to choose among different formats. Select the format FASTA. This will display the coding sequence of the Gal4p protein.
>gi|1169823|sp|P04386.2|GAL4_YEAST RecName: Full=Regulatory protein GAL4
- You can store this result in some file on your computer, in order to use it for further analyses.
Getting the query history
An interesting feature of Entrez is the history. By clicking on the link Advanced below the query box, below Search builder you will see a section entitled Search history. You can select any of these previous queries in order to come back to its results, or edit it, or combine them to refine the selection.
This tutorial was borrowed from Jacques van Helden (firstname.lastname@example.org)