## Retrieving information from the NCBI with Entrez

Entrez is a retrieval system for searching several linked databases stored at the NCBI (National Computational Bioinfology Institute of the United States).

### Goal

During this tutorial, we will learn to use the interface of NCBI Entrez to retrieve a protein of interest. As will be seen, a simple formulation of the query generally returns too many hits, and the desired answer may be lost in hundreds or thousands of other records. We will see how to use advanced search options in order to refine the query.

http://www.ncbi.nlm.nih.gov/Entrez/

### Quick panorama of the databases

1. Open the Entrez home pagehttp://www.ncbi.nlm.nih.gov/Entrez/You can see the impressive list of the databases supported at Entrez.
2. As a first trial, we will see which of these databases contain information aout our gene of interest (e.g. Gal4). In the query box, type
Gal4
How many results are returned for the submitted keyword in the different databases?
The query we formulated by entering a single keyword was obviously too imprecise, and we thus obtained tens of thousands of hits in different databases. In the subsequent steps we will learn how to use the Entrez query interface in order to formulate precise queries.

### A naive query to the protein database

We will now select the protein sequence database in order to collect information about the Gal4 protein from the budding yeast Saccharomyces cerevisiae.

Click on the link Protein: sequence database and enter the query Gal4.

#### Questions

How many results do you obtain ? How many of them correspond to your needs ? How could you try to improve the result ?

The simple query Gal4 returned 61,135 proteins (Aug 2014). Needless to say, this is too much for what we search: the genome of the budding yeast Saccharomyces cerevisiae contains ~6,000 coding genes, and only one of them codes for the Gal4 protein.

A first reason is that we did not impose any constraint on the organism.

A second reason is that, by typing Gal4 in the query box, we asked Entrez to return all the proteins which contained this string in any field (name, description, ...). Thus, our answer includes some proteins related with Gal4, for example because they interact with this protein, or because a Gal4 fragment was used to construct hybrid proteins (e.g. for enhancer trap experiments).

### Logical operators

A first improvement can be obtained by imposing some additional words in the query. For instance, we could impose to find the words "Saccharomyces" and "cerevisiae", in addition to "Gal4".

For this, you can use the logical operators 'AND', 'OR', and 'NOT' within the query sentence. Beware ! These operators are case-sensitive, i.e. if you type them in lowercase, they will be considered as imposed words rather than operators.

In the query box, type

Gal4 AND Saccharomyces AND cerevisiae

An even more precise way to select Saccharomyces cerevisiae is to quote the pair of words.

Gal4 AND "Saccharomyces cerevisiae"

This will only retain the records where these two words are written consecutively.

#### Questions

What about the result ? Did we obtain an improvement ? How do you explain the incorrect result ?

By combining Gal4 and 'Saccharomyces cerevisiae' in the query, we already obtained some improvement, and the number of results has been reduced. However, we still obtain a lot proteins (2,836 in Aug 2014) most of which do not directly correspond to Gal4, but are returned because the three words of our query were found in some field (name, decription, organism, ...).

### Imposing constraints on a specific field

You can refine the selection by specifying the field in which your query text has to be found.

2. In the Search builder, select the field Gene name and enter GAL4. By pressing the Enter key, yo obtain a list of matches for the gene name GAL4.
3. Click the button Add to Search box. This will add a structured text in the query box
GAL4[Gene Name]
4. You can now click the Search link below the query box.

#### Questions

How many results do we obtain now ? Do they all fit our needs ? How could we refine the query ?

We obtained improvement over our first query (Gal4 alone) by imposing that the value GAL4 has to be found in the field Gene name. However, we still did not achieve the desired precision (in Aug 2014, the query returns 205 records). There are two reasons:

1. We did not impose any constraint on the organism.
2. We matched any gene whose name matches "gal4", for example "galectin-4 [Homo sapiens]", also named "GAL4" (in the field "gene_synonym")

We would thus like to formulate a query with constraints on multiple fields: GAL4 as gene name and Saccharomyces cerevisiae as organism.

### Specifying constraints on multiple fields

We will further use the Advanced query form to impose constraints simultaneously on gene name and on organism.

• In the query box, type the structured query obtained in the previous section.
GAL4[Gene Name]
Do not click on the Search button yet ! We still need to add some constraints.
• In the Search builder, select Organism and type Saccharomyces cerevisiae. Click Add to Search Box. This should display the following query.
(GAL4[Gene Name]) AND Saccharomyces cerevisiae[Organism]

#### Questions

How many results do you obtain now ? What is the difference between these entries ?

### Browsing a protein entry

Now that we have selected a reasonably low number of proteins, we can identify the one we were searching for: Gal4p from Saccharomyces cerevisiae.

• The result list should include a record with the accession number P04386.2. Click on the link to display the entire record.
• Browse the resulting page to get an idea about the annotation content.

### Saving the protein sequence in FASTA format

• On the top of the window, the option Display allows you to choose among different formats. Select the format FASTA. This will display the coding sequence of the Gal4p protein.
	    >gi|1169823|sp|P04386.2|GAL4_YEAST RecName: Full=Regulatory protein GAL4
MKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERLEQLFLL
IFPREDLDMILKMDSLQDIKALLTGLFVQDNVNKDAVTDRLASVETDMPLTLRQHRISATSSSEESSNKG
QRQLTVSIDSAAHHDNSTIPLDFMPRDALHGFDWSEEDDMSDGLPFLKTDPNNNGFFGDGSLLCILRSIG
ILFNCILAIGAWCIEGESTDIDVFYYQNAKSHLTSKVFESGSIILVTALHLLSRYTQWRQKTNTSYNFHS
FSIRMAISLGLNRDLPSSFSDSSILEQRRRIWWSVYSWEIQLSLLYGRSIQLSQNTISFPSSVDDVQRTT
TGPTIYHGIIETARLLQVFTKIYELDKTVTAEKSPICAKKCLMICNEIEEVSRQAPKFLQMDISTTALTN
LLKEHPWLSFTRFELKWKQLSLIIYVLRDFFTNFTQKKSQLEQDQNDHQSYEVKRCSIMLSDAAQRTVMS
VSSYMDNHNVTPYFAWNCSYYLFNAVLVPIKTLLSNSKSNAENNETAQLLQQINTVLMLLKKLATFKIQT
CEKYIQVLEEVCAPFLLSQCAIPLPHISYNNSNGSAIKNIVGSATIAQYPTLPEENVNNISVKYVSPGSV
GPSPVPLKSGASFSDLVKLLSNRPPSRNSPVTIPRSTPSHRSVTPFLGQQQQLQSLVPLTPSALFGGANF