Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
The National Center for Biotechnology Information (NCBI) maintained by the US National
Library of Medicine and National Institutes of Health is one of the world’s most important
resources and repositories for biological data. This fantastic online resource provides an
extensive network of databases cataloging an ever-growing wealth of genetic, medical, and
biochemical information from all walks and crawls of life. Entire genomes, from viruses to
humans, are compiled, organized, and cross-referenced within these networks, such that surfing
the genome can be almost as easy as surfing the web.
But you have to know a) what you’re looking for, and b) what you’re looking at to get anything
out of these databases. This is what this first lab is going to help you do. Note that Google and
other search engines typically do not index database-driven websites, which is why it cannot be
used for searching for information that is stored at NCBI (nor does it handle sequence searching
well, especially in the case of protein sequences).
The primary portal for accessing data at NCBI is called Search NCBI. But first, let’s start by
visiting NCBI’s website and examining the interface, which undergoes constant change.
1. Open your Web browser and go to NCBI’s homepage: www.ncbi.nlm.nih.gov. This page
provides links to all of NCBI databases and resources. It’s worth exploring here just to
get a better idea of the scope of NCBI. If you click About the NCBI you will be taken to
a page summarizing some of these resources. You can also check out the NCBI
Handbook (https://www.ncbi.nlm.nih.gov/books/NBK143764/) for more information.
1
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
2. Now let’s move to the Search NCBI (formerly known as GQuery or Entrez) portal – select
All Databases from the navigation bar at the top of the NCBI start page and click “Search”
beside the empty field. First, scan down the assortment of databases queried through this
portal. You will notice there is everything from the biomedical literature at PubMed to
nucleotide databases, taxonomy databases, protein structure databases, and expression profile
databases. Let’s see what happens when you do an unguided search on the site. In the
"Search NCBI" box, type in bacteria. The output is a summary page of the number of hits in
each section. A search of bacteria gives millions of hits – not very helpful. We need
specifics.
Figure 2. The Search NCBI portal page with bacteria used as a search word.
3. Usually when searching these databases, you have either a region of DNA or a protein (or
protein function) of interest. For this lab you’ll be using a gene from Arabidopsis
thaliana, a small flowering plant that is like the fruit fly of the plant world as it has a
comparatively rapid life cycle and requires little space to grow. The protein product of
this gene is recorded under accession number NP_001318308, and it is an E3 ligase,
involved in ubiquitination of proteins, which is a signal for their degradation.
2
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
4. Go back to the Search NCBI portal page and try a more focused search. Use the search terms
found associated with the gene sequence we’ll be using with the GenBank Field Qualifiers
shown below (a full list of qualifiers is presented in Appendix 1). Try the four different
searches presented below and look at the number records, specifically “Protein” records,
found:
gene keywords
e.g. ubiquitin-protein ligase
gene keyword AND organism
e.g. ubiquitin-protein ligase AND Arabidopsis thaliana
gene keyword [PROT] AND organism [ORGN]
e.g. ubiquitin-protein ligase [PROT] AND Arabidopsis thaliana [ORGN]
accession or GI number
e.g. NP_001318308 Lab Quiz
Question 1
*Answer lab quiz
That narrowed things down significantly! questions while doing lab!
Note that using parentheses can be very helpful in making sure you get exactly what you
want. For example:
Also, using quotation marks can also dramatically affect your search (i.e.: 16s rRNA vs. “16s
rRNA”).
Version numbers follow the Accession number and indicate the revision history of that entry
starting with 1 and increasing with each revision. The standard format is Accession.Version.
A GI number (GenInfo Identifier – sometimes written in lower case, "gi") was simply a series
3
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
of digits that was, until recently, assigned consecutively to each sequence record processed by
NCBI. The GI system of identifiers ran in parallel to the Accession.Version system; therefore, if
the DNA or protein sequence changed in any way, it would receive a new GI number.
Example: When a new entry was submitted to GenBank it was assigned an accession number
(say AF000001). Since this is the first version the Accession would be appended with ‘.1’, so it
would look like AF000001.1. At the same time was given a GI number (say GI:1234567). Now
imagine that the researcher who originally submitted the record wanted to update the
information. The updated record would keep the same Accession number, but would increase in
version number (AF000001.2). The new record would have been given a completely new GI
number (say GI:9876543).
Why is this important? The Accession number will always give you the most up-to-date
information on a record, while the Accession.Version will always take you to a specific record.
There are times when you want the most current information, and other times when you want to
point to a particular piece of information from a particular point in time (e.g. a particular record
that you did an analysis with), even if more information has been subsequently added. Note that
as of September 2016, NCBI started phasing out the use of GI numbers. The use of
Accession.Version form is now recommended for accessing a particular record, instead of the
GI number. GI numbers are not to be confused with Entrez Gene IDs, which are an entirely
different referencing system that NCBI uses!
1. At the bottom left of the NCBI homepage find the “NCBI Help Manual” link.
Click on it. Then access the “Entrez Help” section.
2. You are now in Entrez Help. The Entrez collection of databases is queried when you use
the Search NCBI interface. Note the contents that explain everything from search options
to saving sets of records.
3. Notice that under the section Entrez Searching Options some other appropriate
qualifiers are given, as illustrated on the previous page.
5. Search for our accession number of interest (e.g. NP_001318308 from above) through the
Search NCBI portal page. It should give you .1 protein sequence hit in the Proteins
section. Click on it so that you get its full GenBank description (you can also click on the
“armadillo/beta-catenin repeat protein [Arabidopsis thaliana]” link at the top of the page
as the NCBI system recognizes that you’ve entered a protein identifier and hence
provides some summary information for that above the numerical overview of results).
4
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
5
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
6. Notice all the hyperlinks within the text. It looks messy but is in fact straightforward. For
example, for taxonomic information, click on the SOURCE ORGANISM hyperlink (3702).
Some records have links to the primary publication where this sequence was originally cited
in a PUBMED number hyperlink (not the case in the above example, but there is a PubMed
reference for the sequence). Click around on different links and see what you find.
a. What is the taxonomic lineage of your organism?
b. Has the genome of this organism been sequenced, i.e. is there a Genome Project?
c. If so, can you find the accession for the full sequence or one of the chromosomes?
To find out much more information on the structure of the GenBank file at
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
7. Go back to the GenBank record and click on the CDS link, just above the actual sequence
(circled in red in Figure 3 on the previous page).
a. Where did this take you or what happened when you did this?
8. Go back to the GenBank record and examine the Related Information section on the lower
right. This gives you direct links to other databases with information on this query. Find the
Gene link.
Figure 4. The Related Information menu for NP_001318308, to the right of the record. The
arrow is pointing to the “Gene” link.
9. Select Gene from the Related Information menu. This is a great starter resource at NCBI.
Scroll through the different sections. Use them to answer the following questions.
a. Where is your gene’s location in the genome? (Tip: hover with your cursor over the
green bars in the “Genomic regions, transcripts, and products” section; the green
bars represent the gene in the sequence viewer)
b. How many exons do you see in this gene? Tip: how many green boxes are there?
c. What are the names of the genes surrounding it (i.e. what is its “Genomic context”)?
d. Does it have any conserved domains? What are they called? (Tip: use the “Related
Information” link to Conserved Domains on the right of the Gene page)
e. After exploring conserved domains go back to the Gene page. What biological
process (Gene Ontology terms) is this gene involved with (scroll down!)?
6
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
Figure 5. Truncated GenBank Gene page for At2g28830 (also known as PUB12), the gene
that encodes NP_001318308.
10. On the Gene page, there are also Additional links to examine a gene’s structure, function
and phylogenetic relationships further. The navigation sidebar on the right has an
“Additional links” hyperlink which will take you to the bottom of the page, where they’re
found for most genes. Click [+] Gene LinkOut to see them.
Click around and explore the variety of ways that data for PUB12 are interconnected and
displayed (don’t worry, you can’t break anything). Using the Related Information links
can you find any publications associated with this gene? What about gene expression
data? The next page shows the related “RefSeq RNA” record for the corresponding
encoding mRNA (NCBI’s RefSeq aims to provide canonical “reference” sequences –
genomic, mRNA, CDS, protein etc. – for many model organisms).
b. Why is the length of the mRNA different from the value you can calculate from the start
and stop positions in Question 9a?
7
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
On most NCBI search pages (except, oddly, Search NCBI) click on “Save Search” or “Create
Alert” below the search box. Register for an account and save your search. You can also
combine previous searches using the History tab and the search numbers listed within it, as
well as save your searches by registering for a My NCBI account, so you don’t have to keep
redoing the same searches in the future.
One of the most important bioinformatic strategies used for the functional annotation of
genes and genomes is to predict the function of uncharacterized genes or proteins based on
their similarity to sequences with better functional annotations. BLAST is perhaps the
single most important tool for finding database sequences that are similar to a query
sequence of interest.
8
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
The Basic Local Alignment and Search Tool (BLAST; Altschul et al., 1997) is a very powerful
approach to identifying database sequences that share local similarity to a query sequence (see
below for definitions). There is a very important chain of assumptions used in biological
research that is generally followed when using BLAST:
Homologous genes share sequence similarity
Orthologous genes have the highest similarity among multiple species
Orthologous genes most likely have similar functions
Consequently, sequences that are most similar between multiple
species share similar functions
Note, it is very important to understand that these are only assumptions, and there are many
reasons and instances where these assumptions prove to be false. Nevertheless, they are a
reasonable starting place.
Definitions:
Similar sequences – sequences that share a significant number of residues (nucleotides
or amino acids). Sequences can be similar due to homology or simply by chance. The
higher the similarity between sequences, the more likely they are to be homologous.
Homologous sequences – sequences that are related through common ancestry.
Homology is qualitative – two sequences either are, or are not related through common
ancestry. Homologous sequences can vary greatly in their level of similarity – from
100% to 0%.
Orthologous sequences – sequences that are related through a past speciation event.
Orthologous sequences are assumed to share common functions.
Paralogous sequences – sequences that are related through a past gene duplication event.
Genes often diverge in function after duplicating; therefore, paralogous sequences are not
assumed to share a common function.
Query sequence – your sequence; the sequence you are interested in finding more about.
High Scoring Segment Pair (HSP) – ‘hits’ to the database. A subsequence match
between your query sequence and a database sequence returned by BLAST.
Local alignment – a sequence alignment that extends only across part of the sequence.
Global alignment – a sequence alignment that extends across the entire sequence (from
end to end).
1. First, we need a query sequence for the search. Let’s start with our given gene again, but this
time we’ll use the nucleotide sequence corresponding to the protein sequence, not the protein
sequence. First try finding the gene’s DNA sequence using Search NCBI tool again.
On the Search NCBI Portal page, search “All Databases” for your given protein sequence
again using the Accession number. Using the protein from the first part of this lab, we
would search for NP_001318308.
9
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
The first page that comes up is the summary page. Once you’re on this page you can
move to the database of interest. In this case you probably don't have hits in too many
databases since you had a very specific search.
Figure 7. Search NCBI portal queried for NP_001318308 (partial view), with Gene results
highlighted (numbers of results may differ slightly depending on when you’re accessing NCBI).
Try clicking the Gene link. Does the Gene page give you the gene sequence alone? What
do you get instead? Note the context specific link menus that pop up when you hover
over the graphic of the gene with your mouse pointer. You can click on the green boxes
denoting the exons of the gene to get links to various sequences and analyses associated
with the gene. Note that the green track is a composite of the mRNA and CDS tracks –
click on either the NM_ or NP_ number to see the deconvolution of the green track
(Figure 8).
10
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
1 2
Figure 8. Part of the Gene page for NP_001318308, showing pop-up to sequence links.
1. Click the green bars to make mRNA and protein tracks appear; 2. hover over the
mRNA track to see info panel; 3. Click “Genbank” link to see Genbank record for the
genomic region for this gene.
Click on the RefSeq RNAs link in the “Related information” panel on the right. This
takes you to the mRNA that encodes the protein you have been looking at (we are
accessing the same record you accessed in Step 10 of the first part of the lab). Notice the
feature list in the record. One Feature in the GenBank record is gene, and corresponds to
base position 1 – 1949 on this record. Another features is the coding sequence (CDS),
which corresponds to base position 33 – 1781.
a. Given your biology background knowledge, why do you think these are different?
Above the Sequence Viewer panel, click on the “Go to nucleotide: Genbank” link (see
Step 3. in Figure 8 above). You will be taken you to the genomic region that encodes the
mRNA you were just looking at. Notice how the gene feature corresponds to positions 1–
2201, while the mRNA feature corresponds to positions 1–86, 170–286, 370–819, and
906–2201 and the CDS feature corresponds to nucleotide positions 33–86, 170–286,
370–819, and 906–2033. You may have remarked that the sequence from the
chromosome has been reverse complemented.
b. Again, why are these different? Tip: recall the Central Dogma of Molecular
Biology!
11
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
12
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
Let's return the mRNA record we were previously working with (NM_001336190).
Click on the CDS link. Now you are looking at the information for the coding sequence,
as opposed to the whole gene or protein (highlighted in brown ).
Using the “Display: FASTA” option in the grey bar at the bottom of the page generate a
FASTA-formatted version of the CDS.
Now you have the sequence in the most basic and easily managed format – FASTA
format. FASTA format is simply a header line that starts with a ‘>’ followed by text
describing the sequence, and then the actual sequence beginning on the next line. The
sequence can be either DNA or protein, and may be continuous (scrolling off the page),
or cut into more manageable lengths typically ranging between 60-80 residues.
2. Let’s do some BLASTing! Use the “Run BLAST” link in the “Analyze This Sequence” part
of the webpage. [Or open a new tab or window in your browser and go back to the NCBI
home page (www.ncbi.nlm.nih.gov), then select BLAST from the Resources dropdown
along the top, under the DNA&RNA subsection].
There are lots of options here. We will discuss some of these next lab, but right now let’s
work with the simplest. Since our sequence is a nucleotide sequence, we want to do a
nucleotide blast.
On the BLAST page, note that under the Enter Query Sequence section, the NCBI
13
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
system has automatically entered the accession number (but you can also enter a GI
number, or FASTA sequence) and subrange (we’ll be searching with just the coding
sequence part of the mRNA sequence). You could also copy-and-paste the FASTA
formatted CDS sequence you found as in Figure 10 into the query box without defining a
subrange – you should be clear on the difference between an mRNA sequence and coding
sequence at this point...
Figure 11. The blastn query page, with optimization for “Somewhat similar sequences
(blastn)” selected.
Scan the sections of the page. You have quite a bit of control over how the algorithm runs
(particularly if you click [+] Algorithm parameters near the bottom.
We want to query the full NCBI database; the NCBI linking system has automatically
changed the default Database (which is Human) to Other and Nucleotide collection
(nr/nt) because our sequence is non-human. The nr database is the non-redundant
collection of sequences in GenBank.
Change the Program Selected / Optimized for to Somewhat similar sequences (blastn).
Note all the small question mark icons around the page. Click any one of these to find
out more about the associated parameter. For example, by clicking the question mark in
the Program Selection section you get a very brief summary of the different methods.
By clicking more you jump to a new page with full documentation for the algorithms.
14
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
Identity – the extent to which two sequences are invariant. A very poor measure since it
doesn’t take into account the subtleties of sequence relationships (e.g. a small region of a
highly conserved domain within two sequences that are otherwise very poorly
conserved).
Bit score – the alignment score (S). A very precise measure that is normalized over the
particular score system employed. Suffers from the disadvantage of being dependent on
the length of the query.
15
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
E value – the expect value. A value that is based on the number of different alignments
with scores at least as good as that observed, which are expected to occur simply by
chance. The lower the E value, the more significant the score. This is by far the best
metric to use since results of different searches in the same database can be readily
compared. Note that E value is dependent on the size of the database (n) and the length
of the query sequence (m). The same sequence searched on different databases
containing identical hit sequences would result in different E values being reported.
E = mn2-S
We’ll go into greater detail about this calculation in next week’s class.
Explore the Graphic Summary tab. Scroll your mouse over the coloured bars.
c. What do the coloured bars mean?
d. How does the colour code work?
e. What information is displayed when you hover on an entry?
f. What do you notice about the significance values as you move down the graphical
summary?
g. What is the genus and species of the top (best) hit?
h. What happens if you click on one of the entries?
16
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
i. How many sequence matches are listed for this query sequence? How are they
ordered? (you can sort these segments in other ways, like by identity, score, and
query start position.)
j. What happens if you click the Accession hotlink?
k. What happens if you click the Alignments hotlink?
17
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
18
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
Return the formatting to the original Pairwise format. Go back to the graphical
summary. If there are any low-scoring segments (i.e.: green or blue-coded blocks), click
on one.
n. What is its E-value?
o. Does it have a high percent identity? If so, why would BLAST give it such a
poor E-value?
p. Do you think these hits are homologous? Why or why not?
End of Lab!
19
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
Lab 1 Objectives
By the end of Lab 1 (comprising the lab including its boxes, and the lecture), you should:
know how to search for records at NCBI, both using search terms or identifiers (first part
of lab) and Search NCBI / GQuery, or using a nucleotide sequence and BLAST;
know the difference between a GenBank accession number, a version number, and a GI
number;
understand the difference between the nucleotide sequence database part of GenBank and
the protein sequence part of it;
know the parts of a GenBank record and be able to switch between sequence formats
(e.g. to FASTA format);
be familiar with the interconnectedness of various NCBI databases and be able to call up
linked records with ease;
be able to use nucleotide BLAST (Blastn) to search GenBank, and be able to interpret the
output – what does the E-value tell you etc.?;
understand the meaning of homologous, orthologous, and paralogous sequences;
be able to use the Help function to address any question you may have with regards to the
NCBI interface (if you have any questions on background material, check in with the
forums for this course on Coursera!).
Do not hesitate to post any questions you might have to the Forum section of the Coursera
website for this course if you do not understand any of the above after reading the relevant
material.
Further Reading
Chapter 2 “Information Organization and Sequence Databases” in Concepts in Bioinformatics and Genomics by
Jamil Momand and Alison McCurdy, Oxford University Press, 2017. pp 21-37.
SF Altschul , TL Madden , AA Schaffer , J Zhang , Z Zhang , W Miller , and DJ Lipman (1997) Gapped BLAST
and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25: 3389-3402.
CA Kerfeld, KM Scott (2011) Using BLAST to Teach ‘‘E-value-tionary’’ Concepts. PLoS Biol 9(2):
e1001014. http://dx.doi.org/10.1371/journal.pbio.1001014.
20
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
Accession [ACCN]
Contains the unique accession number of the sequence or record, assigned to the nucleotide, protein, structure,
genome record, or PopSet by a sequence database builder. The Structure database accession index contains the PDB
IDs but not the MMDB IDs.
Filter [FILT]
Contains predetermined or filtered subsets of the various databases. These subsets or filters are created by grouping
records that are commonly linked to other GQuery databases or within the same database. For example, the PopSet
database Filter index includes PopSet all, PopSet medline, PopSet nucleotide, and PopSet protein. The PopSet
medline filter includes all PopSet records with links to PubMed; the PopSet nucleotide filter includes all PopSet
records with links to the nucleotide database; and, the PopSet protein filter includes all PopSet records with links to
the protein database. The PopSet all filter includes all PopSet records.
Issue [ISS]
Contains the issue number of the journal in which the data were published.
Keyword [KYWD]
Contains special index terms from the controlled vocabularies associated with the GenBank, EMBL, DDBJ, SWISS-
Prot, PIR, PRF, or PDB databases. Browse the Keyword indexes of the individual databases to become familiar with
these vocabularies. A Keyword index is not available in the Structure database.
21
Copyright © 2021 by D.S. Guttman and N.J. Provart
Bioinformatic Methods I Lab 1
Weight section of the GQuery help document. Note that molecular weight must be entered as a fixed 6 digit field,
filled with leading zeros (not letter O), e.g., 002002 [MOLWT]
Organism [ORGN]
Contains the scientific and common names for the organisms associated with protein and nucleotide sequences.
Properties [PROP]
Contains properties of the nucleotide or protein sequence. For example, the Nucleotide database's Properties index
includes molecule types, publication status, molecule locations, and GenBank divisions. A Properties index is not
available in the Structure database.
Uid [UID]
Contains the Medline unique identifier for records that contain published references that are linked to PubMed. The
Uid index is not browsable.
Volume [VOL]
Contains the volume number of the journal in which the data were published.
22
Copyright © 2021 by D.S. Guttman and N.J. Provart