Lab 1 - Introduction and Protocol
Lab 1 - Introduction and Protocol
Lab 1 - Introduction and Protocol
BIOL1020 - LAB 1
INTRODUCTION TO BIOINFORMATICS
This is an in-person lab and the protocol below goes over the steps that need to be taken in order
to conduct the pre-lab and collecting data for your lab assignment.
You need to complete the PRE-LAB ASSIGNMENT and send an electronic copy through the
assignment submission on Canvas BEFORE the start of your lab. Submissions after the start of
the lab will not be accepted and will receive a zero for the preparation of the lab! Late
submissions will not be accepted for pre-lab assignment. It is recommended that you work on the
pre-lab well before the due date and submit the pre-lab at least 1 hour before the start of the lab
to ensure that you do not encounter any technical issues. Do not leave it until the last minute.
Late submissions of pre-lab assignment due to technical difficulties beyond the deadline will not
be addressed.
Your LAB ASSIGNMENT is to be completed during the lab session and is due at the end of
your scheduled lab session and a printed copy must be submitted at the end of the session to the
lab TA. Late submissions will not be accepted. It is recommended that you have a draft of your
lab Assignment (complete components that you can, in advance) well before your in-person lab
session. Then utilize the in-person lab session to complete/modify your lab assignment with the
help of your TA and lab partner. The Lab Assignment must be submitted before the end of your
scheduled lab.
For the PRE-LAB ASSIGNMENT you will need to upload a pdf, doc or docx file as an
assignment submission in Canvas. Here are the steps to upload a file as an assignment
submission in Canvas. Pay close attention that you are submitting the correct assignment in the
correct assignment folder.
Note: After your submission, always double check the submitted file to ensure that it is the
correct file, and it can be opened by the TA for grading. Technical issues in uploading
discovered after the due date will not be addressed.
1
BIOL 1020 Lab 1 2
Introduction
Model organisms offer a cost-effective way to follow the inheritance of genes through many
generations in a relatively short time among other applications. One of the first and most
important problems encountered in these genome projects was how to acquire, store, and analyze
massive amounts of DNA sequence information. GenBank, a major public repository of DNA
sequence data, has grown to include roughly 4.86 million individual sequence records
representing about 3.86 billion base pairs (as of early 2000) as compared to 0.56 million records
in 1995! GenBank contains the full and partial genome sequences of over 670 different
organisms, including 27 complete genomes (Reed 2000).
Where did all of these data come from? From local and international academic and government
research groups as well as commercial, privately owned companies. Almost all private
companies conducting genomic research, such as Celera Genomics, Incyte, Human Genome
Sciences, Millennium Pharmaceuticals, have sequenced stretches of human and other organisms'
DNA. Some of this privately-generated sequence data has been submitted to public databases
like GenBank, while some data remain proprietary.
2
BIOL 1020 Lab 1 3
GenBank, built by the National Center for Biotechnology Information (NCBI) at NIH in
Bethesda, Maryland, is part of the International Nucleotide Sequence Database Collaboration,
along with its two partners, the DNA Database of Japan (DDBJ, Mishima, Japan) and the
European Molecular Biology Laboratory (EMBL) nucleotide database from the European
Bioinformatics Institute (EBI, Hinxton, England). All three centers are separate points of data
submission, but all of them exchange updated information daily.
In the early 1980s, there was no common format or electronic submission of data, which
slowed the collation of information tremendously. In 1988, the three groups (American, Japanese
and European) met and agreed to use a common format for data elements whereby each database
update only the records that were submitted to it. This means that each record is owned by the
database that created it, that “update clashes” and overwriting records are prevented, and most
importantly, the information in all three databases is compatible as well as accessible to a global
community. These database centers are also computational biology centers as it became clear
that sequence data couldn’t be generated simply by automated means, but need to be proofread
by biologists. Additional tools were developed to analyze the information.
BLAST
In this lab, you will use BLAST to search the sequence database. This program is an important
research tool because it is a good combination of speed, sensitivity, flexibility, and statistical
rigor. BLAST is an acronym for Basic Local A1ignment Search Tool. The BLAST search
algorithm takes your input sequence and compares it to all known genetic sequences (DNA or
protein), identifying the known molecules that have similar sequences.
There are various BLAST programs designed for either nucleotide sequence queries or protein
sequence queries. You will be using BLASTN for this lab: BLASTN takes a nucleotide sequence
(the query or unknown sequence) and its reverse compliment, and searches them against a
nucleotide sequence database.
CLUSTALW2
The Multiple Sequence Alignment (MSA) algorithm ClustalW2 takes a set of input sequences
and aligns them so that homologous regions (the features that are common to the entire set of
sequences) are highlighted. This serves to identify the nucleotides or amino acids within the
sequences that have been conserved during their evolutionary divergence. Natural selection tends
to select against changes that result in loss of molecular function, thus conserved residues
identified in an MSA are presumed to be important for the structure and function of the
molecule.
3
BIOL 1020 Lab 1 4
The Multiple Sequence Alignment, or MSA, homology search algorithm is sometimes called a
“many-against-each-other” search because the input is a small, defined set of sequences that are
compared only against each other, not against an entire database. This is in contrast to the
BLAST similarity search algorithm, a “one-against-all” similarity search, in which the input is a
single sequence that is compared against all other known sequences listed in the database. Thus,
the starting point for an MSA is a set of sequences that are already presumed to be homologous.
ENTREZ
Another extremely powerful tool is The National Center for Biotechnology Information's
(NCBI) Entrez search engine (http://www.ncbi.nlm.nih.gov/Entrez). Entrez not only lets you
search for genetic sequence database records but interfaces with several databases. Entrez
connects to databases of nucleotide or protein sequences, 3-dimensional structures of
macromolecules, and even the Medline bibliographic database via PubMed -- all from this single
site.
4
BIOL 1020 Lab 1 5
A separate pre-lab file has been made available for you to edit in the assignment folder.
You will perform the Bacterial ID Virtual Lab for section 1 of the pre-lab assignment by
connecting to the HHMI web site and using their “virtual Bacterial Identification Lab.” The
virtual lab demonstrates how one would isolate and purify a specific bacterial DNA sequence
using PCR. You will use the isolated DNA from this virtual lab in Section 2.
https://www.biointeractive.org/classroom-resources/bacterial-identification-virtual-lab
Section 2: Identify unknown bacterium from its DNA sequence of 16S RNA using BLAST
Using the search tool BLAST, you will sequence the DNA (from section 1) and identify the
bacterium.
• Use Multiple Sequence Alignment tool ClustalW for comparison of the sequences (refer
to page 11-15 of intro/protocol document). After conducting multiple sequence
alignments, you will construct a phylogenetic “tree” diagram using ClustalW. Save your
tree and copy/paste the diagram below.
5
BIOL 1020 Lab 1 6
• Instructions for how to use ClustalW can be found on page 15 of the introduction and
protocol document
• Indicate on your tree diagram which bacteria are most similar/different? Briefly explain
your selection.
6
BIOL 1020 Lab 1 7
Background
For this part of the lab, which you will do on your own before you come to lab, you will
connect to the Howard Hughes Medical Institute's (HHMI) web site and use their “Virtual
Bacterial Identification Lab.” Explanatory sections are summarized or reprinted here from their
virtual lab (with permission). The purpose of the HHMI virtual lab is to familiarize you with the
science and techniques used to identify different types of bacteria based on their DNA sequence.
In the process, you will see how bacteria are grown and specific DNA is isolated and purified
using the molecular techniques Polymerase Chain Reaction (PCR) and DNA sequencing. These
techniques are fundamental tools and often used in molecular research as well as forensic
analysis.
In this specific example, the sequence of DNA used for identifying the bacterium is the region
that codes for the 16S subunit of the ribosomal RNA (16S rDNA). From genomic studies, it has
been found that different bacterial species have unique 16S rDNA, thus a comparison of this
region may be used as a diagnostic test. Imagine you are a pathologist or a pathology lab
technician at a well-equipped research hospital. Your task is to identify a bacterial sample
received from a clinician of a very sick patient who needs to be on the correct drug regime as
quickly as possible.
Why use molecular techniques to identify the pathological (disease-causing) bacterium instead
of traditional methods?
Over the years, a battery of tests has been developed to categorize and identify bacteria. Tests
include staining and growing bacteria under a variety of conditions. Such procedures typically
require vigorously and reliably growing bacterial cultures. Many pathogens grow poorly on solid
medium while others grow only in liquid culture, making identification through traditional
techniques difficult or impossible. With the aid of molecular methods, however, these limitations
can be overcome. In addition, some species of bacteria cannot be differentiated from closely
related species through traditional methods. For these species, molecular methods offer the only
reliable and convenient means of identification.
Procedure
https://www.biointeractive.org/classroom-resources/bacterial-identification-virtual-lab
7
BIOL 1020 Lab 1 8
Go through the exercise, and “perform the lab” and complete all steps (all steps are described in
the virtual lab, in the link above.)
For your understanding (not need to report answers in the pre-lab assignment), you should be
able to answer the questions in italics:
1. Prepare a sample from the patient and isolate whole bacterial DNA. [How was this done in
the virtual lab?]
2. Make many copies of the desired piece of DNA. [What technique do you use? How do you
separate the desired DNA sequence from the rest of the DNA? How do you purify your
sample?]
4. In Section 2 below, we will discuss how you will use the DNA sequence information to
identify the bacterium isolated from the patient.
Background
As stated earlier, it has been found (from genomic studies) that different bacterial species have
unique 16S rDNA, thus a comparison of this region may be used as a diagnostic test. Bacterium
identification relies on matching the unknown sequence from a particular sample against a
database of all known 16S rDNA sequences using a program called BLAST.
http://www.ncbi.nlm.nih.gov/BLAST
After identifying the bacterium described in the online virtual lab (SAMPLE A), identify other
16S rDNA sequences (choose from samples B, C, D, E, F) using BLAST. You will need to
submit these to your TA at the beginning of the lab.
Why would you use BLAST? (Adapted from HHMI “Virtual Bacterial Identification Lab”)
Under what situations would a scientist search sequence databases? As an example, sequence
matching can be used to determine whether a newly identified DNA sequence is part of a known
gene. In the simplest scenario, if a new sequence is identical or almost identical (except for a few
8
BIOL 1020 Lab 1 9
nucleotide changes) to that of a gene in the sequence database, it is reasonable to predict that the
new sequence is either part of the same gene or of a closely related gene. But what if two
sequences that appear to be different share sections that are identical? How do you know whether
the identical sections are due to chance or indicate some meaningful relationship between the
two sequences? Sequence analysis using BLAST or another program provides a “similarity
score” to help answer this question. (The “similarity score” is discussed further below.)
If the function of a particular DNA sequence is already known (e.g., the 16S rRNA gene we
will be working with in this lab), comparing its sequence with that of the same gene from another
species of bacterium provides information about the evolutionary relationship between the two
bacterial species. The assumption here is that the number of positions that differ in the nucleotide
sequence is proportional to the time elapsed, since the two species formed their own lines of
descent from a common predecessor.
However, not all DNA sequences change at a constant rate over time. For example, it is not at all
clear whether all organisms experience similar mutation rates from purely environmental factors
(from increased UV exposure, for example). If the DNA sequence has or has had at some point
in evolution a functional role, the rate of evolution and selection — which may be related to
population size among other things — can affect its rate of change. Moreover, in some cases,
mutations are caused by deletions, insertions, and substitutions of long sequences of DNA rather
than by single nucleotide changes. Finally, some sequences of DNA encode proteins with very
specific structural requirements, and any change may prove unfavorable to the organism. Such
sequences therefore do not tolerate change well and tend to remain the same for long periods of
time. These are referred to as “conserved” regions. In contrast, sequences that can
accommodate change more easily are referred to as “variable” regions.
Similarity score: Interpreting BLAST search results
(Reprinted from HHMI “Virtual Bacterial ID lab.”) [You should understand the basic distinction between
the BLAST Score and the E value]
Let's consider how one might go about assigning a numerical value to the degree of similarity
between two DNA sequences. Suppose we have two sequences as follows:
CGGCAT
CGCGAT
Let's assign one point for each base pair that matches exactly and 0 point for each base pair that
does not. We have C-C (match), G-G (match), G-C (no match), C-G (no match), A-A (match),
and T-T (match) for a total of 4 points. Under this hypothetical system, the more nucleotides that
match up, the higher the score.
When comparing two DNA sequences, it's important to remember that because of evolutionary
history, the sequences may have diverged not only by substitution of bases but also possibly by
deletions or insertions of bases. This means that the sequences that are being matched may not be
exactly the same length but might have gaps. In practical terms, for these two sequences, the best
match (for a total of 5 points) is:
CGGC__AT
CG__CGAT
9
BIOL 1020 Lab 1 10
From the simple example above, you can imagine how rapidly sequence comparisons can
become complicated as DNA length increases. The statistics for comparing two sequences of
DNA are thus highly complicated. Here we cover just the bare essence of the topic so that you
can interpret the response from your sequence query.
TATCGCGTATTGCC
BLAST will come back with a result, starting with the reference of the search program, the
number of letters in your sequence, the number of letters in the database, a graphic representation
of the sequence matches, and a list of matches. The list of matches is sorted with the best
matchingsequences shown first. For the sequence we used, the list starts with the following:
Score E
Sequences producing significant alignments: (bits) Value
gb|AC012156.14|AC012 Homo sapiens chr 12.. 28 5.8
ref|NC_001142.1 Saccharomyces cerevisiae... 28 5.8
What does this mean? “Score” is a numerical score assigned by BLAST. In the simple example,
we used earlier, we simply assigned 1 point for matches, and 0 point for non-matches. In
BLAST, the scoring system uses “bits” as the measure of information. For DNA, each position
can be occupied by either T, A, C, or G. Each match therefore contains 2 bits of information
(only 1 is correct out of 4 possible). For a 14-nucleotide-long sequence like ours, the maximum
match score then is 28 bits. The higher the score, the better the match.
“E-value” is the number of hits one can expect to see just by chance when searching a database
of a particular size. The value is defined as
E = N/n * m * n * 2-S
where m and n are the length of the two nucleotide sequences (measured in base pairs), S is the
bit score, and N refers to the total length of all sequences in the database. The formula should
make intuitive sense. For example, if S is higher (i.e., better matches), you would expect to see
fewer “hits.” On the other hand, if m or n are larger (i.e., one or the other sequence is longer),
then you would expect to see more hits purely by chance. Finally, if the database contains more
sequences (i.e., N is larger), then you would expect to see more hits. In any case, if BLAST
returns an E-value that is very small or close to zero, then you probably have a meaningful match
that is not due to random chance. To interpret the matches, you therefore need to pay attention to
whether the E-value is reasonably small. E-value is related to the P-value by the following
formula:
P = 1 - e-E
10
BIOL 1020 Lab 1 11
So for a P-value of 0.95 (the statistically significant level), the E-value is around 3. Thus, in your
search, an E-value of 3 or less would be an acceptable match.
You should also keep in mind that there are a lot of sequences in the database and that some of
them are from the same species and therefore might be very similar. In some cases, the name of
the organism may have changed after it was originally reported; accordingly, two or more
sequences may match extremely well but appear to belong to completely different species.
• Because we wish to look for nucleotide sequences homologous to the 16S rDNA, you
will use the blastn program and the nr (non-redundant) database. You may leave all of
the other default settings. Paste your DNA sequence into the box provided, then click on
BLAST to submit yourquery.
3. Interpret the BLAST results and select the most likely identity of your unknown. After the
server computer conducts your analysis, the results are presented three ways: as a graphic, a
table of "hits" (identified similarities), and a series of sequence alignments.
A. The graphic has lines showing the positions and ranges of identity/similarity between your
sequence (the query) and other possible sequences in the database. The location and length
of each line indicates the extent of similarity (how close the match is, also shown as the
line's color as well as length).
B. Under the text “Sequences producing significant alignments” is the table of “hits.” Each
database entry similar to the query sequence is presented, beginning at the top with the
closest match and ending at the bottom with the weakest. Clicking on the code on the left
of each line (e.g., emb|V00296|ECLACZ) links you to the GenBank entry for the sequence.
Clicking on the number (the score) at the right end of the line will jump you downward
within the file to the sequence alignment.
C. Each matched sequence is presented as a separate alignment with the query sequence. Only
the similar/identical regions of each molecule's sequence are presented here. The numbers
after the words “Query and Sbjct” indicate the position within each database entry to which
the nucleotides on that line correspond. This display is where one can analyze in detail the
nucleotide differences between the query and its homologue. You do NOT need to print out
your BLAST search. You need to understand what the results mean and what is the most
likely match and why. Usually the best match is the result at the top of the list with the
highest score and lowest E value.
4. Find the GenBank file record for your bacterium and learn how to read a GenBank file. On the
left column of your BLAST search, click on the identification number of the bacterium you
think best matches your 16S rDNA sample. You should now be linked to the GenBank record
of your chosen bacterium. Look at the GenBank record of your chosen bacterium. Acquaint
yourself with the parts of the GenBank database record for your nucleotide sequence and be
able to identify the information in your record that is bolded below.
11
BIOL 1020 Lab 1 12
What types of information are contained in the following parts of the record?
• Locus
• Definition
• Keywords
• Accession and NID
• Source
• Organism
• Reference(s) --earliest record
• Medline
• Comments
• Features
• CDS Why is there no coding sequence for the 16S RNA??
• /translation Why is your sequence NOT translated?
• /db_xref
• mutation
• variation
• exon Why do you NOT expect to find either exons or introns in your sequence?
• intron
• precursor_RNA
• mRNA
• Base Count
• Sequence
Background
You have seen that the 16S rDNA region is sufficiently different from one bacterium to
another to use the differences as a means of identification. However, how different/similar are
the regions, and are some bacterial 16S rDNA sequences more similar than others? You might
have predicted that genes coding for molecules with such a vital function as being part of the
ribosome would not have such variability between species. Start to generate questions about
what parts of the 16S rDNA have variable and conserved regions and why this might be so; look
in your textbook for the structure of 16S rRNA and see if you can predict variable and conserved
regions on the DNA. (Hint: look for regions that are double-stranded and single-stranded;
where are the active sites, etc.)
Understanding Phylogenetics
Our system of taxonomy is based on phylogeny. That is, we classify organisms together because
they have a common evolutionary ancestor. In most cases, we cannot determine ancestry directly
because the fossil record is poor for most organisms. Instead we rely on shared, homologous
features, and we say that organisms that share many features are closely related. Organisms that
share many features probably had a relatively recent common ancestor. Molecules, including
DNA, can also reveal homology, since these are passed down from one generation to the next.
12
BIOL 1020 Lab 1 13
For example, chimpanzees and humans share about 98% of their DNA because the common
ancestor of chimps and humans lived only about 6 million years ago. In 6 million years, there has
not been enough time for very much divergence to take place. However, the DNA of humans and
yeast is more dissimilar because humans and yeast shared an early eukaryotic ancestor no more
recently than about 1.2 billion years ago.
One type of phylogenetic tree, known as a rooted tree, contains a root, nodes, branches and
clades (Figure 1). A phylogenetic tree is composed of nodes, each representing a taxonomic unit
(species, populations, individuals), and branches, which define the relationship between the
taxonomic units in terms of descent and ancestry. Only one branch can connect any two adjacent
nodes. The branching pattern of the tree is called the topology, and the branch length usually
represents the number of changes that have occurred in the branch. This is called a scaled
branch. Scaled trees are often calibrated to represent the passage of time. Such trees have a
theoretical basis in the particular gene or genes under analysis. Branches can also be unscaled,
which means that the branch length is not proportional to the number of changes that has
occurred, although the actual number may be indicated numerically somewhere on the branch.
Groups sharing a node share a common ancestor and make up a clade. Phylogenetic trees may
also be either rooted or unrooted. In rooted trees, there is a particular node, called the root,
representing a common ancestor, from which a unique path leads to any other node. An unrooted
tree only specifies the relationship among species, without identifying a common ancestor, or
evolutionary path.
In Figure 1, humans, mice and flies are all animals and therefore their shared ancestral traits are
eukaryotic, multicellular organisms that consume food. But humans and mice share some traits
that are not shared by flies. These shared traits (for example, hair and milk production) are
known as derived traits and determine the specific clade – mammals – of the organisms that
share them.
13
BIOL 1020 Lab 1 14
Text and figures adapted with permission from A. Vierstraete, University of Ghent, Belgium.
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
Molecular phylogenetics
Rapid technological advances in molecular biology have allowed scientists to obtain DNA
sequences of genes and entire genomes of a large variety of organisms. Molecular phylogenetics
is the field that examines evolutionary relationships among groups specifically based on changes
occurring in DNA and protein structure. The number of differences in DNA sequences between
different groups reflects the accumulation of mutations over the time since they shared a
common ancestor.
14
BIOL 1020 Lab 1 15
16S/18S rRNA
The modern approach to phylogeny relies on molecular studies and sequence comparisons of
genes and proteins. One of the most extensively used methods to develop the phylogeny of both
prokaryotes and eukaryotes was pioneered in the early 1970s by Carl Woese. This so-called
"SSU sequencing" or "16S/18S sequencing" is based on the 16S (prokaryotes) and 18S
(eukaryotes) ribosomal RNA (rRNA) genes.
All living organisms contain the small (16S or 18S) and the large (23S or 28S) subunit rRNA.
Since these subunits are essential for protein synthesis, they all have the same function and must
have been developed in the early stages of life. Mutations in these genes can affect directly the
ordinary functioning of the ribosome and thus, only minor changes in these genes are allowed.
Otherwise, the ribosome can lose its function resulting in the elimination of mutated organisms.
Since 16S rRNA is rather sensitive to mutations, the corresponding gene seems to contain a large
number of highly conserved regions. Some of them don't affect the ribosome's function
and mutations can accumulate over evolutionary times.
1. Align similar DNA sequences from different groups to detect similarities and differences
in nucleotide bases.
2. Establish sequence variation by observing the level of homology or similarity of
sequences among groups.
3. Build a tree by arranging groups based on the percentage of matching bases for
sequences and other factors.
4. Evaluate the tree, including the analysis of the resulting tree and comparison with trees
constructed with non-molecular data.
Using DNA sequences in phylogenetics can generate very large data sets. To cope with such
huge amounts of data, scientists use the tools of bioinformatic to construct molecular
phylogenetics. There are many computer programs that use different algorithms for analyzing
these large data sets. In this exercise, you will use a multiple sequence analysis program,
ClustalW. Remember that the more similar two sequences are, the more closely related they are
(the more recent their common shared ancestor). The more nucleotide changes that occur, the
more time since two groups share a common ancestor, and therefore they are more distantly
related.
15
BIOL 1020 Lab 1 16
Procedure
1. For this comparison you will compare the DNA from all 5 unknown bacteria. Select and copy
the 5 sequences posted in the lab folder on Canvas.
2. To use the CLUSTALW software for multiple sequence alignment, the DNA sequences must
be in a specific format:
>bacterium1 [Each new sequence must start with a ">" and have no spaces in the title]
ATGCTTAAA….. [DNA sequence starts on a new line]
>bacterium2
CGGTAAACT
You do NOT need to print out your alignment results . You need to know how to read the
results,e.g., what portions of the sequences are identical; where are their differences?
16
BIOL 1020 Lab 1 17
NAME:
ID #: CRN:
To be completed during the in-person lab session and submitted by the end of the
lab session. A paper copy is submitted to the TA before/by the end of the in-person lab
session.
Note: A Lab Assignment word document is posted on Canvas. Print and bring to the lab session
to complete/submit to TA.
Biological Problem #1: Imagine that you are working in a pathology lab and need to
identify the bacterial species contained in a sample from a very sick patient. Once you know
what species they are infected with, the doctor will be able to recommend the appropriate
antibiotic. You have purified the bacteria from the patient’s samples and extracted bacterial
DNA from a single colony. You then performed PCR using primers that anneal to the region
containing the 16S rRNA gene. You have sequenced the PCR product and you are now ready to
identify the bacterium. As you will recall, you performed a similar exercise in your pre-lab
assignment.
Step 1: You will use an unknown bacterial 16S rRNA gene sequences provided to you by your
TA, during your lab session.
Step 4: In the pull-down database window called Choose Search Set, select nucleotide
collection (nr/nt) since you want to compare your nucleotide sequence with all other nucleotide
sequences in the database.
17
BIOL 1020 Lab 1 18
Step 5: Hit the “BLAST” button and wait for your results.
a) Which bacterial species is the patient most likely infected with? ___________________
_______________________________________________________________________
_______________________________________________________________________
c) Which sequence is the query sequence and which one is the subject sequence? ________
________________________________________________________________________
Step 6: Click on the hyperlink associated with your best Blast match to get to the GenBank
record.
f) Why is there no CDS (sequence coding for amino acids in protein) associated with this
record? _______________________________________________________________
_______________________________________________________________________
g) If you were to perform the same blastn analysis with your bacterial sequence a year from
now using a public database where sequences are constantly being added, would you
expect to obtain the same Score? E-value? Explain.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Biological Problem #2: You and your fellow colleagues in the pathology lab have
sequenced the 16S rRNA gene sequence for a total of 5 bacterial species. You are curious to see
how similar your 16S rRNA sequence (unknown provided by TA in Biological Problem #1) is to
the other 4 sequences. You have heard that multiple sequence alignments may provide some
information about this and would like to give it a try.
18
BIOL 1020 Lab 1 19
Step 1: All of the 16S rRNA genes sequences you need will be provided to you by your TA at
the beginning of your lab (file provided by TA includes 1 unknown used in Biological Problem
#1 and 4 other bacterium sequences). Copy all of the sequences.
Paste your bacterial sequences into the window. In the pull-down window, select DNA since
that is what you are aligning.
Step 3: Click the Submit button and wait for your results to appear.
a) What do you think it indicates when the nucleotides have a star? What do you think it
means when they have a white space and no star? ___________________________
______________________________________________________________________
______________________________________________________________________
b) Would you estimate that all 5 sequences are quite similar (>90%) or not? ___________
______________________________________________________________________
c) Which bacterial sequence appears to differ the most from the others? Click on “View
Tree” at the top of the output page for a different visual representation. (Note that trees
generated by ClustalW represent sequence similarity and are not necessarily intended to
be interpreted as a tree indicating evolutionary descent).
_______________________________________________________________________
_______________________________________________________________________
d) Does your bacterial sequence cluster with any other sequence? Briefly describe the
relationship between your sequence and those of your colleagues.
________________________________________________________________________
________________________________________________________________________
19
BIOL 1020 Lab 1 20
Use your knowledge and laboratory experience to develop a morphological phylogenetic tree
that depicts the evolutionary relationships for the organisms listed above. This tree will serve as
your hypothesis, which you will test using molecular data for the rbcL nucleotide sequences. Use
the file “Biological Problem 3 pictures” to help you with the morphological phylogenetic tree
Step 1: Examine the organisms selected and determine which are charophytes, bryophytes,
pterophytes, gymnosperms or angiosperms.
a) What are the ancestral and derived characteristics of the major phyla of plants? Develop
a hypothesis for which organisms will be more closely related to each other (compare
their morphological and life cycle characteristics of land plants). Arrange the organisms
most closely related to each other into clades, and which clades might share a common
ancestor. Draw your morphological phylogenetic tree, in the space below.
Step 2: You will be testing your hypothesis that the morphological tree accurately represents
land plant phylogeny. You will use molecular data from the nucleotide sequences of rbcL.
b) Do you think the molecular evidence will support (be consistent with) your hypothesis or
falsify it? Write your predictions below.
________________________________________________________________________
________________________________________________________________________
Step 3: Create a phylogenetic tree of the 9 plant species mentioned above using the rbcL gene
using ClustalW. Use the sequences posted on lab folder on Canvas in the file named
“Biological Problem 3 rbcL 9 sequences”.
20
BIOL 1020 Lab 1 21
b) The two phylogenetic trees are supported by different types of evidence. What evidence
was used to create the phylogenetic tree using bioinformatics in ClustalW?
________________________________________________________________________
________________________________________________________________________
d) What is rbcL, and why is it a particularly useful molecule for studying evolutionary
relationships in plants and green algae? ________________________________________
________________________________________________________________________
e) What are the limitations in using rcbL to construct a phylogenetic tree? ______________
________________________________________________________________________
1. Can you suggest reasons why a phylogeny based on molecular evidence and a phylogeny
based on morphology and other evidence might not be exactly the same?
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
21
BIOL 1020 Lab 1 22
Escherichia coli strain ICMP 15663 ATP synthase beta subunit (atpD) gene, partial cds
LOCUS DQ859781 1264 bp DNA linear BCT 26-JUL-2011
DEFINITION Escherichia coli strain ICMP 15663 ATP synthase beta subunit
(atpD)
gene, partial cds. Just as every item in a museum gets an Accession
ACCESSION DQ859781 Number, so does every submission to GenBank.
VERSION DQ859781.1 GI:112791345
KEYWORDS .
SOURCE Escherichia coli DSM 30083 The taxonomic classification of the organism.
ORGANISM Escherichia coli DSM 30083
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
REFERENCE 1 (bases 1 to 1264)
This sequence was published in a journal article.
AUTHORS Young,J.M. and Park,D.C.
TITLE Relationships of plant pathogenic enterobacteria based on partial
atpD, carA, and recA as individual and concatenated nucleotide
and peptide sequences
JOURNAL Syst. Appl. Microbiol. 30 (5), 343-354 (2007)
PUBMED 17451899
REFERENCE 2 (bases 1 to 1264) And it was submitted separately to GenBank.
AUTHORS Park,D.
TITLE Direct Submission
JOURNAL Submitted (19-JUL-2006) Landcare Research, Private Bag 92-170,
Auckland 1072, New Zealand
If you are unsure of what the
FEATURES Location/Qualifiers
source 1..1264 sequence represents, look here.
/organism="Escherichia coli DSM 30083"
/mol_type="genomic DNA"
/strain="ICMP 15663; DSM 30083"
/db_xref="taxon:866789"
/note="type strain of Escherichia coli"
gene <1..>1264
/gene="atpD" If the gene has a designated
CDS <1..>1264 name, it will appear here.
/gene="atpD"
The section of the /codon_start=1
sequence that does /transl_table=11
the protein coding. The amino acid residues
/product="ATP synthase beta subunit"
/protein_id="ABI21943.1" the sequence codes for.
/db_xref="GI:112791346"
/translation="VYDALEVQNGNERLVLEVQQQLGGGIVRTIAMGSSDGLRRGLDV
KDLEHPIEVPVGKATLGRIMNVLGEPVDMKGEIGEEERWAIHRAAPSYEELSNSQELL
ETGIKVIDLMCPFAKGGKVGLFGGAGVGKTVNMMELIRNIAIEHSGYSVFAGVGERTR
EGNDFYHEMTDSNVIDKVSLVYGQMNEPPGNRLRVALTGLTMAEKFRDEGRDVLLFVD
NIYRYTLAGTEVSALLGRMPSAVGYQPTLAEEMGVLQERITSTKTGSITSVQAVYVPA
DDLTDPSPATTFAHLDATVVLSRQIASLGIYPAVDPLDSTSRQLDPLVVGQEHYDTAR
GVQSILQRYQELKDIIAILGMDELSEEDKLVVARARKIQRFLSQPFFVAEVFTGSPGK
YVSLKDTIRGFKGIMEGEYDHLPEQAFYM"
22
BIOL 1020 Lab 1 23
ORIGIN
1 gtgtacgatg ctcttgaggt gcaaaatggt aatgagcgtc tggtgctgga agttcagcag
61 cagctcggcg gcggtatcgt gcgtaccatc gcaatgggtt cctccgacgg tctgcgtcgc
121 ggtctggatg taaaagacct cgaacacccg atcgaagtcc cggtaggtaa agcgactctg
181 ggccgtatca tgaacgtact gggtgaaccg gtcgacatga aaggcgagat cggtgaagaa
241 gagcgttggg cgattcaccg agcagcacct tcctacgaag agctgtcaaa ctctcaggaa
301 ctgctggaaa ccggtatcaa agttatcgac ctgatgtgtc cgttcgctaa gggcggtaaa
361 gttggtctgt tcggtggtgc gggtgtaggt aaaaccgtaa acatgatgga gcttattcgt
421 aacatcgcga tcgagcactc cggttactct gtgtttgcgg gcgtaggtga acgtactcgt
481 gagggtaacg acttctacca cgaaatgacc gactccaacg ttatcgacaa agtatccctg
541 gtgtatggcc agatgaacga gccgccggga aaccgtctgc gcgttgctct gaccggtctg
601 accatggctg agaaattccg tgacgaaggt cgtgacgttc tgctgttcgt tgacaacatc
661 tatcgttaca ccctggccgg tacggaagta tccgcactgc tgggccgtat gccttcagcg
721 gtaggttatc agccgaccct ggcggaagag atgggcgttc tgcaggaacg tatcacctcc
781 accaaaaccg gttctatcac ctccgtacag gcagtatacg tacctgcgga tgacttgact
841 gacccgtctc cggcaaccac ctttgcgcac cttgacgcaa ccgtggtact gagccgtcag
901 atcgcgtctc tgggtatcta cccggccgtt gacccgctgg actccaccag ccgtcagctg
961 gacccgctgg tggttggtca ggaacactac gacactgcgc gtggcgttca gtccatcctg
1021 caacgttatc aggaactgaa agacattatc gccatcctgg gtatggatga actgtctgaa
1081 gaagacaaac tggtggtagc gcgtgctcgt aagatccagc gcttcctgtc ccagccgttc
1141 ttcgtggcag aagtattcac cggttctccg ggtaaatacg tctccctgaa agacaccatc
1201 cgtggcttta aaggcatcat ggaaggcgaa tacgatcacc tgccggagca ggcgttctac
1261 atgg
//
23
BIOL 1020 Lab 1 24
LOCUS - A short mnemonic name for the entry, chosen to suggest the sequence's definition.
DEFINITION - A concise description of the sequence.
ACCESSION - The primary accession number is a unique, unchanging code assigned to each
entry. (Please use this
code when citing information from GenBank.)
NID - The unique nucleic acid identifier that has been assigned to the current version of the
sequence data that are
associated with the GenBank entry identified by a given primary accession number.
KEYWORDS - Short phrases describing gene products and other information about an entry.
SEGMENT - Information on the order in which this entry appears in a series of discontinuous
sequences from the same molecule.
SOURCE - Common name of the organism or the name most frequently used in the literature.
ORGANISM - Formal scientific name of the organism (first line) and taxonomic classification
levels (second and
subsequent lines).
REFERENCE - Citations for all articles containing data reported in this entry. Includes four
subkeywords and may
repeat.
AUTHORS - Lists the authors of the citation.
TITLE - Full title of citation. Optional sub keyword (present) in all but unpublished
citations)/one or more records.
JOURNAL - Lists the journal name, volume, year, and page numbers of the citation.
MEDLINE - Provides the Medline unique identifier for a citation.
REMARK - Specifies the relevance of a citation to an entry.
COMMENT - Cross-references to other sequence entries, comparisons to other collections,
notes of changes in LOCUS names, and other remarks.
FEATURES - Table containing information on portions of the sequence that code for proteins
and RNA molecules and information on experimentally determined sites of biological
significance.
BASE COUNT - Summary of the number of occurrences of each base code in the sequence.
ORIGIN - Specification of how the first base of the reported sequence is operationally located
within the genome. Where possible, this includes its location within a larger genetic map
- The ORIGIN line is followed by sequence data (multiple records).
24
BIOL 1020 Lab 1 25
exon Region that codes for part of spliced mRNA (in eukaryotes)
iDNA Intervening DNA eliminated by recombination
intron Transcribed region excised by mRNA splicing (in eukaryotes)
LTR Long terminal repeat
mat_peptide Mature peptide coding region (does not include stop codon)
misc_binding Miscellaneous binding site
misc_difference Miscellaneous difference feature
misc_recomb Miscellaneous recombination feature
misc_RNA Miscellaneous transcript feature not defined by other RNA keys
misc_signal Miscellaneous signal
misc_structure Miscellaneous DNA or RNA structure
modified_base The indicated base is a modified nucleotide
mRNA Messenger RNA
Bioinformatics & Vlabs
mutation A mutation alters the sequence here
old_sequence Presented sequence revises a previous version
precursor_RNA Any RNA species that is not yet the mature RNA product
primer Primer binding region used with PCR
promoter A region involved in transcription initiation
protein_bind Non-covalent protein binding site on DNA or RNA
RBS Ribosome binding site
rep_origin Replication origin for duplex DNA
repeat_region Sequence containing repeated subsequences
repeat_unit One repeated unit of a repeat_region
rRNA Ribosomal RNA
STS Sequence Tagged Site; operationally unique sequence that identifies the combination of
primer spans used in a PCR assay
tRNA Transfer RNA
unsure Authors are unsure about the sequence in this region
variation A related population contains stable mutation
-10_signal `Pribnow box' in prokaryotic promoters
-35_signal `-35 box' in prokaryotic promoters
3'UTR 3' untranslated region (trailer)
5'UTR 5' untranslated region (leader)
2. What basic information about the application of search and analysis tools used in this lab:
25
BIOL 1020 Lab 1 26
ENTREZ
• What is Entrez and for what type of information would you use ENTREZ?
• Where is it and how do you access Entrez? Although Entrez is not directly used for this
lab, it is a very useful tool to know about.
• Suggest reasons why some regions of the 16S rDNA are highly conserved while others
are not.
Given a sample Genbank record be able to find the following information (see sample in the
Appendix A):
1. What is the specific sequence? ("Definition")
2. Know what the permanent identifying code is for this sequence (accession number) and
distinguish this from the unique nucleic acid identifier used for the current version of
the sequence data. (NID).
3. From what organism was this sequence isolated? (Source)
4. What was the first reference describing this sequence?
5. For this first reference locate when, by whom, and if published.
26
BIOL 1020 Lab 1 27
6. What are the total number of A,C,G,T nucleotides (Base Count) for this sequence and
look for anomalies–e.g., are the number of each base similar or is there a predominance
of G-C or A-T?
7. What information can you get out of the Features section- e.g., CDS (coding sequence)
indicates transcribed into mRNA for translation into amino acids. Why is there no CDS
indicated for 16S rDNA?
27
BIOL 1020 Lab 1 28
Literature Cited
Ball, M., G. Duncan, D. Ranieri, and S. Kiser. 2002. Exploring important biological concepts
using Biology Workbench. Pages 85-109, in Tested studies for Laboratory Teaching,
Volume 23 (M.A. O’Donnell, Editor). Proceedings of the 23rd Workshop/Conference of the
Associationfor Biology Laboratory Education (ABLE), 392 pages.
Gurney, T., R. Ethel, D. Ratnapradipa, and R. Bossard. 2000. Introduction to the molecular
phylogeny of insects. Pages 63-77, in Tested Studies for Laboratory Teaching, Volume
21(S.J. Karcher, Editor). Proceedings of the 21 st Workshop/Conference of the Association
forBiology Laboratory Education (ABLE), 509 pages.
Gurney, T., R. LeMon, and K. Nolan. 2001. DNA sequencing to illustrate mutation and
evolution. Pages 100-119, in Tested studies for Laboratory Teaching, Volume 22 (S.J.
Karcher, editor). Proceedings of the 22nd Workshop/Conference of the Association for
Biology Laboratory Education (ABLE), 489 pages.
Hershberger, R.P. 2000. What I could teach Darwin using “Darwin 2000,” an interactive web
site for student research into the evolution of genes and proteins. Pages 1-32 in Tested
Studies for Laboratory Teaching, Volume 21 (S.J. Karcher, Editor). Proceedings of the 21st
Workshop/Conference of the Association for Biology Laboratory Education (ABLE), 509
pages.
Howard, K. 2000. The bioinformatics gold rush. Scientific American July, 2000. Accessed
online (http://www.sciam.com) July, 2000.
Lim, H. 2000. Bioinformatics in the pre- and post-genomic eras. Trends in Biotechnology (April)
18: 133-135.
Reed, J. 2000. Trends in commercial bioinformatics (March). Accessed online
(http://www.oscargruss.com/reports.htm) July, 2000.
28