Lab 1 - Introduction and Protocol

BIOL 1020 Lab 1 1
BIOL1020 - LAB 1
INTRODUCTION TO BIOINFORMATICS
This is an in-person lab and the protocol below goes over the steps that need to be taken in order
to conduct the pre-lab and collecting data for your lab assignment.
You need to complete the PRE-LAB ASSIGNMENT and send an electronic copy through the
assignment submission on Canvas BEFORE the start of your lab. Submissions after the start of
the lab will not be accepted and will receive a zero for the preparation of the lab! Late
submissions will not be accepted for pre-lab assignment. It is recommended that you work on the
pre-lab well before the due date and submit the pre-lab at least 1 hour before the start of the lab
to ensure that you do not encounter any technical issues. Do not leave it until the last minute.
Late submissions of pre-lab assignment due to technical difficulties beyond the deadline will not
be addressed.
Your LAB ASSIGNMENT is to be completed during the lab session and is due at the end of
your scheduled lab session and a printed copy must be submitted at the end of the session to the
lab TA. Late submissions will not be accepted. It is recommended that you have a draft of your
lab Assignment (complete components that you can, in advance) well before your in-person lab
session. Then utilize the in-person lab session to complete/modify your lab assignment with the
help of your TA and lab partner. The Lab Assignment must be submitted before the end of your
scheduled lab.
For the PRE-LAB ASSIGNMENT you will need to upload a pdf, doc or docx file as an
assignment submission in Canvas. Here are the steps to upload a file as an assignment
submission in Canvas. Pay close attention that you are submitting the correct assignment in the
correct assignment folder.
1. Open Assignments. In Course Navigation, click the Assignments link.

2. Select Assignment. Click the title of the assignment.
3. Submit Assignment. Click the Submit Assignment button.
4. Add File. ...
5. Add Another File. ...
6. View Submission.
Note: After your submission, always double check the submitted file to ensure that it is the
correct file, and it can be opened by the TA for grading. Technical issues in uploading
discovered after the due date will not be addressed.
1
BIOL 1020 Lab 1 2
Introduction
Bioinformatics is the application of computer technology to the management of biological

information and is used to address biological problems. Bioinformatics is essential in examining
how raw sequence data from genome sequencing projects can be used to generate information
about gene function, protein structure, molecular evolution, drug targets and disease
mechanisms. This emerging field requires individuals with multi-disciplinary background in
biology and computer science (adapted from A. Cordon and D. Messersmith).
Bioinformatics is a discipline combining mathematics and biology. Bioinformatics technology

includes the computational tools and databases that support genomic and related research, which
encompasses the study of DNA structure and function, gene expression, and protein production.
Bioinformatics technology enables the extraction of information that can be used in basic
molecular research as well as commercial applications such as drug discovery, clinical
diagnostics, and agricultural biotechnology.
Model organisms offer a cost-effective way to follow the inheritance of genes through many
generations in a relatively short time among other applications. One of the first and most
important problems encountered in these genome projects was how to acquire, store, and analyze
massive amounts of DNA sequence information. GenBank, a major public repository of DNA
sequence data, has grown to include roughly 4.86 million individual sequence records
representing about 3.86 billion base pairs (as of early 2000) as compared to 0.56 million records
in 1995! GenBank contains the full and partial genome sequences of over 670 different
organisms, including 27 complete genomes (Reed 2000).
Where did all of these data come from? From local and international academic and government
research groups as well as commercial, privately owned companies. Almost all private
companies conducting genomic research, such as Celera Genomics, Incyte, Human Genome
Sciences, Millennium Pharmaceuticals, have sequenced stretches of human and other organisms'
DNA. Some of this privately-generated sequence data has been submitted to public databases
like GenBank, while some data remain proprietary.
For good references, see:

Howard, K. 2000 (July). The Bioinformatics Gold Rush. Scientific American.
http://www.sciam.com/
Lim, H. 2000 (April). Bioinformatics in the Pre- and Post-Genomic Eras. Trends in
Biotechnology (TIBTECH) 18:133-135.
Reed, J. 2000 (March). Trends in Commercial Bioinformatics. http://www.oscargruss.com/reports.htm
Nucleotide Sequence Databases

GenBank is the National Institute of Health's (NIH) genetic sequence database, an annotated
collection of all publicly available nucleotide and protein sequences. The unit records represent
single contiguous (gap free) stretches of DNA or RNA with additional comments. Presently, all
records in GenBank are generated from direct submissions to the DNA sequence databases from
the original scientists, who volunteer their records as part of the publication process. Most of the
input data are DNA sequences from which protein or RNA is inferred.
2
BIOL 1020 Lab 1 3
GenBank, built by the National Center for Biotechnology Information (NCBI) at NIH in
Bethesda, Maryland, is part of the International Nucleotide Sequence Database Collaboration,
along with its two partners, the DNA Database of Japan (DDBJ, Mishima, Japan) and the
European Molecular Biology Laboratory (EMBL) nucleotide database from the European
Bioinformatics Institute (EBI, Hinxton, England). All three centers are separate points of data
submission, but all of them exchange updated information daily.
In the early 1980s, there was no common format or electronic submission of data, which
slowed the collation of information tremendously. In 1988, the three groups (American, Japanese
and European) met and agreed to use a common format for data elements whereby each database
update only the records that were submitted to it. This means that each record is owned by the
database that created it, that “update clashes” and overwriting records are prevented, and most
importantly, the information in all three databases is compatible as well as accessible to a global
community. These database centers are also computational biology centers as it became clear
that sequence data couldn’t be generated simply by automated means, but need to be proofread
by biologists. Additional tools were developed to analyze the information.
Database Searching Tools
BLAST
In this lab, you will use BLAST to search the sequence database. This program is an important
research tool because it is a good combination of speed, sensitivity, flexibility, and statistical
rigor. BLAST is an acronym for Basic Local A1ignment Search Tool. The BLAST search
algorithm takes your input sequence and compares it to all known genetic sequences (DNA or
protein), identifying the known molecules that have similar sequences.
BLAST is sometimes referred to as a “one-against-all” homology search algorithm, since the

input is a single sequence which is compared against all other known sequences. This is in
contrast to the Multiple Sequence Alignment, or MSA homology search, a “many-against-each-
other” search in which a small, defined set of sequences are compared only against each other,
not against the entire database.
There are various BLAST programs designed for either nucleotide sequence queries or protein
sequence queries. You will be using BLASTN for this lab: BLASTN takes a nucleotide sequence
(the query or unknown sequence) and its reverse compliment, and searches them against a
nucleotide sequence database.
CLUSTALW2
The Multiple Sequence Alignment (MSA) algorithm ClustalW2 takes a set of input sequences
and aligns them so that homologous regions (the features that are common to the entire set of
sequences) are highlighted. This serves to identify the nucleotides or amino acids within the
sequences that have been conserved during their evolutionary divergence. Natural selection tends
to select against changes that result in loss of molecular function, thus conserved residues
identified in an MSA are presumed to be important for the structure and function of the
molecule.
3
BIOL 1020 Lab 1 4
The Multiple Sequence Alignment, or MSA, homology search algorithm is sometimes called a
“many-against-each-other” search because the input is a small, defined set of sequences that are
compared only against each other, not against an entire database. This is in contrast to the
BLAST similarity search algorithm, a “one-against-all” similarity search, in which the input is a
single sequence that is compared against all other known sequences listed in the database. Thus,
the starting point for an MSA is a set of sequences that are already presumed to be homologous.
ENTREZ
Another extremely powerful tool is The National Center for Biotechnology Information's
(NCBI) Entrez search engine (http://www.ncbi.nlm.nih.gov/Entrez). Entrez not only lets you
search for genetic sequence database records but interfaces with several databases. Entrez
connects to databases of nucleotide or protein sequences, 3-dimensional structures of
macromolecules, and even the Medline bibliographic database via PubMed -- all from this single
site.
4
BIOL 1020 Lab 1 5
PART I: PRE-LAB ASSIGNMENT

TO BE SUBMITTED ANY TIME BEFORE THE START OF THE LAB AS AN
ELECTRONIC COPY through the assignment submission on Canvas.
A separate pre-lab file has been made available for you to edit in the assignment folder.
Section 1: Bacterial ID Virtual Lab
You will perform the Bacterial ID Virtual Lab for section 1 of the pre-lab assignment by
connecting to the HHMI web site and using their “virtual Bacterial Identification Lab.” The
virtual lab demonstrates how one would isolate and purify a specific bacterial DNA sequence
using PCR. You will use the isolated DNA from this virtual lab in Section 2.
https://www.biointeractive.org/classroom-resources/bacterial-identification-virtual-lab
Note: refer to page 6 of Intro/protocol document.
Section 2: Identify unknown bacterium from its DNA sequence of 16S RNA using BLAST
Using the search tool BLAST, you will sequence the DNA (from section 1) and identify the
bacterium.
SAMPLE A:____________________________ SAMPLE B:__________________________
SAMPLE C:____________________________ SAMPLE D:__________________________
SAMPLE E:____________________________ SAMPLE F:___________________________
Section 3: Multiple sequence alignment using ClustalW

For this exercise, you will compare the DNA sequences of 16S RNA from five unknown
bacteria. The sequences are provided for you in the lab folder on Canvas called “Lab 1-
Bactreium unknown sequences 1 to 5 16s rRNA”.
• Use Multiple Sequence Alignment tool ClustalW for comparison of the sequences (refer
to page 11-15 of intro/protocol document). After conducting multiple sequence
alignments, you will construct a phylogenetic “tree” diagram using ClustalW. Save your
tree and copy/paste the diagram below.
• Link for ClustalW: http://www.genome.jp/tools/clustalw/
5
BIOL 1020 Lab 1 6
• Instructions for how to use ClustalW can be found on page 15 of the introduction and
protocol document
• Indicate on your tree diagram which bacteria are most similar/different? Briefly explain
your selection.
FOLLOW THE OUTLINED STEPS BELOW TO COMPLETE THE PRE-LAB

ASSIGNMENT.
6
BIOL 1020 Lab 1 7
Guide to Pre-Lab Assignment #1

Section 1: “Virtual Bacterial Identification Lab”: Isolation and Purification of
16S rDNA
Note: A Pre-lab Assignment word document is posted on Canvas. Please download/complete

this document and submit pre-lab assignment before the start of your scheduled lab session on
Canvas (electronic copy).
Use this document as an introduction/protocol guide.
Background
For this part of the lab, which you will do on your own before you come to lab, you will
connect to the Howard Hughes Medical Institute's (HHMI) web site and use their “Virtual
Bacterial Identification Lab.” Explanatory sections are summarized or reprinted here from their
virtual lab (with permission). The purpose of the HHMI virtual lab is to familiarize you with the
science and techniques used to identify different types of bacteria based on their DNA sequence.
In the process, you will see how bacteria are grown and specific DNA is isolated and purified
using the molecular techniques Polymerase Chain Reaction (PCR) and DNA sequencing. These
techniques are fundamental tools and often used in molecular research as well as forensic
analysis.
In this specific example, the sequence of DNA used for identifying the bacterium is the region
that codes for the 16S subunit of the ribosomal RNA (16S rDNA). From genomic studies, it has
been found that different bacterial species have unique 16S rDNA, thus a comparison of this
region may be used as a diagnostic test. Imagine you are a pathologist or a pathology lab
technician at a well-equipped research hospital. Your task is to identify a bacterial sample
received from a clinician of a very sick patient who needs to be on the correct drug regime as
quickly as possible.
Why use molecular techniques to identify the pathological (disease-causing) bacterium instead
of traditional methods?
Over the years, a battery of tests has been developed to categorize and identify bacteria. Tests
include staining and growing bacteria under a variety of conditions. Such procedures typically
require vigorously and reliably growing bacterial cultures. Many pathogens grow poorly on solid
medium while others grow only in liquid culture, making identification through traditional
techniques difficult or impossible. With the aid of molecular methods, however, these limitations
can be overcome. In addition, some species of bacteria cannot be differentiated from closely
related species through traditional methods. For these species, molecular methods offer the only
reliable and convenient means of identification.
Procedure
START NOW, by connecting to:
https://www.biointeractive.org/classroom-resources/bacterial-identification-virtual-lab
7
BIOL 1020 Lab 1 8
Go through the exercise, and “perform the lab” and complete all steps (all steps are described in
the virtual lab, in the link above.)
For your understanding (not need to report answers in the pre-lab assignment), you should be
able to answer the questions in italics:
1. Prepare a sample from the patient and isolate whole bacterial DNA. [How was this done in
the virtual lab?]
2. Make many copies of the desired piece of DNA. [What technique do you use? How do you
separate the desired DNA sequence from the rest of the DNA? How do you purify your
sample?]
3. Sequence the DNA. [Briefly describe how this is done.]
4. In Section 2 below, we will discuss how you will use the DNA sequence information to
identify the bacterium isolated from the patient.
Section 2: “Virtual Bacterial Identification Lab”: BLAST Search to Identify

Bacterium
Background
As stated earlier, it has been found (from genomic studies) that different bacterial species have
unique 16S rDNA, thus a comparison of this region may be used as a diagnostic test. Bacterium
identification relies on matching the unknown sequence from a particular sample against a
database of all known 16S rDNA sequences using a program called BLAST.
http://www.ncbi.nlm.nih.gov/BLAST
After identifying the bacterium described in the online virtual lab (SAMPLE A), identify other
16S rDNA sequences (choose from samples B, C, D, E, F) using BLAST. You will need to
submit these to your TA at the beginning of the lab.
SAMPLE A:____________________________ SAMPLE B:__________________________
SAMPLE C:____________________________ SAMPLE D:__________________________
SAMPLE E:____________________________ SAMPLE F:___________________________
Why would you use BLAST? (Adapted from HHMI “Virtual Bacterial Identification Lab”)
Under what situations would a scientist search sequence databases? As an example, sequence
matching can be used to determine whether a newly identified DNA sequence is part of a known
gene. In the simplest scenario, if a new sequence is identical or almost identical (except for a few
8
BIOL 1020 Lab 1 9
nucleotide changes) to that of a gene in the sequence database, it is reasonable to predict that the
new sequence is either part of the same gene or of a closely related gene. But what if two
sequences that appear to be different share sections that are identical? How do you know whether
the identical sections are due to chance or indicate some meaningful relationship between the
two sequences? Sequence analysis using BLAST or another program provides a “similarity
score” to help answer this question. (The “similarity score” is discussed further below.)
If the function of a particular DNA sequence is already known (e.g., the 16S rRNA gene we
will be working with in this lab), comparing its sequence with that of the same gene from another
species of bacterium provides information about the evolutionary relationship between the two
bacterial species. The assumption here is that the number of positions that differ in the nucleotide
sequence is proportional to the time elapsed, since the two species formed their own lines of
descent from a common predecessor.
However, not all DNA sequences change at a constant rate over time. For example, it is not at all
clear whether all organisms experience similar mutation rates from purely environmental factors
(from increased UV exposure, for example). If the DNA sequence has or has had at some point
in evolution a functional role, the rate of evolution and selection — which may be related to
population size among other things — can affect its rate of change. Moreover, in some cases,
mutations are caused by deletions, insertions, and substitutions of long sequences of DNA rather
than by single nucleotide changes. Finally, some sequences of DNA encode proteins with very
specific structural requirements, and any change may prove unfavorable to the organism. Such
sequences therefore do not tolerate change well and tend to remain the same for long periods of
time. These are referred to as “conserved” regions. In contrast, sequences that can
accommodate change more easily are referred to as “variable” regions.
Similarity score: Interpreting BLAST search results
(Reprinted from HHMI “Virtual Bacterial ID lab.”) [You should understand the basic distinction between
the BLAST Score and the E value]
Let's consider how one might go about assigning a numerical value to the degree of similarity
between two DNA sequences. Suppose we have two sequences as follows:
CGGCAT
CGCGAT
Let's assign one point for each base pair that matches exactly and 0 point for each base pair that
does not. We have C-C (match), G-G (match), G-C (no match), C-G (no match), A-A (match),
and T-T (match) for a total of 4 points. Under this hypothetical system, the more nucleotides that
match up, the higher the score.
When comparing two DNA sequences, it's important to remember that because of evolutionary
history, the sequences may have diverged not only by substitution of bases but also possibly by
deletions or insertions of bases. This means that the sequences that are being matched may not be
exactly the same length but might have gaps. In practical terms, for these two sequences, the best
match (for a total of 5 points) is:
CGGC__AT
CG__CGAT
9
BIOL 1020 Lab 1 10
Another possible alignment (for a total of 5 points) is:

CG__GCAT
CGCG__AT
From the simple example above, you can imagine how rapidly sequence comparisons can
become complicated as DNA length increases. The statistics for comparing two sequences of
DNA are thus highly complicated. Here we cover just the bare essence of the topic so that you
can interpret the response from your sequence query.
Let's suppose you do a BLAST search of the following sequence:
TATCGCGTATTGCC
BLAST will come back with a result, starting with the reference of the search program, the
number of letters in your sequence, the number of letters in the database, a graphic representation
of the sequence matches, and a list of matches. The list of matches is sorted with the best
matchingsequences shown first. For the sequence we used, the list starts with the following:
Score E
Sequences producing significant alignments: (bits) Value
gb|AC012156.14|AC012 Homo sapiens chr 12.. 28 5.8
ref|NC_001142.1 Saccharomyces cerevisiae... 28 5.8
What does this mean? “Score” is a numerical score assigned by BLAST. In the simple example,
we used earlier, we simply assigned 1 point for matches, and 0 point for non-matches. In
BLAST, the scoring system uses “bits” as the measure of information. For DNA, each position
can be occupied by either T, A, C, or G. Each match therefore contains 2 bits of information
(only 1 is correct out of 4 possible). For a 14-nucleotide-long sequence like ours, the maximum
match score then is 28 bits. The higher the score, the better the match.
“E-value” is the number of hits one can expect to see just by chance when searching a database
of a particular size. The value is defined as
E = N/n * m * n * 2-S
where m and n are the length of the two nucleotide sequences (measured in base pairs), S is the
bit score, and N refers to the total length of all sequences in the database. The formula should
make intuitive sense. For example, if S is higher (i.e., better matches), you would expect to see
fewer “hits.” On the other hand, if m or n are larger (i.e., one or the other sequence is longer),
then you would expect to see more hits purely by chance. Finally, if the database contains more
sequences (i.e., N is larger), then you would expect to see more hits. In any case, if BLAST
returns an E-value that is very small or close to zero, then you probably have a meaningful match
that is not due to random chance. To interpret the matches, you therefore need to pay attention to
whether the E-value is reasonably small. E-value is related to the P-value by the following
formula:
P = 1 - e-E
10
BIOL 1020 Lab 1 11
So for a P-value of 0.95 (the statistically significant level), the E-value is around 3. Thus, in your
search, an E-value of 3 or less would be an acceptable match.
You should also keep in mind that there are a lot of sequences in the database and that some of
them are from the same species and therefore might be very similar. In some cases, the name of
the organism may have changed after it was originally reported; accordingly, two or more
sequences may match extremely well but appear to belong to completely different species.
Procedure (BLAST Search to identify your bacterium)

1. At this point, you need to select and copy the DNA sequence of your unknown bacterium.
2. Link to NCBI's BLAST Sequence Homology Search server
http://www.ncbi.nlm.nih.gov/BLAST and select nucleotide blast
• Because we wish to look for nucleotide sequences homologous to the 16S rDNA, you
will use the blastn program and the nr (non-redundant) database. You may leave all of
the other default settings. Paste your DNA sequence into the box provided, then click on
BLAST to submit yourquery.
3. Interpret the BLAST results and select the most likely identity of your unknown. After the
server computer conducts your analysis, the results are presented three ways: as a graphic, a
table of "hits" (identified similarities), and a series of sequence alignments.
A. The graphic has lines showing the positions and ranges of identity/similarity between your
sequence (the query) and other possible sequences in the database. The location and length
of each line indicates the extent of similarity (how close the match is, also shown as the
line's color as well as length).
B. Under the text “Sequences producing significant alignments” is the table of “hits.” Each
database entry similar to the query sequence is presented, beginning at the top with the
closest match and ending at the bottom with the weakest. Clicking on the code on the left
of each line (e.g., emb|V00296|ECLACZ) links you to the GenBank entry for the sequence.
Clicking on the number (the score) at the right end of the line will jump you downward
within the file to the sequence alignment.
C. Each matched sequence is presented as a separate alignment with the query sequence. Only
the similar/identical regions of each molecule's sequence are presented here. The numbers
after the words “Query and Sbjct” indicate the position within each database entry to which
the nucleotides on that line correspond. This display is where one can analyze in detail the
nucleotide differences between the query and its homologue. You do NOT need to print out
your BLAST search. You need to understand what the results mean and what is the most
likely match and why. Usually the best match is the result at the top of the list with the
highest score and lowest E value.
4. Find the GenBank file record for your bacterium and learn how to read a GenBank file. On the
left column of your BLAST search, click on the identification number of the bacterium you
think best matches your 16S rDNA sample. You should now be linked to the GenBank record
of your chosen bacterium. Look at the GenBank record of your chosen bacterium. Acquaint
yourself with the parts of the GenBank database record for your nucleotide sequence and be
able to identify the information in your record that is bolded below.
11
BIOL 1020 Lab 1 12
What types of information are contained in the following parts of the record?
• Locus
• Definition
• Keywords
• Accession and NID
• Source
• Organism
• Reference(s) --earliest record
• Medline
• Comments
• Features
• CDS Why is there no coding sequence for the 16S RNA??
• /translation Why is your sequence NOT translated?
• /db_xref
• mutation
• variation
• exon Why do you NOT expect to find either exons or introns in your sequence?
• intron
• precursor_RNA
• mRNA
• Base Count
• Sequence
Section 3: Identifying Conserved Sequences using Multiple Sequence

Alignment (MSA): ClustalW
Background
You have seen that the 16S rDNA region is sufficiently different from one bacterium to
another to use the differences as a means of identification. However, how different/similar are
the regions, and are some bacterial 16S rDNA sequences more similar than others? You might
have predicted that genes coding for molecules with such a vital function as being part of the
ribosome would not have such variability between species. Start to generate questions about
what parts of the 16S rDNA have variable and conserved regions and why this might be so; look
in your textbook for the structure of 16S rRNA and see if you can predict variable and conserved
regions on the DNA. (Hint: look for regions that are double-stranded and single-stranded;
where are the active sites, etc.)
Understanding Phylogenetics
Our system of taxonomy is based on phylogeny. That is, we classify organisms together because
they have a common evolutionary ancestor. In most cases, we cannot determine ancestry directly
because the fossil record is poor for most organisms. Instead we rely on shared, homologous
features, and we say that organisms that share many features are closely related. Organisms that
share many features probably had a relatively recent common ancestor. Molecules, including
DNA, can also reveal homology, since these are passed down from one generation to the next.
12
BIOL 1020 Lab 1 13
For example, chimpanzees and humans share about 98% of their DNA because the common
ancestor of chimps and humans lived only about 6 million years ago. In 6 million years, there has
not been enough time for very much divergence to take place. However, the DNA of humans and
yeast is more dissimilar because humans and yeast shared an early eukaryotic ancestor no more
recently than about 1.2 billion years ago.
Phylogenetic Trees: Presenting Evolutionary Relationships

Phylogenetic systematics is the field of biology that examines morphological characteristics,
biochemical pathways, and gene sequences to establish relationships among groups of
organisms. In phylogenetic studies, the most convenient way of visually presenting evolutionary
relationships is by constructing a phylogenetic tree. All phylogenetic trees are hypotheses that
are to be tested, modified and tested again.
One type of phylogenetic tree, known as a rooted tree, contains a root, nodes, branches and
clades (Figure 1). A phylogenetic tree is composed of nodes, each representing a taxonomic unit
(species, populations, individuals), and branches, which define the relationship between the
taxonomic units in terms of descent and ancestry. Only one branch can connect any two adjacent
nodes. The branching pattern of the tree is called the topology, and the branch length usually
represents the number of changes that have occurred in the branch. This is called a scaled
branch. Scaled trees are often calibrated to represent the passage of time. Such trees have a
theoretical basis in the particular gene or genes under analysis. Branches can also be unscaled,
which means that the branch length is not proportional to the number of changes that has
occurred, although the actual number may be indicated numerically somewhere on the branch.
Groups sharing a node share a common ancestor and make up a clade. Phylogenetic trees may
also be either rooted or unrooted. In rooted trees, there is a particular node, called the root,
representing a common ancestor, from which a unique path leads to any other node. An unrooted
tree only specifies the relationship among species, without identifying a common ancestor, or
evolutionary path.
Figure 1. Terminology associated with a phylogenetic tree reflecting evolutionary relationships.
In Figure 1, humans, mice and flies are all animals and therefore their shared ancestral traits are
eukaryotic, multicellular organisms that consume food. But humans and mice share some traits
that are not shared by flies. These shared traits (for example, hair and milk production) are
known as derived traits and determine the specific clade – mammals – of the organisms that
share them.
13
BIOL 1020 Lab 1 14
Figure 2. Possible ways of drawing a tree.
Phylogenetic trees, a convenient way of representing evolutionary relationships among a group

of organisms, can be drawn in various ways. Branches on phylogenetic trees may be scaled (top
panel) representing the amount of evolutionary change, time, or both, when there is a molecular
clock, or they may be unscaled (middle panel) and have no direct correspondence with either
time or amount of evolutionary change. Phylogenetic trees may be rooted (top and middle
panels) or unrooted (bottom panels). In the case of unrooted trees, branching relationships
between taxa are specified by the way they are connected to each other, but the position of the
common ancestor is not. For example, on an unrooted tree with five species, there are five
branches (four external, one internal) on which the tree can be rooted. Rooting on each of the
five branches has different implications for evolutionary relationships.
Text and figures adapted with permission from A. Vierstraete, University of Ghent, Belgium.
http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
Molecular phylogenetics
Rapid technological advances in molecular biology have allowed scientists to obtain DNA
sequences of genes and entire genomes of a large variety of organisms. Molecular phylogenetics
is the field that examines evolutionary relationships among groups specifically based on changes
occurring in DNA and protein structure. The number of differences in DNA sequences between
different groups reflects the accumulation of mutations over the time since they shared a
common ancestor.
14
BIOL 1020 Lab 1 15
16S/18S rRNA
The modern approach to phylogeny relies on molecular studies and sequence comparisons of
genes and proteins. One of the most extensively used methods to develop the phylogeny of both
prokaryotes and eukaryotes was pioneered in the early 1970s by Carl Woese. This so-called
"SSU sequencing" or "16S/18S sequencing" is based on the 16S (prokaryotes) and 18S
(eukaryotes) ribosomal RNA (rRNA) genes.
All living organisms contain the small (16S or 18S) and the large (23S or 28S) subunit rRNA.
Since these subunits are essential for protein synthesis, they all have the same function and must
have been developed in the early stages of life. Mutations in these genes can affect directly the
ordinary functioning of the ribosome and thus, only minor changes in these genes are allowed.
Otherwise, the ribosome can lose its function resulting in the elimination of mutated organisms.
Since 16S rRNA is rather sensitive to mutations, the corresponding gene seems to contain a large
number of highly conserved regions. Some of them don't affect the ribosome's function
and mutations can accumulate over evolutionary times.
Rubisco Large subunit

Over the past decade, botanists have produced several thousand phylogenetic analyses based on
molecular data, with particular emphasis on sequencing rbcL, the plastid gene encoding the large
subunit of Rubisco (ribulose bisphosphate carboxylase), which is widely used for reconstruction
of plant phylogenies due to its conservative nature. Rubisco is an enzyme involved in the first
major step of carbon fixation, a process by which atmospheric carbon dioxide is converted by
plants to energy-rich molecules such as glucose. Rubisco is responsible for almost all carbon
fixation on Earth.
Four steps are important in the construction of phylogenetic trees based on molecular data
(NCBI, 2004).
1. Align similar DNA sequences from different groups to detect similarities and differences
in nucleotide bases.
2. Establish sequence variation by observing the level of homology or similarity of
sequences among groups.
3. Build a tree by arranging groups based on the percentage of matching bases for
sequences and other factors.
4. Evaluate the tree, including the analysis of the resulting tree and comparison with trees
constructed with non-molecular data.
Using DNA sequences in phylogenetics can generate very large data sets. To cope with such
huge amounts of data, scientists use the tools of bioinformatic to construct molecular
phylogenetics. There are many computer programs that use different algorithms for analyzing
these large data sets. In this exercise, you will use a multiple sequence analysis program,
ClustalW. Remember that the more similar two sequences are, the more closely related they are
(the more recent their common shared ancestor). The more nucleotide changes that occur, the
more time since two groups share a common ancestor, and therefore they are more distantly
related.
15
BIOL 1020 Lab 1 16
Procedure
1. For this comparison you will compare the DNA from all 5 unknown bacteria. Select and copy
the 5 sequences posted in the lab folder on Canvas.
2. To use the CLUSTALW software for multiple sequence alignment, the DNA sequences must
be in a specific format:
>bacterium1 [Each new sequence must start with a ">" and have no spaces in the title]
ATGCTTAAA….. [DNA sequence starts on a new line]
>bacterium2
CGGTAAACT
3. Access the Kyoto University Bioinformatics Center's ClustalW MSA server

http://www.genome.jp/tools/clustalw/ to conduct multiple sequence alignments.
• Select output format: CLUSTAL
• Select Pairwise Alignment SLOW/ACCURATE
• Select DNA not PROTEIN
• Pairwise Alignment Parameters for SLOW/ACCURATE select Weight Matrix:
CLUSTALW (IUB for DNA)
• Multiple Alignment Parameters select Weight matrix: CLUSTALW (IUB for
DNA)
• Click “Execute Multiple Alignment”
You do NOT need to print out your alignment results . You need to know how to read the
results,e.g., what portions of the sequences are identical; where are their differences?
4. Interpret the results.

(The results of the MSA are a series of stacked lines, each line representing one of the
sequences in the query set. Gaps (dashes) are introduced as necessary to maximize the
alignment of identical or similar residues among the set of sequences. Insertions and/or
deletions reflect important evolutionary events. At the bottom of each stack of aligned
sequences are symbols that summarize the alignment at that position in the sequence. An
asterisk denotes a position at which all query sequences have the exact same amino acid. Dots
indicate the degree of homology when there is not complete sequence conservation.)
5. Create a phylogenetic tree.

• For a graphical view of the alignment, click on the pull-down menu “Select tree menu”.
Select “FastTree Full” and Execute.
• ENSURE THAT YOUR POP-UP BLOCKER IS TURNED OFF!!!
• Save your tree as a PNG file and input the picture of the tree into your pre-lab
assignment, section 3. Indicate on your tree diagram which bacteria are most
similar/different? Explain your selection.
16
BIOL 1020 Lab 1 17
PART II: LAB ASSIGNMENT
LAB 1 LAB ASSIGNMENT

BIOINFORMATICS
NAME:
ID #: CRN:
To be completed during the in-person lab session and submitted by the end of the
lab session. A paper copy is submitted to the TA before/by the end of the in-person lab
session.
Note: A Lab Assignment word document is posted on Canvas. Print and bring to the lab session
to complete/submit to TA.
Biological Problem #1: Imagine that you are working in a pathology lab and need to
identify the bacterial species contained in a sample from a very sick patient. Once you know
what species they are infected with, the doctor will be able to recommend the appropriate
antibiotic. You have purified the bacteria from the patient’s samples and extracted bacterial
DNA from a single colony. You then performed PCR using primers that anneal to the region
containing the 16S rRNA gene. You have sequenced the PCR product and you are now ready to
identify the bacterium. As you will recall, you performed a similar exercise in your pre-lab
assignment.
Objective: BLAST of an unknown bacterial 16S rRNA gene sequence
Step 1: You will use an unknown bacterial 16S rRNA gene sequences provided to you by your
TA, during your lab session.
Step 2: Copy your assigned unknown bacterial sequence.
Step 3: Open the BLAST website, http://www.ncbi.nlm.nih.gov/BLAST select nucleotide

blast and paste your sequence into the large empty window labelled Enter Query Sequence.
Your sequence is in FASTA format already, but click on the link to find out what FASTA format
means.
Step 4: In the pull-down database window called Choose Search Set, select nucleotide
collection (nr/nt) since you want to compare your nucleotide sequence with all other nucleotide
sequences in the database.
17
BIOL 1020 Lab 1 18
Step 5: Hit the “BLAST” button and wait for your results.
a) Which bacterial species is the patient most likely infected with? ___________________
_______________________________________________________________________
b) What features of the Blast output influenced your decision? _______________________
_______________________________________________________________________
c) Which sequence is the query sequence and which one is the subject sequence? ________
________________________________________________________________________
Step 6: Click on the hyperlink associated with your best Blast match to get to the GenBank
record.
d) What is the Accession Number of your best match? ____________________________
e) Who submitted this sequence? _____________________________________________
f) Why is there no CDS (sequence coding for amino acids in protein) associated with this
record? _______________________________________________________________
_______________________________________________________________________
g) If you were to perform the same blastn analysis with your bacterial sequence a year from
now using a public database where sequences are constantly being added, would you
expect to obtain the same Score? E-value? Explain.
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
Biological Problem #2: You and your fellow colleagues in the pathology lab have
sequenced the 16S rRNA gene sequence for a total of 5 bacterial species. You are curious to see
how similar your 16S rRNA sequence (unknown provided by TA in Biological Problem #1) is to
the other 4 sequences. You have heard that multiple sequence alignments may provide some
information about this and would like to give it a try.
Objective: Multiple Sequence Alignment (ClustalW) of bacterial 16S rRNA gene

sequences from 5 bacterial species.
18
BIOL 1020 Lab 1 19
Step 1: All of the 16S rRNA genes sequences you need will be provided to you by your TA at
the beginning of your lab (file provided by TA includes 1 unknown used in Biological Problem
#1 and 4 other bacterium sequences). Copy all of the sequences.
Step 2: Open the Multiple Sequence alignment link (ClustalW)

http://www.genome.jp/tools/clustalw/
Paste your bacterial sequences into the window. In the pull-down window, select DNA since
that is what you are aligning.
Step 3: Click the Submit button and wait for your results to appear.
a) What do you think it indicates when the nucleotides have a star? What do you think it
means when they have a white space and no star? ___________________________
______________________________________________________________________
______________________________________________________________________
b) Would you estimate that all 5 sequences are quite similar (>90%) or not? ___________
______________________________________________________________________
c) Which bacterial sequence appears to differ the most from the others? Click on “View
Tree” at the top of the output page for a different visual representation. (Note that trees
generated by ClustalW represent sequence similarity and are not necessarily intended to
be interpreted as a tree indicating evolutionary descent).
_______________________________________________________________________
_______________________________________________________________________
d) Does your bacterial sequence cluster with any other sequence? Briefly describe the
relationship between your sequence and those of your colleagues.
________________________________________________________________________
________________________________________________________________________
Biological Problem #3: The ribulose bisphosphate carboxylase (Rubisco) protein is

essential to carbon fixation in photosynthesis and is found in green algae and all land plants.
Therefore it is an ideal choice to establish phylogenetic relationships among green algae and land
plants using nucleotide sequences. The gene sequence for the large subunit (rbcL) of the
Rubisco protein has been isolated for Arabidopsis, Chara, Equisetum, Lilium, Marchantia,
Pinus, Polypodium, Polytrichum and Zamia are available in GenBank and in the lab folder on
Canvas, under “Miscellaneous files for Pre-lab and Lab Assignment #1” named “Biological
Problem 3 rbcL 9 sequences”.
19
BIOL 1020 Lab 1 20
Use your knowledge and laboratory experience to develop a morphological phylogenetic tree
that depicts the evolutionary relationships for the organisms listed above. This tree will serve as
your hypothesis, which you will test using molecular data for the rbcL nucleotide sequences. Use
the file “Biological Problem 3 pictures” to help you with the morphological phylogenetic tree
Step 1: Examine the organisms selected and determine which are charophytes, bryophytes,
pterophytes, gymnosperms or angiosperms.
a) What are the ancestral and derived characteristics of the major phyla of plants? Develop
a hypothesis for which organisms will be more closely related to each other (compare
their morphological and life cycle characteristics of land plants). Arrange the organisms
most closely related to each other into clades, and which clades might share a common
ancestor. Draw your morphological phylogenetic tree, in the space below.
Step 2: You will be testing your hypothesis that the morphological tree accurately represents
land plant phylogeny. You will use molecular data from the nucleotide sequences of rbcL.
b) Do you think the molecular evidence will support (be consistent with) your hypothesis or
falsify it? Write your predictions below.
________________________________________________________________________
________________________________________________________________________
Step 3: Create a phylogenetic tree of the 9 plant species mentioned above using the rbcL gene
using ClustalW. Use the sequences posted on lab folder on Canvas in the file named
“Biological Problem 3 rbcL 9 sequences”.
a) Compare your molecular phylogenetic tree to your hypothesized morphological tree.

Describe any similarities or differences and your thoughts on why there may be
differences.
________________________________________________________________________
________________________________________________________________________
20
BIOL 1020 Lab 1 21
b) The two phylogenetic trees are supported by different types of evidence. What evidence
was used to create the phylogenetic tree using bioinformatics in ClustalW?
________________________________________________________________________
________________________________________________________________________
c) What types of evidence support your hypothesized “morphological” tree?

________________________________________________________________________
________________________________________________________________________
d) What is rbcL, and why is it a particularly useful molecule for studying evolutionary
relationships in plants and green algae? ________________________________________
________________________________________________________________________
e) What are the limitations in using rcbL to construct a phylogenetic tree? ______________
________________________________________________________________________
Applying Your Knowledge
1. Can you suggest reasons why a phylogeny based on molecular evidence and a phylogeny
based on morphology and other evidence might not be exactly the same?
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
2. Zoologist worldwide are sequencing a mitochondrial gene CO1 (cytochrome c oxidase

subunit), which is found in all animals and appears to be distinctive for each species. The
sequence of nucleotides can be used as a universal DNA bar code. By comparing the CO1 DNA
sequence for an animal to a growing database of DNA sequences, scientists can accurately
identify any animal and also discover species not previously known to science. How might DNA
bar coding, which uses molecular biology and bioinformatics, be useful in enforcing
international laws for banning the import of endangered species? How might these approaches
stimulate the study of biodiversity in remote areas?
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
______________________________________________________________________________
21
BIOL 1020 Lab 1 22
Appendix A: Sample GenBank Record

This is a nuclear DNA sequence,
1264 base pairs long.
Escherichia coli strain ICMP 15663 ATP synthase beta subunit (atpD) gene, partial cds
LOCUS DQ859781 1264 bp DNA linear BCT 26-JUL-2011
DEFINITION Escherichia coli strain ICMP 15663 ATP synthase beta subunit
(atpD)
gene, partial cds. Just as every item in a museum gets an Accession
ACCESSION DQ859781 Number, so does every submission to GenBank.
VERSION DQ859781.1 GI:112791345
KEYWORDS .
SOURCE Escherichia coli DSM 30083 The taxonomic classification of the organism.
ORGANISM Escherichia coli DSM 30083
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
REFERENCE 1 (bases 1 to 1264)
This sequence was published in a journal article.
AUTHORS Young,J.M. and Park,D.C.
TITLE Relationships of plant pathogenic enterobacteria based on partial
atpD, carA, and recA as individual and concatenated nucleotide
and peptide sequences
JOURNAL Syst. Appl. Microbiol. 30 (5), 343-354 (2007)
PUBMED 17451899
REFERENCE 2 (bases 1 to 1264) And it was submitted separately to GenBank.
AUTHORS Park,D.
TITLE Direct Submission
JOURNAL Submitted (19-JUL-2006) Landcare Research, Private Bag 92-170,
Auckland 1072, New Zealand
If you are unsure of what the
FEATURES Location/Qualifiers
source 1..1264 sequence represents, look here.
/organism="Escherichia coli DSM 30083"
/mol_type="genomic DNA"
/strain="ICMP 15663; DSM 30083"
/db_xref="taxon:866789"
/note="type strain of Escherichia coli"
gene <1..>1264
/gene="atpD" If the gene has a designated
CDS <1..>1264 name, it will appear here.
/gene="atpD"
The section of the /codon_start=1
sequence that does /transl_table=11
the protein coding. The amino acid residues
/product="ATP synthase beta subunit"
/protein_id="ABI21943.1" the sequence codes for.
/db_xref="GI:112791346"
/translation="VYDALEVQNGNERLVLEVQQQLGGGIVRTIAMGSSDGLRRGLDV
KDLEHPIEVPVGKATLGRIMNVLGEPVDMKGEIGEEERWAIHRAAPSYEELSNSQELL
ETGIKVIDLMCPFAKGGKVGLFGGAGVGKTVNMMELIRNIAIEHSGYSVFAGVGERTR
EGNDFYHEMTDSNVIDKVSLVYGQMNEPPGNRLRVALTGLTMAEKFRDEGRDVLLFVD
NIYRYTLAGTEVSALLGRMPSAVGYQPTLAEEMGVLQERITSTKTGSITSVQAVYVPA
DDLTDPSPATTFAHLDATVVLSRQIASLGIYPAVDPLDSTSRQLDPLVVGQEHYDTAR
GVQSILQRYQELKDIIAILGMDELSEEDKLVVARARKIQRFLSQPFFVAEVFTGSPGK
YVSLKDTIRGFKGIMEGEYDHLPEQAFYM"
22
BIOL 1020 Lab 1 23
ORIGIN
1 gtgtacgatg ctcttgaggt gcaaaatggt aatgagcgtc tggtgctgga agttcagcag
61 cagctcggcg gcggtatcgt gcgtaccatc gcaatgggtt cctccgacgg tctgcgtcgc
121 ggtctggatg taaaagacct cgaacacccg atcgaagtcc cggtaggtaa agcgactctg
181 ggccgtatca tgaacgtact gggtgaaccg gtcgacatga aaggcgagat cggtgaagaa
241 gagcgttggg cgattcaccg agcagcacct tcctacgaag agctgtcaaa ctctcaggaa
301 ctgctggaaa ccggtatcaa agttatcgac ctgatgtgtc cgttcgctaa gggcggtaaa
361 gttggtctgt tcggtggtgc gggtgtaggt aaaaccgtaa acatgatgga gcttattcgt
421 aacatcgcga tcgagcactc cggttactct gtgtttgcgg gcgtaggtga acgtactcgt
481 gagggtaacg acttctacca cgaaatgacc gactccaacg ttatcgacaa agtatccctg
541 gtgtatggcc agatgaacga gccgccggga aaccgtctgc gcgttgctct gaccggtctg
601 accatggctg agaaattccg tgacgaaggt cgtgacgttc tgctgttcgt tgacaacatc
661 tatcgttaca ccctggccgg tacggaagta tccgcactgc tgggccgtat gccttcagcg
721 gtaggttatc agccgaccct ggcggaagag atgggcgttc tgcaggaacg tatcacctcc
781 accaaaaccg gttctatcac ctccgtacag gcagtatacg tacctgcgga tgacttgact
841 gacccgtctc cggcaaccac ctttgcgcac cttgacgcaa ccgtggtact gagccgtcag
901 atcgcgtctc tgggtatcta cccggccgtt gacccgctgg actccaccag ccgtcagctg
961 gacccgctgg tggttggtca ggaacactac gacactgcgc gtggcgttca gtccatcctg
1021 caacgttatc aggaactgaa agacattatc gccatcctgg gtatggatga actgtctgaa
1081 gaagacaaac tggtggtagc gcgtgctcgt aagatccagc gcttcctgtc ccagccgttc
1141 ttcgtggcag aagtattcac cggttctccg ggtaaatacg tctccctgaa agacaccatc
1201 cgtggcttta aaggcatcat ggaaggcgaa tacgatcacc tgccggagca ggcgttctac
1261 atgg
//
23
BIOL 1020 Lab 1 24
Appendix A continued: GenBank record file-- definitions

http://ncbi.nlm.nih.gov/genbank/gbrel.txt
LOCUS - A short mnemonic name for the entry, chosen to suggest the sequence's definition.
DEFINITION - A concise description of the sequence.
ACCESSION - The primary accession number is a unique, unchanging code assigned to each
entry. (Please use this
code when citing information from GenBank.)
NID - The unique nucleic acid identifier that has been assigned to the current version of the
sequence data that are
associated with the GenBank entry identified by a given primary accession number.
KEYWORDS - Short phrases describing gene products and other information about an entry.
SEGMENT - Information on the order in which this entry appears in a series of discontinuous
sequences from the same molecule.
SOURCE - Common name of the organism or the name most frequently used in the literature.
ORGANISM - Formal scientific name of the organism (first line) and taxonomic classification
levels (second and
subsequent lines).
REFERENCE - Citations for all articles containing data reported in this entry. Includes four
subkeywords and may
repeat.
AUTHORS - Lists the authors of the citation.
TITLE - Full title of citation. Optional sub keyword (present) in all but unpublished
citations)/one or more records.
JOURNAL - Lists the journal name, volume, year, and page numbers of the citation.
MEDLINE - Provides the Medline unique identifier for a citation.
REMARK - Specifies the relevance of a citation to an entry.
COMMENT - Cross-references to other sequence entries, comparisons to other collections,
notes of changes in LOCUS names, and other remarks.
FEATURES - Table containing information on portions of the sequence that code for proteins
and RNA molecules and information on experimentally determined sites of biological
significance.
BASE COUNT - Summary of the number of occurrences of each base code in the sequence.
ORIGIN - Specification of how the first base of the reported sequence is operationally located
within the genome. Where possible, this includes its location within a larger genetic map
- The ORIGIN line is followed by sequence data (multiple records).
Feature Key Names

The first column of the feature descriptor line contains the feature key. It starts at column 6 and
can continue to column 20. The list of valid feature keys is shown below.
allele Related strain contains alternative gene form
attenuator Sequence related to transcription termination
CDS Sequence coding for amino acids in protein (includes stop codon)
enhancer Cis-acting enhancer of promoter function
24
BIOL 1020 Lab 1 25
exon Region that codes for part of spliced mRNA (in eukaryotes)
iDNA Intervening DNA eliminated by recombination
intron Transcribed region excised by mRNA splicing (in eukaryotes)
LTR Long terminal repeat
mat_peptide Mature peptide coding region (does not include stop codon)
misc_binding Miscellaneous binding site
misc_difference Miscellaneous difference feature
misc_recomb Miscellaneous recombination feature
misc_RNA Miscellaneous transcript feature not defined by other RNA keys
misc_signal Miscellaneous signal
misc_structure Miscellaneous DNA or RNA structure
modified_base The indicated base is a modified nucleotide
mRNA Messenger RNA
Bioinformatics & Vlabs
mutation A mutation alters the sequence here
old_sequence Presented sequence revises a previous version
precursor_RNA Any RNA species that is not yet the mature RNA product
primer Primer binding region used with PCR
promoter A region involved in transcription initiation
protein_bind Non-covalent protein binding site on DNA or RNA
RBS Ribosome binding site
rep_origin Replication origin for duplex DNA
repeat_region Sequence containing repeated subsequences
repeat_unit One repeated unit of a repeat_region
rRNA Ribosomal RNA
STS Sequence Tagged Site; operationally unique sequence that identifies the combination of
primer spans used in a PCR assay
tRNA Transfer RNA
unsure Authors are unsure about the sequence in this region
variation A related population contains stable mutation
-10_signal `Pribnow box' in prokaryotic promoters
-35_signal `-35 box' in prokaryotic promoters
3'UTR 3' untranslated region (trailer)
5'UTR 5' untranslated region (leader)
Appendix B: Study Guide

This outline of questions and key concepts is provided as a study aid for your preparation for this
lab on Bioinformatics .
1. Know basic terminology about bioinformatics.
• What is Bioinformatics?
• Recognize and be able to distinguish the names of places (and their location), repositories
of data, and search and analysis tools e.g., National Centre for Biotechnology Information
(NCBI) in Washington D.C; databanks include GENBANK (at NCBI), EMBL (Europe)
DDBJ (Japan); search tools include Entrez, Blast, MSA
2. What basic information about the application of search and analysis tools used in this lab:
25
BIOL 1020 Lab 1 26
ENTREZ
• What is Entrez and for what type of information would you use ENTREZ?
• Where is it and how do you access Entrez? Although Entrez is not directly used for this
lab, it is a very useful tool to know about.
BLAST (what does this stand for?)

• What is BLAST, what type of data do you input into this tool and what kind of
information does the tool provide?
• When would you use BLAST (i.e., for what kind of scientific questions would BLAST be
useful)
• Why is BLAST referred to as a "one against all similarity search"
• What do Score and E-value represent?
• Given a sample BLAST search result be able to locate basic types of information such as
the Accession Number, and type of gene, sequence alignment as well as identify and
explain your reasoning for the best match.
• Further thought question: What features of a sequence influence the minimum size (i.e.,
number of bases) to give a clear and unambiguous result in a BLAST search?
• Given some information about unknown/unidentified sequenced material, and BLAST
search results identify the most likely match and explain why
Multiple Sequence Alignment (MSA) using ClustalW

• What is MSA, what type of data do you input into the ClustalW tool and what kind of
information does the tool provide?
• In what form must your data be to input into MSA?
• When would you use MSA (i.e., for what kind of scientific questions would BLAST be
useful)
• Why is MSA referred to as "many against each other similarity search"?
• Given a sample output identify homologous regions and variable regions. What are
conservative substitutions and identify some on the output. What is the difference
between conservative and non-conservative substitutions?
• What general comments can you make about the degree of similarity between your two
(or more) bacteria samples?
• Suggest reasons why some regions of the 16S rDNA are highly conserved while others
are not.
Given a sample Genbank record be able to find the following information (see sample in the
Appendix A):
1. What is the specific sequence? ("Definition")
2. Know what the permanent identifying code is for this sequence (accession number) and
distinguish this from the unique nucleic acid identifier used for the current version of
the sequence data. (NID).
3. From what organism was this sequence isolated? (Source)
4. What was the first reference describing this sequence?
5. For this first reference locate when, by whom, and if published.
26
BIOL 1020 Lab 1 27
6. What are the total number of A,C,G,T nucleotides (Base Count) for this sequence and
look for anomalies–e.g., are the number of each base similar or is there a predominance
of G-C or A-T?
7. What information can you get out of the Features section- e.g., CDS (coding sequence)
indicates transcribed into mRNA for translation into amino acids. Why is there no CDS
indicated for 16S rDNA?
3. What you need to look for in the Virtual Bactieria ID Lab

Know some basic information about the isolation, purification, and identification of bacterial
DNA described in this exercise; the answers to all of these questions are given in the
background text information in the vlab:
1. Why use molecular techniques to identify the unknown bacterium rather than standard
microbiological tests?
2. Why is 16S rDNA used in this exercise to identify the unknown bacterium? Realizing
that 16S rRNA has loops and folds gives you a clue to why such an important molecule
has both conserved and variable regions. Look up the structure of 16S rRNA in your
textbook.
3. Isolate a bacterial sample from the patient [only in the simplest terms].
4. Isolate whole bacterial DNA.[How was this done in the vlab?]
5. Make many copies of the desired piece of DNA.
a. What technique do you use?
b. How do you separate the desired DNA sequence from the rest of the DNA?
c. How do you purify your sample?
6. Sequence the DNA. Briefly describe how this is done
27
BIOL 1020 Lab 1 28
Literature Cited
Ball, M., G. Duncan, D. Ranieri, and S. Kiser. 2002. Exploring important biological concepts
using Biology Workbench. Pages 85-109, in Tested studies for Laboratory Teaching,
Volume 23 (M.A. O’Donnell, Editor). Proceedings of the 23rd Workshop/Conference of the
Associationfor Biology Laboratory Education (ABLE), 392 pages.
Gurney, T., R. Ethel, D. Ratnapradipa, and R. Bossard. 2000. Introduction to the molecular
phylogeny of insects. Pages 63-77, in Tested Studies for Laboratory Teaching, Volume
21(S.J. Karcher, Editor). Proceedings of the 21 st Workshop/Conference of the Association
forBiology Laboratory Education (ABLE), 509 pages.
Gurney, T., R. LeMon, and K. Nolan. 2001. DNA sequencing to illustrate mutation and
evolution. Pages 100-119, in Tested studies for Laboratory Teaching, Volume 22 (S.J.
Karcher, editor). Proceedings of the 22nd Workshop/Conference of the Association for
Biology Laboratory Education (ABLE), 489 pages.
Hershberger, R.P. 2000. What I could teach Darwin using “Darwin 2000,” an interactive web
site for student research into the evolution of genes and proteins. Pages 1-32 in Tested
Studies for Laboratory Teaching, Volume 21 (S.J. Karcher, Editor). Proceedings of the 21st
Workshop/Conference of the Association for Biology Laboratory Education (ABLE), 509
pages.
Howard, K. 2000. The bioinformatics gold rush. Scientific American July, 2000. Accessed
online (http://www.sciam.com) July, 2000.
Lim, H. 2000. Bioinformatics in the pre- and post-genomic eras. Trends in Biotechnology (April)
18: 133-135.
Reed, J. 2000. Trends in commercial bioinformatics (March). Accessed online
(http://www.oscargruss.com/reports.htm) July, 2000.
28

Lab 1 - Introduction and Protocol

Uploaded by

Copyright:

Available Formats

Lab 1 - Introduction and Protocol

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab 1 - Introduction and Protocol

Uploaded by

Copyright:

Available Formats

BIOL 1020 Lab 1 1

1. Open Assignments. In Course Navigation, click the Assignments link.

Bioinformatics is the application of computer technology to the management of biological

Bioinformatics is a discipline combining mathematics and biology. Bioinformatics technology

For good references, see:

Nucleotide Sequence Databases

Database Searching Tools

BLAST is sometimes referred to as a “one-against-all” homology search algorithm, since the

PART I: PRE-LAB ASSIGNMENT

Section 1: Bacterial ID Virtual Lab

Note: refer to page 6 of Intro/protocol document.

SAMPLE A:____________________________ SAMPLE B:__________________________

SAMPLE C:____________________________ SAMPLE D:__________________________

SAMPLE E:____________________________ SAMPLE F:___________________________

Section 3: Multiple sequence alignment using ClustalW

• Link for ClustalW: http://www.genome.jp/tools/clustalw/

FOLLOW THE OUTLINED STEPS BELOW TO COMPLETE THE PRE-LAB

Guide to Pre-Lab Assignment #1

Note: A Pre-lab Assignment word document is posted on Canvas. Please download/complete

START NOW, by connecting to:

3. Sequence the DNA. [Briefly describe how this is done.]

Section 2: “Virtual Bacterial Identification Lab”: BLAST Search to Identify

SAMPLE A:____________________________ SAMPLE B:__________________________

SAMPLE C:____________________________ SAMPLE D:__________________________

SAMPLE E:____________________________ SAMPLE F:___________________________

Another possible alignment (for a total of 5 points) is:

Let's suppose you do a BLAST search of the following sequence:

Procedure (BLAST Search to identify your bacterium)

Section 3: Identifying Conserved Sequences using Multiple Sequence

Phylogenetic Trees: Presenting Evolutionary Relationships

Figure 1. Terminology associated with a phylogenetic tree reflecting evolutionary relationships.

Figure 2. Possible ways of drawing a tree.

Phylogenetic trees, a convenient way of representing evolutionary relationships among a group

Rubisco Large subunit

3. Access the Kyoto University Bioinformatics Center's ClustalW MSA server

4. Interpret the results.

5. Create a phylogenetic tree.

PART II: LAB ASSIGNMENT

LAB 1 LAB ASSIGNMENT

Objective: BLAST of an unknown bacterial 16S rRNA gene sequence

Step 2: Copy your assigned unknown bacterial sequence.

Step 3: Open the BLAST website, http://www.ncbi.nlm.nih.gov/BLAST select nucleotide

b) What features of the Blast output influenced your decision? _______________________

d) What is the Accession Number of your best match? ____________________________

e) Who submitted this sequence? _____________________________________________

Objective: Multiple Sequence Alignment (ClustalW) of bacterial 16S rRNA gene

Step 2: Open the Multiple Sequence alignment link (ClustalW)

Biological Problem #3: The ribulose bisphosphate carboxylase (Rubisco) protein is

a) Compare your molecular phylogenetic tree to your hypothesized morphological tree.

c) What types of evidence support your hypothesized “morphological” tree?

Applying Your Knowledge

2. Zoologist worldwide are sequencing a mitochondrial gene CO1 (cytochrome c oxidase

Appendix A: Sample GenBank Record

Appendix A continued: GenBank record file-- definitions

Feature Key Names

Appendix B: Study Guide

BLAST (what does this stand for?)

Multiple Sequence Alignment (MSA) using ClustalW

3. What you need to look for in the Virtual Bactieria ID Lab

SAMPLE A:__ SAMPLE B:

SAMPLE C:__ SAMPLE D:

SAMPLE E:__ SAMPLE F:_

SAMPLE A:__ SAMPLE B:

SAMPLE C:__ SAMPLE D:

SAMPLE E:__ SAMPLE F:_