Bioinformatics
Bioinformatics
Bioinformatics
Biot 2083
By; Sefinew Tilahun (MSc in
Biotechnology)
Introduction
•Biologists have been dealing with problems of information management since the
17th century when they started to catalogue of different animal and plant species
•Taxonomy (the practice of classifying organisms) was the first informatics problem
in biology.
– The importance of classify particular species organisms? Is
• Individuals of species that are useful to humans as sources of food and medicine can be
identified.
• Taxonomy also enables us to glimpse the evolutionary history of life on earth. To construct
an evolutionary tree, or phylogeny, inferring which organisms evolved from which other
– But, Some living things were more closely related than others and
understanding how genetic information shape life and drives its evolution.
•Genetic information is stored in the cell in the form
of biological macro molecules, such as nucleic acids
and proteins.
• The genetic information not only drives the functioning of the
whole organism, but also drives the evolutionary engine.
• In the genomic era large quantities of genomes have been
sequenced
– Many interesting problems arise out of sequence analysis
– For some, the terms bioinformatics and computational biology have become
distinction.
•1960, Margaret Dayhoff, Richard Eck, and Robert Ledley computer aided analysis
of protein data. The beginning of the pioneering work
– They capitalized on their experience and training in computing, mathematics, and life
sciences in collecting and organizing protein sequences, sequence analysis , and studies
of protein evolution.
– Their work could be regarded as the direct ancestor of modern bioinformatics.
•In 1965, Dayhoff, Eck, and a couple of colleagues compiled the first Atlas of
Protein Sequence and Structure, which had 50 sequences known at that time.
– This compilation of protein sequence and structure information was the predecessor of
the current gene and protein databases that form the backbone of contemporary
bioinformatics.
– Eventually, in 1972 this database became the Protein Information Resource (PIR)
database, now maintained at Georgetown University.
• Margaret Dayhoff as an independent researcher brought her
background of mathematics, chemistry, and computing to address
problems in biology, particularly protein chemistry, and became the
pioneer in the application of mathematics and computational
method to biochemistry
– One of her most important contributions was developing,
together with Richard Eck, the single-letter code for amino acids
that is used by all protein analysis tools.
– She developed a computer algorithm for protein-sequence
alignment, which was (correctly) thought to reveal their
evolutionary history.
•Richard Eck in 1961, compared all the sequences of hemoglobin
– He realized that the information on amino-acid sequence could be organized in different ways in order
– He also identified numerous amino-acid substitutions in proteins and noted that the pattern of
•In a conference in 1964, Eck presented a cryptogammic method to trace the evolution of
proteins.
– He suggested that , using this result, one could calculate the degree of relatedness of each protein with
reference to its ancestors, and draw a family tree in which the distances between the branches
represented a quantitative measure of relatedness. Thus, Eck outlined the basis of reconstruction of a
phylogenetic tree
•Robert Ledley, envisioned an important application of computers to sequence
analysis.
– He suggested that after the polypeptide chain is cut into many overlapping
fragments, whose sequences could be determined by peptide sequencing, the
fragment reassembly of partial sequences to obtain full sequences could be
done using computer.
•In 1960 Dayhoff and Ledley worte FORTMAN programs that could direct the
assembly of partial peptide sequences in the right order in less than 5 minute
•Dayhoff published the first reconstruction of a phylogenetic tree using a
substitution matrix for studying protein evolution, called the PAM matrix.
Molecular medicine
• The human genome will have profound effects on the fields of
biomedical research and clinical medicine.
– More specific medicines that act on the cause, not merely the
symptoms, of the disease can be developed
• These highly specific drugs promise to have fewer side effects than
many of today’s medicines.
Microbial genome applications
• The arrival of the complete genome sequences and their potential
to provide a greater insight into the microbial world and its
capacities could have broad and far reaching implications for
environment, health, energy and industrial applications.
• For these reasons, in 1994, the US Department of Energy (DOE)
initiated the MGP (Microbial Genome Project)
– To sequence genomes of bacteria useful in energy production,
environmental cleanup, industrial processing and toxic waste
reduction.
• By studying the genetic material of these organisms, scientists
can begin
– To understand these microbes at a very fundamental level and
– Isolate the genes that give them their unique abilities to survive
under extreme conditions.
Waste cleanup
• Deinococcus radiodurans is known as the world’s toughest bacteria and it
is the most radiation resistant organism known.
• Scientists are interested in this organism because of its potential usefulness
in cleaning up waste sites that contain radiation and toxic chemicals
Climate change Studies
•The human body is composed of around 60 trillion cells and about 320
different cell types, each having a different type of function or structural
property.
– Prokaryotes (Which are smaller than eukaryotic cells and have simpler
structure and are single cellular organisms (but not all single-celled
Organisms are prokaryotes)
So what is the difference between the two types of cells?
– Inside the nucleus are the chromosomes, where all of the genetic
intergenic DNA
• In eukaryotes, its more complicated. The non-coding DNA are
found not only between the genes, but also found within genes
• Introns are not totally absent from prokaryotes, but they are
extremely rare
• Moreover, there is usually only a single intron in a gene, unlike in
eukaryotes where many genes have multiple introns.
• Most known examples are within the genes of bacteriophages
• molecular biology has clarified some aspects of the C-value
paradox,
• The range in C values does not correlate well with the
complexity of the organism and the gene.
– There is a tendency for species with higher C-values to
have higher proportions of repetitive DNA
• The flow of genetic information;
– DNA directs the synthesis of RNA, and RNA then in turn directs the
synthesis of Protein.
• This flow of genetic information from nucleic acids to protein has
been called the central dogma of molecular biology
• The term transcriptome refers to the complete collection of all
possible mRNAs (including splice variants) of an organism.
– This can be thought of as the regions of an organism’s genome
that get transcribed into messenger RNA.
• In some cases, the transcriptome can be extended to include all
transcribed elements, including non-coding RNAs used for
structural and regulatory purposes.
• The term proteome refers to the complete collection of proteins
that can be produced by an organism.
– The proteome can be studied either as a static (sum of all
proteins possible) or a dynamic (all proteins found at a specific
time point) entity.
Genetic Code
• Since there are 4 possible bases (A, C, G, U) and 3 bases in the
codon, there are 4 * 4 * 4 = 64 possible codon sequences.
• However, the codon AUG can also be used as a signal to initiate
translation, while the codons UAA, UAG, and UGA are terminal
codons signaling the end of translation.
– That leaves a 61 codon sequences that can code for amino
acids (AUG can also code for an amino acid).
• However, there are only 20 amino acids.
• Therefore, the genetic code is redundant, meaning that a single
amino acid could be coded for by several different codons.
Amino Acids
• Amino acids are the building blocks from which proteins are
made.
• There are 20 different amino acids that vary from each other by
their side chain groups.
• Amino acids are linked to one another via a single chemical bond,
called a peptide bond.
• A linear chain of amino acids can be referred to as a peptide (if it
is short – less than 30 a.a. long) or polypeptide (which can be
upwards of 4000 residues long).
Reading Frames and open reading frames
• In a given DNA we will have six frames three on the forward and
three reverse.
• Theoretically, the different reading frames give entirely different
proteins
• The reading frame used for protein synthesis (ORF: Open Reading
Frame ) is determined by the position of the initiation codon
• ORF is any continuous reading frame that starts with a start codon
and ends with a stop codon.
Mutation
• Alterations in the sequence of DNA are known as mutations.
• Mutations can have harmful effects- even death
• They can also have beneficial results, or they can be neutral.
• The slow accumulation of such changes is responsible for the
process known as evolution.
• In single-celled organisms, mutations are passed on from one
generation to the next when the organism divides
• In multicellular organisms, mutations are inherited to the next
generation organisms only if they occur in the germ line cells and
are passed on during sexual reproduction.
• Such mutant cell lines will be restricted to the original multi-
cellular organism where the mutation occurred
• Mutations that occur in somatic cells will only be passed on to the
descendants of those cells
Types of Mutation
• Point mutations:- Changes in one or a few nucleotides
– Substitution
– Insertion
– Deletion
Silent mutation
DNA sequencing
• If you just digest DNA into its four component bases and measure
the quantity of each, it tells you about the DNA sequence.
Individuals differs by base sequence in their DNA
Common DNA sequencing techniques
3. Automated sequencing
4. Next generation sequencing
– Pyrosequencing (1996)
– 454 technology
Maxam-Gilbert Chemical degradation method
• Designed by Allan Maxam and Walter Gilbert in 1977
• This method uses specific chemicals to modify individual DNA
bases or sets of bases prior to cleavage of the sugar– phosphate
backbone with piperidine at the modified bases at least one
nucleotide in each reaction tube
• It uses ss or dsDNA radiolabelled at either 5’end using at least one
nucleotide in each reaction tube
• After cleavage it generates a set of fragments that differ by
polynucleotide kinase or 3’ end by terminal transferase (why is
this needed?)
Involves the following steps
1. Label either 5’ end of DNA with 32P or use a radiolabelled
phosphate containing nucleotide at 3’
2. Separate the labeled strands (denature)
3. Divide the mixture into four samples and treat each with different
chemicals having property of destroying
– A and G with Dimethyl sulfate or formic acid
– Only G with Dimethyl sulfate and piperidine
– T and C with Hydrazine (at alkaline condition)
– Only C with Hydrazine + 1M NaCl
4. Electrophoresis each of four samples in four different lanes of the
gel
5. Autoradiography
• Limitation: Not much popular, time consuming and expensive
Sanger dideoxy (primer extension/chain-termination) method
• Developed in 1990
• It is an improvement of Sanger method in that, this approach uses
different fluerecent dye tagged to each of ddNTPs
• Performed in single tube with differently tagged ddNTPS
• So this simplifies the hazardous effect of radioactive isotopes and
minimize the time needed to sequence
• The fluorescence tags are attached to the chain-terminating
nucleotides
• Sequence data is found in real time by detecting the DNA bands
within the gel during the electrophoretic separation
• Each of the four dideoxynucleotides carries a spectrally different
fluorophore
• The DNA bands are detected by their fluorescence
Genome sequencing
• How can we sequence a whole genome?
• The current sequencing methods like a chain termination
sequencing could sequence only up to 750 bp of Sequence.
• But the total size of a typical bacterial genome is 4,000,000 bp and
the human genome is >3 billion bp
• Therefore, the whole genome need to be fragmented in to pieces
cloned in to vector and could be sequenced
• The problem then is how to reconstruct the original genome
sequence based on the small fragments that are cloned into
individual vectors?
• Several basic approaches have been used so far:
Clone Contigs
• This method uses by generating overlapping DNA sequences
cloned in to a vector
• Then isolation and sequencing of one clone, from a library,
then identify a second clone, whose insert overlaps with the
first by hybridization.
• The second clone is then sequenced and the information used
to identify a third clone, whose insert overlaps with the second
clone, and so on.
• This is the basis of chromosome walking. However, this
method is laborious.
• A single clone has to be isolated and sequenced before the
next overlapping clone can be found.
• It involves much more work and so takes longer and costs
more money.
• Additional time and effort is needed to construct the
overlapping series of cloned DNA fragments.
Whole genome shotgun sequencing
• The fragments of the genome, which have been randomly
generated, are cloned into a vector and each insert is sequenced.
• The sequence is then examined for overlaps and the genome is
reconstructed by assembling the overlapping sequences together
using bioinformatics softwares.
• This approach was first used to sequence the genome of the
bacterium Haemophilus influenzae
• The entire genome of the organism was randomly fragmented using
sonication and then small fragments (in the range of 1.5–2 kbp)
were cloned into a vector (pUC18). The resulting library consisted
of approximately 20,000 individual clones.
• Each of these was then sequenced to generate approximately 12
million base pairs of sequence information (six times the length
of the H. influenzae genome).
• The sequence obtained from each clone was then assembled into
contigs based on the overlaps between the individual clones.
• Shotgun sequencing is inappropriate for eukaryotic genomes b/c
one repeat element might accidentally be assigned an overlap
with the identical sequence present in a different repeat element
Hierarchical shotgun
• Whole genome shotgun approach to cotig assembly has proved to
be successful in sequencing comparatively small genomes.
• The majority of bacterial genomes can be sequenced by this
method.
• For larger genomes, however assembly of contigs with this method
is problematic
• But it could be greatly simplified if the genomic DNA is first
broken up into a series of overlapping large clones such as those
produced by cloning into BACs.
• A library of smaller clones is then produced from each BAC and
subjected to shotgun sequencing
• The hierarchical approach provides a mechanism to relatively
easily construct assembled contigs for a particular part of the
genome
Protein sequencing