Bioinformatics

Bioinformatics
Biot 2083
By; Sefinew Tilahun (MSc in
Biotechnology)
Introduction
•Biologists have been dealing with problems of information management since the
17th century when they started to catalogue of different animal and plant species
•Taxonomy (the practice of classifying organisms) was the first informatics problem
in biology.
– The importance of classify particular species organisms? Is
• Individuals of species that are useful to humans as sources of food and medicine can be
identified.
• Taxonomy also enables us to glimpse the evolutionary history of life on earth. To construct
an evolutionary tree, or phylogeny, inferring which organisms evolved from which other
ones, in what order, and when.

•Biologists were faced with the problem of how to organize, access, and sensibly
add new members to the existing information
– Previously simple physical characteristics of organisms were used to identify and
diﬀerentiate between diﬀerent species
– But, Some living things were more closely related than others and
– exhibits different characteristics
•Thus, an understanding of the molecular basis of life is fundamental to
understanding how genetic information shape life and drives its evolution.
•Genetic information is stored in the cell in the form
of biological macro molecules, such as nucleic acids
and proteins.
• The genetic information not only drives the functioning of the
whole organism, but also drives the evolutionary engine.
• In the genomic era large quantities of genomes have been
sequenced
– Many interesting problems arise out of sequence analysis
• Once we have the sequences, there is a need for parsing of large

DNA sequences into their components (genes, elements, and so
forth). transcription units, protein-coding regions, regulatory
elements, and so forth).
• Followed by genome annotation, where the biological functions of
these various elements are (more or less tentatively) predicted.
• This all need databases and tools and analysis mechanisms
What is Bioinformatics?
•Defining the terms bioinformatics and computational biology is not
necessarily an easy task.
– For some, the terms bioinformatics and computational biology have become
completely interchangeable terms, while for others, there is a great
distinction.
•Bioinformatics is highly interdisciplinary, requiring at least mathematical,
statistical, biological, physical, and chemical knowledge, and its
implementation may furthermore require knowledge of computer science,
chemical engineering, biotechnology, medicine, pharmacology, etc.

• Multiple definition are available on literature and web because of
the integrative nature of Bioinformatics
• Bioinformatics is simply a union of biology and informatics

• The term “bioinformatics” have been first used in the mid-1980s in
order to describe the application of information science and
technology in the life sciences.
– It was very general, covering everything from robotics to artificial
intelligence.
• Later, defined as “the use of computers to retrieve, process, analyze,
and simulate biological information”.
• Biological information is related to biological macromolecules such as
DNA, RNA, and proteins.
• It could also be defines as an integration of mathematical, statistical

and computer methods to analyze biological biochemical and
biophysical data
• An even narrower definition was “the application of information
technology to the management of biological data”.
– Such definitions fail to capture the centrality of information in
biology.
•A more appropriate definition of bioinformatics is, therefore, “the
science of how information is generated, transmitted, received,
stored, processed and interpreted in biological systems”
•Although it began with sequence comparison (which is a sub-
branch of the study of the non-randomness of DNA sequences), it
now encompasses a far wider spread of activity, which truly
epitomizes modern scientific research.
–It has become an extremely active research field.

• Therefore, bioinformatics gained prominence due to the
advancement of genome and proteome studies that produced
unprecedented amounts of biological data
• Then bioinformatics is excellently positioned to revive
consideration of the central question “what is life?”
History of Bioinformatics
•1960, Margaret Dayhoff, Richard Eck, and Robert Ledley computer aided analysis
of protein data. The beginning of the pioneering work
– They capitalized on their experience and training in computing, mathematics, and life
sciences in collecting and organizing protein sequences, sequence analysis , and studies
of protein evolution.
– Their work could be regarded as the direct ancestor of modern bioinformatics.
•In 1965, Dayhoff, Eck, and a couple of colleagues compiled the first Atlas of
Protein Sequence and Structure, which had 50 sequences known at that time.
– This compilation of protein sequence and structure information was the predecessor of
the current gene and protein databases that form the backbone of contemporary
bioinformatics.
– Eventually, in 1972 this database became the Protein Information Resource (PIR)
database, now maintained at Georgetown University.
• Margaret Dayhoff as an independent researcher brought her
background of mathematics, chemistry, and computing to address
problems in biology, particularly protein chemistry, and became the
pioneer in the application of mathematics and computational
method to biochemistry
– One of her most important contributions was developing,
together with Richard Eck, the single-letter code for amino acids
that is used by all protein analysis tools.
– She developed a computer algorithm for protein-sequence
alignment, which was (correctly) thought to reveal their
evolutionary history.
•Richard Eck in 1961, compared all the sequences of hemoglobin
variants, and other proteins such as insulin, from different species.
– He realized that the information on amino-acid sequence could be organized in different ways in order
to produce specific pattern
– He also identified numerous amino-acid substitutions in proteins and noted that the pattern of
substitutions was not random.
•In a conference in 1964, Eck presented a cryptogammic method to trace the evolution of
proteins.
– He suggested that , using this result, one could calculate the degree of relatedness of each protein with
reference to its ancestors, and draw a family tree in which the distances between the branches
represented a quantitative measure of relatedness. Thus, Eck outlined the basis of reconstruction of a
phylogenetic tree
•Robert Ledley, envisioned an important application of computers to sequence
analysis.
– He suggested that after the polypeptide chain is cut into many overlapping
fragments, whose sequences could be determined by peptide sequencing, the
fragment reassembly of partial sequences to obtain full sequences could be
done using computer.
– Thus, Ledley suggested that computers could assist biochemists in their

efforts to determine protein sequences
•In 1960 Dayhoff and Ledley worte FORTMAN programs that could direct the
assembly of partial peptide sequences in the right order in less than 5 minute
•Dayhoff published the first reconstruction of a phylogenetic tree using a
maximum parsimony method. She also developed the first amino-acid
substitution matrix for studying protein evolution, called the PAM matrix.
– PAM stands for point accepted mutation (also referred to as percent
accepted mutation) because it represents accepted point mutation per 100
amino acid residues.
•A publication by Dayhoff, entitled Computer Analysis of Protein
Evolution, can be regarded as one of the most important initial
publications in bioinformatics and molecular phylogenetics.

•For her enormous pioneering contributions, Margaret Dayhoff is popularly
regarded as the founder of modern bioinformatics
•In 1970 the first sequence alignment algorithm was developed by Needleman and
Wunsch
•The Protein Data Bank (PDB) of X-ray and NMR structures is established in the
early 1970s by Brookhaven National Laboratory
•The first protein structure prediction algorithm was developed by Chou and
Fasman in 1974
•1977: Maxam-Gilbert and Sanger DNA sequencing Methods
•In 1978 the term bioinformatics was coined by Paulien Hogeweg and Ben Hesper
as “the study of information processes in biotic system”
• The first DNA database was established in 1979
• In the 1980s
– GenBank established, Human Genome Project started, fast database searching
algorithms developed
• FASTA by William Pearson
• BLAST by Stephen Altschul and coworkers
– Development of principles of sequence alignment

• Prediction of RNA secondary structure
• Prediction of protein secondary structure and 3D
• In 1990s Prediction of genes and Studies of complete genome sequence

• As of Feb. 2015, the number of sequences increased to more than 187
million
Applications of Bioinformatics
• Bioinformatics plays a vital role in the areas of structural

genomics, functional genomics, and nutritional genomics
• It covers emerging scientific research and the exploration of
proteomes
• Bioinformatics is used for transcriptome analysis where mRNA
expression levels can be determined.
• Bioinformatics is used
– To identify and structurally modify a natural product

– To design a compound with the desired properties
– To assess its therapeutic effects, theoretically
– Cheminformatics analysis includes similarity searching,
clustering, modeling, virtual screening, etc.
– Bioinformatics tools are very effective in prediction, analysis
and interpretation of clinical and preclinical findings.
• Bioinformatics as applied to DNA sequences would be exploited
– To find individual genes in the form of protein coding
sequences (exons),
– Expanses of nucleotides that might interrupt gene regions
(introns),
– Domains within the DNA that might control the expression of
individual genes (e.g., promoters, enhancers, silencers, splice
sites),
– Repeated elements (insertion sequences and transposons in
prokaryotes; micro- and mini-satellites in eukaryotic genomes),
– Three elements important for chromosome and gene
maintenance (Origin of replication, centromeres and telomeres)
• For proteins, identifying important domains within polypeptides,
such as
– Catalytic active sites,
– Substrate binding sites,
– Regions of protein-protein interaction, and
– The prediction of protein-folding pathways are important

applications of bioinformatics
Applications in Other Fields
Molecular medicine
• The human genome will have profound effects on the fields of
biomedical research and clinical medicine.
• The completion of the human genome and the use of

bioinformatics tools
– We can search for the genes directly associated with different
diseases and
– Begin to understand the molecular basis of these diseases more
clearly.
• This new knowledge of the molecular mechanisms of disease will

enable better treatments, cures and even preventative tests to be
developed.
Personalized medicine
• Clinical medicine will become more personalized with the
development of the field of pharmacogenomics.
• This is the study of how an individual’s genetic inheritance affects
the body’s response to drugs.
• Today, doctors have to use trial and error to find the best drug to treat
a particular patient as those with the same clinical symptoms can
show a wide range of responses to the same treatment.
• In the future, doctors will be able to analyze a patient’s genetic
profile and prescribe the best available drug therapy and dosage from
the beginning.
Preventative medicine
• With the specific details of the genetic mechanisms of diseases
being unraveled, the development of diagnostic tests to measure a
persons susceptibility to different diseases may become a distinct
reality.
Gene therapy
• In the not too distant future with the use of bioinformatics tool, the
potential for using genes themselves to treat disease may become a
reality.
• Gene therapy is the approach used to treat, cure or even prevent
disease by changing the expression of a person’s genes.
Drug development
• Bioinformatics is playing an increasingly important role in almost all
aspects of drug discovery and drug development.
• At present all drugs on the market target only about 500 proteins.
• With an improved understanding of disease mechanisms and using

computational tools
– To identify and validate new drug targets,
– More specific medicines that act on the cause, not merely the
symptoms, of the disease can be developed
• These highly specific drugs promise to have fewer side effects than
many of today’s medicines.
Microbial genome applications
• The arrival of the complete genome sequences and their potential
to provide a greater insight into the microbial world and its
capacities could have broad and far reaching implications for
environment, health, energy and industrial applications.
• For these reasons, in 1994, the US Department of Energy (DOE)
initiated the MGP (Microbial Genome Project)
– To sequence genomes of bacteria useful in energy production,
environmental cleanup, industrial processing and toxic waste
reduction.
• By studying the genetic material of these organisms, scientists
can begin
– To understand these microbes at a very fundamental level and
– Isolate the genes that give them their unique abilities to survive
under extreme conditions.
Waste cleanup
• Deinococcus radiodurans is known as the world’s toughest bacteria and it
is the most radiation resistant organism known.
• Scientists are interested in this organism because of its potential usefulness
in cleaning up waste sites that contain radiation and toxic chemicals
Climate change Studies
• Increasing levels of carbon dioxide emission, mainly through the

expanding use of fossil fuels for energy, are thought to contribute to global
climate change.
• Recently, the DOE (Department of Energy, USA) launched a program to
decrease atmospheric carbon dioxide levels.
• One method of doing so is to study the genomes of microbes that
use carbon dioxide as their sole carbon source.
Alternative energy sources
• Scientists are studying the genome of the microbe Chlorobium
tepidum which has an unusual capacity for generating energy from
light
Goals of bioinformatic analysis
• The ultimate goal of bioinformatics is to be able

to predict the biological processes in health and disease.
• In order to acquire such an ability, a thorough understanding of the
biological processes is necessary.
– Functions of a cell can be better understood by analyzing sequence
data
– Cellular functions are mainly performed by proteins whose capabilities
are ultimately determined by their sequences.
• Therefore, the proximate goal of bioinformatics is to develop such an
understanding through
– Analysis and integration of the information obtained on genes and
proteins
– Development of new tools and continuously improve the existing set
of tools for diverse type of analysis
• Bioinformatics has three constituents.
1. Database creation for enabling the storage of management of
large collection of biological data.
2. Algorithm development to specify data sets relationships.
3. Biological data sets analysis and interpretation by using these
tools, such as DNA, RNA and protein sequence, structure,
gene expression profiles and biochemical pathways.
• The most common query is: ‘
– I have determined a new sequence, or structure
– what do the databanks contain that is like it?’
• Once a set of sequences or structures similar to the probe object is
fished out of the appropriate database, the researcher is in a
position to identify and investigate their common features.
• Tools for sequence analysis includes
– Sequence alignment, sequence database searching, motif and
pattern discovery, gene and promoter finding, reconstruction of
evolutionary relationships, Genome assembly and comparison
• Tools for structural analysis
– Protein and nucleic acid structural analysis, comparison,
classification and prediction
Tools for functional analysis
• Gene expression profiling, protein-protein interaction prediction,
protein sub-cellular location prediction, metabolic pathway
reconstruction, construction and curation of biological databases
Level of bioinformatics analysis
1. Analysis of a single gene (protein) sequence.
 Similarity with other known genes
 Identification of well-defined domains in the sequence
 Prediction of secondary and tertiary structure
2. Analysis of complete genomes…..Genomics

 Which gene families are present, which are missing?
 Searching the location of genes on the chromosomes

 Expansion/duplication of gene families
 Identification of "missing" genes and hence product

3. Sequence structure analysis
 Protein structure analysis
• Remember that structure determine function
4. Analysis of genes and genomes with respect to functional data
 Expression analysis
 Proteomics, protein conc. measurements, covalent modifications
 Comparison and analysis of biochemical pathways

Limitations of bioinformatics
• Its important to know its limitations which avoids over-reliance and
over-expectation of bioinformatics output
– Depends on experimental science to produce raw data
– They do not replace the traditional experimental research
methods but complements it
– Quality of predictions depends on the quality of data and
algorithms used
– If the sequences are wrong or annotations incorrect, the results
of analysis will be misleading as well.
E.g. Errors in the sequence affects the alignment result as well as
outcome of structural or phylogenetic analysis.
• Outcome of computation also depends on the computing power
available.
• Many accurate but exhaustive algorithms cannot be used because of
the slow rate of computation. Instead, less accurate but faster
algorithms have to be used.
• Caution should always be exercised when interpreting prediction
results
• It is a good practice to use multiple programs if available for
confirmation
Chapter Two
Fundamentals of Genetic Information
Cells
•Every organism is made up of tiny structures called cells.
– Each cell is in itself a complex system enclosed in a membrane.
•The human body is composed of around 60 trillion cells and about 320
different cell types, each having a different type of function or structural
property.
•There are two types of organisms based on their cell type:
– Eukaryotes (which represent most of the organisms which we can see,

including plants and animals)
– Prokaryotes (Which are smaller than eukaryotic cells and have simpler
structure and are single cellular organisms (but not all single-celled
Organisms are prokaryotes)
So what is the difference between the two types of cells?
•A eukaryotic cells has a nucleus, which is separated from the rest
of the cell by a membrane.
– Inside the nucleus are the chromosomes, where all of the genetic
information for the organism is stored.
– Very long DNA molecules packaged with proteins called Chromosomes
– Chromosomes are the rod-shaped/circular, filamentous bodies present in
the nucleus/nucleoid region

•In addition, eukaryotic cells contain membrane bound
organelles with various function, including centrioles,
lysosomes, mitochondria, ribosomes, etc. contained within
the nucleous are one or several along double strand DNA
molecules organized as chromosomes.
– For humans, there are 22 pairs of autosomes, as well as one
pair of sex chromosomes.

DNA
• Is the basis for the building blocks encoding the information of
life. It is sometimes referred to as “the blueprint of life”
• A single stranded DNA molecule, called a polynucleotide or
oligomer which is a chain of small molecules called nucleotides.
• There are four different nucleotides, or bases: adenosine (A),
cytosine (C), guanine (G) and thymine (T) and they are broadly
classified as purines (A and G) and pyrimidines (C and T).
• The ends of the polynucleotide are different, meaning that each
polynucleotide sequence will have a directionality.
– The ends of the polynucleotide are marked either 3’ or 5’.
– The general convention is to label the coding strand from 5’ to
3’ (left to right).
• DNA can be either single-stranded or double stranded.
• When DNA is double-stranded, the second strand is referred to as
the reverse complement strand because of the directionality of this
second strand runs in the opposite direction as the first
– The second strand are complementary to the bases in the first.
• In the case of DNA, A binds to T, and C binds to G.
• Two complementary polynucleotide chains form a stable structure

known as the DNA double helix.
– The discovery of the double helix structure of DNA by Watson,
Crick and Franklin.
RNA
•It is similar to DNA in the fact that it is constructed from nucleotides.
– However, instead of thymine (T), an alternative base uracil (U) is found in RNA.
•RNA can be found as double-stranded or single-stranded, and can also be
part of a hybrid helix where one strand is an RNA strand and the other is a
DNA strand.
•RNA is generally found as a single stranded molecule that may form a
secondary structure or tertiary structures due to the complementary bases
between parts of the same strand.
•One of the most important roles of RNA is in protein synthesis.
•Two of the major RNA molecules involved in protein synthesis are
– messenger RNA (mRNA) and
– transfer RNA (tRNA).
mRNA
•It is a linear molecule of an RNA copied from DNA.
•Transcription is the process in which DNA is copied into RNA molecule.
•mRNA encodes the genetic information as copied from the DNA molecules.
• In eukaryotic cells, before the mRNA can be translated into a
protein, it needs to be modified a process known as Post
transcriptional modification.
• The nature of most eukaryotic genes contains coding regions,
called exons noncoding regions, called introns.
• One of the steps in processing the mRNA is to remove the intronic
regions and to splice together the coding, or exonic regions.
tRNA
• Attached to each tRNA molecule is an amino acid.
• The amino acid to be attached is determined by a three base
sequence called an anticodon sequence, which is complementary
to the sequence in the mRNA.
• Translation is the process in which the nucleotide base sequence
of the processed mRNA is used to order and join the amino acids
into a protein with the help of ribosomes and tRNA.
What is a Gene?
• A gene can be thought of as the DNA sequence necessary for the

synthesis of a functional protein or RNA molecule.
Genome, Transcriptome, Proteome
• The term genome is used; it typically refers to the chromosomal
DNA of an organism.
– The number of chromosomes and genome size varies quite
significantly from one organism to another.
• Genome size does not reflect gene number
• In fact, many plant genomes are much greater in size than the
human genome.
• Differences in gene topography in four species. Light green =
introns; dark green = exons; White = regions between the coding
sequences (including regulatory region plus “spacer” DNA).

Genome size and organismal complexity paradoxes
• Genome size (C-value) refers the total amount of DNA contained
within a haploid nucleus or one half the amount in a diploid
somatic cell
• It usually measured in picograms (pg)
– 1 pg of DNA = 0.978 × 109 bp = 978 MB
• The relationship between genome size, number of proteins

synthesize and organismal complexity have some confusions
• The confusion over these points has been called the "C-value
paradox“
• What are the confusing facts?
• Generally, at the lower range of complexity, this holds
– Bacteria have smaller genomes than eukaryotes, and viruses
have smaller genomes than bacteria??
• In larger organisms, relationship breaks down. Larger organisms
have DNA in junk DNA”
• Is there a direct proportion between the size of DNA and the
number of genes/proteins?
 In eukaryotic organisms we might expect an economical use
of DNA, that most or all of it would code for protein (as in
prokaryotes)
 No!!! Genome size does not reflect gene number
• In eukaryotes, much more DNA is present in the genome than a
protein coded by it
• In viruses and prokaryotes, the amount of coding DNA increases
linearly with genome size and ~80- 95% of the genome is
coding DNA
• BUT in eukaryotes, up to 98-99% of the genome can be NON-
coding DNA
• Humans, Arabidopsis and nematodes have about the same
number of genes
• Decreased proportion of coding DNA in large eukaryote genomes is due to
increase in proportion of introns and repetitive sequences
• In prokaryotes, almost all the non-coding DNA is found between genes as
intergenic DNA
• In eukaryotes, its more complicated. The non-coding DNA are
found not only between the genes, but also found within genes
• Introns are not totally absent from prokaryotes, but they are
extremely rare
• Moreover, there is usually only a single intron in a gene, unlike in
eukaryotes where many genes have multiple introns.
• Most known examples are within the genes of bacteriophages
• molecular biology has clarified some aspects of the C-value
paradox,
• The range in C values does not correlate well with the
complexity of the organism and the gene.
– There is a tendency for species with higher C-values to
have higher proportions of repetitive DNA
• The flow of genetic information;
– DNA directs the synthesis of RNA, and RNA then in turn directs the
synthesis of Protein.
• This flow of genetic information from nucleic acids to protein has
been called the central dogma of molecular biology
• The term transcriptome refers to the complete collection of all
possible mRNAs (including splice variants) of an organism.
– This can be thought of as the regions of an organism’s genome
that get transcribed into messenger RNA.
• In some cases, the transcriptome can be extended to include all
transcribed elements, including non-coding RNAs used for
structural and regulatory purposes.
• The term proteome refers to the complete collection of proteins
that can be produced by an organism.
– The proteome can be studied either as a static (sum of all
proteins possible) or a dynamic (all proteins found at a specific
time point) entity.
Genetic Code
• Since there are 4 possible bases (A, C, G, U) and 3 bases in the
codon, there are 4 * 4 * 4 = 64 possible codon sequences.
• However, the codon AUG can also be used as a signal to initiate
translation, while the codons UAA, UAG, and UGA are terminal
codons signaling the end of translation.
– That leaves a 61 codon sequences that can code for amino
acids (AUG can also code for an amino acid).
• However, there are only 20 amino acids.
• Therefore, the genetic code is redundant, meaning that a single
amino acid could be coded for by several different codons.
Amino Acids
• Amino acids are the building blocks from which proteins are
made.
• There are 20 different amino acids that vary from each other by
their side chain groups.
• Amino acids are linked to one another via a single chemical bond,
called a peptide bond.
• A linear chain of amino acids can be referred to as a peptide (if it
is short – less than 30 a.a. long) or polypeptide (which can be
upwards of 4000 residues long).
Reading Frames and open reading frames
• In a given DNA we will have six frames three on the forward and
three reverse.
• Theoretically, the different reading frames give entirely different
proteins
• The reading frame used for protein synthesis (ORF: Open Reading
Frame ) is determined by the position of the initiation codon
• ORF is any continuous reading frame that starts with a start codon
and ends with a stop codon.
Mutation
• Alterations in the sequence of DNA are known as mutations.
• Mutations can have harmful effects- even death
• They can also have beneficial results, or they can be neutral.
• The slow accumulation of such changes is responsible for the
process known as evolution.
• In single-celled organisms, mutations are passed on from one
generation to the next when the organism divides
• In multicellular organisms, mutations are inherited to the next
generation organisms only if they occur in the germ line cells and
are passed on during sexual reproduction.
• Such mutant cell lines will be restricted to the original multi-
cellular organism where the mutation occurred
• Mutations that occur in somatic cells will only be passed on to the
descendants of those cells
Types of Mutation
• Point mutations:- Changes in one or a few nucleotides
– Substitution
– Insertion
– Deletion
Base Substitution Mutations

• If one base is replaced by another, a base substitution mutation has occurred
• These may be subdivided into transitions and transversions.
• In a transition a pyrimidine is replaced by another pyrimidine (i.e., T is replaced

by C or vice versa) or a purine is replaced by another purine (i.e., A is replaced
by G or vice versa).
• A transversions occurs when one base is replaced by another of a
different type; for example, a pyrimidine is replaced by a purine or
vice versa.
• Proteins must assume their correct three-dimensional structure in
order to function properly.
• Those amino acids in the active site and others that are critical for
correct folding of the protein are very essential.
• When a change in the base sequence alters a codon so that one
amino acid in a protein is replaced with a different amino acid, this
is called a missense mutation.
• The severity of a missense mutation depends on the location and
the nature of the amino acid that was substituted.
• Replacing one amino acid with another that has similar chemical
and physical properties is known as a conservative substitution.
• If the location of the missense mutation is in the conserved regions
it will have significant effect
• since the critical regions of most proteins occupy only a small
proportion of the total sequence, most conservative substitutions
will be relatively mild and usually non-lethal
Silent mutation
• There are 64 different codons, most of the 20 possible amino acids

have more than one codon (degeneracy of the genetic code).
• So a base change that converts the original codon into another

codon that codes for the same amino acid will have no effect on the
final structure of the protein
• Very often, altering the third base of a codon has no effect on the
protein that will be made..
• Third base mutations that do not alter protein identity can
sometimes have effects due to differential codon usage and tRNA
bias.
– This is the reason for changing some codons of a bacteria to a
eukaryote readable form while we express prokaryotic genes
in eukaryotes
Non sense mutation
• Not all codons encode amino acids.
• Three (UAA, UAG and UGA) are stop codons that signal the end of a
polypeptide chain.
• A nonsense mutation occurs when the codon for an amino acid is mutated to
give a stop codon.
• The ribosome stops and the rest of the protein does not get made.
Insertions and deletions (INDELS)

• Mutations that remove one or more bases are known as deletions and those that
add extra bases are known as insertions.
• The effect of a deletion (or insertion) depends greatly on how many bases are
removed (or inserted).
• In particular, we should distinguish between point mutations where
one (or a very few) bases are affected, and gross deletions and
insertions that affect long segments of DNA.
• Point deletions and insertions may have major effects due to
disruption of the reading frame(Frameshift mutation)
• The introduction or removal of one or two bases can have drastic
effects since the alteration changes the reading frame of the that
gene.
Frameshift Mutations
• Shifts the reading frame of the genetic message
• Frameshift mutations usually completely destroy the function of a protein,

unless they occur extremely close to the far end.
• However, insertion or deletion of three bases adds or removes a whole
codon and the reading frame is retained.
– Apart from the single amino acid that is gained or lost, the rest of the
protein is unchanged.
• If the deleted (or inserted) amino acid is in a relatively less vital region of
the protein, a functional protein may be made.
• Adding or deleting more than three bases may not have a negative effect as
long as the number is a multiple of three and if it occurs outside a very
essential area.
Sequenci ng
• A methods for determining order of the nucleotides or amino acids

in a DNA or protein
• The knowledge of the sequence genes, and the entire genome, is
vital
– To understand how genes and proteins work
– To understand how different gene products influence the
activity of each other within the context of the whole
organism.
DNA sequencing
• If you just digest DNA into its four component bases and measure
the quantity of each, it tells you about the DNA sequence.
Individuals differs by base sequence in their DNA
Common DNA sequencing techniques
1. Chemical Degradation Method (Maxam-Gilbert)

2. Chain-termination method (Sanger Method)
– Enzymatic method of sequencing-the most widely used
3. Automated sequencing
4. Next generation sequencing
– Pyrosequencing (1996)
– 454 technology
Maxam-Gilbert Chemical degradation method
• Designed by Allan Maxam and Walter Gilbert in 1977
• This method uses specific chemicals to modify individual DNA
bases or sets of bases prior to cleavage of the sugar– phosphate
backbone with piperidine at the modified bases at least one
nucleotide in each reaction tube
• It uses ss or dsDNA radiolabelled at either 5’end using at least one
nucleotide in each reaction tube
• After cleavage it generates a set of fragments that differ by
polynucleotide kinase or 3’ end by terminal transferase (why is
this needed?)
Involves the following steps
1. Label either 5’ end of DNA with 32P or use a radiolabelled
phosphate containing nucleotide at 3’
2. Separate the labeled strands (denature)
3. Divide the mixture into four samples and treat each with different
chemicals having property of destroying
– A and G with Dimethyl sulfate or formic acid
– Only G with Dimethyl sulfate and piperidine
– T and C with Hydrazine (at alkaline condition)
– Only C with Hydrazine + 1M NaCl
4. Electrophoresis each of four samples in four different lanes of the
gel
5. Autoradiography
• Limitation: Not much popular, time consuming and expensive
Sanger dideoxy (primer extension/chain-termination) method
• Discovered by Fredrick Sanger

• The principle of this sequencing method is based on DNA
replication (which takes place in every dividing cell) or PCR
• Requires the following components
1. ssDNA template
2. A primer for DNA synthesis
3. DNA polymerase (usually klenow fragment of DNA pol I)
4. dNTPs (dATP, dTTP, dCTP, dGTP) one of which is
radioactively labeled (Why?)
5. dideoxynucleotide triphosphates, ddNTPS ( ddATP, ddTTP,
ddCTP, ddGTP)
• The 3'-OH group necessary for formation of the phosphodiester
bond is missing in ddNTPs
• With addition of enzyme (DNA polymerase), the primer is
extended until a ddNTP is encountered
• The chain will end with the incorporation of the ddNTP, this is
because of the lack of 3’ OH functional group for chain elongation
• With the proper dNTP: ddNTP ratio (usually 99:1), the chain will
terminate throughout the length of the template
• By carrying out four reactions, four separate sets of fragments are
formed which are specifically terminated
• Fragments from each of the four tubes are placed in four separate
gel lanes and the resulting terminated chains are resolved by
electrophoresis
It involves the following major steps
1. Preparation of four reaction tubes (A mix, T mix, C mix, G mix)
each contains:
– ssDNA,
– primer
– four dNTPs
– Enzyme
– Small amount of ddNTPs that brings chain termination
2. Run four separate reactions each with different ddNTPs
3. Fragment separation on high resolution polyacrylamide gel
4. Transfer to an X-ray film and read from bottom to top
• Sanger sequencing is the most popular protocol for sequencing ,
very adaptable, scalable to large sequencing projects
Limitations
– It is time consuming and error prone because
• Multiple pipetting steps are required to set up each reaction
• Reactions need be loaded onto four lanes of a gel to separate
the products manual reading of sequencing gels
• To tackle those problems a new technology that combine the four
individual sequencing reactions into a single reaction and that
could be analysed on a single lane of a gel was needed. That is
automated sequencing
Automated DNA sequencing
• Developed in 1990
• It is an improvement of Sanger method in that, this approach uses
different fluerecent dye tagged to each of ddNTPs
• Performed in single tube with differently tagged ddNTPS
• So this simplifies the hazardous effect of radioactive isotopes and
minimize the time needed to sequence
• The fluorescence tags are attached to the chain-terminating
nucleotides
• Sequence data is found in real time by detecting the DNA bands
within the gel during the electrophoretic separation
• Each of the four dideoxynucleotides carries a spectrally different
fluorophore
• The DNA bands are detected by their fluorescence
Genome sequencing
• How can we sequence a whole genome?
• The current sequencing methods like a chain termination
sequencing could sequence only up to 750 bp of Sequence.
• But the total size of a typical bacterial genome is 4,000,000 bp and
the human genome is >3 billion bp
• Therefore, the whole genome need to be fragmented in to pieces
cloned in to vector and could be sequenced
• The problem then is how to reconstruct the original genome
sequence based on the small fragments that are cloned into
individual vectors?
• Several basic approaches have been used so far:
Clone Contigs
• This method uses by generating overlapping DNA sequences
cloned in to a vector
• Then isolation and sequencing of one clone, from a library,
then identify a second clone, whose insert overlaps with the
first by hybridization.
• The second clone is then sequenced and the information used
to identify a third clone, whose insert overlaps with the second
clone, and so on.
• This is the basis of chromosome walking. However, this
method is laborious.
• A single clone has to be isolated and sequenced before the
next overlapping clone can be found.
• It involves much more work and so takes longer and costs
more money.
• Additional time and effort is needed to construct the
overlapping series of cloned DNA fragments.
Whole genome shotgun sequencing
• The fragments of the genome, which have been randomly
generated, are cloned into a vector and each insert is sequenced.
• The sequence is then examined for overlaps and the genome is
reconstructed by assembling the overlapping sequences together
using bioinformatics softwares.
• This approach was first used to sequence the genome of the
bacterium Haemophilus influenzae
• The entire genome of the organism was randomly fragmented using
sonication and then small fragments (in the range of 1.5–2 kbp)
were cloned into a vector (pUC18). The resulting library consisted
of approximately 20,000 individual clones.
• Each of these was then sequenced to generate approximately 12
million base pairs of sequence information (six times the length
of the H. influenzae genome).
• The sequence obtained from each clone was then assembled into
contigs based on the overlaps between the individual clones.
• Shotgun sequencing is inappropriate for eukaryotic genomes b/c
one repeat element might accidentally be assigned an overlap
with the identical sequence present in a different repeat element
Hierarchical shotgun
• Whole genome shotgun approach to cotig assembly has proved to
be successful in sequencing comparatively small genomes.
• The majority of bacterial genomes can be sequenced by this
method.
• For larger genomes, however assembly of contigs with this method
is problematic
• But it could be greatly simplified if the genomic DNA is first
broken up into a series of overlapping large clones such as those
produced by cloning into BACs.
• A library of smaller clones is then produced from each BAC and
subjected to shotgun sequencing
• The hierarchical approach provides a mechanism to relatively
easily construct assembled contigs for a particular part of the
genome
Protein sequencing
• The advent of large-scale genomic sequencing has greatly

simplified the task of determining the primary structure of proteins
• Because open reading frames in the nucleotide sequence serve as
templates for the construction of the corresponding proteins.
• But the genome sequences of most organisms are still unknown.
• Even for those that are known, modifications such as post-
translational events may change its properties.
• Thus, complete characterization of the protein primary structure
often requires determination of the protein sequence
• A given protein contains precisely the same number of total
amino acids (residues) in the same proportion.
• Thus, a formula for a protein looks like this:
– Protein A = (30 glycines + 44 alanines + 5 tyrosines + 14
glutamines + . . .)
• These amino acids are linked together as a chain and the true
identity of a protein is derived not only from its composition, but
also from the precise order of its constituent amino acids.
• The first amino-acid sequence of a protein (bovine insulin) was
determined by Sanger in early 1950s.
• He determined the amino acid sequence of bovine insulin
• Sanger was awarded the Nobel Prize in 1958
Edman degradation
• The method uses a set of chemical reactions to remove and identify

amino acid residue at the N-terminus of the polypeptide chain, i.e. the
residue with a free a-amino group.
• Then the next residue in the sequence is made available and subjected
to the same round of chemical reactions.
• Reiteration of this process reveals the sequence of the polypeptide.
• Phenyl isothiocyanate (PITC) also known as Edman’s reagent reacts
with an a-amino group (or in the case of prolyl residue with an imino
group) at the N-terminal end of the polypeptide chain, to form a Phenyl
thiohydantoin derivative (PTH–amino acid) of the terminal residue.
• The PTH derivative is then treated with HCl in an anhydrous solvent.
The PTH derivative of N-terminal amino acid is cleaved from the
remainder of the peptide.
• After first cycle of the reaction, amino group of the second amino acid
is free for reaction with Edman’s reagent and at the end of reaction
PTH derivative of second amino acid from N-terminal is released.
• The process continues till end of sequence or a disulfide bond is
encountered in the sequence.
• The liberated PTH amino acid is identified and quantified relative
to a standard by chromatographic (HPLC) and UV-detection.
techniques
• Reduction of disulfide bond in the polypeptide sequence needed
before sequencing process can be initiated. Reduction of free
cysteine can be done by use of ß-marcaptoethanol
• Clearly, a free a-amino group is required for this reaction to occur.
• It is necessary to avoid contamination of the sample with amine-
containing nonpeptidic species since these, too, may react with
PITC and generate products that interfere with subsequent
analysis.
• This method cannot be used for sequencing of proteins larger than
50 amino acids.
• In case of larger proteins it has to be broken down to short peptide
fragments using cleavage proteases
• Trypsin (cleaves a protein at carboxyl side of lysine and arginine
residues)
• Chymotrypsin (cleaves at carboxyl side of tyrosine, tryptophan and
phenylalanine).
• Specific cleavage can also be achieved by chemical methods like
cynogen bromide, which always cleaves at carboxyl side of
methionine residue
• A protein with 12 methionine will yield 13 fragment polypeptide
on cleavage with cynogen bromide (CNBr).
• Protein fragments after different protease treatments (e.g trypsin
and chemotrypsin) will form different fragments, it could be
sequenced and assembled in the same fashion as DNA.
Limitations of edman degradation
• The major drawback of the procedure remains the length of the
peptide chain.
• If the chain exceeds a length of 50-60 residues the procedure tends
to fail
• This can be solved by taking the larger peptide chain and cleaving
it into smaller fragments using cyanogen bromide, trypsin,
chymotrypsin or any enzyme/chemical which can break peptide
chains.
Chromosome Mutations:
• Changes in the structure of entire
• chromosomes :
– Inversion
– Translocation
– Deletion
– Duplication
• Changes in the number of
– chromosomes :
– Aneuploidy
– euploidy

Bioinformatics

Uploaded by

Copyright:

Available Formats

Bioinformatics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics

Uploaded by

Copyright:

Available Formats

Bioinformatics

ones, in what order, and when.

add new members to the existing information

– Previously simple physical characteristics of organisms were used to identify and

diﬀerentiate between diﬀerent species

– exhibits different characteristics

•Thus, an understanding of the molecular basis of life is fundamental to

• Once we have the sequences, there is a need for parsing of large

necessarily an easy task.

completely interchangeable terms, while for others, there is a great

•Bioinformatics is highly interdisciplinary, requiring at least mathematical,

statistical, biological, physical, and chemical knowledge, and its

implementation may furthermore require knowledge of computer science,

chemical engineering, biotechnology, medicine, pharmacology, etc.

the integrative nature of Bioinformatics

• Bioinformatics is simply a union of biology and informatics

• It could also be defines as an integration of mathematical, statistical

science of how information is generated, transmitted, received,

stored, processed and interpreted in biological systems”

•Although it began with sequence comparison (which is a sub-

branch of the study of the non-randomness of DNA sequences), it

now encompasses a far wider spread of activity, which truly

epitomizes modern scientific research.

–It has become an extremely active research field.

variants, and other proteins such as insulin, from different species.

to produce specific pattern

substitutions was not random.

– Thus, Ledley suggested that computers could assist biochemists in their

maximum parsimony method. She also developed the first amino-acid

– PAM stands for point accepted mutation (also referred to as percent

accepted mutation) because it represents accepted point mutation per 100

amino acid residues.

•A publication by Dayhoff, entitled Computer Analysis of Protein

Evolution, can be regarded as one of the most important initial

publications in bioinformatics and molecular phylogenetics.

– Development of principles of sequence alignment

• In 1990s Prediction of genes and Studies of complete genome sequence

• Bioinformatics plays a vital role in the areas of structural

– To identify and structurally modify a natural product

– The prediction of protein-folding pathways are important

• The completion of the human genome and the use of

• This new knowledge of the molecular mechanisms of disease will

• With an improved understanding of disease mechanisms and using

• Increasing levels of carbon dioxide emission, mainly through the

• The ultimate goal of bioinformatics is to be able

 Prediction of secondary and tertiary structure

2. Analysis of complete genomes…..Genomics

 Searching the location of genes on the chromosomes

 Identification of "missing" genes and hence product

 Comparison and analysis of biochemical pathways

– Each cell is in itself a complex system enclosed in a membrane.

•There are two types of organisms based on their cell type:

– Eukaryotes (which represent most of the organisms which we can see,

•A eukaryotic cells has a nucleus, which is separated from the rest

of the cell by a membrane.

information for the organism is stored.

– Very long DNA molecules packaged with proteins called Chromosomes

– Chromosomes are the rod-shaped/circular, filamentous bodies present in