Databases in Bioinformatics - An Introduction
Databases in Bioinformatics - An Introduction
Databases in Bioinformatics -
An Introduction
INTRODUCTION
The amount and rate of accumulation of biological information is increasing
expo nentially with the discovery of new and automated sequencing methods and
development of powerful new technologies for acquiring large-scale genomic and
proteomic datasets. The exponential growth in molecular sequence data started in the
early 1980s, when the methods for DNA sequencing became widely available. The data
generated from various sequencing projects needed to be stored and analysed to
annotate the genes and their products and measure their dynamic interactions. As a
result came the concept of 'biological database' to store biological data in an electronic
format . All the biological databases use standardized formats. This tremendous growth
in the biological data has turned biological sciences into a data-rich science. The
common examples of biological data are the nucleotide sequences (genes and genomes);
the protein sequences and motifs; the macromolecular structural data generated from X -
ray crystallography and macromolecular N MR; metabolic pathways; gene expression
data (microarrays); protein-protein interactions; and many other types of data related to
biological function a nd processes. This explosion of biological sequence data in the early
1980s paved the way for the development of three popular databases: NCBI (National
Center for Biotechnology Information), EMBL (European M olecular Biology Labora-
tory), and DDBJ (DNA Data Bank of Japan).
BIOLOGICAL DATABASES
We can define biological database as a collection of data that is structured, searchable,
updated periodically, and cross-referenced. The database administrator updates these
data from time to time by editing existing data and adding new data. Biological
llal11bfflllcl: Princlples and Applications
databases are developed to perform several functions. Some of the main purposes/
functions of biological databases are as follows:
• Databases aid in the systematization of results from biological experiments and
analysis. All the biological data obtained through experiments or analysis are
useful for future work. So databases help to organize and store all known data
which prevents recomputing and duplication of experiments.
nd
• Databases make biological data available to scientists at one place a help them
to obtain data for their research and cross-validation.
nd
• Biological data in databases are available in computer-readable form a this
forms the first fundamental step of biological data analysis.
Sequence Databases
Sequence databases are applicable to both nucleic acid sequences (GenBank, EMBL-
Bank, and DDBJ) and protein sequences (Entrez protein, Integr8, proteome F ASTA,
52 Biolnfonnatics: Principles and Applications
r Databases l
Sequence database
' Genome database
Nuoleotide
(DNA)
~ Protein
(protein)
l
/
Microarray database
Bibliographic database (Transcriptome)
(literature)
/
Chemical database
Metabolic database ,
(pathways and enzymes)
/'
/
Structure database Disease database
(3D structure of ,
macromolecules) ,
r
Enzyme database
Genome Databases
Genome databases are a repository of whole genome nucleotide sequences of various
organisms - prokaryotes, eukaryotes, and viruses. These databases also provide views
for a variety of genomes, sequence maps with contigs, and integrated genetic and
physical maps along with annotated genes information. For example, Entrez Genome of
NCBI has the genome sequence data for six major organism types: Archaea, Bacteria.
Eukaryotes, Viruses, Viroids, and Plasmids. Genome Information Broker (GIB) is
a~othe_r database of the complete genome sequence data (http://www.gib.genes.
mg.ac.Jp).
Bibliographic Databases
Bibliographic database. are scientific literature database consisting of numero~s
15
resear~h papers a~d. articles from various journals. PubMed, available at NCBI.
the widely used b1bho~raphic _database. PubMed is a special type of database that
helps to _stay cu~rent with the literature of various subjects. PubMed is maintained bY
the National Library of Medicine (NLM) and contains more than 12 .8 million
Databases in Bioinformatics - All Introduction 53
. ro~s- ata ase search system (Wheeler et al. 2005) so that the users can see more than
.
Just Journal abstrac ts and titles to their text quenes.
MEDL INE . th ' .
ooici-- . is e NLM s premier bibliographic database covering the fields of
m_ . ne, !mrsmg, dentistry, veterinary medicine, the health-care system, and the pre-
chmcal sciences · MEDL INE co n tams · b'bl' · · s and auth or a bstracts
1 10graph'1c c1tat1on
from n~ore than 4,800 biomedical journ;I ; publi;hed in the United States and 70 other
countries. The ~aoas e contains over 12 million citations dating back to the mid-
1960s. Coverage ts worldwide, but most records are from English-language sources or
have English abstrac ts.
Microarray Databases
These databas es contain data obtained from microarray-based experiments measuring
the abunda nce of mRNA , genomic DNA, and protein molecules, and also from
nonarra y-based technologies, such as SAGE and mass spectrometry peptide profiling.
These data are otherwise known as transcriptome data. The examples of such data-
bases are GEO (Gene Expression Omnibus), Gensat, ArrayExpress, Cancer Gene
Expression Databa se (CGED ), Human Gene Expression Index (HuGE Index), etc.
Metabolic Databases
Metabo lic databas es contain data on biochemical pathways and enzymes in different
organisms. KEGG and MetaCyc are the noteworthy metabolic databases. Organism-
specific databas es include organism-related data individually such as EcoCyc , Flybase,
and CCDB. All these databases are elaborated in the subsequent chapters.
Chemical Databases
These databas es store chemical information on various molecules. For example,
PubChe m of NCBI contain s substance descriptions on small molecules with fewer
than 1,000 atoms and 1,000 bonds.
Structure Databases
Structu re databas es include data on 3D structur e of nucleic acids and proteins. The data
types found in this databas e are crystallographic or NMR coordin ate data, structur e
factors for the X-ray structures or constra int files for the NMR structures, and
information about the experiments used to determine the structures, such as crystal-
lization informa tion, data collection, and refinement statistics. The examples of such
nucleic acid databas es are NOB (Nucleic Acid Databas e) and SCOR (Structu ral
Classification of RNA). PDB (Protein Data Bank) is the most popular repository of 3D
structure of proteins obtained either by NMR or X-ray crystallography.
Disease Databases . .
These are the exclusive sources for disease-related mformat10n. For example, OMIM
(Online Mendel ian Inheritance in Man) provides data about human genes and genetic
disorders. Genetic Association Databa se is another popular diseases databas e contain -
ing data on Human Genetic Association studies of complex diseases and disorders.
ete. 'l'filj dlta6u6 ~
involved. Tbe
Nomencla ure a
Dallbllll IIIN 11t-DIII IOUrOI
There are two aeneral
classes of biological datal,alel bad"~ t:hffl ld,\ll'Cllltii'ii
of biological data - (i) Archival or Primal')' database and (11) Curated or
database.
,,,,,,., Dl!lllb•• . ..
Primary or Archival databases accept or mclude onginal data from researc
relatiwly little cbockiDS or validation. They contain original submissions by
ers. Most of the archival databases are public and offer open access to the
community for annotation purposes. GenBank and EMBL-Bank are eJU11111V11
primary nucleic acid database, These are nucleotide sequence database of N
Centre for Biotechnology Information (NCBI) and European Molecular
Laboratory (EMBL), respectively (see Table 3.1). Primary protein sequence da
are UniProt, PIR, Swiss-Prot, and TrEMBL; primary structure databases include
and Nucleic Acid Database (NOB).
Table 3.1 Primary databases, their descriptions and web interfaces ..,,..
~ Web Interfaces
.,_.,,e, Descriptions
http: //www.ncbi.nih.gov/
GenBank Nucleotide sequence database of NCBI
Genbank/
Database of European Molecular Biology http://www.ebi.ac.uk/embl.
EMBL
Laboratory
http://www.ddbj.nig.ac.jp
DDBJ DNA Data Bank of Japan
Universal Protein Resource http://www.uniprot.org
UniProt http://pir.georgetown.edu/
PIR Protein Information Resource
Manually curated protein-only sequence http://www.expasy.ch
Swiss-
Prot database
TrEMBL Translated EMBL is a very large protein
database in Swiss- Prot format
Protein Data Bank - repository for 30 http://www.rcsb.org/pdb
PDB
structure of macromolecules
NDB Nucleic Acid Database http://www.ndbserver.ru
edu/
Composite Database T bl 3
Composite database combines different primary database sources (see a e -3).
mak · . . ore efficient. Although these ....
es querymg and searchmg multiple resources m . . . db -\:
compiled from various primary databases, non-redundancy ts mamtame examplesY filtCri.ll&
. s The best-known
multiple data from different primary database source · (NRDB) d . . or
· R d dant Database , an 8 10S11ico
composite databases are OWL, Non- e un .1 ·table primary 8 •
OWL is a non-redundant composite of the four pubhc y-avat ourcea:
3
Swiss Prot, PIR, GenBank (translation), and NRL- D.
. • t· and web interfaces
Table 3.3 Composite databases, their descnp ions