IInd Sem Class1
IInd Sem Class1
IInd Sem Class1
Bioinformatics is a modern discipline integrating different branches of science i.e. Biology, Chemistry & Information technology.
Informatics related to Biological and Medical sciences: Bioinformatics Structural Bioinformatics Medical Informatics
Chemoinformatics
Pharmacy Informatics Clinical Informatics
Computer
Science,
Information
Technology,
Mathematics, Chemistry, Physics, and Medicine with the objectives of developing tools to analyze biological, biochemical, biophysical data and to generate new knowledge in these areas. It is a fact that persons trained and skilled in these multifarious ways do not exist, and if this area is to develop in our country these persons will have to be trained and produced.
In other wordsBioinformatics is
The combination of biology and information technology. It is a branch of science that deals with the computer based analysis of large biological data sets.
It incorporates the development of databases to store and search data, and of statistical tools and algorithms to analyze and determine relationships between biological sets, such as macromolecular sequences, structures, expression profiles and biochemical pathways.
DNA
RNA
Protein synthesis
COMPUTERS IN BIOLOGY
Development of
New scientific methods, Algorithms for managing large amounts of sequence and structural data As the full genome sequences of many species, data from structural genomics, micro-arrays, and proteomics became available, integration of these data to a common platform require sophisticated bioinformatics tools. { Sequence-Structure-Function }. Organizing these data into knowledgeable databases and developing appropriate software tools for analyzing the same are going to be major challenges. India as a major player in IT industry, has the potential to develop such resources at an affordable cost.
Homology modeling of target protein Target protein sequence Virtual library of compounds or QSAR analysis Large scale Docking
Fig: Schematic outline of the application of SB (homology modeling) and crystallography (structural molecular biology) in drug discovery process.
X-ray
COMPUTERS IN BIOLOGY
Table : Some important structural bioinformatics databases/ resources/ tools: Database and its importance S.No. 1. National Center for Biotechnology http://www.ncbi.nlm.nih.go Information (NCBI): Provides a v/Entrez/ general search for nucleotide sequences, protein sequences, biomolecule 3D structures, genomes, taxonomy or literature. Structural Genomics Target http://spam.sdsc.edu/ Database (sgtdb): 3-D models of all sequences under investigation by structural genomics centers. Structure Comparison Database http://cl.sdsc.edu/ce.html (CE): Pair-wise structure comparisons based on the Combinatorial Extension (CE) Algorithm for both a representative set and complete set of protein structures; includes alignments. URL
2.
3.
COMPUTERS IN BIOLOGY
4.
CKAAP DB:Database of http://ckaaps.sdsc.edu/perl/ structures with Conserved Key browser.pl Amino Acid Positions. Protein Data Bank (PDB): The http://www.rcsb.org/pdb single worldwide source of primary structural data on biological macromolecules determined experimentally. Extended GO Annotation of PDB http://spdc.sdsc.edu/ Chains: Use of structure comparison to extend the coverage of GO terms in the PDB. The PDBbind database is http://www.pdbbind.org/ designed to provide a collection of experimentally measured binding affinity data (Kd, Ki, and IC50) exclusively for the proteinligand complexes available in PDB.
5.
6.
7.
Bioinformatics
Information Resources And Networks
Outline
Bioinformatics Information Resources And Networks
EMBnet European Molecular Biology Network
DBs and Tools
Nucleic Acid Sequence Databases Protein Information Resources Metabolic Databases Mapping Databases Databases concerning Mutations Literature Databases
EMBnet - Nodes
National Nodes
(18)
governmental
EMBnet
(41 nodes)
EMBnet - Nodes
National Nodes
Vienna Biocenter - Austria CSC - Finland DKFZ - Germany INCBI - Ireland IEN-AdR - Italy Bio - Norway PEN - Portugal CNB-CSIC - Spain SIB - Switzerland BEN - Belgium INFOBIOGEN - France HEN - Hungary INN - Israel CMBI - Netherlands IBB - Poland GeneBee - Russia BMC - Sweden SEQNET - UK
Appointed by the governments Provide on-line services, user support and training
EMBnet - Nodes
Munich Information Center for protein sequences
Specialist Nodes
MIPS ICGEB Pharmarcia F.Hoffmann La Roche EBI HGMP - RC Sanger UCL
Academic, industrial or research centers in specific areas of bioinformatics Largely responsible for maintenance of biological databases and software
Important key specialist node and home of: EMBL, SWISS-PROT and TrEMBL databases
Hinxton Hall
(Cambridge UK)
EMBnet - Nodes
Associate Nodes
IBBM - Argentina ANGIS - Australia
CBI - China
CIGB - Cuba
CDFD - India
EMBnet - Brazil
CBR - Canada
EMBnet - Chile
EBMnet - Colombia
CIFN - MEXICO
EMBnets Mission
Assist in biotechnological and bioinformatics related research Provide training and education Exploit network infrastructures Investigate and develop new technologies Bridge between commercial and academic sectors
EMBnets SRS
National Nodes
EMBnet
Specialist Nodes Associate Nodes
http://srs.embl-heidelberg.de:8000/srs5/
Sequence Retrieval System Network Browser for Databanks in Molecular Biology
Data Bank
Rele ase
No Entries
Indexing Date
Group
Availa bility
ok ok ok ok ok ok ok ok
Data Bank
No Entries
Indexing Date
Group
Availa bility
SPTREMBL
SPTREMBLNEW REMTREMBL PIR WORMPEP
1449374
143140 92182 283416 19538
16-Jun-2005
17-Jun-2005 20-Jun-2005 16-Jun-2005 16-Jun-2005
Sequence
Sequence Sequence Sequence Sequence
ok
ok ok ok ok
DROSOPHILA
EMBLNEW EMBL EMBLEST EMBLWGS GENBANK GENBANKEST REFSEQP
14100
4035816 20343598 31990232 11106060 19233264 31008556 8006
16-Jun-2005
21-Nov-2005 30-Dec-2005 06-Jan-2006 24-Sep-2005 18-Nov-2005 23-Feb-2006 16-Jun-2005
Sequence
Sequence Sequence Sequence Sequence Sequence Sequence Sequence
ok
ok ok ok ok ok ok ok
SUBTILIST
16-Jun-2005
Sequence
ok
Data Bank
No Entries
Indexing Date
Group
Availa bility
ok ok ok ok ok ok ok ok
TFCELL
TFCLASS TFMATRIX TFGENE PDB DSSP HSSP PDBFINDER NRL3D FLYGENES FLYREFS OMIM REPTILIA
816
27 246 1035 34927 30832 30369 35701 6063 7556 0 17004 8364
07-Apr-2003
07-Apr-2003 07-Apr-2003 07-Apr-2003 08-Feb-2006 22-Nov-2005 08-Feb-2006 28-Mar-2006 16-Jun-2005 16-Jun-2005 07-Apr-2003 18-Oct-2005 18-Jan-2006
TransFac
TransFac TransFac TransFac Protein3DStruct Protein3DStruct Protein3DStruct Protein3DStruct Protein3DStruct Genome Genome Mutations Others
ok
ok ok ok ok ok ok ok ok ok ok ok ok
Mission: Development of new information technologies to aid our understanding of the molecular and genetic processes that underlie health and disease Creation of systems for storing and analysing biological information Development of advanced methods of computer-based information processing Facilitation of user access to DBs and software Co-ordination of efforts to gather biotechnology information worldwide
NCBI
Since 1992 maintenance of GenBank and collaboration with international nucleotide DBs: EMBL and DDBJ (Japan) Providing the Entrez that facilitates to access biological DBs (similar to SRS that is provided by the EMBnet)
NCBI - Responsibilities
administers research on biomedical problems at the molecular level using mathematical and computational methods maintains collaborations with several NIH (National Institutes of Health) institutes, academia, industry, and other governmental agencies promotes scientific communication by sponsoring meetings, workshops, and lecture series supports training on basic and applied research in computational biology for postdoctoral fellows through the NIH Intramural Research Program engages members of the international scientific community in informatics research and training through the Scientific Visitors Program develops, distributes, supports, and coordinates access to a variety of databases and software for the scientific and medical communities develops and promotes standards for databases, data deposition and exchange, and biological nomenclature
Nucleic Acid Sequence Databasesare GeneBank, the principal nucleic acid sequence databases
EMBL and DDBJ, which each collect a portion of the total sequence data reported world-wide, and exchange new and updated entries on a daily basis Nucleic acid sequence Databases
EMBL (Europe) GenBank (USA) DDBJ (Japan) ENSEMBL (project between EMBL - EBI and the Sanger Institute) dbEST (division of GenBank) GSDB (division of GenBank)
source: http://www3.ebi.ac.uk/Services/DBStats/
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications. The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis.
Ref: EMBL Nucleotide Sequence Database:developments in 2005, Nucleic Acids Research, 2006, Vol. 34, D10D15
http://genome-www.stanford.edu/Saccharomyces
FlyBase (A Database of Drosophila Genes & Genomes) (http://flybase.bio.indiana.edu (fruit fly) MGD (Mouse Genome Database)
http://www.informatics.jax.org (Mouse)
secondary
The second structure of a protein corresponds to regions of local regularity (e.g., -helices and -strands). The tertiary structure of a protein arises from the packing of its secondary structure elements, which may form discrete domains within a fold.
tertiary
primary structure
ACDEFGHIKLMNPQRSTVWY
primary
sequence
AVILDRYFH
secondary
motif
[AS]-[IL]2-X[DE]-R-[FYW]2-H
tertiary
domain
module
a,b,c
@.*,#
structure database
these are stored in primary databases as linear alphabets that denote the constituent residues
Protein sequence Databases SWISS-PROT - Protein knowledgebase TrEMBL - Computer-annotated supplement to Swiss-Prot PIR Protein Information Resource MIPS Munich Information Centre for Protein Sequences NRL-3D - produced by PIR
Frequ.
13049 10132 5189 4847 4669 3665 2863 2814 2750 2286
Species
Homo sapiens (Human) Mus musculus (Mouse) Saccharomyces cerevisiae (Baker's yeast) Escherichia coli Rattus norvegicus (Rat)
OWL
SWISS-PROT PIR
MIPSX
PIR1-4 MIPSOwn
SP+TrEMBL
SwissProt TrEMBL
PDB SWISS-PROT
SWISS-PROT TrEMBL
PIR
GenPept SWISS-PROTupdate GenPeptupdate
GenBank
NRL-3D
MIPSTrn
MIPSH PIRMOD NRL-3D SWISS-PROT EMTrans GBTrans Kabat PseqIP
Secondary databases
Secondary databases contain pattern data, i.e., diagnostic signatures for protein families. These signatures encode the most highly conserved features of multiply aligned sequences, which are often crucial to the structure or function of the protein The second structure of a protein corresponds to regions of local regularity (e.g., -helices and -strands). Which, in sequence alignments, are often apparent as wellconserved motifs patterns are regular expressions, fingerprints, blocks, profiles, etc.
Secondary databases
Secondary DB
PROSITE
Primary source
SWISS-PROT
Stored information
Regular expressions (patterns)
Profiles
PRINTS BLOCKS IDENTIFY
SWISS-PROT
OWL PROSITE/PRINTS BLOCKS/PRINTS
Secondary databases
TRANSFAC http://transfac.gbf.de EPD http://www.epd.isb-sib.ch InterPro http://www.ebi.ac.uk/interpro/ PROSITE http://www.expasy.ch/prosite BLOCKS http://blocks.fhcrc.org PRINTS ftp://ftp.seqnet.dl.ac.uk/pub/database/prints PFAM http://www.sanger.ac.uk/Software/Pfam/index.shtml ProDom http://www.toulouse.inra.fr/prodom.html InterPro http://www.ebi.ac.uk/interpro GeneCards http://bioinformatics.weizmann.ac.il/cards ENSEMBL http://www.ensembl.org EcoCyc http://ecocyc.panbio.com/ecocyc/ecocyc.html
Secondary databases
There is some overlap in content between the secondary databases PDBsum alone has 35,291 entries Pattern DB growth is slow because the addition of detailed family annotation is very time consuming. PROSITE and PRINTS are the only comprehensively, manually annotated secondary DBs
To address the annotation bottleneck, the secondary database curators are together created a unified database of protein families known as InterPro
Structure Classification Databases PDBsum Protein Data Bank CATH Class, Architecture, Topology, Homology SCOP Structural Classification of Proteins
SCOP
http://scop.mrc-lmb.cam.ac.uk/scop
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
DSSP
http://www.sander.ebi.ac.uk/dssp
FSSP
http://www.ebi.ac.uk/dali/fssp
HSSP
http://www.sander.ebi.ac.uk/hssp
Metabolic Databases
A number of metabolic databases are available electronically some with features for querying and visualizing metabolic pathways and regulatory networks.
Mapping Databases
OMIM
(Online Mendelian Inheritance in Man)
http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim
RHDB
http://corba.ebi.ac.uk/RHdb
HGBASE
http://hgbase.cgr.ki.se
HAEMA
http://europium.csc.mrc.ac.uk/usr/WWW/WebPages/datab ase.dir/quiz.dir/intrquiz.htm
Literature Databases
PubMed
http://www.ncbi.nlm.nih.gov/entrez/query
Bioinformatics Online
http://www.bioinformatics.oupjournals.org
Nature
http://www.nature.com
Science
http://www.sciencemag.org
URL
www.sanger.ac.uk/Software/Artemis www.acedb.org/Tutorial/brieftutorial/shtml www.ensembl.org/apollo www.ensembl.org www.ncbi.nlm.nih.gov genome.ucsc.edu
Database formats
There is no universally agreed format for genome databases and several viewers and browsers have been developed with graphical displays for genomic sequence analysis and annotation.
Common formats
There are several conventions for representing nucleic acid and protein sequences, of which the following are widely used
NBRF/PIR FASTA GDE
These formats have limited facilities for comments, which must include a unique identifier code and sequence accession number
http://www.pdb.org
Submission of sequences
Sequences may be submitted to any of the three primary databases using the tools provided by the database curators Such tools include WebIn and BankIt, which can be used over the Internet, and Sequin, a stand-alone application
http://www.ebi.ac.uk/embl/Submission/webin.html
http://www.ncbi.nlm.nih.gov/BankIt/
Database interrogation
All the databases discussed above can be searched by sequence similarity However, detailed text-based searches of the annotations are also possible using tools such as Entrez The simplest way to cross-reference between the primary nucleotide sequence databases and SWISS-PROT is to search by accession number, as this provides an unambiguous identifier of genes and their products