IInd Sem Class1

Introduction to Bioinformatics
Bioinformatics is a modern discipline integrating different branches of science i.e. Biology, Chemistry & Information technology.
Informatics related to Biological and Medical sciences: Bioinformatics Structural Bioinformatics Medical Informatics
Chemoinformatics
Pharmacy Informatics Clinical Informatics
Bioinformatics has a strong interdisciplinary character. It can be considered to be a confluence of Biology,
Computer
Science,
Information
Technology,
Mathematics, Chemistry, Physics, and Medicine with the objectives of developing tools to analyze biological, biochemical, biophysical data and to generate new knowledge in these areas. It is a fact that persons trained and skilled in these multifarious ways do not exist, and if this area is to develop in our country these persons will have to be trained and produced.
In other wordsBioinformatics is
The combination of biology and information technology. It is a branch of science that deals with the computer based analysis of large biological data sets.
It incorporates the development of databases to store and search data, and of statistical tools and algorithms to analyze and determine relationships between biological sets, such as macromolecular sequences, structures, expression profiles and biochemical pathways.
DNA
RNA
Protein synthesis
COMPUTERS IN BIOLOGY
Development of
New scientific methods, Algorithms for managing large amounts of sequence and structural data As the full genome sequences of many species, data from structural genomics, micro-arrays, and proteomics became available, integration of these data to a common platform require sophisticated bioinformatics tools. { Sequence-Structure-Function }. Organizing these data into knowledgeable databases and developing appropriate software tools for analyzing the same are going to be major challenges. India as a major player in IT industry, has the potential to develop such resources at an affordable cost.
Structural Bioinformatics in Drug Discovery
Homology modeling of target protein Target protein sequence Virtual library of compounds or QSAR analysis Large scale Docking
Confirm using Crystallography, Kinetic analysis
Crystal structure of target protein
Lead identification & Lead optimization
Compound development (Drug)
Fig: Schematic outline of the application of SB (homology modeling) and crystallography (structural molecular biology) in drug discovery process.
X-ray
Table : Some important structural bioinformatics databases/ resources/ tools: Database and its importance S.No. 1. National Center for Biotechnology http://www.ncbi.nlm.nih.go Information (NCBI): Provides a v/Entrez/ general search for nucleotide sequences, protein sequences, biomolecule 3D structures, genomes, taxonomy or literature. Structural Genomics Target http://spam.sdsc.edu/ Database (sgtdb): 3-D models of all sequences under investigation by structural genomics centers. Structure Comparison Database http://cl.sdsc.edu/ce.html (CE): Pair-wise structure comparisons based on the Combinatorial Extension (CE) Algorithm for both a representative set and complete set of protein structures; includes alignments. URL
2.
3.
4.
CKAAP DB:Database of http://ckaaps.sdsc.edu/perl/ structures with Conserved Key browser.pl Amino Acid Positions. Protein Data Bank (PDB): The http://www.rcsb.org/pdb single worldwide source of primary structural data on biological macromolecules determined experimentally. Extended GO Annotation of PDB http://spdc.sdsc.edu/ Chains: Use of structure comparison to extend the coverage of GO terms in the PDB. The PDBbind database is http://www.pdbbind.org/ designed to provide a collection of experimentally measured binding affinity data (Kd, Ki, and IC50) exclusively for the proteinligand complexes available in PDB.
5.
6.
7.
Bioinformatics
Information Resources And Networks
Outline
Bioinformatics Information Resources And Networks
EMBnet European Molecular Biology Network
DBs and Tools
NCBI National Center For Biotechnology Information

DBs and Tools
Nucleic Acid Sequence Databases Protein Information Resources Metabolic Databases Mapping Databases Databases concerning Mutations Literature Databases
EMBnet European Molecular Biology Network

Founded in 1988 Network that links European laboratories that use biocomputing and bioinformatics in molecular biology research is a science-based group of collaborating nodes throughout Europe and nodes outside Europe provides information, services and training to the users efforts to increase the availability and
accessibility of data resources and computing tools

increase knowledge and proficiency in bioinformatics through education and training
EMBnet - Nodes
National Nodes
(18)
governmental
academic, industrial research centers

Specialist Nodes
(9)
EMBnet
(41 nodes)
Biocomputing centers from non European countries Associate Nodes

(11)
EMBnet - Nodes
National Nodes
Vienna Biocenter - Austria CSC - Finland DKFZ - Germany INCBI - Ireland IEN-AdR - Italy Bio - Norway PEN - Portugal CNB-CSIC - Spain SIB - Switzerland BEN - Belgium INFOBIOGEN - France HEN - Hungary INN - Israel CMBI - Netherlands IBB - Poland GeneBee - Russia BMC - Sweden SEQNET - UK
Appointed by the governments Provide on-line services, user support and training
EMBnet - Nodes
Munich Information Center for protein sequences
Specialist Nodes
MIPS ICGEB Pharmarcia F.Hoffmann La Roche EBI HGMP - RC Sanger UCL
Academic, industrial or research centers in specific areas of bioinformatics Largely responsible for maintenance of biological databases and software
Important key specialist node and home of: EMBL, SWISS-PROT and TrEMBL databases
Hinxton Hall
(Cambridge UK)
EMBnet - Nodes
Associate Nodes
IBBM - Argentina ANGIS - Australia
Centers from non European countries
CBI - China
CIGB - Cuba
CDFD - India
SANBI South Africa
EMBnet - Brazil
CBR - Canada
EMBnet - Chile
EBMnet - Colombia
CIFN - MEXICO
EMBnets Mission
Assist in biotechnological and bioinformatics related research Provide training and education Exploit network infrastructures Investigate and develop new technologies Bridge between commercial and academic sectors
Who are EMBnets Users?

> 40,000 registered users from all over the world as well as a larger number of Internet users All scientists working in Life Sciences, from undergraduate students to top level scientists, in academia as well as industry, can get support from EMBnet
EMBnets SRS
National Nodes
Sequence Retrieval System - SRS

result of a research project with the EMBnet to interrogating all resources gathered together
EMBnet
Specialist Nodes Associate Nodes
SRS is a network browser for DBs in molecular Biology
SRS allows any flat-file DB to be indexed to any other

queries across a range of different DB types via a single interface independent of underlying data structures or query languages
http://srs.embl-heidelberg.de:8000/srs5/
Sequence Retrieval System Network Browser for Databanks in Molecular Biology
Data Bank
Rele ase
No Entries
Indexing Date
Group
Availa bility
SWISSPROT SWISSNEW NRDB SWALL UNIPROT_SPROT UNIPROT_TREMBL TREMBLNEW TREMBL
163235 81134 2269647 3022528 212425 2666963 624819 2576118
10-Jun-2005 22-Mar-2006 29-Mar-2006 22-Mar-2006 22-Mar-2006 23-Mar-2006 12-Dec-2005 04-Oct-2005
Sequence Sequence Sequence Sequence Sequence Sequence Sequence Sequence
ok ok ok ok ok ok ok ok
Data Bank
No Entries
Indexing Date
Group
Availa bility
SPTREMBL
SPTREMBLNEW REMTREMBL PIR WORMPEP
1449374
143140 92182 283416 19538
16-Jun-2005
17-Jun-2005 20-Jun-2005 16-Jun-2005 16-Jun-2005
Sequence
Sequence Sequence Sequence Sequence
ok
ok ok ok ok
DROSOPHILA
EMBLNEW EMBL EMBLEST EMBLWGS GENBANK GENBANKEST REFSEQP
14100
4035816 20343598 31990232 11106060 19233264 31008556 8006
16-Jun-2005
21-Nov-2005 30-Dec-2005 06-Jan-2006 24-Sep-2005 18-Nov-2005 23-Feb-2006 16-Jun-2005
Sequence
Sequence Sequence Sequence Sequence Sequence Sequence Sequence
ok
ok ok ok ok ok ok ok
SUBTILIST
16-Jun-2005
Sequence
ok
Data Bank
No Entries
Indexing Date
Group
Availa bility
PROSITE PROSITEDOC BLOCKS EPD ENZYME PRINTS TFSITE TFFACTOR
1935 1407 4034 1375 4173 865 4342 1799
22-Mar-2006 22-Mar-2006 16-Jun-2005 16-Jun-2005 16-Jun-2005 16-Jun-2005 07-Apr-2003 07-Apr-2003
SeqRelated SeqRelated SeqRelated SeqRelated SeqRelated SeqRelated TransFac TransFac
ok ok ok ok ok ok ok ok
TFCELL
TFCLASS TFMATRIX TFGENE PDB DSSP HSSP PDBFINDER NRL3D FLYGENES FLYREFS OMIM REPTILIA
816
27 246 1035 34927 30832 30369 35701 6063 7556 0 17004 8364
07-Apr-2003
07-Apr-2003 07-Apr-2003 07-Apr-2003 08-Feb-2006 22-Nov-2005 08-Feb-2006 28-Mar-2006 16-Jun-2005 16-Jun-2005 07-Apr-2003 18-Oct-2005 18-Jan-2006
TransFac
TransFac TransFac TransFac Protein3DStruct Protein3DStruct Protein3DStruct Protein3DStruct Protein3DStruct Genome Genome Mutations Others
ok
ok ok ok ok ok ok ok ok ok ok ok ok
NCBI National Center For Biotechnology Information

Leading American information provider Established in 1988 as a division of the National Library of Medicine (NLM)
Located on the campus of the National Institute of Health (NIH Rockville/Maryland)
Mission: Development of new information technologies to aid our understanding of the molecular and genetic processes that underlie health and disease Creation of systems for storing and analysing biological information Development of advanced methods of computer-based information processing Facilitation of user access to DBs and software Co-ordination of efforts to gather biotechnology information worldwide
NCBI
Since 1992 maintenance of GenBank and collaboration with international nucleotide DBs: EMBL and DDBJ (Japan) Providing the Entrez that facilitates to access biological DBs (similar to SRS that is provided by the EMBnet)
NCBI - Responsibilities
administers research on biomedical problems at the molecular level using mathematical and computational methods maintains collaborations with several NIH (National Institutes of Health) institutes, academia, industry, and other governmental agencies promotes scientific communication by sponsoring meetings, workshops, and lecture series supports training on basic and applied research in computational biology for postdoctoral fellows through the NIH Intramural Research Program engages members of the international scientific community in informatics research and training through the Scientific Visitors Program develops, distributes, supports, and coordinates access to a variety of databases and software for the scientific and medical communities develops and promotes standards for databases, data deposition and exchange, and biological nomenclature
Nucleic Acid Sequence Databasesare GeneBank, the principal nucleic acid sequence databases
EMBL and DDBJ, which each collect a portion of the total sequence data reported world-wide, and exchange new and updated entries on a daily basis Nucleic acid sequence Databases
EMBL (Europe) GenBank (USA) DDBJ (Japan) ENSEMBL (project between EMBL - EBI and the Sanger Institute) dbEST (division of GenBank) GSDB (division of GenBank)
source: http://www3.ebi.ac.uk/Services/DBStats/
Nucleic Acid Sequence Databases - EMBL

This morning the EMBL Database contained 127,450,085,130 nucleotides in 69,666,551 entries. Breakdown by entry type: Entry Standard Constructed (CON) Third Party Annotation (TPA) Whole Genome Shotgun (WGS) TypeEntries 56,843,150 497,187 4,884 12,318,618 Nucleotides 61,498,109,356 n/a 334,827,880 64,837,183,592
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications. The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis.

Number of entries (current 69,666,551) Total nucleotides (current 127,450,085,130 )
Ref: EMBL Nucleotide Sequence Database:developments in 2005, Nucleic Acids Research, 2006, Vol. 34, D10D15

By nucleotide count
Homo sapiens Bos taurus Macaca mulatta
Mus musculus Canis familiaris Loxodonta africana
Rattus norvegicus Monodelphis domestica Other
Pan troglodyt es Danio rerio
Nucleic Acid Sequence Databases GenBank

GenBank which is produced at NCBI, is split into smaller, discrete divisions.
This facilitates fast, specific searches by restricting queries to particular database subsets During 1992-1997, the level of EST and STS data within GenBank grew 10-fold. the overall sequence information contributed by such partial data was still less than that of higher quality sequences in the other major divisions
Specialised Genomic Resources

In addition to the comprehensive DNA sequence DBs, there is a variety of more specialised genomic resources. These so called boutique DBs bring focus to speciesspecific genomics and to particular sequencing techniques. Specialised Genomic Resources SGD Saccharomyces Genome Database UniGene - gene-oriented clusters from GenBank TIGR - Databases of The Institute for Genomic Research ACeDB A C.elegans DataBase
Specialised Genomic Databases

SGD (Saccharomyces Genome Database) SGDTM is a scientific database
of the molecular biology and genetics of the yeast Saccharomyces cerevisiae.
http://genome-www.stanford.edu/Saccharomyces
AceDB (A C. elegans DataBase)

http://www.acedb.org (c.elegans)
FlyBase (A Database of Drosophila Genes & Genomes) (http://flybase.bio.indiana.edu (fruit fly) MGD (Mouse Genome Database)
http://www.informatics.jax.org (Mouse)
Protein Information Resources

Levels of protein sequence and structural organisation:
primary The primary structure of a protein is its amino acid sequence
secondary
The second structure of a protein corresponds to regions of local regularity (e.g., -helices and -strands). The tertiary structure of a protein arises from the packing of its secondary structure elements, which may form discrete domains within a fold.
tertiary
Principles of Protein Structure
primary structure
ACDEFGHIKLMNPQRSTVWY
Protein Information Resources

Levels of protein sequence and structural organisation:
primary database secondary database
primary
sequence
AVILDRYFH
secondary
motif
[AS]-[IL]2-X[DE]-R-[FYW]2-H
tertiary
domain
module
a,b,c
@.*,#
structure database
Primary Protein Databases

The primary structure of a protein is its amino acid sequence
these are stored in primary databases as linear alphabets that denote the constituent residues
Protein sequence Databases SWISS-PROT - Protein knowledgebase TrEMBL - Computer-annotated supplement to Swiss-Prot PIR Protein Information Resource MIPS Munich Information Centre for Protein Sequences NRL-3D - produced by PIR
Protein Sequence Databases

Table of the most represented species Swiss-Prot contains 197,228 sequence entries, comprising 71,501,181 amino acids abstracted from 135,257 references Total number of species represented in Swiss-Prot: 9,520 The average sequence length in Swiss-Prot is 362 amino acids. Swiss-Prot is the most highly annotated protein sequence DB
No. 1 2 3 4 5 6 8 7 9 10
Frequ.
13049 10132 5189 4847 4669 3665 2863 2814 2750 2286
Species
Homo sapiens (Human) Mus musculus (Mouse) Saccharomyces cerevisiae (Baker's yeast) Escherichia coli Rattus norvegicus (Rat)
Arabidopsis thaliana (Mouseear cress)

Schizosaccharomyces pombe (Fission yeast) Bacillus subtilis Caenorhabditis elegans Drosophila melanogaster (Fruit fly)
Composite Protein Sequence Databases

Composite databases amalgamate a variety of different primary databases They render sequence searching much more efficient, because they obviate the need to interrogate multiple resources Different composite databases use different primary sources and different redundancy criteria in their amalgamation procedures
Composite Protein Sequence Databases

NRDB
Natural Resource DB
OWL
SWISS-PROT PIR
MIPSX
PIR1-4 MIPSOwn
SP+TrEMBL
SwissProt TrEMBL
PDB SWISS-PROT
SWISS-PROT TrEMBL
PIR
GenPept SWISS-PROTupdate GenPeptupdate
GenBank
NRL-3D
MIPSTrn
MIPSH PIRMOD NRL-3D SWISS-PROT EMTrans GBTrans Kabat PseqIP
Secondary databases
Secondary databases contain pattern data, i.e., diagnostic signatures for protein families. These signatures encode the most highly conserved features of multiply aligned sequences, which are often crucial to the structure or function of the protein The second structure of a protein corresponds to regions of local regularity (e.g., -helices and -strands). Which, in sequence alignments, are often apparent as wellconserved motifs patterns are regular expressions, fingerprints, blocks, profiles, etc.
Secondary databases
Secondary DB
PROSITE
Primary source
SWISS-PROT
Stored information
Regular expressions (patterns)
Profiles
PRINTS BLOCKS IDENTIFY
SWISS-PROT
OWL PROSITE/PRINTS BLOCKS/PRINTS
Weighted matrices (profiles)

Aligned motifs (fingerprints) Aligned motifs (blocks) Fuzzy regular expressions (patterns)
Secondary databases
TRANSFAC http://transfac.gbf.de EPD http://www.epd.isb-sib.ch InterPro http://www.ebi.ac.uk/interpro/ PROSITE http://www.expasy.ch/prosite BLOCKS http://blocks.fhcrc.org PRINTS ftp://ftp.seqnet.dl.ac.uk/pub/database/prints PFAM http://www.sanger.ac.uk/Software/Pfam/index.shtml ProDom http://www.toulouse.inra.fr/prodom.html InterPro http://www.ebi.ac.uk/interpro GeneCards http://bioinformatics.weizmann.ac.il/cards ENSEMBL http://www.ensembl.org EcoCyc http://ecocyc.panbio.com/ecocyc/ecocyc.html
Secondary databases
There is some overlap in content between the secondary databases PDBsum alone has 35,291 entries Pattern DB growth is slow because the addition of detailed family annotation is very time consuming. PROSITE and PRINTS are the only comprehensively, manually annotated secondary DBs
To address the annotation bottleneck, the secondary database curators are together created a unified database of protein families known as InterPro
Structure Classification DBs

Contain 3D structures available from crystallographic and spectroscopic studies
Structure Classification Databases PDBsum Protein Data Bank CATH Class, Architecture, Topology, Homology SCOP Structural Classification of Proteins
Structure Classification DBs

PDB
http://www.rcsb.org
SCOP
http://scop.mrc-lmb.cam.ac.uk/scop
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
DSSP
http://www.sander.ebi.ac.uk/dssp
FSSP
http://www.ebi.ac.uk/dali/fssp
HSSP
http://www.sander.ebi.ac.uk/hssp
Metabolic Databases
A number of metabolic databases are available electronically some with features for querying and visualizing metabolic pathways and regulatory networks.
KEGG (Kyoto Encyclopedia of Genes and Genomes)

http://www.genome.ad.jp/kegg
ENZYME (Enzyme nomenclature database)

http://www.expasy.ch/enzyme
BRENDA (Enzyme Information System)

http://www.brenda.uni-koeln.de
EMP (Enzymes and Metabolic Pathways database)

http://www.empproject.com
Mapping Databases
OMIM
(Online Mendelian Inheritance in Man)
http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim
GDB (The GDB Human Genome Database)

http://www.gdb.org
RHDB
http://corba.ebi.ac.uk/RHdb
Databases concerning Mutations

dbSNP
http://www.ncbi.nlm.nih.gov/SNP
HGBASE
http://hgbase.cgr.ki.se
The SNP Consortium (TSC)

http://snp.cshl.org
HAEMA
http://europium.csc.mrc.ac.uk/usr/WWW/WebPages/datab ase.dir/quiz.dir/intrquiz.htm
Literature Databases
PubMed
http://www.ncbi.nlm.nih.gov/entrez/query
Bioinformatics Online
http://www.bioinformatics.oupjournals.org
Nature
http://www.nature.com
Science
http://www.sciencemag.org
Database tools for displaying and annotating genomic sequence data

Viewer format
Artemis ACeDB Apollo EnsEMBL NCBI map viewer GoldenPath
URL
www.sanger.ac.uk/Software/Artemis www.acedb.org/Tutorial/brieftutorial/shtml www.ensembl.org/apollo www.ensembl.org www.ncbi.nlm.nih.gov genome.ucsc.edu
Database formats
There is no universally agreed format for genome databases and several viewers and browsers have been developed with graphical displays for genomic sequence analysis and annotation.
Common formats
There are several conventions for representing nucleic acid and protein sequences, of which the following are widely used
NBRF/PIR FASTA GDE
These formats have limited facilities for comments, which must include a unique identifier code and sequence accession number
Formats for multiple sequence alignment

There are separate formats for multiple sequence alignment representation, of which the following are popular
MSF PHYLIP ALN
Files of structural data

Structural data are maintained as flat files using the PDB format Such files contain orthogonal atomic coordinates together with annotations, comments and experimental details
http://www.pdb.org
Submission of sequences
Sequences may be submitted to any of the three primary databases using the tools provided by the database curators Such tools include WebIn and BankIt, which can be used over the Internet, and Sequin, a stand-alone application
http://www.ebi.ac.uk/embl/Submission/webin.html
http://www.ncbi.nlm.nih.gov/BankIt/
Database interrogation
All the databases discussed above can be searched by sequence similarity However, detailed text-based searches of the annotations are also possible using tools such as Entrez The simplest way to cross-reference between the primary nucleotide sequence databases and SWISS-PROT is to search by accession number, as this provides an unambiguous identifier of genes and their products

IInd Sem Class1

Uploaded by

Copyright:

Available Formats

IInd Sem Class1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IInd Sem Class1

Uploaded by

Copyright:

Available Formats

Introduction to Bioinformatics

Bioinformatics has a strong interdisciplinary character. It can be considered to be a confluence of Biology,

Structural Bioinformatics in Drug Discovery

Confirm using Crystallography, Kinetic analysis

Crystal structure of target protein

Lead identification & Lead optimization

Compound development (Drug)

NCBI National Center For Biotechnology Information

EMBnet European Molecular Biology Network

accessibility of data resources and computing tools

academic, industrial research centers

Biocomputing centers from non European countries Associate Nodes

Centers from non European countries

SANBI South Africa

Who are EMBnets Users?

Sequence Retrieval System - SRS

SRS is a network browser for DBs in molecular Biology

SRS allows any flat-file DB to be indexed to any other

SWISSPROT SWISSNEW NRDB SWALL UNIPROT_SPROT UNIPROT_TREMBL TREMBLNEW TREMBL

163235 81134 2269647 3022528 212425 2666963 624819 2576118

10-Jun-2005 22-Mar-2006 29-Mar-2006 22-Mar-2006 22-Mar-2006 23-Mar-2006 12-Dec-2005 04-Oct-2005

Sequence Sequence Sequence Sequence Sequence Sequence Sequence Sequence

PROSITE PROSITEDOC BLOCKS EPD ENZYME PRINTS TFSITE TFFACTOR

1935 1407 4034 1375 4173 865 4342 1799

22-Mar-2006 22-Mar-2006 16-Jun-2005 16-Jun-2005 16-Jun-2005 16-Jun-2005 07-Apr-2003 07-Apr-2003

SeqRelated SeqRelated SeqRelated SeqRelated SeqRelated SeqRelated TransFac TransFac

NCBI National Center For Biotechnology Information

Nucleic Acid Sequence Databases - EMBL

Nucleic Acid Sequence Databases - EMBL

Nucleic Acid Sequence Databases - EMBL

Homo sapiens Bos taurus Macaca mulatta

Mus musculus Canis familiaris Loxodonta africana

Rattus norvegicus Monodelphis domestica Other

Pan troglodyt es Danio rerio

Nucleic Acid Sequence Databases GenBank

Specialised Genomic Resources

Specialised Genomic Databases

AceDB (A C. elegans DataBase)

Protein Information Resources

Principles of Protein Structure

Protein Information Resources

Primary Protein Databases

Protein Sequence Databases

Arabidopsis thaliana (Mouseear cress)

Composite Protein Sequence Databases

Composite Protein Sequence Databases

Weighted matrices (profiles)

Structure Classification DBs

Structure Classification DBs

KEGG (Kyoto Encyclopedia of Genes and Genomes)

ENZYME (Enzyme nomenclature database)

BRENDA (Enzyme Information System)

EMP (Enzymes and Metabolic Pathways database)

GDB (The GDB Human Genome Database)

Databases concerning Mutations

The SNP Consortium (TSC)

Database tools for displaying and annotating genomic sequence data

Formats for multiple sequence alignment

Files of structural data

You might also like