PH Dsharmaji

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Statement of Purpose (SoP) /Research Proposal – PhD Bioinformatics/ Computational Biology

1. Name of the student: K S PRAHARSHIT SHARMA sharmaji@iscb.org +917995158734

2. Title + objectives: An Integrated Analysis Of DiCodon Usage Bias (DCUB) and Epitope-
Mapping (EM) across Broad LCR/ CBZ Spectrum Accommodated By HyperProteoGenome

Objectives: To arrive at a Holistic Compositional, Substitutional, Algebraic and Boolean 1-1


Correlations, including Computational Rule Equivalence based on simple 256 ECA: Elementary
Cellular Automata, toward Universally resolving the pertinent SSF (Sequence Structure-Function)
4-dimensional Space-Time paradigm within the context of macromolecular Genetic-coding
Translation from cDNA to Proteins, at Quantum many-body resolution, namely amino-acids, and
in-Silico prediction of DCUB+EM via. NGS-MS coupling (Proteogenomics-datasets, based on
multi-omic integration of Next Generation Sequencing and Mass Spectrometry experiments). The
HyperProteoGenomic equation is a natural extrapolation of the following ECA-QCA
computational Rule-quivalence (Elementary Cellular Automata-Quarternary Cellular Automata):

Genetic code 64 triplets 4-fold expansion


Whereby, a 3-bit Ts-Tv (Transitions/ Transversions) mutation schema (R/Y= puRines, pYrimidines)
is Both necessary and sufficient to Map onto a NO-neighborhood point cDNA mutations schema (A|
C|G|T). Naturally, encoding with 4-nt (nucleotides, A|C|G|T) on the left-side and 20-aa (Twenty
amino-acids) right, we then need to “solve” for x in the HyperProteoGenomic (HPG) equation:

3. Short context (related to research topic) The Universal Genetic Code is represented by
Codons, which translate mRNA (cDNA) to proteins. Degeneracy and evolutionary pressures
encompassing RNA isoacceptors, TATA-box/ Pribnow-box, Shine-Dalgarno sequences lead to
favoring the recruitment of one particular Codon over another to code for same amino-acid. This
aspect is Codon-Bias. But DCUB, computed in terms of Dicodon usage freqeuency is given by a 64
x 64 = 4,096 vector, that has been shown to capture a more global picture of Gene sequences when
compared to merely codon bias measures. In this context, Codon optimization is impactuful since it
is can be used to increase the translational efficiency of a gene and improve Gene Expression of
interest by the host organism’s DCUB (over Codon Bias). We justify the choice of DCUB herein:
Solving for x in the 2nd Equation above, interestingly, x = 99.9455% Close to e, Napier constant and

given that All “default-Log” above (base-unspecified) is to the ase e = Natural – Logarithms.
Also, DCUB must inevitably then come into the picture since we ought to interrogate inbetween a
threshold of 5th and 6th “Dicon” (DICODON) positions wrt Inequation as obviated below:-

implicitly reckoning that,

namely, the First 2 terms of Mac-Laurin


series expansion for e = 2.71828 (upto 5 Decimal-places of Numerical Approximation).

4. Problem: research question to be researched.


The middle (QCA-term) involving 4 in 1st Equation under Section>2 (Objectives) if expressed as:

which has been our motvating factor to express the RHS above in a KLD-form (Kullback–Leibler
Divergence) which leads us to conjecture that Triplet (3-tuple) codon (base/ radix = 4) is but a
special case involving uniform mono-nucleotide background frequencies, whereas for Extreme-GC
content, such as Plasmodium falciparum spp., “Codome-core-length” tends to e= Napier’s constant.
This is inspired by Amos Bairoch’s (Founder of SIB: Swiss Institute of Bioinformatics) review
paper “The 20 years of PROSITE” wherein the authors declare that “The average size of PROSITE
patterns is 20 amino acids”. Definitively, the HPG accommodates the Entire plausible Broad
Spectrum of LCRs/ CBZs (Low Complexity Regions/ Compositionally Biased Zones) spanning
exactly 20 extremely low, 20-aa (Amino-Acids) LCRs represented as A(20), C(20), ... , Y(20) upto
20! = 2432902008176640000 Completely-Complex PROSITE patterns are harboured. In each such
permuted set of Polypeptide-string data structures, the Observed aa-Residue Frequency of
Occurence based on “Estimated Independent Counts Model” = (1/20) = 0.05 as per the Formula,

The work seeks to predict Epitopes


purely from a Genetic-coding Perspective for Generalized Vaccine-design e.g. Core length of
“Linear T-cell epitopes” is Nonamer (9-peptides), as evident from the KLD- Relative Entropy
conservation equation with respect to Information – Content:-
The Bioinformatics Tools envisaged to be used to “map” Dicodons are fasta36 (with ktup = 6-nt)
which fortunately happens to be the default ktup for Fast-A sequence alignment-tool, and in so far
as to discern the PROSITE patterns wrt UNIPROT, (https://prosite.expasy.org/scanprosite/) and for
validating the Epitopes, IMGT and IEDB (Immune Epitope DataBase & analysis resource, URL->
https://www.iedb.org/ as well as NetCTL 1.2 Server ( http://www.cbs.dtu.dk/services/NetCTL/ ) and
most importantly, for ECA Binitized-mapping [http://atlas.wolfram.com/01/01/] , Wolfram
Mathematica v.12.3 Tool (https://www.wolfram.com/mathematica/new-in-12/). Noting that indeed,

That is, 10 (tending to 11) Complete ECA cycles (within each 88 being fundamentally inequivalent) suffice to Map onto
the HPG, which might potentially be Experimentally validated using Hi-C-seq by dynamically estimating the
eccentricities of 4,096 Dicodons – based on within-CDS contact maps, coupled with RNA-folding studies of cDNA
hexamers + Ribosomal-profiling, and also study PPIs (Protein-Protein Interactions) and infer InterLogs. The
information content is a measure of the degree of conservation and has a value between Zero (no conservation; all
amino acids are equally probable) and log 2(20) = 4.32 (full conservation; only a single amino acid is observed at that
position). Ultimately, the research question to be researched is directed at answering the subtle relationship between
optimalisation of Genetic-coding with Immuno-informatics, Radix-determinism and weights-prioritization of individual
amino-acid residues, that holds a promise of leading us to formulate Algebraic + Boolean “Rules” to achieve robust T-
cell 1-1 epitope-mapping. On-demand Bioinformatics Tools along the way as and when needed are invoked from,
https://www.startbioinfo.org/ and https://bio.tools/ and DataBases: http://bigd.big.ac.cn/databasecommons/

5. References related to research topic [Metareferences within Below URL may be referred to]:-

https://www.itsoc.org/profile/9590
Books: https://www.wolframscience.com/nks/ , https://www.springer.com/gp/book/9789811316388

SUMMARY:
In terms of expressing the gist of the HyperProteoGenome using square substitution
matrices such as PAM, BLOSUM, WAG, JTT matrix – RHS can be re-Expressed as shown below:

In above Identity, uniform frequencies are assumed for the amino-acid substitutions, upon which
certain perturbations may be introduced based upon Observed protein primary structural datasets. In
so far as PAM matrix reconstruction/ recalibration is concerned, we know that we need to have
atleast 85% sequence identity amongst Equal-length protein sequence alignments, which means a
minimum of 17 out of 20 amino acids wrt the Input must be same (For sake of convention, arranged
neighborless in an Alpha-betic order): A C D E F G H I K L M N P Q R S T V W Y
that is to say, we need to “sample” the RHS of HPG (20^20 polypeptide Motfs of Length= 20) such
that maximum Hamming distance amongst any two pairs = 20 minus 17 = 3 to get PAM statistics.

Coming back to the exact 1-1 mapping of Binary-encoded Transitions/ Transversions (Ts/Tv) to
point cDNA mutations, we can take an exploratory detour towards characterizing this Identity using
2000 Di-nucleotide properties from German Server (http://diprodb.fli-leibniz.de/ShowTable.php)
and split the Nomaeric linear T-cell epitopes into 3 Tri-peptides which might potentially let us delve
into addressing the issue of Conformational Epitopes as well using upto 8000 Tri-peptide properties
from Chennai server- http://www.au-kbc.org/research_areas/bio/projects/protein/tri.html

On another note, it may be appreciated that the motivation toward choosing Dicon/ Dicodon as the
fundamental unit of inter-linking Genetic coding to Immunoinformatics actually stems from the
concept of Hamming distance inequations applied to k-Mers from k=1 to k=7 as surveyed herein:-
https://github.com/bioinformer/DicodonFactorial
It is important to note that following simplest subset of 20 Extreme LCRs (single amino-acid tandemly repeated
20 noughts) indeed present in Protein sequences, need to be finished upto (20^20) motifs toward this PhD task !!!

A - {Ala} - Alanine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Ala) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5515248

C - {Cys} - Cysteine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Cys) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5515219

D-{Asp} -Aspartate 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Asp) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5515255

E-{Glu}-Glutamate 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Glu) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5515301

F-{Phe}-Phenylalanine 20mer Basis LCR/CBZ: Low Complexity Regions / Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Phe) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5515316

G - {Gly} - Glycine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Gly) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5516815

H - {His} - Histidine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20His) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5516823

I - {Ile} - Isoleucine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Ile) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5516879

K - {Lys} - Lysine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Lys) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5518572

L - {Leu} - Leucine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Leu) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5518617

M-Met-Methionine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Met) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5518645

N-Asn-Asparagine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Asn) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5518865

P - {Pro} - Proline 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Pro) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5518887

Q - {Gln}-Glutamine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Gln) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5518903

R - { Arg} - Arginine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Arg) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5518924

S - {Ser} - Serine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Ser) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5518999

T-{Thr}-Threonine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Thr) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5519051

V - {Val} - Valine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Val) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5519124

W-Trp-Tryptophan 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Trp) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5519138

Y - {Tyr} - Tyrosine 20-mer Basis LCR/ CBZ: Low Complexity Regions (or) Compositionally Biased Zones in Real
Protein Sequences (Version HPG20Tyr) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5519147

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy