Microbial Diagnosis

Bioinformatics and Computational Biology
MICROBIAL Diagnosis
Supervisor: Dr. Ebrahim Zaghloul
• Hassan Magdy Abdallah
• Esraa Basher Abdelgawad
• Khaled Emad Nabil

First Paper
Identifying viruses from metagenomic data by deep learning

Outlines
• What is the microbiome?

• Metagenomics
• Next-Generation Sequencing (NGS)
• Introduction
• Challenges
• Dataset
• Methodology
• Deep-VIR_FINDER
• Metavirome datasets
• Results
• Conclusion
What is the microbiome?
- "Micro" typically refers to small-scale or microscopic entities, often
used as a prefix to denote something tiny or on a small scale. it refers
to microorganisms, which are tiny organisms such as bacteria,
viruses, fungi.
- Biomes describe the ecosystems or environment in which
microorganisms live
- This environment could be a person, soil, or a river
https://link.springer.com/article/10.1007/s40484-019-0187-4
What is the microbiome?
- "Human Microbiome: This refers to the
microorganisms that reside in or on the
human body. It includes the gut microbiome,
skin microbiome, oral microbiome (mouth),
and other body sites. The human microbiome
plays crucial roles in health and disease.
Microbiome vs microbiota vs metagenome
microbiome microbiota metagenome
Microorganisms (and their Microorganisms (by type) The genes of microorganisms

genes) living in a specific living in a specific environment in a specific environment
environment
Metagenomics
- Metagenomics is the application of high-throughput

genomic techniques to the study of communities of
microbial organisms.
- "High-throughput" techniques for metagenomic

analysis refer to methods that allow scientists to analyze
the genetic material (DNA or RNA) of entire microbial
communities present in a sample quickly and efficiently
Metagenomics
- sequencing refers to the process

of determining the nucleotide
sequence of DNA or RNA
molecules extracted from a
complex mixture of
microorganisms within a sample.
This sequencing process provides
valuable information about the
genetic composition and diversity
of the microbial community present
in the sample.
Metagenomics
Shotgun Sequencing:
- Sequence all DNA
- More information
- Higher complexity
- Higher cost
Amplicon Sequencing:
- Sequence only specific gene
- No functional information
- Less complex to analyse
- Cheaper
What sequencing technologies offer for metagenomics?
Next-Generation Sequencing (NGS):
Next-Generation Sequencing (NGS): NGS refers to a group of high-throughput
sequencing technologies that enable rapid and cost-effective sequencing of
DNA or RNA. These technologies allow for the simultaneous sequencing of
millions to billions of DNA fragments in parallel, making them ideal for studying
complex biological samples like microbial communities.
Flow cell
Next-Generation Sequencing (NGS):output data
FASTQ file is a text file that contains sequencing data (reads)
Introduction
• Deep-Vir-Finder, for predicting viral sequences in metagenomic data.
• Deep-Vir-Finder is a reference-free and alignment-free method, which means that it does not require a
reference database of known viral sequences or time-consuming alignment steps.
• Deep-Vir-Finder was trained on a large dataset of viral sequences discovered before May 2015. When
evaluated on sequences from a later date,
• Deep-Vir-Finder outperformed the state-of-the-art method, VirFinder, at all contig lengths.
• Deep-Vir-Finder was then applied to real human gut metagenomic samples from patients with colorectal
carcinoma (CRC).
Challenges
•Focus on Cultivable Hosts:

• Existing reference databases used by tools like Kraken, Centrifuge, and MetaPhlAn are biased towards
viruses with cultivable hosts.
• These databases primarily contain information on viruses that can be grown in labs, neglecting the vast
majority that cannot.
•Limited Representation:
• Consequently, these databases offer an incomplete picture of viral diversity in the human gut.
• It's estimated that only 15% of gut viruses have any similarity to known viruses in the databases.
• This highlights the need for new methods, like DeepVirFinder described earlier, that can identify viruses
independent of existing databases. These novel approaches are crucial for uncovering the unseen majority of
viruses residing within our guts.
Dataset
• We collected 2,314 reference genomes of viruses infecting prokaryotes

(bacteria and archaea) from NCBI.
• The dataset was partitioned into three parts based on the dates when the
genomes were discovered.
• We used the genomes discovered before January 2014 for training, those
between January 2014 and May 2015 for validation, and those after May
2015 for testing.
• The partitioning of the dataset not only avoids the overlaps between the
training, validation, and test datasets, but also helps to evaluate the
method’s ability for predicting future unknown viruses based on the
previously known viruses.
https://www.ncbi.nlm.nih.gov/genome/browse
Dataset continued...
• We updated the dataset to include new viruses after May 2015 and it was
natural to use them as test data.
• Since sequences in real metagenomics data are of various lengths

ranging from hundreds to thousands of base pairs, we fragmented the
genomes into non-overlapping sequences of different sizes, L = 150, 300,
500, 1000, and 3000 bp.
• We built models for sequences of each size, respectively.
• The dataset is paired with the same number of prokaryotic sequences,

fragmented from RefSeq and partitioned by the exact same dates.
https://www.ncbi.nlm.nih.gov/genome/browse
Methodology
• We used deep learning techniques and developed a powerful framework

for predicting viral sequences.
• Given a query sequence, the framework gives a score between 0 and 1,

and the larger score indicates the higher possibility of being a viral
sequence.
• Previously k-mer frequencies were used as features and the logistic

Lasso regression model was built.
• The success of the method confirmed that virus and their host have
different k-mer usage.
• Those k-mers can be generalized as motifs, which can be described using

the position weight matrix (PWM).
Methodology continued…
• We expected that using motifs as features could increase the model

flexibility and would produce a better prediction model.
• Thus, we designed a structure of convolutional neural networks (CNN,

ConvNet) that extracts motif intensities in sequences and then used them
as features for prediction.
• The advantage of the CNN algorithm is that it simultaneously learns motif

patterns and the prediction function from the data, and there is no need to
specify motifs beforehand.
• We call our method, DeepVirFinder.

Deep-VIR-Finder
• The model consists of a convolutional layer, a pooling layer, a fully

connected layer, and several dropout layers.
• We first encoded the DNA sequence of length L using one-hot encoding

method with (A, C, G, T) coded as (1000, 0100, 0010, 0001), respectively,
resulting in a 4 × L matrix.
• A rectifier activation convolutional layer contains M motifs of length f, and

each motif scans the sequence from the beginning to the end obtaining a
series of the motif intensities by applying the convolution operation.
• Thus, the output of the convolutional layer is an M ×(L−f +1) matrix.
• A max pooling layer reduces the dimension by keeping only the highest
intensity for each motif, resulting in an M × 1 matrix.
• A dropout regularization layer of rate 0.1 follows for reducing overfitting

in neural networks by randomly dropping a few dimensions.
Deep-VIR-Finder continued…
• The subsequent layer is a dense layer containing M fully connected neurons with
the output matrix of dimension M × 1.
• With another dropout layer of rate 0.1, the output is summarized using a sigmoid
function to generate a prediction score ranging from 0 to 1.
• Since the DNA sequence is double stranded, and the contigs in real data can come
from both strands, the prediction score should be identical no matter the forward
strand or the backward strand is provided as the input.
• Thus, we apply the same network to the reverse complement of the original
sequence, and the final prediction score is the average of the predictions from the
original and the reverse complement sequences.
• The network is learned by minimizing the binary cross-entropy loss through back-
propagation using Adam optimization algorithm for stochastic gradient descent
with learning rate 0.001.
Deep-VIR-Finder continued…
Metavirome datasets
• To achieve high prediction accuracy, deep learning algorithm needs a large

amount of training data.
• Though a large number of training sequences were obtained from RefSeq, there is
a potential to enlarge the training dataset by including viral sequences from
metavirome sequencing data.
• Metavirome targets at sequencing mainly viruses by removing prokaryotic cells in

samples using physical 0.22µm filters.
• Metavirome sequencing does not rely on culturing viruses in the lab so that a wide
diversity of viruses can be captured including unknown viruses.
• A few studies have used this technique to extract viruses and sequenced viral
genomes in human gut and ocean.
• Normal et al. sequenced viruses in the human gut sample from IBD patients using
Illumina sequencing technology.
Metavirome datasets continued…
• We collected the metavirome samples from those studies and aimed to add more
viral diversity, especially unknown viruses, to the training data.
• We were careful in quality control of the samples because it is likely that the sample
can be contaminated by prokaryotic DNA, since the physical filters may not exclude
small sized prokaryotic cells.
• The details of preparation of metavirome data and quality control can be found in the
supplementary materials and Table S1.
• Up to 1.3 million of sequences were generated from the metavirome data, and they
were the combined with sequences derived from viral RefSeqs before May 2015 for
training.
• The new model was evaluated and compared with the original model trained based
on RefSeqs only, using the test sequences from RefSeqs after May 2015.
Results
• In convolutional neuron networks, two critical parameters, the length of motifs(or
filters) and the number of motifs, determine the complexity of the model.
• To find the best parameter settings, we trained the model with different combinations of
the two parameters using the data before January 2014, and evaluated AUROC on the
validation dataset.
• We studied the motif length ranging from 4 to 18 and the number of motifs from 100,
500, 1000, and 1500.
• We observed that as motif length increased from 4 to 8, the validation AUROC increased
rapidly.
• The highest AUROC achieved when the motif length was around 10, and the value kept
in the same level as the motif length future increased.
• For example, for the model trained with 500 bp sequences, when fixing the model
having 1000 motifs, the validation AUROC increased from 0.7747 to 0.9464 as motif
length increased from 4 to 8, and achieved the highest value of 0.9496 when motif
length is 10.
• The trend is similar for all other sequence sizes and numbers of motifs. Thus, we set
the motif length as 10 in the final model.
Results
Deep-Vir-Finder outperforms Vir-Finder
• We compared the performance of the new model DeepVirFinder with that of VirFinder .
• VirFinder used k-mer frequencies as features for sequences and trained a regularized
logistic regression model to predict viral sequences.
• To make a fair comparison, both methods were trained using the sequences before May
2015 and assessed on data after May 2015.
• DeepVirFinder outperformed VirFinder at all sequence lengths, where the ROC curve for
DeepVirFinder is always higher than that for VirFinder.
• The improvement in AUROC is more remarkable for short sequences of length < 1000
bp.
Conclusion
• Viruses play crucial roles in regulating microbial communities, but our knowledge of
viruses has been delayed by the tradition experiment techniques and the computational
methods for a long time.
• We developed a powerful method, DeepVirFinder, to identify viral sequences in

metagenomic data. To the best of our knowledge, it is the first time that deep learning
techniques are used for problems in metagenomics.
• Built based on the deep learning techniques, and trained with a large number of viral
sequences, DeepVirFinder outperformed the the state-of-the-art method, VirFinder at all
contig lengths.
• Enlarging the training data with additional million of viral sequences from the
environmental samples futher improved the prediction accuracy for the under-
represented viral groups.
Second Paper
Benchmark of structured machine learning methods for microbial identification from

mass-spectrometry data
MALDI-TOF MS ?
)Matrix-Assisted Laser Desorption/Ionization Time-of-Flight Mass Spectrometry(
• Technology has been improved up to a genuine paradigm breaking
technology in microbiology,
• allowing to quickly, cheaply and efficiently characterize a microorganism,
• can be used to identify a microorganism by matching it with a reference
database of annotated fingerprints

history MALDI-TOF MS ?
• MALDI-TOF MS can be traced back to the late 1980s when it was
first developed and introduced by Franz Hillenkamp and Michael
Karas at the University of Münster in Germany.

how to work MALDI-TOF MS ?
You take a small amount of the bacterial colony and mix it with a matrix compound. This
Sample Preparation:
matrix absorbs the laser energy and helps in the desorption and ionization of the bacterial
molecules.
Laser Irradiation: The sample is then irradiated with a laser beam. The laser energy causes the matrix molecules
and the bacterial molecules to desorb and ionize
Ionization : As a result of laser irradiation, the desorbed molecules from the bacterial colony are ionized,
creating positively charged ions
Acceleration : The positively charged ions are accelerated in an electric field towards the detector.
The accelerated ions travel through a flight tube, where they are separated based on their
Time-of-Flight
mass-to-charge ratio (m/z). The time taken for ions of different masses to reach the
Measurement
detector is measured.
As the ions reach the detector at different times based on their mass-to-charge ratio, the
Detection :
detector records the arrival time of ions. This data is used to generate a mass spectrum
history MALDI-TOF MS ?
purple= Gram postive
pink= Gram negative
Hans Christian Gram

how to work MALDI-TOF MS ?
Gram-Positive Bacteria: Gram-Negative Bacteria
These bacteria have a thick layer of peptidoglycan These bacteria have a thinner layer of peptidoglycan in their cell
in their cell walls, which retains the crystal violet walls, which is surrounded by an outer membrane. During the
dye used in the staining process. As a result, Gram- staining process, the outer membrane is disrupted, and the crystal
positive bacteria appear purple or blue under a violet dye is washed out. Gram-negative bacteria take up the
microscope after staining. counterstain (safranin) and appear pink or red under a microscope
MicroMass dataset. This table describes the MicroMass dataset content, in terms of used
bacterial genera and species. It also provides information on the number of bacterial strains and
massspectra for each species.
GRAM+ Tree
GRAM- Tree
svm (support vector machine)
svm one vs all
svm one vs all
svm one vs all
svm one vs one(ovo)
Cost-Sensitive Multiclass SVMs:
In practical applications like microbial identification, different classification errors can
have varying impacts. For example, mistaking one class for another may have different
consequences based on the context. Cost-sensitive classifiers are designed to distinguish
between various types of classification errors and penalize them differently during the
learning process.
TreeLoss, a specific approach mentioned, the loss function is based on the shortest path
connecting the classes in a hierarchical tree structure, reflecting the relatedness of
classes. This approach ensures that the classifier's decisions align with the specific
requirements of the application, enhancing its effectiveness in practical settings.
Cascade Support Vector Machines
Cascade Support Vector Machines (C-SVMs) do not involve drawing
decision trees in the traditional sense. Instead of using a tree-like structure to
make decisions based on features, C-SVMs employ a hierarchical approach
to train multiple SVM models on subsets of the data, with each model
focusing on different aspects of the classification task.
example, in a classification problem involving species taxonomy, the
cascade could start by distinguishing between broad categories like mammals
and birds, then proceed to differentiate between specific mammalian orders,
and so on.
⋅
f(x)=sign(w1x+b1 )
Cascade Support Vector Machines
Dendrogram-SVMs (DSVMs):
• Hierarchical Classification: Utilizes a dendrogram structure to represent hierarchical relationships between classes.
• Traversal and Prediction: Classification involves traversing the dendrogram from the root to the appropriate leaf node based on sample
characteristics.
• No Sequential SVMs: Does not employ separate SVM models at each node; classification is based on the hierarchical structure of the
dendrogram.
• Final Prediction: Obtained when the sample reaches a leaf node, where the corresponding class label is assigned.
Cascade SVMs:
• Sequential Classification: Employs a sequential cascade structure with each node representing a separate SVM model.
• Hierarchical Refinement: Each SVM model in the cascade refines the classification based on decisions made by previous SVM models.
• Binary Classification: Performs binary classification at each node, deciding between two classes based on a decision boundary.
• Final Prediction: Made based on sequential decisions made by all SVM models in the cascade.
Adenine pairs with Thymine
(A with T)
Guanine pairs with Cytosine

(G with C)
What does the code
translate to?
Each gene's code uses A, C, G and T in various combinations
to code for a specific amino acid. is needed at each position
within a protein.
Amino acids are the building blocks for proteins. This means
our DNA codes for different proteins that perform specific
functions in our body.
Where do our genes
hang out?
Genes are segments of DNA located on
structures called chromosomes.
Chromosomes are long, thread-like structures made up of

DNA and proteins. They carry many genes and are found
inside the nucleus of our cells.
FUN FACT
Humans typically have 46
chromosomes, which come in
pairs, with 23 inherited from
each parent.
Why do we need genes?
We need genes because they contain the
instructions for building and maintaining our
bodies.
They also crucial role in passing traits

from parents to children through
reproduction.
FUN FACT
The Human Genome Project has
determined that humans have an
estimated 30,000 genes.
GENETICS ELEMENTS
References
• First Paper:
• https://pubmed.ncbi.nlm.nih.gov/34084563/
• https://link.springer.com/article/10.1007/s40484-019-0187-4
• https://www.ncbi.nlm.nih.gov/genome/browse
• Second Paper:
• https://arxiv.org/pdf/1506.07251

Microbial Diagnosis

Uploaded by

Copyright:

Available Formats

Microbial Diagnosis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Microbial Diagnosis

Uploaded by

Copyright:

Available Formats

Bioinformatics and Computational Biology

Supervisor: Dr. Ebrahim Zaghloul

• Hassan Magdy Abdallah

• Esraa Basher Abdelgawad

• Khaled Emad Nabil

Identifying viruses from metagenomic data by deep learning

• What is the microbiome?

- "Micro" typically refers to small-scale or microscopic entities, often

used as a prefix to denote something tiny or on a small scale. it refers

to microorganisms, which are tiny organisms such as bacteria,

- Biomes describe the ecosystems or environment in which

- This environment could be a person, soil, or a river

- "Human Microbiome: This refers to the

microorganisms that reside in or on the

human body. It includes the gut microbiome,

skin microbiome, oral microbiome (mouth),

and other body sites. The human microbiome

plays crucial roles in health and disease.

microbiome microbiota metagenome

Microorganisms (and their Microorganisms (by type) The genes of microorganisms

- Metagenomics is the application of high-throughput

- "High-throughput" techniques for metagenomic

- sequencing refers to the process

• Deep-Vir-Finder, for predicting viral sequences in metagenomic data.

reference database of known viral sequences or time-consuming alignment steps.

evaluated on sequences from a later date,

• Deep-Vir-Finder outperformed the state-of-the-art method, VirFinder, at all contig lengths.

•Focus on Cultivable Hosts:

• We collected 2,314 reference genomes of viruses infecting prokaryotes

• Since sequences in real metagenomics data are of various lengths

• We built models for sequences of each size, respectively.

• The dataset is paired with the same number of prokaryotic sequences,

• We used deep learning techniques and developed a powerful framework

• Given a query sequence, the framework gives a score between 0 and 1,

• Previously k-mer frequencies were used as features and the logistic

• Those k-mers can be generalized as motifs, which can be described using

• We expected that using motifs as features could increase the model

• Thus, we designed a structure of convolutional neural networks (CNN,

• The advantage of the CNN algorithm is that it simultaneously learns motif

• We call our method, DeepVirFinder.

• The model consists of a convolutional layer, a pooling layer, a fully

• We first encoded the DNA sequence of length L using one-hot encoding

• A rectifier activation convolutional layer contains M motifs of length f, and

• Thus, the output of the convolutional layer is an M ×(L−f +1) matrix.

• A dropout regularization layer of rate 0.1 follows for reducing overfitting

• To achieve high prediction accuracy, deep learning algorithm needs a large

• Metavirome targets at sequencing mainly viruses by removing prokaryotic cells in

• We developed a powerful method, DeepVirFinder, to identify viral sequences in

Benchmark of structured machine learning methods for microbial identification from

• Technology has been improved up to a genuine paradigm breaking

• allowing to quickly, cheaply and efficiently characterize a microorganism,

• can be used to identify a microorganism by matching it with a reference

database of annotated fingerprints

• MALDI-TOF MS can be traced back to the late 1980s when it was

first developed and introduced by Franz Hillenkamp and Michael

Karas at the University of Münster in Germany.

purple= Gram postive

pink= Gram negative

Hans Christian Gram

Gram-Positive Bacteria: Gram-Negative Bacteria