Microbial Diagnosis
Microbial Diagnosis
Microbial Diagnosis
MICROBIAL Diagnosis
viruses, fungi.
microorganisms live
https://link.springer.com/article/10.1007/s40484-019-0187-4
What is the microbiome?
https://link.springer.com/article/10.1007/s40484-019-0187-4
Microbiome vs microbiota vs metagenome
https://link.springer.com/article/10.1007/s40484-019-0187-4
Metagenomics
https://link.springer.com/article/10.1007/s40484-019-0187-4
Metagenomics
https://link.springer.com/article/10.1007/s40484-019-0187-4
Metagenomics
Shotgun Sequencing:
- Sequence all DNA
- More information
- Higher complexity
- Higher cost
Amplicon Sequencing:
- Sequence only specific gene
- No functional information
- Less complex to analyse
- Cheaper
https://link.springer.com/article/10.1007/s40484-019-0187-4
What sequencing technologies offer for metagenomics?
https://link.springer.com/article/10.1007/s40484-019-0187-4
Next-Generation Sequencing (NGS):
Next-Generation Sequencing (NGS): NGS refers to a group of high-throughput
sequencing technologies that enable rapid and cost-effective sequencing of
DNA or RNA. These technologies allow for the simultaneous sequencing of
millions to billions of DNA fragments in parallel, making them ideal for studying
complex biological samples like microbial communities.
https://link.springer.com/article/10.1007/s40484-019-0187-4
Next-Generation Sequencing (NGS):
Flow cell
https://link.springer.com/article/10.1007/s40484-019-0187-4
Next-Generation Sequencing (NGS):
https://link.springer.com/article/10.1007/s40484-019-0187-4
Next-Generation Sequencing (NGS):output data
FASTQ file is a text file that contains sequencing data (reads)
https://link.springer.com/article/10.1007/s40484-019-0187-4
Introduction
• Deep-Vir-Finder is a reference-free and alignment-free method, which means that it does not require a
• Deep-Vir-Finder was trained on a large dataset of viral sequences discovered before May 2015. When
• Deep-Vir-Finder was then applied to real human gut metagenomic samples from patients with colorectal
carcinoma (CRC).
https://link.springer.com/article/10.1007/s40484-019-0187-4
Challenges
• These databases primarily contain information on viruses that can be grown in labs, neglecting the vast
majority that cannot.
•Limited Representation:
• Consequently, these databases offer an incomplete picture of viral diversity in the human gut.
• It's estimated that only 15% of gut viruses have any similarity to known viruses in the databases.
• This highlights the need for new methods, like DeepVirFinder described earlier, that can identify viruses
independent of existing databases. These novel approaches are crucial for uncovering the unseen majority of
viruses residing within our guts.
https://link.springer.com/article/10.1007/s40484-019-0187-4
Dataset
• The dataset was partitioned into three parts based on the dates when the
genomes were discovered.
• We used the genomes discovered before January 2014 for training, those
between January 2014 and May 2015 for validation, and those after May
2015 for testing.
• The partitioning of the dataset not only avoids the overlaps between the
training, validation, and test datasets, but also helps to evaluate the
method’s ability for predicting future unknown viruses based on the
previously known viruses.
https://www.ncbi.nlm.nih.gov/genome/browse
Dataset continued...
• We updated the dataset to include new viruses after May 2015 and it was
natural to use them as test data.
https://www.ncbi.nlm.nih.gov/genome/browse
Methodology
• The success of the method confirmed that virus and their host have
different k-mer usage.
• A max pooling layer reduces the dimension by keeping only the highest
intensity for each motif, resulting in an M × 1 matrix.
• The subsequent layer is a dense layer containing M fully connected neurons with
the output matrix of dimension M × 1.
• With another dropout layer of rate 0.1, the output is summarized using a sigmoid
function to generate a prediction score ranging from 0 to 1.
• Since the DNA sequence is double stranded, and the contigs in real data can come
from both strands, the prediction score should be identical no matter the forward
strand or the backward strand is provided as the input.
• Thus, we apply the same network to the reverse complement of the original
sequence, and the final prediction score is the average of the predictions from the
original and the reverse complement sequences.
• The network is learned by minimizing the binary cross-entropy loss through back-
propagation using Adam optimization algorithm for stochastic gradient descent
with learning rate 0.001.
Deep-VIR-Finder continued…
Metavirome datasets
• Though a large number of training sequences were obtained from RefSeq, there is
a potential to enlarge the training dataset by including viral sequences from
metavirome sequencing data.
• Metavirome sequencing does not rely on culturing viruses in the lab so that a wide
diversity of viruses can be captured including unknown viruses.
• A few studies have used this technique to extract viruses and sequenced viral
genomes in human gut and ocean.
• Normal et al. sequenced viruses in the human gut sample from IBD patients using
Illumina sequencing technology.
Metavirome datasets continued…
• We collected the metavirome samples from those studies and aimed to add more
viral diversity, especially unknown viruses, to the training data.
• We were careful in quality control of the samples because it is likely that the sample
can be contaminated by prokaryotic DNA, since the physical filters may not exclude
small sized prokaryotic cells.
• The details of preparation of metavirome data and quality control can be found in the
supplementary materials and Table S1.
• Up to 1.3 million of sequences were generated from the metavirome data, and they
were the combined with sequences derived from viral RefSeqs before May 2015 for
training.
• The new model was evaluated and compared with the original model trained based
on RefSeqs only, using the test sequences from RefSeqs after May 2015.
Results
• In convolutional neuron networks, two critical parameters, the length of motifs(or
filters) and the number of motifs, determine the complexity of the model.
• To find the best parameter settings, we trained the model with different combinations of
the two parameters using the data before January 2014, and evaluated AUROC on the
validation dataset.
• We studied the motif length ranging from 4 to 18 and the number of motifs from 100,
500, 1000, and 1500.
• We observed that as motif length increased from 4 to 8, the validation AUROC increased
rapidly.
• The highest AUROC achieved when the motif length was around 10, and the value kept
in the same level as the motif length future increased.
• For example, for the model trained with 500 bp sequences, when fixing the model
having 1000 motifs, the validation AUROC increased from 0.7747 to 0.9464 as motif
length increased from 4 to 8, and achieved the highest value of 0.9496 when motif
length is 10.
• The trend is similar for all other sequence sizes and numbers of motifs. Thus, we set
the motif length as 10 in the final model.
Results
Deep-Vir-Finder outperforms Vir-Finder
• We compared the performance of the new model DeepVirFinder with that of VirFinder .
• VirFinder used k-mer frequencies as features for sequences and trained a regularized
logistic regression model to predict viral sequences.
• To make a fair comparison, both methods were trained using the sequences before May
2015 and assessed on data after May 2015.
• DeepVirFinder outperformed VirFinder at all sequence lengths, where the ROC curve for
DeepVirFinder is always higher than that for VirFinder.
• The improvement in AUROC is more remarkable for short sequences of length < 1000
bp.
Conclusion
• Viruses play crucial roles in regulating microbial communities, but our knowledge of
viruses has been delayed by the tradition experiment techniques and the computational
methods for a long time.
• Built based on the deep learning techniques, and trained with a large number of viral
sequences, DeepVirFinder outperformed the the state-of-the-art method, VirFinder at all
contig lengths.
• Enlarging the training data with additional million of viral sequences from the
environmental samples futher improved the prediction accuracy for the under-
represented viral groups.
Second Paper
technology in microbiology,
You take a small amount of the bacterial colony and mix it with a matrix compound. This
Sample Preparation:
matrix absorbs the laser energy and helps in the desorption and ionization of the bacterial
molecules.
Laser Irradiation: The sample is then irradiated with a laser beam. The laser energy causes the matrix molecules
and the bacterial molecules to desorb and ionize
Ionization : As a result of laser irradiation, the desorbed molecules from the bacterial colony are ionized,
creating positively charged ions
Acceleration : The positively charged ions are accelerated in an electric field towards the detector.
The accelerated ions travel through a flight tube, where they are separated based on their
Time-of-Flight
mass-to-charge ratio (m/z). The time taken for ions of different masses to reach the
Measurement
detector is measured.
As the ions reach the detector at different times based on their mass-to-charge ratio, the
Detection :
detector records the arrival time of ions. This data is used to generate a mass spectrum
history MALDI-TOF MS ?
These bacteria have a thick layer of peptidoglycan These bacteria have a thinner layer of peptidoglycan in their cell
in their cell walls, which retains the crystal violet walls, which is surrounded by an outer membrane. During the
dye used in the staining process. As a result, Gram- staining process, the outer membrane is disrupted, and the crystal
positive bacteria appear purple or blue under a violet dye is washed out. Gram-negative bacteria take up the
microscope after staining. counterstain (safranin) and appear pink or red under a microscope
MicroMass dataset. This table describes the MicroMass dataset content, in terms of used
bacterial genera and species. It also provides information on the number of bacterial strains and
massspectra for each species.
GRAM+ Tree
GRAM- Tree
svm (support vector machine)
svm (support vector machine)
svm (support vector machine)
svm one vs all
svm one vs all
svm one vs all
svm one vs one(ovo)
Cost-Sensitive Multiclass SVMs:
In practical applications like microbial identification, different classification errors can
have varying impacts. For example, mistaking one class for another may have different
consequences based on the context. Cost-sensitive classifiers are designed to distinguish
between various types of classification errors and penalize them differently during the
learning process.
TreeLoss, a specific approach mentioned, the loss function is based on the shortest path
connecting the classes in a hierarchical tree structure, reflecting the relatedness of
classes. This approach ensures that the classifier's decisions align with the specific
requirements of the application, enhancing its effectiveness in practical settings.
Cascade Support Vector Machines
Cascade Support Vector Machines (C-SVMs) do not involve drawing
decision trees in the traditional sense. Instead of using a tree-like structure to
make decisions based on features, C-SVMs employ a hierarchical approach
to train multiple SVM models on subsets of the data, with each model
focusing on different aspects of the classification task.
example, in a classification problem involving species taxonomy, the
cascade could start by distinguishing between broad categories like mammals
and birds, then proceed to differentiate between specific mammalian orders,
and so on.
⋅
f(x)=sign(w1x+b1 )
Cascade Support Vector Machines
Dendrogram-SVMs (DSVMs):
• Hierarchical Classification: Utilizes a dendrogram structure to represent hierarchical relationships between classes.
• Traversal and Prediction: Classification involves traversing the dendrogram from the root to the appropriate leaf node based on sample
characteristics.
• No Sequential SVMs: Does not employ separate SVM models at each node; classification is based on the hierarchical structure of the
dendrogram.
• Final Prediction: Obtained when the sample reaches a leaf node, where the corresponding class label is assigned.
Cascade SVMs:
• Sequential Classification: Employs a sequential cascade structure with each node representing a separate SVM model.
• Hierarchical Refinement: Each SVM model in the cascade refines the classification based on decisions made by previous SVM models.
• Binary Classification: Performs binary classification at each node, deciding between two classes based on a decision boundary.
• Final Prediction: Made based on sequential decisions made by all SVM models in the cascade.
Adenine pairs with Thymine
(A with T)
Amino acids are the building blocks for proteins. This means
our DNA codes for different proteins that perform specific
functions in our body.
Where do our genes
hang out?
Genes are segments of DNA located on
structures called chromosomes.
• First Paper:
• https://pubmed.ncbi.nlm.nih.gov/34084563/
• https://link.springer.com/article/10.1007/s40484-019-0187-4
• https://www.ncbi.nlm.nih.gov/genome/browse
• Second Paper:
• https://arxiv.org/pdf/1506.07251
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: