Kraken2: Manual and Usage Specifications

Wood et al.
Genome Biology (2019) 20:257

https://doi.org/10.1186/s13059-019-1891-0
SHORT REPORT Open Access
Improved metagenomic analysis with

Kraken 2
Derrick E. Wood1,2, Jennifer Lu2,3 and Ben Langmead1,2*
Abstract
Although Kraken’s k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data,
its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by
reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while
maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode,
providing increased sensitivity in viral metagenomics analysis.
Keywords: Metagenomics, Metagenomics classification, Microbiome, Probabilistic data structures, Alignment-free
methods, Minimizers
Assigning taxonomic labels to sequencing reads is an structures and algorithms. While Kraken 1 used a sorted
important part of many computational genomics pipe- list of k-mer/LCA pairs indexed by minimizers [15], Kra-
lines for metagenomics projects. Recent years have seen ken 2 introduces a probabilistic, compact hash table to
several approaches to accomplish this task in a time- map minimizers to LCAs. This table uses one third of
efficient manner [1–3]. One such tool, Kraken [4], uses a the memory of a standard hash table, at the cost of some
memory-intensive algorithm that associates short gen- specificity and accuracy. Additionally, Kraken 2 only
omic substrings (k-mers) with the lowest common an- stores minimizers (of length ℓ, ℓ ≤ k) from the reference
cestor (LCA) taxa. Kraken and related tools like sequence library in its data structure, whereas Kraken 1
KrakenUniq [5] have proven highly efficient and accur- stored all k-mers. This change means that, during classi-
ate in independent tool comparisons [6, 7]. But Kraken’s fication, the minimizer (ℓ-mer) is the substring com-
high memory requirements force many researchers to ei- pared against a reference set in Kraken 2, while Kraken
ther use a reduced-sensitivity MiniKraken database [8, 9] 1 compared k-mers (Fig. 1a, b). Kraken 2’s index for a
or to build and use many indexes over subsets of the ref- specific reference database with 9.1 Gbp of genomic se-
erence sequences [10,11]. Its memory requirements can quences uses 10.6 GB of memory when classifying. Kra-
easily exceed 100 GB [7], especially when the reference ken 1’s index for the same reference uses 72.4 GB of
data includes large eukaryotic genomes [12,13]. Here, we memory for classification (Fig. 2a, Additional file 1:
introduce Kraken 2, which provides a major reduction in Table S1). In general, a Kraken 2 database is about 85%
memory usage as well as faster classification, a spaced smaller than a Kraken 1 database over the same refer-
seed searching scheme, a translated search mode for ences (Additional file 2: Figure S1).
matching in amino acid space, and continued compati- Kraken 2’s approach is faster than Kraken 1’s because
bility with the Bracken [14] species-level sequence abun- only distinct minimizers from the query (read) trigger
dance estimation algorithm. accesses to the hash table. A similar minimizer-based
Kraken 2 addresses the issue of large memory require- approach has proven useful in accelerating read align-
ments through two changes to Kraken 1’s data ment [16]. Kraken 2 additionally provides a hash-based
subsampling approach that reduces the set of
* Correspondence: langmea@cs.jhu.edu minimizer/LCA pairs included in the table, allowing the
1
Department of Computer Science, Whiting School of Engineering, Johns user to specify a target hash table size; smaller hash ta-
Hopkins University, Baltimore, MD, USA
2
Center for Computational Biology, Johns Hopkins University, Baltimore, MD, bles yield lower memory usage and higher classification
USA
Full list of author information is available at the end of the article
© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Wood et al. Genome Biology (2019) 20:257 Page 2 of 13
Fig. 1 Differences in operation between the two versions of Kraken. a Both versions of Kraken begin classifying a k-mer by computing its ℓ bp
minimizer (highlighted in magenta). The default values of k and ℓ for each version are shown in the figure. b Kraken 2 applies a spaced seed
mask of s spaces to the minimizer and calculates a compact hash code, which is then used as a search query in its compact hash table; the
lowest common ancestor (LCA) taxon associated with the compact hash code is then assigned to the k-mer (see the “Methods” section for full
details). In Kraken 1, the minimizer is used to accelerate the search for the k-mer, through the use of an offset index and a limited-range binary
search; the association between k-mer and LCA is directly stored in the sorted list. c Kraken 2 also achieves lower memory usage than Kraken 1
by using fewer bits to store the LCA and storing a compact hash code of the minimizer rather than the full k-mer. d Impact on speed, memory
usage, and prokaryotic genus F1-measure in Kraken 2 when changing k with respect to ℓ (ℓ = 31, s = 7 for all three graphs). e Impact on
prokaryotic genus sensitivity and positive predictive value (PPV) when changing the number of minimizer spaces s (k = 35, ℓ = 31 for both
graphs). In d and e, the data are from our parameter sweep results in Additional file 1: Table S2, and the default values of the independent
variables for Kraken 2 are marked with a circle.
throughput at the expense of lower classification accur- which we had reference genomes for at least 2 sister
acy (Fig. 1d, Additional file 1: Table S2). subspecies and at least 2 sister species (Additional file 1:
Kraken 2 also features other improvements to accur- Table S3). We then created a reference genome (or pro-
acy and runtime. A new translated search mode (Kraken tein) set that excluded the 50 taxa for the genomes we
2X) uses a reduced amino acid alphabet and increases selected. This reference set and taxonomy were held
sensitivity on viral datasets compared to nucleotide- constant between the various classifiers we examined,
based search. Block- and batch-based parsing within the avoiding any confounding due to the differences in the
critical section is used to improve thread scaling, in a reference database. A similar approach has been recently
manner similar to that used in recent versions of Bowtie used for this same purpose in another study [7].
2 [17]. We also added a form of spaced seed search and We simulated 1 million Illumina 100 × 100 nt paired-
automated masking of low-complexity reference se- end reads from each of the 50 selected genomes, for a
quences to improve accuracy. total of 50 million reads (25 million fragments). We
To assess the accuracy and performance of Kraken 2, processed these data with 4 nucleotide search-based se-
we selected 40 prokaryotic and 10 viral genomes for quence classification programs (Centrifuge [1], CLARK
Fig. 2 Comparison between Kraken 2 and other sequence classification tools. a Processing speed (in millions of reads per minute) and memory
usage (measured by maximum resident set size, in gigabytes) are shown for each classifier, as evaluated on 50 million paired-end simulated reads
with 16 threads. Accuracy results are shown for b 40 prokaryotic genomes and c 10 viral genomes. The results here are shown for sensitivity,
positive predictive value (PPV), and F1-measure as evaluated on a per-fragment basis at the genus rank, with 1000 reads simulated from each
genome. The strains from which reads were simulated were excluded from the reference libraries for each classification tool. “Kraken 2X” is Kraken
2 using translated search against a protein database. Full results for these strain-exclusion experiments are available in Additional file 1: Table S1
[2], Kraken 1 [4], and KrakenUniq [5]) and a translated the reference database (Additional file 2: Figure. S2).
search classifier (Kaiju [3]). We additionally processed This was often the result of classifications that were ei-
these data with Kraken 2, using several different data- ther incorrect at the species level or correct but only
bases created with different parameters (the “Methods” made at the genus level (or higher). Such classifications
section). can occur when genomes from different species or gen-
This strain-exclusion approach mimics the real-world era share a high genomic identity, which is the case in
scenario where reads likely originate from strains that multiple places of the taxonomy, including the Shigella
are genetically distinct from those in the database. The [18], Bacillus [19], and Pseudomonas [20] genera. A re-
addition of simulated sequencing errors also provides definition of the taxonomy based on the phylogeny as
further genetic distance between the test data and the recently proposed [21] would likely improve sensitivity
reference sequences. Through this approach, we sought at the species level.
to avoid overly optimistic estimates of a classifier’s Following our evaluation of the classifiers’ accuracy,
performance. we then examined the runtime and memory require-
We found that Kraken 2 exhibited similar, and often ments of each program. Kraken 2 provided substantial
superior, per-sequence accuracy to the other nucleotide increases in processing speed, classifying paired-end data
classifiers and that Kraken 2X provided similar (though at over 93 million reads per minute while using 16
slightly lower) accuracy compared to Kaiju (Fig. 2b, threads, a speed over 5 times faster than Kraken 1, the
Additional file 1: Table S1). The nucleotide-based classi- next-fastest classifier (Fig. 2a, Additional file 1: Table
fiers exhibited lower accuracy on the viral read data than S1). Additionally, Kraken 2 exhibited superior thread
did the translated search classifiers, demonstrating the scaling to Kraken 1 (Additional file 1: Table S4). Kraken
advantage of translated search in scenarios marked by 2’s memory requirement is also 15% of Kraken 1’s, and
high genetic variability and sparsity of available reference only 2.5 times as much as that of the least memory-
genomes [3]. intensive classifier we examined, Centrifuge. With re-
In some cases, we found that Kraken 2 would not clas- spect to the translated search programs, Kraken 2X is
sify a large proportion of reads correctly at the species over 3 times faster and uses 47% less memory than
level, despite the presence of at least two sister strains in Kaiju.
To determine if Kraken 2 exhibited similar analytical the HyperLogLog sketch [25] to estimate the number of
performance on real sequencing data, we classified read distinct k-mers matched at each node of the taxonomy,
data from the FDA-ARGOS project [22]. We compared a statistic that is used in turn to better determine the
the fragment classifications obtained by the various clas- presence or absence of individual genomes. We plan to
sification programs to the taxonomic labels attached to add this functionality in the future, as it enables applica-
the corresponding ARGOS experiment. Kraken 2 ex- tions in the diagnosis of infections where the infectious
hibits similar genus-level concordance and discordance agent is present at low abundance.
statistics to the other nucleotide search classifiers, while
Kraken 2X exhibits similar but less agreement with the Methods
ARGOS labels than does Kaiju (Additional file 1: Table Compact hash table
S5). These results agree with those obtained in the The hash table used by Kraken 2 to store minimizer/
strain-exclusion experiment on simulated data. LCA key-value pairs is very similar to a traditional hash
As a continuation of the strain-exclusion experiments, table that would use linear probing for collision reso-
we applied Bracken [14] to the Kraken 1 and Kraken 2 lution, with some modifications. Kraken 2’s compact
results, estimating species- and genus-level sequence hash table (CHT) uses a fixed-size array of 32-bit hash
abundance for prokaryotic species. Bracken uses a cells to store key-value pairs. Within a cell, the number
Bayesian algorithm to integrate reads Kraken classified of bits used to store the value of the key-value pair will
at higher taxonomic levels into the abundance estimates. vary depending on the number of bits needed to repre-
Although the true strain-level taxa are excluded from sent all unique taxonomy ID numbers found in the ref-
the database, Bracken recaptured most of the true erence sequence library; this was 17 bits with the
genus-level and species-level sequence abundances using standard Kraken 2 database in September 2018. The
both Kraken 2 and Kraken 1 classification results. Com- value is stored in the least significant bits of the hash cell
paring the results, the Bracken estimates were more ac- and must be a positive integer. Values of 0 represent
curate with Kraken 2 than with Kraken 1 at both the empty cells. Within the remaining bits of the hash cell,
genus and species levels, likely owing to Kraken 2’s the most significant bits of the key’s hash code (a com-
higher sensitivity (Additional file 2: Figure S2). Bracken pact hash code) are stored. Searching for a key K in the
ran in less than 1 s, a minute fraction of the runtime of CHT is done by computing the hash code of the key
any of the classification programs we examined. h(K) then linearly scanning the table array starting at
As databases of assembled genomes continue to grow, position h(K) mod |T| (where |T| is the number of cells
databases of reference sequences used for metagenomics in the array) for a matching key. Examples of this search
studies will also grow [21,23]. We presented Kraken 2, process—including both key/value insertion and query-
an extremely memory-efficient metagenomics classifica- ing—are shown in Additional file 2: Figure S3. In Kraken
tion tool that replaces Kraken 1’s k-mer database with a 2, the hash function h used is the finalization function
probabilistic data structure that is substantially smaller, from MurmurHash3 [26].
allowing six to seven times more reference data com- Compacting hash codes in this way allows Kraken 2 to
pared to Kraken 1. The algorithms introduced in Kraken use 32 bits for a key-value pair, a reduction compared to
2 to subsample the set of genomic substrings also pro- the 96 bits used by Kraken 1 (64 bits for key, 32 for
vide Kraken 2 with the ability to further reduce the size value) (Fig. 1c). But it also creates a new way in which
of its database and accelerate the processing of sequen- keys can “collide,” in turn impacting the accuracy of
cing data. We showed Kraken 2’s accuracy is comparable queries to the CHT. Two distinct keys can be treated as
to that of Kraken 1 and other competing tools, consist- identical by the CHT if they share the same compact
ent with other studies [6,7]. We also showed that its new hash code and their starting search positions are close
translated search mode has accuracy approaching that of enough to cause a linear probe to encounter a stored
the protein-focused Kaiju tool, while using less memory matching compact hash code before an empty cell is
and runtime. Also, Kraken 2 is compatible with the found. This property gives the CHT its probabilistic na-
Bracken software for species-level quantification, making ture, in that two types of false-positive query results are
Kraken 2 straightforwardly usable for that application. possible: either (a) a key that was not inserted can be re-
In the future, it will be important to consider add- ported as present in the table or (b) the values of two
itional use cases for Kraken 2. For example, other data keys can be confused with each other. In Kraken 2, the
structures similar to our compact hash table, such as the former error is indeed a false positive, whereas the latter
counting quotient filter [24], could be implemented and results in a less specific LCA being assigned to the
used in computing environments and applications that minimizer (Additional file 2: Figure S3). The probability
may benefit from a particular data structure’s design and of either of these errors is < 1% with Kraken 2’s default
properties. Additionally, the KrakenUniq [5] tool uses load factor of 70% (Additional file 2: Figure S4). The
adverse effect on read-level classification is further miti- reducing the probability of CHT errors (or “hash table
gated by the algorithm Kraken 2 uses to combine infor- collisions,” as we describe elsewhere in this paper).
mation from across the read, which is unchanged from A Kraken 2 database consists of a CHT and this in-
Kraken 1 and utilizes information from all k-mers in a ternal taxonomy representation. Typical databases will
sequence to counteract low-frequency erroneous LCA be built using the NCBI taxonomy [27], but users can
values that could be returned by a key-value store. override this default to create custom databases for atyp-
The probabilistic nature and comparisons involving ical use cases.
parts of a key’s hash code make the CHT similar to the
counting quotient filter (CQF) described by Pandey et al. Minimizer-based subsampling
[24] Like the CQF, Kraken 2’s CHT features high locality In contrast to Kraken 1’s use of all k-mers in the stand-
of memory access during an individual query due to the ard use case, Kraken 2 subsamples the set of genomic
linear probing that the CHT employs. Unlike the CQF, substrings and inserts only the distinct minimizers into
however, our CHT does not allow the full hash code to its database (Fig. 1b). We define the ℓ bp minimizer of a
be recovered from a stored value (the CQF’s remainder), k-mer (ℓ ≤ k) to be the lexicographically smallest canon-
and so we are unable to resize a CHT once it is instanti- ical ℓ-mer found within the k-mer. An ℓ-mer is called
ated. Additionally, our CHT has an additional possibility canonical if it is lexicographically less than or equal to
of error compared to the CQF, where two keys that do its reverse complement. Note that if k = ℓ, no subsamp-
not have the same full hash code but share a truncated ling occurs and Kraken 2 inserts the same substrings
hash code will be treated as identical. The CQF can into its data structure that Kraken 1 would. Additionally,
avoid such “soft” hash collisions. as the difference between k and ℓ grows, fewer sub-
strings are inserted into the CHT, reducing its size along
with Kraken 2’s memory usage and runtime (Fig. 1d,
Internal taxonomy of a Kraken 2 database Additional file 1: Table S2). The default values for Kra-
While Kraken 1 used the taxonomy provided by the user ken 2, k = 35 and ℓ = 31, were determined after the ana-
without modification, Kraken 2 makes some modifica- lysis of the parameter sweep results we show in
tions to its internal representation of the taxonomy that Additional file 1: Table S2.
causes that representation to differ from the user- Kraken 2 determines which ℓ-mers are minimizers by
provided taxonomy. First, Kraken 2 finds a minimal set the use of a sliding window minimum algorithm, in con-
of nodes in the user-provided taxonomy. This minimal trast to Kraken 1’s implementation which examined each
set consists of all nodes to which a reference sequence is k-mer anew. This allows for a faster determination of
assigned, as well as all of those nodes’ ancestors; vertices minimizers, as less work is required when moving from
between nodes in this set remain as they were in the one k-mer to the next overlapping k-mer (in terms of
user-provided taxonomy, maintaining the tree structure computational complexity, the new approach uses an
in the internal representation. Kraken 2 then assigns average of O (1) time to calculate a new minimizer vs.
nodes in the minimal set sequentially increasing internal Θ(k) time with the older algorithm). The sliding window
taxonomy ID numbers using a breadth-first search (BFS) minimum calculation uses a double-ended queue (or
beginning at the root, with the root having an internal “deque”) in which canonicalized candidate ℓ-mers are
ID number of 1. This BFS provides a guarantee that an- inserted in the back, along with the candidates’ position
cestor nodes will have smaller internal ID numbers than in the original sequence. As a new candidate is encoun-
their descendants; an example of this numbering is tered, enqueued candidates are removed from the back
shown in Additional file 2: Figure S3. Kraken 2 stores a of the deque until the candidate at the back has a greater
mapping of its internal taxonomy numbers to the exter- value than the new candidate (as determined by lexico-
nal taxonomy ID numbers to make its results more eas- graphical ordering). The new candidate is then pushed
ily interpretable, and performs all output using the onto the back of the deque.
external taxonomy ID numbers. Once a k-mer’s worth of ℓ-mers has been processed in
Kraken 2’s use of this internal taxonomy representa- this way, the front of the deque contains the minimizer
tion allows for the easier computation of the LCA of two of that k-mer. This property is then maintained during
nodes because the ID numbers themselves give informa- scanning subsequent bases by removing the front elem-
tion as to their relative depths in the tree, while the Na- ent in the deque if it is from a position in the original se-
tional Center for Biotechnology Information (NCBI) quence that is not in the current k-mer. In this way, the
taxonomy IDs lack this property. The internal taxonomy front element of the deque holds the minimizer of the k-
representation also allows Kraken 2 to use the minimal mer currently being examined.
number of bits for storage of taxonomy ID numbers, giv- We further augmented the sliding window algorithm
ing maximal space for the compact hash codes and to include the exclusive or (XOR) shuffling operation
from Kraken 1. This operation serves to permute the or- specify a maximum size when building a database. If the
dering of the ℓ-mers when calculating minimizers and estimated required capacity is larger than the maximum
helps to avoid a bias toward low-complexity ℓ-mers requested size, then the minimizers will be subsampled
when selecting the minimizer of a k-mer [4,15]. To shuf- further using a hash function. Given an estimated re-
fle, we calculate the XOR value of the ℓ-mer and a pre- quired capacity S′ and a maximum user-specified cap-
defined constant and use this value as the “candidate” acity of S (S < S′), we can calculate the value f = S/S′,
that is put in the deque. When the original ℓ-mer value which is the fraction of available minimizers that the
is needed again, the operation is reversed by XORing a user will be able to hold in their database. A minimum
second time with the same constant. allowable hash value of v = (1 − f)∙M can also be calcu-
lated, where M is the maximum value output by hash
Spaced seed usage function h. Any minimizer in the reference library with a
Spaced k-mers, a similar concept to spaced seeds, have hash code less than v will not be inserted into the hash
been shown to improve the ability to classify reads table. This value v is also provided to the classifier so
within the Kraken framework [28]. Kraken 2 uses a sim- that only minimizers with hash codes greater than or
ple spaced seed approach where a user specifies an inte- equal to v will be used to probe the hash table, saving
ger s when building a database that indicates how many the search failure runtime penalty that would be in-
positions in the minimizer will be masked (i.e., not con- curred by searching for minimizers guaranteed not to be
sidered when searching). Beginning with the next-to- in the hash table.
rightmost position in the minimizer, every other position
is masked until s positions have been masked. For ex- Evaluation of k-mer level discordance rates
ample, if s = 3 and ℓ = 12, the positions in the bit string At a k-mer level, there are two main types of discord-
1111 1101 0101 with a “0” would be masked. When ance between Kraken 1 and Kraken 2’s results: those
using Kraken 2, Kraken 1’s classification results can be caused by two distinct k-mers sharing the same
most closely approximated by setting k = ℓ = 31 and s = minimizer (a “minimizer collision”) and those caused by
0, as these settings will avoid any minimizer-based sub- two distinct minimizers being indistinguishable by the
sampling and spaced seed usage. Kraken 2’s default value CHT (a “hash table collision”). Minimizer collisions are
for s is 7 and was determined after the analysis of the not always damaging. When it occurs between k-mers
parameter sweep results we show in Additional file 1: from very closely related genomes, such a collision
Table S2. might detect true homology even in the face of single
The canonical ℓ-mers that are minimizer candidates nucleotide polymorphisms and/or sequencing error.
are masked with the spaced seed mask prior to their in- That said, minimizer collisions between k-mers from
sertion into the deque for the sliding window calcula- distantly related genomes could produce either elevated
tion. By performing canonicalization of the minimizer LCA values (if both genomes are in the reference library)
candidates prior to applying the spaced seed mask, we or incorrectly classified k-mers (if one of the genomes is
ensure the result is the same whether applied to the ℓ- not in the reference library). Hash table collisions are a
mer or its reverse complement. consequence of the probabilistic nature of the CHT and
Kraken 1’s sensitivity performance was governed by can also cause either elevated LCA values or incorrectly
the value of k (the length of the searched substring). By classified k-mers (Additional file 2: Figure S3). We note
comparison, the use of spaced seeds and minimizer- that these different discordant results are all at a k-mer
based subsampling means that Kraken 2’s sensitivity per- level and may not always affect a query sequence’s classi-
formance will be largely governed by ℓ-s (the number of fication due to the many k-mers’ worth of data that are
compared bases in Kraken 2’s searched substring). Thus, used to classify a query sequence; aside from slight mod-
increasing s will generally increase sensitivity while de- ifications to handle the subsampling methods we use in
creasing positive predictive value (Fig. 1e, Add- Kraken 2, the classification method of Kraken 2 is identi-
itional file 1: Table S2). cal to Kraken 1.
We wished to estimate the rate at which these colli-
Hash-based subsampling sions would cause discordance at a k-mer level between
Kraken 2 estimates the required capacity of the hash the Kraken 1 and Kraken 2 results. To do so, we selected
table given the k, ℓ, and s values chosen along with the a specific bacterial genome for which we had neighbor-
sequence data in a database’s reference genomic library. ing genomes at each taxonomic rank from species to
Some users will not have access to large memory com- phylum. The selected genome was our “reference se-
puters, and therefore, this estimate may be greater than quence,” and eight others were progressively more taxo-
the maximum possible hash table size that they can nomically distant from the reference sequence. We list
work with. To aid such users, Kraken 2 allows them to the nine genomes used in these experiments in
Additional file 1: Table S6. We additionally created a Processing of a standard genomic reference library
synthetic genome with 4 Mbp of uniformly random The CHT’s modest memory requirements, and the add-
DNA. Together, these ten sequences formed a set of itional savings yielded by minimizer-based subsampling,
“query sequences” and were the basis for our evaluation allow more reference genomic data to be included in
of collision rates. For these experiments, we used the de- Kraken 2’s standard reference library. Whereas Kraken
fault Kraken 2 values of k = 35, ℓ = 31, and s = 7, unless 1’s default database had data from archeal, bacterial, and
otherwise noted. viral genomes, Kraken 2’s default database additionally
To determine the rates of discordance caused by includes the GRCh38 assembly of the human genome
minimizer collisions, we compared each of the ten query [29] and the “UniVec_Core” subset of the UniVec data-
sequences’ k-mers to the set of reference sequence k- base [30]. We include these in Kraken 2’s default data-
mers. For each sequence, the minimizer collision rate is base to allow for easier classification of human
the proportion of distinct k-mers in a query sequence microbiome reads and more accurate classification of
that (a) are not in the set of reference sequence k-mers reads containing vector sequences.
and (b) share a minimizer with a reference sequence k- Additionally, we have implemented masking of low-
mer. The various sequences’ minimizer collision rates complexity sequences from reference sequences in Kra-
are summarized in Additional file 1: Table S7. We hy- ken 2, by using the “dustmasker” [31] (for nucleotide se-
pothesized that the minimizer collision rate would be in- quences) and “segmasker” [32] (for protein sequences)
fluenced by the length of the minimizer used, due to the tools from NCBI. Using the tools’ default settings, nu-
length’s direct relationship to the number of possible cleotide and protein sequences are checked for low-
minimizers. To test this, we repeated the minimizer col- complexity regions, and those regions identified are
lision rate estimation experiment focusing on the refer- masked and not processed further by the Kraken 2 data-
ence genome and using the random synthetic genome as base building process. In this manner, we seek to reduce
the sole query sequence. Setting k = 35 and s = 0, we var- false positives resulting from these low-complexity se-
ied the ℓ parameter from 8 to 31. Minimizer lengths quences, similar to the build process for Centrifuge [1].
greater than 15 had collision rates under 1%. Minimizer
lengths greater than 22 had 0 collisions. The full results Populating the Kraken 2 hash table
are shown in Additional file 2: Figure S5. Kraken 2 begins building a CHT by first estimating the
To determine the rates of discordance caused by hash number of distinct minimizers present in the reference
table collisions, we compared each of the ten query se- library for the selected values of k, ℓ, and s. This is done
quences’ minimizers to a CHT populated with the refer- through a form of zeroth frequency moment estimation
ence sequence minimizers. The CHT was created with a [33] where Kraken 2 creates a small set structure imple-
load factor of 70% and 15 bits reserved for the truncated mented with a traditional hash table. In this set Q, we
hash code (the same parameters used in Kraken 2’s insert only the distinct minimizers that satisfy the criter-
standard database in September 2018). For each se- ion h(m) mod F < E, where h(m) is the hash code of the
quence, the hash table collision rate is the proportion of minimizer m and E ≪ F (in practice, Kraken 2 uses E = 4
distinct minimizers in a query sequence that (a) are not and F = 1024). We then find the estimate of the total
minimizers in the set of reference sequence minimizers number of distinct minimizers by multiplying the num-
and (b) are reported by the CHT as being inserted in the ber of satisfactory distinct minimizers (|Q|) by F/E. This
hash table. The various sequences’ hash table collision form of estimation requires storing in memory only a
rates are summarized in Additional file 1: Table S8. To fraction of all distinct minimizers (approximately E/F)
investigate the impact of load factor and truncated hash and allows us to quickly set the capacity of our CHT
code size on hash table collision rates, we repeated the properly without needing to first store all elements in it.
hash table collision rate experiment, but focused only on After estimating the number of distinct minimizers
the reference genome and used the random synthetic D = |Q|(F/E) present in the reference library, Kraken 2
genome as the sole query sequence. We used the same then allocates memory for a CHT containing D/0.7 hash
default values of k, ℓ, and s as before (35, 31, and 7, re- table cells. We selected the divisor of 0.7 so that the re-
spectively) and calculated hash table collision rates while sultant hash table will have approximately 30% of its
varying both the load factor and truncated hash code cells remain empty after the population of the CHT (i.e.,
size. The impact of these two parameters on hash table the CHT will have a load factor of 70%). As stated earl-
collision rates is shown in Additional file 2: Figure S4. ier, the cells of this table are 32 bits each, and so the
The parameters adopted for Kraken 2’s default mode total memory required for Kraken 2’s CHT is 32D/
had an error rate of 0.016%, consistent with the results 0.7 bits or 4D/0.7 bytes.
seen when comparing genomes of different species Kraken 2 then proceeds to scan each genome in the
(Additional file 1: Table S8). reference library. Each genome must be associated with
a taxonomic ID number so that Kraken 2 can calculate critical section to thread-local execution. The first
LCA values; genomes without associated taxonomy IDs method (referred to as “batch deferred” parsing by Lang-
are therefore not processed by Kraken 2. For a mead et al.) reads a set number of lines (40,000 in Kra-
minimizer M in a genome G, Kraken 2 attempts to in- ken 2) of input in a thread-local buffer within the critical
sert a key-value pair containing M (key) and the taxo- section and then parses the input within a single thread’s
nomic ID T (value) associated with G into the CHT. If execution. This method is used to perform reading of
the CHT does not report that M was previously inserted, paired-end FASTQ input, where the lengths of a frag-
then the <M, T > key-value pair will be inserted, indicat- ment’s mates can be different and reading a consistent
ing that the LCA of M is currently T. If M was previ- number of lines from both input files is necessary to en-
ously inserted into the CHT, with LCA value T*, then its sure a thread is working with complete mate pairs. For
associated LCA value is updated to equal the LCA of T FASTA or single-end FASTQ input, Kraken 2 instead
and T*. All minimizers are processed in this way; once uses a more efficient method that reads in a set number
the reference library’s minimizers are all processed, the of bytes (3 MB in Kraken 2) of input into a thread-local
LCA values are properly set for each of the minimizers buffer within the critical section and continues reading
and the database build is complete. The LCA operation input into that buffer until a record boundary is found,
is both commutative and associative, facilitating parallel at which point a thread leaves the critical section and
index construction. parses its input. These modifications allow Kraken 2 to
more efficiently use multiple threads than did Kraken 1
Classification of a sequence fragment with Kraken 2 (Additional file 1: Table S4).
Kraken 2 classifies sequence fragments similarly to Kra-
ken 1, with modifications to facilitate minimizer- and Translated search
hash-based subsampling. For each k-mer in an input se- To perform a translated search, Kraken 2X first builds a
quence, Kraken 2 finds its minimizer and, if it is distinct database from a set of reference proteins in the same
from the previous k-mer’s minimizer, uses it as a key to manner that Kraken 2 does for nucleotide sequences.
probe the CHT. If the minimizer matches a key in the The usual alphabet of 20 amino acids is reduced to 15
CHT, Kraken 2 considers the associated LCA value to using the 15-character alphabet of Solis [34]; we add a
be the k-mer’s LCA (Fig. 1b). Classification then pro- single additional value representing selenocysteine, pyr-
ceeds in the same manner as Kraken 1, taking note of rolysine, and translation termination (stop codons). This
how many k-mer hits mapped to each taxon, construct- gives us 16 characters in our reduced alphabet, allowing
ing a pruned classification tree, and using the leaf of the us to represent a character with 4 bits. Minimizers of
maximally scoring root-to-leaf path of that tree to clas- reference proteins are calculated using the same
sify the sequence [4]. If hash-based subsampling was methods for nucleotide sequences (i.e., using spaced
used to build the CHT, each minimizer has its hash code seeds if requested and a sliding window minimum algo-
compared against the table’s maximum allowable hash rithm), but reverse complements are not calculated and
code, and minimizers with higher-than-allowed hash by default k = 15, ℓ = 12, and s = 0.
codes are not searched against the CHT. Any k-mer When searching against a protein minimizer database,
containing an ambiguous nucleotide code is also not Kraken 2X translates all six reading frames of the input
searched against the CHT. query DNA sequences into the reduced amino acid al-
We note that although Kraken 2 only uses the phabet. Minimizers from all six frames are pooled and
minimizer to query the CHT, the LCA found via this used to query the CHT, and therefore, all contribute to
query is assigned by Kraken 2 to the k-mer rather than the Kraken 2X classification of a query sequence.
only the minimizer. This means that a stretch of n over-
lapping k-mers that share a minimizer will all be Generation of data for strain exclusion experiments
assigned the same LCA value by Kraken 2 and that n We downloaded the reference genome and protein data
hits to that LCA will be part of the classification tree, used for the clade exclusion experiments from NCBI in
even though only one distinct minimizer was present January 2018 from the archaeal, bacterial, and viral do-
among the k-mers. mains. We also downloaded the taxonomy from NCBI at
this same time. Using the taxonomy ID information for
Parsing of input files each sequence, we obtained a set of all taxonomy IDs
Previous work by Langmead et al. [17] has shown the represented by the reference genomes. From this set, we
importance of removing parsing work from critical sec- selected a subset of “eligible strains” that had both two
tions, i.e., portions of the program that can be executed sister sub-species taxa present and two sister species
by only 1 thread at a time. Kraken 2 uses 2 different taxa present in the set of reference genomes. We se-
methods to defer a majority of parsing work from the lected this subset by examining only those nucleotide
sequences with the phrase “complete genome” in their Sequence abundance estimation programs (which map
FASTA record header but excluding those that were taxa to sequence counts or frequencies), such as
plasmids or second or third chromosomes. In this man- Bracken, and population abundance estimation pro-
ner, we sought to ensure we did not count a genome grams (which map taxa to organism counts or frequen-
multiple times due to multiple sequences being associ- cies), such as MetaPhlAn [36], are answering related but
ated with that genome. From the eligible strain subset, different problems than those in our comparator set. For
40 prokaryotic taxonomy IDs and 10 viral taxonomy IDs example, Bracken does not actually change any of the
were selected arbitrarily to be the strains of origin for taxonomic labels associated with the sequenced frag-
our experiments. The strains selected are listed in Add- ments but rather adjusts the fragment counts associated
itional file 1: Table S3. with low-rank taxa. We also note that although MetaPh-
After selecting the taxonomy IDs that represented the lAn does, as part of its operation, classify a small propor-
strains of origin, we gathered all of the nucleotide se- tion of reads that map to marker genes, this proportion
quences we had downloaded—including chromosome can be less than 10% of reads [6] in whole-genome shot-
and plasmid sequences excluded from our examination gun metagenomic experiments (such as ours), and thus,
when creating the eligible strain subset—into a single file MetaPhlAn would yield far lower per-sequence sensitiv-
and did the same for the protein sequences. For both ity relative to the tools in our comparison.
the nucleotide and protein files, we placed sequences In brief, we used the nucleotide search-based classifi-
with taxonomy IDs that were outside the strains of ori- cation programs (Kraken 1, KrakenUniq, Kraken 2,
gin into a strain exclusion reference file. Then, for each CLARK, and Centrifuge) to build a strain-exclusion
taxonomy ID in our strain of origin set, we created a sin- database from reference genomes, and we used the
gle “strain reference” file containing all nucleotide se- translated search-based classification programs (Kraken
quences that were associated with that taxonomy ID. 2X and Kaiju) to build a strain-exclusion database from
We used Mason 2 [35] to simulate 100-bp paired-end reference protein sequences. We compared Kraken 2
Illumina sequence data from our strains of origin, with and Kraken 2X (both using the code base from Kraken
500,000 fragments being simulated from each strain. 2.0.8) against Kraken 1.1.1, KrakenUniq 0.5.6, CLARK
When simulating the reads, we used the default options 1.2.4, Centrifuge 1.0.3-beta, and Kaiju 1.5.0. Because
for simulating sequencing errors with Mason 2’s CLARK requires a rank to be specified at the time of
“mason_simulator” command. These defaults caused the building a database, and our evaluations center on
simulator to simulate sequencing errors at rates of 0.4% genus-rank accuracy, we built a CLARK database for the
for mismatches, 0.005% for insertions, and 0.005% for genus rank for our evaluation work in this paper.
deletions. We combined simulated reads from the Classifiers received the simulated read data as paired-
strains of origin into a single set of read data. We also end FASTQ input. To evaluate runtime and memory
shuffled the order of the fragments in this set to control usage, we sought to eliminate the performance impact of
for ordering effects that might affect runtime. reading or writing from disk or from a network storage
location. To accomplish this, we copied simulated read
Execution of strain exclusion experiments data and classifier databases onto a random access mem-
To evaluate the accuracy and computational perform- ory (RAM) filesystem and directed the classifiers to read
ance of Kraken 2, we compared it to Kraken 1 and sev- input from and write output to that RAM filesystem.
eral other programs. In selecting these programs, we Accuracy was evaluated on a smaller subset of the
concentrated on three main properties. First, because simulated data containing 1000 fragments per genome
Kraken’s principal aim is to provide high-speed taxo- of origin or 50,000 fragments in total. To obtain process-
nomic sequence classification, we looked for taxonomic ing speed and memory usage information, we ran each
sequence classification tools that were high in classifica- classifier using 16 threads on 25 million sequences’
tion speed (within approximately an order of magnitude worth of simulated read data. We used the taskset com-
of Kraken 1). Secondly, because our experiments rely on mand to restrict each classifier to the appropriate num-
holding fixed the reference data between programs, we ber of processors (e.g., “taskset -c 0-15” was used with
selected tools which had the ability to customize the our 16 thread experiments); this ensures that a classifier
underlying reference sequence set and taxonomy using that uses an external process to aid in its execution has
whole-genome reference data. These two requirements that process’ runtime properly counted against its run-
led to our selection of KrakenUniq, CLARK, Centrifuge, time here. The “/usr/bin/time -v” command provided us
and Kaiju as comparator programs. We note that these with elapsed wall clock time and maximum resident set
requirements exclude an accuracy evaluation against size data (memory usage) for each experiment and
programs that are not taxonomic sequence classifiers allowed us to verify that no major page faults were in-
(programs that output a mapping of sequences to taxa). curred by a classifier during its execution (the absence
of which indicates minimal disk- or network-related in- Evaluation of thread scaling efficiency
put/output effects on the runtime). Classifiers were run To evaluate Kraken 1’s and Kraken 2’s ability to effi-
on a computer with 32 Xeon 2.3 GHz CPUs (16 hyper- ciently use multiple threads, we performed an experi-
threaded cores) and 244 GB of RAM. ment using the strain exclusion databases and simulated
read data we describe previously in this section. We ran
both Kraken 1 and Kraken 2 on the same data using 1,
Evaluation of accuracy in strain exclusion experiments 4, and 16 threads. The 2 programs were run once on the
We evaluated the accuracy of each classifier at a per- data as paired-end read data and once as single-end read
fragment level, with respect to a particular taxonomic data. Read data and Kraken database files were all placed
rank. Each fragment had a known true subspecies taxon on a RAM filesystem, and the “taskset” command was
of origin, which implied a true taxon of origin at both used to limit the classifier programs to only as many
the species and genus ranks, which is where we mea- cores as the number of threads being used. These condi-
sured accuracy. We now describe how we counted true- tions mirror those of our main strain exclusion experi-
positive (TP), false-negative (FN), vague positive (VP), ments, only varying the number of threads between the
and false-positive (FP) results at the genus and species various runs of the classifiers. The results for this experi-
levels. We describe this at the genus level specifically, ment are shown in Additional file 1: Table S4. In short,
but the analogous procedure was also used at the species Kraken 2 exhibits superior speedup with respect to the
level. For a given true genus of origin, a TP classification number of threads allocated compared to Kraken 1. This
is a classification at that genus or at a descendant of that is especially true for paired-end reads.
genus. Because we excluded the strains of origin from
our reference databases, we expected all classifiers to FDA-ARGOS experimental concordance evaluation
make incorrect strain-level classifications and so allow The FDA-ARGOS (dAtabase for Reference Grade mi-
classifications of descendants of the true genus to be crObial Sequences) project provides sequencing experi-
judged as TP. We define an FN classification as a failure ments for many microbial isolates [22]. We used the
of a classifier to assign any classification to a sequence NCBI’s Sequence Read Archive [37] to find all 1392 ex-
and a VP classification as a classification at an ancestor periments related to the FDA-ARGOS project (accession
of the true genus of origin. Finally, we define an FP clas- PRJNA231221). Because some tools are unable to prop-
sification as a classification that is incorrect, that is, not erly process reads of differing lengths, we selected only
at the true genus of origin nor an ancestor or descend- those 263 experiments that were run on an Illumina
ant of that true genus. These four categories are mutu- HiSeq 4000 instrument and produced 151-bp reads. We
ally exclusive, and all fragments run through a classifier then randomly selected 1 experiment from each genus
will have their classification (or lack thereof) categorized to download and used reservoir sampling to select a sub-
by one of these categories. set of 10,000 paired-end fragments from each selected
These categories are different from those typically experiment. We also removed experiments for which
used for binary classification problems; they are used our strain-exclusion reference genome set did not have a
here because these methods can make classifications that reference genome of the same species as the sequenced
are not at leaves of the taxonomic tree but are still cor- isolate. These steps yielded 25 experiments’ worth of
rect. For example, a classification of an Escherichia coli data, for 250,000 paired-end fragments in total. Using
fragment as Escherichia would be evaluated as TP for the strain-exclusion databases created earlier, we then
genus-rank accuracy, but as VP for species-rank accur- used each classifier to classify the data and examined the
acy. Classification of that same fragment as Vibrio would percentage of each experiment’s fragments that were
be evaluated as FP at any rank below class (because the classified.
LCA of Vibrio and Escherichia is the class taxon Gam- Because the FDA-ARGOS data are from real sequen-
maproteobacteria) and would be evaluated as TP for the cing experiments, several factors could explain discord-
class rank and above. ance between a classifier’s results and the experiments’
Using these categories, we define rank-level sensitivity assigned taxa, including the evolutionary distance be-
as the proportion of input fragments that were true- tween sequences and reference data, low-quality sequen-
positive classifications, or TP/(TP + VP + FN + FP). We cing runs, and contamination. The true causes of such
define rank-level positive predictive value (PPV) as the discordance may not be discernable, and even when they
proportion of classifications that were true positives (ex- are, they often require an in-depth examination of the
cluding vague positives), or TP/(TP + FP). Along with sequencing and reference data. For these reasons, we do
these definitions of rank-level sensitivity and PPV, we not report sensitivity and PPV for these data because we
also define an F1-measure as the harmonic mean of cannot be certain of the true taxonomic origin of each
those two values. individual fragment of real sequencing data. Rather, we
evaluated the concordance of the SRA-assigned taxa genomic data, once with k = 31, ℓ = 31, s = 0 (corre-
with the fragments’ classifications at the genus rank and sponding to Kraken 1’s defaults—effectively counting the
report for each classifier the following quantities: (a) the number of distinct k-mers) and again with k = 35, ℓ = 31,
percentage of fragments with a concordant classification s = 7 (Kraken 2’s defaults).
at the genus rank, (b) the percentage of fragments with a The size of a Kraken 1 database is a function of the
discordant classification at the genus rank, (c) the per- number of distinct k-mers in the reference data. If there
centage of fragments with a classification of an ancestor are X distinct k-mers, the size of Kraken 1’s database.kdb
of the SRA-assigned genus taxon, and (d) the percentage (sorted list of k-mer/LCA pairs) file will be 1072 + 12X
of fragments that were not classified. The results of this bytes; the 1072-byte term is the size of the Jellyfish/Kra-
concordance evaluation are provided in full in Add- ken header data, and 12 bytes are used for each k-mer/
itional file 1: Table S5. LCA pair. The database.idx (minimizer offset index) file
is 8,589,934,608 bytes, a function of Kraken 1’s default
Parameter sweeps minimizer length of 15. The full database size is the sum
We examined various values for parameters to ensure of the sizes of those two files.
Kraken 2’s default parameters would provide an advanta- Similarly, the size of a Kraken 2 hash table is a func-
geous balance of accuracy, classification speed, and tion of the estimate of the number of distinct minimizers
memory usage. Specifically, we looked at parameters re- in the reference data. If there are an estimated Y distinct
lating to minimizer-based subsampling (k and ℓ), hash- minimizers, Kraken 2’s hash table will be 32 + ⌊4Y/0.7⌋
based subsampling (f = S/S′), and spaced seed usage (s). bytes in size (representing 32 bytes of metadata and
For Kraken 2, we performed two parameter sweeps, with using 4 bytes per cell and a load factor of 0.7).
one focused on minimizer-based subsampling and one We used the estimates of the numbers of distinct k-
focused on hash-based subsampling. The first parameter mers and distinct minimizers to calculate the database
sweep looked at values for ℓ in the interval [25, 31], sizes of Kraken 1 and Kraken 2 for successively larger
values for k in the interval [ℓ, ℓ + 10], and values for s in subsets of the strain exclusion set. The results of this
the interval [0, 7]; the second parameter sweep looked at evaluation are shown in Additional file 2: Figure S1, with
values of ℓ in the interval [25, 31], fixed k = ℓ, values for raw data available in Additional file 1: Table S10.
f in the set {0.125, 0.25, 0.5}, and values for s in the Reviewing the results when all genomic sequences
interval [0, 7]. We also performed a third parameter were added, our results indicate that the number of dis-
sweep, focused on translated search (Kraken 2X), where tinct k-mers is approximately 3.1 times the number of
we looked at values for ℓ in the interval [11, 15], values distinct minimizers for the settings we have selected for
for k in the interval [ℓ, ℓ + 3], and values for s in the Kraken 1 and Kraken 2. It is not possible to draw a dir-
interval [0, 3]. ect relationship between the number of distinct k-mers
Each parameter sweep used the strain exclusion data or minimizers and the number of sequence bases proc-
that we previously created to build databases, and we essed. For example, homology between similar strains
used the same accuracy and timing methods for these and species will cause the number of distinct k-mers/
databases that we did in the cross-classifier comparison. minimizers to grow slower than the total number of
The results of the first two parameter sweeps, run on bases. Examining the linear-term coefficients from the
nucleotide databases, are provided in Additional file 1: database-size expressions (12X and 4Y/0.7) indicates a
Table S2, while the results of the third parameter sweep, Kraken 2 database will be approximately 15% of the size
run on protein databases, are provided in Add- of a Kraken 1 database of the same reference data; this is
itional file 1: Table S9. We note that the parameter because X ≈ 3.1Y, and (4/0.7)/(12 × 3.1) = 0.15. When we
sweeps yielded a large number of parameter combina- examine the full reference set, the 15% estimate is con-
tions giving approximately the same, near-optimal levels sistent with the ratio of Kraken 2’s hash table size
of accuracy. This suggests performance is not overly sen- (10.456 GB) to Kraken 1’s database.kdb file size (77.490
sitive to particular parameter settings. − 8.589 = 68.901 GB), which is 10.456/68.901 = 0.152.
Evaluation of database sizes of Kraken 1 and Kraken 2 Bracken experiments on strain exclusion data
We began by shuffling the reference DNA sequences in We first generated Bracken metadata from each of the
our strain exclusion set and recorded the total number Kraken 1 and Kraken 2 reference libraries used in the
of bases in each sequence. We modified Kraken 2’s cap- strain exclusion experiments. We then used Bracken to
acity estimator to report an estimate of the number of estimate genus- and species-level abundance from the
distinct minimizers after each sequence processed, ra- Kraken 1 and Kraken 2 classification results on the pro-
ther than only after all sequences are processed. Finally, karyotic strain exclusion read data. Due to the low se-
we ran the capacity estimator twice on the shuffled quence similarity between our simulated viral reads and
the strain-exclusion reference data, none of the nucleo- Funding

tide search programs exhibited high sensitivity on these BL and DEW were supported by NSF grant IIS-1349906. BL was additionally
supported by NIH grant R01-GM118568. JL was supported by NIH grant R35-
reads, including Kraken 1 and Kraken 2. Such low classi- GM130151.
fication rates prevent Bracken from inferring taxonomy
for a large proportion of the viral reads. Additionally, the Availability of data and materials
taxonomy for viruses has several examples where species We have made the data for our strain exclusion experiments publicly
available for download, including all reference sequences, taxonomy, and
are not grouped by ancestry and lack similarity in both simulated read data [39]. Code to generate all databases from these
gene organization and genomic sequence [38]. For these reference sequences, to generate simulated read data, and to run the
reasons, we chose to exclude the simulated viral reads comparison of classifiers’ accuracy and performance is also available for
public download in a GitHub repository [40] and via permanent storage at
from our analysis of Bracken. https://doi.org/10.5281/zenodo.3520278 .
For overall evaluation of the accuracy of Bracken in Kraken 2’s source code is open-source, licensed under the MIT License, and
these strain exclusion experiments, we calculated the available in a GitHub repository [41]. The specific version of Kraken 2 evalu-
ated here, version 2.0.8, is also permanently available at https://doi.org/10.
mean absolute percentage error (MAPE): 5281/zenodo.3520272.
n
100% X T x −S x Ethics approval and consent to participate
MAPE ¼
n x¼1 T x Not applicable.
Consent for publication

where Sx is the estimated number of reads and Tx is the Not applicable.
true number of reads for taxon x. In this strain exclusion
experiment, n = 40, the total number of distinct prokary- Competing interests
The authors declare that they have no competing interests.
otic species and genera in the sample and Tx = 1000 for
each taxon. Author details
1
Department of Computer Science, Whiting School of Engineering, Johns
Hopkins University, Baltimore, MD, USA. 2Center for Computational Biology,
Supplementary information Johns Hopkins University, Baltimore, MD, USA. 3Department of Biomedical
Supplementary information accompanies this paper at https://doi.org/10. Engineering, Whiting School of Engineering, Johns Hopkins University,
1186/s13059-019-1891-0. Baltimore, MD, USA.
Additional file 1: Table S1. Comparison of accuracy and computational Received: 20 September 2019 Accepted: 18 November 2019
performance. Table S2. Comparison of Kraken 2 with other classifiers,
using various parameter values. Table S3. Genomes excluded in strain-
exclusion simulation. Table S4. Thread scaling evaluation results. Table References
S5. Evaluation of FDA-ARGOS sequencing data. Table S6. Sequences 1. Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive
used for evaluation of collision rates. Table S7. Minimizer collision classification of metagenomic sequences. Genome Res. 2016;26:1721–9.
evaluation results. Table S8. Hash table collision evaluation results. Table 2. Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate
S9. Comparison of Kraken 2X with other classifiers, using various classification of metagenomic and genomic sequences using discriminative
parameter values. Table S10. Database size evaluation results. k-mers. BMC Genomics. 2015;16:236.
Additional file 2: Figure. S1. Estimation of database sizes for Kraken 1 3. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for
and Kraken 2 as sequences are added to the reference set. Figure S2. metagenomics with Kaiju. Nat Commun. 2016;7:11257.
Bracken performance on strain exclusion simulated prokaryotic data. 4. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence
Figure S3. Examples of compact hash table usage with Kraken 2. Figure classification using exact alignments. Genome Biol. 2014;15:R46.
S4. Evaluation of compact hash table error rates as a function of two 5. Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: confident and fast
variables. Figure S5. Evaluation of minimizer collision rates as a function metagenomics classification using unique k-mer counts. Genome Biol. 2018;
of minimizer length. 19:198.
Additional file 3: Review history. 6. Lindgreen S, Adair KL, Gardner PP. An evaluation of the accuracy and speed
of metagenome analysis tools. Sci Rep. 2016;6:19233.
7. Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking metagenomics tools for
Acknowledgements taxonomic classification. Cell. 2019;178:779–94.
The authors would like to thank James R. White and Steven Salzberg for the 8. Eyice Ö, et al. SIP metagenomics identifies uncultivated Methylophilaceae as
helpful discussions about the manuscript. dimethylsulphide degrading bacteria in soil and lake sediment. ISME J. 2015;
9:2336.
Peer review information 9. Merelli I, et al. Low-power portable devices for metagenomics analysis: fog
Barbara Cheifet was the primary editor of this article and managed its computing makes bioinformatics ready for the Internet of Things. Futur
editorial process and peer review in collaboration with the rest of the Gener Comput Syst. 2018;88:467–78.
editorial team. 10. Lu J, Salzberg SL. Removing contaminants from databases of draft
genomes. PLoS Comput Biol. 2018;14:e1006277.
Review history 11. Donovan PD, Gonzalez G, Higgins DG, Butler G, Ito K. Identification of fungi
The review history is available as Additional file 3. in shotgun metagenomics datasets. PLoS One. 2018;13:e0192898.
12. Meiser A, Otte J, Schmitt I, Grande FD. Sequencing genomes from mixed
Authors’ contributions DNA samples - evaluating the metagenome skimming approach in
DEW and BL designed the algorithms for Kraken 2. DEW developed the lichenized fungi. Sci Rep. 2017;7:14881.
Kraken 2 software. DEW, JL, and BL designed the experiments. DEW and JL 13. Knutson TP, Velayudhan BT, Marthaler DG. A porcine enterovirus G
performed the experiments. DEW, JL, and BL prepared and reviewed the associated with enteric disease contains a novel papain-like cysteine
manuscript. All authors read and approved the final manuscript. protease. J Gen Virol. 2017;98:1305–10.
14. Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species

abundance in metagenomics data. PeerJ Comput Sci. 2017;3:e104.
15. Roberts M, Hayes W, Hunt B, Mount S, Yorke J. Reducing storage
requirements for biological sequence comparison. Bioinformatics. 2004;20:
3363–9.
16. Li H. Minimap2: pairwise alignment for nucleotide sequences.
Bioinformatics. 2018;34:3094–100.
17. Langmead B, Wilks C, Antonescu V, Charles R. Scaling read aligners to
hundreds of threads on general-purpose processors. Bioinformatics. 2018;
35(3):421–32.
18. Pettengill EA, Pettengill JB, Binet R. Phylogenetic analyses of Shigella and
enteroinvasive Escherichia coli for the identification of molecular
epidemiological markers: whole-genome comparative analysis does not
support distinct genera designation. Front Microbiol. 2016;6:1573.
19. Helgason E, et al. Bacillus anthracis, Bacillus cereus, and Bacillus
thuringiensis—one species on the basis of genetic evidence. Appl Environ
Microbiol. 2000;66:2627 LP–2630.
20. Gomila M, Peña A, Mulet M, Lalucat J, García-Valdés E. Phylogenomics and
systematics in Pseudomonas. Front Microbiol. 2015;6:214.
21. Parks DH, et al. A standardized bacterial taxonomy based on genome
phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36:996.
22. Sichtig H, et al. FDA-ARGOS: a public quality-controlled genome database
resource for infectious disease sequencing diagnostics and regulatory
science research. bioRxiv. 2018;482059. https://doi.org/10.1101/482059.
23. Stewart RD, et al. Assembly of 913 microbial genomes from metagenomic
sequencing of the cow rumen. Nat Commun. 2018;9:870.
24. Pandey, P., Bender, M. A., Johnson, R. & Patro, R. A general-purpose
counting filter: making every bit count. in Proc 2017 ACM Int Conf Manag
Data 775–787 (2017). doi:https://doi.org/10.1145/3035918.3035963
25. Flajolet P, Fusy É, Gandouet O, Meunier F. Hyperloglog: the analysis of a
near-optimal cardinality estimation algorithm. Discret Math Theor Comput
Sci Proc. 2007;AH:127–46.
26. Appleby, A. SMHasher GitHub repository. at <https://github.com/aappleby/
smhasher>
27. Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. 2011;40:
D136–43.
28. Břinda K, Sykulski M, Kucherov G. Spaced seeds improve k-mer-based
metagenomic classification. Bioinformatics. 2015;31:3584–92.
29. Church DM, et al. Extending reference assembly models. Genome Biol. 2015;
16:13.
30. The UniVec Database. at <https://www.ncbi.nlm.nih.gov/tools/vecscreen/
univec/>
31. Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST
implementation to mask low-complexity DNA sequences. J Comput Biol.
2006;13:1028–40.
32. Wootton JC, Federhen S. Analysis of compositionally biased regions in
sequence databases. Methods Enzymol. 1996;266:554–71.
33. Flajolet P, Martin GN. Probabilistic counting algorithms for data base
applications. J Comput Syst Sci. 1985;31:182–209.
34. Solis AD. Amino acid alphabet reduction preserves fold information
contained in contact interactions in proteins. Proteins Struct Funct
Bioinforma. 2015;83:2198–216.
35. Holtgrewe, M. Mason - a read simulator for second generation sequencing
data. Technical Report TR–B–10–06 (2010).
36. Segata N, et al. Metagenomic microbial community profiling using unique
clade-specific marker genes. Nat Methods. 2012;9:811–4.
37. Kodama Y, et al. The sequence read archive: explosive growth of
sequencing data. Nucleic Acids Res. 2011;40:D54–6.
38. Lawrence JG, Hatfull GF, Hendrix RW. Imbroglios of viral taxonomy: genetic
exchange and failings of phenetic approaches. J Bacteriol. 2002;184:4891
LP–4905.
39. Wood, D. E. Kraken 2 Manuscript Data. doi:https://doi.org/10.5281/zenodo.
3365797
40. Wood, D. E. Kraken 2 Experiment GitHub repository. at <https://github.com/
DerrickWood/kraken2-experiment-code>
41. Wood, D. E. Kraken 2 GitHub repository. at <https://github.com/
DerrickWood/kraken2>
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

Kraken2: Manual and Usage Specifications

Uploaded by

Copyright:

Available Formats

Kraken2: Manual and Usage Specifications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kraken2: Manual and Usage Specifications

Uploaded by

Copyright:

Available Formats

Wood et al.

Genome Biology (2019) 20:257

SHORT REPORT Open Access

Improved metagenomic analysis with

the strain-exclusion reference data, none of the nucleo- Funding

Consent for publication

14. Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.