10 1021@acs Jcim 0c01010
10 1021@acs Jcim 0c01010
org/jcim Article
■ INTRODUCTION
The need for rapid time-to-solution in drug discovery has
that baloxavir would bind to NendoU, has nevertheless been in
clinical trials. Similarly, lopinavir and ritonavir are also
undergoing testing, even though they target proteases of the
become accentuated by the Covid-19 pandemic. Given the
unrelated HIV.4 Although in principle, enzyme active sites with
urgency of the pandemic, in parallel with vaccine development,
similar chemical functions may bind similar ligands, the steric
both repurposing established antivirals and discovering new
and physicochemical substrates of drug−protein binding are
drugs are needed. Among the present candidates, apart from
nuanced. The frequent trial of drugs specific for targets known to
antibody treatments the antiviral remdesivir, a nucleoside analog
be absent in SARS-CoV-2 seems to us to be symptomatic of a
that acts by interfering with RNA synthesis, it is active against
lack of precision in combating this pandemic. For further
SARS-CoV-2,1−5 shortening recovery from Covid-19 in
information on other ongoing trials on small molecule drugs,
hospitalized patients.6 Other promising candidates, such as
biologics, passive immunization with antibodies, and vaccines,
dexamethasone, may modulate the host response.7
the reader is referred to the comprehensive review by Liu et al.10
When surveying clinical trials for Covid-19, one is struck by The aforementioned challenges, taken together, demonstrate
the number of trials that are not based on knowledge of the drug
interacting with a known target. There are several examples. As
one illustration, baloxavir is a specific inhibitor of the cap- Special Issue: COVID19 - Computational Chemists
snatching endonuclease of influenza virus, which is a member of Meet the Moment
the PD-(D/E)xK two-metal nuclease superfamily.8 Coronavi- Received: August 27, 2020
ruses have an endonuclease but of a completely different fold
(NendoU) and different active site residues.9 NendoU also
oligomerizes into a hexamer. Although it would seem unlikely
that there exists an unequivocal need for de novo drug discovery order to identify initial hits to be prioritized for experimental
campaigns as well as repurposing studies.11 validation.
There are a number of events in the SARS-CoV-2 viral Early docking studies were performed with static target crystal
replication cycle12 to target for antiviral therapeutic develop- structures and rigid ligands. These were quite successful in some
ment; from viral entry to membrane fusion, travel to the host cases, such as in the discovery of antivirals for HIV and
endoplasmic reticulum where translation of the viral genome influenza.23,24 Unfortunately, though, at that time, structures for
occurs, to formation of the viral replication complex and few targets existed, and the process was relatively inefficient:
formation from host membranes of double-membrane vesicles calculations were relatively inaccurate, and computers could
(DMVs),13,14 the passage of the replicon through the Golgi and dock only ∼100 compounds in a reasonable time frame. Since
the release of the virion from the cell. Each of these steps the 1990s, the power of supercomputers has increased by a
involves key viral proteins and occurs in a different compartment factor of a million or so. Rigid docking of over a billion
of the host cell. For example, the binding of the virus to the compounds has been performed in a few days. Thus, virtual
ACE2 receptor involves the receptor-binding domain (RBD) of high-throughput screening has outperformed equivalent ex-
the virus S (spike) protein, prefusion cleavage involves the perimental high-throughput screening and has been shown to
binding of host TMPRSS and furin proteases to the S1/S2 rapidly identify very tightly binding compounds.25
dibasic domain15 formation of the replication complex and the Strictly rigid docking does not often take place in protein:
DMVs involves the nonstructural proteins (NSPs), and the N ligand interactions, as both ligands and proteins, undergo
protein is required for packaging of the viral genome into the thermally driven internal motions, which lead to fluctuating
newly assembled virion.16 The replication complex is made up of binding site conformations.26 Therefore, a particularly impor-
15 mature NSPs, which are encoded by orf1ab and orf1a genes as tant development has been the recognition that incorporating
the pp1ab and pp1a polyproteins.17 Currently, many efforts are target flexibility into drug discovery protocols can improve the
targeting the main protease, MPro,18 which is required for drug discovery process.27 Ensemble docking uses different
cleavage of the large viral polypeptide into its functional conformations of the protein targets of interest, and
proteins, the RNA-dependent RNA polymerase (RdRp)19 combinatorially performs the docking of databases of com-
responsible for the production of new viral RNA, and some pounds against each of the protein target conformations. This
efforts target prevention of S cleavage.20 In addition, viral process models the conformational selection binding mecha-
proteins also function to impede the host’s defense mechanisms: nism, as opposed to a more limited induced-fit mechanism. The
both proteases have been shown to inhibit the human immune method requires the generation of an ensemble of protein
conformations to be used in the docking calculations.
response by interacting with immune proteins in SARS-CoV.21
Ensemble docking of small probe molecules for flexible
It is, therefore, important to understand regions on these
pharmacophore modeling was introduced in 1999. It was shown
proteins that act as binding sites for both substrates (as in the
that consensus pharmacophore models, based on multiple MD
case of the proteases) and for protein−protein interaction.
structures or on multiple crystallographic structures, were more
In previous work, very early in the pandemic, we combined
successful than models based on single conformations in
restrained temperature replica-exchange molecular dynamics
yielding successful predictions of binding.28 In our own
(restrained T-REMD) simulations with virtual high-throughput laboratories, ensemble docking has produced experimentally
screening in a supercomputer-based ensemble docking validated hits against each of the 16 protein targets presented to
campaign to identify well-characterized drugs, metabolites, or us over the past few years. Our groups have increasingly used an
natural products that bind to either the S-protein: ACE2 ensemble approach to perform docking.22,29−46 In addition, we
receptor interface or the RBD of the S-protein.22 From this have shown that the clustering of protein target MD trajectories
ensemble docking campaign, we provided a ranking of the usually brings a large improvement in the quality of ensemble
predicted binding affinities of over 8000 drugs, metabolites, and docking compared to what is obtained using single structure
natural products (and their isomers) with regards to the SARS- docking.47
CoV-2 S-protein and the S-protein: ACE2 receptor complex. Ensemble generation using MD and docking both require
The ranked list has been incorporated into experimental testing significant computational power: to perform MD simulations of
using a high throughput screen that was implemented in the sufficient duration and to dock large databases of compounds.
SARS-CoV outbreak, and new compounds will be added as This combinatorially large computational time requirement
discovered. Three of the top compounds, hypericin (a essentially limits this approach to high-performance computing
component of St. John’s Wort), imatinib, and quercetin, (HPC) architectures for large database screenings, even when
identified in the initial S-Protein: ACE2 receptor screen are only a subset of protein conformations is used in docking, for
now in clinical trials. example, following clustering of the target’s MD configurations.
Here, we report on an optimized supercomputing pipeline for HPC involves the use of specialized, large supercomputing
early stage drug discovery together with results on 24 systems systems to perform large calculations that are parallelized over
involving 8 SARS-CoV-2 proteins. The computational approach many compute nodes, each consisting of dozens of cores.
mimics what happens in nature, using “structure-based” drug Traditionally, the use of a high-speed interconnect allows rapid
discovery. Generally speaking, the availability of many communication between separate compute units and clever
experimental protein structures combined with massive parallelization schemes to enable rapid calculations on problems
increases in computational power and methodological advances too large to fit on a single compute unit. These schemes have
have led to a resurgence of computational studies in which trial historically involved specialized programs that focus efforts to
compounds are docked into binding sites in three-dimensional optimize communication overlap. The use of graphics
models of the protein targets and then ranked according to their processing units (GPUs) has helped to accelerate many types
strength of binding. Computational docking has been of calculations. The Summit supercomputer, housed at the Oak
particularly useful in early stages of molecular discovery in Ridge Leadership Computing Facility (OLCF), is currently the
B https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
fastest supercomputer in the United States. Summit is an IBM for experimental testing. Experimental work can involve a variety
AC922 system consisting of 4608 large nodes, each with six of tests, including live virus testing as well as target engagement
NVIDIA Volta V100 GPUs per node. Each node also contains studies, and will not be considered further here. Rather, we
two POWER9 CPU sockets for a total of 42 cores per node. The discuss the procedures of structural modeling, MD simulation,
GROMACS MD program48−50 is able to make use of all aspects and docking.
of the Summit supercomputer’s HPC utilities, including the Choice of Target Proteins and Generation of Struc-
GPUs and the interconnect, providing for both strong and weak tural Models from Experimental Work. Multiple groups
scaling, which dramatically decreases time per MD step and have been using structural biology techniques, including X-ray
increases the size of the system that can be simulated efficiently. crystallography, small-angle scattering (SAS), and cryogenic
The temperature replica exchange molecular dynamics (T- electron microscopy (cryoEM), to investigate the structure of
REMD) routine51−54 which was chosen here for the MD proteins and protein complexes from SARS-CoV-2.68−71
calculations (see below) uses the interconnect not only to allow However, obtaining a structure from the Protein Data Bank
for parallelization of a single simulated biomolecule, but also to (PDB) or perhaps a revised model from another resource is only
communicate between separate replicas of the system, each the starting point. Often structures obtained from the PDB do
carried out at a different temperature, and performs exchanges not fully resolve all residues, and a determination must be made
between replicas to accelerate the conformational sampling of whether and how to model them. Also, as structural models are
the biomolecule.55 rapidly being released to aid in the fight against COVID-19, the
Protein−ligand docking has hitherto not been considered a potential inclusion of a few structural errors is an unfortunate
traditional HPC task, as each docking calculation is short and reality. In particular, the identification of metal cations in protein
does not require multiple nodes to complete. In fact, many structures requires careful thought and examination of its local
docking programs can run on a single CPU core. However, the coordination environment.
number of cores on a supercomputer or cluster can provide a Even with perfectly assigned and complete experimental
resource to perform many simultaneous docking calculations structures, it may not be enough to consider viral protein targets
that greatly decrease the time-to-solution for screening a large as chemically invariant structures for modeling and binding
data set of ligands. Cloud and distributed computing resources calculations. Large differences in pH in various cell compart-
also provide this type of completely parallel solution for high- ments as the virus travels through the host cell72−74 can
throughput screening.56,57 The use of GPUs has recently been qualitatively change the protein’s structure and function.
made possible for the widely used program AutoDock458,59 Differences in pH also affect the protonation states of the
resulting in the program AutoDock-GPU, which provides up to proteins and the small molecules being tested as drugs, altering
50× speedup over AutoDock4 (available at https://github. drug binding preferences. Finally, the oligomerization states of
com/ccsb-scripts/AutoDock-GPU).60−64 Thus, the use of the target proteins are important to consider as the interactions
leadership HPC facilities for ensemble docking can provide between protein monomers may influence the shapes of the
the ability to screen billions of ligands to a full set of active sites. Another important factor to consider when
conformations generated with HPC-based MD simulations. performing in silico screens using ensemble docking is the
Quantum mechanical refinement of classical docking ranking ability to construct a useful model of a particular protein for MD
based on fragment molecular orbital (FMO) techniques also simulation. For instance, certain metal-containing regions of a
naturally benefits greatly from massively parallel supercomputer protein may not have an existing classical mechanics model
capabilities.65 (force field parameters), or existing models may be inadequate.
In this work, we describe our efforts establishing a In addition, highly charged, disordered, and ion-dependent
supercomputing-based pipeline for ensemble docking and biomolecules have been known to have less accurate force fields
preliminary results on its application to discovering therapeutics and may perform poorly in an MD simulation.75−78
that target viral proteins of SARS-CoV-2. The pipeline and Proteins chosen for ensemble docking in this study were those
results presented here represent our contribution to date to the that had a crystal structure available with a reasonable resolution,
work of the USA HPC Covid-19 Consortium that was created were amenable to accurate simulation with classical MD force
on March 29th, 2020. We describe the choice of eight targets fields, and were also known to be important for viral
and the preparation of protein models from experimental data. pathogenicity based on either recent studies or those on
We report on T-REMD simulations performed for the targets SARS-CoV. The 24 systems studied comprise nine protein
totaling about half a millisecond of simulation time. We have domains. Two of these, RBD of the S (spike) protein and the N-
docked repurposing databases to ten configurations of each terminal region of the N (nucleocapsid) protein, are domains in
protein simulated using the popular docking program Autodock structural proteins found attached/within the virion. The N
Vina. We also describe efforts deploying Autodock-GPU60−62 at protein is used for packaging the viral genome and is essential for
scale on Summit that demonstrate the docking of over a billion the assembly of the virion.79 The remaining seven domains
compounds in 24 h with full structural optimization of the come from nonstructural proteins (NSPs) 3, 5, 9, 10, 15, and 16,
ligand. Future developments involving the use of AI and which form the replication complex and are involved in a
quantum chemistry in rescoring and clustering are also outlined. number of key tasks leading to the creation of new virus
The pipeline described here can also be used in future work to particles.80 Two domains come from NSP3,81 the ADP ribose
target human proteins66,67 known to interact with viral proteins, phosphatase (ADRP, also known as macro- or “X”) domain, and
or in disease-causing responses in Covid-19 and more generally the papain-like protease domain (PLPro).82 The ADRP seems
in computational structure-based drug discovery.
■
to be involved in ADP-ribosylation, which is used in cell
signaling and thus may act to inhibit the host immune
METHODS response.83 PLPro cleaves regions of the polyprotein to release
Computational methods in drug discovery narrow a vast nonstructural proteins and also is involved in the mechanisms
chemical search space to a tractable set of compounds suitable the virus uses to counteract the host immune response, for
C https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
instance by interaction with host immune proteins80,84 NSP5 is protonation state choices (where applicable), and the number of
the main protease (MPro), which self-cleaves and also cleaves replicas used for the T-REMD.
other regions along the polyprotein, releasing essential proteins Simulation Model Preparation. To engage in the use of
to perform their tasks in the assembly of the replication complex, MD for the rapid generation of conformational ensembles for
and is also involved in interacting with and preventing the action drug discovery one requires that the input for MD be generated
of host immune proteins.84,85 The exact function of NSP9 is in a semiautomated fashion by which the atomic coordinates,
unclear, but has been found in SARS-CoV to be required for viral obtained from experimental or in silico protein structure
replication and has been shown to bind to RNA oligonucleo- prediction methods, can be quickly processed into MD input
tides,86 NSP15 is an endoribonuclease specific for uridine whose files. To facilitate this semiautomated approach, CHARMM-
exact function is also unknown, but has been implicated in GUI was used for most model building.94 The general system
interfering with host immune response both through direct building method used here involves the direct retrieval of
interaction and by cleaving viral RNA to prevent detection by structures from the PDB and processing to model missing
the host.87 NSP16 is thought to be a methyltransferase that residues, assign protonation states, add disulfide bonds (where
requires NSP10 as a cofactor, and acts to disguise viral mRNA noted in the PDB annotation), add glycosylation (where
from the host immune response by adding a methylation onto resolved in the crystal structures of the S-protein receptor
the RNA cap which host cells use to mark RNA as belonging to binding domain and ACE2), neutralize the charge of the system
“self” versus “pathogen”.88−90 (using Na+ and Cl− ions), and solvate (with TIP3P water).
The explosion in research and literature fueled by the Covid- Many proteins have coordinated ions that serve structural roles,
19 pandemic, together with the need for searching through such as the Zn2+ cations in Nsp10, or catalytic roles. Thus, the
related literature on other coronaviruses, has created a challenge treatment of Zn-complexes in fixed-charged classical MD force-
for researchers needing to understand the structural details and fields is a challenge, and for some systems, it may result in the
cellular contexts of the SARS-CoV-2 proteins. To help navigate failure to maintain Zn-protein coordination95−97 and when
this challenging landscape, we have been developing new tools found necessary (as noted below and also summarized in SI
based on natural language processing and machine learning for Table S1) an explicit bond representation was used. All of these
enabling a more robust search for specific questions required for considerations mandate an abundance of care when preparing a
our modeling, simulation, and ligand docking work91−93 biologically accurate model for simulation. Below we discuss
featuring targeted filtering and exploiting external resources considerations taken into account when modeling some of the
(e.g., Wikidata, ChEBI, PubChem) to expand our search proteins simulated.
capability. For example, after we generate a set of related S (Spike) Protein (PDB: 6W41). Presently available structures
keywords, the service will screen for the terms referring to a of the S protein have nine gaps totaling approximately 150
chemical substance and fetch the chemical information (e.g., residues, in addition to over 20 and over 100 missing residues at
SMILES string) from the PubChem automatically. In addition, the N- and C-termini, respectively. Current models also lack
using this keyword search enables the ontologies (e.g., Wikidata, post-translational modifications, including glycosylation and
ChEBI) to be used to link related chemicals and their properties formation of disulfide bonds. The S protein is heavily
for document annotations in query results. The main data glycosylated, with roughly 20% of its mass in glycan chains,
resource of the system is a collection of scientific papers, which yet at most, a few mono or disaccharides are present in the
are collated from major publications. The full-text article access structure.
and download from the publishers’ archives are performed In our preliminary study,22 we made use of a homology model
under the publishers’ agreements, and the internal article corpus of the entire spike with restraints applied such that only the
in our system is updated on a weekly basis. human ACE2-Spike interface was unconstrained. Here, using
To provide a diverse survey of the conformational ensembles crystal structures of the ACE2-S protein complex, simulations of
of the SARS CoV-2 viral proteome, we performed T-REMD the receptor-binding domain of the S Protein (Spike) both in
simulations of 24 different model systems listed in Table 1. complex with the human ACE2 receptor and on its own
Additionally, Supporting Information (SI) Table S1 is also (referred to within the text as the “Apo” RBD) were performed.
provided, which summarizes the PDB entry simulated, complete The viral spike receptor-binding domain was chosen to provide
D https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
insight into the details of the initial viral-host recognition enzymatic activity of CoV MPro. In particular, the doubly
process. Glycosylation resolved from crystallographic imaging protonated (cationic) state of His163 at pH 6.0 was suggested to
was used, and an annotated disulfide bond was also included. modulate relevant conformational variations involving Glu166,
Main Protease. (PDB: 6Y2E and 6WQF). The main protease, Phe140, and His172, leading to a catalytically inactive
MPro, is an attractive target for the development of antiviral conformation.101 At higher pH values, both His163 and
drugs. There is compelling evidence that the enzymatically His172 should be uncharged, and, on the basis of the
active species is the dimeric assembly of MPro. A dimer is hydrogen-bonding pattern that can be inferred from the crystal
observed in most crystal structures of CoV MPro, as well as in structures, the HSE protonation state was used for both His163
solution at sufficiently high concentrations. In addition, a linear and His172 in the present MD simulations. All other His
increase in the enzyme activity at increasing concentration residues were also simulated in their neutral state, assigning the
suggests catalytic incompetence of the monomer.98 Therefore, Nδ or Nε protonation state on the basis of their chemical
the full dimer was considered in the present MD simulations for environment and hydrogen-bonding patterns. The selected
MPro, using as starting coordinates the apo-homodimer in the protonation states are as follows: HSD64, HSD80, HSE164,
crystal structures 6Y2E and 6WQF.99,100 HSE246.
The crystal structures show that SARS-CoV-2 MPro, similarly PLPro (PDB: 6W9C and 6WRH). For PLPro, the Zn cation
to other MPros,85,99,101,102 is composed of three domains: was coordinated to C189, C192, C224, and C226. Similar to
Domains I (residues 8−101) and II (residues 102−184) are MPro, the protonation states of the His residues in PLPro were
arranged in an antiparallel β-barrel structure, and domain III not readily available. Here we pursued two potential protonation
(residues 201−303) contains five α-helices arranged in a state varients, a charged variant and a neutral variant. For the
globular cluster. Domain III is a specific feature of CoV MPro charged variant the protonation states were obtained with the
proteins and was suggested to be essential in the proteolytic use of the PropKa 3.0 server assuming pH ∼5, corresponding to
activity by keeping domain II and the long loop connecting its presumed cellular (lysosome) environment,107 with 6W9C
domains II and III (residues 185−200) in the proper being assigned to pH 5 based on its physiological role in acidic
orientation, and/or by orienting the N-terminal residues that environments. Protonation states for the neutral state were
are essential for the dimerization.101 Dimerization occurs manually assigned using 6WRH as the original coordinate file,
through interactions between the helical domains of the two with C111S mutation reversed. For both variants, Zn-
monomers and through hydrogen bonding interactions between coordination during temperature-replica exchange was enforced
the N-terminal residues of one monomer and key residues in the by topological patches applied with CHARMM-compatible
other monomer. In particular, the salt bridge between the N- tools. TopoGromacs108 was used to convert the system and
terminal Ser1 of one monomer and Glu166 of the other associated force field to GROMACS format.
monomer has been suggested to be essential to maintaining the NSP15 (PDB: 6VWW). Large (His) tags present during the
catalytically competent conformation.101,103 The substrate- recombinant expression processes to purify NSP15 for
binding site is located in a cleft between domains I and II and crystallization; however, these tags were not removed before
contains a highly conserved catalytic dyad formed by Cys145 crystallization. Prior to simulating the monomeric and
and His41. Comparison among the two apo crystal structures hexameric forms, the artifactual His tags were removed from
and the crystal structure obtained in the presence of an inhibitor NSP15 using MOE2019 and subjected to a “quick prep” with
reveals85 only minor structural differences in the position of a the prepare protein module of MOE to resolve potential issues in
few side-chains and no relevant changes in the substrate-binding the resulting structure. The resulting truncated PDBs were then
site, except for the rotation of the side chain of Met165, which is uploaded to CHARMM-GUI for neutralization and solvation.
in the proximity of His41. NSP10 (PDB: 6W4H). For NSP10, both in its monomeric and
Although the catalytic mechanism is not fully understood, a complexed form with one Zn cation liganded to C4370,
there is a general agreement in considering that the proteolytic C4373, C4381, and C4383, while the other bound Zn was
activity of CoV MPros is initiated by activation of the enzyme liganded to C4327, C4330, H4336, and C4343. For 6VYO, H59
through a proton transfer reaction in the catalytic dyad, leading and H145 were liganded.
to a charge-separated state with a highly reactive thiolate. It has Molecular Mechanics and Molecular Dynamics Sys-
also been suggested that such a proton transfer reaction is tem Preparation (Force Fields, Counterions, Energy
induced by the presence of the native substrate.104,105 Therefore, Minimization, and Equilibration). All simulations were
in the present MD simulations of the apoenzyme, Cys145, and performed using the GROMACS48,109 software suite, and the
His41 were simulated in their neutral state, with His41 CHARMM36m force field,110 which is the default of choice
protonated at Nδ (i.e., HSD). This choice is based on the using the CHARMM-GUI. For all systems, the protein was
observation that the His41-Nε appears to be the best candidate solvated in water-boxes with edge-distances of 1 nm, and only
as proton acceptor from Cys145 because in the crystal structures neutralizing Na+ and Cl− ions were used. Short-range
the His41-Nε is closer than the His41-Nδ to the Cys145-S and interactions were treated with a smooth force-switch cutoff of
the His41-Nδ is already involved in a hydrogen bond to a highly 1.2 nm, and long-range electrostatics were treated using the
conserved water molecule, which is considered the third element particle-mesh Ewald (PME) formalism, as implemented in
of the catalytic site. A recent QM/MM study106 also supports GROMACS.111 To facilitate the use of a 2 fs MD time step, all
this proton transfer mode and the role of water in catalysis. covalent bonds to hydrogen were restrained with the LINCS
However, the ε-nitrogen protonation state for His41 (HIE) algorithm112 in all simulations. Following system preparation, all
cannot be decisively ruled out, and MD simulations were solvated models generated were subject to steepest-descent
performed also considering this alternative, although less energy minimization with a stopping condition of either
probable, protonation state. reaching the force-convergence criteria of 1000 kJ mol−1 nm−1
The protonation states of two additional His residues, namely or a maximum of 5000 iterations. Energy minimization was
His163 and His172, were also highlighted as being crucial for the performed primarily to remove potential clashes between the
E https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
solvent, ions, and the protein (or protein complex) of interest. sites was of interest for subsequent docking, a second round of
Postclash removal minimization, short (250 ps) NPT relaxation clustering was performed based on binding site residues and
simulations (with default positions restrains generated from protein−protein interfaces. VMD atom selections for the
CHARMM-GUI) were performed to relax the simulation box docking specific clustering are summarized in SI Table S2.
dimensions for each replica (at different temperatures) Although T-REMD is an efficient simulation method, and the
independently (see T-REMD Protocol). For these relaxation 310 K data do correspond to a formal statistical mechanical
simulations, the Berendsen baro/thermostats113 (as imple- ensemble generated at this temperature, as with other enhanced
mented in GROMACS) and an integration time step of 1 fs were sampling methods the risk is always present that the enhance-
used. ment of the sampling takes the system to regions of
T-REMD Protocol. MD simulations provide a means to configurational space beyond that that would be significantly
study the conformational dynamics of proteins. However, sampled by the protein physiologically; for example, to partially
frequently MD becomes ‘trapped,’ resulting in the need for or wholly unfolded states. We, therefore, take care to identify
many long simulations to effectively sample a protein’s these and to not perform docking screens on such config-
conformational landscape.114 To overcome this sampling urations.
challenge, enhanced sampling techniques can be used. For the Docking. Two different docking databases were used.
present work, temperature replica-exchange molecular dynamics (1) A smaller database of potential ligands was built merging
(T-REMD) was employed, whereby multiple copies of a target together the content of the SWEETLEADS125,126
system are simulated simultaneously with each copy (replica) at repurposing database SuperDRUG2,127,128 and the
a different temperature, with periodic coordinate swapping NCI-diversity database,129 yielding 13 757 unique com-
(performed in such a manner as that preserves detailed balance) pounds. This database has been ensemble docked to all
between the copies.52−54 By running at multiple temperatures, systems, as listed in Table 2, with noted targeted binding
with exchanges, the dynamics of the system avoids “kinetic sites. This database was docked using local HPC clusters
traps” and provides a robust sampling of the protein free-energy using Autodock Vina.
landscape, and thus the protein conformational diversity.115 T-
REMD was chosen for several reasons:
Table 2. List of Proteins and Binding Sites Used for “Smaller
(1) it guarantees an increase in sampling efficiency over Database” Docking. PPI Refers to a Protein−protein
straightforward MD,116 Interfacea
(2) it does not require the assignment of reaction coordinates receptor/binding site receptor/binding site
(or collective-variables) a priori to accelerate conforma-
MPro monomer/catalytic pocket NSP15 monomer/catalytic pocket
tional sampling
MPro dimer/PPI NSP15 dimer/PPI
(3) it does not require direct modification to the system NSP9 dimer/FTMap sites NSP10 monomer/PPI to NSP16
Hamiltonian.117 nucleocapsid phosphoprotein/ NSP16 monomer/PPI to NSP10
T-REMD simulations for each system were performed with RNA binding site
the GROMACS simulation suite. A limited temperature range of nucleocapsid phosphoprotein/PPI NSP10:NSP16/PPI
310 K to ∼350 K was chosen to maintain physiological nucleocapsid tetramer/FTMap
sites
NSP3 ADRP domain (asymmetric unit,
dimer)/active site
configuration space. For each protein system, the number of NSP3 ADRP domain (monomer) NSP9 monomer/PPI
replicas and temperatures for each replica was chosen using the active site
temperature predictor server by Patriksson and van der Spoel118 a
In some cases, FTMap was used to identify potential binding sites
with a target exchange probability set at 0.2 though the actual (see SI Table S2).
exchange was found to be ∼0.3 for all systems. All simulations
were performed for a total of 750 ns per replicate. (2) Supercomputing docking runs were performed involving
After relaxation, production T-REMD simulations were billion-plus compound screens of the Enamine database
performed with a frame saving rate of 10 ps and an integration using an accelerated version of Autodock: Autodock-
time step of 2 fs. Production simulations were performed, similar GPU. To date, these runs have been performed on two
to the relaxation simulations, in the NPT ensemble. Unlike the crystal structures of MPro.
relaxation simulations, the V-rescale (Bussi) thermostat119 and
the Parrinello−Rahman barostat,120,121 were used. Regardless of Smaller Database Docking. Data and Protocols. Docking
the temperature window, the target pressure for each replicate to the target structures obtained from the MD simulations (as
was set to 1 bar. listed in Table 1) was performed on various HPC clusters using
Trajectory Analysis. For all systems, the measures of the Vina MPI130 and MOE. Two sets of structures were used in the
gyration tensor (from which shape anisotropies are derived), ensemble docking. In the first series of docking calculations, only
solvent-accessible surface area (SASA), and pairwise simulation the first 100 ns of the T-REMD trajectories were used, and the
frame versus simulation frame RMSD matrices, and RMSD results of the docking simulations were passed on to
based clusters, were obtained using a combination in-house collaborators for experimental testing. In the second series of
VMD122 scripts, NumPy, and SciPy.123 The RMSD clustering docking, as the MD trajectories were expanded beyond their
specifically only considered the lowest temperature replica (310 initial first 100 ns, the clustering was performed on the entire 750
K), and rapidly generated the pairwise RMSD matrices using the ns trajectories, as described in the results section below.
QCP algorithm.124 Clustering was performed using hierarchical For the VinaMPI130 calculations, the “Exhaustiveness”
clustering with a complete linkage method, as implemented parameter was kept at its default value of 10. Databases of
within SciPy. For generality in evaluating structural diversity, potential ligands were built merging together the content of the
clustering was initially performed based on the RMSD of all SWEETLEADS,125,126 SuperDRUG2,127,128 and NCI-diversity
heavy protein atoms, and where additional diversity of active databases,129 yielding 13 757 unique compounds.
F https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
Using the program MOE, compounds with more than 49 streamlined to enable it to keep up with the GPU’s speed. These
rotatable bonds were deleted from the database, and only one modifications, together with the size of the Summit super-
stereoisomer was included for each compound. Very low computer, indeed allow over 1 billion compounds to be docked
molecular weight (<58) compounds (single atoms, ions, very within 24 h. This capability will enable giga-compound docking
small functional groups). The resulting database included ∼9K for a number of proteins in the viral proteome and beyond.
unique molecules. The compounds were protonated at pH 7 and We performed initial docking tests using this framework on
energy-minimized using the MOE software to obtain low energy NSP15 (NendoU) and the main protease (MPro). For NSP15
3D structures. The compounds were saved on disk in sdf format we used a 9000 compound data set composed of the
and then converted to PDBQT format using AutoDock SWEETLEADS125 database with additional ligands, and also a
Tools.131,132 trimmed version of this data set containing only ligands
The ligands were docked to 10 clusters per receptor, each containing less than 11 rotatable bonds, consisting of about
cluster corresponding to a different configuration of the binding 5000 ligands. For tests with MPro, we used a 90 000 ligand
pocket. The clusters corresponding to the first 100 ns of the MD subset of the Enamine REAL database.137 All ligands were
simulations have been uploaded on the publicly available prepared with AutoDockTools,132,138 and the receptor grids
structure repository https://cmbcovid19.flywheelsites.com/ were generated with the program autogrid with a grid spacing of
data/ additional data, including the complete trajectories from 0.375 Å. We tested a set of search box sizes: 40, 25, 20, and 15 Å3,
the 750 ns T-REMD simulations is forthcoming. The residues and different settings for the number of runs, nruns, which
used to determine the clusters fall into one of three categories: defines how many separate instances of the genetic algorithm are
the protein active site, residues at the protein−protein interfaces executed. For the trimmed data set, we also performed docking
(for complexes), and all the protein non-hydrogen atoms. with AutoDock Vina with exhaustiveness of 10 to compare
Tables 2 and list the receptors and binding sites we have results. These results provided us with the confidence to dock
screened so far. over 1 billion compounds from the Enamine real database to two
Binding Sites for Docking. In general, we have two classes of different MPro crystal structures, 5R84 and 6WQF,100 with a
potential binding sites: (1) catalytic pocket or substrate-binding search space 25 Å large on each side, centered on the active site.
site and (2) PPI. The first aims at identifying potential The analysis of this data set is ongoing. Due to the documented
competitive inhibitors of the viral proteins, and the second inaccuracies of force field-based scoring functions in the task of
aims at finding compounds potentially disrupting a viral screening and affinity prediction of compounds,139 rescoring of
protein−protein complex. Binding site definition requires at least 1% of the billion compounds is being performed using
manual intervention and cannot be easily automated. Examples the accurate, yet highly computationally efficient machine
of definitions are listed below for three viral proteins. learning-based rescoring method known as RF-Score-3.140 Also,
(a) In the main protease dimer (PDB: 6WQF), the docking at least 50% of those compounds rescored with RF-Score-3 will
box contains catalytic sites of chain A and PPI residues. be further filtered using recently developed rescoring described
The docking box was constructed to align with the below in Future Directions and Preliminary Results from New
peptide-binding groove on either side of the catalytic dyad Methodologies, subsection Protein−Ligand Rescoring Using
of chain A, which extends outward to include the S3, S2, Machine Learning.
S1, S1′, and S2’ substrate binding pockets. Sequence Analysis and Mutational Entropy Calculations.
We performed an analysis of available sequences of the SAR-
(b) In the NSP10-NSP16 complex (PDB: 6W4H), the S-
CoV-2 virus to look for numbers of mutations and map these
adenosyl methionine (SAM) binding site Asp6928 in
locations on the proteins we were using as drug discovery
NSP16 was considered.89 In addition, PPI residues such
targets. All complete, high-coverage genomes labeled as human
as Tyr4349, Val4295 to Leu4298 in NSP10, and Gln6885
host SARS-CoV-2 were downloaded from GISIAID141,142 on
in NSP16 were included. Tyr4349 and Gln6885 interact
May 5, 2020, yielding a total of 16 252 genomes. Sequences were
with each other in SARS-CoV virus,89 and Val4295 to
Leu4298 are hot spot residues in the SARS-CoV-2 virus filtered to remove any genomes with greater than 3% ambiguous
computationally predicted using the crystal structure (N) nucleotides or were less than 29,000 nucleotides in length,
along with the KFC2 method,133 which is based on a resulting in 14 284 genomes. Multiple sequence alignment of the
machine learning predictive model (https://kfc.mitchell- 14 284 genomes was performed using MAFFT143 v.7.464 with
lab.org). Hot spot residues are the fraction of PPI residues the --addfragments method using NC_045512.2 (EPI_-
that account significantly for the overall protein−protein ISL_402125) as the reference genome and removing insertions
binding affinity, and they are typically determined relative to the reference. Mature protein-coding sequences for
experimentally using alanine scanning mutagenesis.134 each protein were extracted from the alignment using
coordinates from the reference genome and translated using
(c) In the N-terminal domain of nucleocapsid protein FAST144 v1.6, with protein sequences containing internal stop
tetramer (PDB: 6VYO), three critical RNA-binding codons discarded from further analyses. Shannon entropy145
residues on the beta-sheet core were included in docking: was calculated for every column of each protein alignment using
Arg88, Arg92, and Arg10771,135.136 a custom script, disregarding ambiguous and gap characters
Billion-Compound Supercomputer Docking with Enamine using a custom script. Additionally, the frequency and types of
Real Database. A major aim of this exercise was to see whether substitutions with respect to the reference were recorded. For
it would be possible to dock a billion compounds with full ligand visualization of the mutation entropy per residue of the proteins
optimization on the OLCF Summit supercomputer in 24 h of studied in this paper, entropy values were color-coded in protein
wall-clock time. To perform efficient ensemble docking, we PDB structures. Known SARS-CoV and SARS-CoV-2 structures
modified AutoDock-GPU,60,62 to enable it to run at peak were downloaded from the Protein Data Bank, their sequences
efficiency on the Summit system. For compatibility, OpenCL were aligned with the SARS-CoV-2 reference genome
kernels were rewritten in CUDA, and file input and output were (NC_045512.2) using BLASTP, and the calculated entropy of
G https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
the sequences was embedded in the PDB file in the place of the B coronavirus-hpc.ornl.gov and will be updated as new simulations
factor column using a custom Python script. and docking results become available.
Preliminary QM Refinement Protocol. Along with ML-based T-REMD Scaling Performance. Figure 1 shows the
approaches, quantum mechanics (QM)-based refinement of performance per replica on Summit of T-REMD simulations
classical docking results is being developed here as a tool to
narrow down the list of promising inhibitor candidates.146 Until
recently, the inclusion of QM electronic structure in high-
throughput drug screening was deemed computationally
intractable due to the enormous computational resources
required even for density functional theory (DFT) calculations.
The poor scaling of most quantum chemical methods further
exacerbates the situation. A viable emergent alternative is the
recently developed linear-scaling version of an approximate, yet
remarkably accurate DFT method called “fragment molecular
orbital density-functional tight-binding (FMO−DFTB).147 This
method is implemented in the widely-utilized GAMESS
quantum chemistry code.148 We here report preliminary
calculations of FMO−DFTB with the so-called polarizable
continuum model (PCM) of the solvent149 for quantum
mechanics-based evaluation of potential COVID-19 spike
protein inhibitor drugs identified by re-clustering and re-
docking to an extended simulation of the S protein, similar to the Figure 1. Simulation throughput per replica. Each point represents the
initial work by Smith & Smith.22 For the PCM calculations, the performance achieved by replica-exchange MD simulations on a single
cavity was calculated using simplified united atomic (SUAHF) protein/water system. Run parameters were one replica per node (each
radii150 which are available for all the chemical elements node has six GPUs), using between 24 and 40 replicas in a given system.
contained in all ligand compounds. Because the binding side is
widely open, the dielectric constant of water ε = 78.39 was used.
In addition, to improve the accuracy in describing non-covalent for the majority of the simulations performed in this work using
interactions, the D3 dispersion correction was employed. To GROMACS version 2020.1. A few simulations were run with
obtain the refined binding energy of a given candidate, its GROMACS 2018 and/or with different scheduling parameters
unbound geometry, the unbound protein, and its corresponding and achieved only 20−50% of the performance shown above
complex were optimized using FMO−DFTB/PCM. While the and are not included in the figure. We found that performance
unbound ligands were completely optimized, only selective was maximized when running all bonded and nonbonded
residues in the binding pocket of the unbound protein, and in calculations on the GPUs (interatomic and both particle-mesh
the protein−ligand complexes were locally optimized. The QM- Ewald and pairwise Lennard-Jones). With the noted choices,
refined binding energy is defined here as the difference between performance saturates at around 100 ns/day for 250 000 atoms
FMO−DFTB/PCM total energy of the complex and the sum of and above, even if more nodes are allocated per replica, for two
the total energy of unbound protein plus total energy of reasons. First, the GPU-based fast Fourier transform is limited to
unbound ligand. In preliminary work, the QM-refinement was a single GPU, and communication latencies between nodes slow
carried out for the Vina top-10 best binders of each spike protein down the calculation. However, throughput around 100 ns/day
cluster. In total, 15 spike protein clusters were investigated, and can still be achieved for simulations above 250,000 atoms if
the binding energies of 150 protein-ligands complexes were nodes are increased proportionately to system size.
refined. T-REMD: Conformational Sampling of SARS-CoV-2
■
Proteins. T-REMD simulations were performed with the
RESULTS number of replicas ranging from 20 to 60 for 750 ns each for an
aggregate sampling of over 0.6 ms (Table 1 and SI Table S1).
We present here preliminary results obtained for members of the Given the scaling data noted above, for the total 816 replicas
SARS-CoV-2 proteome. Naturally, ongoing refinements of the simulated, the calculations (if performed simultaneously) used
results are continually being undertaken, and the results are the equivalent of ∼18% of the entire Summit supercomputer for
incomplete. However, they give a snapshot report on the state of ∼3 days. The performance, if the entire machine were used to
delivery of the pipeline. At the moment of submission, 24 T- simulate all of the different protein systems at the same time,
REMD simulations have been performed on nine members of would thus scales up to ∼1 ms/day. For all systems, the replicate
the proteome, in various oligomerization and protonation states, temperatures range from 310 K to ∼350 K, and the average
for a total of 0.612 ms of MD aggregated over all replicas and exchange probabilities were near 0.3.
∼17.25 μs aggregated overall lowest temperature windows. At From the simulations, structural diversity was quantified by
present ∼2.07 M physical docking calculations have been calculating, when a binding site is known, the gyration tensor of
performed with the smaller database and on Summit 2.4 billion the binding site residues, the solvent-accessible surface area
docking calculations with the Enamine REAL database. The (SASA) of the binding sites, and the construction of pairwise
preliminary results presented are general trends observed in the snapshot-snapshot root-mean-squared deviation (RMSD) ma-
MD and docking runs and do not describe details of the trices for the target temperature replica, that is, the replica with
candidate compounds or dynamical properties of individual the temperature set to 310 K (see Methods for calculation
proteins, which will be reserved for future publications. Results details). Additionally, using the gyration tensor, the shape
of MD and docking are available at the Web site https:// anisotropy of the pockets was also obtained.
H https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
Figure 2. Configurational variability of PLPro (PDB: 6WRH) with neutral HIS protonation states. (A) Overlay of 26 RMSD aligned structures from
the lowest temperature replicate spanning the 750 ns of sampling. (B) Population distribution for shape anisotropy (κ) and solvent accessible surface
area (SASA), with redder colors indicating greater occupancy of these kappa-SASA combinations. The distributions are also reflected by one-
dimensional histograms above and to the right of the plot, and black dots within the population distribution, which represent position information for
10% of the total snapshots considered. (C) Pairwise RMSD clustering for the lowest temperature replica, with the snapshots ordered according to their
cluster. The clusters in this instance were defined using a cutoff of half the maximum RMSD observed within the simulation and are labeled according
to color with a color-bar for reference located above the plot. (D) Pairwise RMSD distribution across all snapshots. (E) Population statistics for the
clusters introduced in (C).
Linkage-based RMSD clustering, using the pairwise RMSD number of dominant conformational states. Figure 3D,C further
matrices was performed to gauge the overall structural diversity suggest that, although six dominant states exist, these states
of the proteins. Figure 2 provides example conformations and could be grouped into two “superstates”, which may indicate a
calculated quantities for one example target, the neutral variant switching like behavior or the potential existence of a “hinge”.
of the PL-Protease (PLPro). Similar plots for the other Finally, subfigure B shows a significant amount of sampling of
simulated systems are provided as Supporting Information and rod-like geometries (anisotropies near 1); however, there are
on https://coronavirus-hpc.ornl.gov. states that have a correlated reduction in SASA and shape
From Figure 2A (and subfigure A of the SI Figures S1−S23), anisotropy, which would correspond with a nearly continuous
it is clear that the simulations generate a diverse ensemble of transition between rod-like structures and spheroid-like
states with varied loop structures. For the case of the neutral structures.
variant PL-Protease, Figure 3C,E indicate the existence of a The general conformation variation highlighted by Figure 2
and SI Figures S1−S22, to some degree, masks the conforma-
tional variation within binding sites; however, when for docking
to the individual binding sites, clusters within the T-REMD
trajectory are identified and demonstrate significant variability
within the active site region (Figure 4). While not specifically
active site residues, residue variability at the tip of the loop
centered on Y266 and the charged residue pair R164-E165 near
the active site imply that accounting for the protein conforma-
I https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
Table 3. Number of Duplicate Compounds Found in the top 500 lists in specific pairs of proteinsa
a
In grey/diagonal: the number of compounds unique to the corresponding target/site in the respective top 500 lists.
tional ensemble is essential. Otherwise, the docking calculations between compounds against the same target, and not based on
would be strongly biased by the rotameric states present in the their absolute docking scores, which could artifactually inflate the
single static structure used in typical single-structure docking number of duplicates. A high number of duplicates between the
calculations. lists obtained on two different proteins could also indicate a
Smaller Database Docking. A preliminary analysis was computational bias of some compounds based on other criteria
performed of general trends seen in docking the smaller database than their good fit to the targets. On the other hand, such high
to the 24 SARS-CoV-2 protein systems. For each protein target, numbers could correctly identify promiscuous binding sites that
all the docking results from each of the 10 cluster configurations do not display marked structural specificity and hence could be
were combined, and the top 500 scoring compounds extracted. indeed targeted by similar compounds. It is outside of the scope
The selectivity of the compounds for any given target varies of the present work to assertively differentiate between these two
considerably (Table 3) with the number of compounds present possibilities. However, the number of duplicates varies greatly
on any two different top 500 lists as low as 132 or as high as 283. across several pairs of targets, which renders unlikely a
In comparison, from two random selections of 500 items out of systematic bias in the docking (because of, say, molecular
9014 items (see SI Figure S24, 5% percentile = 19 compounds, weight or other ligand properties independent of the target’s
95% percentile = 36 compounds), 27 identical compounds binding site).
would be expected on average. Thus, the high number of Only ∼55% of the top 500 compounds were the same in the
identical top-scoring compounds observed between any two docking results from the 100 and 750 ns clusters. Thus,
targets indicates a nonrandom selection of these duplicate extending the T-REMD simulation time by a factor of 7.5 nearly
compounds. doubled the chemical diversity. Future analysis will be needed to
For any particular target, the number of nonduplicate indicate if the compounds that are identical in both sets of
compounds is relatively low, ranging from 8 to 50 (Table 3). docking calculations are promiscuous compounds that would
The majority the compounds selected bind to a single target bind to many protein structures or if many of the clusters from
(Figure 4). However, of all the compounds that are found in the the MD trajectories end up being selected by the same
top 500 lists, over half are calculated to bind to 3 or more targets. compounds.
Molecular weight was found to be only weakly related to the Comparison of Docking with Experimental Screening
number of protein targets a compound is calculated to bind to Results. The chemical databases used in the ensemble docking
(see SI Figures S25 and S26). Therefore, the overlap in the top- contained compounds from a variety of sources (i.e., Sweet-
scoring compounds is not an artifact of the size of the ligand. In Leads, NCI, and Enamine). In a separate experimental screening
the absence of a systematic experimental assay on each of these campaign, 2900 chemicals have been tested by the National
compounds, it is difficult to assess the significance of the overlap Institutes of Health, National Center for Advancing Transla-
in the top 500 compounds. It is important to note that the “top tional Sciences (NCATS) and listed on https://opendata.ncats.
compounds” are assembled based on relative docking scores nih.gov/covid19/databrowser (accessed 2020/11/02). Many of
J https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
Table 4. Number of Top-Scoring Computationally Predicted Figure 5. Comparison of S-protein (Spike) true-positive rates for
strong-actives. Plot shows percentage of experimental NCATS positives
Compoundsa, Corresponding NCATS-Tested Compounds in top computational-predicted chemicals as solid line. Dashed-line
As a Subset from First Column, Percentage of Strong and represents constant NCATS positive rate for comparison.
Strong+Moderately Active Compounds for the Spike Protein
(Top) and MPro (Bottom) Targets
no. of top no. of corresponding percentage of percentage of overlap in our database with both the strong and moderately
compounds compounds tested NCATS actives NCATS actives active compounds. Although the enrichment rates obtained
(docking)a (NCATS) (strong) (strong+moderate) through computational docking was not as high as the rates
Spike obtained for the spike protein, the computationally obtained
673 235 14.0% 53.2% MPro enrichments ranged from 7% to 14% and are still
420 158 17.1% 57.0% systematically better than the rates obtained exeprientally
292 108 20.4% 61.1% (5.7%).
149 55 25.5% 67.3% Interestingly, for both targets, as the number of ranked
81 27 33.3% 77.8% compounds considered is decreased, the computational enrich-
17 4 100.0% 100.0% ment improves. As expected, the very top screening compounds
MPro with the best docking score are often the most likely to have
968 359 - 7.0% experimental activity and the further we go down the ranked list
648 221 - 6.8% the lower the computational enrichment. Thus, docking
459 156 - 7.7% performs better as a tool for identifying a small number of
248 86 - 9.3% active compounds in a very small subset of a database, rather
136 45 - 8.9% than a tool to identify many active compounds in a large subset
32 7 - 14.3% of a database. This result is important when considering the
a prioritization of compounds from the billion-plus compound
Top compounds from docking were obtained from the top 500, 300,
200, 100, and 50 ranked lists that correspond to each of the spike and screen described below. A threshold of 10% or even 5% of a
MPro targets. For both systems multiple docking runs were billion compounds database would still likely be too large a
considered and only unique compounds are reported. number to screen experimentally, but less than 1% would be
more amenable to experimental validation. The present results
of true positives, that is, how many of the top computational- suggest that the best enrichments are indeed obtained for a small
scoring compounds were identified by NCATS as active as a or very small subset of the chemical databases used in docking.
percentage of how many of the top computational-scoring Billion-Compound Supercomputing Screens. We
compounds were experimentally tested by NCATS. found that for the ligands with fewer numbers of rotatable
We found that computational prediction is systematically bonds, such as found in the Enamine data set, a docking
enriched compared to a random selection of compounds. The calculation using 20 repeated runs could be performed in 0.5−
experimental hit rate for NCATS compounds active in the spike 2.5 s when using the Summit GPU (Figure 6). The same set of
protein is 6.1% for “strong actives” (NCATS definition) and ligands docked with Vina on Summit’s CPUs showed a large
28.3% for strong and moderately active compounds (the value of spread of timings, with some ligands requiring nearly five
28.3% being unusually high). In contrast, the computational minutes to complete (Figure 6). In practice, this means that with
enrichment is between about twice to four times as high (Table 4 GPU-enabled docking, it is feasible to flexibly dock a billion
and Figure 5). Out of the 673 unique compounds in the union of compounds in about a day on modern supercomputers, whereas
our top 500 lists for each spike protein simulation variant, 235 with Vina, a similar calculation would require a multiyear effort
have been tested experimentally by NCATS. Of these 235 on a university cluster. We confirmed that for ligands with less
compounds, 33 (14%) are experimentally strong actives and 125 than 11 rotations, the Solis-Wets algorithm in AD-GPU provides
(53%) are strongly or moderately active. Narrowing the ranked equivalent results to the new ADADELTA algorithm.61 For the
lists from docking to the top 10 resulted in 17 unique trimmed data set, the top 5% of scores obtained with AD-GPU
compounds for which 4 have experimental activity (0.14% of using the Solis-Wets algorithm formed an intersection with the
the total NCATS screen). Interestingly, all four of the top 5% of scores from Vina consisting of 18% of each top 5% set.
experimentally tested compounds (i.e., 100% of the tested Analysis of the full billion ligand sets is currently ongoing.
compounds in our top 10 lists) are strongly active. Mutation Analysis. The mutation frequency of the proteins
For the MPro assay, NCATS identifies only 1 strongly active simulated in this study is generally low. It should be noted,
out of 2675 compounds in the approved drugs collection that however, that it is as yet early in the history of SAR-CoV-2, and
were tested experimentally. Therefore, we considered the thus increased relative variability of residues along the proteome
K https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
Figure 6. General benchmarking of Autodock-GPU and Autodock Vina performance against subset of Enamine database.
Figure 7. Example mutational entropy analysis. Residues are colored by entropy, with redder colors corresponding to greater entropy.
may indicate the propensity of those residues for future consider when choosing targets for drug discovery, in that a
mutations. We did find higher variability, given by the entropy protein that seems to be more rapidly mutating could potentially
values, in other SARS-CoV-2 proteins not included in this study; lead to an ineffective therapeutic if mutations alter the shape of
in particular, the Spike mutation D614G noted in other reports the drug-binding site. In the case of MPro, the highest entropy
continues to be seen with high frequency since being described mutations were not found in the active site; however, it is
in a recent preprint that performed an analysis on GISAID possible that they may still affect its conformation indirectly.
through April 13.151 We counted 9107 D614G mutations (up The reduced mutation entropy for PLPro may indicate that an
from 3577 found April 13) and calculated entropy of 0.94 for effort to target a protease could meet with fewer mutation-
this residue. The NSP12 RdRp protein also shows a large related problems if targeting PLPro rather than MPro.
mutation entropy at residue 323, with a mutation entropy of
0.95. This residue, P, has mutated to L 9078 times (and F 3
times). Note that not every protein was represented in all
■ FUTURE DIRECTIONS AND PRELIMINARY RESULTS
FROM NEW METHODOLOGIES
sequences used for entropy calculations. Other regions of the As emphasized above, this article is a progress report on an
genome with higher entropy values (greater than 0.5) are ongoing project. The development of the pipeline is continuing
residues 203 and 204 of orf9 (entropy 0.70 and 0.69, with advances being made in several directions. Notably, we are
respectively), residue 85 of NSP2 (0.74), 37 of NSP6 (0.57), incorporating artificial intelligence and machine learning into
57 of orf3a (0.81), and 84 of orf8 (0.56). rescoring ligand ranking and clustering the MD trajectories.
The highest entropy found among the structures in this study Further, we are developing methods to rescore docking using
was in the main protease, with an entropy of 0.13 for residue 15, quantum chemical approaches. Although these developments
a glycine. We found 261 G15S mutations and one G15D have not been incorporated into the pipeline at the time of
mutation in our data set. The MPro also has a number of other writing and were not applied to generate the results described
residues with relatively high entropies, including residue 90, with above, we report on progress with them here.
entropy 0.07 and 117 K90R mutations, and 266 with entropy Clustering MD Trajectories Using Deep Learning and
0.04 and 64 A266 V mutations. After this, the next highest AI. The deluge of data generated from simulations such as the T-
entropy was 0.06 for the N-terminal region of the N protein and REMD runs reported here can make traditional approaches of
also for NSP15. These are displayed in Figure 7. An entropy of machine learning and clustering approaches (based on measures
0.04 was also found for domain X of NSP3 (glycine 76). A lower of similarity in the RMSD-space, or other metrics) quite
mutation entropy was found in PLPro, compared to MPro, with challenging. Often, practical aspects of computing dictate the
the highest value being 0.03. These mutations are important to use of subsample tracts of the MD data itself or use of prior
L https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
Figure 8. Deep learning clusters T-REMD simulations of the NSP15 hexameric complex into conformational states that are potentially relevant for
docking studies. (A) A 3D-representation of the CVAE learned from the T-REMD simulations shows the presence of multiple conformational states.
Each conformation from the simulation is painted using the RMSD to the starting structure and shows the presence of distinct directions in the
conformational landscape where low- and high-RMSD structures are distributed. To understand this representation better, we use an at-stochastic
neighbor embedding (t-SNE) algorithm to embed the data into a low-dimensional space, where we can clearly visualize how the conformational
landscape is organized. In this two-dimensional space, we visualize various observables from the simulations, including (B) RMSD to the native
structure, (C) SASA, and (D) radius of gyration. In each of these cases, we can observe the presence of at least three dominant substates with distinct
structural characteristics, which can be further used for docking simulations.
knowledge about these data sets (e.g., knowing that the ligand docked conformation, determining whether or not the ligand is a
binds only in a certain orientation) to filter such data sets. Deep true binder (the screening problem); and (3) is determined to
learning techniques can be particularly valuable in “sifting” be a true binder, ascertaining a relative, or better yet, absolute
through large data sets and can be powerful for clustering T- binding affinity (the affinity, or scoring, problem). In principle,
REMD simulations. We are investigating the use of a variational one could perform the screening and affinity prediction
autoencoder with convolutional filters (CVAE), previously problems using molecular dynamics techniques such as free
developed to cluster protein folding trajectories,152,153 to cluster energy perturbation, thermodynamic integration, or more
the T-REMD simulations of NSP15. As shown in Figure 8 we approximate methods such as MM/PB(GB)SA (molecular
find that the latent dimensions learned from the simulations mechanics/Poisson-Boltzmann[generalized Born]-surface
indeed cluster the simulation data into a small number of area). However, this is computationally intractable for large
conformational states. These states correspond to transitions numbers of compounds, even with supercomputers, and the
observed in the simulations, as seen from various measured accuracy can often be poor. Furthermore, these rigorous first-
observables from the data such as the binding site RMSD, SASA, principles-based methods assume a putative binding site, and
and the radius of gyration. cannot be applied to cases where the binding site is unknown.
The outcomes from the clustering provide insights into While the score or energy given by computational docking
aspects of how the T-REMD simulations have sampled the programs such as AutoDock Vina is reasonably well-suited for
conformational landscape - for example, in the case of this docking pose prediction, improvements are possible on the
protein, as observed in Panel C, there is only one conformational screening and affinity problems, and for this, we use here
state which has sampled a large SASA, indicating a potentially machine learning.
open state (which has only a minor change in the overall RMSD, There is an ongoing need for the development of computa-
Panel B). This information can be particularly helpful for tionally tractable models that can be easily validated on
selecting conformations and identifying metastable states for benchmark docking data sets. To this end, accurate, physics-
docking simulations.154 based, machine-learned models for the docking and affinity have
Protein−ligand Rescoring Using Machine-Learning. been trained using the PDBbind database, a data set consisting
The computational identification of drug compounds, and small of experimentally determined protein−ligand complex struc-
molecules in general, that bind to a protein consists of three tures with accurate experimental binding affinities.155−158 On an
distinct tasks: (1) identifying a putative conformation of the independent data set, the CASF-2013 benchmark,158,159 affinity
protein−ligand complex (the docking problem); (2) given a prediction (random forest-based) models achieve, at best, a
M https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
■
the interactions between ligands and the S-protein, we compare
FMO−DFTB/PCM pair interaction energy (PIE) to that of the
higher-level, but more expensive FMO-MP2/PCM method. CONCLUSIONS
The PIEs were calculated for ligands binding to the S-protein in The present manuscript reports on the establishment of a
the binding pocket. Figure 9 shows that FMO−DFTB/PCM supercomputer-based virtual high-throughput screening ensem-
interaction energies agree very well with high-level ab initio ble-docking pipeline that takes into account the dynamic
FMO-MP2/PCM data with the R correlation coefficient; in this properties of protein targets, as well as preliminary results on
N https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
simulations and docking screens to a number of protein targets Additional tables describing the simulations systems are
from SARS-CoV-2. provided; additionally, descriptive figures showing cluster
The speed at which structural data have been derived diversity for each protein system are also provided (PDF)
experimentally for the SARS-CoV-2 proteome means that
several of the simulations reported above were “out of date”
almost immediately. By this, we mean that the simulations were
performed using models derived from experiments that had
■ AUTHOR INFORMATION
Corresponding Author
been superseded by higher-resolution or more complete data. J. C. Smith − UT/ORNL Center for Molecular Biophysics, Oak
Examples of these are the S-protein, MPro, N Protein, and Ridge National Laboratory, Oak Ridge, Tennessee 37830,
NSP9. Clearly, as information on structures increases in quality, United States; The University of Tennessee, Knoxville,
simulations will be further repeated. Furthermore, the complex- Department of Biochemistry & Cellular and Molecular
ity of the structural models derived is expected to increase. For Biology, Knoxville, Tennessee 37996, United States;
example, models of the S protein interacting with the viral orcid.org/0000-0002-2978-3227; Email: smithjc@
envelope or extending up to the complete virion can be ornl.gov
envisaged and, in principle at least, incorporated into drug
screening protocols. Authors
The present results provide comprehensive simulation A. Acharya − School of Physics, Georgia Institute of Technology,
models for eight of the viral proteins in 24 molecular systems. Atlanta, Georgia 30332, United States; orcid.org/0000-
T-REMD is well suited for massively parallel supercomputing 0002-6960-7789
because many replicas are run simultaneously, and they need to R. Agarwal − UT/ORNL Center for Molecular Biophysics, Oak
communicate with each other. In the present tests, 350 ns/day/ Ridge National Laboratory, Oak Ridge, Tennessee 37830,
replica was obtained for the smallest (NSP3 phophatase/ United States; Graduate School of Genome Science and
ADRP) system, and this, therefore, scales up to about 1.5 ms/ Technology, University of Tennessee, Knoxville, Tennessee
day of aggregate MD time, given the hypothetical situation that 37996, United States; orcid.org/0000-0003-1029-2281
one had about 100 different proteins to run of roughly the same M. B. Baker − Computer Science and Mathematics Division,
size. For bigger systems, with 105 atoms, the throughput is lower, Oak Ridge National Lab, Oak Ridge, Tennessee 37830, United
about 1.0 ms/day. Nevertheless, it is clear that extensive States
simulation data can be obtained on many proteins with a short J. Baudry − The University of Alabama in Huntsville,
time-to-solution on this machine. As one possible future Department of Biological Sciences, Huntsville, Alabama
direction, one might envisage running T-REMD on the 44 35899, United States; orcid.org/0000-0002-1969-1679
drug targets that have been suggested as a minimal screen for the D. Bhowmik − Computational Sciences and Engineering
toxicity effects in human drug trials.162 Division, Oak Ridge National Laboratory, Oak Ridge,
The ensemble docking performed so far mostly involves Tennessee 37831, United States; orcid.org/0000-0001-
repurposing databases and therefore is limited to about 10k 7770-9091
compounds. Many of these compounds are predicted to be quite S. Boehm − Computer Science and Mathematics Division, Oak
promiscuous in binding to the targets. Two of the compounds Ridge National Lab, Oak Ridge, Tennessee 37830, United
identified in the top 1% of our preliminary S-protein screen have States
been reported to be in two registered clinical trials (quercetin K. G. Byler − The University of Alabama in Huntsville,
and hypericin). Further, several compounds from the screens Department of Biological Sciences, Huntsville, Alabama
reported above show activity in reducing live viral infectivity: 35899, United States
these results will be reported elsewhere. S. Y. Chen − Computational Science Initiative, Brookhaven
The docking results using the smaller database were not run National Laboratory, Upton, New York 11973, United States
on Summit, because of the fact that for Summit code running on L. Coates − Neutron Scattering Division, Oak Ridge National
GPUs is preferred. However, as COVID-19 therapeutic research Laboratory, Oak Ridge, Tennessee 37831, United States;
moves beyond repurposing to the discovery of novel orcid.org/0000-0003-2342-049X
compounds, there is a need to quickly screen many more C. J. Cooper − UT/ORNL Center for Molecular Biophysics,
compounds. Therefore, we have installed Autodock-GPU and Oak Ridge National Laboratory, Oak Ridge, Tennessee 37830,
demonstrated that it is capable of screening 1 billion compounds United States; Graduate School of Genome Science and
on Summit in 12 h when scaled to the whole machine. Although Technology, University of Tennessee, Knoxville, Tennessee
several other groups have reported billion-compound screens, 37996, United States; orcid.org/0000-0002-5527-9948
these have been using AI approaches or rigid docking without O. Demerdash − Biosciences Division, Oak Ridge National Lab,
pose optimization.163,164 The present billion-compound screen Oak Ridge, Tennessee 37830, United States
calculations, therefore, represent a potential supercomputer- I. Daidone − Department of Physical and Chemical Sciences,
driven paradigm shift in computational drug discovery and can University of L’Aquila, I-67010 L’Aquila, Italy; orcid.org/
be envisaged to be performed on dozens of proteins in a single 0000-0001-8970-8408
day when the exascale era of supercomputing arrives, as planned J. D. Eblen − UT/ORNL Center for Molecular Biophysics, Oak
for 2021. Ridge National Laboratory, Oak Ridge, Tennessee 37830,
■
United States; The University of Tennessee, Knoxville,
ASSOCIATED CONTENT Department of Biochemistry & Cellular and Molecular
Biology, Knoxville, Tennessee 37996, United States
* Supporting Information
sı S. Ellingson − University of Kentucky, Division of Biomedical
The Supporting Information is available free of charge at Informatics, College of Medicine, Lexington, Kentucky 40536,
https://pubs.acs.org/doi/10.1021/acs.jcim.0c01010. United States
O https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
S. Forli − Scripps Research, La Jolla, California 92037, United D. M. Rogers − National Center for Computational Sciences,
States; orcid.org/0000-0002-5964-7111 Oak Ridge National Laboratory, Oak Ridge, Tennessee 37830,
J. Glaser − National Center for Computational Sciences, Oak United States; orcid.org/0000-0002-5187-1768
Ridge National Laboratory, Oak Ridge, Tennessee 37830, D. Santos-Martins − Scripps Research, La Jolla, California
United States 92037, United States
J. C. Gumbart − School of Physics, Georgia Institute of A. Scheinberg − Jubilee Development, Cambridge,
Technology, Atlanta, Georgia 30332, United States; Massachusetts 02139, United States
orcid.org/0000-0002-1510-7842 A. Sedova − Biosciences Division, Oak Ridge National Lab, Oak
J. Gunnels − HPC Engineering, Amazon Web Services, Seattle, Ridge, Tennessee 37830, United States; orcid.org/0000-
Washington 98121, United States 0002-8233-3057
O. Hernandez − Computer Science and Mathematics Division, Y. Shen − UT/ORNL Center for Molecular Biophysics, Oak
Oak Ridge National Lab, Oak Ridge, Tennessee 37830, United Ridge National Laboratory, Oak Ridge, Tennessee 37830,
States United States; Graduate School of Genome Science and
S. Irle − Computational Sciences and Engineering Division, Oak Technology, University of Tennessee, Knoxville, Tennessee
Ridge National Laboratory, Oak Ridge, Tennessee 37831, 37996, United States
United States; Chemical Sciences Division, Oak Ridge M. D. Smith − UT/ORNL Center for Molecular Biophysics,
National Laboratory, Oak Ridge, Tennessee 37831, United Oak Ridge National Laboratory, Oak Ridge, Tennessee 37830,
States; Bredesen Center for Interdisciplinary Research and United States; The University of Tennessee, Knoxville,
Graduate Education, University of Tennessee, Knoxville, Department of Biochemistry & Cellular and Molecular
Tennessee 37996, United States; orcid.org/0000-0003- Biology, Knoxville, Tennessee 37996, United States;
4995-4991 orcid.org/0000-0002-0777-7539
D. W. Kneller − Neutron Scattering Division, Oak Ridge C. Soto − Computational Science Initiative, Brookhaven
National Laboratory, Oak Ridge, Tennessee 37831, United National Laboratory, Upton, New York 11973, United States
States A. Tsaris − National Center for Computational Sciences, Oak
A. Kovalevsky − Neutron Scattering Division, Oak Ridge Ridge National Laboratory, Oak Ridge, Tennessee 37830,
National Laboratory, Oak Ridge, Tennessee 37831, United United States
States; orcid.org/0000-0003-4459-9142 M. Thavappiragasam − Biosciences Division, Oak Ridge
J. Larkin − NVIDIA Corporation, Santa Clara, California National Lab, Oak Ridge, Tennessee 37830, United States
95051, United States A. F. Tillack − Scripps Research, La Jolla, California 92037,
T. J. Lawrence − Biosciences Division, Oak Ridge National Lab, United States
Oak Ridge, Tennessee 37830, United States J. V. Vermaas − National Center for Computational Sciences,
S. LeGrand − NVIDIA Corporation, Santa Clara, California Oak Ridge National Laboratory, Oak Ridge, Tennessee 37830,
95051, United States United States; orcid.org/0000-0003-3139-6469
S.-H. Liu − UT/ORNL Center for Molecular Biophysics, Oak V. Q. Vuong − Computational Sciences and Engineering
Ridge National Laboratory, Oak Ridge, Tennessee 37830, Division, Oak Ridge National Laboratory, Oak Ridge,
United States; The University of Tennessee, Knoxville, Tennessee 37831, United States; Chemical Sciences Division,
Department of Biochemistry & Cellular and Molecular Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831,
Biology, Knoxville, Tennessee 37996, United States; United States; Bredesen Center for Interdisciplinary Research
orcid.org/0000-0002-8889-1375 and Graduate Education, University of Tennessee, Knoxville,
J.C. Mitchell − Biosciences Division, Oak Ridge National Lab, Tennessee 37996, United States
Oak Ridge, Tennessee 37830, United States J. Yin − National Center for Computational Sciences, Oak Ridge
G. Park − Computational Science Initiative, Brookhaven National Laboratory, Oak Ridge, Tennessee 37830, United
National Laboratory, Upton, New York 11973, United States States
J.M. Parks − UT/ORNL Center for Molecular Biophysics, Oak S. Yoo − Computational Science Initiative, Brookhaven
Ridge National Laboratory, Oak Ridge, Tennessee 37830, National Laboratory, Upton, New York 11973, United States
United States; Graduate School of Genome Science and M. Zahran − Department of Biological Sciences, New York City
Technology, University of Tennessee, Knoxville, Tennessee College of Technology, The City University of New York
37996, United States; orcid.org/0000-0002-3103-9333 (CUNY), Brooklyn, New York 11201, United States
A. Pavlova − School of Physics, Georgia Institute of Technology, L. Zanetti-Polzi − CNR Institute of Nanoscience, I-41125
Atlanta, Georgia 30332, United States Modena, Italy; orcid.org/0000-0002-2550-4796
L. Petridis − UT/ORNL Center for Molecular Biophysics, Oak Complete contact information is available at:
Ridge National Laboratory, Oak Ridge, Tennessee 37830, https://pubs.acs.org/10.1021/acs.jcim.0c01010
United States; The University of Tennessee, Knoxville,
Department of Biochemistry & Cellular and Molecular Author Contributions
Biology, Knoxville, Tennessee 37996, United States;
orcid.org/0000-0001-8569-060X Authors are listed in alphabetical order. All authors assisted in
D. Poole − NVIDIA Corporation, Santa Clara, California the drafting and revising the manuscript. A.A. assisted in MPro
95051, United States protonation state selection. R.A. assisted in AD-GPU for
L. Pouchard − Computational Science Initiative, Brookhaven Summit development and testing, he also assisted in clustering
National Laboratory, Upton, New York 11973, United States and performed ensemble docking calculations. J.B. performed
A. Ramanathan − Data Science and Learning Division, Argonne clustering, ensemble docking, and docking analysis and curated
National Lab, Lemont, Illinois 60439, United States repurposing ligand database. M.B.B. created and deployed
ligand pre-processing pipeline at scale on Summit. S.B. assisted
P https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling
■
pubs.acs.org/jcim Article
Q https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
Nsp15 endoribonuclease NendoU from SARS-CoV-2. Protein Sci. (29) Amaro, R. E.; Baudry, J.; Chodera, J.; Demir, Ö .; McCammon, J.
2020, 29, 1596−1605. A.; Miao, Y.; Smith, J. C. Ensemble Docking in Drug Discovery. Biophys.
(10) Liu, C.; Zhou, Q.; Li, Y.; Garner, L. V.; Watkins, S. P.; Carter, L. J. 2018, 114, 2271−2278.
J.; Smoot, J.; Gregg, A. C.; Daniels, A. D.; Jervey, S.; Albaiu, D. Research (30) Pi, M.; Kapoor, K.; Wu, Y.; Ye, R.; Senogles, S. E.; Nishimoto, S.
and Development on Therapeutic Agents and Vaccines for COVID-19 K.; Hwang, D.-J.; Miller, D. D.; Narayanan, R.; Smith, J. C.; Baudry, J.;
and Related Human Coronavirus Diseases. ACS Cent. Sci. 2020, 6, Quarles, L. D. Structural and Functional Evidence for Testosterone
315−331. Activation of GPRC6A in Peripheral Tissues. Mol. Endocrinol. 2015, 29,
(11) Kim, P. S.; Read, S. W.; Fauci, A. S., Therapy for Early COVID- 1759−1773.
19: A Critical Need. JAMA, J. Am. Med. Assoc. 2020. DOI: 10.1001/ (31) Pi, M.; Kapoor, K.; Ye, R.; Nishimoto, S. K.; Smith, J. C.; Baudry,
jama.2020.22813 J.; Quarles, L. D. Evidence for osteocalcin binding and activation of
(12) Jiang, S.; Hillyer, C.; Du, L. Neutralizing Antibodies against GPRC6A in β-cells. Endocrinology 2016, 157, 1866−1880.
SARS-CoV-2 and Other Human Coronaviruses. Trends Immunol. 2020, (32) Evangelista, W.; Weir, R. L.; Ellingson, S. R.; Harris, J. B.; Kapoor,
41, 355−359. K.; Smith, J. C.; Baudry, J. Ensemble-based docking: From hit discovery
(13) Netherton, C. L.; Wileman, T. Virus factories, double membrane to metabolism and toxicity predictions. Bioorg. Med. Chem. 2016, 24,
vesicles and viroplasm generated in animal cells. Curr. Opin. Virol. 2011, 4928−4935.
1, 381−7. (33) Xiao, Z.; Riccardi, D.; Velazquez, H. A.; Chin, A. L.; Yates, C. R.;
(14) den Boon, J. A.; Diaz, A.; Ahlquist, P. Cytoplasmic viral Carrick, J. D.; Smith, J. C.; Baudry, J.; Quarles, L. D. A computationally
replication complexes. Cell Host Microbe 2010, 8, 77−85. identified compound antagonizes excess FGF-23 signaling in renal
(15) Hoffmann, M.; Kleine-Weber, H.; Pohlmann, S. A Multibasic tubules and a mouse model of hypophosphatemia. Sci. Signaling 2016,
Cleavage Site in the Spike Protein of SARS-CoV-2 Is Essential for 9, ra113.
Infection of Human Lung Cells. Mol. Cell 2020, 78, 779−784 e5. (34) Abdali, N.; Parks, J. M.; Haynes, K. M.; Chaney, J. L.; Green, A.
(16) Hagemeijer, M. C.; Rottier, P. J.; de Haan, C. A. Biogenesis and T.; Wolloscheck, D.; Walker, J. K.; Rybenkov, V. V.; Baudry, J.; Smith, J.
dynamics of the coronavirus replicative structures. Viruses 2012, 4, C.; Zgurskaya, H. I. Reviving Antibiotics: Efflux Pump Inhibitors That
3245−69. Interact with AcrA, a Membrane Fusion Protein of the AcrAB-TolC
(17) Wu, A.; Peng, Y.; Huang, B.; Ding, X.; Wang, X.; Niu, P.; Meng, Multidrug Efflux Pump. ACS Infect. Dis. 2017, 3, 89−98.
J.; Zhu, Z.; Zhang, Z.; Wang, J., Genome composition and divergence of (35) Dale, J. B.; Smeesters, P. R.; Courtney, H. S.; Penfound, T. A.;
the novel coronavirus (2019-nCoV) originating in China. Cell Host Hohn, C. M.; Smith, J. C.; Baudry, J. Y. Structure-based design of
Microbe 2020.27325 broadly protective group a streptococcal M protein-based vaccines.
(18) Li, X.; Zhang, L.; Duan, Y.; Yu, J.; Wang, L.; Yang, K.; Liu, F.; Vaccine 2017, 35, 19−26.
You, T.; Liu, X.; Yang, X., Structure of Mpro from SARS-CoV-2 and (36) Haynes, K. M.; Abdali, N.; Jhawar, V.; Zgurskaya, H. I.; Parks, J.
discovery of its inhibitors. Nature 2020.582289 M.; Green, A. T.; Baudry, J.; Rybenkov, V. V.; Smith, J. C.; Walker, J. K.
(19) Yin, W.; Mao, C.; Luan, X.; Shen, D.-D.; Shen, Q.; Su, H.; Wang, Identification and Structure−Activity Relationships of Novel Com-
X.; Zhou, F.; Zhao, W.; Gao, M., Structural basis for inhibition of the pounds that Potentiate the Activities of Antibiotics in Escherichia coli. J.
Med. Chem. 2017, 60, 6205−6219.
RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir.
(37) Velazquez, H. A.; Riccardi, D.; Xiao, Z.; Quarles, L. D.; Yates, C.
Science 2020.3681499
R.; Baudry, J.; Smith, J. C. Ensemble docking to difficult targets in early-
(20) Wang, X.; Cao, R.; Zhang, H.; Liu, J.; Xu, M.; Hu, H.; Li, Y.;
stage drug discovery: Methodology and application to fibroblast growth
Zhao, L.; Li, W.; Sun, X. The anti-influenza virus drug, Arbidol is an
factor 23. Chem. Biol. Drug Des. 2018, 91, 491−504.
efficient inhibitor of SARS-CoV-2 in vitro. Cell Discovery 2020, 6, 1−5.
(38) Xiao, Z.; Baudry, J.; Cao, L.; Huang, J.; Chen, H.; Yates, C. R.; Li,
(21) Lei, J.; Hilgenfeld, R. RNA-virus proteases counteracting host
W.; Dong, B.; Waters, C. M.; Smith, J. C. Polycystin-1 interacts with
innate immunity. FEBS Lett. 2017, 591, 3190−3210.
TAZ to stimulate osteoblastogenesis and inhibit adipogenesis. J. Clin.
(22) Smith, M.; Smith, J. Repurposing Therapeutics for COVID-19:
Invest. 2018, 128, 157−174.
Supercomputer-Based Docking to the SARS-CoV-2 Viral Spike Protein and
(39) Pi, M.; Kapoor, K.; Ye, R.; Hwang, D.-J.; Miller, D. D.; Smith, J.
Viral Spike Protein-Human ACE2 Interface. 2020. C.; Baudry, J.; Quarles, L. D., Computationally identified novel agonists
(23) Kaldor, S. W.; Kalish, V. J.; Davies, J. F., 2nd; Shetty, B. V.; Fritz, J.
for GPRC6A. PLoS One 2018, 13.e0195980
E.; Appelt, K.; Burgess, J. A.; Campanale, K. M.; Chirgadze, N. Y.; (40) Darzynkiewicz, Z. M.; Green, A. T.; Abdali, N.; Hazel, A.; Fulton,
Clawson, D. K.; Dressman, B. A.; Hatch, S. D.; Khalil, D. A.; Kosa, M. R. L.; Kimball, J.; Gryczynski, Z.; Gumbart, J. C.; Parks, J. M.; Smith, J.
B.; Lubbehusen, P. P.; Muesing, M. A.; Patick, A. K.; Reich, S. H.; Su, K. C. Identification of binding sites for efflux pump inhibitors of the
S.; Tatlock, J. H. Viracept (nelfinavir mesylate, AG1343): a potent, AcrAB-TolC component AcrA. Biophys. J. 2019, 116, 648−658.
orally bioavailable inhibitor of HIV-1 protease. J. Med. Chem. 1997, 40, (41) Evangelista Falcon, W.; Ellingson, S. R.; Smith, J. C.; Baudry, J.
3979−85. Ensemble Docking in Drug Discovery: How Many Protein
(24) von Itzstein, M.; Wu, W.-Y.; Kok, G. B.; Pegg, M. S.; Dyason, J. Configurations from Molecular Dynamics Simulations are Needed
C.; Jin, B.; Van Phan, T.; Smythe, M. L.; White, H. F.; Oliver, S. W. To Reproduce Known Ligand Binding? J. Phys. Chem. B 2019, 123,
Rational design of potent sialidase-based inhibitors of influenza virus 5189−5195.
replication. Nature 1993, 363, 418−423. (42) Aranha, M. P.; Spooner, C.; Demerdash, O.; Czejdo, B.; Smith, J.
(25) Gorgulla, C.; Boeszoermenyi, A.; Wang, Z. F.; Fischer, P. D.; C.; Mitchell, J. C. Prediction of peptide binding to MHC using machine
Coote, P. W.; Padmanabha Das, K. M.; Malets, Y. S.; Radchenko, D. S.; learning with sequence and structure-based feature sets. Biochim.
Moroz, Y. S.; Scott, D. A.; Fackeldey, K.; Hoffmann, M.; Iavniuk, I.; Biophys. Acta, Gen. Subj. 2020, 1864, 129535.
Wagner, G.; Arthanari, H. An open-source drug discovery platform (43) Smith, M.; Smith, J. C. Repurposing Therapeutics for COVID-
enables ultra-large virtual screens. Nature 2020, 580, 663−668. 19: Supercomputer-Based Docking to the SARS-CoV-2 Viral Spike
(26) Amaro, R. E.; Baudry, J.; Chodera, J.; Demir, Ö .; McCammon, J. Protein and Viral Spike Protein-Human ACE2 Interface. ChemRxiv
A.; Miao, Y.; Smith, J. C. Ensemble Docking in Drug Discovery. Biophys. 2020, DOI: 10.26434/chemrxiv.11871402.v4.
J. 2018, 114, 2271−2278. (44) Parks, J. M.; Smith, J. C., How to discover antiviral drugs quickly.
(27) Teague, S. J. Implications of protein flexibility for drug discovery. N. Engl. J. Med. 2020.3822261
Nat. Rev. Drug Discovery 2003, 2, 527−541. (45) Agarwal, R.; Bensing, B. A.; Mi, D.; Vinson, P. N.; Baudry, J.;
(28) Carlson, H. A.; Masukawa, K. M.; McCammon, J. A. Method for Iverson, T. M.; Smith, J. C. Structure based virtual screening identifies
Including the Dynamic Fluctuations of a Protein in Computer-Aided small molecule effectors for the sialoglycan binding protein Hsa.
Drug Design. J. Phys. Chem. A 1999, 103, 10213−10219. Biochem. J. 2020, 477, 3695−3707.
R https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
(46) Gupta, M.; Ha, K.; Agarwal, R.; Quarles, L. D.; Smith, J. C., computing Pipelines Search for Therapeutics Against COVID-19.
Molecular dynamics analysis of the binding of human interleukin-6 with Comput. Sci. Eng. 2020, 1−1.
interleukin-6 α-receptor. Proteins: Struct., Funct., Genet. n/a.2020 (65) Fedorov, D. G., The fragment molecular orbital method:
DOI: 10.1002/prot.26002 theoretical development, implementation in GAMESS and applica-
(47) Evangelista Falcon, W.; Ellingson, S. R.; Smith, J. C.; Baudry, J. tions. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2017, 7. DOI: 10.1002/
Ensemble Docking in Drug Discovery: How Many Protein wcms.1322
Configurations from Molecular Dynamics Simulations are Needed (66) Gordon, D. E.; Jang, G. M.; Bouhaddou, M.; Xu, J.; Obernier, K.;
To Reproduce Known Ligand Binding? J. Phys. Chem. B 2019, 123, White, K. M.; O’Meara, M. J.; Rezelj, V. V.; Guo, J. Z.; Swaney, D. L. A
5189−5195. SARS-CoV-2 protein interaction map reveals targets for drug
(48) Abraham, M. J.; Murtola, T.; Schulz, R.; Páll, S.; Smith, J. C.; repurposing. Nature 2020, 1−13.
Hess, B.; Lindahl, E. GROMACS: High performance molecular (67) Zhou, H.; Fang, Y.; Xu, T.; Ni, W. J.; Shen, A. Z.; Meng, X. M.,
simulations through multi-level parallelism from laptops to super- Potential therapeutic targets and promising drugs for combating SARS-
computers. SoftwareX 2015, 1−2, 19−25. CoV-2. Br. J. Pharmacol. 2020.1773147
(49) Kutzner, C.; Páll, S.; Fechner, M.; Esztermann, A.; de Groot, B. (68) Kim, Y.; Jedrzejczak, R.; Maltseva, N. I.; Wilamowski, M.; Endres,
L.; Grubmüller, H. Best bang for your buck: GPU nodes for M.; Godzik, A.; Michalska, K.; Joachimiak, A., Crystal structure of
GROMACS biomolecular simulations. J. Comput. Chem. 2015, 36, Nsp15 endoribonuclease NendoU from SARS-CoV-2. Protein Sci.
1990−2008. 2020.291596
(50) Van Der Spoel, D.; Lindahl, E.; Hess, B.; Groenhof, G.; Mark, A. (69) Jin, Z.; Du, X.; Xu, Y.; Deng, Y.; Liu, M.; Zhao, Y.; Zhang, B.; Li,
E.; Berendsen, H. J. C. GROMACS: Fast, flexible, and free. J. Comput. X.; Zhang, L.; Peng, C. Structure of M pro from SARS-CoV-2 and
Chem. 2005, 26, 1701−1718. discovery of its inhibitors. Nature 2020, 582, 1−5.
(51) Earl, D. J.; Deem, M. W. Parallel tempering: Theory, applications, (70) Lan, J.; Ge, J.; Yu, J.; Shan, S.; Zhou, H.; Fan, S.; Zhang, Q.; Shi,
and new perspectives. Phys. Chem. Chem. Phys. 2005, 7, 3910−3916. X.; Wang, Q.; Zhang, L. Structure of the SARS-CoV-2 spike receptor-
(52) Hansmann, U. H. E. Parallel tempering algorithm for binding domain bound to the ACE2 receptor. Nature 2020, 581, 215−
conformational studies of biological molecules. Chem. Phys. Lett. 220.
1997, 281, 140−150. (71) Kang, S.; Yang, M.; Hong, Z.; Zhang, L.; Huang, Z.; Chen, X.; He,
(53) Sugita, Y.; Okamoto, Y. Replica-exchange molecular dynamics S.; Zhou, Z.; Zhou, Z.; Chen, Q., Crystal structure of SARS-CoV-2
method for protein folding. Chem. Phys. Lett. 1999, 314, 141−151. nucleocapsid protein RNA binding domain reveals potential unique
(54) Sugita, Y.; Kitao, A.; Okamoto, Y. Multidimensional replica- drug targeting sites. Acta Pharm. Sin. B 2020.101228
exchange method for free-energy calculations. J. Chem. Phys. 2000, 113, (72) Maeda, Y.; Kinoshita, T. The acidic environment of the Golgi is
6042−6051. critical for glycosylation and transport. In Methods Enzymol.; Elsevier:
(55) Bernardi, R. C.; Melo, M. C. R.; Schulten, K. Enhanced sampling 2010; Vol. 480, pp 495−510.
techniques in molecular dynamics simulations of biological systems. (73) Wu, M. M.; Grabe, M.; Adams, S.; Tsien, R. Y.; Moore, H. P.;
Biochim. Biophys. Acta, Gen. Subj. 2015, 1850, 872−877. Machen, T. E. Mechanisms of pH regulation in the regulated secretory
(56) Tsai, T.-Y.; Chang, K.-W.; Chen, C. Y.-C. iScreen: world’s first pathway. J. Biol. Chem. 2001, 276, 33027−35.
cloud-computing web server for virtual screening and de novo drug (74) Zumla, A.; Chan, J. F. W.; Azhar, E. I.; Hui, D. S. C.; Yuen, K.-Y.
design based on TCM database@Taiwan. J. Comput.-Aided Mol. Des. Coronaviruses drug discovery and therapeutic options. Nat. Rev.
2011, 25, 525−531. Drug Discovery 2016, 15, 327−347.
(57) Dolezal, R.; Sobeslav, V.; Hornig, O.; Balik, L.; Korabecny, J.; (75) Chen, A. A.; Pappu, R. V. Parameters of Monovalent Ions in the
Kuca, K. HPC Cloud Technologies for Virtual Screening in Drug AMBER-99 Forcefield: Assessment of Inaccuracies and Proposed
Discovery. Cham, 2015; Springer International Publishing: Cham, Improvements. J. Phys. Chem. B 2007, 111, 11884−11887.
2015; pp 440−449. (76) Yoo, J.; Aksimentiev, A. Improved Parametrization of Li+, Na+,
(58) Huey, R.; Morris, G. M.; Olson, A. J.; Goodsell, D. S. A K+, and Mg2+ Ions for All-Atom Molecular Dynamics Simulations of
semiempirical free energy force field with charge-based desolvation. J. Nucleic Acid Systems. J. Phys. Chem. Lett. 2012, 3, 45−50.
Comput. Chem. 2007, 28, 1145−1152. (77) Ahlstrand, E.; Schpector, J. Z.; Friedman, R. Computer
(59) Huey, R.; Goodsell, D. S.; Morris, G. M.; Olson, A. J. Grid-based simulations of alkali-acetate solutions: Accuracy of the forcefields in
hydrogen bond potentials with improved directionality. Letters in Drug difference concentrations. J. Chem. Phys. 2017, 147, 194102.
Design & Discovery 2004, 1, 178−183. (78) Jing, Z.; Liu, C.; Cheng, S. Y.; Qi, R.; Walker, B. D.; Piquemal, J.-
(60) El Khoury, L.; Santos-Martins, D.; Sasmal, S.; Eberhardt, J.; P.; Ren, P. Polarizable Force Fields for Biomolecular Simulations:
Bianco, G.; Ambrosio, F. A.; Solis-Vasquez, L.; Koch, A.; Forli, S.; Recent Advances and Applications. Annu. Rev. Biophys. 2019, 48, 371−
Mobley, D. L. Comparison of affinity ranking using AutoDock-GPU 394.
and MM-GBSA scores for BACE-1 inhibitors in the D3R Grand (79) Chang, C.-k.; Hou, M.-H.; Chang, C.-F.; Hsiao, C.-D.; Huang,
Challenge 4. J. Comput.-Aided Mol. Des. 2019, 33, 1011−1020. T.-h. The SARS coronavirus nucleocapsid protein−forms and
(61) Solis-Vasquez, L.; Santos-Martins, D.; Koch, A.; Forli, S. functions. Antiviral Res. 2014, 103, 39−50.
Evaluating the Energy Efficiency of OpenCL-accelerated AutoDock (80) Astuti, I.; Ysrafil. Severe Acute Respiratory Syndrome
Molecular Docking. In 2020 28th Euromicro International Conference on Coronavirus 2 (SARS-CoV-2): An overview of viral structure and
Parallel, Distributed and Network-Based Processing (PDP), 2020; IEEE, host response. Diabetes & Metabolic Syndrome: Clinical Research &
2020; pp 162−166. Reviews 2020, 14, 407−412.
(62) Santos-Martins, D.; Solis-Vasquez, L.; Koch, A.; Forli, S., (81) Lei, J.; Kusov, Y.; Hilgenfeld, R. Nsp3 of coronaviruses:
Accelerating autodock4 with gpus and gradient-based local search. Structures and functions of a large multi-domain protein. Antiviral Res.
2019. 2018, 149, 58−74.
(63) LeGrand, S.; Scheinberg, A.; Tillack, A. F.; Thavappiragasam, M.; (82) Báez-Santos, Y. M.; St John, S. E.; Mesecar, A. D. The SARS-
Vermaas, J. V.; Agarwal, R.; Larkin, J.; Poole, D.; Santos-Martins, D.; coronavirus papain-like protease: structure, function and inhibition by
Solis-Vasquez, L. GPU-Accelerated Drug Discovery with Docking on designed antiviral compounds. Antiviral Res. 2015, 115, 21−38.
the Summit Supercomputer: Porting, Optimization, and Application to (83) Michalska, K.; Kim, Y.; Jedrzejczak, R.; Maltseva, N. I.; Stols, L.;
COVID-19 Research. In Proceedings of the 11th ACM International Endres, M.; Joachimiak, A., Crystal structures of SARS-CoV-2 ADP-
Conference on Bioinformatics; Computational Biology and Health ribose phosphatase (ADRP): from the apo form to ligand complexes.
Informatics, 2020; 2020; pp 1−10. IUCrJ 2020, 7814
(64) Vermaas, J. V.; Sedova, A.; Baker, M.; Boehm, S.; Rogers, D.; (84) Lei, J.; Hilgenfeld, R. RNA-virus proteases counteracting host
Larkin, J.; Glaser, J.; Smith, M.; Hernandez, O.; Smith, J. Super- innate immunity. FEBS Lett. 2017, 591, 3190−3210.
S https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
(85) Jin, Z.; Du, X.; Xu, Y.; Deng, Y.; Liu, M.; Zhao, Y.; Zhang, B.; Li, (103) Anand, K.; Palm, G. J.; Mesters, J. R.; Siddell, S. G.; Ziebuhr, J.;
X.; Zhang, L.; Peng, C.; Duan, Y.; Yu, J.; Wang, L.; Yang, K.; Liu, F.; Hilgenfeld, R. Structure of coronavirus main proteinase reveals
Jiang, R.; Yang, X.; You, T.; Liu, X.; Yang, X.; Bai, F.; Liu, H.; Liu, X.; combination of a chymotrypsin fold with an extra α-helical domain.
Guddat, L. W.; Xu, W.; Xiao, G.; Qin, C.; Shi, Z.; Jiang, H.; Rao, Z.; EMBO J. 2002, 21, 3213−3224.
Yang, H. Structure of Mpro from SARS-CoV-2 and discovery of its (104) Paasche, A.; Zipper, A.; Schäfer, S.; Ziebuhr, J.; Schirmeister, T.;
inhibitors. Nature 2020, 582, 289−293. Engels, B. Evidence for substrate binding-induced zwitterion formation
(86) Littler, D. R.; Gully, B. S.; Colson, R. N.; Rossjohn, J. Crystal in the catalytic Cys-His dyad of the SARS-CoV main protease.
Structure of the SARS-CoV-2 Non-structural Protein 9, Nsp9. iScience Biochemistry 2014, 53, 5930−5946.
2020, 23, 101258. (105) Katarzyna, S.; Vicent, M., Revealing the Molecular Mechanisms
(87) Kim, Y.; Jedrzejczak, R.; Maltseva, N. I.; Wilamowski, M.; Endres, of Proteolysis of SARS-CoV-2 Mpro from QM/MM Computational
M.; Godzik, A.; Michalska, K.; Joachimiak, A. Crystal structure of Methods. ChemRxiv 2020, 10.26434/chemrxiv.12283967.v1.1110626
Nsp15 endoribonuclease NendoU from SARS-CoV-2. Protein Sci. (106) Świderek, K.; Moliner, V., Revealing the molecular mechanisms
2020, 29, 1596−1605. of proteolysis of SARS-CoV-2 Mpro by QM/MM computational
(88) Rosas-Lemus, M.; Minasov, G.; Shuvalova, L.; Inniss, N. L.; methods. Chemical Science 2020.1110626
Kiryukhina, O.; Wiersum, G.; Kim, Y.; Jedrzejczak, R.; Enders, M.; (107) Dolinsky, T. J.; Czodrowski, P.; Li, H.; Nielsen, J. E.; Jensen, J.
Jaroszewski, L.; Godzik, A.; Joachimiak, A.; Satchell, K. J. F., The crystal H.; Klebe, G.; Baker, N. A. PDB2PQR: expanding and upgrading
structure of nsp10-nsp16 heterodimer from SARS-CoV-2 in complex automated preparation of biomolecular structures for molecular
with S-adenosylmethionine. bioRxiv 2020, 2020.04.17.047498. simulations. Nucleic Acids Res. 2007, 35, W522−W525.
(89) Chen, Y.; Su, C.; Ke, M.; Jin, X.; Xu, L.; Zhang, Z.; Wu, A.; Sun, (108) Vermaas, J. V.; Hardy, D. J.; Stone, J. E.; Tajkhorshid, E.;
Y.; Yang, Z.; Tien, P.; Ahola, T.; Liang, Y.; Liu, X.; Guo, D. Biochemical Kohlmeyer, A. TopoGromacs: Automated Topology Conversion from
and Structural Insights into the Mechanisms of SARS Coronavirus CHARMM to GROMACS within VMD. J. Chem. Inf. Model. 2016, 56,
RNA Ribose 2′-O-Methylation by nsp16/nsp10 Protein Complex. 1112−1116.
PLoS Pathog. 2011, 7, e1002294. (109) Hess, B.; Kutzner, C.; van der Spoel, D.; Lindahl, E. GROMACS
(90) Menachery, V. D.; Debbink, K.; Baric, R. S. Coronavirus non- 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable
structural protein 16: Evasion, attenuation, and possible treatments. Molecular Simulation. J. Chem. Theory Comput. 2008, 4, 435−47.
Virus Res. 2014, 194, 191−199. (110) Huang, J.; Rauscher, S.; Nawrocki, G.; Ran, T.; Feig, M.; de
(91) Mikolov, T.; Chen, K.; Corrado, G.; Dean, J., Efficient Estimation Groot, B. L.; Grubmüller, H.; MacKerell, A. D. CHARMM36m: an
of Word Representations in Vector Space. arXiv 2013, 1301.3781. improved force field for folded and intrinsically disordered proteins.
(92) Robertson, S. The Probabilistic Relevance Framework: BM25 Nat. Methods 2017, 14, 71−73.
and Beyond. Foundations and Trends® in Information Retrieval 2009, 3, (111) Abraham, M. J.; Gready, J. E. Optimization of parameters for
333−389. molecular dynamics simulation using smooth particle-mesh Ewald in
(93) Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre- GROMACS 4.5. J. Comput. Chem. 2011, 32, 2031−2040.
(112) Hess, B. P-LINCS: A parallel linear constraint solver for
training of Deep Bidirectional Transformers for Language Under-
molecular simulation. J. Chem. Theory Comput. 2008, 4, 116−122.
standing. In Association for Computational Linguistics; Minneapolis,
(113) Berendsen, H. J. C.; Postma, J. P. M.; Gunsteren, W. F. v.;
MN, 2019.
DiNola, A.; Haak, J. R. Molecular dynamics with coupling to an external
(94) Jo, S.; Kim, T.; Iyer, V. G.; Im, W. CHARMM-GUI: a web-based
bath. J. Chem. Phys. 1984, 81, 3684−3690.
graphical user interface for CHARMM. J. Comput. Chem. 2008, 29,
(114) Bernardi, R. C.; Melo, M. C. R.; Schulten, K. Enhanced
1859−65.
sampling techniques in molecular dynamics simulations of biological
(95) Zhang, J.; Yang, W.; Piquemal, J.-P.; Ren, P. Modeling Structural
systems. Biochim. Biophys. Acta, Gen. Subj. 2015, 1850, 872−877.
Coordination and Ligand Binding in Zinc Proteins with a Polarizable (115) Hansmann, U. H. E. Parallel tempering algorithm for
Potential. J. Chem. Theory Comput. 2012, 8, 1314−1324. conformational studies of biological molecules. Chem. Phys. Lett.
(96) Calimet, N.; Simonson, T. CysxHisy−Zn2+ interactions: 1997, 281, 140−150.
Possibilities and limitations of a simple pairwise force field. J. Mol. (116) Nymeyer, H. How Efficient Is Replica Exchange Molecular
Graphics Modell. 2006, 24, 404−411. Dynamics? An Analytic Approach. J. Chem. Theory Comput. 2008, 4,
(97) Wambo, T. O.; Chen, L. Y.; McHardy, S. F.; Tsin, A. T. 626−636.
Molecular dynamics study of human carbonic anhydrase II in complex (117) Yang, Y. I.; Shao, Q.; Zhang, J.; Yang, L.; Gao, Y. Q. Enhanced
with Zn2+ and acetazolamide on the basis of all-atom force field sampling in molecular dynamics. J. Chem. Phys. 2019, 151, No. 070902.
simulations. Biophys. Chem. 2016, 214−215, 54−60. (118) Patriksson, A.; van der Spoel, D. A temperature predictor for
(98) Fan, K.; Wei, P.; Feng, Q.; Chen, S.; Huang, C.; Ma, L.; Lai, B.; parallel tempering simulations. Phys. Chem. Chem. Phys. 2008, 10,
Pei, J.; Liu, Y.; Chen, J. Biosynthesis, purification, and substrate 2073−2077.
specificity of severe acute respiratory syndrome coronavirus 3C-like (119) Bussi, G.; Donadio, D.; Parrinello, M. Canonical sampling
proteinase. J. Biol. Chem. 2004, 279, 1637−1642. through velocity rescaling. J. Chem. Phys. 2007, 126.
(99) Zhang, L.; Lin, D.; Sun, X.; Curth, U.; Drosten, C.; Sauerhering, (120) Parrinello, M.; Rahman, A. polymorphic transitions in single-
L.; Becker, S.; Rox, K.; Hilgenfeld, R. Crystal structure of SARS-CoV-2 crystalls - a new molecular dynamics method. J. Appl. Phys. 1981, 52,
main protease provides a basis for design of improved α-ketoamide 7182−7190.
inhibitors. Science 2020, 368, 409−412. (121) Nosé, S.; Klein, M. Constant pressure molecular dynamics for
(100) Kneller, D. W.; Phillips, G.; O’Neill, H. M.; Jedrzejczak, R.; molecular systems. Mol. Phys. 1983, 50, 1055−1076.
Stols, L.; Langan, P.; Joachimiak, A.; Coates, L.; Kovalevsky, A. (122) Humphrey, W.; Dalke, A.; Schulten, K. VMD: visual molecular
Structural plasticity of SARS-CoV-2 3CL M(pro) active site cavity dynamics. J. Mol. Graphics 1996, 14, 33−38.
revealed by room temperature X-ray crystallography. Nat. Commun. (123) Bressert, E. SciPy and NumPy: An Overview for Developers;
2020, 11, 3202. O’Reilly Media, Inc., 2012.
(101) Yang, H.; Yang, M.; Ding, Y.; Liu, Y.; Lou, Z.; Zhou, Z.; Sun, L.; (124) Theobald, D. L. Rapid calculation of RMSDs using a
Mo, L.; Ye, S.; Pang, H. The crystal structures of severe acute quaternion-based characteristic polynomial. Acta Crystallogr., Sect. A:
respiratory syndrome virus main protease and its complex with an Found. Crystallogr. 2005, 61, 478−480.
inhibitor. Proc. Natl. Acad. Sci. U. S. A. 2003, 100, 13190−13195. (125) Novick, P. A.; Ortiz, O. F.; Poelman, J.; Abdulhay, A. Y.; Pande,
(102) Wang, F.; Chen, C.; Tan, W.; Yang, K.; Yang, H. Structure of V. S. SWEETLEAD: an in silico database of approved drugs, regulated
Main Protease from Human Coronavirus NL63: Insights for Wide chemicals, and herbal isolates for computer-aided drug discovery. PLoS
Spectrum Anti-Coronavirus Drug Design. Sci. Rep. 2016, 6, 22677. One 2013, 8, e79568.
T https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
(126) Novick, P. A.; Ortiz, O. F.; Poelman, J.; Abdulhay, A. Y.; Pande, Recent developments in the general atomic and molecular electronic
V. S., SWEETLEAD: an in silico database of approved drugs, regulated structure system. J. Chem. Phys. 2020, 152, 154102.
chemicals, and herbal isolates for computer-aided drug discovery. PLoS (149) Fedorov, D. G.; Kitaura, K.; Li, H.; Jensen, J. H.; Gordon, M. S.
One 2013, 8.e79568 The polarizable continuum model (PCM) interfaced with the fragment
(127) Goede, A.; Dunkel, M.; Mester, N.; Frommel, C.; Preissner, R. molecular orbital method (FMO). J. Comput. Chem. 2006, 27, 976−
SuperDrug: a conformational drug database. Bioinformatics 2005, 21, 985.
1751−1753. (150) Zhan, C.-G.; Chipman, D. M. Cavity size in reaction field
(128) Siramshetty, V. B.; Eckert, O. A.; Gohlke, B.-O.; Goede, A.; theory. J. Chem. Phys. 1998, 109, 10543−10558.
Chen, Q.; Devarakonda, P.; Preissner, S.; Preissner, R. SuperDRUG2: a (151) Korber, B.; Fischer, W.; Gnanakaran, S. G.; Yoon, H.; Theiler,
one stop resource for approved/marketed drugs. Nucleic Acids Res. J.; Abfalterer, W.; Foley, B.; Giorgi, E. E.; Bhattacharya, T.; Parker, M.
2018, 46, D1137−D1143. D., Spike mutation pipeline reveals the emergence of a more
(129) Repositories, N., In; 2013. transmissible form of SARS-CoV-2. bioRxiv 2020.
(130) Ellingson, S. R.; Smith, J. C.; Baudry, J. VinaMPI: Facilitating (152) Bhowmik, D.; Gao, S.; Young, M. T.; Ramanathan, A. Deep
multiple receptor high-throughput virtual docking on high-perform- clustering of protein folding simulations. BMC Bioinf. 2018, 19, 484.
ance computers. J. Comput. Chem. 2013, 34, 2212−2221. (153) Romero, R.; Ramanathan, A.; Yuen, T.; Bhowmik, D.; Mathew,
(131) El-Hachem, N.; Haibe-Kains, B.; Khalil, A.; Kobeissy, F. H.; M.; Munshi, L. B.; Javaid, S.; Bloch, M.; Lizneva, D.; Rahimova, A.;
Nemer, G. AutoDock and AutoDockTools for protein-ligand docking: Khan, A.; Taneja, C.; Kim, S.-M.; Sun, L.; New, M. I.; Haider, S.; Zaidi,
Beta-site amyloid precursor protein cleaving enzyme 1 (BACE1) as a M. Mechanism of glucocerebrosidase activation and dysfunction in
case study. Methods Mol. Biol. 2017, 1598, 391−403. Gaucher disease unraveled by molecular dynamics and deep learning.
(132) Morris, G. M.; Huey, R.; Olson, A. J. Using autodock for ligand- Proc. Natl. Acad. Sci. U. S. A. 2019, 116, 5086−5095.
receptor docking. Current protocols in bioinformatics 2008, 24, 8.14. 1− (154) Lee, H.; Turilli, M.; Jha, S.; Bhowmik, D.; Ma, H.; Ramanathan,
8.14. 40. A. DeepDriveMD: Deep-Learning Driven Adaptive Molecular
(133) Zhu, X.; Mitchell, J. C. KFC2: a knowledge-based hot spot Simulations for Protein Folding. In 2019 IEEE/ACM Third Workshop
prediction method based on interface solvation, atomic density, and on Deep Learning on Supercomputers (DLS), 17 Nov., 2019; pp 12−19.
plasticity features. Proteins: Struct., Funct., Genet. 2011, 79, 2671−2683. (155) Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind database:
(134) Macalino, S. J. Y.; Basith, S.; Clavio, N. A. B.; Chang, H.; Kang, Collection of binding affinities for protein− ligand complexes with
S.; Choi, S., Evolution of In Silico Strategies for Protein-Protein known three-dimensional structures. J. Med. Chem. 2004, 47, 2977−
Interaction Drug Discovery. Molecules 2018, 23.1963 2980.
(135) Lagzian, M.; Valadan, R.; Saeedi, M.; Roozbeh, F.; (156) Wang, R.; Fang, X.; Lu, Y.; Yang, C.-Y.; Wang, S. The PDBbind
Hedayatizadeh-Omran, A.; Amanlou, M.; Alizadeh-Navaei, R., database: methodologies and updates. J. Med. Chem. 2005, 48, 4111−
Repurposing naproxen as a potential antiviral agent against SARS- 4119.
CoV-2. 2020, DOI: DOI: 10.21203/rs.3.rs-21833/v1. (157) Cheng, T.; Li, X.; Li, Y.; Liu, Z.; Wang, R. Comparative
(136) Dinesh, D. C.; Chalupska, D.; Silhan, J.; Veverka, V.; Boura, E., assessment of scoring functions on a diverse test set. J. Chem. Inf. Model.
Structural basis of RNA recognition by the SARS-CoV-2 nucleocapsid 2009, 49, 1079−1093.
phosphoprotein. PLoS Pathog. 2020, 2020.04.02.022194.16e1009100 (158) Li, Y.; Han, L.; Liu, Z.; Wang, R. Comparative assessment of
(137) Shivanyuk, A.; Ryabukhin, S.; Tolmachev, A.; Bogolyubsky, A.; scoring functions on an updated benchmark: 2. Evaluation methods
Mykytenko, D.; Chupryna, A.; Heilman, W.; Kostyuk, A. Enamine real and general results. J. Chem. Inf. Model. 2014, 54, 1717−1736.
database: Making chemical diversity real. Chemistry Today 2007, 25, (159) Li, Y.; Liu, Z.; Li, J.; Han, L.; Liu, J.; Zhao, Z.; Wang, R.
58−59. Comparative assessment of scoring functions on an updated bench-
(138) Forli, W.; Halliday, S.; Belew, R.; Olson, A. J., In; Citeseer: 2012. mark: 1. Compilation of the test set. J. Chem. Inf. Model. 2014, 54,
(139) Gaillard, T. Evaluation of AutoDock and AutoDock Vina on the 1700−16.
CASF-2013 Benchmark. J. Chem. Inf. Model. 2018, 58, 1697−1706. (160) Huang, N.; Shoichet, B. K.; Irwin, J. J. Benchmarking Sets for
(140) Li, H.; Leung, K. S.; Wong, M. H.; Ballester, P. J. Improving Molecular Docking. J. Med. Chem. 2006, 49, 6789−6801.
AutoDock Vina Using Random Forest: The Growing Accuracy of (161) Mysinger, M. M.; Carchia, M.; Irwin, J. J.; Shoichet, B. K.
Binding Affinity Prediction by the Effective Exploitation of Larger Data Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and
Sets. Mol. Inf. 2015, 34, 115−26. Decoys for Better Benchmarking. J. Med. Chem. 2012, 55, 6582−6594.
(141) GISAID. https://www.gisaid.org/. (162) Bowes, J.; Brown, A. J.; Hamon, J.; Jarolimek, W.; Sridhar, A.;
(142) GISAID, GISAID-Homepage. Waldron, G.; Whitebread, S. Reducing safety-related drug attrition: the
(143) Katoh, K.; Toh, H. Parallelization of the MAFFT multiple use of in vitro pharmacological profiling. Nat. Rev. Drug Discovery 2012,
sequence alignment program. Bioinformatics 2010, 26, 1899−1900. 11, 909−922.
(144) Lawrence, T. J.; Kauffman, K. T.; Amrine, K. C. H.; Carper, D. (163) Ton, A. T.; Gentile, F.; Hsing, M.; Ban, F.; Cherkasov, A., Rapid
L.; Lee, R. S.; Becich, P. J.; Canales, C. J.; Ardell, D. H. FAST: FAST identification of potential inhibitors of SARS-CoV-2 main protease by
Analysis of Sequences Toolbox. Front. Genet. 2015, 6, 172. deep docking of 1.3 billion compounds. Mol. Inf. 2020.392000028
(145) Shannon, C. E. A Mathematical Theory of Communication. Bell (164) Gorgulla, C.; Boeszoermenyi, A.; Wang, Z.-F.; Fischer, P. D.;
Coote, P. W.; Das, K. M. P.; Malets, Y. S.; Radchenko, D. S.; Moroz, Y.
Syst. Tech. J. 1948, 27, 379−423.
S.; Scott, D. A. An open-source drug discovery platform enables ultra-
(146) Morao, I.; Heifetz, A.; Fedorov, D. G. Accurate Scoring in
large virtual screens. Nature 2020, 580, 663−668.
Seconds with the Fragment Molecular Orbital and Density-Functional
Tight-Binding Methods. Methods Mol. Biol. 2020, 2114, 143−148.
(147) Nishimoto, Y.; Fedorov, D. G.; Irle, S. Density-functional tight-
binding combined with the fragment molecular orbital method. J. Chem.
Theory Comput. 2014, 10, 4801−4812.
(148) Barca, G. M. J.; Bertoni, C.; Carrington, L.; Datta, D.; De Silva,
N.; Deustua, J. E.; Fedorov, D. G.; Gour, J. R.; Gunina, A. O.; Guidez,
E.; Harville, T.; Irle, S.; Ivanic, J.; Kowalski, K.; Leang, S. S.; Li, H.; Li,
W.; Lutz, J. J.; Magoulas, I.; Mato, J.; Mironov, V.; Nakata, H.; Pham, B.
Q.; Piecuch, P.; Poole, D.; Pruitt, S. R.; Rendell, A. P.; Roskop, L. B.;
Ruedenberg, K.; Sattasathuchana, T.; Schmidt, M. W.; Shen, J.;
Slipchenko, L.; Sosonkina, M.; Sundriyal, V.; Tiwari, A.; Galvez Vallejo,
J. L.; Westheimer, B.; Wloch, M.; Xu, P.; Zahariev, F.; Gordon, M. S.
U https://dx.doi.org/10.1021/acs.jcim.0c01010
J. Chem. Inf. Model. XXXX, XXX, XXX−XXX